GEMV: General Matrix Vector Product

Content

In this session we will implement and benchmark different implementations for the so called GEMV operation:

\[y \leftarrow \beta y + \alpha A x,\quad A \in\mathbb{M}^{m \times n},\; x \in \mathbb{M}^n, y \in \mathbb{M}^m\]

This operation is specified in BLAS Level 2. Here some additional notes on the parameters and the operation:

Error bound

We provide one reference implementation for GEMV. An alternative implementation will be tested as follows:

Even both implementations are numerically acceptable they might differ due to round off errors. However, we will accept if the expression (where \(\text{eps}\) denotes the machine precision)

\[\frac{\| y_\text{ref} - y_\text{sol} \| }{\text{eps}\left( \text{max}\{m, n\} \cdot |\alpha| \cdot \|A\|_\infty \cdot \|x\|_\infty + m \cdot |\beta| \cdot \|y_0\|_\infty \right) }\]

is smaller than a threshold \(\tau\). Theoretically one can only show that \(\tau = \mathcal{O}(1)\) (i.e. does not depend on matrix dimensions, the condition of \(A\) or the particular choice of \(x\) and \(y\)). But that means, that there is still an unknown constant involved. In practice, choosing \(\tau = 2\) should typically be sufficient to detect errors in the implementation.

Setting up Matrices and Vectors for Testing

In general we will initialize matrices and vectors with random numbers (using srand for initializing the random number generator and rand() for getting a pseudo random number).

However, in case \(\alpha\) equals zero we initialize \(A\) and \(x\) with NaN, and in case \(\beta\) equals zero we initialize \(y_0\) with NaN.

Values for \(\alpha\) and \(\beta\) can be specified when you compile the code:

Measure performance

The GEMV operation requires about \(m \cdot (2\cdot n + 1)\) floating point operations (we do not distinguish between floating point addition and multiplication). The efficiency will be measured in mega flops per second:

\[\text{MFLOPS} = \frac{\text{Floating point operations} }{10^6 \cdot \text{wall time in seconds}}\]

For measuring the required wall time for an operation we re-run the operation several times and use the average time.

Skeleton for Test and Benchmark

Make yourself familiar with the skeleton below:

#include <stdlib.h>         // for malloc(), free(), rand(), srand()
#include <stdio.h>          // for printf()
#include <stddef.h>         // for size_t, ptrdiff_t
#include <math.h>           // for nan(), fabs()
#include <float.h>          // for DBL_EPSILON
#include <stdbool.h>        // for typedef bool
#include <sys/times.h>      // needed for walltime()
#include <unistd.h>         // needed for walltime()

//-- Function for benchmarking and testing -------------------------------------

double
walltime()
{
   struct tms    ts;
   static double ClockTick=0.0;

   if (ClockTick==0.0) {
        ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
   }
   return ((double) times(&ts)) * ClockTick;
}


void
initGeMatrix(size_t m, size_t n,
             double *A,
             ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            A[i*incRowA + j*incColA] = i*n + j +1;
        }
    }
}

void
randGeMatrix(size_t m, size_t n, bool withNan,
             double *A,
             ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            A[i*incRowA + j*incColA] = withNan
                                     ? nan("")
                                     : 2.*(rand()-RAND_MAX/2)/RAND_MAX;
        }
    }
}

void
printGeMatrix(size_t m, size_t n,
              const double *A,
              ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            printf("%9.2lf ", A[i*incRowA + j*incColA]);
        }
        printf("\n");
    }
    printf("\n");
}

double
dgenrm_inf(size_t m, size_t n,
           const double *A,
           ptrdiff_t incRowA, ptrdiff_t incColA)
{
    double result = 0;

    for (size_t i=0; i<m; ++i) {
        double sum = 0;

        for (size_t j=0; j<n; ++j) {
            sum += fabs(A[i*incRowA+j*incColA]);
        }

        if (sum>result) {
            result = sum;
        }
    }

    return result;
}

//-- BLAS Level 1 functions ----------------------------------------------------

void
dcopy(size_t n,
      const double *x, ptrdiff_t incX,
      double *y, ptrdiff_t incY)
{
    for (size_t i=0; i<n; ++i) {
        y[i*incY] = x[i*incX];
    }
}

void
daxpy(size_t n, double alpha,
      const double *x, ptrdiff_t incX,
      double *y, ptrdiff_t incY)
{
    for (size_t i=0; i<n; ++i) {
        y[i*incY] += alpha*x[i*incX];
    }
}

double
ddot(size_t n,
     const double *x, ptrdiff_t incX,
     const double *y, ptrdiff_t incY)
{
    double result = 0;

    for (size_t i=0; i<n; ++i) {
        result += x[i*incX]*y[i*incY];
    }

    return result;
}

void
dscal(size_t n,
      double alpha,
      double *x, ptrdiff_t incX)
{
    if (alpha==1) {
        return;
    }
    if (alpha!=0) {
        for (size_t i=0; i<n; ++i) {
            x[i*incX] *= alpha;
        }
    } else {
        for (size_t i=0; i<n; ++i) {
            x[i*incX] = 0;
        }
    }
}

//-- BLAS Level 2 functions ----------------------------------------------------

void
dgemv_dot(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    /* Your code here */
}

void
dgemv_axpy(size_t m, size_t n,
           double alpha,
           const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
           const double *x, ptrdiff_t incX,
           double beta,
           double *y, ptrdiff_t incY)
{
    /* your code here */
}


#ifndef DOTF
#define DOTF 2
#endif

void
dgemv_dotf(size_t m, size_t n,
           double alpha,
           const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
           const double *x, ptrdiff_t incX,
           double beta,
           double *y, ptrdiff_t incY)
{
    /* your code here */
}


#ifndef AXPYF
#define AXPYF 2
#endif

void
dgemv_axpyf(size_t m, size_t n,
            double alpha,
            const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
            const double *x, ptrdiff_t incX,
            double beta,
            double *y, ptrdiff_t incY)
{
    /* Your code here */
}


//-- BLAS Level 2: dgemv reference implementation and error bound --------------

void
dgemv_ref(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    if (beta!=1) {
        if (beta!=0) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] *= beta;
            }
        } else {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] = 0;
            }
        }
    }
    if (alpha!=0) {
        for (size_t j=0; j<n; ++j) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    }
}

// - Computes error bound for the test result ySol of the gemv operation
//   beta*y0 + alpha*A*x.
// - yRef is trusted result.
// - ySol gets overwritten.
double
dgemv_err(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          const double *y0, ptrdiff_t incY0,
          double beta,
          const double *yRef, ptrdiff_t incYRef,
          double *ySol, ptrdiff_t incYSol)
{
    double nrmY0 = dgenrm_inf(m, 1, y0, incY0, 1);
    double nrmX  = dgenrm_inf(n, 1, x, incX, 1);
    double nrmA  = dgenrm_inf(m, n, A, incRowA, incColA);
    size_t maxMN = m<n ? n : m;

    // nrmDiff = ||y2 - y1||_inf
    daxpy(m, -1, yRef, incYRef, ySol, incYSol);
    double nrmDiff = dgenrm_inf(m, 1, ySol, incYSol, 1);

    return nrmDiff /
        (DBL_EPSILON*(maxMN*fabs(alpha)*nrmA*nrmX + m*fabs(beta)*nrmY0));
}



//------------------------------------------------------------------------------

#ifndef COLMAJOR
#define COLMAJOR 0
#endif

#ifndef SEED_RAND
#define SEED_RAND 0
#endif

#ifndef MAX_M
#define MAX_M 4000
#endif

#ifndef MAX_N
#define MAX_N 4000
#endif

#ifndef ALPHA
#define ALPHA 1
#endif

#ifndef BETA
#define BETA 1
#endif

int
main()
{
    srand(SEED_RAND);
    printf("#COLMAJOR = %d\n", COLMAJOR);
    printf("#ALPHA    = %lf\n", (double)ALPHA);
    printf("#BETA     = %lf\n", (double)BETA);

    double *A    = malloc(MAX_M*MAX_N*sizeof(double));
    double *x    = malloc(MAX_N*sizeof(double));
    double *y0   = malloc(MAX_M*sizeof(double));
    double *yRef = malloc(MAX_M*sizeof(double));
    double *ySol = malloc(MAX_M*sizeof(double));
    if (!A || !x || !y0 || !yRef || !ySol) {
        abort();
    }

    // print header
    printf("#%4s %4s ", "m", "n");
    printf("%10s %10s ", "time ref", "mflops ref");
    printf("%10s %10s %7s ", "time 1", "mflops 1", "err");
    printf("%10s %10s %7s ", "time 2", "mflops 2", "err");
    printf("\n");

    for (size_t m=10, n=10; m<=MAX_M && n<=MAX_N; m+=10, n+=10) {
        ptrdiff_t incRowA = COLMAJOR ? 1 : n;
        ptrdiff_t incColA = COLMAJOR ? m : 1;

        ptrdiff_t incX    = 1;
        ptrdiff_t incY0   = 1;
        ptrdiff_t incYRef = 1;
        ptrdiff_t incYSol = 1;

        double    alpha = ALPHA;
        double    beta  = BETA;

        double    mflop = m*(2*n+1)/1000./1000.;

        randGeMatrix(m, n, ALPHA==0, A, incRowA, incColA);
        randGeMatrix(n, 1, ALPHA==0, x, incX, 0);
        randGeMatrix(m, 1, BETA==0, y0, incY0, 0);

        printf("A =\n");
        printGeMatrix(m, n, A, incRowA, incColA);

        printf("x =\n");
        printGeMatrix(1, n, x, 0, incX);

        printf("y0 =\n");
        printGeMatrix(1, m, y0, 0, incY0);

        printf(" %4zu %4zu ", m,  n);

        {
            double t    = 0;
            size_t runs = 0;

            while (t<0.1 || runs<3) {
                dcopy(m, y0, incY0, yRef, incYRef);

                double t0 = walltime();
                dgemv_ref(m, n,
                          alpha,
                          A, incRowA, incColA,
                          x, incX,
                          beta,
                          yRef, incYRef);
                t += walltime() - t0;
                ++runs;
            }
            t /= runs;

            printf("\nyRef =\n");
            printGeMatrix(1, m, yRef, 0, incYRef);
            printf("%10.2lf %10.2lf ", t, mflop/t);
        }

        {
            double t    = 0;
            size_t runs = 0;

            while (t<0.1 || runs<3) {
                dcopy(m, y0, incY0, ySol, incYSol);

                double t0 = walltime();
                dgemv_dot(m, n,
                          alpha,
                          A, incRowA, incColA,
                          x, incX,
                          beta,
                          ySol, incYSol);
                t += walltime() - t0;
                ++runs;
            }
            t /= runs;

            printf("\nySol =\n");
            printGeMatrix(1, m, ySol, 0, incYSol);

            double err = dgemv_err(m, n, alpha,
                                   A, incRowA, incColA,
                                   x, incX,
                                   y0, incY0,
                                   beta,
                                   yRef, incYRef,
                                   ySol, incYSol);

            printf("\nyDiff =\n");
            printGeMatrix(1, m, ySol, 0, incYSol);

            printf("%10.2lf %10.2lf %7.1e ", t, mflop/t, err);
        }

        printf("\n");
        break;
    }

    free(A);
    free(x);
    free(y0);
    free(yRef);
    free(ySol);
}

Here the output generated by the current version:

heim$ gcc -Wall -std=c11 -O3 -o gemv gemv_ex1.c
heim$ ./gemv
#COLMAJOR = 0
#ALPHA    = 1.000000
#BETA     = 1.000000
#   m    n   time ref mflops ref     time 1   mflops 1     err     time 2   mflops 2     err 
A =
     0.68     -0.21      0.57      0.60      0.82     -0.60     -0.33      0.54     -0.44      0.11 
    -0.05      0.26     -0.27      0.03      0.90      0.83      0.27      0.43     -0.72      0.21 
    -0.97     -0.51     -0.73      0.61     -0.69     -0.20     -0.74     -0.78      1.00     -0.56 
     0.03      0.68      0.23     -0.41      0.28      0.05     -0.01      0.95     -0.41      0.54 
     0.05      0.54     -0.20      0.78     -0.43     -0.30      0.62      0.84     -0.86      0.90 
     0.05     -0.83     -0.62      0.33      0.78     -0.30     -0.87     -0.96     -0.08     -0.87 
    -0.52      0.94      0.80      0.70     -0.47      0.08     -0.25      0.52      0.03      0.34 
     0.06     -0.92     -0.12      0.86      0.86      0.44     -0.43      0.48      0.28     -0.29 
     0.38     -0.67     -0.12      0.76      0.66     -0.34     -0.54      0.79     -0.30      0.37 
     0.91      0.18      0.31      0.72     -0.12      0.85     -0.20      0.63      0.37      0.82 

x =
    -0.04     -0.57      0.90      0.84     -0.70      0.76      0.28     -0.14      0.24     -0.44 

y0 =
     0.57     -0.39     -0.11     -0.55     -0.62     -0.45      0.11     -0.17     -0.66      0.81 

   10   10 
yRef =
     0.32     -1.00      0.79     -1.70     -0.91     -0.79      1.02      0.71     -0.98      1.88 

      0.00     502.36 
ySol =
     0.57     -0.39     -0.11     -0.55     -0.62     -0.45      0.11     -0.17     -0.66      0.81 


yDiff =
     0.25      0.62     -0.90      1.15      0.29      0.34     -0.90     -0.87      0.32     -1.07 

      0.00     640.79 7.5e+13 
heim$ 

Exercise

Implement the different algorithms for gemv:

Run tests for col major (compile with -DCOLMAJOR=1) and row major (-DCOLMAJOR=0). Also make sure that you also test cases with \(\alpha=0\) or \(\beta=0\).

Algorithm for dgemv_dot

We consider \(A\) as a \(m \times n\) matrix that consists of \(m\) row vectors:

\[A = \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\]

This partitioning of \(A\) in the operation \(\beta y + \alpha A x\) requires that we also consider \(y\) component wise. So that it reads

\[ \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)\leftarrow\beta \cdot \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)+ \alpha \cdot \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\cdot x\]

The complete algorithm then reads as follows:

  • \(y \leftarrow \beta \cdot y\) (scal operation)

  • For \(i=0, \dots, m-1\)

    • \(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)

Algorithm for dgemv_axpy

We consider \(A\) as a \(m \times n\) matrix that consists of \(n\) column vectors:

\[A = \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right)\]

This requires to consider \(x\) component wise so that the operation reads:

\[y \leftarrow y \beta+ \alpha \cdot \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right) \left( \begin{array}{c} x_0 \\ \hline x_1 \\ \hline \vdots \\ x_{n-1} \end{array} \right)\]

The complete algorithm then reads as follows:

  • \(y \leftarrow \beta y\) (scal operation)

  • For \(j=0, \dots, n-1\)

    • \(y \leftarrow y + \left(\alpha x_j \right) \cdot a_j\) (axpy operation)

Algorithm for dgemv_dotf (fused dot products)

This algorithm is similar to the one for dgemv_dotf. However, it can be interpreted that it fuses several dot products together.

The so called fuse factor \(f\) is integer constant known at compile time (in the exercise its a literal value defined through a macro) so that the most inner loop can be unrolled:

(\(\;f\) is a literal (positiv) integer)

  • \(y \leftarrow \beta y\) (scal operation)

  • For \(i=0, \dots, \left\lfloor\frac{m}{f}\right\rfloor - 1\)

    • For \(j=0, \dots, n-1\)

      • For \(\ell=0, \dots, f-1\)

        • \(y_{i \cdot f + \ell} \leftarrow y_{i \cdot f + \ell} + \alpha \cdot a_{i \cdot f + \ell,\, j} \cdot x_j\)

  • For \(i=\left\lfloor\frac{m}{f}\right\rfloor \cdot f, \dots, m-1\)

    • \(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)

Algorithm for dgemv_axpyf (fused axpy operations)

This algorithm is similar to the one for dgemv_axpyf with a fixed number of axpy operations fused. interpreted that it fuses several dot products together.

Again the fuse factor \(f\) (which can differ from the fuse factor for dgemv_dotf) is a integer constant known at compile time:

(\(\;f\) is a literal (positiv) integer)

  • \(y \leftarrow \beta y\) (scal operation)

  • For \(j=0, \dots, \left\lfloor\frac{n}{f}\right\rfloor -1\)

    • For \(i=0, \dots, m-1\)

      • For \(\ell=0, \dots, f-1\)

        • \(y_{i} \leftarrow y_{i} + \alpha \cdot a_{i, \, j \cdot f + \ell} \cdot x_{j \cdot f + \ell}\)

  • For \(j=\left\lfloor\frac{n}{f}\right\rfloor \cdot f, \dots, n-1\)

    • \(y \leftarrow y + \left(\alpha x_j \right) \cdot a_j\) (axpy operation)