GEMV: General Matrix Vector Product

Content

GEMV: General Matrix Vector Product
Possible Solution and another Exercise

In this session we will implement and benchmark different implementations for the so called GEMV operation:

\[y \leftarrow \beta y + \alpha A x,\quad A \in\mathbb{M}^{m \times n},\; x \in \mathbb{M}^n, y \in \mathbb{M}^m\]

This operation is specified in BLAS Level 2. Here some additional notes on the parameters and the operation:

Dimensions \(m\) and \(n\) can be zero.
If \(m\) equals zero gemv is a no-operation.
If \(n\) equals zero or \(\alpha\) equals zero it just computes \(y \leftarrow \beta y\). In this case \(x\) or \(A\) can contain NaN values.
If \(\beta\) equals zero it computes \(y \leftarrow \alpha A x\). In this case \(y\) is allowed to contain NaN values.

Error bound

We provide one reference implementation for GEMV. An alternative implementation will be tested as follows:

Choose \(m\), \(n\), \(\alpha\), \(\beta\) and generate matrix \(A\) and vectors \(x\), \(y_0\).
Use the reference implementation to compute a trusted \(y_\text{ref}\) with \(y_\text{ref} = \beta y_0 + \alpha A x\)
Use the alternative implementation to compute \(y_\text{sol}\) with \(y_\text{sol} = \beta y_0 + \alpha A x\)

Even both implementations are numerically acceptable they might differ due to round off errors. However, we will accept if the expression (where \(\text{eps}\) denotes the machine precision)

\[\frac{\| y_\text{ref} - y_\text{sol} \| }{\text{eps}\left( \text{max}\{m, n\} \cdot |\alpha| \cdot \|A\|_\infty \cdot \|x\|_\infty + m \cdot |\beta| \cdot \|y_0\|_\infty \right) }\]

is smaller than a threshold \(\tau\). Theoretically one can only show that \(\tau = \mathcal{O}(1)\) (i.e. does not depend on matrix dimensions, the condition of \(A\) or the particular choice of \(x\) and \(y\)). But that means, that there is still an unknown constant involved. In practice, choosing \(\tau = 2\) should typically be sufficient to detect errors in the implementation.

Setting up Matrices and Vectors for Testing

In general we will initialize matrices and vectors with random numbers (using srand for initializing the random number generator and rand() for getting a pseudo random number).

However, in case \(\alpha\) equals zero we initialize \(A\) and \(x\) with NaN, and in case \(\beta\) equals zero we initialize \(y_0\) with NaN.

Values for \(\alpha\) and \(\beta\) can be specified when you compile the code:

Use for example flag -DALPHA=1.5 to set \(\alpha\) to \(1.5\) and analogously
flag -DBETA=2.5 to specify \(\beta\).

Measure performance

The GEMV operation requires about \(m \cdot (2\cdot n + 1)\) floating point operations (we do not distinguish between floating point addition and multiplication). The efficiency will be measured in mega flops per second:

\[\text{MFLOPS} = \frac{\text{Floating point operations} }{10^6 \cdot \text{wall time in seconds}}\]

For measuring the required wall time for an operation we re-run the operation several times and use the average time.

Skeleton for Test and Benchmark

Make yourself familiar with the skeleton below:

In main we first allocate sufficient memory for all matrices and vectors used in the test.
In a for loop we iterate (beginning with \(m=n=10\)) over all matrix/vector dimensions used for the benchmark.

Each gemv implementation is tested inside its own scope. For example:

{
    double t    = 0;
    size_t runs = 0;

    while (t<0.1 and runs<3) {
        dcopy(m, y0, incY0, yRef, incYRef);

        double t0 = walltime();
        dgemv_ref(m, n,
                  alpha,
                  A, incRowA, incColA,
                  x, incX,
                  beta,
                  yRef, incYRef);
        t += walltime() - t0;
        ++runs;
    }
    t /= runs;

    printf("\nyRef =\n");
    printGeMatrix(1, m, yRef, 0, incYRef);
    printf("%10.2lf %10.2lf ", t, mflop/t);
}

This allows to re-use local variable names t and runs. Before an gemv implementation gets tested a corresponding vector for \(y\) gets initialized with y0. This means, that we each time use the same data for the operation.

Except for the reference implementation we compute the error bounds.
The code contains in this form a lot of output:
- Output of vectors and matrices
  
  We use this output for a quick eye-test for the computed results. This is only helpful for small problem sizes. Remove this kind of output for larger problem sizes.
- Output of error bounds and mflops
  
  This output is useful for testing larger problem sizes (but not so helpful for debugging)
In this skeleton there is a break statement at the end of the for loop. Therefore test stops after testing the problemsize \(m=n=10\). Once you eye-checked your implementation remove this statement.

session11/gemv_ex1.c

#include <stdlib.h>         // for malloc(), free(), rand(), srand()
#include <stdio.h>          // for printf()
#include <stddef.h>         // for size_t, ptrdiff_t
#include <math.h>           // for nan(), fabs()
#include <float.h>          // for DBL_EPSILON
#include <stdbool.h>        // for typedef bool
#include <sys/times.h>      // needed for walltime()
#include <unistd.h>         // needed for walltime()

//-- Function for benchmarking and testing -------------------------------------

double
walltime()
{
   struct tms    ts;
   static double ClockTick=0.0;

   if (ClockTick==0.0) {
        ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
   }
   return ((double) times(&ts)) * ClockTick;
}


void
initGeMatrix(size_t m, size_t n,
             double *A,
             ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            A[i*incRowA + j*incColA] = i*n + j +1;
        }
    }
}

void
randGeMatrix(size_t m, size_t n, bool withNan,
             double *A,
             ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            A[i*incRowA + j*incColA] = withNan
                                     ? nan("")
                                     : 2.*(rand()-RAND_MAX/2)/RAND_MAX;
        }
    }
}

void
printGeMatrix(size_t m, size_t n,
              const double *A,
              ptrdiff_t incRowA, ptrdiff_t incColA)
{
    for (size_t i=0; i<m; ++i) {
        for (size_t j=0; j<n; ++j) {
            printf("%9.2lf ", A[i*incRowA + j*incColA]);
        }
        printf("\n");
    }
    printf("\n");
}

double
dgenrm_inf(size_t m, size_t n,
           const double *A,
           ptrdiff_t incRowA, ptrdiff_t incColA)
{
    double result = 0;

    for (size_t i=0; i<m; ++i) {
        double sum = 0;

        for (size_t j=0; j<n; ++j) {
            sum += fabs(A[i*incRowA+j*incColA]);
        }

        if (sum>result) {
            result = sum;
        }
    }

    return result;
}

//-- BLAS Level 1 functions ----------------------------------------------------

void
dcopy(size_t n,
      const double *x, ptrdiff_t incX,
      double *y, ptrdiff_t incY)
{
    for (size_t i=0; i<n; ++i) {
        y[i*incY] = x[i*incX];
    }
}

void
daxpy(size_t n, double alpha,
      const double *x, ptrdiff_t incX,
      double *y, ptrdiff_t incY)
{
    for (size_t i=0; i<n; ++i) {
        y[i*incY] += alpha*x[i*incX];
    }
}

double
ddot(size_t n,
     const double *x, ptrdiff_t incX,
     const double *y, ptrdiff_t incY)
{
    double result = 0;

    for (size_t i=0; i<n; ++i) {
        result += x[i*incX]*y[i*incY];
    }

    return result;
}

void
dscal(size_t n,
      double alpha,
      double *x, ptrdiff_t incX)
{
    if (alpha==1) {
        return;
    }
    if (alpha!=0) {
        for (size_t i=0; i<n; ++i) {
            x[i*incX] *= alpha;
        }
    } else {
        for (size_t i=0; i<n; ++i) {
            x[i*incX] = 0;
        }
    }
}

//-- BLAS Level 2 functions ----------------------------------------------------

void
dgemv_dot(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    /* Your code here */
}

void
dgemv_axpy(size_t m, size_t n,
           double alpha,
           const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
           const double *x, ptrdiff_t incX,
           double beta,
           double *y, ptrdiff_t incY)
{
    /* your code here */
}


#ifndef DOTF
#define DOTF 2
#endif

void
dgemv_dotf(size_t m, size_t n,
           double alpha,
           const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
           const double *x, ptrdiff_t incX,
           double beta,
           double *y, ptrdiff_t incY)
{
    /* your code here */
}


#ifndef AXPYF
#define AXPYF 2
#endif

void
dgemv_axpyf(size_t m, size_t n,
            double alpha,
            const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
            const double *x, ptrdiff_t incX,
            double beta,
            double *y, ptrdiff_t incY)
{
    /* Your code here */
}


//-- BLAS Level 2: dgemv reference implementation and error bound --------------

void
dgemv_ref(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    if (beta!=1) {
        if (beta!=0) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] *= beta;
            }
        } else {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] = 0;
            }
        }
    }
    if (alpha!=0) {
        for (size_t j=0; j<n; ++j) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    }
}

// - Computes error bound for the test result ySol of the gemv operation
//   beta*y0 + alpha*A*x.
// - yRef is trusted result.
// - ySol gets overwritten.
double
dgemv_err(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          const double *y0, ptrdiff_t incY0,
          double beta,
          const double *yRef, ptrdiff_t incYRef,
          double *ySol, ptrdiff_t incYSol)
{
    double nrmY0 = dgenrm_inf(m, 1, y0, incY0, 1);
    double nrmX  = dgenrm_inf(n, 1, x, incX, 1);
    double nrmA  = dgenrm_inf(m, n, A, incRowA, incColA);
    size_t maxMN = m<n ? n : m;

    // nrmDiff = ||y2 - y1||_inf
    daxpy(m, -1, yRef, incYRef, ySol, incYSol);
    double nrmDiff = dgenrm_inf(m, 1, ySol, incYSol, 1);

    return nrmDiff /
        (DBL_EPSILON*(maxMN*fabs(alpha)*nrmA*nrmX + m*fabs(beta)*nrmY0));
}



//------------------------------------------------------------------------------

#ifndef COLMAJOR
#define COLMAJOR 0
#endif

#ifndef SEED_RAND
#define SEED_RAND 0
#endif

#ifndef MAX_M
#define MAX_M 4000
#endif

#ifndef MAX_N
#define MAX_N 4000
#endif

#ifndef ALPHA
#define ALPHA 1
#endif

#ifndef BETA
#define BETA 1
#endif

int
main()
{
    srand(SEED_RAND);
    printf("#COLMAJOR = %d\n", COLMAJOR);
    printf("#ALPHA    = %lf\n", (double)ALPHA);
    printf("#BETA     = %lf\n", (double)BETA);

    double *A    = malloc(MAX_M*MAX_N*sizeof(double));
    double *x    = malloc(MAX_N*sizeof(double));
    double *y0   = malloc(MAX_M*sizeof(double));
    double *yRef = malloc(MAX_M*sizeof(double));
    double *ySol = malloc(MAX_M*sizeof(double));
    if (!A || !x || !y0 || !yRef || !ySol) {
        abort();
    }

    // print header
    printf("#%4s %4s ", "m", "n");
    printf("%10s %10s ", "time ref", "mflops ref");
    printf("%10s %10s %7s ", "time 1", "mflops 1", "err");
    printf("%10s %10s %7s ", "time 2", "mflops 2", "err");
    printf("\n");

    for (size_t m=10, n=10; m<=MAX_M && n<=MAX_N; m+=10, n+=10) {
        ptrdiff_t incRowA = COLMAJOR ? 1 : n;
        ptrdiff_t incColA = COLMAJOR ? m : 1;

        ptrdiff_t incX    = 1;
        ptrdiff_t incY0   = 1;
        ptrdiff_t incYRef = 1;
        ptrdiff_t incYSol = 1;

        double    alpha = ALPHA;
        double    beta  = BETA;

        double    mflop = m*(2*n+1)/1000./1000.;

        randGeMatrix(m, n, ALPHA==0, A, incRowA, incColA);
        randGeMatrix(n, 1, ALPHA==0, x, incX, 0);
        randGeMatrix(m, 1, BETA==0, y0, incY0, 0);

        printf("A =\n");
        printGeMatrix(m, n, A, incRowA, incColA);

        printf("x =\n");
        printGeMatrix(1, n, x, 0, incX);

        printf("y0 =\n");
        printGeMatrix(1, m, y0, 0, incY0);

        printf(" %4zu %4zu ", m,  n);

        {
            double t    = 0;
            size_t runs = 0;

            while (t<0.1 || runs<3) {
                dcopy(m, y0, incY0, yRef, incYRef);

                double t0 = walltime();
                dgemv_ref(m, n,
                          alpha,
                          A, incRowA, incColA,
                          x, incX,
                          beta,
                          yRef, incYRef);
                t += walltime() - t0;
                ++runs;
            }
            t /= runs;

            printf("\nyRef =\n");
            printGeMatrix(1, m, yRef, 0, incYRef);
            printf("%10.2lf %10.2lf ", t, mflop/t);
        }

        {
            double t    = 0;
            size_t runs = 0;

            while (t<0.1 || runs<3) {
                dcopy(m, y0, incY0, ySol, incYSol);

                double t0 = walltime();
                dgemv_dot(m, n,
                          alpha,
                          A, incRowA, incColA,
                          x, incX,
                          beta,
                          ySol, incYSol);
                t += walltime() - t0;
                ++runs;
            }
            t /= runs;

            printf("\nySol =\n");
            printGeMatrix(1, m, ySol, 0, incYSol);

            double err = dgemv_err(m, n, alpha,
                                   A, incRowA, incColA,
                                   x, incX,
                                   y0, incY0,
                                   beta,
                                   yRef, incYRef,
                                   ySol, incYSol);

            printf("\nyDiff =\n");
            printGeMatrix(1, m, ySol, 0, incYSol);

            printf("%10.2lf %10.2lf %7.1e ", t, mflop/t, err);
        }

        printf("\n");
        break;
    }

    free(A);
    free(x);
    free(y0);
    free(yRef);
    free(ySol);
}

Here the output generated by the current version:

heim$ gcc -Wall -std=c11 -O3 -o gemv gemv_ex1.c
heim$ ./gemv
#COLMAJOR = 0
#ALPHA    = 1.000000
#BETA     = 1.000000
#   m    n   time ref mflops ref     time 1   mflops 1     err     time 2   mflops 2     err 
A =
     0.68     -0.21      0.57      0.60      0.82     -0.60     -0.33      0.54     -0.44      0.11 
    -0.05      0.26     -0.27      0.03      0.90      0.83      0.27      0.43     -0.72      0.21 
    -0.97     -0.51     -0.73      0.61     -0.69     -0.20     -0.74     -0.78      1.00     -0.56 
     0.03      0.68      0.23     -0.41      0.28      0.05     -0.01      0.95     -0.41      0.54 
     0.05      0.54     -0.20      0.78     -0.43     -0.30      0.62      0.84     -0.86      0.90 
     0.05     -0.83     -0.62      0.33      0.78     -0.30     -0.87     -0.96     -0.08     -0.87 
    -0.52      0.94      0.80      0.70     -0.47      0.08     -0.25      0.52      0.03      0.34 
     0.06     -0.92     -0.12      0.86      0.86      0.44     -0.43      0.48      0.28     -0.29 
     0.38     -0.67     -0.12      0.76      0.66     -0.34     -0.54      0.79     -0.30      0.37 
     0.91      0.18      0.31      0.72     -0.12      0.85     -0.20      0.63      0.37      0.82 

x =
    -0.04     -0.57      0.90      0.84     -0.70      0.76      0.28     -0.14      0.24     -0.44 

y0 =
     0.57     -0.39     -0.11     -0.55     -0.62     -0.45      0.11     -0.17     -0.66      0.81 

   10   10 
yRef =
     0.32     -1.00      0.79     -1.70     -0.91     -0.79      1.02      0.71     -0.98      1.88 

      0.00     502.36 
ySol =
     0.57     -0.39     -0.11     -0.55     -0.62     -0.45      0.11     -0.17     -0.66      0.81 


yDiff =
     0.25      0.62     -0.90      1.15      0.29      0.34     -0.90     -0.87      0.32     -1.07 

      0.00     640.79 7.5e+13 
heim$

Exercise

Implement the different algorithms for gemv:

gemv based on the dot product (function dgemv_dot).
gemv based on the scaled vector update axpy (function dgemv_axpy).
gemv based on the fused dot product (function dgemv_dotf). Use macro DOTF for the fuse factor.
gemv based on the scaled vector update axpy (function dgemv_axpyf). Use macro AXPYF for the fuse factor.

Run tests for col major (compile with -DCOLMAJOR=1) and row major (-DCOLMAJOR=0). Also make sure that you also test cases with \(\alpha=0\) or \(\beta=0\).

Algorithm for `dgemv_dot`

We consider \(A\) as a \(m \times n\) matrix that consists of \(m\) row vectors:

\[A = \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\]

This partitioning of \(A\) in the operation \(\beta y + \alpha A x\) requires that we also consider \(y\) component wise. So that it reads

\[ \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)\leftarrow\beta \cdot \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)+ \alpha \cdot \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\cdot x\]

The complete algorithm then reads as follows:

\(y \leftarrow \beta \cdot y\) (scal operation)
For \(i=0, \dots, m-1\)
- \(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)

Algorithm for `dgemv_axpy`

We consider \(A\) as a \(m \times n\) matrix that consists of \(n\) column vectors:

\[A = \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right)\]

This requires to consider \(x\) component wise so that the operation reads:

\[y \leftarrow y \beta+ \alpha \cdot \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right) \left( \begin{array}{c} x_0 \\ \hline x_1 \\ \hline \vdots \\ x_{n-1} \end{array} \right)\]

The complete algorithm then reads as follows:

\(y \leftarrow \beta y\) (scal operation)
For \(j=0, \dots, n-1\)
- \(y \leftarrow y + \left(\alpha x_j \right) \cdot a_j\) (axpy operation)

Algorithm for `dgemv_dotf` (fused dot products)

This algorithm is similar to the one for dgemv_dotf. However, it can be interpreted that it fuses several dot products together.

The so called fuse factor \(f\) is integer constant known at compile time (in the exercise its a literal value defined through a macro) so that the most inner loop can be unrolled:

(\(\;f\) is a literal (positiv) integer)

\(y \leftarrow \beta y\) (scal operation)
For \(i=0, \dots, \left\lfloor\frac{m}{f}\right\rfloor - 1\)
- For \(j=0, \dots, n-1\)
  - For \(\ell=0, \dots, f-1\)
    - \(y_{i \cdot f + \ell} \leftarrow y_{i \cdot f + \ell} + \alpha \cdot a_{i \cdot f + \ell,\, j} \cdot x_j\)
For \(i=\left\lfloor\frac{m}{f}\right\rfloor \cdot f, \dots, m-1\)
- \(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)

Algorithm for `dgemv_axpyf` (fused axpy operations)

This algorithm is similar to the one for dgemv_axpyf with a fixed number of axpy operations fused. interpreted that it fuses several dot products together.

Again the fuse factor \(f\) (which can differ from the fuse factor for dgemv_dotf) is a integer constant known at compile time: