GEMV: General Matrix Vector Product
Content |
In this session we will implement and benchmark different implementations for the so called GEMV operation:
\[y \leftarrow \beta y + \alpha A x,\quad A \in\mathbb{M}^{m \times n},\; x \in \mathbb{M}^n, y \in \mathbb{M}^m\]This operation is specified in BLAS Level 2. Here some additional notes on the parameters and the operation:
-
Dimensions \(m\) and \(n\) can be zero.
-
If \(m\) equals zero gemv is a no-operation.
-
If \(n\) equals zero or \(\alpha\) equals zero it just computes \(y \leftarrow \beta y\). In this case \(x\) or \(A\) can contain NaN values.
-
If \(\beta\) equals zero it computes \(y \leftarrow \alpha A x\). In this case \(y\) is allowed to contain NaN values.
Error bound
We provide one reference implementation for GEMV. An alternative implementation will be tested as follows:
-
Choose \(m\), \(n\), \(\alpha\), \(\beta\) and generate matrix \(A\) and vectors \(x\), \(y_0\).
-
Use the reference implementation to compute a trusted \(y_\text{ref}\) with \(y_\text{ref} = \beta y_0 + \alpha A x\)
-
Use the alternative implementation to compute \(y_\text{sol}\) with \(y_\text{sol} = \beta y_0 + \alpha A x\)
Even both implementations are numerically acceptable they might differ due to round off errors. However, we will accept if the expression (where \(\text{eps}\) denotes the machine precision)
\[\frac{\| y_\text{ref} - y_\text{sol} \| }{\text{eps}\left( \text{max}\{m, n\} \cdot |\alpha| \cdot \|A\|_\infty \cdot \|x\|_\infty + m \cdot |\beta| \cdot \|y_0\|_\infty \right) }\]is smaller than a threshold \(\tau\). Theoretically one can only show that \(\tau = \mathcal{O}(1)\) (i.e. does not depend on matrix dimensions, the condition of \(A\) or the particular choice of \(x\) and \(y\)). But that means, that there is still an unknown constant involved. In practice, choosing \(\tau = 2\) should typically be sufficient to detect errors in the implementation.
Setting up Matrices and Vectors for Testing
In general we will initialize matrices and vectors with random numbers (using srand for initializing the random number generator and rand() for getting a pseudo random number).
However, in case \(\alpha\) equals zero we initialize \(A\) and \(x\) with NaN, and in case \(\beta\) equals zero we initialize \(y_0\) with NaN.
Values for \(\alpha\) and \(\beta\) can be specified when you compile the code:
-
Use for example flag -DALPHA=1.5 to set \(\alpha\) to \(1.5\) and analogously
-
flag -DBETA=2.5 to specify \(\beta\).
Measure performance
The GEMV operation requires about \(m \cdot (2\cdot n + 1)\) floating point operations (we do not distinguish between floating point addition and multiplication). The efficiency will be measured in mega flops per second:
\[\text{MFLOPS} = \frac{\text{Floating point operations} }{10^6 \cdot \text{wall time in seconds}}\]For measuring the required wall time for an operation we re-run the operation several times and use the average time.
Skeleton for Test and Benchmark
Make yourself familiar with the skeleton below:
-
In main we first allocate sufficient memory for all matrices and vectors used in the test.
-
In a for loop we iterate (beginning with \(m=n=10\)) over all matrix/vector dimensions used for the benchmark.
-
Each gemv implementation is tested inside its own scope. For example:
{ double t = 0; size_t runs = 0; while (t<0.1 and runs<3) { dcopy(m, y0, incY0, yRef, incYRef); double t0 = walltime(); dgemv_ref(m, n, alpha, A, incRowA, incColA, x, incX, beta, yRef, incYRef); t += walltime() - t0; ++runs; } t /= runs; printf("\nyRef =\n"); printGeMatrix(1, m, yRef, 0, incYRef); printf("%10.2lf %10.2lf ", t, mflop/t); }
This allows to re-use local variable names t and runs. Before an gemv implementation gets tested a corresponding vector for \(y\) gets initialized with y0. This means, that we each time use the same data for the operation.
-
Except for the reference implementation we compute the error bounds.
-
The code contains in this form a lot of output:
-
Output of vectors and matrices
We use this output for a quick eye-test for the computed results. This is only helpful for small problem sizes. Remove this kind of output for larger problem sizes.
-
Output of error bounds and mflops
This output is useful for testing larger problem sizes (but not so helpful for debugging)
-
-
In this skeleton there is a break statement at the end of the for loop. Therefore test stops after testing the problemsize \(m=n=10\). Once you eye-checked your implementation remove this statement.
#include <stdlib.h> // for malloc(), free(), rand(), srand() #include <stdio.h> // for printf() #include <stddef.h> // for size_t, ptrdiff_t #include <math.h> // for nan(), fabs() #include <float.h> // for DBL_EPSILON #include <stdbool.h> // for typedef bool #include <sys/times.h> // needed for walltime() #include <unistd.h> // needed for walltime() //-- Function for benchmarking and testing ------------------------------------- double walltime() { struct tms ts; static double ClockTick=0.0; if (ClockTick==0.0) { ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK)); } return ((double) times(&ts)) * ClockTick; } void initGeMatrix(size_t m, size_t n, double *A, ptrdiff_t incRowA, ptrdiff_t incColA) { for (size_t i=0; i<m; ++i) { for (size_t j=0; j<n; ++j) { A[i*incRowA + j*incColA] = i*n + j +1; } } } void randGeMatrix(size_t m, size_t n, bool withNan, double *A, ptrdiff_t incRowA, ptrdiff_t incColA) { for (size_t i=0; i<m; ++i) { for (size_t j=0; j<n; ++j) { A[i*incRowA + j*incColA] = withNan ? nan("") : 2.*(rand()-RAND_MAX/2)/RAND_MAX; } } } void printGeMatrix(size_t m, size_t n, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA) { for (size_t i=0; i<m; ++i) { for (size_t j=0; j<n; ++j) { printf("%9.2lf ", A[i*incRowA + j*incColA]); } printf("\n"); } printf("\n"); } double dgenrm_inf(size_t m, size_t n, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA) { double result = 0; for (size_t i=0; i<m; ++i) { double sum = 0; for (size_t j=0; j<n; ++j) { sum += fabs(A[i*incRowA+j*incColA]); } if (sum>result) { result = sum; } } return result; } //-- BLAS Level 1 functions ---------------------------------------------------- void dcopy(size_t n, const double *x, ptrdiff_t incX, double *y, ptrdiff_t incY) { for (size_t i=0; i<n; ++i) { y[i*incY] = x[i*incX]; } } void daxpy(size_t n, double alpha, const double *x, ptrdiff_t incX, double *y, ptrdiff_t incY) { for (size_t i=0; i<n; ++i) { y[i*incY] += alpha*x[i*incX]; } } double ddot(size_t n, const double *x, ptrdiff_t incX, const double *y, ptrdiff_t incY) { double result = 0; for (size_t i=0; i<n; ++i) { result += x[i*incX]*y[i*incY]; } return result; } void dscal(size_t n, double alpha, double *x, ptrdiff_t incX) { if (alpha==1) { return; } if (alpha!=0) { for (size_t i=0; i<n; ++i) { x[i*incX] *= alpha; } } else { for (size_t i=0; i<n; ++i) { x[i*incX] = 0; } } } //-- BLAS Level 2 functions ---------------------------------------------------- void dgemv_dot(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, double beta, double *y, ptrdiff_t incY) { /* Your code here */ } void dgemv_axpy(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, double beta, double *y, ptrdiff_t incY) { /* your code here */ } #ifndef DOTF #define DOTF 2 #endif void dgemv_dotf(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, double beta, double *y, ptrdiff_t incY) { /* your code here */ } #ifndef AXPYF #define AXPYF 2 #endif void dgemv_axpyf(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, double beta, double *y, ptrdiff_t incY) { /* Your code here */ } //-- BLAS Level 2: dgemv reference implementation and error bound -------------- void dgemv_ref(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, double beta, double *y, ptrdiff_t incY) { if (beta!=1) { if (beta!=0) { for (size_t i=0; i<m; ++i) { y[i*incY] *= beta; } } else { for (size_t i=0; i<m; ++i) { y[i*incY] = 0; } } } if (alpha!=0) { for (size_t j=0; j<n; ++j) { for (size_t i=0; i<m; ++i) { y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX]; } } } } // - Computes error bound for the test result ySol of the gemv operation // beta*y0 + alpha*A*x. // - yRef is trusted result. // - ySol gets overwritten. double dgemv_err(size_t m, size_t n, double alpha, const double *A, ptrdiff_t incRowA, ptrdiff_t incColA, const double *x, ptrdiff_t incX, const double *y0, ptrdiff_t incY0, double beta, const double *yRef, ptrdiff_t incYRef, double *ySol, ptrdiff_t incYSol) { double nrmY0 = dgenrm_inf(m, 1, y0, incY0, 1); double nrmX = dgenrm_inf(n, 1, x, incX, 1); double nrmA = dgenrm_inf(m, n, A, incRowA, incColA); size_t maxMN = m<n ? n : m; // nrmDiff = ||y2 - y1||_inf daxpy(m, -1, yRef, incYRef, ySol, incYSol); double nrmDiff = dgenrm_inf(m, 1, ySol, incYSol, 1); return nrmDiff / (DBL_EPSILON*(maxMN*fabs(alpha)*nrmA*nrmX + m*fabs(beta)*nrmY0)); } //------------------------------------------------------------------------------ #ifndef COLMAJOR #define COLMAJOR 0 #endif #ifndef SEED_RAND #define SEED_RAND 0 #endif #ifndef MAX_M #define MAX_M 4000 #endif #ifndef MAX_N #define MAX_N 4000 #endif #ifndef ALPHA #define ALPHA 1 #endif #ifndef BETA #define BETA 1 #endif int main() { srand(SEED_RAND); printf("#COLMAJOR = %d\n", COLMAJOR); printf("#ALPHA = %lf\n", (double)ALPHA); printf("#BETA = %lf\n", (double)BETA); double *A = malloc(MAX_M*MAX_N*sizeof(double)); double *x = malloc(MAX_N*sizeof(double)); double *y0 = malloc(MAX_M*sizeof(double)); double *yRef = malloc(MAX_M*sizeof(double)); double *ySol = malloc(MAX_M*sizeof(double)); if (!A || !x || !y0 || !yRef || !ySol) { abort(); } // print header printf("#%4s %4s ", "m", "n"); printf("%10s %10s ", "time ref", "mflops ref"); printf("%10s %10s %7s ", "time 1", "mflops 1", "err"); printf("%10s %10s %7s ", "time 2", "mflops 2", "err"); printf("\n"); for (size_t m=10, n=10; m<=MAX_M && n<=MAX_N; m+=10, n+=10) { ptrdiff_t incRowA = COLMAJOR ? 1 : n; ptrdiff_t incColA = COLMAJOR ? m : 1; ptrdiff_t incX = 1; ptrdiff_t incY0 = 1; ptrdiff_t incYRef = 1; ptrdiff_t incYSol = 1; double alpha = ALPHA; double beta = BETA; double mflop = m*(2*n+1)/1000./1000.; randGeMatrix(m, n, ALPHA==0, A, incRowA, incColA); randGeMatrix(n, 1, ALPHA==0, x, incX, 0); randGeMatrix(m, 1, BETA==0, y0, incY0, 0); printf("A =\n"); printGeMatrix(m, n, A, incRowA, incColA); printf("x =\n"); printGeMatrix(1, n, x, 0, incX); printf("y0 =\n"); printGeMatrix(1, m, y0, 0, incY0); printf(" %4zu %4zu ", m, n); { double t = 0; size_t runs = 0; while (t<0.1 || runs<3) { dcopy(m, y0, incY0, yRef, incYRef); double t0 = walltime(); dgemv_ref(m, n, alpha, A, incRowA, incColA, x, incX, beta, yRef, incYRef); t += walltime() - t0; ++runs; } t /= runs; printf("\nyRef =\n"); printGeMatrix(1, m, yRef, 0, incYRef); printf("%10.2lf %10.2lf ", t, mflop/t); } { double t = 0; size_t runs = 0; while (t<0.1 || runs<3) { dcopy(m, y0, incY0, ySol, incYSol); double t0 = walltime(); dgemv_dot(m, n, alpha, A, incRowA, incColA, x, incX, beta, ySol, incYSol); t += walltime() - t0; ++runs; } t /= runs; printf("\nySol =\n"); printGeMatrix(1, m, ySol, 0, incYSol); double err = dgemv_err(m, n, alpha, A, incRowA, incColA, x, incX, y0, incY0, beta, yRef, incYRef, ySol, incYSol); printf("\nyDiff =\n"); printGeMatrix(1, m, ySol, 0, incYSol); printf("%10.2lf %10.2lf %7.1e ", t, mflop/t, err); } printf("\n"); break; } free(A); free(x); free(y0); free(yRef); free(ySol); }
Here the output generated by the current version:
heim$ gcc -Wall -std=c11 -O3 -o gemv gemv_ex1.c heim$ ./gemv #COLMAJOR = 0 #ALPHA = 1.000000 #BETA = 1.000000 # m n time ref mflops ref time 1 mflops 1 err time 2 mflops 2 err A = 0.68 -0.21 0.57 0.60 0.82 -0.60 -0.33 0.54 -0.44 0.11 -0.05 0.26 -0.27 0.03 0.90 0.83 0.27 0.43 -0.72 0.21 -0.97 -0.51 -0.73 0.61 -0.69 -0.20 -0.74 -0.78 1.00 -0.56 0.03 0.68 0.23 -0.41 0.28 0.05 -0.01 0.95 -0.41 0.54 0.05 0.54 -0.20 0.78 -0.43 -0.30 0.62 0.84 -0.86 0.90 0.05 -0.83 -0.62 0.33 0.78 -0.30 -0.87 -0.96 -0.08 -0.87 -0.52 0.94 0.80 0.70 -0.47 0.08 -0.25 0.52 0.03 0.34 0.06 -0.92 -0.12 0.86 0.86 0.44 -0.43 0.48 0.28 -0.29 0.38 -0.67 -0.12 0.76 0.66 -0.34 -0.54 0.79 -0.30 0.37 0.91 0.18 0.31 0.72 -0.12 0.85 -0.20 0.63 0.37 0.82 x = -0.04 -0.57 0.90 0.84 -0.70 0.76 0.28 -0.14 0.24 -0.44 y0 = 0.57 -0.39 -0.11 -0.55 -0.62 -0.45 0.11 -0.17 -0.66 0.81 10 10 yRef = 0.32 -1.00 0.79 -1.70 -0.91 -0.79 1.02 0.71 -0.98 1.88 0.00 502.36 ySol = 0.57 -0.39 -0.11 -0.55 -0.62 -0.45 0.11 -0.17 -0.66 0.81 yDiff = 0.25 0.62 -0.90 1.15 0.29 0.34 -0.90 -0.87 0.32 -1.07 0.00 640.79 7.5e+13 heim$
Exercise
Implement the different algorithms for gemv:
-
gemv based on the dot product (function dgemv_dot).
-
gemv based on the scaled vector update axpy (function dgemv_axpy).
-
gemv based on the fused dot product (function dgemv_dotf). Use macro DOTF for the fuse factor.
-
gemv based on the scaled vector update axpy (function dgemv_axpyf). Use macro AXPYF for the fuse factor.
Run tests for col major (compile with -DCOLMAJOR=1) and row major (-DCOLMAJOR=0). Also make sure that you also test cases with \(\alpha=0\) or \(\beta=0\).
Algorithm for dgemv_dot
We consider \(A\) as a \(m \times n\) matrix that consists of \(m\) row vectors:
\[A = \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\]This partitioning of \(A\) in the operation \(\beta y + \alpha A x\) requires that we also consider \(y\) component wise. So that it reads
\[ \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)\leftarrow\beta \cdot \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right)+ \alpha \cdot \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right)\cdot x\]The complete algorithm then reads as follows:
-
\(y \leftarrow \beta \cdot y\) (scal operation)
-
For \(i=0, \dots, m-1\)
-
\(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)
-
Algorithm for dgemv_axpy
We consider \(A\) as a \(m \times n\) matrix that consists of \(n\) column vectors:
\[A = \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right)\]This requires to consider \(x\) component wise so that the operation reads:
\[y \leftarrow y \beta+ \alpha \cdot \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right) \left( \begin{array}{c} x_0 \\ \hline x_1 \\ \hline \vdots \\ x_{n-1} \end{array} \right)\]The complete algorithm then reads as follows:
-
\(y \leftarrow \beta y\) (scal operation)
-
For \(j=0, \dots, n-1\)
-
\(y \leftarrow y + \left(\alpha x_j \right) \cdot a_j\) (axpy operation)
-
Algorithm for dgemv_dotf (fused dot products)
This algorithm is similar to the one for dgemv_dotf. However, it can be interpreted that it fuses several dot products together.
The so called fuse factor \(f\) is integer constant known at compile time (in the exercise its a literal value defined through a macro) so that the most inner loop can be unrolled:
(\(\;f\) is a literal (positiv) integer)
-
\(y \leftarrow \beta y\) (scal operation)
-
For \(i=0, \dots, \left\lfloor\frac{m}{f}\right\rfloor - 1\)
-
For \(j=0, \dots, n-1\)
-
For \(\ell=0, \dots, f-1\)
-
\(y_{i \cdot f + \ell} \leftarrow y_{i \cdot f + \ell} + \alpha \cdot a_{i \cdot f + \ell,\, j} \cdot x_j\)
-
-
-
-
For \(i=\left\lfloor\frac{m}{f}\right\rfloor \cdot f, \dots, m-1\)
-
\(y_i \leftarrow y_i + \alpha \cdot a_i^T x\) (dot operation)
-
Algorithm for dgemv_axpyf (fused axpy operations)
This algorithm is similar to the one for dgemv_axpyf with a fixed number of axpy operations fused. interpreted that it fuses several dot products together.
Again the fuse factor \(f\) (which can differ from the fuse factor for dgemv_dotf) is a integer constant known at compile time:
(\(\;f\) is a literal (positiv) integer)
-
\(y \leftarrow \beta y\) (scal operation)
-
For \(j=0, \dots, \left\lfloor\frac{n}{f}\right\rfloor -1\)
-
For \(i=0, \dots, m-1\)
-
For \(\ell=0, \dots, f-1\)
-
\(y_{i} \leftarrow y_{i} + \alpha \cdot a_{i, \, j \cdot f + \ell} \cdot x_{j \cdot f + \ell}\)
-
-
-
-
For \(j=\left\lfloor\frac{n}{f}\right\rfloor \cdot f, \dots, n-1\)
-
\(y \leftarrow y + \left(\alpha x_j \right) \cdot a_j\) (axpy operation)
-