=================================== GEMV: General Matrix Vector Product [TOC] =================================== In this session we will implement and benchmark different implementations for the so called GEMV operation: ---- LATEX --------------------------------------------------------------------- y \leftarrow \beta y + \alpha A x, \quad A \in\mathbb{M}^{m \times n},\; x \in \mathbb{M}^n, y \in \mathbb{M}^m -------------------------------------------------------------------------------- This operation is specified in BLAS Level 2. Here some additional notes on the parameters and the operation: - Dimensions $m$ and $n$ can be zero. - If $m$ equals zero gemv is a no-operation. - If $n$ equals zero or $\alpha$ equals zero it just computes $y \leftarrow \beta y$. In this case $x$ or $A$ can contain NaN values. - If $\beta$ equals zero it computes $y \leftarrow \alpha A x$. In this case $y$ is allowed to contain NaN values. Error bound =========== We provide one reference implementation for GEMV. An alternative implementation will be tested as follows: - Choose $m$, $n$, $\alpha$, $\beta$ and generate matrix $A$ and vectors $x$, $y_0$. - Use the reference implementation to compute a trusted $y_\text{ref}$ with $y_\text{ref} = \beta y_0 + \alpha A x$ - Use the alternative implementation to compute $y_\text{sol}$ with $y_\text{sol} = \beta y_0 + \alpha A x$ Even both implementations are numerically acceptable they might differ due to round off errors. However, we will accept if the expression (where $\text{eps}$ denotes the machine precision) ---- LATEX --------------------------------------------------------------------- \frac{\| y_\text{ref} - y_\text{sol} \| }{\text{eps}\left( \text{max}\{m, n\} \cdot |\alpha| \cdot \|A\|_\infty \cdot \|x\|_\infty + m \cdot |\beta| \cdot \|y_0\|_\infty \right) } -------------------------------------------------------------------------------- is smaller than a threshold $\tau$. Theoretically one can only show that $\tau = \mathcal{O}(1)$ (i.e. does not depend on matrix dimensions, the condition of $A$ or the particular choice of $x$ and $y$). But that means, that there is still an unknown constant involved. In practice, choosing $\tau = 2$ should typically be sufficient to detect errors in the implementation. Setting up Matrices and Vectors for Testing =========================================== In general we will initialize matrices and vectors with random numbers (using `srand` for initializing the random number generator and `rand()` for getting a pseudo random number). However, in case $\alpha$ equals zero we initialize $A$ and $x$ with NaN, and in case $\beta$ equals zero we initialize $y_0$ with NaN. Values for $\alpha$ and $\beta$ can be specified when you compile the code: - Use for example flag `-DALPHA=1.5` to set $\alpha$ to $1.5$ and analogously - flag `-DBETA=2.5` to specify $\beta$. Measure performance =================== The GEMV operation requires about $m \cdot (2\cdot n + 1)$ floating point operations (we do not distinguish between floating point addition and multiplication). The efficiency will be measured in mega flops per second: ---- LATEX --------------------------------------------------------------------- \text{MFLOPS} = \frac{\text{Floating point operations} }{10^6 \cdot \text{wall time in seconds}} -------------------------------------------------------------------------------- For measuring the required wall time for an operation we re-run the operation several times and use the average time. Skeleton for Test and Benchmark =============================== Make yourself familiar with the skeleton below: - In main we first allocate sufficient memory for all matrices and vectors used in the test. - In a for loop we iterate (beginning with $m=n=10$) over all matrix/vector dimensions used for the benchmark. - Each gemv implementation is tested inside its own scope. For example: ---- CODE(type=c) ------------------------------------------------------------ { double t = 0; size_t runs = 0; while (t<0.1 and runs<3) { dcopy(m, y0, incY0, yRef, incYRef); double t0 = walltime(); dgemv_ref(m, n, alpha, A, incRowA, incColA, x, incX, beta, yRef, incYRef); t += walltime() - t0; ++runs; } t /= runs; printf("\nyRef =\n"); printGeMatrix(1, m, yRef, 0, incYRef); printf("%10.2lf %10.2lf ", t, mflop/t); } ------------------------------------------------------------------------------ This allows to re-use local variable names `t` and `runs`. Before an gemv implementation gets tested a corresponding vector for $y$ gets initialized with `y0`. This means, that we each time use the same data for the operation. - Except for the reference implementation we compute the error bounds. - The code contains in this form a lot of output: - *Output of vectors and matrices* We use this output for a quick eye-test for the computed results. This is only helpful for small problem sizes. Remove this kind of output for larger problem sizes. - *Output of error bounds and mflops* This output is useful for testing larger problem sizes (but not so helpful for debugging) - In this skeleton there is a `break` statement at the end of the for loop. Therefore test stops after testing the problemsize $m=n=10$. Once you eye-checked your implementation remove this statement. :import: session11/gemv_ex1.c Here the output generated by the current version: ---- SHELL (path=session11/,hostname=heim) ------------------------------------- gcc -Wall -std=c11 -O3 -o gemv gemv_ex1.c ./gemv -------------------------------------------------------------------------------- Exercise ======== Implement the different algorithms for gemv: - gemv based on the dot product (function `dgemv_dot`). - gemv based on the scaled vector update `axpy` (function `dgemv_axpy`). - gemv based on the fused dot product (function `dgemv_dotf`). Use macro `DOTF` for the fuse factor. - gemv based on the scaled vector update `axpy` (function `dgemv_axpyf`). Use macro `AXPYF` for the fuse factor. Run tests for col major (compile with `-DCOLMAJOR=1`) and row major (`-DCOLMAJOR=0`). Also make sure that you also test cases with $\alpha=0$ or $\beta=0$. Algorithm for `dgemv_dot` ------------------------- We consider $A$ as a $m \times n$ matrix that consists of $m$ row vectors: ---- LATEX --------------------------------------------------------------------- A = \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right) -------------------------------------------------------------------------------- This partitioning of $A$ in the operation $\beta y + \alpha A x$ requires that we also consider $y$ component wise. So that it reads ---- LATEX --------------------------------------------------------------------- \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right) \leftarrow \beta \cdot \left( \begin{array}{c} y_0 \\ \hline y_1 \\ \hline \vdots \\ \hline y_{m-1} \end{array} \right) + \alpha \cdot \left( \begin{array}{c} a_0^T \\ \hline a_1^T \\ \hline \vdots \\ \hline a_{m-1}^T \end{array} \right) \cdot x -------------------------------------------------------------------------------- The complete algorithm then reads as follows: ---- BOX ----------------------------------------------------------------------- - $y \leftarrow \beta \cdot y$ (scal operation) - For $i=0, \dots, m-1$ - $y_i \leftarrow y_i + \alpha \cdot a_i^T x$ (dot operation) -------------------------------------------------------------------------------- Algorithm for `dgemv_axpy` -------------------------- We consider $A$ as a $m \times n$ matrix that consists of $n$ column vectors: ---- LATEX --------------------------------------------------------------------- A = \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right) -------------------------------------------------------------------------------- This requires to consider $x$ component wise so that the operation reads: ---- LATEX --------------------------------------------------------------------- y \leftarrow y \beta + \alpha \cdot \left( \begin{array}{c|c|c|c} a_0 & a_1 & \dots a_{n-1} \end{array} \right) \left( \begin{array}{c} x_0 \\ \hline x_1 \\ \hline \vdots \\ x_{n-1} \end{array} \right) -------------------------------------------------------------------------------- The complete algorithm then reads as follows: ---- BOX ----------------------------------------------------------------------- - $y \leftarrow \beta y$ (scal operation) - For $j=0, \dots, n-1$ - $y \leftarrow y + \left(\alpha x_j \right) \cdot a_j$ (axpy operation) -------------------------------------------------------------------------------- Algorithm for `dgemv_dotf` (fused dot products) ----------------------------------------------- This algorithm is similar to the one for `dgemv_dotf`. However, it can be interpreted that it fuses several dot products together. The so called *fuse factor $f$* is integer constant known at compile time (in the exercise its a literal value defined through a macro) so that the most inner loop can be unrolled: ---- BOX ----------------------------------------------------------------------- ($\;f$ is a literal (positiv) integer) - $y \leftarrow \beta y$ (scal operation) - For $i=0, \dots, \left\lfloor\frac{m}{f}\right\rfloor - 1$ - For $j=0, \dots, n-1$ - For $\ell=0, \dots, f-1$ - $y_{i \cdot f + \ell} \leftarrow y_{i \cdot f + \ell} + \alpha \cdot a_{i \cdot f + \ell,\, j} \cdot x_j$ - For $i=\left\lfloor\frac{m}{f}\right\rfloor \cdot f, \dots, m-1$ - $y_i \leftarrow y_i + \alpha \cdot a_i^T x$ (dot operation) -------------------------------------------------------------------------------- Algorithm for `dgemv_axpyf` (fused axpy operations) --------------------------------------------------- This algorithm is similar to the one for `dgemv_axpyf` with a fixed number of axpy operations fused. interpreted that it fuses several dot products together. Again the *fuse factor $f$* (which can differ from the fuse factor for `dgemv_dotf`) is a integer constant known at compile time: ---- BOX ----------------------------------------------------------------------- ($\;f$ is a literal (positiv) integer) - $y \leftarrow \beta y$ (scal operation) - For $j=0, \dots, \left\lfloor\frac{n}{f}\right\rfloor -1$ - For $i=0, \dots, m-1$ - For $\ell=0, \dots, f-1$ - $y_{i} \leftarrow y_{i} + \alpha \cdot a_{i, \, j \cdot f + \ell} \cdot x_{j \cdot f + \ell}$ - For $j=\left\lfloor\frac{n}{f}\right\rfloor \cdot f, \dots, n-1$ - $y \leftarrow y + \left(\alpha x_j \right) \cdot a_j$ (axpy operation) --------------------------------------------------------------------------------