===================================
GEMV: General Matrix Vector Product                                       [TOC]
===================================

In this session we will implement and benchmark different implementations for
the so called GEMV operation:

---- LATEX ---------------------------------------------------------------------
y \leftarrow \beta y + \alpha A x,
\quad A \in\mathbb{M}^{m \times n},\; x \in \mathbb{M}^n, y \in \mathbb{M}^m
--------------------------------------------------------------------------------

This operation is specified in BLAS Level 2.  Here some additional notes on the
parameters and the operation:

- Dimensions $m$ and $n$ can be zero.
- If $m$ equals zero gemv is a no-operation.
- If $n$ equals zero or $\alpha$ equals zero it just computes $y \leftarrow
  \beta y$. In this case $x$ or $A$ can contain NaN values.
- If $\beta$ equals zero it computes $y \leftarrow \alpha A x$.  In this case
  $y$ is allowed to contain NaN values.

Error bound
===========
We provide one reference implementation for GEMV.  An alternative
implementation will be tested as follows:

- Choose $m$, $n$, $\alpha$, $\beta$ and generate matrix $A$ and vectors $x$,
  $y_0$.
- Use the reference implementation to compute a trusted $y_\text{ref}$ with
  $y_\text{ref} = \beta y_0 + \alpha A x$
- Use the alternative implementation to compute $y_\text{sol}$ with
  $y_\text{sol} = \beta y_0 + \alpha A x$

Even both implementations are numerically acceptable they might differ due to
round off errors.  However, we will accept if the expression (where
$\text{eps}$ denotes the machine precision)

---- LATEX ---------------------------------------------------------------------
\frac{\| y_\text{ref} - y_\text{sol} \|
    }{\text{eps}\left( \text{max}\{m, n\} \cdot |\alpha|
                       \cdot \|A\|_\infty \cdot \|x\|_\infty
                     + m \cdot |\beta| \cdot \|y_0\|_\infty \right) }
--------------------------------------------------------------------------------

is smaller than a threshold $\tau$.  Theoretically one can only show that $\tau
= \mathcal{O}(1)$ (i.e. does not depend on matrix dimensions, the condition of
$A$ or the particular choice of $x$ and $y$).  But that means, that there is
still an unknown constant involved. In practice, choosing $\tau = 2$ should
typically be sufficient to detect errors in the implementation.

Setting up Matrices and Vectors for Testing
===========================================
In general we will initialize matrices and vectors with random numbers (using
`srand` for initializing the random number generator and `rand()` for getting a
pseudo random number).

However, in case $\alpha$ equals zero we initialize $A$ and $x$ with NaN, and
in case $\beta$ equals zero we initialize $y_0$ with NaN.

Values for $\alpha$ and $\beta$ can be specified when you compile the code:

- Use for example flag `-DALPHA=1.5` to set $\alpha$ to $1.5$ and analogously
- flag `-DBETA=2.5` to specify $\beta$.

Measure performance
===================
The GEMV operation requires about $m \cdot (2\cdot n + 1)$ floating point
operations (we do not distinguish between floating point addition and
multiplication).  The efficiency will be measured in mega flops per second:

---- LATEX ---------------------------------------------------------------------
\text{MFLOPS} = \frac{\text{Floating point operations}
                    }{10^6 \cdot \text{wall time in seconds}}
--------------------------------------------------------------------------------

For measuring the required wall time for an operation we re-run the operation
several times and use the average time.

Skeleton for Test and Benchmark
===============================
Make yourself familiar with the skeleton below:

- In main we first allocate sufficient memory for all matrices and vectors used
  in the test.
- In a for loop we iterate (beginning with $m=n=10$) over all matrix/vector
  dimensions used for the benchmark.
- Each gemv implementation is tested inside its own scope. For example:

  ---- CODE(type=c) ------------------------------------------------------------
  {
      double t    = 0;
      size_t runs = 0;

      while (t<0.1 and runs<3) {
          dcopy(m, y0, incY0, yRef, incYRef);

          double t0 = walltime();
          dgemv_ref(m, n,
                    alpha,
                    A, incRowA, incColA,
                    x, incX,
                    beta,
                    yRef, incYRef);
          t += walltime() - t0;
          ++runs;
      }
      t /= runs;

      printf("\nyRef =\n");
      printGeMatrix(1, m, yRef, 0, incYRef);
      printf("%10.2lf %10.2lf ", t, mflop/t);
  }
  ------------------------------------------------------------------------------

  This allows to re-use local variable names `t` and `runs`.  Before an gemv
  implementation gets tested a corresponding vector for $y$ gets initialized
  with `y0`.  This means, that we each time use the same data for the operation.

- Except for the reference implementation we compute the error bounds.
- The code contains in this form a lot of output:

  - *Output of vectors and matrices*

    We use this output for a quick eye-test for the computed results.  This is
    only helpful for small problem sizes.  Remove this kind of output for larger
    problem sizes.

  - *Output of error bounds and mflops*

    This output is useful for testing larger problem sizes (but not so helpful
    for debugging)

- In this skeleton there is a `break` statement at the end of the for loop.
  Therefore test stops after testing the problemsize $m=n=10$.  Once you
  eye-checked your implementation remove this statement.

:import: session11/gemv_ex1.c

Here the output generated by the current version:

---- SHELL (path=session11/,hostname=heim) -------------------------------------
gcc -Wall -std=c11 -O3 -o gemv gemv_ex1.c
./gemv
--------------------------------------------------------------------------------

Exercise
========

Implement the different algorithms for gemv:

- gemv based on the dot product (function `dgemv_dot`).
- gemv based on the scaled vector update `axpy` (function `dgemv_axpy`).
- gemv based on the fused dot product (function `dgemv_dotf`).  Use macro
  `DOTF` for the fuse factor.
- gemv based on the scaled vector update `axpy` (function `dgemv_axpyf`). Use
  macro `AXPYF` for the fuse factor.

Run tests for col major (compile with `-DCOLMAJOR=1`) and row major
(`-DCOLMAJOR=0`). Also make sure that you also test cases with $\alpha=0$ or
$\beta=0$.

Algorithm for `dgemv_dot`
-------------------------

We consider $A$ as a $m \times n$ matrix that consists of $m$ row vectors:

---- LATEX ---------------------------------------------------------------------
A = \left(
    \begin{array}{c}
          a_0^T \\
          \hline
          a_1^T \\
          \hline
          \vdots \\
          \hline
          a_{m-1}^T
    \end{array}
    \right)
--------------------------------------------------------------------------------

This partitioning of $A$ in the operation $\beta y + \alpha A x$ requires that
we also consider $y$ component wise. So that it reads

---- LATEX ---------------------------------------------------------------------
     \left(
    \begin{array}{c}
          y_0 \\
          \hline
          y_1 \\
          \hline
          \vdots \\
          \hline
          y_{m-1}
    \end{array}
    \right)
\leftarrow
\beta \cdot
    \left(
    \begin{array}{c}
          y_0 \\
          \hline
          y_1 \\
          \hline
          \vdots \\
          \hline
          y_{m-1}
    \end{array}
    \right)
+ \alpha \cdot
    \left(
    \begin{array}{c}
          a_0^T \\
          \hline
          a_1^T \\
          \hline
          \vdots \\
          \hline
          a_{m-1}^T
    \end{array}
    \right)
\cdot x
--------------------------------------------------------------------------------

The complete algorithm then reads as follows:

---- BOX -----------------------------------------------------------------------

- $y \leftarrow \beta \cdot y$ (scal operation)
- For $i=0, \dots, m-1$
    - $y_i \leftarrow y_i + \alpha \cdot a_i^T x$ (dot operation)

--------------------------------------------------------------------------------

Algorithm for `dgemv_axpy`
--------------------------

We consider $A$ as a $m \times n$ matrix that consists of $n$ column vectors:

---- LATEX ---------------------------------------------------------------------
A = \left(
    \begin{array}{c|c|c|c}
          a_0 & a_1 & \dots a_{n-1}
    \end{array}
    \right)
--------------------------------------------------------------------------------

This requires to consider $x$ component wise so that the operation reads:

---- LATEX ---------------------------------------------------------------------
y \leftarrow y \beta
+ \alpha \cdot
    \left(
    \begin{array}{c|c|c|c}
          a_0 & a_1 & \dots a_{n-1}
    \end{array}
    \right)
    \left(
    \begin{array}{c}
          x_0 \\
          \hline
          x_1 \\
          \hline
          \vdots \\
          x_{n-1}
    \end{array}
    \right)
--------------------------------------------------------------------------------

The complete algorithm then reads as follows:

---- BOX -----------------------------------------------------------------------

- $y \leftarrow \beta y$ (scal operation)
- For $j=0, \dots, n-1$
    - $y \leftarrow y + \left(\alpha x_j \right) \cdot a_j$ (axpy operation)

--------------------------------------------------------------------------------


Algorithm for `dgemv_dotf` (fused dot products)
-----------------------------------------------

This algorithm is similar to the one for `dgemv_dotf`.  However, it can be
interpreted that it fuses several dot products together.

The so called *fuse factor $f$* is integer constant known at compile time (in
the exercise its a literal value defined through a macro) so that the most
inner loop can be unrolled:

---- BOX -----------------------------------------------------------------------
($\;f$ is a literal (positiv) integer)

- $y \leftarrow \beta y$ (scal operation)
- For $i=0, \dots, \left\lfloor\frac{m}{f}\right\rfloor - 1$
    - For $j=0, \dots, n-1$
        - For $\ell=0, \dots, f-1$
            - $y_{i \cdot f + \ell} \leftarrow
              y_{i \cdot f + \ell}
              + \alpha \cdot a_{i \cdot f + \ell,\, j} \cdot x_j$
- For $i=\left\lfloor\frac{m}{f}\right\rfloor \cdot f, \dots, m-1$
    - $y_i \leftarrow y_i + \alpha \cdot a_i^T x$ (dot operation)
--------------------------------------------------------------------------------

Algorithm for `dgemv_axpyf` (fused axpy operations)
---------------------------------------------------

This algorithm is similar to the one for `dgemv_axpyf` with a fixed number of
axpy operations fused.  interpreted that it fuses several dot products together.

Again the *fuse factor $f$*  (which can differ from the fuse factor for
`dgemv_dotf`) is a integer constant known at compile time:

---- BOX -----------------------------------------------------------------------
($\;f$ is a literal (positiv) integer)

- $y \leftarrow \beta y$ (scal operation)
- For $j=0, \dots, \left\lfloor\frac{n}{f}\right\rfloor -1$
    - For $i=0, \dots, m-1$
        - For $\ell=0, \dots, f-1$
            - $y_{i} \leftarrow y_{i}
              + \alpha
              \cdot a_{i, \, j \cdot f + \ell} \cdot x_{j \cdot f + \ell}$
- For $j=\left\lfloor\frac{n}{f}\right\rfloor \cdot f, \dots, n-1$
    - $y \leftarrow y + \left(\alpha x_j \right) \cdot a_j$ (axpy operation)
--------------------------------------------------------------------------------