============================================
Unblocked LU: Variant based on GEMV and TRSV                            [TOC]
============================================

In Session 12 we implemented the LU factorization based on the BLAS operation
GER (rank 1 update).  In this session we consider another variant that is based
on the BLAS operations GEMV (matrix vector product) and TRSV (triangular
solver).

Note that the term variant is used here to express that both algorithms (and
implementation) carry out the same underlying floating point operations.
However, because of the different BLAS building blocks the operations are
carried out in a different order.  So it is possible that each variant of
an algorithm achieves a different different performance.  That is because:

- Depending on an architecture different BLAS operations can be realized more
  or less efficient.
- Different BLAS operations might be implemented more or less efficient.

Algorithm
=========

---- BOX -----------------------------------------------------------------------

 +--------------+-----------------------------------------------------------+
 | Input:       | An $m \times n$ matrix $A$.                               |
 +--------------+-----------------------------------------------------------+
 | Output:      | - $A$ is overwritten with the LU factorization.           |
 |              | - $p$ a pivot vector of length $k := \text{min}\{m, n\}$. |
 +--------------+-----------------------------------------------------------+
 | Return value:| - $-1$ on success and otherwise the index of the diagonal |
 |              |   element that was exactly zero during the factorization  |
 +--------------+-----------------------------------------------------------+

- For $j=0, \dots, k-1$

  - Consider for the first $j+1$ columns of $A$ the partition

    ---- LATEX -----------------------------------------------------------------
    \left(\begin{array}{ccc}
          a_{0,0} & \dots & a_{0,j} \\
          \vdots  &       & \vdots    \\
          a_{m,0} & \dots & a_{m,j} \\
          \end{array}\right)
    =
    \left(\begin{array}{c|c}
          A_{0,0}   &   \vec{a_{0,j}} \\
          \hline
          A_{j,0}   &   \vec{a_{j,j}} \\
          \end{array}\right)
    ----------------------------------------------------------------------------

  - *TRSV operation*

    Compute $\vec{a_{0,j}} \leftarrow L_{\text{unit}}^{-1}(A_{0,0})
    \vec{a_{0,j}}$ where $L_{\text{unit}}( \cdot )$ denotes that a matrix
    argument is considered as a lower unit triangular matrix.

  - *GEMV operation*

    Compute $\vec{a_{j,j}} \leftarrow \vec{a_{j,j}} - A_{j,0} \vec{a_{0,j}}$.

  - *Partial pivoting*

    Find pivot index $p_j$ from $\vec{a_{j,j}}$, i.e.

    ---- LATEX -----------------------------------------------------------------
    p_j = j + \text{argmax}_{0 \leq \ell < m} a_{j,l}
    ----------------------------------------------------------------------------

    If $p_j \neq j$ interchange in $A$ rows with index $j$ and $p_j$.

  - Consider for $\vec{a_{j,j}}$ the sub-partition

    ---- LATEX -----------------------------------------------------------------
    \vec{a_{j,j}} =
    \left(\begin{array}{c}
          a_{j,j} \\
          \hline
          \vec{a_{j+1,j}}
          \end{array}\right)
    ----------------------------------------------------------------------------

  - If $a_{j,j}$ is exactly zero return $j$

  - *Scale operation*

    Compute $\vec{a_{j+1,j}} \leftarrow \frac{1}{a_{j,j}} \vec{a_{j+1,j}}$

- For $j=k, \dots, n-1$

  - Consider for $A$ the partition

    ---- LATEX -----------------------------------------------------------------
    A = \left(\begin{array}{c|c|c|c}
              A_{0,0}   &   \vec{a_{0,k}} & \cdots & \vec{a_{0,n-1}} \\
              \end{array}\right)
    ----------------------------------------------------------------------------

  - *TRSV operation*

    Compute $\vec{a_{0,j}} \leftarrow L_{\text{unit}}^{-1}(A_{0,0})
    \vec{a_{0,j}}$

--------------------------------------------------------------------------------

Exercise
========

- Use the source file `ulmblas.c` from Session 12 as starting point.  Or copy
  `ulmblas.c` from ` /home/numerik/pub/hpc/ss18/ulmblas/session13b/ulmblas.c`

- Modify the code as follows, so that we can use a compiler switch to select
  the implementation for `dgetrf`:

  ---- CODE (type=c) -----------------------------------------------------------
  /*
      Implementation of other BLAS functions
  */
  
  ptrdiff_t
  dgetrf_ger(size_t m, size_t n,
             double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
             size_t *p, ptrdiff_t incP)
  {
      /* Implementation from session 12 */
  }
  
  ptrdiff_t
  dgetrf_gemv(size_t m, size_t n,
              double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
              size_t *p, ptrdiff_t incP)
  {
      /* TODO: Your implementation based on dgemv and dtrsv */
  }
  
  #define GETRF_GER   1
  #define GETRF_GEMV  2
  
  #ifndef GETRF
  #define GETRF   0
  #endif
  
  // this pragma makes gcc stop immediately on '#error'
  #pragma GCC diagnostic error "-Wfatal-errors"
  
  ptrdiff_t
  dgetrf(size_t m, size_t n,
         double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
         size_t *p, ptrdiff_t incP)
  {
  #if  GETRF == GETRF_GER
      return dgetrf_ger(m, n, A, incRowA, incColA, p, incP);
  #elif GETRF == GETRF_GEMV
      return dgetrf_gemv(m, n, A, incRowA, incColA, p, incP);
  #else
  #error "Compile with either -DGETRF=GETRF_GER or -DGETRF=GETRF_GEMV"
  #endif
  }
  ------------------------------------------------------------------------------

- Implement function `dgetrf_gemv`.  Use the GEMV/TRSV based algorithm.  Test
  your implementation as described in the next section.


Test and Benchmark
==================

---- SHELL (path=session14, hide) ----------------------------------------------
rm -rf getrf
mkdir getrf
cd getrf
cp /home/numerik/pub/hpc/ss18/ulmblas/*.[hc] .
cp /home/numerik/pub/hpc/ss18/ulmblas/session14/ulmblas.* .
cp ../getrf.plot .
--------------------------------------------------------------------------------

- Compile with either `-DGETRF=GETRF_GER` or `-DGETRF=GETRF_GER`.  Otherwise
  you get *the following error*:

  ---- SHELL (path=session14/getrf, hostname=heim) -----------------------------
  gcc -Wall -std=c11 -I. -O3 -o test_dgetrf test_dgetrf.c ulmaux.c ulmblas.c
  ------------------------------------------------------------------------------

- Create different executables and check results:

  - You can create an executable `test_dgetrf_ger` for testing and benchmarking
    the variant based on GER:

    ---- SHELL (path=session14/getrf, hostname=heim,fold) ----------------------
    gcc -Wall -std=c11 -I. -O3 -o test_dgetrf_ger -DGETRF=GETRF_GER          +++
        test_dgetrf.c ulmaux.c ulmblas.c
    ./test_dgetrf_ger check
    ----------------------------------------------------------------------------

  - and `test_dgetrf_gemv` for benchmarking the variant based on GEMV and TRSV:

    ---- SHELL (path=session14/getrf, hostname=heim,fold) ----------------------
    gcc -Wall -std=c11 -I. -O3 -o test_dgetrf_gemv -DGETRF=GETRF_GEMV        +++
        test_dgetrf.c ulmaux.c ulmblas.c
    ./test_dgetrf_gemv check
    ----------------------------------------------------------------------------

- Benchmark both variants:

    ---- SHELL (path=session14/getrf, hostname=heim,fold) ----------------------
    ./test_dgetrf_ger bench > bench.dgetrf_ger
    ./test_dgetrf_gemv bench > bench.dgetrf_gemv
    cat bench.dgetrf_ger
    cat bench.dgetrf_gemv
    ----------------------------------------------------------------------------

    You can use the Gnuplot script

    :import: session14/getrf.plot

    for plotting the performance:

    ---- SHELL(path=session14/getrf,hostname=heim) -----------------------------
    gnuplot getrf.plot
    ----------------------------------------------------------------------------

    gives

    ---- IMAGE -----------------------
    session14/getrf/bench.getrf.svg
    ----------------------------------