============================================
Using fused AXPY and DOT Operations for TRSV                            [TOC]
============================================

In the previous benchmark the LU variant based on GEMV/TRSV has more efficient.
That mainly due to the fact that the GEMV implementation uses fuesed AXPY and
fused DOT operations.

We will now alos use the fused operations for TRSV and hope to achieve a
further improvement in the performance.

---- SHELL (path=session14, hide) ----------------------------------------------
rm -rf getrf2
mkdir getrf2
cd getrf2
cp /home/numerik/pub/hpc/ss18/ulmblas/*.[hc] .
cp /home/numerik/pub/hpc/ss18/ulmblas/session14b/ulmblas.* .
cp ../getrf.plot2 .
cp ../getrf/bench* .
--------------------------------------------------------------------------------


TRSV: Optimized for col major case
==================================
The TRSV operation can be optimized using fused AXPY or fused DOT operations. In
the code of `ulmblas.c` (found in
`/home/numerik/pub/hpc/ss18/ulmblas/session14b/`) this kind of optimization was
applied for the col major case.  Further modifications:

- Function `daxpyf` was added.  This function performs the fused AXPY
  operations.  The purpose of this function is reusability of this operation.
- Function `dgemv_axpyf` now calls `daxpyf`.  Before the fused AXPY operations
  where hard coded in `dgemv_axpyf`.
- For lower triangular, col major matrices function `dtrsv` also calls `daxpyf`
  (and therefore exploits the reusable implementation of the fused AXPY
  operation).

:import: session14/getrf2/ulmblas.c [fold]

Exercise
========
- Try to express the underlying algorithm in TRSV for the case of a lower
  triangular matrix that is stored col major.
- Run benchmarks to evaluate and verify the performance (see below).

Test and Benchmark
==================

---- SHELL (path=session14, hide) ----------------------------------------------
rm -rf getrf2
mkdir getrf2
cd getrf2
cp /home/numerik/pub/hpc/ss18/ulmblas/*.[hc] .
cp /home/numerik/pub/hpc/ss18/ulmblas/session14b/ulmblas.* .
cp ../getrf.plot2 .
cp ../getrf/bench* .
--------------------------------------------------------------------------------

- Create executable and check results:

    ---- SHELL (path=session14/getrf2, hostname=heim,fold) ---------------------
    gcc -Wall -std=c11 -I. -O3 -o test_dgetrf_gemv_fused -DGETRF=GETRF_GEMV  +++
        test_dgetrf.c ulmaux.c ulmblas.c
    ./test_dgetrf_gemv_fused check
    ----------------------------------------------------------------------------

- Benchmark:

    ---- SHELL (path=session14/getrf2, hostname=heim,fold) ---------------------
    ./test_dgetrf_gemv_fused bench > bench.dgetrf_gemv_fused
    cat bench.dgetrf_gemv_fused
    ----------------------------------------------------------------------------

    You can use the Gnuplot script

    :import: session14/getrf.plot2

    for plotting the performance:

    ---- SHELL(path=session14/getrf2,hostname=heim) ----------------------------
    gnuplot getrf.plot2
    ----------------------------------------------------------------------------

    gives

    ---- IMAGE -----------------------
    session14/getrf2/bench.getrf.svg
    ----------------------------------