===================================
Pure ANSI C Implementation of DGEMM                                 [TOC]
===================================

In the following we present the cache optimized implementation of the
matrix-matrix product.  The function `dgemm_nn` can compute operations of the
form $C \leftarrow \beta C + \alpha A B$.  All matrices can have *arbitrary row
and column stride*.  That in particular means each matrix can be row or column
wise stored.  It further means that we also can use the function for the
computation of $C \leftarrow \beta C + \alpha A^T B$,
$C \leftarrow \beta C + \alpha A B^T$ and
$C \leftarrow \beta C + \alpha A^T B^T$.


Select the demo-pure-c Branch
=============================
Before switching a branch do a `make clean` first:

    *--[SHELL(height=6)]------------------------------------------------*
    |                                                                   |
    |  cd ulmBLAS                                                       |
    |  make clean                                                       |
    |                                                                   |
    *-------------------------------------------------------------------*

Now we switch by checking out the `demo-pure-c`branch:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  git branch -a                                                    |
    |  git checkout -B demo-pure-c remotes/origin/demo-pure-c           |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we compile the project

    *--[SHELL(path=ulmBLAS,height=15)]----------------------------------*
    |                                                                   |
    |  make                                                             |
    |                                                                   |
    *-------------------------------------------------------------------*


Building Blocks of dgemm_nn
===========================

- `pack_MRxk` and `pack_A` are copying row panels from matrix blocks of A into
  the buffer `_A`.  Details are described in __GEMM Packing Matrix A__.
- `pack_kxNR` and `pack_B` are copying col panels from matrix blocks of A into
  the buffer `_B`.  Details are described in __GEMM Packing Matrix B__.
- `dgemm_micro_kernel` multiplies a row panel with a col panel.  Details are
  given in __GEMM Micro Kernel__.
- `dgemm_macro_kernel` multiplies a row panel with a col panel. Details are
  given in __GEMM Macro Kernel__.
- `dgemm_nn` computes $C \leftarrow \beta C + \alpha A B$ as described in
  __GEMM__.

The Micro Kernel Algorithm
==========================
For the sake of simplicity we assume $m_r = n_r = 4$ in the description of the
algorithm.  Note that the pure C implementation works as long as $m_r$ is a
divisor or $m_c$ and $n_r$ a divisor of $n_c$.

Recall that `A` points to a packed (maybe zero padded) row panel of height four
and width $k_c$.  That means we have the column wise stored panel

---- LATEX ---------------------------------------------------------------------
A
=
\begin{pmatrix}
a_0 & a_4 & \dots & a_{4 k_c -4} \\
a_1 & a_5 & \dots & a_{4 k_c -3} \\
a_2 & a_6 & \dots & a_{4 k_c -2} \\
a_3 & a_7 & \dots & a_{4 k_c -1} \\
\end{pmatrix}
--------------------------------------------------------------------------------

Further `B` points to a column panel of height $k_c$ and  width four.  That can
be illustrated as a row wise stored panel

---- LATEX ---------------------------------------------------------------------
B
=
\begin{pmatrix}
b_0 & b_1 & b_2 & b_3 \\
b_4 & b_5 & b_6 & b_7 \\
\vdots & \vdots & \vdots & \vdots \\
b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1} \\
\end{pmatrix}
--------------------------------------------------------------------------------

For the product $A \cdot B$ this means

---- LATEX ---------------------------------------------------------------------
\begin{eqnarray*}
A \cdot B
&=&
\begin{pmatrix}
a_0 \\
a_1 \\
a_2 \\
a_3 \\
\end{pmatrix}
\begin{pmatrix}
b_0 & b_1 & b_2 & b_3
\end{pmatrix}
+
\begin{pmatrix}
a_4 \\
a_5 \\
a_6 \\
a_7 \\
\end{pmatrix}
\begin{pmatrix}
b_4 & b_5 & b_6 & b_7
\end{pmatrix}
+
\dots
+
\begin{pmatrix}
a_{4 k_c-4} \\
a_{4 k_c-3} \\
a_{4 k_c-2} \\
a_{4 k_c-1} \\
\end{pmatrix}
\begin{pmatrix}
b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1}
\end{pmatrix}\\[1cm]
&=&
\begin{pmatrix}
a_0 b_0 & a_0 b_1 & a_0 b_2 & a_0 b_3 \\
a_1 b_0 & a_1 b_1 & a_1 b_2 & a_1 b_3 \\
a_2 b_0 & a_2 b_1 & a_2 b_2 & a_2 b_3 \\
a_3 b_0 & a_3 b_1 & a_3 b_2 & a_3 b_3 \\
\end{pmatrix}
+
\begin{pmatrix}
a_4 b_4 & a_4 b_5 & a_4 b_6 & a_4 b_7 \\
a_5 b_4 & a_5 b_5 & a_5 b_6 & a_5 b_7 \\
a_6 b_4 & a_6 b_5 & a_6 b_6 & a_6 b_7 \\
a_7 b_4 & a_7 b_5 & a_7 b_6 & a_7 b_7 \\
\end{pmatrix}
+
\dots
+
\begin{pmatrix}
a_{4 k_c-4} b_{4 k_c-4} & a_{4 k_c-4} b_{4 k_c-3} & a_{4 k_c-4} b_{4 k_c-2} & a_{4 k_c-4} b_{4 k_c-1} \\
a_{4 k_c-3} b_{4 k_c-4} & a_{4 k_c-3} b_{4 k_c-3} & a_{4 k_c-3} b_{4 k_c-2} & a_{4 k_c-3} b_{4 k_c-1} \\
a_{4 k_c-2} b_{4 k_c-4} & a_{4 k_c-2} b_{4 k_c-3} & a_{4 k_c-2} b_{4 k_c-2} & a_{4 k_c-2} b_{4 k_c-1} \\
a_{4 k_c-1} b_{4 k_c-4} & a_{4 k_c-1} b_{4 k_c-3} & a_{4 k_c-1} b_{4 k_c-2} & a_{4 k_c-1} b_{4 k_c-1} \\
\end{pmatrix}
\end{eqnarray*}
--------------------------------------------------------------------------------
We compute $\mathbf{AB} = A \cdot B$ sequentially:
 - Initialize: $\mathbf{AB} \leftarrow \mathbf{0}_{4 \times 4}$
 - For $l = 0, \dots, k_c-1$ update:
    - $\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \begin{pmatrix} b_{4l} & b_{4l+1} & b_{4l+2} & b_{4l+3}\end{pmatrix} $

      *(Note that this loop is the computational hotspot.  In principal the overall
      performance only depends on the efficient update of $\mathbf{AB}$)*

Afterwards we merely update the left hand side micro block $\tilde{C}$:
 - $\tilde{C} \leftarrow \beta \tilde{C}$
 - $\tilde{C} \leftarrow \tilde{C} + \alpha \mathbf{AB}$


The dgemm_nn Code (less than 450 lines!)
========================================
:import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers]


Benchmark Results
=================
In `bench/` we have extracted the benchmarks suite from the __ATLAS__ project:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  cd bench                                                         |
    |  ./xdl3blastst                                                    |
    |                                                                   |
    *-------------------------------------------------------------------*

Lines with `PASS` show results for our own implementation.  Lines with `-----`
belong to the reference implementation.

We can visualize the benchmarks with __gnuplot__.  First we write the results
to a report file and use `grep` to separate results for our implementation
and the reference implementation:

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  ./xdl3blastst > report                                           |
    |  grep PASS report > demo-pure-c                                   |
    |  grep "\ \-\-\-\-\-$" report > refBLAS                            |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we can use gnuplot to create a svg of the benchmark results:

:import: ulmBLAS/bench/bench.gps

Feeding gnuplot with this script

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench.gps                                                |
    |                                                                   |
    *-------------------------------------------------------------------*

creates

    ---- IMAGE ------------
    ulmBLAS/bench/bench.svg
    -----------------------


Sensitivity to Compilers
========================
Unfortunately different compilers, flags and versions can have a big influence
on the performance.  Using SSE intrinsics or inline assembly code in the micro
kernel solves this problem.

Here I can demonstrate a significant performance drop if I compile with `clang`
instead of `gcc-4.8`.  Note that in other cases my `clang` compiler performs
much better optimizations.

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  clang --version                                                  |
    |  gcc-4.8 --version                                                |
    |                                                                   |
    |  cd src                                                           |
    |  clang -Wall -I. -O3  -msse3 -mfpmath=sse -fomit-frame-pointer +++|
    |        -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c |
    |  make                                                             |
    |  cd ../bench                                                      |
    |  make                                                             |
    |  ./xdl3blastst > report                                           |
    |  cat report                                                       |
    |                                                                   |
    *-------------------------------------------------------------------*

We can visualize this again

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  grep PASS report > demo-pure-c-with-clang                        |
    |                                                                   |
    *-------------------------------------------------------------------*

By adopting the gnuplot script

:import: ulmBLAS/bench/bench2.gps

and feeding gnuplot with this script

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench2.gps                                               |
    |                                                                   |
    *-------------------------------------------------------------------*


Although the performance drops dramatically we see that it is stable.  The is
no drop depending on cache sizes like in the `refBLAS` case.  Therefore it still
beats the reference implementation if problem sizes are large enough.


    ---- IMAGE -------------
    ulmBLAS/bench/bench2.svg
    ------------------------


:links:  __ATLAS__                  -> http://math-atlas.sourceforge.net
         __gnuplot__                -> http://www.gnuplot.info
         __GEMM Packing Matrix A__  -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page02.html
         __GEMM Packing Matrix B__  -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page03.html
         __GEMM Micro Kernel__      -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page04.html
         __GEMM Macro Kernel__      -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page05.html
         __GEMM__                   -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page06.html

:navigate: __up__    -> doc:index
           __back__  -> doc:page01/index
           __next__  -> doc:page03/index