===========================
Naive Use of AVX Intrinsics                                              [TOC]
===========================

Using __AVX intrinsics__ instead of SSE we follow the straight forward
approach of __Naive Use of SSE Intrinsics__.


Clone the ulmBLAS Repository
============================
    *--[SHELL(hide)]----------------------------------------------------*
    |                                                                   |
    | rm -rf ulmBLAS/                                                   |
    |                                                                   |
    *-------------------------------------------------------------------*

If not done already clone the __ulmBLAS__ repository.

    *--[SHELL]----------------------------------------------------------*
    |                                                                   |
    |  git clone https://github.com/michael-lehn/ulmBLAS.git            |
    |                                                                   |
    *-------------------------------------------------------------------*


Select the demo-naive-avx-with-intrinsics Branch
================================================
Again, we do a `make clean` before switching a branch:

    *--[SHELL(height=6)]------------------------------------------------*
    |                                                                   |
    |  cd ulmBLAS                                                       |
    |  make clean                                                       |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we are checking out the `demo-naive-avx-with-intrinsics` branch:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  git branch -a                                                    |
    |  git checkout -B demo-naive-avx-with-intrinsics                +++|
    |               remotes/origin/demo-naive-avx-with-intrinsics       |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we compile the project

    *--[SHELL(path=ulmBLAS,height=15)]----------------------------------*
    |                                                                   |
    |  make                                                             |
    |                                                                   |
    *-------------------------------------------------------------------*


The Micro Kernel Algorithm
==========================
We use parameters $m_r = 8$ and $n_r=4$ in the micor kernel.  We merely
optimize the update step

---- LATEX ---------------------------------------------------------------------
\mathbf{AB} \leftarrow \mathbf{AB} +
        \begin{pmatrix}
           a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}
           a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7}
        \end{pmatrix}
        \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix}
--------------------------------------------------------------------------------

by using SSE intrinsics.  Looking at the original C code

---- CODE(type=c) ------------------------
for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }
-----------------------------------------

we notice that in the most inner loop the value `B[j]` does not change.  The
natural idea is to compute this step as

---- LATEX ---------------------------------------------------------------------
\mathbf{AB} \leftarrow \mathbf{AB} +
        \begin{pmatrix}
        b_{4l} \begin{pmatrix}
                a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}
                a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7}
               \end{pmatrix},   &
        b_{4l+1} \begin{pmatrix}
                 a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}
                 a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7}
                 \end{pmatrix}, &
        b_{4l+2} \begin{pmatrix}
                  a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}
                  a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7}
                  \end{pmatrix}, &
        b_{4l+3} \begin{pmatrix}
                  a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}
                  a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7}
                  \end{pmatrix}
        \end{pmatrix}
--------------------------------------------------------------------------------

Let $\mathbb{b}_{0000}, \mathbb{b}_{1111}, \mathbb{b}_{2222}, \mathbb{b}_{3333}$
and $\mathbb{a}_{0123}, \mathbb{a}_{4567}$ denote AVX registers.  We use this 6
registers to store the operands:

---- LATEX ---------------------------------------------------------------------
\begin{array}{llllll}
\mathbb{b}_{0000} \leftarrow \begin{pmatrix} b_{4l  } \\ b_{4l  } \\ b_{4l  } \\ b_{4l  } \end{pmatrix}, &
\mathbb{b}_{1111} \leftarrow \begin{pmatrix} b_{4l+1} \\ b_{4l+1} \\ b_{4l+1} \\ b_{4l+1} \end{pmatrix}, &
\mathbb{b}_{2222} \leftarrow \begin{pmatrix} b_{4l+2} \\ b_{4l+2} \\ b_{4l+2} \\ b_{4l+2} \end{pmatrix}, &
\mathbb{b}_{3333} \leftarrow \begin{pmatrix} b_{4l+3} \\ b_{4l+3} \\ b_{4l+2} \\ b_{4l+2} \end{pmatrix}, &
\mathbb{a}_{0123} \leftarrow \begin{pmatrix} a_{4l  } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} \end{pmatrix}, &
\mathbb{a}_{4567} \leftarrow \begin{pmatrix} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix}
\end{array}
--------------------------------------------------------------------------------

Another 8 AVX registers denoted as $\mathbb{ab}_{\cdot,\cdot,\cdot,\cdot}$ are
used to represent $\mathbf{AB}$:

---- LATEX ---------------------------------------------------------------------
\begin{array}{llll}
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{0,0} \\ ab_{1,0} \\ ab_{2,0} \\ ab_{3,0}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{0,1} \\ ab_{1,1} \\ ab_{2,1} \\ ab_{3,1}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{0,2} \\ ab_{1,2} \\ ab_{2,2} \\ ab_{3,2}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{0,3} \\ ab_{1,3} \\ ab_{2,3} \\ ab_{3,3}
                          \end{pmatrix} \\[0.5cm]
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{4,0} \\ ab_{5,0} \\ ab_{6,0} \\ ab_{7,0}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{4,1} \\ ab_{5,1} \\ ab_{6,1} \\ ab_{7,1}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{4,2} \\ ab_{5,2} \\ ab_{6,2} \\ ab_{7,2}
                          \end{pmatrix}, &
\mathbb{ab}_{00,10,20,30} \leftarrow
                          \begin{pmatrix}
                          ab_{4,3} \\ ab_{5,3} \\ ab_{6,3} \\ ab_{7,3}
                          \end{pmatrix}
\end{array}
--------------------------------------------------------------------------------

As our architecture has a total of 16 AVX registers we have two registers left.
We use them for temporary results and denote them as $\mathbb{tmp}_1$ and
$\mathbb{tmp}_2$.

A single update can now be computed as
 - Update the first column:
     - $\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{00,10,20,30}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{40,50,60,70}$
     - $\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{0000}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{0000}$
     - $\mathbb{ab}_{00,10,20,30} \leftarrow \mathbb{ab}_{00,10,20,30} + \mathbb{tmp}_1$
     - $\mathbb{ab}_{40,50,60,70} \leftarrow \mathbb{ab}_{40,50,60,70} + \mathbb{tmp}_2$
 - Update the second column:
     - $\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{01,11,21,31}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{41,51,61,71}$
     - $\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{1111}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{1111}$
     - $\mathbb{ab}_{01,11,21,31} \leftarrow \mathbb{ab}_{01,11,21,31} + \mathbb{tmp}_1$
     - $\mathbb{ab}_{41,51,61,71} \leftarrow \mathbb{ab}_{41,51,61,71} + \mathbb{tmp}_2$
 - Update the third column:
     - $\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{02,12,22,32}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{42,52,62,72}$
     - $\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{2222}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{2222}$
     - $\mathbb{ab}_{02,12,22,32} \leftarrow \mathbb{ab}_{02,12,22,32} + \mathbb{tmp}_1$
     - $\mathbb{ab}_{42,52,62,72} \leftarrow \mathbb{ab}_{42,52,62,72} + \mathbb{tmp}_2$
 - Update the forth column:
     - $\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{03,13,23,33}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{43,53,63,73}$
     - $\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{3333}$
     - $\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{3333}$
     - $\mathbb{ab}_{03,13,23,33} \leftarrow \mathbb{ab}_{03,13,23,33} + \mathbb{tmp}_1$
     - $\mathbb{ab}_{43,53,63,73} \leftarrow \mathbb{ab}_{43,53,63,73} + \mathbb{tmp}_2$

Hereby $\odot$ denotes the usual component wise multiplication of AVX registers.
We also assume that previous to the first update step all the $\mathbb{ab}$
registers are zero initialized.

Once we have completed the total of $k_c$ updates we write the result back to
memory into $\mathbf{AB}$.


The dgemm_nn Code
=================
Note that we also added an attribute for 32 byte alignment to the definition of
local buffers `_A`, `_B`, `_C` and `AB`.  Having a 32-byte alignment is required
for the load and store intrinsics used in the micro kernel.

:import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers]


Benchmark Results
=================
We run the benchmarks

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  cd bench                                                         |
    |  ./xdl3blastst -N 100 2500 100                                    |
    |                                                                   |
    *-------------------------------------------------------------------*

and filter out the results for the `demo-naive-avx-with-intrinsics` branch:

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  ./xdl3blastst -N 100 2500 100 > report                           |
    |  grep PASS report > demo-naive-avx-with-intrinsics                |
    |  grep "\ \-\-\-\-\-$" report > refBLAS                            |
    |                                                                   |
    *-------------------------------------------------------------------*

With the gnuplot script

:import: ulmBLAS/bench/bench.gps

we feed gnuplot

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench.gps                                                |
    |                                                                   |
    *-------------------------------------------------------------------*

and get

    ---- IMAGE ------------
    ulmBLAS/bench/bench.png
    -----------------------


:links: AVX intrinsics               -> https://software.intel.com/sites/landingpage/IntrinsicsGuide/
        Naive Use of SSE Intrinsics  -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page03/index.html
        ulmBLAS                      -> https://github.com/michael-lehn/ulmBLAS


:navigate: __up__    -> doc:index