===========================
Notes on the GEMM Algorithm                                         [TOC:2]
===========================

Of course the right thing to do is reading the __BLIS__ paper.

:links: BLIS -> http://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.pdf


Concept of the Micro-Kernel `ugemm`
===================================
The micro-kernel computes the GEMM operation

---- LATEX--------------------------------------------------
C \leftarrow \beta C + \alpha A \cdot B \quad\text{with}
\quad C \in \mathbb{M}^{M_R \times N_R},
\quad A \in \mathbb{M}^{M_R \times k}\;\text{and}\;
\quad B \in \mathbb{M}^{k \times N_R}\;\text{and}\;
------------------------------------------------------------

The dimensions $M_R$ and $N_R$ are constants chosen depending on
the available number of registers.  The variable dimension $k$ is
bounded by $k \leq K_c$.  The constant $K_c$ is chosen such that
$A$ and $B$ fit into L1-cache.

So accessing elements of $A$ and $B$ will be fast, but accessing elements of
$C$ is slow in general.  We therefore consider a $M_R \times N_R$ matrix
$\overline{C}$ can be kept in registers and split the operation as follows:

---- LATEX--------------------------------------------------
\begin{array}{lcl}
\overline{C} & \leftarrow & A \cdot B\\
\overline{C} & \leftarrow & \alpha \overline{C}\\
C & \leftarrow & \beta C + \overline{C}\\
\end{array}
------------------------------------------------------------

Relevant for the performance will mainly be the operation

---- LATEX--------------------------------------------------
\overline{C} \leftarrow A \cdot B
------------------------------------------------------------

Here we have the situation that accessing elements of $A$
and $B$ (even if in L1-cache) is slower than accessing elements of
$\overline{C}$ stored in registers.  We exploit this fact by
carrying out the operation as a sequence of rank-1 updates:

- The elements of $A$ are guaranteed to be stored column-wise, i.e.

  ---- LATEX--------------------------------------------------
  A = \left(\vec{a}_1, \vec{a}_2, \dots, \vec{a}_k \right)
  ------------------------------------------------------------

- The elements of $B$ are guaranteed to be stored row-wise, i.e.

  ---- LATEX--------------------------------------------------
  B = \left(\begin{array}{c}
            \vec{b}_1^T \\
            \vec{b}_2^T \\
            \vdots      \\
            \vec{b}_k^T \\
            \end{array}\right)
  ------------------------------------------------------------

Hence we get

---- LATEX----------------------------------------------------
A \cdot B = \vec{a}_1 \vec{b}_1^T +
            \vec{a}_2 \vec{b}_2^T +
            \dots +
            \vec{a}_k \vec{b}_k^T
--------------------------------------------------------------

So if $\overline{C}$ is zero-initialized we can compute the
operation as

---- LATEX--------------------------------------------------
\begin{array}{lcl}
\overline{C} & \leftarrow & \vec{a}_1 \vec{b}_1^T \\
\overline{C} & \leftarrow & \vec{a}_2 \vec{b}_2^T \\
    \vdots   &            & \vdots                \\
\overline{C} & \leftarrow & \vec{a}_k \vec{b}_k^T \\
\end{array}
------------------------------------------------------------

So we finally end up with the task to compute in the $l$-th
step a rank-1 update

---- LATEX--------------------------------------------------
\overline{C} \leftarrow \vec{a}_l \vec{b}_l^T
\quad 1 \leq l \leq k
------------------------------------------------------------


Computing the Rank-1 Update
===========================
For simplicity we assume $M_R = 4$ and $N_R = 8$.  These are
the parameters used for the AVX micro-kernel.  The rank-1
update then reads

---- LATEX--------------------------------------------------
\overline{C} \leftarrow
\left(\begin{array}{c}
      a_0 \\
      a_1 \\
      a_2 \\
      a_3 \\
      \end{array}\right)
\left(b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7 \right)
------------------------------------------------------------

Or more explicit as

---- LATEX--------------------------------------------------
\left(\begin{array}{cccccccc}
      c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\
      c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\
      c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\
      c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\
      \end{array}\right)
\leftarrow
\left(\begin{array}{cccccccc}
      a_{0} b_{0} & a_{0} b_{1}& a_{0} b_{2}& a_{0} b_{3}& a_{0} b_{4}& a_{0} b_{5}& a_{0} b_{6}& a_{0} b_{7}\\
      a_{1} b_{0} & a_{1} b_{1}& a_{1} b_{2}& a_{1} b_{3}& a_{1} b_{4}& a_{1} b_{5}& a_{1} b_{6}& a_{1} b_{7}\\
      a_{2} b_{0} & a_{2} b_{1}& a_{2} b_{2}& a_{2} b_{3}& a_{2} b_{4}& a_{2} b_{5}& a_{2} b_{6}& a_{2} b_{7}\\
      a_{3} b_{0} & a_{3} b_{1}& a_{3} b_{2}& a_{3} b_{3}& a_{3} b_{4}& a_{3} b_{5}& a_{3} b_{6}& a_{3} b_{7}\\
      \end{array}\right)
------------------------------------------------------------


Simple Variant for AVX
----------------------
You will see there are many variants for how to carry out the
rank-1 update.  All these variants differ only in the order
elements of $\overline{C}$ get computed.  However, performance
of these variants can differ significantly.  We first consider
a variant the theoretically could be produced from a compiler
from a code like

---- CODE (type=cc) ----------------------------
for (i=0; i<MR; ++i)
    for (j=0; j<NR; ++j)
        c[i][j] += a[i]*b[j];
------------------------------------------------

Obviously the value `a[i]` does not change in the inner loop so
the compiler can buffer it like in

---- CODE (type=cc) ----------------------------
for (i=0; i<MR; ++i) {
    auto a_ = a[i];
    for (j=0; j<NR; ++j)
        c[i][j] += a_*b[j];
}
------------------------------------------------

The compiler can even buffer all values of `c`, `a` and `b` in registers. Having
16 AVX regsiters ($\text{reg}_0, \dots, \text{reg}_{15}$) with 256 bits the
following mapping can be used for values of type  `double`:

 - Matrix $\overline{C}$ is mapped as

   ---- LATEX ----------------------------------
   \left(\begin{array}{cc}
         \text{reg}_{8 } & \text{reg}_{9 }\\
         \text{reg}_{10} & \text{reg}_{11} \\
         \text{reg}_{12} & \text{reg}_{13} \\
         \text{reg}_{14} & \text{reg}_{15} \\
         \end{array}\right)
   \leftrightarrow
   \left(\begin{array}{cccccccc}
         c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\
         c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\
         c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\
         c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\
         \end{array}\right)
   ---------------------------------------------

   Or more detailed:

   ---- LATEX ----------------------------------
   \begin{array}{cc}
         \text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{01} , c_{02} , c_{03} \right) &
         \text{reg}_{9 } \leftrightarrow \left(c_{04} , c_{05} , c_{06} , c_{07} \right) \\
         \text{reg}_{10} \leftrightarrow \left(c_{10} , c_{11} , c_{12} , c_{13} \right) &
         \text{reg}_{11} \leftrightarrow \left(c_{14} , c_{15} , c_{16} , c_{17} \right) \\
         \text{reg}_{12} \leftrightarrow \left(c_{20} , c_{21} , c_{22} , c_{23} \right) &
         \text{reg}_{13} \leftrightarrow \left(c_{24} , c_{25} , c_{26} , c_{27} \right) \\
         \text{reg}_{14} \leftrightarrow \left(c_{30} , c_{31} , c_{32} , c_{33} \right) &
         \text{reg}_{15} \leftrightarrow \left(c_{34} , c_{35} , c_{36} , c_{37} \right) \\
         \end{array}
   ---------------------------------------------

 - Buffering the value `a_` (`auto a_ = a[i]`) in an AVX regsiter can be done by a
   broadcast such that it can be used for component-wise multiplications later:

   ---- LATEX -----------------------------------------------------------------------------
   \begin{array}{c}
   \text{broadcast $a_0$:}\quad \text{reg}_{0} \leftarrow \left( a_0, a_0, a_0, a_0 \right)\\
   \text{broadcast $a_1$:}\quad \text{reg}_{1} \leftarrow \left( a_1, a_1, a_1, a_1 \right)\\
   \text{broadcast $a_2$:}\quad \text{reg}_{2} \leftarrow \left( a_2, a_2, a_2, a_2 \right)\\
   \text{broadcast $a_3$:}\quad \text{reg}_{3} \leftarrow \left( a_3, a_3, a_3, a_3 \right)\\
   \end{array}
   ----------------------------------------------------------------------------------------

 - Values of $b$ can be buffered in $\text{reg}_{4}$ and $\text{reg}_{5}$:

   ---- LATEX -----------------------------------------------------------------------------
   \begin{array}{cc}
         \text{reg}_{4 } \leftrightarrow \left(b_{0} , b_{1} , b_{2} , b_{3} \right), &
         \text{reg}_{5 } \leftrightarrow \left(b_{4} , b_{5} , b_{6} , b_{7} \right)
         \end{array}
   ----------------------------------------------------------------------------------------

So we have $\text{reg}_{6}$ and $\text{reg}_7$ available for temporary results:

 - We can compute $\text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{01} , c_{02} , c_{03} \right)$
   with

   ---- LATEX -----------------------------------------------------------------------------
   \begin{array}{lclcl}
         \text{reg}_{6} &\leftarrow& \text{reg}_{0} &\odot& \text{reg}_{4 } \\
         \text{reg}_{8} &\leftarrow& \text{reg}_{8} &+    & \text{reg}_{6}
         \end{array}
   ----------------------------------------------------------------------------------------

   where $\odot$ denotes a component-wise multiplication.  This corresponds to

   ---- LATEX -----------------------------------------------------------------------------
   \begin{array}{lclcl}
         \left(a_0 b_0, a_0 b_1, a_0 b_2, a_0 b_3\right)
         &\leftarrow&
         \left( a_0, a_0, a_0, a_0 \right) &\odot& \left(b_{0} , b_{1} , b_{2} , b_{3} \right)\\
         \left(c_{00} , c_{01} , c_{02} , c_{03} \right)
         &\leftarrow&
         \left(c_{00} , c_{01} , c_{02} , c_{03} \right) &+& \left(a_0 b_0, a_0 b_1, a_0 b_2, a_0 b_3\right)
         \end{array}
   ----------------------------------------------------------------------------------------

 - Computing the rest of $\overline{C}$ goes analogously.  Pipelining can be improved by using
   registers $\text{reg}_{6}$ and $\text{reg}_7$ in turn for the temporary results. E.g. entries
   $\left(c_{04} , c_{05} , c_{06} , c_{07} \right)$ buffered in
   $\text{reg}_{9}$ can be computed by

    ---- LATEX -----------------------------------------------------------------------------
    \begin{array}{lclcl}
          \text{reg}_{7} &\leftarrow& \text{reg}_{0} &\odot& \text{reg}_{5 } \\
          \text{reg}_{9} &\leftarrow& \text{reg}_{9} &+    & \text{reg}_{7}
          \end{array}
    ----------------------------------------------------------------------------------------

 - So the complete Algorithm for the rank-1 update would read

    ---- LATEX -----------------------------------------------------------------------------
    \begin{array}{lcl}
         \text{broadcast $a_0$:}\quad \text{reg}_{0} &\leftarrow& \left( a_0, a_0, a_0, a_0 \right)\\
         \text{broadcast $a_1$:}\quad \text{reg}_{1} &\leftarrow& \left( a_1, a_1, a_1, a_1 \right)\\
         \text{broadcast $a_2$:}\quad \text{reg}_{2} &\leftarrow& \left( a_2, a_2, a_2, a_2 \right)\\
         \text{broadcast $a_3$:}\quad \text{reg}_{3} &\leftarrow& \left( a_3, a_3, a_3, a_3 \right)\\
         \text{reg}_{4 }                             &\leftarrow& \left(b_{0} , b_{1} , b_{2} , b_{3} \right)\\
         \text{reg}_{5 }                             &\leftarrow& \left(b_{4} , b_{5} , b_{6} , b_{7} \right)\\
         \text{reg}_{6}                              &\leftarrow& \text{reg}_{0}  \odot  \text{reg}_{4 } \\
         \text{reg}_{8}                              &\leftarrow& \text{reg}_{8}  +      \text{reg}_{6}  \\
         \text{reg}_{7}                              &\leftarrow& \text{reg}_{0}  \odot  \text{reg}_{5 } \\
         \text{reg}_{9}                              &\leftarrow& \text{reg}_{9}  +      \text{reg}_{7}  \\
         \text{reg}_{6}                              &\leftarrow& \text{reg}_{1}  \odot  \text{reg}_{4 } \\
         \text{reg}_{10}                             &\leftarrow& \text{reg}_{10} +      \text{reg}_{6}  \\
         \text{reg}_{7}                              &\leftarrow& \text{reg}_{1}  \odot  \text{reg}_{5 } \\
         \text{reg}_{11}                             &\leftarrow& \text{reg}_{11} +      \text{reg}_{7}  \\
         \text{reg}_{6}                              &\leftarrow& \text{reg}_{2}  \odot  \text{reg}_{4 } \\
         \text{reg}_{12}                             &\leftarrow& \text{reg}_{12} +      \text{reg}_{6}  \\
         \text{reg}_{7}                              &\leftarrow& \text{reg}_{2}  \odot  \text{reg}_{5 } \\
         \text{reg}_{13}                             &\leftarrow& \text{reg}_{13} +      \text{reg}_{7}  \\
         \text{reg}_{6}                              &\leftarrow& \text{reg}_{3}  \odot  \text{reg}_{4 } \\
         \text{reg}_{14}                             &\leftarrow& \text{reg}_{14} +      \text{reg}_{6}  \\
         \text{reg}_{7}                              &\leftarrow& \text{reg}_{3}  \odot  \text{reg}_{5 } \\
         \text{reg}_{15}                             &\leftarrow& \text{reg}_{15} +      \text{reg}_{7}  \\
         \end{array}
    ----------------------------------------------------------------------------------------

    So this only requires loading data from L1-cache and we have a lot freedom to improve
    pipelining.


*On the previous page we saw how using the GCC vector extensions allows to give some
necessary hints to the compiler in order to come very close to this.  This can actually
be observed from the assembly code produced by the compiler.*


More Advanced Variant
---------------------
On some architectures the broadcast instruction is slower than a simple move.  So let us consider
to move a complete column of $A$ into a single register and a row of $B$ into two registers:

---- LATEX -----------------------------------------------------------------------------
\begin{array}{cc}
      \text{reg}_{0 } \leftrightarrow \left(a_{0} , a_{1} , a_{2} , a_{3} \right), &
      \text{reg}_{1 } \leftrightarrow \left(b_{0} , b_{1} , b_{2} , b_{3} \right), &
      \text{reg}_{2 } \leftrightarrow \left(b_{4} , b_{5} , b_{6} , b_{7} \right)
      \end{array}
----------------------------------------------------------------------------------------

The component-wise multiplication would give:

---- LATEX -----------------------------------------------------------------------------
\begin{array}{cc}
      \text{reg}_{0 } \odot \text{reg}_{1 } &= \left(a_0 b_0 , a_1 b_1 , a_2 b_2 , a_3 b_3\right) \\
      \end{array}
----------------------------------------------------------------------------------------

So the values computed would correspond to the red-marked entries in $\overline{C}$:

---- LATEX -----------------------------------------------------------------------------
\require{color}
   \left(\begin{array}{cccccccc}
         \textcolor{red}{c_{00}}& c_{01}                  & c_{02}                  & c_{03}                  & c_{04} & c_{05} & c_{06} & c_{07} \\
                         c_{10} & \textcolor{red}{c_{11}} & c_{12}                  & c_{13}                  & c_{14} & c_{15} & c_{16} & c_{17} \\
                         c_{20} & c_{21}                  & \textcolor{red}{c_{22}} & c_{23}                  & c_{24} & c_{25} & c_{26} & c_{27} \\
                         c_{30} & c_{31}                  & c_{32}                  & \textcolor{red}{c_{33}} & c_{34} & c_{35} & c_{36} & c_{37} \\
         \end{array}\right)
----------------------------------------------------------------------------------------

This suggests to make the mapping

---- LATEX -----------------------------------------------------------------------------
\begin{array}{cc}
      \text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{11}, c_{22}, c_{33} \right)
      \end{array}
----------------------------------------------------------------------------------------

Analogously the component-wise multiplication

---- LATEX -----------------------------------------------------------------------------
\begin{array}{cc}
      \text{reg}_{0 } \odot \text{reg}_{2 } &= \left(a_0 b_4 , a_1 b_5 , a_2 b_6 , a_3 b_7\right) \\
      \end{array}
----------------------------------------------------------------------------------------

contributes to the entries

---- LATEX -----------------------------------------------------------------------------
\require{color}
   \left(\begin{array}{cccccccc}
         c_{00} & c_{01} & c_{02} & c_{03} & \textcolor{red}{c_{04}}& c_{05}                  & c_{06}                  & c_{07}                  \\
         c_{10} & c_{11} & c_{12} & c_{13} &                 c_{14} & \textcolor{red}{c_{15}} & c_{16}                  & c_{17}                  \\
         c_{20} & c_{21} & c_{22} & c_{23} &                 c_{24} & c_{25}                  & \textcolor{red}{c_{26}} & c_{27}                  \\
         c_{30} & c_{31} & c_{32} & c_{33} &                 c_{34} & c_{35}                  & c_{36}                  & \textcolor{red}{c_{37}} \\
         \end{array}\right)
----------------------------------------------------------------------------------------

This suggests to make the mapping

---- LATEX -----------------------------------------------------------------------------
\begin{array}{cc}
      \text{reg}_{9 } \leftrightarrow \left(c_{04} , c_{15}, c_{26}, c_{37} \right)
      \end{array}
----------------------------------------------------------------------------------------


Assume we also have some kind of swap or shuffel operations to achieve the following:

 - The content $(t_0, t_1, t_2, t_3)$ of a register can be shuffled to $(t_2, t_3, t_0, t_1)$.  We
   denote this by operation by $\text{shuffle}(\text{reg}_i)$.
 - The content $(t_0, t_1, t_2, t_3)$ of a register can be permuted to $(t_1, t_0, t_3, t_2)$. We
   denote this by operation by $\text{permute}(\text{reg}_i)$.

*So you already might get the idea ... I will update this page with further details if interested.*


:navigate: back  -> doc:session5/page01
           next  -> doc:session7/page01