Content

Concept of the Micro-Kernel ugemm
Computing the Rank-1 Update
- Simple Variant for AVX
- More Advanced Variant

Notes on the GEMM Algorithm

Of course the right thing to do is reading the BLIS paper.

Concept of the Micro-Kernel `ugemm`

The micro-kernel computes the GEMM operation

\[C \leftarrow \beta C + \alpha A \cdot B \quad\text{with}\quad C \in \mathbb{M}^{M_R \times N_R},\quad A \in \mathbb{M}^{M_R \times k}\;\text{and}\;\quad B \in \mathbb{M}^{k \times N_R}\;\text{and}\;\]

The dimensions $M_R$ and $N_R$ are constants chosen depending on the available number of registers. The variable dimension $k$ is bounded by $k \leq K_c$. The constant $K_c$ is chosen such that $A$ and $B$ fit into L1-cache.

So accessing elements of $A$ and $B$ will be fast, but accessing elements of $C$ is slow in general. We therefore consider a $M_R \times N_R$ matrix $\overline{C}$ can be kept in registers and split the operation as follows:

\[\begin{array}{lcl}\overline{C} & \leftarrow & A \cdot B\\\overline{C} & \leftarrow & \alpha \overline{C}\\C & \leftarrow & \beta C + \overline{C}\\\end{array}\]

Relevant for the performance will mainly be the operation

\[\overline{C} \leftarrow A \cdot B\]

Here we have the situation that accessing elements of $A$ and $B$ (even if in L1-cache) is slower than accessing elements of $\overline{C}$ stored in registers. We exploit this fact by carrying out the operation as a sequence of rank-1 updates:

The elements of $A$ are guaranteed to be stored column-wise, i.e.
\[A = \left(\vec{a}_1, \vec{a}_2, \dots, \vec{a}_k \right)\]
The elements of $B$ are guaranteed to be stored row-wise, i.e.
\[B = \left(\begin{array}{c} \vec{b}_1^T \\ \vec{b}_2^T \\ \vdots \\ \vec{b}_k^T \\ \end{array}\right)\]

Hence we get

\[A \cdot B = \vec{a}_1 \vec{b}_1^T + \vec{a}_2 \vec{b}_2^T + \dots + \vec{a}_k \vec{b}_k^T\]

So if $\overline{C}$ is zero-initialized we can compute the operation as

\[\begin{array}{lcl}\overline{C} & \leftarrow & \vec{a}_1 \vec{b}_1^T \\\overline{C} & \leftarrow & \vec{a}_2 \vec{b}_2^T \\ \vdots & & \vdots \\\overline{C} & \leftarrow & \vec{a}_k \vec{b}_k^T \\\end{array}\]

So we finally end up with the task to compute in the $l$-th step a rank-1 update

\[\overline{C} \leftarrow \vec{a}_l \vec{b}_l^T\quad 1 \leq l \leq k\]

Computing the Rank-1 Update

For simplicity we assume $M_R = 4$ and $N_R = 8$. These are the parameters used for the AVX micro-kernel. The rank-1 update then reads

\[\overline{C} \leftarrow\left(\begin{array}{c} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \end{array}\right)\left(b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7 \right)\]

Or more explicit as

\[\left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right)\leftarrow\left(\begin{array}{cccccccc} a_{0} b_{0} & a_{0} b_{1}& a_{0} b_{2}& a_{0} b_{3}& a_{0} b_{4}& a_{0} b_{5}& a_{0} b_{6}& a_{0} b_{7}\\ a_{1} b_{0} & a_{1} b_{1}& a_{1} b_{2}& a_{1} b_{3}& a_{1} b_{4}& a_{1} b_{5}& a_{1} b_{6}& a_{1} b_{7}\\ a_{2} b_{0} & a_{2} b_{1}& a_{2} b_{2}& a_{2} b_{3}& a_{2} b_{4}& a_{2} b_{5}& a_{2} b_{6}& a_{2} b_{7}\\ a_{3} b_{0} & a_{3} b_{1}& a_{3} b_{2}& a_{3} b_{3}& a_{3} b_{4}& a_{3} b_{5}& a_{3} b_{6}& a_{3} b_{7}\\ \end{array}\right)\]

Simple Variant for AVX

You will see there are many variants for how to carry out the rank-1 update. All these variants differ only in the order elements of $\overline{C}$ get computed. However, performance of these variants can differ significantly. We first consider a variant the theoretically could be produced from a compiler from a code like

for (i=0; i<MR; ++i)
    for (j=0; j<NR; ++j)
        c[i][j] += a[i]*b[j];

Obviously the value a[i] does not change in the inner loop so the compiler can buffer it like in

for (i=0; i<MR; ++i) {
    auto a_ = a[i];
    for (j=0; j<NR; ++j)
        c[i][j] += a_*b[j];
}

The compiler can even buffer all values of c, a and b in registers. Having 16 AVX regsiters ($\text{reg}_0, \dots, \text{reg}_{15}$) with 256 bits the following mapping can be used for values of type double:

Matrix $\overline{C}$ is mapped as
\[\left(\begin{array}{cc} \text{reg}_{8 } & \text{reg}_{9 }\\ \text{reg}_{10} & \text{reg}_{11} \\ \text{reg}_{12} & \text{reg}_{13} \\ \text{reg}_{14} & \text{reg}_{15} \\ \end{array}\right)\leftrightarrow\left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right)\]
Or more detailed:
\[\begin{array}{cc} \text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{01} , c_{02} , c_{03} \right) & \text{reg}_{9 } \leftrightarrow \left(c_{04} , c_{05} , c_{06} , c_{07} \right) \\ \text{reg}_{10} \leftrightarrow \left(c_{10} , c_{11} , c_{12} , c_{13} \right) & \text{reg}_{11} \leftrightarrow \left(c_{14} , c_{15} , c_{16} , c_{17} \right) \\ \text{reg}_{12} \leftrightarrow \left(c_{20} , c_{21} , c_{22} , c_{23} \right) & \text{reg}_{13} \leftrightarrow \left(c_{24} , c_{25} , c_{26} , c_{27} \right) \\ \text{reg}_{14} \leftrightarrow \left(c_{30} , c_{31} , c_{32} , c_{33} \right) & \text{reg}_{15} \leftrightarrow \left(c_{34} , c_{35} , c_{36} , c_{37} \right) \\ \end{array}\]
Buffering the value a_ (auto a_ = a[i]) in an AVX regsiter can be done by a broadcast such that it can be used for component-wise multiplications later:
\[\begin{array}{c}\text{broadcast $a_0$:}\quad \text{reg}_{0} \leftarrow \left( a_0, a_0, a_0, a_0 \right)\\\text{broadcast $a_1$:}\quad \text{reg}_{1} \leftarrow \left( a_1, a_1, a_1, a_1 \right)\\\text{broadcast $a_2$:}\quad \text{reg}_{2} \leftarrow \left( a_2, a_2, a_2, a_2 \right)\\\text{broadcast $a_3$:}\quad \text{reg}_{3} \leftarrow \left( a_3, a_3, a_3, a_3 \right)\\\end{array}\]
Values of $b$ can be buffered in $\text{reg}_{4}$ and $\text{reg}_{5}$:
\[\begin{array}{cc} \text{reg}_{4 } \leftrightarrow \left(b_{0} , b_{1} , b_{2} , b_{3} \right), & \text{reg}_{5 } \leftrightarrow \left(b_{4} , b_{5} , b_{6} , b_{7} \right) \end{array}\]

So we have $\text{reg}_{6}$ and $\text{reg}_7$ available for temporary results:

We can compute $\text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{01} , c_{02} , c_{03} \right)$ with
\[\begin{array}{lclcl} \text{reg}_{6} &\leftarrow& \text{reg}_{0} &\odot& \text{reg}_{4 } \\ \text{reg}_{8} &\leftarrow& \text{reg}_{8} &+ & \text{reg}_{6} \end{array}\]
where $\odot$ denotes a component-wise multiplication. This corresponds to
\[\begin{array}{lclcl} \left(a_0 b_0, a_0 b_1, a_0 b_2, a_0 b_3\right) &\leftarrow& \left( a_0, a_0, a_0, a_0 \right) &\odot& \left(b_{0} , b_{1} , b_{2} , b_{3} \right)\\ \left(c_{00} , c_{01} , c_{02} , c_{03} \right) &\leftarrow& \left(c_{00} , c_{01} , c_{02} , c_{03} \right) &+& \left(a_0 b_0, a_0 b_1, a_0 b_2, a_0 b_3\right) \end{array}\]
Computing the rest of $\overline{C}$ goes analogously. Pipelining can be improved by using registers $\text{reg}_{6}$ and $\text{reg}_7$ in turn for the temporary results. E.g. entries $\left(c_{04} , c_{05} , c_{06} , c_{07} \right)$ buffered in $\text{reg}_{9}$ can be computed by
\[\begin{array}{lclcl} \text{reg}_{7} &\leftarrow& \text{reg}_{0} &\odot& \text{reg}_{5 } \\ \text{reg}_{9} &\leftarrow& \text{reg}_{9} &+ & \text{reg}_{7} \end{array}\]
So the complete Algorithm for the rank-1 update would read
\[\begin{array}{lcl} \text{broadcast $a_0$:}\quad \text{reg}_{0} &\leftarrow& \left( a_0, a_0, a_0, a_0 \right)\\ \text{broadcast $a_1$:}\quad \text{reg}_{1} &\leftarrow& \left( a_1, a_1, a_1, a_1 \right)\\ \text{broadcast $a_2$:}\quad \text{reg}_{2} &\leftarrow& \left( a_2, a_2, a_2, a_2 \right)\\ \text{broadcast $a_3$:}\quad \text{reg}_{3} &\leftarrow& \left( a_3, a_3, a_3, a_3 \right)\\ \text{reg}_{4 } &\leftarrow& \left(b_{0} , b_{1} , b_{2} , b_{3} \right)\\ \text{reg}_{5 } &\leftarrow& \left(b_{4} , b_{5} , b_{6} , b_{7} \right)\\ \text{reg}_{6} &\leftarrow& \text{reg}_{0} \odot \text{reg}_{4 } \\ \text{reg}_{8} &\leftarrow& \text{reg}_{8} + \text{reg}_{6} \\ \text{reg}_{7} &\leftarrow& \text{reg}_{0} \odot \text{reg}_{5 } \\ \text{reg}_{9} &\leftarrow& \text{reg}_{9} + \text{reg}_{7} \\ \text{reg}_{6} &\leftarrow& \text{reg}_{1} \odot \text{reg}_{4 } \\ \text{reg}_{10} &\leftarrow& \text{reg}_{10} + \text{reg}_{6} \\ \text{reg}_{7} &\leftarrow& \text{reg}_{1} \odot \text{reg}_{5 } \\ \text{reg}_{11} &\leftarrow& \text{reg}_{11} + \text{reg}_{7} \\ \text{reg}_{6} &\leftarrow& \text{reg}_{2} \odot \text{reg}_{4 } \\ \text{reg}_{12} &\leftarrow& \text{reg}_{12} + \text{reg}_{6} \\ \text{reg}_{7} &\leftarrow& \text{reg}_{2} \odot \text{reg}_{5 } \\ \text{reg}_{13} &\leftarrow& \text{reg}_{13} + \text{reg}_{7} \\ \text{reg}_{6} &\leftarrow& \text{reg}_{3} \odot \text{reg}_{4 } \\ \text{reg}_{14} &\leftarrow& \text{reg}_{14} + \text{reg}_{6} \\ \text{reg}_{7} &\leftarrow& \text{reg}_{3} \odot \text{reg}_{5 } \\ \text{reg}_{15} &\leftarrow& \text{reg}_{15} + \text{reg}_{7} \\ \end{array}\]
So this only requires loading data from L1-cache and we have a lot freedom to improve pipelining.

On the previous page we saw how using the GCC vector extensions allows to give some necessary hints to the compiler in order to come very close to this. This can actually be observed from the assembly code produced by the compiler.

More Advanced Variant

On some architectures the broadcast instruction is slower than a simple move. So let us consider to move a complete column of $A$ into a single register and a row of $B$ into two registers:

\[\begin{array}{cc} \text{reg}_{0 } \leftrightarrow \left(a_{0} , a_{1} , a_{2} , a_{3} \right), & \text{reg}_{1 } \leftrightarrow \left(b_{0} , b_{1} , b_{2} , b_{3} \right), & \text{reg}_{2 } \leftrightarrow \left(b_{4} , b_{5} , b_{6} , b_{7} \right) \end{array}\]

The component-wise multiplication would give:

\[\begin{array}{cc} \text{reg}_{0 } \odot \text{reg}_{1 } &= \left(a_0 b_0 , a_1 b_1 , a_2 b_2 , a_3 b_3\right) \\ \end{array}\]

So the values computed would correspond to the red-marked entries in $\overline{C}$:

\[\require{color} \left(\begin{array}{cccccccc} \textcolor{red}{c_{00}}& c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & \textcolor{red}{c_{11}} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & \textcolor{red}{c_{22}} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & \textcolor{red}{c_{33}} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right)\]

This suggests to make the mapping

\[\begin{array}{cc} \text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{11}, c_{22}, c_{33} \right) \end{array}\]

Analogously the component-wise multiplication

\[\begin{array}{cc} \text{reg}_{0 } \odot \text{reg}_{2 } &= \left(a_0 b_4 , a_1 b_5 , a_2 b_6 , a_3 b_7\right) \\ \end{array}\]

contributes to the entries

\[\require{color} \left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & \textcolor{red}{c_{04}}& c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & \textcolor{red}{c_{15}} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & \textcolor{red}{c_{26}} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & \textcolor{red}{c_{37}} \\ \end{array}\right)\]

This suggests to make the mapping

\[\begin{array}{cc} \text{reg}_{9 } \leftrightarrow \left(c_{04} , c_{15}, c_{26}, c_{37} \right) \end{array}\]

Assume we also have some kind of swap or shuffel operations to achieve the following:

The content $(t_0, t_1, t_2, t_3)$ of a register can be shuffled to $(t_2, t_3, t_0, t_1)$. We denote this by operation by $\text{shuffle}(\text{reg}_i)$.
The content $(t_0, t_1, t_2, t_3)$ of a register can be permuted to $(t_1, t_0, t_3, t_2)$. We denote this by operation by $\text{permute}(\text{reg}_i)$.

So you already might get the idea ... I will update this page with further details if interested.