Content

Notes on the GEMM Algorithm

Of course the right thing to do is reading the BLIS paper.

Concept of the Micro-Kernel ugemm

The micro-kernel computes the GEMM operation

\[C \leftarrow \beta C + \alpha A \cdot B \quad\text{with}\quad C \in \mathbb{M}^{M_R \times N_R},\quad A \in \mathbb{M}^{M_R \times k}\;\text{and}\;\quad B \in \mathbb{M}^{k \times N_R}\;\text{and}\;\]

The dimensions \(M_R\) and \(N_R\) are constants chosen depending on the available number of registers. The variable dimension \(k\) is bounded by \(k \leq K_c\). The constant \(K_c\) is chosen such that \(A\) and \(B\) fit into L1-cache.

So accessing elements of \(A\) and \(B\) will be fast, but accessing elements of \(C\) is slow in general. We therefore consider a \(M_R \times N_R\) matrix \(\overline{C}\) can be kept in registers and split the operation as follows:

\[\begin{array}{lcl}\overline{C} & \leftarrow & A \cdot B\\\overline{C} & \leftarrow & \alpha \overline{C}\\C & \leftarrow & \beta C + \overline{C}\\\end{array}\]

Relevant for the performance will mainly be the operation

\[\overline{C} \leftarrow A \cdot B\]

Here we have the situation that accessing elements of \(A\) and \(B\) (even if in L1-cache) is slower than accessing elements of \(\overline{C}\) stored in registers. We exploit this fact by carrying out the operation as a sequence of rank-1 updates:

Hence we get

\[A \cdot B = \vec{a}_1 \vec{b}_1^T + \vec{a}_2 \vec{b}_2^T + \dots + \vec{a}_k \vec{b}_k^T\]

So if \(\overline{C}\) is zero-initialized we can compute the operation as

\[\begin{array}{lcl}\overline{C} & \leftarrow & \vec{a}_1 \vec{b}_1^T \\\overline{C} & \leftarrow & \vec{a}_2 \vec{b}_2^T \\ \vdots & & \vdots \\\overline{C} & \leftarrow & \vec{a}_k \vec{b}_k^T \\\end{array}\]

So we finally end up with the task to compute in the \(l\)-th step a rank-1 update

\[\overline{C} \leftarrow \vec{a}_l \vec{b}_l^T\quad 1 \leq l \leq k\]

Computing the Rank-1 Update

For simplicity we assume \(M_R = 4\) and \(N_R = 8\). These are the parameters used for the AVX micro-kernel. The rank-1 update then reads

\[\overline{C} \leftarrow\left(\begin{array}{c} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \end{array}\right)\left(b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7 \right)\]

Or more explicit as

\[\left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right)\leftarrow\left(\begin{array}{cccccccc} a_{0} b_{0} & a_{0} b_{1}& a_{0} b_{2}& a_{0} b_{3}& a_{0} b_{4}& a_{0} b_{5}& a_{0} b_{6}& a_{0} b_{7}\\ a_{1} b_{0} & a_{1} b_{1}& a_{1} b_{2}& a_{1} b_{3}& a_{1} b_{4}& a_{1} b_{5}& a_{1} b_{6}& a_{1} b_{7}\\ a_{2} b_{0} & a_{2} b_{1}& a_{2} b_{2}& a_{2} b_{3}& a_{2} b_{4}& a_{2} b_{5}& a_{2} b_{6}& a_{2} b_{7}\\ a_{3} b_{0} & a_{3} b_{1}& a_{3} b_{2}& a_{3} b_{3}& a_{3} b_{4}& a_{3} b_{5}& a_{3} b_{6}& a_{3} b_{7}\\ \end{array}\right)\]

Simple Variant for AVX

You will see there are many variants for how to carry out the rank-1 update. All these variants differ only in the order elements of \(\overline{C}\) get computed. However, performance of these variants can differ significantly. We first consider a variant the theoretically could be produced from a compiler from a code like

for (i=0; i<MR; ++i)
    for (j=0; j<NR; ++j)
        c[i][j] += a[i]*b[j];

Obviously the value a[i] does not change in the inner loop so the compiler can buffer it like in

for (i=0; i<MR; ++i) {
    auto a_ = a[i];
    for (j=0; j<NR; ++j)
        c[i][j] += a_*b[j];
}

The compiler can even buffer all values of c, a and b in registers. Having 16 AVX regsiters (\(\text{reg}_0, \dots, \text{reg}_{15}\)) with 256 bits the following mapping can be used for values of type double:

So we have \(\text{reg}_{6}\) and \(\text{reg}_7\) available for temporary results:

On the previous page we saw how using the GCC vector extensions allows to give some necessary hints to the compiler in order to come very close to this. This can actually be observed from the assembly code produced by the compiler.

More Advanced Variant

On some architectures the broadcast instruction is slower than a simple move. So let us consider to move a complete column of \(A\) into a single register and a row of \(B\) into two registers:

\[\begin{array}{cc} \text{reg}_{0 } \leftrightarrow \left(a_{0} , a_{1} , a_{2} , a_{3} \right), & \text{reg}_{1 } \leftrightarrow \left(b_{0} , b_{1} , b_{2} , b_{3} \right), & \text{reg}_{2 } \leftrightarrow \left(b_{4} , b_{5} , b_{6} , b_{7} \right) \end{array}\]

The component-wise multiplication would give:

\[\begin{array}{cc} \text{reg}_{0 } \odot \text{reg}_{1 } &= \left(a_0 b_0 , a_1 b_1 , a_2 b_2 , a_3 b_3\right) \\ \end{array}\]

So the values computed would correspond to the red-marked entries in \(\overline{C}\):

\[\require{color} \left(\begin{array}{cccccccc} \textcolor{red}{c_{00}}& c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & \textcolor{red}{c_{11}} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & \textcolor{red}{c_{22}} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & \textcolor{red}{c_{33}} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right)\]

This suggests to make the mapping

\[\begin{array}{cc} \text{reg}_{8 } \leftrightarrow \left(c_{00} , c_{11}, c_{22}, c_{33} \right) \end{array}\]

Analogously the component-wise multiplication

\[\begin{array}{cc} \text{reg}_{0 } \odot \text{reg}_{2 } &= \left(a_0 b_4 , a_1 b_5 , a_2 b_6 , a_3 b_7\right) \\ \end{array}\]

contributes to the entries

\[\require{color} \left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & \textcolor{red}{c_{04}}& c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & \textcolor{red}{c_{15}} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & \textcolor{red}{c_{26}} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & \textcolor{red}{c_{37}} \\ \end{array}\right)\]

This suggests to make the mapping

\[\begin{array}{cc} \text{reg}_{9 } \leftrightarrow \left(c_{04} , c_{15}, c_{26}, c_{37} \right) \end{array}\]

Assume we also have some kind of swap or shuffel operations to achieve the following:

So you already might get the idea ... I will update this page with further details if interested.