=========================== Notes on the GEMM Algorithm [TOC:2] =========================== Of course the right thing to do is reading the __BLIS__ paper. :links: BLIS -> http://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.pdf Concept of the Micro-Kernel `ugemm` =================================== The micro-kernel computes the GEMM operation ---- LATEX-------------------------------------------------- C \leftarrow \beta C + \alpha A \cdot B \quad\text{with} \quad C \in \mathbb{M}^{M_R \times N_R}, \quad A \in \mathbb{M}^{M_R \times k}\;\text{and}\; \quad B \in \mathbb{M}^{k \times N_R}\;\text{and}\; ------------------------------------------------------------ The dimensions $M_R$ and $N_R$ are constants chosen depending on the available number of registers. The variable dimension $k$ is bounded by $k \leq K_c$. The constant $K_c$ is chosen such that $A$ and $B$ fit into L1-cache. So accessing elements of $A$ and $B$ will be fast, but accessing elements of $C$ is slow in general. We therefore consider a $M_R \times N_R$ matrix $\overline{C}$ can be kept in registers and split the operation as follows: ---- LATEX-------------------------------------------------- \begin{array}{lcl} \overline{C} & \leftarrow & A \cdot B\\ \overline{C} & \leftarrow & \alpha \overline{C}\\ C & \leftarrow & \beta C + \overline{C}\\ \end{array} ------------------------------------------------------------ Relevant for the performance will mainly be the operation ---- LATEX-------------------------------------------------- \overline{C} \leftarrow A \cdot B ------------------------------------------------------------ Here we have the situation that accessing elements of $A$ and $B$ (even if in L1-cache) is slower than accessing elements of $\overline{C}$ stored in registers. We exploit this fact by carrying out the operation as a sequence of rank-1 updates: - The elements of $A$ are guaranteed to be stored column-wise, i.e. ---- LATEX-------------------------------------------------- A = \left(\vec{a}_1, \vec{a}_2, \dots, \vec{a}_k \right) ------------------------------------------------------------ - The elements of $B$ are guaranteed to be stored row-wise, i.e. ---- LATEX-------------------------------------------------- B = \left(\begin{array}{c} \vec{b}_1^T \\ \vec{b}_2^T \\ \vdots \\ \vec{b}_k^T \\ \end{array}\right) ------------------------------------------------------------ Hence we get ---- LATEX---------------------------------------------------- A \cdot B = \vec{a}_1 \vec{b}_1^T + \vec{a}_2 \vec{b}_2^T + \dots + \vec{a}_k \vec{b}_k^T -------------------------------------------------------------- So if $\overline{C}$ is zero-initialized we can compute the operation as ---- LATEX-------------------------------------------------- \begin{array}{lcl} \overline{C} & \leftarrow & \vec{a}_1 \vec{b}_1^T \\ \overline{C} & \leftarrow & \vec{a}_2 \vec{b}_2^T \\ \vdots & & \vdots \\ \overline{C} & \leftarrow & \vec{a}_k \vec{b}_k^T \\ \end{array} ------------------------------------------------------------ So we finally end up with the task to compute in the $l$-th step a rank-1 update ---- LATEX-------------------------------------------------- \overline{C} \leftarrow \vec{a}_l \vec{b}_l^T \quad 1 \leq l \leq k ------------------------------------------------------------ Computing the Rank-1 Update =========================== For simplicity we assume $M_R = 4$ and $N_R = 8$. These are the parameters used for the AVX micro-kernel. The rank-1 update then reads ---- LATEX-------------------------------------------------- \overline{C} \leftarrow \left(\begin{array}{c} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \end{array}\right) \left(b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7 \right) ------------------------------------------------------------ Or more explicit as ---- LATEX-------------------------------------------------- \left(\begin{array}{cccccccc} c_{00} & c_{01} & c_{02} & c_{03} & c_{04} & c_{05} & c_{06} & c_{07} \\ c_{10} & c_{11} & c_{12} & c_{13} & c_{14} & c_{15} & c_{16} & c_{17} \\ c_{20} & c_{21} & c_{22} & c_{23} & c_{24} & c_{25} & c_{26} & c_{27} \\ c_{30} & c_{31} & c_{32} & c_{33} & c_{34} & c_{35} & c_{36} & c_{37} \\ \end{array}\right) \leftarrow \left(\begin{array}{cccccccc} a_{0} b_{0} & a_{0} b_{1}& a_{0} b_{2}& a_{0} b_{3}& a_{0} b_{4}& a_{0} b_{5}& a_{0} b_{6}& a_{0} b_{7}\\ a_{1} b_{0} & a_{1} b_{1}& a_{1} b_{2}& a_{1} b_{3}& a_{1} b_{4}& a_{1} b_{5}& a_{1} b_{6}& a_{1} b_{7}\\ a_{2} b_{0} & a_{2} b_{1}& a_{2} b_{2}& a_{2} b_{3}& a_{2} b_{4}& a_{2} b_{5}& a_{2} b_{6}& a_{2} b_{7}\\ a_{3} b_{0} & a_{3} b_{1}& a_{3} b_{2}& a_{3} b_{3}& a_{3} b_{4}& a_{3} b_{5}& a_{3} b_{6}& a_{3} b_{7}\\ \end{array}\right) ------------------------------------------------------------ Simple Variant for AVX ---------------------- You will see there are many variants for how to carry out the rank-1 update. All these variants differ only in the order elements of $\overline{C}$ get computed. However, performance of these variants can differ significantly. We first consider a variant the theoretically could be produced from a compiler from a code like ---- CODE (type=cc) ---------------------------- for (i=0; i doc:session5/page01 next -> doc:session7/page01