Packing: Filling buffers for a fully cacheoptimized GEMM
Content 
For a fully cacheoptimized GEMM implementation matrices \(A\) and \(B\) will be blocked and each block buffered in a certain way. We will first focus on how to buffer blocks of \(A\).
Partitioning of matrix \(A\)
Matrix \(A\) has dimension \(m \times k\) and will be partitioned with respect to some constants \(M_c\) and \(K_c\). This means that \(A\) is seen as a \(m_b \times k_b\) blockmatrix:
\[A = \left(\begin{array}{ccc} A_{0,0} & \dots & A_{0,k_b1} \\ \hline \vdots & & \vdots \\ \hline A_{m_b1,0} & \dots & A_{m_b1,k_b1} \\ \end{array}\right),\quad m_b = \left\lceil \frac{m}{M_c} \right\rceil,\; k_b = \left\lceil \frac{k}{K_c} \right\rceil.\]Each block has maximal dimension \(M_c \times K_c\). However, blocks at the bottom or at the right border can be smaller. More precisely: For a block
\[A_{i,j} \in \mathbb{M}^{M\times K} \quad (0 \leq i < m_b, \; 0 \leq j < k_b)\]its dimension is given by
\[M = \begin{cases} M_c, & i < m_b1 \;\; \lor \;\; M_c \,\, m, \\ m \bmod M_c, & \text{else}, \end{cases}\]and
\[K = \begin{cases} K_c, & j < k_b1 \;\; \lor \;\;K_c \,\, k, \\ k \bmod K_c, & \text{else}. \end{cases}\]Packing blocks of \(A\)
With \(\overline{A}\) we will denote an arbitrary block of \(A\). So the dimension of \(\overline{A}\) is \(M \times K\) with \(M \leq M_c\) and \(K \leq K_c\).
This block gets partitioned further into horizontal panels. Partitioning happens with respect to a constant \(M_r\). Here we require that \(M_c\) is divisible by \(M_r\) (so in the most frequent case where \(M=M_c\) all horizontal panels have the same height).
In general, the partitioning
\[\overline{A} = \left(\begin{array}{c} A_{0} \\ \hline \vdots \\ \hline A_{m_p1} \\ \end{array}\right),\quad m_p = \left\lceil \frac{M}{M_r} \right\rceil,\]has horizontal panels
\[A_{i} \in \mathbb{M}^{m_r \times K} \quad (0 \leq i < m_p)\]with
\[m_r = \begin{cases} M_r, & i < m_p1 \;\; \lor \;\; M_r \,\, M, \\ M \bmod M_r, & \text{else}, \end{cases}\]However, the packing operations acts as if \(\overline{A}\) was zeroextended such that each panel has exactly \(M_r\) rows. We denote this virtual extension as matrix \(\overline{A}_Z\). It's dimension is \(\left(m_p \cdot M_r\right) \times K\).
The elements of \(\overline{A}_Z\) are copied into a buffer \(p\) of size \(M_c \cdot K_c\) satisfying the following conditions:

For \(0 \leq i < m_p\) each panel \(A_i\) gets stored column wise in the buffer:

Elements of \(A_0\) get stored column wise in \((\; p_0, \dots, p_{M_r \cdot K 1})\),

Elements of \(A_1\) get stored column wise in \((\; p_{M_r \cdot K}, \dots, p_{2 \cdot M_r \cdot K 1})\),

etc.
So in general:

Elements of \(A_i\) get stored column wise in \((\; p_{i \cdot M_r \cdot K}, \dots, p_{(i+1) \cdot M_r \cdot K 1})\),

So in case \(M=M_c\) and \(K=K_c\) the complete buffer gets overwritten. In other cases only part of the buffer gets used.
Exercise: Algorithm for packing blocks of A
Let \(\overline{A}\) be a \(M \times K\) matrix with \(M \leq M_C\) and \(K \leq K_C\). With \(\overline{A}_Z\) we denote its zeroextension to dimensions \(\left(m_p \cdot M_r\right) \times K\). Further, elements of \(\overline{A}_Z\) are denoted with \(a_{I,J}\), i.e.
\[\overline{A}_Z = \left( a_{I,J} \right)\quad (0 \leq I < m_p \cdot M_r,\; 0 \leq J < K)\]Find and define a function
\[\nu: \{0,\dots,m_p \cdot M_r1\} \times \{0,\dots,K1\} \to \{0,\dots, M_c \cdot K_c1\},\;\; (I,J) \mapsto \nu(I,J)\]such that \(\nu(I,J)\) gives the index in the buffer to which element \(a_{I,J}\) gets copied.
Hence we can use the following algorithm for packing the actual block \(\overline{A}\):

For \(0 \leq J < K\)

For \(0 \leq I < m_p \cdot M_r\)

if \(0 \leq I < M\)
\(p_{\nu(I,J)} \leftarrow a_{I,J}\)

else
\(p_{\nu(I,J)} \leftarrow 0\)


Note that the algorithm only accesses actually existing elements of the nonextended matrix block \(\overline{A}\). Further, elements of \(\overline{A}\) are accessed columnwise. For a rowwise access simply swap the loops.