Packing: Filling buffers for a fully cache-optimized GEMM


For a fully cache-optimized GEMM implementation matrices \(A\) and \(B\) will be blocked and each block buffered in a certain way. We will first focus on how to buffer blocks of \(A\).

Partitioning of matrix \(A\)

Matrix \(A\) has dimension \(m \times k\) and will be partitioned with respect to some constants \(M_c\) and \(K_c\). This means that \(A\) is seen as a \(m_b \times k_b\) block-matrix:

\[A = \left(\begin{array}{c|c|c} A_{0,0} & \dots & A_{0,k_b-1} \\ \hline \vdots & & \vdots \\ \hline A_{m_b-1,0} & \dots & A_{m_b-1,k_b-1} \\ \end{array}\right),\quad m_b = \left\lceil \frac{m}{M_c} \right\rceil,\; k_b = \left\lceil \frac{k}{K_c} \right\rceil.\]

Each block has maximal dimension \(M_c \times K_c\). However, blocks at the bottom or at the right border can be smaller. More precisely: For a block

\[A_{i,j} \in \mathbb{M}^{M\times K} \quad (0 \leq i < m_b, \; 0 \leq j < k_b)\]

its dimension is given by

\[M = \begin{cases} M_c, & i < m_b-1 \;\; \lor \;\; M_c \,|\, m, \\ m \bmod M_c, & \text{else}, \end{cases}\]


\[K = \begin{cases} K_c, & j < k_b-1 \;\; \lor \;\;K_c \,|\, k, \\ k \bmod K_c, & \text{else}. \end{cases}\]

Packing blocks of \(A\)

With \(\overline{A}\) we will denote an arbitrary block of \(A\). So the dimension of \(\overline{A}\) is \(M \times K\) with \(M \leq M_c\) and \(K \leq K_c\).

This block gets partitioned further into horizontal panels. Partitioning happens with respect to a constant \(M_r\). Here we require that \(M_c\) is divisible by \(M_r\) (so in the most frequent case where \(M=M_c\) all horizontal panels have the same height).

In general, the partitioning

\[\overline{A} = \left(\begin{array}{c} A_{0} \\ \hline \vdots \\ \hline A_{m_p-1} \\ \end{array}\right),\quad m_p = \left\lceil \frac{M}{M_r} \right\rceil,\]

has horizontal panels

\[A_{i} \in \mathbb{M}^{m_r \times K} \quad (0 \leq i < m_p)\]


\[m_r = \begin{cases} M_r, & i < m_p-1 \;\; \lor \;\; M_r \,|\, M, \\ M \bmod M_r, & \text{else}, \end{cases}\]

However, the packing operations acts as if \(\overline{A}\) was zero-extended such that each panel has exactly \(M_r\) rows. We denote this virtual extension as matrix \(\overline{A}_Z\). It's dimension is \(\left(m_p \cdot M_r\right) \times K\).

The elements of \(\overline{A}_Z\) are copied into a buffer \(p\) of size \(M_c \cdot K_c\) satisfying the following conditions:

So in case \(M=M_c\) and \(K=K_c\) the complete buffer gets overwritten. In other cases only part of the buffer gets used.

Exercise: Algorithm for packing blocks of A

Let \(\overline{A}\) be a \(M \times K\) matrix with \(M \leq M_C\) and \(K \leq K_C\). With \(\overline{A}_Z\) we denote its zero-extension to dimensions \(\left(m_p \cdot M_r\right) \times K\). Further, elements of \(\overline{A}_Z\) are denoted with \(a_{I,J}\), i.e.

\[\overline{A}_Z = \left( a_{I,J} \right)\quad (0 \leq I < m_p \cdot M_r,\; 0 \leq J < K)\]

Find and define a function

\[\nu: \{0,\dots,m_p \cdot M_r-1\} \times \{0,\dots,K-1\} \to \{0,\dots, M_c \cdot K_c-1\},\;\; (I,J) \mapsto \nu(I,J)\]

such that \(\nu(I,J)\) gives the index in the buffer to which element \(a_{I,J}\) gets copied.

Hence we can use the following algorithm for packing the actual block \(\overline{A}\):

Note that the algorithm only accesses actually existing elements of the non-extended matrix block \(\overline{A}\). Further, elements of \(\overline{A}\) are accessed column-wise. For a row-wise access simply swap the loops.