# Packing: Filling buffers for a fully cache-optimized GEMM

#### Content

For a fully cache-optimized GEMM implementation matrices $$A$$ and $$B$$ will be blocked and each block buffered in a certain way. We will first focus on how to buffer blocks of $$A$$.

## Partitioning of matrix $$A$$

Matrix $$A$$ has dimension $$m \times k$$ and will be partitioned with respect to some constants $$M_c$$ and $$K_c$$. This means that $$A$$ is seen as a $$m_b \times k_b$$ block-matrix:

$A = \left(\begin{array}{c|c|c} A_{0,0} & \dots & A_{0,k_b-1} \\ \hline \vdots & & \vdots \\ \hline A_{m_b-1,0} & \dots & A_{m_b-1,k_b-1} \\ \end{array}\right),\quad m_b = \left\lceil \frac{m}{M_c} \right\rceil,\; k_b = \left\lceil \frac{k}{K_c} \right\rceil.$

Each block has maximal dimension $$M_c \times K_c$$. However, blocks at the bottom or at the right border can be smaller. More precisely: For a block

$A_{i,j} \in \mathbb{M}^{M\times K} \quad (0 \leq i < m_b, \; 0 \leq j < k_b)$

its dimension is given by

$M = \begin{cases} M_c, & i < m_b-1 \;\; \lor \;\; M_c \,|\, m, \\ m \bmod M_c, & \text{else}, \end{cases}$

and

$K = \begin{cases} K_c, & j < k_b-1 \;\; \lor \;\;K_c \,|\, k, \\ k \bmod K_c, & \text{else}. \end{cases}$

## Packing blocks of $$A$$

With $$\overline{A}$$ we will denote an arbitrary block of $$A$$. So the dimension of $$\overline{A}$$ is $$M \times K$$ with $$M \leq M_c$$ and $$K \leq K_c$$.

This block gets partitioned further into horizontal panels. Partitioning happens with respect to a constant $$M_r$$. Here we require that $$M_c$$ is divisible by $$M_r$$ (so in the most frequent case where $$M=M_c$$ all horizontal panels have the same height).

In general, the partitioning

$\overline{A} = \left(\begin{array}{c} A_{0} \\ \hline \vdots \\ \hline A_{m_p-1} \\ \end{array}\right),\quad m_p = \left\lceil \frac{M}{M_r} \right\rceil,$

has horizontal panels

$A_{i} \in \mathbb{M}^{m_r \times K} \quad (0 \leq i < m_p)$

with

$m_r = \begin{cases} M_r, & i < m_p-1 \;\; \lor \;\; M_r \,|\, M, \\ M \bmod M_r, & \text{else}, \end{cases}$

However, the packing operations acts as if $$\overline{A}$$ was zero-extended such that each panel has exactly $$M_r$$ rows. We denote this virtual extension as matrix $$\overline{A}_Z$$. It's dimension is $$\left(m_p \cdot M_r\right) \times K$$.

The elements of $$\overline{A}_Z$$ are copied into a buffer $$p$$ of size $$M_c \cdot K_c$$ satisfying the following conditions:

• For $$0 \leq i < m_p$$ each panel $$A_i$$ gets stored column wise in the buffer:

• Elements of $$A_0$$ get stored column wise in $$(\; p_0, \dots, p_{M_r \cdot K -1})$$,

• Elements of $$A_1$$ get stored column wise in $$(\; p_{M_r \cdot K}, \dots, p_{2 \cdot M_r \cdot K -1})$$,

• etc.

So in general:

• Elements of $$A_i$$ get stored column wise in $$(\; p_{i \cdot M_r \cdot K}, \dots, p_{(i+1) \cdot M_r \cdot K -1})$$,

So in case $$M=M_c$$ and $$K=K_c$$ the complete buffer gets overwritten. In other cases only part of the buffer gets used.

## Exercise: Algorithm for packing blocks of A

Let $$\overline{A}$$ be a $$M \times K$$ matrix with $$M \leq M_C$$ and $$K \leq K_C$$. With $$\overline{A}_Z$$ we denote its zero-extension to dimensions $$\left(m_p \cdot M_r\right) \times K$$. Further, elements of $$\overline{A}_Z$$ are denoted with $$a_{I,J}$$, i.e.

$\overline{A}_Z = \left( a_{I,J} \right)\quad (0 \leq I < m_p \cdot M_r,\; 0 \leq J < K)$

Find and define a function

$\nu: \{0,\dots,m_p \cdot M_r-1\} \times \{0,\dots,K-1\} \to \{0,\dots, M_c \cdot K_c-1\},\;\; (I,J) \mapsto \nu(I,J)$

such that $$\nu(I,J)$$ gives the index in the buffer to which element $$a_{I,J}$$ gets copied.

Hence we can use the following algorithm for packing the actual block $$\overline{A}$$:

• For $$0 \leq J < K$$

• For $$0 \leq I < m_p \cdot M_r$$

• if $$0 \leq I < M$$

$$p_{\nu(I,J)} \leftarrow a_{I,J}$$

• else

$$p_{\nu(I,J)} \leftarrow 0$$

Note that the algorithm only accesses actually existing elements of the non-extended matrix block $$\overline{A}$$. Further, elements of $$\overline{A}$$ are accessed column-wise. For a row-wise access simply swap the loops.