========================================================= Packing: Filling buffers for a fully cache-optimized GEMM [TOC] ========================================================= For a fully cache-optimized GEMM implementation matrices $A$ and $B$ will be blocked and each block buffered in a _certain way_. We will first focus on how to buffer blocks of $A$. Partitioning of matrix $A$ ========================== Matrix $A$ has dimension $m \times k$ and will be partitioned with respect to some constants $M_c$ and $K_c$. This means that $A$ is seen as a $m_b \times k_b$ block-matrix: ---- LATEX --------------------------------------------------------------------- A = \left(\begin{array}{c|c|c} A_{0,0} & \dots & A_{0,k_b-1} \\ \hline \vdots & & \vdots \\ \hline A_{m_b-1,0} & \dots & A_{m_b-1,k_b-1} \\ \end{array}\right), \quad m_b = \left\lceil \frac{m}{M_c} \right\rceil, \; k_b = \left\lceil \frac{k}{K_c} \right\rceil. -------------------------------------------------------------------------------- Each block has maximal dimension $M_c \times K_c$. However, blocks at the bottom or at the right border can be smaller. More precisely: For a block ---- LATEX --------------------------------------------------------------------- A_{i,j} \in \mathbb{M}^{M\times K} \quad (0 \leq i < m_b, \; 0 \leq j < k_b) -------------------------------------------------------------------------------- its dimension is given by ---- LATEX --------------------------------------------------------------------- M = \begin{cases} M_c, & i < m_b-1 \;\; \lor \;\; M_c \,|\, m, \\ m \bmod M_c, & \text{else}, \end{cases} -------------------------------------------------------------------------------- and ---- LATEX --------------------------------------------------------------------- K = \begin{cases} K_c, & j < k_b-1 \;\; \lor \;\;K_c \,|\, k, \\ k \bmod K_c, & \text{else}. \end{cases} -------------------------------------------------------------------------------- Packing blocks of $A$ ===================== With $\overline{A}$ we will denote an arbitrary block of $A$. So the dimension of $\overline{A}$ is $M \times K$ with $M \leq M_c$ and $K \leq K_c$. This block gets partitioned further into horizontal panels. Partitioning happens with respect to a constant $M_r$. Here we require that $M_c$ is divisible by $M_r$ (so in the most frequent case where $M=M_c$ all horizontal panels have the same height). In general, the partitioning ---- LATEX --------------------------------------------------------------------- \overline{A} = \left(\begin{array}{c} A_{0} \\ \hline \vdots \\ \hline A_{m_p-1} \\ \end{array}\right), \quad m_p = \left\lceil \frac{M}{M_r} \right\rceil, -------------------------------------------------------------------------------- has horizontal panels ---- LATEX --------------------------------------------------------------------- A_{i} \in \mathbb{M}^{m_r \times K} \quad (0 \leq i < m_p) -------------------------------------------------------------------------------- with ---- LATEX --------------------------------------------------------------------- m_r = \begin{cases} M_r, & i < m_p-1 \;\; \lor \;\; M_r \,|\, M, \\ M \bmod M_r, & \text{else}, \end{cases} -------------------------------------------------------------------------------- *However, the packing operations acts as if $\overline{A}$ was zero-extended such that each panel has exactly $M_r$ rows. We denote this virtual extension as matrix $\overline{A}_Z$. It's dimension is $\left(m_p \cdot M_r\right) \times K$.* The elements of $\overline{A}_Z$ are copied into a buffer $p$ of size $M_c \cdot K_c$ satisfying the following conditions: - For $0 \leq i < m_p$ each panel $A_i$ gets stored column wise in the buffer: - Elements of $A_0$ get stored column wise in $(\; p_0, \dots, p_{M_r \cdot K -1})$, - Elements of $A_1$ get stored column wise in $(\; p_{M_r \cdot K}, \dots, p_{2 \cdot M_r \cdot K -1})$, - etc. So in general: - Elements of $A_i$ get stored column wise in $(\; p_{i \cdot M_r \cdot K}, \dots, p_{(i+1) \cdot M_r \cdot K -1})$, So in case $M=M_c$ and $K=K_c$ the complete buffer gets overwritten. In other cases only part of the buffer gets used. Exercise: Algorithm for packing blocks of A =========================================== Let $\overline{A}$ be a $M \times K$ matrix with $M \leq M_C$ and $K \leq K_C$. With $\overline{A}_Z$ we denote its zero-extension to dimensions $\left(m_p \cdot M_r\right) \times K$. Further, elements of $\overline{A}_Z$ are denoted with $a_{I,J}$, i.e. ---- LATEX --------------------------------------------------------------------- \overline{A}_Z = \left( a_{I,J} \right) \quad (0 \leq I < m_p \cdot M_r,\; 0 \leq J < K) -------------------------------------------------------------------------------- Find and define a function ---- LATEX --------------------------------------------------------------------- \nu: \{0,\dots,m_p \cdot M_r-1\} \times \{0,\dots,K-1\} \to \{0,\dots, M_c \cdot K_c-1\},\;\; (I,J) \mapsto \nu(I,J) -------------------------------------------------------------------------------- such that $\nu(I,J)$ gives the index in the buffer to which element $a_{I,J}$ gets copied. Hence we can use the following algorithm for packing the actual block $\overline{A}$: - For $0 \leq J < K$ - For $0 \leq I < m_p \cdot M_r$ - if $0 \leq I < M$ $p_{\nu(I,J)} \leftarrow a_{I,J}$ - else $p_{\nu(I,J)} \leftarrow 0$ Note that the algorithm only accesses actually existing elements of the non-extended matrix block $\overline{A}$. Further, elements of $\overline{A}$ are accessed column-wise. For a row-wise access simply swap the loops. :navigate: up -> doc:index next -> doc:session09/page02