Packing Blocks of A

In the cache optimized GEMM-Operation \(\beta C + \alpha A B \to C\) the matrix \(A\) gets partitioned into blocks of maximal dimension \(M_c \times K_c\). Each block \(A_{i,l}\) of \(A\) gets packed into col-major horizontal panels with \(M_r\) rows.

Assume that \(p\) is a buffer for \(M_c \cdot K_c\) elements and \(X\) a matrix block with dimension \(m \times k\) where \(m \leq M_c\) and \(k \leq K_c\). Then the following algorithm (which is using zero-based indices) can be used for packing:

`dgepack_A(X, p)`

Input:
- matrix \(X = \left(x_{i,l}\right)\) with dimension \(m \times k\) (assume \(m \leq M_c\), \(k \leq K_c\))
- array \(p\) with length \(M_c \cdot K_c\)
On Return:
- \(p\) contains \(X\) packed in horizontal col-major panels with \(M_r\) rows

for \(l\) with \(0 \leq l < k\)
- for \(i_1\) with \(0 \leq i_1 < \left\lceil \frac{m}{M_r} \right\rceil\)
  - for \(i_0\) with \(0 \leq i_0 < M_r\)
    - \(i \leftarrow i_1 \cdot M_r + i_0\)
    - \(\nu \leftarrow i_1 \cdot M_r \cdot k + l \cdot M_r + i_0\)
    - if \(i < m\)
      - \(p_\nu \leftarrow x_{i,l}\)
    - else
      - \(p_\nu \leftarrow 0\)

Exercise

Implement and test the following algorithm for packing blocks of \(A\)
Start with an empty source file!