Packing Blocks of A

In the cache optimized GEMM-Operation \(\beta C + \alpha A B \to C\) the matrix \(A\) gets partitioned into blocks of maximal dimension \(M_c \times K_c\). Each block \(A_{i,l}\) of \(A\) gets packed into col-major horizontal panels with \(M_r\) rows.

Assume that \(p\) is a buffer for \(M_c \cdot K_c\) elements and \(X\) a matrix block with dimension \(m \times k\) where \(m \leq M_c\) and \(k \leq K_c\). Then the following algorithm (which is using zero-based indices) can be used for packing:

dgepack_A(X, p)

  • Input:

    • matrix \(X = \left(x_{i,l}\right)\) with dimension \(m \times k\) (assume \(m \leq M_c\), \(k \leq K_c\))

    • array \(p\) with length \(M_c \cdot K_c\)

  • On Return:

    • \(p\) contains \(X\) packed in horizontal col-major panels with \(M_r\) rows

  • for \(l\) with \(0 \leq l < k\)

    • for \(i_1\) with \(0 \leq i_1 < \left\lceil \frac{m}{M_r} \right\rceil\)

      • for \(i_0\) with \(0 \leq i_0 < M_r\)

        • \(i \leftarrow i_1 \cdot M_r + i_0\)

        • \(\nu \leftarrow i_1 \cdot M_r \cdot k + l \cdot M_r + i_0\)

        • if \(i < m\)

          • \(p_\nu \leftarrow x_{i,l}\)

        • else

          • \(p_\nu \leftarrow 0\)

Exercise