Packing Blocks of B

In the cache optimized GEMM-Operation \(\beta C + \alpha A B \to C\) the matrix \(A\) gets partitioned into blocks of maximal dimension \(M_c \times K_c\). Each block \(B_{l,j}\) of \(B\) gets packed into row-major vertical panels with \(N_r\) columns.

Exercise

Assume that \(p\) is a buffer for \(K_c \cdot N_c\) elements and \(X\) a matrix block with dimension \(k \times n\) where \(k \leq K_c\) and \(n \leq N_c\).

Be honest for your own sake and derive the following algorithm (or any equivalent algorithm) for packing blocks of \(B\):

dgepack_B(X, p)

  • Input:

    • matrix \(X = \left(x_{l,j}\right)\) with dimension \(k \times n\) (assume \(k \leq K_c\), \(n \leq N_c\))

    • array \(p\) with length \(K_c \cdot N_c\)

  • On Return:

    • \(p\) contains \(X\) packed in vertical row-major panels with \(N_r\) columns

  • for \(j_1\) with \(0 \leq j_1 < \left\lceil \frac{n}{N_r} \right\rceil\)

    • for \(j_0\) with \(0 \leq j_0 < N_r\)

      • for \(l\) with \(0 \leq l < k\)

        • \(j \leftarrow j_1 \cdot N_r + j_0\)

        • \(\nu \leftarrow j_1 \cdot N_r \cdot k + l \cdot N_r + j_0\)

        • if \(j < n\)

          • \(p_\nu \leftarrow x_{l,j}\)

        • else

          • \(p_\nu \leftarrow 0\)

Exercise