GEMM Macro Kernel

The GEMM Macro Kernel computes the GEMM operation $C \leftarrow \beta C + \alpha A B$ based on packing blocks of $A$ and $B and multiplying these using the GEMM micro kernel. The matrix dimension are as follows:

$A$ is a $m \times k$ matrix with $m < M_c$ and $k < K_c$.
$B$ is a $k \times n$ matrix with $n < N_c$.
$C$ is a $m \times n$ matrix.

Exercise: Test Framework

Parameters $M_C$, $N_C$, $K_C$ as well as $M_r$, $N_r$ are defined through macros. Hereby we require that $M_r$ divides $M_c$ and $N_r$ divides $N_c$. Beside these restrictions you can choose arbitrary values. However, it is usually a good idea to use pairwise different values. (Why?)
In main the following test case should be setup:
- Allocate a $m \times k$ matrix $A$ with $m < M_c$ and $k < K_c$. The strict smaller inequality is chosen on purpose! (Why?)
- Allocate a $k \times n$ matrix $B$ with $n < N_c$.
- Also make sure that $m$, $n$ and $k$ are pairwise different.
- Allocate two $m \times n$ matrices $C_0$ and $C_1$.
- Initialize all matrices. Hereby $C_0$ and $C_1$ should be equal.
- Print matrices $A$, $B$ and $C_0$.
- For some fixed value of $\alpha$ and $\beta$ compute $C_0 \leftarrow \beta C_0 + \alpha A B$ with a reference implementation for GEMM.
- Print $C_0$
- Print $C_1$

Note: Print the name of the matrix before you print its value.

Exercise: Add a call of the GEMM Macro Kernel

The macro kernel assumes that it receives already packed blocks $A$ and $B$. The signature for the macro kernel is given as

void
dgemm_macro(size_t m, size_t n, size_t k, double alpha,
            const double *A, const double *B,
            double beta,
            double *C, ptrdiff_t incRowC, ptrdiff_t incColC);

Why are there no row and column increments for $A$ and $B$?
Add in main a call of the macro kernel as follows:
- Allocate buffers $A_p$ and $B_p$ of length $M_c \cdot K_C$ and $K_c \cdot N_c$ respectively.
- Pack $A$ and $B$ into these buffers.
- Call the macro kernel
- Deallocate the buffers.

Do not continue until this compiles and runs without crashing.

Exercise: Implement the macro kernel

Implement the macro-kernel:

$A$ gets partitioned into horizontal panels of dimension $m_r \times k$ with $m_r \leq M_r$.
$B$ gets partitioned into vertical panels of dimension $k \times n_r$ with $n_r \leq N_r$.
Panels of $A$ are multiplied with panels of $B$ using the micro kernel.
Note that micro kernel requires that dimensions of $A$ and $B$ have dimension $M_r \times k$ and $k \times N_r$ respectively!

In case of $m < M_r$ or $n < N_r$ recall that the buffers contain panels that were extended through zero padding to dimensions $M_r \times k$ and $k \times N_r$ respectively.

So in this case use the following workaround:
- Compute $AB \leftarrow \alpha A_i B_j$ were $A_i$ and $B_J$ denote panels.
- Compute $C_{i,j} \leftarrow \beta C_{i,j}$ (GEAXPY) where $C_{i,j}$ denotes the corresponding block of $C$.
- Compute $C_{i,j} \leftarrow AB|_{\text{dim}{C_{i,j}}}$ where $AB|_{\text{dim}{C_{i,j}}}$ denotes the upper-left part of $AB$ that is relevant for updating $C_{i,j}$.