=================
GEMM Macro Kernel
=================

The GEMM Macro Kernel computes the GEMM operation $C \leftarrow \beta C + \alpha A B$ based
on packing blocks of $A$ and $B and multiplying these using the GEMM micro kernel.  The
matrix dimension are as follows:

- $A$ is a $m \times k$ matrix with $m < M_c$ and $k < K_c$.
- $B$ is a $k \times n$ matrix with $n < N_c$.
- $C$ is a $m \times n$ matrix.

Exercise: Test Framework
========================

- Parameters $M_C$, $N_C$, $K_C$ as well as $M_r$, $N_r$ are defined through macros.  Hereby
  we require that $M_r$ divides $M_c$ and $N_r$ divides $N_c$.  Beside these restrictions you
  can choose arbitrary values.  However, it is usually a good idea to use pairwise different
  values. (Why?)
- In `main` the following test case should be setup:
    - Allocate a $m \times k$ matrix $A$ with $m < M_c$ and $k < K_c$.  The strict smaller
      inequality  is chosen on purpose! (Why?)
    - Allocate a $k \times n$ matrix $B$ with $n < N_c$.
    - Also make sure that $m$, $n$ and $k$ are pairwise different.
    - Allocate two $m \times n$ matrices $C_0$ and $C_1$.
    - Initialize all matrices.  Hereby $C_0$ and $C_1$ should be equal.
    - Print matrices $A$, $B$ and $C_0$.
    - For some fixed value of $\alpha$ and $\beta$ compute $C_0 \leftarrow \beta C_0 + \alpha A B$
      with a reference implementation for GEMM.
    - Print $C_0$
    - Print $C_1$

Note: Print the name of the matrix before you print its value.


Exercise: Add a call of the GEMM Macro Kernel
=============================================
The macro kernel assumes that it receives already packed blocks $A$ and $B$.
The signature for the macro kernel is given as

---- CODE(type=c) --------------------------------------------------------------
void
dgemm_macro(size_t m, size_t n, size_t k, double alpha,
            const double *A, const double *B,
            double beta,
            double *C, ptrdiff_t incRowC, ptrdiff_t incColC);
--------------------------------------------------------------------------------

- Why are there no row and column increments for $A$ and $B$?
- Add in `main` a call of the macro kernel as follows:
   - Allocate buffers $A_p$ and $B_p$ of length $M_c \cdot K_C$ and $K_c \cdot N_c$
     respectively.
   - Pack $A$ and $B$ into these buffers.
   - Call the macro kernel
   - Deallocate the buffers.

Do not continue until this compiles and runs without crashing.

Exercise: Implement the macro kernel
====================================

Implement the macro-kernel:

- $A$ gets partitioned into horizontal panels of dimension $m_r \times k$ with $m_r \leq M_r$.
- $B$ gets partitioned into vertical panels of dimension $k \times n_r$ with $n_r \leq N_r$.
- Panels of $A$ are multiplied with panels of $B$ using the micro kernel.
- Note that micro kernel requires that dimensions of $A$ and $B$ have dimension
  $M_r \times k$ and $k \times N_r$ respectively!

  In case of $m < M_r$ or $n < N_r$ recall that the buffers contain panels that were
  extended through zero padding to dimensions $M_r \times k$ and $k \times N_r$
  respectively.

  So in this case use the following workaround:

  - Compute $AB \leftarrow \alpha A_i B_j$ were $A_i$ and $B_J$ denote panels.
  - Compute $C_{i,j} \leftarrow \beta C_{i,j}$ (GEAXPY) where $C_{i,j}$ denotes the
    corresponding block of $C$.
  - Compute $C_{i,j} \leftarrow AB|_{\text{dim}{C_{i,j}}}$ where
    $AB|_{\text{dim}{C_{i,j}}}$ denotes the upper-left part of $AB$ that is relevant for
    updating $C_{i,j}$.


:navigate: up    -> doc:index
           back  -> doc:session04/page16
           next  -> doc:session04/page18