=================
GEMM Micro Kernel                                                       [TOC]
=================

The GEMM micro kernel is supposed to perform the GEMM operation

---- LATEX ---------------------------------------------------------------------
C \leftarrow \beta \cdot C + \alpha \cdot A \cdot B
--------------------------------------------------------------------------------

for the following special case:

- The dimensions of the matrices are as follows:

  - $C$ is a $M_r \times N_r$ matrix.
  - $A$ is a $M_r \times k$ matrix.
  - $B$ is a $k \times N_r$ matrix.

- Storage of $A$ and $B$:

  - Matrix $A$ is stored col major with $\text{incRow}_A = 1$ and
    $\text{incCol}_A = M_r$.  That means, all elements of $A$ are located in a
    contiguous memory block (with $M_r \cdot k$) elements.

  - Matrix $B$ is stored row major with $\text{incRow}_B = N_r$ and
    $\text{incCol}_B = 1$.  That means, all elements of $B$ are located in a
    contiguous memory block (with $k \cdot N_r$) elements.

- Parameters $M_r$ and $N_r$ are given as *literal constants*.

Notes on requirements for the Algorithm and the Implementation
==============================================================
The performance of the micro kernel will be extremely critical for the overall
performance of the GEMM operation.  So where possible loops should be unrolled.
In particular, nested loops should be ordered such that inner loops can be
unrolled if possible.  Do not call functions (unless it is guaranteed that the
compiler can inline these calls).

The implementation should operate as follows:

- first compute the matrix product $A \cdot B$ and store the result in CPU
  registers.  For the sake of simplicity we assume that the compiler maps local
  variables directly to registers.  So assume that storing results in a local
  variable means to store them in a register and not on the stack.
- Next, scale $C$, i.e. perform the operation $C \leftarrow \beta C$.  In case
  $\beta =0$ matrix $C$ can contain NaN entries!
- Finally, update the scaled matrix $C$ with the previously updated product $AB$
  Apply the scaling of $AB$ withinthis update step, i.e. perform here the
  operation

  ---- LATEX -------------------------------------------------------------------
  C \leftarrow C + \alpha \cdot (AB)
  ------------------------------------------------------------------------------


Exercise
========
- Implement function `dgemm_micro_ref` in the test program below.  Before you
  start coding: Make yourself familiar with the test program below.


Simple Test Program
===================
:import: session16/simple_test_micro_ex.c