GEMM Macro Kernel
The GEMM Macro Kernel computes the GEMM operation \(C \leftarrow \beta C + \alpha A B\) based on packing blocks of \(A\) and $B and multiplying these using the GEMM micro kernel. The matrix dimension are as follows:
-
\(A\) is a \(m \times k\) matrix with \(m < M_c\) and \(k < K_c\).
-
\(B\) is a \(k \times n\) matrix with \(n < N_c\).
-
\(C\) is a \(m \times n\) matrix.
Exercise: Test Framework
-
Parameters \(M_C\), \(N_C\), \(K_C\) as well as \(M_r\), \(N_r\) are defined through macros. Hereby we require that \(M_r\) divides \(M_c\) and \(N_r\) divides \(N_c\). Beside these restrictions you can choose arbitrary values. However, it is usually a good idea to use pairwise different values. (Why?)
-
In main the following test case should be setup:
-
Allocate a \(m \times k\) matrix \(A\) with \(m < M_c\) and \(k < K_c\). The strict smaller inequality is chosen on purpose! (Why?)
-
Allocate a \(k \times n\) matrix \(B\) with \(n < N_c\).
-
Also make sure that \(m\), \(n\) and \(k\) are pairwise different.
-
Allocate two \(m \times n\) matrices \(C_0\) and \(C_1\).
-
Initialize all matrices. Hereby \(C_0\) and \(C_1\) should be equal.
-
Print matrices \(A\), \(B\) and \(C_0\).
-
For some fixed value of \(\alpha\) and \(\beta\) compute \(C_0 \leftarrow \beta C_0 + \alpha A B\) with a reference implementation for GEMM.
-
Print \(C_0\)
-
Print \(C_1\)
-
Note: Print the name of the matrix before you print its value.
Exercise: Add a call of the GEMM Macro Kernel
The macro kernel assumes that it receives already packed blocks \(A\) and \(B\). The signature for the macro kernel is given as
void dgemm_macro(size_t m, size_t n, size_t k, double alpha, const double *A, const double *B, double beta, double *C, ptrdiff_t incRowC, ptrdiff_t incColC);
-
Why are there no row and column increments for \(A\) and \(B\)?
-
Add in main a call of the macro kernel as follows:
-
Allocate buffers \(A_p\) and \(B_p\) of length \(M_c \cdot K_C\) and \(K_c \cdot N_c\) respectively.
-
Pack \(A\) and \(B\) into these buffers.
-
Call the macro kernel
-
Deallocate the buffers.
-
Do not continue until this compiles and runs without crashing.
Exercise: Implement the macro kernel
Implement the macro-kernel:
-
\(A\) gets partitioned into horizontal panels of dimension \(m_r \times k\) with \(m_r \leq M_r\).
-
\(B\) gets partitioned into vertical panels of dimension \(k \times n_r\) with \(n_r \leq N_r\).
-
Panels of \(A\) are multiplied with panels of \(B\) using the micro kernel.
-
Note that micro kernel requires that dimensions of \(A\) and \(B\) have dimension \(M_r \times k\) and \(k \times N_r\) respectively!
In case of \(m < M_r\) or \(n < N_r\) recall that the buffers contain panels that were extended through zero padding to dimensions \(M_r \times k\) and \(k \times N_r\) respectively.
So in this case use the following workaround:
-
Compute \(AB \leftarrow \alpha A_i B_j\) were \(A_i\) and \(B_J\) denote panels.
-
Compute \(C_{i,j} \leftarrow \beta C_{i,j}\) (GEAXPY) where \(C_{i,j}\) denotes the corresponding block of \(C\).
-
Compute \(C_{i,j} \leftarrow AB|_{\text{dim}{C_{i,j}}}\) where \(AB|_{\text{dim}{C_{i,j}}}\) denotes the upper-left part of \(AB\) that is relevant for updating \(C_{i,j}\).
-