======================================================= Simple cache optimization for the matrix-matrix product [TOC] ======================================================= We develop an algorithm that performs the multiplication block-wise. The basic idea should be that matrices are considered to be partitioned block-wise: - Matrix $A$ partitioned by $M_C$ and $K_C$ in ----- LATEX --------------------------------------------------------------- A = \left(\begin{array}{c|c|c} \mathbf{A}_{1,1} & \dots & \mathbf{A}_{1,k_b} \\ \hline \vdots & & \vdots \\ \hline \mathbf{A}_{m_b,1} & \dots & \mathbf{A}_{m_b,k_b} \\ \end{array}\right) \quad\text{with}\quad m_b = \lceil m/M_c \rceil,\; k_b = \lceil k/K_c \rceil --------------------------------------------------------------------------- where ----- LATEX --------------------------------------------------------------- \mathbf{A}_{i,j} \in \mathbb{M}^{M_C \times K_C} \quad 1 \leq i < m_b\; 1 \leq j < k_b. --------------------------------------------------------------------------- - Matrix $B$ partitioned by $K_C$ und $N_C$ in ----- LATEX --------------------------------------------------------------- B = \left(\begin{array}{c|c|c} \mathbf{B}_{1,1} & \dots & \mathbf{B}_{1, n_b} \\ \hline \vdots & & \vdots \\ \hline \mathbf{B}_{k_b,1} & \dots & \mathbf{B}_{k_b,n_b} \\ \end{array}\right) \quad\text{with}\quad k_b = \lceil k/K_c \rceil,\; n_b = \lceil n/N_c \rceil. --------------------------------------------------------------------------- and ----- LATEX --------------------------------------------------------------- \mathbf{B}_{i,j} \in \mathbb{M}^{K_C \times N_C} \quad 1 \leq i < k_b\; 1 \leq j < n_b\; --------------------------------------------------------------------------- - Matrix $C$ partitioned by $M_C$ und $N_C$ in ----- LATEX --------------------------------------------------------------- C = \left(\begin{array}{c|c|c} \mathbf{C}_{1,1} & \dots & \mathbf{C}_{1, n_b} \\ \hline \vdots & & \vdots \\ \hline \mathbf{C}_{m_b,1} & \dots & \mathbf{C}_{m_b,n_b} \\ \end{array}\right) \quad\text{with}\quad m_b = \lceil m/M_c \rceil,\; n_b = \lceil n/N_c \rceil. --------------------------------------------------------------------------- and ----- LATEX --------------------------------------------------------------- \mathbf{C}_{i,j} \in \mathbb{M}^{M_C \times N_C} \quad 1 \leq i < m_b\; 1 \leq j < n_b\; --------------------------------------------------------------------------- Instead of multiplying element-wise, we first copy a block into a buffer. Then we multiply these buffered blocks. E.g.: ---- BOX -------------------------------------------------- = $C \leftarrow \beta\, C$ = $\text{For}\; j=1,\dots,n_b$ = $\text{For}\; l=1,\dots,k_b$ = $\overline{B} \leftarrow \mathbf{B}_{l,j}$ = $\text{For}\; i=1,\dots,m_b$ = $\overline{A} \leftarrow \mathbf{A}_{i,l}$ = $\overline{C} \leftarrow \alpha\,\overline{A} \, \overline{B}$ = $\mathbf{C}_{i,j} \leftarrow \mathbf{C}_{i,j} + \overline{C}$ ----------------------------------------------------------- Hints for the exercise below ============================ - Rewrite the blocked representations of matrices $A$, $B$ and $C$ with indices starting at zero. - Figure out a simple formula that (depending on its indices) gives the actual dimension of a block. Something like: ---- LATEX ------------------------------------------------------------------- A_{i,j}\in\mathbb{M}^{M \times K} ------------------------------------------------------------------------------ with ---- LATEX ------------------------------------------------------------------- M = \begin{cases} \dots \text{TODO} \dots, & \text{Condition depending on $i$,}\\ \dots \text{TODO} \dots, & \text{else,} \end{cases} ------------------------------------------------------------------------------ and ---- LATEX ------------------------------------------------------------------- K = \begin{cases} \dots \text{TODO} \dots, & \text{Condition depending on $j$,}\\ \dots \text{TODO} \dots, & \text{else.} \end{cases} ------------------------------------------------------------------------------ - Recall that for $\beta = 0$ the specification of GEMM allows that $C$ contains entries that are NaN (Not a Number). The same is the case for $A$ and $B$ if $\alpha = 0$. - For performance, also consider cases with $\alpha=0$ or $k = 0$. Exercise ======== - Implement the buffered GEMM operation. Use dynamically allocated memory for the buffers. Macros for block dimensions are already defined through macros. - Also output MFLOPS achieved by `gemm_colmajor` and `gemm_buf` (the GEMM operation requires $2 \cdot m \cdot n \cdot k$ floating point operations). - Try different block sizes. Change the block dimension with the `-D` option when you compile your code! - Take the above hints serious. :import: session05/solution/gemm_ex.c :navigate: up -> doc:index back -> doc:session05/page04 next -> doc:session05/page06