=================================== Pure ANSI C Implementation of DGEMM [TOC] =================================== In the following we present the cache optimized implementation of the matrix-matrix product. The function `dgemm_nn` can compute operations of the form $C \leftarrow \beta C + \alpha A B$. All matrices can have *arbitrary row and column stride*. That in particular means each matrix can be row or column wise stored. It further means that we also can use the function for the computation of $C \leftarrow \beta C + \alpha A^T B$, $C \leftarrow \beta C + \alpha A B^T$ and $C \leftarrow \beta C + \alpha A^T B^T$. Select the demo-pure-c Branch ============================= Before switching a branch do a `make clean` first: *--[SHELL(height=6)]------------------------------------------------* | | | cd ulmBLAS | | make clean | | | *-------------------------------------------------------------------* Now we switch by checking out the `demo-pure-c`branch: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | git branch -a | | git checkout -B demo-pure-c remotes/origin/demo-pure-c | | | *-------------------------------------------------------------------* Then we compile the project *--[SHELL(path=ulmBLAS,height=15)]----------------------------------* | | | make | | | *-------------------------------------------------------------------* Building Blocks of dgemm_nn =========================== - `pack_MRxk` and `pack_A` are copying row panels from matrix blocks of A into the buffer `_A`. Details are described in __GEMM Packing Matrix A__. - `pack_kxNR` and `pack_B` are copying col panels from matrix blocks of A into the buffer `_B`. Details are described in __GEMM Packing Matrix B__. - `dgemm_micro_kernel` multiplies a row panel with a col panel. Details are given in __GEMM Micro Kernel__. - `dgemm_macro_kernel` multiplies a row panel with a col panel. Details are given in __GEMM Macro Kernel__. - `dgemm_nn` computes $C \leftarrow \beta C + \alpha A B$ as described in __GEMM__. The Micro Kernel Algorithm ========================== For the sake of simplicity we assume $m_r = n_r = 4$ in the description of the algorithm. Note that the pure C implementation works as long as $m_r$ is a divisor or $m_c$ and $n_r$ a divisor of $n_c$. Recall that `A` points to a packed (maybe zero padded) row panel of height four and width $k_c$. That means we have the column wise stored panel ---- LATEX --------------------------------------------------------------------- A = \begin{pmatrix} a_0 & a_4 & \dots & a_{4 k_c -4} \\ a_1 & a_5 & \dots & a_{4 k_c -3} \\ a_2 & a_6 & \dots & a_{4 k_c -2} \\ a_3 & a_7 & \dots & a_{4 k_c -1} \\ \end{pmatrix} -------------------------------------------------------------------------------- Further `B` points to a column panel of height $k_c$ and width four. That can be illustrated as a row wise stored panel ---- LATEX --------------------------------------------------------------------- B = \begin{pmatrix} b_0 & b_1 & b_2 & b_3 \\ b_4 & b_5 & b_6 & b_7 \\ \vdots & \vdots & \vdots & \vdots \\ b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1} \\ \end{pmatrix} -------------------------------------------------------------------------------- For the product $A \cdot B$ this means ---- LATEX --------------------------------------------------------------------- \begin{eqnarray*} A \cdot B &=& \begin{pmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \end{pmatrix} \begin{pmatrix} b_0 & b_1 & b_2 & b_3 \end{pmatrix} + \begin{pmatrix} a_4 \\ a_5 \\ a_6 \\ a_7 \\ \end{pmatrix} \begin{pmatrix} b_4 & b_5 & b_6 & b_7 \end{pmatrix} + \dots + \begin{pmatrix} a_{4 k_c-4} \\ a_{4 k_c-3} \\ a_{4 k_c-2} \\ a_{4 k_c-1} \\ \end{pmatrix} \begin{pmatrix} b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1} \end{pmatrix}\\[1cm] &=& \begin{pmatrix} a_0 b_0 & a_0 b_1 & a_0 b_2 & a_0 b_3 \\ a_1 b_0 & a_1 b_1 & a_1 b_2 & a_1 b_3 \\ a_2 b_0 & a_2 b_1 & a_2 b_2 & a_2 b_3 \\ a_3 b_0 & a_3 b_1 & a_3 b_2 & a_3 b_3 \\ \end{pmatrix} + \begin{pmatrix} a_4 b_4 & a_4 b_5 & a_4 b_6 & a_4 b_7 \\ a_5 b_4 & a_5 b_5 & a_5 b_6 & a_5 b_7 \\ a_6 b_4 & a_6 b_5 & a_6 b_6 & a_6 b_7 \\ a_7 b_4 & a_7 b_5 & a_7 b_6 & a_7 b_7 \\ \end{pmatrix} + \dots + \begin{pmatrix} a_{4 k_c-4} b_{4 k_c-4} & a_{4 k_c-4} b_{4 k_c-3} & a_{4 k_c-4} b_{4 k_c-2} & a_{4 k_c-4} b_{4 k_c-1} \\ a_{4 k_c-3} b_{4 k_c-4} & a_{4 k_c-3} b_{4 k_c-3} & a_{4 k_c-3} b_{4 k_c-2} & a_{4 k_c-3} b_{4 k_c-1} \\ a_{4 k_c-2} b_{4 k_c-4} & a_{4 k_c-2} b_{4 k_c-3} & a_{4 k_c-2} b_{4 k_c-2} & a_{4 k_c-2} b_{4 k_c-1} \\ a_{4 k_c-1} b_{4 k_c-4} & a_{4 k_c-1} b_{4 k_c-3} & a_{4 k_c-1} b_{4 k_c-2} & a_{4 k_c-1} b_{4 k_c-1} \\ \end{pmatrix} \end{eqnarray*} -------------------------------------------------------------------------------- We compute $\mathbf{AB} = A \cdot B$ sequentially: - Initialize: $\mathbf{AB} \leftarrow \mathbf{0}_{4 \times 4}$ - For $l = 0, \dots, k_c-1$ update: - $\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \begin{pmatrix} b_{4l} & b_{4l+1} & b_{4l+2} & b_{4l+3}\end{pmatrix} $ *(Note that this loop is the computational hotspot. In principal the overall performance only depends on the efficient update of $\mathbf{AB}$)* Afterwards we merely update the left hand side micro block $\tilde{C}$: - $\tilde{C} \leftarrow \beta \tilde{C}$ - $\tilde{C} \leftarrow \tilde{C} + \alpha \mathbf{AB}$ The dgemm_nn Code (less than 450 lines!) ======================================== :import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers] Benchmark Results ================= In `bench/` we have extracted the benchmarks suite from the __ATLAS__ project: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | cd bench | | ./xdl3blastst | | | *-------------------------------------------------------------------* Lines with `PASS` show results for our own implementation. Lines with `-----` belong to the reference implementation. We can visualize the benchmarks with __gnuplot__. First we write the results to a report file and use `grep` to separate results for our implementation and the reference implementation: *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | ./xdl3blastst > report | | grep PASS report > demo-pure-c | | grep "\ \-\-\-\-\-$" report > refBLAS | | | *-------------------------------------------------------------------* Then we can use gnuplot to create a svg of the benchmark results: :import: ulmBLAS/bench/bench.gps Feeding gnuplot with this script *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench.gps | | | *-------------------------------------------------------------------* creates ---- IMAGE ------------ ulmBLAS/bench/bench.svg ----------------------- Sensitivity to Compilers ======================== Unfortunately different compilers, flags and versions can have a big influence on the performance. Using SSE intrinsics or inline assembly code in the micro kernel solves this problem. Here I can demonstrate a significant performance drop if I compile with `clang` instead of `gcc-4.8`. Note that in other cases my `clang` compiler performs much better optimizations. *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | clang --version | | gcc-4.8 --version | | | | cd src | | clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer +++| | -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c | | make | | cd ../bench | | make | | ./xdl3blastst > report | | cat report | | | *-------------------------------------------------------------------* We can visualize this again *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | grep PASS report > demo-pure-c-with-clang | | | *-------------------------------------------------------------------* By adopting the gnuplot script :import: ulmBLAS/bench/bench2.gps and feeding gnuplot with this script *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench2.gps | | | *-------------------------------------------------------------------* Although the performance drops dramatically we see that it is stable. The is no drop depending on cache sizes like in the `refBLAS` case. Therefore it still beats the reference implementation if problem sizes are large enough. ---- IMAGE ------------- ulmBLAS/bench/bench2.svg ------------------------ :links: __ATLAS__ -> http://math-atlas.sourceforge.net __gnuplot__ -> http://www.gnuplot.info __GEMM Packing Matrix A__ -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page02.html __GEMM Packing Matrix B__ -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page03.html __GEMM Micro Kernel__ -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page04.html __GEMM Macro Kernel__ -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page05.html __GEMM__ -> https://apfel.mathematik.uni-ulm.de/~lehn/sghpc/day08/page06.html :navigate: __up__ -> doc:index __back__ -> doc:page01/index __next__ -> doc:page03/index