======================= GEMM C++ Implementation [TOC:2] ======================= All things required for the implementation of the matrix-matrix product are contained in `gemm.h`. Note, that is also contains some stuff that is already in BLAZE available. So these things can be removed and replaced by calling adequate BLAZE functions. Small Demo ========== In this simple demo you have to call the GEMM function directly: - An expression like $C \leftarrow \beta C + \alpha A B$ gets evaluated by calling ---- CODE (type=cc) ---------------------------------------------------------- foo::gemm(alpha, A, B, beta, C); ------------------------------------------------------------------------------ - An expression like $C \leftarrow \beta C + \alpha (A^T+A) (2 \cdot B)$ gets evaluated by calling ---- CODE (type=cc) ---------------------------------------------------------- foo::gemm(alpha, blaze::trans(A)+A, 2*B, beta, C); ------------------------------------------------------------------------------ Note that this does not create temporary matrices for $A^T+A$ or $2 \cdot B$. Any expression for $A$ and $B$ that allows element-wise access can be used. If the cost of evaluating an element of an expression is $\mathcal{O}(1)$ (i.e. constant) the impact on performance should not be measurable. Also note, that *matrices can have different element types*. The performance depends on the common type of $\alpha$, $A$ and $B$. Source Code: `test_gemm.cc` --------------------------- :import: session1/test_gemm.cc ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -O3 -DNDEBUG -std=c++11 -I /home/numerik/lehn/work/blaze-2.5/ +++ -I /opt/intel/compilers_and_libraries/linux/mkl/include +++ test_gemm.cc ./a.out -------------------------------------------------------------------------------- Source Code for GEMM ==================== Most of the algorithm is implemented in pure C++. Only the so called frame algorithm was adapted to BLAZE. The rest is coded in a C-style. One reason for this is, that in my lecture we start from scratch with a pure C implementation. Step by step we C++-fy it. Performance depends on the so called _micro-kernels_. In the following we have hand coded micro-kernels for AVX and FMA. Another micro-kernel uses the GCC vector extensions. Again, these kernels are rather simple and I justed included versions for double precision. More advanced kernels can be taken from BLIS. The micro-kernel can be selected by: - Compile with `-DHAVE_AVX` for the AVX-Kernel - Compile with `-DHAVE_FMA` for the FMA-Kernel - Compile with `-DHAVE_GCCVEC` for the Kernel using GCC vector extensions. - Compile with `-DHAVE_BLISAVX` for the FMA-Kernel Otherwise a reference implementation of the Frame Algorithm, Macro-Kernel and Reference-Micro-Kernel -------------------------------------------------------- :import: session1/gemm.h AVX Micro-Kernel (fairly optimized) ----------------------------------- :import: session1/avx.h FMA Micro-Kernel (very simple) ------------------------------ :import: session1/fma.h Micro-Kernel with GCC Extensions (very simple) ---------------------------------------------- :import: session1/fma.h Micro-Kernel from BLIS (slightly adapted) ----------------------------------------- :import: session1/blisavx.h Source Code of Benchmark Program ================================ :import: session1/bench_gemm.cc Benchmarks ========== Performance is compared with BLAZE linked against the Intel MKL. For comparing it with other BLAS implementations simply link against another BLAS implementation like ATLAS, OpenBLAS, ... Hardware used ------------- Note that the machine only has AVX but not FMA: ---- SHELL (hostname=heim) ----------------------------------------------------- cat /proc/cpuinfo -------------------------------------------------------------------------------- Run with Reference Micro-Kernel ------------------------------- Here we only have cache optimization. So performance is poor (about 10 percent peak performance) but does not drop for growing problem sizes: ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -Ofast -mavx -DNDEBUG -std=c++11 -DM_MAX=2000 +++ -I /home/numerik/lehn/work/blaze-2.5/ +++ -I /opt/intel/compilers_and_libraries/linux/mkl/include +++ -DBLAZE_BLAS_MODE -DMKL_ILP64 +++ -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread +++ -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ bench_gemm.cc ./a.out > report.gemm.ref cat report.gemm.ref -------------------------------------------------------------------------------- Run with GCC Vector Extensione ------------------------------ Better but still off. ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -Ofast -mavx -DHAVE_GCCVEC -DNDEBUG -std=c++11 -DM_MAX=10000 +++ -I /home/numerik/lehn/work/blaze-2.5/ +++ -I /opt/intel/compilers_and_libraries/linux/mkl/include +++ -DBLAZE_BLAS_MODE -DMKL_ILP64 +++ -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread +++ -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ bench_gemm.cc ./a.out > report.gemm.gccvec cat report.gemm.gccvec -------------------------------------------------------------------------------- Run with AVX Micro-Kernel ------------------------- Better. Further improvement is possible with hints for prefetching. See BLIS. ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000 +++ -I /home/numerik/lehn/work/blaze-2.5/ +++ -I /opt/intel/compilers_and_libraries/linux/mkl/include +++ -DBLAZE_BLAS_MODE -DMKL_ILP64 +++ -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread +++ -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ bench_gemm.cc ./a.out > report.gemm.avx cat report.gemm.avx -------------------------------------------------------------------------------- Run with Micro-Kernel from BLIS ------------------------------- ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -Ofast -mavx -DHAVE_BLISAVX -DNDEBUG -std=c++11 -DM_MAX=10000 +++ -I /home/numerik/lehn/work/blaze-2.5/ +++ -I /opt/intel/compilers_and_libraries/linux/mkl/include +++ -DBLAZE_BLAS_MODE -DMKL_ILP64 +++ -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread +++ -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 +++ bench_gemm.cc ./a.out > report.gemm.blisavx cat report.gemm.blisavx -------------------------------------------------------------------------------- Run with internal BLAZE Matrix-Matrix Product --------------------------------------------- In this run we disable the BLAS backend in BLAZE such that the internal BLAZE implementation gets called. ---- SHELL (path=session1,hostname=heim) --------------------------------------- g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000 +++ -DBLAZE_BLAS_MODE=0 +++ -I /home/numerik/lehn/work/blaze-2.5/ +++ bench_gemm.cc ./a.out > report.gemm.blaze cat report.gemm.blaze -------------------------------------------------------------------------------- Plots from the Benchmarks ========================= Generating Plots ---------------- ---- SHELL (path=session1,hostname=heim) --------------------------------------- gnuplot plot.gemm.time gnuplot plot.gemm.time_log gnuplot plot.gemm.mflops -------------------------------------------------------------------------------- Plot: Time Elapsed ------------------ ---- IMAGE ----------------- session1/bench.gemm.time.svg ---------------------------- Plot: Time Elapsed (log scaled) -------------------------------- ---- IMAGE --------------------- session1/bench.gemm.time_log.svg -------------------------------- Plot: MFLOPS ------------ ---- IMAGE ------------------- session1/bench.gemm.mflops.svg ------------------------------