=======================
GEMM C++ Implementation                                                 [TOC:2]
=======================

All things required for the implementation of the matrix-matrix product are
contained in `gemm.h`.  Note, that is also contains some stuff that is
already in BLAZE available.  So these things can be removed and replaced by
calling adequate BLAZE functions.

Small Demo
==========
In this simple demo you have to call the GEMM function directly:

- An expression like $C \leftarrow \beta C + \alpha A B$ gets evaluated by
  calling

  ---- CODE (type=cc) ----------------------------------------------------------
  foo::gemm(alpha, A, B, beta, C);
  ------------------------------------------------------------------------------

- An expression like $C \leftarrow \beta C + \alpha (A^T+A) (2 \cdot B)$ gets
  evaluated by calling

  ---- CODE (type=cc) ----------------------------------------------------------
  foo::gemm(alpha, blaze::trans(A)+A, 2*B, beta, C);
  ------------------------------------------------------------------------------

  Note that this does not create temporary matrices for $A^T+A$ or $2 \cdot B$.
  Any expression for $A$ and $B$ that allows element-wise access can be used.
  If the cost of evaluating an element of an expression is $\mathcal{O}(1)$
  (i.e. constant) the impact on performance should not be measurable.

Also note, that *matrices can have different element types*.  The performance
depends on the common type of $\alpha$, $A$ and $B$.


Source Code: `test_gemm.cc`
---------------------------
:import: session1/test_gemm.cc


---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -O3 -DNDEBUG -std=c++11 -I /home/numerik/lehn/work/blaze-2.5/        +++
        -I /opt/intel/compilers_and_libraries/linux/mkl/include              +++
        test_gemm.cc
./a.out
--------------------------------------------------------------------------------

Source Code for GEMM
====================
Most of the algorithm is implemented in pure C++.  Only the so called frame
algorithm was adapted to BLAZE.  The rest is coded in a C-style.  One reason
for this is, that in my lecture we start from scratch with a pure C
implementation.  Step by step we C++-fy it.

Performance depends on the so called _micro-kernels_.  In the following we have
hand coded micro-kernels for AVX and FMA.  Another micro-kernel uses the GCC
vector extensions.  Again, these kernels are rather simple and I justed included
versions for double precision.  More advanced kernels can be taken from BLIS.

The micro-kernel can be selected by:

- Compile with `-DHAVE_AVX` for the AVX-Kernel
- Compile with `-DHAVE_FMA` for the FMA-Kernel
- Compile with `-DHAVE_GCCVEC` for the Kernel using GCC vector extensions.
- Compile with `-DHAVE_BLISAVX` for the FMA-Kernel

Otherwise a reference implementation of the 


Frame Algorithm, Macro-Kernel and Reference-Micro-Kernel
--------------------------------------------------------
:import: session1/gemm.h

AVX Micro-Kernel (fairly optimized)
-----------------------------------
:import: session1/avx.h

FMA Micro-Kernel (very simple)
------------------------------
:import: session1/fma.h

Micro-Kernel with GCC Extensions (very simple)
----------------------------------------------
:import: session1/fma.h

Micro-Kernel from BLIS (slightly adapted)
-----------------------------------------
:import: session1/blisavx.h


Source Code of Benchmark Program
================================
:import: session1/bench_gemm.cc


Benchmarks
==========
Performance is compared with BLAZE linked against the Intel MKL.  For comparing
it with other BLAS implementations simply link against another BLAS implementation
like ATLAS, OpenBLAS, ...

Hardware used
-------------
Note that the machine only has AVX but not FMA:

---- SHELL (hostname=heim) -----------------------------------------------------
cat /proc/cpuinfo
--------------------------------------------------------------------------------


Run with Reference Micro-Kernel
-------------------------------
Here we only have cache optimization.  So performance is poor (about 10 percent
peak performance) but does not drop for growing problem sizes:

---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -Ofast -mavx -DNDEBUG -std=c++11 -DM_MAX=2000                        +++
        -I /home/numerik/lehn/work/blaze-2.5/                                +++
        -I /opt/intel/compilers_and_libraries/linux/mkl/include              +++
        -DBLAZE_BLAS_MODE -DMKL_ILP64                                        +++
        -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64          +++
        -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread          +++
        -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64  +++
        bench_gemm.cc
./a.out > report.gemm.ref
cat report.gemm.ref
--------------------------------------------------------------------------------


Run with GCC Vector Extensione
------------------------------
Better but still off.

---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -Ofast -mavx -DHAVE_GCCVEC -DNDEBUG -std=c++11 -DM_MAX=10000         +++
        -I /home/numerik/lehn/work/blaze-2.5/                                +++
        -I /opt/intel/compilers_and_libraries/linux/mkl/include              +++
        -DBLAZE_BLAS_MODE -DMKL_ILP64                                        +++
        -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64          +++
        -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread          +++
        -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64  +++
        bench_gemm.cc
./a.out > report.gemm.gccvec
cat report.gemm.gccvec
--------------------------------------------------------------------------------

Run with AVX Micro-Kernel
-------------------------
Better.  Further improvement is possible with hints for prefetching.  See BLIS.

---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000            +++
        -I /home/numerik/lehn/work/blaze-2.5/                                +++
        -I /opt/intel/compilers_and_libraries/linux/mkl/include              +++
        -DBLAZE_BLAS_MODE -DMKL_ILP64                                        +++
        -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64          +++
        -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread          +++
        -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64  +++
        bench_gemm.cc
./a.out > report.gemm.avx
cat report.gemm.avx
--------------------------------------------------------------------------------

Run with Micro-Kernel from BLIS
-------------------------------

---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -Ofast -mavx -DHAVE_BLISAVX -DNDEBUG -std=c++11 -DM_MAX=10000        +++
        -I /home/numerik/lehn/work/blaze-2.5/                                +++
        -I /opt/intel/compilers_and_libraries/linux/mkl/include              +++
        -DBLAZE_BLAS_MODE -DMKL_ILP64                                        +++
        -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64          +++
        -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread          +++
        -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64  +++
        bench_gemm.cc
./a.out > report.gemm.blisavx
cat report.gemm.blisavx
--------------------------------------------------------------------------------

Run with internal BLAZE Matrix-Matrix Product
---------------------------------------------
In this run we disable the BLAS backend in BLAZE such that the internal BLAZE
implementation gets called.

---- SHELL (path=session1,hostname=heim) ---------------------------------------
g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000            +++
        -DBLAZE_BLAS_MODE=0                                                  +++
        -I /home/numerik/lehn/work/blaze-2.5/                                +++
        bench_gemm.cc
./a.out > report.gemm.blaze
cat report.gemm.blaze
--------------------------------------------------------------------------------


Plots from the Benchmarks
=========================


Generating Plots
----------------

---- SHELL (path=session1,hostname=heim) ---------------------------------------
gnuplot plot.gemm.time
gnuplot plot.gemm.time_log
gnuplot plot.gemm.mflops
--------------------------------------------------------------------------------


Plot: Time Elapsed
------------------

---- IMAGE -----------------
session1/bench.gemm.time.svg
----------------------------


Plot: Time Elapsed  (log scaled)
--------------------------------

---- IMAGE ---------------------
session1/bench.gemm.time_log.svg
--------------------------------

Plot: MFLOPS
------------

---- IMAGE -------------------
session1/bench.gemm.mflops.svg
------------------------------