===============================
Complete Assembler Micro Kernel                                         [TOC]
===============================

The BLIS micro kernel gains another performance boost through prefetching data.
In my experiments I only was able to add the feature after converting the whole
mico kernel into assembler.

Status Quo: So far only the loop for computing `AB` was implemented in
assembler.  The rest of the micro kernel was left untouched and is plain C
code.  The bridge between the assembler and C code is the `double` array `AB`
of length 16.  The assembler code copies at the end its results into this array.
The remaining C code uses `AB` to compute `C <- beta*C + alpha*A*B`.

Having all the micro kernel in assembler removes the need for this internal
buffer `AB`.  Note that this will not improve performance.  But again, it is a
prerequisite for effectively adding prefetching.

It is no surprise that performance does not improve.  Using a profiler one can
see that only a few milliseconds were _not spent_ on computing `AB`.


Select the demo-sse-all-asm Branch
==================================
Check out the `demo-sse-all-asm` branch:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  git branch -a                                                    |
    |  git checkout -B demo-sse-all-asm                              +++|
    |                  remotes/origin/demo-sse-all-asm                  |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we compile the project

    *--[SHELL(path=ulmBLAS,height=15)]----------------------------------*
    |                                                                   |
    |  make                                                             |
    |                                                                   |
    *-------------------------------------------------------------------*


Outline of the Modification
===========================
- In the micro kernel we also go from `int`to `long`.  It is just simpler to
  use only 64 bit registers.
- Registers `%xmm8`, .., `%xmm15` are used to hold the result of `A*B`.
- Once these registers contain the required values of `A*B` we begin
  with the update `C <- beta*C + alpha*A*B`:
    - `%xmm0` is used to hold `alpha` and `%xmm1` for `beta`.
    - The strides of matrix $C$ must be multiplied by `sizeof(double)`.  This
      gives the stride in bytes.
    - For each element $C_{i,j}$ we do the following:
        - Load $C_{i,j}$.
        - Compute $\left(\alpha A B\right)_{i,j}$.
        - Compute $\beta C_{i,j}$ 
        - Store $\beta C_{i,j} + \left(\alpha A B\right)_{i,j}$.


The dgemm_nn Code
=================
:import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers]


Benchmark Results
=================
We run the benchmarks

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  cd bench                                                         |
    |  ./xdl3blastst > report                                           |
    |                                                                   |
    *-------------------------------------------------------------------*

and filter out the results for the `demo-sse-all-asm`
branch:

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  grep PASS report > demo-sse-all-asm                              |
    |                                                                   |
    *-------------------------------------------------------------------*

With the gnuplot script

:import: ulmBLAS/bench/bench13.gps

we feed gnuplot

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench13.gps                                              |
    |                                                                   |
    *-------------------------------------------------------------------*

and get

    ---- IMAGE --------------
    ulmBLAS/bench/bench13.svg
    -------------------------

Code Size of kb-Loop Body
=========================
On Mac OS X you can use `otool` to get at assembler code from an object file.
Moreover you directly can see how many bytes each instruction takes.  We will
use this to determine the code size of the kb-loop in the micro kernel.  So we
have to look at the label generated from

---- CODE(type=c)---------------------------------------------------------------
".DLOOP%=:                   \n\t"  // for l = kb,..,1 do
--------------------------------------------------------------------------------

and the jump

---- CODE(type=c)---------------------------------------------------------------
"jne       .DLOOP%=          \n\t"  // if l>= 1 go back
--------------------------------------------------------------------------------

Here the complete dump of `otool`

    *--[SHELL(path=ulmBLAS/bench,height=12)]----------------------------*
    |                                                                   |
    |  cd ../src/level3                                                 |
    |  otool -dtV dgemm_nn.o                                            |
    |                                                                   |
    *-------------------------------------------------------------------*

So the line right before the `.DLOOP0` label is

    *--[SHELL(path=ulmBLAS/src/level3)]---------------------------------*
    |                                                                   |
    |  otool -dtV dgemm_nn.o | while read line; do if test           +++|
    |     "$line" = ".DLOOP0:"; then echo $last; fi; last=$line;     +++|
    |     done                                                          |
    |                                                                   |
    *-------------------------------------------------------------------*

And the line with `jne  .DLOOP0` is

    *--[SHELL(path=ulmBLAS/src/level3)]---------------------------------*
    |                                                                   |
    |  otool -dtV dgemm_nn.o | grep ".DLOOP0$"                          |
    |                                                                   |
    *-------------------------------------------------------------------*

So the code in between takes

    *--[SHELL(path=ulmBLAS/src/level3)]---------------------------------*
    |                                                                   |
    |  BEGIN=`otool -dtV dgemm_nn.o | while read line; do if test    +++|
    |        "$line" = ".DLOOP0:"; then echo $last; fi; last=$line;  +++|
    |        done`                                                      |
    |  BEGIN=($BEGIN)                                                   |
    |  BEGIN=${BEGIN[0]}                                                |
    |  BEGIN=`echo $BEGIN | tr "a-f" "A-F"`                             |
    |  END=`otool -dtV dgemm_nn.o | grep ".DLOOP0$" | tr "a-f" "A-F"`   |
    |  END=($END)                                                       |
    |  END=${END[0]}                                                    |
    |  END=`echo $END | tr "a-f" "A-F"`                                 |
    |  SIZE=`dc -e "16 i $END $BEGIN - f"`                              |
    |  echo "Code size of loop body (in bytes): $SIZE"                  |
    |                                                                   |
    *-------------------------------------------------------------------*

bytes.

:navigate: __up__    -> doc:index
           __back__  -> doc:page11/index
           __next__  -> doc:page13/index