===============================
Another SSE Intrinsics Approach                                        [TOC]
===============================

In this section we follow the idea of BLIS micro kernel for i86-64 architectures
with SSE.  Compared to the original BLIS micro kernel we simplify quite a few
things:

  - We are using __SSE intrinsics__ instead of inline assembler.
  - We do not take the latency of SSE instructions into account.
  - We do not unroll the update loop
  - We do not prefetch panels and other data.  At least not explicitly.


Select the demo-naive-sse-with-intrinsics-unrolled Branch
=========================================================
Check out the `demo-naive-sse-with-intrinsics` branch:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  git branch -a                                                    |
    |  git checkout -B demo-sse-intrinsics                           +++|
    |                  remotes/origin/demo-sse-intrinsics               |
    |                                                                   |
    *-------------------------------------------------------------------*

Then we compile the project

    *--[SHELL(path=ulmBLAS,height=15)]----------------------------------*
    |                                                                   |
    |  make                                                             |
    |                                                                   |
    *-------------------------------------------------------------------*


The Micro Kernel Algorithm
==========================
Again we consider the update step

---- LATEX ---------------------------------------------------------------------
\mathbf{AB} \leftarrow \mathbf{AB} +
        \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix}
        \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix}
--------------------------------------------------------------------------------

And again we will keep the complete matrix $\mathbf{AB}$ in eight SSE registers.
However, we will do this with a slight modification that we illustrate later.

Remember that in our so called _naive_ approach the 64-bit operand $b_{4l}$ was
duplicated and stored in a 128-bit SSE register.  This time we will store the
two 64-bit operands $b_{4l}$ and $b_{4l+1}$ in a common 128-bit SSE register.
That is basically the main difference.  More precise we store the operands
in SSE registers $\mathbb{tmp}_0$ to $\mathbb{tmp}_3$ 

---- LATEX ---------------------------------------------------------------------
\begin{array}{llll}
\mathbb{tmp}_{0} \leftarrow \begin{pmatrix} a_{4l  } \\ a_{4l+1} \end{pmatrix}, &
\mathbb{tmp}_{1} \leftarrow \begin{pmatrix} a_{4l+2} \\ a_{4l+3} \end{pmatrix}, &
\mathbb{tmp}_{2} \leftarrow \begin{pmatrix} b_{4l  } \\ b_{4l+1} \end{pmatrix}, &
\mathbb{tmp}_{3} \leftarrow \begin{pmatrix} b_{4l+2} \\ b_{4l+3} \end{pmatrix}, &
\end{array}
--------------------------------------------------------------------------------

Now we notice that a component wise SSE multiplication like 
$\mathbb{tmp}_{0} \odot \mathbb{tmp}_{2}$ computes
$\begin{pmatrix} a_{4l} b_{4l} \\ a_{4l+1} b_{4l+1} \end{pmatrix}$ which
contributes to $\mathbf{AB}_{0,0}$ and $\mathbf{AB}_{1,1}$.  With this registers
further contributions can be computed for
$(\mathbf{AB}_{2,0}, \mathbf{AB}_{3,1})$,
$(\mathbf{AB}_{0,2}, \mathbf{AB}_{1,3})$ and
$(\mathbf{AB}_{2,2}, \mathbf{AB}_{3,3})$.

For the remaining entries of $\mathbf{AB}$ contributions can be computed after
swapping $\mathbb{tmp}_{2}$ and $\mathbb{tmp}_{3}$:

---- LATEX ---------------------------------------------------------------------
\begin{array}{llll}
\mathbb{tmp}_{4} \leftarrow \begin{pmatrix} b_{4l+1} \\ b_{4l  } \end{pmatrix}, &
\mathbb{tmp}_{5} \leftarrow \begin{pmatrix} b_{4l+3} \\ b_{4l+2} \end{pmatrix}, &
\end{array}
--------------------------------------------------------------------------------

Using eight SSE registers for $\mathbf{AB}$ we have two SSE registers
$\mathbb{tmp}_6$ and $\mathbb{tmp}_7$ left for intermediate results.

Denoting elements of $\mathbf{AB}$ with $\mathbb{ab}_{\cdot,\cdot}$ we update
diags and anti-diags in the first two columns with

---- LATEX ---------------------------------------------------------------------
\begin{array}{lll}
\mathbb{tmp}_6      &\leftarrow& \mathbb{tmp}_2 \\
\mathbb{tmp}_2      &\leftarrow& \mathbb{tmp}_2 \odot \mathbb{tmp}_0 \\
\mathbb{tmp}_6      &\leftarrow& \mathbb{tmp}_6 \odot \mathbb{tmp}_0 \\
\mathbb{ab}_{00,11} &\leftarrow& \mathbb{ab}_{00,11} + \mathbb{tmp}_2 \\
\mathbb{ab}_{20,31} &\leftarrow& \mathbb{ab}_{20,31} + \mathbb{tmp}_6 \\
& & \\
\mathbb{tmp}_7      &\leftarrow& \mathbb{tmp}_4 \\
\mathbb{tmp}_4      &\leftarrow& \mathbb{tmp}_4 \odot \mathbb{tmp}_0 \\
\mathbb{tmp}_7      &\leftarrow& \mathbb{tmp}_7 \odot \mathbb{tmp}_0 \\
\mathbb{ab}_{01,10} &\leftarrow& \mathbb{ab}_{01,10} + \mathbb{tmp}_4 \\
\mathbb{ab}_{21,30} &\leftarrow& \mathbb{ab}_{21,30} + \mathbb{tmp}_7 \\
\end{array}
--------------------------------------------------------------------------------

Analogously we compute updates for the last two columns of $\mathbf{AB}$

---- LATEX ---------------------------------------------------------------------
\begin{array}{lll}
\mathbb{tmp}_6      &\leftarrow& \mathbb{tmp}_3 \\
\mathbb{tmp}_3      &\leftarrow& \mathbb{tmp}_3 \odot \mathbb{tmp}_0 \\
\mathbb{tmp}_6      &\leftarrow& \mathbb{tmp}_6 \odot \mathbb{tmp}_0 \\
\mathbb{ab}_{00,11} &\leftarrow& \mathbb{ab}_{00,11} + \mathbb{tmp}_2 \\
\mathbb{ab}_{20,31} &\leftarrow& \mathbb{ab}_{20,31} + \mathbb{tmp}_6 \\
& & \\
\mathbb{tmp}_7      &\leftarrow& \mathbb{tmp}_5 \\
\mathbb{tmp}_5      &\leftarrow& \mathbb{tmp}_5 \odot \mathbb{tmp}_0 \\
\mathbb{tmp}_7      &\leftarrow& \mathbb{tmp}_7 \odot \mathbb{tmp}_0 \\
\mathbb{ab}_{01,10} &\leftarrow& \mathbb{ab}_{01,10} + \mathbb{tmp}_4 \\
\mathbb{ab}_{21,30} &\leftarrow& \mathbb{ab}_{21,30} + \mathbb{tmp}_7 \\
\end{array}
--------------------------------------------------------------------------------

When copying $\mathbb{ab}_{\cdot,\cdot}$ back two memory we have to move lower
and hight double separately.  For example the lower double of
$\mathbb{ab}_{00,11}$ gets moved to $\mathbf{AB}_{0,0}$ and the higher double
to $\mathbf{AB}_{1,1}$.


The dgemm_nn Code
=================
:import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers]


Benchmark Results
=================
We run the benchmarks

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  cd bench                                                         |
    |  ./xdl3blastst > report                                           |
    |  cat report                                                       |
    |                                                                   |
    *-------------------------------------------------------------------*

and filter out the results for the `demo-sse-intrinsics`
branch:

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  grep PASS report > demo-sse-intrinsics                           |
    |                                                                   |
    *-------------------------------------------------------------------*

With the gnuplot script

:import: ulmBLAS/bench/bench5.gps

we feed gnuplot

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench5.gps                                               |
    |                                                                   |
    *-------------------------------------------------------------------*

and get

    ---- IMAGE -------------
    ulmBLAS/bench/bench5.svg
    ------------------------


Sensitivity to Compilers
========================
Maybe you noticed that this time the `clang` compiler was used above!  This
times we get poor results when using `gcc 4.8`:

    *--[SHELL(path=ulmBLAS)]--------------------------------------------*
    |                                                                   |
    |  gcc-4.8 --version                                                |
    |                                                                   |
    |  cd src                                                           |
    |  gcc-4.8 -Wall -I. -O3  -msse3 -mfpmath=sse                    +++|
    |          -fomit-frame-pointer -DFAKE_ATLAS -c                  +++|
    |          -o level3/atl_dgemm_nn.o level3/dgemm_nn.c               |
    |  make                                                             |
    |  cd ../bench                                                      |
    |  make                                                             |
    |  ./xdl3blastst                                                    |
    |                                                                   |
    *-------------------------------------------------------------------*

We can visualize this again

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  ./xdl3blastst > report                                           |
    |  grep PASS report > demo-sse-intrinsics-gcc                       |
    |                                                                   |
    *-------------------------------------------------------------------*

By adopting the gnuplot script

:import: ulmBLAS/bench/bench6.gps

and feeding gnuplot with this script

    *--[SHELL(path=ulmBLAS/bench)]--------------------------------------*
    |                                                                   |
    |  gnuplot bench6.gps                                               |
    |                                                                   |
    *-------------------------------------------------------------------*

We get

    ---- IMAGE -------------
    ulmBLAS/bench/bench6.svg
    ------------------------

Analyzing Assembler Code from gcc 4.8
-------------------------------------

----- CODE(type=s) -------------------------------------------------------------
#  ...
                xorpd           %xmm9, %xmm9
                movapd          %xmm9, %xmm11
                addq            %rdx, %rdi
                movapd          %xmm9, %xmm8
                movapd          %xmm9, %xmm10
                movapd          %xmm9, %xmm13
                movapd          %xmm9, %xmm15
                movapd          %xmm9, %xmm12
                movapd          %xmm9, %xmm14
                .align 4,0x90
L3:
                movapd          (%rdx), %xmm5
                addq            $32, %rdx
                addq            $32, %rsi
                movapd          -16(%rsi), %xmm2
                movapd          %xmm5, %xmm7
                movapd          %xmm5, %xmm1
                movapd          -32(%rsi), %xmm3
                shufpd          $1, %xmm5, %xmm7
                mulpd           %xmm2, %xmm5
                movapd          -16(%rdx), %xmm4
                mulpd           %xmm3, %xmm1
                cmpq            %rdi, %rdx
                movapd          %xmm4, %xmm6
                shufpd          $1, %xmm4, %xmm6
                addpd           %xmm5, %xmm12
                movapd          %xmm7, %xmm5
                mulpd           %xmm3, %xmm5
                addpd           %xmm1, %xmm14
                mulpd           %xmm2, %xmm7
                addpd           %xmm5, %xmm15
                movapd          %xmm4, %xmm5
                mulpd           %xmm3, %xmm5
                addpd           %xmm7, %xmm13
                mulpd           %xmm2, %xmm4
                mulpd           %xmm6, %xmm3
                mulpd           %xmm6, %xmm2
                addpd           %xmm5, %xmm10
                addpd           %xmm4, %xmm8
                addpd           %xmm3, %xmm11
                addpd           %xmm2, %xmm9
                jne             L3
L2:
                movd            %r10, %xmm1
                movapd          %xmm14, %xmm2
                movsd           %xmm14, -120(%rsp)
                ucomisd         LC0(%rip), %xmm1
                movhpd          %xmm15, -112(%rsp)
                movlpd          %xmm12, -104(%rsp)
                movhpd          %xmm13, -96(%rsp)
                movlpd          %xmm15, -88(%rsp)
                movhpd          %xmm14, -80(%rsp)
                movlpd          %xmm13, -72(%rsp)
                movhpd          %xmm12, -64(%rsp)
                movlpd          %xmm10, -56(%rsp)
                movhpd          %xmm11, -48(%rsp)
                movlpd          %xmm8, -40(%rsp)
                movhpd          %xmm9, -32(%rsp)
                movlpd          %xmm11, -24(%rsp)
                movhpd          %xmm10, -16(%rsp)
                movlpd          %xmm9, -8(%rsp)
                movhpd          %xmm8, (%rsp)
#  ...
--------------------------------------------------------------------------------

TODO:  re-code this to equivalent SSE intrinsics


Analyzing Assembler Code from clang
-----------------------------------

----- CODE(type=s) -------------------------------------------------------------
#  ...
                xorpd           %xmm8, %xmm8
                xorpd           %xmm9, %xmm9
                xorpd           %xmm14, %xmm14
                xorpd           %xmm15, %xmm15
                xorpd           %xmm10, %xmm10
                xorpd           %xmm11, %xmm11
                xorpd           %xmm12, %xmm12
                xorpd           %xmm13, %xmm13
                jle             LBB1_2
                .align          4, 0x90
LBB1_1:                                 ## %.lr.ph
                                        ## =>This Inner Loop Header: Depth=1
                movapd          (%rsi), %xmm6
                movapd          16(%rsi), %xmm7
                movapd          (%rdx), %xmm2
                movapd          16(%rdx), %xmm3
                movapd          %xmm6, %xmm4
                mulpd           %xmm2, %xmm4
                movapd          %xmm7, %xmm5
                mulpd           %xmm2, %xmm5
                pshufd          $78, %xmm2, %xmm2       ## xmm2 = xmm2[2,3,0,1]
                addpd           %xmm4, %xmm8
                addpd           %xmm5, %xmm9
                movapd          %xmm6, %xmm4
                mulpd           %xmm2, %xmm4
                mulpd           %xmm7, %xmm2
                addpd           %xmm4, %xmm14
                addpd           %xmm2, %xmm15
                movapd          %xmm6, %xmm2
                mulpd           %xmm3, %xmm2
                movapd          %xmm7, %xmm4
                mulpd           %xmm3, %xmm4
                pshufd          $78, %xmm3, %xmm3       ## xmm3 = xmm3[2,3,0,1]
                addpd           %xmm2, %xmm10
                addpd           %xmm4, %xmm11
                mulpd           %xmm3, %xmm6
                mulpd           %xmm3, %xmm7
                addpd           %xmm6, %xmm12
                addpd           %xmm7, %xmm13
                addq            $32, %rsi
                addq            $32, %rdx
                decq            %rdi
                jne             LBB1_1
LBB1_2:                                 ## %._crit_edge
                movlpd          %xmm8, (%rsp)
                movhpd          %xmm14, 8(%rsp)
                movlpd          %xmm9, 16(%rsp)
                movhpd          %xmm15, 24(%rsp)
                movlpd          %xmm14, 32(%rsp)
                movhpd          %xmm8, 40(%rsp)
                movlpd          %xmm15, 48(%rsp)
                movhpd          %xmm9, 56(%rsp)
                movlpd          %xmm10, 64(%rsp)
                movhpd          %xmm12, 72(%rsp)
                movlpd          %xmm11, 80(%rsp)
                movhpd          %xmm13, 88(%rsp)
                movlpd          %xmm12, 96(%rsp)
                movhpd          %xmm10, 104(%rsp)
                movlpd          %xmm13, 112(%rsp)
                movhpd          %xmm11, 120(%rsp)
#  ...
--------------------------------------------------------------------------------

TODO:  re-code this to equivalent SSE intrinsics


Conclusion
----------

The performance difference is mainly due to latency issues.  We have to take
into account that the execution of each SSE instruction also involves some
latency time.  Assume that instruction `inst2` requires that `inst1` is already
completed.  Than we can improve pipelining by doing in between some other useful
things.


:links:  SSE intrinsics  -> https://software.intel.com/sites/landingpage/IntrinsicsGuide/

:navigate: __up__    -> doc:index
           __back__  -> doc:page04/index
           __next__  -> doc:page06/index