=============================== Another SSE Intrinsics Approach [TOC] =============================== In this section we follow the idea of BLIS micro kernel for i86-64 architectures with SSE. Compared to the original BLIS micro kernel we simplify quite a few things: - We are using __SSE intrinsics__ instead of inline assembler. - We do not take the latency of SSE instructions into account. - We do not unroll the update loop - We do not prefetch panels and other data. At least not explicitly. Select the demo-naive-sse-with-intrinsics-unrolled Branch ========================================================= Check out the `demo-naive-sse-with-intrinsics` branch: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | git branch -a | | git checkout -B demo-sse-intrinsics +++| | remotes/origin/demo-sse-intrinsics | | | *-------------------------------------------------------------------* Then we compile the project *--[SHELL(path=ulmBLAS,height=15)]----------------------------------* | | | make | | | *-------------------------------------------------------------------* The Micro Kernel Algorithm ========================== Again we consider the update step ---- LATEX --------------------------------------------------------------------- \mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix} -------------------------------------------------------------------------------- And again we will keep the complete matrix $\mathbf{AB}$ in eight SSE registers. However, we will do this with a slight modification that we illustrate later. Remember that in our so called _naive_ approach the 64-bit operand $b_{4l}$ was duplicated and stored in a 128-bit SSE register. This time we will store the two 64-bit operands $b_{4l}$ and $b_{4l+1}$ in a common 128-bit SSE register. That is basically the main difference. More precise we store the operands in SSE registers $\mathbb{tmp}_0$ to $\mathbb{tmp}_3$ ---- LATEX --------------------------------------------------------------------- \begin{array}{llll} \mathbb{tmp}_{0} \leftarrow \begin{pmatrix} a_{4l } \\ a_{4l+1} \end{pmatrix}, & \mathbb{tmp}_{1} \leftarrow \begin{pmatrix} a_{4l+2} \\ a_{4l+3} \end{pmatrix}, & \mathbb{tmp}_{2} \leftarrow \begin{pmatrix} b_{4l } \\ b_{4l+1} \end{pmatrix}, & \mathbb{tmp}_{3} \leftarrow \begin{pmatrix} b_{4l+2} \\ b_{4l+3} \end{pmatrix}, & \end{array} -------------------------------------------------------------------------------- Now we notice that a component wise SSE multiplication like $\mathbb{tmp}_{0} \odot \mathbb{tmp}_{2}$ computes $\begin{pmatrix} a_{4l} b_{4l} \\ a_{4l+1} b_{4l+1} \end{pmatrix}$ which contributes to $\mathbf{AB}_{0,0}$ and $\mathbf{AB}_{1,1}$. With this registers further contributions can be computed for $(\mathbf{AB}_{2,0}, \mathbf{AB}_{3,1})$, $(\mathbf{AB}_{0,2}, \mathbf{AB}_{1,3})$ and $(\mathbf{AB}_{2,2}, \mathbf{AB}_{3,3})$. For the remaining entries of $\mathbf{AB}$ contributions can be computed after swapping $\mathbb{tmp}_{2}$ and $\mathbb{tmp}_{3}$: ---- LATEX --------------------------------------------------------------------- \begin{array}{llll} \mathbb{tmp}_{4} \leftarrow \begin{pmatrix} b_{4l+1} \\ b_{4l } \end{pmatrix}, & \mathbb{tmp}_{5} \leftarrow \begin{pmatrix} b_{4l+3} \\ b_{4l+2} \end{pmatrix}, & \end{array} -------------------------------------------------------------------------------- Using eight SSE registers for $\mathbf{AB}$ we have two SSE registers $\mathbb{tmp}_6$ and $\mathbb{tmp}_7$ left for intermediate results. Denoting elements of $\mathbf{AB}$ with $\mathbb{ab}_{\cdot,\cdot}$ we update diags and anti-diags in the first two columns with ---- LATEX --------------------------------------------------------------------- \begin{array}{lll} \mathbb{tmp}_6 &\leftarrow& \mathbb{tmp}_2 \\ \mathbb{tmp}_2 &\leftarrow& \mathbb{tmp}_2 \odot \mathbb{tmp}_0 \\ \mathbb{tmp}_6 &\leftarrow& \mathbb{tmp}_6 \odot \mathbb{tmp}_0 \\ \mathbb{ab}_{00,11} &\leftarrow& \mathbb{ab}_{00,11} + \mathbb{tmp}_2 \\ \mathbb{ab}_{20,31} &\leftarrow& \mathbb{ab}_{20,31} + \mathbb{tmp}_6 \\ & & \\ \mathbb{tmp}_7 &\leftarrow& \mathbb{tmp}_4 \\ \mathbb{tmp}_4 &\leftarrow& \mathbb{tmp}_4 \odot \mathbb{tmp}_0 \\ \mathbb{tmp}_7 &\leftarrow& \mathbb{tmp}_7 \odot \mathbb{tmp}_0 \\ \mathbb{ab}_{01,10} &\leftarrow& \mathbb{ab}_{01,10} + \mathbb{tmp}_4 \\ \mathbb{ab}_{21,30} &\leftarrow& \mathbb{ab}_{21,30} + \mathbb{tmp}_7 \\ \end{array} -------------------------------------------------------------------------------- Analogously we compute updates for the last two columns of $\mathbf{AB}$ ---- LATEX --------------------------------------------------------------------- \begin{array}{lll} \mathbb{tmp}_6 &\leftarrow& \mathbb{tmp}_3 \\ \mathbb{tmp}_3 &\leftarrow& \mathbb{tmp}_3 \odot \mathbb{tmp}_0 \\ \mathbb{tmp}_6 &\leftarrow& \mathbb{tmp}_6 \odot \mathbb{tmp}_0 \\ \mathbb{ab}_{00,11} &\leftarrow& \mathbb{ab}_{00,11} + \mathbb{tmp}_2 \\ \mathbb{ab}_{20,31} &\leftarrow& \mathbb{ab}_{20,31} + \mathbb{tmp}_6 \\ & & \\ \mathbb{tmp}_7 &\leftarrow& \mathbb{tmp}_5 \\ \mathbb{tmp}_5 &\leftarrow& \mathbb{tmp}_5 \odot \mathbb{tmp}_0 \\ \mathbb{tmp}_7 &\leftarrow& \mathbb{tmp}_7 \odot \mathbb{tmp}_0 \\ \mathbb{ab}_{01,10} &\leftarrow& \mathbb{ab}_{01,10} + \mathbb{tmp}_4 \\ \mathbb{ab}_{21,30} &\leftarrow& \mathbb{ab}_{21,30} + \mathbb{tmp}_7 \\ \end{array} -------------------------------------------------------------------------------- When copying $\mathbb{ab}_{\cdot,\cdot}$ back two memory we have to move lower and hight double separately. For example the lower double of $\mathbb{ab}_{00,11}$ gets moved to $\mathbf{AB}_{0,0}$ and the higher double to $\mathbf{AB}_{1,1}$. The dgemm_nn Code ================= :import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers] Benchmark Results ================= We run the benchmarks *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | cd bench | | ./xdl3blastst > report | | cat report | | | *-------------------------------------------------------------------* and filter out the results for the `demo-sse-intrinsics` branch: *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | grep PASS report > demo-sse-intrinsics | | | *-------------------------------------------------------------------* With the gnuplot script :import: ulmBLAS/bench/bench5.gps we feed gnuplot *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench5.gps | | | *-------------------------------------------------------------------* and get ---- IMAGE ------------- ulmBLAS/bench/bench5.svg ------------------------ Sensitivity to Compilers ======================== Maybe you noticed that this time the `clang` compiler was used above! This times we get poor results when using `gcc 4.8`: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | gcc-4.8 --version | | | | cd src | | gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse +++| | -fomit-frame-pointer -DFAKE_ATLAS -c +++| | -o level3/atl_dgemm_nn.o level3/dgemm_nn.c | | make | | cd ../bench | | make | | ./xdl3blastst | | | *-------------------------------------------------------------------* We can visualize this again *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | ./xdl3blastst > report | | grep PASS report > demo-sse-intrinsics-gcc | | | *-------------------------------------------------------------------* By adopting the gnuplot script :import: ulmBLAS/bench/bench6.gps and feeding gnuplot with this script *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench6.gps | | | *-------------------------------------------------------------------* We get ---- IMAGE ------------- ulmBLAS/bench/bench6.svg ------------------------ Analyzing Assembler Code from gcc 4.8 ------------------------------------- ----- CODE(type=s) ------------------------------------------------------------- # ... xorpd %xmm9, %xmm9 movapd %xmm9, %xmm11 addq %rdx, %rdi movapd %xmm9, %xmm8 movapd %xmm9, %xmm10 movapd %xmm9, %xmm13 movapd %xmm9, %xmm15 movapd %xmm9, %xmm12 movapd %xmm9, %xmm14 .align 4,0x90 L3: movapd (%rdx), %xmm5 addq $32, %rdx addq $32, %rsi movapd -16(%rsi), %xmm2 movapd %xmm5, %xmm7 movapd %xmm5, %xmm1 movapd -32(%rsi), %xmm3 shufpd $1, %xmm5, %xmm7 mulpd %xmm2, %xmm5 movapd -16(%rdx), %xmm4 mulpd %xmm3, %xmm1 cmpq %rdi, %rdx movapd %xmm4, %xmm6 shufpd $1, %xmm4, %xmm6 addpd %xmm5, %xmm12 movapd %xmm7, %xmm5 mulpd %xmm3, %xmm5 addpd %xmm1, %xmm14 mulpd %xmm2, %xmm7 addpd %xmm5, %xmm15 movapd %xmm4, %xmm5 mulpd %xmm3, %xmm5 addpd %xmm7, %xmm13 mulpd %xmm2, %xmm4 mulpd %xmm6, %xmm3 mulpd %xmm6, %xmm2 addpd %xmm5, %xmm10 addpd %xmm4, %xmm8 addpd %xmm3, %xmm11 addpd %xmm2, %xmm9 jne L3 L2: movd %r10, %xmm1 movapd %xmm14, %xmm2 movsd %xmm14, -120(%rsp) ucomisd LC0(%rip), %xmm1 movhpd %xmm15, -112(%rsp) movlpd %xmm12, -104(%rsp) movhpd %xmm13, -96(%rsp) movlpd %xmm15, -88(%rsp) movhpd %xmm14, -80(%rsp) movlpd %xmm13, -72(%rsp) movhpd %xmm12, -64(%rsp) movlpd %xmm10, -56(%rsp) movhpd %xmm11, -48(%rsp) movlpd %xmm8, -40(%rsp) movhpd %xmm9, -32(%rsp) movlpd %xmm11, -24(%rsp) movhpd %xmm10, -16(%rsp) movlpd %xmm9, -8(%rsp) movhpd %xmm8, (%rsp) # ... -------------------------------------------------------------------------------- TODO: re-code this to equivalent SSE intrinsics Analyzing Assembler Code from clang ----------------------------------- ----- CODE(type=s) ------------------------------------------------------------- # ... xorpd %xmm8, %xmm8 xorpd %xmm9, %xmm9 xorpd %xmm14, %xmm14 xorpd %xmm15, %xmm15 xorpd %xmm10, %xmm10 xorpd %xmm11, %xmm11 xorpd %xmm12, %xmm12 xorpd %xmm13, %xmm13 jle LBB1_2 .align 4, 0x90 LBB1_1: ## %.lr.ph ## =>This Inner Loop Header: Depth=1 movapd (%rsi), %xmm6 movapd 16(%rsi), %xmm7 movapd (%rdx), %xmm2 movapd 16(%rdx), %xmm3 movapd %xmm6, %xmm4 mulpd %xmm2, %xmm4 movapd %xmm7, %xmm5 mulpd %xmm2, %xmm5 pshufd $78, %xmm2, %xmm2 ## xmm2 = xmm2[2,3,0,1] addpd %xmm4, %xmm8 addpd %xmm5, %xmm9 movapd %xmm6, %xmm4 mulpd %xmm2, %xmm4 mulpd %xmm7, %xmm2 addpd %xmm4, %xmm14 addpd %xmm2, %xmm15 movapd %xmm6, %xmm2 mulpd %xmm3, %xmm2 movapd %xmm7, %xmm4 mulpd %xmm3, %xmm4 pshufd $78, %xmm3, %xmm3 ## xmm3 = xmm3[2,3,0,1] addpd %xmm2, %xmm10 addpd %xmm4, %xmm11 mulpd %xmm3, %xmm6 mulpd %xmm3, %xmm7 addpd %xmm6, %xmm12 addpd %xmm7, %xmm13 addq $32, %rsi addq $32, %rdx decq %rdi jne LBB1_1 LBB1_2: ## %._crit_edge movlpd %xmm8, (%rsp) movhpd %xmm14, 8(%rsp) movlpd %xmm9, 16(%rsp) movhpd %xmm15, 24(%rsp) movlpd %xmm14, 32(%rsp) movhpd %xmm8, 40(%rsp) movlpd %xmm15, 48(%rsp) movhpd %xmm9, 56(%rsp) movlpd %xmm10, 64(%rsp) movhpd %xmm12, 72(%rsp) movlpd %xmm11, 80(%rsp) movhpd %xmm13, 88(%rsp) movlpd %xmm12, 96(%rsp) movhpd %xmm10, 104(%rsp) movlpd %xmm13, 112(%rsp) movhpd %xmm11, 120(%rsp) # ... -------------------------------------------------------------------------------- TODO: re-code this to equivalent SSE intrinsics Conclusion ---------- The performance difference is mainly due to latency issues. We have to take into account that the execution of each SSE instruction also involves some latency time. Assume that instruction `inst2` requires that `inst1` is already completed. Than we can improve pipelining by doing in between some other useful things. :links: SSE intrinsics -> https://software.intel.com/sites/landingpage/IntrinsicsGuide/ :navigate: __up__ -> doc:index __back__ -> doc:page04/index __next__ -> doc:page06/index