========================================= Fine-tuning the Unrolled Assembler Kernel [TOC] ========================================= In our previous assembler kernel we stored the pointer to the panel of $A$ in register `%rax` and the pointer to the $B$ panel in `%rbx`. Inside the unrolled loop we have copy and pasted the original loop body four times. So we also have copied the pointer increments four times. These are the lines with ---- CODE(type=c) -------------------------------------------------------------- "addq $32, %%rax \n\t" // A += 4; "addq $32, %%rbx \n\t" // B += 4; -------------------------------------------------------------------------------- We replace that with a single increment, i.e. ---- CODE(type=c) -------------------------------------------------------------- "addq $4*32, %%rax \n\t" // A += 16; "addq $4*32, %%rbx \n\t" // B += 16; -------------------------------------------------------------------------------- at the end of the loop. As the value of `%rax` remains constant in the loop pipelining can be improved by the CPU. But as a consequently we have to use offsets in the loop, e.g. when we do the second update step we write ---- CODE(type=c) -------------------------------------------------------------- "movaps 48(%%rbx), %%xmm3 \n\t" // tmp3 = _mm_load_pd(B+6) -------------------------------------------------------------------------------- instead of ---- CODE(type=c) -------------------------------------------------------------- "movaps 16(%%rbx), %%xmm3 \n\t" // tmp3 = _mm_load_pd(B+6) -------------------------------------------------------------------------------- Select the demo-sse-asm-unrolled-v2 Branch ========================================== Again, we do a `make clean` before switching a branch: *--[SHELL(height=6)]------------------------------------------------* | | | cd ulmBLAS | | make clean | | | *-------------------------------------------------------------------* Then we are checking out the `demo-sse-asm-unrolled-v2` branch: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | git branch -a | | git checkout -B demo-sse-asm-unrolled-v2 +++| | remotes/origin/demo-sse-asm-unrolled-v2 | | | *-------------------------------------------------------------------* Then we compile the project *--[SHELL(path=ulmBLAS,height=15)]----------------------------------* | | | make | | | *-------------------------------------------------------------------* The dgemm_nn Code ================= :import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers] Benchmark Results ================= We run the benchmarks *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | cd bench | | ./xdl3blastst > report | | | *-------------------------------------------------------------------* and filter out the results for the `demo-sse-asm-unrolled-v2` branch: *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | grep PASS report > demo-sse-asm-unrolled-v2 | | | *-------------------------------------------------------------------* With the gnuplot script :import: ulmBLAS/bench/bench11.gps we feed gnuplot *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench11.gps | | | *-------------------------------------------------------------------* and get ---- IMAGE -------------- ulmBLAS/bench/bench11.svg ------------------------- :navigate: __up__ -> doc:index __back__ -> doc:page09/index __next__ -> doc:page11/index