==================== Improving Pipelining [TOC] ==================== Loading data from memory (even the L1 cache) into registers takes some CPU cycles. We have several loads in the update loop. Instead of waiting for all of them to complete we could try to do some useful work between the load calls. This is another step to discover the brilliancy of the BLIS micro kernel! Select the demo-sse-intrinsics-v2 Branch ======================================== Check out the `demo-naive-sse-with-intrinsics-v2` branch: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | git branch -a | | git checkout -B demo-sse-intrinsics-v2 +++| | remotes/origin/demo-sse-intrinsics-v2 | | | *-------------------------------------------------------------------* Then we compile the project *--[SHELL(path=ulmBLAS,height=15)]----------------------------------* | | | make | | | *-------------------------------------------------------------------* The Micro Kernel Algorithm ========================== We make 6 modifications: - Before we enter the update loop and before zero initializing the $\mathbb{AB}$ registers we already load - $(a_{0}, a_{1})$ into $\mathbb{tmp}_0$, - $(a_{2}, a_{3})$ into $\mathbb{tmp}_1$ and - $(b_{0}, b_{1})$ into $\mathbb{tmp}_2$. - Reload $\mathbb{tmp}_2$ with $(b_{4l+4}, b_{4l+5})$ at about the middle of the loop. - Reload $\mathbb{tmp}_0$ with $(a_{4l+4}, a_{4l+5})$ and $\mathbb{tmp}_1$ with $\mathbb{tmp}_2$ with $(a_{4l+6}, a_{4l+7})$ at about the end of the loop (with an SSE multiplication in between and previous to two SSE additions). The dgemm_nn Code ================= :import: ulmBLAS/src/level3/dgemm_nn.c [linenumbers] Benchmark Results ================= We run the benchmarks *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | cd bench | | ./xdl3blastst > report | | cat report | | | *-------------------------------------------------------------------* and filter out the results for the ` demo-sse-intrinsics-v2` branch: *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | grep PASS report > demo-sse-intrinsics-v2 | | | *-------------------------------------------------------------------* With the gnuplot script :import: ulmBLAS/bench/bench7.gps we feed gnuplot *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench7.gps | | | *-------------------------------------------------------------------* and get ---- IMAGE ------------- ulmBLAS/bench/bench7.svg ------------------------ :navigate: __up__ -> doc:index __back__ -> doc:page05/index __next__ -> doc:page07/index