=========================== Naive Use of AVX Intrinsics [TOC] =========================== Using __AVX intrinsics__ instead of SSE we follow the straight forward approach of __Naive Use of SSE Intrinsics__. Clone the ulmBLAS Repository ============================ *--[SHELL(hide)]----------------------------------------------------* | | | rm -rf ulmBLAS/ | | | *-------------------------------------------------------------------* If not done already clone the __ulmBLAS__ repository. *--[SHELL]----------------------------------------------------------* | | | git clone https://github.com/michael-lehn/ulmBLAS.git | | | *-------------------------------------------------------------------* Select the demo-naive-avx-with-intrinsics Branch ================================================ Again, we do a `make clean` before switching a branch: *--[SHELL(height=6)]------------------------------------------------* | | | cd ulmBLAS | | make clean | | | *-------------------------------------------------------------------* Then we are checking out the `demo-naive-avx-with-intrinsics` branch: *--[SHELL(path=ulmBLAS)]--------------------------------------------* | | | git branch -a | | git checkout -B demo-naive-avx-with-intrinsics +++| | remotes/origin/demo-naive-avx-with-intrinsics | | | *-------------------------------------------------------------------* Then we compile the project *--[SHELL(path=ulmBLAS,height=15)]----------------------------------* | | | make | | | *-------------------------------------------------------------------* The Micro Kernel Algorithm ========================== We use parameters $m_r = 8$ and $n_r=4$ in the micor kernel. We merely optimize the update step ---- LATEX --------------------------------------------------------------------- \mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix} \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix} -------------------------------------------------------------------------------- by using SSE intrinsics. Looking at the original C code ---- CODE(type=c) ------------------------ for (l=0; l report | | grep PASS report > demo-naive-avx-with-intrinsics | | grep "\ \-\-\-\-\-$" report > refBLAS | | | *-------------------------------------------------------------------* With the gnuplot script :import: ulmBLAS/bench/bench.gps we feed gnuplot *--[SHELL(path=ulmBLAS/bench)]--------------------------------------* | | | gnuplot bench.gps | | | *-------------------------------------------------------------------* and get ---- IMAGE ------------ ulmBLAS/bench/bench.png ----------------------- :links: AVX intrinsics -> https://software.intel.com/sites/landingpage/IntrinsicsGuide/ Naive Use of SSE Intrinsics -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page03/index.html ulmBLAS -> https://github.com/michael-lehn/ulmBLAS :navigate: __up__ -> doc:index