================================================ GEMM: From Pure C to SSE Optimized Micro Kernels ================================================ On the next pages we try to discover how __BLIS__ can achieve such a create performance. *For this journey we set up our own BLAS implementation!* In our __ulmBLAS__ project we have implemented a simple matrix-matrix product that follows the ideas described in __BLIS: A Framework for Rapidly Instantiating BLAS Functionality__. - __Page 1__ How to obtain the __ulmBLAS__ project. - __Page 2__ Pure C implementation - __Page 3__ Naive Use of SSE Intrinsics. - __Page 4__ Applying loop unrolling to the previous implementation. - __Page 5__ Another SSE Intrinsics Approach which is based on the __BLIS__ micro kernel for SSE architectures. - __Page 6__ Improving pipelining by reordering SSE intrinsics. - __Page 7__ Limitations of SSE intrinsics. - __Page 8__ We go nuclear and translate the intrinsics to assember by ourself! - __Page 9__ Unrolling the nuke: demo-asm-unrolled. - __Page 10__ Fine-tuning the unrolled assembler kernel. - __Page 11__ More fine-tuning of the unrolled assembler kernel. - __Page 12__ Preparation for adding prefetching: Porting the rest of the micro kernel to assembler. - __Page 13__ Adding prefetching. - __Page 14__ Benchmarking! Comparing the performance with __MKL__, __ATLAS__, __Eigen__ and the original __BLIS__ micro kernel. Note that all benchmarks on these pages were generated when __doctool__ transformed the doc files to HTML. All this happened on my MacBook Pro which has a 2.4 GHz Intel Core 2 Duo (P8600, "Penryn"). The theoretical peak performance of one core is 9.6 GFLOPS. __Back__ to the main course :links: BLIS -> https://code.google.com/p/blis/ BLIS: A Framework for Rapidly Instantiating BLAS Functionality -> http://www.cs.utexas.edu/users/flame/pubs/BLISTOMSrev2.pdf doctool -> https://github.com/michael-lehn/DocTool Page (\d) -> doc:page0$1/index Page (\d\d) -> doc:page$1/index ulmBLAS -> https://github.com/michael-lehn/ulmBLAS MKL -> https://software.intel.com/en-us/intel-mkl ATLAS -> http://math-atlas.sourceforge.net Eigen -> http://eigen.tuxfamily.org/index.php?title=Main_Page Back -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc