======= ulmBLAS [TOC] ======= ulmBLAS is a high performance C++ implementation of __BLAS__ (Basic Linear Subprograms). Standard conform interfaces for C and Fortran are provided. __BLAS__ defines a set of low-level linear algebra operations and has become a de facto standard API for linear algebra libraries and routines. Several BLAS library implementations have been tuned for specific computer architectures. Highly optimized implementations have been developed by hardware vendors such as Intel (__Intel MKL__) or AMD (__ACML__). Higher level numerical libraries and applications like __LAPACK__, __SuperLU__, __Matlab__, __GNU Octave__, __Mathematica__, etc. use BLAS for as computational kernel. Performance of these libraries and applications directly depends on the performance of the underlying BLAS implementation. Application in Teaching ======================= - *ulmBLAS* was developed for our lectures on __High Performance Computing__ at __Ulm University__. - In High Performance Computing we are interested in realizing numerical algorithms in such a way that we get as close as possible to the theoretical peak performance of the underlying hardware. This requires adapting the numerical algorithms. In the usual lectures on numerical analysis or numerical linear algebra this does not get covered. - It turns out the performance only depends on a few elementary linear algebra operations. These operations became known as BLAS (Basic Linear Algebra Subroutines). - So obviously an efficient BLAS implementation is crucial for scientific computing and scientific applications. Both, commercial (__Intel MKL__, AMD __ACML__, Sun Performance Library, ...) and open source (__BLIS__, __ATLAS__, ...) implementations of BLAS are available. - *So if a lecture is called _High Performance Computing_ it has to deal with BLAS!* One way of dealing with BLAS is merely reading papers and using existing BLAS implementations as black box. *But if you really want to understand it you have to implement your own BLAS!* - Of course some open source implementations already do exist. However, either their implementation is _kind of complex_ and therefore not suitable for teaching an undergraduate class, or their implementation is far from efficient. `ulmBLAS` was designed for being suitable for teaching while being on par with commercial implementations for performance. - It is important to know that students develop their own *ulmBLAS* during the semester. This happens step-by-step (of course these steps are guided) and in two phases: - *Phase 1:* In the first half of the semester the students develop a pure C implementation for the most important BLAS functions. In this phase they learn about the relevant numerical algorithms, the programming language C, programming tools and fundamental concepts of modern hardware architecture. At the end of this phase they reach about 30 percent of the theoretical peak performance. The code for the matrix-matrix product (one of the most important BLAS functions) only has less than 450 lines of code. - *Phase 2:* Students now focus on hardware optimizations using techniques like loop unrolling, instruction pipelining, SIMD (single instruction, multiple data). In this phase students get introduced to assembly language for doing sophisticated low-level programming. In particular realizing numerical algorithms for micro kernels on this level. At the end they can achieve about 98 percent of the performance of __Intel MKL__ which is the fastest implementation for the Intel platform. In this phase about 10 lines of the original C code where replaced by about 400 lines of assembly code. How students proceed in this phase is documented on __GEMM: From Pure C to SSE Optimized Micro Kernels__. - ulmBLAS is not only a nice project for teaching but can also compete with commercial products as you can see on __GEMM Benchmarks__. How to Obtain ============= You can obtain ulmBLAS from github: __https://github.com/michael-lehn/ulmBLAS__ For different stages of the course different branches are available. The master branch contains the final C++ implementation. Benchmarks ========== All benchmarks where created on a MacBook Pro with a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS. Effectiveness of solving linear systems of equations was measured in seconds (so less is better). Efficiency of BLAS routines was measured in MFLOPS (so more is better). Solving Systems of Linear Equations ----------------------------------- In undergraduate lectures on numerical linear algebra students learn (among many other things) how to solve systems of linear equations using the $LU$ factorization with pivoting. The algorithms covered in these lectures are so called unblocked algorithms that are based on vector operations and matrix-vector operations. Performance of these algorithms breaks down when matrix dimension become bigger. That is because accessing the data from the memory becomes a bottleneck. Most of the time the CPU will be waiting for doing actual computations. In the lecture on high performance computing the students are introduced to blocked algorithms. These make heavy use of matrix-matrix products (and some variants of it that are specified in the so called BLAS Level 3). Using blocked algorithms and efficient implementations of the BLAS Level 3 functions allows achieving almost peak performance. How big the gain of using blocked algorithms over unblock can be shows the following benchmark: ---- IMAGE ------------------------- bench/benchmarks/lu-blocked-time.svg ------------------------------------ (General) Matrix-Matrix Products -------------------------------- If matrices are non-square or not symmetric their are categorized as general matrices. Matrix-matrix products are of the forms - $C \leftarrow \beta C + \alpha A B$, - $C \leftarrow \beta C + \alpha A^T B$, - $C \leftarrow \beta C + \alpha A B^T$ or - $C \leftarrow \beta C + \alpha A^T B^T$. These operations must achieve in all cases the same performance. ---- IMAGE --------------- bench/benchmarks/xgemm.svg -------------------------- ---- IMAGE ------------------ bench/benchmarks/xgemm_At.svg ----------------------------- ---- IMAGE ------------------ bench/benchmarks/xgemm_Bt.svg ----------------------------- ---- IMAGE --------------------- bench/benchmarks/xgemm_At_Bt.svg -------------------------------- :links: BLAS -> http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms Intel MKL -> http://en.wikipedia.org/wiki/Math_Kernel_Library ACML -> http://en.wikipedia.org/wiki/AMD_Core_Math_Library LAPACK -> http://en.wikipedia.org/wiki/LAPACK SuperLU -> http://crd-legacy.lbl.gov/~xiaoye/SuperLU/ Matlab -> http://en.wikipedia.org/wiki/Matlab GNU Octave -> http://en.wikipedia.org/wiki/GNU_Octave Mathematica -> http://en.wikipedia.org/wiki/Mathematica High Performance Computing -> http://www.uni-ulm.de/mawi/mawi-numerik/lehre/wintersemester-20142015/vorlesung-softwaregrundlagen-high-performance-computing.html Ulm University -> http://www.uni-ulm.de/en/homepage.html BLIS -> https://code.google.com/p/blis/ ATLAS -> http://math-atlas.sourceforge.net GEMM: From Pure C to SSE Optimized Micro Kernels -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/ GEMM Benchmarks -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5 https://github.com/michael-lehn/ulmBLAS -> https://github.com/michael-lehn/ulmBLAS