Content 
ulmBLAS
ulmBLAS is a high performance C++ implementation of BLAS (Basic Linear Subprograms). Standard conform interfaces for C and Fortran are provided.
BLAS defines a set of lowlevel linear algebra operations and has become a de facto standard API for linear algebra libraries and routines. Several BLAS library implementations have been tuned for specific computer architectures. Highly optimized implementations have been developed by hardware vendors such as Intel (Intel MKL) or AMD (ACML).
Related to ulmBLAS is the FLENS C++ library. It provides powerful and efficient matrix/vector types for implementing numerical algorithms. In the FLENS library ulmBLAS is used as the default BLAS implementation.
Higher level numerical libraries and applications like LAPACK, SuperLU, Matlab, GNU Octave, Mathematica, etc. use BLAS as computational kernel. Performance of these libraries and applications directly depends on the performance of the underlying BLAS implementation.
Application in Teaching

ulmBLAS was developed for our lectures on High Performance Computing at Ulm University.

In High Performance Computing we are interested in realizing numerical algorithms in such a way that we get as close as possible to the theoretical peak performance of the underlying hardware. This requires adapting the numerical algorithms. In the usual lectures on numerical analysis or numerical linear algebra this does not get covered.

It turns out the performance only depends on a few elementary linear algebra operations. These operations became known as BLAS (Basic Linear Algebra Subroutines).

So obviously an efficient BLAS implementation is crucial for scientific computing and scientific applications. Both, commercial (Intel MKL, AMD ACML, Sun Performance Library, ...) and open source (BLIS, ATLAS, ...) implementations of BLAS are available.

So if a lecture is called High Performance Computing it has to deal with BLAS! One way of dealing with BLAS is merely reading papers and using existing BLAS implementations as black box. But if you really want to understand it you have to implement your own BLAS!

Of course some open source implementations already do exist. However, either their implementation is kind of complex and therefore not suitable for teaching an undergraduate class, or their implementation is far from efficient. ulmBLAS was designed for being suitable for teaching while being on par with commercial implementations for performance.

It is important to know that students develop their own ulmBLAS during the semester. This happens stepbystep (of course these steps are guided) and in two phases:

Phase 1: In the first half of the semester the students develop a pure C implementation for the most important BLAS functions. In this phase they learn about the relevant numerical algorithms, the programming language C, programming tools and fundamental concepts of modern hardware architecture. At the end of this phase they reach about 30 percent of the theoretical peak performance. The code for the matrixmatrix product (one of the most important BLAS functions) only has less than 450 lines of code.

Phase 2: Students now focus on hardware optimizations using techniques like loop unrolling, instruction pipelining, SIMD (single instruction, multiple data). In this phase students get introduced to assembly language for doing sophisticated lowlevel programming. In particular realizing numerical algorithms for micro kernels on this level. At the end they can achieve about 98 percent of the performance of Intel MKL which is the fastest implementation for the Intel platform. In this phase about 10 lines of the original C code where replaced by about 400 lines of assembly code.
How students proceed in this phase is documented on GEMM: From Pure C to SSE Optimized Micro Kernels.


ulmBLAS is not only a nice project for teaching but can also compete with commercial products as you can see on GEMM Benchmarks.
How to Obtain
You can obtain ulmBLAS from github: https://github.com/michaellehn/ulmBLAS
For different stages of the course different branches are available. The master branch contains the final C++ implementation.
Benchmarks
All benchmarks where created on a MacBook Pro with a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS.
Effectiveness of solving linear systems of equations was measured in seconds (so less is better). Efficiency of BLAS routines was measured in MFLOPS (so more is better).
Solving Systems of Linear Equations
In undergraduate lectures on numerical linear algebra students learn (among many other things) how to solve systems of linear equations using the \(LU\) factorization with pivoting. The algorithms covered in these lectures are so called unblocked algorithms that are based on vector operations and matrixvector operations. Performance of these algorithms breaks down when matrix dimension become bigger. That is because accessing the data from the memory becomes a bottleneck. Most of the time the CPU will be waiting for doing actual computations.
In the lecture on high performance computing the students are introduced to blocked algorithms. These make heavy use of matrixmatrix products (and some variants of it that are specified in the so called BLAS Level 3). Using blocked algorithms and efficient implementations of the BLAS Level 3 functions allows achieving almost peak performance.
How big the gain of using blocked algorithms over unblock can be shows the following benchmark:
(General) MatrixMatrix Products
If matrices are nonsquare or not symmetric their are categorized as general matrices. Matrixmatrix products are of the forms

\(C \leftarrow \beta C + \alpha A B\),

\(C \leftarrow \beta C + \alpha A^T B\),

\(C \leftarrow \beta C + \alpha A B^T\) or

\(C \leftarrow \beta C + \alpha A^T B^T\).
These operations must achieve in all cases the same performance.