Content

ulmBLAS

ulmBLAS is a high performance C++ implementation of BLAS (Basic Linear Subprograms). Standard conform interfaces for C and Fortran are provided.

BLAS defines a set of low-level linear algebra operations and has become a de facto standard API for linear algebra libraries and routines. Several BLAS library implementations have been tuned for specific computer architectures. Highly optimized implementations have been developed by hardware vendors such as Intel (Intel MKL) or AMD (ACML).

Related to ulmBLAS is the FLENS C++ library. It provides powerful and efficient matrix/vector types for implementing numerical algorithms. In the FLENS library ulmBLAS is used as the default BLAS implementation.

Higher level numerical libraries and applications like LAPACK, SuperLU, Matlab, GNU Octave, Mathematica, etc. use BLAS as computational kernel. Performance of these libraries and applications directly depends on the performance of the underlying BLAS implementation.

Application in Teaching

How to Obtain

You can obtain ulmBLAS from github: https://github.com/michael-lehn/ulmBLAS

For different stages of the course different branches are available. The master branch contains the final C++ implementation.

Benchmarks

All benchmarks where created on a MacBook Pro with a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS.

Effectiveness of solving linear systems of equations was measured in seconds (so less is better). Efficiency of BLAS routines was measured in MFLOPS (so more is better).

Solving Systems of Linear Equations

In undergraduate lectures on numerical linear algebra students learn (among many other things) how to solve systems of linear equations using the \(LU\) factorization with pivoting. The algorithms covered in these lectures are so called unblocked algorithms that are based on vector operations and matrix-vector operations. Performance of these algorithms breaks down when matrix dimension become bigger. That is because accessing the data from the memory becomes a bottleneck. Most of the time the CPU will be waiting for doing actual computations.

In the lecture on high performance computing the students are introduced to blocked algorithms. These make heavy use of matrix-matrix products (and some variants of it that are specified in the so called BLAS Level 3). Using blocked algorithms and efficient implementations of the BLAS Level 3 functions allows achieving almost peak performance.

How big the gain of using blocked algorithms over unblock can be shows the following benchmark:

(General) Matrix-Matrix Products

If matrices are non-square or not symmetric their are categorized as general matrices. Matrix-matrix products are of the forms

These operations must achieve in all cases the same performance.

Symmetric Matrix-Matrix Product

Triangular Matrix-Matrix Product