Content

Performance: It's BLAS, not the Compiler
About FLENS-BLAS
BLAS Level 1 Functions
BLAS Level 2 Functions
BLAS Level 3 Functions
Rational of FLENS-BLAS Function Names

Basic Linear Algebra Operations (BLAS)

BLAS is a de facto standard that defines a certain set of linear algebra operations. These operations mainly are matrix/vector operations, e.g. dot products, matrix-vector products, matrix-matrix products, etc. Hardware vendors provide highly tuned BLAS implementations for their architecture, e.g. Intel MKL, AMD CML, Sun Performance Library, etc. Implementations like NVIDIA cuBLAS exploit GPUs for the computation. In addition open source implementations like BLIS, ATLAS, OpenBLAS, ulmBLAS, etc. exist that are on par with vendor tuned BLAS implementations.

How much performance can be achieved in theory depends on various factor. For instance exploiting caches, vectorization and parallelism. The potential for performance depends on the type of operation. In this respect BLAS divides the operation types into three Levels:

BLAS Level 1 contains all linear algebra operations that require \(\mathcal{O}(N)\) data and \(\mathcal{O}(N)\) floating point operations. These are vector-vector operations like vector sums or dot products.
BLAS Level 2 contains all linear algebra operations that require \(\mathcal{O}(N^2)\) data and \(\mathcal{O}(N^2)\) floating point operations. These are operations like matrix-vector products.
BLAS Level 3 contains all linear algebra operations that require \(\mathcal{O}(N^2)\) data and \(\mathcal{O}(N^3)\) floating point operations. These are operations like matrix-matrix products. Only Level 3 operations can efficiently exploit caches. That's because you have a considerable gap between the amount of data and amount of floating point operations. Once a cache is filled with \(\mathcal{O}(N^2)\) data it gets uesed for \(\mathcal{O}(N^3)\) floating point operations.

With Level 1 or 2 functions you can not exploit caches. An optimized implementation will achieve about 5-15% percent of the theoretical peak performance. That means 85-95% of the time the CPU is waiting for data.

So achieving high performance in numerical linear algebra is two fold:

The Numerical Math Part

First you need to organize an algorithm such that it is dominated by BLAS Level 3 operations. This can be achieved by considering algorithms that work on matrix blocks rather than matrix elements. That is exactly what libraries like LAPACK are doing.
The Computer Science Part

Second you need an optimized BLAS implementation that actually can achieve near peak performance for BLAS Level 3 functions.

Note that this separation allows mathematician and computer scientist to cooperate independent from each other. The mathematician just needs to know that BLAS Level 3 gives great performance. The computer scientist just needs to know that it is important that BLAS Level 3 functions actually achieve great performance. But of course it is aways favorable to look beyond one's own nose!

We first give an overview of BLAS and how to make use of it with FLENS. Only a few BLAS operations will be shown. For a complete overview have a look at the FLENS-BLAS page. By default FLENS will use ulmBLAS as computational backend. Optionally an external BLAS implementation can be used. On later pages we will show how algorithms can be developed such that they exploit BLAS Level 3 functions.

Performance: It's BLAS, not the Compiler

As a rule of thumb you can say: In numerical linear algebra you never get the performance from the compiler. You only get it from using BLAS.

For an undergraduate class on high performance computing we developed step by step an optimized implementation of the matrix-matrix product. The steps are documented on GEMM: From Pure C to SSE Optimized Micro Kernels

Performance benchmarks will certainly vary between different architectures. But as they are mainly influence on how fast data can be move from memory to the caches and from caches to the CPU the relative comparisons stay comparable:

With a naive C implementation you achieve about 10%. Even if you try to reduce cache misses by traversing the matrix elements in the right order. The Netlib reference BLAS implementation (Netlib RefBLAS) is such a naive but not stupid implementation. For column major matrices it traverses matrices always column wise.
With a pure but clever C implementation and really, really good C compiler you can achieve about 30% of the theoretical peak performance. The compiler can not do better unless you build in some expert knowledge in numerical linear algebra. In the class project the best possible result through compiler optimizations was denoted demo-naive-sse-with-intrinsics:

Actually we bet that you won't find any compiler that can achieve this performance. In the class project we simulated what a compiler could achieve with optimal assumptions about alignment, restricted access to memory, optimal use of registers etc.
An optimized BLAS achieves more than 90% of the theoretical peak performance. In the class project ulmBLAS 10 lines of C code where replaced with about 500 lines of assembler code. The assembler code contains expert knowledge in numerical linear algebra that you can not expect from a general purpose compiler. This resulted in:

As ulmBLAS is derived from BLIS it is no coincident that we achieve the same performance.

All the benchmarks are for single threaded implementations. Multi threading for BLAS Level 3 scales nearly perfect. That's because each thread can compute separate blocks for the result matrix and race conditions are not an issue.

About FLENS-BLAS

BLAS functions are provided in FLENS through three layers:

CXXBLAS functions in namespace cxxblas

These provide a low-level C++ BLAS interface. These functions work on raw data pointers and by default call corresponding ulmBLAS functions. Optionally you can compile and link your code such that an external and possible vendor tuned BLAS function gets called.

The CXXBLAS interface is similar to the CBLAS interface. It uses only very few C++ features, e.g. templates are used for merely for the element type. It does not use structs, classes or even enum types. So compared to CBLAS there is except for using template no abstraction.
FLENS-BLAS functions in namespace flens::blas

These functions provide a high-level BLAS interface for the FLENS DenseVector types. These functions internally call CXXBLAS functions.
Overloaded operators

These have the advantage of allowing a more expressive notation. In FLENS we are very pedantic that this kind of syntactic sugar does not lead to any hidden performance penalty. Using overloaded operators for BLAS should not be misunderstood. They certainly are not provided such that a user does need to know about BLAS. If you want to do high performance computing you always have to think in BLAS. We will discuss overloaded operators in more detail on the next page.

In all cases runtime overhead (if at all) is negligible compared to calling a BLAS implementation directly. Runtime assertions merely check consistency of the arguments.

Compiling with -DNDEBUG removes any of these checks. It is common practise that during the test phase runtime assertions are used and production code gets generated with -DNDEBUG.

BLAS Level 1 Functions

This BLAS level originally deals with vector-vector operations only. For convenience we added in FLENS-BLAS some cases matrix-matrix variants that for instance allow copying or adding matrices.

FLENS-BLAS	DESCRIPTION	CXXBLAS
copy	\(y \leftarrow x\) Copies a vector \(\vec{x}\) to a vector \(\vec{y}\) or a matrix \(A\) to a matrix \(B\).	copy
axpy	\(\vec{y} \leftarrow \alpha \vec{x} + \vec{y}\) `axpy` is short for alpha x pluy y FLENS also provides a variant for `GeMatrix` that computes \(Y \leftarrow \alpha X + Y\).	axpy
scal	\(\vec{x} \leftarrow \alpha \vec{x}\) Scales a vector by a constant. FLENS also provides a variant for `GeMatrix` that computes \(X \leftarrow \alpha X\).	scal
swap	Interchanges two vectors, i.e. \(\vec{x} \leftrightarrow \vec{y}\).	swap
asum	Takes the sum of the absolute values, i.e. computes \(\sum\limits_{i} \|x_i\|\).	asum
dot, dotu	\(\alpha \leftarrow \vec{x}^T \vec{y}\) or \(\alpha \leftarrow \vec{\bar{x}}^T \vec{y}\) Forms the dot product of two vectors, i.e. computes \(\sum\limits_{i} \bar{x}_i y_i\) or \(\sum\limits_{i} x_i y_i\).	dot, udot
nrm2	Computes the euclidean norm of a vector, i.e. \(\sqrt{\sum\limits_{i} \|x_i\|^2}\).	nrm2
rot	Applies a plane rotation.	rot

BLAS Level 1 Example: Adding Vectors with Operators

In the following examples we compute following vector sums:

\(\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}\) and
\(\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}\)

Using overloaded operators this can be coded in an expressive way. People from high performance computing will get nervous if they see the code. They are more concern about how computations are carried out and whether temporary objects get created. Ease-of-use comes second in this field. We will show in the next example that no temporaries are carried out and what BLAS routines are used for the evaluation.

flens/examples/vecsum_operators.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;

    y =  5,  6,  7,  8;

    z =  9, 10, 11, 12;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

    cout << "z = " << z << endl;

//

//  compute  v = x + y + z

//

    v = x + y + z;

    cout << "v = x + y + z = " << v << endl;

//

//  compute v = 2*x + 3*y + 4*z

//

    v = 2*x + 3*y + 4*z;

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_operators.cc                        
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80

BLAS Level 1 Example: Adding Vectors with FLENS-BLAS

In the next example we show how internally the expression v = x + y + z and v = 2*x + 3*y + 4*z are computed by calling FLENS-BLAS functions:

\(\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}\)

This gets computed using blas::copy and blas::axpy:
- \(\vec{v} \leftarrow \vec{x}\) (blas::copy)
- \(\vec{v} \leftarrow \vec{v} + \vec{y}\) (blas::axpy)
- \(\vec{v} \leftarrow \vec{v} + \vec{z}\) (blas::axpy)
\(\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}\)

This gets computed by zero-fill and blas::axpy:
- \(\vec{v} \leftarrow 0\) (zero-fill)
- \(\vec{v} \leftarrow \vec{v} + 2\vec{x}\) (blas::axpy)
- \(\vec{v} \leftarrow \vec{v} + 3\vec{y}\) (blas::axpy)
- \(\vec{v} \leftarrow \vec{v} + 4\vec{z}\) (blas::axpy)
Note that zero-fill is not a BLAS functionality. It is implemented as as simply loop or uses std::fill if possible. This variant is slightly more efficient than
- \(\vec{v} \leftarrow x\) (blas::copy)
- \(\vec{v} \leftarrow 2 \vec{v}\) (blas::scal)
- \(\vec{v} \leftarrow \vec{v} + 3\vec{y}\) (blas::axpy)
- \(\vec{v} \leftarrow \vec{v} + 4\vec{z}\) (blas::axpy)

flens/examples/vecsum_flensblas.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;

    y =  5,  6,  7,  8;

    z =  9, 10, 11, 12;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

    cout << "z = " << z << endl;

//

//  compute  v = x + y + z

//

    // v = x

    blas::copy(x, v);

    // v += y

    blas::axpy(1., y, v);

    // v += z

    blas::axpy(1., z, v);

    cout << "v = x + y + z = " << v << endl;

//

//  compute v = 2*x + 3*y + 4*z

//

    // set all element of v to zero

    v = 0.0;

    // v += 2*x

    blas::axpy(2., x, v);

    // v += 3*y

    blas::axpy(3., y, v);

    // v += 4*z

    blas::axpy(4., z, v);

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;

}

We compile as usual and get:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_flensblas.cc                        
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80

BLAS Level 1 Example: Adding Vectors with CXXBLAS

Next we show how the FLENS-BLAS functions call internally CXXBLAS functionsacxxblas::copy and cxxblas::axpy:

flens/examples/vecsum_cxxblas.cc

#include <flens/flens.cxx>

#include <algorithm>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;

    y =  5,  6,  7,  8;

    z =  9, 10, 11, 12;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

    cout << "z = " << z << endl;

//

//  compute  v = x + y + z

//

    // v = x

    cxxblas::copy(v.length(), x.data(), x.stride(), v.data(), v.stride());

    // v += y

    cxxblas::axpy(v.length(), 1., y.data(), y.stride(), v.data(), v.stride());

    // v += z

    cxxblas::axpy(v.length(), 1., z.data(), z.stride(), v.data(), v.stride());

    cout << "v = x + y + z = " << v << endl;

//

//  compute v = 2*x + 3*y + 4*z

//

    // set all element of v to zero

    std::fill_n(v.data(), v.length(), 0.0);

    // v += 2*x

    cxxblas::axpy(v.length(), 2., x.data(), x.stride(), v.data(), v.stride());

    // v += 3*y

    cxxblas::axpy(v.length(), 3., y.data(), y.stride(), v.data(), v.stride());

    // v += 4*z

    cxxblas::axpy(v.length(), 4., z.data(), z.stride(), v.data(), v.stride());

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;

}

We compile as usual and get:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_cxxblas.cc                          
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80

BLAS Level 2 Functions

FLENS-BLAS	DESCRIPTION	CXXBLAS
mv	Computes a matrix-vector product. The form of the product depends on the matrix type: For general matrices it is \(y \leftarrow \beta y + \alpha \text{op}(A) x\). For symmetric matrices it is \(y \leftarrow \beta y + \alpha A x\). For hermitian matrices it is \(y \leftarrow \beta y + \alpha A x\). For triangular matrices it is \(x \leftarrow \text{op}(A) x\). Hereby \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\).	gemv symv hemv trmv
r	Computes a rank `1` operation. The type of operation depends on type of the matrix that gets updated: For general matrices it is \(A \leftarrow A + \alpha x y^T\). For symmetric matrices it is \(A \leftarrow A + \alpha x x^T\).	geru gerc syr her
r2	Computes a symmetrix rank `2` operation. The type of operation depends on type of the matrix that gets updated: For symmetric matrices it is \(A \leftarrow A + \alpha x y^T + \alpha y x^T\). For hermitian matrices it is \(A \leftarrow A + \alpha x y^H + \overline{\alpha} y x^H\).	syr2 her2
sv	Solves one of the systems of equations \(Ax = b\) or \(A^T x = b\) where \(A\) is an unit or non-unit or upper or lower triangular matrix.	trsv

BLAS Level 2 Example: Matrix-Vector Product with Operators

In the following examples we compute following matrix vector products:

\(\vec{y} \leftarrow A \vec{x}\)
\(\vec{y} \leftarrow A^T \vec{x}\)
\(\vec{y} \leftarrow \beta y + A^T \vec{x}\)

If you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine unless \(A\) is a trinagular matrix. We will show later that this is exactly how the compuation get carried out internally.

flens/examples/matvecprod_operators.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4);

    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

//

//  compute  y = A*x

//

    y = A*x;

    cout << "y = A*x = " << y << endl;

//

//  compute  y = A^T*x

//

    y = transpose(A)*x;

    cout << "y = A^T*x = " << y << endl;

//

//  compute  y = 1.5*y + A^T*x

//

    y = 1.5*y + transpose(A)*x;

    cout << "y = 1.5*y + A^T*x = " << y << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_operators.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300

BLAS Level 2 Example: Matrix-Vector Product with FLENS-BLAS

Having a look at the documentation of blas::mv you see that all of these operations can handeled by a single (FLENS-)BLAS function. So here it is more transparent that no temporary objects get created:

flens/examples/matvecprod_flensblas.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4);

    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

//

//  compute  y = A*x

//

    blas::mv(NoTrans, 1.0, A, x, 0.0, y);

    cout << "y = A*x = " << y << endl;

//

//  compute  y = A^T*x

//

    blas::mv(Trans, 1.0, A, x, 0.0, y);

    cout << "y = A^T*x = " << y << endl;

//

//  compute  y = 1.5*y + A^T*x

//

    blas::mv(Trans, 1.0, A, x, 1.5, y);

    cout << "y = 1.5*y + A^T*x = " << y << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_flensblas.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300

BLAS Level 2 Example: Matrix-Vector Product with CXXBLAS

The function called by blas::mv is in this case cxxblas::gemv:

flens/examples/matvecprod_cxxblas.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4);

    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;

    cout << "x = " << x << endl;

    cout << "y = " << y << endl;

    int m = A.numRows();

    int n = A.numCols();

//

//  compute  y = A*x

//

    cxxblas::gemv(m, n, 1.0, false, false,

                  A.data(), A.strideRow(), A.strideCol(),

                  x.data(), x.stride(),

                  0.0,

                  y.data(), y.stride());

    cout << "y = A*x = " << y << endl;

//

//  compute  y = A^T*x

//

    cxxblas::gemv(m, n, 1.0, true, false,

                  A.data(), A.strideRow(), A.strideCol(),

                  x.data(), x.stride(),

                  0.0,

                  y.data(), y.stride());

    cout << "y = A^T*x = " << y << endl;

//

//  compute  y = 1.5*y + A^T*x

//

    cxxblas::gemv(m, n, 1.0, true, false,

                  A.data(), A.strideRow(), A.strideCol(),

                  x.data(), x.stride(),

                  1.5,

                  y.data(), y.stride());

    cout << "y = 1.5*y + A^T*x = " << y << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_cxxblas.cc                      
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300

BLAS Level 3 Functions

FLENS-BLAS	DESCRIPTION	CXXBLAS
mm	Computes a matrix-matrix product. The form of the product depends on the matrix types. If one matrix is a general matrix and the other matrix is general then it is \(C \leftarrow \beta C + \alpha \, \text{op}(A) \, \text{op}(B)\) symmetric then it is \(C \leftarrow \beta C + \alpha \, A \, \text{op}(B)\) hermitian then it is \(C \leftarrow \beta C + \alpha \, A \, \text{op}(B)\) triangular then it is \(B \leftarrow \alpha \, \text{op}(A) \, B\)	gemm symm hemm trmm
r2k	Compute a symmetric rank-`2k` update, i.e. \(C \leftarrow \beta C + \alpha\,A\, B^T + \alpha\,B\,A^T\) or \(C \leftarrow \beta C + \alpha\,A^T \, B + \alpha\,B^T\,A\).	syr2k
rk	Compute a symmetric rank-`k` update, i.e. \(C \leftarrow \beta C + \alpha A \, A^T\) or \(C \leftarrow \beta C + \alpha A^T \, A\)	syrk
sm	Solves one of the matrix equations \(\text{op}(A)\,X = B\) or \(X\,\text{op}(A) = B\) for \(X\) where \(A\) is an unit or non-unit or upper or lower triangular matrix and \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\).	trsm

BLAS Level 3 Example: Matrix-Matrix Product with Operators

In the following examples we compute following matrix matrix products:

\(C = A*B\)
\(C = A^T*B\)
\(C = S * B\)
\(B = U * B\)

where \(A\), \(B\), \(C\) are general matrices, \(S\) is a symmetric matrix and \(U\) a triangular matrix.

Again, if you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine.

flens/examples/matmatprod_operators.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    B = 17, 18, 19, 20,

        21, 22, 23, 24,

        25, 26, 27, 28,

        29, 30, 31, 32;

    auto U = A.upper();

    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;

    cout << "B = " << B << endl;

    cout << "S = " << S << endl;

    cout << "U = " << U << endl;

//

//  compute  C = A*B

//

    C = A*B;

    cout << "C = A*B = " << C << endl;

//

//  compute  C = A^T*B

//

    C = transpose(A)*B;

    cout << "C = A^T*B = " << C << endl;

//

//  compute  C = S*B

//

    C = S*B;

    cout << "C = S*B = " << C << endl;

//

//  compute  B = U*B

//

    B = U*B;

    cout << "B = U*B = " << B << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_operators.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512

BLAS Level 3 Example: Matrix-Matrix Product with FLENS-BLAS

Again, having a look at the documentation of blas::mm you see that all of these operations can handeled by a single (FLENS-)BLAS function:

flens/examples/matmatprod_flensblas.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    B = 17, 18, 19, 20,

        21, 22, 23, 24,

        25, 26, 27, 28,

        29, 30, 31, 32;

    auto U = A.upper();

    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;

    cout << "B = " << B << endl;

    cout << "S = " << S << endl;

    cout << "U = " << U << endl;

//

//  compute  C = A*B

//

    blas::mm(NoTrans, NoTrans, 1.0, A, B, 0.0, C);

    cout << "C = A*B = " << C << endl;

//

//  compute  C = A^T*B

//

    blas::mm(Trans, NoTrans, 1.0, A, B, 0.0, C);

    cout << "C = A^T*B = " << C << endl;

//

//  compute  C = S*B

//

    blas::mm(Left, 1.0, S, B, 0.0, C);

    cout << "C = S*B = " << C << endl;

//

//  compute  B = U*B

//

    blas::mm(Left, NoTrans, 1.0, U, B);

    cout << "B = U*B = " << B << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_flensblas.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512

BLAS Level 3 Example: Matrix-Matrix Product with CXXBLAS

The functions called by blas::mm are in this case cxxblas::gemm, cxxblas::symm and cxxblas::trmm:

flens/examples/matmatprod_cxxblas.cc

#include <flens/flens.cxx>

#include <iostream>

using namespace flens;

using namespace std;

int

main()

{

    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,

         5,  6,  7,  8,

         9, 10, 11, 12,

        13, 14, 15, 16;

    B = 17, 18, 19, 20,

        21, 22, 23, 24,

        25, 26, 27, 28,

        29, 30, 31, 32;

    auto U = A.upper();

    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;

    cout << "B = " << B << endl;

    cout << "S = " << S << endl;

    cout << "U = " << U << endl;

//

//  compute  C = A*B

//

    cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),

                  1.0,

                  false, false, A.data(), A.strideRow(), A.strideCol(),

                  false, false, B.data(), B.strideRow(), B.strideCol(),

                  0.0,

                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = A*B = " << C << endl;

//

//  compute  C = A^T*B

//

    cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),

                  1.0,

                  true,  false, A.data(), A.strideRow(), A.strideCol(),

                  false, false, B.data(), B.strideRow(), B.strideCol(),

                  0.0,

                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = A^T*B = " << C << endl;

//

//  compute  C = S*B

//

    cxxblas::symm(true, C.numRows(), C.numCols(),

                  1.0,

                  (S.upLo()==Lower),

                  S.data(), S.strideRow(), S.strideCol(),

                  B.data(), B.strideRow(), B.strideCol(),

                  0.0,

                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = S*B = " << C << endl;

//

//  compute  B = U*B

//

    cxxblas::trmm(true, C.numRows(), C.numCols(),

                  1.0,

                  (U.upLo()==Lower), false, false, (U.diag()==Unit),

                  U.data(), U.strideRow(), U.strideCol(),

                  B.data(), B.strideRow(), B.strideCol());

    cout << "B = U*B = " << B << endl;

}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_cxxblas.cc                      
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512

Rational of FLENS-BLAS Function Names

Function names in FLENS-BLAS are derived from BLAS/CBLAS. As matrix/vector types encapsulate element types and mathematical properties this simplifies the BLAS/CBLAS function name scheme.

Simplified naming convention

Consider BLAS functions like dgemv, sgemv, ssbmv, dgbmv, etc. all of them implement some kind of matrix-vector product. That's what the mv stands for. However, the names also encapsulate further information about the involved matrix/vector types:
- The first leter indicates the element type
  - s for single precision,
  - d for double precision,
  - c for complex single precision,
  - z for complex double precision.
- The next two letter indicate the involved matrix type
  - ge for a general matrix with full storage
  - gb for a general matrix with band storage
  - sy for a symmetric matrix with full storage
  - sb for a symmetric matrix with banded storage
  - ...
Because FLENS provides actual matrix/vector types this information can be retrieved from the argument types. Hence, FLENS merely provides one function blas::mv and overloads it for different matrix/vector types.

Analogously in FLENS function blas::mm is overloaded for various matrix/vector types such that it unifies gemm, symm, trmm, etc.
Simplified function arguments

Compared to calling BLAS functions directly or through the CBLAS interface the number of arguments required by FLENS-BLAS functions is significantly smaller. In BLAS/CBLAS you always have to pass
- matrix/vector dimensions
- leading dimensions/strides and
- a data pointer
for each matrix/vector argument. In FLENS-BLAS you simply pass a single matrix/vector object to the corresponding function.

As an example: dgemv from CBLAS requires 12 arguments while its FLENS-BLAS counterpart blas::mv only requires 6 arguments. Besides convenient usage this also provides increased safety and improves correctness.