Content

Basic Linear Algebra Operations (BLAS)

BLAS is a de facto standard that defines a certain set of linear algebra operations. These operations mainly are matrix/vector operations, e.g. dot products, matrix-vector products, matrix-matrix products, etc. Hardware vendors provide highly tuned BLAS implementations for their architecture, e.g. Intel MKL, AMD CML, Sun Performance Library, etc. Implementations like NVIDIA cuBLAS exploit GPUs for the computation. In addition open source implementations like BLIS, ATLAS, OpenBLAS, ulmBLAS, etc. exist that are on par with vendor tuned BLAS implementations.

How much performance can be achieved in theory depends on various factor. For instance exploiting caches, vectorization and parallelism. The potential for performance depends on the type of operation. In this respect BLAS divides the operation types into three Levels:

So achieving high performance in numerical linear algebra is two fold:

Note that this separation allows mathematician and computer scientist to cooperate independent from each other. The mathematician just needs to know that BLAS Level 3 gives great performance. The computer scientist just needs to know that it is important that BLAS Level 3 functions actually achieve great performance. But of course it is aways favorable to look beyond one's own nose!

We first give an overview of BLAS and how to make use of it with FLENS. Only a few BLAS operations will be shown. For a complete overview have a look at the FLENS-BLAS page. By default FLENS will use ulmBLAS as computational backend. Optionally an external BLAS implementation can be used. On later pages we will show how algorithms can be developed such that they exploit BLAS Level 3 functions.

Performance: It's BLAS, not the Compiler

As a rule of thumb you can say: In numerical linear algebra you never get the performance from the compiler. You only get it from using BLAS.

For an undergraduate class on high performance computing we developed step by step an optimized implementation of the matrix-matrix product. The steps are documented on GEMM: From Pure C to SSE Optimized Micro Kernels

Performance benchmarks will certainly vary between different architectures. But as they are mainly influence on how fast data can be move from memory to the caches and from caches to the CPU the relative comparisons stay comparable:

About FLENS-BLAS

BLAS functions are provided in FLENS through three layers:

In all cases runtime overhead (if at all) is negligible compared to calling a BLAS implementation directly. Runtime assertions merely check consistency of the arguments.

Compiling with -DNDEBUG removes any of these checks. It is common practise that during the test phase runtime assertions are used and production code gets generated with -DNDEBUG.

BLAS Level 1 Functions

This BLAS level originally deals with vector-vector operations only. For convenience we added in FLENS-BLAS some cases matrix-matrix variants that for instance allow copying or adding matrices.

FLENS-BLAS

DESCRIPTION

CXXBLAS

copy

\(y \leftarrow x\)

Copies a vector \(\vec{x}\) to a vector \(\vec{y}\) or a matrix \(A\) to a matrix \(B\).

copy

axpy

\(\vec{y} \leftarrow \alpha \vec{x} + \vec{y}\)

axpy is short for alpha x pluy y

FLENS also provides a variant for GeMatrix that computes \(Y \leftarrow \alpha X + Y\).

axpy

scal

\(\vec{x} \leftarrow \alpha \vec{x}\)

Scales a vector by a constant.

FLENS also provides a variant for GeMatrix that computes \(X \leftarrow \alpha X\).

scal

swap

Interchanges two vectors, i.e. \(\vec{x} \leftrightarrow \vec{y}\).

swap

asum

Takes the sum of the absolute values, i.e. computes \(\sum\limits_{i} |x_i|\).

asum

dot, dotu

\(\alpha \leftarrow \vec{x}^T \vec{y}\) or \(\alpha \leftarrow \vec{\bar{x}}^T \vec{y}\)

Forms the dot product of two vectors, i.e. computes \(\sum\limits_{i} \bar{x}_i y_i\) or \(\sum\limits_{i} x_i y_i\).

dot, udot

nrm2

Computes the euclidean norm of a vector, i.e. \(\sqrt{\sum\limits_{i} |x_i|^2}\).

nrm2

rot

Applies a plane rotation.

rot

BLAS Level 1 Example: Adding Vectors with Operators

In the following examples we compute following vector sums:

Using overloaded operators this can be coded in an expressive way. People from high performance computing will get nervous if they see the code. They are more concern about how computations are carried out and whether temporary objects get created. Ease-of-use comes second in this field. We will show in the next example that no temporaries are carried out and what BLAS routines are used for the evaluation.

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;
    y =  5,  6,  7,  8;
    z =  9101112;

    cout << "x = " << x << endl;
    cout << "y = " << y << endl;
    cout << "z = " << z << endl;

//
//  compute  v = x + y + z
//
    v = x + y + z;

    cout << "v = x + y + z = " << v << endl;

//
//  compute v = 2*x + 3*y + 4*z
//
    v = 2*x + 3*y + 4*z;

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_operators.cc                        
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80 

BLAS Level 1 Example: Adding Vectors with FLENS-BLAS

In the next example we show how internally the expression v = x + y + z and v = 2*x + 3*y + 4*z are computed by calling FLENS-BLAS functions:

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;
    y =  5,  6,  7,  8;
    z =  9101112;

    cout << "x = " << x << endl;
    cout << "y = " << y << endl;
    cout << "z = " << z << endl;

//
//  compute  v = x + y + z
//

    // v = x
    blas::copy(x, v);

    // v += y
    blas::axpy(1., y, v);

    // v += z
    blas::axpy(1., z, v);

    cout << "v = x + y + z = " << v << endl;

//
//  compute v = 2*x + 3*y + 4*z
//

    // set all element of v to zero
    v = 0.0;

    // v += 2*x
    blas::axpy(2., x, v);

    // v += 3*y
    blas::axpy(3., y, v);

    // v += 4*z
    blas::axpy(4., z, v);

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}

We compile as usual and get:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_flensblas.cc                        
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80 

BLAS Level 1 Example: Adding Vectors with CXXBLAS

Next we show how the FLENS-BLAS functions call internally CXXBLAS functionsacxxblas::copy and cxxblas::axpy:

#include <flens/flens.cxx>
#include <algorithm>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    DenseVector<Array<double> >  x(4), y(4), z(4), v(4);

    x =  1,  2,  3,  4;
    y =  5,  6,  7,  8;
    z =  9101112;

    cout << "x = " << x << endl;
    cout << "y = " << y << endl;
    cout << "z = " << z << endl;

//
//  compute  v = x + y + z
//
    // v = x
    cxxblas::copy(v.length(), x.data(), x.stride(), v.data(), v.stride());

    // v += y
    cxxblas::axpy(v.length(), 1., y.data(), y.stride(), v.data(), v.stride());

    // v += z
    cxxblas::axpy(v.length(), 1., z.data(), z.stride(), v.data(), v.stride());

    cout << "v = x + y + z = " << v << endl;

//
//  compute v = 2*x + 3*y + 4*z
//

    // set all element of v to zero
    std::fill_n(v.data(), v.length(), 0.0);

    // v += 2*x
    cxxblas::axpy(v.length(), 2., x.data(), x.stride(), v.data(), v.stride());

    // v += 3*y
    cxxblas::axpy(v.length(), 3., y.data(), y.stride(), v.data(), v.stride());

    // v += 4*z
    cxxblas::axpy(v.length(), 4., z.data(), z.stride(), v.data(), v.stride());

    cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}

We compile as usual and get:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. vecsum_cxxblas.cc                          
$shell> ./a.out                                                                 
x = 
            1              2              3              4 
y = 
            5              6              7              8 
z = 
            9             10             11             12 
v = x + y + z = 
           15             18             21             24 
v = 2*x + 3*y + 4*z = 
           53             62             71             80 

BLAS Level 2 Functions

FLENS-BLAS

DESCRIPTION

CXXBLAS

mv

Computes a matrix-vector product. The form of the product depends on the matrix type:

  • For general matrices it is \(y \leftarrow \beta y + \alpha \text{op}(A) x\).

  • For symmetric matrices it is \(y \leftarrow \beta y + \alpha A x\).

  • For hermitian matrices it is \(y \leftarrow \beta y + \alpha A x\).

  • For triangular matrices it is \(x \leftarrow \text{op}(A) x\).

Hereby \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\).

gemv

symv

hemv

trmv

r

Computes a rank 1 operation. The type of operation depends on type of the matrix that gets updated:

  • For general matrices it is \(A \leftarrow A + \alpha x y^T\).

  • For symmetric matrices it is \(A \leftarrow A + \alpha x x^T\).

geru gerc syr her

r2

Computes a symmetrix rank 2 operation. The type of operation depends on type of the matrix that gets updated:

  • For symmetric matrices it is \(A \leftarrow A + \alpha x y^T + \alpha y x^T\).

  • For hermitian matrices it is \(A \leftarrow A + \alpha x y^H + \overline{\alpha} y x^H\).

syr2

her2

sv

Solves one of the systems of equations \(Ax = b\) or \(A^T x = b\) where \(A\) is an unit or non-unit or upper or lower triangular matrix.

trsv

BLAS Level 2 Example: Matrix-Vector Product with Operators

In the following examples we compute following matrix vector products:

If you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine unless \(A\) is a trinagular matrix. We will show later that this is exactly how the compuation get carried out internally.

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4);
    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;
    cout << "x = " << x << endl;
    cout << "y = " << y << endl;

//
//  compute  y = A*x
//
    y = A*x;

    cout << "y = A*x = " << y << endl;

//
//  compute  y = A^T*x
//
    y = transpose(A)*x;

    cout << "y = A^T*x = " << y << endl;

//
//  compute  y = 1.5*y + A^T*x
//
    y = 1.5*y + transpose(A)*x;

    cout << "y = 1.5*y + A^T*x = " << y << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_operators.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300 

BLAS Level 2 Example: Matrix-Vector Product with FLENS-BLAS

Having a look at the documentation of blas::mv you see that all of these operations can handeled by a single (FLENS-)BLAS function. So here it is more transparent that no temporary objects get created:

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4);
    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;
    cout << "x = " << x << endl;
    cout << "y = " << y << endl;

//
//  compute  y = A*x
//
    blas::mv(NoTrans, 1.0, A, x, 0.0, y);

    cout << "y = A*x = " << y << endl;

//
//  compute  y = A^T*x
//
    blas::mv(Trans, 1.0, A, x, 0.0, y);

    cout << "y = A^T*x = " << y << endl;

//
//  compute  y = 1.5*y + A^T*x
//
    blas::mv(Trans, 1.0, A, x, 1.5, y);

    cout << "y = 1.5*y + A^T*x = " << y << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_flensblas.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300 

BLAS Level 2 Example: Matrix-Vector Product with CXXBLAS

The function called by blas::mv is in this case cxxblas::gemv:

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4);
    DenseVector<Array<double> >      x(4), y(4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    x =  1,  2,  3,  4;

    cout << "A = " << A << endl;
    cout << "x = " << x << endl;
    cout << "y = " << y << endl;

    int m = A.numRows();
    int n = A.numCols();

//
//  compute  y = A*x
//
    cxxblas::gemv(m, n, 1.0falsefalse,
                  A.data(), A.strideRow(), A.strideCol(),
                  x.data(), x.stride(),
                  0.0,
                  y.data(), y.stride());

    cout << "y = A*x = " << y << endl;

//
//  compute  y = A^T*x
//
    cxxblas::gemv(m, n, 1.0truefalse,
                  A.data(), A.strideRow(), A.strideCol(),
                  x.data(), x.stride(),
                  0.0,
                  y.data(), y.stride());

    cout << "y = A^T*x = " << y << endl;

//
//  compute  y = 1.5*y + A^T*x
//
    cxxblas::gemv(m, n, 1.0truefalse,
                  A.data(), A.strideRow(), A.strideCol(),
                  x.data(), x.stride(),
                  1.5,
                  y.data(), y.stride());

    cout << "y = 1.5*y + A^T*x = " << y << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matvecprod_cxxblas.cc                      
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
x = 
            1              2              3              4 
y = 
            0              0              0              0 
y = A*x = 
           30             70            110            150 
y = A^T*x = 
           90            100            110            120 
y = 1.5*y + A^T*x = 
          225            250            275            300 

BLAS Level 3 Functions

FLENS-BLAS

DESCRIPTION

CXXBLAS

mm

Computes a matrix-matrix product. The form of the product depends on the matrix types. If one matrix is a general matrix and the other matrix is

  • general then it is \(C \leftarrow \beta C + \alpha \, \text{op}(A) \, \text{op}(B)\)

  • symmetric then it is \(C \leftarrow \beta C + \alpha \, A \, \text{op}(B)\)

  • hermitian then it is \(C \leftarrow \beta C + \alpha \, A \, \text{op}(B)\)

  • triangular then it is \(B \leftarrow \alpha \, \text{op}(A) \, B\)

gemm

symm

hemm

trmm

r2k

Compute a symmetric rank-2k update, i.e. \(C \leftarrow \beta C + \alpha\,A\, B^T + \alpha\,B\,A^T\) or \(C \leftarrow \beta C + \alpha\,A^T \, B + \alpha\,B^T\,A\).

syr2k

rk

Compute a symmetric rank-k update, i.e. \(C \leftarrow \beta C + \alpha A \, A^T\) or \(C \leftarrow \beta C + \alpha A^T \, A\)

syrk

sm

Solves one of the matrix equations \(\text{op}(A)\,X = B\) or \(X\,\text{op}(A) = B\) for \(X\) where \(A\) is an unit or non-unit or upper or lower triangular matrix and \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\).

trsm

BLAS Level 3 Example: Matrix-Matrix Product with Operators

In the following examples we compute following matrix matrix products:

where \(A\), \(B\), \(C\) are general matrices, \(S\) is a symmetric matrix and \(U\) a triangular matrix.

Again, if you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine.

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    B = 17181920,
        21222324,
        25262728,
        29303132;

    auto U = A.upper();
    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;
    cout << "B = " << B << endl;
    cout << "S = " << S << endl;
    cout << "U = " << U << endl;

//
//  compute  C = A*B
//
    C = A*B;

    cout << "C = A*B = " << C << endl;

//
//  compute  C = A^T*B
//
    C = transpose(A)*B;

    cout << "C = A^T*B = " << C << endl;

//
//  compute  C = S*B
//
    C = S*B;

    cout << "C = S*B = " << C << endl;

//
//  compute  B = U*B
//
    B = U*B;

    cout << "B = U*B = " << B << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_operators.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512 

BLAS Level 3 Example: Matrix-Matrix Product with FLENS-BLAS

Again, having a look at the documentation of blas::mm you see that all of these operations can handeled by a single (FLENS-)BLAS function:

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    B = 17181920,
        21222324,
        25262728,
        29303132;

    auto U = A.upper();
    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;
    cout << "B = " << B << endl;
    cout << "S = " << S << endl;
    cout << "U = " << U << endl;

//
//  compute  C = A*B
//
    blas::mm(NoTrans, NoTrans, 1.0, A, B, 0.0, C);

    cout << "C = A*B = " << C << endl;

//
//  compute  C = A^T*B
//
    blas::mm(Trans, NoTrans, 1.0, A, B, 0.0, C);

    cout << "C = A^T*B = " << C << endl;

//
//  compute  C = S*B
//
    blas::mm(Left, 1.0, S, B, 0.0, C);

    cout << "C = S*B = " << C << endl;

//
//  compute  B = U*B
//
    blas::mm(Left, NoTrans, 1.0, U, B);

    cout << "B = U*B = " << B << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_flensblas.cc                    
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512 

BLAS Level 3 Example: Matrix-Matrix Product with CXXBLAS

The functions called by blas::mm are in this case cxxblas::gemm, cxxblas::symm and cxxblas::trmm:

#include <flens/flens.cxx>
#include <iostream>

using namespace flens;
using namespace std;

int
main()
{
    GeMatrix<FullStorage<double> >   A(4,4), B(4,4), C(4,4);

    A =  1,  2,  3,  4,
         5,  6,  7,  8,
         9101112,
        13141516;

    B = 17181920,
        21222324,
        25262728,
        29303132;

    auto U = A.upper();
    auto S = A.upper().symmetric();

    cout << "A = " << A << endl;
    cout << "B = " << B << endl;
    cout << "S = " << S << endl;
    cout << "U = " << U << endl;

//
//  compute  C = A*B
//
    cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),
                  1.0,
                  falsefalse, A.data(), A.strideRow(), A.strideCol(),
                  falsefalse, B.data(), B.strideRow(), B.strideCol(),
                  0.0,
                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = A*B = " << C << endl;

//
//  compute  C = A^T*B
//
    cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),
                  1.0,
                  true,  false, A.data(), A.strideRow(), A.strideCol(),
                  falsefalse, B.data(), B.strideRow(), B.strideCol(),
                  0.0,
                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = A^T*B = " << C << endl;

//
//  compute  C = S*B
//
    cxxblas::symm(true, C.numRows(), C.numCols(),
                  1.0,
                  (S.upLo()==Lower),
                  S.data(), S.strideRow(), S.strideCol(),
                  B.data(), B.strideRow(), B.strideCol(),
                  0.0,
                  C.data(), C.strideRow(), C.strideCol());

    cout << "C = S*B = " << C << endl;

//
//  compute  B = U*B
//
    cxxblas::trmm(true, C.numRows(), C.numCols(),
                  1.0,
                  (U.upLo()==Lower), falsefalse, (U.diag()==Unit),
                  U.data(), U.strideRow(), U.strideCol(),
                  B.data(), B.strideRow(), B.strideCol());

    cout << "B = U*B = " << B << endl;
}

This leads to:

$shell> cd flens/examples                                                       
$shell> g++ -Wall -std=c++11 -I../.. matmatprod_cxxblas.cc                      
$shell> ./a.out                                                                 
A = 
            1             2             3             4 
            5             6             7             8 
            9            10            11            12 
           13            14            15            16 
B = 
           17            18            19            20 
           21            22            23            24 
           25            26            27            28 
           29            30            31            32 
S = 
            1             2             3             4 
            2             6             7             8 
            3             7            11            12 
            4             8            12            16 
U = 
            1             2             3             4 
            0             6             7             8 
            0             0            11            12 
            0             0             0            16 
C = A*B = 
          250           260           270           280 
          618           644           670           696 
          986          1028          1070          1112 
         1354          1412          1470          1528 
C = A^T*B = 
          724           752           780           808 
          816           848           880           912 
          908           944           980          1016 
         1000          1040          1080          1120 
C = S*B = 
          250           260           270           280 
          567           590           613           636 
          821           854           887           920 
         1000          1040          1080          1120 
B = U*B = 
          250           260           270           280 
          533           554           575           596 
          623           646           669           692 
          464           480           496           512 

Rational of FLENS-BLAS Function Names

Function names in FLENS-BLAS are derived from BLAS/CBLAS. As matrix/vector types encapsulate element types and mathematical properties this simplifies the BLAS/CBLAS function name scheme.