Content |
Basic Linear Algebra Operations (BLAS)
BLAS is a de facto standard that defines a certain set of linear algebra operations. These operations mainly are matrix/vector operations, e.g. dot products, matrix-vector products, matrix-matrix products, etc. Hardware vendors provide highly tuned BLAS implementations for their architecture, e.g. Intel MKL, AMD CML, Sun Performance Library, etc. Implementations like NVIDIA cuBLAS exploit GPUs for the computation. In addition open source implementations like BLIS, ATLAS, OpenBLAS, ulmBLAS, etc. exist that are on par with vendor tuned BLAS implementations.
How much performance can be achieved in theory depends on various factor. For instance exploiting caches, vectorization and parallelism. The potential for performance depends on the type of operation. In this respect BLAS divides the operation types into three Levels:
-
BLAS Level 1 contains all linear algebra operations that require \(\mathcal{O}(N)\) data and \(\mathcal{O}(N)\) floating point operations. These are vector-vector operations like vector sums or dot products.
-
BLAS Level 2 contains all linear algebra operations that require \(\mathcal{O}(N^2)\) data and \(\mathcal{O}(N^2)\) floating point operations. These are operations like matrix-vector products.
-
BLAS Level 3 contains all linear algebra operations that require \(\mathcal{O}(N^2)\) data and \(\mathcal{O}(N^3)\) floating point operations. These are operations like matrix-matrix products. Only Level 3 operations can efficiently exploit caches. That's because you have a considerable gap between the amount of data and amount of floating point operations. Once a cache is filled with \(\mathcal{O}(N^2)\) data it gets uesed for \(\mathcal{O}(N^3)\) floating point operations.
With Level 1 or 2 functions you can not exploit caches. An optimized implementation will achieve about 5-15% percent of the theoretical peak performance. That means 85-95% of the time the CPU is waiting for data.
So achieving high performance in numerical linear algebra is two fold:
-
The Numerical Math Part
First you need to organize an algorithm such that it is dominated by BLAS Level 3 operations. This can be achieved by considering algorithms that work on matrix blocks rather than matrix elements. That is exactly what libraries like LAPACK are doing.
-
The Computer Science Part
Second you need an optimized BLAS implementation that actually can achieve near peak performance for BLAS Level 3 functions.
Note that this separation allows mathematician and computer scientist to cooperate independent from each other. The mathematician just needs to know that BLAS Level 3 gives great performance. The computer scientist just needs to know that it is important that BLAS Level 3 functions actually achieve great performance. But of course it is aways favorable to look beyond one's own nose!
We first give an overview of BLAS and how to make use of it with FLENS. Only a few BLAS operations will be shown. For a complete overview have a look at the FLENS-BLAS page. By default FLENS will use ulmBLAS as computational backend. Optionally an external BLAS implementation can be used. On later pages we will show how algorithms can be developed such that they exploit BLAS Level 3 functions.
Performance: It's BLAS, not the Compiler
As a rule of thumb you can say: In numerical linear algebra you never get the performance from the compiler. You only get it from using BLAS.
For an undergraduate class on high performance computing we developed step by step an optimized implementation of the matrix-matrix product. The steps are documented on GEMM: From Pure C to SSE Optimized Micro Kernels
Performance benchmarks will certainly vary between different architectures. But as they are mainly influence on how fast data can be move from memory to the caches and from caches to the CPU the relative comparisons stay comparable:
-
With a naive C implementation you achieve about 10%. Even if you try to reduce cache misses by traversing the matrix elements in the right order. The Netlib reference BLAS implementation (Netlib RefBLAS) is such a naive but not stupid implementation. For column major matrices it traverses matrices always column wise.
-
With a pure but clever C implementation and really, really good C compiler you can achieve about 30% of the theoretical peak performance. The compiler can not do better unless you build in some expert knowledge in numerical linear algebra. In the class project the best possible result through compiler optimizations was denoted demo-naive-sse-with-intrinsics:
Actually we bet that you won't find any compiler that can achieve this performance. In the class project we simulated what a compiler could achieve with optimal assumptions about alignment, restricted access to memory, optimal use of registers etc.
-
An optimized BLAS achieves more than 90% of the theoretical peak performance. In the class project ulmBLAS 10 lines of C code where replaced with about 500 lines of assembler code. The assembler code contains expert knowledge in numerical linear algebra that you can not expect from a general purpose compiler. This resulted in:
As ulmBLAS is derived from BLIS it is no coincident that we achieve the same performance.
All the benchmarks are for single threaded implementations. Multi threading for BLAS Level 3 scales nearly perfect. That's because each thread can compute separate blocks for the result matrix and race conditions are not an issue.
About FLENS-BLAS
BLAS functions are provided in FLENS through three layers:
-
CXXBLAS functions in namespace cxxblas
These provide a low-level C++ BLAS interface. These functions work on raw data pointers and by default call corresponding ulmBLAS functions. Optionally you can compile and link your code such that an external and possible vendor tuned BLAS function gets called.
The CXXBLAS interface is similar to the CBLAS interface. It uses only very few C++ features, e.g. templates are used for merely for the element type. It does not use structs, classes or even enum types. So compared to CBLAS there is except for using template no abstraction.
-
FLENS-BLAS functions in namespace flens::blas
These functions provide a high-level BLAS interface for the FLENS DenseVector types. These functions internally call CXXBLAS functions.
-
Overloaded operators
These have the advantage of allowing a more expressive notation. In FLENS we are very pedantic that this kind of syntactic sugar does not lead to any hidden performance penalty. Using overloaded operators for BLAS should not be misunderstood. They certainly are not provided such that a user does need to know about BLAS. If you want to do high performance computing you always have to think in BLAS. We will discuss overloaded operators in more detail on the next page.
In all cases runtime overhead (if at all) is negligible compared to calling a BLAS implementation directly. Runtime assertions merely check consistency of the arguments.
Compiling with -DNDEBUG removes any of these checks. It is common practise that during the test phase runtime assertions are used and production code gets generated with -DNDEBUG.
BLAS Level 1 Functions
This BLAS level originally deals with vector-vector operations only. For convenience we added in FLENS-BLAS some cases matrix-matrix variants that for instance allow copying or adding matrices.
DESCRIPTION |
||
\(y \leftarrow x\) Copies a vector \(\vec{x}\) to a vector \(\vec{y}\) or a matrix \(A\) to a matrix \(B\). |
||
\(\vec{y} \leftarrow \alpha \vec{x} + \vec{y}\) axpy is short for alpha x pluy y FLENS also provides a variant for GeMatrix that computes \(Y \leftarrow \alpha X + Y\). |
||
\(\vec{x} \leftarrow \alpha \vec{x}\) Scales a vector by a constant. FLENS also provides a variant for GeMatrix that computes \(X \leftarrow \alpha X\). |
||
Interchanges two vectors, i.e. \(\vec{x} \leftrightarrow \vec{y}\). |
||
Takes the sum of the absolute values, i.e. computes \(\sum\limits_{i} |x_i|\). |
||
dot, dotu |
\(\alpha \leftarrow \vec{x}^T \vec{y}\) or \(\alpha \leftarrow \vec{\bar{x}}^T \vec{y}\) Forms the dot product of two vectors, i.e. computes \(\sum\limits_{i} \bar{x}_i y_i\) or \(\sum\limits_{i} x_i y_i\). |
dot, udot |
Computes the euclidean norm of a vector, i.e. \(\sqrt{\sum\limits_{i} |x_i|^2}\). |
||
Applies a plane rotation. |
BLAS Level 1 Example: Adding Vectors with Operators
In the following examples we compute following vector sums:
-
\(\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}\) and
-
\(\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}\)
Using overloaded operators this can be coded in an expressive way. People from high performance computing will get nervous if they see the code. They are more concern about how computations are carried out and whether temporary objects get created. Ease-of-use comes second in this field. We will show in the next example that no temporaries are carried out and what BLAS routines are used for the evaluation.
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
DenseVector<Array<double> > x(4), y(4), z(4), v(4);
x = 1, 2, 3, 4;
y = 5, 6, 7, 8;
z = 9, 10, 11, 12;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
cout << "z = " << z << endl;
//
// compute v = x + y + z
//
v = x + y + z;
cout << "v = x + y + z = " << v << endl;
//
// compute v = 2*x + 3*y + 4*z
//
v = 2*x + 3*y + 4*z;
cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. vecsum_operators.cc $shell> ./a.out x = 1 2 3 4 y = 5 6 7 8 z = 9 10 11 12 v = x + y + z = 15 18 21 24 v = 2*x + 3*y + 4*z = 53 62 71 80
BLAS Level 1 Example: Adding Vectors with FLENS-BLAS
In the next example we show how internally the expression v = x + y + z and v = 2*x + 3*y + 4*z are computed by calling FLENS-BLAS functions:
-
\(\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}\)
This gets computed using blas::copy and blas::axpy:
-
\(\vec{v} \leftarrow \vec{x}\) (blas::copy)
-
\(\vec{v} \leftarrow \vec{v} + \vec{y}\) (blas::axpy)
-
\(\vec{v} \leftarrow \vec{v} + \vec{z}\) (blas::axpy)
-
-
\(\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}\)
This gets computed by zero-fill and blas::axpy:
-
\(\vec{v} \leftarrow 0\) (zero-fill)
-
\(\vec{v} \leftarrow \vec{v} + 2\vec{x}\) (blas::axpy)
-
\(\vec{v} \leftarrow \vec{v} + 3\vec{y}\) (blas::axpy)
-
\(\vec{v} \leftarrow \vec{v} + 4\vec{z}\) (blas::axpy)
Note that zero-fill is not a BLAS functionality. It is implemented as as simply loop or uses std::fill if possible. This variant is slightly more efficient than
-
\(\vec{v} \leftarrow x\) (blas::copy)
-
\(\vec{v} \leftarrow 2 \vec{v}\) (blas::scal)
-
\(\vec{v} \leftarrow \vec{v} + 3\vec{y}\) (blas::axpy)
-
\(\vec{v} \leftarrow \vec{v} + 4\vec{z}\) (blas::axpy)
-
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
DenseVector<Array<double> > x(4), y(4), z(4), v(4);
x = 1, 2, 3, 4;
y = 5, 6, 7, 8;
z = 9, 10, 11, 12;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
cout << "z = " << z << endl;
//
// compute v = x + y + z
//
// v = x
blas::copy(x, v);
// v += y
blas::axpy(1., y, v);
// v += z
blas::axpy(1., z, v);
cout << "v = x + y + z = " << v << endl;
//
// compute v = 2*x + 3*y + 4*z
//
// set all element of v to zero
v = 0.0;
// v += 2*x
blas::axpy(2., x, v);
// v += 3*y
blas::axpy(3., y, v);
// v += 4*z
blas::axpy(4., z, v);
cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}
We compile as usual and get:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. vecsum_flensblas.cc $shell> ./a.out x = 1 2 3 4 y = 5 6 7 8 z = 9 10 11 12 v = x + y + z = 15 18 21 24 v = 2*x + 3*y + 4*z = 53 62 71 80
BLAS Level 1 Example: Adding Vectors with CXXBLAS
Next we show how the FLENS-BLAS functions call internally CXXBLAS functionsacxxblas::copy and cxxblas::axpy:
#include <algorithm>
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
DenseVector<Array<double> > x(4), y(4), z(4), v(4);
x = 1, 2, 3, 4;
y = 5, 6, 7, 8;
z = 9, 10, 11, 12;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
cout << "z = " << z << endl;
//
// compute v = x + y + z
//
// v = x
cxxblas::copy(v.length(), x.data(), x.stride(), v.data(), v.stride());
// v += y
cxxblas::axpy(v.length(), 1., y.data(), y.stride(), v.data(), v.stride());
// v += z
cxxblas::axpy(v.length(), 1., z.data(), z.stride(), v.data(), v.stride());
cout << "v = x + y + z = " << v << endl;
//
// compute v = 2*x + 3*y + 4*z
//
// set all element of v to zero
std::fill_n(v.data(), v.length(), 0.0);
// v += 2*x
cxxblas::axpy(v.length(), 2., x.data(), x.stride(), v.data(), v.stride());
// v += 3*y
cxxblas::axpy(v.length(), 3., y.data(), y.stride(), v.data(), v.stride());
// v += 4*z
cxxblas::axpy(v.length(), 4., z.data(), z.stride(), v.data(), v.stride());
cout << "v = 2*x + 3*y + 4*z = " << v << endl;
}
We compile as usual and get:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. vecsum_cxxblas.cc $shell> ./a.out x = 1 2 3 4 y = 5 6 7 8 z = 9 10 11 12 v = x + y + z = 15 18 21 24 v = 2*x + 3*y + 4*z = 53 62 71 80
BLAS Level 2 Functions
DESCRIPTION |
||
Computes a matrix-vector product. The form of the product depends on the matrix type:
Hereby \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\). |
||
Computes a rank 1 operation. The type of operation depends on type of the matrix that gets updated:
|
||
Computes a symmetrix rank 2 operation. The type of operation depends on type of the matrix that gets updated:
|
||
Solves one of the systems of equations \(Ax = b\) or \(A^T x = b\) where \(A\) is an unit or non-unit or upper or lower triangular matrix. |
BLAS Level 2 Example: Matrix-Vector Product with Operators
In the following examples we compute following matrix vector products:
-
\(\vec{y} \leftarrow A \vec{x}\)
-
\(\vec{y} \leftarrow A^T \vec{x}\)
-
\(\vec{y} \leftarrow \beta y + A^T \vec{x}\)
If you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine unless \(A\) is a trinagular matrix. We will show later that this is exactly how the compuation get carried out internally.
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4);
DenseVector<Array<double> > x(4), y(4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
x = 1, 2, 3, 4;
cout << "A = " << A << endl;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
//
// compute y = A*x
//
y = A*x;
cout << "y = A*x = " << y << endl;
//
// compute y = A^T*x
//
y = transpose(A)*x;
cout << "y = A^T*x = " << y << endl;
//
// compute y = 1.5*y + A^T*x
//
y = 1.5*y + transpose(A)*x;
cout << "y = 1.5*y + A^T*x = " << y << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matvecprod_operators.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x = 1 2 3 4 y = 0 0 0 0 y = A*x = 30 70 110 150 y = A^T*x = 90 100 110 120 y = 1.5*y + A^T*x = 225 250 275 300
BLAS Level 2 Example: Matrix-Vector Product with FLENS-BLAS
Having a look at the documentation of blas::mv you see that all of these operations can handeled by a single (FLENS-)BLAS function. So here it is more transparent that no temporary objects get created:
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4);
DenseVector<Array<double> > x(4), y(4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
x = 1, 2, 3, 4;
cout << "A = " << A << endl;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
//
// compute y = A*x
//
blas::mv(NoTrans, 1.0, A, x, 0.0, y);
cout << "y = A*x = " << y << endl;
//
// compute y = A^T*x
//
blas::mv(Trans, 1.0, A, x, 0.0, y);
cout << "y = A^T*x = " << y << endl;
//
// compute y = 1.5*y + A^T*x
//
blas::mv(Trans, 1.0, A, x, 1.5, y);
cout << "y = 1.5*y + A^T*x = " << y << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matvecprod_flensblas.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x = 1 2 3 4 y = 0 0 0 0 y = A*x = 30 70 110 150 y = A^T*x = 90 100 110 120 y = 1.5*y + A^T*x = 225 250 275 300
BLAS Level 2 Example: Matrix-Vector Product with CXXBLAS
The function called by blas::mv is in this case cxxblas::gemv:
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4);
DenseVector<Array<double> > x(4), y(4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
x = 1, 2, 3, 4;
cout << "A = " << A << endl;
cout << "x = " << x << endl;
cout << "y = " << y << endl;
int m = A.numRows();
int n = A.numCols();
//
// compute y = A*x
//
cxxblas::gemv(m, n, 1.0, false, false,
A.data(), A.strideRow(), A.strideCol(),
x.data(), x.stride(),
0.0,
y.data(), y.stride());
cout << "y = A*x = " << y << endl;
//
// compute y = A^T*x
//
cxxblas::gemv(m, n, 1.0, true, false,
A.data(), A.strideRow(), A.strideCol(),
x.data(), x.stride(),
0.0,
y.data(), y.stride());
cout << "y = A^T*x = " << y << endl;
//
// compute y = 1.5*y + A^T*x
//
cxxblas::gemv(m, n, 1.0, true, false,
A.data(), A.strideRow(), A.strideCol(),
x.data(), x.stride(),
1.5,
y.data(), y.stride());
cout << "y = 1.5*y + A^T*x = " << y << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matvecprod_cxxblas.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x = 1 2 3 4 y = 0 0 0 0 y = A*x = 30 70 110 150 y = A^T*x = 90 100 110 120 y = 1.5*y + A^T*x = 225 250 275 300
BLAS Level 3 Functions
DESCRIPTION |
||
Computes a matrix-matrix product. The form of the product depends on the matrix types. If one matrix is a general matrix and the other matrix is
|
||
Compute a symmetric rank-2k update, i.e. \(C \leftarrow \beta C + \alpha\,A\, B^T + \alpha\,B\,A^T\) or \(C \leftarrow \beta C + \alpha\,A^T \, B + \alpha\,B^T\,A\). |
||
Compute a symmetric rank-k update, i.e. \(C \leftarrow \beta C + \alpha A \, A^T\) or \(C \leftarrow \beta C + \alpha A^T \, A\) |
||
Solves one of the matrix equations \(\text{op}(A)\,X = B\) or \(X\,\text{op}(A) = B\) for \(X\) where \(A\) is an unit or non-unit or upper or lower triangular matrix and \(\text{op}(A)\) denotes \(A\), \(A^T\) or \(A^H\). |
BLAS Level 3 Example: Matrix-Matrix Product with Operators
In the following examples we compute following matrix matrix products:
-
\(C = A*B\)
-
\(C = A^T*B\)
-
\(C = S * B\)
-
\(B = U * B\)
where \(A\), \(B\), \(C\) are general matrices, \(S\) is a symmetric matrix and \(U\) a triangular matrix.
Again, if you are familiar with BLAS you know that each of these operations can be computed by a single BLAS routine.
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4), B(4,4), C(4,4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
B = 17, 18, 19, 20,
21, 22, 23, 24,
25, 26, 27, 28,
29, 30, 31, 32;
auto U = A.upper();
auto S = A.upper().symmetric();
cout << "A = " << A << endl;
cout << "B = " << B << endl;
cout << "S = " << S << endl;
cout << "U = " << U << endl;
//
// compute C = A*B
//
C = A*B;
cout << "C = A*B = " << C << endl;
//
// compute C = A^T*B
//
C = transpose(A)*B;
cout << "C = A^T*B = " << C << endl;
//
// compute C = S*B
//
C = S*B;
cout << "C = S*B = " << C << endl;
//
// compute B = U*B
//
B = U*B;
cout << "B = U*B = " << B << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matmatprod_operators.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 B = 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 S = 1 2 3 4 2 6 7 8 3 7 11 12 4 8 12 16 U = 1 2 3 4 0 6 7 8 0 0 11 12 0 0 0 16 C = A*B = 250 260 270 280 618 644 670 696 986 1028 1070 1112 1354 1412 1470 1528 C = A^T*B = 724 752 780 808 816 848 880 912 908 944 980 1016 1000 1040 1080 1120 C = S*B = 250 260 270 280 567 590 613 636 821 854 887 920 1000 1040 1080 1120 B = U*B = 250 260 270 280 533 554 575 596 623 646 669 692 464 480 496 512
BLAS Level 3 Example: Matrix-Matrix Product with FLENS-BLAS
Again, having a look at the documentation of blas::mm you see that all of these operations can handeled by a single (FLENS-)BLAS function:
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4), B(4,4), C(4,4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
B = 17, 18, 19, 20,
21, 22, 23, 24,
25, 26, 27, 28,
29, 30, 31, 32;
auto U = A.upper();
auto S = A.upper().symmetric();
cout << "A = " << A << endl;
cout << "B = " << B << endl;
cout << "S = " << S << endl;
cout << "U = " << U << endl;
//
// compute C = A*B
//
blas::mm(NoTrans, NoTrans, 1.0, A, B, 0.0, C);
cout << "C = A*B = " << C << endl;
//
// compute C = A^T*B
//
blas::mm(Trans, NoTrans, 1.0, A, B, 0.0, C);
cout << "C = A^T*B = " << C << endl;
//
// compute C = S*B
//
blas::mm(Left, 1.0, S, B, 0.0, C);
cout << "C = S*B = " << C << endl;
//
// compute B = U*B
//
blas::mm(Left, NoTrans, 1.0, U, B);
cout << "B = U*B = " << B << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matmatprod_flensblas.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 B = 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 S = 1 2 3 4 2 6 7 8 3 7 11 12 4 8 12 16 U = 1 2 3 4 0 6 7 8 0 0 11 12 0 0 0 16 C = A*B = 250 260 270 280 618 644 670 696 986 1028 1070 1112 1354 1412 1470 1528 C = A^T*B = 724 752 780 808 816 848 880 912 908 944 980 1016 1000 1040 1080 1120 C = S*B = 250 260 270 280 567 590 613 636 821 854 887 920 1000 1040 1080 1120 B = U*B = 250 260 270 280 533 554 575 596 623 646 669 692 464 480 496 512
BLAS Level 3 Example: Matrix-Matrix Product with CXXBLAS
The functions called by blas::mm are in this case cxxblas::gemm, cxxblas::symm and cxxblas::trmm:
#include <iostream>
using namespace flens;
using namespace std;
int
main()
{
GeMatrix<FullStorage<double> > A(4,4), B(4,4), C(4,4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
B = 17, 18, 19, 20,
21, 22, 23, 24,
25, 26, 27, 28,
29, 30, 31, 32;
auto U = A.upper();
auto S = A.upper().symmetric();
cout << "A = " << A << endl;
cout << "B = " << B << endl;
cout << "S = " << S << endl;
cout << "U = " << U << endl;
//
// compute C = A*B
//
cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),
1.0,
false, false, A.data(), A.strideRow(), A.strideCol(),
false, false, B.data(), B.strideRow(), B.strideCol(),
0.0,
C.data(), C.strideRow(), C.strideCol());
cout << "C = A*B = " << C << endl;
//
// compute C = A^T*B
//
cxxblas::gemm(C.numRows(), C.numCols(), A.numCols(),
1.0,
true, false, A.data(), A.strideRow(), A.strideCol(),
false, false, B.data(), B.strideRow(), B.strideCol(),
0.0,
C.data(), C.strideRow(), C.strideCol());
cout << "C = A^T*B = " << C << endl;
//
// compute C = S*B
//
cxxblas::symm(true, C.numRows(), C.numCols(),
1.0,
(S.upLo()==Lower),
S.data(), S.strideRow(), S.strideCol(),
B.data(), B.strideRow(), B.strideCol(),
0.0,
C.data(), C.strideRow(), C.strideCol());
cout << "C = S*B = " << C << endl;
//
// compute B = U*B
//
cxxblas::trmm(true, C.numRows(), C.numCols(),
1.0,
(U.upLo()==Lower), false, false, (U.diag()==Unit),
U.data(), U.strideRow(), U.strideCol(),
B.data(), B.strideRow(), B.strideCol());
cout << "B = U*B = " << B << endl;
}
This leads to:
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. matmatprod_cxxblas.cc $shell> ./a.out A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 B = 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 S = 1 2 3 4 2 6 7 8 3 7 11 12 4 8 12 16 U = 1 2 3 4 0 6 7 8 0 0 11 12 0 0 0 16 C = A*B = 250 260 270 280 618 644 670 696 986 1028 1070 1112 1354 1412 1470 1528 C = A^T*B = 724 752 780 808 816 848 880 912 908 944 980 1016 1000 1040 1080 1120 C = S*B = 250 260 270 280 567 590 613 636 821 854 887 920 1000 1040 1080 1120 B = U*B = 250 260 270 280 533 554 575 596 623 646 669 692 464 480 496 512
Rational of FLENS-BLAS Function Names
Function names in FLENS-BLAS are derived from BLAS/CBLAS. As matrix/vector types encapsulate element types and mathematical properties this simplifies the BLAS/CBLAS function name scheme.
-
Simplified naming convention
Consider BLAS functions like dgemv, sgemv, ssbmv, dgbmv, etc. all of them implement some kind of matrix-vector product. That's what the mv stands for. However, the names also encapsulate further information about the involved matrix/vector types:
-
The first leter indicates the element type
-
s for single precision,
-
d for double precision,
-
c for complex single precision,
-
z for complex double precision.
-
-
The next two letter indicate the involved matrix type
-
ge for a general matrix with full storage
-
gb for a general matrix with band storage
-
sy for a symmetric matrix with full storage
-
sb for a symmetric matrix with banded storage
-
...
-
Because FLENS provides actual matrix/vector types this information can be retrieved from the argument types. Hence, FLENS merely provides one function blas::mv and overloads it for different matrix/vector types.
Analogously in FLENS function blas::mm is overloaded for various matrix/vector types such that it unifies gemm, symm, trmm, etc.
-
-
Simplified function arguments
Compared to calling BLAS functions directly or through the CBLAS interface the number of arguments required by FLENS-BLAS functions is significantly smaller. In BLAS/CBLAS you always have to pass
-
matrix/vector dimensions
-
leading dimensions/strides and
-
a data pointer
for each matrix/vector argument. In FLENS-BLAS you simply pass a single matrix/vector object to the corresponding function.
As an example: dgemv from CBLAS requires 12 arguments while its FLENS-BLAS counterpart blas::mv only requires 6 arguments. Besides convenient usage this also provides increased safety and improves correctness.
-