Content |
Accessing Raw Data Pointers and Strides
In some object-oriented circles the idea of providing access to raw data is considered as a bad idea in general. If you belong to these people and still want high performance you have to use libraries coded by people who are dealing with raw pointers. If you want to implement highly efficient numerical code you have to know what you do and deal in some places with raw data.
Assuming you know what you are doing FLENS gives you direct access to the raw data of its matrix/vector types. One application is interfacing with other libraries. For example BLAS, LAPACK, etc.
But also consider cases where an efficient numerical code requires local buffers. In this case you often do not want to allocate them on the heap. That's because allocation data on the heap can cause a considerable runtime overhead in some cases. So if you depend on dynamic memory allocation you are interested that it happens as rarely as possible. Like only once when your application gets started.
In some cases you can do even better. Using memory from the stack or data segment is free, i.e. has no runtime overhead. In some case you can take advantage of this. Common examples are:
-
If you only need a small buffer inside a function just use a local buffer on the stack.
-
If you know the total amount of needed workspace at compile time use a global buffer on the data segment. And yes, we assume you know how to deal with it in a multi-threaded environment.
Always have an example in mind were you implement a function that gets called by a function that gets called by a function, etc. And each function call happens within a loop.
Accessing Raw Data form Vectors and Matrices
-
For matrices of type GeMatrix the following methods provides the relevant information:
const ElementType * data() const;
Pointer to the first element of the matrix. So this is the address of A(1,1) (or more general A(A.firstRow(),A.firstCol())).
ElementType * data();
StorageOrder order() const;
Returns RowMajor, ColMajor or Grid.
An storage order Grid occurs if you have e.g. a matrix view that references every second row and every third column of a matrix.
IndexType strideRow() const;
Stride in memory between elements in the same column. So A.data() + A.strideRow() is the address of A(2,1).
If your matrix is ColMajor this is 1.
IndexType strideCol() const;
Stride in memory between elements in the same row. So A.data() + A.strideCol() is the address of A(2,1).
If your matrix is RowMajor this is 1.
IndexType leadingDimension() const;
-
For ColMajor matrices returns strideCol().
-
For RowMajor matrices returns strideRow().
-
For Grid matrices this method triggers an runtime assertion.
-
-
For vectors of type DenseVector we provide
const ElementType * data() const;
Pointer to the first element of the vector. So this is the address of x(1) (or more general x(x.firstIndex())).
ElementType * data();
IndexType stride();
Stride in memory between elements. So x.data()+x.stride() is the address of x(2).
Example Code
#include <flens/flens.cxx>
using namespace flens;
using namespace std;
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef DenseVector<Array<double> > DDenseVector;
int
main()
{
DGeMatrix A(3,3);
fillRandom(A);
cout << "A = " << A << endl;
cout << "A.strideRow() = " << A.strideRow() << endl;
cout << "A.strideCol() = " << A.strideCol() << endl << endl;
cout << "A.data() = " << A.data() << endl;
cout << "&A(1,1) = " << &A(1,1) << endl << endl;
cout << "A.data()+A.strideRow() = " << A.data()+A.strideRow() << endl;
cout << "&A(2,1) = " << &A(2,1) << endl << endl;
cout << "A.data()+A.strideCol() = " << A.data()+A.strideCol() << endl;
cout << "&A(1,2) = " << &A(1,2) << endl << endl;
}
Compile and Rn
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. tut01-page03-example01.cc $shell> ./a.out A = -0.999984 -0.0826997 -0.905911 -0.736924 0.0655345 0.357729 0.511211 -0.562082 0.358593 A.strideRow() = 1 A.strideCol() = 3 A.data() = 0x7fe7f3c039b0 &A(1,1) = 0x7fe7f3c039b0 A.data()+A.strideRow() = 0x7fe7f3c039b8 &A(2,1) = 0x7fe7f3c039b8 A.data()+A.strideCol() = 0x7fe7f3c039c8 &A(1,2) = 0x7fe7f3c039c8
Naive creation of matrices/vectors in Functions
Consider a small core routine of a numerical application that is used to operate on small matrix blocks. The function itself does only a few computations but gets called many times. For intermediate results a small buffer is needed. In this case dynamic memory allocation can have a considerable impact on runtime.
So you DO NOT want an implementation of this type:
#include <flens/flens.cxx>
using namespace flens;
using namespace std;
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef DenseVector<Array<double> > DDenseVector;
// Each function call causes a malloc for a 4x4 matrix and a free at the end of
// scope.
void
foo()
{
DGeMatrix A(4,4);
A = 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16;
cout << "A = " << A << endl;
}
int
main()
{
// Always have in mind that the function is an essential inner core routine
// doing only little computation but gets called many times.
for (int i=0; i<100; ++i) {
foo();
}
}
Creating Matrices/Vector from Raw Data
You can not directly create matrices/vectors from raw data. But you can create storage schemes from raw data and construct from these matrices/vectors.
Recall that for GeMatrix<..> the typedefs GeMatrix<..>::EngineView and GeMatrix<..>::EngineConstView give you the types FullStorageView<..> and ConstFullStorageView with the correct template parameters.
Creating a matrix view that uses externally managed raw data follows this pattern:
double buffer_[BUFFERSIZE];
// You are responsible that BUFFERSIZE >= NUMROWS*NUMCOLS*STRIDEROW*STRIDECOL
DGeMatrix::View A = DGeMatrix::EngineView(NUMROWS, NUMCOLS,
buffer_,
STRIDEROW, STRIDECOL);
const DGeMatrix::ConstView B = DGeMatrix::EngineConstView(NUMROWS, NUMCOLS,
buffer_,
STRIDEROW, STRIDECOL);
// Ok, you can create a const-view from a EngineView (if buffer_ is not const)
const DGeMatrix::ConstView B = DGeMatrix::EngineView(NUMROWS, NUMCOLS,
buffer_,
STRIDEROW, STRIDECOL);
Analogously you can create a vector view from raw data:
double buffer_[BUFFERSIZE];
// Parameter STRIDE defaults to 1
DDenseVector::View x = DDenseVector::EngineView(LENGTH, buffer_);
DDenseVector::View y = DDenseVector::EngineView(LENGTH, buffer_, STRIDE);
const DDenseVector::ConstView v = DDenseVector::EngineConstView(LENGTH, buffer_);
const DDenseVector::ConstView w = DDenseVector::EngineConstView(LENGTH, buffer_, STRIDE);
Using a Stack Buffer for Matrix/Vector Views
If the buffer size is known at compile time and small enough you want to use a buffer on the stack. Because allocating the buffer is free in this case.
You than can use matrix/vector views as an interface for working on the buffer. In many cases you even want to use the same buffer as both, a matrix and a vector. This is no problem, just use the same buffer for a matrix view and a vector view. Then each view is just used as a different interface for the same data:
#include <flens/flens.cxx>
using namespace flens;
using namespace std;
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef DenseVector<Array<double> > DDenseVector;
void
foo()
{
double buffer_[4*4] = { 1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16};
DGeMatrix::View A = DGeMatrix::EngineView(4, 4, buffer_, 1, 4);
DGeMatrix::View B = DGeMatrix::EngineView(4, 4, buffer_, 4, 1);
DGeMatrix::View C = DGeMatrix::EngineView(2, 4, buffer_, 4, 1);
DDenseVector::View x = DDenseVector::EngineView(4*4, buffer_);
DDenseVector::View y = DDenseVector::EngineView(2*4, buffer_, 2);
cout << "Matrix/vector views created from stack buffer:" << endl;
cout << " ColMajor Matrix View" << endl;
cout << " A = " << A << endl;
cout << " RowMajor Matrix View" << endl;
cout << " B = " << B << endl;
cout << " RowMajor Matrix View referencing 2 rows" << endl;
cout << " C = " << C << endl;
cout << " Vector View" << endl;
cout << " x = " << x << endl;
cout << " Vector View referencing every second element" << endl;
cout << " y = " << y << endl;
}
int
main()
{
foo();
}
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -I../.. tut01-page03-example03.cc $shell> ./a.out Matrix/vector views created from stack buffer: ColMajor Matrix View A = 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 RowMajor Matrix View B = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 RowMajor Matrix View referencing 2 rows C = 1 2 3 4 5 6 7 8 Vector View x = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Vector View referencing every second element y = 1 3 5 7 9 11 13 15
If Buffers are to big for the Stack
In many cases the required workspace depends on the dimension of matrix or vector arguments. In this case you obviously can not use a stack buffer. In this case it should be possible that the buffer can be created before the function gets called.
Consider a function for computing the eigenvalues of a matrix. An efficient implementation requires a workspace. For the size of the workspace we have two cases:
-
Minimal workspace that is required to do the computation.
-
Optimal workspace for best performance. So you can decide how much memory you can invest for performance.
So a common practise is to implement two functions:
-
eigenvalues_wsq(A) does a workspace query to determine the minimal and optimal workspace for computing the eigenvalues of A. The function just returns a pair of integers.
-
eigenvalues(A, v, workspace) does the actual computation. It for example receives workspace as dense vector. An assertion should check that the length of workspace is at least the required minimum. As you saw above you can also create a matrix view from the underlying memory of the workspace vector if this is more appropriate internally.
Often you want to do the same numerical operation on different matrices of the same size. So in this case the workspace can have each time the same size. For convenience you can consider the following: If workspace has length zero the function internally calls the workspace query and resizes workspace for optimal size. In this case the workspace does get allocated on the heap. But if you call the function for the next problem (of same or smaller size) the workspace gets reused without allocating it again.
If you don't care (or if you know that it does not matter) in some cases you can provide a third function:
-
eigenvalues(A, v) which just calls eigenvalue(A, v, workspace) for an empty workspace vector. So in this case each function call will trigger dynamic memory allocation and a release.
#include <iostream>
#include <utility>
#include <flens/flens.cxx>
using namespace flens;
using namespace std;
template <typename MA>
std::pair<long, long>
eigenvalues_wsq(const GeMatrix<MA> &A)
{
// Compute and return minimal and optimal workspace sizes
return std::pair<long,long>(A.numRows()*4, A.numRows()*256);
}
template <typename MA, typename VV, typename VWS>
void
eigenvalues(const GeMatrix<MA> &A, DenseVector<VV> &v, DenseVector<VWS> &ws)
{
auto wsq = eigenvalues_wsq(A);
if (ws.length()==0) {
ws.resize(wsq.second);
}
assert(ws.length()>=wsq.first);
//
// ... compute eigenvalues v of A ... here the actual work happens!!!
//
}
template <typename MA, typename VV>
void
eigenvalues(const GeMatrix<MA> &A, DenseVector<VV> &v)
{
DenseVector<VV> ws;
eigenvalues(A, v, ws);
}
int
main()
{
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef DenseVector<Array<double> > DDenseVector;
// Use case 1: Use optimal workspace (performance for memory). We have
// a typical case where the same problem size is involved. So
// the workspace can be reused.
{
DGeMatrix A(100, 100);
DDenseVector v(100);
auto workspace = eigenvalues_wsq(A);
DDenseVector ws(workspace.second);
for (int i=0; i<666; ++i) {
fillRandom(A);
eigenvalues(A, v, ws);
}
}
// Use case 2: Use minimal workspace (memory for performance)
{
DGeMatrix A(100, 100);
DDenseVector v(100);
auto workspace = eigenvalues_wsq(A);
DDenseVector ws(workspace.first);
for (int i=0; i<666; ++i) {
fillRandom(A);
eigenvalues(A, v, ws);
}
}
// Use case 3: Allocate (optimal) workspace on first call.
{
DGeMatrix A(100, 100);
DDenseVector v(100);
DDenseVector ws;
for (int i=0; i<666; ++i) {
fillRandom(A);
eigenvalues(A, v, ws);
}
}
// NO-Use case 4: Workspace gets allocated for each call!!!
{
DGeMatrix A(100, 100);
DDenseVector v(100);
for (int i=0; i<666; ++i) {
fillRandom(A);
eigenvalues(A, v);
}
}
// Use case 5: Ok, workspace is needed only once.
{
DGeMatrix A(100, 100);
DDenseVector v(100);
fillRandom(A);
eigenvalues(A, v);
}
}
Using a Global Buffer for Matrix/Vector Views
In Fortran 77 you don't have the possibility to allocate memory dynamically, i.e. at runtime. And some hardcore circles in scientific computing claim it is a waste of CPU time. The following example shows you how to do things right in this circles:
-
You first make a dry run for a workspace query. You merely compute the minimal and optimal workspace required for your problem. So the program just outputs two numbers:
-
Minimal workspace required,
-
Optimal workspace.
-
-
If your hardware has enough memory to satisfy the minimal workspace needed you compile again. This time your application uses some workspace on the data segment, i.e. some global workspace buffer. The size of the workspace is some number between the minimal and optimal size. This new compiled program then does the actual compuation.
You can realize this using the preprocessor. The macro N defines the problem size and WORKSPACE the size of the workspace. If WORKSPACE equals zero a workspace query is done.
Example Code
#include <iostream>
#include <utility>
#include <flens/flens.cxx>
using namespace flens;
using namespace std;
template <typename MA>
std::pair<long, long>
eigenvalues_wsq(const GeMatrix<MA> &A)
{
// Compute and return minimal and optimal workspace sizes
return std::pair<long,long>(A.numRows()*4, A.numRows()*256);
}
template <typename MA, typename VV, typename VWS>
void
eigenvalues(const GeMatrix<MA> &A, DenseVector<VV> &v, DenseVector<VWS> &ws)
{
auto wsq = eigenvalues_wsq(A);
if (ws.length()==0) {
ws.resize(wsq.second);
}
assert(ws.length()>=wsq.first);
//
// ... compute eigenvalues v of A ... here the actual work happens!!!
//
cout << "Computing eigenvalue "
<< "(Using a workspace of size " << ws.length() << ")"
<< endl;
}
template <typename MA, typename VV>
void
eigenvalues(const GeMatrix<MA> &A, DenseVector<VV> &v)
{
DenseVector<VV> ws;
eigenvalues(A, v, ws);
}
double buffer_[DIM*DIM + DIM + WORKSPACE];
int
main()
{
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef DenseVector<Array<double> > DDenseVector;
DGeMatrix::View A = DGeMatrix::EngineView(DIM, DIM,
&buffer_[0],
1, DIM);
DDenseVector::View v = DDenseVector::EngineView(DIM,
&buffer_[DIM*DIM]);
DDenseVector::View ws = DDenseVector::EngineView(WORKSPACE,
&buffer_[DIM*DIM+DIM]);
# if WORKSPACE==0
// Perform workspace query and return
auto workspace = eigenvalues_wsq(A);
cout << workspace.first << " " << workspace.second << endl;
# else
// Solve the problem with given workspace
for (int i=0; i<6; ++i) {
fillRandom(A);
eigenvalues(A, v, ws);
}
# endif
}
Compile and Run
$shell> cd flens/examples $shell> # compile and run for a worksize query $shell> g++ -Wall -DDIM=100 -DWORKSPACE=0 -std=c++11 -I../.. tut01-page03-example05.cc $shell> ./a.out 400 25600 $shell> # We store the output of a.out in a bash array $shell> WS=(`./a.out`) ./a.out $shell> # Compile with minimal workspace and run $shell> g++ -Wall -DDIM=100 -DWORKSPACE=${WS[0]} -std=c++11 -I../.. tut01-page03-example05.cc $shell> ./a.out Computing eigenvalue (Using a workspace of size 400) Computing eigenvalue (Using a workspace of size 400) Computing eigenvalue (Using a workspace of size 400) Computing eigenvalue (Using a workspace of size 400) Computing eigenvalue (Using a workspace of size 400) Computing eigenvalue (Using a workspace of size 400) $shell> # Compile with optimal workspace and run $shell> g++ -Wall -DDIM=100 -DWORKSPACE=${WS[1]} -std=c++11 -I../.. tut01-page03-example05.cc $shell> ./a.out Computing eigenvalue (Using a workspace of size 25600) Computing eigenvalue (Using a workspace of size 25600) Computing eigenvalue (Using a workspace of size 25600) Computing eigenvalue (Using a workspace of size 25600) Computing eigenvalue (Using a workspace of size 25600) Computing eigenvalue (Using a workspace of size 25600)