Content

Using Operators for BLAS

Using overloaded operators for basic linear algebra operations often attracts users from Matlab to C++ libraries. It seems trivial to port Matlab code and promises speedup for free because C++ is said to be faster. At the same time Using overloaded operators has a bad reputation in high performance community.

So how comes that people in numerical linear algebra are worried about using overloaded operators? They are worried about *numerically correct results. Recall that for floating point operations you do not have the associative property. That means (a*b)*c is not the same as a*(b*c). And dealing with roundoff error is crucial matter in numerical applications. So mult(mult(a,b), c) or mult(a, mult(b, c)) can be more expressive in this respect. In particular if you can not trust the operator precedence. So a nice feature like overloaded operators for BLAS must preserve operator precedence and must not apply the associative property implicitly.

Once correctness is assured performance matters. It is important to know what functions are doing the computations. Furthermore one has to be sure that no temporary vectors/matrices are created silently.

Here are some pitfalls users that come from Matlab (or other high level applications/libraries) often do not consider at first:

So some expression can not be evaluated without extending the set of BLAS functions (which is possible) or using temporaries (which would drop performance and is only possible in some special debug mode). Some other expressions require temporaries because we don't have the associative property.

In FLENS you are not forced to use overloaded operators in stead of function calls or vice versa. In some cases an explicit function call can be more expressive than a notation with operators. In any cases an expression with overloaded operators will be equivalent to a sequence of explicit function calls and will never involve the creation of temporaries.

The primary goal of FLENS is that an FLENS/C++ implementation is at least as efficient as a Fortran or C implementation. It should just be simpler, safer and more convenient. So FLENS is strict about nice features must be transparent and must not come with performance penalties.

How other C++ Libraries Deal with Expressions

Short: Inefficient code looks nice. Efficient code not so nice.

Brief: They make it simple to port code from Matlab to C++. And you C++ code looks nice, initially. Dealing with performance and ensuring that no temporaries get created needs more effort and makes the code less attractive.

Eigen

By default an expression like C = A*B will first create a temporary matrix tmp = A*B for the matrix product and then assigns C = tmp. That's for supporting special cases like A = A*A. So in a loop like

  // Eigen silently creates a temporary
  for (int i=0; i<100000; ++i) {
      C = A*B;
  }

you have 100000 calls of malloc and free. If a user is sure that \(C\) is not identical with \(A\) or \(B\) this can be coded as

  // Enforce Eigen not to create a temporary
  for (int i=0; i<100000; ++i) {
      C.noalias() = A*B;
  }

In this case you achieve the default behaviour of FLENS for C = A*B, i.e. no temporaries get created. In FLENS you have to treat special cases like A=A*A special, i.e. code it explicitly as tmp = A*A followed by A=tmp.

Cases like C = A*(B1+B2) are treated correctly. First a temporary tmp = B1+B2 gets created then C=A*tmp computed. In this case it obviously can not be avoided that each loop iteration triggers a malloc and free:

  // Eigen silently creates a temporary
  for (int i=0; i<100000; ++i) {
      C = A*(B1+B2);           // alloc tmp
                               // compute tmp = B1+B2,
                               //         C   = A*tmp
                               // release tmp
  }

Unless you use a special debug mode FLENS will force you to explicitly create a temporary:

  // FLENS enforces transparency
  for (int i=0; i<100000; ++i) {
      tmp = B1 + B2;
      C = A*tmp;
  }

If tmp has length zero in the first iteration it gets resized. This involves dynamic memory allocations. However, in the next iterations the memory gets reused. So you have at most one memory allocation and one memory release. If tmp was previously allocated with the correct size no memory allocations get triggered at all inside the loop.

ublas

Like in Eigen computing a matrix product creates a temporary for the result unless you explicitly do something about it:

  C          = prod(A, B);   // alloc tmp
                             // compute tmp = A*B
                             // copy    C   = tmp
                             // release tmp
  noalias(C) = prod(A, B);   // like C = A*B in FLENS

MTL 4

MTL 4 is doing it right. No temporaries are created for C=A*B. If a temporary would be needed like in A=A*A this throws an exception in debug mode and would give wrong results in untested release mode.

However, matrix products like A = B*C*D will create a temporary tmp = B*C and then computes A=tmp*D.

Expression Gotchas

Footnotes