======================================
Basic Linear Algebra Operations (BLAS)                                  [TOC]
======================================

BLAS is a de facto standard that defines a certain set of linear algebra
operations.  These operations mainly are matrix/vector operations, e.g. dot
products, matrix-vector products, matrix-matrix products, etc.  Hardware vendors
provide highly tuned BLAS implementations for their architecture, e.g. __Intel
MKL__, __AMD CML__, __Sun Performance Library__, etc.  Implementations like
__NVIDIA cuBLAS__ exploit GPUs for the computation.  In addition open source
implementations like __BLIS__, __ATLAS__, __OpenBLAS__, __ulmBLAS__, etc. exist
that are on par with vendor tuned BLAS implementations.

How much performance can be achieved in theory depends on various factor.  For
instance exploiting caches, vectorization and parallelism.  The potential for
performance depends on the type of operation.  In this respect BLAS divides the
operation types into three Levels:

 - *BLAS Level 1* contains all linear algebra operations that require
   $\mathcal{O}(N)$ data and $\mathcal{O}(N)$ floating point operations.
   These are vector-vector operations like vector sums or dot products.

 - *BLAS Level 2* contains all linear algebra operations that require
   $\mathcal{O}(N^2)$ data and $\mathcal{O}(N^2)$ floating point operations.
   These are operations like matrix-vector products.

 - *BLAS Level 3* contains all linear algebra operations that require
   $\mathcal{O}(N^2)$ data and $\mathcal{O}(N^3)$ floating point operations.
   These are operations like matrix-matrix products.  *Only Level 3 operations
   can efficiently exploit caches.*  That's because you have a considerable gap
   between the amount of data and amount of floating point operations.  Once a
   cache is filled with $\mathcal{O}(N^2)$ data it gets uesed for
   $\mathcal{O}(N^3)$ floating point operations.

   With Level 1 or 2 functions you can not exploit caches.  An optimized
   implementation will achieve about 5-15% percent of the theoretical peak
   performance.  That means 85-95% of the time the CPU is waiting for data.

*So achieving high performance in numerical linear algebra is two fold:*

 - *The Numerical Math Part*

   First you need to organize an algorithm such that it is dominated by BLAS
   Level 3 operations.  This can be achieved by considering algorithms that
   work on matrix blocks rather than matrix elements.  That is exactly what
   libraries like __LAPACK__ are doing.

 - *The Computer Science Part*

   Second you need an optimized BLAS implementation that actually can achieve
   near peak performance for BLAS Level 3 functions.

Note that this separation allows mathematician and computer scientist to
cooperate independent from each other.  The mathematician just needs to know
that BLAS Level 3 gives great performance.  The computer scientist just needs
to know that it is important that BLAS Level 3 functions actually achieve great
performance.  But of course it is aways favorable to look beyond one's own nose!

We first give an overview of BLAS and how to make use of it with FLENS.  Only
a few BLAS operations will be shown.  For a complete overview have a look at the
__FLENS-BLAS__ page.  By default FLENS will use __ulmBLAS__ as computational
backend.  Optionally an external BLAS implementation can be used.  On later
pages we will show how algorithms can be developed such that they exploit BLAS
Level 3 functions.


Performance: It's BLAS, not the Compiler
========================================
*As a rule of thumb you can say: In numerical linear algebra you never get the
performance from the compiler.  You only get it from using BLAS.*

For an undergraduate class on high performance computing we developed step by
step an optimized implementation of the matrix-matrix product.  The steps are
documented on __GEMM: From Pure C to SSE Optimized Micro Kernels__ 

Performance benchmarks will certainly vary between different architectures.  But
as they are mainly influence on how fast data can be move from memory to the
caches and from caches to the CPU the relative comparisons stay comparable:

 - With a *naive C implementation* you achieve about 10%.  Even if you try to
   reduce cache misses by traversing the matrix elements in the right order.
   The Netlib reference BLAS implementation (__Netlib RefBLAS__) is such a naive
   but not stupid implementation.  For column major matrices it traverses
   matrices always column wise.

 - With a *pure but clever C implementation* and really, really good C compiler
   you can achieve about 30% of the theoretical peak performance.  The compiler
   can not do better unless you build in some expert knowledge in numerical
   linear algebra.  In the class project the best possible result through
   compiler optimizations was denoted *demo-naive-sse-with-intrinsics*:

   ---- IMAGE -----------------------------------------------------------------
   http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/ulmBLAS/bench/bench3.svg
   ----------------------------------------------------------------------------

   Actually we bet that you won't find any compiler that can achieve this
   performance.  In the class project we simulated what a compiler could achieve
   with optimal assumptions about alignment, restricted access to memory,
   optimal use of registers etc.

 - An *optimized BLAS* achieves more than 90% of the theoretical peak
   performance.  In the class project __ulmBLAS__ 10 lines of C code where
   replaced with about 500 lines of assembler code.  The assembler code contains
   expert knowledge in numerical linear algebra that you can not expect from
   a general purpose compiler.  This resulted in:

   ---- IMAGE -------------------------------------------------------------------------
   http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/ulmBLAS/bench/compare-bench5.svg
   ------------------------------------------------------------------------------------

   As __ulmBLAS__ is derived from __BLIS__ it is no coincident that we achieve
   the same performance.

   All the benchmarks are for single threaded implementations.  Multi threading
   for BLAS Level 3 scales nearly perfect.  That's because each thread can
   compute separate blocks for the result matrix and race conditions are not an
   issue.


:links:  Intel MKL               -> https://software.intel.com/en-us/intel-mkl
         AMD CML                 -> http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/
         Sun Performance Library -> https://code.google.com/p/blis/
         NVIDIA cuBLAS           -> https://developer.nvidia.com/cuBLAS
         BLIS                    -> http://math-atlas.sourceforge.net
         OpenBLAS                -> http://www.openblas.net
         ATLAS                   -> http://math-atlas.sourceforge.net
         ulmBLAS                 -> http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/
         LAPACK                  -> http://www.netlib.org/lapack/
         GEMM: From Pure C to SSE Optimized Micro Kernels -> http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
         Netlib RefBLAS          -> http://www.netlib.org/blas/
         FLENS-BLAS              -> doc:flens/blas/blas


About FLENS-BLAS
================
BLAS functions are provided in FLENS through three layers:

 - __CXXBLAS__ *functions in namespace `cxxblas`*

   These provide a low-level C++ BLAS interface.  These functions work on raw
   data pointers and by default call corresponding __ulmBLAS__ functions.
   Optionally you can compile and link your code such that an external and
   possible vendor tuned BLAS function gets called.

   The __CXXBLAS__ interface is similar to the CBLAS interface.  It uses only
   very few C++ features, e.g. templates are used for merely for the element
   type.  It does not use structs, classes or even enum types.  So compared to
   CBLAS there is except for using template no abstraction.

 - __FLENS-BLAS__ *functions in namespace `flens::blas`*

   These functions provide a high-level BLAS interface for the FLENS
   `DenseVector` types.  These functions internally call __CXXBLAS__ functions.

 - *Overloaded operators*

   These have the advantage of allowing a more expressive notation.  In FLENS
   we are very pedantic that this kind of syntactic sugar does not lead to any
   hidden performance penalty.  Using overloaded operators for BLAS should not
   be misunderstood.  They certainly are not provided such that a user does
   need to know about BLAS.  If you want to do high performance computing you
   always have to think in BLAS.  We will discuss overloaded operators in more
   detail on the next page.

In all cases runtime overhead (if at all) is negligible compared to calling a
BLAS implementation directly.  Runtime assertions merely check consistency of
the arguments.

Compiling with `-DNDEBUG` removes any of these checks.  It is common practise
that during the test phase runtime assertions are used and production code gets
generated with `-DNDEBUG`.

:links: FLENS-BLAS -> doc:flens/blas/blas
        CXXBLAS    -> dir:ulmblas/cxxblas/
        ulmBLAS    -> http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/


BLAS Level 1 Functions
======================
This BLAS level originally deals with vector-vector operations only.  For
convenience we added in __FLENS-BLAS__ some cases matrix-matrix variants that
for instance allow copying or adding matrices.

 +----------------+-----------------------------------------------+---------------+
 | __FLENS-BLAS__ | DESCRIPTION                                   | __CXXBLAS__   |
 +================+===============================================+===============+
 | __copy__       | $y \leftarrow x$                              | __copy[1]__   |
 |                |                                               |               |
 |                | Copies a vector $\vec{x}$ to a vector         |               |
 |                | $\vec{y}$ or a matrix $A$ to a matrix $B$.    |               |
 +----------------+-----------------------------------------------+---------------+
 | __axpy__       | $\vec{y} \leftarrow \alpha \vec{x} + \vec{y}$ | __axpy[1]__   |
 |                |                                               |               |
 |                | `axpy` is short for *alpha x pluy y*          |               |
 |                |                                               |               |
 |                | FLENS also provides a variant for `GeMatrix`  |               |
 |                | that computes $Y \leftarrow \alpha X + Y$.    |               |
 +----------------+-----------------------------------------------+---------------+
 | __scal__       | $\vec{x} \leftarrow \alpha \vec{x}$           | __scal[1]__   |
 |                |                                               |               |
 |                | Scales a vector by a constant.                |               |
 |                |                                               |               |
 |                | FLENS also provides a variant for `GeMatrix`  |               |
 |                | that computes $X \leftarrow \alpha X$.        |               |
 +----------------+-----------------------------------------------+---------------+
 | __swap__       | Interchanges two vectors, i.e.                | __swap[1]__   |
 |                | $\vec{x} \leftrightarrow \vec{y}$.            |               |
 +----------------+-----------------------------------------------+---------------+
 | __asum__       | Takes the sum of the absolute values, i.e.    | __asum[1]__   |
 |                | computes $\sum\limits_{i} |x_i|$.             |               |
 +----------------+-----------------------------------------------+---------------+
 | __dot__,       | $\alpha \leftarrow \vec{x}^T \vec{y}$ or      | __dot[1]__,   |
 | dotu           | $\alpha \leftarrow \vec{\bar{x}}^T \vec{y}$   | udot          |
 |                |                                               |               |
 |                | Forms the dot product of two vectors, i.e.    |               |
 |                | computes $\sum\limits_{i} \bar{x}_i y_i$      |               |
 |                | or $\sum\limits_{i} x_i y_i$.                 |               |
 +----------------+-----------------------------------------------+---------------+
 | __nrm2__       | Computes the euclidean norm of a vector, i.e. | __nrm2[1]__   |
 |                | $\sqrt{\sum\limits_{i} |x_i|^2}$.             |               |
 +----------------+-----------------------------------------------+---------------+
 | __rot__        | Applies a plane rotation.                     | __rot[1]__    |
 +----------------+-----------------------------------------------+---------------+


BLAS Level 1 Example: Adding Vectors with Operators
---------------------------------------------------
In the following examples we compute following vector sums:

 - $\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}$ and
 - $\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}$

Using overloaded operators this can be coded in an expressive way.  *People from
high performance computing will get nervous if they see the code.*  They are
more concern about how computations are carried out and whether temporary
objects get created.  Ease-of-use comes second in this field. We will show in
the next example that no temporaries are carried out and what BLAS routines are
used for the evaluation.

:import: flens/examples/vecsum_operators.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. vecsum_operators.cc                        |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 1 Example: Adding Vectors with FLENS-BLAS
----------------------------------------------------
In the next example we show how internally the expression `v = x + y + z` and
`v = 2*x + 3*y + 4*z` are computed by calling __FLENS-BLAS__ functions:

 - $\vec{v} \leftarrow \vec{x} + \vec{y} + \vec{z}$

   This gets computed using __blas::copy__ and __blas::axpy__:

     - $\vec{v} \leftarrow \vec{x}$ (__blas::copy__)
     - $\vec{v} \leftarrow \vec{v} + \vec{y}$ (__blas::axpy__)
     - $\vec{v} \leftarrow \vec{v} + \vec{z}$ (__blas::axpy__)

 - $\vec{v} = 2\vec{x} + 3\vec{y} + 4\vec{z}$

   This gets computed by zero-fill and __blas::axpy__:

     - $\vec{v} \leftarrow 0$ (zero-fill)
     - $\vec{v} \leftarrow \vec{v} + 2\vec{x}$ (__blas::axpy__)
     - $\vec{v} \leftarrow \vec{v} + 3\vec{y}$ (__blas::axpy__)
     - $\vec{v} \leftarrow \vec{v} + 4\vec{z}$ (__blas::axpy__)

   Note that zero-fill is not a BLAS functionality.  It is implemented as
   as simply loop or uses `std::fill` if possible.  This variant is slightly
   more efficient than

     - $\vec{v} \leftarrow x$ (__blas::copy__)
     - $\vec{v} \leftarrow 2 \vec{v}$ (__blas::scal__)
     - $\vec{v} \leftarrow \vec{v} + 3\vec{y}$ (__blas::axpy__)
     - $\vec{v} \leftarrow \vec{v} + 4\vec{z}$ (__blas::axpy__)

:import: flens/examples/vecsum_flensblas.cc

We compile as usual and get:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. vecsum_flensblas.cc                        |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 1 Example: Adding Vectors with CXXBLAS
-------------------------------------------------
Next we show how the __FLENS-BLAS__ functions call internally __CXXBLAS__
functionsa__cxxblas::copy__ and  __cxxblas::axpy__:

:import: flens/examples/vecsum_cxxblas.cc

We compile as usual and get:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. vecsum_cxxblas.cc                          |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


:links: FLENS-BLAS    -> doc:flens/blas/blas
        CXXBLAS       -> dir:ulmblas/cxxblas/
        blas::axpy    -> doc:flens/blas/level1/axpy
        blas::copy    -> doc:flens/blas/level1/copy
        blas::scal    -> doc:flens/blas/level1/scal
        cxxblas::axpy -> file:ulmblas/cxxblas/level1/axpy.h
        cxxblas::copy -> file:ulmblas/cxxblas/level1/copy.h
        (.*)\[1\]     -> doc:ulmblas/cxxblas/level1/$1
        (.*)          -> doc:flens/blas/level1/$1


BLAS Level 2 Functions
======================

 +----------------+---------------------------------------------------+------------+
 | __FLENS-BLAS__ | DESCRIPTION                                       | __CXXBLAS__|
 +================+===================================================+============+
 | __mv__         | Computes a matrix-vector product. The form of     |            |
 |                | the product depends on the matrix type:           |            |
 |                |  - For *general* matrices it is                   |            |
 |                |    $y \leftarrow \beta y + \alpha \text{op}(A) x$.| __gemv__   |
 |                |  - For *symmetric* matrices it is                 |            |
 |                |    $y \leftarrow \beta y + \alpha A x$.           | __symv__   |
 |                |  - For *hermitian* matrices it is                 |            |
 |                |    $y \leftarrow \beta y + \alpha A x$.           | __hemv__   |
 |                |  - For *triangular* matrices it is                |            |
 |                |    $x \leftarrow \text{op}(A) x$.                 | __trmv__   |
 |                | Hereby $\text{op}(A)$ denotes $A$, $A^T$ or       |            |
 |                | $A^H$.                                            |            |
 +----------------+---------------------------------------------------+------------+
 | __r__          | Computes a rank `1` operation.  The type of       |            |
 |                | operation depends on type of the matrix that      |            |
 |                | gets updated:                                     |            |
 |                |  - For *general* matrices it is                   | __geru__   |
 |                |    $A \leftarrow A + \alpha x y^T$.               | __gerc__   |
 |                |  - For *symmetric* matrices it is                 | __syr__    |
 |                |    $A \leftarrow A + \alpha x x^T$.               | __her__    |
 +----------------+---------------------------------------------------+------------+
 | __r2__         | Computes a symmetrix rank `2` operation.  The     |            |
 |                | type of operation depends on type of the matrix   |            |
 |                | that gets updated:                                |            |
 |                |  - For *symmetric* matrices it is                 |            |
 |                |    $A \leftarrow A + \alpha x y^T + \alpha y x^T$.| __syr2__   |
 |                |  - For *hermitian* matrices it is                 |            |
 |                |    $A \leftarrow A + \alpha x y^H                 |            |
 |                |    + \overline{\alpha} y x^H$.                    | __her2__   |
 +----------------+---------------------------------------------------+------------+
 | __sv__         | Solves one of the systems of equations            |            |
 |                | $Ax = b$ or $A^T x = b$ where $A$ is an unit      | __trsv__   |
 |                | or non-unit or upper or lower triangular matrix.  |            |
 +----------------+---------------------------------------------------+------------+


BLAS Level 2 Example: Matrix-Vector Product with Operators
----------------------------------------------------------
In the following examples we compute following matrix vector products:

 - $\vec{y} \leftarrow A \vec{x}$
 - $\vec{y} \leftarrow A^T \vec{x}$
 - $\vec{y} \leftarrow \beta y + A^T \vec{x}$

If you are familiar with BLAS you know that each of these operations can be
computed by a single BLAS routine unless $A$ is a trinagular matrix.  We will
show later that this is exactly how the compuation get carried out internally.

:import: flens/examples/matvecprod_operators.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matvecprod_operators.cc                    |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 2 Example: Matrix-Vector Product with FLENS-BLAS
-----------------------------------------------------------
Having a look at the documentation of __blas::mv__ you see that all of these
operations can handeled by a single (FLENS-)BLAS function.  So here it is more
transparent that no temporary objects get created:

:import: flens/examples/matvecprod_flensblas.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matvecprod_flensblas.cc                    |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 2 Example: Matrix-Vector Product with CXXBLAS
--------------------------------------------------------
The function called by __blas::mv__ is in this case __cxxblas::gemv__:

:import: flens/examples/matvecprod_cxxblas.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matvecprod_cxxblas.cc                      |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*

:links: blas::mv      -> doc:flens/blas/level2/mv
        cxxblas::gemv -> file:ulmblas/cxxblas/level2/gemv.h
        FLENS-BLAS    -> doc:flens/blas/blas
        CXXBLAS       -> dir:ulmblas/cxxblas/
        __mv__        -> doc:flens/blas/level2/mv
        __r__         -> doc:flens/blas/level2/r
        __r2__        -> doc:flens/blas/level2/r2
        __sv__        -> doc:flens/blas/level2/sv
        __(.*)__      -> doc:ulmblas/cxxblas/level2/$1


BLAS Level 3 Functions
======================

 +----------------+---------------------------------------------------+------------+
 | __FLENS-BLAS__ | DESCRIPTION                                       | __CXXBLAS__|
 +================+===================================================+============+
 | __mm__         | Computes a matrix-matrix product.  The form of    |            |
 |                | the product depends on the matrix types.  If one  |            |
 |                | matrix is a *general* matrix and the other matrix |            |
 |                | is                                                |            |
 |                |  - *general* then it is                           | __gemm__   |
 |                |    $C \leftarrow \beta C                          |            |
 |                |    + \alpha \, \text{op}(A) \, \text{op}(B)$      |            |
 |                |  - *symmetric* then it is                         | __symm__   |
 |                |    $C \leftarrow \beta C                          |            |
 |                |    + \alpha \, A \, \text{op}(B)$                 |            |
 |                |  - *hermitian* then it is                         | __hemm__   |
 |                |    $C \leftarrow \beta C                          |            |
 |                |    + \alpha \, A \, \text{op}(B)$                 |            |
 |                |  - *triangular* then it is                        | __trmm__   |
 |                |    $B \leftarrow \alpha \, \text{op}(A) \, B$     |            |
 +----------------+---------------------------------------------------+------------+
 | __r2k__        | Compute a symmetric rank-`2k` update, i.e.        | __syr2k__  |
 |                | $C \leftarrow \beta C                             |            |
 |                | + \alpha\,A\, B^T + \alpha\,B\,A^T$ or            |            |
 |                | $C \leftarrow \beta C                             |            |
 |                | + \alpha\,A^T \, B + \alpha\,B^T\,A$.             |            |
 +----------------+---------------------------------------------------+------------+
 | __rk__         | Compute a symmetric rank-`k` update, i.e.         | __syrk__   |
 |                | $C \leftarrow \beta C + \alpha A \, A^T$          |            |
 |                | or                                                |            |
 |                | $C \leftarrow \beta C + \alpha A^T \, A$          |            |
 +----------------+---------------------------------------------------+------------+
 | __sm__         | Solves one of the matrix equations                | __trsm__   |
 |                | $\text{op}(A)\,X = B$ or $X\,\text{op}(A) = B$    |            |
 |                | for $X$ where $A$ is an unit or non-unit or upper |            |
 |                | or lower triangular matrix and $\text{op}(A)$     |            |
 |                | denotes $A$, $A^T$ or $A^H$.                      |            |
 +----------------+---------------------------------------------------+------------+

BLAS Level 3 Example: Matrix-Matrix Product with Operators
----------------------------------------------------------
In the following examples we compute following matrix matrix products:

 - $C = A*B$
 - $C = A^T*B$
 - $C = S * B$
 - $B = U * B$

where $A$, $B$, $C$ are general matrices, $S$ is a symmetric matrix and $U$ a
triangular matrix.

Again, if you are familiar with BLAS you know that each of these operations can
be computed by a single BLAS routine.

:import: flens/examples/matmatprod_operators.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matmatprod_operators.cc                    |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 3 Example: Matrix-Matrix Product with FLENS-BLAS
-----------------------------------------------------------
Again, having a look at the documentation of __blas::mm__ you see that all of
these operations can handeled by a single (FLENS-)BLAS function:

:import: flens/examples/matmatprod_flensblas.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matmatprod_flensblas.cc                    |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*


BLAS Level 3 Example: Matrix-Matrix Product with CXXBLAS
--------------------------------------------------------
The functions called by __blas::mm__ are in this case __cxxblas::gemm__,
__cxxblas::symm__ and __cxxblas::trmm__:

:import: flens/examples/matmatprod_cxxblas.cc

This leads to:

   *--[SHELL]-----------------------------------------------------------------*
   |                                                                          |
   |  cd flens/examples                                                       |
   |  g++ -Wall -std=c++11 -I../.. matmatprod_cxxblas.cc                      |
   |  ./a.out                                                                 |
   |                                                                          |
   *--------------------------------------------------------------------------*

:links: mm            -> doc:flens/blas/level3/mm
        blas::mm      -> doc:flens/blas/level3/mm
        r2k           -> doc:flens/blas/level3/r2k
        rk            -> doc:flens/blas/level3/rk
        sm            -> doc:flens/blas/level3/sm
        FLENS-BLAS    -> doc:flens/blas/blas
        CXXBLAS       -> dir:ulmblas/cxxblas/
        cxxblas::gemm -> file:ulmblas/cxxblas/level3/gemm.h
        cxxblas::symm -> file:ulmblas/cxxblas/level3/symm.h
        cxxblas::trmm -> file:ulmblas/cxxblas/level3/trmm.h
        (.*)          -> doc:ulmblas/cxxblas/level3/$1


Rational of FLENS-BLAS Function Names
=====================================
Function names in FLENS-BLAS are derived from BLAS/CBLAS.  As matrix/vector
types encapsulate element types and mathematical properties this simplifies the
BLAS/CBLAS function name scheme.

- *Simplified naming convention*

  Consider BLAS functions like `dgemv`, `sgemv`, `ssbmv`, `dgbmv`, etc. all
  of them implement some kind of matrix-vector product. That's what the `mv`
  stands for.  However, the names also encapsulate further information about
  the involved matrix/vector types:

    - The first leter indicates the element type
       - `s` for single precision,
       - `d` for double precision,
       - `c` for complex single precision,
       - `z` for complex double precision.
    - The next two letter indicate the involved matrix type
       - `ge` for a general matrix with full storage
       - `gb` for a general matrix with band storage
       - `sy` for a symmetric matrix with full storage
       - `sb` for a symmetric matrix with banded storage
       - ...

  Because FLENS provides actual matrix/vector types this information can
  be retrieved from the argument types.  Hence, FLENS merely provides one
  function `blas::mv` and overloads it for different matrix/vector types.

  Analogously in FLENS function `blas::mm` is overloaded for various
  matrix/vector types such that it unifies `gemm`, `symm`, `trmm`, etc.

- *Simplified function arguments*

  Compared to calling BLAS functions directly or through the CBLAS interface
  the number of arguments required by FLENS-BLAS functions is significantly
  smaller.  In BLAS/CBLAS you always have to pass
    - matrix/vector dimensions
    - leading dimensions/strides and
    - a data pointer
  for each matrix/vector argument.  In FLENS-BLAS you simply pass a single
  matrix/vector object to the corresponding function.

  As an example: `dgemv` from CBLAS requires `12` arguments while its FLENS-BLAS
  counterpart `blas::mv` only requires `6` arguments.  Besides convenient usage
  this also provides increased safety and improves correctness.


:navigate: __up__    -> doc:flens/examples/tutorial
           __back__  -> doc:flens/examples/tut01-page05
           __next__  -> doc:flens/examples/tut01-page07