Content

Unblocked Algorithm
Blocked Algorithm
Block Algorithm with Parallel BLAS
Parallel Tile Algorithm

Coding a Cache Optimized Parallel LU Factorization

For a general $m \times n$ matrix $A$ the factorization computes $P$, $L$ and $U$ such that

\[ A = P L U\]

where

$P$ is a $m \times m$ permutation matrix. Our implementation will represent it a permutation vector $p$. This vector will only contain integers and encodes the permutation as follows: interchange row $i$ with $p_i$ for $1\leq i \leq m$.

The permutation matrix is determined by a pivoting strategy. Such a strategy is necessary to prevent divisions by zero or division by very small numbers. So it enhances the numerical stability of the algorithm. Pivoting strategies differ in the amount of effort you want to invest for stability. We will use partial pivoting in our factorization.
$L$ is a lower triangular $m \times m$ matrix with ones on the diagonal.

The strict lower triangular part of $A$ gets overwritten with $L$. So no additional matrix is needed to store $L$.
$U$ is an upper triangular $m \times n$ matrix.

The upper triangular part of $A$ gets overwritten with $U$. So no additional matrix is needed to store $U$.

Every student who took an undergraduate class on numerical linear algebra knows how to implement this element wise using 3 loops We will derive two variants that use matrix/vector operations instead of element wise operations. Hence, we can use BLAS for the heavy computations:

Unblocked Variant lu_unblk

This variant uses only BLAS Level 1 and Level 2 functions. So here we can not expect high performance. But we will take advantage of it. Low performance means that the CPU is waiting for work most of the time. So we can do some extra computations if it improves numerical stability.
Blocked Variant lu_blk

This variant uses the unblocked variant lu_unblk only on a rather small block. Using BLAS Level 3 functions the remaining matrix gets updated with the results. We we achieve high performance. Furthermore, it dominates the overall performance and leads to an immense speedup.

Comparing the efficiency (measured in MFLOPS) you see that the blocked variant is superior for all problem sizes. For large systems of linear equations it almost reaches 8 GFLOPS. The unblock variant drops from 1 GFLOP to 500 MFLOPS when the problem size exceeds the cache size. So efficiency is increased by a factor of 14. Note that this is achieved without multi threading just by using an algorithm that allows to exploit cache hierarchies:

Of course in practise not efficiency but effectiveness matters. Effectiveness means how long does it take to compute the factorisation. So less is better:

Note that both variants still have a complexity of $\mathcal{O}(N^3)$. Only the constant for the $N^3$ differs significantly. This can be visualized by a log-scaled plot:
Parallel Tile Variant lu_tile

Once we can exploit a single CPU core we can consider parallelization. The matrix is converted to a block format. For this blocked or tiled matrix tasks for the LU factorization are done in parallel if possible.

The following benchmarks were done on a Quad-Core Intel Xeon Woodcrest with 2.66 GHz:

Using the timing measurements of lu_blk we can see that we have an effective per-core speedup:

Unblocked Algorithm

The idea to develop an algorithm in terms of matrix/vector operations is based on considering partitions of matrices.

Partitioning $A$, $L$ and $U$

We partition $A \in \mathbb{R}^{m \times n}$ along the first row and first column as follows:

\[ \displaystyle A = \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) \qquad a_{11} \in \mathbb{R},\; \vec{a}_{12}^T \in \mathbb{R}^{1 \times (n-1)},\; \vec{a}_{21} \in \mathbb{R}^{(m-1) \times 1},\; A_{22} \in \mathbb{R}^{(m-1) \times (m-1)}.\]

So $a_{11}$ is a scalar, $\vec{a}_{21}$ a column vector, $\vec{a}_{12}^T$ a row vector and $A_{22}$ a matrix block.

We first assume that the permutation matrix $P$ is just the identity matrix, i.e. that no pivoting is needed. In this case the factorization reads $A = LU$. We now partition $L$ and $U$ accordingly.

As $L \in \mathbb{R}^{m \times m}$ is lower triangular with unit diagonal the partition reads:

\[ L = \left( \begin{array}{c|c} 1 & \vec{\mathcal{0}}^T \\ \hline \vec{l}_{21} & L_{22} \end{array}\right) \qquad \vec{l}_{21} \in \mathbb{R}^{(m-1) \times 1},\; L_{22} \in \mathbb{R}^{(m-1) \times (m-1)}. \]

Analogously we get for $U$:

\[ U = \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \mathcal{0} & U_{22} \end{array}\right) \qquad u_{11} \in \mathbb{R},\; \vec{u}_{12}^T \in \mathbb{R}^{1 \times (n-1)},\; L_{22} \in \mathbb{R}^{(m-1) \times (n-1)}.\]

Recursive Algorithm for a Factorization $A = LU$

In the matrix product equation $A=LU$ we now simply plug in the partitioned matrices and get

\[ \begin{eqnarray*} A = LU &\;\Leftrightarrow\;& \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} 1 & \vec{\mathcal{0}}^T \\ \hline \vec{l}_{21} & L_{22} \end{array}\right) \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \mathcal{0} & U_{22} \end{array}\right) \\[1em] &\;\Leftrightarrow\;& \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \vec{l}_{21} u_{11} & \vec{l}_{21} \vec{u}_{12}^T + L_{22} U_{22} \end{array}\right) \end{eqnarray*}\]

Comparing partition blocks leads to the equations

\[ \left\{ \begin{array}{lcl} a_{11} &=& u_{11} \\ \vec{a}_{12} &=& \vec{u}_{12} \\ \vec{a}_{21} &=& u_{11} \vec{l}_{21} \\ A_{22} &=& \vec{l}_{21} \vec{u}_{12}^T + L_{22}U_{22} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} u_{11} &=& a_{11} \\ \vec{u}_{12} &=& \vec{a}_{12} \\ \vec{l}_{21} &=& \vec{a}_{21} / u_{11} \\ L_{22}U_{22} &=& A_{22} - \vec{l}_{21} \vec{u}_{12}^T \end{array} \right.\]

So except for $L_{22}$ and $U_{22}$ all entries of the partitioned matrices $L$ and $U$ can be computed directly. The blocks $L_{22}$ and $U_{22}$ can be computed by computing the $LU$ factorization of the $(m-1) \times (n-1)$ matrix $A_{22} - \vec{l}_{21} \vec{u}_{12}^T$.

Overwriting the partitions of $A$ with partitions of $L$ and $U$ reads:

\[ \begin{array}{lcll} a_{11} & \leftarrow & a_{11} & \text{(No Operation)} \\ \vec{a}_{12} & \leftarrow & \vec{a}_{12} & \text{(No Operation)} \\ \vec{a}_{21} & \leftarrow & \vec{a}_{21} / a_{11} & \text{(Scale)} \\ A_{22} & \leftarrow & A_{22} - \vec{a}_{21} \vec{a}_{12}^T & \text{(Rank 1 Update)} \end{array}\]

So obviously this requires that $a_{11} \neq 0$ If $a_{11} = 0$ we have to use a pivot strategy that interchanges two rows of $A$ first. We will add this next. For the moment we derived the well-known Gauss algorithm (without row interchanges):

Partition $A$ as outlined above.
Overwrite $a_{21}$ with $\frac{1}{a_{11}} a_{21}$.
Overwrite $A_{22}$ with $A_{22} - a_{21} a_{12}^T$.
Repeat with $A_{22}$ instead of $A$.

Recursive Algorithm with Pivoting: $A = PLU$

In the above algorithm we have to avoid a division by zero. For numerical stability it is favorable that in a floating point division x/y the absolute value of y is larger than the absolute value of x.

So the idea to achieve both is simple and known as partial pivoting. Before each step we first determine the component in $\vec{a}_{21}$ with maximal absolute value. We interchange then complete rows of the original matrix $A$ such that $a_{11}$ always contains the maximal absolute value. Note that is important the rows of the original matrix get interchanged.

In order to remember what rows of the original matrix where interchanged we use a pivot vector $\vec{p}$ that initially contains values $1, \dots, m$. If row $i$ gets interchanged with row $j$ we set $p_i = j$.

Iterative Algorithm with Pivoting: $A = PLU$

Of course we prefer an iterative algorithm for an implementation. This is simply achieved by considering the partitioning within the original matrix:

\[ \require{color} \left( \begin{array}{c|c} A_{00} & A_{01} \\ \hline A_{10} & {\color{red}\tilde{A}_{11}} \\ \end{array} \right) \qquad\text{with}\quad{\color{red}\tilde{A}} = \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array} \right)\]

So in the $i$-th step we are working on the sub-matrix $\tilde{A}$. In Matlab notation this sub-matrix is related to $A$ by $\tilde{A} = A(i:m, i:n)$. Parts of $\tilde{A}$ are therefore:

\[ \begin{array}{lclclcl} \text{first column of}\;\tilde{A} & = & A(&i:m &,& i &) \\ a_{11} & = & A(&i &,& i &) \\ \vec{a}_{21} & = & A(&i+1:m &,& i &) \\ \vec{a}_{12}^T & = & A(&i &,& i+1:n &) \\ \vec{a}_{12}^T & = & A(&i+1:m &,& i+1:n &) \\ \end{array}\]

This gives the iterative algorithm for computing the $LU$ factorization with partial pivoting:

For $i=1,\dots,\min\{m,n\}$
- Find $l \in {i,\dots,m}$ such that $|\,A_{l,i}| = \max\limits_{k=i,\dots,m}|\,A_{k,i}|$
- Set $p_i = l$
- If $p_i \neq i$ then
  - interchange rows $i$ and $l$ of $A$.
- If $|\,A_{i,i}|$ is exactly Zero
  - return $i$
- $A_{i+1:m, i} \leftarrow \frac{1}{A_{i,i}} \cdot A_{i+1:m, i} $ (Scale)
- $A_{i+1:m,i+1:n} \leftarrow A_{i+1:m,i+1:n} - A_{i+1:m,i} \cdot A_{i,i+1:n}$ (Rank 1 Update)

The algorithm aborts if in the $i$-th step no non-zero element can be found. So yes, we continue even if $A_{i,i}$ is almost zero. Numerical stability can be optimized if scaling is done by division if $A_{i,i}$ is almost zero.

Improving Numerical Stability

Floating point divisions are more expensive (i.e. need more CPU cycles) than floating point multiplications. That makes it favorable to compute first the reciprocal $\alpha = \frac{1}{A_{i,i}}$ and then the scaling $A_{i+1:m, i} \leftarrow \alpha A_{i+1:m, i}$.

If the absolute value of $A_{i,i}$ is small the division $1/A_{i,i}$ can cause an overflow. Even without an overflow scaling by multiplying with the reciprocal can cause a considerable loss of precision. Because of the partial pivoting the element wise divisions $A_{j,i}/A_{i,i}$ with $i < j \leq m$ still give results within machine accuracy. Using the FPU for the division makes the division considerably more expensive than multiplications using SSE or AVX. However, using the FPU the division gets internally done with 80 bits floating point numbers instead of 64 or 32 bits.

Note that the dominant operations in the above algorithm are Vector Scaling (BLAS Level 1) and Rank 1 Update (BLAS Level 2). We don't have any BLAS Level 3 function. So this algorithm can not exploit caches. This means the CPU will wait most of the time for data from the memory. So once the CPU finally gets something to do we can do some extra computations if it improves roundoff errors. So in each step we will check whether the absolute value of $A_{i,i}$ is smaller than a threshold. Based on that we decide to scale by $1/A_{i,i}$ (possibly using internally SSE or AVX) or doing element wise divisions using the FPU.

For $i=1,\dots,\min\{m,n\}$
- Find $l \in {i,\dots,m}$ such that $|\,A_{l,i}| = \max\limits_{k=i,\dots,m}|\,A_{k,i}|$
- Set $p_i = l$
- If $p_i \neq i$ then
  - interchange rows $i$ and $l$ of $A$.
- If $|\,A_{i,i}|$ is exactly Zero
  - return $i$
- If $|\,A_{i,i}| < \text{thresh}$
  - For $j=i+1, \dots, m$
    - $A_{j, i} \leftarrow A_{j,m}/A_{i,i} $ (FPU Division)
- else
  - $A_{i+1:m, i} \leftarrow \frac{1}{A_{i,i}} \cdot A_{i+1:m, i} $ (Scale)
- $A_{i+1:m,i+1:n} \leftarrow A_{i+1:m,i+1:n} - A_{i+1:m,i} \cdot A_{i,i+1:n}$ (Rank 1 Update)

Numerical stability can be further improved by using full pivoting for the $LU$ factorization. If this is not sufficient more stable (and more expensive) factorizations like $QR$ factorization or a singular value decomposition can be considered.

Implementation of `lu_unblk` using Operators

flens/examples/lu/lu_unblk_with_operators.h

#ifndef LU_LU_UNBLK_H

#define LU_LU_UNBLK_H 1

#include <cxxstd/algorithm.h>

#include <cxxstd/cmath.h>

#include <cxxstd/limits.h>

#include <flens/flens.h>

namespace flens {

template <typename MA, typename VP>

typename GeMatrix<MA>::IndexType

lu_unblk(GeMatrix<MA> &A, DenseVector<VP> &p)

{

    typedef typename GeMatrix<MA>::ElementType  T;

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const IndexType  m = A.numRows();

    const IndexType  n = A.numCols();

    const IndexType  mn = std::min(m, n);

    const T Zero(0);

    const T One(1);

    const Underscore<IndexType>  _;

    // Compute threshold for stable scaling

    const T eps   = std::numeric_limits<T>::epsilon() * T(0.5);

    T safeMin     = std::numeric_limits<T>::min();

    const T small = One / std::numeric_limits<T>::max();

    if (small>=safeMin) {

        safeMin = small*(One+eps);

    }

    for (IndexType i=1; i<=mn; ++i) {

        // Partition sub-matrix A(i:m,i:n)

        const auto  A_1 = A(_( i, m),        i);  // first column of sub_matrix

        const auto &a11 = A(       i,        i);

        const auto  a12 = A(       i, _(i+1,n));

        auto        a21 = A(_(i+1,m),        i);

        auto        A22 = A(_(i+1,m), _(i+1,n));

        // Find pivot index

        IndexType l = blas::iamax(A_1) +i- 1;

        p(i) = l;

        // Interchange rows of A

        if (p(i)!=i) {

            blas::swap(A(i,_), A(l,_));

        }

        // Only abort if diagonal element is *exactly* zero

        if (a11==Zero) {

            return i;

        }

        // Choose stable scaling

        if (std::abs(a11)<safeMin) {

            a21 /= a11;

        } else {

            a21 *= One/a11;

        }

        // Rank 1 update

        A22 -= a21*transpose(a12);

    }

    return 0;

}

} // namespace flens

#endif // LU_LU_UNBLK_H

Implementation of `lu_unblk` using `flens::blas` functions

In case you find explicit function calls more expressive you can do so. For example, it makes it more obvious that only BLAS Level 1 and BLAS Level 2 functions get used. Actually we usually start an implementation with explicit function calls and only replace them with operator notations if it make code more expressive.

The previous previous code is equivalent to:

flens/examples/lu/lu_unblk_with_flensblas.h

#ifndef LU_LU_UNBLK_H

#define LU_LU_UNBLK_H 1

#include <cxxstd/algorithm.h>

#include <cxxstd/cmath.h>

#include <cxxstd/limits.h>

#include <flens/flens.h>

namespace flens {

template <typename MA, typename VP>

typename GeMatrix<MA>::IndexType

lu_unblk(GeMatrix<MA> &A, DenseVector<VP> &p)

{

    typedef typename GeMatrix<MA>::ElementType  T;

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const IndexType  m = A.numRows();

    const IndexType  n = A.numCols();

    const IndexType  mn = std::min(m, n);

    const T Zero(0);

    const T One(1);

    const Underscore<IndexType>  _;

    // Compute threshold for stable scaling

    const T eps   = std::numeric_limits<T>::epsilon() * T(0.5);

    T safeMin     = std::numeric_limits<T>::min();

    const T small = One / std::numeric_limits<T>::max();

    if (small>=safeMin) {

        safeMin = small*(One+eps);

    }

    for (IndexType i=1; i<=mn; ++i) {

        // Partition sub-matrix A(i:m,i:n)

        const auto  A_1 = A(_( i, m),        i);  // first column of sub_matrix

        const auto &a11 = A(       i,        i);

        const auto  a12 = A(       i, _(i+1,n));

        auto        a21 = A(_(i+1,m),        i);

        auto        A22 = A(_(i+1,m), _(i+1,n));

        // Find pivot index

        IndexType l = blas::iamax(A_1) +i- 1;

        p(i) = l;

        // Interchange rows of A

        if (p(i)!=i) {

            blas::swap(A(i,_), A(l,_));

        }

        // Only abort if diagonal element is *exactly* zero

        if (a11==Zero) {

            return i;

        }

        // Choose stable scaling

        if (std::abs(a11)<safeMin) {

            blas::rscal(a11, a21);

        } else {

            blas::scal(One/a11, a21);

        }

        // Rank 1 update

        blas::r(-One, a21, a12, A22);

    }

    return 0;

}

} // namespace flens

#endif // LU_LU_UNBLK_H 1

Auxiliary Function for Applying the Permutation

Applying the permutation $P$ to a matrix $X$ means computing $PX$. The permutation matrix $P$ is represented by a pivot vector $p$ that stores row interchanges that happened during the $LU$ factorization. In this case applying $P$ means reversing the interchanges such that $PLU$ gives the original matrix $A$. This is done by the auxiliary function apply_perm:

flens/examples/lu/apply_perm.h

#ifndef LU_APPLY_PERM_H

#define LU_APPLY_PERM_H 1

#include <flens/flens.cxx>

namespace flens {

// Compute A = P*A

template <typename VP, typename MA>

void

apply_perm(const DenseVector<VP> &p, GeMatrix<MA> &A)

{

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const Underscore<IndexType> _;

    const IndexType m = p.length();

    // reverse row interchanges

    for (IndexType i=m; i>=1; --i) {

        if (i!=p(i)) {

            blas::swap(A(i,_), A(p(i),_));

        }

    }

}

} // namespace flens

#endif // LU_APPLY_PERM_H

Auxiliary Wall Time Function

For measuring the elapsed time for doing a certain task we use a simple wall-time clock. The term 'wall-time' means that for a task we want to measure the real time that elapses from start to end, i.e. including time used for system calls etc. It can be used following the pattern:

double t0 = ATL_walltime();

// do something

t0 = ATL_walltime() - t0;

cout << "elapsed time = " << t0 << endl;

The code for ATL_walltime was taken from ATLAS:

flens/examples/lu/timer.h

#ifndef LU_LU_TIMER_H

#define LU_LU_TIMER_H 1

#include <stdlib.h>

#include <sys/times.h>

#include <unistd.h>

double ATL_walltime(void)

{

   struct tms ts;

   static double ClockTick=0.0;

   if (ClockTick == 0.0) ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));

   return( ((double) times(&ts)) * ClockTick);

}

#endif // LU_LU_TIMER_H

Code for Testing

flens/examples/lu_unblk_test.cc

#include <flens/flens.cxx>

#include <iostream>

#ifdef WITH_OPERATORS

#   include <flens/examples/lu/lu_unblk_with_operators.h>

#else

#   include <flens/examples/lu/lu_unblk_with_flensblas.h>

#endif

#include <flens/examples/lu/apply_perm.h>

#include <flens/examples/lu/timer.h>

using namespace flens;

using namespace std;

int

main()

{

    const int m = 2000;

    const int n = 2000;

    GeMatrix<FullStorage<double> >  A(m,n), A_org;

    DenseVector<Array<int> >        p(m);

    Underscore<int>                 _;

    fillRandom(A);

    // store original A for checking the factorization later

    A_org = A;

    //cout << "A = " << A << endl;

    double t0 = ATL_walltime();

    if (lu_unblk(A, p) != 0) {

        cout << "Matrix is singular or close to singular" << endl;

        return 1;

    }

    t0 = ATL_walltime() - t0;

    cout << "Time elapsed: " << t0 << endl;

    //cout << "L = " << A.lowerUnit() << endl;

    //cout << "U = " << A.upper() << endl;

    //cout << "p = " << p << endl;

    cout << "Checking results:" << endl;

    GeMatrix<FullStorage<double> >  LU(m,n);

    // blas::copy(NoTrans, A.upper(), LU.upper());

    LU.upper() = A.upper();

    // blas::mm(Left, NoTrans, 1.0, A(_(1,4),_(1,4)).lowerUnit(), LU);

    LU = A(_(1,m),_(1,m)).lowerUnit() * LU;

    // LU = P*LU

    apply_perm(p, LU);

    // blas::axpy(NoTrans, -1.0, LU, A_org);

    A_org -= LU;

    cout << "|| A-P*L*U ||_1 = " << blas::asum(A_org) << endl;

    return 0;

}

Note:

Uncomment for playing with small dimensions:

//cout << "A = " << A << endl;

and

    //cout << "L = " << A.lowerUnit() << endl;

    //cout << "U = " << A.upper() << endl;

    //cout << "p = " << p << endl;

Compile and Run

$shell> cd flens/examples                                                                          
$shell> g++ -Wall -std=c++11 -DWITH_OPERATORS -I../.. -o lu_unblk_operators_test lu_unblk_test.cc  
$shell> g++ -Wall -std=c++11 -I../.. -o lu_unblk_flensblas_test lu_unblk_test.cc                   
$shell> ./lu_unblk_operators_test                                                                  
Time elapsed: 13.96
Checking results:
|| A-P*L*U ||_1 = 1.66341e-08
$shell> ./lu_unblk_flensblas_test                                                                  
Time elapsed: 13.85
Checking results:
|| A-P*L*U ||_1 = 1.66341e-08

Blocked Algorithm

We now derive an algorithm that takes advantage of BLAS Level 3 functions and applies the unblocked $LU$ factorization only on a block that fits into the L2 cache.

Block-Partitioning A, L and U

We now consider a block-wise partition of $A \in \mathbb{R}^{m \times n}$

\[ A = \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) \]

where block $A_{11}$ is square, i.e. $A_{11} \in \mathbb{R}^{\text{BS} \,\times\, \text{BS}}$, $A_{21} \in \mathbb{R}^{(m-\text{BS}) \,\times\, \text{BS}}$, $A_{12} \in \mathbb{R}^{\text{BS} \,\times\, (n-\text{BS})}$ and $A_{22} \in \mathbb{R}^{(m-\text{BS}) \,\times\, (n-\text{BS})}$ for some block size $\text{BS} \in \mathbb{N}$.

Analogously we consider for $L$ the block-partition

\[ L = \left( \begin{array}{c|c} L_{11} & \mathcal{0} \\ \hline L_{21} & L_{22} \end{array}\right) \]

where $L_{11}$ adn $L_{22}$ are lower triangular matrix with unit diagonal. For $U$ the block-partition

\[ U = \left( \begin{array}{c|c} U_{11} & U_{12} \\ \hline \mathcal{0} & U_{22} \end{array}\right). \]

blocks $U_{11}$ and $U_{22}$ are obviously upper triangular matrices.

So in case no pivoting gets applied the $LU$ factorization is determined by

\[ \begin{eqnarray*} A = LU &\;\Leftrightarrow\;& \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} L_{11} & \mathcal{0} \\ \hline L_{21} & L_{22} \end{array}\right) \left( \begin{array}{c|c} U_{11} & U_{12} \\ \hline \mathcal{0} & U_{22} \end{array}\right) \\[1em] &\;\Leftrightarrow\;& \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} L_{11} U_{11} & L_{11} U_{12} \\ \hline L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{array}\right) \end{eqnarray*}\]

Comparing partition blocks leads to the equations

\[ \left\{ \begin{array}{lcl} A_{11} &=& L_{11} U_{11} \\ A_{12} &=& L_{11} U_{12} \\ A_{21} &=& L_{21} U_{11} \\ A_{22} &=& L_{21} U{12} + L_{22} U_{22} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} L_{11} U_{11} &=& A_{11} \\ U_{12} &=& L_{11}^{-1} A_{12} \\ L_{21} &=& A_{21} U_{11}^{-1} \\ L_{22}U_{22} &=& A_{22} - L_{21} U_{12} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} L_{11} U_{11} &=& A_{11} \\ U_{12} &=& L_{11}^{-1} A_{12} \\ L_{21}^T &=& U_{11}^{-T} A_{21}^T \\ L_{22}U_{22} &=& A_{22} - L_{21} U_{12} \end{array} \right.\]

Note that because $L_{11}$ and $U_{11}^T$ are both lower triangular computing $L_{11}^{-1} A_{12}$ and $U_{11}^{-T} A_{21}^T$ can be done using a triangular solver (forward substitution).

Recursive Block Algorithm for a Factorization $A = LU$

So as we again want to overwrite $A$ with $L$ and $U$ we end up with the algorithm

Partition $A$ as outlined above.
Use the unblocked $LU$ factorization to overwrite $A_{11}$ with $L_{11}$ and $U_{11}.
Use a triangular solver to overwrite $A_{21}$ with $A_{21} U_{11}^{-1}$.
Use a triangular solver to overwrite $A_{12}$ with $L_{11}^{–1} U_{12}$
Overwrite $A_{22}$ with $A_{22} - A_{21}A_{12}$ (Matrix-Matrix-Product)
Repeat with $A_{22}$ instead of $A$.

The triangular solvers (blas::sm) as well as the matrix-matrix product (blas::mm) are BLAS Level 3 functions.

Iterative Block Algorithm with Pivoting: $A = PLU$

In order to derive an iterative algorithm we again consider the partitioning within the original matrix:

\[ \require{color} \left( \begin{array}{c|c|c} A_{00} & A_{01} & A_{02} \\ \hline A_{10} & {\color{red}A_{11}} & {\color{red}A_{12}}\\ \hline A_{20} & {\color{red}A_{21}} & {\color{red}A_{22}}\\ \end{array} \right) \qquad\text{with}\quad{\color{red}\tilde{A}} = \left( \begin{array}{c|c} {\color{red}A_{11}} & {\color{red}A_{12}}\\ \hline {\color{red}A_{21}} & {\color{red}A_{22}} \end{array} \right)\]

In the $i$-th iteration the dimension of $\tilde{A}$ is $\bigl(m - (i-1)\cdot\text{BS}\bigr) \times \bigl(n - (i-1)\cdot\text{BS}\bigr)$. Therefore blocks of $\tilde{A}$ have dimensions

\[ \begin{array}{lll} A_{11} & \in \mathbb{R}^{\text{bs} \times \text{bs}} & \quad\text{with}\; \text{bs} = \min\{\text{BS},\, m - i\cdot\text{BS},\, n - i\cdot\text{BS}\}\\ A_{21} & \in \mathbb{R}^{(m - i\cdot\text{BS}) \,\times\, \text{bs}} & \\ A_{12} & \in \mathbb{R}^{\text{bs} \,\times\, (n - i\cdot\text{BS})} & \\ A_{22} & \in \mathbb{R}^{(m - i\cdot\text{BS}) \,\times\, (n - i\cdot\text{BS})} & \\ \end{array}\]

This gives the iterative algorithm for computing block wise the $A=PLU$ factorization with partial pivoting:

Set $i = 1$
While $i=1 \leq \min\{m,n\}$
- Partition $A$ for the $i$-th iteration as illustrated above with $\text{bs} = \min\{\text{BS},\, m - (i-1)\cdot\text{BS},\, n - (i-1)\cdot\text{BS}\}$
- Use the unblocked $LU$ factorization to compute $A_{11} = \tilde{P} L_{11} U_{11}$
- Apply $\tilde{P}^{-1}$ to $A_{10}$, i.e. compute $A_{10} = \tilde{P}^{-1} A_{10}$.
- Apply $\tilde{P}^{-1}$ to $A_{12}$, i.e. compute $A_{12} = \tilde{P}^{-1} A_{12}$.
- Update the permutation matrix $P$ with $\tilde{P}$.
- Use a triangular solver for $A_{21}^T = U_{11}^{-T} A_{21}^T$.
- Use a triangular solver for $A_{12} = L_{11}^{-1} A_{12}$.
- Compute the matrix-product $A_{22} = A_{22} - A_{21} A_{12}$.
- $i = i + \text{BS}$

Note

What does updating $P$ with $\tilde{P}$ mean? The permutation matrix $P$ is represented by a pivot vector $p = (p_1, \dots, p_m)$ and $\tilde{P}$ by a pivot vector $\tilde{p} = (\tilde{p}_1, \dots, \tilde{p}_{\text{bs}})$. So an update means to set $p_{i+k-1} = i - 1 + \tilde{p}_{k}$ for $1 \leq k \leq \text{bs}$.

Auxiliary Function for Applying the Inverse Permutation

Applying the inverse permutation $P^{-1}=P^T$ to a matrix $X$ means computing $P^T X$. Recall, the permutation matrix $P$ is represented by a pivot vector $p$ that stores row interchanges that happened during the $LU$ factorization. So applying $P$ means reversing the interchanges such that $PLU$ gives the original matrix $A$. Hence, applying $P^{-1}$ to $X$ means to apply the same row interchanges that were done during the $LU$ factorization. This is done by the auxiliary function apply_perm_inv:

flens/examples/lu/apply_perm_inv.h

#ifndef LU_APPLY_PERM_INV_H

#define LU_APPLY_PERM_INV_H 1

#include <flens/flens.cxx>

namespace flens {

// Compute A = P^{-1}*A = P^T*A

template <typename VP, typename MA>

void

apply_perm_inv(const DenseVector<VP> &p, GeMatrix<MA> &A)

{

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const Underscore<IndexType> _;

    const IndexType m = p.length();

    // apply row interchanges

    for (IndexType i=1; i<=m; ++i) {

        if (i!=p(i)) {

            blas::swap(A(i,_), A(p(i),_));

        }

    }

}

} // namespace flens

#endif // LU_APPLY_PERM_INV_H

Implementation of `lu_blk` using `flens::blas` functions

flens/examples/lu/lu_blk_with_flensblas.h

#ifndef LU_LU_BLK_H

#define LU_LU_BLK_H 1

#ifndef LU_BS

#define LU_BS 96

#endif

#include <cxxstd/algorithm.h>

#include <cxxstd/cmath.h>

#include <flens/flens.h>

#include <flens/examples/lu/apply_perm_inv.h>

#include <flens/examples/lu/lu_unblk_with_operators.h>

namespace flens {

template <typename MA, typename VP>

typename GeMatrix<MA>::IndexType

lu_blk(GeMatrix<MA> &A, DenseVector<VP> &p)

{

    typedef typename GeMatrix<MA>::ElementType  T;

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const IndexType  m = A.numRows();

    const IndexType  n = A.numCols();

    const IndexType  mn = std::min(m, n);

    const T One(1);

    const Underscore<IndexType>  _;

    IndexType info = 0;

    for (IndexType i=1; i<=mn; i+=LU_BS) {

        const IndexType bs = std::min(std::min(LU_BS, m-i+1), n-i+1);

        // Partitions of A

        auto A10 = A(_(   i, i+bs-1), _(   1,    i-1));

        auto A11 = A(_(   i, i+bs-1), _(   i, i+bs-1));

        auto A12 = A(_(   i, i+bs-1), _(i+bs,      n));

        auto A21 = A(_(i+bs,      m), _(   i, i+bs-1));

        auto A22 = A(_(i+bs,      m), _(i+bs,      n));

        // Part of the pivot vector for rows of A11

        auto p_  = p(_(i,i+bs-1));

        // Compute LU factorization of A11

        info = lu_unblk(A11, p_);

        if (info) {

            // All values in column info of A11 are *exactly* zero.

            return info+i-1;

        }

        // Apply permutation to A10 and A12

        apply_perm_inv(p_, A10);

        apply_perm_inv(p_, A12);

        // Update p

        p_ += i-1;

        // Use triangular solver for A21 = A11.upper()*A21

        blas::sm(Right, NoTrans, One, A11.upper(), A21);

        // Use triangular solver for A12 = A11.lowerUnit()*A12

        blas::sm(Left, NoTrans, One, A11.lowerUnit(), A12);

        // Update A22 with matrix-product A22 = A22 - A21*A12

        blas::mm(NoTrans, NoTrans, -One, A21, A12, One, A22);

    }

    return 0;

}

} // namespace flens

#endif // LU_LU_BLK_H

Implementation of `lu_blk` using Operators

For applying triangular solvers we do not provide overloaded operators. So you might be disappointed if you a big fan of it. In that case we leave it as a simple exercise to the dear user to support for a triangular matrix L something like X = inverse(L)*B that gets evaluated as blas::sm(Left, NoTrans, 1.0, L, X) if X and B are identical and as X = B; blas::sm(Left, NoTrans, 1.0, L, X) otherwise.

flens/examples/lu/lu_blk_with_operators.h

#ifndef LU_LU_BLK_H

#define LU_LU_BLK_H 1

#ifndef LU_BS

#define LU_BS 96

#endif

#include <cxxstd/algorithm.h>

#include <cxxstd/cmath.h>

#include <flens/flens.h>

#include <flens/examples/lu/apply_perm_inv.h>

#include <flens/examples/lu/lu_unblk_with_operators.h>

namespace flens {

template <typename MA, typename VP>

typename GeMatrix<MA>::IndexType

lu_blk(GeMatrix<MA> &A, DenseVector<VP> &p)

{

    typedef typename GeMatrix<MA>::ElementType  T;

    typedef typename GeMatrix<MA>::IndexType    IndexType;

    const IndexType  m = A.numRows();

    const IndexType  n = A.numCols();

    const IndexType  mn = std::min(m, n);

    const T One(1);

    const Underscore<IndexType>  _;

    IndexType info = 0;

    for (IndexType i=1; i<=mn; i+=LU_BS) {

        const IndexType bs = std::min(std::min(LU_BS, m-i+1), n-i+1);

        // Partitions of A

        auto A10 = A(_(   i, i+bs-1), _(   1,    i-1));

        auto A11 = A(_(   i, i+bs-1), _(   i, i+bs-1));

        auto A12 = A(_(   i, i+bs-1), _(i+bs,      n));

        auto A21 = A(_(i+bs,      m), _(   i, i+bs-1));

        auto A22 = A(_(i+bs,      m), _(i+bs,      n));

        // Part of the pivot vector for rows of A11

        auto p_  = p(_(i,i+bs-1));

        // Compute LU factorization of A11

        info = lu_unblk(A11, p_);

        if (info) {

            // All values in column info of A11 are *exactly* zero.

            return info+i- 1;

        }

        // Apply permutation to A10 and A12

        apply_perm_inv(p_, A10);

        apply_perm_inv(p_, A12);

        // Update p

        p_ += i-1;

        // Use triangular solver for A21 = A11.upper()*A21

        blas::sm(Right, NoTrans, One, A11.upper(), A21);

        // Use triangular solver for A12 = A11.lowerUnit()*A12

        blas::sm(Left, NoTrans, One, A11.lowerUnit(), A12);

        // Update A22 with matrix-product A22 = A22 - A21*A12

        A22 -= A21*A12;

    }

    return 0;

}

} // namespace flens

#endif // LU_LU_BLK_H

Code for Testing

flens/examples/lu_blk_test.cc

#include <flens/flens.cxx>

#include <algorithm>

#include <iostream>

#ifdef WITH_OPERATORS

#   include <flens/examples/lu/lu_blk_with_operators.h>

#else

#   include <flens/examples/lu/lu_blk_with_flensblas.h>

#endif

#include <flens/examples/lu/check_lu.h>

#include <flens/examples/lu/timer.h>

using namespace flens;

using namespace std;

int

main()

{

    const int m  = 2000;

    const int n  = 2000;

    const int mn = std::min(m, n);

    GeMatrix<FullStorage<double> >  A(m,n), A_org;

    DenseVector<Array<int> >        p(mn);

    Underscore<int>                 _;

    fillRandom(A);

    // store original A for checking the factorization later

    A_org = A;

    double t0 = ATL_walltime();

    if (lu_blk(A, p) != 0) {

        cout << "Matrix is singular or close to singular" << endl;

        return 1;

    }

    t0 = ATL_walltime() - t0;

    cout << "Time elapsed: " << t0 << endl;

    cout << "Checking results:" << endl;

    cout << "|| A-P*L*U ||_1 = " << check_LU(A_org, A, p) << endl;

    return 0;

}

Compile and Run

$shell> cd flens/examples                                                                          
$shell> g++ -Wall -std=c++11 -DWITH_OPERATORS -I../.. -o lu_blk_operators_test lu_blk_test.cc      
$shell> g++ -Wall -std=c++11 -I../.. -o lu_blk_flensblas_test lu_blk_test.cc                       
$shell> ./lu_blk_operators_test                                                                    
Time elapsed: 0.95
Checking results:
|| A-P*L*U ||_1 = 3.56198e-07
$shell> ./lu_blk_flensblas_test                                                                    
Time elapsed: 0.95
Checking results:
|| A-P*L*U ||_1 = 3.56198e-07

Having an algorithm that is capable to exploit caches and BLAS it now makes also sense to compile without debug code and optimizations:

$shell> cd flens/examples                                                                          
$shell> g++ -Wall -DNDEBUG -O3 -std=c++11 -DWITH_OPERATORS -I../.. -o lu_blk_operators_test lu_blk_test.cc                                                                         
$shell> g++ -Wall -DNDEBUG -O3 -std=c++11 -I../.. -o lu_blk_flensblas_test lu_blk_test.cc          
$shell> ./lu_blk_operators_test                                                                    
Time elapsed: 0.76
Checking results:
|| A-P*L*U ||_1 = 3.56198e-07
$shell> ./lu_blk_flensblas_test                                                                    
Time elapsed: 0.76
Checking results:
|| A-P*L*U ||_1 = 3.56198e-07

Block Algorithm with Parallel BLAS

The easiest way to parallelize the $LU$ factorization is using a multi-threaded BLAS implementation for the blocked $LU$ factorization. In this case 'easy' means that you don't have to change function lu_blk you just link against a parallel BLAS library. So the multi-threading part is hidden in BLAS.

In the near future we will extend the ulmBLAS tutorial for parallelization...

Parallel Tile Algorithm

Having a parallel computer architecture we hope that using 2 CPU cores for the $LU$ factorization cuts the wall-time in half, i.e. gives a speedup factor of 2. And of course, with 3 cores we hope for a speedup factor of 3, with 4 cores a speedup factor or 4 etc. This means we hope for a perfect scalability of the problem. However, this usually requires that for $k$ CPU cores we can work on $k$ disjoint sub-problems with perfect load balancing. The later means each sub-problem has the same complexity and the same memory bandwidth can be used for solving each of these sub-problems.

For BLAS Level 3 functions this can be achieved in certain cases. For example, in the matrix product $C = A\cdot B$ each thread computes a different block of $C$. However, a good speedup factor requires that the computation done by a thread compensates the induced overhead, e.g. scheduling threads or copying buffered results.

The PLASMA library tries to achieve a much higher granularity of sub-problems by considering parallelization already in the algorithms. These algorithms basically push forward ideas from blocked algorithms and are known as tiled algorithms.

Example: Tasks for Factorizing a 2x2 Tile/Block Matrix

We again consider the $LU$ factorization of a $2 \times 2$ blocked matrix:

\[A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix}\]

In the blocked algorithm we identified a sequence of necessary tasks that were done in the certain fixed order. In this case the total sequence of tasks done were:

Compute $A_11 = P_1 L_{11} U_{11}$
Compute $A_{12} \leftarrow P_1^T A_{12}$
Update the permutation: $P \big|_1 \leftarrow P_1$
Compute $A_{12} \leftarrow A_{12} U_{11}^{-1}$
Compute $A_{21} \leftarrow L_{11}^{-1} A_{21}$
Compute $A_{22} \leftarrow A_{22} - A_{21} A_{12}$
Compute $A_{22} = P_2 L_{22} U_{22}$
Compute $A_{21} \leftarrow P_2^T A_{21}$
Update the permutation: $P \big|_2 \leftarrow P_2 + \text{bs}$

Obviously these tasks can not be done in an arbitrary order. And therefore some tasks can not be done in parallel. However, analyzing the dependencies will show what tasks could be done in parallel.

Task Dependencies

We illustrated the dependencies through a directed graph. Hereby $T_1 \rightarrow T_2$ denotes that task $T_1$ must be completed before task $T_2$ can be started.

So once task $A_11 = P_1 L_{11} U_{11}$ is done we could

Compute $A_{12} \leftarrow P_1^T A_{12}$ by one thread and
$A_{12} \leftarrow A_{12} U_{11}^{-1}$ by another thread.

Considering more tiles shows that more and more independent task can be identified.

Example: Tasks for Factorizing a 3x3 Tile/Block Matrix

Listing all operations needed for a $3 \times 3$ tile matrix is already lengthy.

\[A = \begin{pmatrix} A_{11} & A_{12} & A_{13} \\ A_{21} & A_{22} & A_{23} \\ A_{31} & A_{32} & A_{33} \\ \end{pmatrix}\]

Gladly we will develop in this section a tool that automatically generates the task dependencies:

Example: Tasks for Factorizing a 10x10 Tile/Block Matrix

Just for fun (and because we have the tool) we also have a graph generated for a 10x10 tile matrix.

Converting Matrices to Tile Matrices

flens/examples/lu/tile.h

#ifndef LU_TILE_H

#define LU_TILE_H 1

#include <flens/flens.h>

namespace flens {

template <typename MA, typename IndexType, typename VB>

void

copyToTiles(IndexType bs, const GeMatrix<MA> &A, DenseVector<VB> &b)

{

    typedef typename GeMatrix<MA>::View   TileView;

    IndexType m = A.numRows();

    IndexType n = A.numCols();

    Underscore<IndexType> _;

    IndexType ib = 0;

    for (IndexType j=1; j<=n; j+=bs) {

        IndexType nb = std::min(bs, n-j+1);

        for (IndexType i=1; i<=m; i+=bs) {

            IndexType mb = std::min(bs, m-i+1);

            TileView  B  = TileView(mb, nb, b(_(ib+1,ib+bs*bs)), bs);

            B = A(_(i,i+mb-1),_(j,j+nb-1));

            ib += bs*bs;

        }

    }

}

template <typename MA, typename IndexType, typename VB>

void

copyFromTiles(IndexType bs, const DenseVector<VB> &b, GeMatrix<MA> &A)

{

    typedef typename GeMatrix<MA>::ConstView   TileView;

    IndexType m = A.numRows();

    IndexType n = A.numCols();

    Underscore<IndexType> _;

    IndexType ib = 0;

    for (IndexType j=1; j<=n; j+=bs) {

        IndexType nb = std::min(bs, n-j+1);

        for (IndexType i=1; i<=m; i+=bs) {

            IndexType mb = std::min(bs, m-i+1);

            TileView  B  = TileView(mb, nb, b(_(ib+1,ib+bs*bs)), bs);

            A(_(i,i+mb-1),_(j,j+nb-1)) = B;

            ib += bs*bs;

        }

    }

}

template <typename GeMatrixType>

class TiledCopy

    : public GeneralMatrix<GeMatrixType>

{

    public:

        typedef typename GeMatrixType::View         TileView;

        typedef typename GeMatrixType::ElementType  ElementType;

        typedef typename GeMatrixType::IndexType    IndexType;

        TiledCopy(const GeMatrixType &A, int bs_)

            : numRows(A.numRows()), numCols(A.numCols()),

              bs(bs_), mr(numRows % bs), nr(numCols % bs),

              tileRows((numRows+bs-1) / bs),

              tileCols((numCols+bs-1) / bs)

        {

            tiles.resize(bs*bs*tileRows*tileCols);

            copyToTiles(bs, A, tiles);

        }

        int

        numTileRows() const

        {

            return tileRows;

        }

        int

        numTileCols() const

        {

            return tileCols;

        }

        int

        blockSize() const

        {

            return bs;

        }

        TileView

        operator()(int i, int j)

        {

            int mb = (i<tileRows || mr==0) ? bs : mr;

            int nb = (j<tileCols || nr==0) ? bs : nr;

            int ib = tileRows*bs*bs*(j-1) + bs*bs*(i-1);

            return TileView(mb, nb, tiles(Range<int>(ib+1,ib+bs*bs)), bs);

        }

        void

        untile(GeMatrixType &A) const

        {

            copyFromTiles(bs, tiles, A);

        }

    private:

        int                             numRows, numCols;

        int                             bs, mr, nr;

        int                             tileRows, tileCols;

        typename GeMatrixType::Vector   tiles;

};

} // namespace flens

#endif // LU_TILE_H

Implementation of `lu_tiled`

For the implementation of the tile algorithm we need a scheduler. This scheduler can be fed with tasks (i.e. functions) that have dependency constraints, e.g. task A must be completed before task B can be started. Tasks can be done in parallel if constraints permit and if enough threads are available.

A task is just a function pointer. Using C++11 lambdas makes is a convenient way of creating such functions. Furthermore it is required to define unique keys for each task.

We will first look at the implementation of the tile algorithm and discuss the design of the scheduler and key functions later:

flens/examples/lu/lu_tiled_mt.h

#ifndef LU_LU_TILED_MT_H

#define LU_LU_TILED_MT_H 1

#include <cxxstd/algorithm.h>

#include <cxxstd/cmath.h>

#include <cstdio>

#include <vector>

#include <flens/flens.h>

#include <flens/examples/lu/apply_perm_inv.h>

#include <flens/examples/lu/lu_blk_with_operators.h>

#include <flens/examples/lu/lu_tiled_keys.h>

#include <flens/examples/lu/scheduler.h>

#include <flens/examples/lu/tile.h>

namespace flens {

template <typename MA, typename VP>

typename DenseVector<VP>::IndexType

lu_tiled(Scheduler &scheduler, TiledCopy<MA> &A, DenseVector<VP> &p)

{

    typedef typename TiledCopy<MA>::ElementType  T;

    typedef typename TiledCopy<MA>::IndexType    IndexType;

    typedef Scheduler::Task                      Task;

    typedef Scheduler::Key                       Key;

    const IndexType  m  = A.numTileRows();

    const IndexType  n  = A.numTileCols();

    const IndexType  bs = A.blockSize();

    const IndexType  mn = std::min(m, n);

    const T One(1);

    const Underscore<IndexType>  _;

    Task task;

    //scheduler.pause();

    std::vector<std::string> gather;

    for (IndexType i=1; i<=mn; ++i) {

        auto A_ii = A(i,i);

        auto p_   = p(_(bs*(i-1)+1,bs*(i-1)+A_ii.numRows()));

        //  Factorize A(i,i)

        task = [=, &scheduler]() mutable

               {

                   auto info = lu_blk(A_ii, p_);

                   if (info!=0) {

                       scheduler.abort(info+bs*(i-1));

                   }

               };

        scheduler.add(keyLU(i), task, gather);

        gather.clear();

        //  Apply permutation to i-th tile row

        for (IndexType j=1; j<=n; ++j) {

            if (i!=j) {

                auto A_ij = A(i,j);

                task = [=]() mutable

                       {

                           apply_perm_inv(p_, A_ij);

                       };

                scheduler.add(keyPtA(i,j), task, { keyLU(i) });

                gather.push_back(keyPtA(i,j));

                // dependency for task: "Update Permutation Vector p"

                scheduler.addArc(keyPtA(i,j), keyUpdateP(i, bs));

            }

        }

        //  Update Permutation Vector p

        task = [=]() mutable

               {

                   p_ += bs*(i-1);

               };

        scheduler.add(keyUpdateP(i, bs), task);

        gather.push_back(keyUpdateP(i, bs));

        //  Apply U_ii^{-1} to i-th tile column

        for (IndexType k=i+1; k<=m; ++k) {

            auto A_ki = A(k,i);

            task = [=]() mutable

                   {

                       auto U_ii = A_ii.upper();

                       blas::sm(Right, NoTrans, One, U_ii, A_ki);

                   };

            scheduler.add(keyApplyU(k,i), task, { keyLU(i) });

            gather.push_back(keyApplyU(k,i));

        }

        //  Apply L_ii^{-1} to i-th tile row

        for (IndexType l=i+1; l<=n; ++l) {

            auto A_il = A(i,l);

            task = [=]() mutable

                   {

                       auto L_ii = A_ii.lowerUnit();

                       blas::sm(Left, NoTrans, One, L_ii, A_il);

                   };

            scheduler.add(keyApplyL(i,l), task, { keyPtA(i,l) });

            gather.push_back(keyApplyL(i,l));

        }

        //  Update remaining blocks

        for (IndexType k=i+1; k<=m; ++k) {

            const auto A_ki = A(k,i);

            for (IndexType l=i+1; l<=n; ++l) {

                const auto A_il = A(i,l);

                auto       A_kl = A(k,l);

                task = [=]() mutable

                       {

                           A_kl -= A_ki * A_il;

                       };

                scheduler.add(keyUpdateA(k,l,i), task, { keyApplyU(k,i),

                                                         keyApplyL(i,l) });

                gather.push_back(keyUpdateA(k,l,i));

            }

        }

    }

    //scheduler.dumpGraph("graph.dot");

    //scheduler.resume();

    return scheduler.join();

}

} // namespace flens

#endif // LU_LU_TILED_MT_H

Task Scheduler and Thread Pool

Class Scheduler provides an abstract interface a thread pool and scheduler:

flens/examples/lu/scheduler.h

#ifndef LU_SCHEDULER_H

#define LU_SCHEDULER_H 1

#include <condition_variable>

#include <list>

#include <unordered_map>

#include <mutex>

#include <thread>

#include <vector>

#include <flens/examples/lu/format.h>

namespace flens {

class Scheduler

{

    public:

        typedef long                     Key;

        typedef std::function<void()>    Task;

        Scheduler(int numThreads);

        void

        pause();

        void

        resume();

        void

        add(std::string keyStr, Task task, std::vector<std::string> pre= {});

        void

        addArc(std::string preStr, std::string succStr);

        void

        dumpGraph(const char *filename);

        int

        join();

        void

        abort(int errorCode);

    private:

        std::pair<Task, Key>

        getTask();

        void

        runThread();

        void

        taskDone(Key key);

        Key

        getKey(std::string keyStr);

        int                                         numThreads;

        int                                         numThreadsBusy;

        int                                         errorCode;

        bool                                        stop;

        std::list<Key>                              ready;

        std::unordered_map<std::string, Key>        keyString;

        std::unordered_map<Key, std::string>        tag;

        std::unordered_map<Key, Task>               scheduled;

        std::unordered_map<Key, long>               predecessorCount;

        std::unordered_map<Key, std::list<Key>>     successor;

        std::condition_variable                     cv_scheduler;

        std::condition_variable                     cv_thread;

        std::mutex                                  mtx;

};

} // namespace flens

#endif // LU_SCHEDULER_H

Internally task are identified by an integer.

typedef long Key;

And a task is just a function pointer.

typedef std::function<void()> Task;

The scheduler and thread pool get initialized when the application gets started, e.g. in main(). Threads will be spawned and initialized. They will immediately wait on a condition variable until a task is ready.

Scheduler(int numThreads);

For illustration purposes it is nice to identify task by a string. In our example these strings happen to be Latex code. So we are able to dump the scheduled directed graph and generate nice pictures.

Adding a task identified by keyStr can depend on a list of other tasks (pre) that must have been completed before it can be started.

        void

        add(std::string keyStr, Task task, std::vector<std::string> pre= {});

In some cases it is convenient to add only some dependencies, i.e. add an arc to the directed graph.

Note that if you want to add 'pre must be done before succ gets started' it is necessary that succ has not been scheduled already. Otherwise task succ could already be running.

Again tasks pre and succ are identified by string.

        void

        addArc(std::string preStr, std::string succStr);

Once all tasks are scheduled we wait until all tasks are completed.

        int

        join();

If a task calls abort then no new tasks will be stared. join will wait until all running tasks are completed and returns errorCode.

        void

        abort(int errorCode);

The implementation uses typical C++11 features for dealing with threads, locks and condition variables. The directed graph is realized using containers from the C++ standard library:

flens/examples/lu/scheduler.cc

#include <flens/examples/lu/scheduler.h>

#include <cassert>

#include <cstdlib>

#include <fstream>

namespace flens {

Scheduler::Scheduler(int numThreads_)

    : numThreads(numThreads_), numThreadsBusy(0), errorCode(0), stop(false)

{

    std::unique_lock<std::mutex>  lock(mtx);

    for (int i=0; i<numThreads; ++i) {

        auto t = std::thread( [=] { this->runThread(); } );

        t.detach();

    }

}

void

Scheduler::pause()

{

    std::unique_lock<std::mutex>  lock(mtx);

    stop = true;

}

void

Scheduler::resume()

{

    bool jobReady = false;

    {

        std::unique_lock<std::mutex>  lock(mtx);

        stop     = false;

        jobReady = (ready.size()>0);

    }

    if (jobReady) {

        cv_thread.notify_one();

    }

}

void

Scheduler::add(std::string keyStr, Task task, std::vector<std::string> pre)

{

    bool jobReady = false;

    {

        std::unique_lock<std::mutex>  lock(mtx);

        Key  key      = getKey(keyStr);

        while (ready.size()>15) {

            cv_scheduler.wait(lock);

        }

        #ifndef NDEBUG

        if (scheduled.count(key)!=0) {

            std::cerr << "Already scheduled: " << keyStr << std::endl;

            assert(0);

        }

        #endif

        scheduled[key]         = task;

        for (auto p_ : pre) {

            auto p = getKey(p_);

            if (scheduled.count(p)>0) {

                successor[p].push_back(key);

                ++predecessorCount[key];

            } else {

                assert(scheduled.count(p)==0);

            }

        }

        if (predecessorCount[key]==0) {

            if (successor[key].size()>3) {

                ready.push_front(key);

            } else {

                ready.push_back(key);

            }

        }

        jobReady = (ready.size()>0);

    }

    if (jobReady) {

        cv_thread.notify_one();

    }

}

void

Scheduler::addArc(std::string preStr, std::string succStr)

{

    std::unique_lock<std::mutex>  lock(mtx);

    Key succ = getKey(succStr);

    Key pre  = getKey(preStr);

    // Dependencies for 'succ' must not be added if 'succ' is

    // already scheduled.

    if (scheduled.count(pre)==0) {

        return;

    }

    #ifndef NDEBUG

    if (scheduled.count(succ)!=0) {

        std::cerr << "Already scheduled and possibly running: "

                  << succStr << std::endl;

        std::cerr << preStr << " -> " << succStr << std::endl;

        assert(0);

    }

    #endif

    successor[pre].push_back(succ);

    ++predecessorCount[succ];

}

void

Scheduler::abort(int errorCode_)

{

    assert(errorCode_!=0);

    {

        std::unique_lock<std::mutex> lock(mtx);

        std::cerr << "abort: error code = " << errorCode_ << std::endl;

        errorCode = errorCode_;

        scheduled.clear();

    }

    cv_scheduler.notify_one();

}

void

Scheduler::dumpGraph(const char *filename)

{

    std::unique_lock<std::mutex> lock(mtx);

    std::ofstream out(filename);

    out << "digraph G {\nnode [shape=\"none\"];" << std::endl;

    for (auto t : tag) {

        out << t.first << "[label=\"" << t.second << "\"]"

            << std::endl;

    }

    for (auto s : successor) {

        for (auto succ : s.second) {

            out << "   \"" << s.first

                << "\" -> \"" << succ

                << "\";" << std::endl;

        }

    }

    out << "}" << std::endl;

    out.close();

}

int

Scheduler::join()

{

    std::unique_lock<std::mutex> lock(mtx);

    while (scheduled.size()+numThreadsBusy>0) {

        if (ready.size()+numThreadsBusy==0) {

            std::cerr << "Error: Task queue not empty but no job ready"

                         " or running." << std::endl;

            std::exit(666);

        }

        cv_scheduler.wait(lock);

    }

    // clean up

    ready.clear();

    keyString.clear();

    tag.clear();

    scheduled.clear();

    predecessorCount.clear();

    successor.clear();

    // restore error code and return old value

    int errorCode_ = errorCode;

    errorCode = 0;

    return errorCode_;

}

std::pair<Scheduler::Task, Scheduler::Key>

Scheduler::getTask()

{

    bool jobReady = false;

    Key  key;

    Task task;

    {

        std::unique_lock<std::mutex> lock(mtx);

        while (stop || ready.size()==0) {

            cv_thread.wait(lock);

        }

        key  = ready.front();

        task = scheduled[key];

        ready.pop_front();

        ++numThreadsBusy;

        if (ready.size()>0) {

            jobReady = true;

        }

    }

    if (jobReady) {

        cv_thread.notify_one();

    }

    return std::pair<Task, Key>(task,key);

}

void

Scheduler::runThread()

{

    while(1) {

        std::pair<Task,Key> job = getTask();

        job.first();

        taskDone(job.second);

    }

}

void

Scheduler::taskDone(Key key)

{

    bool allDone = false;

    {

        std::unique_lock<std::mutex> lock(mtx);

        if (errorCode!=0) {

            allDone= true;

        } else {

            for (auto succ : successor[key]) {

                --predecessorCount[succ];

                if (scheduled.count(succ)>0) {

                    if (predecessorCount[succ]==0) {

                        predecessorCount.erase(succ);

                        if (successor[succ].size()>3) {

                            ready.push_front(succ);

                        } else {

                            ready.push_back(succ);

                        }

                    }

                }

            }

            scheduled.erase(key);

            successor.erase(key);

        }

        --numThreadsBusy;

        if (scheduled.size()==0 && numThreadsBusy==0) {

            allDone = true;

        }

        if (ready.size()==10) {

            allDone = true;

        }

    }

    if (allDone) {

        cv_scheduler.notify_one();

    }

}

Scheduler::Key

Scheduler::getKey(std::string keyStr)

{

    if (keyString.count(keyStr)==0) {

        keyString[keyStr] = keyString.size();

    }

    tag[keyString[keyStr]] = keyStr;

    return keyString[keyStr];

}

} // namespace flens

Auxiliary Functions for Task Keys

For creating task identifiers we simply create strings that contain some Latex code describing what the task is supposed to do. For convenience we define a function such that we can use the beloved format strings from C:

flens/examples/lu/format.h

#ifndef LU_FORMAT_H

#define LU_FORMAT_H 1

#include <memory>

#include <iostream>

#include <string>

#include <cstdio>

namespace flens {

template<typename ... Args>

std::string

format(std::string fmt, Args ... args)

{

    size_t size = 1 + snprintf(nullptr, 0, fmt.c_str(), args ...);

    std::unique_ptr<char[]> buf(new char[size]);

    snprintf(buf.get(), size, fmt.c_str(), args ...);

    return std::string(buf.get(), buf.get() + size-1);

}

} // namespace flens

#endif

Now we simply can define a bunch of functions for generating different keys:

flens/examples/lu/lu_tiled_keys.h

#ifndef LU_LU_TILED_KEYS_H

#define LU_LU_TILED_KEYS_H 1

#include <flens/examples/lu/format.h>

namespace flens {

// Some auxiliary function for generating nice task keys

auto keyLU      = [=](int i)

                  {

                      return format("LU(A_{%d,%d})", i, i);

                  };

auto keyPtA     = [=](int i, int j)

                  {

                      return format("A_{%d,%d} \\leftarrow "

                                    "P_%d^T A_{%d,%d}",

                                    i, j, i, i, j);

                  };

auto keyUpdateP = [=](int i, int bs)

                  {

                      return format("P\\big|_%d \\leftarrow P_%d + %d",

                                    i, i, (i-1)*bs);

                  };

auto keyApplyU  = [=](int i, int j)

                  {

                      return format("A_{%d,%d} \\leftarrow "

                                    "A_{%d,%d} U_{%d,%d}^{-1}",

                                    i, j, i, j, i, i);

                  };

auto keyApplyL  = [=](int i, int j)

                  {

                      return format("A_{%d,%d} \\leftarrow "

                                    "L_{%d,%d}^{-1} A_{%d,%d}",

                                    i, j, i, i, i, j);

                  };

auto keyUpdateA = [=](int i, int j, int k)

                  {

                      return format("A_{%d,%d} \\leftarrow "

                                    "A_{%d,%d} - A_{%d,%d} A_{%d,%d}",

                                    i, j, i, j, i, k, k, j);

                  };

} // namespace flens

#endif // LU_LU_TILED_MT_H

Code for Testing

The test code first initializes the scheduler. Copies and converts the matrix into tile format and calls the implementation of the tile algorithm. Once the factorization is completed the tiled matrix is copied and converted back.

flens/examples/lu_tiled_test.cc

#include <flens/flens.cxx>

#include <iostream>

#include <flens/examples/lu/apply_perm.h>

#include <flens/examples/lu/check_lu.h>

#include <flens/examples/lu/lu_tiled_mt.h>

#include <flens/examples/lu/scheduler.h>

#include <flens/examples/lu/timer.h>

using namespace flens;

using namespace std;

int

main()

{

    typedef GeMatrix<FullStorage<double> >  DGeMatrix;

    typedef TiledCopy<DGeMatrix>            DTiledMatrix;

    typedef DenseVector<Array<double> >     DDenseVector;

    typedef DenseVector<Array<int> >        IDenseVector;

    Scheduler scheduler(2);

    const int bs = 256;

    const int m  = 4*bs;

    const int n  = 4*bs;

    DGeMatrix           A(m,n), A_org;

    IDenseVector        p(m);

    Underscore<int>     _;

    // Init A and store a copy for checking the factorization later

    fillRandom(A);

    A_org = A;

    double t0, t1;

    // Create a new matrix and initialize it with the tiled format

    t0 = ATL_walltime();

    DTiledMatrix        A_(A, bs);

    t1 = ATL_walltime();

    // Call the tile algorithm

    if (lu_tiled(scheduler, A_, p) != 0) {

        cout << "Matrix is singular or close to singular" << endl;

        return 1;

    }

    t1 = ATL_walltime()-t1;

    // Copy back the tiled matrix

    A_.untile(A);

    t0 = ATL_walltime()-t0;

    cout << "total time elapsed:          " << t0 << endl;

    cout << "time elapsed for 'lu_tiled': " << t1 << endl;

    cout << "time elapsed for conversion: " << t0-t1 << endl;

    // Check result

    cout << "Checking results:" << endl;

    cout << "|| A-P*L*U ||_1 = " << check_LU(A_org, A, p) << endl;

    return 0;

}

Compile and Run

$shell> cd flens/examples                                               
$shell> g++ -O3 -Wall -std=c++11 -I../.. -c lu/scheduler.cc             
$shell> g++ -O3 -Wall -std=c++11 -I../.. scheduler.o lu_tiled_test.cc   
$shell> ./a.out                                                         
total time elapsed:          0.0999999
time elapsed for 'lu_tiled': 0.0699999
time elapsed for conversion: 0.03
Checking results:
|| A-P*L*U ||_1 = 4.7791e-09

Content

Coding a Cache Optimized Parallel LU Factorization

Unblocked Algorithm

Partitioning \(A\), \(L\) and \(U\)

Recursive Algorithm for a Factorization \(A = LU\)

Recursive Algorithm with Pivoting: \(A = PLU\)

Iterative Algorithm with Pivoting: \(A = PLU\)

Improving Numerical Stability

Implementation of `lu_unblk` using Operators

Implementation of `lu_unblk` using `flens::blas` functions

Auxiliary Function for Applying the Permutation

Auxiliary Wall Time Function

Code for Testing

Note:

Compile and Run

Blocked Algorithm

Block-Partitioning A, L and U

Recursive Block Algorithm for a Factorization \(A = LU\)

Iterative Block Algorithm with Pivoting: \(A = PLU\)

Note

Auxiliary Function for Applying the Inverse Permutation

Implementation of `lu_blk` using `flens::blas` functions

Implementation of `lu_blk` using Operators

Code for Testing

Compile and Run

Block Algorithm with Parallel BLAS

Parallel Tile Algorithm

Example: Tasks for Factorizing a 2x2 Tile/Block Matrix

Task Dependencies

Example: Tasks for Factorizing a 3x3 Tile/Block Matrix

Example: Tasks for Factorizing a 10x10 Tile/Block Matrix

Converting Matrices to Tile Matrices

Implementation of `lu_tiled`

Task Scheduler and Thread Pool

Auxiliary Functions for Task Keys

Code for Testing

Compile and Run

Content

Coding a Cache Optimized Parallel LU Factorization

Unblocked Algorithm

Partitioning \(A\), \(L\) and \(U\)

Recursive Algorithm for a Factorization \(A = LU\)

Recursive Algorithm with Pivoting: \(A = PLU\)

Iterative Algorithm with Pivoting: \(A = PLU\)

Improving Numerical Stability

Implementation of lu_unblk using Operators

Implementation of lu_unblk using flens::blas functions

Auxiliary Function for Applying the Permutation

Auxiliary Wall Time Function

Code for Testing

Note:

Compile and Run

Blocked Algorithm

Block-Partitioning A, L and U

Recursive Block Algorithm for a Factorization \(A = LU\)

Iterative Block Algorithm with Pivoting: \(A = PLU\)

Note

Auxiliary Function for Applying the Inverse Permutation

Implementation of lu_blk using flens::blas functions

Implementation of lu_blk using Operators

Code for Testing

Compile and Run

Block Algorithm with Parallel BLAS

Parallel Tile Algorithm

Example: Tasks for Factorizing a 2x2 Tile/Block Matrix

Task Dependencies

Example: Tasks for Factorizing a 3x3 Tile/Block Matrix

Example: Tasks for Factorizing a 10x10 Tile/Block Matrix

Converting Matrices to Tile Matrices

Implementation of lu_tiled

Task Scheduler and Thread Pool

Auxiliary Functions for Task Keys

Code for Testing

Compile and Run

Implementation of `lu_unblk` using Operators

Implementation of `lu_unblk` using `flens::blas` functions

Implementation of `lu_blk` using `flens::blas` functions

Implementation of `lu_blk` using Operators

Implementation of `lu_tiled`