Coding a Cache Optimized Parallel LU Factorization
For a general \(m \times n\) matrix \(A\) the factorization computes \(P\), \(L\) and \(U\) such that
\[ A = P L U\]where
-
\(P\) is a \(m \times m\) permutation matrix. Our implementation will represent it a permutation vector \(p\). This vector will only contain integers and encodes the permutation as follows: interchange row \(i\) with \(p_i\) for \(1\leq i \leq m\).
The permutation matrix is determined by a pivoting strategy. Such a strategy is necessary to prevent divisions by zero or division by very small numbers. So it enhances the numerical stability of the algorithm. Pivoting strategies differ in the amount of effort you want to invest for stability. We will use partial pivoting in our factorization.
-
\(L\) is a lower triangular \(m \times m\) matrix with ones on the diagonal.
The strict lower triangular part of \(A\) gets overwritten with \(L\). So no additional matrix is needed to store \(L\).
-
\(U\) is an upper triangular \(m \times n\) matrix.
The upper triangular part of \(A\) gets overwritten with \(U\). So no additional matrix is needed to store \(U\).
Every student who took an undergraduate class on numerical linear algebra knows how to implement this element wise using 3 loops We will derive two variants that use matrix/vector operations instead of element wise operations. Hence, we can use BLAS for the heavy computations:
-
Unblocked Variant lu_unblk
This variant uses only BLAS Level 1 and Level 2 functions. So here we can not expect high performance. But we will take advantage of it. Low performance means that the CPU is waiting for work most of the time. So we can do some extra computations if it improves numerical stability.
-
Blocked Variant lu_blk
This variant uses the unblocked variant lu_unblk only on a rather small block. Using BLAS Level 3 functions the remaining matrix gets updated with the results. We we achieve high performance. Furthermore, it dominates the overall performance and leads to an immense speedup.
Comparing the efficiency (measured in MFLOPS) you see that the blocked variant is superior for all problem sizes. For large systems of linear equations it almost reaches 8 GFLOPS. The unblock variant drops from 1 GFLOP to 500 MFLOPS when the problem size exceeds the cache size. So efficiency is increased by a factor of 14. Note that this is achieved without multi threading just by using an algorithm that allows to exploit cache hierarchies:
Of course in practise not efficiency but effectiveness matters. Effectiveness means how long does it take to compute the factorisation. So less is better:
Note that both variants still have a complexity of \(\mathcal{O}(N^3)\). Only the constant for the \(N^3\) differs significantly. This can be visualized by a log-scaled plot:
-
Parallel Tile Variant lu_tile
Once we can exploit a single CPU core we can consider parallelization. The matrix is converted to a block format. For this blocked or tiled matrix tasks for the LU factorization are done in parallel if possible.
The following benchmarks were done on a Quad-Core Intel Xeon Woodcrest with 2.66 GHz:
Using the timing measurements of lu_blk we can see that we have an effective per-core speedup:
Unblocked Algorithm
The idea to develop an algorithm in terms of matrix/vector operations is based on considering partitions of matrices.
Partitioning \(A\), \(L\) and \(U\)
We partition \(A \in \mathbb{R}^{m \times n}\) along the first row and first column as follows:
\[ \displaystyle A = \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) \qquad a_{11} \in \mathbb{R},\; \vec{a}_{12}^T \in \mathbb{R}^{1 \times (n-1)},\; \vec{a}_{21} \in \mathbb{R}^{(m-1) \times 1},\; A_{22} \in \mathbb{R}^{(m-1) \times (m-1)}.\]So \(a_{11}\) is a scalar, \(\vec{a}_{21}\) a column vector, \(\vec{a}_{12}^T\) a row vector and \(A_{22}\) a matrix block.
We first assume that the permutation matrix \(P\) is just the identity matrix, i.e. that no pivoting is needed. In this case the factorization reads \(A = LU\). We now partition \(L\) and \(U\) accordingly.
As \(L \in \mathbb{R}^{m \times m}\) is lower triangular with unit diagonal the partition reads:
\[ L = \left( \begin{array}{c|c} 1 & \vec{\mathcal{0}}^T \\ \hline \vec{l}_{21} & L_{22} \end{array}\right) \qquad \vec{l}_{21} \in \mathbb{R}^{(m-1) \times 1},\; L_{22} \in \mathbb{R}^{(m-1) \times (m-1)}. \]Analogously we get for \(U\):
\[ U = \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \mathcal{0} & U_{22} \end{array}\right) \qquad u_{11} \in \mathbb{R},\; \vec{u}_{12}^T \in \mathbb{R}^{1 \times (n-1)},\; L_{22} \in \mathbb{R}^{(m-1) \times (n-1)}.\]Recursive Algorithm for a Factorization \(A = LU\)
In the matrix product equation \(A=LU\) we now simply plug in the partitioned matrices and get
\[ \begin{eqnarray*} A = LU &\;\Leftrightarrow\;& \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} 1 & \vec{\mathcal{0}}^T \\ \hline \vec{l}_{21} & L_{22} \end{array}\right) \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \mathcal{0} & U_{22} \end{array}\right) \\[1em] &\;\Leftrightarrow\;& \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} u_{11} & \vec{u}_{12}^T \\ \hline \vec{l}_{21} u_{11} & \vec{l}_{21} \vec{u}_{12}^T + L_{22} U_{22} \end{array}\right) \end{eqnarray*}\]Comparing partition blocks leads to the equations
\[ \left\{ \begin{array}{lcl} a_{11} &=& u_{11} \\ \vec{a}_{12} &=& \vec{u}_{12} \\ \vec{a}_{21} &=& u_{11} \vec{l}_{21} \\ A_{22} &=& \vec{l}_{21} \vec{u}_{12}^T + L_{22}U_{22} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} u_{11} &=& a_{11} \\ \vec{u}_{12} &=& \vec{a}_{12} \\ \vec{l}_{21} &=& \vec{a}_{21} / u_{11} \\ L_{22}U_{22} &=& A_{22} - \vec{l}_{21} \vec{u}_{12}^T \end{array} \right.\]So except for \(L_{22}\) and \(U_{22}\) all entries of the partitioned matrices \(L\) and \(U\) can be computed directly. The blocks \(L_{22}\) and \(U_{22}\) can be computed by computing the \(LU\) factorization of the \((m-1) \times (n-1)\) matrix \(A_{22} - \vec{l}_{21} \vec{u}_{12}^T\).
Overwriting the partitions of \(A\) with partitions of \(L\) and \(U\) reads:
\[ \begin{array}{lcll} a_{11} & \leftarrow & a_{11} & \text{(No Operation)} \\ \vec{a}_{12} & \leftarrow & \vec{a}_{12} & \text{(No Operation)} \\ \vec{a}_{21} & \leftarrow & \vec{a}_{21} / a_{11} & \text{(Scale)} \\ A_{22} & \leftarrow & A_{22} - \vec{a}_{21} \vec{a}_{12}^T & \text{(Rank 1 Update)} \end{array}\]So obviously this requires that \(a_{11} \neq 0\) If \(a_{11} = 0\) we have to use a pivot strategy that interchanges two rows of \(A\) first. We will add this next. For the moment we derived the well-known Gauss algorithm (without row interchanges):
-
Partition \(A\) as outlined above.
-
Overwrite \(a_{21}\) with \(\frac{1}{a_{11}} a_{21}\).
-
Overwrite \(A_{22}\) with \(A_{22} - a_{21} a_{12}^T\).
-
Repeat with \(A_{22}\) instead of \(A\).
Recursive Algorithm with Pivoting: \(A = PLU\)
In the above algorithm we have to avoid a division by zero. For numerical stability it is favorable that in a floating point division x/y the absolute value of y is larger than the absolute value of x.
So the idea to achieve both is simple and known as partial pivoting. Before each step we first determine the component in \(\vec{a}_{21}\) with maximal absolute value. We interchange then complete rows of the original matrix \(A\) such that \(a_{11}\) always contains the maximal absolute value. Note that is important the rows of the original matrix get interchanged.
In order to remember what rows of the original matrix where interchanged we use a pivot vector \(\vec{p}\) that initially contains values \(1, \dots, m\). If row \(i\) gets interchanged with row \(j\) we set \(p_i = j\).
Iterative Algorithm with Pivoting: \(A = PLU\)
Of course we prefer an iterative algorithm for an implementation. This is simply achieved by considering the partitioning within the original matrix:
\[ \require{color} \left( \begin{array}{c|c} A_{00} & A_{01} \\ \hline A_{10} & {\color{red}\tilde{A}_{11}} \\ \end{array} \right) \qquad\text{with}\quad{\color{red}\tilde{A}} = \left( \begin{array}{c|c} a_{11} & \vec{a}_{12}^T \\ \hline \vec{a}_{21} & A_{22} \end{array} \right)\]So in the \(i\)-th step we are working on the sub-matrix \(\tilde{A}\). In Matlab notation this sub-matrix is related to \(A\) by \(\tilde{A} = A(i:m, i:n)\). Parts of \(\tilde{A}\) are therefore:
\[ \begin{array}{lclclcl} \text{first column of}\;\tilde{A} & = & A(&i:m &,& i &) \\ a_{11} & = & A(&i &,& i &) \\ \vec{a}_{21} & = & A(&i+1:m &,& i &) \\ \vec{a}_{12}^T & = & A(&i &,& i+1:n &) \\ \vec{a}_{12}^T & = & A(&i+1:m &,& i+1:n &) \\ \end{array}\]This gives the iterative algorithm for computing the \(LU\) factorization with partial pivoting:
-
For \(i=1,\dots,\min\{m,n\}\)
-
Find \(l \in {i,\dots,m}\) such that \(|\,A_{l,i}| = \max\limits_{k=i,\dots,m}|\,A_{k,i}|\)
-
Set \(p_i = l\)
-
If \(p_i \neq i\) then
-
interchange rows \(i\) and \(l\) of \(A\).
-
-
If \(|\,A_{i,i}|\) is exactly Zero
-
return \(i\)
-
-
\(A_{i+1:m, i} \leftarrow \frac{1}{A_{i,i}} \cdot A_{i+1:m, i} \) (Scale)
-
\(A_{i+1:m,i+1:n} \leftarrow A_{i+1:m,i+1:n} - A_{i+1:m,i} \cdot A_{i,i+1:n}\) (Rank 1 Update)
-
The algorithm aborts if in the \(i\)-th step no non-zero element can be found. So yes, we continue even if \(A_{i,i}\) is almost zero. Numerical stability can be optimized if scaling is done by division if \(A_{i,i}\) is almost zero.
Improving Numerical Stability
Floating point divisions are more expensive (i.e. need more CPU cycles) than floating point multiplications. That makes it favorable to compute first the reciprocal \(\alpha = \frac{1}{A_{i,i}}\) and then the scaling \(A_{i+1:m, i} \leftarrow \alpha A_{i+1:m, i}\).
If the absolute value of \(A_{i,i}\) is small the division \(1/A_{i,i}\) can cause an overflow. Even without an overflow scaling by multiplying with the reciprocal can cause a considerable loss of precision. Because of the partial pivoting the element wise divisions \(A_{j,i}/A_{i,i}\) with \(i < j \leq m\) still give results within machine accuracy. Using the FPU for the division makes the division considerably more expensive than multiplications using SSE or AVX. However, using the FPU the division gets internally done with 80 bits floating point numbers instead of 64 or 32 bits.
Note that the dominant operations in the above algorithm are Vector Scaling (BLAS Level 1) and Rank 1 Update (BLAS Level 2). We don't have any BLAS Level 3 function. So this algorithm can not exploit caches. This means the CPU will wait most of the time for data from the memory. So once the CPU finally gets something to do we can do some extra computations if it improves roundoff errors. So in each step we will check whether the absolute value of \(A_{i,i}\) is smaller than a threshold. Based on that we decide to scale by \(1/A_{i,i}\) (possibly using internally SSE or AVX) or doing element wise divisions using the FPU.
-
For \(i=1,\dots,\min\{m,n\}\)
-
Find \(l \in {i,\dots,m}\) such that \(|\,A_{l,i}| = \max\limits_{k=i,\dots,m}|\,A_{k,i}|\)
-
Set \(p_i = l\)
-
If \(p_i \neq i\) then
-
interchange rows \(i\) and \(l\) of \(A\).
-
-
If \(|\,A_{i,i}|\) is exactly Zero
-
return \(i\)
-
-
If \(|\,A_{i,i}| < \text{thresh}\)
-
For \(j=i+1, \dots, m\)
-
\(A_{j, i} \leftarrow A_{j,m}/A_{i,i} \) (FPU Division)
-
-
-
else
-
\(A_{i+1:m, i} \leftarrow \frac{1}{A_{i,i}} \cdot A_{i+1:m, i} \) (Scale)
-
-
\(A_{i+1:m,i+1:n} \leftarrow A_{i+1:m,i+1:n} - A_{i+1:m,i} \cdot A_{i,i+1:n}\) (Rank 1 Update)
-
Numerical stability can be further improved by using full pivoting for the \(LU\) factorization. If this is not sufficient more stable (and more expensive) factorizations like \(QR\) factorization or a singular value decomposition can be considered.
Implementation of lu_unblk using Operators
#define LU_LU_UNBLK_H 1
#include <cxxstd/algorithm.h>
#include <cxxstd/cmath.h>
#include <cxxstd/limits.h>
#include <flens/flens.h>
namespace flens {
template <typename MA, typename VP>
typename GeMatrix<MA>::IndexType
lu_unblk(GeMatrix<MA> &A, DenseVector<VP> &p)
{
typedef typename GeMatrix<MA>::ElementType T;
typedef typename GeMatrix<MA>::IndexType IndexType;
const IndexType m = A.numRows();
const IndexType n = A.numCols();
const IndexType mn = std::min(m, n);
const T Zero(0);
const T One(1);
const Underscore<IndexType> _;
// Compute threshold for stable scaling
const T eps = std::numeric_limits<T>::epsilon() * T(0.5);
T safeMin = std::numeric_limits<T>::min();
const T small = One / std::numeric_limits<T>::max();
if (small>=safeMin) {
safeMin = small*(One+eps);
}
for (IndexType i=1; i<=mn; ++i) {
// Partition sub-matrix A(i:m,i:n)
const auto A_1 = A(_( i, m), i); // first column of sub_matrix
const auto &a11 = A( i, i);
const auto a12 = A( i, _(i+1,n));
auto a21 = A(_(i+1,m), i);
auto A22 = A(_(i+1,m), _(i+1,n));
// Find pivot index
IndexType l = blas::iamax(A_1) +i- 1;
p(i) = l;
// Interchange rows of A
if (p(i)!=i) {
blas::swap(A(i,_), A(l,_));
}
// Only abort if diagonal element is *exactly* zero
if (a11==Zero) {
return i;
}
// Choose stable scaling
if (std::abs(a11)<safeMin) {
a21 /= a11;
} else {
a21 *= One/a11;
}
// Rank 1 update
A22 -= a21*transpose(a12);
}
return 0;
}
} // namespace flens
#endif // LU_LU_UNBLK_H
Implementation of lu_unblk using flens::blas functions
In case you find explicit function calls more expressive you can do so. For example, it makes it more obvious that only BLAS Level 1 and BLAS Level 2 functions get used. Actually we usually start an implementation with explicit function calls and only replace them with operator notations if it make code more expressive.
The previous previous code is equivalent to:
#define LU_LU_UNBLK_H 1
#include <cxxstd/algorithm.h>
#include <cxxstd/cmath.h>
#include <cxxstd/limits.h>
#include <flens/flens.h>
namespace flens {
template <typename MA, typename VP>
typename GeMatrix<MA>::IndexType
lu_unblk(GeMatrix<MA> &A, DenseVector<VP> &p)
{
typedef typename GeMatrix<MA>::ElementType T;
typedef typename GeMatrix<MA>::IndexType IndexType;
const IndexType m = A.numRows();
const IndexType n = A.numCols();
const IndexType mn = std::min(m, n);
const T Zero(0);
const T One(1);
const Underscore<IndexType> _;
// Compute threshold for stable scaling
const T eps = std::numeric_limits<T>::epsilon() * T(0.5);
T safeMin = std::numeric_limits<T>::min();
const T small = One / std::numeric_limits<T>::max();
if (small>=safeMin) {
safeMin = small*(One+eps);
}
for (IndexType i=1; i<=mn; ++i) {
// Partition sub-matrix A(i:m,i:n)
const auto A_1 = A(_( i, m), i); // first column of sub_matrix
const auto &a11 = A( i, i);
const auto a12 = A( i, _(i+1,n));
auto a21 = A(_(i+1,m), i);
auto A22 = A(_(i+1,m), _(i+1,n));
// Find pivot index
IndexType l = blas::iamax(A_1) +i- 1;
p(i) = l;
// Interchange rows of A
if (p(i)!=i) {
blas::swap(A(i,_), A(l,_));
}
// Only abort if diagonal element is *exactly* zero
if (a11==Zero) {
return i;
}
// Choose stable scaling
if (std::abs(a11)<safeMin) {
blas::rscal(a11, a21);
} else {
blas::scal(One/a11, a21);
}
// Rank 1 update
blas::r(-One, a21, a12, A22);
}
return 0;
}
} // namespace flens
#endif // LU_LU_UNBLK_H 1
Auxiliary Function for Applying the Permutation
Applying the permutation \(P\) to a matrix \(X\) means computing \(PX\). The permutation matrix \(P\) is represented by a pivot vector \(p\) that stores row interchanges that happened during the \(LU\) factorization. In this case applying \(P\) means reversing the interchanges such that \(PLU\) gives the original matrix \(A\). This is done by the auxiliary function apply_perm:
#define LU_APPLY_PERM_H 1
#include <flens/flens.cxx>
namespace flens {
// Compute A = P*A
template <typename VP, typename MA>
void
apply_perm(const DenseVector<VP> &p, GeMatrix<MA> &A)
{
typedef typename GeMatrix<MA>::IndexType IndexType;
const Underscore<IndexType> _;
const IndexType m = p.length();
// reverse row interchanges
for (IndexType i=m; i>=1; --i) {
if (i!=p(i)) {
blas::swap(A(i,_), A(p(i),_));
}
}
}
} // namespace flens
#endif // LU_APPLY_PERM_H
Auxiliary Wall Time Function
For measuring the elapsed time for doing a certain task we use a simple wall-time clock. The term 'wall-time' means that for a task we want to measure the real time that elapses from start to end, i.e. including time used for system calls etc. It can be used following the pattern:
// do something
t0 = ATL_walltime() - t0;
cout << "elapsed time = " << t0 << endl;
The code for ATL_walltime was taken from ATLAS:
#define LU_LU_TIMER_H 1
#include <stdlib.h>
#include <sys/times.h>
#include <unistd.h>
double ATL_walltime(void)
{
struct tms ts;
static double ClockTick=0.0;
if (ClockTick == 0.0) ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
return( ((double) times(&ts)) * ClockTick);
}
#endif // LU_LU_TIMER_H
Code for Testing
#include <iostream>
#ifdef WITH_OPERATORS
# include <flens/examples/lu/lu_unblk_with_operators.h>
#else
# include <flens/examples/lu/lu_unblk_with_flensblas.h>
#endif
#include <flens/examples/lu/apply_perm.h>
#include <flens/examples/lu/timer.h>
using namespace flens;
using namespace std;
int
main()
{
const int m = 2000;
const int n = 2000;
GeMatrix<FullStorage<double> > A(m,n), A_org;
DenseVector<Array<int> > p(m);
Underscore<int> _;
fillRandom(A);
// store original A for checking the factorization later
A_org = A;
//cout << "A = " << A << endl;
double t0 = ATL_walltime();
if (lu_unblk(A, p) != 0) {
cout << "Matrix is singular or close to singular" << endl;
return 1;
}
t0 = ATL_walltime() - t0;
cout << "Time elapsed: " << t0 << endl;
//cout << "L = " << A.lowerUnit() << endl;
//cout << "U = " << A.upper() << endl;
//cout << "p = " << p << endl;
cout << "Checking results:" << endl;
GeMatrix<FullStorage<double> > LU(m,n);
// blas::copy(NoTrans, A.upper(), LU.upper());
LU.upper() = A.upper();
// blas::mm(Left, NoTrans, 1.0, A(_(1,4),_(1,4)).lowerUnit(), LU);
LU = A(_(1,m),_(1,m)).lowerUnit() * LU;
// LU = P*LU
apply_perm(p, LU);
// blas::axpy(NoTrans, -1.0, LU, A_org);
A_org -= LU;
cout << "|| A-P*L*U ||_1 = " << blas::asum(A_org) << endl;
return 0;
}
Note:
Uncomment for playing with small dimensions:
and
//cout << "U = " << A.upper() << endl;
//cout << "p = " << p << endl;
Compile and Run
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -DWITH_OPERATORS -I../.. -o lu_unblk_operators_test lu_unblk_test.cc $shell> g++ -Wall -std=c++11 -I../.. -o lu_unblk_flensblas_test lu_unblk_test.cc $shell> ./lu_unblk_operators_test Time elapsed: 13.96 Checking results: || A-P*L*U ||_1 = 1.66341e-08 $shell> ./lu_unblk_flensblas_test Time elapsed: 13.85 Checking results: || A-P*L*U ||_1 = 1.66341e-08
Blocked Algorithm
We now derive an algorithm that takes advantage of BLAS Level 3 functions and applies the unblocked \(LU\) factorization only on a block that fits into the L2 cache.
Block-Partitioning A, L and U
We now consider a block-wise partition of \(A \in \mathbb{R}^{m \times n}\)
\[ A = \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) \]where block \(A_{11}\) is square, i.e. \(A_{11} \in \mathbb{R}^{\text{BS} \,\times\, \text{BS}}\), \(A_{21} \in \mathbb{R}^{(m-\text{BS}) \,\times\, \text{BS}}\), \(A_{12} \in \mathbb{R}^{\text{BS} \,\times\, (n-\text{BS})}\) and \(A_{22} \in \mathbb{R}^{(m-\text{BS}) \,\times\, (n-\text{BS})}\) for some block size \(\text{BS} \in \mathbb{N}\).
Analogously we consider for \(L\) the block-partition
\[ L = \left( \begin{array}{c|c} L_{11} & \mathcal{0} \\ \hline L_{21} & L_{22} \end{array}\right) \]where \(L_{11}\) adn \(L_{22}\) are lower triangular matrix with unit diagonal. For \(U\) the block-partition
\[ U = \left( \begin{array}{c|c} U_{11} & U_{12} \\ \hline \mathcal{0} & U_{22} \end{array}\right). \]blocks \(U_{11}\) and \(U_{22}\) are obviously upper triangular matrices.
So in case no pivoting gets applied the \(LU\) factorization is determined by
\[ \begin{eqnarray*} A = LU &\;\Leftrightarrow\;& \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} L_{11} & \mathcal{0} \\ \hline L_{21} & L_{22} \end{array}\right) \left( \begin{array}{c|c} U_{11} & U_{12} \\ \hline \mathcal{0} & U_{22} \end{array}\right) \\[1em] &\;\Leftrightarrow\;& \left( \begin{array}{c|c} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{array}\right) = \left( \begin{array}{c|c} L_{11} U_{11} & L_{11} U_{12} \\ \hline L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{array}\right) \end{eqnarray*}\]Comparing partition blocks leads to the equations
\[ \left\{ \begin{array}{lcl} A_{11} &=& L_{11} U_{11} \\ A_{12} &=& L_{11} U_{12} \\ A_{21} &=& L_{21} U_{11} \\ A_{22} &=& L_{21} U{12} + L_{22} U_{22} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} L_{11} U_{11} &=& A_{11} \\ U_{12} &=& L_{11}^{-1} A_{12} \\ L_{21} &=& A_{21} U_{11}^{-1} \\ L_{22}U_{22} &=& A_{22} - L_{21} U_{12} \end{array} \right. \quad \Leftrightarrow \quad \left\{ \begin{array}{lcl} L_{11} U_{11} &=& A_{11} \\ U_{12} &=& L_{11}^{-1} A_{12} \\ L_{21}^T &=& U_{11}^{-T} A_{21}^T \\ L_{22}U_{22} &=& A_{22} - L_{21} U_{12} \end{array} \right.\]Note that because \(L_{11}\) and \(U_{11}^T\) are both lower triangular computing \(L_{11}^{-1} A_{12}\) and \(U_{11}^{-T} A_{21}^T\) can be done using a triangular solver (forward substitution).
Recursive Block Algorithm for a Factorization \(A = LU\)
So as we again want to overwrite \(A\) with \(L\) and \(U\) we end up with the algorithm
-
Partition \(A\) as outlined above.
-
Use the unblocked \(LU\) factorization to overwrite \(A_{11}\) with \(L_{11}\) and $U_{11}.
-
Use a triangular solver to overwrite \(A_{21}\) with \(A_{21} U_{11}^{-1}\).
-
Use a triangular solver to overwrite \(A_{12}\) with \(L_{11}^{–1} U_{12}\)
-
Overwrite \(A_{22}\) with \(A_{22} - A_{21}A_{12}\) (Matrix-Matrix-Product)
-
Repeat with \(A_{22}\) instead of \(A\).
The triangular solvers (blas::sm) as well as the matrix-matrix product (blas::mm) are BLAS Level 3 functions.
Iterative Block Algorithm with Pivoting: \(A = PLU\)
In order to derive an iterative algorithm we again consider the partitioning within the original matrix:
\[ \require{color} \left( \begin{array}{c|c|c} A_{00} & A_{01} & A_{02} \\ \hline A_{10} & {\color{red}A_{11}} & {\color{red}A_{12}}\\ \hline A_{20} & {\color{red}A_{21}} & {\color{red}A_{22}}\\ \end{array} \right) \qquad\text{with}\quad{\color{red}\tilde{A}} = \left( \begin{array}{c|c} {\color{red}A_{11}} & {\color{red}A_{12}}\\ \hline {\color{red}A_{21}} & {\color{red}A_{22}} \end{array} \right)\]In the \(i\)-th iteration the dimension of \(\tilde{A}\) is \(\bigl(m - (i-1)\cdot\text{BS}\bigr) \times \bigl(n - (i-1)\cdot\text{BS}\bigr)\). Therefore blocks of \(\tilde{A}\) have dimensions
\[ \begin{array}{lll} A_{11} & \in \mathbb{R}^{\text{bs} \times \text{bs}} & \quad\text{with}\; \text{bs} = \min\{\text{BS},\, m - i\cdot\text{BS},\, n - i\cdot\text{BS}\}\\ A_{21} & \in \mathbb{R}^{(m - i\cdot\text{BS}) \,\times\, \text{bs}} & \\ A_{12} & \in \mathbb{R}^{\text{bs} \,\times\, (n - i\cdot\text{BS})} & \\ A_{22} & \in \mathbb{R}^{(m - i\cdot\text{BS}) \,\times\, (n - i\cdot\text{BS})} & \\ \end{array}\]This gives the iterative algorithm for computing block wise the \(A=PLU\) factorization with partial pivoting:
-
Set \(i = 1\)
-
While \(i=1 \leq \min\{m,n\}\)
-
Partition \(A\) for the \(i\)-th iteration as illustrated above with \(\text{bs} = \min\{\text{BS},\, m - (i-1)\cdot\text{BS},\, n - (i-1)\cdot\text{BS}\}\)
-
Use the unblocked \(LU\) factorization to compute \(A_{11} = \tilde{P} L_{11} U_{11}\)
-
Apply \(\tilde{P}^{-1}\) to \(A_{10}\), i.e. compute \(A_{10} = \tilde{P}^{-1} A_{10}\).
-
Apply \(\tilde{P}^{-1}\) to \(A_{12}\), i.e. compute \(A_{12} = \tilde{P}^{-1} A_{12}\).
-
Update the permutation matrix \(P\) with \(\tilde{P}\).
-
Use a triangular solver for \(A_{21}^T = U_{11}^{-T} A_{21}^T\).
-
Use a triangular solver for \(A_{12} = L_{11}^{-1} A_{12}\).
-
Compute the matrix-product \(A_{22} = A_{22} - A_{21} A_{12}\).
-
\(i = i + \text{BS}\)
-
Note
What does updating \(P\) with \(\tilde{P}\) mean? The permutation matrix \(P\) is represented by a pivot vector \(p = (p_1, \dots, p_m)\) and \(\tilde{P}\) by a pivot vector \(\tilde{p} = (\tilde{p}_1, \dots, \tilde{p}_{\text{bs}})\). So an update means to set \(p_{i+k-1} = i - 1 + \tilde{p}_{k}\) for \(1 \leq k \leq \text{bs}\).
Auxiliary Function for Applying the Inverse Permutation
Applying the inverse permutation \(P^{-1}=P^T\) to a matrix \(X\) means computing \(P^T X\). Recall, the permutation matrix \(P\) is represented by a pivot vector \(p\) that stores row interchanges that happened during the \(LU\) factorization. So applying \(P\) means reversing the interchanges such that \(PLU\) gives the original matrix \(A\). Hence, applying \(P^{-1}\) to \(X\) means to apply the same row interchanges that were done during the \(LU\) factorization. This is done by the auxiliary function apply_perm_inv:
#define LU_APPLY_PERM_INV_H 1
#include <flens/flens.cxx>
namespace flens {
// Compute A = P^{-1}*A = P^T*A
template <typename VP, typename MA>
void
apply_perm_inv(const DenseVector<VP> &p, GeMatrix<MA> &A)
{
typedef typename GeMatrix<MA>::IndexType IndexType;
const Underscore<IndexType> _;
const IndexType m = p.length();
// apply row interchanges
for (IndexType i=1; i<=m; ++i) {
if (i!=p(i)) {
blas::swap(A(i,_), A(p(i),_));
}
}
}
} // namespace flens
#endif // LU_APPLY_PERM_INV_H
Implementation of lu_blk using flens::blas functions
#define LU_LU_BLK_H 1
#ifndef LU_BS
#define LU_BS 96
#endif
#include <cxxstd/algorithm.h>
#include <cxxstd/cmath.h>
#include <flens/flens.h>
#include <flens/examples/lu/apply_perm_inv.h>
#include <flens/examples/lu/lu_unblk_with_operators.h>
namespace flens {
template <typename MA, typename VP>
typename GeMatrix<MA>::IndexType
lu_blk(GeMatrix<MA> &A, DenseVector<VP> &p)
{
typedef typename GeMatrix<MA>::ElementType T;
typedef typename GeMatrix<MA>::IndexType IndexType;
const IndexType m = A.numRows();
const IndexType n = A.numCols();
const IndexType mn = std::min(m, n);
const T One(1);
const Underscore<IndexType> _;
IndexType info = 0;
for (IndexType i=1; i<=mn; i+=LU_BS) {
const IndexType bs = std::min(std::min(LU_BS, m-i+1), n-i+1);
// Partitions of A
auto A10 = A(_( i, i+bs-1), _( 1, i-1));
auto A11 = A(_( i, i+bs-1), _( i, i+bs-1));
auto A12 = A(_( i, i+bs-1), _(i+bs, n));
auto A21 = A(_(i+bs, m), _( i, i+bs-1));
auto A22 = A(_(i+bs, m), _(i+bs, n));
// Part of the pivot vector for rows of A11
auto p_ = p(_(i,i+bs-1));
// Compute LU factorization of A11
info = lu_unblk(A11, p_);
if (info) {
// All values in column info of A11 are *exactly* zero.
return info+i-1;
}
// Apply permutation to A10 and A12
apply_perm_inv(p_, A10);
apply_perm_inv(p_, A12);
// Update p
p_ += i-1;
// Use triangular solver for A21 = A11.upper()*A21
blas::sm(Right, NoTrans, One, A11.upper(), A21);
// Use triangular solver for A12 = A11.lowerUnit()*A12
blas::sm(Left, NoTrans, One, A11.lowerUnit(), A12);
// Update A22 with matrix-product A22 = A22 - A21*A12
blas::mm(NoTrans, NoTrans, -One, A21, A12, One, A22);
}
return 0;
}
} // namespace flens
#endif // LU_LU_BLK_H
Implementation of lu_blk using Operators
For applying triangular solvers we do not provide overloaded operators. So you might be disappointed if you a big fan of it. In that case we leave it as a simple exercise to the dear user to support for a triangular matrix L something like X = inverse(L)*B that gets evaluated as blas::sm(Left, NoTrans, 1.0, L, X) if X and B are identical and as X = B; blas::sm(Left, NoTrans, 1.0, L, X) otherwise.
#define LU_LU_BLK_H 1
#ifndef LU_BS
#define LU_BS 96
#endif
#include <cxxstd/algorithm.h>
#include <cxxstd/cmath.h>
#include <flens/flens.h>
#include <flens/examples/lu/apply_perm_inv.h>
#include <flens/examples/lu/lu_unblk_with_operators.h>
namespace flens {
template <typename MA, typename VP>
typename GeMatrix<MA>::IndexType
lu_blk(GeMatrix<MA> &A, DenseVector<VP> &p)
{
typedef typename GeMatrix<MA>::ElementType T;
typedef typename GeMatrix<MA>::IndexType IndexType;
const IndexType m = A.numRows();
const IndexType n = A.numCols();
const IndexType mn = std::min(m, n);
const T One(1);
const Underscore<IndexType> _;
IndexType info = 0;
for (IndexType i=1; i<=mn; i+=LU_BS) {
const IndexType bs = std::min(std::min(LU_BS, m-i+1), n-i+1);
// Partitions of A
auto A10 = A(_( i, i+bs-1), _( 1, i-1));
auto A11 = A(_( i, i+bs-1), _( i, i+bs-1));
auto A12 = A(_( i, i+bs-1), _(i+bs, n));
auto A21 = A(_(i+bs, m), _( i, i+bs-1));
auto A22 = A(_(i+bs, m), _(i+bs, n));
// Part of the pivot vector for rows of A11
auto p_ = p(_(i,i+bs-1));
// Compute LU factorization of A11
info = lu_unblk(A11, p_);
if (info) {
// All values in column info of A11 are *exactly* zero.
return info+i- 1;
}
// Apply permutation to A10 and A12
apply_perm_inv(p_, A10);
apply_perm_inv(p_, A12);
// Update p
p_ += i-1;
// Use triangular solver for A21 = A11.upper()*A21
blas::sm(Right, NoTrans, One, A11.upper(), A21);
// Use triangular solver for A12 = A11.lowerUnit()*A12
blas::sm(Left, NoTrans, One, A11.lowerUnit(), A12);
// Update A22 with matrix-product A22 = A22 - A21*A12
A22 -= A21*A12;
}
return 0;
}
} // namespace flens
#endif // LU_LU_BLK_H
Code for Testing
#include <algorithm>
#include <iostream>
#ifdef WITH_OPERATORS
# include <flens/examples/lu/lu_blk_with_operators.h>
#else
# include <flens/examples/lu/lu_blk_with_flensblas.h>
#endif
#include <flens/examples/lu/check_lu.h>
#include <flens/examples/lu/timer.h>
using namespace flens;
using namespace std;
int
main()
{
const int m = 2000;
const int n = 2000;
const int mn = std::min(m, n);
GeMatrix<FullStorage<double> > A(m,n), A_org;
DenseVector<Array<int> > p(mn);
Underscore<int> _;
fillRandom(A);
// store original A for checking the factorization later
A_org = A;
double t0 = ATL_walltime();
if (lu_blk(A, p) != 0) {
cout << "Matrix is singular or close to singular" << endl;
return 1;
}
t0 = ATL_walltime() - t0;
cout << "Time elapsed: " << t0 << endl;
cout << "Checking results:" << endl;
cout << "|| A-P*L*U ||_1 = " << check_LU(A_org, A, p) << endl;
return 0;
}
Compile and Run
$shell> cd flens/examples $shell> g++ -Wall -std=c++11 -DWITH_OPERATORS -I../.. -o lu_blk_operators_test lu_blk_test.cc $shell> g++ -Wall -std=c++11 -I../.. -o lu_blk_flensblas_test lu_blk_test.cc $shell> ./lu_blk_operators_test Time elapsed: 0.95 Checking results: || A-P*L*U ||_1 = 3.56198e-07 $shell> ./lu_blk_flensblas_test Time elapsed: 0.95 Checking results: || A-P*L*U ||_1 = 3.56198e-07
Having an algorithm that is capable to exploit caches and BLAS it now makes also sense to compile without debug code and optimizations:
$shell> cd flens/examples $shell> g++ -Wall -DNDEBUG -O3 -std=c++11 -DWITH_OPERATORS -I../.. -o lu_blk_operators_test lu_blk_test.cc $shell> g++ -Wall -DNDEBUG -O3 -std=c++11 -I../.. -o lu_blk_flensblas_test lu_blk_test.cc $shell> ./lu_blk_operators_test Time elapsed: 0.76 Checking results: || A-P*L*U ||_1 = 3.56198e-07 $shell> ./lu_blk_flensblas_test Time elapsed: 0.76 Checking results: || A-P*L*U ||_1 = 3.56198e-07
Block Algorithm with Parallel BLAS
The easiest way to parallelize the \(LU\) factorization is using a multi-threaded BLAS implementation for the blocked \(LU\) factorization. In this case 'easy' means that you don't have to change function lu_blk you just link against a parallel BLAS library. So the multi-threading part is hidden in BLAS.
In the near future we will extend the ulmBLAS tutorial for parallelization...
Parallel Tile Algorithm
Having a parallel computer architecture we hope that using 2 CPU cores for the \(LU\) factorization cuts the wall-time in half, i.e. gives a speedup factor of 2. And of course, with 3 cores we hope for a speedup factor of 3, with 4 cores a speedup factor or 4 etc. This means we hope for a perfect scalability of the problem. However, this usually requires that for \(k\) CPU cores we can work on \(k\) disjoint sub-problems with perfect load balancing. The later means each sub-problem has the same complexity and the same memory bandwidth can be used for solving each of these sub-problems.
For BLAS Level 3 functions this can be achieved in certain cases. For example, in the matrix product \(C = A\cdot B\) each thread computes a different block of \(C\). However, a good speedup factor requires that the computation done by a thread compensates the induced overhead, e.g. scheduling threads or copying buffered results.
The PLASMA library tries to achieve a much higher granularity of sub-problems by considering parallelization already in the algorithms. These algorithms basically push forward ideas from blocked algorithms and are known as tiled algorithms.
Example: Tasks for Factorizing a 2x2 Tile/Block Matrix
We again consider the \(LU\) factorization of a \(2 \times 2\) blocked matrix:
\[A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix}\]In the blocked algorithm we identified a sequence of necessary tasks that were done in the certain fixed order. In this case the total sequence of tasks done were:
-
Compute \(A_11 = P_1 L_{11} U_{11}\)
-
Compute \(A_{12} \leftarrow P_1^T A_{12}\)
-
Update the permutation: \(P \big|_1 \leftarrow P_1\)
-
Compute \(A_{12} \leftarrow A_{12} U_{11}^{-1}\)
-
Compute \(A_{21} \leftarrow L_{11}^{-1} A_{21}\)
-
Compute \(A_{22} \leftarrow A_{22} - A_{21} A_{12}\)
-
Compute \(A_{22} = P_2 L_{22} U_{22}\)
-
Compute \(A_{21} \leftarrow P_2^T A_{21}\)
-
Update the permutation: \(P \big|_2 \leftarrow P_2 + \text{bs}\)
Obviously these tasks can not be done in an arbitrary order. And therefore some tasks can not be done in parallel. However, analyzing the dependencies will show what tasks could be done in parallel.
Task Dependencies
We illustrated the dependencies through a directed graph. Hereby \(T_1 \rightarrow T_2\) denotes that task \(T_1\) must be completed before task \(T_2\) can be started.
So once task \(A_11 = P_1 L_{11} U_{11}\) is done we could
-
Compute \(A_{12} \leftarrow P_1^T A_{12}\) by one thread and
-
\(A_{12} \leftarrow A_{12} U_{11}^{-1}\) by another thread.
Considering more tiles shows that more and more independent task can be identified.
Example: Tasks for Factorizing a 3x3 Tile/Block Matrix
Listing all operations needed for a \(3 \times 3\) tile matrix is already lengthy.
\[A = \begin{pmatrix} A_{11} & A_{12} & A_{13} \\ A_{21} & A_{22} & A_{23} \\ A_{31} & A_{32} & A_{33} \\ \end{pmatrix}\]Gladly we will develop in this section a tool that automatically generates the task dependencies:
Example: Tasks for Factorizing a 10x10 Tile/Block Matrix
Just for fun (and because we have the tool) we also have a graph generated for a 10x10 tile matrix.
Converting Matrices to Tile Matrices
#define LU_TILE_H 1
#include <flens/flens.h>
namespace flens {
template <typename MA, typename IndexType, typename VB>
void
copyToTiles(IndexType bs, const GeMatrix<MA> &A, DenseVector<VB> &b)
{
typedef typename GeMatrix<MA>::View TileView;
IndexType m = A.numRows();
IndexType n = A.numCols();
Underscore<IndexType> _;
IndexType ib = 0;
for (IndexType j=1; j<=n; j+=bs) {
IndexType nb = std::min(bs, n-j+1);
for (IndexType i=1; i<=m; i+=bs) {
IndexType mb = std::min(bs, m-i+1);
TileView B = TileView(mb, nb, b(_(ib+1,ib+bs*bs)), bs);
B = A(_(i,i+mb-1),_(j,j+nb-1));
ib += bs*bs;
}
}
}
template <typename MA, typename IndexType, typename VB>
void
copyFromTiles(IndexType bs, const DenseVector<VB> &b, GeMatrix<MA> &A)
{
typedef typename GeMatrix<MA>::ConstView TileView;
IndexType m = A.numRows();
IndexType n = A.numCols();
Underscore<IndexType> _;
IndexType ib = 0;
for (IndexType j=1; j<=n; j+=bs) {
IndexType nb = std::min(bs, n-j+1);
for (IndexType i=1; i<=m; i+=bs) {
IndexType mb = std::min(bs, m-i+1);
TileView B = TileView(mb, nb, b(_(ib+1,ib+bs*bs)), bs);
A(_(i,i+mb-1),_(j,j+nb-1)) = B;
ib += bs*bs;
}
}
}
template <typename GeMatrixType>
class TiledCopy
: public GeneralMatrix<GeMatrixType>
{
public:
typedef typename GeMatrixType::View TileView;
typedef typename GeMatrixType::ElementType ElementType;
typedef typename GeMatrixType::IndexType IndexType;
TiledCopy(const GeMatrixType &A, int bs_)
: numRows(A.numRows()), numCols(A.numCols()),
bs(bs_), mr(numRows % bs), nr(numCols % bs),
tileRows((numRows+bs-1) / bs),
tileCols((numCols+bs-1) / bs)
{
tiles.resize(bs*bs*tileRows*tileCols);
copyToTiles(bs, A, tiles);
}
int
numTileRows() const
{
return tileRows;
}
int
numTileCols() const
{
return tileCols;
}
int
blockSize() const
{
return bs;
}
TileView
operator()(int i, int j)
{
int mb = (i<tileRows || mr==0) ? bs : mr;
int nb = (j<tileCols || nr==0) ? bs : nr;
int ib = tileRows*bs*bs*(j-1) + bs*bs*(i-1);
return TileView(mb, nb, tiles(Range<int>(ib+1,ib+bs*bs)), bs);
}
void
untile(GeMatrixType &A) const
{
copyFromTiles(bs, tiles, A);
}
private:
int numRows, numCols;
int bs, mr, nr;
int tileRows, tileCols;
typename GeMatrixType::Vector tiles;
};
} // namespace flens
#endif // LU_TILE_H
Implementation of lu_tiled
For the implementation of the tile algorithm we need a scheduler. This scheduler can be fed with tasks (i.e. functions) that have dependency constraints, e.g. task A must be completed before task B can be started. Tasks can be done in parallel if constraints permit and if enough threads are available.
A task is just a function pointer. Using C++11 lambdas makes is a convenient way of creating such functions. Furthermore it is required to define unique keys for each task.
We will first look at the implementation of the tile algorithm and discuss the design of the scheduler and key functions later:
#define LU_LU_TILED_MT_H 1
#include <cxxstd/algorithm.h>
#include <cxxstd/cmath.h>
#include <cstdio>
#include <vector>
#include <flens/flens.h>
#include <flens/examples/lu/apply_perm_inv.h>
#include <flens/examples/lu/lu_blk_with_operators.h>
#include <flens/examples/lu/lu_tiled_keys.h>
#include <flens/examples/lu/scheduler.h>
#include <flens/examples/lu/tile.h>
namespace flens {
template <typename MA, typename VP>
typename DenseVector<VP>::IndexType
lu_tiled(Scheduler &scheduler, TiledCopy<MA> &A, DenseVector<VP> &p)
{
typedef typename TiledCopy<MA>::ElementType T;
typedef typename TiledCopy<MA>::IndexType IndexType;
typedef Scheduler::Task Task;
typedef Scheduler::Key Key;
const IndexType m = A.numTileRows();
const IndexType n = A.numTileCols();
const IndexType bs = A.blockSize();
const IndexType mn = std::min(m, n);
const T One(1);
const Underscore<IndexType> _;
Task task;
//scheduler.pause();
std::vector<std::string> gather;
for (IndexType i=1; i<=mn; ++i) {
auto A_ii = A(i,i);
auto p_ = p(_(bs*(i-1)+1,bs*(i-1)+A_ii.numRows()));
// Factorize A(i,i)
task = [=, &scheduler]() mutable
{
auto info = lu_blk(A_ii, p_);
if (info!=0) {
scheduler.abort(info+bs*(i-1));
}
};
scheduler.add(keyLU(i), task, gather);
gather.clear();
// Apply permutation to i-th tile row
for (IndexType j=1; j<=n; ++j) {
if (i!=j) {
auto A_ij = A(i,j);
task = [=]() mutable
{
apply_perm_inv(p_, A_ij);
};
scheduler.add(keyPtA(i,j), task, { keyLU(i) });
gather.push_back(keyPtA(i,j));
// dependency for task: "Update Permutation Vector p"
scheduler.addArc(keyPtA(i,j), keyUpdateP(i, bs));
}
}
// Update Permutation Vector p
task = [=]() mutable
{
p_ += bs*(i-1);
};
scheduler.add(keyUpdateP(i, bs), task);
gather.push_back(keyUpdateP(i, bs));
// Apply U_ii^{-1} to i-th tile column
for (IndexType k=i+1; k<=m; ++k) {
auto A_ki = A(k,i);
task = [=]() mutable
{
auto U_ii = A_ii.upper();
blas::sm(Right, NoTrans, One, U_ii, A_ki);
};
scheduler.add(keyApplyU(k,i), task, { keyLU(i) });
gather.push_back(keyApplyU(k,i));
}
// Apply L_ii^{-1} to i-th tile row
for (IndexType l=i+1; l<=n; ++l) {
auto A_il = A(i,l);
task = [=]() mutable
{
auto L_ii = A_ii.lowerUnit();
blas::sm(Left, NoTrans, One, L_ii, A_il);
};
scheduler.add(keyApplyL(i,l), task, { keyPtA(i,l) });
gather.push_back(keyApplyL(i,l));
}
// Update remaining blocks
for (IndexType k=i+1; k<=m; ++k) {
const auto A_ki = A(k,i);
for (IndexType l=i+1; l<=n; ++l) {
const auto A_il = A(i,l);
auto A_kl = A(k,l);
task = [=]() mutable
{
A_kl -= A_ki * A_il;
};
scheduler.add(keyUpdateA(k,l,i), task, { keyApplyU(k,i),
keyApplyL(i,l) });
gather.push_back(keyUpdateA(k,l,i));
}
}
}
//scheduler.dumpGraph("graph.dot");
//scheduler.resume();
return scheduler.join();
}
} // namespace flens
#endif // LU_LU_TILED_MT_H
Task Scheduler and Thread Pool
Class Scheduler provides an abstract interface a thread pool and scheduler:
#define LU_SCHEDULER_H 1
#include <condition_variable>
#include <list>
#include <unordered_map>
#include <mutex>
#include <thread>
#include <vector>
#include <flens/examples/lu/format.h>
namespace flens {
class Scheduler
{
public:
typedef long Key;
typedef std::function<void()> Task;
Scheduler(int numThreads);
void
pause();
void
resume();
void
add(std::string keyStr, Task task, std::vector<std::string> pre= {});
void
addArc(std::string preStr, std::string succStr);
void
dumpGraph(const char *filename);
int
join();
void
abort(int errorCode);
private:
std::pair<Task, Key>
getTask();
void
runThread();
void
taskDone(Key key);
Key
getKey(std::string keyStr);
int numThreads;
int numThreadsBusy;
int errorCode;
bool stop;
std::list<Key> ready;
std::unordered_map<std::string, Key> keyString;
std::unordered_map<Key, std::string> tag;
std::unordered_map<Key, Task> scheduled;
std::unordered_map<Key, long> predecessorCount;
std::unordered_map<Key, std::list<Key>> successor;
std::condition_variable cv_scheduler;
std::condition_variable cv_thread;
std::mutex mtx;
};
} // namespace flens
#endif // LU_SCHEDULER_H
Internally task are identified by an integer.
And a task is just a function pointer.
The scheduler and thread pool get initialized when the application gets started, e.g. in main(). Threads will be spawned and initialized. They will immediately wait on a condition variable until a task is ready.
For illustration purposes it is nice to identify task by a string. In our example these strings happen to be Latex code. So we are able to dump the scheduled directed graph and generate nice pictures.
Adding a task identified by keyStr can depend on a list of other tasks (pre) that must have been completed before it can be started.
add(std::string keyStr, Task task, std::vector<std::string> pre= {});
In some cases it is convenient to add only some dependencies, i.e. add an arc to the directed graph.
Note that if you want to add 'pre must be done before succ gets started' it is necessary that succ has not been scheduled already. Otherwise task succ could already be running.
Again tasks pre and succ are identified by string.
addArc(std::string preStr, std::string succStr);
Once all tasks are scheduled we wait until all tasks are completed.
join();
If a task calls abort then no new tasks will be stared. join will wait until all running tasks are completed and returns errorCode.
abort(int errorCode);
The implementation uses typical C++11 features for dealing with threads, locks and condition variables. The directed graph is realized using containers from the C++ standard library:
#include <cassert>
#include <cstdlib>
#include <fstream>
namespace flens {
Scheduler::Scheduler(int numThreads_)
: numThreads(numThreads_), numThreadsBusy(0), errorCode(0), stop(false)
{
std::unique_lock<std::mutex> lock(mtx);
for (int i=0; i<numThreads; ++i) {
auto t = std::thread( [=] { this->runThread(); } );
t.detach();
}
}
void
Scheduler::pause()
{
std::unique_lock<std::mutex> lock(mtx);
stop = true;
}
void
Scheduler::resume()
{
bool jobReady = false;
{
std::unique_lock<std::mutex> lock(mtx);
stop = false;
jobReady = (ready.size()>0);
}
if (jobReady) {
cv_thread.notify_one();
}
}
void
Scheduler::add(std::string keyStr, Task task, std::vector<std::string> pre)
{
bool jobReady = false;
{
std::unique_lock<std::mutex> lock(mtx);
Key key = getKey(keyStr);
while (ready.size()>15) {
cv_scheduler.wait(lock);
}
#ifndef NDEBUG
if (scheduled.count(key)!=0) {
std::cerr << "Already scheduled: " << keyStr << std::endl;
assert(0);
}
#endif
scheduled[key] = task;
for (auto p_ : pre) {
auto p = getKey(p_);
if (scheduled.count(p)>0) {
successor[p].push_back(key);
++predecessorCount[key];
} else {
assert(scheduled.count(p)==0);
}
}
if (predecessorCount[key]==0) {
if (successor[key].size()>3) {
ready.push_front(key);
} else {
ready.push_back(key);
}
}
jobReady = (ready.size()>0);
}
if (jobReady) {
cv_thread.notify_one();
}
}
void
Scheduler::addArc(std::string preStr, std::string succStr)
{
std::unique_lock<std::mutex> lock(mtx);
Key succ = getKey(succStr);
Key pre = getKey(preStr);
// Dependencies for 'succ' must not be added if 'succ' is
// already scheduled.
if (scheduled.count(pre)==0) {
return;
}
#ifndef NDEBUG
if (scheduled.count(succ)!=0) {
std::cerr << "Already scheduled and possibly running: "
<< succStr << std::endl;
std::cerr << preStr << " -> " << succStr << std::endl;
assert(0);
}
#endif
successor[pre].push_back(succ);
++predecessorCount[succ];
}
void
Scheduler::abort(int errorCode_)
{
assert(errorCode_!=0);
{
std::unique_lock<std::mutex> lock(mtx);
std::cerr << "abort: error code = " << errorCode_ << std::endl;
errorCode = errorCode_;
scheduled.clear();
}
cv_scheduler.notify_one();
}
void
Scheduler::dumpGraph(const char *filename)
{
std::unique_lock<std::mutex> lock(mtx);
std::ofstream out(filename);
out << "digraph G {\nnode [shape=\"none\"];" << std::endl;
for (auto t : tag) {
out << t.first << "[label=\"" << t.second << "\"]"
<< std::endl;
}
for (auto s : successor) {
for (auto succ : s.second) {
out << " \"" << s.first
<< "\" -> \"" << succ
<< "\";" << std::endl;
}
}
out << "}" << std::endl;
out.close();
}
int
Scheduler::join()
{
std::unique_lock<std::mutex> lock(mtx);
while (scheduled.size()+numThreadsBusy>0) {
if (ready.size()+numThreadsBusy==0) {
std::cerr << "Error: Task queue not empty but no job ready"
" or running." << std::endl;
std::exit(666);
}
cv_scheduler.wait(lock);
}
// clean up
ready.clear();
keyString.clear();
tag.clear();
scheduled.clear();
predecessorCount.clear();
successor.clear();
// restore error code and return old value
int errorCode_ = errorCode;
errorCode = 0;
return errorCode_;
}
std::pair<Scheduler::Task, Scheduler::Key>
Scheduler::getTask()
{
bool jobReady = false;
Key key;
Task task;
{
std::unique_lock<std::mutex> lock(mtx);
while (stop || ready.size()==0) {
cv_thread.wait(lock);
}
key = ready.front();
task = scheduled[key];
ready.pop_front();
++numThreadsBusy;
if (ready.size()>0) {
jobReady = true;
}
}
if (jobReady) {
cv_thread.notify_one();
}
return std::pair<Task, Key>(task,key);
}
void
Scheduler::runThread()
{
while(1) {
std::pair<Task,Key> job = getTask();
job.first();
taskDone(job.second);
}
}
void
Scheduler::taskDone(Key key)
{
bool allDone = false;
{
std::unique_lock<std::mutex> lock(mtx);
if (errorCode!=0) {
allDone= true;
} else {
for (auto succ : successor[key]) {
--predecessorCount[succ];
if (scheduled.count(succ)>0) {
if (predecessorCount[succ]==0) {
predecessorCount.erase(succ);
if (successor[succ].size()>3) {
ready.push_front(succ);
} else {
ready.push_back(succ);
}
}
}
}
scheduled.erase(key);
successor.erase(key);
}
--numThreadsBusy;
if (scheduled.size()==0 && numThreadsBusy==0) {
allDone = true;
}
if (ready.size()==10) {
allDone = true;
}
}
if (allDone) {
cv_scheduler.notify_one();
}
}
Scheduler::Key
Scheduler::getKey(std::string keyStr)
{
if (keyString.count(keyStr)==0) {
keyString[keyStr] = keyString.size();
}
tag[keyString[keyStr]] = keyStr;
return keyString[keyStr];
}
} // namespace flens
Auxiliary Functions for Task Keys
For creating task identifiers we simply create strings that contain some Latex code describing what the task is supposed to do. For convenience we define a function such that we can use the beloved format strings from C:
#define LU_FORMAT_H 1
#include <memory>
#include <iostream>
#include <string>
#include <cstdio>
namespace flens {
template<typename ... Args>
std::string
format(std::string fmt, Args ... args)
{
size_t size = 1 + snprintf(nullptr, 0, fmt.c_str(), args ...);
std::unique_ptr<char[]> buf(new char[size]);
snprintf(buf.get(), size, fmt.c_str(), args ...);
return std::string(buf.get(), buf.get() + size-1);
}
} // namespace flens
#endif
Now we simply can define a bunch of functions for generating different keys:
#define LU_LU_TILED_KEYS_H 1
#include <flens/examples/lu/format.h>
namespace flens {
// Some auxiliary function for generating nice task keys
auto keyLU = [=](int i)
{
return format("LU(A_{%d,%d})", i, i);
};
auto keyPtA = [=](int i, int j)
{
return format("A_{%d,%d} \\leftarrow "
"P_%d^T A_{%d,%d}",
i, j, i, i, j);
};
auto keyUpdateP = [=](int i, int bs)
{
return format("P\\big|_%d \\leftarrow P_%d + %d",
i, i, (i-1)*bs);
};
auto keyApplyU = [=](int i, int j)
{
return format("A_{%d,%d} \\leftarrow "
"A_{%d,%d} U_{%d,%d}^{-1}",
i, j, i, j, i, i);
};
auto keyApplyL = [=](int i, int j)
{
return format("A_{%d,%d} \\leftarrow "
"L_{%d,%d}^{-1} A_{%d,%d}",
i, j, i, i, i, j);
};
auto keyUpdateA = [=](int i, int j, int k)
{
return format("A_{%d,%d} \\leftarrow "
"A_{%d,%d} - A_{%d,%d} A_{%d,%d}",
i, j, i, j, i, k, k, j);
};
} // namespace flens
#endif // LU_LU_TILED_MT_H
Code for Testing
The test code first initializes the scheduler. Copies and converts the matrix into tile format and calls the implementation of the tile algorithm. Once the factorization is completed the tiled matrix is copied and converted back.
#include <iostream>
#include <flens/examples/lu/apply_perm.h>
#include <flens/examples/lu/check_lu.h>
#include <flens/examples/lu/lu_tiled_mt.h>
#include <flens/examples/lu/scheduler.h>
#include <flens/examples/lu/timer.h>
using namespace flens;
using namespace std;
int
main()
{
typedef GeMatrix<FullStorage<double> > DGeMatrix;
typedef TiledCopy<DGeMatrix> DTiledMatrix;
typedef DenseVector<Array<double> > DDenseVector;
typedef DenseVector<Array<int> > IDenseVector;
Scheduler scheduler(2);
const int bs = 256;
const int m = 4*bs;
const int n = 4*bs;
DGeMatrix A(m,n), A_org;
IDenseVector p(m);
Underscore<int> _;
// Init A and store a copy for checking the factorization later
fillRandom(A);
A_org = A;
double t0, t1;
// Create a new matrix and initialize it with the tiled format
t0 = ATL_walltime();
DTiledMatrix A_(A, bs);
t1 = ATL_walltime();
// Call the tile algorithm
if (lu_tiled(scheduler, A_, p) != 0) {
cout << "Matrix is singular or close to singular" << endl;
return 1;
}
t1 = ATL_walltime()-t1;
// Copy back the tiled matrix
A_.untile(A);
t0 = ATL_walltime()-t0;
cout << "total time elapsed: " << t0 << endl;
cout << "time elapsed for 'lu_tiled': " << t1 << endl;
cout << "time elapsed for conversion: " << t0-t1 << endl;
// Check result
cout << "Checking results:" << endl;
cout << "|| A-P*L*U ||_1 = " << check_LU(A_org, A, p) << endl;
return 0;
}
Compile and Run
$shell> cd flens/examples $shell> g++ -O3 -Wall -std=c++11 -I../.. -c lu/scheduler.cc $shell> g++ -O3 -Wall -std=c++11 -I../.. scheduler.o lu_tiled_test.cc $shell> ./a.out total time elapsed: 0.0999999 time elapsed for 'lu_tiled': 0.0699999 time elapsed for conversion: 0.03 Checking results: || A-P*L*U ||_1 = 4.7791e-09