Unblocked LU factorization: Another variant


In the lecture we introduced algorithms for an unblocked LU factorization by considering the following partitioning of matrix \(A\):

\[A =\left(\begin{array}{l|l|l} A_{0,0} & a_{0,j} & A_{0,j+1} \\ \hline a_{j,0}^T & a_{j,j} & a_{j,j+1}^T \\ \hline A_{j+1,0} & a_{j+1,j} & A_{j+1,j+1} \end{array}\right)\]

Variant 1

In session 13 we ended up with the following algorithm:

  • for \(j=0, \dots, \min\{m,n\}-1\)

    • Find pivot \(p_j\) in \(\left(\begin{array}{c} a_{j,j} \\ a_{j+1,j} \end{array}\right)\)

      • if \(j \neq p_j\) interchange in \(A\) row \(j\) and row \(p_j\)

    • if element \(A_{j,j}\) is exactly zero return \(j\) (matrix is numerically singular).

    • Scaling: \(a_{j+1,j} \leftarrow \frac{1}{a_{j,j}} a_{j+1,j}\)

    • Rank-1 update: \(A_{j+1,j+1} \leftarrow A_{j+1,j+1} - a_{j+1,j} a_{j,j+1}^T\)

This variant was based on a rank one update.

Variant 2

Another variant can be derived such that we get:

  • for \(j=0, \dots, \min\{m,n\}-1\)

    • Solve: \(a_{0,j} \leftarrow L_{\text{unit}}^{-1}(A_{0,0}) a_{0,j}\)

    • Matrix-vector product \( \left(\begin{array}{c} a_{j,j} \\ a_{j+1,j} \end{array}\right) \leftarrow \left(\begin{array}{c} a_{j,j} \\ a_{j+1,j} \end{array}\right) - \left(\begin{array}{c} a_{j,0}^T \\ A_{j+1,0} \end{array}\right) a_{0,j} \)

    • Find pivot \(p_j\) in \(\left(\begin{array}{c} a_{j,j} \\ a_{j+1,j} \end{array}\right)\)

      • if \(j \neq p_j\) interchange in \(A\) row \(j\) and row \(p_j\)

    • if element \(A_{j,j}\) is exactly zero return \(j\) (matrix is numerically singular).

    • Scaling: \(a_{j+1,j} \leftarrow \frac{1}{a_{j,j}} a_{j+1,j}\)

  • if \(\min\{m,n\} < n\)

    • Solve: \(A_{0,m} \leftarrow L_{\text{unit}}^{-1}(A_{0,0}) A_{0,m}\) where we consider the partitioning case: \( A = \left(\begin{array}{l|l} A_{0,0} & A_{0,m} \\ \end{array}\right)\)

This variant is based on a matrix vector product.

Notes on using the library

The library provides matrix and vector types. Matrix types where extended for triangular (actually trapezoidal) matrices with full storage. Currently we have

With memory allocation and deallocation

Writeable view

Read only view

general matrix




triangular/trapezoidal matrix



dense vector




Creating triangular views from general matrices

Note that triangular matrices can only be created as views from a general matrix. For example

GeMatrix<double>     A(4,8);
// ...

TrMatrixView<double> L = A.view(UpLo::LowerUnit);
TrMatrixView<double> U = A.view(UpLo::Upper);       // upper, non-unit
// ...

or simply

GeMatrix<double>     A(4,8);
// ...

auto L = A.view(UpLo::LowerUnit);
auto U = A.view(UpLo::Upper);       // upper, non-unit
// ...

Using constView instead of view creates a TrMatrixConstView object.

Creating views referencing blocks and subvectors

Assume that A is some general or triangular matrix with full storage. For convenience we provide the following methods for creating matrix and vector views:


Selecting a block starting at A(i,j) with the remaining rows (i.e. A.numRows()-i) and columns (i.e. `A.numCols()-j).


Selecting a block starting at A(0,0) with m rows and n columns.


Selecting the column starting at A(i,j)


Selecting the row starting at A(i,j)

For dense vectors we provide similar methods (and at least currently use the same names for the methods):


Selecting a sub-vector starting at x(i)


Selecting a sub-vector starting at x(0) of length n.

Traits for getting the element type

You will see that in many cases the element type of matrices and vectors is defined through the function signature. But in same cases you still need traits like here:

template <typename VectorX,
          Require< Dense<VectorX> >=true >
foo(VectorX &&x)
    using T = ElementType<VectorX>;             // getting the element type

    // ...

Example for using the triangular solver

Solving \(Lx = b\) where \(L\) is a lower unit triangular matrix can be done with the provided sv function. Note that this function takes \(L\) and \(b\) and then overwrites \(b\) with the solution \(x\).

In case \(L\) is stored in the strict lower triangular part of \(A\) this can be implemented by

sv(A.view(UpLo::LowerUnit), b);

Of course you can create views of views. Like in

sv(A.dim(j,j).view(UpLo::LowerUnit), A.col(0,j).dim(j));

Figure out what happens here before you start with the exercise!


With lu.hpp we provide an implementation of the first variant and a skeleton for the second variant:


#include <cassert>
#include <cstddef>
#include <type_traits>

#include <hpc/matvec/iamax.hpp>
#include <hpc/matvec/rank1.hpp>
#include <hpc/matvec/scal.hpp>
#include <hpc/matvec/sv.hpp>
#include <hpc/matvec/sm.hpp>
#include <hpc/matvec/swap.hpp>

#include <hpc/matvec/densevector.hpp>
#include <hpc/matvec/traits.hpp>
#include <hpc/ulmblas/axpy.hpp>

//#include <hpc/mklblas/mv.hpp>
//#include <hpc/mklblas/sv.hpp>
//#include <hpc/mklblas/mm.hpp>
//#include <hpc/mklblas/sm.hpp>

namespace hpc { namespace matvec {

template <typename MatrixA, typename VectorPiv,
          Require< Ge<MatrixA>,
                 > = true>
lu_unblk_var1(MatrixA &&A, VectorPiv &&piv)
    using T = ElementType<MatrixA>;

    std::size_t m  = A.numRows();
    std::size_t n  = A.numCols();
    std::size_t mn = std::min(m,n);


    for (std::size_t j=0; j<mn; ++j) {
        // pivoting
        auto jp = j + iamax(A.col(j,j));
        if (jp!=j) {
            swap(A.row(j,0), A.row(jp,0));
        piv(j) = jp;
        if (A(j,j)==T(0)) {
            return j;

        // apply gauss
        scal(1/A(j,j), A.col(j+1,j));
        rank1(T(-1), A.col(j+1,j), A.row(j,j+1),
    return -1;

template <typename MatrixA, typename VectorPiv,
          Require< Ge<MatrixA>,
                 > = true>
lu_unblk_var2(MatrixA &&A, VectorPiv &&piv)
    // TODO: Your code
    return -1;

} } // namespace matvec, hpc


Implement the second variant. Use hpc/test/lu_unblk.cpp for benchmarking and testing:

#include <complex>
#include <functional>
#include <random>

#include <printf.hpp>

#include <hpc/matvec.hpp>
#include <hpc/matvec/test/error.hpp>
#include <hpc/matvec/test/rand.hpp>
#include <hpc/matvec/test/walltime.hpp>

#include "lu.hpp"

namespace hpc { namespace matvec {

template <typename T, template<typename> class MatrixA,
          typename lu_func,
          Require< Ge<MatrixA<T>> > = true>
std::pair<double, double>
test_lu(const MatrixA<T> &A0, lu_func lu)
    GeMatrix<double>         A(A0.numRows(), A0.numCols(), Order::ColMajor);
    DenseVector<std::size_t> piv(A.numRows());

    copy(A0, A);
    std::ptrdiff_t res;

    test::WallTime<double> timer;

    res = lu(A, piv);
    double time = timer.toc();
    double err  = test::error_estimate::getrf(A0, A, piv);

    if (res!=-1) {
        fmt::printf("Matrix is (numerically) singular\n");
    return std::pair<double, double>(err, time);

} } // namespace matvec, hpc

#define MIN_M 500
#define MIN_N 2
#define INC_M 0
#define INC_N 1
#define MAX_M 500
#define MAX_N 500

    using namespace hpc::matvec;

    GeMatrix<double> A(MAX_M, MAX_N);


    auto lu1 = lu_unblk_var1<GeMatrix<double> &, DenseVector<std::size_t> &>;
    auto lu2 = lu_unblk_var2<GeMatrix<double> &, DenseVector<std::size_t> &>;

    fmt::printf("%5s %5s "
                "%10s %10s %10s "
                "%10s %10s %10s\n",
                "M", "N",
                "Error 1", "Time 1", "MFLOPS 1",
                "Error 2", "Time 2", "MFLOPS 2");

    for (std::size_t m=MIN_M, n=MIN_N;
         m<=MAX_M && n<=MAX_N;
         m+=INC_M, n+=INC_N)
        double maxMN = std::max(m,n);
        double minMN = std::min(m,n);
        double flops = maxMN*minMN*minMN
                     - ((minMN*minMN*minMN) / 3.0)
                     - (minMN*minMN) / 2.0;
        flops /= 1000000.0;

        auto A0   = A.dim(m, n);
        auto tst1 = test_lu(A0, lu1);
        auto tst2 = test_lu(A0, lu2);

        fmt::printf("%5d %5d "
                    "%10.2e %10.2f %10.2f "
                    "%10.2e %10.2f %10.2f\n",
                    m, n,
                    tst1.first, tst1.second, flops/tst1.second,
                    tst2.first, tst2.second, flops/tst2.second);

Note that by default the benchmark tests matrices with a fixed number of rows and an increasing number of columns. These kind of matrix shapes are relevant when the unblocked LU factorization is used for the blocked factorization.

Compile with

heim$ g++-7.2 -Wall -std=c++17 -I /home/numerik/pub/hpc/ws18/session20/ -O3 -mavx -o test_lu_unblk test_lu_unblk.cpp