================ LU Factorization [TOC:2] ================ The performance of solving a system of linear equations depends on the performance of the matrix-matrix product. What's new: - Triangular solver for $B \leftarrow L^{-1} B$. Currently this is for the special case where $L$ is a lower triangular matrix with unit diagonal. But this can easily be generalized. Note that the performance of the triangular solver merely depends on the performance of the GEMM micro kernel. - Two implementations for the LU factorization: - `lu_blocked` is basically the algorithm of LAPACK function `dgetrf`. - `lu_blocked_recursive` applies the factorization recursively. Both variants exploit the performance of the matrix-matrix product and the triangular solver for matrix equations. Compile and Run Benchmark ========================= ---- SHELL (path=session7,hostname=heim) --------------------------------------- g++-5.3 -Wall -DNDEBUG -mavx -Ofast -std=c++11 -I ../../boost_1_60_0/ -DM_MAX=500 -DHAVE_AVX -DHAVE_GCCVEC bench_lu.cc ./a.out -------------------------------------------------------------------------------- Benchmark (Single Threaded) =========================== ---- SHELL (path=session7,hostname=heim) --------------------------------------- g++-5.3 -Wall -DNDEBUG -mavx -Ofast -std=c++11 -I ../../boost_1_60_0/ -DHAVE_AVX -DHAVE_GCCVEC bench_lu.cc ./a.out > report.session7.lu gnuplot plot.session7.lu gnuplot plot.session7.lu.log gnuplot plot.session7.lu.mflops -------------------------------------------------------------------------------- Time (Effectiveness) -------------------- For the LU-factorization we are primarily interested how long it takes to get the factorization: ---- IMAGE ---------------------------- session7/bench.session7.lu.svg --------------------------------------- Using a log-sclaing of both axis shows, that the complexity is still of order $N^3$: ---- IMAGE ---------------------------- session7/bench.session7.lu.log.svg --------------------------------------- MFLOPS (Efficiency) ------------------- We are also interested how fast each algorithm reaches its peak performance: ---- IMAGE ---------------------------- session7/bench.session7.lu.mflops.svg --------------------------------------- Benchmark (Multi Threaded) ========================== Test with 2 threads: -------------------- ---- SHELL (path=session7,hostname=heim) --------------------------------------- g++-5.3 -fopenmp -Wall -DNDEBUG -mavx -Ofast -std=c++11 -I ../../boost_1_60_0/ -DHAVE_AVX -DHAVE_GCCVEC bench_lu.cc export OMP_NUM_THREADS=2; ./a.out > report.session7-mt2.lu -------------------------------------------------------------------------------- Test with 4 threads: -------------------- ---- SHELL (path=session7,hostname=heim) --------------------------------------- g++-5.3 -fopenmp -Wall -DNDEBUG -mavx -Ofast -std=c++11 -I ../../boost_1_60_0/ -DHAVE_AVX -DHAVE_GCCVEC bench_lu.cc export OMP_NUM_THREADS=4; ./a.out > report.session7-mt4.lu gnuplot plot.session7-mt.lu gnuplot plot.session7-mt.lu.log gnuplot plot.session7-mt.lu.mflops -------------------------------------------------------------------------------- Time (Effectiveness) -------------------- ---- IMAGE ---------------------------- session7/bench.session7-mt.lu.svg --------------------------------------- ---- IMAGE ---------------------------- session7/bench.session7-mt.lu.log.svg --------------------------------------- MFLOPS (Efficiency) ------------------- ---- IMAGE ---------------------------- session7/bench.session7.lu-mt.mflops.svg --------------------------------------- Source Code =========== The tar-ball __session7.tgz__ contains the files: ---- SHELL --------------------------------------------------------------------- tar tfvz session7.tgz -------------------------------------------------------------------------------- `lu.hpp` -------- :import: session7/lu.hpp `gemm.hpp` ---------- :import: session7/gemm.hpp `bench_lu.cc` ------------- :import: session7/bench_lu.cc :links: session7.tgz -> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session7.tgz :navigate: back -> doc:session6/page01