============================================ Using fused AXPY and DOT Operations for TRSV [TOC] ============================================ In the previous benchmark the LU variant based on GEMV/TRSV has more efficient. That mainly due to the fact that the GEMV implementation uses fuesed AXPY and fused DOT operations. We will now alos use the fused operations for TRSV and hope to achieve a further improvement in the performance. ---- SHELL (path=session14, hide) ---------------------------------------------- rm -rf getrf2 mkdir getrf2 cd getrf2 cp /home/numerik/pub/hpc/ss18/ulmblas/*.[hc] . cp /home/numerik/pub/hpc/ss18/ulmblas/session14b/ulmblas.* . cp ../getrf.plot2 . cp ../getrf/bench* . -------------------------------------------------------------------------------- TRSV: Optimized for col major case ================================== The TRSV operation can be optimized using fused AXPY or fused DOT operations. In the code of `ulmblas.c` (found in `/home/numerik/pub/hpc/ss18/ulmblas/session14b/`) this kind of optimization was applied for the col major case. Further modifications: - Function `daxpyf` was added. This function performs the fused AXPY operations. The purpose of this function is reusability of this operation. - Function `dgemv_axpyf` now calls `daxpyf`. Before the fused AXPY operations where hard coded in `dgemv_axpyf`. - For lower triangular, col major matrices function `dtrsv` also calls `daxpyf` (and therefore exploits the reusable implementation of the fused AXPY operation). :import: session14/getrf2/ulmblas.c [fold] Exercise ======== - Try to express the underlying algorithm in TRSV for the case of a lower triangular matrix that is stored col major. - Run benchmarks to evaluate and verify the performance (see below). Test and Benchmark ================== ---- SHELL (path=session14, hide) ---------------------------------------------- rm -rf getrf2 mkdir getrf2 cd getrf2 cp /home/numerik/pub/hpc/ss18/ulmblas/*.[hc] . cp /home/numerik/pub/hpc/ss18/ulmblas/session14b/ulmblas.* . cp ../getrf.plot2 . cp ../getrf/bench* . -------------------------------------------------------------------------------- - Create executable and check results: ---- SHELL (path=session14/getrf2, hostname=heim,fold) --------------------- gcc -Wall -std=c11 -I. -O3 -o test_dgetrf_gemv_fused -DGETRF=GETRF_GEMV +++ test_dgetrf.c ulmaux.c ulmblas.c ./test_dgetrf_gemv_fused check ---------------------------------------------------------------------------- - Benchmark: ---- SHELL (path=session14/getrf2, hostname=heim,fold) --------------------- ./test_dgetrf_gemv_fused bench > bench.dgetrf_gemv_fused cat bench.dgetrf_gemv_fused ---------------------------------------------------------------------------- You can use the Gnuplot script :import: session14/getrf.plot2 for plotting the performance: ---- SHELL(path=session14/getrf2,hostname=heim) ---------------------------- gnuplot getrf.plot2 ---------------------------------------------------------------------------- gives ---- IMAGE ----------------------- session14/getrf2/bench.getrf.svg ----------------------------------