Content |
Matrix-Vektor Produkt TRMV
Bei der BLAS Level 2 Operation TRMV (triangular matrix vector product) wird eine im Full-Storage-Format gespeicherte Matrix als obere oder untere Dreieicksmatrix interpretiert. Zusätzlich kann angegeben werden, ob bei der Operation angenommen werden soll, dass die Elemente auf der Diagonale alle Eins sind. Das ist eine reine Vereinbarung:
-
Soll die Matrix als untere Dreiecksmatrix interpretiert werden, dann müssen in der oberen Dreiecksmatrix keine Nullen stehen. Die dort liegenden Werte werden einfach ignoriert. Bei gewissen LAPACK Funktionen wird die obere Dreiecksmatrix auch mit Zwischenergebnissen überschrieben.
-
Soll angenommen werden, dass auf der Diagonale Einsen stehen, dann gilt gleiches. Auch hier wird nicht vorausgesetzt, dass im Speicher wirklich Einsen stehen.
-
Analoges gilt für obere Dreiecksmatrizen.
Es geht hier also nicht darum eine Dreiecksmatrix möglichst kompakt zu speichern (wie beispielsweise beim sogenanten Packed-Storage-Format). Der Vorteil ist ein schneller Zugriff auf Elemente und Matrix-Blöcke. Also insbesondere auch auf Spalten und Zeilen.
Wir werden hier nur die TRMV-Operation für den Fall einer unteren Dreiecksmatrix realisieren (der andere Fall geht dann analog). In der Prozedur trlmv (triangular lower matrix vector product) wird die Operation
\[x \longleftarrow L(A) x\quad\text{oder}\quad x \longleftarrow L_1(A) x\quad\text{mit}\quad L(A), L_1(A) \in \mathbb{M}^{n \times n}\]durchgeführt. Durch einen zusätzlichen Parameter kann angegeben werden, dass Diagonalelemente als Eins angenommen werden sollen. Zu beachten ist, dass bei der Operation der Vektor \(x\) mit dem Ergebnis überschrieben werden muss.
Aufgabe: Col-Major-Variante für TRLMV
Von \(L(A)\) betrachten wir die Partitionierung bezüglich der Zeile und Spalte mit Index \(k\):
\[\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline x_k \\[0.2cm] \hline \vec{x_{k+1}} \end{array}\right)\longleftarrow\left(\begin{array}{c|c|c} L_{0,0} & & \\ \hline {\vec{l_{k,0}}}^T & l_{k,\,k} & \\[0.2cm] \hline L_{k+1,0} & \vec{l_{k+1,\,k}} & L_{k+1,\,k+1} \end{array}\right)\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline x_k \\[0.2cm] \hline \vec{x_{k+1}} \end{array}\right)=\left(\begin{array}{c} L_{0,0} \\[0.2cm] \hline {\vec{l_{k,0}}}^T \\[0.2cm] \hline L_{k+1,0} \end{array}\right) \vec{x_0}+\left(\begin{array}{c} \\[0.2cm] \hline l_{k,\,k} \\[0.2cm] \hline \vec{l_{k+1,\,k}} \end{array}\right) x_k+\left(\begin{array}{c} \\[0.2cm] \hline \\[0.2cm] \hline L_{k+1,\,k+1} \end{array}\right) \vec{x_{k+1}}\]Dies kann man in die drei Teil-Operationen
\[\begin{array}{|c|c|c|}\hline\\\text{pre}(k)&\text{inv}(k)&\text{post}(k)\\[0.8cm]\hline\\\quad \vec{x_{k+1}}\longleftarrow L_{k+1,\,k+1} \vec{x_{k+1}}\quad&\quad\left\{\begin{array}{lccl}(1) & \vec{x_{k+1}} & \leftarrow & \vec{x_{k+1}} + \vec{l_{k+1,\,\,k}} x_k\;, \\[0.5cm](2) & x_k & \leftarrow & l_{k,\,k} x_k\;.\end{array}\right.\quad&\quad\left\{\begin{array}{lccl}(1) & \vec{x_{k+1}} & \leftarrow & \vec{x_{k+1}} + L_{k+1,0} \vec{x_0}\;, \\[0.5cm](2) & x_k & \leftarrow & x_k + {\vec{l_{k,0}}}^T \vec{x_0}\;, \\[0.5cm](3) & \vec{x_0} & \leftarrow & L_{0,0} \vec{x_0}\;.\end{array}\right.\quad\\[1cm]\\\hline\end{array}\]zerlegen. Da \(x\) überschrieben wird müssen diese in der Reihenfolge \(\text{pre}(k)\), \(\text{inv}(k)\), \(\text{post}(k)\) durchgeführt werden. Wir erhalten damit folgenden iterativen Algorithmus:
\[\begin{array}{l}\text{for}\; k=n-1, \dots, 0 \\\qquad \begin{array}{lcl} \vec{x_{k+1}} & \longleftarrow & \vec{x_{k+1}} + \vec{l_{k+1,\,k}} \cdot x_k \\ x_k & \longleftarrow & l_{k,\,k} \cdot x_k \\ \end{array} \\\\\end{array}\]Dabei muss aber zusätzlich der Fall behandelt werden, dass eventuell \(l_{k,\,k}=1\) gelten soll.
Implementiert diesen Algorithmus mit der Prozedur
void dtrlmv_col(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX);
Im Fall \(\text{unitDiag} \neq 0\) soll angenommen werden, dass alle Diagonalelemente den Wert Eins besitzen.
Aufgabe: Row-Major-Variante für TRLMV
Wir betrachten wieder die gleiche Partionierung von \(L(A)\). Beim Ausmultiplizieren fassen wir das Ergebnis aber zeilenweise zusammen:
\[\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline x_k \\[0.2cm] \hline \vec{x_{k+1}} \end{array}\right)\longleftarrow\left(\begin{array}{c|c|c} L_{0,0} & & \\[0.2cm] \hline {\vec{l_{k,0}}}^T & l_{k,\,k} & \\[0.2cm] \hline L_{k+1,0} & \vec{l_{k+1,\,k}} & L_{k+1,\,k+1} \end{array}\right)\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline x_k \\[0.2cm] \hline \vec{x_{k+1}} \end{array}\right)=\left(\begin{array}{lclcl} L_{0,0} \,\vec{x_0} && &&\\[0.2cm] \hline {\vec{l_{k,0}}}^T \,\vec{x_0} &+& l_{k,\,k} \,x_k & & \\[0.2cm] \hline L_{k+1,0} \,\vec{x_0} &+& l_{k+1,\,k} \,x_k &+& L_{k+1,\,k+1} \,\vec{x_{k+1}} \end{array}\right)\]Das ist also äquivalent zu den drei Teil-Operationen
\[\begin{array}{|c|c|c|}\hline\\\text{pre}(k)&\text{inv}(k)&\text{post}(k)\\[0.8cm]\hline\\\qquad \vec{x_{k+1}} \longleftarrow L_{k+1,0} \,\vec{x_0} + l_{k+1,\,k} \,x_k + L_{k+1,\,k+1} \,\vec{x_{k+1}} \qquad &\qquad x_k \longleftarrow {\vec{l_{k,0}}}^T \,\vec{x_0} + l_{k,\,k} \,x_k \qquad &\qquad \vec{x_0} \longleftarrow L_{0,0} \,\vec{x_0} \qquad\\[1cm]\\\hline\end{array}\]Wir erhalten damit folgenden iterativen Algorithmus:
\[\begin{array}{l}\text{for}\; k=n-1, \dots, 0 \\\qquad \begin{array}{lcl} x_k & \longleftarrow & {\vec{l_{k,0}}}^T \,\vec{x_0} + l_{k,\,k} \,x_k \end{array}\\\\\end{array}\]Auch hier muss zusätzlicher der eventuelle Fall \(l_{k,\,k}=1\) gesondert behandelt werden.
Implementiert diesen Algorithmus mit der Prozedur
void dtrlmv_row(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX);
Grundgerüst für Test
#include <stdio.h> #include <math.h> //-- setup and print matrices -------------------------------------------------- void initGeMatrix(int m, int n, double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { A[i*incRowA+j*incColA] = i*n + j + 1; } } } void printGeMatrix(int m, int n, const double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { printf("%10.4lf ", A[i*incRowA+j*incColA]); } printf("\n"); } printf("\n\n"); } void printTrMatrix(int m, int n, int unitDiag, int lower, const double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { if (unitDiag && (i==j)) { printf("%10.4lf ", 1.0); } else if ((lower && (i>=j)) || (!lower && (i<=j))) { printf("%10.4lf ", A[i*incRowA+j*incColA]); } else { printf("%10.4lf ", 0.0); } } printf("\n"); } printf("\n\n"); } //-- some BLAS Level 1 procedures and functions -------------------------------- double ddot(int n, const double *x, int incX, const double *y, int incY) { int i; double alpha = 0; for (i=0; i<n; ++i) { alpha += x[i*incX]*y[i*incY]; } return alpha; } void daxpy(int n, double alpha, const double *x, int incX, double *y, int incY) { int i; if (alpha==0) { return; } for (i=0; i<n; ++i) { y[i*incY] += alpha*x[i*incX]; } } void dscal(int n, double alpha, double *x, int incX) { int i; if (alpha==1) { return; } for (i=0; i<n; ++i) { x[i*incX] *= alpha; } } void dcopy(int n, const double *x, int incX, double *y, int incY) { int i; for (i=0; i<n; ++i) { y[i*incY] = x[i*incX]; } } void dswap(int n, double *x, int incX, double *y, int incY) { int i; for (i=0; i<n; ++i) { double tmp; tmp = x[i*incX]; x[i*incX] = y[i*incY]; y[i*incY] = tmp; } } double damax(int n, const double *x, int incX) { int i; double result = 0; for (i=0; i<n; ++i) { if (fabs(x[i*incX])>result) { result = fabs(x[i*incX]); } } return result; } double dgenrm1(int m, int n, const double *A, int incRowA, int incColA) { int i, j; double result = 0; for (j=0; j<n; ++j) { double sum = 0; for (i=0; i<m; ++i) { sum += fabs(A[i*incRowA+j*incColA]); } if (sum>result) { result = sum; } } return result; } //-- TRMV implementations ------------------------------------------------------ void dtrlmv_col(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { // your code here } void dtrlmv_row(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { // your code here } //------------------------------------------------------------------------------ #ifndef DIM_N #define DIM_N 7 #endif #ifndef ROWMAJOR #define ROWMAJOR 0 #endif #if (ROWMAJOR==1) # define INCROW_A DIM_N # define INCCOL_A 1 #else # define INCROW_A 1 # define INCCOL_A DIM_N #endif #ifndef INC_X #define INC_X 1 #endif double A[DIM_N*DIM_N]; double x_[DIM_N]; double x[INC_X*DIM_N]; int main() { initGeMatrix(DIM_N, DIM_N, A, INCROW_A, INCCOL_A); initGeMatrix(DIM_N, 1, x_, 1, 0); // // Lower triangular matrix with unit diagonal // printf("L_1(A) =\n"); printTrMatrix(DIM_N, DIM_N, 1, 1, A, INCROW_A, INCCOL_A); printf("x =\n"); printGeMatrix(DIM_N, 1, x_, 1, 0); // // Aufruf von trlmv_col // printf("trlmv_col: A * x =\n\n"); dcopy(DIM_N, x_, 1, x, INC_X); dtrlmv_col(DIM_N, 1, A, INCROW_A, INCCOL_A, x, INC_X); printGeMatrix(DIM_N, 1, x, INC_X, 0); // // Aufruf von trlmv_row // printf("trlmv_row: A * x =\n\n"); dcopy(DIM_N, x_, 1, x, INC_X); dtrlmv_row(DIM_N, 1, A, INCROW_A, INCCOL_A, x, INC_X); printGeMatrix(DIM_N, 1, x, INC_X, 0); // // Lower triangular matrix with non-unit diagonal // printf("L(A) =\n"); printTrMatrix(DIM_N, DIM_N, 0, 1, A, INCROW_A, INCCOL_A); printf("x =\n"); printGeMatrix(DIM_N, 1, x_, 1, 0); // // Aufruf von trlmv_col // printf("trlmv_col: A * x =\n\n"); dcopy(DIM_N, x_, 1, x, INC_X); dtrlmv_col(DIM_N, 0, A, INCROW_A, INCCOL_A, x, INC_X); printGeMatrix(DIM_N, 1, x, INC_X, 0); // // Aufruf von trlmv_row // printf("trlmv_row: A * x =\n\n"); dcopy(DIM_N, x_, 1, x, INC_X); dtrlmv_row(DIM_N, 0, A, INCROW_A, INCCOL_A, x, INC_X); printGeMatrix(DIM_N, 1, x, INC_X, 0); return 0; }
Aufgabe: Geblockte TRMV Varinaten
Die Operation soll nun bezüglich einem Blockfaktor \(\text{bs}\) quasi-äquidistant partitioniert werden. Bezüglich der Blockzeile und -spalte erhält man damit
\[\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline \vec{x_k} \\[0.2cm] \hline \vec{x_{k+\text{bs}}} \end{array}\right)\longleftarrow\left(\begin{array}{l|l|l} L_{0,0} & & \\[0.2cm] \hline L_{k,0} & L_{k,\,k} & \\[0.2cm] \hline L_{k+\text{bs},0} & L_{k+\text{bs},\,k} & L_{k+\text{bs},\,k+\text{bs}} \end{array}\right)\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline \vec{x_k} \\[0.2cm] \hline \vec{x_{k+\text{bs}}} \end{array}\right)=\left(\begin{array}{l} L_{0,0} \\[0.2cm] \hline L_{k,0} \\[0.2cm] \hline L_{k+\text{bs},0} \\ \end{array}\right)\]Col-Major Algorithmus
Für einen Block-Spalten orientierten Algorithmus betrachtet man
\[\left(\begin{array}{l|l|l} L_{0,0} & & \\[0.2cm] \hline L_{k,0} & L_{k,\,k} & \\[0.2cm] \hline L_{k+\text{bs},0} & L_{k+\text{bs},\,k} & L_{k+\text{bs},\,k+\text{bs}} \end{array}\right)\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline \vec{x_k} \\[0.2cm] \hline \vec{x_{k+\text{bs}}} \end{array}\right)=\left(\begin{array}{l} L_{0,0} \\[0.2cm] \hline L_{k,0} \\[0.2cm] \hline L_{k+\text{bs},0} \\ \end{array}\right)\vec{x_0}+\left(\begin{array}{l} \\[0.2cm] \hline L_{k,\,k} \\[0.2cm] \hline L_{k+\text{bs},\,k} \\ \end{array}\right)\vec{x_k}+\left(\begin{array}{l} \\[0.2cm] \hline \\[0.2cm] \hline L_{k+\text{bs},\,k+\text{bs}} \\ \end{array}\right)\vec{x_{k+\text{bs}}}\]Row-Major Algorithmus
Für einen Block-Zeilen orientierten Algorithmus betrachtet man
\[\left(\begin{array}{l|l|l} L_{0,0} & & \\[0.2cm] \hline L_{k,0} & L_{k,\,k} & \\[0.2cm] \hline L_{k+\text{bs},0} & L_{k+\text{bs},\,k} & L_{k+\text{bs},\,k+\text{bs}} \end{array}\right)\left(\begin{array}{c} \vec{x_0} \\[0.2cm] \hline \vec{x_k} \\[0.2cm] \hline \vec{x_{k+\text{bs}}} \end{array}\right)=\left(\begin{array}{lclcl} L_{0,0} \;\vec{x_0} && && \\[0.2cm] \hline L_{k,0} \;\vec{x_0} &+& L_{k,\,k} \;\vec{x_k} \\[0.2cm] \hline L_{k+\text{bs},0} \;\vec{x_0} &+& L_{k+\text{bs},\,k} \;\vec{x_k} &+& L_{k+\text{bs},\,k+\text{bs}} \;\vec{x_{k+\text{bs}}} \\ \end{array}\right)\]Leitet für Col-Major und Row-Major Matrizen jeweils einen geblockten Algorithmus für die TRLMV-Operation her. Diese sollen die speziellen GEMV-Operationen für Matrix-Paneele und die ungeblockten TRLMV-Operationen verwenden. Implementiert die Algorithmen in einer gemeinsamen Prozedur:
void dtrlmv_ulm(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX);
Durch eine Fallunterscheidung soll stets der günstigere Algorithmus verwendet werden.
Grundgerüst für Benchmark
#include <stdio.h> #include <math.h> #include <stdlib.h> #include <sys/times.h> #include <unistd.h> //-- timer for benchmarks ------------------------------------------------------ double walltime() { struct tms ts; static double ClockTick=0.0; if (ClockTick==0.0) { ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK)); } return ((double) times(&ts)) * ClockTick; } //-- setup and print matrices -------------------------------------------------- void initGeMatrix(int m, int n, double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { A[i*incRowA+j*incColA] = i*n + j + 1; } } } void randGeMatrix(int m, int n, double *A, int incRowA, int incColA) { int i, j; for (j=0; j<n; ++j) { for (i=0; i<m; ++i) { A[i*incRowA+j*incColA] = ((double)rand()-RAND_MAX/2)*200/RAND_MAX; } } } void printGeMatrix(int m, int n, const double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { printf("%10.4lf ", A[i*incRowA+j*incColA]); } printf("\n"); } printf("\n\n"); } void printTrMatrix(int m, int n, int unitDiag, int lower, const double *A, int incRowA, int incColA) { int i, j; for (i=0; i<m; ++i) { for (j=0; j<n; ++j) { if (unitDiag && (i==j)) { printf("%10.4lf ", 1.0); } else if ((lower && (i>=j)) || (!lower && (i<=j))) { printf("%10.4lf ", A[i*incRowA+j*incColA]); } else { printf("%10.4lf ", 0.0); } } printf("\n"); } printf("\n\n"); } //-- some BLAS Level 1 procedures and functions -------------------------------- double ddot(int n, const double *x, int incX, const double *y, int incY) { int i; double alpha = 0; for (i=0; i<n; ++i) { alpha += x[i*incX]*y[i*incY]; } return alpha; } void daxpy(int n, double alpha, const double *x, int incX, double *y, int incY) { int i; if (alpha==0) { return; } for (i=0; i<n; ++i) { y[i*incY] += alpha*x[i*incX]; } } void dscal(int n, double alpha, double *x, int incX) { int i; if (alpha==1) { return; } for (i=0; i<n; ++i) { x[i*incX] *= alpha; } } void dcopy(int n, const double *x, int incX, double *y, int incY) { int i; for (i=0; i<n; ++i) { y[i*incY] = x[i*incX]; } } void dswap(int n, double *x, int incX, double *y, int incY) { int i; for (i=0; i<n; ++i) { double tmp; tmp = x[i*incX]; x[i*incX] = y[i*incY]; y[i*incY] = tmp; } } double damax(int n, const double *x, int incX) { int i; double result = 0; for (i=0; i<n; ++i) { if (fabs(x[i*incX])>result) { result = fabs(x[i*incX]); } } return result; } double dgenrm1(int m, int n, const double *A, int incRowA, int incColA) { int i, j; double result = 0; for (j=0; j<n; ++j) { double sum = 0; for (i=0; i<m; ++i) { sum += fabs(A[i*incRowA+j*incColA]); } if (sum>result) { result = sum; } } return result; } double dtrnrm1(int m, int n, int unitDiag, int lower, const double *A, int incRowA, int incColA) { int i, j; double result = 0; for (j=0; j<n; ++j) { double sum = 0; for (i=0; i<m; ++i) { if (unitDiag && (i==j)) { sum += 1.0; } else if ((lower && (i>=j)) || (!lower && (i<=j))) { sum += fabs(A[i*incRowA+j*incColA]); } } if (sum>result) { result = sum; } } return result; } //-- error bound for gemv ------------------------------------------------------ double err_dgemv(int m, int n, double alpha, const double *A, int incRowA, int incColA, const double *x, int incX, double beta, const double *y0, int incY0, double *y1, int incY1) { int max_mn = (m>n) ? m : n; double normA = fabs(alpha)*dgenrm1(m, n, A, incRowA, incColA); double normX = damax(n, x, incX); double normY0 = fabs(beta)*damax(m, y0, incY0); double normD; daxpy(m, -1.0, y0, incY0, y1, incY1); normD = damax(n, y1, incY1); return normD/(normA*normX*normY0*max_mn); } double err_dtrmv(int n, int unitDiag, int lower, const double *A, int incRowA, int incColA, const double *x0, int incX0, double *x1, int incX1) { double normA = dtrnrm1(n, n, unitDiag, lower, A, incRowA, incColA); double normX0 = damax(n, x0, incX0); double normD; daxpy(n, -1.0, x0, incX0, x1, incX1); normD = damax(n, x1, incX1); return normD/(n*normA*normX0); } //-- Fused BLAS Level 1 procedures and functions ------------------------------- #ifndef DGEMV_MxBS #define DGEMV_MxBS 4 #endif void dgemv_mxBS(int m, double alpha, const double *A, int incRowA, int incColA, const double *x, int incX, double *y, int incY) { int i, j; if (alpha==0) { return; } for (i=0; i<m; ++i) { double dot = 0; for (j=0; j<DGEMV_MxBS; ++j) { dot += A[i*incRowA+j*incColA]*x[j*incX]; } y[i*incY] += alpha*dot; } } #ifndef DGEMV_BSxN #define DGEMV_BSxN 4 #endif void dgemv_BSxn(int n, double alpha, const double *A, int incRowA, int incColA, const double *x, int incX, double beta, double *y, int incY) { int i, j; if (alpha==0) { return; } if (beta!=1) { for (i=0; i<DGEMV_BSxN; ++i) { y[i*incY] *= beta; } } for (j=0; j<n; ++j) { for (i=0; i<DGEMV_BSxN; ++i) { y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX]; } } } //-- GEMV implementations ------------------------------------------------------ void dgemv_ref(int m, int n, double alpha, const double *A, int incRowA, int incColA, const double *x, int incX, double beta, double *y, int incY) { int i, j; for (i=0; i<m; ++i) { if (beta==0) { y[i*incY] = 0; } else { y[i*incY] *= beta; } for (j=0; j<n; ++j) { y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX]; } } } void dgemv_ulm(int m, int n, double alpha, const double *A, int incRowA, int incColA, const double *x, int incX, double beta, double *y, int incY) { if (incRowA<incColA) { int bs = DGEMV_MxBS; int j; dscal(m, beta, y, incY); for (j=0; j+bs<=n; j+=bs) { dgemv_mxBS(m, alpha, &A[j*incColA], incRowA, incColA, &x[j*incX], incX, y, incY); } for (; j<n; ++j) { daxpy(m, alpha*x[j*incX], &A[j*incColA], incRowA, y, incY); } } else { int bs = DGEMV_BSxN; int i; for (i=0; i+bs<=m; i+=bs) { dgemv_BSxn(n, alpha, &A[i*incRowA], incRowA, incColA, x, incX, beta, &y[i*incY], incY); } for (; i<m; ++i) { y[i*incY] *= beta; y[i*incY] += alpha*ddot(n, &A[i*incRowA], incColA, x, incX); } } } //-- TRMV implementations ------------------------------------------------------ void dtrlmv_ref(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { int i, j; for (j=n-1; j>=0; --j) { for (i=j+1; i<n; ++i) { x[i*incX] += A[i*incRowA+j*incColA]*x[j*incX]; } if (!unitDiag) { x[j*incX] *= A[j*(incRowA+incColA)]; } } } void dtrlmv_col(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { // your code here } void dtrlmv_row(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { // your code here } void dtrlmv_ulm(int n, int unitDiag, const double *A, int incRowA, int incColA, double *x, int incX) { if (incRowA<incColA) { // your code here } else { // your code here } } //-- Wrapper for the Intel MKL implementation of gemv -------------------------- #ifdef USE_MKL #include <mkl_types.h> // pre declaration so we don't need to include the complete mkl_blas header. void dgemv(const char *trans, const MKL_INT *m, const MKL_INT *n, const double *alpha, const double *A, const MKL_INT *ldA, const double *x, const MKL_INT *incX, const double *beta, double *y, const MKL_INT *incY); void dgemv_mkl(MKL_INT m, MKL_INT n, double alpha, const double *A, MKL_INT incRowA, MKL_INT incColA, const double *x, MKL_INT incX, double beta, double *y, MKL_INT incY) { MKL_INT ldA = (incRowA==1) ? incColA : incRowA; char trans = (incRowA==1) ? 'N' : 'T'; MKL_INT M = (incRowA==1) ? m : n; MKL_INT N = (incRowA==1) ? n : m; dgemv(&trans, &M, &N, &alpha, A, &ldA, x, &incX, &beta, y, &incY); } #endif // USE_MKL //------------------------------------------------------------------------------ #ifndef MIN_N #define MIN_N 100 #endif #ifndef MAX_N #define MAX_N 8000 #endif #ifndef INC_N #define INC_N 100 #endif #ifndef ROWMAJOR #define ROWMAJOR 0 #endif #ifndef UNITDIAG #define UNITDIAG 0 #endif #if (ROWMAJOR==1) # define INCROW_A MAX_N # define INCCOL_A 1 #else # define INCROW_A 1 # define INCCOL_A MAX_N #endif #ifndef INC_X #define INC_X 1 #endif double A[MAX_N*MAX_N]; double x_[MAX_N]; double x_0[INC_X*MAX_N]; double x_1[INC_X*MAX_N]; int main() { int n; randGeMatrix(MAX_N, MAX_N, A, INCROW_A, INCCOL_A); randGeMatrix(MAX_N, 1, x_, 1, 0); printf("#%9s %9s %9s", "n", "INCROW_A", "INCCOL_A"); printf(" %9s", "INC_X"); printf(" %12s %12s", "t", "MFLOPS"); printf(" %12s %12s %12s", "t", "MFLOPS", "err"); printf(" %12s %12s %12s", "t", "MFLOPS", "err"); printf(" %12s %12s %12s", "t", "MFLOPS", "err"); printf("\n"); for (n=MIN_N; n<=MAX_N; n+=INC_N) { int runs = 1; int ops = n*(n+1)/2 + (n-1)*n/2; double t0, t1, dt, err; printf(" %9d %9d %9d", n, INCROW_A, INCCOL_A); printf(" %9d", INC_X); t0 = 0; runs = 0; do { dcopy(n, x_, 1, x_0, INC_X); dt = walltime(); dtrlmv_ref(n, UNITDIAG, A, INCROW_A, INCCOL_A, x_0, INC_X); dt = walltime() - dt; t0 += dt; ++runs; } while (t0<0.3); t0 /= runs; printf(" %12.2e %12.2lf", t0, ops/(1000*1000*t0)); t1 = 0; runs = 0; do { dcopy(n, x_, 1, x_1, INC_X); dt = walltime(); dtrlmv_col(n, UNITDIAG, A, INCROW_A, INCCOL_A, x_1, INC_X); dt = walltime() - dt; t1 += dt; ++runs; } while (t1<0.3); t1 /= runs; err = err_dtrmv(n, UNITDIAG, 1, A, INCROW_A, INCCOL_A, x_0, INC_X, x_1, INC_X); printf(" %12.2e %12.2lf %12.2e", t1, ops/(1000*1000*t1), err); t1 = 0; runs = 0; do { dcopy(n, x_, 1, x_1, INC_X); dt = walltime(); dtrlmv_row(n, UNITDIAG, A, INCROW_A, INCCOL_A, x_1, INC_X); dt = walltime() - dt; t1 += dt; ++runs; } while (t1<0.3); t1 /= runs; err = err_dtrmv(n, UNITDIAG, 1, A, INCROW_A, INCCOL_A, x_0, INC_X, x_1, INC_X); printf(" %12.2e %12.2lf %12.2e", t1, ops/(1000*1000*t1), err); t1 = 0; runs = 0; do { dcopy(n, x_, 1, x_1, INC_X); dt = walltime(); dtrlmv_ulm(n, UNITDIAG, A, INCROW_A, INCCOL_A, x_1, INC_X); dt = walltime() - dt; t1 += dt; ++runs; } while (t1<0.3); t1 /= runs; err = err_dtrmv(n, UNITDIAG, 1, A, INCROW_A, INCCOL_A, x_0, INC_X, x_1, INC_X); printf(" %12.2e %12.2lf %12.2e", t1, ops/(1000*1000*t1), err); printf("\n"); } return 0; }