# GEMV: Computation based on vector operations

#### Content

In the last session, you implemented the BLAS operation gemv for a matrix vector product of the form

$y \leftarrow \beta\, y + \alpha\, A x$

The performance of the implementation was independent of the storage order (row-major or col-major). This was achieved through the following algorithm:

• $$\mathbf{if}\; \beta \neq 1$$

• $$\mathbf{for}\; i=1,\dots,m$$

• $$y_i \leftarrow \beta\, y_i$$

• $$\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}$$

• $$\mathbf{for}\; i=1,\dots,m$$

• $$\mathbf{for}\; j=1,\dots,n$$

• $$y_i \leftarrow y_i + \alpha\, a_{i,j} x_j$$

• $$\mathbf{else}$$

• $$\mathbf{for}\; j=1,\dots,n$$

• $$\mathbf{for}\; i=1,\dots,m$$

• $$y_i \leftarrow y_i + \alpha\, a_{i,j} x_j$$

This algorithm describes the matrix and vector operations component-wise.

## Describing Algorithms in terms of vector operations

We first define some basic vector operation (these are so called BLAS level 1 operations):

• The dot product: $$\alpha \leftarrow x^T y$$

• The scal operation: $$x \leftarrow \alpha \; x$$

• The axpy operation (alpha x plus y): $$y \leftarrow y + \alpha \; x$$

Using these as building blocks, we can describe the algorithmic variants in terms of these vector operations. Both variants have in common that at first vector $$y$$ gets scaled by $$\beta$$, i.e.

$y \leftarrow \beta y$

So what remains to be done is merely computing

$y \leftarrow \alpha A x$

### Variant 1

In the first variant we consider the rows of $$A$$, i.e.

$A = \begin{pmatrix} \mathbf{a}_1^T \\ \vdots \\ \mathbf{a}_m^T \end{pmatrix}$

So $$A$$ is a column vector where each element itself is a row vector. Hence, the operation $$y \leftarrow \beta y + \alpha A x$$ is equivalent to

$\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}\leftarrow\beta\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}+ \alpha\begin{pmatrix}\mathbf{a}_1^T x \\\vdots \\\mathbf{a}_m^T x\end{pmatrix}$

Using the dot operation, we can compute the matrix vector product with

• $$\mathbf{for}\; i=1,\dots,m$$

• $$y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x$$

### Variant 2

We now consider the columns of $$A$$, i.e.

$A = \begin{pmatrix} \mathbf{a}_1 & \dots & \mathbf{a}_n \end{pmatrix}$

So $$A$$ is a row vector where each element itself is a column vector. Hence, the operation $$y \leftarrow \beta\, y + \alpha A x$$ is equivalent to

$y \leftarrow \beta\, y+ \alpha\, \mathbf{a}_1 x_1+ \dots+ \alpha \,\mathbf{a}_n x_n$

So computing $$y \leftarrow y + \alpha\, A x$$ is equivalent to

• $$\text{For}\; j=1,\dots,n$$

• $$y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j$$

## Combining both variants

Obviously variant 1 performs better for row-major and variant 2 better for col-major matrices. In dependence of the increments we can switch to the more favorable variant:

• $$y \leftarrow \beta\, y$$

• $$\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}$$

• $$\mathbf{for}\; i=1,\dots,m$$

• $$y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x$$

• $$\mathbf{else}$$

• $$\mathbf{for}\; j=1,\dots,n$$

• $$y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j$$

## Exercise

• Describe algorithms for the BLAS Level 1 operations: scal, dot and axpy.

• Implement these algorithms for vectors of type double. Use the function names dscal_ulm, ddot_ulm und daxpy_ulm.

• Implement the algorithm that combines both variants.

## Material

In the computer lab E.44 we installed a free version of Intel MKL. We will use the Intel MKL for comparing the performance of our gemv implementation.

### Solution for session 4 with Intel MKL Benchmark

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <sys/times.h>
#include <unistd.h>
#include <mkl_types.h>

#ifndef MIN_M
#define MIN_M 100
#endif

#ifndef MIN_N
#define MIN_N 100
#endif

#ifndef MAX_M
#define MAX_M 1500
#endif

#ifndef MAX_N
#define MAX_N 1500
#endif

#ifndef INCX
#define INCX 1
#endif

#ifndef INCY
#define INCY 1
#endif

#ifndef ALPHA
#define ALPHA 1.5
#endif

#ifndef BETA
#define BETA 1.5
#endif

#ifndef T_MIN
#define T_MIN 5
#endif

double A[MAX_M*MAX_N];
double X[MAX_N*INCX];
double Y[MAX_M*INCY];
double Y1[MAX_M*INCY];
double Y2[MAX_M*INCY];
double Y3[MAX_M*INCY];
double Y4[MAX_M*INCY];
double Y5[MAX_M*INCY];

double
walltime()
{
struct tms    ts;
static double ClockTick=0.0;

if (ClockTick==0.0) {
ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
}
return ((double) times(&ts)) * ClockTick;
}

void
initMatrix(size_t m, size_t n, double *A, size_t incRowA, size_t incColA)
{
for (size_t j=0; j<n; ++j) {
for (size_t i=0; i<m; ++i) {
A[i*incRowA+j*incColA] = ((double)rand() - RAND_MAX/2)*200/RAND_MAX;
}
}
}

void
copyMatrix(size_t m, size_t n,
const double *A, size_t incRowA, size_t incColA,
double *B, size_t incRowB, size_t incColB)
{
for (size_t j=0; j<n; ++j) {
for (size_t i=0; i<m; ++i) {
B[i*incRowB+j*incColB] = A[i*incRowA+j*incColA];
}
}
}

double
asumDiffMatrix(size_t m, size_t n,
const double *A, size_t incRowA, size_t incColA,
double *B, size_t incRowB, size_t incColB)
{
double asum = 0;

for (size_t j=0; j<n; ++j) {
for (size_t i=0; i<m; ++i) {
asum += fabs(B[i*incRowB+j*incColB] - A[i*incRowA+j*incColA]);
}
}
return asum;
}

//------------------------------------------------------------------------------

void
dgemv(const char *trans,
const MKL_INT *m, const MKL_INT *n, const double *alpha,
const double *A, const MKL_INT *ldA, const double *x,
const MKL_INT *incX,
const double *beta, double *y, const MKL_INT *incY);

void
dgemv_mkl(MKL_INT m, MKL_INT n,
double alpha,
const double *A, MKL_INT incRowA, MKL_INT incColA,
const double *x, MKL_INT incX,
double beta,
double *y, MKL_INT incY)
{
MKL_INT ldA   = (incRowA==1) ? incColA : incRowA;
char    trans = (incRowA==1) ? 'N' : 'T';
MKL_INT M     = (incRowA==1) ? m : n;
MKL_INT N     = (incRowA==1) ? n : m;

dgemv(&trans, &M, &N, &alpha, A, &ldA, x, &incX, &beta, y, &incY);
}

//------------------------------------------------------------------------------

void
dgemv_ulm(size_t m, size_t n,
double alpha,
const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
const double *x, ptrdiff_t incX,
double beta,
double *y, ptrdiff_t incY)
{
if (beta!=1.0) {
for (size_t i=0; i<m; ++i) {
y[i*incY] *= beta;
}
}
if (alpha==0) {
return;
}
if (incRowA > incColA) {
for (size_t i=0; i<m; ++i) {
for (size_t j=0; j<n; ++j) {
y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
}
}
} else  {
for (size_t j=0; j<n; ++j) {
for (size_t i=0; i<m; ++i) {
y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
}
}
}
}

//------------------------------------------------------------------------------

#ifndef COLMAJOR
//#define COLMAJOR 1
#define COLMAJOR 0
#endif

int
main()
{
size_t runs, incRowA, incColA;
double t0, t1, t2;
double diff2;
double alpha = ALPHA;
double beta  = BETA;

initMatrix(MAX_M, MAX_N, A, 1, MAX_M);
initMatrix(MAX_N, 1, X, INCX, 1);
initMatrix(MAX_M, 1, Y, INCY, 1);

printf("# COLMAJOR    = %d\n", COLMAJOR);
printf("# T_MIN       = %d\n", T_MIN);
printf("#RUN    M     N  INCROW  INCCOL");
printf("    GEMV_MKL    GEMV_ULM");
printf("    GEMV_MKL    GEMV_ULM");
printf("       DIFF2");
printf("\n");
printf("#                              ");
printf("    (t in s)    (t in s)");
printf("    (MFLOPS)    (MFLOPS)");
printf("           ");
printf("\n");

for (size_t i=0, m=MIN_M, n=MIN_N; m<=MAX_M && n<=MAX_N;
++i, m+=100, n+=100) {

if (COLMAJOR) {
incRowA = 1;
incColA = m;
} else {
incRowA = n;
incColA = 1;
}

t1   = 0;
runs = 0;
do {
copyMatrix(MAX_M, 1, Y, INCY, 1, Y1, INCY, 1);
t0 = walltime();
dgemv_mkl(m, n, alpha,
A, incRowA, incColA,
X, INCX,
beta,
Y1, INCY);
t1 += walltime() - t0;
++runs;
} while (t1<T_MIN);
t1 /= runs;

t2   = 0;
runs = 0;
do {
copyMatrix(MAX_M, 1, Y, INCY, 1, Y2, INCY, 1);
t0 = walltime();
dgemv_ulm(m, n, alpha,
A, incRowA, incColA,
X, INCX,
beta,
Y2, INCY);
t2 += walltime() - t0;
++runs;
} while (t2<T_MIN);
t2 /= runs;
diff2 = asumDiffMatrix(m, 1, Y1, INCY, 1, Y2, INCY, 1);

printf("%3ld %5ld %5ld %7ld %7ld ", i, m, n, incRowA, incColA);
printf("%11.4lf %11.4lf ", t1, t2);
printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t1);
printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t2);
printf("%11.4lf ", diff2);
printf("\n");
}
}


### Makefile (for the lab in E.44)

The following makefile creates the two executables dgemv_rowmajor and dgemv_colmajor:

• With dgemv_rowmajor we test the case that $$A$$ is stored row major and

• with dgemv_colmajor the case that $$A$$ is stored col major.

In both cases the benchmarks compare our performance with the MKL.

COLMAJOR  = -DCOLMAJOR=1
ROWMAJOR  = -DCOLMAJOR=0

CC        = gcc
CFLAGS   += -std=c99
CFLAGS   += -Wall -Ofast

Arch := intel64
IntelDir := /opt/intel/compilers_and_libraries/linux
IncludeDir := $(IntelDir)/mkl/include LibDirs :=$(IntelDir)/lib/$(Arch)$(IntelDir)/mkl/lib/$(Arch) LinkerFlag := -Wl, Rpath :=$(patsubst %,$(LinkerFlag)-rpath$(LinkerFlag)%,$(LibDirs)) Lpath :=$(patsubst %,-L%,$(LibDirs)) LDFLAGS +=$(Lpath) $(Rpath) CPPFLAGS += -I$(IncludeDir) -DMKL_ILP64
LDLIBS += -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread

TARGETS = gemv_rowmajor gemv_colmajor

all: $(TARGETS) %_rowmajor : %.c$(CC) $(ROWMAJOR)$(CPPFLAGS) $(CFLAGS)$(LDFLAGS) $(LDLIBS) -o$@ $< %_colmajor : %.c$(CC)  $(COLMAJOR)$(CPPFLAGS) $(CFLAGS)$(LDFLAGS) $(LDLIBS) -o$@ $< clean:$(RM) $(TARGETS)  ### Running the benchmarks heim$ make
gcc -DCOLMAJOR=0 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_rowmajor gemv.c
gcc  -DCOLMAJOR=1 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_colmajor gemv.c
heim$./gemv_colmajor > gemv_colmajor.dat heim$ ./gemv_rowmajor > gemv_rowmajor.dat
heim$cat gemv_colmajor.dat # COLMAJOR = 1 # T_MIN = 5 #RUN M N INCROW INCCOL GEMV_MKL GEMV_ULM GEMV_MKL GEMV_ULM DIFF2 # (t in s) (t in s) (MFLOPS) (MFLOPS) 0 100 100 1 100 0.0000 0.0000 8325.6560 3829.5649 0.0000 1 200 200 1 200 0.0000 0.0000 8582.8000 4163.8164 0.0000 2 300 300 1 300 0.0000 0.0000 8503.2000 4454.6400 0.0000 3 400 400 1 400 0.0000 0.0001 8723.4810 4467.2640 0.0000 4 500 500 1 500 0.0001 0.0001 8635.0000 4547.7046 0.0000 5 600 600 1 600 0.0001 0.0002 8484.6240 4476.7904 0.0000 6 700 700 1 700 0.0001 0.0002 8141.2520 4510.1517 0.0000 7 800 800 1 800 0.0002 0.0003 7111.6800 4215.0400 0.0000 8 900 900 1 900 0.0003 0.0004 5737.9042 3992.0040 0.0000 9 1000 1000 1 1000 0.0004 0.0005 5320.4000 3888.6228 0.0000 10 1100 1100 1 1100 0.0005 0.0006 4710.0639 3806.1760 0.0000 11 1200 1200 1 1200 0.0006 0.0008 4440.1437 3814.2720 0.0000 12 1300 1300 1 1300 0.0008 0.0009 4360.8760 3780.1920 0.0000 13 1400 1400 1 1400 0.0009 0.0010 4316.7040 3781.2320 0.0000 14 1500 1500 1 1500 0.0011 0.0012 4283.1000 3793.5000 0.0000 heim$ cat gemv_rowmajor.dat
# COLMAJOR    = 0
# T_MIN       = 5
#RUN    M     N  INCROW  INCCOL    GEMV_MKL    GEMV_ULM    GEMV_MKL    GEMV_ULM       DIFF2
#                                  (t in s)    (t in s)    (MFLOPS)    (MFLOPS)
0   100   100     100       1      0.0000      0.0000   8266.4113   3801.8040      0.0000
1   200   200     200       1      0.0000      0.0000   8548.0879   4201.7600      0.0000
2   300   300     300       1      0.0000      0.0000   8534.4480   4328.4960      0.0000
3   400   400     400       1      0.0000      0.0001   8714.5600   4483.3214      0.0000
4   500   500     500       1      0.0001      0.0001   8585.6000   4479.7405      0.0000
5   600   600     600       1      0.0001      0.0002   8474.4000   4485.8443      0.0000
6   700   700     700       1      0.0001      0.0002   7972.4960   4327.8760      0.0000
7   800   800     800       1      0.0002      0.0003   7122.9440   4289.2800      0.0000
8   900   900     900       1      0.0003      0.0004   6219.5040   4006.2600      0.0000
9  1000  1000    1000       1      0.0003      0.0005   5797.6000   3906.1876      0.0000
10  1100  1100    1100       1      0.0005      0.0006   5289.1520   3834.7320      0.0000
11  1200  1200    1200       1      0.0006      0.0008   4563.6480   3759.5210      0.0000
12  1300  1300    1300       1      0.0007      0.0009   4532.5800   3795.7400      0.0000
13  1400  1400    1400       1      0.0009      0.0010   4443.7120   3733.4080      0.0000
14  1500  1500    1500       1      0.0010      0.0012   4446.9000   3727.8000      0.0000
heim$ ### Gnuplot scripts for the benchmarks We provide two scripts for plotting the benchmark results: • For plotting the col major case you can use: set terminal svg size 900, 500 set output "bench_gemv_colmajor.svg" set xlabel "Matrix dim A: M=N" set ylabel "MFLOPS" set yrange [0:11000] set title "GEMV: Col major case" set key outside set pointsize 0.5 plot "gemv_colmajor.dat" using 2:8 with linespoints lt 2 lw 3 title "MKL", \ "gemv_colmajor.dat" using 2:9 with linespoints lt 3 lw 3 title "ulmBLAS"  • For plotting the row major case you can use: set terminal svg size 900, 500 set output "bench_gemv_rowmajor.svg" set xlabel "Matrix dim A: M=N" set ylabel "MFLOPS" set yrange [0:11000] set title "GEMV: Row major case" set key outside set pointsize 0.5 plot "gemv_rowmajor.dat" using 2:8 with linespoints lt 2 lw 3 title "MKL", \ "gemv_rowmajor.dat" using 2:9 with linespoints lt 3 lw 3 title "ulmBLAS"  As usual run these scripts through gnuplot: heim$ gnuplot gemv_colmajor.plot
heim$gnuplot gemv_rowmajor.plot heim$ 

### Plot of the benchmark results

• Here the plot for the col major case:

• and here for the row major case: