GEMV: Computation based on vector operations

Content

In the last session, you implemented the BLAS operation gemv for a matrix vector product of the form

\[y \leftarrow \beta\, y + \alpha\, A x\]

The performance of the implementation was independent of the storage order (row-major or col-major). This was achieved through the following algorithm:

  • \(\mathbf{if}\; \beta \neq 1\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(y_i \leftarrow \beta\, y_i\)

  • \(\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(\mathbf{for}\; j=1,\dots,n\)

        • \(y_i \leftarrow y_i + \alpha\, a_{i,j} x_j\)

  • \(\mathbf{else}\)

    • \(\mathbf{for}\; j=1,\dots,n\)

      • \(\mathbf{for}\; i=1,\dots,m\)

        • \(y_i \leftarrow y_i + \alpha\, a_{i,j} x_j\)

This algorithm describes the matrix and vector operations component-wise.

Describing Algorithms in terms of vector operations

We first define some basic vector operation (these are so called BLAS level 1 operations):

Using these as building blocks, we can describe the algorithmic variants in terms of these vector operations. Both variants have in common that at first vector \(y\) gets scaled by \(\beta\), i.e.

\[y \leftarrow \beta y\]

So what remains to be done is merely computing

\[y \leftarrow \alpha A x\]

Variant 1

In the first variant we consider the rows of \(A\), i.e.

\[A = \begin{pmatrix} \mathbf{a}_1^T \\ \vdots \\ \mathbf{a}_m^T \end{pmatrix}\]

So \(A\) is a column vector where each element itself is a row vector. Hence, the operation \(y \leftarrow \beta y + \alpha A x\) is equivalent to

\[\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}\leftarrow\beta\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}+ \alpha\begin{pmatrix}\mathbf{a}_1^T x \\\vdots \\\mathbf{a}_m^T x\end{pmatrix}\]

Using the dot operation, we can compute the matrix vector product with

  • \(\mathbf{for}\; i=1,\dots,m\)

    • \(y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x\)

Variant 2

We now consider the columns of \(A\), i.e.

\[A = \begin{pmatrix} \mathbf{a}_1 & \dots & \mathbf{a}_n \end{pmatrix}\]

So \(A\) is a row vector where each element itself is a column vector. Hence, the operation \(y \leftarrow \beta\, y + \alpha A x\) is equivalent to

\[y \leftarrow \beta\, y+ \alpha\, \mathbf{a}_1 x_1+ \dots+ \alpha \,\mathbf{a}_n x_n\]

So computing \(y \leftarrow y + \alpha\, A x\) is equivalent to

  • \(\text{For}\; j=1,\dots,n\)

    • \(y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j\)

Combining both variants

Obviously variant 1 performs better for row-major and variant 2 better for col-major matrices. In dependence of the increments we can switch to the more favorable variant:

  • \(y \leftarrow \beta\, y\)

  • \(\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x\)

  • \(\mathbf{else}\)

    • \(\mathbf{for}\; j=1,\dots,n\)

      • \(y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j\)

Exercise

Material

In the computer lab E.44 we installed a free version of Intel MKL. We will use the Intel MKL for comparing the performance of our gemv implementation.

Solution for session 4 with Intel MKL Benchmark

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <sys/times.h>
#include <unistd.h>
#include <mkl_types.h>

#ifndef MIN_M
#define MIN_M 100
#endif

#ifndef MIN_N
#define MIN_N 100
#endif

#ifndef MAX_M
#define MAX_M 1500
#endif

#ifndef MAX_N
#define MAX_N 1500
#endif

#ifndef INCX
#define INCX 1
#endif

#ifndef INCY
#define INCY 1
#endif

#ifndef ALPHA
#define ALPHA 1.5
#endif

#ifndef BETA
#define BETA 1.5
#endif

#ifndef T_MIN
#define T_MIN 5
#endif


double A[MAX_M*MAX_N];
double X[MAX_N*INCX];
double Y[MAX_M*INCY];
double Y1[MAX_M*INCY];
double Y2[MAX_M*INCY];
double Y3[MAX_M*INCY];
double Y4[MAX_M*INCY];
double Y5[MAX_M*INCY];

double
walltime()
{
   struct tms    ts;
   static double ClockTick=0.0;

   if (ClockTick==0.0) {
        ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
   }
   return ((double) times(&ts)) * ClockTick;
}

void
initMatrix(size_t m, size_t n, double *A, size_t incRowA, size_t incColA)
{
    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            A[i*incRowA+j*incColA] = ((double)rand() - RAND_MAX/2)*200/RAND_MAX;
        }
    }
}

void
copyMatrix(size_t m, size_t n,
           const double *A, size_t incRowA, size_t incColA,
           double *B, size_t incRowB, size_t incColB)
{
    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            B[i*incRowB+j*incColB] = A[i*incRowA+j*incColA];
        }
    }
}

double
asumDiffMatrix(size_t m, size_t n,
               const double *A, size_t incRowA, size_t incColA,
               double *B, size_t incRowB, size_t incColB)
{
    double asum = 0;

    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            asum += fabs(B[i*incRowB+j*incColB] - A[i*incRowA+j*incColA]);
        }
    }
    return asum;
}

//------------------------------------------------------------------------------

void
dgemv(const char *trans,
      const MKL_INT *m, const MKL_INT *n, const double *alpha,
      const double *A, const MKL_INT *ldA, const double *x,
      const MKL_INT *incX,
      const double *beta, double *y, const MKL_INT *incY);

void
dgemv_mkl(MKL_INT m, MKL_INT n,
          double alpha,
          const double *A, MKL_INT incRowA, MKL_INT incColA,
          const double *x, MKL_INT incX,
          double beta,
          double *y, MKL_INT incY)
{
    MKL_INT ldA   = (incRowA==1) ? incColA : incRowA;
    char    trans = (incRowA==1) ? 'N' : 'T';
    MKL_INT M     = (incRowA==1) ? m : n;
    MKL_INT N     = (incRowA==1) ? n : m;

    dgemv(&trans, &M, &N, &alpha, A, &ldA, x, &incX, &beta, y, &incY);
}

//------------------------------------------------------------------------------

void
dgemv_ulm(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    if (beta!=1.0) {
        for (size_t i=0; i<m; ++i) {
            y[i*incY] *= beta;
        }
    }
    if (alpha==0) {
        return;
    }
    if (incRowA > incColA) {
        for (size_t i=0; i<m; ++i) {
            for (size_t j=0; j<n; ++j) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    } else  {
        for (size_t j=0; j<n; ++j) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    }
}

//------------------------------------------------------------------------------

#ifndef COLMAJOR
//#define COLMAJOR 1
#define COLMAJOR 0
#endif

int
main()
{
    size_t runs, incRowA, incColA;
    double t0, t1, t2;
    double diff2;
    double alpha = ALPHA;
    double beta  = BETA;

    initMatrix(MAX_M, MAX_N, A, 1, MAX_M);
    initMatrix(MAX_N, 1, X, INCX, 1);
    initMatrix(MAX_M, 1, Y, INCY, 1);

    printf("# COLMAJOR    = %d\n", COLMAJOR);
    printf("# T_MIN       = %d\n", T_MIN);
    printf("#RUN    M     N  INCROW  INCCOL");
    printf("    GEMV_MKL    GEMV_ULM");
    printf("    GEMV_MKL    GEMV_ULM");
    printf("       DIFF2");
    printf("\n");
    printf("#                              ");
    printf("    (t in s)    (t in s)");
    printf("    (MFLOPS)    (MFLOPS)");
    printf("           ");
    printf("\n");

    for (size_t i=0, m=MIN_M, n=MIN_N; m<=MAX_M && n<=MAX_N;
	    ++i, m+=100, n+=100) {

        if (COLMAJOR) {
            incRowA = 1;
            incColA = m;
        } else {
            incRowA = n;
            incColA = 1;
        }

        t1   = 0;
        runs = 0;
        do {
            copyMatrix(MAX_M, 1, Y, INCY, 1, Y1, INCY, 1);
            t0 = walltime();
            dgemv_mkl(m, n, alpha,
                      A, incRowA, incColA,
                      X, INCX,
                      beta,
                      Y1, INCY);
            t1 += walltime() - t0;
            ++runs;
        } while (t1<T_MIN);
        t1 /= runs;

        t2   = 0;
        runs = 0;
        do {
            copyMatrix(MAX_M, 1, Y, INCY, 1, Y2, INCY, 1);
            t0 = walltime();
            dgemv_ulm(m, n, alpha,
                      A, incRowA, incColA,
                      X, INCX,
                      beta,
                      Y2, INCY);
            t2 += walltime() - t0;
            ++runs;
        } while (t2<T_MIN);
        t2 /= runs;
        diff2 = asumDiffMatrix(m, 1, Y1, INCY, 1, Y2, INCY, 1);

        printf("%3ld %5ld %5ld %7ld %7ld ", i, m, n, incRowA, incColA);
        printf("%11.4lf %11.4lf ", t1, t2);
        printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t1);
        printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t2);
        printf("%11.4lf ", diff2);
        printf("\n");
    }
}

Makefile (for the lab in E.44)

The following makefile creates the two executables dgemv_rowmajor and dgemv_colmajor:

In both cases the benchmarks compare our performance with the MKL.

COLMAJOR  = -DCOLMAJOR=1
ROWMAJOR  = -DCOLMAJOR=0

CC        = gcc
CFLAGS   += -std=c99
CFLAGS   += -Wall -Ofast

Arch := intel64
IntelDir := /opt/intel/compilers_and_libraries/linux
IncludeDir := $(IntelDir)/mkl/include
LibDirs := $(IntelDir)/lib/$(Arch) $(IntelDir)/mkl/lib/$(Arch)
LinkerFlag := -Wl,
Rpath := $(patsubst %,$(LinkerFlag)-rpath $(LinkerFlag)%,$(LibDirs))
Lpath := $(patsubst %,-L%,$(LibDirs))

LDFLAGS += $(Lpath) $(Rpath)
CPPFLAGS += -I$(IncludeDir) -DMKL_ILP64
LDLIBS += -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread

TARGETS = gemv_rowmajor gemv_colmajor

all: $(TARGETS)

%_rowmajor : %.c
	$(CC) $(ROWMAJOR) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $(LDLIBS) -o $@ $<

%_colmajor : %.c
	$(CC)  $(COLMAJOR) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $(LDLIBS) -o $@ $<

clean:
	$(RM) $(TARGETS)

Running the benchmarks

heim$ make
gcc -DCOLMAJOR=0 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_rowmajor gemv.c
gcc  -DCOLMAJOR=1 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_colmajor gemv.c
heim$ ./gemv_colmajor > gemv_colmajor.dat
heim$ ./gemv_rowmajor > gemv_rowmajor.dat
heim$ cat gemv_colmajor.dat
# COLMAJOR    = 1
# T_MIN       = 5
#RUN    M     N  INCROW  INCCOL    GEMV_MKL    GEMV_ULM    GEMV_MKL    GEMV_ULM       DIFF2
#                                  (t in s)    (t in s)    (MFLOPS)    (MFLOPS)           
  0   100   100       1     100      0.0000      0.0000   8325.6560   3829.5649      0.0000 
  1   200   200       1     200      0.0000      0.0000   8582.8000   4163.8164      0.0000 
  2   300   300       1     300      0.0000      0.0000   8503.2000   4454.6400      0.0000 
  3   400   400       1     400      0.0000      0.0001   8723.4810   4467.2640      0.0000 
  4   500   500       1     500      0.0001      0.0001   8635.0000   4547.7046      0.0000 
  5   600   600       1     600      0.0001      0.0002   8484.6240   4476.7904      0.0000 
  6   700   700       1     700      0.0001      0.0002   8141.2520   4510.1517      0.0000 
  7   800   800       1     800      0.0002      0.0003   7111.6800   4215.0400      0.0000 
  8   900   900       1     900      0.0003      0.0004   5737.9042   3992.0040      0.0000 
  9  1000  1000       1    1000      0.0004      0.0005   5320.4000   3888.6228      0.0000 
 10  1100  1100       1    1100      0.0005      0.0006   4710.0639   3806.1760      0.0000 
 11  1200  1200       1    1200      0.0006      0.0008   4440.1437   3814.2720      0.0000 
 12  1300  1300       1    1300      0.0008      0.0009   4360.8760   3780.1920      0.0000 
 13  1400  1400       1    1400      0.0009      0.0010   4316.7040   3781.2320      0.0000 
 14  1500  1500       1    1500      0.0011      0.0012   4283.1000   3793.5000      0.0000 
heim$ cat gemv_rowmajor.dat
# COLMAJOR    = 0
# T_MIN       = 5
#RUN    M     N  INCROW  INCCOL    GEMV_MKL    GEMV_ULM    GEMV_MKL    GEMV_ULM       DIFF2
#                                  (t in s)    (t in s)    (MFLOPS)    (MFLOPS)           
  0   100   100     100       1      0.0000      0.0000   8266.4113   3801.8040      0.0000 
  1   200   200     200       1      0.0000      0.0000   8548.0879   4201.7600      0.0000 
  2   300   300     300       1      0.0000      0.0000   8534.4480   4328.4960      0.0000 
  3   400   400     400       1      0.0000      0.0001   8714.5600   4483.3214      0.0000 
  4   500   500     500       1      0.0001      0.0001   8585.6000   4479.7405      0.0000 
  5   600   600     600       1      0.0001      0.0002   8474.4000   4485.8443      0.0000 
  6   700   700     700       1      0.0001      0.0002   7972.4960   4327.8760      0.0000 
  7   800   800     800       1      0.0002      0.0003   7122.9440   4289.2800      0.0000 
  8   900   900     900       1      0.0003      0.0004   6219.5040   4006.2600      0.0000 
  9  1000  1000    1000       1      0.0003      0.0005   5797.6000   3906.1876      0.0000 
 10  1100  1100    1100       1      0.0005      0.0006   5289.1520   3834.7320      0.0000 
 11  1200  1200    1200       1      0.0006      0.0008   4563.6480   3759.5210      0.0000 
 12  1300  1300    1300       1      0.0007      0.0009   4532.5800   3795.7400      0.0000 
 13  1400  1400    1400       1      0.0009      0.0010   4443.7120   3733.4080      0.0000 
 14  1500  1500    1500       1      0.0010      0.0012   4446.9000   3727.8000      0.0000 
heim$ 

Gnuplot scripts for the benchmarks

We provide two scripts for plotting the benchmark results:

As usual run these scripts through gnuplot:

heim$ gnuplot gemv_colmajor.plot
heim$ gnuplot gemv_rowmajor.plot
heim$ 

Plot of the benchmark results