GEMV: Computation based on vector operations

Content

In the last session, you implemented the BLAS operation gemv for a matrix vector product of the form

\[y \leftarrow \beta\, y + \alpha\, A x\]

The performance of the implementation was independent of the storage order (row-major or col-major). This was achieved through the following algorithm:

  • \(\mathbf{if}\; \beta \neq 1\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(y_i \leftarrow \beta\, y_i\)

  • \(\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(\mathbf{for}\; j=1,\dots,n\)

        • \(y_i \leftarrow y_i + \alpha\, a_{i,j} x_j\)

  • \(\mathbf{else}\)

    • \(\mathbf{for}\; j=1,\dots,n\)

      • \(\mathbf{for}\; i=1,\dots,m\)

        • \(y_i \leftarrow y_i + \alpha\, a_{i,j} x_j\)

This algorithm describes the matrix and vector operations component-wise.

Describing Algorithms in terms of vector operations

We first define some basic vector operation (these are so called BLAS level 1 operations):

Using these as building blocks, we can describe the algorithmic variants in terms of these vector operations. Both variants have in common that at first vector \(y\) gets scaled by \(\beta\), i.e.

\[y \leftarrow \beta y\]

So what remains to be done is merely computing

\[y \leftarrow \alpha A x\]

Variant 1

In the first variant we consider the rows of \(A\), i.e.

\[A = \begin{pmatrix} \mathbf{a}_1^T \\ \vdots \\ \mathbf{a}_m^T \end{pmatrix}\]

So \(A\) is a column vector where each element itself is a row vector. Hence, the operation \(y \leftarrow \beta y + \alpha A x\) is equivalent to

\[\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}\leftarrow\beta\begin{pmatrix}y_1 \\\vdots \\y_m\end{pmatrix}+ \alpha\begin{pmatrix}\mathbf{a}_1^T x \\\vdots \\\mathbf{a}_m^T x\end{pmatrix}\]

Using the dot operation, we can compute the matrix vector product with

  • \(\mathbf{for}\; i=1,\dots,m\)

    • \(y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x\)

Variant 2

We now consider the columns of \(A\), i.e.

\[A = \begin{pmatrix} \mathbf{a}_1 & \dots & \mathbf{a}_n \end{pmatrix}\]

So \(A\) is a row vector where each element itself is a column vector. Hence, the operation \(y \leftarrow \beta\, y + \alpha A x\) is equivalent to

\[y \leftarrow \beta\, y+ \alpha\, \mathbf{a}_1 x_1+ \dots+ \alpha \,\mathbf{a}_n x_n\]

So computing \(y \leftarrow y + \alpha\, A x\) is equivalent to

  • \(\text{For}\; j=1,\dots,n\)

    • \(y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j\)

Combining both variants

Obviously variant 1 performs better for row-major and variant 2 better for col-major matrices. In dependence of the increments we can switch to the more favorable variant:

  • \(y \leftarrow \beta\, y\)

  • \(\mathbf{if}\; \text{incRow}_\text{A} > \text{incCol}_\text{A}\)

    • \(\mathbf{for}\; i=1,\dots,m\)

      • \(y_i \leftarrow y_i + \alpha\, \mathbf{a}_i^T x\)

  • \(\mathbf{else}\)

    • \(\mathbf{for}\; j=1,\dots,n\)

      • \(y \leftarrow y + \alpha\, \mathbf{a}_{j} x_j\)

Exercise

Material

In the computer lab E.44 we installed a free version of Intel MKL. We will use the Intel MKL for comparing the performance of our gemv implementation.

Solution for session 4 with Intel MKL Benchmark

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <sys/times.h>
#include <unistd.h>
#include <mkl_types.h>

#ifndef MIN_M
#define MIN_M 100
#endif

#ifndef MIN_N
#define MIN_N 100
#endif

#ifndef MAX_M
#define MAX_M 1500
#endif

#ifndef MAX_N
#define MAX_N 1500
#endif

#ifndef INCX
#define INCX 1
#endif

#ifndef INCY
#define INCY 1
#endif

#ifndef ALPHA
#define ALPHA 1.5
#endif

#ifndef BETA
#define BETA 1.5
#endif

#ifndef T_MIN
#define T_MIN 5
#endif


double A[MAX_M*MAX_N];
double X[MAX_N*INCX];
double Y[MAX_M*INCY];
double Y1[MAX_M*INCY];
double Y2[MAX_M*INCY];
double Y3[MAX_M*INCY];
double Y4[MAX_M*INCY];
double Y5[MAX_M*INCY];

double
walltime()
{
   struct tms    ts;
   static double ClockTick=0.0;

   if (ClockTick==0.0) {
        ClockTick = 1.0 / ((double) sysconf(_SC_CLK_TCK));
   }
   return ((double) times(&ts)) * ClockTick;
}

void
initMatrix(size_t m, size_t n, double *A, size_t incRowA, size_t incColA)
{
    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            A[i*incRowA+j*incColA] = ((double)rand() - RAND_MAX/2)*200/RAND_MAX;
        }
    }
}

void
copyMatrix(size_t m, size_t n,
           const double *A, size_t incRowA, size_t incColA,
           double *B, size_t incRowB, size_t incColB)
{
    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            B[i*incRowB+j*incColB] = A[i*incRowA+j*incColA];
        }
    }
}

double
asumDiffMatrix(size_t m, size_t n,
               const double *A, size_t incRowA, size_t incColA,
               double *B, size_t incRowB, size_t incColB)
{
    double asum = 0;

    for (size_t j=0; j<n; ++j) {
        for (size_t i=0; i<m; ++i) {
            asum += fabs(B[i*incRowB+j*incColB] - A[i*incRowA+j*incColA]);
        }
    }
    return asum;
}

//------------------------------------------------------------------------------

void
dgemv(const char *trans,
      const MKL_INT *m, const MKL_INT *n, const double *alpha,
      const double *A, const MKL_INT *ldA, const double *x,
      const MKL_INT *incX,
      const double *beta, double *y, const MKL_INT *incY);

void
dgemv_mkl(MKL_INT m, MKL_INT n,
          double alpha,
          const double *A, MKL_INT incRowA, MKL_INT incColA,
          const double *x, MKL_INT incX,
          double beta,
          double *y, MKL_INT incY)
{
    MKL_INT ldA   = (incRowA==1) ? incColA : incRowA;
    char    trans = (incRowA==1) ? 'N' : 'T';
    MKL_INT M     = (incRowA==1) ? m : n;
    MKL_INT N     = (incRowA==1) ? n : m;

    dgemv(&trans, &M, &N, &alpha, A, &ldA, x, &incX, &beta, y, &incY);
}

//------------------------------------------------------------------------------

void
dgemv_ulm(size_t m, size_t n,
          double alpha,
          const double *A, ptrdiff_t incRowA, ptrdiff_t incColA,
          const double *x, ptrdiff_t incX,
          double beta,
          double *y, ptrdiff_t incY)
{
    if (beta!=1.0) {
        for (size_t i=0; i<m; ++i) {
            y[i*incY] *= beta;
        }
    }
    if (alpha==0) {
        return;
    }
    if (incRowA > incColA) {
        for (size_t i=0; i<m; ++i) {
            for (size_t j=0; j<n; ++j) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    } else  {
        for (size_t j=0; j<n; ++j) {
            for (size_t i=0; i<m; ++i) {
                y[i*incY] += alpha*A[i*incRowA+j*incColA]*x[j*incX];
            }
        }
    }
}

//------------------------------------------------------------------------------

#ifndef COLMAJOR
//#define COLMAJOR 1
#define COLMAJOR 0
#endif

int
main()
{
    size_t runs, incRowA, incColA;
    double t0, t1, t2;
    double diff2;
    double alpha = ALPHA;
    double beta  = BETA;

    initMatrix(MAX_M, MAX_N, A, 1, MAX_M);
    initMatrix(MAX_N, 1, X, INCX, 1);
    initMatrix(MAX_M, 1, Y, INCY, 1);

    printf("# COLMAJOR    = %d\n", COLMAJOR);
    printf("# T_MIN       = %d\n", T_MIN);
    printf("#RUN    M     N  INCROW  INCCOL");
    printf("    GEMV_MKL    GEMV_ULM");
    printf("    GEMV_MKL    GEMV_ULM");
    printf("       DIFF2");
    printf("\n");
    printf("#                              ");
    printf("    (t in s)    (t in s)");
    printf("    (MFLOPS)    (MFLOPS)");
    printf("           ");
    printf("\n");

    for (size_t i=0, m=MIN_M, n=MIN_N; m<=MAX_M && n<=MAX_N;
	    ++i, m+=100, n+=100) {

        if (COLMAJOR) {
            incRowA = 1;
            incColA = m;
        } else {
            incRowA = n;
            incColA = 1;
        }

        t1   = 0;
        runs = 0;
        do {
            copyMatrix(MAX_M, 1, Y, INCY, 1, Y1, INCY, 1);
            t0 = walltime();
            dgemv_mkl(m, n, alpha,
                      A, incRowA, incColA,
                      X, INCX,
                      beta,
                      Y1, INCY);
            t1 += walltime() - t0;
            ++runs;
        } while (t1<T_MIN);
        t1 /= runs;

        t2   = 0;
        runs = 0;
        do {
            copyMatrix(MAX_M, 1, Y, INCY, 1, Y2, INCY, 1);
            t0 = walltime();
            dgemv_ulm(m, n, alpha,
                      A, incRowA, incColA,
                      X, INCX,
                      beta,
                      Y2, INCY);
            t2 += walltime() - t0;
            ++runs;
        } while (t2<T_MIN);
        t2 /= runs;
        diff2 = asumDiffMatrix(m, 1, Y1, INCY, 1, Y2, INCY, 1);

        printf("%3ld %5ld %5ld %7ld %7ld ", i, m, n, incRowA, incColA);
        printf("%11.4lf %11.4lf ", t1, t2);
        printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t1);
        printf("%11.4lf ", 2*(m/1000.0)*(n/1000.0)/t2);
        printf("%11.4lf ", diff2);
        printf("\n");
    }
}

Makefile (for the lab in E.44)

The following makefile creates the two executables dgemv_rowmajor and dgemv_colmajor:

In both cases the benchmarks compare our performance with the MKL.

COLMAJOR  = -DCOLMAJOR=1
ROWMAJOR  = -DCOLMAJOR=0

CC        = gcc
CFLAGS   += -std=c99
CFLAGS   += -Wall -Ofast

Arch := intel64
IntelDir := /opt/intel/compilers_and_libraries/linux
IncludeDir := $(IntelDir)/mkl/include
LibDirs := $(IntelDir)/lib/$(Arch) $(IntelDir)/mkl/lib/$(Arch)
LinkerFlag := -Wl,
Rpath := $(patsubst %,$(LinkerFlag)-rpath $(LinkerFlag)%,$(LibDirs))
Lpath := $(patsubst %,-L%,$(LibDirs))

LDFLAGS += $(Lpath) $(Rpath)
CPPFLAGS += -I$(IncludeDir) -DMKL_ILP64
LDLIBS += -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread

TARGETS = gemv_rowmajor gemv_colmajor

all: $(TARGETS)

%_rowmajor : %.c
	$(CC) $(ROWMAJOR) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $(LDLIBS) -o $@ $<

%_colmajor : %.c
	$(CC)  $(COLMAJOR) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $(LDLIBS) -o $@ $<

clean:
	$(RM) $(TARGETS)

Running the benchmarks

heim$ make
gcc -DCOLMAJOR=0 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_rowmajor gemv.c
gcc  -DCOLMAJOR=1 -I/opt/intel/compilers_and_libraries/linux/mkl/include -DMKL_ILP64 -std=c99 -Wall -Ofast -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/lib/intel64 -Wl,-rpath -Wl,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -o gemv_colmajor gemv.c
heim$ ./gemv_colmajor > gemv_colmajor.dat
heim$ ./gemv_rowmajor > gemv_rowmajor.dat
heim$ cat gemv_colmajor.dat
# COLMAJOR    = 1
# T_MIN       = 5
#RUN    M     N  INCROW  INCCOL    GEMV_MKL    GEMV_ULM    GEMV_MKL    GEMV_ULM       DIFF2
#                                  (t in s)    (t in s)    (MFLOPS)    (MFLOPS)           
  0   100   100       1     100      0.0000      0.0000   9065.4970   4348.2914      0.0000 
  1   200   200       1     200      0.0000      0.0000   8500.8160   4768.6866      0.0000 
  2   300   300       1     300      0.0000      0.0000   8478.5400   5007.9162      0.0000 
  3   400   400       1     400      0.0000      0.0001   8812.0320   5158.7840      0.0000 
  4   500   500       1     500      0.0001      0.0001   8697.8044   5185.2000      0.0000 
  5   600   600       1     600      0.0001      0.0001   8795.3760   5184.8640      0.0000 
  6   700   700       1     700      0.0001      0.0002   8227.2960   5108.9360      0.0000 
  7   800   800       1     800      0.0002      0.0003   7049.4720   4614.4000      0.0000 
  8   900   900       1     900      0.0003      0.0004   4799.0880   3736.0440      0.0000 
  9  1000  1000       1    1000      0.0004      0.0005   4768.8000   3861.8763      0.0000 
 10  1100  1100       1    1100      0.0005      0.0006   4845.8080   3972.4711      0.0000 
 11  1200  1200       1    1200      0.0006      0.0007   4665.0240   3894.3360      0.0000 
 12  1300  1300       1    1300      0.0008      0.0009   4460.9240   3851.1720      0.0000 
 13  1400  1400       1    1400      0.0009      0.0010   4351.2000   3822.0000      0.0000 
 14  1500  1500       1    1500      0.0011      0.0012   4251.6000   3778.2000      0.0000 
heim$ cat gemv_rowmajor.dat
# COLMAJOR    = 0
# T_MIN       = 5
#RUN    M     N  INCROW  INCCOL    GEMV_MKL    GEMV_ULM    GEMV_MKL    GEMV_ULM       DIFF2
#                                  (t in s)    (t in s)    (MFLOPS)    (MFLOPS)           
  0   100   100     100       1      0.0000      0.0000   8705.2120   3754.4760      0.0000 
  1   200   200     200       1      0.0000      0.0000   8435.5769   4200.0320      0.0000 
  2   300   300     300       1      0.0000      0.0000   8413.5600   4418.1198      0.0000 
  3   400   400     400       1      0.0000      0.0001   8597.8880   4480.7040      0.0000 
  4   500   500     500       1      0.0001      0.0001   8560.3000   4549.2000      0.0000 
  5   600   600     600       1      0.0001      0.0002   8290.0599   4471.9200      0.0000 
  6   700   700     700       1      0.0001      0.0002   7655.9560   4338.4600      0.0000 
  7   800   800     800       1      0.0002      0.0003   6702.8480   4132.6080      0.0000 
  8   900   900     900       1      0.0003      0.0004   5838.1560   3878.9280      0.0000 
  9  1000  1000    1000       1      0.0004      0.0005   5514.4000   3767.6647      0.0000 
 10  1100  1100    1100       1      0.0005      0.0007   4911.1480   3706.9560      0.0000 
 11  1200  1200    1200       1      0.0006      0.0008   4781.9520   3677.7600      0.0000 
 12  1300  1300    1300       1      0.0008      0.0009   4222.9720   3626.7400      0.0000 
 13  1400  1400    1400       1      0.0009      0.0011   4188.3753   3469.9840      0.0000 
 14  1500  1500    1500       1      0.0011      0.0013   3920.6587   3344.4000      0.0000 
heim$ 

Gnuplot scripts for the benchmarks

We provide two scripts for plotting the benchmark results:

As usual run these scripts through gnuplot:

heim$ gnuplot gemv_colmajor.plot
heim$ gnuplot gemv_rowmajor.plot
heim$ 

Plot of the benchmark results