=====================
GNU Vector Extensions                                                   [TOC]
=====================

In this session we give you a taste on the performance boost you can achieve by
applying hardware optimizations to the micro kernel.

But note: You only can observe this performance improvement because we have
previously exploited the cache hierarchy.


Exercise
========

Make the following modifications to `ulmblas.c`:

- In the GEMM frame algorithm (function `dgemm`) allocate the buffers with a 32
  byte alignment:  read the manual for `aligned_alloc`, i.e  `man
  aligned_alloc`.

- Add the micro kernel based on the GNU Vector extensions (function
  `dgemm_micro_gcc`) which is given below.

- In the macro kernel call the new micro kernel (instead of the reference
  implementation).

- Use the following block sizes (modify the macros or specify them with the
  `-D` flag when you compile the benchmark):

  ---- LATEX -------------------------------------------------------------------
  M_c = 256, \; N_C = 1024, \; K_c = 256, \; M_r = 4, \; N_r = 8
  ------------------------------------------------------------------------------

Compile with additional flags `-O3` and `-mavx` and re-run the benchmarks.


GEMM Micro Kernel using the GNU Vector Extensions
=================================================

---- CODE (type=c) -------------------------------------------------------------
//-- GEMM micro kernel (gcc vector extensions) ---------------------------------

#ifndef DGEMM_GCC_VECBITS
#define DGEMM_GCC_VECBITS     256
#endif

#define DGEMM_GCC_VECBYTES    (DGEMM_GCC_VECBITS / 8)
#define DGEMM_GCC_VECDBLS     (DGEMM_GCC_VECBITS / (8*sizeof(double)))
#define DGEMM_GCC_NR          (DGEMM_NR / DGEMM_GCC_VECDBLS)

void
dgemm_micro_gcc(size_t k, double alpha,
                const double *A, const double *B,
                double beta,
                double *C, ptrdiff_t incRowC, ptrdiff_t incColC)
{
    typedef double vec __attribute__((vector_size (DGEMM_GCC_VECBYTES)));
    vec AB[DGEMM_MR*DGEMM_GCC_NR] = {};

    A = (const double*) __builtin_assume_aligned (A, DGEMM_GCC_VECBYTES);
    B = (const double*) __builtin_assume_aligned (B, DGEMM_GCC_VECBYTES);

    // AB <- A*B
    for (size_t l=0; l<k; ++l) {
        const vec *b = (const vec *)B;
        for (size_t i=0; i<DGEMM_MR; ++i) {
            for (size_t j=0; j<DGEMM_GCC_NR; ++j) {
                AB[i*DGEMM_GCC_NR+j] += A[i]*b[j];
            }
        }
        A += DGEMM_MR;
        B += DGEMM_NR;
    }
    // AB <- alpha*AB
    for (size_t i=0; i<DGEMM_MR; ++i) {
        for (size_t j=0; j<DGEMM_GCC_NR; ++j) {
            AB[i*DGEMM_GCC_NR+j] *= alpha;
        }
    }
    // C <- beta*C + alpha*AB
    if (beta!=0) {
        for (size_t i=0; i<DGEMM_MR; ++i) {
            for (size_t j=0; j<DGEMM_GCC_NR; ++j) {
                const double *p = (const double *) &AB[i*DGEMM_GCC_NR+j];
                for (size_t j0=0; j0<DGEMM_GCC_VECDBLS; ++j0) {
                    C[i*incRowC + (j*DGEMM_GCC_VECDBLS + j0)*incColC] *= beta;
                    C[i*incRowC + (j*DGEMM_GCC_VECDBLS + j0)*incColC] += p[j0];
                }
            }
        }
    } else {
        for (size_t i=0; i<DGEMM_MR; ++i) {
            for (size_t j=0; j<DGEMM_GCC_NR; ++j) {
                const double *p = (const double *) &AB[i*DGEMM_GCC_NR+j];
                for (size_t j0=0; j0<DGEMM_GCC_VECDBLS; ++j0) {
                    C[i*incRowC + (j*DGEMM_GCC_VECDBLS + j0)*incColC] = p[j0];
                }
            }
        }
    }
}


--------------------------------------------------------------------------------