===================== GNU Vector Extensions [TOC] ===================== In this session we give you a taste on the performance boost you can achieve by applying hardware optimizations to the micro kernel. But note: You only can observe this performance improvement because we have previously exploited the cache hierarchy. Exercise ======== Make the following modifications to `ulmblas.c`: - In the GEMM frame algorithm (function `dgemm`) allocate the buffers with a 32 byte alignment: read the manual for `aligned_alloc`, i.e `man aligned_alloc`. - Add the micro kernel based on the GNU Vector extensions (function `dgemm_micro_gcc`) which is given below. - In the macro kernel call the new micro kernel (instead of the reference implementation). - Use the following block sizes (modify the macros or specify them with the `-D` flag when you compile the benchmark): ---- LATEX ------------------------------------------------------------------- M_c = 256, \; N_C = 1024, \; K_c = 256, \; M_r = 4, \; N_r = 8 ------------------------------------------------------------------------------ Compile with additional flags `-O3` and `-mavx` and re-run the benchmarks. GEMM Micro Kernel using the GNU Vector Extensions ================================================= ---- CODE (type=c) ------------------------------------------------------------- //-- GEMM micro kernel (gcc vector extensions) --------------------------------- #ifndef DGEMM_GCC_VECBITS #define DGEMM_GCC_VECBITS 256 #endif #define DGEMM_GCC_VECBYTES (DGEMM_GCC_VECBITS / 8) #define DGEMM_GCC_VECDBLS (DGEMM_GCC_VECBITS / (8*sizeof(double))) #define DGEMM_GCC_NR (DGEMM_NR / DGEMM_GCC_VECDBLS) void dgemm_micro_gcc(size_t k, double alpha, const double *A, const double *B, double beta, double *C, ptrdiff_t incRowC, ptrdiff_t incColC) { typedef double vec __attribute__((vector_size (DGEMM_GCC_VECBYTES))); vec AB[DGEMM_MR*DGEMM_GCC_NR] = {}; A = (const double*) __builtin_assume_aligned (A, DGEMM_GCC_VECBYTES); B = (const double*) __builtin_assume_aligned (B, DGEMM_GCC_VECBYTES); // AB <- A*B for (size_t l=0; l