Content |
GEMM C++ Implementation
All things required for the implementation of the matrix-matrix product are contained in gemm.h. Note, that is also contains some stuff that is already in BLAZE available. So these things can be removed and replaced by calling adequate BLAZE functions.
Small Demo
In this simple demo you have to call the GEMM function directly:
-
An expression like \(C \leftarrow \beta C + \alpha A B\) gets evaluated by calling
foo::gemm(alpha, A, B, beta, C);
-
An expression like \(C \leftarrow \beta C + \alpha (A^T+A) (2 \cdot B)\) gets evaluated by calling
foo::gemm(alpha, blaze::trans(A)+A, 2*B, beta, C);
Note that this does not create temporary matrices for \(A^T+A\) or \(2 \cdot B\). Any expression for \(A\) and \(B\) that allows element-wise access can be used. If the cost of evaluating an element of an expression is \(\mathcal{O}(1)\) (i.e. constant) the impact on performance should not be measurable.
Also note, that matrices can have different element types. The performance depends on the common type of \(\alpha\), \(A\) and \(B\).
Source Code: test_gemm.cc
#include <blaze/Math.h> #include <iostream> #include <random> #include "gemm.h" // fill rectangular matrix with random values template <typename MATRIX> void fill(MATRIX &A) { typedef typename MATRIX::ElementType T; std::random_device random; std::default_random_engine mt(random()); std::uniform_real_distribution<T> uniform(-100,100); for (std::size_t i=0; i<(~A).rows(); ++i) { for (std::size_t j=0; j<(~A).columns(); ++j) { A(i,j) = uniform(mt); } } } int main() { typedef double ElementType; constexpr auto SOA = blaze::rowMajor; constexpr auto SOB = blaze::rowMajor; constexpr auto SOC = blaze::rowMajor; std::size_t m = 5; std::size_t n = 5; std::size_t k = 5; ElementType alpha = 1; ElementType beta = 0; blaze::DynamicMatrix<double, SOA> A(m, k); blaze::DynamicMatrix<double, SOB> B(k, n); blaze::DynamicMatrix<double, SOC> C(m, n); fill(A); fill(B); fill(C); std::cout << "A = " << A << std::endl; std::cout << "B = " << B << std::endl; std::cout << "C = " << C << std::endl; std::cout << "C <- " << beta << "*C + " << alpha << "*A*B" << std::endl; foo::gemm(alpha, A, B, beta, C); std::cout << "C = " << C << std::endl; std::cout << "C <- " << beta << "*C + " << alpha << "*(A^T+A)*(2*B)" << std::endl; foo::gemm(alpha, blaze::trans(A)+A, 2*B, beta, C); std::cout << "C = " << C << std::endl; }
$shell> g++-5.3 -O3 -DNDEBUG -std=c++11 -I /home/numerik/lehn/work/blaze-2.5/ -I /opt/intel/compilers_and_libraries/linux/mkl/include test_gemm.cc $shell> ./a.out A = ( -52.3971 -73.3619 -93.1966 12.0865 43.5553 ) ( 96.4251 -56.1692 -56.1205 -92.551 16.4769 ) ( 1.56675 -1.92304 -84.998 -22.7602 81.1943 ) ( -69.6447 85.1908 -71.8187 23.5216 -37.4075 ) ( -51.174 63.8766 -71.6123 -26.202 96.7796 ) B = ( -13.2835 -45.8576 38.8924 -9.1605 -12.2619 ) ( 85.9343 -14.5258 -55.4144 -36.2752 88.4101 ) ( 46.8044 -50.2632 -73.8792 60.0059 -1.6872 ) ( -26.0593 96.0564 -70.1142 -67.8452 -56.8731 ) ( -67.6937 -93.9384 -83.4871 -73.8218 -79.665 ) C = ( 78.6586 86.2904 -76.0658 -70.2664 -80.1549 ) ( 81.0674 -0.353454 86.1323 50.4902 66.6075 ) ( -77.5299 40.9973 95.4865 -28.3445 89.5014 ) ( -12.0162 -64.8605 -42.6583 9.13303 97.9829 ) ( 85.6541 44.1693 5.45224 54.1766 -86.4162 ) C <- 0*C + 1*A*B C = ( -13233.7 5222.29 4429.01 -6486.49 -9843.43 ) ( -7437.97 -11223.1 16122.5 2849.48 -2102.56 ) ( -9067.58 -5585.17 1264.22 -9494.72 -5219.72 ) ( 6803.8 11339.5 -649.681 -5596.21 10149.2 ) ( -3051.38 -6589.87 -6482.02 -11512.3 175.899 ) C <- 0*C + 1*(A^T+A)*(2*B) C = ( 2201.88 8526.15 12175.1 -1815.01 14718.1 ) ( -35848.8 -9527.29 10435.9 -10103.2 -32198.9 ) ( -19822.6 7209.15 36086.5 -3092.97 1788.77 ) ( -2429.15 35988.8 14337.8 -6753.84 5213.16 ) ( -7980.66 -51184.4 -34313.5 -24486.7 -9241.81 ) $shell>
Source Code for GEMM
Most of the algorithm is implemented in pure C++. Only the so called frame algorithm was adapted to BLAZE. The rest is coded in a C-style. One reason for this is, that in my lecture we start from scratch with a pure C implementation. Step by step we C++-fy it.
Performance depends on the so called micro-kernels. In the following we have hand coded micro-kernels for AVX and FMA. Another micro-kernel uses the GCC vector extensions. Again, these kernels are rather simple and I justed included versions for double precision. More advanced kernels can be taken from BLIS.
The micro-kernel can be selected by:
-
Compile with -DHAVE_AVX for the AVX-Kernel
-
Compile with -DHAVE_FMA for the FMA-Kernel
-
Compile with -DHAVE_GCCVEC for the Kernel using GCC vector extensions.
-
Compile with -DHAVE_BLISAVX for the FMA-Kernel
Otherwise a reference implementation of the
Frame Algorithm, Macro-Kernel and Reference-Micro-Kernel
// Code extracted from ulmBLAS: https://github.com/michael-lehn/ulmBLAS-core // Contains: GEMM and TRSM. #ifndef GEMM_HPP #define GEMM_HPP #include <algorithm> #include <cstdlib> #if defined(_OPENMP) #include <omp.h> #endif namespace foo { //-- malloc with alignment (I guess that is already in BLAZE) ------------------ void * malloc_aligned(std::size_t alignment, std::size_t size) { alignment = std::max(alignment, alignof(void *)); size += alignment; void *ptr = std::malloc(size); void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1)); void **vp = (void**) ptr2 - 1; *vp = ptr; return ptr2; } void free_aligned(void *ptr) { std::free(*((void**)ptr-1)); } //-- Config -------------------------------------------------------------------- // SIMD-Register width in bits // SSE: 128 // AVX/FMA: 256 // AVX-512: 512 #ifndef SIMD_REGISTER_WIDTH #define SIMD_REGISTER_WIDTH 256 #endif #ifdef HAVE_FMA # ifndef BS_D_MR # define BS_D_MR 4 # endif # ifndef BS_D_NR # define BS_D_NR 12 # endif # ifndef BS_D_MC # define BS_D_MC 256 # endif # ifndef BS_D_KC # define BS_D_KC 512 # endif # ifndef BS_D_NC # define BS_D_NC 4092 # endif #endif #ifdef HAVE_BLISAVX # ifndef BS_D_MR # define BS_D_MR 8 # endif # ifndef BS_D_NR # define BS_D_NR 4 # endif # ifndef BS_D_MC # define BS_D_MC 96 # endif # ifndef BS_D_KC # define BS_D_KC 256 # endif # ifndef BS_D_NC # define BS_D_NC 4096 # endif #endif #ifndef BS_D_MR #define BS_D_MR 4 #endif #ifndef BS_D_NR #define BS_D_NR 8 #endif #ifndef BS_D_MC #define BS_D_MC 256 #endif #ifndef BS_D_KC #define BS_D_KC 256 #endif #ifndef BS_D_NC #define BS_D_NC 4096 #endif #if !defined(USE_SIMD) && defined(HAVE_AVX) #define USE_SIMD #endif #if !defined(USE_SIMD) && defined(HAVE_FMA) #define USE_SIMD #endif #if !defined(USE_SIMD) && defined(HAVE_GCCVEC) #define USE_SIMD #endif #if !defined(USE_SIMD) && defined(HAVE_BLISAVX) #define USE_SIMD #endif template <typename T> struct BlockSize { static constexpr int MC = 64; static constexpr int KC = 64; static constexpr int NC = 256; static constexpr int MR = 8; static constexpr int NR = 8; static constexpr int rwidth = 0; static constexpr int align = alignof(T); static constexpr int vlen = 0; static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size."); static_assert(MC % MR == 0, "MC must be a multiple of MR."); static_assert(NC % NR == 0, "NC must be a multiple of NR."); }; template <> struct BlockSize<double> { static constexpr int MC = BS_D_MC; static constexpr int KC = BS_D_KC; static constexpr int NC = BS_D_NC; static constexpr int MR = BS_D_MR; static constexpr int NR = BS_D_NR; static constexpr int rwidth = SIMD_REGISTER_WIDTH; static constexpr int align = rwidth / 8; #ifdef USE_SIMD static constexpr int vlen = rwidth / (8*sizeof(double)); #else static constexpr int vlen = 0; #endif static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size."); static_assert(MC % MR == 0, "MC must be a multiple of MR."); static_assert(NC % NR == 0, "NC must be a multiple of NR."); static_assert(rwidth % sizeof(double) == 0, "SIMD register width not sane."); }; //-- aux routines (can be replaced by calling BLAZE routines...) --------------- template <typename Alpha, typename MX, typename MY> void geaxpy(const Alpha &alpha, const MX &X, MY &Y) { assert(X.rows()==Y.rows()); assert(X.columns()==Y.columns()); for (std::size_t j=0; j<X.columns(); ++j) { for (std::size_t i=0; i<X.rows(); ++i) { Y(i,j) += alpha*X(i,j); } } } template <typename Alpha, typename MX> void gescal(const Alpha &alpha, MX &X) { for (std::size_t j=0; j<X.columns(); ++j) { for (std::size_t i=0; i<X.rows(); ++i) { X(i,j) *= alpha; } } } template <typename Index, typename Alpha, typename TX, typename TY> void geaxpy(Index m, Index n, const Alpha &alpha, const TX *X, Index incRowX, Index incColX, TY *Y, Index incRowY, Index incColY) { for (Index j=0; j<n; ++j) { for (Index i=0; i<m; ++i) { Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX]; } } } template <typename Index, typename Alpha, typename TX> void gescal(Index m, Index n, const Alpha &alpha, TX *X, Index incRowX, Index incColX) { for (Index j=0; j<n; ++j) { for (Index i=0; i<m; ++i) { X[i*incRowX+j*incColX] *= alpha; } } } template <typename IndexType, typename MX, typename MY> void gecopy(IndexType m, IndexType n, const MX *X, IndexType incRowX, IndexType incColX, MY *Y, IndexType incRowY, IndexType incColY) { for (IndexType j=0; j<n; ++j) { for (IndexType i=0; i<m; ++i) { Y[i*incRowY+j*incColY] = X[i*incRowX+j*incColX]; } } } //-- Micro Kernel -------------------------------------------------------------- template <typename Index, typename T> typename std::enable_if<BlockSize<T>::vlen == 0, void>::type ugemm(Index kc, T alpha, const T *A, const T *B, T beta, T *C, Index incRowC, Index incColC, const T *a_next, const T *b_next) { const Index MR = BlockSize<T>::MR; const Index NR = BlockSize<T>::NR; T P[BlockSize<T>::MR*BlockSize<T>::NR]; for (Index l=0; l<MR*NR; ++l) { P[l] = 0; } for (Index l=0; l<kc; ++l) { for (Index j=0; j<NR; ++j) { for (Index i=0; i<MR; ++i) { P[i+j*MR] += A[i+l*MR]*B[l*NR+j]; } } } for (Index j=0; j<NR; ++j) { for (Index i=0; i<MR; ++i) { C[i*incRowC+j*incColC] *= beta; C[i*incRowC+j*incColC] += alpha*P[i+j*MR]; } } } #if defined HAVE_AVX #include "avx.h" #elif defined HAVE_FMA #include "fma.h" #elif defined HAVE_GCCVEC #include "gccvec.h" #elif defined HAVE_BLISAVX #include "blisavx.h" #endif //-- Macro Kernel -------------------------------------------------------------- template <typename Index, typename T, typename Beta, typename TC> void mgemm(Index mc, Index nc, Index kc, const T &alpha, const T *A, const T *B, Beta beta, TC *C, Index incRowC, Index incColC) { const Index MR = BlockSize<T>::MR; const Index NR = BlockSize<T>::NR; const Index mp = (mc+MR-1) / MR; const Index np = (nc+NR-1) / NR; const Index mr_ = mc % MR; const Index nr_ = nc % NR; const T *nextA; const T *nextB; #if defined(_OPENMP) #pragma omp parallel for #endif for (Index j=0; j<np; ++j) { const Index nr = (j!=np-1 || nr_==0) ? NR : nr_; T C_[BlockSize<T>::MR*BlockSize<T>::NR]; nextB = &B[j*kc*NR]; for (Index i=0; i<mp; ++i) { const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_; nextA = &A[(i+1)*kc*MR]; if (i==mp-1) { nextA = A; nextB = &B[(j+1)*kc*NR]; if (j==np-1) { nextB = B; } } if (mr==MR && nr==NR) { ugemm(kc, alpha, &A[i*kc*MR], &B[j*kc*NR], beta, &C[i*MR*incRowC+j*NR*incColC], incRowC, incColC, nextA, nextB); } else { std::fill_n(C_, MR*NR, T(0)); ugemm(kc, alpha, &A[i*kc*MR], &B[j*kc*NR], T(0), C_, Index(1), MR, nextA, nextB); gescal(mr, nr, beta, &C[i*MR*incRowC+j*NR*incColC], incRowC, incColC); geaxpy(mr, nr, T(1), C_, Index(1), MR, &C[i*MR*incRowC+j*NR*incColC], incRowC, incColC); } } } } //-- Packing blocks ------------------------------------------------------------ template <typename MA, typename T> void pack_A(const MA &A, T *p) { std::size_t mc = A.rows(); std::size_t kc = A.columns(); std::size_t MR = BlockSize<T>::MR; std::size_t mp = (mc+MR-1) / MR; for (std::size_t j=0; j<kc; ++j) { for (std::size_t l=0; l<mp; ++l) { for (std::size_t i0=0; i0<MR; ++i0) { std::size_t i = l*MR + i0; std::size_t nu = l*MR*kc + j*MR + i0; p[nu] = (i<mc) ? A(i,j) : T(0); } } } } template <typename MB, typename T> void pack_B(const MB &B, T *p) { std::size_t kc = B.rows(); std::size_t nc = B.columns(); std::size_t NR = BlockSize<T>::NR; std::size_t np = (nc+NR-1) / NR; for (std::size_t l=0; l<np; ++l) { for (std::size_t j0=0; j0<NR; ++j0) { for (std::size_t i=0; i<kc; ++i) { std::size_t j = l*NR+j0; std::size_t nu = l*NR*kc + i*NR + j0; p[nu] = (j<nc) ? B(i,j) : T(0); } } } } //-- Frame routine ------------------------------------------------------------- template <typename Alpha, typename MatrixA, typename MatrixB, typename Beta, typename MatrixC> void gemm(Alpha alpha, const MatrixA &A, const MatrixB &B, Beta beta, MatrixC &C) { assert((~A).columns()==(~B).rows()); assert((~C).rows()==(~A).rows()); assert((~C).columns()==(~B).columns()); typedef typename MatrixA::ElementType TA; typedef typename MatrixB::ElementType TB; typedef typename MatrixC::ElementType TC; typedef typename std::common_type<Alpha, TA, TB>::type T; const std::size_t m = (~C).rows(); const std::size_t n = (~C).columns(); const std::size_t k = (~A).rows(); // Here we should choose block sizes at runtime based on the problem size const std::size_t MC = BlockSize<T>::MC; const std::size_t NC = BlockSize<T>::NC; const std::size_t KC = BlockSize<T>::KC; const std::size_t MR = BlockSize<T>::MR; const std::size_t NR = BlockSize<T>::NR; const std::size_t mb = (m+MC-1) / MC; const std::size_t nb = (n+NC-1) / NC; const std::size_t kb = (k+KC-1) / KC; const std::size_t mc_ = m % MC; const std::size_t nc_ = n % NC; const std::size_t kc_ = k % KC; if (m==0 || n==0 || ((alpha==Alpha(0) || k==0) && (beta==Beta(1)))) { return; } // Actually C is not required to be row- or col-major... TC *C_ = (~C).data(); const std::size_t incRowC = blaze::IsRowMajorMatrix<MatrixC>::value ? (~C).spacing() : 1; const std::size_t incColC = blaze::IsRowMajorMatrix<MatrixC>::value ? 1 : (~C).spacing(); // Here we should use unique pointers for the buffers A_ and B_ T *A_ = (T*) malloc_aligned(BlockSize<T>::align, sizeof(T)*(MC*KC+MR)); T *B_ = (T*) malloc_aligned(BlockSize<T>::align, sizeof(T)*(KC*NC+NR)); if (alpha==Alpha(0) || k==0) { gescal(beta, ~C); return; } for (std::size_t j=0; j<nb; ++j) { std::size_t nc = (j!=nb-1 || nc_==0) ? NC : nc_; for (std::size_t l=0; l<kb; ++l) { std::size_t kc = (l!=kb-1 || kc_==0) ? KC : kc_; Beta beta_ = (l==0) ? beta : Beta(1); const auto Bs = blaze::submatrix(~B, l*KC, j*NC, kc, nc); pack_B(Bs, B_); for (std::size_t i=0; i<mb; ++i) { std::size_t mc = (i!=mb-1 || mc_==0) ? MC : mc_; const auto As = blaze::submatrix(~A, i*MC, l*KC, mc, kc); pack_A(As, A_); mgemm(mc, nc, kc, T(alpha), A_, B_, beta_, &C_[i*MC*incRowC+j*NC*incColC], incRowC, incColC); } } } free_aligned(A_); free_aligned(B_); } } // namespace foo #endif
AVX Micro-Kernel (fairly optimized)
#ifndef AVX_HPP #define AVX_HPP #include "gemm.h" #include <type_traits> template <typename Index> typename std::enable_if<std::is_convertible<Index, std::int64_t>::value && BlockSize<double>::MR==4 && BlockSize<double>::NR==8 && BlockSize<double>::align==32, void>::type ugemm(Index kc_, double alpha, const double *A, const double *B, double beta, double *C, Index incRowC_, Index incColC_, const double *, const double *) { int64_t kc = kc_; int64_t incRowC = incRowC_; int64_t incColC = incColC_; double *pAlpha = α double *pBeta = β // // Compute AB = A*B // __asm__ volatile ( "movq %0, %%rdi \n\t" // kc "movq %1, %%rsi \n\t" // A "movq %2, %%rdx \n\t" // B "movq %5, %%rcx \n\t" // C "movq %6, %%r8 \n\t" // incRowC "movq %7, %%r9 \n\t" // incColC "vmovapd 0 * 32(%%rdx), %%ymm4\n\t" "vbroadcastsd 0 * 8(%%rsi), %%ymm0\n\t" "vbroadcastsd 1 * 8(%%rsi), %%ymm1\n\t" "vbroadcastsd 2 * 8(%%rsi), %%ymm2\n\t" "vbroadcastsd 3 * 8(%%rsi), %%ymm3\n\t" "vxorpd %%ymm8, %%ymm8, %%ymm8\n\t" "vxorpd %%ymm9, %%ymm9, %%ymm9\n\t" "vxorpd %%ymm10, %%ymm10, %%ymm10\n\t" "vxorpd %%ymm11, %%ymm11, %%ymm11\n\t" "vxorpd %%ymm12, %%ymm12, %%ymm12\n\t" "vxorpd %%ymm13, %%ymm13, %%ymm13\n\t" "vxorpd %%ymm14, %%ymm14, %%ymm14\n\t" "vxorpd %%ymm15, %%ymm15, %%ymm15\n\t" "jmp check%=\n\t" "loop%=:\n\t" "vmovapd 1 * 32(%%rdx), %%ymm5\n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6\n\t" "vaddpd %%ymm6, %%ymm8, %%ymm8\n\t" "vmulpd %%ymm1, %%ymm4, %%ymm7\n\t" "vaddpd %%ymm7, %%ymm9, %%ymm9\n\t" "vmulpd %%ymm2, %%ymm4, %%ymm6\n\t" "vaddpd %%ymm6, %%ymm10, %%ymm10\n\t" "vmulpd %%ymm3, %%ymm4, %%ymm7\n\t" "vaddpd %%ymm7, %%ymm11, %%ymm11\n\t" "vmovapd 2 * 32(%%rdx), %%ymm4\n\t" "vmulpd %%ymm0, %%ymm5, %%ymm6\n\t" "vaddpd %%ymm6, %%ymm12, %%ymm12\n\t" "vbroadcastsd 4 * 8(%%rsi), %%ymm0\n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7\n\t" "vaddpd %%ymm7, %%ymm13, %%ymm13\n\t" "vbroadcastsd 5 * 8(%%rsi), %%ymm1\n\t" "vmulpd %%ymm2, %%ymm5, %%ymm6\n\t" "vaddpd %%ymm6, %%ymm14, %%ymm14\n\t" "vbroadcastsd 6 * 8(%%rsi), %%ymm2\n\t" "vmulpd %%ymm3, %%ymm5, %%ymm7\n\t" "vaddpd %%ymm7, %%ymm15, %%ymm15\n\t" "vbroadcastsd 7 * 8(%%rsi), %%ymm3\n\t" "addq $32, %%rsi\n\t" "addq $2*32, %%rdx\n\t" "decq %%rdi\n\t" "check%=:\n\t" "testq %%rdi, %%rdi\n\t" "jg loop%=\n\t" "movq %3, %%rdi \n\t" // alpha "movq %4, %%rsi \n\t" // beta "vbroadcastsd (%%rdi), %%ymm6\n\t" "vbroadcastsd (%%rsi), %%ymm7\n\t" "vmulpd %%ymm6, %%ymm8, %%ymm8\n\t" "vmulpd %%ymm6, %%ymm9, %%ymm9\n\t" "vmulpd %%ymm6, %%ymm10, %%ymm10\n\t" "vmulpd %%ymm6, %%ymm11, %%ymm11\n\t" "vmulpd %%ymm6, %%ymm12, %%ymm12\n\t" "vmulpd %%ymm6, %%ymm13, %%ymm13\n\t" "vmulpd %%ymm6, %%ymm14, %%ymm14\n\t" "vmulpd %%ymm6, %%ymm15, %%ymm15\n\t" "leaq (,%%r8,8), %%r8\n\t" "leaq (,%%r9,8), %%r9\n\t" "leaq (,%%r9,2), %%r10\n\t" "leaq (%%r10,%%r9), %%r11\n\t" "leaq (%%rcx,%%r10,2), %%rdx\n\t" "#\n\t" "# Update C(0,:)\n\t" "#\n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0\n\t" "vmovhpd (%%rcx,%%r9), %%xmm0, %%xmm0\n\t" "vmovlpd (%%rcx,%%r10), %%xmm1, %%xmm1\n\t" "vmovhpd (%%rcx,%%r11), %%xmm1, %%xmm1\n\t" "vmovlpd (%%rdx), %%xmm2, %%xmm2\n\t" "vmovhpd (%%rdx,%%r9), %%xmm2, %%xmm2\n\t" "vmovlpd (%%rdx,%%r10), %%xmm3, %%xmm3\n\t" "vmovhpd (%%rdx,%%r11), %%xmm3, %%xmm3\n\t" "vmulpd %%xmm7, %%xmm0, %%xmm0\n\t" "vmulpd %%xmm7, %%xmm1, %%xmm1\n\t" "vmulpd %%xmm7, %%xmm2, %%xmm2\n\t" "vmulpd %%xmm7, %%xmm3, %%xmm3\n\t" "vextractf128 $1, %%ymm8, %%xmm4\n\t" "vextractf128 $1, %%ymm12, %%xmm5\n\t" "vaddpd %%xmm0, %%xmm8, %%xmm0\n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1\n\t" "vaddpd %%xmm2, %%xmm12, %%xmm2\n\t" "vaddpd %%xmm3, %%xmm5, %%xmm3\n\t" "vmovlpd %%xmm0, (%%rcx)\n\t" "vmovhpd %%xmm0, (%%rcx,%%r9)\n\t" "vmovlpd %%xmm1, (%%rcx,%%r10)\n\t" "vmovhpd %%xmm1, (%%rcx,%%r11)\n\t" "vmovlpd %%xmm2, (%%rdx)\n\t" "vmovhpd %%xmm2, (%%rdx,%%r9)\n\t" "vmovlpd %%xmm3, (%%rdx,%%r10)\n\t" "vmovhpd %%xmm3, (%%rdx,%%r11)\n\t" "#\n\t" "# Update C(1,:)\n\t" "#\n\t" "addq %%r8, %%rcx\n\t" "addq %%r8, %%rdx\n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0\n\t" "vmovhpd (%%rcx,%%r9), %%xmm0, %%xmm0\n\t" "vmovlpd (%%rcx,%%r10), %%xmm1, %%xmm1\n\t" "vmovhpd (%%rcx,%%r11), %%xmm1, %%xmm1\n\t" "vmovlpd (%%rdx), %%xmm2, %%xmm2\n\t" "vmovhpd (%%rdx,%%r9), %%xmm2, %%xmm2\n\t" "vmovlpd (%%rdx,%%r10), %%xmm3, %%xmm3\n\t" "vmovhpd (%%rdx,%%r11), %%xmm3, %%xmm3\n\t" "vmulpd %%xmm7, %%xmm0, %%xmm0\n\t" "vmulpd %%xmm7, %%xmm1, %%xmm1\n\t" "vmulpd %%xmm7, %%xmm2, %%xmm2\n\t" "vmulpd %%xmm7, %%xmm3, %%xmm3\n\t" "vextractf128 $1, %%ymm9, %%xmm4\n\t" "vextractf128 $1, %%ymm13, %%xmm5\n\t" "vaddpd %%xmm0, %%xmm9, %%xmm0\n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1\n\t" "vaddpd %%xmm2, %%xmm13, %%xmm2\n\t" "vaddpd %%xmm3, %%xmm5, %%xmm3\n\t" "vmovlpd %%xmm0, (%%rcx)\n\t" "vmovhpd %%xmm0, (%%rcx,%%r9)\n\t" "vmovlpd %%xmm1, (%%rcx,%%r10)\n\t" "vmovhpd %%xmm1, (%%rcx,%%r11)\n\t" "vmovlpd %%xmm2, (%%rdx)\n\t" "vmovhpd %%xmm2, (%%rdx,%%r9)\n\t" "vmovlpd %%xmm3, (%%rdx,%%r10)\n\t" "vmovhpd %%xmm3, (%%rdx,%%r11)\n\t" "#\n\t" "# Update C(2,:)\n\t" "#\n\t" "addq %%r8, %%rcx\n\t" "addq %%r8, %%rdx\n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0\n\t" "vmovhpd (%%rcx,%%r9), %%xmm0, %%xmm0\n\t" "vmovlpd (%%rcx,%%r10), %%xmm1, %%xmm1\n\t" "vmovhpd (%%rcx,%%r11), %%xmm1, %%xmm1\n\t" "vmovlpd (%%rdx), %%xmm2, %%xmm2\n\t" "vmovhpd (%%rdx,%%r9), %%xmm2, %%xmm2\n\t" "vmovlpd (%%rdx,%%r10), %%xmm3, %%xmm3\n\t" "vmovhpd (%%rdx,%%r11), %%xmm3, %%xmm3\n\t" "vmulpd %%xmm7, %%xmm0, %%xmm0\n\t" "vmulpd %%xmm7, %%xmm1, %%xmm1\n\t" "vmulpd %%xmm7, %%xmm2, %%xmm2\n\t" "vmulpd %%xmm7, %%xmm3, %%xmm3\n\t" "vextractf128 $1, %%ymm10, %%xmm4\n\t" "vextractf128 $1, %%ymm14, %%xmm5\n\t" "vaddpd %%xmm0, %%xmm10, %%xmm0\n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1\n\t" "vaddpd %%xmm2, %%xmm14, %%xmm2\n\t" "vaddpd %%xmm3, %%xmm5, %%xmm3\n\t" "vmovlpd %%xmm0, (%%rcx)\n\t" "vmovhpd %%xmm0, (%%rcx,%%r9)\n\t" "vmovlpd %%xmm1, (%%rcx,%%r10)\n\t" "vmovhpd %%xmm1, (%%rcx,%%r11)\n\t" "vmovlpd %%xmm2, (%%rdx)\n\t" "vmovhpd %%xmm2, (%%rdx,%%r9)\n\t" "vmovlpd %%xmm3, (%%rdx,%%r10)\n\t" "vmovhpd %%xmm3, (%%rdx,%%r11)\n\t" "#\n\t" "# Update C(3,:)\n\t" "#\n\t" "addq %%r8, %%rcx\n\t" "addq %%r8, %%rdx\n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0\n\t" "vmovhpd (%%rcx,%%r9), %%xmm0, %%xmm0\n\t" "vmovlpd (%%rcx,%%r10), %%xmm1, %%xmm1\n\t" "vmovhpd (%%rcx,%%r11), %%xmm1, %%xmm1\n\t" "vmovlpd (%%rdx), %%xmm2, %%xmm2\n\t" "vmovhpd (%%rdx,%%r9), %%xmm2, %%xmm2\n\t" "vmovlpd (%%rdx,%%r10), %%xmm3, %%xmm3\n\t" "vmovhpd (%%rdx,%%r11), %%xmm3, %%xmm3\n\t" "vmulpd %%xmm7, %%xmm0, %%xmm0\n\t" "vmulpd %%xmm7, %%xmm1, %%xmm1\n\t" "vmulpd %%xmm7, %%xmm2, %%xmm2\n\t" "vmulpd %%xmm7, %%xmm3, %%xmm3\n\t" "vextractf128 $1, %%ymm11, %%xmm4\n\t" "vextractf128 $1, %%ymm15, %%xmm5\n\t" "vaddpd %%xmm0, %%xmm11, %%xmm0\n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1\n\t" "vaddpd %%xmm2, %%xmm15, %%xmm2\n\t" "vaddpd %%xmm3, %%xmm5, %%xmm3\n\t" "vmovlpd %%xmm0, (%%rcx)\n\t" "vmovhpd %%xmm0, (%%rcx,%%r9)\n\t" "vmovlpd %%xmm1, (%%rcx,%%r10)\n\t" "vmovhpd %%xmm1, (%%rcx,%%r11)\n\t" "vmovlpd %%xmm2, (%%rdx)\n\t" "vmovhpd %%xmm2, (%%rdx,%%r9)\n\t" "vmovlpd %%xmm3, (%%rdx,%%r10)\n\t" "vmovhpd %%xmm3, (%%rdx,%%r11)\n\t" : // output : // input "m" (kc), // 0 "m" (A), // 1 "m" (B), // 2 "m" (pAlpha), // 3 "m" (pBeta), // 4 "m" (C), // 5 "m" (incRowC), // 6 "m" (incColC) // 7 : // register clobber list "rax", "rbx", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15", "memory" ); } #endif
FMA Micro-Kernel (very simple)
#ifndef FMA_HPP #define FMA_HPP #include "gemm.h" #include <type_traits> template <typename Index> typename std::enable_if<std::is_convertible<Index, std::int64_t>::value && BlockSize<double>::MR==4 && BlockSize<double>::NR==12 && BlockSize<double>::align==32, void>::type ugemm(Index kc_, double alpha, const double *A, const double *B, double beta, double *C, Index incRowC_, Index incColC_, const double *, const double *) { int64_t kc = kc_; int64_t incRowC = incRowC_; int64_t incColC = incColC_; double *pAlpha = α double *pBeta = β // // Compute AB = A*B // __asm__ volatile ( "movq %0, %%rdi \n\t" // kc "movq %1, %%rsi \n\t" // A "movq %2, %%rdx \n\t" // B "movq %5, %%rcx \n\t" // C "movq %6, %%r8 \n\t" // incRowC "movq %7, %%r9 \n\t" // incColC "vmovapd 0*32(%%rdx), %%ymm1 \n\t" "vmovapd 1*32(%%rdx), %%ymm2 \n\t" "vmovapd 2*32(%%rdx), %%ymm3 \n\t" "vxorpd %%ymm4, %%ymm4, %%ymm4 \n\t" "vxorpd %%ymm5, %%ymm5, %%ymm5 \n\t" "vxorpd %%ymm6, %%ymm6, %%ymm6 \n\t" "vxorpd %%ymm7, %%ymm7, %%ymm7 \n\t" "vxorpd %%ymm8, %%ymm8, %%ymm8 \n\t" "vxorpd %%ymm9, %%ymm9, %%ymm9 \n\t" "vxorpd %%ymm10, %%ymm10, %%ymm10 \n\t" "vxorpd %%ymm11, %%ymm11, %%ymm11 \n\t" "vxorpd %%ymm12, %%ymm12, %%ymm12 \n\t" "vxorpd %%ymm13, %%ymm13, %%ymm13 \n\t" "vxorpd %%ymm14, %%ymm14, %%ymm14 \n\t" "vxorpd %%ymm15, %%ymm15, %%ymm15 \n\t" "movq $3*32, %%r13 \n\t" "movq $4* 8, %%r12 \n\t" "jmp check%= \n\t" "loop%=: \n\t" "vbroadcastsd 0* 8(%%rsi), %%ymm0 \n\t" "addq %%r13, %%rdx \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm4 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm8 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm12 \n\t" "vbroadcastsd 1* 8(%%rsi), %%ymm0 \n\t" "decq %%rdi \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm5 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm9 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm13 \n\t" "vbroadcastsd 2* 8(%%rsi), %%ymm0 \n\t" "addq %%r12, %%rsi \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm6 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm10 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm14 \n\t" "vbroadcastsd -1* 8(%%rsi), %%ymm0 \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm7 \n\t" "vmovapd 0*32(%%rdx), %%ymm1 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm11 \n\t" "vmovapd 1*32(%%rdx), %%ymm2 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm15 \n\t" "vmovapd 2*32(%%rdx), %%ymm3 \n\t" "check%=: \n\t" "testq %%rdi, %%rdi \n\t" "jg loop%= \n\t" "movq %3, %%rdi \n\t" // alpha "vbroadcastsd (%%rdi), %%ymm0 \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm5 \n\t" "vmulpd %%ymm0, %%ymm6, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm7, %%ymm7 \n\t" "vmulpd %%ymm0, %%ymm8, %%ymm8 \n\t" "vmulpd %%ymm0, %%ymm9, %%ymm9 \n\t" "vmulpd %%ymm0, %%ymm10, %%ymm10 \n\t" "vmulpd %%ymm0, %%ymm11, %%ymm11 \n\t" "vmulpd %%ymm0, %%ymm12, %%ymm12 \n\t" "vmulpd %%ymm0, %%ymm13, %%ymm13 \n\t" "vmulpd %%ymm0, %%ymm14, %%ymm14 \n\t" "vmulpd %%ymm0, %%ymm15, %%ymm15 \n\t" "leaq (,%%r8,8), %%r8 \n\t" "leaq (,%%r9,8), %%r9 \n\t" "leaq (,%%r9,2), %%r10 # 2*incColC \n\t" "leaq (%%r10,%%r9), %%r11 # 3*incColC \n\t" "leaq (%%rcx,%%r10,2), %%rdx # C + 4*incColC \n\t" "leaq (%%rdx,%%r10,2), %%rax # C + 8*incColC \n\t" // check if beta == 0 "movq %4, %%rdi \n\t" // beta "vbroadcastsd (%%rdi), %%ymm0 \n\t" "vxorpd %%ymm1, %%ymm1, %%ymm1 \n\t" "vucomisd %%xmm0, %%xmm1 \n\t" "je beta_zero%= \n\t" // case: beta != 0 "# \n\t" "# Update C(0,0:3) \n\t" "# \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm4, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(0,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm8, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm8, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(0,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm12, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm12, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(1,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm5, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm5, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(1,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm9, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm9, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(1,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm13, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm13, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(2,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm6, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm6, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(2,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm10, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm10, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(2,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm14, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm14, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(3,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm7, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm7, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(3,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm11, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm11, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(3,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm15, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm15, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "jmp done%= \n\t" // case: beta == 0 "beta_zero%=: \n\t" "# \n\t" "# Update C(0,0:3) \n\t" "# \n\t" "vextractf128 $1, %%ymm4, %%xmm3 \n\t" "vmovlpd %%xmm4, (%%rcx) \n\t" "vmovhpd %%xmm4, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(0,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm8, %%xmm3 \n\t" "vmovlpd %%xmm8, (%%rdx) \n\t" "vmovhpd %%xmm8, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(0,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm12, %%xmm3 \n\t" "vmovlpd %%xmm12, (%%rax) \n\t" "vmovhpd %%xmm12, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(1,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm5, %%xmm3 \n\t" "vmovlpd %%xmm5, (%%rcx) \n\t" "vmovhpd %%xmm5, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(1,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm9, %%xmm3 \n\t" "vmovlpd %%xmm9, (%%rdx) \n\t" "vmovhpd %%xmm9, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(1,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm13, %%xmm3 \n\t" "vmovlpd %%xmm13, (%%rax) \n\t" "vmovhpd %%xmm13, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(2,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm6, %%xmm3 \n\t" "vmovlpd %%xmm6, (%%rcx) \n\t" "vmovhpd %%xmm6, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(2,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm10, %%xmm3 \n\t" "vmovlpd %%xmm10, (%%rdx) \n\t" "vmovhpd %%xmm10, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(2,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm14, %%xmm3 \n\t" "vmovlpd %%xmm14, (%%rax) \n\t" "vmovhpd %%xmm14, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(3,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm7, %%xmm3 \n\t" "vmovlpd %%xmm7, (%%rcx) \n\t" "vmovhpd %%xmm7, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(3,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm11, %%xmm3 \n\t" "vmovlpd %%xmm11, (%%rdx) \n\t" "vmovhpd %%xmm11, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(3,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm15, %%xmm3 \n\t" "vmovlpd %%xmm15, (%%rax) \n\t" "vmovhpd %%xmm15, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "done%=: \n\t" : // output : // input "m" (kc), // 0 "m" (A), // 1 "m" (B), // 2 "m" (pAlpha), // 3 "m" (pBeta), // 4 "m" (C), // 5 "m" (incRowC), // 6 "m" (incColC) // 7 : // register clobber list "rax", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15", "memory" ); } #endif
Micro-Kernel with GCC Extensions (very simple)
#ifndef FMA_HPP #define FMA_HPP #include "gemm.h" #include <type_traits> template <typename Index> typename std::enable_if<std::is_convertible<Index, std::int64_t>::value && BlockSize<double>::MR==4 && BlockSize<double>::NR==12 && BlockSize<double>::align==32, void>::type ugemm(Index kc_, double alpha, const double *A, const double *B, double beta, double *C, Index incRowC_, Index incColC_, const double *, const double *) { int64_t kc = kc_; int64_t incRowC = incRowC_; int64_t incColC = incColC_; double *pAlpha = α double *pBeta = β // // Compute AB = A*B // __asm__ volatile ( "movq %0, %%rdi \n\t" // kc "movq %1, %%rsi \n\t" // A "movq %2, %%rdx \n\t" // B "movq %5, %%rcx \n\t" // C "movq %6, %%r8 \n\t" // incRowC "movq %7, %%r9 \n\t" // incColC "vmovapd 0*32(%%rdx), %%ymm1 \n\t" "vmovapd 1*32(%%rdx), %%ymm2 \n\t" "vmovapd 2*32(%%rdx), %%ymm3 \n\t" "vxorpd %%ymm4, %%ymm4, %%ymm4 \n\t" "vxorpd %%ymm5, %%ymm5, %%ymm5 \n\t" "vxorpd %%ymm6, %%ymm6, %%ymm6 \n\t" "vxorpd %%ymm7, %%ymm7, %%ymm7 \n\t" "vxorpd %%ymm8, %%ymm8, %%ymm8 \n\t" "vxorpd %%ymm9, %%ymm9, %%ymm9 \n\t" "vxorpd %%ymm10, %%ymm10, %%ymm10 \n\t" "vxorpd %%ymm11, %%ymm11, %%ymm11 \n\t" "vxorpd %%ymm12, %%ymm12, %%ymm12 \n\t" "vxorpd %%ymm13, %%ymm13, %%ymm13 \n\t" "vxorpd %%ymm14, %%ymm14, %%ymm14 \n\t" "vxorpd %%ymm15, %%ymm15, %%ymm15 \n\t" "movq $3*32, %%r13 \n\t" "movq $4* 8, %%r12 \n\t" "jmp check%= \n\t" "loop%=: \n\t" "vbroadcastsd 0* 8(%%rsi), %%ymm0 \n\t" "addq %%r13, %%rdx \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm4 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm8 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm12 \n\t" "vbroadcastsd 1* 8(%%rsi), %%ymm0 \n\t" "decq %%rdi \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm5 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm9 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm13 \n\t" "vbroadcastsd 2* 8(%%rsi), %%ymm0 \n\t" "addq %%r12, %%rsi \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm6 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm10 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm14 \n\t" "vbroadcastsd -1* 8(%%rsi), %%ymm0 \n\t" "vfmadd231pd %%ymm0, %%ymm1, %%ymm7 \n\t" "vmovapd 0*32(%%rdx), %%ymm1 \n\t" "vfmadd231pd %%ymm0, %%ymm2, %%ymm11 \n\t" "vmovapd 1*32(%%rdx), %%ymm2 \n\t" "vfmadd231pd %%ymm0, %%ymm3, %%ymm15 \n\t" "vmovapd 2*32(%%rdx), %%ymm3 \n\t" "check%=: \n\t" "testq %%rdi, %%rdi \n\t" "jg loop%= \n\t" "movq %3, %%rdi \n\t" // alpha "vbroadcastsd (%%rdi), %%ymm0 \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm5 \n\t" "vmulpd %%ymm0, %%ymm6, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm7, %%ymm7 \n\t" "vmulpd %%ymm0, %%ymm8, %%ymm8 \n\t" "vmulpd %%ymm0, %%ymm9, %%ymm9 \n\t" "vmulpd %%ymm0, %%ymm10, %%ymm10 \n\t" "vmulpd %%ymm0, %%ymm11, %%ymm11 \n\t" "vmulpd %%ymm0, %%ymm12, %%ymm12 \n\t" "vmulpd %%ymm0, %%ymm13, %%ymm13 \n\t" "vmulpd %%ymm0, %%ymm14, %%ymm14 \n\t" "vmulpd %%ymm0, %%ymm15, %%ymm15 \n\t" "leaq (,%%r8,8), %%r8 \n\t" "leaq (,%%r9,8), %%r9 \n\t" "leaq (,%%r9,2), %%r10 # 2*incColC \n\t" "leaq (%%r10,%%r9), %%r11 # 3*incColC \n\t" "leaq (%%rcx,%%r10,2), %%rdx # C + 4*incColC \n\t" "leaq (%%rdx,%%r10,2), %%rax # C + 8*incColC \n\t" // check if beta == 0 "movq %4, %%rdi \n\t" // beta "vbroadcastsd (%%rdi), %%ymm0 \n\t" "vxorpd %%ymm1, %%ymm1, %%ymm1 \n\t" "vucomisd %%xmm0, %%xmm1 \n\t" "je beta_zero%= \n\t" // case: beta != 0 "# \n\t" "# Update C(0,0:3) \n\t" "# \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm4, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm4, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(0,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm8, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm8, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(0,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm12, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm12, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(1,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm5, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm5, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(1,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm9, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm9, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(1,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm13, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm13, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(2,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm6, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm6, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(2,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm10, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm10, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(2,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm14, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm14, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(3,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vmovlpd (%%rcx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rcx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rcx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rcx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm7, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm7, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rcx) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(3,4:7) \n\t" "# \n\t" "vmovlpd (%%rdx), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rdx,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rdx,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rdx,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm11, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm11, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rdx) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm2, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm2, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(3,8:11) \n\t" "# \n\t" "vmovlpd (%%rax), %%xmm1, %%xmm1 \n\t" "vmovhpd (%%rax,%%r9), %%xmm1, %%xmm1 \n\t" "vmovlpd (%%rax,%%r10), %%xmm2, %%xmm2 \n\t" "vmovhpd (%%rax,%%r11), %%xmm2, %%xmm2 \n\t" "vextractf128 $1, %%ymm15, %%xmm3 \n\t" "vmulpd %%xmm0, %%xmm1, %%xmm1 \n\t" "vaddpd %%xmm1, %%xmm15, %%xmm1 \n\t" "vmulpd %%xmm0, %%xmm2, %%xmm2 \n\t" "vaddpd %%xmm2, %%xmm3, %%xmm2 \n\t" "vmovlpd %%xmm1, (%%rax) \n\t" "vmovhpd %%xmm1, (%%rax,%%r9) \n\t" "vmovlpd %%xmm2, (%%rax,%%r10) \n\t" "vmovhpd %%xmm2, (%%rax,%%r11) \n\t" "jmp done%= \n\t" // case: beta == 0 "beta_zero%=: \n\t" "# \n\t" "# Update C(0,0:3) \n\t" "# \n\t" "vextractf128 $1, %%ymm4, %%xmm3 \n\t" "vmovlpd %%xmm4, (%%rcx) \n\t" "vmovhpd %%xmm4, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(0,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm8, %%xmm3 \n\t" "vmovlpd %%xmm8, (%%rdx) \n\t" "vmovhpd %%xmm8, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(0,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm12, %%xmm3 \n\t" "vmovlpd %%xmm12, (%%rax) \n\t" "vmovhpd %%xmm12, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(1,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm5, %%xmm3 \n\t" "vmovlpd %%xmm5, (%%rcx) \n\t" "vmovhpd %%xmm5, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(1,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm9, %%xmm3 \n\t" "vmovlpd %%xmm9, (%%rdx) \n\t" "vmovhpd %%xmm9, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(1,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm13, %%xmm3 \n\t" "vmovlpd %%xmm13, (%%rax) \n\t" "vmovhpd %%xmm13, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(2,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm6, %%xmm3 \n\t" "vmovlpd %%xmm6, (%%rcx) \n\t" "vmovhpd %%xmm6, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(2,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm10, %%xmm3 \n\t" "vmovlpd %%xmm10, (%%rdx) \n\t" "vmovhpd %%xmm10, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(2,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm14, %%xmm3 \n\t" "vmovlpd %%xmm14, (%%rax) \n\t" "vmovhpd %%xmm14, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "# \n\t" "# Update C(3,0:3) \n\t" "# \n\t" "addq %%r8, %%rcx \n\t" "addq %%r8, %%rdx \n\t" "addq %%r8, %%rax \n\t" "vextractf128 $1, %%ymm7, %%xmm3 \n\t" "vmovlpd %%xmm7, (%%rcx) \n\t" "vmovhpd %%xmm7, (%%rcx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rcx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rcx,%%r11) \n\t" "# \n\t" "# Update C(3,4:7) \n\t" "# \n\t" "vextractf128 $1, %%ymm11, %%xmm3 \n\t" "vmovlpd %%xmm11, (%%rdx) \n\t" "vmovhpd %%xmm11, (%%rdx,%%r9) \n\t" "vmovlpd %%xmm3, (%%rdx,%%r10) \n\t" "vmovhpd %%xmm3, (%%rdx,%%r11) \n\t" "# \n\t" "# Update C(3,8:11) \n\t" "# \n\t" "vextractf128 $1, %%ymm15, %%xmm3 \n\t" "vmovlpd %%xmm15, (%%rax) \n\t" "vmovhpd %%xmm15, (%%rax,%%r9) \n\t" "vmovlpd %%xmm3, (%%rax,%%r10) \n\t" "vmovhpd %%xmm3, (%%rax,%%r11) \n\t" "done%=: \n\t" : // output : // input "m" (kc), // 0 "m" (A), // 1 "m" (B), // 2 "m" (pAlpha), // 3 "m" (pBeta), // 4 "m" (C), // 5 "m" (incRowC), // 6 "m" (incColC) // 7 : // register clobber list "rax", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15", "memory" ); } #endif
Micro-Kernel from BLIS (slightly adapted)
/* BLIS An object-based framework for developing high-performance BLAS-like libraries. Copyright (C) 2014, The University of Texas at Austin Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. - Neither the name of The University of Texas at Austin nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* NOTE: The micro-kernels in this file were partially inspired by portions of code found in OpenBLAS 0.2.8 (http://www.openblas.net/). -FGVZ */ #ifndef BLISAVX_HPP #define BLISAVX_HPP #include "gemm.h" #include <type_traits> template <typename Index> typename std::enable_if<std::is_convertible<Index, std::int64_t>::value && BlockSize<double>::MR==8 && BlockSize<double>::NR==4 && BlockSize<double>::align==32, void>::type ugemm(Index kc_, double alpha_, const double *A, const double *B, double beta_, double *C, Index incRowC, Index incColC, const double * /* a_next */, const double *b_next) { int64_t k = kc_; int64_t k_iter = k / 4; int64_t k_left = k % 4; int64_t rs_c = incRowC; int64_t cs_c = incColC; const double *alpha = &alpha_; const double *beta = &beta_; __asm__ volatile ( " \n\t" " \n\t" "movq %2, %%rax \n\t" // load address of a. "movq %3, %%rbx \n\t" // load address of b. "movq %9, %%r15 \n\t" // load address of b_next. //"movq %10, %%r14 \n\t" // load address of a_next. "addq $-4 * 64, %%r15 \n\t" " \n\t" "vmovapd 0 * 32(%%rax), %%ymm0 \n\t" // initialize loop by pre-loading "vmovapd 0 * 32(%%rbx), %%ymm2 \n\t" // elements of a and b. "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" " \n\t" "movq %6, %%rcx \n\t" // load address of c "movq %8, %%rdi \n\t" // load cs_c "leaq (,%%rdi,8), %%rdi \n\t" // cs_c *= sizeof(double) "leaq (%%rcx,%%rdi,2), %%r10 \n\t" // load address of c + 2*cs_c; " \n\t" "prefetcht0 3 * 8(%%rcx) \n\t" // prefetch c + 0*cs_c "prefetcht0 3 * 8(%%rcx,%%rdi) \n\t" // prefetch c + 1*cs_c "prefetcht0 3 * 8(%%r10) \n\t" // prefetch c + 2*cs_c "prefetcht0 3 * 8(%%r10,%%rdi) \n\t" // prefetch c + 3*cs_c " \n\t" "vxorpd %%ymm8, %%ymm8, %%ymm8 \n\t" "vxorpd %%ymm9, %%ymm9, %%ymm9 \n\t" "vxorpd %%ymm10, %%ymm10, %%ymm10 \n\t" "vxorpd %%ymm11, %%ymm11, %%ymm11 \n\t" "vxorpd %%ymm12, %%ymm12, %%ymm12 \n\t" "vxorpd %%ymm13, %%ymm13, %%ymm13 \n\t" "vxorpd %%ymm14, %%ymm14, %%ymm14 \n\t" "vxorpd %%ymm15, %%ymm15, %%ymm15 \n\t" " \n\t" " \n\t" " \n\t" "movq %0, %%rsi \n\t" // i = k_iter; "testq %%rsi, %%rsi \n\t" // check i via logical AND. "je .DCONSIDKLEFT%= \n\t" // if i == 0, jump to code that " \n\t" // contains the k_left loop. " \n\t" " \n\t" ".DLOOPKITER%=: \n\t" // MAIN LOOP " \n\t" "addq $4 * 4 * 8, %%r15 \n\t" // b_next += 4*4 (unroll x nr) " \n\t" " \n\t" // iteration 0 "vmovapd 1 * 32(%%rax), %%ymm1 \n\t" "vmulpd %%ymm0, %%ymm2, %%ymm6 \n\t" "vperm2f128 $0x3, %%ymm2, %%ymm2, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm3, %%ymm7 \n\t" "vperm2f128 $0x3, %%ymm3, %%ymm3, %%ymm5 \n\t" "vaddpd %%ymm15, %%ymm6, %%ymm15 \n\t" "vaddpd %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" "prefetcht0 16 * 32(%%rax) \n\t" "vmulpd %%ymm1, %%ymm2, %%ymm6 \n\t" "vmovapd 1 * 32(%%rbx), %%ymm2 \n\t" "vmulpd %%ymm1, %%ymm3, %%ymm7 \n\t" "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" "vaddpd %%ymm14, %%ymm6, %%ymm14 \n\t" "vaddpd %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm7 \n\t" "vmovapd 2 * 32(%%rax), %%ymm0 \n\t" "vaddpd %%ymm11, %%ymm6, %%ymm11 \n\t" "vaddpd %%ymm9, %%ymm7, %%ymm9 \n\t" "prefetcht0 0 * 32(%%r15) \n\t" // prefetch b_next[0*4] " \n\t" "vmulpd %%ymm1, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7 \n\t" "vaddpd %%ymm10, %%ymm6, %%ymm10 \n\t" "vaddpd %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" " \n\t" // iteration 1 "vmovapd 3 * 32(%%rax), %%ymm1 \n\t" "vmulpd %%ymm0, %%ymm2, %%ymm6 \n\t" "vperm2f128 $0x3, %%ymm2, %%ymm2, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm3, %%ymm7 \n\t" "vperm2f128 $0x3, %%ymm3, %%ymm3, %%ymm5 \n\t" "vaddpd %%ymm15, %%ymm6, %%ymm15 \n\t" "vaddpd %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" "prefetcht0 18 * 32(%%rax) \n\t" "vmulpd %%ymm1, %%ymm2, %%ymm6 \n\t" "vmovapd 2 * 32(%%rbx), %%ymm2 \n\t" "vmulpd %%ymm1, %%ymm3, %%ymm7 \n\t" "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" "vaddpd %%ymm14, %%ymm6, %%ymm14 \n\t" "vaddpd %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm7 \n\t" "vmovapd 4 * 32(%%rax), %%ymm0 \n\t" "vaddpd %%ymm11, %%ymm6, %%ymm11 \n\t" "vaddpd %%ymm9, %%ymm7, %%ymm9 \n\t" " \n\t" "vmulpd %%ymm1, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7 \n\t" "vaddpd %%ymm10, %%ymm6, %%ymm10 \n\t" "vaddpd %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" " \n\t" // iteration 2 "vmovapd 5 * 32(%%rax), %%ymm1 \n\t" "vmulpd %%ymm0, %%ymm2, %%ymm6 \n\t" "vperm2f128 $0x3, %%ymm2, %%ymm2, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm3, %%ymm7 \n\t" "vperm2f128 $0x3, %%ymm3, %%ymm3, %%ymm5 \n\t" "vaddpd %%ymm15, %%ymm6, %%ymm15 \n\t" "vaddpd %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" "prefetcht0 20 * 32(%%rax) \n\t" "vmulpd %%ymm1, %%ymm2, %%ymm6 \n\t" "vmovapd 3 * 32(%%rbx), %%ymm2 \n\t" "addq $4 * 4 * 8, %%rbx \n\t" // b += 4*4 (unroll x nr) "vmulpd %%ymm1, %%ymm3, %%ymm7 \n\t" "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" "vaddpd %%ymm14, %%ymm6, %%ymm14 \n\t" "vaddpd %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm7 \n\t" "vmovapd 6 * 32(%%rax), %%ymm0 \n\t" "vaddpd %%ymm11, %%ymm6, %%ymm11 \n\t" "vaddpd %%ymm9, %%ymm7, %%ymm9 \n\t" "prefetcht0 2 * 32(%%r15) \n\t" // prefetch b_next[2*4] " \n\t" "vmulpd %%ymm1, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7 \n\t" "vaddpd %%ymm10, %%ymm6, %%ymm10 \n\t" "vaddpd %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" " \n\t" // iteration 3 "vmovapd 7 * 32(%%rax), %%ymm1 \n\t" "addq $4 * 8 * 8, %%rax \n\t" // a += 4*8 (unroll x mr) "vmulpd %%ymm0, %%ymm2, %%ymm6 \n\t" "vperm2f128 $0x3, %%ymm2, %%ymm2, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm3, %%ymm7 \n\t" "vperm2f128 $0x3, %%ymm3, %%ymm3, %%ymm5 \n\t" "vaddpd %%ymm15, %%ymm6, %%ymm15 \n\t" "vaddpd %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" //"prefetcht0 22 * 32(%%rax) \n\t" "prefetcht0 14 * 32(%%rax) \n\t" "vmulpd %%ymm1, %%ymm2, %%ymm6 \n\t" "vmovapd 0 * 32(%%rbx), %%ymm2 \n\t" "vmulpd %%ymm1, %%ymm3, %%ymm7 \n\t" "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" "vaddpd %%ymm14, %%ymm6, %%ymm14 \n\t" "vaddpd %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm7 \n\t" "vmovapd 0 * 32(%%rax), %%ymm0 \n\t" "vaddpd %%ymm11, %%ymm6, %%ymm11 \n\t" "vaddpd %%ymm9, %%ymm7, %%ymm9 \n\t" " \n\t" "vmulpd %%ymm1, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7 \n\t" "vaddpd %%ymm10, %%ymm6, %%ymm10 \n\t" "vaddpd %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" " \n\t" //"addq $4 * 8 * 8, %%rax \n\t" // a += 4*8 (unroll x mr) //"addq $4 * 4 * 8, %%rbx \n\t" // b += 4*4 (unroll x nr) " \n\t" "decq %%rsi \n\t" // i -= 1; "jne .DLOOPKITER%= \n\t" // iterate again if i != 0. " \n\t" " \n\t" " \n\t" " \n\t" " \n\t" " \n\t" ".DCONSIDKLEFT%=: \n\t" " \n\t" "movq %1, %%rsi \n\t" // i = k_left; "testq %%rsi, %%rsi \n\t" // check i via logical AND. "je .DPOSTACCUM%= \n\t" // if i == 0, we're done; jump to end. " \n\t" // else, we prepare to enter k_left loop. " \n\t" " \n\t" ".DLOOPKLEFT%=: \n\t" // EDGE LOOP " \n\t" "vmovapd 1 * 32(%%rax), %%ymm1 \n\t" "addq $8 * 1 * 8, %%rax \n\t" // a += 8 (1 x mr) "vmulpd %%ymm0, %%ymm2, %%ymm6 \n\t" "vperm2f128 $0x3, %%ymm2, %%ymm2, %%ymm4 \n\t" "vmulpd %%ymm0, %%ymm3, %%ymm7 \n\t" "vperm2f128 $0x3, %%ymm3, %%ymm3, %%ymm5 \n\t" "vaddpd %%ymm15, %%ymm6, %%ymm15 \n\t" "vaddpd %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" "prefetcht0 14 * 32(%%rax) \n\t" "vmulpd %%ymm1, %%ymm2, %%ymm6 \n\t" "vmovapd 1 * 32(%%rbx), %%ymm2 \n\t" "addq $4 * 1 * 8, %%rbx \n\t" // b += 4 (1 x nr) "vmulpd %%ymm1, %%ymm3, %%ymm7 \n\t" "vpermilpd $0x5, %%ymm2, %%ymm3 \n\t" "vaddpd %%ymm14, %%ymm6, %%ymm14 \n\t" "vaddpd %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmulpd %%ymm0, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm0, %%ymm5, %%ymm7 \n\t" "vmovapd 0 * 32(%%rax), %%ymm0 \n\t" "vaddpd %%ymm11, %%ymm6, %%ymm11 \n\t" "vaddpd %%ymm9, %%ymm7, %%ymm9 \n\t" " \n\t" "vmulpd %%ymm1, %%ymm4, %%ymm6 \n\t" "vmulpd %%ymm1, %%ymm5, %%ymm7 \n\t" "vaddpd %%ymm10, %%ymm6, %%ymm10 \n\t" "vaddpd %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" "decq %%rsi \n\t" // i -= 1; "jne .DLOOPKLEFT%= \n\t" // iterate again if i != 0. " \n\t" " \n\t" " \n\t" ".DPOSTACCUM%=: \n\t" " \n\t" " \n\t" " \n\t" // ymm15: ymm13: ymm11: ymm9: " \n\t" // ( ab00 ( ab01 ( ab02 ( ab03 " \n\t" // ab11 ab10 ab13 ab12 " \n\t" // ab22 ab23 ab20 ab21 " \n\t" // ab33 ) ab32 ) ab31 ) ab30 ) " \n\t" " \n\t" // ymm14: ymm12: ymm10: ymm8: " \n\t" // ( ab40 ( ab41 ( ab42 ( ab43 " \n\t" // ab51 ab50 ab53 ab52 " \n\t" // ab62 ab63 ab60 ab61 " \n\t" // ab73 ) ab72 ) ab71 ) ab70 ) " \n\t" "vmovapd %%ymm15, %%ymm7 \n\t" "vshufpd $0xa, %%ymm15, %%ymm13, %%ymm15 \n\t" "vshufpd $0xa, %%ymm13, %%ymm7, %%ymm13 \n\t" " \n\t" "vmovapd %%ymm11, %%ymm7 \n\t" "vshufpd $0xa, %%ymm11, %%ymm9, %%ymm11 \n\t" "vshufpd $0xa, %%ymm9, %%ymm7, %%ymm9 \n\t" " \n\t" "vmovapd %%ymm14, %%ymm7 \n\t" "vshufpd $0xa, %%ymm14, %%ymm12, %%ymm14 \n\t" "vshufpd $0xa, %%ymm12, %%ymm7, %%ymm12 \n\t" " \n\t" "vmovapd %%ymm10, %%ymm7 \n\t" "vshufpd $0xa, %%ymm10, %%ymm8, %%ymm10 \n\t" "vshufpd $0xa, %%ymm8, %%ymm7, %%ymm8 \n\t" " \n\t" " \n\t" // ymm15: ymm13: ymm11: ymm9: " \n\t" // ( ab01 ( ab00 ( ab03 ( ab02 " \n\t" // ab11 ab10 ab13 ab12 " \n\t" // ab23 ab22 ab21 ab20 " \n\t" // ab33 ) ab32 ) ab31 ) ab30 ) " \n\t" " \n\t" // ymm14: ymm12: ymm10: ymm8: " \n\t" // ( ab41 ( ab40 ( ab43 ( ab42 " \n\t" // ab51 ab50 ab53 ab52 " \n\t" // ab63 ab62 ab61 ab60 " \n\t" // ab73 ) ab72 ) ab71 ) ab70 ) " \n\t" "vmovapd %%ymm15, %%ymm7 \n\t" "vperm2f128 $0x30, %%ymm15, %%ymm11, %%ymm15 \n\t" "vperm2f128 $0x12, %%ymm7, %%ymm11, %%ymm11 \n\t" " \n\t" "vmovapd %%ymm13, %%ymm7 \n\t" "vperm2f128 $0x30, %%ymm13, %%ymm9, %%ymm13 \n\t" "vperm2f128 $0x12, %%ymm7, %%ymm9, %%ymm9 \n\t" " \n\t" "vmovapd %%ymm14, %%ymm7 \n\t" "vperm2f128 $0x30, %%ymm14, %%ymm10, %%ymm14 \n\t" "vperm2f128 $0x12, %%ymm7, %%ymm10, %%ymm10 \n\t" " \n\t" "vmovapd %%ymm12, %%ymm7 \n\t" "vperm2f128 $0x30, %%ymm12, %%ymm8, %%ymm12 \n\t" "vperm2f128 $0x12, %%ymm7, %%ymm8, %%ymm8 \n\t" " \n\t" " \n\t" // ymm9: ymm11: ymm13: ymm15: " \n\t" // ( ab00 ( ab01 ( ab02 ( ab03 " \n\t" // ab10 ab11 ab12 ab13 " \n\t" // ab20 ab21 ab22 ab23 " \n\t" // ab30 ) ab31 ) ab32 ) ab33 ) " \n\t" " \n\t" // ymm8: ymm10: ymm12: ymm14: " \n\t" // ( ab40 ( ab41 ( ab42 ( ab43 " \n\t" // ab50 ab51 ab52 ab53 " \n\t" // ab60 ab61 ab62 ab63 " \n\t" // ab70 ) ab71 ) ab72 ) ab73 ) " \n\t" " \n\t" "movq %4, %%rax \n\t" // load address of alpha "movq %5, %%rbx \n\t" // load address of beta "vbroadcastsd (%%rax), %%ymm0 \n\t" // load alpha and duplicate "vbroadcastsd (%%rbx), %%ymm2 \n\t" // load beta and duplicate " \n\t" "vmulpd %%ymm0, %%ymm8, %%ymm8 \n\t" // scale by alpha "vmulpd %%ymm0, %%ymm9, %%ymm9 \n\t" "vmulpd %%ymm0, %%ymm10, %%ymm10 \n\t" "vmulpd %%ymm0, %%ymm11, %%ymm11 \n\t" "vmulpd %%ymm0, %%ymm12, %%ymm12 \n\t" "vmulpd %%ymm0, %%ymm13, %%ymm13 \n\t" "vmulpd %%ymm0, %%ymm14, %%ymm14 \n\t" "vmulpd %%ymm0, %%ymm15, %%ymm15 \n\t" " \n\t" " \n\t" " \n\t" " \n\t" " \n\t" " \n\t" "movq %7, %%rsi \n\t" // load rs_c "leaq (,%%rsi,8), %%rsi \n\t" // rsi = rs_c * sizeof(double) " \n\t" "leaq (%%rcx,%%rsi,4), %%rdx \n\t" // load address of c + 4*rs_c; " \n\t" "leaq (,%%rsi,2), %%r12 \n\t" // r12 = 2*rs_c; "leaq (%%r12,%%rsi,1), %%r13 \n\t" // r13 = 3*rs_c; " \n\t" " \n\t" " \n\t" " \n\t" // determine if " \n\t" // c % 32 == 0, AND " \n\t" // 8*cs_c % 32 == 0, AND " \n\t" // rs_c == 1 " \n\t" // ie: aligned, ldim aligned, and " \n\t" // column-stored " \n\t" "cmpq $8, %%rsi \n\t" // set ZF if (8*rs_c) == 8. "sete %%bl \n\t" // bl = ( ZF == 1 ? 1 : 0 ); "testq $31, %%rcx \n\t" // set ZF if c & 32 is zero. "setz %%bh \n\t" // bh = ( ZF == 0 ? 1 : 0 ); "testq $31, %%rdi \n\t" // set ZF if (8*cs_c) & 32 is zero. "setz %%al \n\t" // al = ( ZF == 0 ? 1 : 0 ); " \n\t" // and(bl,bh) followed by " \n\t" // and(bh,al) will reveal result " \n\t" " \n\t" // now avoid loading C if beta == 0 " \n\t" "vxorpd %%ymm0, %%ymm0, %%ymm0 \n\t" // set ymm0 to zero. "vucomisd %%xmm0, %%xmm2 \n\t" // set ZF if beta == 0. "je .DBETAZERO%= \n\t" // if ZF = 1, jump to beta == 0 case " \n\t" " \n\t" " \n\t" // check if aligned/column-stored "andb %%bl, %%bh \n\t" // set ZF if bl & bh == 1. "andb %%bh, %%al \n\t" // set ZF if bh & al == 1. "jne .DCOLSTORED%= \n\t" // jump to column storage case " \n\t" " \n\t" " \n\t" ".DGENSTORED%=: \n\t" " \n\t" // update c00:c33 " \n\t" "vextractf128 $1, %%ymm9, %%xmm1 \n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0 \n\t" // load c00 and c10, "vmovhpd (%%rcx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm9, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%rsi) \n\t" "vmovlpd (%%rcx,%%r12), %%xmm0, %%xmm0 \n\t" // load c20 and c30, "vmovhpd (%%rcx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm11, %%xmm1 \n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0 \n\t" // load c01 and c11, "vmovhpd (%%rcx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm11, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%rsi) \n\t" "vmovlpd (%%rcx,%%r12), %%xmm0, %%xmm0 \n\t" // load c21 and c31, "vmovhpd (%%rcx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm13, %%xmm1 \n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0 \n\t" // load c02 and c12, "vmovhpd (%%rcx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm13, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%rsi) \n\t" "vmovlpd (%%rcx,%%r12), %%xmm0, %%xmm0 \n\t" // load c22 and c32, "vmovhpd (%%rcx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm15, %%xmm1 \n\t" "vmovlpd (%%rcx), %%xmm0, %%xmm0 \n\t" // load c03 and c13, "vmovhpd (%%rcx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm15, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%rsi) \n\t" "vmovlpd (%%rcx,%%r12), %%xmm0, %%xmm0 \n\t" // load c23 and c33, "vmovhpd (%%rcx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rcx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rcx,%%r13) \n\t" " \n\t" " \n\t" // update c40:c73 " \n\t" "vextractf128 $1, %%ymm8, %%xmm1 \n\t" "vmovlpd (%%rdx), %%xmm0, %%xmm0 \n\t" // load c40 and c50, "vmovhpd (%%rdx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm8, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%rsi) \n\t" "vmovlpd (%%rdx,%%r12), %%xmm0, %%xmm0 \n\t" // load c60 and c70, "vmovhpd (%%rdx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm10, %%xmm1 \n\t" "vmovlpd (%%rdx), %%xmm0, %%xmm0 \n\t" // load c41 and c51, "vmovhpd (%%rdx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm10, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%rsi) \n\t" "vmovlpd (%%rdx,%%r12), %%xmm0, %%xmm0 \n\t" // load c61 and c71, "vmovhpd (%%rdx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm12, %%xmm1 \n\t" "vmovlpd (%%rdx), %%xmm0, %%xmm0 \n\t" // load c42 and c52, "vmovhpd (%%rdx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm12, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%rsi) \n\t" "vmovlpd (%%rdx,%%r12), %%xmm0, %%xmm0 \n\t" // load c62 and c72, "vmovhpd (%%rdx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm14, %%xmm1 \n\t" "vmovlpd (%%rdx), %%xmm0, %%xmm0 \n\t" // load c43 and c53, "vmovhpd (%%rdx,%%rsi), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm14, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%rsi) \n\t" "vmovlpd (%%rdx,%%r12), %%xmm0, %%xmm0 \n\t" // load c63 and c73, "vmovhpd (%%rdx,%%r13), %%xmm0, %%xmm0 \n\t" "vmulpd %%xmm2, %%xmm0, %%xmm0 \n\t" // scale by beta, "vaddpd %%xmm1, %%xmm0, %%xmm0 \n\t" // add the gemm result, "vmovlpd %%xmm0, (%%rdx,%%r12) \n\t" // and store back to memory. "vmovhpd %%xmm0, (%%rdx,%%r13) \n\t" " \n\t" " \n\t" "jmp .DDONE%= \n\t" // jump to end. " \n\t" " \n\t" " \n\t" ".DCOLSTORED%=: \n\t" " \n\t" // update c00:c33 " \n\t" "vmovapd (%%rcx), %%ymm0 \n\t" // load c00:c30, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm9, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rcx) \n\t" // and store back to memory. "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rcx), %%ymm0 \n\t" // load c01:c31, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm11, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rcx) \n\t" // and store back to memory. "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rcx), %%ymm0 \n\t" // load c02:c32, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm13, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rcx) \n\t" // and store back to memory. "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rcx), %%ymm0 \n\t" // load c03:c33, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm15, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rcx) \n\t" // and store back to memory. " \n\t" " \n\t" // update c40:c73 " \n\t" "vmovapd (%%rdx), %%ymm0 \n\t" // load c40:c70, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm8, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rdx) \n\t" // and store back to memory. "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rdx), %%ymm0 \n\t" // load c41:c71, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm10, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rdx) \n\t" // and store back to memory. "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rdx), %%ymm0 \n\t" // load c42:c72, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm12, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rdx) \n\t" // and store back to memory. "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd (%%rdx), %%ymm0 \n\t" // load c43:c73, "vmulpd %%ymm2, %%ymm0, %%ymm0 \n\t" // scale by beta, "vaddpd %%ymm14, %%ymm0, %%ymm0 \n\t" // add the gemm result, "vmovapd %%ymm0, (%%rdx) \n\t" // and store back to memory. " \n\t" " \n\t" "jmp .DDONE%= \n\t" // jump to end. " \n\t" " \n\t" " \n\t" " \n\t" ".DBETAZERO%=: \n\t" " \n\t" // check if aligned/column-stored "andb %%bl, %%bh \n\t" // set ZF if bl & bh == 1. "andb %%bh, %%al \n\t" // set ZF if bh & al == 1. "jne .DCOLSTORBZ%= \n\t" // jump to column storage case " \n\t" " \n\t" " \n\t" ".DGENSTORBZ%=: \n\t" " \n\t" // update c00:c33 " \n\t" "vextractf128 $1, %%ymm9, %%xmm1 \n\t" "vmovlpd %%xmm9, (%%rcx) \n\t" // store to c00:c30 "vmovhpd %%xmm9, (%%rcx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rcx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm11, %%xmm1 \n\t" "vmovlpd %%xmm11, (%%rcx) \n\t" // store to c01:c31 "vmovhpd %%xmm11, (%%rcx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rcx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm13, %%xmm1 \n\t" "vmovlpd %%xmm13, (%%rcx) \n\t" // store to c02:c32 "vmovhpd %%xmm13, (%%rcx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rcx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r13) \n\t" "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm15, %%xmm1 \n\t" "vmovlpd %%xmm15, (%%rcx) \n\t" // store to c03:c33 "vmovhpd %%xmm15, (%%rcx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rcx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rcx,%%r13) \n\t" " \n\t" " \n\t" // update c40:c73 " \n\t" "vextractf128 $1, %%ymm8, %%xmm1 \n\t" "vmovlpd %%xmm8, (%%rdx) \n\t" // store to c40:c70 "vmovhpd %%xmm8, (%%rdx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rdx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm10, %%xmm1 \n\t" "vmovlpd %%xmm10, (%%rdx) \n\t" // store to c41:c71 "vmovhpd %%xmm10, (%%rdx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rdx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm12, %%xmm1 \n\t" "vmovlpd %%xmm12, (%%rdx) \n\t" // store to c42:c72 "vmovhpd %%xmm12, (%%rdx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rdx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r13) \n\t" "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vextractf128 $1, %%ymm14, %%xmm1 \n\t" "vmovlpd %%xmm14, (%%rdx) \n\t" // store to c43:c73 "vmovhpd %%xmm14, (%%rdx,%%rsi) \n\t" "vmovlpd %%xmm1, (%%rdx,%%r12) \n\t" "vmovhpd %%xmm1, (%%rdx,%%r13) \n\t" " \n\t" " \n\t" "jmp .DDONE%= \n\t" // jump to end. " \n\t" " \n\t" " \n\t" ".DCOLSTORBZ%=: \n\t" " \n\t" // update c00:c33 " \n\t" "vmovapd %%ymm9, (%%rcx) \n\t" // store c00:c30 "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm11, (%%rcx) \n\t" // store c01:c31 "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm13, (%%rcx) \n\t" // store c02:c32 "addq %%rdi, %%rcx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm15, (%%rcx) \n\t" // store c03:c33 " \n\t" " \n\t" // update c40:c73 " \n\t" "vmovapd %%ymm8, (%%rdx) \n\t" // store c40:c70 "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm10, (%%rdx) \n\t" // store c41:c71 "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm12, (%%rdx) \n\t" // store c42:c72 "addq %%rdi, %%rdx \n\t" // c += cs_c; " \n\t" "vmovapd %%ymm14, (%%rdx) \n\t" // store c43:c73 " \n\t" " \n\t" " \n\t" " \n\t" " \n\t" ".DDONE%=: \n\t" " \n\t" : // output operands (none) : // input operands "m" (k_iter), // 0 "m" (k_left), // 1 "m" (A), // 2 "m" (B), // 3 "m" (alpha), // 4 "m" (beta), // 5 "m" (C), // 6 "m" (rs_c), // 7 "m" (cs_c), // 8 "m" (b_next)/*, // 9 "m" (a_next)*/ // 10 : // register clobber list "rax", "rbx", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15", "memory" ); } #endif // BLISAVX_HPP
Source Code of Benchmark Program
#include <iostream> #include <blaze/Math.h> #include <cassert> #include <chrono> #include <cmath> #include <limits> #include <random> #include "gemm.h" template <typename T> struct WallTime { void tic() { t0 = std::chrono::high_resolution_clock::now(); } T toc() { using namespace std::chrono; elapsed = high_resolution_clock::now() - t0; return duration<T,seconds::period>(elapsed).count(); } std::chrono::high_resolution_clock::time_point t0; std::chrono::high_resolution_clock::duration elapsed; }; // fill rectangular matrix with random values template <typename MATRIX> void fill(MATRIX &A) { typedef typename MATRIX::ElementType T; std::random_device random; std::default_random_engine mt(random()); std::uniform_real_distribution<T> uniform(-100,100); for (std::size_t i=0; i<(~A).rows(); ++i) { for (std::size_t j=0; j<(~A).columns(); ++j) { A(i,j) = uniform(mt); } } } template <typename MATRIX> typename MATRIX::ElementType asum(const MATRIX &A) { typedef typename MATRIX::ElementType T; T asum = 0; for (std::size_t i=0; i<A.rows(); ++i) { for (std::size_t j=0; j<A.columns(); ++j) { asum += std::abs(A(i,j)); } } return asum; } template <typename MA, typename MB, typename MC0, typename MC1> double estimateGemmResidual(const MA &A, const MB &B, const MC0 &C0, const MC1 &C1) { typedef typename MC0::ElementType TC0; std::size_t m= C1.rows(); std::size_t n= C1.columns(); std::size_t k= A.columns(); double aNorm = asum(A); double bNorm = asum(B); double cNorm = asum(C1); double diff = asum(C1-C0); // Using eps for double gives upper bound in case elements have lower // precision. double eps = std::numeric_limits<double>::epsilon(); double res = diff/(aNorm*bNorm*cNorm*eps*std::max(std::max(m,n),k)); return res; } #ifndef M_MAX #define M_MAX 10000 #endif #ifndef K_MAX #define K_MAX 10000 #endif #ifndef N_MAX #define N_MAX 10000 #endif #ifndef ALPHA #define ALPHA 1 #endif #ifndef BETA #define BETA 0 #endif #ifndef USE_SOA #define USE_SOA blaze::rowMajor #endif #ifndef USE_SOB #define USE_SOB blaze::rowMajor #endif #ifndef USE_SOC #define USE_SOC blaze::rowMajor #endif int main() { typedef double ElementType; constexpr auto SOA = USE_SOA; constexpr auto SOB = USE_SOB; constexpr auto SOC = USE_SOC; ElementType alpha = ALPHA; ElementType beta = ALPHA; const std::size_t m_min = 100; const std::size_t k_min = 100; const std::size_t n_min = 100; const std::size_t m_max = M_MAX; const std::size_t k_max = K_MAX; const std::size_t n_max = N_MAX; const std::size_t m_inc = 100; const std::size_t k_inc = 100; const std::size_t n_inc = 100; std::cout << "# m"; std::cout << " n"; std::cout << " k"; std::cout << " C++GEMM t1"; std::cout << " MFLOPS"; std::cout << " BLAZE/BLAS t2"; std::cout << " MFLOPS"; std::cout << " Residual "; std::cout << std::endl; WallTime<double> walltime; for (std::size_t m=m_min, k=k_min, n=n_min; m<=m_max && k<=k_max && n<=n_max; m += m_inc, k += k_inc, n += n_inc) { blaze::DynamicMatrix<double, SOA> A(m, k); blaze::DynamicMatrix<double, SOB> B(k, n); blaze::DynamicMatrix<double, SOC> C1(m, n); blaze::DynamicMatrix<double, SOC> C2(m, n); fill(A); fill(B); fill(C1); C2 = C1; walltime.tic(); foo::gemm(alpha, A, B, beta, C1); double t1 = walltime.toc(); walltime.tic(); #if BLAZE_BLAS_MODE==1 blaze::dgemm(C2, A, B, alpha, beta); #else if (beta==ElementType(0)) { C2 = alpha*A*B; } else if (beta==ElementType(1)) { C2 = C2 + alpha*A*B; } else { C2 = beta*C2; C2 + alpha*A*B; } #endif double t2 = walltime.toc(); double res = estimateGemmResidual(A, B, C1, C2); std::cout.width(5); std::cout << m << " "; std::cout.width(5); std::cout << n << " "; std::cout.width(5); std::cout << k << " "; std::cout.width(12); std::cout << t1 << " "; std::cout.width(12); std::cout << 2.*m/1000.*n/1000.*k/t1 << " "; std::cout.width(15); std::cout << t2 << " "; std::cout.width(12); std::cout << 2.*m/1000.*n/1000.*k/t2 << " "; std::cout.width(15); std::cout << res; std::cout << std::endl; } }
Benchmarks
Performance is compared with BLAZE linked against the Intel MKL. For comparing it with other BLAS implementations simply link against another BLAS implementation like ATLAS, OpenBLAS, ...
Hardware used
Note that the machine only has AVX but not FMA:
$shell> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz stepping : 9 microcode : 0x1b cpu MHz : 1600.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6384.79 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz stepping : 9 microcode : 0x1b cpu MHz : 1600.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6385.20 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz stepping : 9 microcode : 0x1b cpu MHz : 1600.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6385.19 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz stepping : 9 microcode : 0x1b cpu MHz : 1600.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6385.20 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: $shell>
Run with Reference Micro-Kernel
Here we only have cache optimization. So performance is poor (about 10 percent peak performance) but does not drop for growing problem sizes:
$shell> g++-5.3 -Ofast -mavx -DNDEBUG -std=c++11 -DM_MAX=2000 -I /home/numerik/lehn/work/blaze-2.5/ -I /opt/intel/compilers_and_libraries/linux/mkl/include -DBLAZE_BLAS_MODE -DMKL_ILP64 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 bench_gemm.cc $shell> ./a.out > report.gemm.ref $shell> cat report.gemm.ref # m n k C++GEMM t1 MFLOPS BLAZE/BLAS t2 MFLOPS Residual 100 100 100 0.00192924 1036.68 0.00999787 200.043 0 200 200 200 0.0140152 1141.62 0.00155246 10306.2 0 300 300 300 0.0209486 2577.74 0.00212082 25461.9 3.2981e-16 400 400 400 0.0486354 2631.83 0.00496922 25758.6 7.40806e-17 500 500 500 0.0952918 2623.52 0.009898 25257.6 1.62927e-17 600 600 600 0.163133 2648.15 0.0168158 25690.1 0 700 700 700 0.260459 2633.81 0.0264285 25956.8 0 800 800 800 0.387993 2639.22 0.038954 26287.4 0 900 900 900 0.553039 2636.34 0.0561184 25980.8 0 1000 1000 1000 0.75472 2649.99 0.0756805 26426.9 0 1100 1100 1100 1.01181 2630.92 0.102594 25947 0 1200 1200 1200 1.30367 2650.98 0.12927 26734.8 0 1300 1300 1300 1.65913 2648.38 0.166478 26393.8 0 1400 1400 1400 2.06489 2657.77 0.204781 26799.4 0 1500 1500 1500 2.54567 2651.56 0.253282 26650.1 0 1600 1600 1600 3.09508 2646.78 0.310623 26372.8 0 1700 1700 1700 3.70448 2652.47 0.373053 26339.4 0 1800 1800 1800 4.3731 2667.21 0.43445 26847.8 0 1900 1900 1900 5.18503 2645.69 0.521709 26294.3 0 2000 2000 2000 6.01699 2659.14 0.590739 27084.7 0 $shell>
Run with GCC Vector Extensione
Better but still off.
$shell> g++-5.3 -Ofast -mavx -DHAVE_GCCVEC -DNDEBUG -std=c++11 -DM_MAX=10000 -I /home/numerik/lehn/work/blaze-2.5/ -I /opt/intel/compilers_and_libraries/linux/mkl/include -DBLAZE_BLAS_MODE -DMKL_ILP64 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 bench_gemm.cc $shell> ./a.out > report.gemm.gccvec $shell> cat report.gemm.gccvec # m n k C++GEMM t1 MFLOPS BLAZE/BLAS t2 MFLOPS Residual 100 100 100 0.000407025 4913.7 0.00847677 235.939 0 200 200 200 0.00224897 7114.36 0.00157076 10186.2 0 300 300 300 0.00575359 9385.45 0.00213457 25297.8 3.28115e-16 400 400 400 0.00685369 18676.1 0.00497949 25705.5 7.38826e-17 500 500 500 0.0130829 19108.9 0.00994235 25145 1.62405e-17 600 600 600 0.0223269 19348.8 0.0168617 25620.2 0 700 700 700 0.0350645 19563.9 0.0265723 25816.3 0 800 800 800 0.0558207 18344.4 0.0391209 26175.2 0 900 900 900 0.0752291 19380.8 0.0563906 25855.4 0 1000 1000 1000 0.101436 19716.8 0.0754637 26502.8 0 1100 1100 1100 0.138357 19240.1 0.101672 26182.4 0 1200 1200 1200 0.173842 19880.1 0.128797 26833 0 1300 1300 1300 0.220414 19935.2 0.165927 26481.5 0 1400 1400 1400 0.272633 20129.6 0.203764 26933.2 0 1500 1500 1500 0.330385 20430.7 0.252433 26739.7 0 1600 1600 1600 0.425025 19274.1 0.3099 26434.4 0 1700 1700 1700 0.480851 20434.6 0.36709 26767.3 0 1800 1800 1800 0.586448 19889.2 0.436498 26721.8 0 1900 1900 1900 0.689052 19908.5 0.525498 26104.8 0 2000 2000 2000 0.779494 20526.1 0.590567 27092.6 0 2100 2100 2100 0.90315 20508.2 0.690838 26810.9 0 2200 2200 2200 1.03099 20655.8 0.784045 27161.7 0 2300 2300 2300 1.22908 19798.6 0.925685 26287.6 0 2400 2400 2400 1.33735 20673.8 1.02526 26966.9 0 2500 2500 2500 1.50181 20808.3 1.15678 27014.7 0 2600 2600 2600 1.69314 20761.4 1.3059 26917.8 0 2700 2700 2700 1.89012 20827.2 1.45705 27017.7 0 2800 2800 2800 2.10954 20812.1 1.62776 26972 0 2900 2900 2900 2.34972 20759.1 1.80801 26978.8 0 3000 3000 3000 2.58171 20916.4 1.98005 27272 0 3100 3100 3100 2.87224 20744.1 2.21605 26886.6 0 3200 3200 3200 3.20359 20457.1 2.44338 26821.9 0 3300 3300 3300 3.42603 20978.8 2.6459 27164.3 0 3400 3400 3400 3.76479 20879.8 2.87655 27327.2 0 3500 3500 3500 4.08918 20970 3.16028 27133.6 0 3600 3600 3600 4.4817 20820.7 3.41756 27303.7 0 3700 3700 3700 4.83091 20970.4 3.72496 27196.5 0 3800 3800 3800 5.30761 20676.7 4.09041 26829.6 0 3900 3900 3900 5.6785 20892.5 4.37506 27116.9 0 4000 4000 4000 6.11521 20931.4 4.67801 27362 0 4100 4100 4100 7.16665 19233.8 5.2965 26025.1 0 4200 4200 4200 7.17797 20643.2 5.46257 27125.7 0 4300 4300 4300 7.63914 20815.7 5.86366 27118.5 0 4400 4400 4400 8.18422 20816.7 6.21776 27400.2 0 4500 4500 4500 8.7041 20938.4 6.68498 27262.6 0 4600 4600 4600 9.68118 20108.3 7.30794 26638.4 0 4700 4700 4700 9.92288 20926 7.61931 27252.6 0 4800 4800 4800 10.5796 20906.6 8.05337 27464.8 0 4900 4900 4900 11.2833 20853.6 8.63389 27252.9 0 5000 5000 5000 11.967 20890.8 9.10993 27442.6 0 5100 5100 5100 12.6556 20963.2 9.788 27104.8 0 5200 5200 5200 13.7619 20434.4 10.541 26678.3 0 5300 5300 5300 14.1787 21000.1 10.9305 27240.8 0 5400 5400 5400 15.0219 20964.6 11.4892 27410.8 0 5500 5500 5500 15.8621 20977.6 12.2196 27230.9 0 5600 5600 5600 16.7443 20976.2 12.7903 27460.8 0 5700 5700 5700 18.7544 19749.3 14.1342 26204.9 0 5800 5800 5800 18.5705 21013.2 14.2071 27466.7 0 5900 5900 5900 19.5723 20986.7 15.0827 27233.8 0 6000 6000 6000 20.6856 20884.1 15.8036 27335.5 0 6100 6100 6100 21.8598 20767 16.8333 26968.1 0 6200 6200 6200 23.0526 20676.9 17.733 26879.6 0 6300 6300 6300 24.5744 20350.2 18.9015 26457.9 0 6400 6400 6400 25.0979 20889.7 19.1286 27408.7 0 6500 6500 6500 26.0972 21046.3 20.1288 27286.7 0 6600 6600 6600 27.2903 21069.4 20.9308 27471.1 0 6700 6700 6700 28.6022 21030.7 22.0336 27300.4 0 6800 6800 6800 29.9291 21011.8 22.8883 27475.4 0 6900 6900 6900 32.9696 19928 25.0635 26214.1 0 7000 7000 7000 32.6343 21020.8 25.0256 27412 0 7100 7100 7100 33.9858 21062.4 26.2288 27291.4 0 7200 7200 7200 35.5804 20980.5 27.1805 27464.4 0 7300 7300 7300 37.3043 20856.4 28.7618 27050.9 0 7400 7400 7400 39.5695 20481.6 30.6923 26405.6 0 7500 7500 7500 40.0676 21058.1 30.8751 27327.8 0 7600 7600 7600 42.1824 20813.2 32.5473 26974.7 0 7700 7700 7700 43.4245 21026.5 33.5732 27196.3 0 7800 7800 7800 46.6224 20357.2 35.9285 26416.5 0 7900 7900 7900 46.7304 21101.4 36.0548 27349.4 0 8000 8000 8000 48.7735 20995 37.4278 27359.4 0 8100 8100 8100 50.6614 20980.1 39.1723 27133.5 0 8200 8200 8200 56.0748 19665.4 42.4338 25987.2 0 8300 8300 8300 54.5978 20945.4 42.0511 27194.9 0 8400 8400 8400 58.6704 20204.5 44.8213 26447.4 0 8500 8500 8500 58.3901 21035.3 44.9481 27326 0 8600 8600 8600 60.4618 21039.9 46.2047 27532.1 0 8700 8700 8700 63.4636 20752.2 48.2171 27314.1 0 8800 8800 8800 65.0111 20964.8 49.4994 27534.5 0 8900 8900 8900 67.5106 20884.7 52.0179 27104.9 0 9000 9000 9000 69.2848 21043.6 52.925 27548.4 0 9100 9100 9100 74.7006 20175.8 57.4544 26232 0 9200 9200 9200 75.9033 20517.9 57.9829 26859.2 0 9300 9300 9300 76.9345 20910.2 58.8432 27339 0 9400 9400 9400 78.8734 21061.2 60.2768 27559 0 9500 9500 9500 81.9922 20913.6 63.062 27191.5 0 9600 9600 9600 84.2366 21006 64.2094 27557.8 0 9700 9700 9700 86.4003 21126.6 66.5303 27436.3 0 9800 9800 9800 89.3024 21078.8 68.3154 27554.3 0 9900 9900 9900 92.0564 21080.5 70.7997 27409.7 0 10000 10000 10000 95.0441 21042.9 72.6529 27528.1 0 $shell>
Run with AVX Micro-Kernel
Better. Further improvement is possible with hints for prefetching. See BLIS.
$shell> g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000 -I /home/numerik/lehn/work/blaze-2.5/ -I /opt/intel/compilers_and_libraries/linux/mkl/include -DBLAZE_BLAS_MODE -DMKL_ILP64 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 bench_gemm.cc $shell> ./a.out > report.gemm.avx $shell> cat report.gemm.avx # m n k C++GEMM t1 MFLOPS BLAZE/BLAS t2 MFLOPS Residual 100 100 100 0.000367146 5447.42 0.00326583 612.401 0 200 200 200 0.00199287 8028.64 0.00155407 10295.6 0 300 300 300 0.00591876 9123.54 0.00470808 11469.6 3.2758e-16 400 400 400 0.00603099 21223.7 0.00499065 25647.9 7.35799e-17 500 500 500 0.0114509 21832.3 0.00992922 25178.2 1.65011e-17 600 600 600 0.0194779 22178.9 0.0168787 25594.5 0 700 700 700 0.0303694 22588.5 0.026423 25962.2 0 800 800 800 0.0465427 22001.3 0.0389478 26291.6 0 900 900 900 0.0639362 22804 0.0558155 26121.8 0 1000 1000 1000 0.0858282 23302.4 0.0751876 26600.1 0 1100 1100 1100 0.115278 23092 0.101417 26248.1 0 1200 1200 1200 0.150745 22926.1 0.128898 26812 0 1300 1300 1300 0.1902 23102 0.167506 26231.8 0 1400 1400 1400 0.234868 23366.3 0.204432 26845.1 0 1500 1500 1500 0.28564 23631.1 0.255354 26433.9 0 1600 1600 1600 0.370587 22105.5 0.310446 26387.9 0 1700 1700 1700 0.415745 23634.7 0.36711 26765.8 0 1800 1800 1800 0.494782 23574 0.4341 26869.4 0 1900 1900 1900 0.601421 22809.3 0.521157 26322.2 0 2000 2000 2000 0.681128 23490.4 0.590811 27081.4 0 2100 2100 2100 0.789197 23469.4 0.690006 26843.3 0 2200 2200 2200 0.899593 23672.9 0.783601 27177.1 0 2300 2300 2300 1.08288 22471.6 0.925276 26299.2 0 2400 2400 2400 1.17264 23577.6 1.01606 27211 0 2500 2500 2500 1.31376 23786.6 1.15569 27040.1 0 2600 2600 2600 1.48169 23724.2 1.29145 27219 0 2700 2700 2700 1.65705 23756.7 1.45557 27045.1 0 2800 2800 2800 1.84785 23759.5 1.60875 27290.8 0 2900 2900 2900 2.06278 23646.8 1.8063 27004.4 0 3000 3000 3000 2.26218 23870.8 1.97781 27303 0 3100 3100 3100 2.52132 23631.2 2.21096 26948.5 0 3200 3200 3200 2.81773 23258.4 2.44476 26806.8 0 3300 3300 3300 3.00075 23952 2.64563 27167.1 0 3400 3400 3400 3.29343 23868.1 2.8747 27344.8 0 3500 3500 3500 3.62114 23680.4 3.15783 27154.7 0 3600 3600 3600 3.92615 23766.8 3.41648 27312.3 0 3700 3700 3700 4.22823 23959.4 3.72368 27205.9 0 3800 3800 3800 4.65949 23552.8 4.09023 26830.8 0 3900 3900 3900 5.06009 23445.8 4.37677 27106.3 0 4000 4000 4000 5.35292 23912.2 4.68161 27341 0 4100 4100 4100 6.37834 21611 5.29761 26019.7 0 4200 4200 4200 6.29039 23555.9 5.4632 27122.6 0 4300 4300 4300 6.70152 23728.1 5.85646 27151.9 0 4400 4400 4400 7.16907 23764.3 6.21699 27403.6 0 4500 4500 4500 7.62251 23909.4 6.67955 27284.8 0 4600 4600 4600 8.53185 22817.1 7.31215 26623.1 0 4700 4700 4700 8.69293 23886.8 7.61405 27271.4 0 4800 4800 4800 9.25932 23887.7 8.05206 27469.2 0 4900 4900 4900 9.84866 23891.4 8.62733 27273.6 0 5000 5000 5000 10.4685 23881.1 9.11062 27440.5 0 5100 5100 5100 11.1341 23827.9 9.7802 27126.4 0 5200 5200 5200 12.1025 23236.2 10.539 26683.3 0 5300 5300 5300 12.463 23891 10.924 27256.8 0 5400 5400 5400 13.1625 23926.1 11.4865 27417.2 0 5500 5500 5500 13.8845 23965.6 12.2139 27243.6 0 5600 5600 5600 14.8246 23692.5 12.7883 27465.1 0 5700 5700 5700 16.6839 22200.2 14.1249 26222.2 0 5800 5800 5800 16.3759 23829.2 14.2015 27477.6 0 5900 5900 5900 17.2694 23785.3 15.0635 27268.4 0 6000 6000 6000 18.277 23636.3 15.7898 27359.4 0 6100 6100 6100 19.3426 23469.5 16.8154 26996.8 0 6200 6200 6200 20.3543 23417.9 17.7275 26887.9 0 6300 6300 6300 21.7686 22973.2 18.9147 26439.4 0 6400 6400 6400 21.9993 23832 19.1291 27407.8 0 6500 6500 6500 22.8439 24043.6 20.1025 27322.5 0 6600 6600 6600 24.0512 23907 20.899 27512.9 0 6700 6700 6700 25.3256 23751.7 22.0183 27319.4 0 6800 6800 6800 26.4527 23773.2 22.8772 27488.7 0 6900 6900 6900 29.3864 22357.9 25.0468 26231.6 0 7000 7000 7000 28.8664 23764.7 24.9503 27494.7 0 7100 7100 7100 29.749 24062.1 26.2134 27307.5 0 7200 7200 7200 31.1492 23965.2 27.1771 27467.9 0 7300 7300 7300 32.7721 23740.7 28.7404 27071.1 0 7400 7400 7400 34.7688 23309.6 30.6542 26438.4 0 7500 7500 7500 35.0928 24043.4 30.8547 27345.9 0 7600 7600 7600 37.5 23412 32.4123 27087 0 7700 7700 7700 38.0142 24019.1 33.4129 27326.7 0 7800 7800 7800 41.6594 22782.5 35.7899 26518.8 0 7900 7900 7900 40.8991 24110 36.031 27367.5 0 8000 8000 8000 42.6625 24002.4 37.2796 27468.1 0 8100 8100 8100 44.3779 23950.7 39.1438 27153.3 0 8200 8200 8200 49.8474 22122.2 42.4083 26002.8 0 8300 8300 8300 47.7356 23956.4 42.0241 27212.4 0 8400 8400 8400 52.0519 22773.6 44.786 26468.3 0 8500 8500 8500 51.119 24027.3 44.9517 27323.7 0 8600 8600 8600 53.7675 23659.5 46.2043 27532.3 0 8700 8700 8700 56.0515 23496.4 48.2217 27311.5 0 8800 8800 8800 56.8321 23981.9 49.5053 27531.3 0 8900 8900 8900 59.3367 23761.7 52.1244 27049.5 0 9000 9000 9000 61.6809 23637.8 52.9577 27531.4 0 9100 9100 9100 66.1763 22774.6 57.6216 26155.8 0 9200 9200 9200 66.8674 23290.5 58.0958 26807 0 9300 9300 9300 67.2471 23922.4 58.9583 27285.6 0 9400 9400 9400 69.8445 23783.8 60.3212 27538.7 0 9500 9500 9500 72.4649 23663.2 63.1916 27135.7 0 9600 9600 9600 73.9155 23939.1 64.2658 27533.6 0 9700 9700 9700 76.1038 23984.9 66.6451 27389 0 9800 9800 9800 78.5587 23961.5 68.3817 27527.6 0 9900 9900 9900 81.0652 23938.7 70.873 27381.4 0 10000 10000 10000 84.4209 23690.8 72.6859 27515.6 0 $shell>
Run with Micro-Kernel from BLIS
$shell> g++-5.3 -Ofast -mavx -DHAVE_BLISAVX -DNDEBUG -std=c++11 -DM_MAX=10000 -I /home/numerik/lehn/work/blaze-2.5/ -I /opt/intel/compilers_and_libraries/linux/mkl/include -DBLAZE_BLAS_MODE -DMKL_ILP64 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lm -lpthread -Wl,-rpath /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 bench_gemm.cc $shell> ./a.out > report.gemm.blisavx $shell> cat report.gemm.blisavx # m n k C++GEMM t1 MFLOPS BLAZE/BLAS t2 MFLOPS Residual 100 100 100 0.000372942 5362.76 0.00323866 617.54 0 200 200 200 0.00185347 8632.46 0.00159255 10046.8 0 300 300 300 0.0059265 9111.62 0.00473546 11403.3 3.26882e-16 400 400 400 0.00585833 21849.2 0.00497397 25734 7.38718e-17 500 500 500 0.0112347 22252.5 0.00989491 25265.5 1.63896e-17 600 600 600 0.019289 22396.2 0.0168363 25658.9 0 700 700 700 0.0300962 22793.6 0.0264475 25938.2 0 800 800 800 0.0457475 22383.8 0.0389408 26296.3 0 900 900 900 0.0639209 22809.4 0.0558012 26128.5 0 1000 1000 1000 0.0865796 23100.1 0.0751015 26630.6 0 1100 1100 1100 0.115994 22949.5 0.100923 26376.5 0 1200 1200 1200 0.145955 23678.5 0.128666 26860.2 0 1300 1300 1300 0.189539 23182.6 0.166441 26399.8 0 1400 1400 1400 0.233905 23462.5 0.204786 26798.8 0 1500 1500 1500 0.287236 23499.8 0.255567 26411.8 0 1600 1600 1600 0.364779 22457.4 0.310986 26342 0 1700 1700 1700 0.417898 23512.9 0.373009 26342.6 0 1800 1800 1800 0.494057 23608.6 0.436133 26744.1 0 1900 1900 1900 0.584618 23464.9 0.529687 25898.3 0 2000 2000 2000 0.681801 23467.2 0.593978 26937 0 2100 2100 2100 0.825969 22424.6 0.703928 26312.3 0 2200 2200 2200 0.89506 23792.8 0.791303 26912.6 0 2300 2300 2300 1.03217 23575.7 0.939738 25894.4 0 2400 2400 2400 1.17879 23454.5 1.026 26947.3 0 2500 2500 2500 1.30225 23997 1.18086 26463.7 0 2600 2600 2600 1.51517 23200 1.30615 26912.8 0 2700 2700 2700 1.63576 24065.9 1.47914 26614 0 2800 2800 2800 1.80873 24273.3 1.62628 26996.5 0 2900 2900 2900 2.03353 23986.8 1.80907 26963.1 0 3000 3000 3000 2.29453 23534.2 1.9984 27021.7 0 3100 3100 3100 2.50926 23744.9 2.21398 26911.7 0 3200 3200 3200 2.70498 24227.9 2.45706 26672.5 0 3300 3300 3300 2.95724 24304.4 2.70104 26609.8 0 3400 3400 3400 3.25009 24186.4 2.91225 26992.2 0 3500 3500 3500 3.60244 23803.3 3.15971 27138.6 0 3600 3600 3600 3.88113 24042.5 3.42062 27279.3 0 3700 3700 3700 4.19449 24152.2 3.7254 27193.3 0 3800 3800 3800 4.50243 24374.4 4.09083 26826.8 0 3900 3900 3900 4.92075 24109.7 4.3773 27103 0 4000 4000 4000 5.26559 24308.8 4.67802 27362 0 4100 4100 4100 6.10516 22578 5.29677 26023.8 0 4200 4200 4200 6.13505 24152.4 5.46156 27130.7 0 4300 4300 4300 6.56239 24231.1 5.85965 27137.1 0 4400 4400 4400 7.05768 24139.4 6.21792 27399.5 0 4500 4500 4500 7.55223 24131.9 6.68305 27270.5 0 4600 4600 4600 8.33596 23353.3 7.30727 26640.8 0 4700 4700 4700 8.61305 24108.3 7.61665 27262.1 0 4800 4800 4800 9.09944 24307.4 8.05418 27462 0 4900 4900 4900 9.76304 24100.9 8.62957 27266.5 0 5000 5000 5000 10.285 24307.2 9.10928 27444.6 0 5100 5100 5100 11.272 23536.3 9.78555 27111.6 0 5200 5200 5200 11.6597 24118.6 10.5389 26683.7 0 5300 5300 5300 12.2607 24285.2 10.9257 27252.5 0 5400 5400 5400 12.9513 24316.3 11.4897 27409.5 0 5500 5500 5500 13.7315 24232.6 12.2204 27229.1 0 5600 5600 5600 14.3846 24417.2 12.7913 27458.7 0 5700 5700 5700 15.533 23845.1 14.1333 26206.5 0 5800 5800 5800 16.0063 24379.4 14.2062 27468.5 0 5900 5900 5900 16.9706 24204.1 15.0657 27264.5 0 6000 6000 6000 17.7202 24379 15.7907 27357.8 0 6100 6100 6100 18.8619 24067.7 16.8168 26994.5 0 6200 6200 6200 20.2404 23549.7 17.7308 26883 0 6300 6300 6300 20.6224 24250 18.917 26436.2 0 6400 6400 6400 21.626 24243.4 19.1302 27406.3 0 6500 6500 6500 22.589 24314.9 20.1584 27246.7 0 6600 6600 6600 23.4358 24534.8 20.9342 27466.6 0 6700 6700 6700 25.675 23428.5 22.0967 27222.4 0 6800 6800 6800 25.8039 24370.9 22.9373 27416.7 0 6900 6900 6900 28.1045 23377.6 25.0719 26205.3 0 7000 7000 7000 28.1544 24365.6 25.0284 27408.9 0 7100 7100 7100 30.7377 23288 26.2166 27304.2 0 7200 7200 7200 30.6741 24336.4 27.2284 27416.1 0 7300 7300 7300 32.0757 24256.2 28.7467 27065.2 0 7400 7400 7400 33.1329 24460.5 30.6698 26425 0 7500 7500 7500 34.7644 24270.5 31.0362 27186 0 7600 7600 7600 36.4256 24102.6 32.5308 26988.3 0 7700 7700 7700 38.0805 23977.3 33.6201 27158.3 0 7800 7800 7800 39.0801 24286.1 35.9003 26437.2 0 7900 7900 7900 40.2166 24519.2 36.1637 27267 0 8000 8000 8000 41.9562 24406.4 37.4484 27344.3 0 8100 8100 8100 43.7309 24305.1 39.1661 27137.8 0 8200 8200 8200 48.7764 22608 42.438 25984.6 0 8300 8300 8300 47.0006 24331 42.2964 27037.1 0 8400 8400 8400 48.7043 24338.9 44.9677 26361.3 0 8500 8500 8500 50.7522 24200.9 44.9328 27335.2 0 8600 8600 8600 52.128 24403.6 46.485 27366.1 0 8700 8700 8700 57.0766 23074.3 48.2066 27320 0 8800 8800 8800 55.9607 24355.4 49.6313 27461.4 0 8900 8900 8900 57.9327 24337.5 52.0996 27062.3 0 9000 9000 9000 59.8878 24345.5 52.9507 27535.1 0 9100 9100 9100 62.0927 24272.4 57.5573 26185.1 0 9200 9200 9200 65.3103 23845.8 58.0579 26824.5 0 9300 9300 9300 66.9612 24024.6 58.9337 27297 0 9400 9400 9400 67.9214 24457.2 60.3089 27544.3 0 9500 9500 9500 70.6551 24269.3 63.1537 27152 0 9600 9600 9600 72.1908 24511.1 64.263 27534.8 0 9700 9700 9700 75.8854 24054 66.606 27405.1 0 9800 9800 9800 77.1546 24397.6 68.3478 27541.2 0 9900 9900 9900 79.3865 24444.9 70.8878 27375.6 0 10000 10000 10000 82.042 24377.7 72.6955 27512 0 $shell>
Run with internal BLAZE Matrix-Matrix Product
In this run we disable the BLAS backend in BLAZE such that the internal BLAZE implementation gets called.
$shell> g++-5.3 -Ofast -mavx -DHAVE_AVX -DNDEBUG -std=c++11 -DM_MAX=10000 -DBLAZE_BLAS_MODE=0 -I /home/numerik/lehn/work/blaze-2.5/ bench_gemm.cc $shell> ./a.out > report.gemm.blaze $shell> cat report.gemm.blaze # m n k C++GEMM t1 MFLOPS BLAZE/BLAS t2 MFLOPS Residual 100 100 100 0.000347102 5762 0.00023927 8358.76 0 200 200 200 0.00190906 8381.09 0.00157886 10133.9 1.94376e-15 300 300 300 0.00273987 19709 0.00292348 18471.1 3.09935e-16 400 400 400 0.00609684 20994.5 0.00872672 14667.6 6.77194e-17 500 500 500 0.011434 21864.7 0.0156393 15985.3 2.71608e-17 600 600 600 0.0194431 22218.7 0.0269997 16000.2 1.02718e-17 700 700 700 0.0304357 22539.3 0.0424625 16155.4 4.9156e-18 800 800 800 0.0463552 22090.3 0.076389 13405.1 2.66701e-18 900 900 900 0.0640307 22770.3 0.0908907 16041.3 1.41414e-18 1000 1000 1000 0.0860042 23254.7 0.125091 15988.3 8.89855e-19 1100 1100 1100 0.116019 22944.5 0.164119 16220 5.42639e-19 1200 1200 1200 0.148943 23203.5 0.207642 16644.1 3.53572e-19 1300 1300 1300 0.188562 23302.6 0.270505 16243.7 2.45999e-19 1400 1400 1400 0.232967 23557 0.33137 16561.6 1.63876e-19 1500 1500 1500 0.285457 23646.3 0.415312 16252.8 1.20472e-19 1600 1600 1600 0.36981 22151.9 0.693214 11817.4 8.6996e-20 1700 1700 1700 0.415661 23639.4 0.603168 16290.6 6.42364e-20 1800 1800 1800 0.494785 23573.9 0.710542 16415.6 4.97022e-20 1900 1900 1900 0.596272 23006.3 0.853747 16068 3.69016e-20 2000 2000 2000 0.674733 23713.1 0.987237 16206.9 2.92235e-20 2100 2100 2100 0.783154 23650.5 1.18498 15630.7 2.29818e-20 2200 2200 2200 0.891122 23898 1.31608 16181.4 1.81304e-20 2300 2300 2300 1.07586 22618.3 1.52978 15906.8 1.4854e-20 2400 2400 2400 1.16031 23828.1 2.06077 13416.3 1.17713e-20 2500 2500 2500 1.30164 24008.2 1.97847 15795 9.73602e-21 2600 2600 2600 1.46791 23947 2.21479 15871.4 8.05048e-21 2700 2700 2700 1.63977 24007 2.52042 15618.9 6.64385e-21 2800 2800 2800 1.82847 24011.3 2.75028 15963.5 5.62477e-21 2900 2900 2900 2.04018 23908.7 3.11106 15678.9 4.66769e-21 3000 3000 3000 2.23944 24113.2 3.41054 15833.3 3.97486e-21 3100 3100 3100 2.49366 23893.4 4.01972 14822.4 3.40358e-21 3200 3200 3200 2.78747 23510.9 6.06717 10801.7 2.85368e-21 3300 3300 3300 2.97308 24174.9 4.71707 15237 2.50243e-21 3400 3400 3400 3.26012 24112 5.07351 15493.8 2.14404e-21 3500 3500 3500 3.54648 24178.9 5.84645 14667 1.86284e-21 3600 3600 3600 3.88569 24014.3 6.16058 15146.6 1.63382e-21 3700 3700 3700 4.18605 24200.9 7.11084 14246.7 1.40576e-21 3800 3800 3800 4.60856 23813.1 7.41795 14794.4 1.24944e-21 3900 3900 3900 4.92386 24094.5 8.5716 13840.8 1.09605e-21 4000 4000 4000 5.29786 24160.7 11.2478 11380 9.68361e-22 4100 4100 4100 6.31539 21826.4 14.772 9331.3 8.646e-22 4200 4200 4200 6.22882 23788.8 10.2249 14491.7 7.57525e-22 4300 4300 4300 6.63591 23962.7 11.3878 13963.5 6.80752e-22 4400 4400 4400 7.09774 24003.1 11.5168 14793 6.07852e-22 4500 4500 4500 7.54499 24155.1 12.8636 14167.9 5.44032e-22 4600 4600 4600 8.46969 22984.5 13.2824 14656.4 4.91989e-22 4700 4700 4700 8.60009 24144.6 14.547 14274.2 4.37259e-22 4800 4800 4800 9.17624 24104 20.1638 10969.4 3.97149e-22 4900 4900 4900 9.75638 24117.4 16.1575 14562.7 3.59339e-22 5000 5000 5000 10.3364 24186.5 17.1499 14577.3 3.2494e-22 5100 5100 5100 10.9667 24191.5 18.7719 14133 2.96241e-22 5200 5200 5200 11.9826 23468.7 19.2137 14636.2 2.673e-22 5300 5300 5300 12.2857 24235.9 21.0174 14167.1 2.44629e-22 5400 5400 5400 13.0102 24206.3 21.7198 14499.6 2.23667e-22 5500 5500 5500 13.7445 24209.7 23.4256 14204.5 2.02212e-22 5600 5600 5600 14.5036 24216.9 28.1452 12479.3 1.87241e-22 5700 5700 5700 16.4479 22518.8 36.5479 10134.2 1.70934e-22 5800 5800 5800 16.098 24240.5 26.3757 14794.9 1.57396e-22 5900 5900 5900 16.9657 24211.1 27.7243 14815.8 1.45233e-22 6000 6000 6000 17.9495 24067.5 29.3344 14726.7 1.32489e-22 6100 6100 6100 18.9978 23895.5 31.4239 14446.4 1.23218e-22 6200 6200 6200 20.0291 23798.1 32.3477 14735.4 1.13494e-22 6300 6300 6300 21.4573 23306.5 34.2461 14602.9 1.05156e-22 6400 6400 6400 21.857 23987.2 55.6257 9425.29 9.77876e-23 6500 6500 6500 22.619 24282.7 37.4572 14663.4 8.98104e-23 6600 6600 6600 23.6509 24311.6 38.9321 14769.1 8.39111e-23 6700 6700 6700 24.7884 24266.4 40.9972 14672.4 7.78455e-23 6800 6800 6800 25.9223 24259.6 42.2862 14871.6 7.24915e-23 6900 6900 6900 28.861 22764.9 73.8922 8891.58 6.7696e-23 7000 7000 7000 28.2965 24243.3 45.735 14999.5 6.26692e-23 7100 7100 7100 29.4585 24299.3 48.8276 14660.2 5.87634e-23 7200 7200 7200 30.8431 24203 57.8147 12911.9 5.48988e-23 7300 7300 7300 32.475 23957.9 53.4841 14547 5.13437e-23 7400 7400 7400 34.4148 23549.4 54.5667 14852.4 4.81175e-23 7500 7500 7500 34.723 24299.5 57.3133 14721.7 4.48379e-23 7600 7600 7600 36.6144 23978.4 59.0675 14863.5 4.21878e-23 7700 7700 7700 37.6367 24260 63.2329 14439.7 3.96239e-23 7800 7800 7800 40.6028 23375.3 64.689 14671.8 3.69181e-23 7900 7900 7900 40.5153 24338.4 68.4646 14402.7 3.49936e-23 8000 8000 8000 42.2423 24241.1 97.4799 10504.7 3.2783e-23 8100 8100 8100 43.9539 24181.7 74.1203 14340 3.09507e-23 8200 8200 8200 49.0463 22483.6 130.974 8419.49 2.92201e-23 8300 8300 8300 47.2549 24200.1 80.0767 14281 2.73256e-23 8400 8400 8400 51.1208 23188.4 81.6187 14523.7 2.59549e-23 8500 8500 8500 50.6099 24269 84.0387 14615.3 2.44362e-23 8600 8600 8600 52.3271 24310.7 86.4851 14709 2.31461e-23 8700 8700 8700 55.4189 23764.6 89.3641 14737.5 2.19231e-23 8800 8800 8800 56.1748 24262.5 111.523 12221.2 2.0592e-23 8900 8900 8900 58.6724 24030.7 96.1167 14669 1.96014e-23 9000 9000 9000 59.9687 24312.7 98.1572 14853.7 1.8538e-23 9100 9100 9100 65.3089 23077.1 104.331 14445.7 1.75964e-23 9200 9200 9200 65.9519 23613.8 105.318 14787.4 1.67012e-23 9300 9300 9300 66.3043 24262.6 109.426 14701.4 1.57665e-23 9400 9400 9400 68.2153 24351.8 111.871 14849 1.50309e-23 9500 9500 9500 71.3227 24042.1 116.323 14741.3 1.42772e-23 9600 9600 9600 73.1117 24202.3 164.611 10749.4 1.3476e-23 9700 9700 9700 74.9019 24369.8 124.793 14627 1.2921e-23 9800 9800 9800 77.4377 24308.4 126.492 14881.4 1.2242e-23 9900 9900 9900 79.9545 24271.3 131.652 14740.4 1.16929e-23 10000 10000 10000 82.4226 24265.2 133.004 15037.1 1.11417e-23 $shell>
Plots from the Benchmarks
Generating Plots
$shell> gnuplot plot.gemm.time $shell> gnuplot plot.gemm.time_log $shell> gnuplot plot.gemm.mflops $shell>