Content |
Naive Use of AVX Intrinsics
Using AVX intrinsics instead of SSE we follow the straight forward approach of Naive Use of SSE Intrinsics.
Clone the ulmBLAS Repository
If not done already clone the ulmBLAS repository.
$shell> git clone https://github.com/michael-lehn/ulmBLAS.git
Cloning into 'ulmBLAS'...
Select the demo-naive-avx-with-intrinsics Branch
Again, we do a make clean before switching a branch:
$shell> cd ulmBLAS $shell> make clean for dir in src refblas test bench; do make -C $dir clean; done rm -f auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o rm -f auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o rm -f ../libulmblas.a rm -f ../libatlulmblas.a rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o rm -f ../librefblas.a rm -f dblat1_ref dblat3_ref dblat1_ulm dblat3_ulm *.SUMM rm -f xdl1blastst libtstatlas.a l1blastst.o ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
Then we are checking out the demo-naive-avx-with-intrinsics branch:
$shell> git branch -a * master remotes/origin/HEAD -> origin/master remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master $shell> git checkout -B demo-naive-avx-with-intrinsics remotes/origin/demo-naive-avx-with-intrinsics Switched to a new branch 'demo-naive-avx-with-intrinsics' Branch demo-naive-avx-with-intrinsics set up to track remote branch demo-naive-avx-with-intrinsics from origin.
Then we compile the project
$shell> make
make -C src
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o auxiliary/xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/drot.o level1/drot.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level1/idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level3/dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level3/dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level3/dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O2 -Xassembler -q -mavx -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_zero.o ATL_zero.c
gfortran -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l3blastst.o l3blastst.c
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
The Micro Kernel Algorithm
We use parameters \(m_r = 8\) and \(n_r=4\) in the micor kernel. We merely optimize the update step
\[\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix} \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix}\]by using SSE intrinsics. Looking at the original C code
for (j=0; j<NR; ++j) {
for (i=0; i<MR; ++i) {
AB[i+j*MR] += A[i]*B[j];
}
}
A += MR;
B += NR;
}
we notice that in the most inner loop the value B[j] does not change. The natural idea is to compute this step as
\[\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} b_{4l} \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix}, & b_{4l+1} \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix}, & b_{4l+2} \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix}, & b_{4l+3} \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix} \end{pmatrix}\]Let \(\mathbb{b}_{0000}, \mathbb{b}_{1111}, \mathbb{b}_{2222}, \mathbb{b}_{3333}\) and \(\mathbb{a}_{0123}, \mathbb{a}_{4567}\) denote AVX registers. We use this 6 registers to store the operands:
\[\begin{array}{llllll}\mathbb{b}_{0000} \leftarrow \begin{pmatrix} b_{4l } \\ b_{4l } \\ b_{4l } \\ b_{4l } \end{pmatrix}, &\mathbb{b}_{1111} \leftarrow \begin{pmatrix} b_{4l+1} \\ b_{4l+1} \\ b_{4l+1} \\ b_{4l+1} \end{pmatrix}, &\mathbb{b}_{2222} \leftarrow \begin{pmatrix} b_{4l+2} \\ b_{4l+2} \\ b_{4l+2} \\ b_{4l+2} \end{pmatrix}, &\mathbb{b}_{3333} \leftarrow \begin{pmatrix} b_{4l+3} \\ b_{4l+3} \\ b_{4l+2} \\ b_{4l+2} \end{pmatrix}, &\mathbb{a}_{0123} \leftarrow \begin{pmatrix} a_{4l } \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3} \end{pmatrix}, &\mathbb{a}_{4567} \leftarrow \begin{pmatrix} a_{4l+4} \\ a_{4l+5} \\ a_{4l+6} \\ a_{4l+7} \end{pmatrix}\end{array}\]Another 8 AVX registers denoted as \(\mathbb{ab}_{\cdot,\cdot,\cdot,\cdot}\) are used to represent \(\mathbf{AB}\):
\[\begin{array}{llll}\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{0,0} \\ ab_{1,0} \\ ab_{2,0} \\ ab_{3,0} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{0,1} \\ ab_{1,1} \\ ab_{2,1} \\ ab_{3,1} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{0,2} \\ ab_{1,2} \\ ab_{2,2} \\ ab_{3,2} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{0,3} \\ ab_{1,3} \\ ab_{2,3} \\ ab_{3,3} \end{pmatrix} \\[0.5cm]\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{4,0} \\ ab_{5,0} \\ ab_{6,0} \\ ab_{7,0} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{4,1} \\ ab_{5,1} \\ ab_{6,1} \\ ab_{7,1} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{4,2} \\ ab_{5,2} \\ ab_{6,2} \\ ab_{7,2} \end{pmatrix}, &\mathbb{ab}_{00,10,20,30} \leftarrow \begin{pmatrix} ab_{4,3} \\ ab_{5,3} \\ ab_{6,3} \\ ab_{7,3} \end{pmatrix}\end{array}\]As our architecture has a total of 16 AVX registers we have two registers left. We use them for temporary results and denote them as \(\mathbb{tmp}_1\) and \(\mathbb{tmp}_2\).
A single update can now be computed as
-
Update the first column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{00,10,20,30}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{40,50,60,70}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{0000}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{0000}\)
-
\(\mathbb{ab}_{00,10,20,30} \leftarrow \mathbb{ab}_{00,10,20,30} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{40,50,60,70} \leftarrow \mathbb{ab}_{40,50,60,70} + \mathbb{tmp}_2\)
-
-
Update the second column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{01,11,21,31}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{41,51,61,71}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{1111}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{1111}\)
-
\(\mathbb{ab}_{01,11,21,31} \leftarrow \mathbb{ab}_{01,11,21,31} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{41,51,61,71} \leftarrow \mathbb{ab}_{41,51,61,71} + \mathbb{tmp}_2\)
-
-
Update the third column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{02,12,22,32}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{42,52,62,72}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{2222}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{2222}\)
-
\(\mathbb{ab}_{02,12,22,32} \leftarrow \mathbb{ab}_{02,12,22,32} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{42,52,62,72} \leftarrow \mathbb{ab}_{42,52,62,72} + \mathbb{tmp}_2\)
-
-
Update the forth column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{03,13,23,33}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{43,53,63,73}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{3333}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{3333}\)
-
\(\mathbb{ab}_{03,13,23,33} \leftarrow \mathbb{ab}_{03,13,23,33} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{43,53,63,73} \leftarrow \mathbb{ab}_{43,53,63,73} + \mathbb{tmp}_2\)
-
Hereby \(\odot\) denotes the usual component wise multiplication of AVX registers. We also assume that previous to the first update step all the \(\mathbb{ab}\) registers are zero initialized.
Once we have completed the total of \(k_c\) updates we write the result back to memory into \(\mathbf{AB}\).
The dgemm_nn Code
Note that we also added an attribute for 32 byte alignment to the definition of local buffers _A, _B, _C and AB. Having a 32-byte alignment is required for the load and store intrinsics used in the micro kernel.
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst -N 100 2500 100 ./xdl3blastst -N 100 2500 100 --------------------------------- GEMM ---------------------------------- TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== ===== 0 N N 100 100 100 1.0 2500 2500 1.0 2500 0.00 3546.1 1.00 ----- 0 N N 100 100 100 1.0 2500 2500 1.0 2500 0.00 11173.2 3.15 PASS 1 N N 200 200 200 1.0 2500 2500 1.0 2500 0.00 3821.4 1.00 ----- 1 N N 200 200 200 1.0 2500 2500 1.0 2500 0.00 14705.9 3.85 PASS 2 N N 300 300 300 1.0 2500 2500 1.0 2500 0.01 4033.5 1.00 ----- 2 N N 300 300 300 1.0 2500 2500 1.0 2500 0.00 15826.5 3.92 PASS 3 N N 400 400 400 1.0 2500 2500 1.0 2500 0.03 4106.6 1.00 ----- 3 N N 400 400 400 1.0 2500 2500 1.0 2500 0.01 16522.5 4.02 PASS 4 N N 500 500 500 1.0 2500 2500 1.0 2500 0.07 3816.7 1.00 ----- 4 N N 500 500 500 1.0 2500 2500 1.0 2500 0.02 16081.3 4.21 PASS 5 N N 600 600 600 1.0 2500 2500 1.0 2500 0.12 3576.8 1.00 ----- 5 N N 600 600 600 1.0 2500 2500 1.0 2500 0.03 16313.0 4.56 PASS 6 N N 700 700 700 1.0 2500 2500 1.0 2500 0.20 3362.1 1.00 ----- 6 N N 700 700 700 1.0 2500 2500 1.0 2500 0.04 17555.1 5.22 PASS 7 N N 800 800 800 1.0 2500 2500 1.0 2500 0.34 2977.9 1.00 ----- 7 N N 800 800 800 1.0 2500 2500 1.0 2500 0.06 17501.6 5.88 PASS 8 N N 900 900 900 1.0 2500 2500 1.0 2500 0.52 2795.5 1.00 ----- 8 N N 900 900 900 1.0 2500 2500 1.0 2500 0.08 17201.7 6.15 PASS 9 N N 1000 1000 1000 1.0 2500 2500 1.0 2500 0.73 2731.6 1.00 ----- 9 N N 1000 1000 1000 1.0 2500 2500 1.0 2500 0.11 17764.5 6.50 PASS 10 N N 1100 1100 1100 1.0 2500 2500 1.0 2500 0.99 2701.9 1.00 ----- 10 N N 1100 1100 1100 1.0 2500 2500 1.0 2500 0.15 17865.4 6.61 PASS 11 N N 1200 1200 1200 1.0 2500 2500 1.0 2500 1.26 2743.8 1.00 ----- 11 N N 1200 1200 1200 1.0 2500 2500 1.0 2500 0.19 17934.4 6.54 PASS 12 N N 1300 1300 1300 1.0 2500 2500 1.0 2500 1.60 2754.2 1.00 ----- 12 N N 1300 1300 1300 1.0 2500 2500 1.0 2500 0.25 17904.4 6.50 PASS 13 N N 1400 1400 1400 1.0 2500 2500 1.0 2500 1.99 2755.0 1.00 ----- 13 N N 1400 1400 1400 1.0 2500 2500 1.0 2500 0.30 18081.2 6.56 PASS 14 N N 1500 1500 1500 1.0 2500 2500 1.0 2500 2.43 2781.7 1.00 ----- 14 N N 1500 1500 1500 1.0 2500 2500 1.0 2500 0.39 17398.6 6.25 PASS 15 N N 1600 1600 1600 1.0 2500 2500 1.0 2500 2.93 2795.8 1.00 ----- 15 N N 1600 1600 1600 1.0 2500 2500 1.0 2500 0.45 18008.9 6.44 PASS 16 N N 1700 1700 1700 1.0 2500 2500 1.0 2500 3.54 2773.3 1.00 ----- 16 N N 1700 1700 1700 1.0 2500 2500 1.0 2500 0.56 17671.4 6.37 PASS 17 N N 1800 1800 1800 1.0 2500 2500 1.0 2500 4.15 2812.9 1.00 ----- 17 N N 1800 1800 1800 1.0 2500 2500 1.0 2500 0.65 17813.5 6.33 PASS 18 N N 1900 1900 1900 1.0 2500 2500 1.0 2500 4.91 2796.5 1.00 ----- 18 N N 1900 1900 1900 1.0 2500 2500 1.0 2500 0.76 18052.0 6.46 PASS 19 N N 2000 2000 2000 1.0 2500 2500 1.0 2500 5.65 2833.3 1.00 ----- 19 N N 2000 2000 2000 1.0 2500 2500 1.0 2500 0.88 18141.8 6.40 PASS 20 N N 2100 2100 2100 1.0 2500 2500 1.0 2500 6.64 2790.1 1.00 ----- 20 N N 2100 2100 2100 1.0 2500 2500 1.0 2500 1.04 17817.2 6.39 PASS 21 N N 2200 2200 2200 1.0 2500 2500 1.0 2500 7.62 2793.3 1.00 ----- 21 N N 2200 2200 2200 1.0 2500 2500 1.0 2500 1.18 18084.7 6.47 PASS 22 N N 2300 2300 2300 1.0 2500 2500 1.0 2500 8.62 2824.5 1.00 ----- 22 N N 2300 2300 2300 1.0 2500 2500 1.0 2500 1.34 18142.4 6.42 PASS 23 N N 2400 2400 2400 1.0 2500 2500 1.0 2500 9.87 2799.9 1.00 ----- 23 N N 2400 2400 2400 1.0 2500 2500 1.0 2500 1.69 16380.7 5.85 PASS 24 N N 2500 2500 2500 1.0 2500 2500 1.0 2500 10.86 2878.1 1.00 ----- 24 N N 2500 2500 2500 1.0 2500 2500 1.0 2500 1.81 17251.7 5.99 PASS 25 tests run, 25 passed
and filter out the results for the demo-naive-avx-with-intrinsics branch:
$shell> ./xdl3blastst -N 100 2500 100 > report $shell> grep PASS report > demo-naive-avx-with-intrinsics $shell> grep "\ \-\-\-\-\-$" report > refBLAS
With the gnuplot script
set output "bench.png"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:21600]
set title "Compute C + A*B"
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-naive-avx-with-intrinsics" using 4:13 with linespoints lt 3 title "demo-naive-avx-with-intrinsics"
we feed gnuplot
$shell> gnuplot bench.gps
and get
