Content

Complete Assembler Micro Kernel

The BLIS micro kernel gains another performance boost through prefetching data. In my experiments I only was able to add the feature after converting the whole mico kernel into assembler.

Status Quo: So far only the loop for computing AB was implemented in assembler. The rest of the micro kernel was left untouched and is plain C code. The bridge between the assembler and C code is the double array AB of length 16. The assembler code copies at the end its results into this array. The remaining C code uses AB to compute C <- beta*C + alpha*A*B.

Having all the micro kernel in assembler removes the need for this internal buffer AB.

Select the demo-sse-all-asm Branch

Again, we do a make clean before switching a branch:

$shell> cd ulmBLAS                                                       
$shell> make clean                                                       
for dir in src refblas test bench; do make -C $dir clean; done
rm -f  auxiliary/xerbla.o  level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o  level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
rm -f  auxiliary/atl_xerbla.o  level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o  level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
rm -f ../libulmblas.a
rm -f ../libatlulmblas.a
rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
rm -f ../librefblas.a
rm -f  dblat1_ref  dblat3_ref  dblat1_ulm  dblat3_ulm *.SUMM
rm -f xdl1blastst libtstatlas.a l1blastst.o  ATL_cputime.o  ATL_epsilon.o  ATL_f77amax.o  ATL_f77asum.o  ATL_f77axpy.o  ATL_f77copy.o  ATL_f77dot.o  ATL_f77gemm.o  ATL_f77nrm2.o  ATL_f77rot.o  ATL_f77rotg.o  ATL_f77rotm.o  ATL_f77rotmg.o  ATL_f77scal.o  ATL_f77swap.o  ATL_f77symm.o  ATL_f77syr2k.o  ATL_f77syrk.o  ATL_f77trmm.o  ATL_f77trsm.o  ATL_flushcache.o  ATL_gediffnrm1.o  ATL_gegen.o  ATL_genrm1.o  ATL_infnrm.o  ATL_rand.o  ATL_set.o  ATL_synrm.o  ATL_trnrm1.o  ATL_vdiff.o  ATL_zero.o  ATL_df77wrap.o

Then we are checking out the demo-sse-all-asm branch:

$shell> git branch -a                                                    
  demo-naive-sse-with-intrinsics
  demo-naive-sse-with-intrinsics-unrolled
  demo-pure-c
  demo-sse-asm
  demo-sse-asm-unrolled
  demo-sse-asm-unrolled-v2
* demo-sse-asm-unrolled-v3
  demo-sse-intrinsics
  demo-sse-intrinsics-v2
  demo-sse-intrinsics-v3
  master
  remotes/origin/HEAD -> origin/master
  remotes/origin/bench-atlas
  remotes/origin/bench-eigen
  remotes/origin/bench-mkl
  remotes/origin/blis-avx-microkernel
  remotes/origin/demo-naive-avx-with-intrinsics
  remotes/origin/demo-naive-sse-with-intrinsics
  remotes/origin/demo-naive-sse-with-intrinsics-unrolled
  remotes/origin/demo-pure-c
  remotes/origin/demo-sse-all-asm
  remotes/origin/demo-sse-asm
  remotes/origin/demo-sse-asm-for-AB-loop
  remotes/origin/demo-sse-asm-unrolled
  remotes/origin/demo-sse-asm-unrolled-v2
  remotes/origin/demo-sse-asm-unrolled-v3
  remotes/origin/demo-sse-asm-unrolled-with-prefetch
  remotes/origin/demo-sse-intrinsics
  remotes/origin/demo-sse-intrinsics-for-AB-loop
  remotes/origin/demo-sse-intrinsics-v2
  remotes/origin/demo-sse-intrinsics-v3
  remotes/origin/demo-with-sse-intrinsics
  remotes/origin/master
  remotes/origin/trsm-assignment
  remotes/origin/trsm-pure-c
$shell> git checkout -B demo-sse-all-asm remotes/origin/demo-sse-all-asm                  
Switched to a new branch 'demo-sse-all-asm'
Branch demo-sse-all-asm set up to track remote branch demo-sse-all-asm from origin.

Then we compile the project

$shell> make                                                             
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o auxiliary/xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a  auxiliary/xerbla.o  level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o  level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a  auxiliary/atl_xerbla.o  level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o  level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
               CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)           
                                            1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
               CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)           
                                            1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
               CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)           
                                            1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
               CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)           
                                            1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL   -c -o ATL_zero.o ATL_zero.c
gfortran   -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a  ATL_cputime.o  ATL_epsilon.o  ATL_f77amax.o  ATL_f77asum.o  ATL_f77axpy.o  ATL_f77copy.o  ATL_f77dot.o  ATL_f77gemm.o  ATL_f77nrm2.o  ATL_f77rot.o  ATL_f77rotg.o  ATL_f77rotm.o  ATL_f77rotmg.o  ATL_f77scal.o  ATL_f77swap.o  ATL_f77symm.o  ATL_f77syr2k.o  ATL_f77syrk.o  ATL_f77trmm.o  ATL_f77trsm.o  ATL_flushcache.o  ATL_gediffnrm1.o  ATL_gegen.o  ATL_genrm1.o  ATL_infnrm.o  ATL_rand.o  ATL_set.o  ATL_synrm.o  ATL_trnrm1.o  ATL_vdiff.o  ATL_zero.o  ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a

Outline of the Modification

The dgemm_nn Code

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
     100
     101
     102
     103
     104
     105
     106
     107
     108
     109
     110
     111
     112
     113
     114
     115
     116
     117
     118
     119
     120
     121
     122
     123
     124
     125
     126
     127
     128
     129
     130
     131
     132
     133
     134
     135
     136
     137
     138
     139
     140
     141
     142
     143
     144
     145
     146
     147
     148
     149
     150
     151
     152
     153
     154
     155
     156
     157
     158
     159
     160
     161
     162
     163
     164
     165
     166
     167
     168
     169
     170
     171
     172
     173
     174
     175
     176
     177
     178
     179
     180
     181
     182
     183
     184
     185
     186
     187
     188
     189
     190
     191
     192
     193
     194
     195
     196
     197
     198
     199
     200
     201
     202
     203
     204
     205
     206
     207
     208
     209
     210
     211
     212
     213
     214
     215
     216
     217
     218
     219
     220
     221
     222
     223
     224
     225
     226
     227
     228
     229
     230
     231
     232
     233
     234
     235
     236
     237
     238
     239
     240
     241
     242
     243
     244
     245
     246
     247
     248
     249
     250
     251
     252
     253
     254
     255
     256
     257
     258
     259
     260
     261
     262
     263
     264
     265
     266
     267
     268
     269
     270
     271
     272
     273
     274
     275
     276
     277
     278
     279
     280
     281
     282
     283
     284
     285
     286
     287
     288
     289
     290
     291
     292
     293
     294
     295
     296
     297
     298
     299
     300
     301
     302
     303
     304
     305
     306
     307
     308
     309
     310
     311
     312
     313
     314
     315
     316
     317
     318
     319
     320
     321
     322
     323
     324
     325
     326
     327
     328
     329
     330
     331
     332
     333
     334
     335
     336
     337
     338
     339
     340
     341
     342
     343
     344
     345
     346
     347
     348
     349
     350
     351
     352
     353
     354
     355
     356
     357
     358
     359
     360
     361
     362
     363
     364
     365
     366
     367
     368
     369
     370
     371
     372
     373
     374
     375
     376
     377
     378
     379
     380
     381
     382
     383
     384
     385
     386
     387
     388
     389
     390
     391
     392
     393
     394
     395
     396
     397
     398
     399
     400
     401
     402
     403
     404
     405
     406
     407
     408
     409
     410
     411
     412
     413
     414
     415
     416
     417
     418
     419
     420
     421
     422
     423
     424
     425
     426
     427
     428
     429
     430
     431
     432
     433
     434
     435
     436
     437
     438
     439
     440
     441
     442
     443
     444
     445
     446
     447
     448
     449
     450
     451
     452
     453
     454
     455
     456
     457
     458
     459
     460
     461
     462
     463
     464
     465
     466
     467
     468
     469
     470
     471
     472
     473
     474
     475
     476
     477
     478
     479
     480
     481
     482
     483
     484
     485
     486
     487
     488
     489
     490
     491
     492
     493
     494
     495
     496
     497
     498
     499
     500
     501
     502
     503
     504
     505
     506
     507
     508
     509
     510
     511
     512
     513
     514
     515
     516
     517
     518
     519
     520
     521
     522
     523
     524
     525
     526
     527
     528
     529
     530
     531
     532
     533
     534
     535
     536
     537
     538
     539
     540
     541
     542
     543
     544
     545
     546
     547
     548
     549
     550
     551
     552
     553
     554
     555
     556
     557
     558
     559
     560
     561
     562
     563
     564
     565
     566
     567
     568
     569
     570
     571
     572
     573
     574
     575
     576
     577
     578
     579
     580
     581
     582
     583
     584
     585
     586
     587
     588
     589
     590
     591
     592
     593
     594
     595
     596
     597
     598
     599
     600
     601
     602
     603
     604
     605
     606
     607
     608
     609
     610
     611
     612
     613
     614
     615
     616
     617
     618
     619
     620
     621
     622
     623
     624
     625
     626
     627
     628
     629
     630
     631
     632
     633
     634
     635
     636
     637
     638
     639
     640
     641
     642
     643
     644
     645
     646
     647
     648
     649
     650
     651
     652
     653
     654
     655
     656
     657
     658
     659
     660
#include <ulmblas.h>
#include <stdio.h>
#include <emmintrin.h>
#include <immintrin.h>

#define MC  384
#define KC  384
#define NC  4096

#define MR  4
#define NR  4

//
//  Local buffers for storing panels from A, B and C
//
static double _A[MC*KC] __attribute__ ((aligned (16)));
static double _B[KC*NC] __attribute__ ((aligned (16)));
static double _C[MR*NR] __attribute__ ((aligned (16)));

//
//  Packing complete panels from A (i.e. without padding)
//
static void
pack_MRxk(int k, const double *A, int incRowA, int incColA,
          double *buffer)
{
    int i, j;

    for (j=0; j<k; ++j) {
        for (i=0; i<MR; ++i) {
            buffer[i] = A[i*incRowA];
        }
        buffer += MR;
        A      += incColA;
    }
}

//
//  Packing panels from A with padding if required
//
static void
pack_A(int mc, int kc, const double *A, int incRowA, int incColA,
       double *buffer)
{
    int mp  = mc / MR;
    int _mr = mc % MR;

    int i, j;

    for (i=0; i<mp; ++i) {
        pack_MRxk(kc, A, incRowA, incColA, buffer);
        buffer += kc*MR;
        A      += MR*incRowA;
    }
    if (_mr>0) {
        for (j=0; j<kc; ++j) {
            for (i=0; i<_mr; ++i) {
                buffer[i] = A[i*incRowA];
            }
            for (i=_mr; i<MR; ++i) {
                buffer[i] = 0.0;
            }
            buffer += MR;
            A      += incColA;
        }
    }
}

//
//  Packing complete panels from B (i.e. without padding)
//
static void
pack_kxNR(int k, const double *B, int incRowB, int incColB,
          double *buffer)
{
    int i, j;

    for (i=0; i<k; ++i) {
        for (j=0; j<NR; ++j) {
            buffer[j] = B[j*incColB];
        }
        buffer += NR;
        B      += incRowB;
    }
}

//
//  Packing panels from B with padding if required
//
static void
pack_B(int kc, int nc, const double *B, int incRowB, int incColB,
       double *buffer)
{
    int np  = nc / NR;
    int _nr = nc % NR;

    int i, j;

    for (j=0; j<np; ++j) {
        pack_kxNR(kc, B, incRowB, incColB, buffer);
        buffer += kc*NR;
        B      += NR*incColB;
    }
    if (_nr>0) {
        for (i=0; i<kc; ++i) {
            for (j=0; j<_nr; ++j) {
                buffer[j] = B[j*incColB];
            }
            for (j=_nr; j<NR; ++j) {
                buffer[j] = 0.0;
            }
            buffer += NR;
            B      += incRowB;
        }
    }
}

//
//  Micro kernel for multiplying panels from A and B.
//
static void
dgemm_micro_kernel(long kc,
                   double alpha, const double *A, const double *B,
                   double beta,
                   double *C, long incRowC, long incColC)
{
    long kb = kc / 4;
    long kl = kc % 4;

//
//  Compute AB = A*B
//
    __asm__ volatile
    (
    "movq      %0,      %%rsi    \n\t"  // kb (32 bit) stored in %rsi
    "movq      %1,      %%rdi    \n\t"  // kl (32 bit) stored in %rdi
    "movq      %2,      %%rax    \n\t"  // Address of A stored in %rax
    "movq      %3,      %%rbx    \n\t"  // Address of B stored in %rbx
    "                            \n\t"
    "movaps    (%%rax), %%xmm0   \n\t"  // tmp0 = _mm_load_pd(A)
    "movaps  16(%%rax), %%xmm1   \n\t"  // tmp1 = _mm_load_pd(A+2)
    "movaps    (%%rbx), %%xmm2   \n\t"  // tmp2 = _mm_load_pd(B)
    "                            \n\t"
    "xorpd     %%xmm8,  %%xmm8   \n\t"  // ab_00_11 = _mm_setzero_pd()
    "xorpd     %%xmm9,  %%xmm9   \n\t"  // ab_20_31 = _mm_setzero_pd()
    "xorpd     %%xmm10, %%xmm10  \n\t"  // ab_01_10 = _mm_setzero_pd()
    "xorpd     %%xmm11, %%xmm11  \n\t"  // ab_21_30 = _mm_setzero_pd()
    "xorpd     %%xmm12, %%xmm12  \n\t"  // ab_02_13 = _mm_setzero_pd()
    "xorpd     %%xmm13, %%xmm13  \n\t"  // ab_22_33 = _mm_setzero_pd()
    "xorpd     %%xmm14, %%xmm14  \n\t"  // ab_03_12 = _mm_setzero_pd()
    "xorpd     %%xmm15, %%xmm15  \n\t"  // ab_23_32 = _mm_setzero_pd()
    "                            \n\t"
    "xorpd     %%xmm3,  %%xmm3   \n\t"  // tmp3 = _mm_setzero_pd
    "xorpd     %%xmm4,  %%xmm4   \n\t"  // tmp4 = _mm_setzero_pd
    "xorpd     %%xmm5,  %%xmm5   \n\t"  // tmp5 = _mm_setzero_pd
    "xorpd     %%xmm6,  %%xmm6   \n\t"  // tmp6 = _mm_setzero_pd
    "xorpd     %%xmm7,  %%xmm7   \n\t"  // tmp7 = _mm_setzero_pd
    "testq     %%rdi,   %%rdi    \n\t"  // if kl==0 writeback to AB
    "                            \n\t"
    "                            \n\t"
    "testq     %%rsi,   %%rsi    \n\t"  // if kb==0 handle remaining kl
    "je        .DCONSIDERLEFT%=  \n\t"  // update iterations
    "                            \n\t"
    ".DLOOP%=:                   \n\t"  // for l = kb,..,1 do
    "                            \n\t"
    "                            \n\t"  // 1. update
    "addpd     %%xmm3,  %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "movaps  16(%%rbx), %%xmm3   \n\t"  // tmp3     = _mm_load_pd(B+2)
    "addpd     %%xmm6,  %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "movaps    %%xmm2,  %%xmm6   \n\t"  // tmp6     = tmp2
    "pshufd $78,%%xmm2, %%xmm4   \n\t"  // tmp4     = _mm_shuffle_pd(tmp2, tmp2,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm2   \n\t"  // tmp2     = _mm_mul_pd(tmp2, tmp0);
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1);
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm5,  %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd     %%xmm7,  %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "movaps    %%xmm4,  %%xmm7   \n\t"  // tmp7     = tmp4
    "mulpd     %%xmm0,  %%xmm4   \n\t"  // tmp4     = _mm_mul_pd(tmp4, tmp0)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm2,  %%xmm8   \n\t"  // ab_00_11 = _mm_add_pd(ab_00_11, tmp2)
    "movaps  32(%%rbx), %%xmm2   \n\t"  // tmp2     = _mm_load_pd(B+4)
    "addpd     %%xmm6,  %%xmm9   \n\t"  // ab_20_31 = _mm_add_pd(ab_20_31, tmp6)
    "movaps    %%xmm3,  %%xmm6   \n\t"  // tmp6     = tmp3
    "pshufd $78,%%xmm3, %%xmm5   \n\t"  // tmp5     = _mm_shuffle_pd(tmp3, tmp3,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm3   \n\t"  // tmp3     = _mm_mul_pd(tmp3, tmp0)
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm4,  %%xmm10  \n\t"  // ab_01_10 = _mm_add_pd(ab_01_10, tmp4)
    "addpd     %%xmm7,  %%xmm11  \n\t"  // ab_21_30 = _mm_add_pd(ab_21_30, tmp7)
    "movaps    %%xmm5,  %%xmm7   \n\t"  // tmp7     = tmp5
    "mulpd     %%xmm0,  %%xmm5   \n\t"  // tmp5     = _mm_mul_pd(tmp5, tmp0)
    "movaps  32(%%rax), %%xmm0   \n\t"  // tmp0     = _mm_load_pd(A+4)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "movaps  48(%%rax), %%xmm1   \n\t"  // tmp1     = _mm_load_pd(A+6)
    "                            \n\t"
    "                            \n\t"
    "                            \n\t"
    "                            \n\t"  // 2. update
    "addpd     %%xmm3,  %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "movaps  48(%%rbx), %%xmm3   \n\t"  // tmp3     = _mm_load_pd(B+6)
    "addpd     %%xmm6,  %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "movaps    %%xmm2,  %%xmm6   \n\t"  // tmp6     = tmp2
    "pshufd $78,%%xmm2, %%xmm4   \n\t"  // tmp4     = _mm_shuffle_pd(tmp2, tmp2,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm2   \n\t"  // tmp2     = _mm_mul_pd(tmp2, tmp0);
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1);
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm5,  %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd     %%xmm7,  %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "movaps    %%xmm4,  %%xmm7   \n\t"  // tmp7     = tmp4
    "mulpd     %%xmm0,  %%xmm4   \n\t"  // tmp4     = _mm_mul_pd(tmp4, tmp0)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm2,  %%xmm8   \n\t"  // ab_00_11 = _mm_add_pd(ab_00_11, tmp2)
    "movaps  64(%%rbx), %%xmm2   \n\t"  // tmp2     = _mm_load_pd(B+8)
    "addpd     %%xmm6,  %%xmm9   \n\t"  // ab_20_31 = _mm_add_pd(ab_20_31, tmp6)
    "movaps    %%xmm3,  %%xmm6   \n\t"  // tmp6     = tmp3
    "pshufd $78,%%xmm3, %%xmm5   \n\t"  // tmp5     = _mm_shuffle_pd(tmp3, tmp3,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm3   \n\t"  // tmp3     = _mm_mul_pd(tmp3, tmp0)
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm4,  %%xmm10  \n\t"  // ab_01_10 = _mm_add_pd(ab_01_10, tmp4)
    "addpd     %%xmm7,  %%xmm11  \n\t"  // ab_21_30 = _mm_add_pd(ab_21_30, tmp7)
    "movaps    %%xmm5,  %%xmm7   \n\t"  // tmp7     = tmp5
    "mulpd     %%xmm0,  %%xmm5   \n\t"  // tmp5     = _mm_mul_pd(tmp5, tmp0)
    "movaps  64(%%rax), %%xmm0   \n\t"  // tmp0     = _mm_load_pd(A+8)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "movaps  80(%%rax), %%xmm1   \n\t"  // tmp1     = _mm_load_pd(A+10)
    "                            \n\t"
    "                            \n\t"
    "                            \n\t"  // 3. update
    "addpd     %%xmm3,  %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "movaps  80(%%rbx), %%xmm3   \n\t"  // tmp3     = _mm_load_pd(B+10)
    "addpd     %%xmm6,  %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "movaps    %%xmm2,  %%xmm6   \n\t"  // tmp6     = tmp2
    "pshufd $78,%%xmm2, %%xmm4   \n\t"  // tmp4     = _mm_shuffle_pd(tmp2, tmp2,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm2   \n\t"  // tmp2     = _mm_mul_pd(tmp2, tmp0);
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1);
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm5,  %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd     %%xmm7,  %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "movaps    %%xmm4,  %%xmm7   \n\t"  // tmp7     = tmp4
    "mulpd     %%xmm0,  %%xmm4   \n\t"  // tmp4     = _mm_mul_pd(tmp4, tmp0)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm2,  %%xmm8   \n\t"  // ab_00_11 = _mm_add_pd(ab_00_11, tmp2)
    "movaps  96(%%rbx), %%xmm2   \n\t"  // tmp2     = _mm_load_pd(B+12)
    "addpd     %%xmm6,  %%xmm9   \n\t"  // ab_20_31 = _mm_add_pd(ab_20_31, tmp6)
    "movaps    %%xmm3,  %%xmm6   \n\t"  // tmp6     = tmp3
    "pshufd $78,%%xmm3, %%xmm5   \n\t"  // tmp5     = _mm_shuffle_pd(tmp3, tmp3,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm3   \n\t"  // tmp3     = _mm_mul_pd(tmp3, tmp0)
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm4,  %%xmm10  \n\t"  // ab_01_10 = _mm_add_pd(ab_01_10, tmp4)
    "addpd     %%xmm7,  %%xmm11  \n\t"  // ab_21_30 = _mm_add_pd(ab_21_30, tmp7)
    "movaps    %%xmm5,  %%xmm7   \n\t"  // tmp7     = tmp5
    "mulpd     %%xmm0,  %%xmm5   \n\t"  // tmp5     = _mm_mul_pd(tmp5, tmp0)
    "movaps  96(%%rax), %%xmm0   \n\t"  // tmp0     = _mm_load_pd(A+12)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "movaps 112(%%rax), %%xmm1   \n\t"  // tmp1     = _mm_load_pd(A+14)
    "                            \n\t"
    "                            \n\t"
    "                            \n\t"  // 4. update
    "addpd     %%xmm3,  %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "movaps 112(%%rbx), %%xmm3   \n\t"  // tmp3     = _mm_load_pd(B+14)
    "addpd     %%xmm6,  %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "movaps    %%xmm2,  %%xmm6   \n\t"  // tmp6     = tmp2
    "pshufd $78,%%xmm2, %%xmm4   \n\t"  // tmp4     = _mm_shuffle_pd(tmp2, tmp2,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm2   \n\t"  // tmp2     = _mm_mul_pd(tmp2, tmp0);
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1);
    "                            \n\t"
    "addq      $32*4,   %%rax    \n\t"  // A += 16;
    "                            \n\t"
    "addpd     %%xmm5,  %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd     %%xmm7,  %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "movaps    %%xmm4,  %%xmm7   \n\t"  // tmp7     = tmp4
    "mulpd     %%xmm0,  %%xmm4   \n\t"  // tmp4     = _mm_mul_pd(tmp4, tmp0)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm2,  %%xmm8   \n\t"  // ab_00_11 = _mm_add_pd(ab_00_11, tmp2)
    "movaps 128(%%rbx), %%xmm2   \n\t"  // tmp2     = _mm_load_pd(B+16)
    "addpd     %%xmm6,  %%xmm9   \n\t"  // ab_20_31 = _mm_add_pd(ab_20_31, tmp6)
    "movaps    %%xmm3,  %%xmm6   \n\t"  // tmp6     = tmp3
    "pshufd $78,%%xmm3, %%xmm5   \n\t"  // tmp5     = _mm_shuffle_pd(tmp3, tmp3,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm3   \n\t"  // tmp3     = _mm_mul_pd(tmp3, tmp0)
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1)
    "                            \n\t"
    "addq      $32*4,   %%rbx    \n\t"  // B += 16;
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm4,  %%xmm10  \n\t"  // ab_01_10 = _mm_add_pd(ab_01_10, tmp4)
    "addpd     %%xmm7,  %%xmm11  \n\t"  // ab_21_30 = _mm_add_pd(ab_21_30, tmp7)
    "movaps    %%xmm5,  %%xmm7   \n\t"  // tmp7     = tmp5
    "mulpd     %%xmm0,  %%xmm5   \n\t"  // tmp5     = _mm_mul_pd(tmp5, tmp0)
    "movaps    (%%rax), %%xmm0   \n\t"  // tmp0     = _mm_load_pd(A+16)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "movaps  16(%%rax), %%xmm1   \n\t"  // tmp1     = _mm_load_pd(A+18)
    "                            \n\t"
    "                            \n\t"
    "decq      %%rsi             \n\t"  // --l
    "jne       .DLOOP%=          \n\t"  // if l>= 1 go back
    "                            \n\t"
    "                            \n\t"
    ".DCONSIDERLEFT%=:           \n\t"
    "testq     %%rdi,   %%rdi    \n\t"  // if kl==0 writeback to AB
    "je        .DPOSTACCUMULATE%=\n\t"
    "                            \n\t"
    ".DLOOPLEFT%=:               \n\t"  // for l = kl,..,1 do
    "                            \n\t"
    "addpd     %%xmm3,  %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "movapd  16(%%rbx), %%xmm3   \n\t"  // tmp3     = _mm_load_pd(B+2)
    "addpd     %%xmm6,  %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "movapd    %%xmm2,  %%xmm6   \n\t"  // tmp6     = tmp2
    "pshufd $78,%%xmm2, %%xmm4   \n\t"  // tmp4     = _mm_shuffle_pd(tmp2, tmp2,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm2   \n\t"  // tmp2     = _mm_mul_pd(tmp2, tmp0);
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1);
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm5,  %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd     %%xmm7,  %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "movapd    %%xmm4,  %%xmm7   \n\t"  // tmp7     = tmp4
    "mulpd     %%xmm0,  %%xmm4   \n\t"  // tmp4     = _mm_mul_pd(tmp4, tmp0)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm2,  %%xmm8   \n\t"  // ab_00_11 = _mm_add_pd(ab_00_11, tmp2)
    "movapd  32(%%rbx), %%xmm2   \n\t"  // tmp2     = _mm_load_pd(B+4)
    "addpd     %%xmm6,  %%xmm9   \n\t"  // ab_20_31 = _mm_add_pd(ab_20_31, tmp6)
    "movapd    %%xmm3,  %%xmm6   \n\t"  // tmp6     = tmp3
    "pshufd $78,%%xmm3, %%xmm5   \n\t"  // tmp5     = _mm_shuffle_pd(tmp3, tmp3,
    "                            \n\t"  //                   _MM_SHUFFLE2(0, 1))
    "mulpd     %%xmm0,  %%xmm3   \n\t"  // tmp3     = _mm_mul_pd(tmp3, tmp0)
    "mulpd     %%xmm1,  %%xmm6   \n\t"  // tmp6     = _mm_mul_pd(tmp6, tmp1)
    "                            \n\t"
    "                            \n\t"
    "addpd     %%xmm4,  %%xmm10  \n\t"  // ab_01_10 = _mm_add_pd(ab_01_10, tmp4)
    "addpd     %%xmm7,  %%xmm11  \n\t"  // ab_21_30 = _mm_add_pd(ab_21_30, tmp7)
    "movapd    %%xmm5,  %%xmm7   \n\t"  // tmp7     = tmp5
    "mulpd     %%xmm0,  %%xmm5   \n\t"  // tmp5     = _mm_mul_pd(tmp5, tmp0)
    "movapd  32(%%rax), %%xmm0   \n\t"  // tmp0     = _mm_load_pd(A+4)
    "mulpd     %%xmm1,  %%xmm7   \n\t"  // tmp7     = _mm_mul_pd(tmp7, tmp1)
    "movapd  48(%%rax), %%xmm1   \n\t"  // tmp1     = _mm_load_pd(A+6)
    "                            \n\t"
    "                            \n\t"
    "addq      $32,     %%rax    \n\t"  // A += 4;
    "addq      $32,     %%rbx    \n\t"  // B += 4;
    "                            \n\t"
    "decq      %%rdi             \n\t"  // --l
    "jne       .DLOOPLEFT%=      \n\t"  // if l>= 1 go back
    "                            \n\t"
    ".DPOSTACCUMULATE%=:         \n\t"  // Update remaining ab_*_* registers
    "                            \n\t"
    "addpd    %%xmm3,   %%xmm12  \n\t"  // ab_02_13 = _mm_add_pd(ab_02_13, tmp3)
    "addpd    %%xmm6,   %%xmm13  \n\t"  // ab_22_33 = _mm_add_pd(ab_22_33, tmp6)
    "                            \n\t"  //
    "addpd    %%xmm5,   %%xmm14  \n\t"  // ab_03_12 = _mm_add_pd(ab_03_12, tmp5)
    "addpd    %%xmm7,   %%xmm15  \n\t"  // ab_23_32 = _mm_add_pd(ab_23_32, tmp7)
    "                            \n\t"
//
//  Update C <- beta*C + alpha*AB
//
//
    "movsd  %4,                  %%xmm0 \n\t"  // load alpha
    "movsd  %5,                  %%xmm1 \n\t"  // load beta 
    "movq   %6,                  %%rcx  \n\t"  // Address of C stored in %rcx

    "movq   %7,                  %%r8   \n\t"  // load incRowC
    "leaq   (,%%r8,8),           %%r8   \n\t"  //      incRowC *= sizeof(double)
    "movq   %8,                  %%r9   \n\t"  // load incColC
    "leaq   (,%%r9,8),           %%r9   \n\t"  //      incRowC *= sizeof(double)
    "                                   \n\t"
    "leaq (%%rcx,%%r9),          %%r10  \n\t"  // Store addr of C01 in %r10
    "leaq (%%rcx,%%r8,2),        %%rdx  \n\t"  // Store addr of C20 in %rdx
    "leaq (%%rdx,%%r9),          %%r11  \n\t"  // Store addr of C21 in %r11
    "                                   \n\t"
    "unpcklpd %%xmm0,            %%xmm0 \n\t"  // duplicate alpha
    "unpcklpd %%xmm1,            %%xmm1 \n\t"  // duplicate beta
    "                                   \n\t"
    "                                   \n\t"
    "movlpd (%%rcx),             %%xmm3 \n\t"  // load (C00,
    "movhpd (%%r10,%%r8),        %%xmm3 \n\t"  //       C11)
    "mulpd  %%xmm0,              %%xmm8 \n\t"  // scale ab_00_11 by alpha
    "mulpd  %%xmm1,              %%xmm3 \n\t"  // scale (C00, C11) by beta
    "addpd  %%xmm8,              %%xmm3 \n\t"  // add results

    "movlpd %%xmm3,        (%%rcx)       \n\t"  // write back (C00,
    "movhpd %%xmm3,        (%%r10,%%r8)  \n\t"  //             C11)
    "                                   \n\t"
    "movlpd (%%rdx),             %%xmm4 \n\t"  // load (C20,
    "movhpd (%%r11,%%r8),        %%xmm4 \n\t"  //       C31)
    "mulpd  %%xmm0,              %%xmm9 \n\t"  // scale ab_20_31 by alpha
    "mulpd  %%xmm1,              %%xmm4 \n\t"  // scale (C20, C31) by beta
    "addpd  %%xmm9,              %%xmm4 \n\t"  // add results
    "movlpd %%xmm4,        (%%rdx)       \n\t"  // write back (C20,
    "movhpd %%xmm4,        (%%r11,%%r8)  \n\t"  //             C31)
    "                                   \n\t"
    "                                   \n\t"
    "movlpd (%%r10),             %%xmm3 \n\t"  // load (C01,
    "movhpd (%%rcx,%%r8),        %%xmm3 \n\t"  //       C10)
    "mulpd  %%xmm0,              %%xmm10\n\t"  // scale ab_01_10 by alpha
    "mulpd  %%xmm1,              %%xmm3 \n\t"  // scale (C01, C10) by beta
    "addpd  %%xmm10,             %%xmm3 \n\t"  // add results
    "movlpd %%xmm3,        (%%r10)      \n\t"  // write back (C01,
    "movhpd %%xmm3,        (%%rcx,%%r8) \n\t"  //             C10)
    "                                   \n\t"
    "movlpd (%%r11),             %%xmm4 \n\t"  // load (C21,
    "movhpd (%%rdx,%%r8),        %%xmm4 \n\t"  //       C30)
    "mulpd  %%xmm0,              %%xmm11\n\t"  // scale ab_21_30 by alpha
    "mulpd  %%xmm1,              %%xmm4 \n\t"  // scale (C21, C30) by beta
    "addpd  %%xmm11,             %%xmm4 \n\t"  // add results
    "movlpd %%xmm4,        (%%r11)      \n\t"  // write back (C21,
    "movhpd %%xmm4,        (%%rdx,%%r8) \n\t"  //             C30)
    "                                   \n\t"
    "                                   \n\t"
    "leaq   (%%rcx,%%r9,2),      %%rcx  \n\t"  // Store addr of C02 in %rcx
    "leaq   (%%r10,%%r9,2),      %%r10  \n\t"  // Store addr of C03 in %r10
    "leaq   (%%rdx,%%r9,2),      %%rdx  \n\t"  // Store addr of C22 in $rdx
    "leaq   (%%r11,%%r9,2),      %%r11  \n\t"  // Store addr of C23 in %r11
    "                                   \n\t"
    "                                   \n\t"
    "movlpd (%%rcx),             %%xmm3 \n\t"  // load (C02,
    "movhpd (%%r10,%%r8),        %%xmm3 \n\t"  //       C13)
    "mulpd  %%xmm0,              %%xmm12\n\t"  // scale ab_02_13 by alpha
    "mulpd  %%xmm1,              %%xmm3 \n\t"  // scale (C02, C13) by beta
    "addpd  %%xmm12,             %%xmm3 \n\t"  // add results
    "movlpd %%xmm3,        (%%rcx)      \n\t"  // write back (C02,
    "movhpd %%xmm3,        (%%r10,%%r8) \n\t"  //             C13)
    "                                   \n\t"
    "movlpd (%%rdx),             %%xmm4 \n\t"  // load (C22,
    "movhpd (%%r11, %%r8),       %%xmm4 \n\t"  //       C33)
    "mulpd  %%xmm0,              %%xmm13\n\t"  // scale ab_22_33 by alpha
    "mulpd  %%xmm1,              %%xmm4 \n\t"  // scale (C22, C33) by beta
    "addpd  %%xmm13,             %%xmm4 \n\t"  // add results
    "movlpd %%xmm4,             (%%rdx) \n\t"  // write back (C22,
    "movhpd %%xmm4,        (%%r11,%%r8) \n\t"  //             C33)
    "                                   \n\t"
    "                                   \n\t"
    "movlpd (%%r10),             %%xmm3 \n\t"  // load (C03,
    "movhpd (%%rcx,%%r8),        %%xmm3 \n\t"  //       C12)
    "mulpd  %%xmm0,              %%xmm14\n\t"  // scale ab_03_12 by alpha
    "mulpd  %%xmm1,              %%xmm3 \n\t"  // scale (C03, C12) by beta
    "addpd  %%xmm14,             %%xmm3 \n\t"  // add results
    "movlpd %%xmm3,        (%%r10)      \n\t"  // write back (C03,
    "movhpd %%xmm3,        (%%rcx,%%r8) \n\t"  //             C12)
    "                                   \n\t"
    "movlpd (%%r11),             %%xmm4 \n\t"  // load (C23,
    "movhpd (%%rdx,%%r8),        %%xmm4 \n\t"  //       C32)
    "mulpd  %%xmm0,              %%xmm15\n\t"  // scale ab_23_32 by alpha
    "mulpd  %%xmm1,              %%xmm4 \n\t"  // scale (C23, C32) by beta
    "addpd  %%xmm15,             %%xmm4 \n\t"  // add results
    "movlpd %%xmm4,        (%%r11)      \n\t"  // write back (C23,
    "movhpd %%xmm4,        (%%rdx,%%r8) \n\t"  //             C32)
    : // output
    : // input
        "m" (kb),       // 0
        "m" (kl),       // 1
        "m" (A),        // 2
        "m" (B),        // 3
        "m" (alpha),    // 4
        "m" (beta),     // 5
        "m" (C),        // 6
        "m" (incRowC),  // 7
        "m" (incColC)   // 8
    : // register clobber list
        "rax""rbx""rcx""rdx""rsi""rdi""r8""r9""r10""r11",
        "xmm0""xmm1""xmm2""xmm3",
        "xmm4""xmm5""xmm6""xmm7",
        "xmm8""xmm9""xmm10""xmm11",
        "xmm12""xmm13""xmm14""xmm15"
    );
}

//
//  Compute Y += alpha*X
//
static void
dgeaxpy(int           m,
        int           n,
        double        alpha,
        const double  *X,
        int           incRowX,
        int           incColX,
        double        *Y,
        int           incRowY,
        int           incColY)
{
    int i, j;


    if (alpha!=1.0) {
        for (j=0; j<n; ++j) {
            for (i=0; i<m; ++i) {
                Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX];
            }
        }
    } else {
        for (j=0; j<n; ++j) {
            for (i=0; i<m; ++i) {
                Y[i*incRowY+j*incColY] += X[i*incRowX+j*incColX];
            }
        }
    }
}

//
//  Compute X *= alpha
//
static void
dgescal(int     m,
        int     n,
        double  alpha,
        double  *X,
        int     incRowX,
        int     incColX)
{
    int i, j;

    if (alpha!=0.0) {
        for (j=0; j<n; ++j) {
            for (i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] *= alpha;
            }
        }
    } else {
        for (j=0; j<n; ++j) {
            for (i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] = 0.0;
            }
        }
    }
}

//
//  Macro Kernel for the multiplication of blocks of A and B.  We assume that
//  these blocks were previously packed to buffers _A and _B.
//
static void
dgemm_macro_kernel(int     mc,
                   int     nc,
                   int     kc,
                   double  alpha,
                   double  beta,
                   double  *C,
                   int     incRowC,
                   int     incColC)
{
    int mp = (mc+MR-1) / MR;
    int np = (nc+NR-1) / NR;

    int _mr = mc % MR;
    int _nr = nc % NR;

    int mr, nr;
    int i, j;

    for (j=0; j<np; ++j) {
        nr    = (j!=np-1 || _nr==0) ? NR : _nr;

        for (i=0; i<mp; ++i) {
            mr    = (i!=mp-1 || _mr==0) ? MR : _mr;

            if (mr==MR && nr==NR) {
                dgemm_micro_kernel(kc, alpha, &_A[i*kc*MR], &_B[j*kc*NR],
                                   beta,
                                   &C[i*MR*incRowC+j*NR*incColC],
                                   incRowC, incColC);
            } else {
                dgemm_micro_kernel(kc, alpha, &_A[i*kc*MR], &_B[j*kc*NR],
                                   0.0,
                                   _C, 1, MR);
                dgescal(mr, nr, beta,
                        &C[i*MR*incRowC+j*NR*incColC], incRowC, incColC);
                dgeaxpy(mr, nr, 1.0, _C, 1, MR,
                        &C[i*MR*incRowC+j*NR*incColC], incRowC, incColC);
            }
        }
    }
}

//
//  Compute C <- beta*C + alpha*A*B
//
void
ULMBLAS(dgemm_nn)(int            m,
                  int            n,
                  int            k,
                  double         alpha,
                  const double   *A,
                  int            incRowA,
                  int            incColA,
                  const double   *B,
                  int            incRowB,
                  int            incColB,
                  double         beta,
                  double         *C,
                  int            incRowC,
                  int            incColC)
{
    int mb = (m+MC-1) / MC;
    int nb = (n+NC-1) / NC;
    int kb = (k+KC-1) / KC;

    int _mc = m % MC;
    int _nc = n % NC;
    int _kc = k % KC;

    int mc, nc, kc;
    int i, j, l;

    double _beta;

    if (alpha==0.0 || k==0) {
        dgescal(m, n, beta, C, incRowC, incColC);
        return;
    }

    for (j=0; j<nb; ++j) {
        nc = (j!=nb-1 || _nc==0) ? NC : _nc;

        for (l=0; l<kb; ++l) {
            kc    = (l!=kb-1 || _kc==0) ? KC   : _kc;
            _beta = (l==0) ? beta : 1.0;

            pack_B(kc, nc,
                   &B[l*KC*incRowB+j*NC*incColB], incRowB, incColB,
                   _B);

            for (i=0; i<mb; ++i) {
                mc = (i!=mb-1 || _mc==0) ? MC : _mc;

                pack_A(mc, kc,
                       &A[i*MC*incRowA+l*KC*incColA], incRowA, incColA,
                       _A);

                dgemm_macro_kernel(mc, nc, kc, alpha, _beta,
                                   &C[i*MC*incRowC+j*NC*incColC],
                                   incRowC, incColC);
            }
        }
    }
}

Benchmark Results

We run the benchmarks

$shell> cd bench                                                         
$shell> ./xdl3blastst > report                                           

and filter out the results for the demo-sse-all-asm branch:

$shell> grep PASS report > demo-sse-all-asm                              

With the gnuplot script

set terminal svg size 1140,480
set output "bench13.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm"

we feed gnuplot

$shell> gnuplot bench13.gps                                              

and get

Code Size of kb-Loop Body

On Mac OS X you can use otool to get at assembler code from an object file. Moreover you directly can see how many bytes each instruction takes. We will use this to determine the code size of the kb-loop in the micro kernel. So we have to look at the label generated from

".DLOOP%=:                   \n\t"  // for l = kb,..,1 do

and the jump

"jne       .DLOOP%=          \n\t"  // if l>= 1 go back

Here the complete dump of otool

$shell> cd ../src/level3                                                 
$shell> otool -dtV dgemm_nn.o                                            
dgemm_nn.o:
(__TEXT,__text) section
_ULM_dgemm_nn:
0000000000000000	pushq	%rbp
0000000000000001	pushq	%r15
0000000000000003	pushq	%r14
0000000000000005	pushq	%r13
0000000000000007	pushq	%r12
0000000000000009	pushq	%rbx
000000000000000a	subq	$0x288, %rsp
0000000000000011	movsd	%xmm1, 0x208(%rsp)
000000000000001a	movl	%r9d, 0x44(%rsp)
000000000000001f	movsd	%xmm0, 0x1d0(%rsp)
0000000000000028	leal	0xfff(%rsi), %eax
000000000000002e	movq	%rsi, %rbx
0000000000000031	movq	%rdx, %rsi
0000000000000034	sarl	$0x1f, %eax
0000000000000037	shrl	$0x14, %eax
000000000000003a	xorps	%xmm2, %xmm2
000000000000003d	ucomisd	%xmm2, %xmm0
0000000000000041	movl	0x2e8(%rsp), %r12d
0000000000000049	movl	0x2e0(%rsp), %r15d
0000000000000051	movq	0x2d8(%rsp), %r14
0000000000000059	jne	0x5d
000000000000005b	jnp	0x65
000000000000005d	testl	%esi, %esi
000000000000005f	jne	0x110
0000000000000065	testl	%ebx, %ebx
0000000000000067	setg	%cl
000000000000006a	testl	%edi, %edi
000000000000006c	setg	%al
000000000000006f	andb	%cl, %al
0000000000000071	ucomisd	%xmm2, %xmm1
0000000000000075	jne	0x7b
0000000000000077	jp	0x7b
0000000000000079	jmp	0xc8
000000000000007b	testb	%al, %al
000000000000007d	je	.DPOSTACCUMULATE1+1037
0000000000000083	xorl	%eax, %eax
0000000000000085	xorl	%ecx, %ecx
0000000000000087	nopw	_ULM_dgemm_nn(%rax,%rax)
0000000000000090	movl	%eax, %edx
0000000000000092	movl	%edi, %esi
0000000000000094	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000000a0	movslq	%edx, %rdx
00000000000000a3	movsd	_ULM_dgemm_nn(%r14,%rdx,8), %xmm0
00000000000000a9	mulsd	%xmm1, %xmm0
00000000000000ad	movsd	%xmm0, _ULM_dgemm_nn(%r14,%rdx,8)
00000000000000b3	addl	%r15d, %edx
00000000000000b6	decl	%esi
00000000000000b8	jne	0xa0
00000000000000ba	incl	%ecx
00000000000000bc	addl	%r12d, %eax
00000000000000bf	cmpl	%ebx, %ecx
00000000000000c1	jne	0x90
00000000000000c3	jmpq	.DPOSTACCUMULATE1+1037
00000000000000c8	testb	%al, %al
00000000000000ca	je	.DPOSTACCUMULATE1+1037
00000000000000d0	xorl	%eax, %eax
00000000000000d2	xorl	%ecx, %ecx
00000000000000d4	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000000e0	movl	%eax, %edx
00000000000000e2	movl	%edi, %esi
00000000000000e4	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000000f0	movslq	%edx, %rdx
00000000000000f3	movq	$_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rdx,8)
00000000000000fb	addl	%r15d, %edx
00000000000000fe	decl	%esi
0000000000000100	jne	0xf0
0000000000000102	incl	%ecx
0000000000000104	addl	%r12d, %eax
0000000000000107	cmpl	%ebx, %ecx
0000000000000109	jne	0xe0
000000000000010b	jmpq	.DPOSTACCUMULATE1+1037
0000000000000110	movq	%rcx, 0x130(%rsp)
0000000000000118	testl	%ebx, %ebx
000000000000011a	jle	.DPOSTACCUMULATE1+1037
0000000000000120	movq	%rdi, %r9
0000000000000123	movq	%r9, 0x90(%rsp)
000000000000012b	leal	0x17f(%r9), %ecx
0000000000000132	movslq	%ecx, %rcx
0000000000000135	imulq	$0x2aaaaaab, %rcx, %r11
000000000000013c	movq	%r11, %rcx
000000000000013f	shrq	$0x3f, %rcx
0000000000000143	sarq	$0x26, %r11
0000000000000147	addl	%ecx, %r11d
000000000000014a	movq	%r11, 0x118(%rsp)
0000000000000152	movq	%rbx, %rdi
0000000000000155	leal	0xfff(%rdi,%rax), %r10d
000000000000015d	sarl	$0xc, %r10d
0000000000000161	movq	%r10, 0x18(%rsp)
0000000000000166	leal	0x17f(%rsi), %eax
000000000000016c	cltq
000000000000016e	imulq	$0x2aaaaaab, %rax, %rbx
0000000000000175	movq	%rbx, %rax
0000000000000178	shrq	$0x3f, %rax
000000000000017c	sarq	$0x26, %rbx
0000000000000180	addl	%eax, %ebx
0000000000000182	movq	%rbx, 0x80(%rsp)
000000000000018a	movslq	%r9d, %rdx
000000000000018d	imulq	$0x2aaaaaab, %rdx, %rax
0000000000000194	movq	%rax, %rcx
0000000000000197	shrq	$0x3f, %rcx
000000000000019b	sarq	$0x26, %rax
000000000000019f	addl	%ecx, %eax
00000000000001a1	imull	$0x180, %eax, %eax
00000000000001a7	subl	%eax, %edx
00000000000001a9	movq	%rdx, 0x128(%rsp)
00000000000001b1	movl	%edi, %eax
00000000000001b3	sarl	$0x1f, %eax
00000000000001b6	shrl	$0x14, %eax
00000000000001b9	addl	%edi, %eax
00000000000001bb	andl	$0xfffff000, %eax
00000000000001c0	subl	%eax, %edi
00000000000001c2	movq	%rdi, 0x28(%rsp)
00000000000001c7	movslq	%esi, %rdx
00000000000001ca	imulq	$0x2aaaaaab, %rdx, %rax
00000000000001d1	movq	%rax, %rcx
00000000000001d4	shrq	$0x3f, %rcx
00000000000001d8	sarq	$0x26, %rax
00000000000001dc	addl	%ecx, %eax
00000000000001de	imull	$0x180, %eax, %eax
00000000000001e4	subl	%eax, %edx
00000000000001e6	movq	%rdx, 0xa0(%rsp)
00000000000001ee	movl	0x2d0(%rsp), %ebp
00000000000001f5	leal	-0x1(%r10), %eax
00000000000001f9	movl	%eax, 0x24(%rsp)
00000000000001fd	leal	-0x1(%rbx), %eax
0000000000000200	movl	%eax, 0x9c(%rsp)
0000000000000207	movslq	0x2c8(%rsp), %rdx
000000000000020f	movq	%rdx, 0xc0(%rsp)
0000000000000217	leal	_ULM_dgemm_nn(,%rbp,4), %r10d
000000000000021f	leal	-0x1(%r11), %ecx
0000000000000223	movl	%ecx, 0x124(%rsp)
000000000000022a	movslq	0x44(%rsp), %r13
000000000000022f	movq	%r13, 0x168(%rsp)
0000000000000237	movq	%r8, 0x170(%rsp)
000000000000023f	leal	_ULM_dgemm_nn(,%r8,4), %r9d
0000000000000247	movslq	%r15d, %rcx
000000000000024a	movq	%rcx, 0x1c8(%rsp)
0000000000000252	movslq	%r12d, %rcx
0000000000000255	movq	%rcx, 0x1c0(%rsp)
000000000000025d	movl	%ebp, %ecx
000000000000025f	imull	$0x180, %edx, %eax
0000000000000265	movl	%eax, 0x7c(%rsp)
0000000000000269	movslq	%r10d, %rax
000000000000026c	movq	%rax, 0x8(%rsp)
0000000000000271	leaq	_ULM_dgemm_nn(,%rax,8), %rax
0000000000000279	movq	%rax, 0xf8(%rsp)
0000000000000281	leaq	_ULM_dgemm_nn(,%rdx,8), %rdi
0000000000000289	movq	%rdi, 0x70(%rsp)
000000000000028e	leal	_ULM_dgemm_nn(%rbp,%rbp,2), %r11d
0000000000000293	leal	_ULM_dgemm_nn(%rbp,%rbp), %edx
0000000000000297	movslq	%ebp, %rbx
000000000000029a	movq	%rbx, 0x68(%rsp)
000000000000029f	imull	$0x180, %r13d, %eax
00000000000002a6	movl	%eax, 0x64(%rsp)
00000000000002aa	imull	$0x180, %r8d, %eax
00000000000002b1	movl	%eax, 0x114(%rsp)
00000000000002b8	movslq	%r9d, %rax
00000000000002bb	movq	%rax, 0xe0(%rsp)
00000000000002c3	leaq	_ULM_dgemm_nn(,%rax,8), %rax
00000000000002cb	movq	%rax, 0xd8(%rsp)
00000000000002d3	leaq	_ULM_dgemm_nn(,%r13,8), %rax
00000000000002db	movq	%rax, 0x140(%rsp)
00000000000002e3	leal	_ULM_dgemm_nn(%r8,%r8,2), %r9d
00000000000002e7	movq	%rsi, %rax
00000000000002ea	movq	%rax, 0x88(%rsp)
00000000000002f2	leal	_ULM_dgemm_nn(%r8,%r8), %esi
00000000000002f6	movslq	%r8d, %rbp
00000000000002f9	movq	%rbp, 0x138(%rsp)
0000000000000301	xorl	%ebp, %ebp
0000000000000303	movq	%rbp, 0x38(%rsp)
0000000000000308	shll	$0xc, %ecx
000000000000030b	movl	%ecx, 0x14(%rsp)
000000000000030f	xorl	%ecx, %ecx
0000000000000311	movslq	%r11d, %r11
0000000000000314	movq	%r11, 0x58(%rsp)
0000000000000319	movslq	%edx, %rdx
000000000000031c	movq	%rdx, 0xf0(%rsp)
0000000000000324	movslq	%r9d, %rdx
0000000000000327	movq	%rdx, 0x180(%rsp)
000000000000032f	movslq	%esi, %rdx
0000000000000332	movq	%rdx, 0x178(%rsp)
000000000000033a	movq	%rax, %rsi
000000000000033d	nopl	_ULM_dgemm_nn(%rax)
0000000000000340	movl	%ecx, 0x34(%rsp)
0000000000000344	movq	0x28(%rsp), %rdx
0000000000000349	testl	%edx, %edx
000000000000034b	sete	%al
000000000000034e	cmpl	0x24(%rsp), %ecx
0000000000000352	setne	%cl
0000000000000355	orb	%al, %cl
0000000000000357	movl	$0x1000, %eax
000000000000035c	cmovnel	%eax, %edx
000000000000035f	movq	%rdx, %rcx
0000000000000362	testl	%esi, %esi
0000000000000364	movq	%rdi, %rdx
0000000000000367	jle	.DPOSTACCUMULATE1+1001
000000000000036d	movl	0x34(%rsp), %edi
0000000000000371	shll	$0xc, %edi
0000000000000374	movl	%edi, %esi
0000000000000376	movl	0x2d0(%rsp), %eax
000000000000037d	imull	%eax, %esi
0000000000000380	movl	%esi, 0xa8(%rsp)
0000000000000387	movq	%rcx, %rsi
000000000000038a	movq	%rsi, 0x148(%rsp)
0000000000000392	movl	%esi, %eax
0000000000000394	sarl	$0x1f, %eax
0000000000000397	shrl	$0x1e, %eax
000000000000039a	addl	%esi, %eax
000000000000039c	movl	%eax, %ecx
000000000000039e	sarl	$0x2, %ecx
00000000000003a1	movq	%rcx, 0x100(%rsp)
00000000000003a9	andl	$-0x4, %eax
00000000000003ac	movl	%esi, %r8d
00000000000003af	subl	%eax, %r8d
00000000000003b2	movq	%r8, 0x1a0(%rsp)
00000000000003ba	negl	%eax
00000000000003bc	leal	-0x1(%rcx), %ecx
00000000000003bf	cmpl	$0x7, %esi
00000000000003c2	leaq	0x8(,%rcx,8), %rbp
00000000000003ca	movl	$0x8, %ecx
00000000000003cf	cmovleq	%rcx, %rbp
00000000000003d3	movq	%rbp, 0x50(%rsp)
00000000000003d8	imulq	0x8(%rsp), %rbp
00000000000003de	movq	%rbp, 0x48(%rsp)
00000000000003e3	movslq	%r8d, %rcx
00000000000003e6	movq	%rcx, 0xd0(%rsp)
00000000000003ee	leal	0x1(%rsi,%rax), %eax
00000000000003f2	cmpl	$0x4, %eax
00000000000003f5	movl	$0x3, %eax
00000000000003fa	cmovgl	%r8d, %eax
00000000000003fe	subl	%ecx, %eax
0000000000000400	leaq	0x8(,%rax,8), %rax
0000000000000408	movq	%rax, 0xc8(%rsp)
0000000000000410	imull	%r12d, %edi
0000000000000414	movl	%edi, 0xec(%rsp)
000000000000041b	leal	0x3(%rsi), %eax
000000000000041e	sarl	$0x1f, %eax
0000000000000421	shrl	$0x1e, %eax
0000000000000424	leal	0x3(%rsi,%rax), %eax
0000000000000428	sarl	$0x2, %eax
000000000000042b	movq	%rax, 0x190(%rsp)
0000000000000433	leal	-0x1(%rax), %eax
0000000000000436	movl	%eax, 0x18c(%rsp)
000000000000043d	xorl	%eax, %eax
000000000000043f	movq	%rax, 0xb8(%rsp)
0000000000000447	movq	0x38(%rsp), %rax
000000000000044c	movl	%eax, %r13d
000000000000044f	xorl	%edi, %edi
0000000000000451	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000460	movq	0xa0(%rsp), %rsi
0000000000000468	testl	%esi, %esi
000000000000046a	sete	%al
000000000000046d	cmpl	0x9c(%rsp), %edi
0000000000000474	setne	%cl
0000000000000477	orb	%al, %cl
0000000000000479	movl	$0x180, %eax
000000000000047e	cmovnel	%eax, %esi
0000000000000481	movq	%rsi, 0x1b0(%rsp)
0000000000000489	testl	%edi, %edi
000000000000048b	je	0x498
000000000000048d	movsd	0x134b(%rip), %xmm0
0000000000000495	movaps	%xmm0, %xmm1
0000000000000498	imull	$0x180, %edi, %eax
000000000000049e	movl	%eax, 0x154(%rsp)
00000000000004a5	movq	%rdi, 0xb0(%rsp)
00000000000004ad	imull	0x2c8(%rsp), %eax
00000000000004b5	addl	0xa8(%rsp), %eax
00000000000004bc	movq	0x148(%rsp), %rcx
00000000000004c4	cmpl	$0x4, %ecx
00000000000004c7	cltq
00000000000004c9	movq	0x2c0(%rsp), %rcx
00000000000004d1	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %r12
00000000000004d5	jl	0x5e0
00000000000004db	movq	0x1b0(%rsp), %rcx
00000000000004e3	leal	_ULM_dgemm_nn(,%rcx,4), %eax
00000000000004ea	cltq
00000000000004ec	movq	%rax, 0x238(%rsp)
00000000000004f4	testl	%ecx, %ecx
00000000000004f6	jle	0x5a0
00000000000004fc	movslq	%r13d, %rax
00000000000004ff	movq	0x2c0(%rsp), %rcx
0000000000000507	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %r9
000000000000050b	movq	0x238(%rsp), %rax
0000000000000513	leaq	_ULM_dgemm_nn(,%rax,8), %r8
000000000000051b	xorl	%eax, %eax
000000000000051d	leaq	__B(%rip), %rsi
0000000000000524	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000530	movq	%r9, %rdi
0000000000000533	xorl	%ebp, %ebp
0000000000000535	movq	0x1b0(%rsp), %rcx
000000000000053d	movq	0xf0(%rsp), %r10
0000000000000545	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000550	movsd	_ULM_dgemm_nn(%rdi), %xmm0
0000000000000554	movsd	%xmm0, _ULM_dgemm_nn(%rsi,%rbp)
0000000000000559	movsd	_ULM_dgemm_nn(%rdi,%rbx,8), %xmm0
000000000000055e	movsd	%xmm0, 0x8(%rsi,%rbp)
0000000000000564	movsd	_ULM_dgemm_nn(%rdi,%r10,8), %xmm0
000000000000056a	movsd	%xmm0, 0x10(%rsi,%rbp)
0000000000000570	movsd	_ULM_dgemm_nn(%rdi,%r11,8), %xmm0
0000000000000576	movsd	%xmm0, 0x18(%rsi,%rbp)
000000000000057c	addq	$0x20, %rbp
0000000000000580	addq	%rdx, %rdi
0000000000000583	decl	%ecx
0000000000000585	jne	0x550
0000000000000587	incl	%eax
0000000000000589	addq	%r8, %rsi
000000000000058c	addq	0xf8(%rsp), %r9
0000000000000594	movq	0x100(%rsp), %rcx
000000000000059c	cmpl	%ecx, %eax
000000000000059e	jl	0x530
00000000000005a0	movl	%r13d, 0xac(%rsp)
00000000000005a8	movq	0x238(%rsp), %rax
00000000000005b0	movq	%rax, %rcx
00000000000005b3	imulq	0x50(%rsp), %rcx
00000000000005b9	leaq	__B(%rip), %rax
00000000000005c0	addq	%rax, %rcx
00000000000005c3	movq	%rcx, 0x238(%rsp)
00000000000005cb	addq	0x48(%rsp), %r12
00000000000005d0	movq	0x1a0(%rsp), %r8
00000000000005d8	jmp	0x5f7
00000000000005da	nopw	_ULM_dgemm_nn(%rax,%rax)
00000000000005e0	movl	%r13d, 0xac(%rsp)
00000000000005e8	leaq	__B(%rip), %rax
00000000000005ef	movq	%rax, 0x238(%rsp)
00000000000005f7	movsd	%xmm1, 0x1e8(%rsp)
0000000000000600	testl	%r8d, %r8d
0000000000000603	jle	0x6b0
0000000000000609	movq	0x1b0(%rsp), %rax
0000000000000611	testl	%eax, %eax
0000000000000613	movl	$_ULM_dgemm_nn, %r13d
0000000000000619	movq	0x238(%rsp), %rbp
0000000000000621	jle	0x6b0
0000000000000627	nopw	_ULM_dgemm_nn(%rax,%rax)
0000000000000630	movq	0xd0(%rsp), %rax
0000000000000638	leaq	_ULM_dgemm_nn(%rax,%r13,4), %rax
000000000000063c	movq	0x238(%rsp), %rcx
0000000000000644	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %rdi
0000000000000648	xorl	%eax, %eax
000000000000064a	xorl	%ecx, %ecx
000000000000064c	movl	0x2d0(%rsp), %edx
0000000000000653	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000660	cltq
0000000000000662	movsd	_ULM_dgemm_nn(%r12,%rax,8), %xmm0
0000000000000668	movsd	%xmm0, _ULM_dgemm_nn(%rbp,%rcx,8)
000000000000066e	incq	%rcx
0000000000000671	addl	%edx, %eax
0000000000000673	cmpl	%ecx, %r8d
0000000000000676	jne	0x660
0000000000000678	movq	0xc8(%rsp), %rsi
0000000000000680	movq	%r8, %rbx
0000000000000683	callq	___bzero
0000000000000688	movq	%rbx, %r8
000000000000068b	addq	$0x20, %rbp
000000000000068f	movq	0xc0(%rsp), %rax
0000000000000697	leaq	_ULM_dgemm_nn(%r12,%rax,8), %r12
000000000000069b	incq	%r13
000000000000069e	movq	0x1b0(%rsp), %rax
00000000000006a6	cmpl	%eax, %r13d
00000000000006a9	jne	0x630
00000000000006ab	nopl	_ULM_dgemm_nn(%rax,%rax)
00000000000006b0	movq	0x90(%rsp), %rax
00000000000006b8	testl	%eax, %eax
00000000000006ba	movsd	0x208(%rsp), %xmm1
00000000000006c3	movq	0x140(%rsp), %rbp
00000000000006cb	movq	0x138(%rsp), %rbx
00000000000006d3	jle	.DPOSTACCUMULATE1+900
00000000000006d9	movl	0x154(%rsp), %eax
00000000000006e0	imull	0x44(%rsp), %eax
00000000000006e5	movl	%eax, 0x154(%rsp)
00000000000006ec	movq	0x1b0(%rsp), %rax
00000000000006f4	leal	_ULM_dgemm_nn(,%rax,4), %ecx
00000000000006fb	movl	%ecx, 0x22c(%rsp)
0000000000000702	movslq	%ecx, %rcx
0000000000000705	movq	%rcx, 0x108(%rsp)
000000000000070d	movslq	%eax, %rdx
0000000000000710	movq	%rdx, %rax
0000000000000713	sarq	$0x3f, %rax
0000000000000717	shrq	$0x3e, %rax
000000000000071b	addq	%rdx, %rax
000000000000071e	movq	%rax, %rsi
0000000000000721	sarq	$0x2, %rsi
0000000000000725	movq	%rsi, 0x1d8(%rsp)
000000000000072d	andq	$-0x4, %rax
0000000000000731	subq	%rax, %rdx
0000000000000734	movq	%rdx, 0x1e0(%rsp)
000000000000073c	leaq	_ULM_dgemm_nn(,%rcx,8), %rax
0000000000000744	movq	%rax, 0x198(%rsp)
000000000000074c	movq	0xb8(%rsp), %rax
0000000000000754	movl	%eax, %esi
0000000000000756	xorl	%edi, %edi
0000000000000758	nopl	_ULM_dgemm_nn(%rax,%rax)
0000000000000760	movq	0x128(%rsp), %rdx
0000000000000768	testl	%edx, %edx
000000000000076a	sete	%al
000000000000076d	cmpl	0x124(%rsp), %edi
0000000000000774	setne	%cl
0000000000000777	orb	%al, %cl
0000000000000779	movl	$0x180, %eax
000000000000077e	cmovnel	%eax, %edx
0000000000000781	imull	$0x180, %edi, %r8d
0000000000000788	movl	%r8d, %eax
000000000000078b	movq	0x170(%rsp), %rcx
0000000000000793	imull	%ecx, %eax
0000000000000796	addl	0x154(%rsp), %eax
000000000000079d	movl	%edx, %ecx
000000000000079f	sarl	$0x1f, %ecx
00000000000007a2	shrl	$0x1e, %ecx
00000000000007a5	addl	%edx, %ecx
00000000000007a7	cmpl	$0x4, %edx
00000000000007aa	cltq
00000000000007ac	movq	0x130(%rsp), %r9
00000000000007b4	leaq	_ULM_dgemm_nn(%r9,%rax,8), %r13
00000000000007b8	jl	0x8f0
00000000000007be	movl	%r8d, 0x21c(%rsp)
00000000000007c6	movq	%rdi, 0x158(%rsp)
00000000000007ce	movslq	%esi, %rax
00000000000007d1	movl	%esi, 0x164(%rsp)
00000000000007d8	movq	%rdx, 0x1a8(%rsp)
00000000000007e0	leaq	_ULM_dgemm_nn(%r9,%rax,8), %r12
00000000000007e4	movl	%ecx, %r10d
00000000000007e7	movl	%ecx, 0x230(%rsp)
00000000000007ee	sarl	$0x2, %r10d
00000000000007f2	leal	-0x1(%r10), %eax
00000000000007f6	cmpl	$0x7, %edx
00000000000007f9	leaq	0x8(,%rax,8), %rcx
0000000000000801	movl	$0x8, %eax
0000000000000806	cmovleq	%rax, %rcx
000000000000080a	movq	%rcx, %rax
000000000000080d	imulq	0xe0(%rsp), %rax
0000000000000816	movq	%rax, 0x220(%rsp)
000000000000081e	imulq	0x108(%rsp), %rcx
0000000000000827	movq	%rcx, 0x238(%rsp)
000000000000082f	xorl	%edi, %edi
0000000000000831	leaq	__A(%rip), %rsi
0000000000000838	movq	0xd8(%rsp), %r11
0000000000000840	movq	0x1b0(%rsp), %rdx
0000000000000848	testl	%edx, %edx
000000000000084a	movq	%r12, %rcx
000000000000084d	movl	$_ULM_dgemm_nn, %eax
0000000000000852	movq	0x180(%rsp), %r8
000000000000085a	movq	0x178(%rsp), %r9
0000000000000862	jle	0x8a7
0000000000000864	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000870	movsd	_ULM_dgemm_nn(%rcx), %xmm0
0000000000000874	movsd	%xmm0, _ULM_dgemm_nn(%rsi,%rax)
0000000000000879	movsd	_ULM_dgemm_nn(%rcx,%rbx,8), %xmm0
000000000000087e	movsd	%xmm0, 0x8(%rsi,%rax)
0000000000000884	movsd	_ULM_dgemm_nn(%rcx,%r9,8), %xmm0
000000000000088a	movsd	%xmm0, 0x10(%rsi,%rax)
0000000000000890	movsd	_ULM_dgemm_nn(%rcx,%r8,8), %xmm0
0000000000000896	movsd	%xmm0, 0x18(%rsi,%rax)
000000000000089c	addq	$0x20, %rax
00000000000008a0	addq	%rbp, %rcx
00000000000008a3	decl	%edx
00000000000008a5	jne	0x870
00000000000008a7	incl	%edi
00000000000008a9	addq	0x198(%rsp), %rsi
00000000000008b1	addq	%r11, %r12
00000000000008b4	cmpl	%r10d, %edi
00000000000008b7	jl	0x840
00000000000008b9	addq	0x220(%rsp), %r13
00000000000008c1	leaq	__A(%rip), %rax
00000000000008c8	addq	%rax, 0x238(%rsp)
00000000000008d0	movq	0x1a8(%rsp), %rdx
00000000000008d8	movl	0x230(%rsp), %ecx
00000000000008df	jmp	0x916
00000000000008e1	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000008f0	movl	%r8d, 0x21c(%rsp)
00000000000008f8	movq	%rdi, 0x158(%rsp)
0000000000000900	movl	%esi, 0x164(%rsp)
0000000000000907	leaq	__A(%rip), %rax
000000000000090e	movq	%rax, 0x238(%rsp)
0000000000000916	movq	%rdx, 0x1a8(%rsp)
000000000000091e	andl	$-0x4, %ecx
0000000000000921	subl	%ecx, %edx
0000000000000923	movq	%rdx, 0x210(%rsp)
000000000000092b	testl	%edx, %edx
000000000000092d	jle	0xa10
0000000000000933	movq	0x1b0(%rsp), %rax
000000000000093b	testl	%eax, %eax
000000000000093d	jle	0xa10
0000000000000943	movq	0x210(%rsp), %rcx
000000000000094b	movslq	%ecx, %rax
000000000000094e	movq	%rax, 0x230(%rsp)
0000000000000956	leal	0x1(%rcx), %eax
0000000000000959	cmpl	$0x4, %eax
000000000000095c	movl	$0x3, %eax
0000000000000961	cmovgl	%ecx, %eax
0000000000000964	subl	%ecx, %eax
0000000000000966	leaq	0x8(,%rax,8), %rax
000000000000096e	movq	%rax, 0x220(%rsp)
0000000000000976	movl	%ecx, %r12d
0000000000000979	xorl	%ebp, %ebp
000000000000097b	movq	0x238(%rsp), %rbx
0000000000000983	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000990	movq	0x230(%rsp), %rax
0000000000000998	leaq	_ULM_dgemm_nn(%rax,%rbp,4), %rax
000000000000099c	movq	0x238(%rsp), %rcx
00000000000009a4	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %rdi
00000000000009a8	xorl	%eax, %eax
00000000000009aa	xorl	%ecx, %ecx
00000000000009ac	movq	0x170(%rsp), %rdx
00000000000009b4	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000009c0	cltq
00000000000009c2	movsd	_ULM_dgemm_nn(%r13,%rax,8), %xmm0
00000000000009c9	movsd	%xmm0, _ULM_dgemm_nn(%rbx,%rcx,8)
00000000000009ce	incq	%rcx
00000000000009d1	addl	%edx, %eax
00000000000009d3	cmpl	%ecx, %r12d
00000000000009d6	jne	0x9c0
00000000000009d8	movq	0x220(%rsp), %rsi
00000000000009e0	callq	___bzero
00000000000009e5	addq	$0x20, %rbx
00000000000009e9	movq	0x168(%rsp), %rax
00000000000009f1	leaq	_ULM_dgemm_nn(%r13,%rax,8), %r13
00000000000009f6	incq	%rbp
00000000000009f9	movq	0x1b0(%rsp), %rax
0000000000000a01	cmpl	%eax, %ebp
0000000000000a03	jne	0x990
0000000000000a05	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000a10	movq	0x1a8(%rsp), %rdx
0000000000000a18	leal	0x3(%rdx), %eax
0000000000000a1b	sarl	$0x1f, %eax
0000000000000a1e	shrl	$0x1e, %eax
0000000000000a21	movq	0x148(%rsp), %rcx
0000000000000a29	testl	%ecx, %ecx
0000000000000a2b	movl	0x2e8(%rsp), %r9d
0000000000000a33	movsd	0x208(%rsp), %xmm1
0000000000000a3c	jle	.DPOSTACCUMULATE1+843
0000000000000a42	movl	0x21c(%rsp), %ecx
0000000000000a49	imull	%r15d, %ecx
0000000000000a4d	addl	0xec(%rsp), %ecx
0000000000000a54	leal	0x3(%rdx,%rax), %eax
0000000000000a58	sarl	$0x2, %eax
0000000000000a5b	movq	%rax, 0x200(%rsp)
0000000000000a63	movslq	%ecx, %rcx
0000000000000a66	movq	%rcx, 0x1f0(%rsp)
0000000000000a6e	leal	-0x1(%rax), %eax
0000000000000a71	movl	%eax, 0x1fc(%rsp)
0000000000000a78	xorl	%esi, %esi
0000000000000a7a	nopw	_ULM_dgemm_nn(%rax,%rax)
0000000000000a80	movq	%rsi, 0x1b8(%rsp)
0000000000000a88	movq	0x1a0(%rsp), %rcx
0000000000000a90	testl	%ecx, %ecx
0000000000000a92	sete	%al
0000000000000a95	cmpl	0x18c(%rsp), %esi
0000000000000a9c	setne	%bl
0000000000000a9f	orb	%al, %bl
0000000000000aa1	movb	%bl, 0x230(%rsp)
0000000000000aa8	movl	%ecx, %r13d
0000000000000aab	movl	$0x4, %eax
0000000000000ab0	cmovnel	%eax, %r13d
0000000000000ab4	testl	%edx, %edx
0000000000000ab6	jle	.DPOSTACCUMULATE1+808
0000000000000abc	movq	0x1b8(%rsp), %rdx
0000000000000ac4	movl	%edx, %eax
0000000000000ac6	imull	0x22c(%rsp), %eax
0000000000000ace	cltq
0000000000000ad0	leaq	__B(%rip), %rcx
0000000000000ad7	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %rax
0000000000000adb	movq	%rax, 0x220(%rsp)
0000000000000ae3	movl	%edx, %eax
0000000000000ae5	imull	%r9d, %eax
0000000000000ae9	movl	%eax, 0x21c(%rsp)
0000000000000af0	xorl	%esi, %esi
0000000000000af2	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000000b00	movq	0x210(%rsp), %rdx
0000000000000b08	testl	%edx, %edx
0000000000000b0a	sete	%al
0000000000000b0d	cmpl	0x1fc(%rsp), %esi
0000000000000b14	setne	%cl
0000000000000b17	orb	%al, %cl
0000000000000b19	movl	%edx, %r12d
0000000000000b1c	movl	$0x4, %eax
0000000000000b21	cmovnel	%eax, %r12d
0000000000000b25	andb	0x230(%rsp), %cl
0000000000000b2c	movl	%esi, %eax
0000000000000b2e	imull	0x22c(%rsp), %eax
0000000000000b36	cmpb	$0x1, %cl
0000000000000b39	cltq
0000000000000b3b	leaq	__A(%rip), %rcx
0000000000000b42	leaq	_ULM_dgemm_nn(%rcx,%rax,8), %rax
0000000000000b46	jne	.DPOSTACCUMULATE0+407
0000000000000b4c	movl	%esi, %ecx
0000000000000b4e	movq	%rsi, 0x238(%rsp)
0000000000000b56	imull	%r15d, %ecx
0000000000000b5a	addl	0x21c(%rsp), %ecx
0000000000000b61	shll	$0x2, %ecx
0000000000000b64	movslq	%ecx, %rcx
0000000000000b67	addq	0x1f0(%rsp), %rcx
0000000000000b6f	leaq	_ULM_dgemm_nn(%r14,%rcx,8), %rcx
0000000000000b73	movsd	0x1d0(%rsp), %xmm0
0000000000000b7c	movsd	%xmm0, 0x280(%rsp)
0000000000000b85	movq	%rax, 0x278(%rsp)
0000000000000b8d	movq	0x220(%rsp), %rax
0000000000000b95	movq	%rax, 0x270(%rsp)
0000000000000b9d	movsd	0x1e8(%rsp), %xmm0
0000000000000ba6	movsd	%xmm0, 0x268(%rsp)
0000000000000baf	movq	%rcx, 0x260(%rsp)
0000000000000bb7	movq	0x1c8(%rsp), %rax
0000000000000bbf	movq	%rax, 0x258(%rsp)
0000000000000bc7	movq	0x1c0(%rsp), %rax
0000000000000bcf	movq	%rax, 0x250(%rsp)
0000000000000bd7	movq	0x1d8(%rsp), %rax
0000000000000bdf	movq	%rax, 0x248(%rsp)
0000000000000be7	movq	0x1e0(%rsp), %rax
0000000000000bef	movq	%rax, 0x240(%rsp)
0000000000000bf7	movq	0x248(%rsp), %rsi
0000000000000bff	movq	0x240(%rsp), %rdi
0000000000000c07	movq	0x278(%rsp), %rax
0000000000000c0f	movq	0x270(%rsp), %rbx
0000000000000c17	movaps	_ULM_dgemm_nn(%rax), %xmm0
0000000000000c1a	movaps	0x10(%rax), %xmm1
0000000000000c1e	movaps	_ULM_dgemm_nn(%rbx), %xmm2
0000000000000c21	xorpd	%xmm8, %xmm8
0000000000000c26	xorpd	%xmm9, %xmm9
0000000000000c2b	xorpd	%xmm10, %xmm10
0000000000000c30	xorpd	%xmm11, %xmm11
0000000000000c35	xorpd	%xmm12, %xmm12
0000000000000c3a	xorpd	%xmm13, %xmm13
0000000000000c3f	xorpd	%xmm14, %xmm14
0000000000000c44	xorpd	%xmm15, %xmm15
0000000000000c49	xorpd	%xmm3, %xmm3
0000000000000c4d	xorpd	%xmm4, %xmm4
0000000000000c51	xorpd	%xmm5, %xmm5
0000000000000c55	xorpd	%xmm6, %xmm6
0000000000000c59	xorpd	%xmm7, %xmm7
0000000000000c5d	testq	%rdi, %rdi
0000000000000c60	testq	%rsi, %rsi
0000000000000c63	je	.DCONSIDERLEFT0
.DLOOP0:
0000000000000c69	addpd	%xmm3, %xmm12
0000000000000c6e	movaps	0x10(%rbx), %xmm3
0000000000000c72	addpd	%xmm6, %xmm13
0000000000000c77	movaps	%xmm2, %xmm6
0000000000000c7a	pshufd	$0x4e, %xmm2, %xmm4
0000000000000c7f	mulpd	%xmm0, %xmm2
0000000000000c83	mulpd	%xmm1, %xmm6
0000000000000c87	addpd	%xmm5, %xmm14
0000000000000c8c	addpd	%xmm7, %xmm15
0000000000000c91	movaps	%xmm4, %xmm7
0000000000000c94	mulpd	%xmm0, %xmm4
0000000000000c98	mulpd	%xmm1, %xmm7
0000000000000c9c	addpd	%xmm2, %xmm8
0000000000000ca1	movaps	0x20(%rbx), %xmm2
0000000000000ca5	addpd	%xmm6, %xmm9
0000000000000caa	movaps	%xmm3, %xmm6
0000000000000cad	pshufd	$0x4e, %xmm3, %xmm5
0000000000000cb2	mulpd	%xmm0, %xmm3
0000000000000cb6	mulpd	%xmm1, %xmm6
0000000000000cba	addpd	%xmm4, %xmm10
0000000000000cbf	addpd	%xmm7, %xmm11
0000000000000cc4	movaps	%xmm5, %xmm7
0000000000000cc7	mulpd	%xmm0, %xmm5
0000000000000ccb	movaps	0x20(%rax), %xmm0
0000000000000ccf	mulpd	%xmm1, %xmm7
0000000000000cd3	movaps	0x30(%rax), %xmm1
0000000000000cd7	addpd	%xmm3, %xmm12
0000000000000cdc	movaps	0x30(%rbx), %xmm3
0000000000000ce0	addpd	%xmm6, %xmm13
0000000000000ce5	movaps	%xmm2, %xmm6
0000000000000ce8	pshufd	$0x4e, %xmm2, %xmm4
0000000000000ced	mulpd	%xmm0, %xmm2
0000000000000cf1	mulpd	%xmm1, %xmm6
0000000000000cf5	addpd	%xmm5, %xmm14
0000000000000cfa	addpd	%xmm7, %xmm15
0000000000000cff	movaps	%xmm4, %xmm7
0000000000000d02	mulpd	%xmm0, %xmm4
0000000000000d06	mulpd	%xmm1, %xmm7
0000000000000d0a	addpd	%xmm2, %xmm8
0000000000000d0f	movaps	0x40(%rbx), %xmm2
0000000000000d13	addpd	%xmm6, %xmm9
0000000000000d18	movaps	%xmm3, %xmm6
0000000000000d1b	pshufd	$0x4e, %xmm3, %xmm5
0000000000000d20	mulpd	%xmm0, %xmm3
0000000000000d24	mulpd	%xmm1, %xmm6
0000000000000d28	addpd	%xmm4, %xmm10
0000000000000d2d	addpd	%xmm7, %xmm11
0000000000000d32	movaps	%xmm5, %xmm7
0000000000000d35	mulpd	%xmm0, %xmm5
0000000000000d39	movaps	0x40(%rax), %xmm0
0000000000000d3d	mulpd	%xmm1, %xmm7
0000000000000d41	movaps	0x50(%rax), %xmm1
0000000000000d45	addpd	%xmm3, %xmm12
0000000000000d4a	movaps	0x50(%rbx), %xmm3
0000000000000d4e	addpd	%xmm6, %xmm13
0000000000000d53	movaps	%xmm2, %xmm6
0000000000000d56	pshufd	$0x4e, %xmm2, %xmm4
0000000000000d5b	mulpd	%xmm0, %xmm2
0000000000000d5f	mulpd	%xmm1, %xmm6
0000000000000d63	addpd	%xmm5, %xmm14
0000000000000d68	addpd	%xmm7, %xmm15
0000000000000d6d	movaps	%xmm4, %xmm7
0000000000000d70	mulpd	%xmm0, %xmm4
0000000000000d74	mulpd	%xmm1, %xmm7
0000000000000d78	addpd	%xmm2, %xmm8
0000000000000d7d	movaps	0x60(%rbx), %xmm2
0000000000000d81	addpd	%xmm6, %xmm9
0000000000000d86	movaps	%xmm3, %xmm6
0000000000000d89	pshufd	$0x4e, %xmm3, %xmm5
0000000000000d8e	mulpd	%xmm0, %xmm3
0000000000000d92	mulpd	%xmm1, %xmm6
0000000000000d96	addpd	%xmm4, %xmm10
0000000000000d9b	addpd	%xmm7, %xmm11
0000000000000da0	movaps	%xmm5, %xmm7
0000000000000da3	mulpd	%xmm0, %xmm5
0000000000000da7	movaps	0x60(%rax), %xmm0
0000000000000dab	mulpd	%xmm1, %xmm7
0000000000000daf	movaps	0x70(%rax), %xmm1
0000000000000db3	addpd	%xmm3, %xmm12
0000000000000db8	movaps	0x70(%rbx), %xmm3
0000000000000dbc	addpd	%xmm6, %xmm13
0000000000000dc1	movaps	%xmm2, %xmm6
0000000000000dc4	pshufd	$0x4e, %xmm2, %xmm4
0000000000000dc9	mulpd	%xmm0, %xmm2
0000000000000dcd	mulpd	%xmm1, %xmm6
0000000000000dd1	addq	$0x80, %rax
0000000000000dd7	addpd	%xmm5, %xmm14
0000000000000ddc	addpd	%xmm7, %xmm15
0000000000000de1	movaps	%xmm4, %xmm7
0000000000000de4	mulpd	%xmm0, %xmm4
0000000000000de8	mulpd	%xmm1, %xmm7
0000000000000dec	addpd	%xmm2, %xmm8
0000000000000df1	movaps	0x80(%rbx), %xmm2
0000000000000df8	addpd	%xmm6, %xmm9
0000000000000dfd	movaps	%xmm3, %xmm6
0000000000000e00	pshufd	$0x4e, %xmm3, %xmm5
0000000000000e05	mulpd	%xmm0, %xmm3
0000000000000e09	mulpd	%xmm1, %xmm6
0000000000000e0d	addq	$0x80, %rbx
0000000000000e14	addpd	%xmm4, %xmm10
0000000000000e19	addpd	%xmm7, %xmm11
0000000000000e1e	movaps	%xmm5, %xmm7
0000000000000e21	mulpd	%xmm0, %xmm5
0000000000000e25	movaps	_ULM_dgemm_nn(%rax), %xmm0
0000000000000e28	mulpd	%xmm1, %xmm7
0000000000000e2c	movaps	0x10(%rax), %xmm1
0000000000000e30	decq	%rsi
0000000000000e33	jne	.DLOOP0
.DCONSIDERLEFT0:
0000000000000e39	testq	%rdi, %rdi
0000000000000e3c	je	.DPOSTACCUMULATE0
.DLOOPLEFT0:
0000000000000e42	addpd	%xmm3, %xmm12
0000000000000e47	movapd	0x10(%rbx), %xmm3
0000000000000e4c	addpd	%xmm6, %xmm13
0000000000000e51	movapd	%xmm2, %xmm6
0000000000000e55	pshufd	$0x4e, %xmm2, %xmm4
0000000000000e5a	mulpd	%xmm0, %xmm2
0000000000000e5e	mulpd	%xmm1, %xmm6
0000000000000e62	addpd	%xmm5, %xmm14
0000000000000e67	addpd	%xmm7, %xmm15
0000000000000e6c	movapd	%xmm4, %xmm7
0000000000000e70	mulpd	%xmm0, %xmm4
0000000000000e74	mulpd	%xmm1, %xmm7
0000000000000e78	addpd	%xmm2, %xmm8
0000000000000e7d	movapd	0x20(%rbx), %xmm2
0000000000000e82	addpd	%xmm6, %xmm9
0000000000000e87	movapd	%xmm3, %xmm6
0000000000000e8b	pshufd	$0x4e, %xmm3, %xmm5
0000000000000e90	mulpd	%xmm0, %xmm3
0000000000000e94	mulpd	%xmm1, %xmm6
0000000000000e98	addpd	%xmm4, %xmm10
0000000000000e9d	addpd	%xmm7, %xmm11
0000000000000ea2	movapd	%xmm5, %xmm7
0000000000000ea6	mulpd	%xmm0, %xmm5
0000000000000eaa	movapd	0x20(%rax), %xmm0
0000000000000eaf	mulpd	%xmm1, %xmm7
0000000000000eb3	movapd	0x30(%rax), %xmm1
0000000000000eb8	addq	$0x20, %rax
0000000000000ebc	addq	$0x20, %rbx
0000000000000ec0	decq	%rdi
0000000000000ec3	jne	.DLOOPLEFT0
.DPOSTACCUMULATE0:
0000000000000ec9	addpd	%xmm3, %xmm12
0000000000000ece	addpd	%xmm6, %xmm13
0000000000000ed3	addpd	%xmm5, %xmm14
0000000000000ed8	addpd	%xmm7, %xmm15
0000000000000edd	movsd	0x280(%rsp), %xmm0
0000000000000ee6	movsd	0x268(%rsp), %xmm1
0000000000000eef	movq	0x260(%rsp), %rcx
0000000000000ef7	movq	0x258(%rsp), %r8
0000000000000eff	leaq	_ULM_dgemm_nn(,%r8,8), %r8
0000000000000f07	movq	0x250(%rsp), %r9
0000000000000f0f	leaq	_ULM_dgemm_nn(,%r9,8), %r9
0000000000000f17	leaq	_ULM_dgemm_nn(%rcx,%r9), %r10
0000000000000f1b	leaq	_ULM_dgemm_nn(%rcx,%r8,2), %rdx
0000000000000f1f	leaq	_ULM_dgemm_nn(%rdx,%r9), %r11
0000000000000f23	unpcklpd	%xmm0, %xmm0
0000000000000f27	unpcklpd	%xmm1, %xmm1
0000000000000f2b	movlpd	_ULM_dgemm_nn(%rcx), %xmm3
0000000000000f2f	movhpd	_ULM_dgemm_nn(%r10,%r8), %xmm3
0000000000000f35	mulpd	%xmm0, %xmm8
0000000000000f3a	mulpd	%xmm1, %xmm3
0000000000000f3e	addpd	%xmm8, %xmm3
0000000000000f43	movlpd	%xmm3, _ULM_dgemm_nn(%rcx)
0000000000000f47	movhpd	%xmm3, _ULM_dgemm_nn(%r10,%r8)
0000000000000f4d	movlpd	_ULM_dgemm_nn(%rdx), %xmm4
0000000000000f51	movhpd	_ULM_dgemm_nn(%r11,%r8), %xmm4
0000000000000f57	mulpd	%xmm0, %xmm9
0000000000000f5c	mulpd	%xmm1, %xmm4
0000000000000f60	addpd	%xmm9, %xmm4
0000000000000f65	movlpd	%xmm4, _ULM_dgemm_nn(%rdx)
0000000000000f69	movhpd	%xmm4, _ULM_dgemm_nn(%r11,%r8)
0000000000000f6f	movlpd	_ULM_dgemm_nn(%r10), %xmm3
0000000000000f74	movhpd	_ULM_dgemm_nn(%rcx,%r8), %xmm3
0000000000000f7a	mulpd	%xmm0, %xmm10
0000000000000f7f	mulpd	%xmm1, %xmm3
0000000000000f83	addpd	%xmm10, %xmm3
0000000000000f88	movlpd	%xmm3, _ULM_dgemm_nn(%r10)
0000000000000f8d	movhpd	%xmm3, _ULM_dgemm_nn(%rcx,%r8)
0000000000000f93	movlpd	_ULM_dgemm_nn(%r11), %xmm4
0000000000000f98	movhpd	_ULM_dgemm_nn(%rdx,%r8), %xmm4
0000000000000f9e	mulpd	%xmm0, %xmm11
0000000000000fa3	mulpd	%xmm1, %xmm4
0000000000000fa7	addpd	%xmm11, %xmm4
0000000000000fac	movlpd	%xmm4, _ULM_dgemm_nn(%r11)
0000000000000fb1	movhpd	%xmm4, _ULM_dgemm_nn(%rdx,%r8)
0000000000000fb7	leaq	_ULM_dgemm_nn(%rcx,%r9,2), %rcx
0000000000000fbb	leaq	_ULM_dgemm_nn(%r10,%r9,2), %r10
0000000000000fbf	leaq	_ULM_dgemm_nn(%rdx,%r9,2), %rdx
0000000000000fc3	leaq	_ULM_dgemm_nn(%r11,%r9,2), %r11
0000000000000fc7	movlpd	_ULM_dgemm_nn(%rcx), %xmm3
0000000000000fcb	movhpd	_ULM_dgemm_nn(%r10,%r8), %xmm3
0000000000000fd1	mulpd	%xmm0, %xmm12
0000000000000fd6	mulpd	%xmm1, %xmm3
0000000000000fda	addpd	%xmm12, %xmm3
0000000000000fdf	movlpd	%xmm3, _ULM_dgemm_nn(%rcx)
0000000000000fe3	movhpd	%xmm3, _ULM_dgemm_nn(%r10,%r8)
0000000000000fe9	movlpd	_ULM_dgemm_nn(%rdx), %xmm4
0000000000000fed	movhpd	_ULM_dgemm_nn(%r11,%r8), %xmm4
0000000000000ff3	mulpd	%xmm0, %xmm13
0000000000000ff8	mulpd	%xmm1, %xmm4
0000000000000ffc	addpd	%xmm13, %xmm4
0000000000001001	movlpd	%xmm4, _ULM_dgemm_nn(%rdx)
0000000000001005	movhpd	%xmm4, _ULM_dgemm_nn(%r11,%r8)
000000000000100b	movlpd	_ULM_dgemm_nn(%r10), %xmm3
0000000000001010	movhpd	_ULM_dgemm_nn(%rcx,%r8), %xmm3
0000000000001016	mulpd	%xmm0, %xmm14
000000000000101b	mulpd	%xmm1, %xmm3
000000000000101f	addpd	%xmm14, %xmm3
0000000000001024	movlpd	%xmm3, _ULM_dgemm_nn(%r10)
0000000000001029	movhpd	%xmm3, _ULM_dgemm_nn(%rcx,%r8)
000000000000102f	movlpd	_ULM_dgemm_nn(%r11), %xmm4
0000000000001034	movhpd	_ULM_dgemm_nn(%rdx,%r8), %xmm4
000000000000103a	mulpd	%xmm0, %xmm15
000000000000103f	mulpd	%xmm1, %xmm4
0000000000001043	addpd	%xmm15, %xmm4
0000000000001048	movlpd	%xmm4, _ULM_dgemm_nn(%r11)
000000000000104d	movhpd	%xmm4, _ULM_dgemm_nn(%rdx,%r8)
0000000000001053	movl	0x2e8(%rsp), %r9d
000000000000105b	jmpq	.DPOSTACCUMULATE1+772
0000000000001060	testl	%r13d, %r13d
0000000000001063	setg	0x238(%rsp)
000000000000106b	movsd	0x1d0(%rsp), %xmm0
0000000000001074	movsd	%xmm0, 0x280(%rsp)
000000000000107d	movq	%rax, 0x278(%rsp)
0000000000001085	movq	0x220(%rsp), %rax
000000000000108d	movq	%rax, 0x270(%rsp)
0000000000001095	movq	$_ULM_dgemm_nn, 0x268(%rsp)
00000000000010a1	leaq	__C(%rip), %rax
00000000000010a8	movq	%rax, 0x260(%rsp)
00000000000010b0	movq	$0x1, 0x258(%rsp)
00000000000010bc	movq	$0x4, 0x250(%rsp)
00000000000010c8	movq	0x1d8(%rsp), %rax
00000000000010d0	movq	%rax, 0x248(%rsp)
00000000000010d8	movq	0x1e0(%rsp), %rax
00000000000010e0	movq	%rax, 0x240(%rsp)
00000000000010e8	movq	%rsi, %rbp
00000000000010eb	movq	0x248(%rsp), %rsi
00000000000010f3	movq	0x240(%rsp), %rdi
00000000000010fb	movq	0x278(%rsp), %rax
0000000000001103	movq	0x270(%rsp), %rbx
000000000000110b	movaps	_ULM_dgemm_nn(%rax), %xmm0
000000000000110e	movaps	0x10(%rax), %xmm1
0000000000001112	movaps	_ULM_dgemm_nn(%rbx), %xmm2
0000000000001115	xorpd	%xmm8, %xmm8
000000000000111a	xorpd	%xmm9, %xmm9
000000000000111f	xorpd	%xmm10, %xmm10
0000000000001124	xorpd	%xmm11, %xmm11
0000000000001129	xorpd	%xmm12, %xmm12
000000000000112e	xorpd	%xmm13, %xmm13
0000000000001133	xorpd	%xmm14, %xmm14
0000000000001138	xorpd	%xmm15, %xmm15
000000000000113d	xorpd	%xmm3, %xmm3
0000000000001141	xorpd	%xmm4, %xmm4
0000000000001145	xorpd	%xmm5, %xmm5
0000000000001149	xorpd	%xmm6, %xmm6
000000000000114d	xorpd	%xmm7, %xmm7
0000000000001151	testq	%rdi, %rdi
0000000000001154	testq	%rsi, %rsi
0000000000001157	je	.DCONSIDERLEFT1
.DLOOP1:
000000000000115d	addpd	%xmm3, %xmm12
0000000000001162	movaps	0x10(%rbx), %xmm3
0000000000001166	addpd	%xmm6, %xmm13
000000000000116b	movaps	%xmm2, %xmm6
000000000000116e	pshufd	$0x4e, %xmm2, %xmm4
0000000000001173	mulpd	%xmm0, %xmm2
0000000000001177	mulpd	%xmm1, %xmm6
000000000000117b	addpd	%xmm5, %xmm14
0000000000001180	addpd	%xmm7, %xmm15
0000000000001185	movaps	%xmm4, %xmm7
0000000000001188	mulpd	%xmm0, %xmm4
000000000000118c	mulpd	%xmm1, %xmm7
0000000000001190	addpd	%xmm2, %xmm8
0000000000001195	movaps	0x20(%rbx), %xmm2
0000000000001199	addpd	%xmm6, %xmm9
000000000000119e	movaps	%xmm3, %xmm6
00000000000011a1	pshufd	$0x4e, %xmm3, %xmm5
00000000000011a6	mulpd	%xmm0, %xmm3
00000000000011aa	mulpd	%xmm1, %xmm6
00000000000011ae	addpd	%xmm4, %xmm10
00000000000011b3	addpd	%xmm7, %xmm11
00000000000011b8	movaps	%xmm5, %xmm7
00000000000011bb	mulpd	%xmm0, %xmm5
00000000000011bf	movaps	0x20(%rax), %xmm0
00000000000011c3	mulpd	%xmm1, %xmm7
00000000000011c7	movaps	0x30(%rax), %xmm1
00000000000011cb	addpd	%xmm3, %xmm12
00000000000011d0	movaps	0x30(%rbx), %xmm3
00000000000011d4	addpd	%xmm6, %xmm13
00000000000011d9	movaps	%xmm2, %xmm6
00000000000011dc	pshufd	$0x4e, %xmm2, %xmm4
00000000000011e1	mulpd	%xmm0, %xmm2
00000000000011e5	mulpd	%xmm1, %xmm6
00000000000011e9	addpd	%xmm5, %xmm14
00000000000011ee	addpd	%xmm7, %xmm15
00000000000011f3	movaps	%xmm4, %xmm7
00000000000011f6	mulpd	%xmm0, %xmm4
00000000000011fa	mulpd	%xmm1, %xmm7
00000000000011fe	addpd	%xmm2, %xmm8
0000000000001203	movaps	0x40(%rbx), %xmm2
0000000000001207	addpd	%xmm6, %xmm9
000000000000120c	movaps	%xmm3, %xmm6
000000000000120f	pshufd	$0x4e, %xmm3, %xmm5
0000000000001214	mulpd	%xmm0, %xmm3
0000000000001218	mulpd	%xmm1, %xmm6
000000000000121c	addpd	%xmm4, %xmm10
0000000000001221	addpd	%xmm7, %xmm11
0000000000001226	movaps	%xmm5, %xmm7
0000000000001229	mulpd	%xmm0, %xmm5
000000000000122d	movaps	0x40(%rax), %xmm0
0000000000001231	mulpd	%xmm1, %xmm7
0000000000001235	movaps	0x50(%rax), %xmm1
0000000000001239	addpd	%xmm3, %xmm12
000000000000123e	movaps	0x50(%rbx), %xmm3
0000000000001242	addpd	%xmm6, %xmm13
0000000000001247	movaps	%xmm2, %xmm6
000000000000124a	pshufd	$0x4e, %xmm2, %xmm4
000000000000124f	mulpd	%xmm0, %xmm2
0000000000001253	mulpd	%xmm1, %xmm6
0000000000001257	addpd	%xmm5, %xmm14
000000000000125c	addpd	%xmm7, %xmm15
0000000000001261	movaps	%xmm4, %xmm7
0000000000001264	mulpd	%xmm0, %xmm4
0000000000001268	mulpd	%xmm1, %xmm7
000000000000126c	addpd	%xmm2, %xmm8
0000000000001271	movaps	0x60(%rbx), %xmm2
0000000000001275	addpd	%xmm6, %xmm9
000000000000127a	movaps	%xmm3, %xmm6
000000000000127d	pshufd	$0x4e, %xmm3, %xmm5
0000000000001282	mulpd	%xmm0, %xmm3
0000000000001286	mulpd	%xmm1, %xmm6
000000000000128a	addpd	%xmm4, %xmm10
000000000000128f	addpd	%xmm7, %xmm11
0000000000001294	movaps	%xmm5, %xmm7
0000000000001297	mulpd	%xmm0, %xmm5
000000000000129b	movaps	0x60(%rax), %xmm0
000000000000129f	mulpd	%xmm1, %xmm7
00000000000012a3	movaps	0x70(%rax), %xmm1
00000000000012a7	addpd	%xmm3, %xmm12
00000000000012ac	movaps	0x70(%rbx), %xmm3
00000000000012b0	addpd	%xmm6, %xmm13
00000000000012b5	movaps	%xmm2, %xmm6
00000000000012b8	pshufd	$0x4e, %xmm2, %xmm4
00000000000012bd	mulpd	%xmm0, %xmm2
00000000000012c1	mulpd	%xmm1, %xmm6
00000000000012c5	addq	$0x80, %rax
00000000000012cb	addpd	%xmm5, %xmm14
00000000000012d0	addpd	%xmm7, %xmm15
00000000000012d5	movaps	%xmm4, %xmm7
00000000000012d8	mulpd	%xmm0, %xmm4
00000000000012dc	mulpd	%xmm1, %xmm7
00000000000012e0	addpd	%xmm2, %xmm8
00000000000012e5	movaps	0x80(%rbx), %xmm2
00000000000012ec	addpd	%xmm6, %xmm9
00000000000012f1	movaps	%xmm3, %xmm6
00000000000012f4	pshufd	$0x4e, %xmm3, %xmm5
00000000000012f9	mulpd	%xmm0, %xmm3
00000000000012fd	mulpd	%xmm1, %xmm6
0000000000001301	addq	$0x80, %rbx
0000000000001308	addpd	%xmm4, %xmm10
000000000000130d	addpd	%xmm7, %xmm11
0000000000001312	movaps	%xmm5, %xmm7
0000000000001315	mulpd	%xmm0, %xmm5
0000000000001319	movaps	_ULM_dgemm_nn(%rax), %xmm0
000000000000131c	mulpd	%xmm1, %xmm7
0000000000001320	movaps	0x10(%rax), %xmm1
0000000000001324	decq	%rsi
0000000000001327	jne	.DLOOP1
.DCONSIDERLEFT1:
000000000000132d	testq	%rdi, %rdi
0000000000001330	je	.DPOSTACCUMULATE1
.DLOOPLEFT1:
0000000000001336	addpd	%xmm3, %xmm12
000000000000133b	movapd	0x10(%rbx), %xmm3
0000000000001340	addpd	%xmm6, %xmm13
0000000000001345	movapd	%xmm2, %xmm6
0000000000001349	pshufd	$0x4e, %xmm2, %xmm4
000000000000134e	mulpd	%xmm0, %xmm2
0000000000001352	mulpd	%xmm1, %xmm6
0000000000001356	addpd	%xmm5, %xmm14
000000000000135b	addpd	%xmm7, %xmm15
0000000000001360	movapd	%xmm4, %xmm7
0000000000001364	mulpd	%xmm0, %xmm4
0000000000001368	mulpd	%xmm1, %xmm7
000000000000136c	addpd	%xmm2, %xmm8
0000000000001371	movapd	0x20(%rbx), %xmm2
0000000000001376	addpd	%xmm6, %xmm9
000000000000137b	movapd	%xmm3, %xmm6
000000000000137f	pshufd	$0x4e, %xmm3, %xmm5
0000000000001384	mulpd	%xmm0, %xmm3
0000000000001388	mulpd	%xmm1, %xmm6
000000000000138c	addpd	%xmm4, %xmm10
0000000000001391	addpd	%xmm7, %xmm11
0000000000001396	movapd	%xmm5, %xmm7
000000000000139a	mulpd	%xmm0, %xmm5
000000000000139e	movapd	0x20(%rax), %xmm0
00000000000013a3	mulpd	%xmm1, %xmm7
00000000000013a7	movapd	0x30(%rax), %xmm1
00000000000013ac	addq	$0x20, %rax
00000000000013b0	addq	$0x20, %rbx
00000000000013b4	decq	%rdi
00000000000013b7	jne	.DLOOPLEFT1
.DPOSTACCUMULATE1:
00000000000013bd	addpd	%xmm3, %xmm12
00000000000013c2	addpd	%xmm6, %xmm13
00000000000013c7	addpd	%xmm5, %xmm14
00000000000013cc	addpd	%xmm7, %xmm15
00000000000013d1	movsd	0x280(%rsp), %xmm0
00000000000013da	movsd	0x268(%rsp), %xmm1
00000000000013e3	movq	0x260(%rsp), %rcx
00000000000013eb	movq	0x258(%rsp), %r8
00000000000013f3	leaq	_ULM_dgemm_nn(,%r8,8), %r8
00000000000013fb	movq	0x250(%rsp), %r9
0000000000001403	leaq	_ULM_dgemm_nn(,%r9,8), %r9
000000000000140b	leaq	_ULM_dgemm_nn(%rcx,%r9), %r10
000000000000140f	leaq	_ULM_dgemm_nn(%rcx,%r8,2), %rdx
0000000000001413	leaq	_ULM_dgemm_nn(%rdx,%r9), %r11
0000000000001417	unpcklpd	%xmm0, %xmm0
000000000000141b	unpcklpd	%xmm1, %xmm1
000000000000141f	movlpd	_ULM_dgemm_nn(%rcx), %xmm3
0000000000001423	movhpd	_ULM_dgemm_nn(%r10,%r8), %xmm3
0000000000001429	mulpd	%xmm0, %xmm8
000000000000142e	mulpd	%xmm1, %xmm3
0000000000001432	addpd	%xmm8, %xmm3
0000000000001437	movlpd	%xmm3, _ULM_dgemm_nn(%rcx)
000000000000143b	movhpd	%xmm3, _ULM_dgemm_nn(%r10,%r8)
0000000000001441	movlpd	_ULM_dgemm_nn(%rdx), %xmm4
0000000000001445	movhpd	_ULM_dgemm_nn(%r11,%r8), %xmm4
000000000000144b	mulpd	%xmm0, %xmm9
0000000000001450	mulpd	%xmm1, %xmm4
0000000000001454	addpd	%xmm9, %xmm4
0000000000001459	movlpd	%xmm4, _ULM_dgemm_nn(%rdx)
000000000000145d	movhpd	%xmm4, _ULM_dgemm_nn(%r11,%r8)
0000000000001463	movlpd	_ULM_dgemm_nn(%r10), %xmm3
0000000000001468	movhpd	_ULM_dgemm_nn(%rcx,%r8), %xmm3
000000000000146e	mulpd	%xmm0, %xmm10
0000000000001473	mulpd	%xmm1, %xmm3
0000000000001477	addpd	%xmm10, %xmm3
000000000000147c	movlpd	%xmm3, _ULM_dgemm_nn(%r10)
0000000000001481	movhpd	%xmm3, _ULM_dgemm_nn(%rcx,%r8)
0000000000001487	movlpd	_ULM_dgemm_nn(%r11), %xmm4
000000000000148c	movhpd	_ULM_dgemm_nn(%rdx,%r8), %xmm4
0000000000001492	mulpd	%xmm0, %xmm11
0000000000001497	mulpd	%xmm1, %xmm4
000000000000149b	addpd	%xmm11, %xmm4
00000000000014a0	movlpd	%xmm4, _ULM_dgemm_nn(%r11)
00000000000014a5	movhpd	%xmm4, _ULM_dgemm_nn(%rdx,%r8)
00000000000014ab	leaq	_ULM_dgemm_nn(%rcx,%r9,2), %rcx
00000000000014af	leaq	_ULM_dgemm_nn(%r10,%r9,2), %r10
00000000000014b3	leaq	_ULM_dgemm_nn(%rdx,%r9,2), %rdx
00000000000014b7	leaq	_ULM_dgemm_nn(%r11,%r9,2), %r11
00000000000014bb	movlpd	_ULM_dgemm_nn(%rcx), %xmm3
00000000000014bf	movhpd	_ULM_dgemm_nn(%r10,%r8), %xmm3
00000000000014c5	mulpd	%xmm0, %xmm12
00000000000014ca	mulpd	%xmm1, %xmm3
00000000000014ce	addpd	%xmm12, %xmm3
00000000000014d3	movlpd	%xmm3, _ULM_dgemm_nn(%rcx)
00000000000014d7	movhpd	%xmm3, _ULM_dgemm_nn(%r10,%r8)
00000000000014dd	movlpd	_ULM_dgemm_nn(%rdx), %xmm4
00000000000014e1	movhpd	_ULM_dgemm_nn(%r11,%r8), %xmm4
00000000000014e7	mulpd	%xmm0, %xmm13
00000000000014ec	mulpd	%xmm1, %xmm4
00000000000014f0	addpd	%xmm13, %xmm4
00000000000014f5	movlpd	%xmm4, _ULM_dgemm_nn(%rdx)
00000000000014f9	movhpd	%xmm4, _ULM_dgemm_nn(%r11,%r8)
00000000000014ff	movlpd	_ULM_dgemm_nn(%r10), %xmm3
0000000000001504	movhpd	_ULM_dgemm_nn(%rcx,%r8), %xmm3
000000000000150a	mulpd	%xmm0, %xmm14
000000000000150f	mulpd	%xmm1, %xmm3
0000000000001513	addpd	%xmm14, %xmm3
0000000000001518	movlpd	%xmm3, _ULM_dgemm_nn(%r10)
000000000000151d	movhpd	%xmm3, _ULM_dgemm_nn(%rcx,%r8)
0000000000001523	movlpd	_ULM_dgemm_nn(%r11), %xmm4
0000000000001528	movhpd	_ULM_dgemm_nn(%rdx,%r8), %xmm4
000000000000152e	mulpd	%xmm0, %xmm15
0000000000001533	mulpd	%xmm1, %xmm4
0000000000001537	addpd	%xmm15, %xmm4
000000000000153c	movlpd	%xmm4, _ULM_dgemm_nn(%r11)
0000000000001541	movhpd	%xmm4, _ULM_dgemm_nn(%rdx,%r8)
0000000000001547	movq	%rbp, %rdx
000000000000154a	movl	%edx, %eax
000000000000154c	imull	%r15d, %eax
0000000000001550	addl	0x21c(%rsp), %eax
0000000000001557	shll	$0x2, %eax
000000000000155a	testl	%r12d, %r12d
000000000000155d	setg	%cl
0000000000001560	andb	0x238(%rsp), %cl
0000000000001567	cltq
0000000000001569	movsd	0x1e8(%rsp), %xmm1
0000000000001572	ucomisd	0x26e(%rip), %xmm1
000000000000157a	jne	0x1580
000000000000157c	jp	0x1580
000000000000157e	jmp	0x15f0
0000000000001580	movq	%rdx, 0x238(%rsp)
0000000000001588	testb	%cl, %cl
000000000000158a	movl	0x2e8(%rsp), %r9d
0000000000001592	je	0x1650
0000000000001598	movq	0x1f0(%rsp), %rcx
00000000000015a0	leaq	_ULM_dgemm_nn(%rax,%rcx), %rcx
00000000000015a4	xorl	%edx, %edx
00000000000015a6	xorl	%esi, %esi
00000000000015a8	nopl	_ULM_dgemm_nn(%rax,%rax)
00000000000015b0	movl	%edx, %edi
00000000000015b2	movl	%r12d, %ebp
00000000000015b5	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
00000000000015c0	movslq	%edi, %rdi
00000000000015c3	leaq	_ULM_dgemm_nn(%rcx,%rdi), %rbx
00000000000015c7	movsd	_ULM_dgemm_nn(%r14,%rbx,8), %xmm0
00000000000015cd	mulsd	%xmm1, %xmm0
00000000000015d1	movsd	%xmm0, _ULM_dgemm_nn(%r14,%rbx,8)
00000000000015d7	addl	%r15d, %edi
00000000000015da	decl	%ebp
00000000000015dc	jne	0x15c0
00000000000015de	incl	%esi
00000000000015e0	addl	%r9d, %edx
00000000000015e3	cmpl	%r13d, %esi
00000000000015e6	jne	0x15b0
00000000000015e8	jmp	0x1650
00000000000015ea	nopw	_ULM_dgemm_nn(%rax,%rax)
00000000000015f0	movq	%rdx, 0x238(%rsp)
00000000000015f8	testb	%cl, %cl
00000000000015fa	movl	0x2e8(%rsp), %r9d
0000000000001602	je	0x1650
0000000000001604	movq	0x1f0(%rsp), %rcx
000000000000160c	leaq	_ULM_dgemm_nn(%rax,%rcx), %rcx
0000000000001610	xorl	%edx, %edx
0000000000001612	xorl	%esi, %esi
0000000000001614	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000001620	movl	%edx, %edi
0000000000001622	movl	%r12d, %ebp
0000000000001625	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000001630	movslq	%edi, %rdi
0000000000001633	leaq	_ULM_dgemm_nn(%rcx,%rdi), %rbx
0000000000001637	movq	$_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rbx,8)
000000000000163f	addl	%r15d, %edi
0000000000001642	decl	%ebp
0000000000001644	jne	0x1630
0000000000001646	incl	%esi
0000000000001648	addl	%r9d, %edx
000000000000164b	cmpl	%r13d, %esi
000000000000164e	jne	0x1620
0000000000001650	testl	%r13d, %r13d
0000000000001653	jle	0x16c1
0000000000001655	addq	0x1f0(%rsp), %rax
000000000000165d	xorl	%r8d, %r8d
0000000000001660	xorl	%edx, %edx
0000000000001662	xorl	%esi, %esi
0000000000001664	nopw	%cs:_ULM_dgemm_nn(%rax,%rax)
0000000000001670	testl	%r12d, %r12d
0000000000001673	jle	0x16b2
0000000000001675	movslq	%r8d, %rdi
0000000000001678	leaq	__C(%rip), %rcx
000000000000167f	leaq	_ULM_dgemm_nn(%rcx,%rdi,8), %rdi
0000000000001683	movl	%edx, %ebx
0000000000001685	movl	%r12d, %ebp
0000000000001688	nopl	_ULM_dgemm_nn(%rax,%rax)
0000000000001690	movsd	_ULM_dgemm_nn(%rdi), %xmm0
0000000000001694	movslq	%ebx, %rbx
0000000000001697	leaq	_ULM_dgemm_nn(%rax,%rbx), %rcx
000000000000169b	addsd	_ULM_dgemm_nn(%r14,%rcx,8), %xmm0
00000000000016a1	movsd	%xmm0, _ULM_dgemm_nn(%r14,%rcx,8)
00000000000016a7	addl	%r15d, %ebx
00000000000016aa	addq	$0x8, %rdi
00000000000016ae	decl	%ebp
00000000000016b0	jne	0x1690
00000000000016b2	incq	%rsi
00000000000016b5	addl	%r9d, %edx
00000000000016b8	addl	$0x4, %r8d
00000000000016bc	cmpl	%r13d, %esi
00000000000016bf	jne	0x1670
00000000000016c1	movq	0x238(%rsp), %rsi
00000000000016c9	incq	%rsi
00000000000016cc	movq	0x200(%rsp), %rax
00000000000016d4	cmpl	%eax, %esi
00000000000016d6	movsd	0x208(%rsp), %xmm1
00000000000016df	jl	_ULM_dgemm_nn+2816
00000000000016e5	movq	0x1b8(%rsp), %rsi
00000000000016ed	incq	%rsi
00000000000016f0	movq	0x190(%rsp), %rax
00000000000016f8	cmpl	%eax, %esi
00000000000016fa	movq	0x1a8(%rsp), %rdx
0000000000001702	jl	_ULM_dgemm_nn+2688
0000000000001708	movq	0x158(%rsp), %rdi
0000000000001710	incq	%rdi
0000000000001713	movl	0x164(%rsp), %esi
000000000000171a	addl	0x114(%rsp), %esi
0000000000001721	movq	0x118(%rsp), %rax
0000000000001729	cmpl	%eax, %edi
000000000000172b	movq	0x140(%rsp), %rbp
0000000000001733	movq	0x138(%rsp), %rbx
000000000000173b	jl	_ULM_dgemm_nn+1888
0000000000001741	movq	0xb0(%rsp), %rdi
0000000000001749	incq	%rdi
000000000000174c	movl	0xac(%rsp), %r13d
0000000000001754	addl	0x7c(%rsp), %r13d
0000000000001759	movq	0xb8(%rsp), %rax
0000000000001761	addl	0x64(%rsp), %eax
0000000000001765	movq	%rax, 0xb8(%rsp)
000000000000176d	movq	0x80(%rsp), %rax
0000000000001775	cmpl	%eax, %edi
0000000000001777	movl	0x2e8(%rsp), %ebx
000000000000177e	movl	%ebx, %r12d
0000000000001781	movq	0x88(%rsp), %rsi
0000000000001789	movq	0x70(%rsp), %rdx
000000000000178e	movq	0x68(%rsp), %rbx
0000000000001793	movq	0x58(%rsp), %r11
0000000000001798	movq	0x1a0(%rsp), %r8
00000000000017a0	jl	_ULM_dgemm_nn+1120
00000000000017a6	movq	%rdx, %rdi
00000000000017a9	movl	0x34(%rsp), %ecx
00000000000017ad	incl	%ecx
00000000000017af	movq	0x38(%rsp), %rax
00000000000017b4	addl	0x14(%rsp), %eax
00000000000017b8	movq	%rax, 0x38(%rsp)
00000000000017bd	movq	0x18(%rsp), %rax
00000000000017c2	cmpl	%eax, %ecx
00000000000017c4	jl	_ULM_dgemm_nn+832
00000000000017ca	addq	$0x288, %rsp
00000000000017d1	popq	%rbx
00000000000017d2	popq	%r12
00000000000017d4	popq	%r13
00000000000017d6	popq	%r14
00000000000017d8	popq	%r15
00000000000017da	popq	%rbp
00000000000017db	ret

So the line right before the .DLOOP0 label is

$shell> otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done                                                          
0000000000000c63 je .DCONSIDERLEFT0

And the line with jne .DLOOP0 is

$shell> otool -dtV dgemm_nn.o | grep ".DLOOP0$"                          
0000000000000e33	jne	.DLOOP0

So the code in between takes

$shell> BEGIN=`otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done`                                                      
$shell> BEGIN=($BEGIN)                                                   
$shell> BEGIN=${BEGIN[0]}                                                
$shell> BEGIN=`echo $BEGIN | tr "a-f" "A-F"`                             
$shell> END=`otool -dtV dgemm_nn.o | grep ".DLOOP0$" | tr "a-f" "A-F"`   
$shell> END=($END)                                                       
$shell> END=${END[0]}                                                    
$shell> END=`echo $END | tr "a-f" "A-F"`                                 
$shell> SIZE=`dc -e "16 i $END $BEGIN - f"`                              
$shell> echo "Code size of loop body (in bytes): $SIZE"                  
Code size of loop body (in bytes): 464

bytes.