Content |
Complete Assembler Micro Kernel
The BLIS micro kernel gains another performance boost through prefetching data. In my experiments I only was able to add the feature after converting the whole mico kernel into assembler.
Status Quo: So far only the loop for computing AB was implemented in assembler. The rest of the micro kernel was left untouched and is plain C code. The bridge between the assembler and C code is the double array AB of length 16. The assembler code copies at the end its results into this array. The remaining C code uses AB to compute C <- beta*C + alpha*A*B.
Having all the micro kernel in assembler removes the need for this internal buffer AB.
Select the demo-sse-all-asm Branch
Again, we do a make clean before switching a branch:
$shell> cd ulmBLAS $shell> make clean for dir in src refblas test bench; do make -C $dir clean; done rm -f auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o rm -f auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o rm -f ../libulmblas.a rm -f ../libatlulmblas.a rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o rm -f ../librefblas.a rm -f dblat1_ref dblat3_ref dblat1_ulm dblat3_ulm *.SUMM rm -f xdl1blastst libtstatlas.a l1blastst.o ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
Then we are checking out the demo-sse-all-asm branch:
$shell> git branch -a demo-naive-sse-with-intrinsics demo-naive-sse-with-intrinsics-unrolled demo-pure-c demo-sse-asm demo-sse-asm-unrolled demo-sse-asm-unrolled-v2 * demo-sse-asm-unrolled-v3 demo-sse-intrinsics demo-sse-intrinsics-v2 demo-sse-intrinsics-v3 master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-sse-all-asm remotes/origin/demo-sse-all-asm Switched to a new branch 'demo-sse-all-asm' Branch demo-sse-all-asm set up to track remote branch demo-sse-all-asm from origin.
Then we compile the project
$shell> make
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o auxiliary/xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_zero.o ATL_zero.c
gfortran -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Outline of the Modification
-
In the micro kernel we also go from intto long. It is just simpler to use only 64 bit registers.
-
Registers %xmm8, .., %xmm15 are used to hold the result of A*B.
-
Once these registers contain the required values of A*B we begin with the update C <- beta*C + alpha*A*B:
-
%xmm0 is used to hold alpha and %xmm1 for beta.
-
The strides of matrix \(C\) must be multiplied by sizeof(double). This gives the stride in bytes.
-
For each element \(C_{i,j}\) we do the following:
-
Load \(C_{i,j}\).
-
Compute \(\left(\alpha A B\right)_{i,j}\).
-
Compute \(\beta C_{i,j}\)
-
Store \(\beta C_{i,j} + \left(\alpha A B\right)_{i,j}\).
-
-
The dgemm_nn Code
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst > report
and filter out the results for the demo-sse-all-asm branch:
$shell> grep PASS report > demo-sse-all-asm
With the gnuplot script
set output "bench13.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm"
we feed gnuplot
$shell> gnuplot bench13.gps
and get
Code Size of kb-Loop Body
On Mac OS X you can use otool to get at assembler code from an object file. Moreover you directly can see how many bytes each instruction takes. We will use this to determine the code size of the kb-loop in the micro kernel. So we have to look at the label generated from
and the jump
Here the complete dump of otool
$shell> cd ../src/level3 $shell> otool -dtV dgemm_nn.o dgemm_nn.o: (__TEXT,__text) section _ULM_dgemm_nn: 0000000000000000 pushq %rbp 0000000000000001 pushq %r15 0000000000000003 pushq %r14 0000000000000005 pushq %r13 0000000000000007 pushq %r12 0000000000000009 pushq %rbx 000000000000000a subq $0x288, %rsp 0000000000000011 movsd %xmm1, 0x208(%rsp) 000000000000001a movl %r9d, 0x44(%rsp) 000000000000001f movsd %xmm0, 0x1d0(%rsp) 0000000000000028 leal 0xfff(%rsi), %eax 000000000000002e movq %rsi, %rbx 0000000000000031 movq %rdx, %rsi 0000000000000034 sarl $0x1f, %eax 0000000000000037 shrl $0x14, %eax 000000000000003a xorps %xmm2, %xmm2 000000000000003d ucomisd %xmm2, %xmm0 0000000000000041 movl 0x2e8(%rsp), %r12d 0000000000000049 movl 0x2e0(%rsp), %r15d 0000000000000051 movq 0x2d8(%rsp), %r14 0000000000000059 jne 0x5d 000000000000005b jnp 0x65 000000000000005d testl %esi, %esi 000000000000005f jne 0x110 0000000000000065 testl %ebx, %ebx 0000000000000067 setg %cl 000000000000006a testl %edi, %edi 000000000000006c setg %al 000000000000006f andb %cl, %al 0000000000000071 ucomisd %xmm2, %xmm1 0000000000000075 jne 0x7b 0000000000000077 jp 0x7b 0000000000000079 jmp 0xc8 000000000000007b testb %al, %al 000000000000007d je .DPOSTACCUMULATE1+1037 0000000000000083 xorl %eax, %eax 0000000000000085 xorl %ecx, %ecx 0000000000000087 nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000090 movl %eax, %edx 0000000000000092 movl %edi, %esi 0000000000000094 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000a0 movslq %edx, %rdx 00000000000000a3 movsd _ULM_dgemm_nn(%r14,%rdx,8), %xmm0 00000000000000a9 mulsd %xmm1, %xmm0 00000000000000ad movsd %xmm0, _ULM_dgemm_nn(%r14,%rdx,8) 00000000000000b3 addl %r15d, %edx 00000000000000b6 decl %esi 00000000000000b8 jne 0xa0 00000000000000ba incl %ecx 00000000000000bc addl %r12d, %eax 00000000000000bf cmpl %ebx, %ecx 00000000000000c1 jne 0x90 00000000000000c3 jmpq .DPOSTACCUMULATE1+1037 00000000000000c8 testb %al, %al 00000000000000ca je .DPOSTACCUMULATE1+1037 00000000000000d0 xorl %eax, %eax 00000000000000d2 xorl %ecx, %ecx 00000000000000d4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000e0 movl %eax, %edx 00000000000000e2 movl %edi, %esi 00000000000000e4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000f0 movslq %edx, %rdx 00000000000000f3 movq $_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rdx,8) 00000000000000fb addl %r15d, %edx 00000000000000fe decl %esi 0000000000000100 jne 0xf0 0000000000000102 incl %ecx 0000000000000104 addl %r12d, %eax 0000000000000107 cmpl %ebx, %ecx 0000000000000109 jne 0xe0 000000000000010b jmpq .DPOSTACCUMULATE1+1037 0000000000000110 movq %rcx, 0x130(%rsp) 0000000000000118 testl %ebx, %ebx 000000000000011a jle .DPOSTACCUMULATE1+1037 0000000000000120 movq %rdi, %r9 0000000000000123 movq %r9, 0x90(%rsp) 000000000000012b leal 0x17f(%r9), %ecx 0000000000000132 movslq %ecx, %rcx 0000000000000135 imulq $0x2aaaaaab, %rcx, %r11 000000000000013c movq %r11, %rcx 000000000000013f shrq $0x3f, %rcx 0000000000000143 sarq $0x26, %r11 0000000000000147 addl %ecx, %r11d 000000000000014a movq %r11, 0x118(%rsp) 0000000000000152 movq %rbx, %rdi 0000000000000155 leal 0xfff(%rdi,%rax), %r10d 000000000000015d sarl $0xc, %r10d 0000000000000161 movq %r10, 0x18(%rsp) 0000000000000166 leal 0x17f(%rsi), %eax 000000000000016c cltq 000000000000016e imulq $0x2aaaaaab, %rax, %rbx 0000000000000175 movq %rbx, %rax 0000000000000178 shrq $0x3f, %rax 000000000000017c sarq $0x26, %rbx 0000000000000180 addl %eax, %ebx 0000000000000182 movq %rbx, 0x80(%rsp) 000000000000018a movslq %r9d, %rdx 000000000000018d imulq $0x2aaaaaab, %rdx, %rax 0000000000000194 movq %rax, %rcx 0000000000000197 shrq $0x3f, %rcx 000000000000019b sarq $0x26, %rax 000000000000019f addl %ecx, %eax 00000000000001a1 imull $0x180, %eax, %eax 00000000000001a7 subl %eax, %edx 00000000000001a9 movq %rdx, 0x128(%rsp) 00000000000001b1 movl %edi, %eax 00000000000001b3 sarl $0x1f, %eax 00000000000001b6 shrl $0x14, %eax 00000000000001b9 addl %edi, %eax 00000000000001bb andl $0xfffff000, %eax 00000000000001c0 subl %eax, %edi 00000000000001c2 movq %rdi, 0x28(%rsp) 00000000000001c7 movslq %esi, %rdx 00000000000001ca imulq $0x2aaaaaab, %rdx, %rax 00000000000001d1 movq %rax, %rcx 00000000000001d4 shrq $0x3f, %rcx 00000000000001d8 sarq $0x26, %rax 00000000000001dc addl %ecx, %eax 00000000000001de imull $0x180, %eax, %eax 00000000000001e4 subl %eax, %edx 00000000000001e6 movq %rdx, 0xa0(%rsp) 00000000000001ee movl 0x2d0(%rsp), %ebp 00000000000001f5 leal -0x1(%r10), %eax 00000000000001f9 movl %eax, 0x24(%rsp) 00000000000001fd leal -0x1(%rbx), %eax 0000000000000200 movl %eax, 0x9c(%rsp) 0000000000000207 movslq 0x2c8(%rsp), %rdx 000000000000020f movq %rdx, 0xc0(%rsp) 0000000000000217 leal _ULM_dgemm_nn(,%rbp,4), %r10d 000000000000021f leal -0x1(%r11), %ecx 0000000000000223 movl %ecx, 0x124(%rsp) 000000000000022a movslq 0x44(%rsp), %r13 000000000000022f movq %r13, 0x168(%rsp) 0000000000000237 movq %r8, 0x170(%rsp) 000000000000023f leal _ULM_dgemm_nn(,%r8,4), %r9d 0000000000000247 movslq %r15d, %rcx 000000000000024a movq %rcx, 0x1c8(%rsp) 0000000000000252 movslq %r12d, %rcx 0000000000000255 movq %rcx, 0x1c0(%rsp) 000000000000025d movl %ebp, %ecx 000000000000025f imull $0x180, %edx, %eax 0000000000000265 movl %eax, 0x7c(%rsp) 0000000000000269 movslq %r10d, %rax 000000000000026c movq %rax, 0x8(%rsp) 0000000000000271 leaq _ULM_dgemm_nn(,%rax,8), %rax 0000000000000279 movq %rax, 0xf8(%rsp) 0000000000000281 leaq _ULM_dgemm_nn(,%rdx,8), %rdi 0000000000000289 movq %rdi, 0x70(%rsp) 000000000000028e leal _ULM_dgemm_nn(%rbp,%rbp,2), %r11d 0000000000000293 leal _ULM_dgemm_nn(%rbp,%rbp), %edx 0000000000000297 movslq %ebp, %rbx 000000000000029a movq %rbx, 0x68(%rsp) 000000000000029f imull $0x180, %r13d, %eax 00000000000002a6 movl %eax, 0x64(%rsp) 00000000000002aa imull $0x180, %r8d, %eax 00000000000002b1 movl %eax, 0x114(%rsp) 00000000000002b8 movslq %r9d, %rax 00000000000002bb movq %rax, 0xe0(%rsp) 00000000000002c3 leaq _ULM_dgemm_nn(,%rax,8), %rax 00000000000002cb movq %rax, 0xd8(%rsp) 00000000000002d3 leaq _ULM_dgemm_nn(,%r13,8), %rax 00000000000002db movq %rax, 0x140(%rsp) 00000000000002e3 leal _ULM_dgemm_nn(%r8,%r8,2), %r9d 00000000000002e7 movq %rsi, %rax 00000000000002ea movq %rax, 0x88(%rsp) 00000000000002f2 leal _ULM_dgemm_nn(%r8,%r8), %esi 00000000000002f6 movslq %r8d, %rbp 00000000000002f9 movq %rbp, 0x138(%rsp) 0000000000000301 xorl %ebp, %ebp 0000000000000303 movq %rbp, 0x38(%rsp) 0000000000000308 shll $0xc, %ecx 000000000000030b movl %ecx, 0x14(%rsp) 000000000000030f xorl %ecx, %ecx 0000000000000311 movslq %r11d, %r11 0000000000000314 movq %r11, 0x58(%rsp) 0000000000000319 movslq %edx, %rdx 000000000000031c movq %rdx, 0xf0(%rsp) 0000000000000324 movslq %r9d, %rdx 0000000000000327 movq %rdx, 0x180(%rsp) 000000000000032f movslq %esi, %rdx 0000000000000332 movq %rdx, 0x178(%rsp) 000000000000033a movq %rax, %rsi 000000000000033d nopl _ULM_dgemm_nn(%rax) 0000000000000340 movl %ecx, 0x34(%rsp) 0000000000000344 movq 0x28(%rsp), %rdx 0000000000000349 testl %edx, %edx 000000000000034b sete %al 000000000000034e cmpl 0x24(%rsp), %ecx 0000000000000352 setne %cl 0000000000000355 orb %al, %cl 0000000000000357 movl $0x1000, %eax 000000000000035c cmovnel %eax, %edx 000000000000035f movq %rdx, %rcx 0000000000000362 testl %esi, %esi 0000000000000364 movq %rdi, %rdx 0000000000000367 jle .DPOSTACCUMULATE1+1001 000000000000036d movl 0x34(%rsp), %edi 0000000000000371 shll $0xc, %edi 0000000000000374 movl %edi, %esi 0000000000000376 movl 0x2d0(%rsp), %eax 000000000000037d imull %eax, %esi 0000000000000380 movl %esi, 0xa8(%rsp) 0000000000000387 movq %rcx, %rsi 000000000000038a movq %rsi, 0x148(%rsp) 0000000000000392 movl %esi, %eax 0000000000000394 sarl $0x1f, %eax 0000000000000397 shrl $0x1e, %eax 000000000000039a addl %esi, %eax 000000000000039c movl %eax, %ecx 000000000000039e sarl $0x2, %ecx 00000000000003a1 movq %rcx, 0x100(%rsp) 00000000000003a9 andl $-0x4, %eax 00000000000003ac movl %esi, %r8d 00000000000003af subl %eax, %r8d 00000000000003b2 movq %r8, 0x1a0(%rsp) 00000000000003ba negl %eax 00000000000003bc leal -0x1(%rcx), %ecx 00000000000003bf cmpl $0x7, %esi 00000000000003c2 leaq 0x8(,%rcx,8), %rbp 00000000000003ca movl $0x8, %ecx 00000000000003cf cmovleq %rcx, %rbp 00000000000003d3 movq %rbp, 0x50(%rsp) 00000000000003d8 imulq 0x8(%rsp), %rbp 00000000000003de movq %rbp, 0x48(%rsp) 00000000000003e3 movslq %r8d, %rcx 00000000000003e6 movq %rcx, 0xd0(%rsp) 00000000000003ee leal 0x1(%rsi,%rax), %eax 00000000000003f2 cmpl $0x4, %eax 00000000000003f5 movl $0x3, %eax 00000000000003fa cmovgl %r8d, %eax 00000000000003fe subl %ecx, %eax 0000000000000400 leaq 0x8(,%rax,8), %rax 0000000000000408 movq %rax, 0xc8(%rsp) 0000000000000410 imull %r12d, %edi 0000000000000414 movl %edi, 0xec(%rsp) 000000000000041b leal 0x3(%rsi), %eax 000000000000041e sarl $0x1f, %eax 0000000000000421 shrl $0x1e, %eax 0000000000000424 leal 0x3(%rsi,%rax), %eax 0000000000000428 sarl $0x2, %eax 000000000000042b movq %rax, 0x190(%rsp) 0000000000000433 leal -0x1(%rax), %eax 0000000000000436 movl %eax, 0x18c(%rsp) 000000000000043d xorl %eax, %eax 000000000000043f movq %rax, 0xb8(%rsp) 0000000000000447 movq 0x38(%rsp), %rax 000000000000044c movl %eax, %r13d 000000000000044f xorl %edi, %edi 0000000000000451 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000460 movq 0xa0(%rsp), %rsi 0000000000000468 testl %esi, %esi 000000000000046a sete %al 000000000000046d cmpl 0x9c(%rsp), %edi 0000000000000474 setne %cl 0000000000000477 orb %al, %cl 0000000000000479 movl $0x180, %eax 000000000000047e cmovnel %eax, %esi 0000000000000481 movq %rsi, 0x1b0(%rsp) 0000000000000489 testl %edi, %edi 000000000000048b je 0x498 000000000000048d movsd 0x134b(%rip), %xmm0 0000000000000495 movaps %xmm0, %xmm1 0000000000000498 imull $0x180, %edi, %eax 000000000000049e movl %eax, 0x154(%rsp) 00000000000004a5 movq %rdi, 0xb0(%rsp) 00000000000004ad imull 0x2c8(%rsp), %eax 00000000000004b5 addl 0xa8(%rsp), %eax 00000000000004bc movq 0x148(%rsp), %rcx 00000000000004c4 cmpl $0x4, %ecx 00000000000004c7 cltq 00000000000004c9 movq 0x2c0(%rsp), %rcx 00000000000004d1 leaq _ULM_dgemm_nn(%rcx,%rax,8), %r12 00000000000004d5 jl 0x5e0 00000000000004db movq 0x1b0(%rsp), %rcx 00000000000004e3 leal _ULM_dgemm_nn(,%rcx,4), %eax 00000000000004ea cltq 00000000000004ec movq %rax, 0x238(%rsp) 00000000000004f4 testl %ecx, %ecx 00000000000004f6 jle 0x5a0 00000000000004fc movslq %r13d, %rax 00000000000004ff movq 0x2c0(%rsp), %rcx 0000000000000507 leaq _ULM_dgemm_nn(%rcx,%rax,8), %r9 000000000000050b movq 0x238(%rsp), %rax 0000000000000513 leaq _ULM_dgemm_nn(,%rax,8), %r8 000000000000051b xorl %eax, %eax 000000000000051d leaq __B(%rip), %rsi 0000000000000524 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000530 movq %r9, %rdi 0000000000000533 xorl %ebp, %ebp 0000000000000535 movq 0x1b0(%rsp), %rcx 000000000000053d movq 0xf0(%rsp), %r10 0000000000000545 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000550 movsd _ULM_dgemm_nn(%rdi), %xmm0 0000000000000554 movsd %xmm0, _ULM_dgemm_nn(%rsi,%rbp) 0000000000000559 movsd _ULM_dgemm_nn(%rdi,%rbx,8), %xmm0 000000000000055e movsd %xmm0, 0x8(%rsi,%rbp) 0000000000000564 movsd _ULM_dgemm_nn(%rdi,%r10,8), %xmm0 000000000000056a movsd %xmm0, 0x10(%rsi,%rbp) 0000000000000570 movsd _ULM_dgemm_nn(%rdi,%r11,8), %xmm0 0000000000000576 movsd %xmm0, 0x18(%rsi,%rbp) 000000000000057c addq $0x20, %rbp 0000000000000580 addq %rdx, %rdi 0000000000000583 decl %ecx 0000000000000585 jne 0x550 0000000000000587 incl %eax 0000000000000589 addq %r8, %rsi 000000000000058c addq 0xf8(%rsp), %r9 0000000000000594 movq 0x100(%rsp), %rcx 000000000000059c cmpl %ecx, %eax 000000000000059e jl 0x530 00000000000005a0 movl %r13d, 0xac(%rsp) 00000000000005a8 movq 0x238(%rsp), %rax 00000000000005b0 movq %rax, %rcx 00000000000005b3 imulq 0x50(%rsp), %rcx 00000000000005b9 leaq __B(%rip), %rax 00000000000005c0 addq %rax, %rcx 00000000000005c3 movq %rcx, 0x238(%rsp) 00000000000005cb addq 0x48(%rsp), %r12 00000000000005d0 movq 0x1a0(%rsp), %r8 00000000000005d8 jmp 0x5f7 00000000000005da nopw _ULM_dgemm_nn(%rax,%rax) 00000000000005e0 movl %r13d, 0xac(%rsp) 00000000000005e8 leaq __B(%rip), %rax 00000000000005ef movq %rax, 0x238(%rsp) 00000000000005f7 movsd %xmm1, 0x1e8(%rsp) 0000000000000600 testl %r8d, %r8d 0000000000000603 jle 0x6b0 0000000000000609 movq 0x1b0(%rsp), %rax 0000000000000611 testl %eax, %eax 0000000000000613 movl $_ULM_dgemm_nn, %r13d 0000000000000619 movq 0x238(%rsp), %rbp 0000000000000621 jle 0x6b0 0000000000000627 nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000630 movq 0xd0(%rsp), %rax 0000000000000638 leaq _ULM_dgemm_nn(%rax,%r13,4), %rax 000000000000063c movq 0x238(%rsp), %rcx 0000000000000644 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rdi 0000000000000648 xorl %eax, %eax 000000000000064a xorl %ecx, %ecx 000000000000064c movl 0x2d0(%rsp), %edx 0000000000000653 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000660 cltq 0000000000000662 movsd _ULM_dgemm_nn(%r12,%rax,8), %xmm0 0000000000000668 movsd %xmm0, _ULM_dgemm_nn(%rbp,%rcx,8) 000000000000066e incq %rcx 0000000000000671 addl %edx, %eax 0000000000000673 cmpl %ecx, %r8d 0000000000000676 jne 0x660 0000000000000678 movq 0xc8(%rsp), %rsi 0000000000000680 movq %r8, %rbx 0000000000000683 callq ___bzero 0000000000000688 movq %rbx, %r8 000000000000068b addq $0x20, %rbp 000000000000068f movq 0xc0(%rsp), %rax 0000000000000697 leaq _ULM_dgemm_nn(%r12,%rax,8), %r12 000000000000069b incq %r13 000000000000069e movq 0x1b0(%rsp), %rax 00000000000006a6 cmpl %eax, %r13d 00000000000006a9 jne 0x630 00000000000006ab nopl _ULM_dgemm_nn(%rax,%rax) 00000000000006b0 movq 0x90(%rsp), %rax 00000000000006b8 testl %eax, %eax 00000000000006ba movsd 0x208(%rsp), %xmm1 00000000000006c3 movq 0x140(%rsp), %rbp 00000000000006cb movq 0x138(%rsp), %rbx 00000000000006d3 jle .DPOSTACCUMULATE1+900 00000000000006d9 movl 0x154(%rsp), %eax 00000000000006e0 imull 0x44(%rsp), %eax 00000000000006e5 movl %eax, 0x154(%rsp) 00000000000006ec movq 0x1b0(%rsp), %rax 00000000000006f4 leal _ULM_dgemm_nn(,%rax,4), %ecx 00000000000006fb movl %ecx, 0x22c(%rsp) 0000000000000702 movslq %ecx, %rcx 0000000000000705 movq %rcx, 0x108(%rsp) 000000000000070d movslq %eax, %rdx 0000000000000710 movq %rdx, %rax 0000000000000713 sarq $0x3f, %rax 0000000000000717 shrq $0x3e, %rax 000000000000071b addq %rdx, %rax 000000000000071e movq %rax, %rsi 0000000000000721 sarq $0x2, %rsi 0000000000000725 movq %rsi, 0x1d8(%rsp) 000000000000072d andq $-0x4, %rax 0000000000000731 subq %rax, %rdx 0000000000000734 movq %rdx, 0x1e0(%rsp) 000000000000073c leaq _ULM_dgemm_nn(,%rcx,8), %rax 0000000000000744 movq %rax, 0x198(%rsp) 000000000000074c movq 0xb8(%rsp), %rax 0000000000000754 movl %eax, %esi 0000000000000756 xorl %edi, %edi 0000000000000758 nopl _ULM_dgemm_nn(%rax,%rax) 0000000000000760 movq 0x128(%rsp), %rdx 0000000000000768 testl %edx, %edx 000000000000076a sete %al 000000000000076d cmpl 0x124(%rsp), %edi 0000000000000774 setne %cl 0000000000000777 orb %al, %cl 0000000000000779 movl $0x180, %eax 000000000000077e cmovnel %eax, %edx 0000000000000781 imull $0x180, %edi, %r8d 0000000000000788 movl %r8d, %eax 000000000000078b movq 0x170(%rsp), %rcx 0000000000000793 imull %ecx, %eax 0000000000000796 addl 0x154(%rsp), %eax 000000000000079d movl %edx, %ecx 000000000000079f sarl $0x1f, %ecx 00000000000007a2 shrl $0x1e, %ecx 00000000000007a5 addl %edx, %ecx 00000000000007a7 cmpl $0x4, %edx 00000000000007aa cltq 00000000000007ac movq 0x130(%rsp), %r9 00000000000007b4 leaq _ULM_dgemm_nn(%r9,%rax,8), %r13 00000000000007b8 jl 0x8f0 00000000000007be movl %r8d, 0x21c(%rsp) 00000000000007c6 movq %rdi, 0x158(%rsp) 00000000000007ce movslq %esi, %rax 00000000000007d1 movl %esi, 0x164(%rsp) 00000000000007d8 movq %rdx, 0x1a8(%rsp) 00000000000007e0 leaq _ULM_dgemm_nn(%r9,%rax,8), %r12 00000000000007e4 movl %ecx, %r10d 00000000000007e7 movl %ecx, 0x230(%rsp) 00000000000007ee sarl $0x2, %r10d 00000000000007f2 leal -0x1(%r10), %eax 00000000000007f6 cmpl $0x7, %edx 00000000000007f9 leaq 0x8(,%rax,8), %rcx 0000000000000801 movl $0x8, %eax 0000000000000806 cmovleq %rax, %rcx 000000000000080a movq %rcx, %rax 000000000000080d imulq 0xe0(%rsp), %rax 0000000000000816 movq %rax, 0x220(%rsp) 000000000000081e imulq 0x108(%rsp), %rcx 0000000000000827 movq %rcx, 0x238(%rsp) 000000000000082f xorl %edi, %edi 0000000000000831 leaq __A(%rip), %rsi 0000000000000838 movq 0xd8(%rsp), %r11 0000000000000840 movq 0x1b0(%rsp), %rdx 0000000000000848 testl %edx, %edx 000000000000084a movq %r12, %rcx 000000000000084d movl $_ULM_dgemm_nn, %eax 0000000000000852 movq 0x180(%rsp), %r8 000000000000085a movq 0x178(%rsp), %r9 0000000000000862 jle 0x8a7 0000000000000864 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000870 movsd _ULM_dgemm_nn(%rcx), %xmm0 0000000000000874 movsd %xmm0, _ULM_dgemm_nn(%rsi,%rax) 0000000000000879 movsd _ULM_dgemm_nn(%rcx,%rbx,8), %xmm0 000000000000087e movsd %xmm0, 0x8(%rsi,%rax) 0000000000000884 movsd _ULM_dgemm_nn(%rcx,%r9,8), %xmm0 000000000000088a movsd %xmm0, 0x10(%rsi,%rax) 0000000000000890 movsd _ULM_dgemm_nn(%rcx,%r8,8), %xmm0 0000000000000896 movsd %xmm0, 0x18(%rsi,%rax) 000000000000089c addq $0x20, %rax 00000000000008a0 addq %rbp, %rcx 00000000000008a3 decl %edx 00000000000008a5 jne 0x870 00000000000008a7 incl %edi 00000000000008a9 addq 0x198(%rsp), %rsi 00000000000008b1 addq %r11, %r12 00000000000008b4 cmpl %r10d, %edi 00000000000008b7 jl 0x840 00000000000008b9 addq 0x220(%rsp), %r13 00000000000008c1 leaq __A(%rip), %rax 00000000000008c8 addq %rax, 0x238(%rsp) 00000000000008d0 movq 0x1a8(%rsp), %rdx 00000000000008d8 movl 0x230(%rsp), %ecx 00000000000008df jmp 0x916 00000000000008e1 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000008f0 movl %r8d, 0x21c(%rsp) 00000000000008f8 movq %rdi, 0x158(%rsp) 0000000000000900 movl %esi, 0x164(%rsp) 0000000000000907 leaq __A(%rip), %rax 000000000000090e movq %rax, 0x238(%rsp) 0000000000000916 movq %rdx, 0x1a8(%rsp) 000000000000091e andl $-0x4, %ecx 0000000000000921 subl %ecx, %edx 0000000000000923 movq %rdx, 0x210(%rsp) 000000000000092b testl %edx, %edx 000000000000092d jle 0xa10 0000000000000933 movq 0x1b0(%rsp), %rax 000000000000093b testl %eax, %eax 000000000000093d jle 0xa10 0000000000000943 movq 0x210(%rsp), %rcx 000000000000094b movslq %ecx, %rax 000000000000094e movq %rax, 0x230(%rsp) 0000000000000956 leal 0x1(%rcx), %eax 0000000000000959 cmpl $0x4, %eax 000000000000095c movl $0x3, %eax 0000000000000961 cmovgl %ecx, %eax 0000000000000964 subl %ecx, %eax 0000000000000966 leaq 0x8(,%rax,8), %rax 000000000000096e movq %rax, 0x220(%rsp) 0000000000000976 movl %ecx, %r12d 0000000000000979 xorl %ebp, %ebp 000000000000097b movq 0x238(%rsp), %rbx 0000000000000983 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000990 movq 0x230(%rsp), %rax 0000000000000998 leaq _ULM_dgemm_nn(%rax,%rbp,4), %rax 000000000000099c movq 0x238(%rsp), %rcx 00000000000009a4 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rdi 00000000000009a8 xorl %eax, %eax 00000000000009aa xorl %ecx, %ecx 00000000000009ac movq 0x170(%rsp), %rdx 00000000000009b4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000009c0 cltq 00000000000009c2 movsd _ULM_dgemm_nn(%r13,%rax,8), %xmm0 00000000000009c9 movsd %xmm0, _ULM_dgemm_nn(%rbx,%rcx,8) 00000000000009ce incq %rcx 00000000000009d1 addl %edx, %eax 00000000000009d3 cmpl %ecx, %r12d 00000000000009d6 jne 0x9c0 00000000000009d8 movq 0x220(%rsp), %rsi 00000000000009e0 callq ___bzero 00000000000009e5 addq $0x20, %rbx 00000000000009e9 movq 0x168(%rsp), %rax 00000000000009f1 leaq _ULM_dgemm_nn(%r13,%rax,8), %r13 00000000000009f6 incq %rbp 00000000000009f9 movq 0x1b0(%rsp), %rax 0000000000000a01 cmpl %eax, %ebp 0000000000000a03 jne 0x990 0000000000000a05 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000a10 movq 0x1a8(%rsp), %rdx 0000000000000a18 leal 0x3(%rdx), %eax 0000000000000a1b sarl $0x1f, %eax 0000000000000a1e shrl $0x1e, %eax 0000000000000a21 movq 0x148(%rsp), %rcx 0000000000000a29 testl %ecx, %ecx 0000000000000a2b movl 0x2e8(%rsp), %r9d 0000000000000a33 movsd 0x208(%rsp), %xmm1 0000000000000a3c jle .DPOSTACCUMULATE1+843 0000000000000a42 movl 0x21c(%rsp), %ecx 0000000000000a49 imull %r15d, %ecx 0000000000000a4d addl 0xec(%rsp), %ecx 0000000000000a54 leal 0x3(%rdx,%rax), %eax 0000000000000a58 sarl $0x2, %eax 0000000000000a5b movq %rax, 0x200(%rsp) 0000000000000a63 movslq %ecx, %rcx 0000000000000a66 movq %rcx, 0x1f0(%rsp) 0000000000000a6e leal -0x1(%rax), %eax 0000000000000a71 movl %eax, 0x1fc(%rsp) 0000000000000a78 xorl %esi, %esi 0000000000000a7a nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000a80 movq %rsi, 0x1b8(%rsp) 0000000000000a88 movq 0x1a0(%rsp), %rcx 0000000000000a90 testl %ecx, %ecx 0000000000000a92 sete %al 0000000000000a95 cmpl 0x18c(%rsp), %esi 0000000000000a9c setne %bl 0000000000000a9f orb %al, %bl 0000000000000aa1 movb %bl, 0x230(%rsp) 0000000000000aa8 movl %ecx, %r13d 0000000000000aab movl $0x4, %eax 0000000000000ab0 cmovnel %eax, %r13d 0000000000000ab4 testl %edx, %edx 0000000000000ab6 jle .DPOSTACCUMULATE1+808 0000000000000abc movq 0x1b8(%rsp), %rdx 0000000000000ac4 movl %edx, %eax 0000000000000ac6 imull 0x22c(%rsp), %eax 0000000000000ace cltq 0000000000000ad0 leaq __B(%rip), %rcx 0000000000000ad7 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rax 0000000000000adb movq %rax, 0x220(%rsp) 0000000000000ae3 movl %edx, %eax 0000000000000ae5 imull %r9d, %eax 0000000000000ae9 movl %eax, 0x21c(%rsp) 0000000000000af0 xorl %esi, %esi 0000000000000af2 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000b00 movq 0x210(%rsp), %rdx 0000000000000b08 testl %edx, %edx 0000000000000b0a sete %al 0000000000000b0d cmpl 0x1fc(%rsp), %esi 0000000000000b14 setne %cl 0000000000000b17 orb %al, %cl 0000000000000b19 movl %edx, %r12d 0000000000000b1c movl $0x4, %eax 0000000000000b21 cmovnel %eax, %r12d 0000000000000b25 andb 0x230(%rsp), %cl 0000000000000b2c movl %esi, %eax 0000000000000b2e imull 0x22c(%rsp), %eax 0000000000000b36 cmpb $0x1, %cl 0000000000000b39 cltq 0000000000000b3b leaq __A(%rip), %rcx 0000000000000b42 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rax 0000000000000b46 jne .DPOSTACCUMULATE0+407 0000000000000b4c movl %esi, %ecx 0000000000000b4e movq %rsi, 0x238(%rsp) 0000000000000b56 imull %r15d, %ecx 0000000000000b5a addl 0x21c(%rsp), %ecx 0000000000000b61 shll $0x2, %ecx 0000000000000b64 movslq %ecx, %rcx 0000000000000b67 addq 0x1f0(%rsp), %rcx 0000000000000b6f leaq _ULM_dgemm_nn(%r14,%rcx,8), %rcx 0000000000000b73 movsd 0x1d0(%rsp), %xmm0 0000000000000b7c movsd %xmm0, 0x280(%rsp) 0000000000000b85 movq %rax, 0x278(%rsp) 0000000000000b8d movq 0x220(%rsp), %rax 0000000000000b95 movq %rax, 0x270(%rsp) 0000000000000b9d movsd 0x1e8(%rsp), %xmm0 0000000000000ba6 movsd %xmm0, 0x268(%rsp) 0000000000000baf movq %rcx, 0x260(%rsp) 0000000000000bb7 movq 0x1c8(%rsp), %rax 0000000000000bbf movq %rax, 0x258(%rsp) 0000000000000bc7 movq 0x1c0(%rsp), %rax 0000000000000bcf movq %rax, 0x250(%rsp) 0000000000000bd7 movq 0x1d8(%rsp), %rax 0000000000000bdf movq %rax, 0x248(%rsp) 0000000000000be7 movq 0x1e0(%rsp), %rax 0000000000000bef movq %rax, 0x240(%rsp) 0000000000000bf7 movq 0x248(%rsp), %rsi 0000000000000bff movq 0x240(%rsp), %rdi 0000000000000c07 movq 0x278(%rsp), %rax 0000000000000c0f movq 0x270(%rsp), %rbx 0000000000000c17 movaps _ULM_dgemm_nn(%rax), %xmm0 0000000000000c1a movaps 0x10(%rax), %xmm1 0000000000000c1e movaps _ULM_dgemm_nn(%rbx), %xmm2 0000000000000c21 xorpd %xmm8, %xmm8 0000000000000c26 xorpd %xmm9, %xmm9 0000000000000c2b xorpd %xmm10, %xmm10 0000000000000c30 xorpd %xmm11, %xmm11 0000000000000c35 xorpd %xmm12, %xmm12 0000000000000c3a xorpd %xmm13, %xmm13 0000000000000c3f xorpd %xmm14, %xmm14 0000000000000c44 xorpd %xmm15, %xmm15 0000000000000c49 xorpd %xmm3, %xmm3 0000000000000c4d xorpd %xmm4, %xmm4 0000000000000c51 xorpd %xmm5, %xmm5 0000000000000c55 xorpd %xmm6, %xmm6 0000000000000c59 xorpd %xmm7, %xmm7 0000000000000c5d testq %rdi, %rdi 0000000000000c60 testq %rsi, %rsi 0000000000000c63 je .DCONSIDERLEFT0 .DLOOP0: 0000000000000c69 addpd %xmm3, %xmm12 0000000000000c6e movaps 0x10(%rbx), %xmm3 0000000000000c72 addpd %xmm6, %xmm13 0000000000000c77 movaps %xmm2, %xmm6 0000000000000c7a pshufd $0x4e, %xmm2, %xmm4 0000000000000c7f mulpd %xmm0, %xmm2 0000000000000c83 mulpd %xmm1, %xmm6 0000000000000c87 addpd %xmm5, %xmm14 0000000000000c8c addpd %xmm7, %xmm15 0000000000000c91 movaps %xmm4, %xmm7 0000000000000c94 mulpd %xmm0, %xmm4 0000000000000c98 mulpd %xmm1, %xmm7 0000000000000c9c addpd %xmm2, %xmm8 0000000000000ca1 movaps 0x20(%rbx), %xmm2 0000000000000ca5 addpd %xmm6, %xmm9 0000000000000caa movaps %xmm3, %xmm6 0000000000000cad pshufd $0x4e, %xmm3, %xmm5 0000000000000cb2 mulpd %xmm0, %xmm3 0000000000000cb6 mulpd %xmm1, %xmm6 0000000000000cba addpd %xmm4, %xmm10 0000000000000cbf addpd %xmm7, %xmm11 0000000000000cc4 movaps %xmm5, %xmm7 0000000000000cc7 mulpd %xmm0, %xmm5 0000000000000ccb movaps 0x20(%rax), %xmm0 0000000000000ccf mulpd %xmm1, %xmm7 0000000000000cd3 movaps 0x30(%rax), %xmm1 0000000000000cd7 addpd %xmm3, %xmm12 0000000000000cdc movaps 0x30(%rbx), %xmm3 0000000000000ce0 addpd %xmm6, %xmm13 0000000000000ce5 movaps %xmm2, %xmm6 0000000000000ce8 pshufd $0x4e, %xmm2, %xmm4 0000000000000ced mulpd %xmm0, %xmm2 0000000000000cf1 mulpd %xmm1, %xmm6 0000000000000cf5 addpd %xmm5, %xmm14 0000000000000cfa addpd %xmm7, %xmm15 0000000000000cff movaps %xmm4, %xmm7 0000000000000d02 mulpd %xmm0, %xmm4 0000000000000d06 mulpd %xmm1, %xmm7 0000000000000d0a addpd %xmm2, %xmm8 0000000000000d0f movaps 0x40(%rbx), %xmm2 0000000000000d13 addpd %xmm6, %xmm9 0000000000000d18 movaps %xmm3, %xmm6 0000000000000d1b pshufd $0x4e, %xmm3, %xmm5 0000000000000d20 mulpd %xmm0, %xmm3 0000000000000d24 mulpd %xmm1, %xmm6 0000000000000d28 addpd %xmm4, %xmm10 0000000000000d2d addpd %xmm7, %xmm11 0000000000000d32 movaps %xmm5, %xmm7 0000000000000d35 mulpd %xmm0, %xmm5 0000000000000d39 movaps 0x40(%rax), %xmm0 0000000000000d3d mulpd %xmm1, %xmm7 0000000000000d41 movaps 0x50(%rax), %xmm1 0000000000000d45 addpd %xmm3, %xmm12 0000000000000d4a movaps 0x50(%rbx), %xmm3 0000000000000d4e addpd %xmm6, %xmm13 0000000000000d53 movaps %xmm2, %xmm6 0000000000000d56 pshufd $0x4e, %xmm2, %xmm4 0000000000000d5b mulpd %xmm0, %xmm2 0000000000000d5f mulpd %xmm1, %xmm6 0000000000000d63 addpd %xmm5, %xmm14 0000000000000d68 addpd %xmm7, %xmm15 0000000000000d6d movaps %xmm4, %xmm7 0000000000000d70 mulpd %xmm0, %xmm4 0000000000000d74 mulpd %xmm1, %xmm7 0000000000000d78 addpd %xmm2, %xmm8 0000000000000d7d movaps 0x60(%rbx), %xmm2 0000000000000d81 addpd %xmm6, %xmm9 0000000000000d86 movaps %xmm3, %xmm6 0000000000000d89 pshufd $0x4e, %xmm3, %xmm5 0000000000000d8e mulpd %xmm0, %xmm3 0000000000000d92 mulpd %xmm1, %xmm6 0000000000000d96 addpd %xmm4, %xmm10 0000000000000d9b addpd %xmm7, %xmm11 0000000000000da0 movaps %xmm5, %xmm7 0000000000000da3 mulpd %xmm0, %xmm5 0000000000000da7 movaps 0x60(%rax), %xmm0 0000000000000dab mulpd %xmm1, %xmm7 0000000000000daf movaps 0x70(%rax), %xmm1 0000000000000db3 addpd %xmm3, %xmm12 0000000000000db8 movaps 0x70(%rbx), %xmm3 0000000000000dbc addpd %xmm6, %xmm13 0000000000000dc1 movaps %xmm2, %xmm6 0000000000000dc4 pshufd $0x4e, %xmm2, %xmm4 0000000000000dc9 mulpd %xmm0, %xmm2 0000000000000dcd mulpd %xmm1, %xmm6 0000000000000dd1 addq $0x80, %rax 0000000000000dd7 addpd %xmm5, %xmm14 0000000000000ddc addpd %xmm7, %xmm15 0000000000000de1 movaps %xmm4, %xmm7 0000000000000de4 mulpd %xmm0, %xmm4 0000000000000de8 mulpd %xmm1, %xmm7 0000000000000dec addpd %xmm2, %xmm8 0000000000000df1 movaps 0x80(%rbx), %xmm2 0000000000000df8 addpd %xmm6, %xmm9 0000000000000dfd movaps %xmm3, %xmm6 0000000000000e00 pshufd $0x4e, %xmm3, %xmm5 0000000000000e05 mulpd %xmm0, %xmm3 0000000000000e09 mulpd %xmm1, %xmm6 0000000000000e0d addq $0x80, %rbx 0000000000000e14 addpd %xmm4, %xmm10 0000000000000e19 addpd %xmm7, %xmm11 0000000000000e1e movaps %xmm5, %xmm7 0000000000000e21 mulpd %xmm0, %xmm5 0000000000000e25 movaps _ULM_dgemm_nn(%rax), %xmm0 0000000000000e28 mulpd %xmm1, %xmm7 0000000000000e2c movaps 0x10(%rax), %xmm1 0000000000000e30 decq %rsi 0000000000000e33 jne .DLOOP0 .DCONSIDERLEFT0: 0000000000000e39 testq %rdi, %rdi 0000000000000e3c je .DPOSTACCUMULATE0 .DLOOPLEFT0: 0000000000000e42 addpd %xmm3, %xmm12 0000000000000e47 movapd 0x10(%rbx), %xmm3 0000000000000e4c addpd %xmm6, %xmm13 0000000000000e51 movapd %xmm2, %xmm6 0000000000000e55 pshufd $0x4e, %xmm2, %xmm4 0000000000000e5a mulpd %xmm0, %xmm2 0000000000000e5e mulpd %xmm1, %xmm6 0000000000000e62 addpd %xmm5, %xmm14 0000000000000e67 addpd %xmm7, %xmm15 0000000000000e6c movapd %xmm4, %xmm7 0000000000000e70 mulpd %xmm0, %xmm4 0000000000000e74 mulpd %xmm1, %xmm7 0000000000000e78 addpd %xmm2, %xmm8 0000000000000e7d movapd 0x20(%rbx), %xmm2 0000000000000e82 addpd %xmm6, %xmm9 0000000000000e87 movapd %xmm3, %xmm6 0000000000000e8b pshufd $0x4e, %xmm3, %xmm5 0000000000000e90 mulpd %xmm0, %xmm3 0000000000000e94 mulpd %xmm1, %xmm6 0000000000000e98 addpd %xmm4, %xmm10 0000000000000e9d addpd %xmm7, %xmm11 0000000000000ea2 movapd %xmm5, %xmm7 0000000000000ea6 mulpd %xmm0, %xmm5 0000000000000eaa movapd 0x20(%rax), %xmm0 0000000000000eaf mulpd %xmm1, %xmm7 0000000000000eb3 movapd 0x30(%rax), %xmm1 0000000000000eb8 addq $0x20, %rax 0000000000000ebc addq $0x20, %rbx 0000000000000ec0 decq %rdi 0000000000000ec3 jne .DLOOPLEFT0 .DPOSTACCUMULATE0: 0000000000000ec9 addpd %xmm3, %xmm12 0000000000000ece addpd %xmm6, %xmm13 0000000000000ed3 addpd %xmm5, %xmm14 0000000000000ed8 addpd %xmm7, %xmm15 0000000000000edd movsd 0x280(%rsp), %xmm0 0000000000000ee6 movsd 0x268(%rsp), %xmm1 0000000000000eef movq 0x260(%rsp), %rcx 0000000000000ef7 movq 0x258(%rsp), %r8 0000000000000eff leaq _ULM_dgemm_nn(,%r8,8), %r8 0000000000000f07 movq 0x250(%rsp), %r9 0000000000000f0f leaq _ULM_dgemm_nn(,%r9,8), %r9 0000000000000f17 leaq _ULM_dgemm_nn(%rcx,%r9), %r10 0000000000000f1b leaq _ULM_dgemm_nn(%rcx,%r8,2), %rdx 0000000000000f1f leaq _ULM_dgemm_nn(%rdx,%r9), %r11 0000000000000f23 unpcklpd %xmm0, %xmm0 0000000000000f27 unpcklpd %xmm1, %xmm1 0000000000000f2b movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000000f2f movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000000f35 mulpd %xmm0, %xmm8 0000000000000f3a mulpd %xmm1, %xmm3 0000000000000f3e addpd %xmm8, %xmm3 0000000000000f43 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 0000000000000f47 movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000000f4d movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000000f51 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 0000000000000f57 mulpd %xmm0, %xmm9 0000000000000f5c mulpd %xmm1, %xmm4 0000000000000f60 addpd %xmm9, %xmm4 0000000000000f65 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 0000000000000f69 movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 0000000000000f6f movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000000f74 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 0000000000000f7a mulpd %xmm0, %xmm10 0000000000000f7f mulpd %xmm1, %xmm3 0000000000000f83 addpd %xmm10, %xmm3 0000000000000f88 movlpd %xmm3, _ULM_dgemm_nn(%r10) 0000000000000f8d movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000000f93 movlpd _ULM_dgemm_nn(%r11), %xmm4 0000000000000f98 movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 0000000000000f9e mulpd %xmm0, %xmm11 0000000000000fa3 mulpd %xmm1, %xmm4 0000000000000fa7 addpd %xmm11, %xmm4 0000000000000fac movlpd %xmm4, _ULM_dgemm_nn(%r11) 0000000000000fb1 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 0000000000000fb7 leaq _ULM_dgemm_nn(%rcx,%r9,2), %rcx 0000000000000fbb leaq _ULM_dgemm_nn(%r10,%r9,2), %r10 0000000000000fbf leaq _ULM_dgemm_nn(%rdx,%r9,2), %rdx 0000000000000fc3 leaq _ULM_dgemm_nn(%r11,%r9,2), %r11 0000000000000fc7 movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000000fcb movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000000fd1 mulpd %xmm0, %xmm12 0000000000000fd6 mulpd %xmm1, %xmm3 0000000000000fda addpd %xmm12, %xmm3 0000000000000fdf movlpd %xmm3, _ULM_dgemm_nn(%rcx) 0000000000000fe3 movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000000fe9 movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000000fed movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 0000000000000ff3 mulpd %xmm0, %xmm13 0000000000000ff8 mulpd %xmm1, %xmm4 0000000000000ffc addpd %xmm13, %xmm4 0000000000001001 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 0000000000001005 movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 000000000000100b movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000001010 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 0000000000001016 mulpd %xmm0, %xmm14 000000000000101b mulpd %xmm1, %xmm3 000000000000101f addpd %xmm14, %xmm3 0000000000001024 movlpd %xmm3, _ULM_dgemm_nn(%r10) 0000000000001029 movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 000000000000102f movlpd _ULM_dgemm_nn(%r11), %xmm4 0000000000001034 movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 000000000000103a mulpd %xmm0, %xmm15 000000000000103f mulpd %xmm1, %xmm4 0000000000001043 addpd %xmm15, %xmm4 0000000000001048 movlpd %xmm4, _ULM_dgemm_nn(%r11) 000000000000104d movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 0000000000001053 movl 0x2e8(%rsp), %r9d 000000000000105b jmpq .DPOSTACCUMULATE1+772 0000000000001060 testl %r13d, %r13d 0000000000001063 setg 0x238(%rsp) 000000000000106b movsd 0x1d0(%rsp), %xmm0 0000000000001074 movsd %xmm0, 0x280(%rsp) 000000000000107d movq %rax, 0x278(%rsp) 0000000000001085 movq 0x220(%rsp), %rax 000000000000108d movq %rax, 0x270(%rsp) 0000000000001095 movq $_ULM_dgemm_nn, 0x268(%rsp) 00000000000010a1 leaq __C(%rip), %rax 00000000000010a8 movq %rax, 0x260(%rsp) 00000000000010b0 movq $0x1, 0x258(%rsp) 00000000000010bc movq $0x4, 0x250(%rsp) 00000000000010c8 movq 0x1d8(%rsp), %rax 00000000000010d0 movq %rax, 0x248(%rsp) 00000000000010d8 movq 0x1e0(%rsp), %rax 00000000000010e0 movq %rax, 0x240(%rsp) 00000000000010e8 movq %rsi, %rbp 00000000000010eb movq 0x248(%rsp), %rsi 00000000000010f3 movq 0x240(%rsp), %rdi 00000000000010fb movq 0x278(%rsp), %rax 0000000000001103 movq 0x270(%rsp), %rbx 000000000000110b movaps _ULM_dgemm_nn(%rax), %xmm0 000000000000110e movaps 0x10(%rax), %xmm1 0000000000001112 movaps _ULM_dgemm_nn(%rbx), %xmm2 0000000000001115 xorpd %xmm8, %xmm8 000000000000111a xorpd %xmm9, %xmm9 000000000000111f xorpd %xmm10, %xmm10 0000000000001124 xorpd %xmm11, %xmm11 0000000000001129 xorpd %xmm12, %xmm12 000000000000112e xorpd %xmm13, %xmm13 0000000000001133 xorpd %xmm14, %xmm14 0000000000001138 xorpd %xmm15, %xmm15 000000000000113d xorpd %xmm3, %xmm3 0000000000001141 xorpd %xmm4, %xmm4 0000000000001145 xorpd %xmm5, %xmm5 0000000000001149 xorpd %xmm6, %xmm6 000000000000114d xorpd %xmm7, %xmm7 0000000000001151 testq %rdi, %rdi 0000000000001154 testq %rsi, %rsi 0000000000001157 je .DCONSIDERLEFT1 .DLOOP1: 000000000000115d addpd %xmm3, %xmm12 0000000000001162 movaps 0x10(%rbx), %xmm3 0000000000001166 addpd %xmm6, %xmm13 000000000000116b movaps %xmm2, %xmm6 000000000000116e pshufd $0x4e, %xmm2, %xmm4 0000000000001173 mulpd %xmm0, %xmm2 0000000000001177 mulpd %xmm1, %xmm6 000000000000117b addpd %xmm5, %xmm14 0000000000001180 addpd %xmm7, %xmm15 0000000000001185 movaps %xmm4, %xmm7 0000000000001188 mulpd %xmm0, %xmm4 000000000000118c mulpd %xmm1, %xmm7 0000000000001190 addpd %xmm2, %xmm8 0000000000001195 movaps 0x20(%rbx), %xmm2 0000000000001199 addpd %xmm6, %xmm9 000000000000119e movaps %xmm3, %xmm6 00000000000011a1 pshufd $0x4e, %xmm3, %xmm5 00000000000011a6 mulpd %xmm0, %xmm3 00000000000011aa mulpd %xmm1, %xmm6 00000000000011ae addpd %xmm4, %xmm10 00000000000011b3 addpd %xmm7, %xmm11 00000000000011b8 movaps %xmm5, %xmm7 00000000000011bb mulpd %xmm0, %xmm5 00000000000011bf movaps 0x20(%rax), %xmm0 00000000000011c3 mulpd %xmm1, %xmm7 00000000000011c7 movaps 0x30(%rax), %xmm1 00000000000011cb addpd %xmm3, %xmm12 00000000000011d0 movaps 0x30(%rbx), %xmm3 00000000000011d4 addpd %xmm6, %xmm13 00000000000011d9 movaps %xmm2, %xmm6 00000000000011dc pshufd $0x4e, %xmm2, %xmm4 00000000000011e1 mulpd %xmm0, %xmm2 00000000000011e5 mulpd %xmm1, %xmm6 00000000000011e9 addpd %xmm5, %xmm14 00000000000011ee addpd %xmm7, %xmm15 00000000000011f3 movaps %xmm4, %xmm7 00000000000011f6 mulpd %xmm0, %xmm4 00000000000011fa mulpd %xmm1, %xmm7 00000000000011fe addpd %xmm2, %xmm8 0000000000001203 movaps 0x40(%rbx), %xmm2 0000000000001207 addpd %xmm6, %xmm9 000000000000120c movaps %xmm3, %xmm6 000000000000120f pshufd $0x4e, %xmm3, %xmm5 0000000000001214 mulpd %xmm0, %xmm3 0000000000001218 mulpd %xmm1, %xmm6 000000000000121c addpd %xmm4, %xmm10 0000000000001221 addpd %xmm7, %xmm11 0000000000001226 movaps %xmm5, %xmm7 0000000000001229 mulpd %xmm0, %xmm5 000000000000122d movaps 0x40(%rax), %xmm0 0000000000001231 mulpd %xmm1, %xmm7 0000000000001235 movaps 0x50(%rax), %xmm1 0000000000001239 addpd %xmm3, %xmm12 000000000000123e movaps 0x50(%rbx), %xmm3 0000000000001242 addpd %xmm6, %xmm13 0000000000001247 movaps %xmm2, %xmm6 000000000000124a pshufd $0x4e, %xmm2, %xmm4 000000000000124f mulpd %xmm0, %xmm2 0000000000001253 mulpd %xmm1, %xmm6 0000000000001257 addpd %xmm5, %xmm14 000000000000125c addpd %xmm7, %xmm15 0000000000001261 movaps %xmm4, %xmm7 0000000000001264 mulpd %xmm0, %xmm4 0000000000001268 mulpd %xmm1, %xmm7 000000000000126c addpd %xmm2, %xmm8 0000000000001271 movaps 0x60(%rbx), %xmm2 0000000000001275 addpd %xmm6, %xmm9 000000000000127a movaps %xmm3, %xmm6 000000000000127d pshufd $0x4e, %xmm3, %xmm5 0000000000001282 mulpd %xmm0, %xmm3 0000000000001286 mulpd %xmm1, %xmm6 000000000000128a addpd %xmm4, %xmm10 000000000000128f addpd %xmm7, %xmm11 0000000000001294 movaps %xmm5, %xmm7 0000000000001297 mulpd %xmm0, %xmm5 000000000000129b movaps 0x60(%rax), %xmm0 000000000000129f mulpd %xmm1, %xmm7 00000000000012a3 movaps 0x70(%rax), %xmm1 00000000000012a7 addpd %xmm3, %xmm12 00000000000012ac movaps 0x70(%rbx), %xmm3 00000000000012b0 addpd %xmm6, %xmm13 00000000000012b5 movaps %xmm2, %xmm6 00000000000012b8 pshufd $0x4e, %xmm2, %xmm4 00000000000012bd mulpd %xmm0, %xmm2 00000000000012c1 mulpd %xmm1, %xmm6 00000000000012c5 addq $0x80, %rax 00000000000012cb addpd %xmm5, %xmm14 00000000000012d0 addpd %xmm7, %xmm15 00000000000012d5 movaps %xmm4, %xmm7 00000000000012d8 mulpd %xmm0, %xmm4 00000000000012dc mulpd %xmm1, %xmm7 00000000000012e0 addpd %xmm2, %xmm8 00000000000012e5 movaps 0x80(%rbx), %xmm2 00000000000012ec addpd %xmm6, %xmm9 00000000000012f1 movaps %xmm3, %xmm6 00000000000012f4 pshufd $0x4e, %xmm3, %xmm5 00000000000012f9 mulpd %xmm0, %xmm3 00000000000012fd mulpd %xmm1, %xmm6 0000000000001301 addq $0x80, %rbx 0000000000001308 addpd %xmm4, %xmm10 000000000000130d addpd %xmm7, %xmm11 0000000000001312 movaps %xmm5, %xmm7 0000000000001315 mulpd %xmm0, %xmm5 0000000000001319 movaps _ULM_dgemm_nn(%rax), %xmm0 000000000000131c mulpd %xmm1, %xmm7 0000000000001320 movaps 0x10(%rax), %xmm1 0000000000001324 decq %rsi 0000000000001327 jne .DLOOP1 .DCONSIDERLEFT1: 000000000000132d testq %rdi, %rdi 0000000000001330 je .DPOSTACCUMULATE1 .DLOOPLEFT1: 0000000000001336 addpd %xmm3, %xmm12 000000000000133b movapd 0x10(%rbx), %xmm3 0000000000001340 addpd %xmm6, %xmm13 0000000000001345 movapd %xmm2, %xmm6 0000000000001349 pshufd $0x4e, %xmm2, %xmm4 000000000000134e mulpd %xmm0, %xmm2 0000000000001352 mulpd %xmm1, %xmm6 0000000000001356 addpd %xmm5, %xmm14 000000000000135b addpd %xmm7, %xmm15 0000000000001360 movapd %xmm4, %xmm7 0000000000001364 mulpd %xmm0, %xmm4 0000000000001368 mulpd %xmm1, %xmm7 000000000000136c addpd %xmm2, %xmm8 0000000000001371 movapd 0x20(%rbx), %xmm2 0000000000001376 addpd %xmm6, %xmm9 000000000000137b movapd %xmm3, %xmm6 000000000000137f pshufd $0x4e, %xmm3, %xmm5 0000000000001384 mulpd %xmm0, %xmm3 0000000000001388 mulpd %xmm1, %xmm6 000000000000138c addpd %xmm4, %xmm10 0000000000001391 addpd %xmm7, %xmm11 0000000000001396 movapd %xmm5, %xmm7 000000000000139a mulpd %xmm0, %xmm5 000000000000139e movapd 0x20(%rax), %xmm0 00000000000013a3 mulpd %xmm1, %xmm7 00000000000013a7 movapd 0x30(%rax), %xmm1 00000000000013ac addq $0x20, %rax 00000000000013b0 addq $0x20, %rbx 00000000000013b4 decq %rdi 00000000000013b7 jne .DLOOPLEFT1 .DPOSTACCUMULATE1: 00000000000013bd addpd %xmm3, %xmm12 00000000000013c2 addpd %xmm6, %xmm13 00000000000013c7 addpd %xmm5, %xmm14 00000000000013cc addpd %xmm7, %xmm15 00000000000013d1 movsd 0x280(%rsp), %xmm0 00000000000013da movsd 0x268(%rsp), %xmm1 00000000000013e3 movq 0x260(%rsp), %rcx 00000000000013eb movq 0x258(%rsp), %r8 00000000000013f3 leaq _ULM_dgemm_nn(,%r8,8), %r8 00000000000013fb movq 0x250(%rsp), %r9 0000000000001403 leaq _ULM_dgemm_nn(,%r9,8), %r9 000000000000140b leaq _ULM_dgemm_nn(%rcx,%r9), %r10 000000000000140f leaq _ULM_dgemm_nn(%rcx,%r8,2), %rdx 0000000000001413 leaq _ULM_dgemm_nn(%rdx,%r9), %r11 0000000000001417 unpcklpd %xmm0, %xmm0 000000000000141b unpcklpd %xmm1, %xmm1 000000000000141f movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000001423 movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000001429 mulpd %xmm0, %xmm8 000000000000142e mulpd %xmm1, %xmm3 0000000000001432 addpd %xmm8, %xmm3 0000000000001437 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 000000000000143b movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000001441 movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000001445 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 000000000000144b mulpd %xmm0, %xmm9 0000000000001450 mulpd %xmm1, %xmm4 0000000000001454 addpd %xmm9, %xmm4 0000000000001459 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 000000000000145d movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 0000000000001463 movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000001468 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 000000000000146e mulpd %xmm0, %xmm10 0000000000001473 mulpd %xmm1, %xmm3 0000000000001477 addpd %xmm10, %xmm3 000000000000147c movlpd %xmm3, _ULM_dgemm_nn(%r10) 0000000000001481 movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000001487 movlpd _ULM_dgemm_nn(%r11), %xmm4 000000000000148c movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 0000000000001492 mulpd %xmm0, %xmm11 0000000000001497 mulpd %xmm1, %xmm4 000000000000149b addpd %xmm11, %xmm4 00000000000014a0 movlpd %xmm4, _ULM_dgemm_nn(%r11) 00000000000014a5 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 00000000000014ab leaq _ULM_dgemm_nn(%rcx,%r9,2), %rcx 00000000000014af leaq _ULM_dgemm_nn(%r10,%r9,2), %r10 00000000000014b3 leaq _ULM_dgemm_nn(%rdx,%r9,2), %rdx 00000000000014b7 leaq _ULM_dgemm_nn(%r11,%r9,2), %r11 00000000000014bb movlpd _ULM_dgemm_nn(%rcx), %xmm3 00000000000014bf movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 00000000000014c5 mulpd %xmm0, %xmm12 00000000000014ca mulpd %xmm1, %xmm3 00000000000014ce addpd %xmm12, %xmm3 00000000000014d3 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 00000000000014d7 movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 00000000000014dd movlpd _ULM_dgemm_nn(%rdx), %xmm4 00000000000014e1 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 00000000000014e7 mulpd %xmm0, %xmm13 00000000000014ec mulpd %xmm1, %xmm4 00000000000014f0 addpd %xmm13, %xmm4 00000000000014f5 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 00000000000014f9 movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 00000000000014ff movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000001504 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 000000000000150a mulpd %xmm0, %xmm14 000000000000150f mulpd %xmm1, %xmm3 0000000000001513 addpd %xmm14, %xmm3 0000000000001518 movlpd %xmm3, _ULM_dgemm_nn(%r10) 000000000000151d movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000001523 movlpd _ULM_dgemm_nn(%r11), %xmm4 0000000000001528 movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 000000000000152e mulpd %xmm0, %xmm15 0000000000001533 mulpd %xmm1, %xmm4 0000000000001537 addpd %xmm15, %xmm4 000000000000153c movlpd %xmm4, _ULM_dgemm_nn(%r11) 0000000000001541 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 0000000000001547 movq %rbp, %rdx 000000000000154a movl %edx, %eax 000000000000154c imull %r15d, %eax 0000000000001550 addl 0x21c(%rsp), %eax 0000000000001557 shll $0x2, %eax 000000000000155a testl %r12d, %r12d 000000000000155d setg %cl 0000000000001560 andb 0x238(%rsp), %cl 0000000000001567 cltq 0000000000001569 movsd 0x1e8(%rsp), %xmm1 0000000000001572 ucomisd 0x26e(%rip), %xmm1 000000000000157a jne 0x1580 000000000000157c jp 0x1580 000000000000157e jmp 0x15f0 0000000000001580 movq %rdx, 0x238(%rsp) 0000000000001588 testb %cl, %cl 000000000000158a movl 0x2e8(%rsp), %r9d 0000000000001592 je 0x1650 0000000000001598 movq 0x1f0(%rsp), %rcx 00000000000015a0 leaq _ULM_dgemm_nn(%rax,%rcx), %rcx 00000000000015a4 xorl %edx, %edx 00000000000015a6 xorl %esi, %esi 00000000000015a8 nopl _ULM_dgemm_nn(%rax,%rax) 00000000000015b0 movl %edx, %edi 00000000000015b2 movl %r12d, %ebp 00000000000015b5 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000015c0 movslq %edi, %rdi 00000000000015c3 leaq _ULM_dgemm_nn(%rcx,%rdi), %rbx 00000000000015c7 movsd _ULM_dgemm_nn(%r14,%rbx,8), %xmm0 00000000000015cd mulsd %xmm1, %xmm0 00000000000015d1 movsd %xmm0, _ULM_dgemm_nn(%r14,%rbx,8) 00000000000015d7 addl %r15d, %edi 00000000000015da decl %ebp 00000000000015dc jne 0x15c0 00000000000015de incl %esi 00000000000015e0 addl %r9d, %edx 00000000000015e3 cmpl %r13d, %esi 00000000000015e6 jne 0x15b0 00000000000015e8 jmp 0x1650 00000000000015ea nopw _ULM_dgemm_nn(%rax,%rax) 00000000000015f0 movq %rdx, 0x238(%rsp) 00000000000015f8 testb %cl, %cl 00000000000015fa movl 0x2e8(%rsp), %r9d 0000000000001602 je 0x1650 0000000000001604 movq 0x1f0(%rsp), %rcx 000000000000160c leaq _ULM_dgemm_nn(%rax,%rcx), %rcx 0000000000001610 xorl %edx, %edx 0000000000001612 xorl %esi, %esi 0000000000001614 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001620 movl %edx, %edi 0000000000001622 movl %r12d, %ebp 0000000000001625 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001630 movslq %edi, %rdi 0000000000001633 leaq _ULM_dgemm_nn(%rcx,%rdi), %rbx 0000000000001637 movq $_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rbx,8) 000000000000163f addl %r15d, %edi 0000000000001642 decl %ebp 0000000000001644 jne 0x1630 0000000000001646 incl %esi 0000000000001648 addl %r9d, %edx 000000000000164b cmpl %r13d, %esi 000000000000164e jne 0x1620 0000000000001650 testl %r13d, %r13d 0000000000001653 jle 0x16c1 0000000000001655 addq 0x1f0(%rsp), %rax 000000000000165d xorl %r8d, %r8d 0000000000001660 xorl %edx, %edx 0000000000001662 xorl %esi, %esi 0000000000001664 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001670 testl %r12d, %r12d 0000000000001673 jle 0x16b2 0000000000001675 movslq %r8d, %rdi 0000000000001678 leaq __C(%rip), %rcx 000000000000167f leaq _ULM_dgemm_nn(%rcx,%rdi,8), %rdi 0000000000001683 movl %edx, %ebx 0000000000001685 movl %r12d, %ebp 0000000000001688 nopl _ULM_dgemm_nn(%rax,%rax) 0000000000001690 movsd _ULM_dgemm_nn(%rdi), %xmm0 0000000000001694 movslq %ebx, %rbx 0000000000001697 leaq _ULM_dgemm_nn(%rax,%rbx), %rcx 000000000000169b addsd _ULM_dgemm_nn(%r14,%rcx,8), %xmm0 00000000000016a1 movsd %xmm0, _ULM_dgemm_nn(%r14,%rcx,8) 00000000000016a7 addl %r15d, %ebx 00000000000016aa addq $0x8, %rdi 00000000000016ae decl %ebp 00000000000016b0 jne 0x1690 00000000000016b2 incq %rsi 00000000000016b5 addl %r9d, %edx 00000000000016b8 addl $0x4, %r8d 00000000000016bc cmpl %r13d, %esi 00000000000016bf jne 0x1670 00000000000016c1 movq 0x238(%rsp), %rsi 00000000000016c9 incq %rsi 00000000000016cc movq 0x200(%rsp), %rax 00000000000016d4 cmpl %eax, %esi 00000000000016d6 movsd 0x208(%rsp), %xmm1 00000000000016df jl _ULM_dgemm_nn+2816 00000000000016e5 movq 0x1b8(%rsp), %rsi 00000000000016ed incq %rsi 00000000000016f0 movq 0x190(%rsp), %rax 00000000000016f8 cmpl %eax, %esi 00000000000016fa movq 0x1a8(%rsp), %rdx 0000000000001702 jl _ULM_dgemm_nn+2688 0000000000001708 movq 0x158(%rsp), %rdi 0000000000001710 incq %rdi 0000000000001713 movl 0x164(%rsp), %esi 000000000000171a addl 0x114(%rsp), %esi 0000000000001721 movq 0x118(%rsp), %rax 0000000000001729 cmpl %eax, %edi 000000000000172b movq 0x140(%rsp), %rbp 0000000000001733 movq 0x138(%rsp), %rbx 000000000000173b jl _ULM_dgemm_nn+1888 0000000000001741 movq 0xb0(%rsp), %rdi 0000000000001749 incq %rdi 000000000000174c movl 0xac(%rsp), %r13d 0000000000001754 addl 0x7c(%rsp), %r13d 0000000000001759 movq 0xb8(%rsp), %rax 0000000000001761 addl 0x64(%rsp), %eax 0000000000001765 movq %rax, 0xb8(%rsp) 000000000000176d movq 0x80(%rsp), %rax 0000000000001775 cmpl %eax, %edi 0000000000001777 movl 0x2e8(%rsp), %ebx 000000000000177e movl %ebx, %r12d 0000000000001781 movq 0x88(%rsp), %rsi 0000000000001789 movq 0x70(%rsp), %rdx 000000000000178e movq 0x68(%rsp), %rbx 0000000000001793 movq 0x58(%rsp), %r11 0000000000001798 movq 0x1a0(%rsp), %r8 00000000000017a0 jl _ULM_dgemm_nn+1120 00000000000017a6 movq %rdx, %rdi 00000000000017a9 movl 0x34(%rsp), %ecx 00000000000017ad incl %ecx 00000000000017af movq 0x38(%rsp), %rax 00000000000017b4 addl 0x14(%rsp), %eax 00000000000017b8 movq %rax, 0x38(%rsp) 00000000000017bd movq 0x18(%rsp), %rax 00000000000017c2 cmpl %eax, %ecx 00000000000017c4 jl _ULM_dgemm_nn+832 00000000000017ca addq $0x288, %rsp 00000000000017d1 popq %rbx 00000000000017d2 popq %r12 00000000000017d4 popq %r13 00000000000017d6 popq %r14 00000000000017d8 popq %r15 00000000000017da popq %rbp 00000000000017db ret
So the line right before the .DLOOP0 label is
$shell> otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done
0000000000000c63 je .DCONSIDERLEFT0
And the line with jne .DLOOP0 is
$shell> otool -dtV dgemm_nn.o | grep ".DLOOP0$"
0000000000000e33 jne .DLOOP0
So the code in between takes
$shell> BEGIN=`otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done` $shell> BEGIN=($BEGIN) $shell> BEGIN=${BEGIN[0]} $shell> BEGIN=`echo $BEGIN | tr "a-f" "A-F"` $shell> END=`otool -dtV dgemm_nn.o | grep ".DLOOP0$" | tr "a-f" "A-F"` $shell> END=($END) $shell> END=${END[0]} $shell> END=`echo $END | tr "a-f" "A-F"` $shell> SIZE=`dc -e "16 i $END $BEGIN - f"` $shell> echo "Code size of loop body (in bytes): $SIZE" Code size of loop body (in bytes): 464
bytes.