Erste Schritte mit SSE3 und AVX

Wie finde ich heraus was mein System unterstützt?
Einfaches Beispiel mit SSE3
Was passiert, wenn Daten nicht richtig im Speicher ausgerichtet (aligned) sind?
Beispiel auf AVX ausweiten.
BLAS Level 1 mit SSE3 und AVX

Was unterstützt mein Rechner?

Mit

$shell> curl -s0 http://repnop.org/releases/cpuid-0.2.tar.gz | tar xz && (mv cpuid-0.2 cpuid && cd cpuid && make )                            
gcc -I. -std=c99 -pedantic -Wall -W -Wundef -Wendif-labels -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-align -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wdisabled-optimization -fstrict-aliasing -O2 -pipe -Wno-parentheses -o cpuid cpuid.c tool.c

lädt und übersetzt man ein kleines Tool mit dem man dann die Hardware abfragen kann:

$shell> cd cpuid                                                            
$shell> ./cpuid -b                                                          
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
$shell> ./cpuid -f                                                          
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH 
DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MONITOR DSCPL VMX SMX EST 
TM2 SSSE3 CMPXCHG16B XTPR PDCM SSE4.1 NX SYSCALL X86_64

Einfaches Beispiel mit SSE3

day07/add.c

#include <stdio.h>

#include <emmintrin.h>

double x[4] __attribute__ ((aligned (16))) = {1.0, 2.0, 3.0, 4.0};

double y[5] __attribute__ ((aligned (16))) = {1.0, 2.0, 3.0, 4.0};

void

foo(void)

{

    __m128d x01 = _mm_load_pd(x);

    __m128d x23 = _mm_load_pd(x+2);

    __m128d y01 = _mm_load_pd(y);

    __m128d y23 = _mm_load_pd(y+2);

    y01 = _mm_add_pd(x01, y01);

    y23 = _mm_add_pd(x23, y23);

    _mm_store_pd(y+1, y01);

    _mm_store_pd(y+3, y23);

}

int

main()

{

    foo();

    printf("y = (%lf, %lf, %lf, %lf)\n", y[0], y[1], y[2], y[3]);

}

Aufgaben:

Übersetzen und mit otool den Maschinen Code anschauen. Verwendet verschieden Optimierungen -O0, -Os und -O3. Also z.B. gcc-4.8 -Os add.c. Mit strip a.out werden Symbole für die Variablen mit Zahlen ersetzt. Rechnet nach was im Daten-Segment referenziert wird.
Ändert das Programm so ab, dass nicht richtig ausgerichtete Daten in die SSE Register geladen werden. Was passiert, wenn ihr das Programm ausführt? (Bei mir hängt das teilweise davon ab mit welcher Optimierung ich übersetze)
Wie kann man das Alignment einer Adresse prüfen? Schreibt ein Macro isAligned so dass zum Beispiel isAligned(x, 16) den Wert 1 liefert, wenn x eine Adresse mit 16-Byte Alignment ist und sonst.

Einfaches Beispiel mit AVX

Ändert das Beispiel von oben so ab, dass AVX benutzt wird.

BLAS Level 1 mit SSE3 und AVX

Baut in einer BLAS Level 1 Routine eurer Wahl eine SSE3 oder AVX Variante ein. Welche Variante benutzt wird soll über ein Macro gesteuert werden:

#if defined(WITH_SSE)

// SSE Variante

#elif defined (WITH AVX)

// SSE Variante

# else

// Referenzimplementierung

#endif

Bemerkungen: - Das geht nur in den Fällen in denen alle Vektor-Inkremente Eins sind. - Bei daxpy müssen zum Beispiel beide Vektoren aligned sein oder keiner. - Natürlich sollt ihr das mit der Test- und Benchmark-Suite testen. - Und die Benchmarks sollte auch mit Gnuplot visualisiert werden.

Beispiel mit nicht-aligned Vektoren

day07/add_unaligned.c

#include <stdio.h>

#include <emmintrin.h>

double x[5] = {1.0, 2.0, 3.0, 4.0, 4.0};

double y[5] = {1.0, 2.0, 3.0, 4.0, 4.0};

void

foo(void)

{

    __m128d x01 = _mm_load_pd(x+1);

    __m128d x23 = _mm_load_pd(x+3);

    __m128d y01 = _mm_load_pd(y+1);

    __m128d y23 = _mm_load_pd(y+3);

    y01 = _mm_add_pd(x01, y01);

    y23 = _mm_add_pd(x23, y23);

    _mm_store_pd(y+1, y01);

    _mm_store_pd(y+3, y23);

}

int

main()

{

    foo();

    printf("y = (%lf, %lf, %lf, %lf)\n", y[0], y[1], y[2], y[3], y[4]);

}