High Performance Computing I

Session 1

First steps with vectors in C

Session 2

First steps with matrices in C

Session 3

  • Some BLAS Level 1 functions

  • Benchmarks and Gnuplot

Session 4

Simple cache optimizations

Session 5

Simple cache optimizations for GEMM

Session 6

Cache optimizations for GEMV

Session 7

First steps with C++

Session 8

  • C++ tools for managing memory buffers

  • Namespaces in C++

  • Some integer arithmetic: Rounding up a division

Session 9

Packing matrix blocks for an efficient GEMM (matrix product) implementation.

Session 10

  • GEMM micro kernel (reference implementation)

  • GEMM macro kernel

  • GEMM frame routine

Session 11

Generic classes, template functions, and static polymorphism

Session 12

Function objects and lambda expressions

Session 13

Unblocked LU factorization

Session 14

More on vector and matrix classes

Session 15

First steps with threads in C++

Session 16

Mutex and condition variables

Session 17

Thread pools (part one)

Session 18

Thread pools (part two)

Session 19

GEMM with AVX-optimized micro kernels

Session 20

  • Another unblocked LU factorization

  • Blocked LU factorization

Session 21

Using MKL-BLAS for LU factorization, improved blocked LU factorization (divide and conquer)

Session 22

Introduction to OpenMP

Session 23

Introduction to MPI

Session 24

Transfer of vector and matrices using MPI

Session 25

Scatter and gather operations, asynchronous communication, two-dimensional grids

Session 26

Distributed matrices (with scatter and gather operations)

Session 27

Distributed GEMM

Session 28

Introduction to CUDA

Session 29

Virtual vs. physical GPU architecture, matrices

Session 30

Global synchronization and two-dimensional aggregation

Session 31

A simple multigrid solver