High Performance Computing I
First steps with vectors in C |
|
First steps with matrices in C |
|
|
|
Simple cache optimizations |
|
Simple cache optimizations for GEMM |
|
Cache optimizations for GEMV |
|
First steps with C++ |
|
|
|
Packing matrix blocks for an efficient GEMM (matrix product) implementation. |
|
|
|
Generic classes, template functions, and static polymorphism |
|
Function objects and lambda expressions |
|
Unblocked LU factorization |
|
More on vector and matrix classes |
|
First steps with threads in C++ |
|
Mutex and condition variables |
|
Thread pools (part one) |
|
Thread pools (part two) |
|
GEMM with AVX-optimized micro kernels |
|
|
|
Using MKL-BLAS for LU factorization, improved blocked LU factorization (divide and conquer) |
|
Introduction to OpenMP |
|
Introduction to MPI |
|
Transfer of vector and matrices using MPI |
|
Scatter and gather operations, asynchronous communication, two-dimensional grids |
|
Distributed matrices (with scatter and gather operations) |
|
Distributed GEMM |
|
Introduction to CUDA |
|
Virtual vs. physical GPU architecture, matrices |
|
Global synchronization and two-dimensional aggregation |
|
A simple multigrid solver |