================================
Simple Jacobi solver for the GPU                           [TOC]
================================

We revisit now the Jacobi solver for the GPU. We start with a very
simple approach where we operate with one block on the GPU only.
This easies synchronization as the threads of a block can easily use
`__sync_threads()` to keep in sync.

For reasons of simplicity we work initially with a fixed number of
iterations. We will see later how this can be improved.

Exercise
========

Develop a kernel function for the Jacobi solver that operates
on one block only. The number of iterations is to be passed
as parameter. Each thread shall operate only on one $A_{i,j}$.

Try to solve this with one matrix only. This can indeed be
done with proper synchronization. Make sure that the kernel
operates on the inner part of $A$ only.

Think about you access the matrix $A$ within the kernel
function. Which approach is more cache-friendly for the
GPU? Or, alternatively, try both variants of matrix
storage organisation. Compare the times of both variants.

You can simply profile your application using the
`nvprof` utility, i.e. invoke `nvprof ./jacobi1` instead
of simply `./jacobi1`. Look for the first section
titled `GPU activities` where you will find `void jacobi...`
– interesting is the total time and the average time (`Avg`).

Explain the difference. Consider that within a warp
`threadIdx.y` is identical for all threads but the
values of `threadIdx.x` are numbered consecutively.

The lecture library is available at `/home/numerik/pub/pp/ss19/lib`.

Skeleton
========

:import:session09/jacobi0.cu

Makefile
========

:import:session09/Makefile

:navigate: up    -> doc:index
           back  -> doc:session09/page02
           next  -> doc:session09/page04