================================ Simple Jacobi solver for the GPU [TOC] ================================ We revisit now the Jacobi solver for the GPU. We start with a very simple approach where we operate with one block on the GPU only. This easies synchronization as the threads of a block can easily use `__sync_threads()` to keep in sync. For reasons of simplicity we work initially with a fixed number of iterations. We will see later how this can be improved. Exercise ======== Develop a kernel function for the Jacobi solver that operates on one block only. The number of iterations is to be passed as parameter. Each thread shall operate only on one $A_{i,j}$. Try to solve this with one matrix only. This can indeed be done with proper synchronization. Make sure that the kernel operates on the inner part of $A$ only. Think about you access the matrix $A$ within the kernel function. Which approach is more cache-friendly for the GPU? Or, alternatively, try both variants of matrix storage organisation. Compare the times of both variants. You can simply profile your application using the `nvprof` utility, i.e. invoke `nvprof ./jacobi1` instead of simply `./jacobi1`. Look for the first section titled `GPU activities` where you will find `void jacobi...` – interesting is the total time and the average time (`Avg`). Explain the difference. Consider that within a warp `threadIdx.y` is identical for all threads but the values of `threadIdx.x` are numbered consecutively. The lecture library is available at `/home/numerik/pub/pp/ss19/lib`. Skeleton ======== :import:session09/jacobi0.cu Makefile ======== :import:session09/Makefile :navigate: up -> doc:index back -> doc:session09/page02 next -> doc:session09/page04