======================
Global synchronization                                          [TOC]
======================

In the moment we move from a single block to multiple blocks we are
no longer able to synchronize globally among all threads within the
GPU. However, it is possible to synchronize at the host as by default
a sequence of kernel function calls is serialized. Hence, we can
move the loop for the Jacobi solver steps to the host for a global
synchronization.

We are no longer able to do this on just one matrix $A$ as this trick
was based on per-block synchronization. Instead we work now with
matrices $A$ and $B$. In the first step we move from $A$ to $B$, then
from $B$ to $A$ etc. We must take care not just to initialize $A$ but
also at least the border of $B$ (the interior of $B$ gets initialized
during the first Jacobi step).

Exercise
========

 * Develop a kernel function `init_matrix_border` that
   works like `init_matrix` but initializes the border
   only. Think about how this kernel function is to
   be configured. As we have just to initialize its border
   we do need much less threads than for initializing the
   entire matrix.

   Test this separately by copying $B$ back and
   generating a graphic for it.

 * Develop a kernel function `jacobi_iteration` that
   performs a single Jacobi step. Make sure that it
   does not change the fixed border.
   Invoke the kernel function within a loop on the
   host with at least 1941 Jacobi steps.

Sources
=======

Program text from the last session:

:import:session09/jacobi3.cu [fold]

Generic _Makefile_ for this session:

:import:session10/Makefile [fold]

:navigate: up    -> doc:index
           next  -> doc:session10/page02