===========================
Two-dimensional aggregation                              [TOC]
===========================

Assume, we want to run the Jacobi iteration until

---- LATEX --------------------------------------------------------------------
   \max_{i,j = 1 \dots N-1}
      \left|
	 A_{{k+1}_{i,j}} - A_{{k}_{i,j}}
      \right| \le \epsilon
-------------------------------------------------------------------------------

for some given threshold $\epsilon$.
Then we need to compute a global maximum of all
the differences. This could be done by applying some
global aggregation operator. As CUDA provides no convenient
support for aggregation this has to be programmed explicitly.

Fortunately, it is possible to do this using lambda expressions
which have experimental support in CUDA and need to be
switched on using the `--expt-extended-lambda` flag for _nvcc_
(already included in the generic _Makefile_ for this session).
Such an self-written aggregation operator could then be used
as follows:

---- CODE (type=cpp) ----------------------------------------------------------
   T maxdiff = hpc::cuda::aggregate(global_diff, [] __device__
	 (T x, T y) {
      return x > y? x: y;
   });
-------------------------------------------------------------------------------

Note the `_``_device_``_` specifier between `[]` and the parameter list.
This compiles the entire lambda expression for the device. The
operator is then applied in some sequence on all elements of the
matrix. The _aggregate_ function can be declared as follows:

---- CODE (type=cpp) ----------------------------------------------------------
template<
   template<typename> class Matrix,
   typename T,
   typename Aggregator,
   Require< DeviceGe<Matrix<T>> > = true
>
T aggregate(const Matrix<T>& A, Aggregator&& aggregator) {
   /* ... */
}
-------------------------------------------------------------------------------

How is it possible to parallelize the aggregation? One simple
option is to let one thread run for each element of the matrix.
But this wastes most of the computing capacity as quite a number
of them will idle. Another option is to aggregate either row-wise
or column-wise whereby each thread aggregates a row or column
within a block, respectively. But then you will need more
iterations and have to switch from row-wise to column-wise and
vice versa until just one element remains which is returned.
The second approach is faster despite the greater number of
kernel function calls.

Exercise
========

 * Develop an aggregation operator as described above and
   test it with the test program below.
 * Employ the aggregator for the Jacobi solver.
   As the aggregation does not come cheap you do not want
   to run it after every step.

Sources
=======

Test program and a trivial implementation:

:import:session10/test-aggregate.cu
:import:session10/trivial-aggregate.hpp

:navigate: up    -> doc:index
           back  -> doc:session10/page02
           next  -> doc:session10/page04