=========================== Two-dimensional aggregation [TOC] =========================== Assume, we want to run the Jacobi iteration until ---- LATEX -------------------------------------------------------------------- \max_{i,j = 1 \dots N-1} \left| A_{{k+1}_{i,j}} - A_{{k}_{i,j}} \right| \le \epsilon ------------------------------------------------------------------------------- for some given threshold $\epsilon$. Then we need to compute a global maximum of all the differences. This could be done by applying some global aggregation operator. As CUDA provides no convenient support for aggregation this has to be programmed explicitly. Fortunately, it is possible to do this using lambda expressions which have experimental support in CUDA and need to be switched on using the `--expt-extended-lambda` flag for _nvcc_ (already included in the generic _Makefile_ for this session). Such an self-written aggregation operator could then be used as follows: ---- CODE (type=cpp) ---------------------------------------------------------- T maxdiff = hpc::cuda::aggregate(global_diff, [] __device__ (T x, T y) { return x > y? x: y; }); ------------------------------------------------------------------------------- Note the `_``_device_``_` specifier between `[]` and the parameter list. This compiles the entire lambda expression for the device. The operator is then applied in some sequence on all elements of the matrix. The _aggregate_ function can be declared as follows: ---- CODE (type=cpp) ---------------------------------------------------------- template< template class Matrix, typename T, typename Aggregator, Require< DeviceGe> > = true > T aggregate(const Matrix& A, Aggregator&& aggregator) { /* ... */ } ------------------------------------------------------------------------------- How is it possible to parallelize the aggregation? One simple option is to let one thread run for each element of the matrix. But this wastes most of the computing capacity as quite a number of them will idle. Another option is to aggregate either row-wise or column-wise whereby each thread aggregates a row or column within a block, respectively. But then you will need more iterations and have to switch from row-wise to column-wise and vice versa until just one element remains which is returned. The second approach is faster despite the greater number of kernel function calls. Exercise ======== * Develop an aggregation operator as described above and test it with the test program below. * Employ the aggregator for the Jacobi solver. As the aggregation does not come cheap you do not want to run it after every step. Sources ======= Test program and a trivial implementation: :import:session10/test-aggregate.cu :import:session10/trivial-aggregate.hpp :navigate: up -> doc:index back -> doc:session10/page02 next -> doc:session10/page04