========================
Synchronized aggregation                                          [TOC]
========================

Within a block it is possible to aggregate results synchronously.
This is possibly by

 * using common data structures that are to be declared
   as `_``_shared_``_` and by
 * the use of the synchronization operation `_``_syncthreads()`
   which blocks the caller until all threads of the same block
   have invoked `_``_syncthreads()`.

Assume you want to compute the total sum of the individual
values computed by each thread of a block. This can be done

 * in the first step by letting each second process combine
   its sum with that of its neighbor,

 * in the second step by letting each fourth process combine
   its sum of the first step combine with that of its
   second-next neighbor

 * etc.

until you have one sum per block. This works out if the
number of threads per block is a power of two like 128
or 256.

An aggregation spanning over multiple blocks can be done
only using atomic operations or after the invocation
of the kernel. Possible options are the CPU or the
invocation of another kernel function that operates
with one block only.

Exercise
========

Develop a kernel function for the scalar product that
aggregates the sum within each block:

---- CODE (type=cpp) ----------------------------------------------------------
template<typename TX, typename TY, typename T>
__global__ void dot(std::size_t n,
      const TX* x, std::ptrdiff_t incX, TY* y, std::ptrdiff_t incY, T* sums) {
   /* ... */
}
-------------------------------------------------------------------------------

The size of the array `sums` shall equal the number of blocks
such that the $i$-th block can store its sum in `sums[i]`. The
overall sum is then to be computed within the CPU.

:navigate: up    -> doc:index
           back  -> doc:session08/page02
           next  -> doc:session08/page04