Synchronized aggregation
Content 
Within a block it is possible to aggregate results synchronously. This is possibly by

using common data structures that are to be declared as __shared__ and by

the use of the synchronization operation __syncthreads() which blocks the caller until all threads of the same block have invoked __syncthreads().
Assume you want to compute the total sum of the individual values computed by each thread of a block. This can be done

in the first step by letting each second process combine its sum with that of its neighbor,

in the second step by letting each fourth process combine its sum of the first step combine with that of its secondnext neighbor

etc.
until you have one sum per block. This works out if the number of threads per block is a power of two like 128 or 256.
An aggregation spanning over multiple blocks can be done only using atomic operations or after the invocation of the kernel. Possible options are the CPU or the invocation of another kernel function that operates with one block only.
Exercise
Develop a kernel function for the scalar product that aggregates the sum within each block:
template<typename TX, typename TY, typename T> __global__ void dot(std::size_t n, const TX* x, std::ptrdiff_t incX, TY* y, std::ptrdiff_t incY, T* sums) { /* ... */ }
The size of the array sums shall equal the number of blocks such that the \(i\)th block can store its sum in sums[i]. The overall sum is then to be computed within the CPU.