======================== Synchronized aggregation [TOC] ======================== Within a block it is possible to aggregate results synchronously. This is possibly by * using common data structures that are to be declared as `_``_shared_``_` and by * the use of the synchronization operation `_``_syncthreads()` which blocks the caller until all threads of the same block have invoked `_``_syncthreads()`. Assume you want to compute the total sum of the individual values computed by each thread of a block. This can be done * in the first step by letting each second process combine its sum with that of its neighbor, * in the second step by letting each fourth process combine its sum of the first step combine with that of its second-next neighbor * etc. until you have one sum per block. This works out if the number of threads per block is a power of two like 128 or 256. An aggregation spanning over multiple blocks can be done only using atomic operations or after the invocation of the kernel. Possible options are the CPU or the invocation of another kernel function that operates with one block only. Exercise ======== Develop a kernel function for the scalar product that aggregates the sum within each block: ---- CODE (type=cpp) ---------------------------------------------------------- template __global__ void dot(std::size_t n, const TX* x, std::ptrdiff_t incX, TY* y, std::ptrdiff_t incY, T* sums) { /* ... */ } ------------------------------------------------------------------------------- The size of the array `sums` shall equal the number of blocks such that the $i$-th block can store its sum in `sums[i]`. The overall sum is then to be computed within the CPU. :navigate: up -> doc:index back -> doc:session08/page02 next -> doc:session08/page04