Synchronized aggregation

Content

Within a block it is possible to aggregate results synchronously. This is possibly by

Assume you want to compute the total sum of the individual values computed by each thread of a block. This can be done

until you have one sum per block. This works out if the number of threads per block is a power of two like 128 or 256.

An aggregation spanning over multiple blocks can be done only using atomic operations or after the invocation of the kernel. Possible options are the CPU or the invocation of another kernel function that operates with one block only.

Exercise

Develop a kernel function for the scalar product that aggregates the sum within each block:

template<typename TX, typename TY, typename T>
__global__ void dot(std::size_t n,
      const TX* x, std::ptrdiff_t incX, TY* y, std::ptrdiff_t incY, T* sums) {
   /* ... */
}

The size of the array sums shall equal the number of blocks such that the \(i\)-th block can store its sum in sums[i]. The overall sum is then to be computed within the CPU.