Invoking multiple kernel functions from the GPU
Content 
Multiple kernel function can be invoked subsequently and work with results left in GPU memory by previous invocations. Hence, it is not necessary to copy data from and to the GPU between two calls.
Invocations of kernel functions are serialized within a socalled stream. If a kernel is configured, you are free to specify a stream. If none is given, a default stream is taken which causes all kernel function invocations to be processed sequentially.
Note that the invocation of a kernel function returns immediately long before the kernel finishes. Copying functions like cudaMemcpy are likewise associated with a stream. Hence, it is the invocation of cudaMemcpy that by default blocks until the previous invocation of a kernel function is completed. We will later see how this can be parallelized by working with multiple streams.
Exercise

Write a kernel function named asum that aggregates per block the sums of the vector \(\vec{x}\) and stores the results in sums.
template<typename T> __global__ void asum(std::size_t n, const T* x, T* sums) { /* ... */ }

Adapt the previous solution such that the total sum is no longer computed by the CPU. Instead the kernel function asum is to be invoked repeatedly until we have only one block left that computes the overall sum. Then just one double value has to be returned from the GPU to the CPU.