=============================================== Invoking multiple kernel functions from the GPU [TOC] =============================================== Multiple kernel function can be invoked subsequently and work with results left in GPU memory by previous invocations. Hence, it is not necessary to copy data from and to the GPU between two calls. Invocations of kernel functions are serialized within a so-called stream. If a kernel is configured, you are free to specify a stream. If none is given, a default stream is taken which causes all kernel function invocations to be processed sequentially. Note that the invocation of a kernel function returns immediately long before the kernel finishes. Copying functions like `cudaMemcpy` are likewise associated with a stream. Hence, it is the invocation of `cudaMemcpy` that by default blocks until the previous invocation of a kernel function is completed. We will later see how this can be parallelized by working with multiple streams. Exercise ======== * Write a kernel function named `asum` that aggregates per block the sums of the vector $\vec{x}$ and stores the results in `sums`. ---- CODE (type=cpp) ------------------------------------------------------- template __global__ void asum(std::size_t n, const T* x, T* sums) { /* ... */ } ---------------------------------------------------------------------------- * Adapt the previous solution such that the total sum is no longer computed by the CPU. Instead the kernel function `asum` is to be invoked repeatedly until we have only one block left that computes the overall sum. Then just one `double` value has to be returned from the GPU to the CPU. :navigate: up -> doc:index back -> doc:session08/page04 next -> doc:session08/page06