===============================================
Invoking multiple kernel functions from the GPU                [TOC]
===============================================

Multiple kernel function can be invoked subsequently and work with
results left in GPU memory by previous invocations. Hence, it is not
necessary to copy data from and to the GPU between two calls.

Invocations of kernel functions are serialized within a so-called
stream. If a kernel is configured, you are free to specify a stream.
If none is given, a default stream is taken which causes all kernel
function invocations to be processed sequentially.

Note that the invocation of a kernel function returns immediately
long before the kernel finishes. Copying functions like `cudaMemcpy`
are likewise associated with a stream. Hence, it is the invocation
of `cudaMemcpy` that by default blocks until the previous invocation
of a kernel function is completed. We will later see how this can be
parallelized by working with multiple streams.

Exercise
========

* Write a kernel function named `asum` that aggregates
  per block the sums of the vector $\vec{x}$ and
  stores the results in `sums`.

   ---- CODE (type=cpp) -------------------------------------------------------
   template<typename T>
   __global__ void asum(std::size_t n, const T* x, T* sums) {
      /* ... */
   }
   ----------------------------------------------------------------------------

* Adapt the previous solution such that the total sum
  is no longer computed by the CPU. Instead the kernel
  function `asum` is to be invoked repeatedly until
  we have only one block left that computes the overall
  sum. Then just one `double` value has to be returned
  from the GPU to the CPU.

:navigate: up    -> doc:index
           back  -> doc:session08/page04
           next  -> doc:session08/page06