====================
Introduction to CUDA                                            [TOC]
====================

A CUDA source consists of program text that is to be executed by the
CPU and program text that is to be run on the GPU. So-called kernel
functions are to be called by the CPU but actually executed by the GPU.

A trivial kernel function _axpy_
that computes $\vec{y} \leftarrow \alpha \vec{x} + \vec{y}$
could be implemented as follows:

---- CODE (type=cpp) ----------------------------------------------------------
__global__ void axpy(std::size_t len, double alpha,
      double* x, double* y) {
   for (std::size_t i = 0; i < len; ++i) {
      y[i] += alpha * x[i];
   }
}
-------------------------------------------------------------------------------

But this solution would fall short of utilizing the
possible parallelization on the GPU. The GPU provides
a SIMD architecture (_single instruction multiple data_)
where the same instruction can be executed on an array
of data (like vectors in this example). A simple parallelization
could be done as follows:

---- CODE (type=cpp) ----------------------------------------------------------
__global__ void axpy(double alpha, double* x, double* y) {
   std::size_t tid = threadIdx.x;
   y[tid] += alpha * x[tid];
}
-------------------------------------------------------------------------------

The for-loop is gone, each invocation of _axpy_ computes
just one element of $\vec{y}$. The decision how many threads
on the GPU are actually started in parallel for _axpy_ is
done in the invocation of the kernel function:

---- CODE (type=cpp) ----------------------------------------------------------
axpy<<<1, N>>>(2.0, device_x, device_y);
-------------------------------------------------------------------------------

The construct `<<<...>>>` configures a kernel function. In
this case we invoke the kernel function with one block (more
about blocks later) consisting of `N` threads (which is the
dimension of `device_x` and `device_y`). The individual threads
have values of 0 to `N - 1` for `threadIdx.x`.

The number of threads per block is limited (to 1024 in
case of Livingstone). If the vector is larger, you need
to split the vector across multiple blocks. As each of
the blocks is operating with the same number of threads,
you have to add a check whether the index is in range:

---- CODE (type=cpp) ----------------------------------------------------------
__global__ void axpy(std::size_t n, double alpha, double* x, double* y) {
   std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
   if (tid < n) {
      y[tid] += alpha * x[tid];
   }
}
-------------------------------------------------------------------------------

How could `N` best be spread over multiple blocks? Following
points should be kept in mind:

 * All computations are done in warps or half-warps. A warp
   has typically the size of 32 threads. If we do not use
   a multiply of 32 for the dimension of a block, some
   threads are simply wasted.

 * A block must not have more than 1024 threads. But going
   to the full limit of 1024 is less preferable as this
   limits the number of registers for each thread.

 * If the problem size is not too small we should seek
   to employ all available multiprocessors. Each block
   can be processed by one multiprocessor only.

 * It is good to have more than one warp per block
   as the multiprocessor switches between different
   warps whenever an active warp is waiting for
   memory operations.

If you are interesting in finding the optimal configuration,
you will need benchmarks as this depends on the actual GPU.
However, the best performance is usually found in a
medium block size like 256.

Following kernel invocation tries to do this:

---- CODE (type=cpp) ----------------------------------------------------------
std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */
std::size_t nof_warps = ceildiv(N, warp_size);
std::size_t warps_per_block =
   hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */
std::size_t nof_blocks = ceildiv(nof_warps, warps_per_block);
std::size_t threads_per_block;
if (nof_blocks == 1) {
   threads_per_block = N;
} else {
   threads_per_block = warps_per_block * warp_size;
}
axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);
-------------------------------------------------------------------------------

The functions of `hpc::cuda` are available through
`#include <hpc/cuda/properties.hpp>`. The function `ceildiv` can be
usually defined as follows:

---- CODE (type=cpp) ----------------------------------------------------------
std::size_t ceildiv(std::size_t x, std::size_t y) {
   /* note that we expect x > 0 && y > 0;
      not safe against overflows but we expect y to be small */
   return (x + y - 1) / y;
}
-------------------------------------------------------------------------------

As the warp size of 32 and the maximal number of threads being 1024 appears
quite commonplace, this is usually simplified:

---- CODE (type=cpp) ----------------------------------------------------------
std::size_t nof_blocks = ceildiv(N, 256);
std::size_t threads_per_block;
if (nof_blocks == 1) {
   threads_per_block = N;
} else {
   threads_per_block = 256;
}
axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);
-------------------------------------------------------------------------------

All CUDA library functions like `cudaMalloc` etc. return possibly an
error code. This error code does not just indicate errors by the actual
call but also perhaps an error coming asynchronously from an operation
(like a kernel function) that started before. Hence, it is advisable
to check all CUDA functions for their return value. Otherwise you might
miss important error indications.  We use a `CHECK_CUDA` macro from
`#include <hpc/cuda/check.hpp>` that is wrapped around each invocation of
a CUDA library function. Whenever an error is detected, `CHECK_CUDA`
prints the error message, specifies the source file, the line number,
and the operation for which it occured, and terminates the execution of
the program.

The invocation of a kernel function follows typically
following steps:

 * Allocate memory on the GPU using `cudaMalloc`. Example
   for a vector named `device_a` with $N$ elements:

   ---- CODE (type=cpp) -------------------------------------------------------
   double* device_a;
   CHECK_CUDA(cudaMalloc, (void**)&device_a, N * sizeof(double));
   ----------------------------------------------------------------------------

 * Copy data from CPU memory to GPU memory. Example for
   copying the vector `a` from CPU memory to the vector `device_a`
   in GPU memory using the function `cudaMemcpy`:

   ---- CODE (type=cpp) -------------------------------------------------------
   CHECK_CUDA(cudaMemcpy, device_a, a, N * sizeof(double),
      cudaMemcpyHostToDevice);
   ----------------------------------------------------------------------------

 * Actual invocation of the kernel function where the wanted total
   number of threads is partitioned in multiple blocks. The
   number of threads per block is limited. In our first example
   we limit the number of threads per block to 128, a multiple
   of the warp size of 32:

   ---- CODE (type=cpp) -------------------------------------------------------
   std::size_t nof_blocks = ceildiv(N, 256);
   std::size_t threads_per_block;
   if (nof_blocks == 1) {
      threads_per_block = N;
   } else {
      threads_per_block = 256;
   }
   axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);
   ----------------------------------------------------------------------------

 * Copy results from GPU memory to CPU memory. You will need the
   direction `cudaMemcpyDeviceToHost` for `cudaMemcpy`:

   ---- CODE (type=cpp) -------------------------------------------------------
   CHECK_CUDA(cudaMemcpy, a, device_a, N * sizeof(double),
      cudaMemcpyDeviceToHost);
   ----------------------------------------------------------------------------

 * Release allocated GPU memory:

   ---- CODE (type=cpp) -------------------------------------------------------
   CHECK_CUDA(cudaFree, device_a);
   ----------------------------------------------------------------------------

Preparations
============

You will need to login on Livingstone to work with CUDA.
Please check that _cuda_ is in your file _~/.options_. Please
add it when necessary or add manually _/usr/local/cuda/bin_ to
your _PATH_. Livingstone is an Ubuntu system as Debian is
not supported by Nvidia CUDA.

---- CODE (type=sh) -----------------------------------------------------------
if grep '^cuda$' ~/.options >/dev/null
then : ok
else echo cuda >>~/.options
fi
-------------------------------------------------------------------------------

Exercise
========

Develope a CUDA kernel function `scal` that scales a vector `x` with
`n` elements: $\vec{x} \leftarrow \alpha \vec{x}$. Write a small
test program that allocates the vector in CPU memory, initializes it,
copies it to GPU memory, invokes the kernel function, transfers the
result back to the GPU and prints it.

Note that the `CHECK_CUDA` macro is available in the header
file `<hpc/cuda/check.hpp>` within the lecture library below
`/home/numerik/pub/pp/ss19`. Your program can be compiled
using the generic Makefile as follows:

---- CODE (type=sh) -----------------------------------------------------------
cp /home/numerik/pub/pp/ss19/Makefile-cuda Makefile
make depend
make
-------------------------------------------------------------------------------

Whenever you add sources or change `#include` directives in your sources in
the current directory, you should invoke `make depend` before invoking
`make`.

Source code
===========

The code snippets above came from `axpy.cu`:

:import:session08/axpy.cu

:navigate: up    -> doc:index
           next  -> doc:session08/page02