Introduction to CUDA

Content

A CUDA source consists of program text that is to be executed by the CPU and program text that is to be run on the GPU. So-called kernel functions are to be called by the CPU but actually executed by the GPU.

A trivial kernel function axpy that computes \(\vec{y} \leftarrow \alpha \vec{x} + \vec{y}\) could be implemented as follows:

__global__ void axpy(std::size_t len, double alpha,
      double* x, double* y) {
   for (std::size_t i = 0; i < len; ++i) {
      y[i] += alpha * x[i];
   }
}

But this solution would fall short of utilizing the possible parallelization on the GPU. The GPU provides a SIMD architecture (single instruction multiple data) where the same instruction can be executed on an array of data (like vectors in this example). A simple parallelization could be done as follows:

__global__ void axpy(double alpha, double* x, double* y) {
   std::size_t tid = threadIdx.x;
   y[tid] += alpha * x[tid];
}

The for-loop is gone, each invocation of axpy computes just one element of \(\vec{y}\). The decision how many threads on the GPU are actually started in parallel for axpy is done in the invocation of the kernel function:

axpy<<<1, N>>>(2.0, device_x, device_y);

The construct <<<...>>> configures a kernel function. In this case we invoke the kernel function with one block (more about blocks later) consisting of N threads (which is the dimension of device_x and device_y). The individual threads have values of 0 to N - 1 for threadIdx.x.

The number of threads per block is limited (to 1024 in case of Livingstone). If the vector is larger, you need to split the vector across multiple blocks. As each of the blocks is operating with the same number of threads, you have to add a check whether the index is in range:

__global__ void axpy(std::size_t n, double alpha, double* x, double* y) {
   std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
   if (tid < n) {
      y[tid] += alpha * x[tid];
   }
}

How could N best be spread over multiple blocks? Following points should be kept in mind:

All computations are done in warps or half-warps. A warp has typically the size of 32 threads. If we do not use a multiply of 32 for the dimension of a block, some threads are simply wasted.
A block must not have more than 1024 threads. But going to the full limit of 1024 is less preferable as this limits the number of registers for each thread.
If the problem size is not too small we should seek to employ all available multiprocessors. Each block can be processed by one multiprocessor only.
It is good to have more than one warp per block as the multiprocessor switches between different warps whenever an active warp is waiting for memory operations.

If you are interesting in finding the optimal configuration, you will need benchmarks as this depends on the actual GPU. However, the best performance is usually found in a medium block size like 256.

Following kernel invocation tries to do this:

std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */
std::size_t nof_warps = ceildiv(N, warp_size);
std::size_t warps_per_block =
   hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */
std::size_t nof_blocks = ceildiv(nof_warps, warps_per_block);
std::size_t threads_per_block;
if (nof_blocks == 1) {
   threads_per_block = N;
} else {
   threads_per_block = warps_per_block * warp_size;
}
axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);

The functions of hpc::cuda are available through #include <hpc/cuda/properties.hpp>. The function ceildiv can be usually defined as follows:

std::size_t ceildiv(std::size_t x, std::size_t y) {
   /* note that we expect x > 0 && y > 0;
      not safe against overflows but we expect y to be small */
   return (x + y - 1) / y;
}

As the warp size of 32 and the maximal number of threads being 1024 appears quite commonplace, this is usually simplified:

std::size_t nof_blocks = ceildiv(N, 256);
std::size_t threads_per_block;
if (nof_blocks == 1) {
   threads_per_block = N;
} else {
   threads_per_block = 256;
}
axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);

All CUDA library functions like cudaMalloc etc. return possibly an error code. This error code does not just indicate errors by the actual call but also perhaps an error coming asynchronously from an operation (like a kernel function) that started before. Hence, it is advisable to check all CUDA functions for their return value. Otherwise you might miss important error indications. We use a CHECK_CUDA macro from #include <hpc/cuda/check.hpp> that is wrapped around each invocation of a CUDA library function. Whenever an error is detected, CHECK_CUDA prints the error message, specifies the source file, the line number, and the operation for which it occured, and terminates the execution of the program.

The invocation of a kernel function follows typically following steps:

Allocate memory on the GPU using cudaMalloc. Example for a vector named device_a with \(N\) elements:
```
double* device_a;
CHECK_CUDA(cudaMalloc, (void**)&device_a, N * sizeof(double));
```
Copy data from CPU memory to GPU memory. Example for copying the vector a from CPU memory to the vector device_a in GPU memory using the function cudaMemcpy:
```
CHECK_CUDA(cudaMemcpy, device_a, a, N * sizeof(double),
   cudaMemcpyHostToDevice);
```
Actual invocation of the kernel function where the wanted total number of threads is partitioned in multiple blocks. The number of threads per block is limited. In our first example we limit the number of threads per block to 128, a multiple of the warp size of 32:
```
std::size_t nof_blocks = ceildiv(N, 256);
std::size_t threads_per_block;
if (nof_blocks == 1) {
   threads_per_block = N;
} else {
   threads_per_block = 256;
}
axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);
```
Copy results from GPU memory to CPU memory. You will need the direction cudaMemcpyDeviceToHost for cudaMemcpy:
```
CHECK_CUDA(cudaMemcpy, a, device_a, N * sizeof(double),
   cudaMemcpyDeviceToHost);
```
Release allocated GPU memory:
```
CHECK_CUDA(cudaFree, device_a);
```

Preparations

You will need to login on Livingstone to work with CUDA. Please check that cuda is in your file ~/.options. Please add it when necessary or add manually /usr/local/cuda/bin to your PATH. Livingstone is an Ubuntu system as Debian is not supported by Nvidia CUDA.

if grep '^cuda$' ~/.options >/dev/null
then : ok
else echo cuda >>~/.options
fi

Exercise

Develope a CUDA kernel function scal that scales a vector x with n elements: \(\vec{x} \leftarrow \alpha \vec{x}\). Write a small test program that allocates the vector in CPU memory, initializes it, copies it to GPU memory, invokes the kernel function, transfers the result back to the GPU and prints it.

Note that the CHECK_CUDA macro is available in the header file <hpc/cuda/check.hpp> within the lecture library below /home/numerik/pub/pp/ss19. Your program can be compiled using the generic Makefile as follows:

cp /home/numerik/pub/pp/ss19/Makefile-cuda Makefile
make depend
make

Whenever you add sources or change #include directives in your sources in the current directory, you should invoke make depend before invoking make.

Source code

The code snippets above came from axpy.cu:

session08/axpy.cu

#include <cstdlib>
#include <iostream>
#include <printf.hpp>
#include <hpc/cuda/check.hpp>
#include <hpc/cuda/properties.hpp>

constexpr std::size_t N = 8192;

__global__ void axpy(std::size_t n, double alpha, double* x, double* y) {
   std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
   if (tid < n) {
      y[tid] += alpha * x[tid];
   }
}

std::size_t ceildiv(std::size_t x, std::size_t y) {
   /* note that we expect x > 0 && y > 0;
      not safe against overflows but we expect y to be small */
   return (x + y - 1) / y;
}

int main() {
   double a[N]; double b[N];
   for (std::size_t i = 0; i < N; ++i) {
      a[i] = i; b[i] = i * i;
   }

   /* transfer vectors to GPU memory */
   double* device_a;
   CHECK_CUDA(cudaMalloc, (void**)&device_a, N * sizeof(double));
   CHECK_CUDA(cudaMemcpy, device_a, a, N * sizeof(double),
      cudaMemcpyHostToDevice);
   double* device_b;
   CHECK_CUDA(cudaMalloc, (void**)&device_b, N * sizeof(double));
   CHECK_CUDA(cudaMemcpy, device_b, b, N * sizeof(double),
      cudaMemcpyHostToDevice);

   /* execute kernel function on GPU */
   std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */
   std::size_t nof_warps = ceildiv(N, warp_size);
   std::size_t warps_per_block =
      hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */
   std::size_t nof_blocks = ceildiv(nof_warps, warps_per_block);
   std::size_t threads_per_block;
   if (nof_blocks == 1) {
      threads_per_block = N;
   } else {
      threads_per_block = warps_per_block * warp_size;
   }
   axpy<<<nof_blocks, threads_per_block>>>(N, 2.0, device_a, device_b);

   /* transfer result vector from GPU to host memory */
   CHECK_CUDA(cudaMemcpy, b, device_b, N * sizeof(double),
      cudaMemcpyDeviceToHost);
   /* free space allocated at GPU memory */
   CHECK_CUDA(cudaFree, device_a); CHECK_CUDA(cudaFree, device_b);

   /* print result */
   for (std::size_t i = 0; i < N; ++i) {
      fmt::printf(" %g", b[i]);
      if (i % 10 == 9) fmt::printf("\n");
   }
   fmt::printf("\n");
}