==================== Introduction to CUDA [TOC] ==================== A CUDA source consists of program text that is to be executed by the CPU and program text that is to be run on the GPU. So-called kernel functions are to be called by the CPU but actually executed by the GPU. A trivial kernel function _axpy_ that computes $\vec{y} \leftarrow \alpha \vec{x} + \vec{y}$ could be implemented as follows: ---- CODE (type=cpp) ---------------------------------------------------------- __global__ void axpy(std::size_t len, double alpha, double* x, double* y) { for (std::size_t i = 0; i < len; ++i) { y[i] += alpha * x[i]; } } ------------------------------------------------------------------------------- But this solution would fall short of utilizing the possible parallelization on the GPU. The GPU provides a SIMD architecture (_single instruction multiple data_) where the same instruction can be executed on an array of data (like vectors in this example). A simple parallelization could be done as follows: ---- CODE (type=cpp) ---------------------------------------------------------- __global__ void axpy(double alpha, double* x, double* y) { std::size_t tid = threadIdx.x; y[tid] += alpha * x[tid]; } ------------------------------------------------------------------------------- The for-loop is gone, each invocation of _axpy_ computes just one element of $\vec{y}$. The decision how many threads on the GPU are actually started in parallel for _axpy_ is done in the invocation of the kernel function: ---- CODE (type=cpp) ---------------------------------------------------------- axpy<<<1, N>>>(2.0, device_x, device_y); ------------------------------------------------------------------------------- The construct `<<<...>>>` configures a kernel function. In this case we invoke the kernel function with one block (more about blocks later) consisting of `N` threads (which is the dimension of `device_x` and `device_y`). The individual threads have values of 0 to `N - 1` for `threadIdx.x`. The number of threads per block is limited (to 1024 in case of Livingstone). If the vector is larger, you need to split the vector across multiple blocks. As each of the blocks is operating with the same number of threads, you have to add a check whether the index is in range: ---- CODE (type=cpp) ---------------------------------------------------------- __global__ void axpy(std::size_t n, double alpha, double* x, double* y) { std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x; if (tid < n) { y[tid] += alpha * x[tid]; } } ------------------------------------------------------------------------------- How could `N` best be spread over multiple blocks? Following points should be kept in mind: * All computations are done in warps or half-warps. A warp has typically the size of 32 threads. If we do not use a multiply of 32 for the dimension of a block, some threads are simply wasted. * A block must not have more than 1024 threads. But going to the full limit of 1024 is less preferable as this limits the number of registers for each thread. * If the problem size is not too small we should seek to employ all available multiprocessors. Each block can be processed by one multiprocessor only. * It is good to have more than one warp per block as the multiprocessor switches between different warps whenever an active warp is waiting for memory operations. If you are interesting in finding the optimal configuration, you will need benchmarks as this depends on the actual GPU. However, the best performance is usually found in a medium block size like 256. Following kernel invocation tries to do this: ---- CODE (type=cpp) ---------------------------------------------------------- std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */ std::size_t nof_warps = ceildiv(N, warp_size); std::size_t warps_per_block = hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */ std::size_t nof_blocks = ceildiv(nof_warps, warps_per_block); std::size_t threads_per_block; if (nof_blocks == 1) { threads_per_block = N; } else { threads_per_block = warps_per_block * warp_size; } axpy<<>>(N, 2.0, device_a, device_b); ------------------------------------------------------------------------------- The functions of `hpc::cuda` are available through `#include `. The function `ceildiv` can be usually defined as follows: ---- CODE (type=cpp) ---------------------------------------------------------- std::size_t ceildiv(std::size_t x, std::size_t y) { /* note that we expect x > 0 && y > 0; not safe against overflows but we expect y to be small */ return (x + y - 1) / y; } ------------------------------------------------------------------------------- As the warp size of 32 and the maximal number of threads being 1024 appears quite commonplace, this is usually simplified: ---- CODE (type=cpp) ---------------------------------------------------------- std::size_t nof_blocks = ceildiv(N, 256); std::size_t threads_per_block; if (nof_blocks == 1) { threads_per_block = N; } else { threads_per_block = 256; } axpy<<>>(N, 2.0, device_a, device_b); ------------------------------------------------------------------------------- All CUDA library functions like `cudaMalloc` etc. return possibly an error code. This error code does not just indicate errors by the actual call but also perhaps an error coming asynchronously from an operation (like a kernel function) that started before. Hence, it is advisable to check all CUDA functions for their return value. Otherwise you might miss important error indications. We use a `CHECK_CUDA` macro from `#include ` that is wrapped around each invocation of a CUDA library function. Whenever an error is detected, `CHECK_CUDA` prints the error message, specifies the source file, the line number, and the operation for which it occured, and terminates the execution of the program. The invocation of a kernel function follows typically following steps: * Allocate memory on the GPU using `cudaMalloc`. Example for a vector named `device_a` with $N$ elements: ---- CODE (type=cpp) ------------------------------------------------------- double* device_a; CHECK_CUDA(cudaMalloc, (void**)&device_a, N * sizeof(double)); ---------------------------------------------------------------------------- * Copy data from CPU memory to GPU memory. Example for copying the vector `a` from CPU memory to the vector `device_a` in GPU memory using the function `cudaMemcpy`: ---- CODE (type=cpp) ------------------------------------------------------- CHECK_CUDA(cudaMemcpy, device_a, a, N * sizeof(double), cudaMemcpyHostToDevice); ---------------------------------------------------------------------------- * Actual invocation of the kernel function where the wanted total number of threads is partitioned in multiple blocks. The number of threads per block is limited. In our first example we limit the number of threads per block to 128, a multiple of the warp size of 32: ---- CODE (type=cpp) ------------------------------------------------------- std::size_t nof_blocks = ceildiv(N, 256); std::size_t threads_per_block; if (nof_blocks == 1) { threads_per_block = N; } else { threads_per_block = 256; } axpy<<>>(N, 2.0, device_a, device_b); ---------------------------------------------------------------------------- * Copy results from GPU memory to CPU memory. You will need the direction `cudaMemcpyDeviceToHost` for `cudaMemcpy`: ---- CODE (type=cpp) ------------------------------------------------------- CHECK_CUDA(cudaMemcpy, a, device_a, N * sizeof(double), cudaMemcpyDeviceToHost); ---------------------------------------------------------------------------- * Release allocated GPU memory: ---- CODE (type=cpp) ------------------------------------------------------- CHECK_CUDA(cudaFree, device_a); ---------------------------------------------------------------------------- Preparations ============ You will need to login on Livingstone to work with CUDA. Please check that _cuda_ is in your file _~/.options_. Please add it when necessary or add manually _/usr/local/cuda/bin_ to your _PATH_. Livingstone is an Ubuntu system as Debian is not supported by Nvidia CUDA. ---- CODE (type=sh) ----------------------------------------------------------- if grep '^cuda$' ~/.options >/dev/null then : ok else echo cuda >>~/.options fi ------------------------------------------------------------------------------- Exercise ======== Develope a CUDA kernel function `scal` that scales a vector `x` with `n` elements: $\vec{x} \leftarrow \alpha \vec{x}$. Write a small test program that allocates the vector in CPU memory, initializes it, copies it to GPU memory, invokes the kernel function, transfers the result back to the GPU and prints it. Note that the `CHECK_CUDA` macro is available in the header file `` within the lecture library below `/home/numerik/pub/pp/ss19`. Your program can be compiled using the generic Makefile as follows: ---- CODE (type=sh) ----------------------------------------------------------- cp /home/numerik/pub/pp/ss19/Makefile-cuda Makefile make depend make ------------------------------------------------------------------------------- Whenever you add sources or change `#include` directives in your sources in the current directory, you should invoke `make depend` before invoking `make`. Source code =========== The code snippets above came from `axpy.cu`: :import:session08/axpy.cu :navigate: up -> doc:index next -> doc:session08/page02