# Introduction to CUDA

#### Content

A CUDA source consists of program text that is to be executed by the CPU and program text that is to be run on the GPU. So-called kernel functions are to be called by the CPU but actually executed by the GPU.

A trivial kernel function axpy that computes $$\vec{y} \leftarrow \alpha \vec{x} + \vec{y}$$ could be implemented as follows:

__global__ void axpy(std::size_t len, double alpha,
double* x, double* y) {
for (std::size_t i = 0; i < len; ++i) {
y[i] += alpha * x[i];
}
}


But this solution would fall short of utilizing the possible parallelization on the GPU. The GPU provides a SIMD architecture (single instruction multiple data) where the same instruction can be executed on an array of data (like vectors in this example). A simple parallelization could be done as follows:

__global__ void axpy(double alpha, double* x, double* y) {
y[tid] += alpha * x[tid];
}


The for-loop is gone, each invocation of axpy computes just one element of $$\vec{y}$$. The decision how many threads on the GPU are actually started in parallel for axpy is done in the invocation of the kernel function:

axpy<<<1, N>>>(2.0, cuda_x, cuda_y);


The construct <<<...>>> configures a kernel function. In this case we invoke the kernel function with one block (more about blocks later) consisting of N threads (which is the dimension of cuda_x and cuda_y). The individual threads have values of 0 to N - 1 for threadIdx.x.

The number of threads per block is limited (to 1024 in case of Donegal). If the vector is larger, you need to split the vector across multiple blocks. As each of the blocks is operating with the same number of threads, you have to add a check whether the index is in range:

__global__ void axpy(std::size_t n, double alpha, double* x, double* y) {
std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
y[tid] += alpha * x[tid];
}
}


How could N best be spread over multiple blocks? Following points should be kept in mind:

• All computations are done in warps or half-warps. A warp has typically the size of 32 threads. If we do not use a multiply of 32 for the dimension of a block, some threads are simply wasted.

• A block must not have more than 1024 threads. But going to the full limit of 1024 is less preferable as this limits the number of registers for each thread.

• If the problem size is not too small we should seek to employ all available multiprocessors. Each block can be processed by one multiprocessor only.

• It is good to have more than one warp per block as the multiprocessor switches between different warps whenever an active warp is waiting for memory operations.

If you are interesting in finding the optimal configuration, you will need benchmarks as this depends on the actual GPU. However, the best performance is usually found in a medium block size like 256.

Following kernel invocation tries to do this:

std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */
std::size_t nof_warps = ceildiv(N, warp_size);
std::size_t warps_per_block =
hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */
std::size_t nof_blocks = ceildiv(nof_warps, warps_per_block);
if (nof_blocks == 1) {
} else {
}


The functions of hpc::cuda are available through #include <hpc/cuda/properties.hpp>. The function ceildiv can be included using #include <hpc/aux/ceildiv.hpp>.

As the warp size of 32 and the maximal number of threads being 1024 appears quite commonplace, this is usually simplified:

std::size_t nof_blocks = ceildiv(N, 256);
if (nof_blocks == 1) {
} else {
}


All CUDA library functions like cudaMalloc etc. return possibly an error code. This error code does not just indicate errors by the actual call but also perhaps an error coming asynchronously from an operation (like a kernel function) that started before. Hence, it is advisable to check all CUDA functions for their return value. Otherwise you might miss important error indications. We use a CHECK_CUDA macro from #include <hpc/cuda/check.hpp> that is wrapped around each invocation of a CUDA library function. Whenever an error is detected, CHECK_CUDA prints the error message, specifies the source file, the line number, and the operation for which it occured, and terminates the execution of the program.

The invocation of a kernel function follows typically following steps:

• Allocate memory on the GPU using cudaMalloc. Example for a vector named cuda_a with $$N$$ elements:

double* cuda_a;
CHECK_CUDA(cudaMalloc, (void**)&cuda_a, N * sizeof(double));

• Copy data from CPU memory to GPU memory. Example for copying the vector a from CPU memory to the vector cuda_a in GPU memory using the function cudaMemcpy:

CHECK_CUDA(cudaMemcpy, cuda_a, a, N * sizeof(double),
cudaMemcpyHostToDevice);

• Actual invocation of the kernel function where the wanted total number of threads is partitioned in multiple blocks. The number of threads per block is limited. In our first example we limit the number of threads per block to 128, a multiple of the warp size of 32:

std::size_t nof_blocks = ceildiv(N, 256);
if (nof_blocks == 1) {
} else {
}

• Copy results from GPU memory to CPU memory. You will need the direction cudaMemcpyDeviceToHost for cudaMemcpy:

CHECK_CUDA(cudaMemcpy, a, cuda_a, N * sizeof(double),
cudaMemcpyDeviceToHost);

• Release allocated GPU memory:

CHECK_CUDA(cudaFree, cuda_a);


## Preparations

Firstly, you should add “cuda” to ~/.options, if you haven't done this already:

theon$grep -q '^cuda$' ~/.options || echo cuda >>~/.options
theon$ Secondly, you will need to login to our host livingstone which is equipped with a graphic card that can be used through CUDA: theon$ ssh livingstone


Now, you are free to use the lecture library using the compilation flag -I/home/numerik/pub/hpc/ws19/session28. You will find a generic Makefile for this session at /home/numerik/pub/hpc/ws19/session28/misc/Makefile.

## Exercise

Develope a CUDA kernel function scal that scales a vector x with n elements: $$\vec{x} \leftarrow \alpha \vec{x}$$. Write a small test program that allocates the vector in CPU memory, initializes it, copies it to GPU memory, invokes the kernel function, transfers the result back to the GPU and prints it.

Note that the CHECK_CUDA macro is available in the header file <hpc/cuda/check.hpp> within the lecture library below /home/numerik/pub/hpc/ws19/session28. Your program can be compiled using the generic Makefile as follows:

livingstone$ls axpy.cu livingstone$ cp /home/numerik/pub/hpc/ws19/session28/misc/Makefile .
livingstone$make depend gcc-makedepend -std=c++14 -I/home/numerik/pub/hpc/ws19/session28 -D__CUDACC__ -x c++ axpy.cu livingstone$ make
nvcc -c -std=c++14 -I/home/numerik/pub/hpc/ws19/session28  axpy.cu
nvcc -w -o axpy -std=c++14  axpy.o
livingstone$./axpy 0 3 8 15 24 35 48 63 80 99 120 143 168 195 224 255 288 323 360 399 440 483 528 575 624 675 728 783 840 899 960 1023 1088 1155 1224 1295 1368 1443 1520 1599 1680 1763 1848 1935 2024 2115 2208 2303 2400 2499 2600 2703 2808 2915 3024 3135 3248 3363 3480 3599 3720 3843 3968 4095 4224 4355 4488 4623 4760 4899 5040 5183 5328 5475 5624 5775 5928 6083 6240 6399 6560 6723 6888 7055 7224 7395 7568 7743 7920 8099 8280 8463 8648 8835 9024 9215 9408 9603 9800 9999 10200 10403 10608 10815 11024 11235 11448 11663 11880 12099 12320 12543 12768 12995 13224 13455 13688 13923 14160 14399 14640 14883 15128 15375 15624 15875 16128 16383 16640 16899 17160 17423 17688 17955 18224 18495 18768 19043 19320 19599 19880 20163 20448 20735 21024 21315 21608 21903 22200 22499 22800 23103 23408 23715 24024 24335 24648 24963 25280 25599 25920 26243 26568 26895 27224 27555 27888 28223 28560 28899 29240 29583 29928 30275 30624 30975 31328 31683 32040 32399 32760 33123 33488 33855 34224 34595 34968 35343 35720 36099 36480 36863 37248 37635 38024 38415 38808 39203 39600 39999 40400 40803 41208 41615 42024 42435 42848 43263 43680 44099 44520 44943 45368 45795 46224 46655 47088 47523 47960 48399 48840 49283 49728 50175 50624 51075 51528 51983 52440 52899 53360 53823 54288 54755 55224 55695 56168 56643 57120 57599 58080 58563 59048 59535 60024 60515 61008 61503 62000 62499 63000 63503 64008 64515 65024 65535 66048 66563 67080 67599 68120 68643 69168 69695 70224 70755 71288 71823 72360 72899 73440 73983 74528 75075 75624 76175 76728 77283 77840 78399 78960 79523 80088 80655 81224 81795 82368 82943 83520 84099 84680 85263 85848 86435 87024 87615 88208 88803 89400 89999 90600 91203 91808 92415 93024 93635 94248 94863 95480 96099 96720 97343 97968 98595 99224 99855 100488 101123 101760 102399 103040 103683 104328 104975 105624 106275 106928 107583 108240 108899 109560 110223 110888 111555 112224 112895 113568 114243 114920 115599 116280 116963 117648 118335 119024 119715 120408 121103 121800 122499 123200 123903 124608 125315 126024 126735 127448 128163 128880 129599 130320 131043 131768 132495 133224 133955 134688 135423 136160 136899 137640 138383 139128 139875 140624 141375 142128 142883 143640 144399 145160 145923 146688 147455 148224 148995 149768 150543 151320 152099 152880 153663 154448 155235 156024 156815 157608 158403 159200 159999 160800 161603 162408 163215 164024 164835 165648 166463 167280 168099 168920 169743 170568 171395 172224 173055 173888 174723 175560 176399 177240 178083 178928 179775 180624 181475 182328 183183 184040 184899 185760 186623 187488 188355 189224 190095 190968 191843 192720 193599 194480 195363 196248 197135 198024 198915 199808 200703 201600 202499 203400 204303 205208 206115 207024 207935 208848 209763 210680 211599 212520 213443 214368 215295 216224 217155 218088 219023 219960 220899 221840 222783 223728 224675 225624 226575 227528 228483 229440 230399 231360 232323 233288 234255 235224 236195 237168 238143 239120 240099 241080 242063 243048 244035 245024 246015 247008 248003 249000 249999 251000 252003 253008 254015 255024 256035 257048 258063 259080 260099 261120 262143 livingstone$ 

Whenever you add sources or change #include directives in your sources in the current directory, you should invoke make depend before invoking make.

## Source code

The code snippets above came from axpy.cu:

#include <cstdlib>
#include <hpc/aux/ceildiv.hpp>
#include <hpc/cuda/check.hpp>
#include <hpc/cuda/properties.hpp>
#include <printf.hpp>

constexpr std::size_t N = 512;

__global__ void axpy(std::size_t n, double alpha, double* x, double* y) {
std::size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
y[tid] += alpha * x[tid];
}
}

int main() {
double a[N]; double b[N];
for (std::size_t i = 0; i < N; ++i) {
a[i] = i; b[i] = i * i;
}

/* transfer vectors to GPU memory */
double* cuda_a;
CHECK_CUDA(cudaMalloc, (void**)&cuda_a, N * sizeof(double));
CHECK_CUDA(cudaMemcpy, cuda_a, a, N * sizeof(double),
cudaMemcpyHostToDevice);
double* cuda_b;
CHECK_CUDA(cudaMalloc, (void**)&cuda_b, N * sizeof(double));
CHECK_CUDA(cudaMemcpy, cuda_b, b, N * sizeof(double),
cudaMemcpyHostToDevice);

/* execute kernel function on GPU */
std::size_t warp_size = hpc::cuda::get_warp_size(); /* typically 32 */
std::size_t nof_warps = hpc::aux::ceildiv(N, warp_size);
std::size_t warps_per_block =
hpc::cuda::get_max_threads_per_block() / warp_size / 4; /* typically 8 */
std::size_t nof_blocks = hpc::aux::ceildiv(nof_warps, warps_per_block);
if (nof_blocks == 1) {
} else {
}

/* transfer result vector from GPU to host memory */
CHECK_CUDA(cudaMemcpy, b, cuda_b, N * sizeof(double),
cudaMemcpyDeviceToHost);
/* free space allocated at GPU memory */
CHECK_CUDA(cudaFree, cuda_a); CHECK_CUDA(cudaFree, cuda_b);

/* print result */
for (std::size_t i = 0; i < N; ++i) {
fmt::printf(" %g", b[i]);
if (i % 10 == 9) fmt::printf("\n");
}
fmt::printf("\n");
}