Two-dimensional organization of threads

Content

Introduction to CUDA
Sample solution
Synchronized aggregation
Sample solution
Invoking multiple kernel functions from the GPU
Sample solution
Data types for matrices and vectors on the GPU
Sample solution
Two-dimensional organization of threads
- Exercise
Sample solution

It appears straightforward to let CUDA blocks operate on corresponding blocks of a matrix. Hence, we need a two-dimensional organization of GPU threads.

When we organize threads in two dimensions we must take care that

we need a multiple of 16 in both dimensions (as this can be nicely processed by warps and half-warps) and that
the maximal number of threads per block on our GPU is 1024.

In consequence, we can work with \(16 \times 16\) or \(32 \times 32\) blocks. However, when we work with \(32 \times 32\) blocks, we will have less space left for registers. Hence, \(16 \times 16\) is usually chosen.

For the partitioning in up to three dimensions we need a dim3 object. The following example shows how a matrix can be partitioned in blocks of size \(16 \times 16\):

std::size_t dim_per_block = 16;
std::size_t num_blocks_per_row = (A.numRows() + dim_per_block - 1) /
   dim_per_block;
std::size_t num_blocks_per_col = (A.numCols() + dim_per_block - 1) /
   dim_per_block;
dim3 block(dim_per_block, dim_per_block);
dim3 grid(num_blocks_per_row, num_blocks_per_col);
kernel_function<<grid, block>>>(/* ... */);

Exercise

Develop a kernel function gescal that computes \(A \leftarrow \alpha A\), i.e. all elements are to be scaled by \(\alpha\). To test this it is recommended to use the class hpc::cuda::GeMatrix from <hpc/cuda/gematrix.hpp> and the copy operation from <hpc/cuda/copy.hpp>.