Two-dimensional organization of threads


It appears straightforward to let CUDA blocks operate on corresponding blocks of a matrix. Hence, we need a two-dimensional organization of GPU threads.

When we organize threads in two dimensions we must take care that

In consequence, we can work with \(16 \times 16\) or \(32 \times 32\) blocks. However, when we work with \(32 \times 32\) blocks, we will have less space left for registers. Hence, \(16 \times 16\) is usually chosen.

For the partitioning in up to three dimensions we need a dim3 object. The following example shows how a matrix can be partitioned in blocks of size \(16 \times 16\):

std::size_t dim_per_block = 16;
std::size_t num_blocks_per_row = (A.numRows() + dim_per_block - 1) /
std::size_t num_blocks_per_col = (A.numCols() + dim_per_block - 1) /
dim3 block(dim_per_block, dim_per_block);
dim3 grid(num_blocks_per_row, num_blocks_per_col);
kernel_function<<grid, block>>>(/* ... */);


Develop a kernel function gescal that computes \(A \leftarrow \alpha A\), i.e. all elements are to be scaled by \(\alpha\). To test this it is recommended to use the class hpc::cuda::GeMatrix from <hpc/cuda/gematrix.hpp> and the copy operation from <hpc/cuda/copy.hpp>.