Content

It appears straightforward to let CUDA blocks operate on corresponding blocks of a matrix. Hence, we need a two-dimensional organization of GPU threads.

When we organize threads in two dimensions we must take care that

• we need a multiple of 16 in both dimensions (as this can be nicely processed by warps and half-warps) and that

• the maximal number of threads per block on our GPU is 1024.

In consequence, we can work with $$16 \times 16$$ or $$32 \times 32$$ blocks. However, when we work with $$32 \times 32$$ blocks, we will have less space left for registers. Hence, $$16 \times 16$$ is usually chosen.

For the partitioning in up to three dimensions we need a dim3 object. The following example shows how a matrix can be partitioned in blocks of size $$16 \times 16$$:

std::size_t dim_per_block = 16;
std::size_t num_blocks_per_row = (A.numRows() + dim_per_block - 1) /
dim_per_block;
std::size_t num_blocks_per_col = (A.numCols() + dim_per_block - 1) /
dim_per_block;
dim3 block(dim_per_block, dim_per_block);
dim3 grid(num_blocks_per_row, num_blocks_per_col);
kernel_function<<grid, block>>>(/* ... */);

Exercise

Develop a kernel function gescal that computes $$A \leftarrow \alpha A$$, i.e. all elements are to be scaled by $$\alpha$$. To test this it is recommended to use the class hpc::cuda::GeMatrix from <hpc/cuda/gematrix.hpp> and the copy operation from <hpc/cuda/copy.hpp>.