======================================= Two-dimensional organization of threads [TOC] ======================================= It appears straightforward to let CUDA blocks operate on corresponding blocks of a matrix. Hence, we need a two-dimensional organization of GPU threads. When we organize threads in two dimensions we must take care that * we need a multiple of 16 in both dimensions (as this can be nicely processed by warps and half-warps) and that * the maximal number of threads per block on our GPU is 1024. In consequence, we can work with $16 \times 16$ or $32 \times 32$ blocks. However, when we work with $32 \times 32$ blocks, we will have less space left for registers. Hence, $16 \times 16$ is usually chosen. For the partitioning in up to three dimensions we need a `dim3` object. The following example shows how a matrix can be partitioned in blocks of size $16 \times 16$: ---- CODE (type=cpp) ---------------------------------------------------------- std::size_t dim_per_block = 16; std::size_t num_blocks_per_row = (A.numRows() + dim_per_block - 1) / dim_per_block; std::size_t num_blocks_per_col = (A.numCols() + dim_per_block - 1) / dim_per_block; dim3 block(dim_per_block, dim_per_block); dim3 grid(num_blocks_per_row, num_blocks_per_col); kernel_function<>>(/* ... */); ------------------------------------------------------------------------------- Exercise ======== Develop a kernel function `gescal` that computes $A \leftarrow \alpha A$, i.e. all elements are to be scaled by $\alpha$. To test this it is recommended to use the class `hpc::cuda::GeMatrix` from `` and the `copy` operation from ``. :navigate: up -> doc:index back -> doc:session08/page08 next -> doc:session08/page10