=======================================
Two-dimensional organization of threads      [TOC]
=======================================

It appears straightforward to let CUDA blocks
operate on corresponding blocks of a matrix.
Hence, we need a two-dimensional organization
of GPU threads.

When we organize threads in two dimensions
we must take care that

 * we need a multiple of 16 in both dimensions
   (as this can be nicely processed by
   warps and half-warps) and that
 * the maximal number of threads per block
   on our GPU is 1024.

In consequence, we can work with $16 \times 16$
or $32 \times 32$ blocks. However, when we
work with $32 \times 32$ blocks, we will have
less space left for registers. Hence,
$16 \times 16$ is usually chosen.

For the partitioning in up to three dimensions
we need a `dim3` object. The following example
shows how a matrix can be partitioned in
blocks of size $16 \times 16$:

---- CODE (type=cpp) ----------------------------------------------------------
std::size_t dim_per_block = 16;
std::size_t num_blocks_per_row = (A.numRows() + dim_per_block - 1) /
   dim_per_block;
std::size_t num_blocks_per_col = (A.numCols() + dim_per_block - 1) /
   dim_per_block;
dim3 block(dim_per_block, dim_per_block);
dim3 grid(num_blocks_per_row, num_blocks_per_col);
kernel_function<<grid, block>>>(/* ... */);
-------------------------------------------------------------------------------

Exercise
========

Develop a kernel function `gescal` that
computes $A \leftarrow \alpha A$, i.e. all
elements are to be scaled by $\alpha$. To test
this it is recommended to use the class
`hpc::cuda::GeMatrix` from `<hpc/cuda/gematrix.hpp>`
and the `copy` operation from `<hpc/cuda/copy.hpp>`.

:navigate: up    -> doc:index
           back  -> doc:session08/page08
           next  -> doc:session08/page10