================================
RAII storage objects for the GPU                            [TOC]
================================

When we work with vectors and matrices on the GPU device we have
the unusual situation that these objects are allocated and
released on the host but can be accessed only on the device.
Hence, the associated RAII classes for maintaining device storage
are divided into host and device parts.

The following shows an excerpt of `<hpc/cuda/buffer.hpp>` which
provides `DeviceBuffer`, a host-based RAII class for maintaining
arrays on the device:

---- CODE (type=cpp) ----------------------------------------------------------
template<typename T>
struct DeviceBuffer {
   void* const devptr;
   T* const aligned_devptr;

   DeviceBuffer(std::size_t length, std::size_t alignment = alignof(T)) :
	 devptr(cuda_malloc(compute_aligned_size<T>(length, alignment))),
	 aligned_devptr(align_ptr<T>(devptr, alignment)) {
   }

   ~DeviceBuffer() {
      CHECK_CUDA(cudaFree, devptr);
   }

   T* data() const {
      return aligned_devptr;
   }

   DeviceBuffer(DeviceBuffer&&) = default;
   DeviceBuffer(const DeviceBuffer&) = delete;
   DeviceBuffer& operator=(const DeviceBuffer&) = delete;
   DeviceBuffer& operator=(DeviceBuffer&&) = delete;
};
-------------------------------------------------------------------------------

Based on this class, the lecture library provides the class
`DeviceGeMatrix` in `<hpc/cuda/gematrix.hpp>` which is based
on `DeviceBuffer`. We have also corresponding _copy_ functions
in `<hpc/cuda/copy.hpp>` which, however, are based on
`cudaMemcpy` and must therefore insist that the to be copied
storage is a contigious block of storage with identical layouts.

This could lead to following simple steps to create a matrix
on the host, initialize it, copy it to the device, work on
the device on it within a kernel function, and copy it back
to examine the result:

---- CODE (type=cpp) ----------------------------------------------------------
   GeMatrix<double> A(M, N, Order::RowMajor);
   // fill A...
   DeviceGeMatrix<double> devA(A.numRows(), A.numCols(), Order::RowMajor);
   copy(A, devA); // copy A to devA
   // work on devA using a kernel function...
   copy(devA, A); // copy devA to A
-------------------------------------------------------------------------------

But how do we pass `devA` to the kernel function and how
is the kernel function to be declared?

Before, we had on the host constructs like the following where we
passed the matrices (or vectors) by reference:

---- CODE (type=cpp) ----------------------------------------------------------
template <template<typename> class Matrix, typename T,
   Require< Ge<Matrix<T>> > = true>
void f(Matrix<T>& A) {
   // ...
}
-------------------------------------------------------------------------------

Think about how this can be done when we pass vector or matrix
parameters from host to device:

 * By reference as before on the host?
 * By value?
 * By ...?

Write it down on some piece of paper with the program text where
the kernel function call is done and where the the kernel
function is declared (which can be templated as well). Within the
`Require` clause, device matrices can be recognized by `DeviceGe`
instead of `Ge` (using `<hpc/cuda/traits.hpp>`).

If you are not sure about this can be done using our library,
just write down how you would like to have it working, provided
it is a feasible solution.

:navigate: up    -> doc:index
           next  -> doc:session09/page02