RAII storage objects for the GPU


When we work with vectors and matrices on the GPU device we have the unusual situation that these objects are allocated and released on the host but can be accessed only on the device. Hence, the associated RAII classes for maintaining device storage are divided into host and device parts.

The following shows an excerpt of <hpc/cuda/buffer.hpp> which provides DeviceBuffer, a host-based RAII class for maintaining arrays on the device:

template<typename T>
struct DeviceBuffer {
   void* const devptr;
   T* const aligned_devptr;

   DeviceBuffer(std::size_t length, std::size_t alignment = alignof(T)) :
	 devptr(cuda_malloc(compute_aligned_size<T>(length, alignment))),
	 aligned_devptr(align_ptr<T>(devptr, alignment)) {

   ~DeviceBuffer() {
      CHECK_CUDA(cudaFree, devptr);

   T* data() const {
      return aligned_devptr;

   DeviceBuffer(DeviceBuffer&&) = default;
   DeviceBuffer(const DeviceBuffer&) = delete;
   DeviceBuffer& operator=(const DeviceBuffer&) = delete;
   DeviceBuffer& operator=(DeviceBuffer&&) = delete;

Based on this class, the lecture library provides the class DeviceGeMatrix in <hpc/cuda/gematrix.hpp> which is based on DeviceBuffer. We have also corresponding copy functions in <hpc/cuda/copy.hpp> which, however, are based on cudaMemcpy and must therefore insist that the to be copied storage is a contigious block of storage with identical layouts.

This could lead to following simple steps to create a matrix on the host, initialize it, copy it to the device, work on the device on it within a kernel function, and copy it back to examine the result:

   GeMatrix<double> A(M, N, Order::RowMajor);
   // fill A...
   DeviceGeMatrix<double> devA(A.numRows(), A.numCols(), Order::RowMajor);
   copy(A, devA); // copy A to devA
   // work on devA using a kernel function...
   copy(devA, A); // copy devA to A

But how do we pass devA to the kernel function and how is the kernel function to be declared?

Before, we had on the host constructs like the following where we passed the matrices (or vectors) by reference:

template <template<typename> class Matrix, typename T,
   Require< Ge<Matrix<T>> > = true>
void f(Matrix<T>& A) {
   // ...

Think about how this can be done when we pass vector or matrix parameters from host to device:

Write it down on some piece of paper with the program text where the kernel function call is done and where the the kernel function is declared (which can be templated as well). Within the Require clause, device matrices can be recognized by DeviceGe instead of Ge (using <hpc/cuda/traits.hpp>).

If you are not sure about this can be done using our library, just write down how you would like to have it working, provided it is a feasible solution.