Data types for matrices and vectors on the GPU

Content

Following the RAII principle, it appears easier to manage the allocation and the release of GPU memory by an object. In principle it is possible to use the same class by both sides. If you want to do this you need to specify for each method including the constructors and destructors from which side they may be invoked. This is supported by CUDA. By default, all methods can just be invoked from the CPU. Methods with a __device__ specifier in front of the declaration can be called on the GPU side. If a method is to be used from both sides, the keywords __device__ and __host__ have to be combined.

Following approach appears straightforward:

To support the transfer between GPU and CPU objects it is advisable to offer corresponding copy operations. However, as we have no efficient means of reorganizing content during a transfer, both sides shall share the same organization on a contigious stretch of memory. Then this can be done with one invocation of cudaMemcpy:

template<typename T>
void
copy(const hpc::matvec::DenseVector<T>& x, DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyHostToDevice);
}

template<typename T>
void
copy(const DenseVector<T>& x, hpc::matvec::DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyDeviceToHost);
}

In consequence, we do not support copy operations for views.

Exercise

Adapt your solution for the scalar product such that the vector class out of <hpc/cuda/densevector.hpp> and the copy operations from <hpc/cuda/copy.hpp> are used. Your solution shall no longer use cudaMalloc, cudaMemcpy, or cudaFree directly.

The kernel function can be left untouched. However, you need to reconsider how pointers are to be passed. Until now we did it using the access function as, for example, in &x(0). Why does this no longer work?