Data types for matrices and vectors on the GPU

Content

Introduction to CUDA
Sample solution
Synchronized aggregation
Sample solution
Invoking multiple kernel functions from the GPU
Sample solution
Data types for matrices and vectors on the GPU
- Exercise
Sample solution
Two-dimensional organization of threads
Sample solution

Following the RAII principle, it appears easier to manage the allocation and the release of GPU memory by an object. In principle it is possible to use the same class by both sides. If you want to do this you need to specify for each method including the constructors and destructors from which side they may be invoked. This is supported by CUDA. By default, all methods can just be invoked from the CPU. Methods with a __device__ specifier in front of the declaration can be called on the GPU side. If a method is to be used from both sides, the keywords __device__ and __host__ have to be combined.

Following approach appears straightforward:

Constructors and destructors of classes that require dynamic storage remain CPU methods as this can be done by the CPU side only: Example:

 DenseVector(std::size_t length) : length(length), inc(1) {
    CHECK_CUDA(cudaMalloc, (void**)&data, length * sizeof(T));
 }

 ~DenseVector() {
    CHECK_CUDA(cudaFree, data);
 }

View objects that do not need dynamic memory can be constructed on both sides. Example:

 __device__ __host__
 DenseVectorView(std::size_t length, T *data, std::ptrdiff_t inc)
     : length(length), inc(inc), data(data)
 {
 }

The access methods for dynamic GPU storage are restricted to the GPU side. Example:

 __device__
 const T& operator()(std::size_t i) const {
    return data[i];
 }

To support the transfer between GPU and CPU objects it is advisable to offer corresponding copy operations. However, as we have no efficient means of reorganizing content during a transfer, both sides shall share the same organization on a contigious stretch of memory. Then this can be done with one invocation of cudaMemcpy:

template<typename T>
void
copy(const hpc::matvec::DenseVector<T>& x, DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyHostToDevice);
}

template<typename T>
void
copy(const DenseVector<T>& x, hpc::matvec::DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyDeviceToHost);
}

In consequence, we do not support copy operations for views.

Exercise

Adapt your solution for the scalar product such that the vector class out of <hpc/cuda/densevector.hpp> and the copy operations from <hpc/cuda/copy.hpp> are used. Your solution shall no longer use cudaMalloc, cudaMemcpy, or cudaFree directly.

The kernel function can be left untouched. However, you need to reconsider how pointers are to be passed. Until now we did it using the access function as, for example, in &x(0). Why does this no longer work?