==============================================
Data types for matrices and vectors on the GPU                    [TOC]
==============================================

Following the RAII principle, it appears easier to manage the allocation
and the release of GPU memory by an object. In principle it is possible
to use the same class by both sides. If you want to do this you need
to specify for each method including the constructors and destructors
from which side they may be invoked. This is supported by CUDA. By
default, all methods can just be invoked from the CPU. Methods with a
`_``_device_``_` specifier in front of the declaration can be called on
the GPU side. If a method is to be used from both sides, the keywords
`_``_device_``_` and `_``_host_``_` have to be combined.

Following approach appears straightforward:

 * Constructors and destructors of classes
   that require dynamic storage remain CPU methods
   as this can be done by the CPU side only:
   Example:

   ---- CODE (type=cpp) -------------------------------------------------------
    DenseVector(std::size_t length) : length(length), inc(1) {
       CHECK_CUDA(cudaMalloc, (void**)&data, length * sizeof(T));
    }

    ~DenseVector() {
       CHECK_CUDA(cudaFree, data);
    }
   ----------------------------------------------------------------------------

 * View objects that do not need dynamic memory can
   be constructed on both sides. Example:

   ---- CODE (type=cpp) -------------------------------------------------------
    __device__ __host__
    DenseVectorView(std::size_t length, T *data, std::ptrdiff_t inc)
        : length(length), inc(inc), data(data)
    {
    }
   ----------------------------------------------------------------------------

 * The access methods for dynamic GPU storage are restricted
   to the GPU side. Example:

   ---- CODE (type=cpp) -------------------------------------------------------
    __device__
    const T& operator()(std::size_t i) const {
       return data[i];
    }
   ----------------------------------------------------------------------------

To support the transfer between GPU and CPU objects it is advisable to
offer corresponding `copy` operations. However, as we have no efficient
means of reorganizing content during a transfer, both sides shall share
the same organization on a contigious stretch of memory. Then this can
be done with one invocation of `cudaMemcpy`:

---- CODE (type=cpp) ----------------------------------------------------------
template<typename T>
void
copy(const hpc::matvec::DenseVector<T>& x, DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyHostToDevice);
}

template<typename T>
void
copy(const DenseVector<T>& x, hpc::matvec::DenseVector<T>& y) {
   assert(x.length() == y.length());
   CHECK_CUDA(cudaMemcpy, y.data(), x.data(), x.length() * sizeof(T),
      cudaMemcpyDeviceToHost);
}
-------------------------------------------------------------------------------

In consequence, we do not support copy operations for views.

Exercise
========

Adapt your solution for the scalar product such that the vector class
out of `<hpc/cuda/densevector.hpp>` and the copy operations from
`<hpc/cuda/copy.hpp>` are used. Your solution shall no longer use
`cudaMalloc`, `cudaMemcpy`, or `cudaFree` directly.

The kernel function can be left untouched. However, you need to
reconsider how pointers are to be passed. Until now we did it using the
access function as, for example, in `&x(0)`. Why does this no longer
work?

:navigate: up    -> doc:index
           back  -> doc:session08/page06
           next  -> doc:session08/page08