========================================= Improving the two-dimensional aggregation [TOC] ========================================= A close look of the results of _test-aggregate_ unveils unsymmetric results. For example, an aggregation for a $2048 \times 666$ matrix takes significantly more time than an aggregation of an $666 \times 2048$ matrix: ---- CODE (type=sh) ----------------------------------------------------------- livingstone$ test-aggregate 1 x 1: 0.01370 ms 0.02131 ms 1 x 2: 0.03846 ms 0.04419 ms 1 x 7: 0.03693 ms 0.03171 ms 1 x 42: 0.05731 ms 0.06666 ms 1 x 666: 0.05744 ms 0.06387 ms 1 x 2048: 0.07869 ms 0.07488 ms 2 x 1: 0.03853 ms 0.04323 ms 2 x 2: 0.05802 ms 0.06134 ms 2 x 7: 0.05686 ms 0.06131 ms 2 x 42: 0.07523 ms 0.07894 ms 2 x 666: 0.07443 ms 0.07859 ms 2 x 2048: 0.09142 ms 0.08006 ms 7 x 1: 0.03901 ms 0.04282 ms 7 x 2: 0.05699 ms 0.06122 ms 7 x 7: 0.05869 ms 0.06016 ms 7 x 42: 0.07238 ms 0.08000 ms 7 x 666: 0.07642 ms 0.05878 ms 7 x 2048: 0.06275 ms 0.05734 ms 42 x 1: 0.03834 ms 0.04189 ms 42 x 2: 0.05040 ms 0.05379 ms 42 x 7: 0.04902 ms 0.05162 ms 42 x 42: 0.06358 ms 0.05978 ms 42 x 666: 0.06307 ms 0.05315 ms 42 x 2048: 0.07718 ms 0.07482 ms 666 x 1: 0.04595 ms 0.05021 ms 666 x 2: 0.05926 ms 0.04915 ms 666 x 7: 0.05142 ms 0.04714 ms 666 x 42: 0.06275 ms 0.05507 ms 666 x 666: 0.48006 ms 0.33491 ms 666 x 2048: 0.60163 ms 0.47082 ms 2048 x 1: 0.05120 ms 0.05216 ms 2048 x 2: 0.06333 ms 0.04723 ms 2048 x 7: 0.06115 ms 0.05517 ms 2048 x 42: 0.09219 ms 0.08624 ms 2048 x 666: 1.05210 ms 0.92387 ms 2048 x 2048: 0.97139 ms 0.84058 ms livingstone$ ------------------------------------------------------------------------------- A close look of `aggregate` reveals that we select `aggregate_col_wise` or `aggregate_row_wise` without any consideration of the storage organisation of _A_: ---- CODE (type=cpp) ---------------------------------------------------------- template< template class Matrix, typename T, typename Aggregator, Require< DeviceGe> > = true > T aggregate(const Matrix& A, Aggregator&& aggregator) { using namespace aggregate_impl; if (A.numRows() > 1 || A.numCols() > 1) { const auto source = A.view(); dim3 grid_dim(ceildiv(A.numRows(), BLOCK_DIM), ceildiv(A.numCols(), BLOCK_DIM)); if (A.numRows() > A.numCols()) { DeviceGeMatrix B(ceildiv(A.numRows(), BLOCK_DIM), A.numCols()); auto target = B.view(); dim3 block_dim(1, BLOCK_DIM); aggregate_col_wise<<>>(source, target, aggregator); return aggregate(target, std::move(aggregator)); } else { DeviceGeMatrix B(A.numRows(), ceildiv(A.numCols(), BLOCK_DIM)); auto target = B.view(); dim3 block_dim(BLOCK_DIM, 1); aggregate_row_wise<<>>(source, target, aggregator); return aggregate(target, std::move(aggregator)); } } else { T result; CHECK_CUDA(cudaMemcpy, &result, A.data(), sizeof(T), cudaMemcpyDeviceToHost); return result; } } ------------------------------------------------------------------------------- Let us consider `aggregate_col_wise`. Does it provide better performance if _A_ is in row-major or in col-major? ---- CODE (type=cpp) ---------------------------------------------------------- template< template class MatrixA, template class MatrixB, typename T, typename Aggregator, Require< DeviceGe>, DeviceView>, DeviceGe>, DeviceView> > = true > __global__ void aggregate_col_wise(const MatrixA A, MatrixB B, Aggregator aggregator) { std::size_t i = threadIdx.x + blockIdx.x * BLOCK_DIM; std::size_t j = threadIdx.y + blockIdx.y * BLOCK_DIM; if (j < A.numCols()) { T val = A(i, j); std::size_t maxoffset = BLOCK_DIM; if (i + maxoffset >= A.numRows()) { maxoffset = A.numRows() - i; } for (std::size_t offset = 1; offset < maxoffset; ++offset) { val = aggregator(val, A(i + offset, j)); } B(blockIdx.x, j) = val; } } ------------------------------------------------------------------------------- The relevant access operation is `A(i + offset, j)`. Within a warp, just `threadIdx.y` varies as `blockIdx.x` is 1 while `blockIdx.y` is 32. The only variable depending on `threadIdx.y` is `j`. Hence, while each thread runs through an individual column, a warp accesses a row during each iteration of the loop. Hence, we can expect the best performance for a matrix in row-major organization. Likewise `aggregate_row_wise` operates best with a matrix in col-major. Exercise ======== * How is it possible to amend the solution such that we operate always with the most favorable storage organization? * A minor performance penalty comes with the handling of small matrices which would fit into one block. Currently, this needs two calls of a kernel function. How can this be optimized? :navigate: up -> doc:index back -> doc:session10/page04 next -> doc:session10/page06