Scatter and gather operations

Content

One common communication pattern is

Scatter and gather operations could be done using MPI_Send and MPI_Recv where a master process (usually that with rank 0) repeatedly uses MPI_Send at the beginning to send out the individual partitions of a vector or matrix and which at the joining phase repeatedly uses MPI_Recv to collect the results.

This is, however, inefficient as this causes the latency periods to add up. It is better to parallelize the send and receive operations at the side of the master. Fortunately, MPI offers the operations MPI_Scatter and MPI_Gather which perform a series of send or receive operations in parallel:

int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
                void* recvbuf, int recvcount, MPI_Datatype recvtype,
                int root, MPI_Comm comm);

int MPI_Gather(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
               void* recvbuf, int recvcount, MPI_Datatype recvtype,
               int root, MPI_Comm comm);

In both cases, root is the rank of the process that distributes or collects partitions of a matrix. This is usually 0.

In case of MPI_Scatter, each of the processes (including root!) receives exactly sendcount elements. More precisely, the process with rank i receives the elements \(i \cdot sendcount, \dots, (i+1) \cdot sendcount - 1\). Typically, sendcount equals recvcount. But as with the counts of MPI_Send and MPI_Recv, recvcount may be larger than sendcount. As usual, compatibility between sendtype and recvtype is required. Similarly, recvcount specifies for MPI_Gather how many elements are to be received by the process with rank root from each of the processes (including itself).

Much like as in the case of MPI_Bcast, the same call of MPI_Scatter can be executed by all processes. However, if it appears wasteful to allocate a large matrix for each of the processes, the calls can be separated. MPI_Scatter ignores the parameters sendbuf, sendcount, and sendtype for non-root processes. Likewise MPI_Gather considers the parameters recvbuf, revcount, and recvtype for the root process only. If the program text is separated for these two cases, you are free to use nullptr (in C++) or 0 (in C) for the unused parameters.

Exercise

Following test program initializes a matrix in process 0 and distributes it among the processes where each of them receives share rows. After each of the processes has updated its partition of the matrix, the updated rows are to be collected and printed.

Add the necessary MPI_Scatter and MPI_Gather invocations.

Source code

#include <mpi.h>
#include <hpc/matvec/gematrix.hpp>
#include <hpc/matvec/iterators.hpp>
#include <hpc/matvec/print.hpp>
#include <hpc/mpi/matrix.hpp>

int main(int argc, char** argv) {
   MPI_Init(&argc, &argv);

   int nof_processes; MPI_Comm_size(MPI_COMM_WORLD, &nof_processes);
   int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   using namespace hpc::matvec;
   using namespace hpc::mpi;

   using Matrix = GeMatrix<double>;
   int share = 3;
   int num_rows = nof_processes * share;
   int num_cols = 5;

   Matrix B(share, num_cols, Order::RowMajor); /* individual share */

   if (rank == 0) {
      Matrix A(num_rows, num_cols); /* entire matrix */
      for (auto [i, j, Aij]: A) {
	 Aij = i * 100 + j;
      }

      /* using MPI_Scatter: scatter A / receive our share into B */

      for (auto [i, j, Bij]: B) {
	 Bij += 10000 * (rank + 1);
	 (void) i; (void) j; // suppress gcc warning
      }

      /* using MPI_Gather: gather into A / send our share from B */

      print(A, " %6g");
   } else {

      /* using MPI_Scatter: receive our share into B */

      for (auto [i, j, Bij]: B) {
	 Bij += 10000 * (rank + 1);
	 (void) i; (void) j; // suppress gcc warning
      }

      /* using MPI_Gather: send our share from B */
   }
   MPI_Finalize();
}