Introduction to the MPI interface
Content |
MPI (Message Passing Interface) is an open standard for a library interface for parallel programs. In contrast to threads and OpenMP, MPI does not rely on communication over shared memory. Instead, every thread is living in its own process with its own virtual memory. This makes it possible to run MPI applications on a cluster (like Pacioli) or an arbitrary network of computers.
In 1994, the first release of the standard (1.0) was published, since 1997 we have 2.0. The current standard 3.1 was published in June 2015. The latter version is fully supported by our installation at Theon, our Debian hosts in our pool support MPI 2.1. The standards are public at http://www.mpi-forum.org/.
The standard includes language-specific interfaces for Fortran and C. We use the C interface for our C++ programs.
Multiple open source implementations are available:
-
OpenMPI: http://www.open-mpi.org/ (on Theon and on our Debian hosts in E.44)
-
MPICH: http://www.mpich.org/
-
MVAPICH: http://mvapich.cse.ohio-state.edu/ (supports Infiniband)
We use OpenMPI in our lecture and our sessions.
For compiling and linking the commands mpiCC or mpic++ have to be used instead of g++. MPI programs are not run directly but with the help of the mpirun tool that configures the execution environment. mpirun needs to know how many processes are to be started on which hosts. This is done using the command line. The total number of processes is thereby fixed by mpirun and cannot be changed when the MPI application runs.
Before, in our thread-based solutions and in OpenMP, our application started with a single thread and a parallelization was always explicit. In MPI all processes are started right from beginning by mpirun and execute (by default) the same program.
As in OpenMP, every process can learn the number of processes in total and the number of the own process (named rank in MPI). The process with rank 0 has a special role. This should be the only process that considers command line arguments and accesses standard input/output. In case of a master/worker pattern the process with rank 0 naturally assumes the role of the master.
Following small example demonstrates how the workers send their ranks using MPI_Send to the master while the master receives these message using MPI_Recv and prints them to standard output.
#include <mpi.h> #include <printf.hpp> int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); int nof_processes; MPI_Comm_size(MPI_COMM_WORLD, &nof_processes); if (rank) { MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); } else { for (int i = 0; i + 1 < nof_processes; ++i) { MPI_Status status; int msg; MPI_Recv(&msg, 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status); int count; MPI_Get_count(&status, MPI_INT, &count); if (count == 1) { fmt::printf("%d\n", msg); } } } MPI_Finalize(); }
Remember that all n processes run this program concurrently. MPI_Init must be invoked before any other MPI function is used. This function connects to the execution environment. At the end, the function MPI_Finalize is to be invoked which severs the link of the process to its networked execution environment.
The function MPI_Comm_rank returns the own process number (rank) and MPI_Comm_size delivers the number of processes. The first parameter refers in these cases to the communication domain. This is always MPI_COMM_WORLD in our trivial examples which refers to all processes. We will see later how new communication domains can be created.
The most important operations for communication are MPI_Send and MPI_Recv. Both have a similar sequence of parameters:
-
void* buf: a pointer to an array whose values are to be transfered.
-
int count: the size of the array (if there is just a single value, 1 is to be given).
-
MPI_Datatype datatype: the data type of an element of the array. For elementary data types fixed types can be used, e.g. MPI_INT for int.
-
int dest (for MPI_Send) and int source (for MPI_Recv) name the recipient or the sender of a message.
-
int tag: messages can be tagged, if a non-zero value is given, the incoming message must have the corresponding tag if it is to be received.
-
MPI_Comm comm: the communication domain that is to be used for this transfer; this is always MPI_COMM_WORLD in trivial cases.
-
MPI_Status* status: is an extra parameter for MPI_Recv. This allows to verify how many elements of an array have actually been sent. This can be 0.
Compilation and execution:
theon$ mpiCC -g -std=c++17 -o mpi-test mpi-test.cpp theon$ mpirun -np 4 mpi-test 3 2 1 theon$
heim$ mpiCC -g -std=c++11 -o mpi-test mpi-test.cpp -Wno-literal-suffix heim$ mpirun -np 4 mpi-test 3 1 2 heim$
heim$ OMPI_CXX=g++-8.3 mpiCC -g -std=c++17 -o mpi-test mpi-test.cpp -Wno-literal-suffix /usr/local/libexec/gcc/x86_64-pc-linux-gnu/8.3.0/cc1plus: error while loading shared libraries: libmpfr.so.4: cannot open shared object file: No such file or directory heim$ mpirun -np 4 mpi-test 3 2 1 heim$
You will get a harmless warning on our Debian machines in E.44 unless you suppress them using the option -Wno-literal-suffix as the OpenMPI headers are somewhat outdated for C++11. The OMPICXX_ environment variable allows you to switch to another C++ compiler.
Exercise
Extend the trivial example such that a numerical integral is computed for the function f on the interval \([a, b]\) by using \(n\) subintervals:
\[S(f,a,b,n) ~=~ \frac{h}{3} \left( \frac{1}{2} f(x_0) + \sum_{k=1}^{n-1} f(x_k) + 2 \sum_{k=1}^{n} f\left(\frac{x_{k-1} + x_k}{2}\right) + \frac{1}{2} f(x_n) \right)\]with \(h ~=~ \frac{b-a}{n}\) and \( x_k ~=~ a + k \cdot h\).
If you use the function \(\frac{4}{1 + x^2}\) at the interval \([0, 1]\), then you could compare the result against \(\pi\). All processes (including the one with rank 0) shall compute the numerical integral on their respective subinterval (which is to be divided further). Afterwards, all processes with a positive rank shall send their results to the master process with rank 0 which sums up the individual results and prints it to standard output. MPI_DOUBLE is to be used as MPI data type for double.