Parallelization using OpenMP

Content

OpenMP is an open standard introduced in 1997 that supports pragma-based language extensions for Fortran, C, and C++ for thread-based parallelization. It is best to imagine OpenMP as an extension where you have one thread pool present right from the beginning which can be used to parallelize selected code sections. OpenMP applies always implicitly the fork-and-join pattern, i.e. all conventual parts of the program are executed by a single thread. However, if a parallelized section is encountered, the task is semi-automatically or manually split into subtasks that are submitted to the task pool. The end of the parallelized section will be reached as soon as all subtasks are completed and the results returned to the main task.

Note that OpenMP is pragma-based. Pragmas are directives to the compiler that are embedded in the program text but are not part of the programming language. They are used to pass implementation-dependent directives to the compiler. Sometimes special comments are used but some languages support generic pragma directives. C and C++ support the latter using a preprocessor-like directive #pragma which is always on an input line of its own, followed by a selector that addresses the kind of extension (always omp in case of OpenMP) and the rest of the line depends on the particular pragma extension.

The nice property about pragmas is that they can be ignored. In principle, OpenMP programs can also be compiled and run by compilers not being aware of OpenMP. In this case there is no parallelization.

The website of the associated OpenMP ARB (Architecture Review Boards) that owns and publishes the standard is at http://www.openmp.org/. The most recent standard is OpenMP 5.0 from November 2018. gcc and g++ support OpenMP 4.0 since release 4.9. The compilers we currently use (7.2 or 7.3) support OpenMP 4.5. Summary reference cards are available for OpenMP 4.0 and 4.5. The coming release of gcc 9 is expected to support OpenMP 5.0 partially.

To enable the OpenMP extensions, gcc and g++ require the option “-fopenmp” for compiling and for linking. Whenever functions of the OpenMP library are to be used, #include <omp.h> is required. The size of the thread pool is configured through the environment variable OMP_NUM_THREADS.

We begin with the most simple but also the most common construct in OpenMP that parallelizes a for loop:

template <
   template<typename> class MatrixA, typename T,
   typename RandomEnginePool,
   Require< Ge<MatrixA<T>> > = true
>
void mt_randomInit(MatrixA<T>& A, RandomEnginePool& repool) {
   #pragma omp parallel for
   for (std::size_t row = 0; row < A.numRows(); ++row) {
      auto A_ = A.block(row, 0).dim(1, A.numCols());
      randomInit(A_, repool);
   }
}

You note the pragma directive #pragma omp parallel for immediately in front of the for loop. This asks for an implicit parallelization of the for loop under the assumption that the body of the for-loop can be executed independently for each of the values of the for-range. OpenMP partitionates the for-range among the threads of the thread pool and inserts an implicit join operation at the end of the for loop that waits until all subtasks are completed.

The loop variable (row in this example) becomes private to each thread but all other variables (and parameters) outside the for-loop are shared among all threads. As with threads before, you must be careful to avoid conflicts.

An automatic parallelization of a for-loop is possibly only if the for-loop is in one of the following so-called canonical forms:

\[ \mbox{for (index = start; index } \left\{ \begin{array}{c} \mbox{<} \\ \mbox{<=} \\ \mbox{>=} \\ \mbox{>} \end{array} \right\} \mbox{end; } \left\{ \begin{array}{l} \mbox{index++} \\ \mbox{++index} \\ \mbox{index--} \\ \mbox{--index} \\ \mbox{index += inc} \\ \mbox{index -= inc} \\ \mbox{index = index + inc} \\ \mbox{index = inc + index} \\ \mbox{index = index - inc} \end{array} \right\} \mbox{)}\]

As seen before, the loop variable may be declared within the for-statement.

Additional options can be added to the pragma directive. If you need, for example, an aggregation like a sum of partial results, you can do this with the following construct:

   double sum = 0;
   #pragma omp parallel for reduction(+:sum)
   for (...) {
      sum += /* ... */;
   }
   /* total sum now in sum */

In this case each thread gets its own copy of the sum variable with its initial value (0 in this example). During the join phase, the values of the individual sum results are collected and aggregated using the operator in the reduction clause. The result is then assigned to the original sum variable.

Exercise

Extend the code below with a parallelized version of asum that delivers the sum of the absolute values of all elements of a matrix. The program can be compiled as follows:

theon$ g++ -O3 -g -I/home/numerik/pub/hpc/ws18/session22 -std=c++17 -fopenmp -o random_init16 random_init16.cpp
theon$ time env OMP_NUM_THREADS=1 ./random_init16

real	0m0.11s
user	0m0.09s
sys	0m0.00s
theon$ time env OMP_NUM_THREADS=4 ./random_init16

real	0m0.06s
user	0m0.16s
sys	0m0.00s
theon$ 
heim$ g++-7.2 -O3 -g -I/home/numerik/pub/hpc/ws18/session22 -std=c++17 -fopenmp -o random_init16 random_init16.cpp
heim$ /usr/bin/time -p env OMP_NUM_THREADS=1 ./random_init16
real 0.03
user 0.02
sys 0.00
heim$ /usr/bin/time -p env OMP_NUM_THREADS=4 ./random_init16
real 0.01
user 0.02
sys 0.00
heim$ 

Source code

#include <hpc/aux/repool.hpp>
#include <hpc/aux/slices.hpp>
#include <hpc/matvec/asum.hpp>
#include <hpc/matvec/gematrix.hpp>
#include <hpc/matvec/iterators.hpp>
#include <hpc/matvec/print.hpp>
#include <hpc/matvec/traits.hpp>

using namespace hpc;
using namespace hpc::aux;
using namespace hpc::matvec;

template <
   template<typename> class MatrixA, typename T,
   typename RandomEnginePool,
   Require< Ge<MatrixA<T>> > = true
>
void randomInit(MatrixA<T>& A, RandomEnginePool& repool) {
   using EngineType = typename RandomEnginePool::EngineType;
   RandomEngineGuard<EngineType> guard(repool);
   std::uniform_real_distribution<double> uniform(-100, 100);
   auto& engine = guard.get();

   for (auto [i, j, Aij]: A) {
      Aij = uniform(engine);
      (void) i; (void) j; // suppress gcc warning
   }
}

template <
   template<typename> class MatrixA, typename T,
   typename RandomEnginePool,
   Require< Ge<MatrixA<T>> > = true
>
void mt_randomInit(MatrixA<T>& A, RandomEnginePool& repool) {
   #pragma omp parallel for
   for (std::size_t row = 0; row < A.numRows(); ++row) {
      auto A_ = A.block(row, 0).dim(1, A.numCols());
      randomInit(A_, repool);
   }
}

int main() {
   RandomEnginePool<std::mt19937> repool(4);

   GeMatrix<double> A(1051, 781);
   mt_randomInit(A, repool);
}