Let’s talk about parallel loops. In parallel computing, we’ve been designing, describing, implementing and using parallel loops almost since the beginning. The advantage of parallel loops (over almost all other forms of parallelism) is that the parallelism scales up with the data set size or loop trip count (number of iterations).

So what exactly is a parallel loop? A sequential loop has a loop body and some kind of loop control: the program typically executes the loop body once, then the control code will determine whether to loop back to the top of the loop body and execute it again, perhaps updating an index or control variable. Similarly, a parallel loop has a loop body and some control. However, if the program executes all the loop iterations in parallel, for instance, there is no loop per se, no jump to the top of the loop. It’s not really a loop at all. However, I can’t change the common nomenclature and a parallel loop often resembles a sequential loop, so parallel loop it is.

Let’s define what we mean by the word parallel in parallel loop. I define a parallel loop as a code body surrounded by parallel loop control such that the body of the loop can be or is executed with parallelism. There are many styles or levels of parallelism in current hardware. Writing and executing a parallel loop means expressing and exploiting the parallelism in the with the goal of better performance. Note the two steps. Expressing the parallelism is the job of writing the program, which can be more or less convenient, depending on the features of the chosen language. Exploiting the parallelism depends on how the implementation can map or chooses to map the parallelism expressed in the program onto the parallelism available in the target machine. Some implementations make that mapping entirely the job of the implementation (the compiler or library), others make it entirely the job of the programmer (making it explicit in the language or the specification).

Consider an MPI program, which maps a data-parallel operation across the MPI ranks. The expression of the parallelism is in the use of the MPI rank index to determine on which part of the data set each rank operates. In that case, the mapping of the data parallelism to the ranks is entirely the job of the programmer. Some applications use an intermediate application framework that manages the MPI interface. The framework, when given the characteristics of the application and the job environment, determines which rank does which part of the work. In that case, the framework does that mapping. In either case, the actual mapping from the MPI rank down to the node or core is done by the scheduling manager and operating system, and is usually completely hidden from the programmer.

For another example, let’s consider the relatively new Fortran do concurrent construct.

do concurrent(i = 1:n) a(i) = a(i+m)+b(i) end do

The do concurrent is a promise or guarantee from the programmer to the implementation (the compiler) that all the iterations are data-independent: that is, no iteration reads from or writes to any variable or array element that any other iteration writes to. It’s a promise that the implementation can exploit to run the iterations in any order, and in particular, run them in parallel. It can spread the iterations across threads, or pipeline the loop, or vectorize (or SIMDize) the loop, or unroll-and-jam, or do any other loop transformation that makes for faster execution.

Compare this to the OpenMP omp parallel do construct:

!$omp parallel do do i = 1, n a(i) = a(i+m) + b(i) end do

The omp parallel do is a command from the programmer to split the iterations of the loop into chunks, and to run those chunks across the OpenMP threads that were either previously created, or that will be created for this construct. It doesn’t say that the loop is a parallel loop. The compiler can’t use the omp parallel do directive as a hint that the loop iterations are all independent and therefore the loop can be vectorized, for instance. The directive is a promise that there are no unprotected race conditions across the threads. But there is no promise that there are no race conditions across iterations. An OpenMP programmer can, and often does, allow for data races either by explicitly scheduling them onto the same thread, or by including synchronization (such as a critical section) to resolve the race.

So, in one case, we have expressed a true parallel loop with no hints from the programmer as to how to exploit that parallelism, leaving the implementation to choose from among the many forms of parallelism available on today’s processors. In the other case, we have directed the implementation how to exploit parallelism, without actually expressing that this is a parallel operation. The implementation has no flexibility as to how to map the parallelism onto the hardware because it is explicit in the language.

While researching for this article, I learned that many parallel loop constructs are defined by their implementations. The OpenMP parallel do and parallel for are examples of this. In some cases, the meaning of the loop construct itself isn’t specified at all, or not well-specified. It’s not clear whether the given implementation is the only legal implementation, or one of many possible legal implementations, and how one could determine whether a different implementation is legal.

Current Parallel Loops

What makes a good parallel loop? This is a complex question, and anyone who has designed a parallel language of any kind will have strong opinions, usually favoring their own design. Automatic parallelizing compilers can be applied to any program, even sequential programs, to achieve even more parallelism than is expressed, but here I’m only evaluating the language, not the compiler. I’m going to use metrics that I think make for better parallel loops.

It should interoperate with existing languages. Current HPC applications are mostly written in Fortran, C++ or C. Some of the parallel loop patterns below support all three languages, but most are designed only for one. In all cases, the program can call existing precompiled procedures inside a parallel loop, though there may be data access restrictions since those precompiled procedures would not know about any parallelism or data distribution. There’s a lot of good HPC work in other languages, such as Ruby on Rail, Python and Matlab, but I don’t see them as quite ready to replace the more traditional languages.

It should be syntactically close to the common sequential loop. This makes it easier to migrate from sequential to parallel execution, and makes the program look more familiar and hence easier to understand. All cases shown here (except the array assignments) satisfy this metric.

It should accommodate the two most common parallel loop idioms: independent loop iterations, and a parallel reduction to a scalar (sum, max, min or similar). All of the parallel loop patterns below can handle a loop where all iterations are independent, but not all allow for a reduction within such a loop. Some require the reduction to be rewritten, or extracted from the loop as a separate operation. Additional scalable parallel primitives, such as atomic operations or nested parallelism, are extra credit.

It should allow the implementation to use as much of the hardware parallelism as is available. As suggested above, some machines have SIMD registers and instructions, or use software pipelining for additional performance.

It should allow an implementation to efficiently execute the parallel loop sequentially as well.

These last two points make up what we call dynamic scalability, scaling up to use more parallelism or down to use less parallelism as necessary to get the best overall performance on the target hardware. Scaling down is useful when running on a system with less parallelism, or when running nested parallel operations where the outer parallel loop already exploits all the available hardware parallelism.

With that as a starting point, I’m going to list a number of parallel loop constructs or designs that are or have been implemented or proposed, and score each of them on these metrics. You may have different metrics, or may score these differently, and that’s another good discussion.

I’ve divided the various parallel loop patterns into four groups: Thread-centric, array-oriented, C++ parallel iterators, and true parallel loops. A thread-centric parallel loop defines the loop execution as spreading the loop iterations across a set of threads managed by the implementation. Array-oriented parallel loops are designed to efficiently execute matrix and vector operations. C++ parallel iterators deserve their own group because they all benefit and suffer from the necessary C++ implementation hurdles. True parallel loops, like the Fortran do concurrent, are the most flexible, but generally leave less control to the programmer. Within each group, I discuss each language roughly chronologically. In the interest of length, I’ve removed description of most of the historical languages, such as The Force, the PCF parallel do, the doall and the Myrias pardo, to a separate addendum, which the more academically inclined may find interesting. If you want to skip over the details of the various parallel loops, you can jump to my predictions of the future.

Thread-Centric Parallel Loops

A thread-centric parallel loop is generally defined by creating or reusing a set of threads (or processes), then splitting or sharing the loop iterations over these threads. The method used to decide which thread executes which iteration is called scheduling, and different examples have different scheduling mechanisms. Thread-centric parallel loops are not true parallel loops because they generally don’t require that the loop iterations be data-independent, only that there are no data races between the threads. If the programmer knows the scheduling algorithm, the program can be written to avoid data races between threads even if there are dependences between iterations on the same thread.

These loops also often allow for explicit synchronization between the threads to avoid data races.

OpenMP parallel loop

The OpenMP API is very thread-centric. An OpenMP program starts as a single-threaded program which then creates a team of threads that execute the parallel regions of code. In a parallel region, all threads start out executing the code redundantly. Worksharing constructs (like a specially identified loop) in the parallel region divide the work (loop iterations) across the threads. The OpenMP specification explicitly states that data races across threads generate undefined behavior. However conflicts across loop iterations that are scheduled to the same thread are supported and well-defined. OpenMP supports common reductions in parallel loops, as well as a very rich set of synchronization constructs, such as critical sections, ordered critical sections, locks, and more. OpenMP 4 has added a simd loop clause, to direct the compiler to use SIMD or vector parallelism for the iterations of this loop.

#pragma omp parallel for for( i = 0; i < n; ++i ){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]]; r[i] = val; }

Comments:

+ Support for C, C++ and Fortran.

Support for C, C++ and Fortran. + Sequential execution is legal.

Sequential execution is legal. + Support for reductions and atomic operations.

Support for reductions and atomic operations. + Allows nested parallelism.

Allows nested parallelism. + OpenMP 4 explicitly supports SIMD parallelism, but only when the programmer adds the simd directive. The simd directive might be used on the inner loop above, for instance.

OpenMP 4 explicitly supports SIMD parallelism, but only when the programmer adds the directive. The directive might be used on the inner loop above, for instance. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

upc_forall

Unified Parallel C (UPC) is one of several partitioned global address space (PGAS) languages, designed to exploit parallelism across a cluster of nodes with a thread at each node and each thread having its own address space. A UPC program uses message passing or remote memory access to communicate between the threads. Unlike OpenMP, the whole UPC program starts with all threads executing redundantly.

The key parallel construct in UPC is the upc_forall, which spreads its iterations across the threads according to the affinity expression. The affinity expression may be an integer expression (usually involving the loop index), which selects the thread as affinity mod THREADS (where THREADS is the number of threads). It may also be a pointer to a distributed data structure element (again, usually involving the loop index), which selects the thread that owns that element.

upc_forall( i = 0; i < n; ++i; i ){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]]; r[i] = val; }

As with OpenMP, the upc_forall does not state that the loop iterations are independent. Also, similar to OpenMP, any side effects that affect another thread result in undefined behavior. UPC allows for a small set of synchronization operations between threads as well. UPC is, as the name implies, only defined for C. I could not find an obvious shorthand for reduction operations.

Comments:

+ Sequential execution is legal.

Sequential execution is legal. - Only supports C.

Only supports C. - No obvious support for reductions.

No obvious support for reductions. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

There is recent work on UPC++, which is intended to provide a similar framework for C++. UPC++ is implemented using a templated library framework (a compiler-free implementation), so would belong with the C++ Parallel Iterators below.

cilk_for

Cilk was designed at MIT as a faithful parallel extension to C, meaning that eliminating the parallel constructs by replacing them with the sequential C analog was always a valid implementation in sequential C. Cilk was commercialized by Cilk Arts, which was subsequently acquired by Intel. The Cilk technology is now part of the Intel C/C++ compiler toolkit, renamed Cilk Plus. Intel has published a Cilk Plus language specification, with the goal of promoting other compliant implementations.

The Cilk implementation is based on a runtime implementing low-overhead task creation, with a scheduler that allows a small number of threads to cooperatively schedule a large number of tasks inexpensively. When tasks are spawned, the spawning thread can either execute the spawned task immediately, or put it on a work list to be stolen by another thread. However I didn’t see that the language prescribed this implementation. Cilk has a parallel loop construct, cilk_for (or _Cilk_for in the Cilk Plus specification).

The implementation of the cilk_for is for the thread to compute the trip count, then to recursively spawn tasks to execute the two halves of the iteration space until a minimum grain size is reached. At that point, the thread executes grain size iterations sequentially, then looks for another task (section of loop iterations). A recursion-based implementation is quite general, meaning it can implement parallel loops as well as other parallel operations such as tasks, but generality comes with a penalty, since the implementation isn’t optimized for the common parallel loop. However, the language specification does not require this specific implementation. In fact, it doesn’t specify anything about the behavior, except that sequential execution is legal, and that the cilk_for keyword permits loop iterations to run in parallel. It also doesn’t say that the loop iterations are data-independent, and allows data conflicts as long as they are resolved to not be actual race conditions. Thus, a compiler cannot use the cilk_for keyword to vectorize the loop body.

cilk_for( int i = 0; i < n; ++i ){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]]; r[i] = val; }

Comments:

+ Sequential execution is legal.

Sequential execution is legal. + Allows nested parallel loops or tasks.

Allows nested parallel loops or tasks. - Only supports C and C++.

Only supports C and C++. - Reductions require more changes to the code.

Reductions require more changes to the code. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

Array-Oriented Parallel Loops

An array-oriented parallel loop is designed to allow behavior like the Fortran array assignment or APL expression, but with loop-like syntax. It is very different from other parallel loops in that it doesn’t specify any parallelism between two iterations of different statements, only between the iterations of a single statement.

Fortran forall

The Fortran forall construct was copied from High Performance Fortran to Fortran 95. A forall is semantically very similar to the Fortran array assignment. The array assignment:

a(2:n-1) = b(1:n-2) + c(3:n)

can be written using forall as:

forall(i=2:n-1) a(i) = b(i-1) + c(i+1) end forall

The reverse is not always true: there are assignments that you can write using forall that are very hard or impossible to write using array assignments without using Fortran array pointers. A forall is defined as being parallel statement-by-statement. That is, in a forall construct, each assignment can be executed by computing the right-hand side in parallel, then doing the assignment to the left-hand side in parallel. For a simple example like that above, this is easy to understand.

However, look at a slightly modified example:

forall(i=2:n-1) a(i) = a(i-1) + a(i+1) end forall

In this case, the right-hand side for all values of i must be computed before doing any assignment to the left-hand side array. Compilers often create a temporary array and implement this as two assignments:

forall(i=2:n-1) temp(i) = a(i-1) + a(i+1) end forall forall(i=2:n-1) a(i) = temp(i) end forall

In fact, some compilers do exactly this transformation for all forall assignments (and Fortran array assignments). Combining the two loops and removing the temporary array becomes an optimization problem, which is sometimes trivial, but often intractable.

Comments:

- Only supports Fortran, and is being considered for removal from Fortran as well.

Only supports Fortran, and is being considered for removal from Fortran as well. - No support for reductions across the forall iterations.

No support for reductions across the iterations. - Sequential execution (scalarizing) is nontrivial.

Sequential execution (scalarizing) is nontrivial. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

The Fortran array assignment could be considered as shorthand for a forall, except array assignments allow for reduction operations.

C Extensions for Array Notation

Intel recently introduced array notation extensions in their C/C++ compilers. Unlike Fortran array assignments, assignments using these C array extensions are undefined if there is any overlap between the left-hand side array and any right-hand side array, unless the overlap is exact (element by element).

No ordering among the elements being computed or assigned is guaranteed, and there is no guarantee that the right-hand side array elements will be fetched before some of the left-hand side elements are assigned. These restrictions mean that scalar, parallel or vector execution are legal. As with Fortran array notation, there are a number of reduction functions.

In the example shown below, I assume that each row has the same number of nonzeros (in ncols).

r[0:n] = 0; for( int j = 0; j < ncols; ++j ) r[0:n] += a[rowndx[0:n]+j] * v[colndx[rowndx[0:n]+j]];

Comments:

+ Allows for reductions as well as array parallelism.

Allows for reductions as well as array parallelism. + Unlike Fortran array assignments, sequential execution is always legal.

Unlike Fortran array assignments, sequential execution is always legal. - Only supported in C and C++, though similar in many ways to Fortran array syntax.

Only supported in C and C++, though similar in many ways to Fortran array syntax. - Parallel execution is specified only within a single statement or expression.

C++ Parallel Iterators

C++ programmers and programs benefit from the very expressive templating and overloading features of the language, allowing more abstraction and more code reuse than is available in other languages. However, that template processing and overloading is typically expanded and evaluated in the C++ front end. By the time the rest of the compiler sees the program, it’s just a C program. The only unique feature in C++ that remains after the front end is exception handling. All the templating is extremely useful for reducing the number of times you have to write an axpy kernel for different data types (float, double, complex, …), but it doesn’t increase your expressivity, that is, it doesn’t let you write anything you couldn’t have written in a C library, however painfully.

The parallel programming constructs I’ve seen expressed in C++ have often been an encapsulation of a library, or some limited set of parallel primitives. The reason to design and use these constructs is that developing languages and compilers is difficult, whereas writing libraries is relatively easy. Using templated functions, iterators, lambdas, functors, and other C++ features, these constructs actually look like an extended programming language. The fact is that these advanced language features allow much more convenient development and use of new methods and libraries, and they can be specialized and retargeted much more quickly than doing the same work in a compiler, or getting new syntax added to a language standard.

These parallel constructs are not part of the core language; to the compiler, it’s really just a bunch of procedure calls. If your compiler is good enough, most of those calls will get inlined and you won’t have too much of an abstraction penalty. However, the benefit comes from the target-specialization of the parallel methods. One might expect a C++ parallel loop method for a multicore to include an OpenMP parallel for, whereas the same method for some other target to include some other target-specific extension. This isolates the target-specific parallel code to those constructs. One downside of such a scheme is that the body of the loop is opaque to the parallel method, and vice versa. The compiler can do no feasibility, correctness or profitability analysis, and there is no mechanism for the implementation to do so either. Classical vectorizing compilers, for instance, were quite successful in large part because they could report the success or failure of their analysis to the programmer. There will be no such feedback cycle from these class libraries.

Fortran has avoided template metaprogramming so far. The abstraction methods in Fortran aren’t nearly as rich, and so far have not been able to replace language and compiler development.

parallel_for

There are several parallel class libraries that include some sort of parallel_for construct. These include Microsoft Visual Studio Parallel Patterns Library, Intel’s Threading Building Blocks (TBB), and the Kokkos kernels package being developed at the Department of Energy national labs. All three parallel_for constructs are quite similar in many respects. They allow a programmer to replace a for loop by a parallel construct,

replacing the loop body by a functor or lambda. The parallel_for implementation splits the iteration space across the available processors or cores, invoking the functor or lambda for each associated iteration. There is no support for reductions in parallel_for, but reductions and other collective operations such as scan and search are supported as separate patterns.

Since the parallel_for construct is not actually part of the C++ language, it’s not clear that talking about the meaning of the construct makes much sense. The definition of parallel_for really is its implementation. In most cases, the implementation allows dependences or synchronization between iterations, with limitations to allow the support library to avoid deadlock or other hazards.

parallel_for( 0, n, [=](int i){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]; r[i] = val; });

Comments:

+ Sequential execution is legal.

Sequential execution is legal. + A parallel_for body could contain another parallel_for , for nested parallelism.

A body could contain another , for nested parallelism. - Only supported in C++ (some might view this as a plus).

Only supported in C++ (some might view this as a plus). - No support for reductions or atomic operations within a parallel loop; reductions require a separate method invocation.

No support for reductions or atomic operations within a parallel loop; reductions require a separate method invocation. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

C++17 execution policies

There is a proposal in the C++ language committee to add execution policies to the C++ standard algorithms library. There are about 75 of these algorithms, including sort, finding the maximum element, and so on. For this discussion the key algorithm is for_each, which functions like a loop: invoking a function or lambda operation on each value of an index range. A for_each call can look and function much like a native loop. I’m not a C++ expert, but the following use of for_each will increment all the values of the vector x.

vector x ..., y ... ; ... for_each( x.begin(), x.end(), [](float& xx){ xx += 1.0; } );

The implementation of for_each is often just a function containing a loop that calls the third argument (leaving out many details):

for_each(first, last, f){ for (; first != last; ++first) f(*first); }

The proposal adds an optional execution policy argument to the for_each algorithm to tell the implementation whether to use the default version, or a parallel version, or a parallel+vector version. Different implementations can provide optimized versions for new targets, even targets not envisioned by the compiler writers. There is some evidence that good performance can be achieved in many cases.

However, while these algorithms would be in the C++ specification, they would be a part of the library, not the core programming language. I believe one could implement these with the current C++ language, meaning no language changes are needed at all. This proposal is simply a convention for specifying a common name for algorithms that can be executed in parallel. As mentioned above, one way to implement them would be to use OpenMP or some other target-specific mechanism, such as using the omp simd to generate SIMD or vector code.

The example below uses a simple auxiliary class, intval, which has one integer data member and the appropriate operators defined to allow for_each to work. There’s probably a much better way to write this using an existing C++ class, but I wanted to keep the code as similar to the other examples as possible:

constexpr parallel_execution_policy par{}; ... intval i0(0), in(n); for_each( par, i0, in, [=](int i){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]]; r[i] = val; });

Comments:

+ Sequential execution is legal.

Sequential execution is legal. + Nested parallelism should be supported, assuming the implementation allows for it.

Nested parallelism should be supported, assuming the implementation allows for it. + Allows the user to explicitly control SIMD or vector parallelism as well, assuming the implementation supports it.

Allows the user to explicitly control SIMD or vector parallelism as well, assuming the implementation supports it. - Only supported in C++ (again, some might view this as a plus).

Only supported in C++ (again, some might view this as a plus). - Reduction operations are implemented in a separate algorithm.

Reduction operations are implemented in a separate algorithm. - No way to declare that the loop iterations are data independent, so can’t take advantage of extra hardware parallelism.

True Parallel Loops

HPF independent

High Performance Fortran was one of the first PGAS languages. The language required the compiler to determine or to generate code to determine which processor owned the data for each loop iteration, and assign that iteration to the proper processor. It also required that the compiler determine if the iterations could be executed in parallel, or generate synchronization between the processors to satisfy any dependences. When compiler analysis failed to find the parallelism, the user could add an independent directive to declare that the loop iterations were data-independent and could safely be executed in parallel. It also allowed the programmer to declare reduction variables in the loop.

!hpf$ independent do i = 1, n row = rowndx(i) ncols = rowndx(i+1) - rowndx(i) val = 0.0 do j = 1, ncols icol = colndx(row+j) val = val + a(row+j) * v(icol) enddo r(i) = val enddo

Comments:

+ Sequential execution is legal.

Sequential execution is legal. + By definition, the loop iterations are data-independent.

By definition, the loop iterations are data-independent. + Support for common simple reductions across the loop iterations.

Support for common simple reductions across the loop iterations. - Was only supported in Fortran.

do concurrent

Fortran 2008 added a new parallel loop construct, do concurrent. One hope was to fix some of the problems with forall. The iterations must be data-independent, meaning any variable or element assigned in one iteration may not be read or assigned in another iteration. This allows the iterations to be executed in any order, including in parallel. It has no mechanism to define a parallel reduction across the iterations, though Fortran has a number of array reduction intrinsic functions. The syntax for multidimensional index sets and conditional execution are different than for a sequential loop, and variables that need to be private to a parallel loop iteration need to be declared in a Fortran block inside the loop.

do concurrent( i = 1:N ) block real :: val integer :: row, ncols, j, icol row = rowndx(i) ncols = rowndx(i+1) - rowndx(i) val = 0.0 do j = 1, ncols icol = colndx(row+j) val = val + a(row+j) * v(icol) enddo r(i) = val end block end do

Comments:

+ Sequential execution is legal.

Sequential execution is legal. + By definition, the loop iterations are data-independent.

By definition, the loop iterations are data-independent. - Only available in Fortran.

Only available in Fortran. - No support for reductions across the do concurrent iterations.

OpenACC parallel loop

The OpenACC API is like OpenMP in many respects, but the parallelism and execution model are quite different. In particular, OpenACC does not have long-lived threads and is not thread-centric. An OpenACC program starts as a single-threaded host program. At a parallel region, the host can create one or more gangs of one or more workers, each of which has vector capability. The number of gangs or number of workers in each gang can change from one region to the next, to support hardware with low cost thread creation. Workshared loops will divide the iterations across the gangs, workers and vector lanes. OpenACC allows for loops where the user specifies the parallelism, in which the loop iterations must be data-independent except for identified reduction operations. It also allows for loops where the compiler will use parallelization analysis to discover parallelism and map that to the gangs, etc., in which case the compiler must be able to prove data-independence. OpenACC has almost no support for synchronization or locks.

#pragma acc parallel loop for( i = 0; i < n; ++i ){ int row = rowndx[i]; int ncols = rowndx[i+1] - rowndx[i]; double val = 0.0; for( int j = 0; j < ncols; ++j ) val += a[row+j] * v[colndx[row+j]]; r[i] = val; }

Comments:

+ Supported in C, C++ and Fortran.

Supported in C, C++ and Fortran. + Sequential execution is legal.

Sequential execution is legal. + Supports reductions and atomic operations.

Supports reductions and atomic operations. + By definition, the loop iterations are data-independent.

By definition, the loop iterations are data-independent. + Support for parallelism in nested loops, for instance, using gang parallelism on the outer loop and vector parallelism in the inner loop.

Support for parallelism in nested loops, for instance, using parallelism on the outer loop and parallelism in the inner loop. - Supports nested parallel regions in the language, but no compiler implements that yet.

The Future

Parallel loops will become even more important in the coming decade. Machines now being designed and built have an order of magnitude more parallelism on each node than was common just a few years ago. A machine like the NERSC-8 Cori will have more than 60 cores per node, and four threads per core and AVX-512 instructions with 8-wide double precision operands for a parallelism factor of about 2,000. Your program will need about 2000 parallel operations at a time just to keep one node busy. A machine like OLCF Summit will have multiple Power 9 CPUs and multiple NVIDIA (Volta architecture) GPUs on each node, supporting (requiring) even more parallelism from the application. We want programming language constructs that can deliver the scalable parallelism that can map efficiently to the high levels of parallelism and the different parallelism modes that these systems provide. At an absolute minimum, any parallel loop must be able to express parallelism across iterations as well as common reductions. Synchronization across iterations can be important, but any synchronization inherently reduces scalability; scalable (lock-free) alternatives, such as atomic operations or small transactions should be preferred. It should also allow the programmer to specify that the loop iterations are truly data independent, allowing the compiler or implementation to use all the parallelism provided by the hardware to execute the loop.

Given that large machines typically live for years whereas large applications often live for decades, it’s important to design and implement your application in a manner that will scale to the next two or more generations of machine. You probably don’t want to go back and refactor your program for each generation or each variation of machine.

What parallel loop constructs will we be using 10-20 years from now? We will have continuing use of the legacy OpenMP thread-centric parallel loop. However, while the focus on an efficient, mechanical translation for small numbers of processors or cores was appropriate 15 years ago, it’s going to be hard to scale efficiently to a parallelism factor of a thousand and more. OpenMP needs updating to promote scalable programming. OpenMP 4 added teams and distribute constructs, for a new level in the parallelism hierarchy. Now an OpenMP program can have a league of teams of threads, where each thread has SIMD capability. However, I agree with the authors of the article Early Experiences with the OpenMP Accelerator Model, presented at IWOMP 2013, that requiring programmers to add teams and distribute is cumbersome and may not be portable.

We will also see growing use of C++ parallel iterators and algorithms. C++ parallel operations are attractive to application developers because anyone can define and implement them without having to invest in compiler development. This is exactly the same argument for development of high performance scientific libraries. If you write your program using these iterators or algorithms (or library routines), and someone else optimizes them for each target, you benefit from that optimization when you move to a new target without having to change your program. The parallelism is hidden in some method instantiation, and that method can include target-specific (or even compiler-specific) features to expose the parallelism to the compiler itself.

But I’m betting on the true parallel loop. A true parallel loop is inherently scalable and flexible, allowing the most options for exploiting the program parallelism on a variety of system designs. My belief is that most parallel loops, regardless of language or syntax, are in fact true parallel loops.

The problem is that only a few languages allow you to express that. We should focus our attention on making that possible in whatever language we want to use, and on designing our applications and algorithms with true parallelism. A language with a rich set of synchronization primitives may be useful, but every synchronization point is a potential performance killing penalty when scaled up to thousand-way parallelism and more. True parallelism and dynamic scalability are going to be important in the coming generation of supercomputers, and will become key for the generations beyond that. Start planning now.

Summary

In the high performance computing world, there has always been interest in making loops run in parallel. Loops and recursion are the only scalable sources of parallelism in applications. Other forms of parallelism, while useful, scale only to the length of the program, not to the length of the data. Early attempts to automatically find all the parallelism in a program using compiler analysis has been useful, particularly for vector parallelism, but is in no way sufficient. To augment this, we have been designing and adding parallel loops to programming languages for the past 40 years or more.

In this article, I reviewed roughly a dozen existing and proposed parallel loop constructs, and discussed and predicted what we need for scalable parallelism in the future. I also included my summary of the information I could find about each construct; I may have made errors, and corrections are welcome. I have also included my opinions, with which some will disagree, so if you add a comment, distinguish whether I’m factually wrong or whether you disagree.

We’re at an inflection point in HPC system design, where a system has many nodes and a very high degree of parallelism on each node. We want programming constructs that dynamically scale up or down to use as much parallelism as is available with minimal abstraction penalty. We might be able to reliably predict what HPC systems will look like for the next five-eight years, but we don’t know what HPC systems will look like five years beyond that. The constraints of power and cost prohibit the system we really want (an infinite number of homogenous cores with infinitely fast, conflict-free access to an infinitely large shared memory), so every system design is a compromise. The applications we work on today need to express enough flexible parallelism to efficiently exploit whatever designs come in the future.



About the Author

Michael Wolfe has developed compilers for over 35 years in both academia and industry, and is now a PGI compiler engineer at NVIDIA Corporation The opinions stated here are those of the author, and do not represent opinions of NVIDIA. Follow Michael’s tweek (weekly tweet) @pgicompilers.