The OpenMP standard was formulated in 1997 as an API for writing portable, multi-threaded applications. It started as a Fortran-based standard, but later grew to include C and C++. While the current version is OpenMP Version 3.0, this article is based on OpenMP Version 2.5, which supports Fortran, C, and C++. Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard. The OpenMP programming model provides a platform-independent set of compiler pragmas, directives, function calls, and environment variables that explicitly instruct the compiler how and where to use parallelism in the application. Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by examples in this article. By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for performance on multi-core processors. The full potential of OpenMP is realized when it is used to thread the most time consuming loops, that is, the hot spots.

Tackling the topic of OpenMP in a single article is an intimidating task. Therefore, this article serves as a bridge for you, allowing you to reach a point where you have a fundamental understanding of threading with OpenMP from which you can build your broader practical knowledge. The power and simplicity of OpenMP can be demonstrated by looking at an example. The following loop converts each 32-bit RGB (red, green, blue) pixel in an array into an 8-bit grayscale pixel. The one pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution under OpenMP.

#pragma omp parallel for for ( i = 0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) ( pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114 ); }

Let's take a closer look at the loop. First, the example uses work-sharing, which is the general term that OpenMP uses to describe distributing work across threads. When work-sharing is used with the for construct, as shown in this example, the iterations of the loop are distributed among multiple threads. The OpenMP implementation determines how many threads to create and how best to manage them. All the programmer needs to do is to tell OpenMP which loop should be threaded. No need for programmers to add a lot of codes for creating, initializing, managing, and killing threads in order to exploit parallelism. OpenMP compiler and runtime library take care of these and many other details behind the scenes.

In the current OpenMP specification Version 2.5, OpenMP places the following five restrictions on which loops can be threaded:

The loop variable must be of type signed integer. Unsigned integers will not work. Note: this restriction is to be removed in the future OpenMP specification Version 3.0.

The comparison operation must be in the form loop_variable <, <=, >, or >= loop_invariant_integer .

. The third expression or increment portion of the for loop must be either integer addition or integer subtraction and by a loop invariant value.

If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.

The loop must be a single entry and single exit loop, meaning no jumps from the inside of the loop to the outside or outside to the inside are permitted with the exception of the exit statement, which terminates the whole application. If the statements goto or break are used, they must jump within the loop, not outside it. The same goes for exception handling; exceptions must be caught within the loop.

Although these restrictions may sound somewhat limiting, most loops can easily be rewritten to conform to them. The restrictions listed above must be observed so that the compiler can parallelize loops via OpenMP. However, even when the compiler parallelizes the loop, you must still ensure the loop is functionally correct by watching out for the issues in the next section.

Challenges in Threading a Loop

Threading a loop is to convert independent loop iterations to threads and run these threads in parallel. In some sense, this is a re-ordering transformation in which the original order of loop iterations can be converted to into an undetermined order. In addition, because the loop body is not an atomic operation, statements in the two different iterations may run simultaneously. In theory, it is valid to convert a sequential loop to a threaded loop if the loop carries no dependence. Therefore, the first challenge for you is to identify or restructure the hot loop to make sure that it has no loop-carried dependence before adding OpenMP pragmas.

Even if the loop meets all five loop criteria and the compiler threaded the loop, it may still not work correctly, given the existence of data dependencies that the compiler ignores due to the presence of OpenMP pragmas. The theory of data dependence imposes two requirements that must be met for a statement S 2 and to be data dependent on statement S 1 .

There must exist a possible execution path such that statement S 1 and S 2 both reference the same memory location L .

and both reference the same memory location . The execution of S 1 that references L occurs before the execution of S 2 that references L .

In order for S 2 to depend upon S 1 , it is necessary for some execution of S 1 to write to a memory location L that is later read by an execution of S 2 . This is also called flow dependence. Other dependencies exist when two statements write the same memory location L , called an output dependence, or a read occurs before a write, called an anti-dependence. This pattern can occur in one of two ways:

S 1 can reference the memory location L on one iteration of a loop; on a subsequent iteration S 2 can reference the same memory location L .

can reference the memory location on one iteration of a loop; on a subsequent iteration can reference the same memory location . S 1 and S 2 can reference the same memory location L on the same loop iteration, but with S 1 preceding S 2 during execution of the loop iteration.

The first case is an example of loop-carried dependence, since the dependence exists when the loop is iterated. The second case is an example of loop-independent dependence; the dependence exists because of the position of the code within the loops. Table 1 shows three cases of loop-carried dependencies with dependence distance d , where 1 ≤ d ≤ n , and n is the loop upper bound.

Table 1: The Different Cases of Loop-carried dependencies.

Let's take a look at the following example where d = 1 and n = 99 . The write operation is to location x[k] at iteration k in S 1 , and a read from it at iteration k+1 in S 2 , thus a loop-carried flow dependence occurs. Furthermore, with the read from location y[k–1] at iteration k in S 1 , a write to it is performed at iteration k+1 in S 2 , hence, the loop-carried antidependence exists. In this case, if a parallel for pragma is inserted for threading this loop, you will get a wrong result.

// Do NOT do this. It will fail due to loop-carried // dependencies. x[0] = 0; y[0] = 1; #pragma omp parallel for private(k) for ( k = 1; k < 100; k++ ) { x[k] = y[k-1] + 1; // S1 y[k] = x[k-1] + 2; // S2 }

Because OpenMP directives are commands to the compiler, the compiler will thread this loop. However, the threaded code will fail because of loop-carried dependence. The only way to fix this kind of problem is to rewrite the loop or to pick a different algorithm that does not contain the loop-carried dependence. With this example, you can first predetermine the initial value of x[49] and y[49] ; then, you can apply the loop strip mining technique to create a loop-carried dependence-free loop m . Finally, you can insert the parallel for to parallelize the loop m . By applying this transformation, the original loop can be executed by two threads on a dual-core processor system.

// Effective threading of the loop using strip-mining // transformation. x[0] = 0; y[0] = 1; x[49] = 74; //derived from the equation x(k)=x(k-2)+3 y[49] = 74; //derived from the equation y(k)=y(k-2)+3 #pragma omp parallel for private(m, k) for (m=0, m<2; m++) { for ( k = m*49+1; k < m*50+50; k++ ) { x[k] = y[k-1] + 1; // S1 y[k] = x[k-1] + 2; // S2 } }

Besides using the parallel for pragma, for the same example, you can also use the parallel sections pragma to parallelize the original loop that has loop-carried dependence for a dual-core processor system.

// Effective threading of a loop using parallel sections #pragma omp parallel sections private(k) { { x[0] = 0; y[0] = 1; for ( k = 1; k < 49; k++ ) { x[k] = y[k-1] + 1; // S1 y[k] = x[k-1] + 2; // S2 } } #pragma omp section { x[49] = 74; y[49] = 74; for ( k = 50; k < 100; k++ ) { x[k] = y[k-1] + 1; // S3 y[k] = x[k-1] + 2; // S4 } } }

With this simple example, you can learn several effective methods from the process of parallelizing a loop with loop-carried dependencies. Sometimes, a simple code restructure or transformation is necessary to get your code threaded for taking advantage of dual-core and multi-core processors besides simply adding OpenMP pragmas.