Getting top performance out of modern processors requires both SIMD and thread parallelism. As I discussed in SIMD Parallism using Array Notation for C/C++, fork-join and SIMD parallelism can be combined to solve a computational problem:

fork-join parallelism can specify which subproblems can be solved in parallel by different hardware threads.

SIMD parallelism can be applied to each subproblem, which helps the compiler exploit SIMD instructions.

In fork-join parallelism, control flow forks into separate flows, and later the flows join back together. Recursive fork-join can be particularly effective. For example, 10 levels of two-way splits creates 1024-way potential parallelism. 1024 separate hardware threads would be inefficient. But systems like Cilk Plus (and Threaded Building Blocks) are careful to turn potential parallelism into actual parallelism only when necessary. Indeed with these systems you should specify as much fork-join parallelism as you can, and let the system determine when, where, and how much to exploit.

An important point is that the subproblems should fit in cache, otherwise multiple threads can easily run up against memory bandwidth limitations. Sometimes it is hard to determine what a cache-sized subproblem is. When in doubt, go as small as practical. The subproblems are too small if the overhead of scheduling a chunk becomes significant compared to the work for solving the subproblem serially. Because Cilk Plus is tightly integrated into the compiler, and has highly structured parallelism, it tends to have lower chopping overhead than systems like TBB that are less structured and supplied as libraries. So you can often afford to chop into finer subproblems than you might with a system such as TBB.

A particularly effective form of chopping is exhibited by cache oblivious algorithms. These optimize for all levels of cache that might exist (including virtual memory!), oddly enough without knowing anything about the levels. See the link for more information.

Quick Introduction to Cilk Notation

There are two ways to express fork-join parallelism in Cilk:

cilk_spawn / cilk_sync

cilk_for

For example, the statement cilk_spawn f(); asynchronously calls a function f. Here asynchronous means that thethe caller keeps going without waiting for the callee to return. This is a neat way to express parallelism because most of what you learned about subroutine calls still applies. The actual arguments are evaluated, bound to formal arguments, and the callee's body is invoked. The only thing that changes is that the caller continues to execute after the callee is entered, if there is a hardware thread available to continue execution of the caller.

Eventually a caller must wait on its callees. To do so, the caller invokes cilk_sync. For example, the following code lets f(x), g(y), and h(z) run in parallel if there are sufficient hardware resources. It waits for them to complete before printing their results.

a = cilk_spawn f(x); b = cilk_spawn g(y); c = h(z); cilk_sync; cout << a << b << c << "

";

Figure 1 shows the execution order.

[Click image to view at full size]

The last call is not spawned as a matter of good Cilk style. It could be spawned, but doing so is considered poor style, because there is no other work to do between that call and the cilk_sync. Doing nothing in parallel with doing something is pointless overhead.

The semantics of cilk_sync is that it waits for all cilk_spawn calls that were issued by the current Cilk block. A Cilk block is the body of a function, or the body of a cilk_for loop. Complex control flow is allowed between cilk_spawn and cilk_sync. For example, this code:

for( list::iterator i=x.begin(); i!=x.end(); ++x ) cilk_spawn f(*i); cilk_sync;

walks a linked list x sequentially and issues calls f(*i) that run concurrently. Execution waits for all of them to complete at the cilk_sync.

There is always an implicit cilk_sync when a routine returns. Hence a callee cannot accidentally leave "dangling tasks" running after it returns. This feature makes it much easier to reason about Cilk parallelism than it is to reason about raw threads.

The callee can be function object (functor) too. Since a C++1x lambda expression returns a function object, you can also spawn a block of work like this:

cilk_spawn [&]{for( int i=0; i<5; ++i ) cout<<"quack ";} (); other_work(); // Do other work while quacking. cilk_sync;

In the example, the lambda expression [&]{...} returns a function object. The cilk_spawn...() causes it to execute asynchronously. Do not forget the trailing parentheses, or the compiler will balk.

The notation cilk_for is like a C/C++ for, except that iterations run in parallel if resources permit. For example:

cilk_for( vector::iterator i=x.begin(); i!=x.end(); ++x ) f(*i);

allows, but does not mandate, each loop iteration to run in parallel. A cilk_for loop is usually more efficient than a serial for loop wrapped around cilk_spawn requests. The reason is that a cilk_for can distribute the work across processors in parallel, and much more efficiently, than the combination of for and cilk_spawn. However, a cilk_for is more limited than a plain for. The iteration variable must be of an integral type or a random access iterator, and the cilk_for must fit a certain pattern, which permits the number of iterations to be computed before the loop iterations commence.

Another key feature of Intel Cilk Plus are hyperobjects. They deal the "join" part of fork-join parallelism when there is data to be joined, such as in reductions. Hyperobjects are a topic for another day.