The quest for finding efficient, convenient, and widely usable higher level parallelization constructs in C++ is continuing. With the standardization document N4411, a group of authors from Intel and Microsoft propose the task_block facility which formalizes fork-join parallelism. From the paper (slightly edited):

The model of parallelism supported by the constructs in this paper is called strict fork-join task parallelism, which has decades of research behind it and is the form of structured parallelism supported by all of the prominent parallel languages, including Cilk, X10, Habanero, and OpenMP. Consider an example of a parallel traversal of a tree, where a user-provided function f is applied to each node of the tree, returning the sum of the results: template <typename Func> int traverse(node& n, Func && f) { int left = 0, right = 0; define_task_block( [ & ](task_block& tb) { if (n.left) tb.run([&] { left = traverse(*n.left, f); }); if (n.right) tb.run([&] { right = traverse(*n.right, f); }); }); return f(n) + left + right; } The example above demonstrates the use of two of the functions proposed in N4411: define_task_block and task_block::run . The define_task_block function delineates a region in the program code potentially containing invocations of tasks spawned by the run member function of the task_block class. The run function spawns a task, a unit of work that is allowed to execute in parallel with respect to the caller. Any parallel tasks spawned by run within the define_task_block are joined back to a single thread of execution on return from define_task_block . The function run takes a user-provided function object f and starts it asynchronously – I.e. it may return before the execution of f completes. The implementation’s scheduler may choose to run this function immediately or delay running it until compute resources become available.

While we understand that fork-join parallelism has certain disadvantages – it introduces a forced barrier into the computation after all – we still decided to support the proposed construct in HPX as well as the concept of fork-join parallelism is widely known and easy to understand. We however, added our own extensions to the proposal which help to a) improve the integration of our implementation of task_block with other higher level parallel facilities like our parallel algorithms and b) helps to mitigate the detrimental effects of the forced barrier.

The parallel algorithms extensions to the standard library described in the Parallel TS (N4354) introduces the concept of execution policies. Those essentially allow for the user to describe the concurrency guarantees for his/her codes (e.g. the lambda passed to an algorithm is safe to be executed concurrently, etc.). In HPX, we decided to enable specifying the very same execution policies while creating a task block. In addition to the syntax shown in the example above we now support writing something like:

template <typename Func> int traverse(node& n, Func && f) { int left = 0, right = 0; define_task_block( par, // specify an execution policy for the created task block [ & ](task_block<>& tb) { if (n.left) tb.run([&] { left = traverse(*n.left, f); }); if (n.right) tb.run([&] { right = traverse(*n.right, f); }); }); return f(n) + left + right; }

The main differences to the snippet shown in the beginning are the additional (optional) first parameter passed to define_task_block and the fact that the task_block<> type is now a template (where the template argument is the type of the passed in execution policy which defaults to parallel_execution_policy ). This change is especially interesting in the light of executors (see HPX and C++ Executors): our implementation of task blocks allows to specify execution policies in combination with executors. This enables a very flexible and adaptable way of defining how and where the tasks inside a task block will be run.

The second addition to the proposal we introduced is based on our experience with implementing the parallel algorithms (see HPX and the C++ Standard). For the parallel algorithms we implemented a set of special execution policies which make the invocation of the algorithm itself asynchronous. For instance, by specifying par(task) as the execution policy, the invocation of the algorithm will just kick off the required operation in an asynchronous way, returning an instance of a future object which will become ready once the algorithm execution finished. For task blocks we decided to implement a similar extension. As you can see, making the execution of a task block asynchronous helps mitigating the effects of the forced barrier at the end of the fork-join operation:

template <typename Func> int traverse(node& n, Func && f) { int left = 0, right = 0; future<void> done = define_task_block( par(task), // specify an asynchronous execution policy [ & ](auto& tb) { if (n.left) tb.run([&] { left = traverse(*n.left, f); }); if (n.right) tb.run([&] { right = traverse(*n.right, f); }); }); int result = f(n); // do other stuff done.wait(); // wait for task block to finish return result + left + right; }

Now, other work can be easily performed while the task block is still executing – nifty! Note the use of the C++14 generic lambda syntax which allows us to ignore the concrete type of the generated task block instance.

As always, if you want to try things out, please fork HPX from our Github site.