It is time to make Parallelism a full First Class Citizen in C and C++. Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.



The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++. This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.



Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.



We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.



We should tackle task and data parallelism both, and as an industry we know how.

Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.

Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions.

The solutions that show the most promise are documented in the Cilk™ Plus open specification (http://cilkplus.org). They are as follows:



For data parallelism:

Explicit array syntax , to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times

, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times Elemental functions , to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions.

, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions. Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism…

For task parallelism:

spawn/join to spawn a function, and to wait for spawns to complete

to spawn a function, and to wait for spawns to complete parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently.

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!



For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement parallelism.



Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability. These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.



Top of mind thoughts are:



Both C and C++ are important . No solution should be specific to only one.

. No solution should be specific to only one. There is strong value in adding some basic task parallelism and data parallel support as a first class citizen into both C and C++. The time has come.

as a first class citizen into both C and C++. The time has come. Maintaining serial semantics (see four-reasons-why-parallel-programs-should-have-serial-semantics ) is important. A program should be able to be understood (and executed) as a serial program. This is critical for making it easy to reason about a program that uses parallelism.

(see four-reasons-why-parallel-programs-should-have-serial-semantics ) is important. A program should be able to be understood (and executed) as a serial program. This is critical for making it easy to reason about a program that uses parallelism. Task parallelism: nothing is more proven than the simple spawn/sync and parallel for of Cilk Plus.

and of Cilk Plus. Data parallelism: nothing is more simple than extending syntax to make data parallel operation explicit via array operations such as a[:]=b[:]+c[:] Fortran 90 added similar capabilities over two decades ago!

such as a[:]=b[:]+c[:] Fortran 90 added similar capabilities over two decades ago! Data parallelism: elemental functions have an important role and should be included

have an important role and should be included Task parallelism goal: shift parallel programming to explicit language features , making parallelism easy to express and exploit task parallelism that can be optimized by a compiler, and can be more easily tested and debugged for data races/deadlock.

, making parallelism easy to express and exploit task parallelism that can be optimized by a compiler, and can be more easily tested and debugged for data races/deadlock. We need strong space-time guarantees in parallelism constructs.

in parallelism constructs. Data parallelism goal: shift programming to EXPLICIT and EASY-TO-FIND (exploit) data parallelism , so all varieties of hardware can be addressed

, so all varieties of hardware can be addressed Everything proposed here is incredibly easy to teach. The power that can be placed underneath via a compiler is a big bonus, of course! That power makes all these very compelling. The data parallelism is immediately convincing by its compact form, but the task parallel constructs are convincing as well.



None of this is radical – and none of it need be proprietary.



If we don’t get carried away adding other stuff, KISS, we can add these and make a fundamental and important advance for task and data parallelism in C and C++… the languages that lead the evolution to parallelism, and are becoming more (not less) important in the future.



- james

