One of the biggest and most impactful changes C++11 heralds is a standardized threading library, along with a documented memory model for the language. While extremely useful and obviating the dilemma of non-portable code vs. third-party libraries for threading, this first edition of the threading libraries is not without kinks. This article is a brief overview of how C++11 tries to enable a "task-based parallelism" idiom with the introduction of std::async , and the challenges it runs into.

Warning: this article is opinionated, especially its last third or so. I'll be happy to get corrections and suggestions in comments or email.

Background - threads vs. tasks When I'm talking about "thread-based parallelism", I mean manual, low-level management of threads. Something like using pthreads or the Windows APIs for threads directly. You create threads, launch them, "join" them, etc. Even though threads are an OS abstraction, this is as close as you can get to the machine. In such cases, the programmer knows (or better know!) exactly how many threads he has running at any given time, and has to take care of load-balancing the work between them. "Task-based parallelism" refers to a higher level of abstraction, where the programmer manages "tasks" - chunks of work that has to be done, while the library (or language) presents an API to launch these tasks. It is then the library's job to launch threads, make sure there are not too few or too many of them, make sure the work is reasonably load-balanced, and so on. For better or worse, this gives the programmer less low-level control over the system, but also higher-level, more convenient and safer APIs to work with. Some will claim that this also leads to better performance, though this really depends on the application.

Threads and tasks in C++11 The C++11 thread library gives us a whole toolbox for working at the thread level. We have std::thread along with a horde of synchronization and signaling mechanisms, a well-defined memory model, thread-local data and atomic operations right there in the standard. C++11 also tries to provide a set of tools for task-based parallelism, revolving around std::async . It succeeds in some respects, and fails in others. I will go ahead and say in advance that I believe std::async is a very nice tool to replace direct std::thread usage on the low level. On the other hand, it is not really a good task-based parallelism abstraction. The rest of the article will cover these claims in detail.

Using std::async as a smarter std::thread While it's great to have std::thread in standard C++, it's a fairly low level construct. As such, its usage is often more cumbersome than we'd want, and also more error-prone than we'd want. Therefore, an experienced programmer would sit down and come up with a slightly higher-level abstraction that makes C++ threading a bit more pleasant and also safer. The good news is that someone has already written this abstraction, and even made it standard. It's called std::async . Here's a simple example of using a worker thread to perform some work - in this case add up integers in a vector : void accumulate_block_worker ( int * data , size_t count , int * result ) { * result = std :: accumulate ( data , data + count , 0 ); } void use_worker_in_std_thread () { std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; int result ; std :: thread worker ( accumulate_block_worker , v . data (), v . size (), & result ); worker . join (); std :: cout << "use_worker_in_std_thread computed " << result << "

" ; } Straightforward enough. The thread is created and then immediately joined (waited upon to finish in a blocking manner). The result is communicated back to the caller via a pointer argument, since a std::thread cannot have a return value. This already points at a potential issue: when we write computation functions in C++ we usually employ the return value construct, rather than taking results by reference/pointer. Say we had a function already that did work, and was used in serial code, and we want to launch it in a std::thread . Since that function most likely returns its value, we'd need to either write a new version of it, or create some sort of wrapper. Here's an alternative using std::async and std::future : int accumulate_block_worker_ret ( int * data , size_t count ) { return std :: accumulate ( data , data + count , 0 ); } void use_worker_in_std_async () { std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; std :: future < int > fut = std :: async ( std :: launch :: async , accumulate_block_worker_ret , v . data (), v . size ()); std :: cout << "use_worker_in_std_async computed " << fut . get () << "

" ; } I'm passing the std::launch::async policy explicitly - more on this in the latter part of the article. The main thing to note here is that now the actual function launched in a thread is written in a natural way, returning the value it computed; no by-pointer output arguments in sight. std::async takes the return type of the function and returns it wrapped in a std::future , which is another handy abstraction. Read more about futures and promises in concurrent programming on Wikipedia. In the code above, the waiting for the computation thread to finish happens when we call get() on the future. I like how the future decouples the task from the result. In more complex code, you can pass the future somewhere else, and it encapsulates both the thread to wait on and the result you'll end up with. The alternative of using std::thread directly is more cumbersome, because there are two things to pass around. Here is a contrived example, where a function launches threads but then wants to delegate waiting for them and getting the results to some other function. It represents many realistic scenarios where we want to launch tasks in one place but collect results in some other place. First, a version with std::thread : // Demonstrates how to launch two threads and return two results to the caller // that will have to wait on those threads. Gives half the input vector to // one thread, and the other half to another. std :: vector < std :: thread > launch_split_workers_with_std_thread ( std :: vector < int >& v , std :: vector < int >* results ) { std :: vector < std :: thread > threads ; threads . emplace_back ( accumulate_block_worker , v . data (), v . size () / 2 , & (( * results )[ 0 ])); threads . emplace_back ( accumulate_block_worker , v . data () + v . size () / 2 , v . size () / 2 , & (( * results )[ 1 ])); return threads ; } ... { // Usage std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; std :: vector < int > results ( 2 , 0 ); std :: vector < std :: thread > threads = launch_split_workers_with_std_thread ( v , & results ); for ( auto & t : threads ) { t . join (); } std :: cout << "results from launch_split_workers_with_std_thread: " << results [ 0 ] << " and " << results [ 1 ] << "

" ; } Note how the thread objects have to be propagated back to the caller (so the caller can join them). Also, the result pointers have to be provided by the caller because otherwise they go out of scope . Now, the same operation using std::async and futures: using int_futures = std :: vector < std :: future < int >> ; int_futures launch_split_workers_with_std_async ( std :: vector < int >& v ) { int_futures futures ; futures . push_back ( std :: async ( std :: launch :: async , accumulate_block_worker_ret , v . data (), v . size () / 2 )); futures . push_back ( std :: async ( std :: launch :: async , accumulate_block_worker_ret , v . data () + v . size () / 2 , v . size () / 2 )); return futures ; } ... { // Usage std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; int_futures futures = launch_split_workers_with_std_async ( v ); std :: cout << "results from launch_split_workers_with_std_async: " << futures [ 0 ]. get () << " and " << futures [ 1 ]. get () << "

" ; } Once again, the code is cleaner and more concise. Bundling the thread handle with the result it's expected to produce just makes more sense. If we want to implement more complex result sharing schemes, things get even trickier. Say we want two different threads to wait on the computation result. You can't just call join on a thread from multiple other threads. Or at least, not easily. A thread that was already joined will throw an exception if another join is attempted. With futures, we have std::shared_future , which wraps a std::future and permits concurrent access from multiple threads that may want to get the future's result.

Setting a timeout on retrieving task results Say we launched a thread to do a computation. At some point we'll have to wait for it to finish in order to obtain the result. The wait may be trivial if we set the program up in a certain way, but it can actually take time in some situations. Can we set a timeout on this wait so that we don't block for too long? With the pure std::thread solution, it won't be easy. You can't set a timeout on the join() method, and other solutions are convoluted (such as setting up a "cooperative" timeout by sharing a condition variable with the launched thread). With futures returned from std::async , nothing could be easier, since std::future has a wait_for() method that takes a timeout: int accumulate_block_worker_ret ( int * data , size_t count ) { std :: this_thread :: sleep_for ( std :: chrono :: seconds ( 3 )); return std :: accumulate ( data , data + count , 0 ); } int main ( int argc , const char ** argv ) { std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; std :: future < int > fut = std :: async ( std :: launch :: async , accumulate_block_worker_ret , v . data (), v . size ()); while ( fut . wait_for ( std :: chrono :: seconds ( 1 )) != std :: future_status :: ready ) { std :: cout << "... still not ready

" ; } std :: cout << "use_worker_in_std_async computed " << fut . get () << "

" ; return 0 ; }

Propagating exceptions between threads If you're writing C++ code with exceptions enabled, you are kinda "living on the edge". You always have to keep a mischievous imaginary friend on your left shoulder who will remind you that at any point in the program an exception can be thrown and then "how are you handling it?". Threads add another dimension to this (already difficult) problem. What happens when a function launched in a std::thread throws an exception? void accumulate_block_worker ( int * data , size_t count , int * result ) { throw std :: runtime_error ( "something broke" ); * result = std :: accumulate ( data , data + count , 0 ); } ... { // Usage. std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; int result ; std :: thread worker ( accumulate_block_worker , v . data (), v . size (), & result ); worker . join (); std :: cout << "use_worker_in_std_thread computed " << result << "

" ; } This: terminate called after throwing an instance of 'std::runtime_error' what(): something broke Aborted (core dumped) Ah, silly me, I didn't catch the exception. Let's try this alternative usage: try { std :: thread worker ( accumulate_block_worker , v . data (), v . size (), & result ); worker . join (); std :: cout << "use_worker_in_std_thread computed " << result << "

" ; } catch ( const std :: runtime_error & error ) { std :: cout << "caught an error: " << error . what () << "

" ; } Nope: terminate called after throwing an instance of 'std::runtime_error' what(): something broke Aborted (core dumped) What's going on? Well, as the C++ standard clearly states, "~thread(), if joinable(), calls std::terminate()". So trying to catch the exception in another thread won't help. While the example shown here is synthetic, there are many real-world cases where code executed in a thread can throw an exception. In regular, non-threaded call, we may reasonably expect that this exception should be handled somewhere higher up the call stack. If the code runs in a thread, however, this assumption is broken. It means that we should wrap the function running in the new thread in additional code that will catch all exceptions and somehow transfer them to the calling thread. Yet another "result" to return, as if returning the actual result of the computation wasn't cumbersome enough. Once again, std::async to the rescue! Let's try this again: int accumulate_block_worker_ret ( int * data , size_t count ) { throw std :: runtime_error ( "something broke" ); return std :: accumulate ( data , data + count , 0 ); } ... { // Usage. std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; try { std :: future < int > fut = std :: async ( std :: launch :: async , accumulate_block_worker_ret , v . data (), v . size ()); std :: cout << "use_worker_in_std_async computed " << fut . get () << "

" ; } catch ( const std :: runtime_error & error ) { std :: cout << "caught an error: " << error . what () << "

" ; } } Now we get: caught an error: something broke The exception was propagated to the calling thread through the std::future and re-thrown when its get() method is called. This is also the place to mention that the C++11 thread library provides many low-level building blocks for implementing high-level threading and task constructs. Returning a std::future from std::async is a fairly high-level abstraction, tailored for a specific kind of task management. If you want to implement something more advanced, like a special kind of concurrent queue that manages tasks, you'll be happy to hear that tools like std::promise and std::packaged_task are right there in the standard library to make your life more convenient. They let you associate functions with futures, and set exceptions separately from real results on those futures. I'll leave a deeper treatment of these topics to another day.

... but is this real task-based parallelism? So we've seen how std::async helps us write robust threaded programs with smaller code compared to "raw" std::thread s. If your threading needs are covered by std::async , you should definitely use it instead of toiling to re-implement the same niceties with raw threads and other low-level constructs. But does std::async enable real task-based parallelism, wherein you can nonchalantly hand it functions and expect it to load-distribute them for you over some existing thread pool to use OS resources efficiently? Unfortunately, no. Well, at least in the current version of the C++ standard, not yet. There are many problems. Let's start with the launch policy. In all the samples shown above, I'm explicitly passing the async policy to std::async to circumvent the issue. async is not the only policy it supports. The other one is deferred , and the default is actually async | deferred , meaning that we leave it to the runtime to decide. Except that we shouldn't. The deferred policy means that the task will run lazily on the calling thread only when get() is called on the future it returns. This is dramatically different from the async policy in many respects, so just letting the runtime choose either sound like it may complicate programming. Consider the wait_for example I've shown above. Let's modify it to launch the accumulation task with a deferred policy: int accumulate_block_worker_ret ( int * data , size_t count ) { std :: this_thread :: sleep_for ( std :: chrono :: seconds ( 3 )); return std :: accumulate ( data , data + count , 0 ); } int main ( int argc , const char ** argv ) { std :: vector < int > v { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }; std :: future < int > fut = std :: async ( std :: launch :: deferred , accumulate_block_worker_ret , v . data (), v . size ()); while ( fut . wait_for ( std :: chrono :: seconds ( 1 )) != std :: future_status :: ready ) { std :: cout << "... still not ready

" ; } std :: cout << "use_worker_in_std_async computed " << fut . get () << "

" ; return 0 ; } Running it: $ ./using-std-future ... still not ready ... still not ready ... still not ready ... still not ready ... still not ready ... still not ready ... still not ready ^C Oops, what's going on? The problem is that with the deferred policy, the call to wait_for on the future doesn't actually run the task. Only get() does. So we're stuck in an infinite loop. This can be fixed, of course (by also checking for a std::future_status::deferred status from wait_for() ), but requires extra thinking and extra handling. It's not just a matter of not getting stuck in a loop, it's also a matter of what do we do in case the task is deferred? Handling both async and deferred tasks in the same caller code becomes tricky. When we use the default policy, we let the runtime decide when it wants to use deferred instead of async , so bugs like this may be difficult to find since they will only manifest occasionally under certain system loads.

Tasks and TLS The C++11 standard also added TLS support with the thread_local keyword, which is great because TLS is a useful technique that hasn't been standardized so far. Let's try a synthetic example showing how it mixes with std::async 's launch policices: thread_local int tls_var ; int read_tls_var () { return tls_var ; } int main ( int argc , const char ** argv ) { tls_var = 50 ; std :: future < int > fut = std :: async ( std :: launch :: deferred , read_tls_var ); std :: cout << "got from read_tls_var: " << fut . get () << "

" ; return 0 ; } When run, this shows the value 50, because read_tls_var runs in the calling thread. If we change the policy to std::launch::async , it will instead show 0. That's because read_tls_var now runs in a new thread where tls_var wasn't set to 50 by main . Now imagine the runtime decides if your task runs in the same thread or another thread. How useful are TLS variables in this scenario? Not very much, unfortunately. Well unless you love non-determinism and multi-threading Heisenbugs :-)

Tasks and mutexes Here's another fun example, this time with mutexes. Consider this piece of code: int task ( std :: recursive_mutex & m ) { m . lock (); return 42 ; } int main ( int argc , const char ** argv ) { std :: recursive_mutex m ; m . lock (); std :: future < int > fut = std :: async ( std :: launch :: deferred , task , std :: ref ( m )); std :: cout << "got from task: " << fut . get () << "

" ; return 0 ; } It runs and shows 42 because the same thread can lock a std::recursive_mutex multiple times. If we switch the launch policy to async , the program deadlocks because a different thread cannot lock a std::recursive_mutex while the calling thread is holding it. Contrived? Yes. Can this happen in real code - yes, of course. If you're thinking to yourself "he's cheating, what is this weird std::recursive_mutex example specifically tailored to show a problem...", I assure you that a regular std::mutex has its own problems. It has to be unlocked in the thread it was locked in. So if task unlocked a regular std::mutex that was locked by main instead, we'd also have an issue. Unlocking a mutex in a different thread is undefined behavior. With the default launch policy, this undefined behavior would happen just sometimes. Lovely. Bartosz Milewski has some additional discussion of these problems here and also here. Note that they will haunt more advanced thread strategies as well. Thread pools reuse the same thread handles for different tasks, so they'll also have to face TLS and mutex thread-locality issues. Whatever the adopted solution ends up being, some additional constraints will have to be introduced to make sure it's not too easy to shoot yourself in the foot.

Is std::async fundamentally broken? Due to the problems highlighted above, I'd consider the default launch policy of std::async broken and would never use it in production code. I'm not the only one thinking this way. Scott Meyers, in his "Effective Modern C++", recommends the following wrapper to launch tasks: template < typename F , typename ... Ts > inline auto reallyAsync ( F && f , Ts && ... params ) { return std :: async ( std :: launch :: async , std :: forward < F > ( f ), std :: forward < Ts > ( params )...); } Use this instead of raw std::async calls to ensure that the tasks are always launched in fresh threads, so that we can reason about our program more deterministically. The authors of gcc came to realize this as well, and switched the libstdc++ default launch policy to std::launch::async in mid-2015. In fact, as the discussion in that bug highlights, std::async came close to being deprecated in the next C++ standard, since the standards committee realized it's not really possible to implement real task-based parallelism with it without non-deterministic and undefined behavior in some corner cases. And it's the role of the standards committee to ensure all corners are covered . It's evident from online sources that std::async was a bit rushed into the C++11 standard, when the committee didn't have enough time to standardize a more comprehensive library solution such as thread pools. std::async was put there as a compromise, as part of a collection of low-level building blocks that could be used to build higher-level abstractions later. But actually, it can't. Or at least not easily. "Real" task-based parallel systems feature things like task migration between threads, task stealing queues, etc. It will just keep hitting the problems highlighted above (TLS, mutexes, etc.) in real user code. A more comprehensive overhaul is required. Luckily, this is exactly what the standards commitee is toiling on - robust high-level concurrency primitives for the C++17 version of the standard.