C++11 async tutorial

For a few years now, we live in a multiprocessor world, starting from the phone in my pocket to the parallel quad-core beast I have on my table. Today, you could easily buy a six or twelve core machine that is several orders of magnitude more powerful than the super computers from a decade ago.

As programmers, we need to be able to use at full capacity the available computing power, you can’t buy a new computer and expect that your serial code will run faster. You need to write code that can run on multiple core machines or you will deliver a low quality product to your potential clients.

The new C++11 standard allows us to maximize the use of the available hardware directly from the language. Today, you could write portable multithreading code using only the standard library of the language.

std::async allows you to write code that could potentially run in one or more separate threads than the main thread of your program. std::async takes as argument a callable object, a function for example, and returns a std::future, that will store the result returned by your function or an error message.

std::async can be seen as a high level interface to std::threads. Let’s see a simplified example of async usage:

1 #include <future> 2 #include <iostream> 3 4 void called_from_async () { 5 std :: cout << "Async call" << std :: endl ; 6 } 7 8 int main () { 9 //called_from_async launched in a separate thread if possible 10 std :: future < void > result ( std :: async ( called_from_async )); 11 12 std :: cout << "Message from main." << std :: endl ; 13 14 //ensure that called_from_async is launched synchronously 15 //if it wasn't already launched 16 result . get (); 17 18 return 0 ; 19 }

Running the above code for a few times gives some interesting outputs:

1 Sols sol$ ./a.out 2 Message from main. 3 Async call 4 Sols sol$ ./a.out 5 MAessysnacg ec aflrlom main.

What just happened ? In the first case, the message from the main thread was printed on the screen before the called_from_async was executed, while for the second run, the main thread and the thread in which called_from_async was launched tried to use the screen in the same time.

Let’s try another example that actually return a value. To make things more interesting, I’m going to use a lambda this time:

1 #include <future> 2 #include <iostream> 3 4 int main () { 5 //called_from_async launched in a separate thread if possible 6 std :: future < int > result ( std :: async ([]( int m , int n ) { return m + n ;} , 2 , 4 )); 7 8 std :: cout << "Message from main." << std :: endl ; 9 10 //retrive and print the value stored in the future 11 std :: cout << result . get () << std :: endl ; 12 13 return 0 ; 14 }

Please note the use of std::future<int> to indicate that the return value of the lambda is an integer and the way in which we pass arguments, 2 and 4, to the lambda in std::async.

Usually, you will need to launch more than one asynchronous calls in a row. Giving separate names for each returned future is tedious and error prone, an elegant way to avoid this problem is to use a vector for storing the futures:

1 #include <future> 2 #include <iostream> 3 #include <vector> 4 5 int twice ( int m ) { 6 return 2 * m ; 7 } 8 9 int main () { 10 std :: vector < std :: future < int >> futures ; 11 12 for ( int i = 0 ; i < 10 ; ++ i ) { 13 futures . push_back ( std :: async ( twice , i )); 14 } 15 16 //retrive and print the value stored in the future 17 for ( auto & e : futures ) { 18 std :: cout << e . get () << std :: endl ; 19 } 20 21 return 0 ; 22 }

We could simplify a bit the above code, using a lambda instead of the twice function:

1 #include <future> 2 #include <iostream> 3 #include <vector> 4 5 int main () { 6 std :: vector < std :: future < int >> futures ; 7 8 for ( int i = 0 ; i < 10 ; ++ i ) { 9 futures . push_back ( std :: async ([]( int m ) { return 2 * m ;} , i )); 10 } 11 ... 12 }

Now, it’s time to try a slightly more ambitious project, something that will allow us to measure the performance of a serial code versus the same code with std::async. For this, I’m going to use a Perlin noise I’ve build some time ago. The idea is that I can use my Perlin noise generator to make a set of pictures in parallel. We’ll use std::chrono to time the performance of the code.

The reference Perlin noise function is a 3D function that given 3 numbers on the unit cube will always generate the same result. In this article, we will use the third dimension as a seed to generate 2D images of 1280px width and 720px height. For our purposes, we will treat this as a black box function that given as input a real number on the [0, 1] interval will generate and save a 2D image. If you want more informations about how I’ve implemented this function, you could read my Perlin noise in C++11 article.

We’ll generate 1800 images from the values of z (the third dimension) in the interval [0, 1]. Each image will be generated by the make_perlin_noise function in a std::async call:

1 int main () { 2 std :: vector < std :: future < void >> futures ; 3 int frames = 1800 ; 4 int id_width = 4 ; 5 double delta = 1.0 / ( double ) frames ; 6 7 8 auto start = std :: chrono :: steady_clock :: now (); 9 for ( int id = 0 ; id <= frames ; ++ id ) { 10 double z = ( double ) id * delta ; 11 futures . push_back ( std :: async ( make_perlin_noise , id , id_width , z )); 12 } 13 14 for ( auto & e : futures ) { 15 e . get (); 16 } 17 auto end = std :: chrono :: steady_clock :: now (); 18 19 auto diff = end - start ; 20 std :: cout << std :: chrono :: duration < double , std :: milli > ( diff ). count () << " ms" << std :: endl ; 21 22 return 0 ; 23 }

Lines 8 and 17 in the above code helps us to measure the execution speed of the code.

The complete source code for the above example is on Github https://github.com/sol-prog/async_tutorial.

Running this example on a dual-core MacBook Pro, 64 bits Mountain Lion, Xcode 4.5 with Clang 4.1, gives some interesting results:

The serial version of the above code uses approx 3.2 MB of memory and runs for about 14 minutes with no optimization. With full optimization the code runs for 4.5 minutes, so about 3x speedup of the code for the serial code.

The parallel version of the code uses about 700 MB of memory and runs for about 7.3 minutes with no optimization. The run time for the parallel code with full optimizations is about 2.65 minutes, about 2.75x speedup.

Comparing the serial with the parallel code without optimization, we note a 1.9x speedup for the parallel version. With optimizations we have a 1.7x speedup for the parallel version.

As a side note, the parallel version uses about 280 threads on my machine vs a single thread for the serial version. This explains the memory usage difference between the serial vs the parallel versions.

Using the resulting 1800 images, I’ve made a small, one minute, movie that shows how the image evolves for z in the interval [0, 1].

If you are interested in learning more about the new C++11 std::async, I would recommend reading C++ Concurrency in Action: Practical Multithreading by Anthony Williams:

or The C++ Standard Library: A Tutorial and Reference (2nd Edition) by N. M. Josuttis: