C++11 multithreading tutorial - part 2

The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.

In my last tutorial about using threads in C++11 we’ve seen that the new C++11 threads syntax is remarkably clean compared with the POSIX pthreads syntax. Using a few simple concepts we were able to build a fairly complex image processing example avoiding the subject of thread synchronization. In the second part of this introduction to multithreading programming in C++11 we are going to see how we can synchronize a group of threads running in parallel.

We’ll start with a quick remainder of how we can create a group of threads in C++11. In the last tutorial we’ve seen that we can store a group of threads in a classical C-type array, it is entirely possible to store our threads in a std::vector which is more in the spirit of C++11 and avoids the pitfalls of dynamical memory allocation with new and delete:

1 #include <iostream> 2 #include <thread> 3 #include <vector> 4 5 //This function will be called from a thread 6 7 void func ( int tid ) { 8 std :: cout << "Launched by thread " << tid << std :: endl ; 9 } 10 11 int main () { 12 std :: vector < std :: thread > th ; 13 14 int nr_threads = 10 ; 15 16 //Launch a group of threads 17 for ( int i = 0 ; i < nr_threads ; ++ i ) { 18 th . push_back ( std :: thread ( func , i )); 19 } 20 21 //Join the threads with the main thread 22 for ( auto & t : th ){ 23 t . join (); 24 } 25 26 return 0 ; 27 }

Compiling the above program on Mac OSX Lion with clang++ or with gcc-4.7 (gcc-4.7 was compiled from source):

1 clang++ -Wall -std=c++11 -stdlib=libc++ file_name.cpp 2 3 g++-4.7 -Wall -std=c++11 file_name.cpp

On a modern Linux system with gcc-4.7.x and up we can compile the code with:

1 g++ -std=c++11 -pthread file_name.cpp

Some real life problems are embarrassingly parallel in their nature and can be well managed with the simple syntax presented in the first part of this tutorial. Adding two arrays, multiplying an array with a scalar, generating the Mandelbroot set are classical examples of embarrassingly parallel problems.

Other problems by their nature require some level of synchronization between threads. Take for example the dot product of two vectors: take two vectors of equal lengths multiply them element by element and add the result of each multiplication in a scalar variable. A naive parallelization of this problem is presented in the next code snippet:

1 #include <iostream> 2 #include <thread> 3 #include <vector> 4 5 ... 6 7 void dot_product ( const std :: vector < int > & v1 , const std :: vector < int > & v2 , int & result , int L , int R ){ 8 for ( int i = L ; i < R ; ++ i ){ 9 result += v1 [ i ] * v2 [ i ]; 10 } 11 } 12 13 int main (){ 14 int nr_elements = 100000 ; 15 int nr_threads = 2 ; 16 int result = 0 ; 17 std :: vector < std :: thread > threads ; 18 19 //Fill two vectors with some constant values for a quick verification 20 // v1={1,1,1,1,...,1} 21 // v2={2,2,2,2,...,2} 22 // The result of the dot_product should be 200000 for this particular case 23 std :: vector < int > v1 ( nr_elements , 1 ), v2 ( nr_elements , 2 ); 24 25 //Split nr_elements into nr_threads parts 26 std :: vector < int > limits = bounds ( nr_threads , nr_elements ); 27 28 //Launch nr_threads threads: 29 for ( int i = 0 ; i < nr_threads ; ++ i ) { 30 threads . push_back ( std :: thread ( dot_product , std :: ref ( v1 ), std :: ref ( v2 ), std :: ref ( result ), limits [ i ], limits [ i + 1 ])); 31 } 32 33 34 //Join the threads with the main thread 35 for ( auto & t : threads ){ 36 t . join (); 37 } 38 39 //Print the result 40 std :: cout << result << std :: endl ; 41 42 return 0 ; 43 }

The result of the above code should obviously be 200000, however, running the code a few times gives slightly different results:

1 sol $g++-4.7 -Wall -std=c++11 cpp11_threads_01.cpp 2 sol $./a.out 3 138832 4 sol $./a.out 5 138598 6 sol $./a.out 7 138032 8 sol $./a.out 9 140690 10 sol $

What has happened ??? Look carefully at line 9 of the C++ code, you can see that the variable result sums the result of v1[i] and v2[i]. Line 9 is a typical example of a race condition, this code runs in two parallel asynchronous threads and the variable result is changed by whichever thread access it first.

We can avoid this problem by specifying that this variable should be accessed synchronously by our threads, we can use for this a mutex which is a special purpose variable that acts like a barrier, synchronizing the access to the code that modifies the result variable:

1 #include <iostream> 2 #include <thread> 3 #include <vector> 4 #include <mutex> 5 6 static std :: mutex barrier ; 7 8 ... 9 10 void dot_product ( const std :: vector < int > & v1 , const std :: vector < int > & v2 , int & result , int L , int R ){ 11 int partial_sum = 0 ; 12 for ( int i = L ; i < R ; ++ i ){ 13 partial_sum += v1 [ i ] * v2 [ i ]; 14 } 15 std :: lock_guard < std :: mutex > block_threads_until_finish_this_job ( barrier ); 16 result += partial_sum ; 17 } 18 ...

Line 6 creates a global mutex variable barrier, line 15 forces the threads to finalize the for loop and access synchronously result. Notice that this time we use a new variable partial_sum declared locally for each thread. The rest of the code is unchanged.

For this particular case we can actually find a simpler and more elegant solution, we can use an atomic type which is a special kind of variable that allows safe concurrent reading/writing, basically the synchronization is done under the hood. As a side note on an atomic type we can apply only atomic operations which are defined in the atomic header:

1 #include <iostream> 2 #include <thread> 3 #include <vector> 4 #include <atomic> 5 6 void dot_product ( const std :: vector < int > & v1 , const std :: vector < int > & v2 , std :: atomic < int > & result , int L , int R ){ 7 int partial_sum = 0 ; 8 for ( int i = L ; i < R ; ++ i ){ 9 partial_sum += v1 [ i ] * v2 [ i ]; 10 } 11 result += partial_sum ; 12 } 13 14 int main (){ 15 int nr_elements = 100000 ; 16 int nr_threads = 2 ; 17 std :: atomic < int > result ( 0 ); 18 std :: vector < std :: thread > threads ; 19 20 ... 21 22 return 0 ; 23 }

The atomic types and atomic operations are not available in the current Apple’s clang++, however you can use atomic types if you are wiling to compile the last clang++ from sources, or you can use the last gcc-4.7 also compiled from sources.

If you are interested in learning more about the new C++11 syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper:

or, if you are a C++ beginner you could read C++ Primer (5th Edition) by S. B. Lippman, J. Lajoie, B. E. Moo.