C++11 multithreading tutorial

The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.

In my previous tutorials I’ve presented some of the newest C++11 additions to the language: regular expressions, raw strings and lambdas.

Perhaps one of the biggest change to the language is the addition of multithreading support. Before C++11, it was possible to target multicore computers using OS facilities (pthreads on Unix like systems) or libraries like OpenMP and MPI.

This tutorial is meant to get you started with C++11 threads and not to be an exhaustive reference of the standard.

Creating and launching a thread in C++11 is as simple as adding the thread header to your C++ source. Let’s see how we can create a simple HelloWorld program with threads:

1 #include <iostream> 2 #include <thread> 3 4 //This function will be called from a thread 5 6 void call_from_thread () { 7 std :: cout << "Hello, World" << std :: endl ; 8 } 9 10 int main () { 11 //Launch a thread 12 std :: thread t1 ( call_from_thread ); 13 14 //Join the thread with the main thread 15 t1 . join (); 16 17 return 0 ; 18 }

On Linux you can compile the above code with g++:

1 g++ -std = c++11 -pthread file_name.cpp

On a Mac with Xcode you can compile the above code with clang++:

1 clang++ -std=c++11 -stdlib=libc++ file_name.cpp

On Windows you could use a commercial library, just::thread, for compiling multithread codes. Unfortunately they don’t supply a trial version of the library, so I wasn’t able to test it.

In a real world application the “call_from_thread” function will do some work independently of the main function. For this particular code, the main function creates a thread and wait for the thread to finish at t1.join(). If you forget to wait for a thread to finish his work, it is possible that main will finish first and the program will exit killing the previously created thread regardless if “call_from_thread” has finished or not.

Compare the relative simplicity of the above code with an equivalent code that uses POSIX threads:

1 #include <iostream> 2 #include <pthread.h> 3 4 //This function will be called from a thread 5 6 void * call_from_thread ( void * ) { 7 std :: cout << "Launched by thread" << std :: endl ; 8 return NULL ; 9 } 10 11 int main () { 12 pthread_t t ; 13 14 //Launch a thread 15 pthread_create ( & t , NULL , call_from_thread , NULL ); 16 17 //Join the thread with the main thread 18 pthread_join ( t , NULL ); 19 return 0 ; 20 }

Usually we will want to launch more than one thread at once and do some work in parallel. In order to do this we could create an array of threads versus creating a single thread like in our first example. In the next example the main function creates a group of 10 threads that will do some work and waits for the threads to finish their work (there is also a POSIX version of this example in the github repository for this article):

1 ... 2 static const int num_threads = 10 ; 3 ... 4 int main () { 5 std :: thread t [ num_threads ]; 6 7 //Launch a group of threads 8 for ( int i = 0 ; i < num_threads ; ++ i ) { 9 t [ i ] = std :: thread ( call_from_thread ); 10 } 11 12 std :: cout << "Launched from the main

" ; 13 14 //Join the threads with the main thread 15 for ( int i = 0 ; i < num_threads ; ++ i ) { 16 t [ i ]. join (); 17 } 18 19 return 0 ; 20 }

Remember that the main function is also a thread, usually named the main thread, so the above code actually runs 11 threads. This allows us to do some work in the main thread after we have launched the threads and before joining them, we will see this in an image processing example at the end of this tutorial.

What about using a function with parameters in a thread ? C++11 let us to add as many parameters as we need in the thread call. For e.g. we could modify the above code in order to receive an integer as a parameter (you can see the POSIX version of this example in the github repository for this article):

1 #include <iostream> 2 #include <thread> 3 4 static const int num_threads = 10 ; 5 6 //This function will be called from a thread 7 8 void call_from_thread ( int tid ) { 9 std :: cout << "Launched by thread " << tid << std :: endl ; 10 } 11 12 int main () { 13 std :: thread t [ num_threads ]; 14 15 //Launch a group of threads 16 for ( int i = 0 ; i < num_threads ; ++ i ) { 17 t [ i ] = std :: thread ( call_from_thread , i ); 18 } 19 20 std :: cout << "Launched from the main

" ; 21 22 //Join the threads with the main thread 23 for ( int i = 0 ; i < num_threads ; ++ i ) { 24 t [ i ]. join (); 25 } 26 27 return 0 ; 28 }

The result of the above code on my system is:

1 Sol$ ./a.out 2 Launched by thread 0 3 Launched by thread 1 4 Launched by thread 2 5 Launched from the main 6 Launched by thread 3 7 Launched by thread 5 8 Launched by thread 6 9 Launched by thread 7 10 Launched by thread Launched by thread 4 11 8L 12 aunched by thread 9 13 Sol$

You can see in the above result that there is no particular order in which once created a thread will run. It is the programmer’s job to ensure that a group of threads won’t block trying to modify the same data. Also the last lines are somehow mangled because thread 4 didn’t finish to write on stdout when thread 8 has started. Actually if you run the above code on your system you can get a completely different result or even some mangled characters. This is because all 11 threads of this program compete for the same resource which is stdout.

You can avoid some of the above problem using barriers in your code (std::mutex) which will let you synchronize the way a group of threads share a resource, or you could try to use separate data structures for your threads, if possible. We will talk about thread synchronization using atomic types and mutex in the next tutorial.

In principle we have all we need in order to write more complex parallel codes using only the above syntax.

In the next example I will try to illustrate the power of parallel programming by tackling a slightly more complex problem: removing the noise from an image, with a blur filter. The idea is that we can dissipate the noise from an image by using some form of weighted average of a pixel and his neighbours.

This tutorial is not about optimum image processing nor the author is an expert in this domain, so we will take a rather simple approach here. Our purpose is to illustrate how to write a parallel code and not how to efficiently read/write images or convolve them with filters. I’ve used for example the definition of the spatial convolution instead of the more performant, but slightly more difficult to implement, convolution in the frequency domain by use of Fast Fourier Transform.

For simplicity we will use a simple non-compressed image file format like PPM. Next we present the header file of a simple C++ class that allows you to read/write PPM images and to store them in memory as three arrays (for the R,G,B colours) of unsigned characters:

1 class ppm { 2 bool flag_alloc ; 3 void init (); 4 //info about the PPM file (height and width) 5 unsigned int nr_lines ; 6 unsigned int nr_columns ; 7 8 public : 9 //arrays for storing the R,G,B values 10 unsigned char * r ; 11 unsigned char * g ; 12 unsigned char * b ; 13 // 14 unsigned int height ; 15 unsigned int width ; 16 unsigned int max_col_val ; 17 //total number of elements (pixels) 18 unsigned int size ; 19 20 ppm (); 21 //create a PPM object and fill it with data stored in fname 22 ppm ( const std :: string & fname ); 23 //create an "empty" PPM image with a given width and height;the R,G,B arrays are filled with zeros 24 ppm ( const unsigned int _width , const unsigned int _height ); 25 //free the memory used by the R,G,B vectors when the object is destroyed 26 ~ ppm (); 27 //read the PPM image from fname 28 void read ( const std :: string & fname ); 29 //write the PPM image in fname 30 void write ( const std :: string & fname ); 31 };

A possible way to structure our code is:

Load an image to memory.

Split the image in a number of threads corresponding to the max number of threads accepted by your system, e.g. on a quad-core computer we could use 8 threads.

Launch number of threads - 1 (7 for a quad-core system), each one will process his chunk of the image.

Let the main thread to deal with the last chunk of the image.

Wait until all threads have finished and join them with the main thread.

Save the processed image.

Next we present the main function that implements the above algorithm (many thanks to wicked for suggesting some code improvements):

1 int main () { 2 std :: string fname = std :: string ( "your_file_name.ppm" ); 3 4 ppm image ( fname ); 5 ppm image2 ( image . width , image . height ); 6 7 //Number of threads to use (the image will be divided between threads) 8 int parts = 8 ; 9 10 std :: vector < int > bnd = bounds ( parts , image . size ); 11 12 std :: thread * tt = new std :: thread [ parts - 1 ]; 13 14 time_t start , end ; 15 time ( & start ); 16 //Lauch parts-1 threads 17 for ( int i = 0 ; i < parts - 1 ; ++ i ) { 18 tt [ i ] = std :: thread ( tst , & image , & image2 , bnd [ i ], bnd [ i + 1 ]); 19 } 20 21 //Use the main thread to do part of the work !!! 22 for ( int i = parts - 1 ; i < parts ; ++ i ) { 23 tst ( & image , & image2 , bnd [ i ], bnd [ i + 1 ]); 24 } 25 26 //Join parts-1 threads 27 for ( int i = 0 ; i < parts - 1 ; ++ i ) 28 tt [ i ]. join (); 29 30 time ( & end ); 31 std :: cout << difftime ( end , start ) << " seconds" << std :: endl ; 32 33 //Save the result 34 image2 . write ( "test.ppm" ); 35 36 //Clear memory and exit 37 delete [] tt ; 38 39 return 0 ; 40 }

Please ignore the hard coded name of image file and the number of threads to launch, on a real world application you should allow the user to enter interactively these parameters.

Now, in order to see a parallel code at work we will need to give him a significative amount of work, otherwise the overhead of creating and destroying threads will nullify our effort to parallelize this code. The input image should be large enough to actually see an improvement in performance when the code is run in parallel. For this purpose I’ve used an image of 16000x10626 pixels which occupy about 512 MB in PPM format:

I’ve added some noise over the above image in Gimp. The effect of the noise addition can be seen in the next detail of the above picture:

Let’s see the above code in action:

As you can see from the above image the noise level was dissipated.

The results of running the last example code on a dual-core MacBook Pro from 2010 is presented in the next table:

Compiler Optim Threads Time Speed clang++ none 1 40 s clang++ none 4 20 s 2x clang++ -O4 1 12 s clang++ -O4 4 6 s 2x

On a dual core machine this code has a perfect speed up 2x for running in parallel versus running the code in serial mode (a single thread).

I’ve also tested the code on a quad-core Intel i7 machine with Linux, these are the results:

Compiler Optim Threads Time Speed g++ none 1 33 s g++ none 8 13 s 2.54x g++ -O4 1 9 s g++ -O4 8 3 s 3x

Apparently Apple’s clang++ is better at scaling a parallel program, however this can be a combination of compiler/machine characteristics, it could also be because the MacBook Pro used for tests has 8GB of RAM versus only 6GB for the Linux machine.

Read the second part of this tutorial - C++11 multithreading tutorial - part 2/.

If you are interested in learning more about the new C++ syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 4th edition:

or, if you are a C++ beginner you could read C++ Primer (5th Edition) by S. B. Lippman, J. Lajoie, B. E. Moo.

A good book for learning about C++11 multithreading support is C++ Concurrency in Action: Practical Multithreading by Anthony Williams: