C++17 STL Parallel Algorithms - with GCC 9.1 and Intel TBB on Linux and macOS

Thanks to Bartlomiej Filipek for the suggestion to try C++17 parallel algorithms with GCC 9 and Intel TBB on Linux.

GCC 9.1 has support for C++17 parallel algorithms by using the Intel TBB library. In this article, I will show you how to build Intel TBB from sources on your machine and how to sort a vector of random numbers in parallel using C++17 std::sort.

At the time of this writing, the latest stable version of Intel TBB is version 2019 Update 8. You can find the latest release of TBB on their GitHub repo. If a newer version was released, change the version number from the next instructions accordingly.

I have instructions about building GCC 9.1 from source on Linux or macOS on other articles on my website.

Let’s start by downloading Intel TBB:

1 # On Linux: 2 3 wget https://github.com/intel/tbb/archive/2019_U8.tar.gz 4 tar zxvf 2019_U8.tar.gz 5 rm 2019_U8.tar.gz 6 7 # On macOS 8 9 curl -L https://github.com/intel/tbb/archive/2019_U8.tar.gz | tar xf -

Next, we are going to duplicate one of the existing build files and change the compiler from generic GCC to GCC 9.1:

1 cd tbb-2019_U8 2 3 # On Linux: 4 5 cp build/linux.gcc.inc build/linux.gcc-9.1.inc 6 nano build/linux.gcc-9.1.inc 7 8 # On macOS 9 10 cp build/macos.gcc.inc build/macos.gcc-9.1.inc 11 nano build/macos.gcc-9.1.inc

At about line 15 you will find these two lines:

1 CPLUS ?= g++ 2 CONLY ?= gcc

change them to:

1 CPLUS ?= g++-9.1 2 CONLY ?= gcc-9.1

save and close the file (with nano you need to press CTRL-X and answer Y when asked if you want to save it).

Next, we can build the library:

1 make compiler=gcc-9.1 stdver=c++17 tbb_build_prefix=my_tbb_build

Once the build is finished, you can install it with the next instructions:

1 sudo mkdir /usr/local/tbb-2019_U8 2 sudo cp -r include /usr/local/tbb-2019_U8/include 3 sudo ln -s /usr/local/tbb-2019_U8/include/tbb /usr/local/include/tbb 4 sudo cp -r build/my_tbb_build_release /usr/local/tbb-2019_U8/lib 5 6 # On Linux: 7 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbb.so.2 /usr/local/lib/libtbb.so 8 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbbmalloc.so.2 /usr/local/lib/libtbbmalloc.so 9 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbbmalloc_proxy.so.2 /usr/local/lib/libtbbmalloc_proxy.so 10 echo 'export LD_LIBRARY_PATH=/usr/local/tbb-2019_U8/lib:$LD_LIBRARY_PATH' >> ~/.bashrc 11 source ~/.bashrc 12 13 # On macOS: 14 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbb.dylib /usr/local/lib/libtbb.dylib 15 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbbmalloc.dylib /usr/local/lib/libtbbmalloc.dylib 16 sudo ln -s /usr/local/tbb-2019_U8/lib/libtbbmalloc_proxy.dylib /usr/local/lib/libtbbmalloc_proxy.dylib 17 echo 'export DYLD_LIBRARY_PATH=/usr/local/tbb-2019_U8/lib:$DYLD_LIBRARY_PATH' >> ~/.bash_profile 18 source ~/.bash_profile

At this point, you should be able to use the C++17 parallel algorithms.

Here is a small C++17 test program, that will fill a vector with 5 millions random doubles, with values from 0 to 1. Next, we sort the vector sequentially and in parallel. Each test is repeated 10 times and the elapsed time, in milliseconds, is printed on the screen:

1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <algorithm> 5 #include <chrono> 6 #include <execution> 7 8 void printDuration ( std :: chrono :: steady_clock :: time_point start , std :: chrono :: steady_clock :: time_point end , const char * message ) { 9 auto diff = end - start ; 10 std :: cout << message << ' ' << std :: chrono :: duration < double , std :: milli > ( diff ). count () << " ms

" ; 11 } 12 13 template < typename T > 14 void test ( const T & policy , const std :: vector < double > & data , const int repeat , const char * message ) { 15 for ( int i = 0 ; i < repeat ; ++ i ) { 16 std :: vector < double > curr_data ( data ); 17 18 const auto start = std :: chrono :: steady_clock :: now (); 19 std :: sort ( policy , curr_data . begin (), curr_data . end ()); 20 const auto end = std :: chrono :: steady_clock :: now (); 21 printDuration ( start , end , message ); 22 } 23 std :: cout << '

' ; 24 } 25 26 int main () { 27 // Test samples and repeat factor 28 constexpr size_t samples { 5 ' 000 ' 000 }; 29 constexpr int repeat { 10 }; 30 31 // Fill a vector with samples numbers 32 std :: random_device rd ; 33 std :: mt19937_64 mre ( rd ()); 34 std :: uniform_real_distribution < double > urd ( 0.0 , 1.0 ); 35 36 std :: vector < double > data ( samples ); 37 for ( auto & e : data ) { 38 e = urd ( mre ); 39 } 40 41 // Sort data using different execution policies 42 std :: cout << "std::execution::seq

" ; 43 test ( std :: execution :: seq , data , repeat , "Elapsed time" ); 44 45 std :: cout << "std::execution::par

" ; 46 test ( std :: execution :: par , data , repeat , "Elapsed time" ); 47 }

You can build the code, assuming that you saved the file as t0.cpp, by using one of the next lines:

1 g++-9.1 -std=c++17 -Wall -Wextra -pedantic t0.cpp -o t0 -ltbb 2 g++-9.1 -std=c++17 -Wall -Wextra -pedantic -O2 t0.cpp -o t0_opt -ltbb

in the above, t0_opt is optimized for speed, while t0 uses the default compiler settings.

Here is what I see on my machine (macOS MacBook Air 2018):

1 ~/work $ g++-9.1 -std=c++17 -Wall -Wextra -pedantic t0.cpp -o t0 -ltbb 2 ~/work $ ./t0 3 std::execution::seq 4 Elapsed time 2446.64 ms 5 Elapsed time 2468.2 ms 6 Elapsed time 2488.72 ms 7 Elapsed time 2536.5 ms 8 Elapsed time 2571.8 ms 9 Elapsed time 2625.19 ms 10 Elapsed time 2636.66 ms 11 Elapsed time 2698.23 ms 12 Elapsed time 2726.94 ms 13 Elapsed time 2766.53 ms 14 15 std::execution::par 16 Elapsed time 1296.66 ms 17 Elapsed time 1329.46 ms 18 Elapsed time 1320.61 ms 19 Elapsed time 1357.26 ms 20 Elapsed time 1474.66 ms 21 Elapsed time 1376.8 ms 22 Elapsed time 1366.08 ms 23 Elapsed time 1372.55 ms 24 Elapsed time 1390.08 ms 25 Elapsed time 1393.47 ms 26 27 ~/work $ 28 ~/work $ g++-9.1 -std=c++17 -Wall -Wextra -pedantic -O2 t0.cpp -o t0_opt -ltbb 29 ~/work $ ./t0_opt 30 std::execution::seq 31 Elapsed time 441.889 ms 32 Elapsed time 441.553 ms 33 Elapsed time 450.44 ms 34 Elapsed time 453.351 ms 35 Elapsed time 448.148 ms 36 Elapsed time 440.537 ms 37 Elapsed time 451.15 ms 38 Elapsed time 449.112 ms 39 Elapsed time 444.223 ms 40 Elapsed time 445.78 ms 41 42 std::execution::par 43 Elapsed time 215.877 ms 44 Elapsed time 215.243 ms 45 Elapsed time 212.281 ms 46 Elapsed time 217.507 ms 47 Elapsed time 206.876 ms 48 Elapsed time 209.786 ms 49 Elapsed time 217.279 ms 50 Elapsed time 210.217 ms 51 Elapsed time 213.432 ms 52 Elapsed time 217.352 ms 53 54 ~/work $

I’ve seen similar results on a Linux machine. The parallel, optimized version, is 2 - 2.5x faster than the serial version. For the non optimized version the difference between the parallel and the sequential version is slightly larger.