

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek





Ubi nihil vales, ibi nihil velis

Where you are worth nothing, there you should want nothing — Arnold Geulincx, XVII century —

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programming can be made easy-as-pie – make sure to read it.

With C++17 supporting1 parallel versions of the std:: algorithms, there are quite a few people saying “hey, it became really simple to write parallel code!”.

Just as one example, [MSDN] wrote: “Only a few years ago, writing parallel code in C++ was a domain of the experts.” (implying that these days, to write parallel code, you don’t need to be an expert anymore).

“I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.I always had my extremely strong suspicions about this position being deadly wrong, but recently I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.

Task at Hand

Let’s consider the following very simple scenario: we have to calculate a sum of the elements in a million-element integer array. The code we’re starting from, is as simple as

template<class Iterator> size_t seq_calc_sum(Iterator begin, Iterator end) { size_t x = 0; std::for_each(begin, end, [&](int item) { x += item; }); return x; }

When running this code on my pretty old Windows box (compiled with MSVC VS2017 in Release mode2) – it takes about 450us.



Adding parallelism: Take 1 (badly broken)

First, we have to note that simple addition of std::execution::par to our std::foreach() call will NOT work. If trying to write something like

//DON'T USE: WILL CAUSE WRONG RESULT template<class Iterator> size_t par_calc_sum_deadly_broken(Iterator begin, Iterator end) { size_t x = 0; std::for_each(std::execution::par,begin, end, [&](int item) { x += item;//data race; often causes wrong calculation result(!) }); return x; }

– it will compile and will run, but we’ll easily get wrong result (in my experiments with a million-element array, the result was wrong each and every time, but YMMV, which only makes things worse <sad-face />).

Adding parallelism: Take 2 (90x performance hit)

IF we were observant enough to note this problem – and found a neat recommendation in [CppReference], we’ll realize that in addition to specifying std::execution::par, we also have to use std::mutex (or std::atomic) to make our program correct.

Ok, so our next (still very naive BTW) version would be along the following lines:

//DON'T USE: DEADLY SLOW template<class Iterator> size_t par_calc_sum_mutex(Iterator begin, Iterator end) { size_t x = 0; std::mutex m; std::for_each(std::execution::par,begin, end, [&](int item) { std::lock_guard<std::mutex> guard(m); x += item; }); return x; }

This does work correctly – and if we take a look at taskman, we’ll see that it DOES use all the cores (4 physical x 2-due-to-HT = 8 logical ones in my case). And, if we wouldn’t measure the performance compared to the sequential version – we could think that everything is good here. Not so fast <sad-face />.

Measurements have shown that the function above, takes about 40 milliseconds of wall-clock time, so instead of expected speedup of about 4x-8x, it is about 90x slowdown compared to the sequential one

Adding parallelism: Take 3. Atomics to the rescue? (still 50x performance hit)

(BTW, if you have doubts and want to run it yourself – the code is available at [demo] ).To make things even worse, the code above is written strictly along the lines recommended in [CppReference] (actually, it is almost exactly the same code).

Ok, as a next step we could think “hey, mutexes are bad for performance, so we should use atomics instead”. So, we rewrite it as

//DON'T USE: DEADLY SLOW template<class Iterator> size_t par_calc_sum_atomic(Iterator begin, Iterator end) { std::atomic<size_t> x = 0; std::for_each(std::execution::par, begin, end, [&](int item) { x += item;//changing it to x.fetch_add(item) doesn't change anything - neither it should }); return x; }

Well, it is still correct, AND (as expected) it is faster than our previous mutex-based version. The problem is that



It is still 50x slower than sequential one

Oh, and BTW – replacing std::execution::par with std::execution::par_unseq (with an assertion on x.is_lock_free() to prevent potential for deadlocks due to vectorization) didn’t make it better.3



Results

Box non-parallelized std::execution::par with std::mutex std::execution::par with std::atomic std::execution::par_unseq with std::atomic #1 (4 physical, 8 logical cores) 470+-4us 41200+-900us (90x slower, 600x+ less power-efficient) 23400+-140us (50x slower, 300x+ less power-efficient) 23400+-140us (50x slower, 300x+ less power-efficient) #2 (2 physical, 4 logical cores) 900+-150us 52500+-6000us (60x slower, 200x+ less power-efficient) 25100+-4500us (30x slower, 100x+ less power-efficient) 21800+-2800us (25x slower, 100x+ less power-efficient)

As we can see, not only our naive parallel code is hopelessly inefficient and greatly increases CO 2 footprint for absolutely no reason,4



it also punishes users with more powerful boxes

(more strictly, it seems that the more cores we have – the more penalty we get; this will start making sense in the second part of this mini-series).

Intermediate Conclusions and To Be Continued

As we have seen from the results above, naive attempts to make our code parallel (while having no clue about the way parallel code works) can easily cause HUGE problems (starting from wrong results and even crashes, and to having even correct programs slowing down by factors of 50x-90x).

In other words (arguing with the quote from [MSDN] cited in the very beginning of this post):



Writing parallel code in C++ is still a domain of the experts. 5

“it is still necessary to understand what we're doingOTOH, the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). However, to do it – it is still necessary to understand what we’re doing. An explanation of what-is-wrong-with-the-code-above is planned for the next post of mine (spoiler: it is all about overhead which happens to be waaay too high in the code above).

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.