

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek





Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.

In my previous post, we have observed pretty bad results for calculations as we tried to use mutexes and even atomics to do things parallel. OTOH, it was promised to show how parallel <algorithm> CAN be used both correctly and efficiently (that is, IF you need it, which is a separate question); this is what we’ll start discussing within this post.

Parallel Code Which Made It 90x Slower on 8-Core Box

As it was mentioned in the previous post, the code below – while being correct – was observed to be REALLY REALLY slow (90x slower than original, that’s while using all 8 cores instead of one):

//DON'T USE: DEADLY SLOW template<class Iterator> size_t par_calc_sum_mutex(Iterator begin, Iterator end) { size_t x = 0; std::mutex m; std::for_each(std::execution::par, begin, end, [&](int item) { std::lock_guard<std::mutex> guard(m); x += item; }); return x; }

And BTW, while replacing mutex with atomic did improve things, it still was 50x slower than original serialized code.

The Big Fat Overhead Problem

The problem we’re facing here, is actually a problem of the overhead. It is just that the cost of thread context switch (which is likely to happen pretty often as we’re trying to obtain lock_guard) is sooo huge,1 that addition (which costs about 1 CPU cycle2) happens to be about zero-cost compared to the costs of synchronization. That’s exactly what happens here – and in this extreme case, we have this generic problem amplified sooooo badly that it happens that ALL the calculations are actually serialized by this mutex (so effectively, there is no parallelism at all3).

Tackling The Overhead Manually

Let’s see what we could do to speed things up; note that it is NOT intended to be a code-for-real-world-use – but rather a code-to-provide-some-“feeling”-of-what-is-going-on. While performance-wise it is a huge improvement over previous code, but compared to alternatives-which-we’ll-see-later – it is certainly NOT the best option.

//DON'T USE: UGLY, ERROR-PRONE, AND A-BIT-SLOWER-THAN-OPTIMAL size_t TestFuncManualParSum::run(RandomAccessIterator begin, RandomAccessIterator end) { std::atomic<size_t> x = 0; constexpr int NCHUNKS = 128; assert( (end-begin) % NCHUNKS ==0 );//too lazy to handle it for sample code RandomAccessIterator starts[NCHUNKS]; size_t sz = (end - begin) / NCHUNKS; for (int i = 0; i < NCHUNKS; ++i) { starts[i] = begin + sz * i; assert(starts[i]<end); } std::for_each(std::execution::par, starts, starts + NCHUNKS, [&](RandomAccessIterator start) { size_t y = 0; for (auto it = start; it < start + sz; ++it) { y += *it;//NO synchronization here } x += y;//synchronization }); return x; }

Associative Property Within an expression containing two or more occurrences in a row of the same associative operator, the order in which the operations are performed does not matter as long as the sequence of the operands is not changed.— Wikipedia —The basic idea here is to avoid synchronization on each and every element; instead – we’re splitting our array into 128 chunks – and running for_each over these large chunks (and within each chunk – we’re calculating partial sum in a local variable y, avoiding any synchronization until the very end of the chunk).4 Let’s note that in the code above, we’re actually relying on the sum to be associative to achieve the speed improvement (we DO change the order of additions compared to sequential code).

If we run this code – we’ll see that it is ACTUALLY FASTER than sequential one (in my measurements – see also a table below – it was by a factor of almost-2x on an 8-core box, while using all the 8 cores).

This is a HUGE improvement over our previous 90x-slower-than-sequential code – AND is the first time we’ve got it faster than sequential.



Enter std::reduce()

Well, while we did manage to improve speed by using multiple cores in our previous code <phew />, it is NOT what we really want to write in our app-level code (that’s to put it mildly). Fortunately, there is a function which will do pretty much the same thing (actually, even a bit better) for us: it is std::reduce().

The point of std::reduce() is that it – just like our code above – exploits the fact that operation-which-is-used-for-reduce (default is ‘+’), is associative.5 Then, the whole thing we wrote above, can be written as follows (it is MUCH less ugly, MUCH less error-prone, and a tiny bit more efficient):

//DO USE size_t TestFuncReduceParSum::run(Iterator begin, Iterator end) { return std::reduce(std::execution::par, begin, end, (size_t)0); //NB: (size_t)0 specifies not only an initial value, but also the TYPE of the accumulator // which ensures that accumulation is done in terms of size_t, and NOT in terms of *begin // This, in turn, is necessary to keep our calculation EXACTLY the same as our previous ones }

Interestingly enough, when I run the code with std::reduce(), I observed that it is merely 2-3% faster6 than our manual parallelization which we used above. While it is certainly not a proof, but it is an indication that the methods used within std::reduce(), are conceptually similar to the stuff-we-wrote-ourselves-above.7 This approximation, in turn, can be useful to “feel” what we can realistically expect from std::reduce(). After all, there is NO magic involved, and everything-what-happens-in-the-world can (and IMO SHOULD) be explained at least at the very high level of abstraction (and BTW, if you have a better explanation – without going into “we’re experts, just trust us to do things right” – please LMK).

Per-Element Mutations

In terms of std::reduce() we can express LOTS of parallelizable things (a bit more on related trickery in the next post), but all of them are inherently limited to our array-being-processed, being unmodified in the process (in other words – std::reduce() can be seen as a “pure” function). However, as in C++, we’re NOT confined to the world of “pure” functions such as std::reduce(), we MAY consider certain mutation operations over our large collection. However, learned from experience with mutexes, we WON’T try to modify more-than-one-single-element of our collection in our parallel algorithm. And as long as all our mutations are limited to only one element of our array – we are good to go without mutexes / atomics / etc. <phew />:

void TestFuncForEachParModify::run(Iterator begin, Iterator end) { std::for_each(std::execution::par,begin, end, [](int& item) { item += 2; });

In general, it is The Right Way(tm) of doing things parallel; however – in this particular case of ultra-simple ops (which happen to cause our app to consume all the memory bandwidth at least on my box) – parallel version still happens to be slower than sequential one(!).8 This is NOT to say that parallel mutations are always slower than serial ones – but rather that while with some experience it MIGHT possible to predict which parallelization WON’T work (one such example is our mutex-based calculation above) – it is next-to-impossible to predict which parallelization WILL work for sure; hence – whatever-we-think about our parallelization, it is necessary to test its performance (and under conditions which are as close as possible to real-world ones).

Results

Cores Used Wall-Clock Time (us) CPU Time (us) Walk-Clock Compared to Sequential Power Consumption Compared to Sequential (guesstimate) “Pure” (mutation-free) Calculation sequential (for) 1 2960+-20 2960 1x 1x sequential (for_each) 1 4480+-509 4480 1.5x slower 1.5x higher std::accumulate()10 1 2940+-30 2940 1x 1x std::reduce(seq) 1 2960+-60 2960 1x 1x naive std::mutex11 8 201’000+-4000 1’600’000 70x slower 300x higher naive std::atomic11 8 68’600+-20 550’000 25x slower 100x higher manual chunking 8 1620+-50 13000 1.82x faster 2.5x higher std::reduce(par) 8 1580+-50 12600 1.87x faster 2.5x higher Mutation sequential (for) 1 3310+-20 3310 1x 1x sequential (for_each) 1 3300+-4 3300 1x 1x sequential (in-place transform) 1 4510+-30 4510 1.36x slower 12 1.3x higher manual chunking 8 3540+-90 28300 1.07x slower 4x higher for_each(par) 8 3520+-100 28100 1.06x slower 4x higher

Hey fellas, don’t be jealous

When they made him they broke the mould

So charismatic with an automatic

Never prematurely shooting his load — 'Man for All Seasons' song from Johnny English film —

My observations from the table above:

All sequential methods are equivalent (all the differences are within measurement errors). The only exceptions are related to sequential for_each(), and in-place transform() – but hopefully they’re merely a flaw in optimizing lambdas out by MSVC; OTOH, for_each() case does highlight risks of using lambdas in performance-critical code even in 2018(!).

Out of the parallel methods – those which serialize too often (“naive” versions above) fail BADLY performance-wise.

“ out of all the premature optimizations, premature parallelization is the most premature one 2 footprint. Which in turn means that – as long as serial code provides good-enough response times from end-user perspective – DON’T PARALLELIZE ; you’ll do a favour BOTH to your user (she will be able to use the cores for something else, AND will pay less in electricity bills), AND to the planet. Or from a bit different perspective – out of all the premature optimizations, premature parallelization is the most premature one . mutexes are evil; 13 even atomics are to be avoided as long as possible Even IF we’re mutex-free, parallelization MAY slow things down. So after you found the bottleneck and parallelized – make sure to re-test performance – and under close-to-real-world conditions too . FWIW, effects of our manual parallelization look very close to those of std::reduce(); it is NOT to argue to do things manually – but to get a very-simplified idea of how std::reduce() does it magic internally.

footprint. Which in turn means that – ; you’ll do a favour BOTH to your user (she will be able to use the cores for something else, AND will pay less in electricity bills), AND to the planet. Or from a bit different perspective – .

Intermediate Conclusions and To Be Continued

In the previous post, we took a look at “how NOT to write parallel programs” (which apparently where LOTS of parallel programming tutorials will lead you). In this one, we have taken a look at “how efficient parallel programs CAN be written”. In the next (and hopefully last-in-this mini-series) post we’ll discuss some guidelines-and-hints for-NOVICE-parallel programmers, which MAY allow speeding up their programs here and there, without the risk to jeopardize program correctness – and without the risk of slowing things down by 50x too. Sure, it won’t be the-very-best optimization – but it MAY bring SOME benefits to the table (and it is rather low-risk too, so as a whole it MAY fly).

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.