

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek





Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.

In my previous post, we have observed the whole spectrum of parallel coding results – ranging from “awfully bad” to “reasonably improving”. So, the Big Fat Question(tm) we’re facing, is the following:

How parallel stuff should be coded if you’re NOT a parallel expert? 1

Do we REALLY need to parallelize?

“if we do not care for latencies - we MUST NOT parallelize.First of all, we should ask ourselves: do we really need to parallelize? The ONE AND ONLY case for parallelization exists when we have to meet certain latency requirements. In other words, if we do not care for latencies – we MUST NOT parallelize. Any parallel code will be both more error-prone, and less energy-efficient than its serial counterpart, so unless we’re tasked with making our customers unhappy – or are paid to do it by an oil-producing company – we MUST NOT parallelize unless we have proven that we cannot live without it.

TBH, even if we’re latency-bound, I still suggest taking a look at our algorithm before going parallel. There is no point in parallelizing inherently poor algorithms; for example, there is no need to parallelize our recursive calculation of Fibonacci – it is orders of magnitude better to do it iteratively and without any parallelization. In another example, before trying to parallelize calculations of big-number exponents (which BTW won’t parallelize well), you should certainly consider using Montgomery multiplication first.2

Real-world example: once upon a time, I was involved in development of (potentially parallelizable) universal hash function, which calculations were performed modulo 2^32+15; apparently, 2^32+15 can be seen as a “pseudo+Mersenne” prime (similar to pseudo-Mersenne primes which are represented as 2^N-epsilon), which allows calculations modulo 2^32+15 to be sped up significantly using methods similar to optimizations used for Mersenne and pseudo-Mersenne primes. Overall, comparing our final code to the original one (which was based on the simplistic use of Boost.Multiprecision), we got performance improvement of about 100x – that’s without any parallelization.

This is not to say that parallelization can’t possibly make sense – but rather that

Parallelization MUST be used ONLY as a last resort – when all the algorithmic improvements are exhausted

Parallelization on the Server-Side

When speaking about parallelization on the Server-Side, we should be even more careful than usual.

“seriously loaded web servers are already heavily parallelized (just by having lots of requests from different users being handled in parallel)One example of a horrible misuse of parallelization is to try parallelizing things in a naive manner when processing HTTP requests which come to a web server. Let’s imagine that we decided to parallelize such a thing, just by requesting more threads whenever we feel like it. We can parallelize to our heart’s content, and even tests on our development box will show significant improvements – but when we try to deploy such a contraption to at least somewhat-loaded web server – it will fall apart very quickly. The problem here is that seriously loaded web servers are already heavily parallelized (just by having lots of requests from different users being handled in parallel) – and any further parallelization will at best do nothing, and at worst – will create NCORES threads for each request, which can easily lead to thousands of threads running – with these threads causing TONS of completely unnecessary context switches, and consuming TONS of resources, easily bringing down your whole website because of such naive Server-Side parallelization.

This is not to say that Server-Side parallelization is inherently evil – BUT you have to be extra careful when parallelizing on the Server-Side. In particular, you MUST understand how many cores you have – and how your setup will use them. Examples when Server-Side parallelization DOES work, include:

HPC-like stuff (I don’t want to argue whether HPC qualifies as a Server-Side, but more importantly, HPC and parallelization do work very well together).

Queued web requests. If you have some of your web server requests processed in a special time-consuming manner – you MAY want to have a queue, processing all such requests. And as soon as you have such a queue – you MAY dedicate some of your cores (or whole boxes) to processing requests in such a queue, and as soon as you know this – you MAY parallelize without the risk of thrashing your whole systems with too many threads.

Two scenarios above are the most common ones, but there are, of course, other parallelization cases which do work. The most important thing is to remember to avoid creating threads on per-request basis; instead – you should make sure that the number of threads you’re using for parallelization, is related not to the number of requests, but to the number of cores you really have.

Reduce rulezz!

One observation which we can make based on results which were already seen in the previous post, tells that

reduce (in particular, std::reduce()) is a very good way to achieve parallelization (that is, IF we need it).

Reduce allows us to calculate certain associative operation over a huge set of data (with pre-filtering if necessary). In general, reduce works very well (among other things, it scales to Shared-Nothing architectures, which is why it is so popular for Big Data DBs). However, reduce is not without its quirks. We won’t go into advanced stuff here, but will restrain ourselves to three most obvious observations.

Reduce Caveat #1: Operations on floats are NOT 100% Associative

As it was noted above, parallel reduce (in particular, std::reduce()) relies on the operation to be associative. While this may seem a minor restriction, it becomes much more significant if we realize that

Operations over floats/doubles are not really associative

Nonassociativity of floating point calculation addition and multiplication of floating point numbers is not associative, as rounding errors are introduced when dissimilar-sized values are joined together— Wikipedia —Indeed, every time we’re adding two floats, we also have an implicit rounding operation after the sum-in-math-sense is calculated. Moreover, this rounding is non-linear with relation to all the arithmetic operations, which means that the result of the addition of 3 float numbers MAY depend on the order in which we’re doing it. One very practical manifestation of this property is an observation that well-known catastrophic cancellation can be often avoided by reordering additions, but it is much more general than that.

When applied to reduce, it means that

If we’re calling reduce() over floats, the end result is NOT deterministic

In practice, it is often not that bad (=”reduce is still usable”), but you still to keep in mind at least two things:

unit tests which are based on exact comparisons won’t work anymore (or worse – they might on the few first runs, but may fail at some point later).

in some relatively rare cases, it may lead your algorithm to become unstable (starting to diverge, etc.). Usually, it is just a sign of a time bomb which is ready to fire anyway, but sometimes it can happen when your algorithm is implicitly relying on certain unwritten properties (such as an order of elements in the array).

When using reduce with floats (and regardless of the programming language), we have to always keep these in mind <sad-face />.

Reduce Caveat #2: Different Semantics for Non-Commutative Calculations than std::accumulate()

The second caveat is actually C++-specific, and goes to certain assumptions which are easy to make – and which (as quite a few of assumptions out there) will lead to mortgage-crisis-size ummm… hiccups.

The story of this particular assumption unfolds as follows. When using std::accumulate(), it is common to write something along the following lines:

double add_sq(double acc, double item) { return acc + item * item; } //... double sum = std::reduce(v.begin(), v.end(),0.,add_sq);

This way, we can calculate things such as sum of squares. However,

with std::reduce() such code, while it compiles at least under MSVC, MAY occasionally provide wrong results

The thing here is that for std::reduce(), it is not one function, but four of them which we have to provide (as we’ve seen in the previous post, reduce needs to accumulate data in separate threads separately – and then needs to combine accumulators; this is where the 2nd function comes from; other two actually can be avoided, but as the almighty standard requires us to implement all four – libraries are allowed to use any of these, and unless we implement all four, we’ll be in Big Trouble(tm)).

These four overloads which we have to provide to a functor we’re feeding to std::reduce(), are the following (special thanks to Billy O’Neal from MSVC Lib Team for explaining these things and underlying mechanics to me):

struct Acc { double sum_sq; Acc() : sum_sq(0.) {} Acc(double sum_sq_) : sum_sq(sum_sq_) {} }; struct add_sq { Acc operator()(Acc acc, double item) { return Acc(acc.sum_sq + item * item); } Acc operator()(double item, Acc acc) { return Acc(acc.sum_sq + item * item); } Acc operator()(double item1, double item2) { return Acc(item1*item1 + item2*item2); } Acc operator()(Acc a, Acc b) { return Acc(a.sum_sq + b.sum_sq); } }; Acc sum = std::reduce(v.begin(), v.end(),0.,add_sq);

IF we’re specifying only a function add_sq() rather than the whole functor-with-four-overloads – we’re effectively saying that all four overloads are the same(!). This is fine for “symmetric” accumulators like simple sum,

but leads to invalid results when dealing with non-commutative accumulation function such as calculating sum of squares

Overall, I am arguing for writing four overloads above in pretty much any case (ok, we can exclude simple addition); it is when you write all four overloads and see that they’re identical – you may switch back to using single function.

NB: similar thing should be doable with std::transform_reduce(), which is probably more appropriate way of doing things; still, if going from std::accumulate() side, the way described above may feel more “natural”. As we’ll see below, it is also more generic, as it allows to avoid unnecessary multiple passes over the same container.

Reduce Caveat #3: Avoiding Multiple Passes

Another thing to keep in mind is that when shoehorning your program to use reduce(), it is easy to make a performance mistake of replacing one single pass over your collection, with two calls to reduce() – merely to calculate two different things over the same collection. Two-pass version will usually take longer (in many cases – 2x longer!) – often negating any benefit from going parallel.

Let’s consider a very simple case when you had to calculate both sum of the elements – and sum of squares of the elements – over the same container. Your original code looked as follows:

for(double item:v) { sum += item; sum_sq += item * item; }

(that is, assuming that you didn’t bend to not-so-wise voices of “hey, just run std::accumulate() twice” in the first place3).

When going parallel, it does become tempting to rewrite it as

//using add_sq from above double sum = std::reduce(v.begin(), v.end(),double(0.)); //yes, I know that 0. would suffice, // but I still prefer to be very explicit in such critical places double sum_sq = std::reduce(v.begin(), v.end(),double(0.),add_sq());

As a Big Fat Rule of Thumb(tm), this counts as a pretty bad idea – as we’ll be making two passes over the same data, at the very least we’re causing a double hit on the memory bus. However, there is a way to write the same thing as one single reduce():

struct Acc2 { double sum; double sum_sq; Acc2() : sum(0.), sum_sq(0.) {} Acc2(double sum_, double sum_sq_) : sum(sum_), sum_sq(sum_sq_) { } }; struct add_item_and_sq { Acc2 operator()(Acc2 acc, double item) { return Acc2(acc.sum+item, acc.sum_sq + item * item); } Acc2 operator()(double item, Acc acc) { return Acc2(acc.sum+item,acc.sum_sq + item * item); } Acc2 operator()(double item1, double item2) { return Acc2(item1+item2, item1*item1 + item2*item2); } Acc2 operator()(Acc2 a, Acc2 b) { return Acc2(a.sum+b.sum,a.sum_sq + b.sum_sq); } }; //... Acc2 acc = std::reduce(v.begin(), v.end(),Acc2(),add_item_and_sq());

Bingo! We got our parallel code – and in one pass too!

IMPORTANT: at least under MSVC, results of using one-pass vs two-pass calculations varied greatly: I’ve got anything from single-pass being 2x faster, to it being 2x slower, so make sure to performance-test your code. I tend to attribute it to optimization flaws in MSVC (FWIW, from my experience they tend to affect both MSVC and GCC, but not Clang – and I am going to research this phenomenon further), so there is a chance that in some distant future one-pass calculations will become faster than two-pass as they should.

Independent Per-Item Modifications Are Ok

Reduce is a very powerful mechanism – that is, IF we have to calculate something over a large collection without modifying it; in other words, modifications are out of scope for reduce()4. If we have to modify something – we MAY still do it using something like parallel for_each() – but we MUST be sure that ALL parallel modifications are going over independent elements, without ANY interaction between any two elements involved. In other words, it is perfectly ok to modify all the elements in your array as item = f(item, params), where f is an arbitrarily complex function, and params are exactly the same for all the elements. Such an approach guarantees that we can run our parallel for_each without any mutexes – and it will be correct too <smile />.

However, at the very moment when we’ll try to use any other item in our modification code – we’re badly out of luck <sad-face />. Not only demons will fly out of our noses,5 but even if we manage to place mutexes correctly – it will be extremely difficult to define the semantics of what we’re really trying to do here (and we didn’t even start to discuss performance, which is likely to be atrocious when using mutexes).

Beyond That – RTFM, RTFM, and even more RTFM

“But if you happen to need something beyond that - take a deep breath and allocate at least several months to understand how does parallelism really work under the hood.If you can limit your parallelism efforts to (a) reduce() or transform_reduce() for calculating values over constant-at-the-moment arrays, and (b) to modifications which are strictly per-item – you’ll be fine. But if you happen to need something beyond that – take a deep breath and allocate at least several months6 to understand how does parallelism really work under the hood. Just to get you started in this direction, you can read two books (make sure to read both!): Anthony William’s “C++ Concurrency in Action”, and Paul McKenney’s “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. [[TODO: links]] Believe me7 – this is very likely to become the most difficult part of software engineering to master <sad-face />.

[[NOTE TO PARALLEL PROS: IF you can point out some other simplistic patterns which can be done without mutexes/atomics – please LMK, I’ll be happy to include them in my list-of-patterns-for-parallel-non-experts.]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.