There was a very interesting discussion on the OpenMP Mailing List last week about a for loop and how to parallelize / optimize it. I will comment on the problems and solutions presented there and also have an interesting observation to add at the end of this article. Although the examples are in OpenMP, the problem and solutions are applicable to any threads-based parallel programming system.

But let’s not get ahead of ourselves and start at the very beginning. The original mail from Hicham Mouline shows the following code snippet:

const double up = 1.1 ;

double Sn=1000.0;

double opt[N+1];

int n;

for (n=0; n<=N; ++n) { opt[n] = Sn; Sn *= up; } [/c] He wanted to parallelize the for-loop with OpenMP, but could not due to the Sn-variable, which depends on the previous loop iteration, respectively. This case is so common in parallel programming that it even has a name: Loop Carried Dependency. I would not have written about it here, if that was the end of the story and there was no way to parallelize the loop, though. Closely examining it you will notice that the dependency is breakable and helpful as I am, I went and implemented a solution. After some testing to make sure I get the exact same results, I wanted to submit it to the mailing list, only to realize that I had been beaten by Clay Breshears (who, by the way also maintains a fine blog about parallel programming – I have been meaning to write about that for some time and will in a future post), who had just posted a solution very similar to what I had done – don’t you just hate it, when that happens ;-). This is what he has come up with:

const double up = 1.1 ;

double Sn, origSn=1000.0;

double opt[N+1];

int n;

opt[0] = origSn;

#pragma omp parallel for private(n) lastprivate(Sn)

for (n=1; n<=N; ++n) { Sn = origSn * pow(up, n); opt[n] = Sn; } Sn *= up; [/c] Actually, this is not quite what he posted, as his original solution had a different outcome than the original code – a slight off-by-one error had obviously slipped in (my old computer science teacher used to say: “programmers are always off by one” – and I can support that thesis from my own experience, this happens a lot to me as well), which I took the liberty to correct here. I bet this is why he had his solution faster than I did :P. The difference between his solution and mine is that he used lastprivate, while I pulled the update of Sn out of the loop entirely. Does not make a big difference either way, though, therefore I am not posting it here.

Anyways, what’s important to note here is that with a simple reorganization the loop is suddenly parallelizable and the loop carried dependency has disappeared. Unfortunately, these dependencies are not always as easy to break as this time, though. This solution has another problem, though – it uses the power-function quite extensively, for each and every iteration. Power is an expensive operation, therefore a few posts later a guy called Jakub Jelinek (from Redhat, judging from his email-address, the very helpful people who have built OpenMP-support into the next gcc-release [4.2]) stepped up to post an even better solution, which is sketched below:

const double up = 1.1;

double Sn, origSn=1000.0;

double opt[N+1];

int n, lastn;

lastn = -2;

#pragma omp parallel for private(n) firstprivate(lastn) lastprivate(Sn)

for (n=0; n<=N; ++n) { if (lastn != n - 1) { Sn = origSn * pow(up, n); } else { Sn *= up; } opt[n] = Sn; lastn = n; } Sn *= up; [/c] A very nice solution, which divides the iteration space into chunks and avoids recalculating the power function for each iteration this way. And it even works for all kinds of schedules, although I suspect it is fastest with a static one. So, are we done now? This particular loop carried dependency is done away with, a solution has been posted and the thread petered off after a few more posts. Unfortunately, I still had some doubts. I could hear the words of the wise Sanjiv ringing in my head, who had told me in his interview:

Michael: What is the worst mistake you have ever encountered in an OpenMP program? Sanjiv: The most common mistake: parallelizing loops that are too fine grained or consume a lot of bandwidth and contribute negatively to overall scaling. Even very experienced OpenMP programmers make this mistake. Programmers must understand the performance and scaling of every parallel region to avoid such mistakes. This is one of the worst mistakes because people are just shooting themselves in the foot.

I suspected this may be the case here. To make a not so long story short, I turned all code snippets into full programs, made N as big as possible (it won’t go much higher than 7000 on the architectures I tried this on before variables start overflowing), threw in some well-placed time measurement functionality (only the actual loop was measured, not the supporting code around it), fired up our trusted 4-way Opteron system and let the horses run, as they say. Here are the results for 4 Threads:

Clay’s version: 0.637ms

Jakub’s version: 0.330ms

Original (non-parallel) version: 0.054ms

What a bummer! I usually try to get well into the seconds when measuring performance, but unfortunately the overflows started much sooner for this example and a millisecond is an eternity for current processors anyways, so these will do. After I have mentioned Sanjiv’s quote so prominently, the results should come as no surprise at this point: by parallelizing the code, we have made it an order of magnitude slower. Even Jakub’s version (which is twice as fast as Clay’s for this architecture and number of threads) is still way behind the original one.

What are the reasons here? I suspect it’s overhead from creating the threads, coupled with the very expensive power function, that still needs to be calculated for each thread at least once. But without examining the example more closely e.g. with a parallel profiler, its hard to tell.

Let me summarize the moral of the story: loop carried dependencies look like a show stopper for parallelism, but sometimes they are not and can be worked around. Unfortunately, that alone is not a guarantee for good performance. Sometimes the smartest optimization there is when it comes to parallelism – is not to use it. And this is a lesson well worth learning. I find it so useful, that sometimes my students get assignments from me where no satisfactory, parallel solution that is faster than a sequential one can be found. This is always very confusing for them. After all we have been telling them this much about parallelism and performance, only for them to discover that it is not the cure for everything. “Live is not a lollipop”, as my colleague Bjoern likes to say in moments like these ;).