The Message Passing Interface, or MPI (not to be confused with the Max Planck Institutes), is in the peculiar situation of being one of the most widely used technologies in HPC and supercomputing, despite being declared dead since decades. Lately however, my nose is picking up some smells which are troubling me. And others.

tl/dr: MPI is still standing strong, but its inflexibility is giving away its age. Contenders are flexing their muscles.

To make one thing clear: I'm not jumping on the MPI is dead bandwagon here. My point of view is that MPI will continue to be the standard interface for moving data on HPC machines, at least for a very long time. MPI has outlived so many other technologies, it has been here before InfiniBand, before the multi-core revolution, before the ubiquity of accelerators... and it just won't die. Instead, it continues to adapt:

Modern implementations can move data directly from GPU RAM to the NIC via PCIe peer to peer transfers.

Shared memory communication on multi-core machines is, thanks to tricks like cross memory attach (CMA) efficient (saves redundant copies) and blazingly fast. Performance-wise, this renders the potential margin for heterogeneous codes, which employ OpenMP and MPI, very thin.

Thanks to the plugin-based architecture of most implementations, newer interconnects are easily supported. Often vendors will even provide a custom implementation of the ibverbs library (even if the new interconnect is not related to InfiniBand -- Cray did this for Gemini IIRC), thereby removing the need of adapting MPI alltogether.

Anyway. Thanks to a sequence of lucky coincidences, I had the pleasure to attend the 2014 Mardi Gras Conference earlier this year. It was organized by CCT at LSU, Baton Rouge, LA. Pavan Balaji from Argone was giving a talk there on MPI for exascale. Pavan is deeply involved in MPI in general, but also in the MPICH project -- just the guy to ask about the guts of MPI. Supported Thread-Levels

My first question was related to why so many MPI implementations are struggling to support MPI_THREAD_MULTIPLE well. Having multiple threads in one MPI process can still be valuable, despite CMA and such, as these may harness the CPU's shared caches -- especially useful for memory bound codes.

As it turned out, the strategy most vendors chose to implement in this case was to wrap all MPI calls in a big lock. The more threads hammer MPI, the more this lock will hurt you -- contention is bad. But these locks add overhead, even if just one thread ends up calling MPI. There are smarter ways of going about this: don't use one giant lock, but multiple, each for a smaller scope. And we know lock-free algorithms, especially useful for managing queues of transfers. And much more. Given that multi-core CPUs aren't exactly new, I'm surprised that this is still not resolved. Interestingly, this also explains why most MPI variants have good support MPI_THREAD_SERIALIZED and MPI_THREAD_FUNNELED. The currently accepted workaround for users is to funnel all MPI calls through one single thread. The application developer needs to take care of not overburdening this single thread. Oh, and of course the user who's submitting the job also needs to take care of starting sufficient numbers of processes per node, if nodes have lots of cores. And the sysadmin needs to provide presets for facilitating different allocation and pinning schemes. It's a pain. Beyond the 2 GB Barrier

If you dig around the MPICH and Open MPI user mailing lists, you will encounter multiple posts complaining about not being able to send/receive more than 2 GB en block. 2 GB is not much, considering that our machines at last year's Student Cluster Competition came with 128 GB of RAM per node. One source for these errors are genuine bugs in the MPI implementation (e.g. using an int to track a message size, instead of a long). These are easily fixed.

The other source is apparently harder to come by: the MPI standard mandates that the number of items to be sent is specified by an MPI_INT. And that locks you down to 2^31 elements. If you're sending chars, you've just lost. It's not much better to be limited to 16 GB when sending doubles, though. Yes, you'll rarely send so many doubles for now. But it creates nasty glitches in user code, which are hard to hunt down. Now, the textbook solution would be to optionally replace MPI_INT by size_t. But this would either require all MPI functions to be specified twice in the standard: once for ints, once for size_t. A huge, but rather simple and mechanical change to the standard. I don't know why, but according to Pavan, this solution was instantly rejected by the MPI Standard Committee. Another solution would be to use packed datatypes, e.g. instead of sending 2^35 doubles, I could send them in 2^25 batches of 2^10 doubles each. So convenient! Not. Finally, one could imagine to create a meta-MPI-implementation which is not part of the standard, but only provides 64-bit enabled variants of the MPI API. Internally this meta-implementation would wrap all calls around the original 32-bit API and make sure buffer sizes etc. are set correctly. Sounds like a huge PITA, and a giant waste of time? Well, according to Pavan, work in this is already in progress. The name of the project escapes me though. Update: the project is called BigMPI. At the time of writing the last commit was on 2013.09.24. Update^2: Jeff Hammond, the author of BigMPI, got back to me to let me know that the project was rather meant as a tutorial to show users how to skirt this current limitation of MPI. He did however hint at the possiblity to develop a fully fledged wrapper library at a later point of time. C++ Bindings Removed from MPI-3

...with the rationale of this being that the original bindings were basically a fig leave on top of the original C-code, and no one was using it anyway. So, no one is using code which isn't going to benefit him anyway? Color me impressed. Today folks are flocking around Boost.MPI , and rightfully so. Boost.MPI brings many features direly missing in vanilla MPI (e.g. support for STL types). If you ask me: Boost.MPI is what the C++ bindings of MPI should have been. This goes to prove that it is possible to bridge from MPI to C++ well.

Asynchronous vs. Non-blocking Communication A lot of users think of MPI_Isend/recv and friends as asynchronous counterparts of MPI_Send/Recv etc. Implementers generally call them non-blocking, and for a good reason: often MPI will do progress (e.g. actually send the data) only if you're blocking in a call to MPI_Wait, or similar. The reason for this is simple: even if the interconnect supports RMDA and bus mastering, MPI still needs to provide it with new addresses, move memory to pinned pages and so on and so on. It's complicated. Still, asynchronous communication can be hugely advantegous, especially in strong scaling setups. So, how to achieve asynchronous progress? Regularly ping MPI, e.g. via MPI_Test(). The frequency of these calls needs to be carefully chosen though. Too few calls, and MPI won't have enough cycles to make good progress. Too many calls and you'll incur overhead. You might need to determine optimum parameters not just for every new machine, but even for every problem size.

A pace maker: some architectures, e.g. IBM's Blue Gene/Q come with a core dedicated MPI pacing. That's nice. But what if you're st(r)uck with a machine that doesn't?

Victim threads, e.g. nemesis engine: an elegant solution, which comes at a price. Considering that your company just spent a gigantic sum on procuring a new machine, people from accounting might not be super happy if you told them that you're wasting 10% of the cores on just waiting for communication.

Pavan's opinion regarding these issues was: it's not MPI's task to make writing any parallel program easy. It's about making writing trivial programs easy and writing hugely complex programs feasible. He said, it was his opinion that no end user (i.e. domain scientist, e.g. a physicist writing a new simulation code) should ever touch MPI. These should use computationally libraries, which are easier to use and will deliver crucial performance optimizations. A surprising, and interesting point of view. One I can sympathize with. After all, that's why I'm working on LibGeoDecomp. This begs the question though if there was a way to express parallelism in a generic, yet user-friendly way.

Feature Regressions in Open MPI From time to time RFCs pop up on the Open MPI devel list. These are used to discuss potentially disruptive changes to the code base with the larger developer community. Usually they're concerned with adding new features, but sometimes they also deal with cleaning up code or removing outdated, unmaintained code. That's fine. Open MPI is an active research project, and as members join and leave the project, portions of the code which are not actively used, may be orphaned. A while ago one of my colleagues, Adrian Knoth, added IPv6 support to Open MPI. Sounds like a trivial change, right? After all, IPv6 is like IPv4, just with longer addresses, right? Well, no. Today IPv6 support is disabled by default, as it is broken since five years and no one is maintaining it. Recently IBM has committed to opening their Power chips to collaborators. Simultaneously, Open MPI developers are discussing whether support for heterogeneous runs should be removed. MPI's hugely complicated type system is usually motivated by stating that if MPI could understand the structure of the data being sent, then it could translate between different architectures. If it can't do this anyway, why bother with defining MPI datatypes? The Gist of It None of the issues I've presented are catastrophic. Cleaning up code and removing rotting passages is part of a healthy software engineering process. And yet it stinks:

MPI is not becoming easier to use, but harder. The voodoo dance an ordinary user has to complete to max out e.g. perfectly ordinary two-socket, 16-core nodes is inane: polling MPI for asynchronous progress,

using a custom locking regime to funnel MPI calls into one thread,

pack data into arbitrary chunks to skirt the limitations of 32-bit ints. Previously usable and useful features are being removed, sometimes confusing, sometimes even alienating users. Trivial changes (e.g. the use of size_t) seem next to impossible to implement.