Runtime Support for Multicore Haskell (ICFP’09) was awarded the SIGPLAN ten-year most-influential paper (MIP) award in 2019. In this blog post we reflect on the journey that led to the paper, and what has happened since.

The promise of parallel functional programming

The background for our work was the shift towards multicore computing that started in the early 2000s. In 2004 Herb Sutter published his famous blog post “The Free Lunch Is Over“, and computing research began focusing on how to exploit parallel hardware. In the Haskell realm, we thought we had a head start: after all, parallel computing requires understanding and controlling side-effects, so Haskell’s uncompromising approach to controlling side-effects in the type system should be ideal. In practice, however, there’s a lot more to realising good parallel speedup than just choosing the right language, especially if your language requires a complex runtime system: there are many opportunities for unexpected bottlenecks to sneak into the implementation.

Haskell was not new to the parallelism game. There had been earlier implementations, such as the GUM system (PLDI’96) that ran Haskell programs on top of a parallel library such as PVM. In fact, there had even been custom hardware designs developed specifically to run parallel Haskell programs (e.g. GRIP FPCA’87). These earlier systems were based on the idea of running multiple instances of the runtime system in separate processes, and even on separate machines. This design achieves good performance by eliminating bottlenecks that arise from sharing resources, and allows the parallel program to exploit clusters of machines. However, the separate-processes model has the downside that communication between parallel tasks is expensive, and the performance bottlenecks tended to center around sharing of data between parallel parts of the program. Another downside is that it isn’t suitable for running Concurrent Haskell programs, a programming model based around shared state — either in the form of shared mutable variables (MVars) or Software Transactional Memory (STM), an idea that also emerged in the early 2000s and quickly became very popular in the research community. We’ll say a bit more about STM later.

Haskell on a shared memory multiprocessor

Thus motivated, we wanted to build a parallel implementation of Haskell based on a shared-heap model, partly because this model would directly support Concurrent Haskell programs, and partly because we thought the benefits of zero-cost communication between threads and parallel tasks might outweigh the penalties for contention of shared resources (and if not, perhaps there would be ways to modify the design to relieve the contention). Other parallel, managed languages were doing similar things: Java ran multiple threads on a single runtime, for example.

The 2005 paper Haskell on a Shared-Memory Multiprocessor (Haskell’05) added a parallel runtime system to GHC, making it possible to run parallel programs directly on a multicore. At this point it was tempting to claim that we were done: there were some programs for which we could demonstrate close to perfect speedup, and for programs that didn’t speed up well at first, it was often possible to tune things to achieve better speedup. The tuning would take the form of tweaking the program — perhaps to reduce the granularity of parallel tasks — or modifying runtime system parameters such as the garbage collector settings. The need for tuning led to a question: how much tuning do we think it is reasonable to expect the programmer to do? After all, there are always some programs that will never achieve good parallel speedup no matter how good the language implementation is: programs that don’t expose enough parallelism for the implementation to exploit, or where the granularity of parallel tasks is too small for even the most efficient scheduler, for example.

Making it run fast

We took the view that we should do more to reduce the tuning burden, and that is the story told in our ICFP’09 paper Runtime Support for Multicore Haskell. The approach we took was to choose a set of benchmark programs and while resisting the temptation to tune the programs themselves, optimise the compiler and runtime system to achieve better parallel speedup. The programs we chose were existing parallel Haskell programs that had been developed for earlier parallel implementations of Haskell, but not tuned for the multicore implementation. We took measurements of the unmodified programs running on the base implementation before our work. Some already achieved good speedup (one benchmark was 6.3x faster on 7 cores), but others actually slowed down when run in parallel. Then we used various tools (such as ThreadScope) to profile the programs and identify the bottlenecks that were preventing good parallel speedup on these benchmarks, and set about fixing them. The improvements we implemented spanned several areas of the runtime, and included exploiting better locality in the garbage collector, using a better load-balancing strategy in the scheduler, and using a faster method for context switching, along with several other techniques.

After we were done, all of the benchmarks achieved better parallel speedup than before (see the table). The most dramatic improvements were “partree” (from 0.68 to 3.18 speedup on 7 cores) and “ray” (from 0.82 to 3.48 speedup on 7 cores).

What happened subsequently

What has happened in the ten years since Runtime Support for Multicore Haskell was published?

Improving GC Performance

In the paper’s future work section we noted that a promising direction for further performance improvements would be to build a garbage collector that could manage a separate allocation area (or nursery) on each processor core, and collect them independently. This would drastically reduce the number of times we needed to synchronise all the cores, which for some of our benchmarks was happening hundreds of times per second.

Indeed we did design and build such a garbage collector and reported on it in Multicore Garbage Collection with Local Heaps (ISMM’11). However, the implementation didn’t end up being folded into GHC proper — the added complexity was not justified by the performance gain, which in many cases could be had just by increasing the heap size. Moreover, the implementation would have complicated building an incremental or concurrent GC, needed to reduce pause times. At the time of this writing (late 2019), a new incremental GC implementation has recently been merged into GHC.

Programming Models

The paper focused on programs written using “Algorithms + Strategies = Parallelism” approach (JFP’98), based around the idea that lazy evaluation enables the separation of program logic and parallelism. The programmer uses semantics-preserving annotations in the form of “par” and “seq” combinators, and higher-level combinators built in terms of these, to indicate to the runtime where to use parallel evaluation. While this is quite a neat trick, it turns out to be quite hard to use in practice, because the programmer needs a deep understanding of the evaluation model, something that Haskell programmers find difficult even in the absence of parallelism.

Other purely-functional parallel programming models in Haskell (such as the Par Monad) have also had mixed success, and today we can’t really say that there are any widely-used parallel programming models in Haskell that exploit purity.

In contrast, explicit concurrency in Haskell has been wildly successful (Concurrent Haskell (POPL’96) also won a ten-year most influential paper award). By “explicit concurrency” we mean explicitly-spawned lightweight threads, each able to perform I/O, synchonising with each other using MVars or Software Transactional Memory (STM). Concurrent Haskell assumes that threads are extremely lightweight (sometimes called “green threads”), so there may be tens of thousands (or many more) threads, each costing only a few hundred bytes of memory.

We attribute this success to several factors:

Concurrency in Haskell is ridiculously easy, especially when used with an abstraction layer like the async library. For many simple tasks, achieving parallelism is as simple as calling mapConcurrently instead of mapM. Here is where purity pays off: the programmer only has to worry about code in the IO monad to determine whether mapConcurently is safe.

Direct support for extremely lightweight concurrency makes it particularly easy to write highly asynchronous applications. In a world of heavyweight threads we are driven instead to a variety of “reactive” or “async” programming models that variously multiplex units of work onto relatively few threads.

Unlike in imperative languages (see Joe Duffy’s retrospective), Software Transactional Memory in Haskell (see Composable Memory Transactions, PPoPP’05) has proved extremely successful, and is used all over the place, including in production settings. For example, STM is used extensively in IOHK’s implementation of their blockchain protocol, and in the implementation of the widely-used async library mentioned above.

To our knowledge, Haskell is the only concurrent language to provide usable Asynchronous Exceptions (PLDI’01). This is a tremendous win in production environments; some of the benefits are documented in the blog post Asynchronous Exceptions in practice.

So, while the runtime that we built and optimised in the work that culminated in this paper is largely unchanged in its architecture today, the workloads that run on it are mostly based on explicit parallelism in Concurrent Haskell (using STM) and DSLs like Haxl, rather than on implicit parallelism in purely functional programs.

Moving the scheduler into Haskell

The beating heart of Multicore Haskell is the runtime system. It is a complex beast, implemented in C, and implementing a garbage collector, thread scheduler, software transactional memory, load balancing and work stealing, asynchronous memory, weak pointers, and more. An attractive idea is to move as much as possible from a hard-to-modify C runtime system into a more malleable and changeable Haskell library. A prime candidate for this is the implementation of lightweight threads, and its associated scheduler and load balancer.

There is plenty of prior art on this, and we got as far as a paper on Composable Scheduler Activations for Haskell (JFP’16). However, Haskell presents unique challenges: lazy evaluation interacts with concurrency, because a thread may become blocked on a thunk that is being evaluated by another thread. That in turn makes the job of disentangling the scheduler from the rest of the runtime a complex one. The prototype implementation had poor performance in some cases, and it remains to be seen whether a workable design can be found.

Influence

Parallel programming in Haskell is in use at scale in industry today. Probably the largest example is Facebook’s Sigma system which employs Haskell as a DSL for detecting abuse, and is running on many thousands of multicore servers in data centres around the world. The programming model in this case is Haxl, a form of Parallel Haskell particularly suited to applications that fetch data from network services, which itself is implemented using Concurrent Haskell.

In the research community, the work on the multicore GHC runtime led to some interesting developments. In Mio: a high-performance multicore IO manager for GHC, the authors made applications that perform I/O on many thousands of lightweight threads run efficiently, which led to high performance web servers developed in Haskell. The design of multicore GHC influenced the later design of Multicore O’Caml.

The work on the Par Monad was further developed by Kuper et. al. to produce a programming model for quasi-deterministic parallel programming with LVars.

Although it wasn’t influenced directly by GHC, the Go language, with its Goroutines, has a design that is strikingly similar to GHC. It also has very lightweight threads multiplexed onto a few heavyweight threads with an M:N scheduling strategy, and has similar mechanisms for handling foreign calls.

Conclusion

Claims have long been made about the promise of pure functional programs executing efficiently on parallel hardware. In practice, it has proved difficult to achieve good wall-clock speedups from completely-implicit parallelism, with no user help to ensure decent granularity or locality. Adding annotations (such as par and seq) to parallel programs often gives useful gains, but Multicore Haskell has proved most effective for explicitly-concurrent programs, using Concurrent Haskell and its many variants, especially programs that use software transactional memory.

Multicore Haskell has been remarkably successful in practice. Its implementation, GHC, comes with parallelism that “just works” out of the box, delivers worthwhile performance gains, and is routinely used in production. It is a low-pain/decent-gain solution, a very useful point on the cost/benefit spectrum, especially in a world where multicores are ubiquitous whether you use them or not.

Author biographies

Simon Marlow is a Software Engineer at Facebook in London. Simon is a co-author of the Glasgow Haskell Compiler, author of the book “Parallel and Concurrent Programming in Haskell”, and has a string of research publications in functional programming, language design, compilers, and language implementation.

Simon Peyton Jones, FRS, is a Principal Researcher at Microsoft Research (Cambridge), where his main research interest is the design and implementation of functional programming languages. Simon has been a co-author of the Glasgow Haskell Compiler since its inception, and is chair of Computing at School, the grass-roots organisation that was at the epicentre of the 2014 reform of the English computing curriculum.

Satnam Singh is a software engineer at Google Research working on the formal verification of hardware to support secure and private computing. Satnam received his bachelors degree and his PhD from the University of Glasgow.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.