Recently we released a new version of the parallel package on Hackage, version 3.1.0.0. This synchronises the API to that described in our Haskell Symposium 2010 paper, “Seq no More: Better Strategies for Parallel Haskell“. If you don’t know what strategies are, I recommend the paper: it does have plenty of introductory material, as well as explaining the new improvements we made.

In this post I don’t want to focus on strategies, though. What has been bugging me about strategies is that, while the abstraction has some nice properties (compositionality, modularity), it does that at the expense of transparency. The problem is that it can be quite hard to reason about performance when using strategies, because you have to think quite carefully about when things are evaluated. In order to use a strategy, the input to the strategy has to be a data structure with some unevaluated components, that the strategy can exploit to create parallelism. Using laziness in this way is the key trick that allows strategies to be modular – indeed it’s an instance of the modularity insight from John Hughes’ famous “Why Functional Programming Matters” – but it can render the resulting program somewhat opaque to understanding the parallel execution.

What I plan to do in this post is introduce a simpler, more explicit, but less abstract, parallelism API that is implemented in a few lines on top of primitives provided by the parallel package. The new interface is still deterministic and pure: your parallel programs produce the same answer every time guaranteed, they don’t deadlock, and they don’t suffer from race conditions.

Let’s start by looking at one of the nice additions in the new strategies API, the Eval monad:

data Eval a = Done a instance Monad Eval where return x = Done x Done x >>= k = k x runEval :: Eval a -> a runEval (Done x) = x

I’ve included the implementation too, so you can see how simple it is. The Eval monad is just a “strict identity monad” – it doesn’t actually do anything in the usual sense of monads (there’s no state, error handling, or reading/writing), but what it does do is order things. When you write

do x <- m f x

you are guaranteed that m happens before f is called. This is all very well, but what does it mean for m to “happen”? You can’t do anything useful in this monad exept pass values around. The strategies library adds two useful primitives to the monad:

rseq :: a -> Eval a rpar :: a -> Eval a

these are what you need to describe parallel execution: rseq forces its argument to be evaluated, and rpar begins evaluation of its argument in parallel. The whole purpose of the Eval monad is to let you express an ordering between uses of rseq and rpar , so you can make them happen in the order you want. This is quite nice: we can be explicit about ordering by staying within the monad, and monads are nice and compositional. To illustrate this, when we introduce parallel Haskell we often start with an example like this:

let a = primes !! 999 b = nfib 45 in a `par` b `pseq` a + b

this is written using the “old” par and pseq operators, to say that a should be evaluated in parallel with b , and finally the result a + b should be returned. We can rewrite this using the Eval monad:

let a = primes !! 999 b = nfib 45 in runEval $ do a' <- rpar a b' <- rseq b return (a' + b')

Ok, it’s longer, but it expresses more clearly that we intend to do the following things in order:

start evaluation of a in parallel,

in parallel, evaluate b

return the result

and with monads being compositional, we can build up larger strictly-ordered computations by composing smaller parts. This was difficult with par/pseq alone, because the ordering was not explicit.

So the Eval monad takes us in a profitable direction: it allows you to be more explicit about parallel execution, and thereby enable parallel programming to be a bit less hit-and-miss, at the expense of making the programmer write more code. But as it stands, the Eval monad doesn’t go quite far enough, for two reasons:

while it’s clear where the fork point for a is, it’s not so clear where the join point is, that is the point in the ordered computation where the value of a is eventually demanded. You can use rseq to express a join point, but in the interests of being more explicit it would be better if the API forced you to write a join point and give an ordering between the joins for each parallel task.

is, it’s not so clear where the join point is, that is the point in the ordered computation where the value of is eventually demanded. You can use to express a join point, but in the interests of being more explicit it would be better if the API forced you to write a join point and give an ordering between the joins for each parallel task. the API doesn’t seem quite right for expressing nested parallel tasks: the argument to rpar is just a polymorphic value, so if we want to use the Eval monad in the parallel task we need to write another runEval (or use the dot operator of the strategies API, which essentially embeds runEval ).

So the API I’m going to propose that addresses these two issues is this:

fork :: Eval a -> Eval (Future a) join :: Future a -> Eval a

You might be familiar with this API: it’s a common parallelism abstraction, used in other languages such as Manticore and similar to what you can do in Cilk. The fork operation creates a parallel task, and join requests the result of a previously forked task. A forked task may or may not actually be evaluated in parallel – it depends on how many CPUs are actually available at runtime – but if not, then the evaluation is performed by join . Either way, the result of join is the same: since the Eval monad may not perform side effects, the API is deterministic.

It’s trival to implement this API on top of what we already have. Here you go:

data Future a = Future a fork :: Eval a -> Eval (Future a) fork a = do a' <- rpar (runEval a); return (Future a') join :: Future a -> Eval a join (Future a) = a `pseq` return a

We also need the rseq and rdeepseq operators from strategies; the point of the Eval monad is that we can say when we want things to be evaluated (in parallel).

I’ve written some examples using this API, and it works rather nicely. Here’s a snippet of an implementation of a parallel Black-Scholes implementation, that I modified from the version in the Haskell CnC distribution:

blackscholes :: Int -> Int -> Eval Float blackscholes numOptions granularity = do fs <- forM [0, granularity .. numOptions-1] $ \t -> fork (return (executeStep t granularity)) foldM (\ acc f -> do x <- join f return (acc + (x ! 0))) 0 fs

The first part, beginning forM , forks parallel tasks to evaluate each step of the computation, and the second part beginning foldM joins each parallel task, and combines the results. This implementation scales well, achieving a 7.5 speedup on the 8-core machine I’m using here, with the latest GHC HEAD (the full code is here).

The fork/join API I’ve described here is comparable in expressivity to Haskell CnC, indeed many of the Haskell CnC examples translate without too much difficulty and give similar performance. The main difference in the programming models is the join: in Haskell CnC you don’t have to thread the result of the fork to the join point, instead they both share an “Item collection” which is written by the parallel task, and read from to join.

However, this isn’t a replacement for CnC for (at least) two reasons:

Haskell CnC is based its own scheduling abstraction which allows more flexibility in choosing a good scheduler for the application. In Haskell CnC you have more control over the runtime scheduling, and there is more scope to experiment with different schedulers. The fork/join API on the other hand is built directly on top of par , and hence is tied to the GHC runtime’s scheduler for sparks.

, and hence is tied to the GHC runtime’s scheduler for sparks. fork/join is limited by the size of the spark pool, which by default is 4096, whereas Haskell CnC is not limited in this way. The limitation doesn’t affect the result, only the number of outstanding simultaneous parallel tasks. If more parallel tasks are spawned, earlier ones will be discaded.

I think where fork/join loses in flexibility and abstraction, it gains in simplicity and accessibility. If you’re learning parallelism in Haskell it might well be a good place to start.

Here’s the API in full:

module Future (Eval(..), Future, runEval, fork, join, rseq, rdeepseq) where import Control.DeepSeq import Control.Parallel import Control.Parallel.Strategies data Future a = Future a fork :: Eval a -> Eval (Future a) fork a = do a' <- rpar (runEval a); return (Future a') join :: Future a -> Eval a join (Future a) = a `pseq` return a