Sparking imperatives

Here is a fun hack I came up with while working on vector and repa. One way to express parallelism in Haskell is to write x `par` y (this comes from package parallel). This is equivalent to y but tells the compiler that it might be a good idea to start evaluating x in parallel with y . For example, this code says that the two expensive computations should be executed concurrently:

let x = expensive computation 1 y = expensive computation 2 in (x `par` y) `pseq` (x+y)

The pseq is necessary because we have to tell the compiler to evaluate x `par` y before x+y . This is all explained in Algorithm + Strategy = Parallelism.

Internally, when x `par` y is evaluated the runtime system creates a spark for x . A spark is a bit like a thread but cheaper. The RTS maintains a queue of sparks and evaluates them whenever it has a spare CPU or core. The scheduling algorithm is based on work stealing and is really quite sophisticated. A detailed description of the RTS is given in Runtime Support for Multicore Haskell.

Let’s try to abuse this mechanism a bit by sparking ST computations rather than pure ones. Here is how:

parST :: ST s a -> ST s a parST m = x `par` return x where x = runST (unsafeIOToST noDuplicate >> unsafeCoerce m)

First, we create a thunk x which, when evaluated, runs the ST computation m . The parallel RTS will sometimes evaluate thunks twice which is fine for pure computations (it just duplicates work) but could be disastrous for stateful ones since it could duplicate side effects. The call to noDuplicate (from GHC.IO) ensures that this doesn’t happen. Then, we spark x and return it. Note that return is lazy and doesn’t evaluate x . This is absolutely crucial.

It is quite possible to implement parST in terms of forkIO . However, sparks are much cheaper than threads (yes, even than GHC threads) which means that this implementation ought to support much more fine-grained parallelism.

So how do we use parST ? The basic idea is to spark computations, do something else for a while and then synchronise by demanding the results:

do x <- parST $ foo bar x `seq` return ()

The last line ensures that foo is executed one way or another: either in parallel to bar or after bar when its result is demanded by seq . We can capture an instance of this pattern in a combinator:

(|||) :: ST s () -> ST s () -> ST s () p ||| q = do u <- runST p q u `seq` return ()

This is quite straightforward to use. Here is a rather simple-minded version of in-place Quicksort:

qsort :: (MVector v a, Ord a) => v s a -> ST s () qsort v | n < 2 = return () | otherwise = do x <- unsafeRead v (n `div` 2) i <- unstablePartition (<x) v qsort (unsafeSlice 0 i v) ||| qsort (unsafeSlice (max i 1) (n-i) v) where n = length v

This looks just like sequential Quicksort except that the two recursive calls are potentially executed in parallel.

Interestingly, the programming model that parST gives us is very well known (e.g., as lazy threads). In particular, Cilk, which I rather like, is based on a very similar approach. It is quite amazing that we can get this with basically 2 lines of Haskell code.

This leaves two questions. Firstly: is it safe? I think (but I’m not sure) that the answer is yes for IO (we can define parIO similarly to parST ) but it is definitely not safe for ST. Here is an example that produces different results depending on scheduling and compiler optimisations:

let x = runST (do { r <- newSTRef 0; writeSTRef r 1 ||| writeSTRef r 2; readSTRef r }) y = runST (do { r <- newSTRef 0; writeSTRef r 1 ||| writeSTRef r 2; readSTRef r }) in x == y

It ought to be possible to build a safe library on top of it, though.

So what about performance? Alas, I don’t have time to pursue this at the moment so I have absolutely no idea. The only real benchmark I tried is Introsort from Dan Doel’s vector-algorithms which I parallelised like Quicksort above (that is, I changed exactly one line). The chart below shows the running times for sorting 10M random elements (in seconds) vs. the number of cores on an 8-core XServe. Clearly, it does things in parallel but it doesn’t really scale. I have my suspicions as to why this is the case (I blame noDuplicate ) but I need to investigate more. It is quite encouraging, however, that the very naive parallel algorithm is barely slower than the sequential one on 1 core (1.98s vs. 2.08s).