27th November 2012, 03:39 pm

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.

As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the ( n + 1 ) th sum & carry require knowing the n th carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.

One general technique for improving parallelism is speculation—doing more work than might be needed so that we don’t have to wait to find out exactly what will be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.

A full adder

Let’s start with an adder for two one-bit numbers. Because of the possibility of overflow, the result will be two bits, which I’ll call “sum” and “carry”. So that we can chain these one-bit adders, we’ll also add a carry input.

addB ∷ ( Bool , Bool ) → Bool → ( Bool , Bool )

In the result, the first Bool will be the sum, and the second will be the carry. I’ve curried the carry input to make it stand out from the (other) addends.

There are a few ways to define addB in terms of logic operations. I like the following definition, as it shares a little work between sum & carry:

addB (a,b) cin = (axb ≠ cin, anb ∨ (cin ∧ axb)) where axb = a ≠ b anb = a ∧ b

I’m using (≠) on Bool for exclusive or.

A ripple carry adder

Now suppose we have not just two bits, but two sequences of bits, interpreted as binary numbers arranged from least to most significant bit. For simplicity, I’d like to assume that these sequences to have the same length, so rather than taking a pair of bit lists, let’s take a list of bit pairs:

add ∷ [( Bool , Bool )] → Bool → ([ Bool ], Bool )

To implement add , traverse the list of bit pairs, threading the carries:

add [] c = ([] , c) add (p : ps) c = (s : ss, c'') where (s ,c' ) = addB p c (ss,c'') = add ps c'

State

This add definition contains a familiar pattern. The carry values act as a sort of state that gets updated in a linear (non-branching) way. The State monad captures this pattern of computation:

newtype State s a = State (s → (a,s))

By using State and its Monad instance, we can shorten our add definition. First we’ll need a new full adder definition, tweaked for State :

addB ∷ ( Bool , Bool ) → State Bool Bool addB (a,b) = do cin ← get put (anb ∨ cin ∧ axb) return (axb ≠ cin) where anb = a ∧ b axb = a ≠ b

And then the multi-bit adder:

add ∷ [( Bool , Bool )] → State Bool [ Bool ] add [] = return [] add (p : ps) = do s ← addB p ss ← add ps return (s : ss)

We don’t really need the Monad interface to define add . The simpler and more general Applicative interface suffices:

add [] = pure [] add (p : ps) = liftA2 ( : ) (addB p) (add ps)

This pattern also looks familiar. Oh — the Traversable instance for lists makes for a very compact definition:

add = traverse addB

Wow. The definition is now so simple that it doesn’t depend on the specific choice of lists. To find out the most general type add can have (with this definition), remove the type signature, turn off the monomorphism restriction, and see what GHCi has to say:

add ∷ Traversable t ⇒ t ( Bool , Bool ) → State Bool (t Bool )

This constraint is very lenient. Traversable can be derived automatically for all algebraic data types, including nested/non-regular ones.

For instance,

data Tree a = Leaf a | Branch ( Tree a) ( Tree a) deriving ( Functor , Foldable , Traversable )

We can now specialize this general add back to lists:

addLS ∷ [( Bool , Bool )] → State Bool [ Bool ] addLS = add

We can also specialize for trees:

addTS ∷ Tree ( Bool , Bool ) → State Bool ( Tree Bool ) addTS = add

Or for depth-typed perfect trees (e.g., as described in From tries to trees):

addTnS ∷ IsNat n ⇒ T n ( Bool , Bool ) → State Bool ( T n Bool ) addTnS = add

Binary trees are often better than lists for parallelism, because they allow quick recursive splitting and joining. In the case of ripple adders, we don’t really get parallelism, however, because of the single-threaded (linear) nature of State . Can we get around this unfortunate linearization?

Speculation

The linearity of carry propagation interferes with parallel execution even when using a tree representation. The problem is that each addB (full adder) invocation must access the carry out from the previous (immediately less significant) bit position and so must wait for that carry to be computed. Since each bit addition must wait for the previous one to finish, we get linear running time, even with unlimited parallel processing available. If we didn’t have to wait for carries, we could instead get logarithmic running time using the tree representation, since subtrees could be added in parallel.

A way out of this dilemma is to speculatively compute the bit sums for both possibilities, i.e., for carry and no carry. We’ll do more work, but much less waiting.

State memoization

Recall the State definition:

newtype State s a = State (s → (a,s))

Rather than using a function of s , let’s use a table indexed by s . Since s is Bool in our use, a table is simply a uniform pair, so we could replace State Bool a with the following:

newtype BoolStateTable a = BST ((a, Bool ), (a, Bool ))

Exercise: define Functor , Applicative , and Monad instances for BoolStateTable .

Rather than defining such a specialized type, let’s stand back and consider what’s going on. We’re replacing a function by an isomorphic data type. This replacement is exactly what memoization is about. So let’s define a general memoizing state monad:

newtype StateTrie s a = StateTrie (s ⇰ (a,s))

Note that the definition of memoizing state is nearly identical to State . I’ve simply replaced “ → ” by “ ⇰ ”, i.e., memo tries. For the (simple) source code of StateTrie , see the github project. (Poking around on Hackage, I just found monad-memo, which looks related.)

The full-adder function addB is restricted to State , but unnecessarily so. The most general type is inferred as

addB ∷ MonadState Bool m ⇒ ( Bool , Bool ) → m Bool

where the MonadState class comes from the mtl package.

With the type-generalized addB , we get a more general type for add as well:

add ∷ ( Traversable t, Applicative m, MonadState Bool m) ⇒ t ( Bool , Bool ) → m (t Bool ) add = traverse addB

Now we can specialize add to work with memoized state:

addLM ∷ [( Bool , Bool )] → StateTrie Bool [ Bool ] addLM = add addTM ∷ Tree ( Bool , Bool ) → StateTrie Bool ( Tree Bool ) addTM = add

What have we done?

The essential tricks in this post are to (a) boost parallelism by speculative evaluation (an old idea) and (b) express speculation as memoization (new, to me at least). The technique wins for binary addition thanks to the small number of possible states, which then makes memoization (full speculation) affordable.

I’m not suggesting that the code above has impressive parallel execution when compiled under GHC. Perhaps it could with some par and pseq annotations. I haven’t tried. This exploration helps me understand a little of the space of hardware-oriented algorithms.

The conditional sum adder looks quite similar to the development above. It has the twist, however, of speculating carries on blocks of a few bits rather than single bits. It’s astonishingly easy to adapt the development above for such a hybrid scheme, forming traversable structures of sequences of bits:

addH ∷ Tree [( Bool , Bool )] → StateTrie Bool ( Tree [ Bool ]) addH = traverse (fromState ∘ add)

I’m using the adapter fromState so that the inner list additions will use State while the outer tree additions will use StateTrie , thanks to type inference. This adapter memoizes and rewraps the state transition function:

fromState ∷ HasTrie s ⇒ State s a → StateTrie s a fromState = StateTrie ∘ trie ∘ runState

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d...