24th May 2011, 12:31 pm

My last few blog posts have been on the theme of scans, and particularly on parallel scans. In Composable parallel scanning, I tackled parallel scanning in a very general setting. There are five simple building blocks out of which a vast assortment of data structures can be built, namely constant (no value), identity (one value), sum, product, and composition. The post defined parallel prefix and suffix scan for each of these five "functor combinators", in terms of the same scan operation on each of the component functors. Every functor built out of this basic set thus has a parallel scan. Functors defined more conventionally can be given scan implementations simply by converting to a composition of the basic set, scanning, and then back to the original functor. Moreover, I expect this implementation could be generated automatically, similarly to GHC’s DerivingFunctor extension.

Now I’d like to show two examples of parallel scan composition in terms of binary trees, namely the top-down and bottom-up variants of perfect binary leaf trees used in previous posts. (In previous posts, I used the terms "right-folded" and "left-folded" instead of "top-down" and "bottom-up".) The resulting two algorithms are expressed nearly identically, but have differ significantly in the work performed. The top-down version does Θ ( n log n ) work, while the bottom-up version does only Θ ( n ) , and thus the latter algorithm is work-efficient, while the former is not. Moreover, with a very simple optimization, the bottom-up tree algorithm corresponds closely to Guy Blelloch’s parallel prefix scan for arrays, given in Programming parallel algorithms. I’m delighted with this result, as I had been wondering how to think about Guy’s algorithm.

Edit:

2011-05-31: Added Scan and Applicative instances for T2 and T4 .

Scanning via functor combinators

In Composable parallel scanning, we saw the Scan class:

class Scan f where

prefixScan, suffixScan ∷ Monoid m ⇒ f m → (m, f m)

Given a structure of values, the prefix and suffix scan methods generate the overall fold (of type m ), plus a structure of the same type as the input. (In contrast, the usual Haskell scanl and scanr functions on lists yield a single list with one more element than the source list. I changed the interface for generality and composability.) The post gave instances for the basic set of five functor combinators.

Most functors are not defined via the basic combinators, but as mentioned above, we can scan by conversion to and from the basic set. For convenience, encapsulate this conversion in a type class:

class EncodeF f where

type Enc f ∷ * → *

encode ∷ f a → Enc f a

decode ∷ Enc f a → f a

and define scan functions via EncodeF :

prefixScanEnc, suffixScanEnc ∷

( EncodeF f, Scan ( Enc f), Monoid m) ⇒ f m → (m, f m)

prefixScanEnc = second decode ∘ prefixScan ∘ encode

suffixScanEnc = second decode ∘ suffixScan ∘ encode

Lists

As a first example, consider

instance EncodeF [] where

type Enc [] = Const () + Id × []

encode [] = InL ( Const ())

encode (a : as ) = InR ( Id a × as )

decode ( InL ( Const ())) = []

decode ( InR ( Id a × as )) = a : as

And declare a boilerplate Scan instance via EncodeF :

instance Scan [] where

prefixScan = prefixScanEnc

suffixScan = suffixScanEnc

I haven’t checked the details, but I think with this instance, suffix scanning has okay performance, while prefix scan does quadratic work. The reason is the in the Scan instance for products, the two components are scanned independently (in parallel), and then the whole second component is adjusted for prefixScan , while the whole first component is adjusted for suffixScan . In the case of lists, the first component is the list head, and second component is the list tail.

For your reading convenience, here’s that Scan instance again:

instance ( Scan f, Scan g, Functor f, Functor g) ⇒ Scan (f × g) where

prefixScan (fa × ga) = (af ⊕ ag, fa' × ((af ⊕) <$> ga'))

where (af,fa') = prefixScan fa

(ag,ga') = prefixScan ga



suffixScan (fa × ga) = (af ⊕ ag, ((⊕ ag) <$> fa') × ga')

where (af,fa') = suffixScan fa

(ag,ga') = suffixScan ga

The lop-sidedness of the list type thus interferes with parallelization, and makes the parallel scans perform much worse than cumulative sequential scans.

Let’s next look at a more balanced type.

Binary Trees

We’ll get better parallel performance by organizing our data so that we can cheaply partition it into roughly equal pieces. Tree types allows such partitioning.

Top-down trees

We’ll try a few variations, starting with a simple binary tree.

data T1 a = L1 a | B1 ( T1 a) ( T1 a) deriving Functor

Encoding and decoding is straightforward:

instance EncodeF T1 where

type Enc T1 = Id + T1 × T1

encode ( L1 a) = InL ( Id a)

encode ( B1 s t) = InR (s × t)

decode ( InL ( Id a)) = L1 a

decode ( InR (s × t)) = B1 s t



instance Scan T1 where

prefixScan = prefixScanEnc

suffixScan = suffixScanEnc

Note that these definitions could be generated automatically from the data type definition.

For balanced trees, prefix and suffix scan divide the problem in half at each step, solve each half, and do linear work to patch up one of the two halves. Letting n be the number of elements, and W ( n ) the work, we have the recurrence W ( n ) = 2 W ( n / 2 ) + c n for some constant factor c . By the Master theorem, therefore, the work done is Θ ( n log n ) . (Use case 2, with a = b = 2 , f ( n ) = c n , and k = 0 .)

Again assuming a balanced tree, the computation dependencies have logarithmic depth, so the ideal parallel running time (assuming sufficient processors) is Θ ( log n ) . Thus we have an algorithm that is depth-efficient (modulo constant factors) but work-inefficient.

Composition

A binary tree as defined above is either a leaf or a pair of binary trees. We can make this pair-ness more explicit with a reformulation:

data T2 a = L2 a | B2 ( Pair ( T2 a)) deriving Functor

where Pair , as in Composable parallel scanning, is defined as

data Pair a = a :# a deriving Functor

or even

type Pair = Id × Id

For encoding and decoding, we could use the same representation as with T1 , but let’s instead use a more natural one for the definition of T2 :

instance EncodeF T2 where

type Enc T2 = Id + Pair ∘ T2

encode ( L2 a) = InL ( Id a)

encode ( B2 st) = InR ( O st)

decode ( InL ( Id a)) = L2 a

decode ( InR ( O st)) = B2 st

Boilerplate scanning:

instance Scan T2 where

prefixScan = prefixScanEnc

suffixScan = suffixScanEnc

for which we’ll need an applicative instance:

instance Applicative T2 where

pure = L2

L2 f <*> L2 x = L2 (f x)

B2 (fs :# gs) <*> B2 (xs :# ys) = B2 ((fs <*> xs) :# (gs <*> ys))

_ <*> _ = error "T2 (<*>): structure mismatch"

The O constructor is for functor composition.

With a small change to the tree type, we can make the composition of Pair and T more explicit:

data T3 a = L3 a | B3 (( Pair ∘ T3 ) a) deriving Functor

Then the conversion becomes even simpler, since there’s no need to add or remove O wrappers:

instance EncodeF T3 where

type Enc T3 = Id + Pair ∘ T3

encode ( L3 a) = InL ( Id a)

encode ( B3 st) = InR st

decode ( InL ( Id a)) = L3 a

decode ( InR st) = B3 st

Bottom-up trees

In the formulations above, a non-leaf tree consists of a pair of trees. I’ll call these trees "top-down", since visible pair structure begins at the top.

With a very small change, we can instead use a tree of pairs:

data T4 a = L4 a | B4 ( T4 ( Pair a)) deriving Functor

Again an applicative instance allows a standard Scan instance:

instance Scan T4 where

prefixScan = prefixScanEnc

suffixScan = suffixScanEnc



instance Applicative T4 where

pure = L4

L4 f <*> L4 x = L4 (f x)

B4 fgs <*> B4 xys = B4 (liftA2 h fgs xys)

where h (f :# g) (x :# y) = f x :# g y

_ <*> _ = error "T4 (<*>): structure mismatch"

or a more explicitly composed form:

data T5 a = L5 a | B5 (( T5 ∘ Pair ) a) deriving Functor

I’ll call these new variations "bottom-up" trees, since visible pair structure begins at the bottom. After stripping off the branch constructor, B4 , we can get at the pair-valued leaves by means of fmap , fold , or traverse (or variations). For B5 , we’d also have to strip off the O wrapper (functor composition).

Encoding is nearly the same as with top-down trees. For instance,

instance EncodeF T4 where

type Enc T4 = Id + T4 ∘ Pair

encode ( L4 a) = InL ( Id a)

encode ( B4 t) = InR ( O t)

decode ( InL ( Id a)) = L4 a

decode ( InR ( O t)) = B4 t

Scanning pairs

We’ll need to scan on the Pair functor. If we use the definition of Pair above in terms of Id and (×) , then we’ll get scanning for free. For using Pair , I find the explicit data type definition above more convenient. We can then derive a Scan instance by conversion. Start with a standard specification:

data Pair a = a :# a deriving Functor

And encode & decode explicitly:

instance EncodeF Pair where

type Enc Pair = Id × Id

encode (a :# b) = Id a × Id b

decode ( Id a × Id b) = a :# b

Then use our boilerplate Scan instance for EncodeF instances:

instance Scan Pair where

prefixScan = prefixScanEnc

suffixScan = suffixScanEnc

We’ve seen the Scan instance for (×) above. The instance for Id is very simple:

newtype Id a = Id a



instance Scan Id where

prefixScan ( Id m) = (m, Id ∅)

suffixScan = prefixScan

Given these definitions, we can calculate a more streamlined Scan instance for Pair :

prefixScan (a :# b)

≡ {- specification -}

prefixScanEnc (a :# b)

≡ {- prefixScanEnc definition -}

(second decode ∘ prefixScan ∘ encode) (a :# b)

≡ {- (∘) -}

second decode (prefixScan (encode (a :# b)))

≡ {- encode definition for Pair -}

second decode (prefixScan ( Id a × Id b))

≡ {- prefixScan definition for f × g -}

second decode

(af ⊕ ag, fa' × ((af ⊕) <$> ga'))

where (af,fa') = prefixScan ( Id a)

(ag,ga') = prefixScan ( Id b)

≡ {- Definition of second on functions -}

(af ⊕ ag, decode (fa' × ((af ⊕) <$> ga')))

where (af,fa') = prefixScan ( Id a)

(ag,ga') = prefixScan ( Id b)

≡ {- prefixScan definition for Id -}

(af ⊕ ag, decode (fa' × ((af ⊕) <$> ga')))

where (af,fa') = (a, Id ∅)

(ag,ga') = (b, Id ∅)

≡ {- substitution -}

(a ⊕ b, decode ( Id ∅ × ((a ⊕) <$> Id ∅)))

≡ {- fmap/(<$>) for Id -}

(a ⊕ b, decode ( Id ∅ × Id (a ⊕ ∅)))

≡ {- Monoid law -}

(a ⊕ b, decode ( Id ∅ × Id a))

≡ {- decode definition on Pair -}

(a ⊕ b, (∅ :# a))

Whew! And similarly for suffixScan .

Now let’s recall the Scan instance for Pair given in Composable parallel scanning:

instance Scan Pair where

prefixScan (a :# b) = (a ⊕ b, (∅ :# a))

suffixScan (a :# b) = (a ⊕ b, (b :# ∅))

Hurray! The derivation led us to the same definition. A "sufficiently smart" compiler could do this derivation automatically.

With this warm-up derivation, let’s now turn to trees.

Scanning trees

Given the tree encodings above, how does scan work? We’ll have to consult Scan instances for some of the functor combinators. The product instance is repeated above. We’ll also want the instances for sum and composition. Omitting the suffixScan definitions for brevity:

data (f + g) a = InL (f a) | InR (g a)



instance ( Scan f, Scan g) ⇒ Scan (f + g) where

prefixScan ( InL fa) = second InL (prefixScan fa)

prefixScan ( InR ga) = second InR (prefixScan ga)



newtype (g ∘ f) a = O (g (f a))



instance ( Scan g, Scan f, Functor f, Applicative g) ⇒ Scan (g ∘ f) where

prefixScan = second ( O ∘ fmap adjustL ∘ zip )

∘ assocR

∘ first prefixScan

∘ unzip

∘ fmap prefixScan

∘ unO

This last definition uses a few utility functions:

zip ∷ Applicative g ⇒ (g a, g b) → g (a,b)

zip = uncurry (liftA2 (,))



unzip ∷ Functor g ⇒ g (a,b) → (g a, g b)

unzip = fmap fst &&& fmap snd



assocR ∷ ((a,b),c) → (a,(b,c))

assocR ((a,b),c) = (a,(b,c))



adjustL ∷ ( Functor f, Monoid m) ⇒ (m, f m) → f m

adjustL (m, ms) = (m ⊕) <$> ms

Let’s consider how the Scan (g ∘ f) instance plays out for top-down vs bottom-up trees, given the functor-composition encodings above. The critical definitions:

type Enc T2 = Id + Pair ∘ T2



type Enc T4 = Id + T4 ∘ Pair

Focusing on the branch case, we have Pair ∘ T2 vs T4 ∘ Pair , so we’ll use the Scan (g ∘ f) instance either way. Let’s consider the work implied by that instance. There are two calls to prefixScan , plus a linear amount of other work. The meanings of those two calls differ, however:

For top-down trees ( T2 ), the recursive tree scans are in fmap prefixScan , mapping over the pair of trees. The first prefixScan is a pair scan and so does constant work. Since there are two recursive calls, each working on a tree of half size (assuming balance), plus linear other work, the total work Θ ( n log n ) , as explained above.

), the recursive tree scans are in , mapping over the pair of trees. The is a pair scan and so does constant work. Since there are two recursive calls, each working on a tree of half size (assuming balance), plus linear other work, the total work , as explained above. For bottom-up trees ( T4 ), there is only one recursive recursive tree scan, which appears in first prefixScan . The prefixScan in fmap prefixScan is pair scan and so does constant work but is mapped over the half-sized tree (of pairs), and so does linear work altogether. Since there only one recursive tree scan, at half size, plus linear other work, the total work is then proportional to n + n / 2 + n / 4 + … ≈ 2 n = Θ ( n ) . So we have a work-efficient algorithm!

Looking deeper

In addition to the simple analysis above of scanning over top-down and over bottom-up, let’s look in detail at what transpires and how each case can be optimized. This section may well have more detail than you’re interested in. If so, feel free to skip ahead.

Top-down

Beginning as with Pair ,

prefixScan t

≡ {- specification -}

prefixScanEnc t

≡ {- prefixScanEnc definition -}

(second decode ∘ prefixScan ∘ encode) t

≡ {- (∘) -}

second decode (prefixScan (encode t))

Take T2 , with T3 being quite similar. Now split into two cases for the two constructors of T2 . First leaf:

prefixScan ( L2 m)

≡ {- as above -}

second decode (prefixScan (encode ( L2 m)))

≡ {- encode for L2 -}

second decode (prefixScan ( InL ( Id m)))

≡ {- prefixScan for functor sum -}

second decode (second InL (prefixScan ( Id m)))

≡ {- prefixScan for Id -}

second decode (second InL (m, Id ∅))

≡ {- second for functions -}

second decode (m, InL ( Id ∅))

≡ {- second for functions -}

(m, decode ( InL ( Id ∅)))

≡ {- decode for L2 -}

(m, L2 ∅)

Then branch:

prefixScan ( B2 (s :# t))

≡ {- as above -}

second decode (prefixScan (encode ( B2 (s :# t))))

≡ {- encode for L2 -}

second decode (prefixScan ( InR ( O (s :# t))))

≡ {- prefixScan for (+) -}

second decode (second InR (prefixScan ( O (s :# t))))

≡ {- property of second -}

second (decode ∘ InR ) (prefixScan ( O (s :# t)))

Focus on the prefixScan application:

prefixScan ( O (s :# t)) =

≡ {- prefixScan for (∘) -}

( second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan

∘ unzip ∘ fmap prefixScan ∘ unO ) ( O (s :# t))

≡ {- unO/O -}

( second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan

∘ unzip ∘ fmap prefixScan ) (s :# t)

≡ {- fmap on Pair -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan ∘ unzip )

(prefixScan s :# prefixScan t)

≡ {- expand prefixScan -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan ∘ unzip )

((ms,s') :# (mt,t'))

where (ms,s') = prefixScan s

(mt,t') = prefixScan t

≡ {- unzip -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan)

((ms :# mt), (s' :# t')) where ⋯

≡ {- first -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR)

(prefixScan (ms :# mt), (s' :# t')) where ⋯

≡ {- prefixScan for Pair -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR)

((ms ⊕ mt, (∅ :# ms)), (s' :# t')) where ⋯

≡ {- assocR -}

(second ( O ∘ fmap adjustL ∘ zip ))

(ms ⊕ mt, ((∅ :# ms), (s' :# t'))) where ⋯

≡ {- second -}

( ms ⊕ mt

, ( O ∘ fmap adjustL ∘ zip ) ((∅ :# ms), (s' :# t')) ) where ⋯

≡ {- zip -}

( ms ⊕ mt

, ( O ∘ fmap adjustL) ((∅,s') :# (ms,t')) ) where ⋯

≡ {- fmap for Pair -}

( ms ⊕ mt

, O (adjustL (∅,s') :# adjustL (ms,t')) ) where ⋯

≡ {- adjustL -}

( ms ⊕ mt

, O (((∅ ⊕) <$> s') :# ((ms ⊕) <$> t')) ) where ⋯

≡ {- Monoid law (left identity) -}

( ms ⊕ mt

, O (( id <$> s') :# ((ms ⊕) <$> t')) ) where ⋯

≡ {- Functor law (fmap id) -}

( ms ⊕ mt

, O (s' :# ((ms ⊕) <$> t')) )

where (ms,s') = prefixScan s

(mt,t') = prefixScan t

Continuing from above,

prefixScan ( B2 (s :# t))

≡ {- see above -}

second (decode ∘ InR ) (prefixScan ( O (s :# t)))

≡ {- prefixScan focus from above -}

second (decode ∘ InR )

( ms ⊕ mt

, O (s' :# ((ms ⊕) <$> t')) )

where (ms,s') = prefixScan s

(mt,t') = prefixScan t

≡ {- definition of second on functions -}

(ms ⊕ mt, (decode ∘ InR ) ( O (s' :# ((ms ⊕) <$> t')))) where ⋯

≡ {- (∘) -}

(ms ⊕ mt, decode ( InR ( O (s' :# ((ms ⊕) <$> t'))))) where ⋯

≡ {- decode for B2 -}

(ms ⊕ mt, B2 (s' :# ((ms ⊕) <$> t'))) where ⋯

This final form is as in Deriving parallel tree scans, changed for the new scan interface. The derivation saved some work in wrapping & unwrapping and method invocation, plus one of the two adjustment passes over the sub-trees. As explained above, this algorithm performs Θ ( n log n ) work.

I’ll leave suffixScan for you to do yourself.

Bottom-up

What happens if we switch from top-down to bottom-up binary trees? I’ll use T4 (though T5 would work as well):

data T4 a = L4 a | B4 ( T4 ( Pair a))

The leaf case is just as with T2 above, so let’s get right to branches.

prefixScan ( B4 t)

≡ {- as above -}

second decode (prefixScan (encode ( B4 t)))

≡ {- encode for L2 -}

second decode (prefixScan ( InR ( O t)))

≡ {- prefixScan for (+) -}

second decode (second InR (prefixScan ( O t)))

≡ {- property of second -}

second (decode ∘ InR ) (prefixScan ( O t))

As before, now focus on the prefixScan call.

prefixScan ( O t) =

≡ {- prefixScan for (∘) -}

( second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan

∘ unzip ∘ fmap prefixScan ∘ unO ) ( O t)

≡ {- unO/O -}

( second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan

∘ unzip ∘ fmap prefixScan ) t

≡ {- prefixScan on Pair (derived above) -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan ∘ unzip )

fmap (λ (a :# b) → (a ⊕ b, (∅ :# a))) t

≡ {- unzip/fmap -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR ∘ first prefixScan)

( fmap (λ (a :# b) → (a ⊕ b)) t

, fmap (λ (a :# b) → (∅ :# a)) t )

≡ {- first on functions -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR)

( prefixScan ( fmap (λ (a :# b) → (a ⊕ b)) t)

, fmap (λ (a :# b) → (∅ :# a)) t )

≡ {- expand prefixScan -}

(second ( O ∘ fmap adjustL ∘ zip ) ∘ assocR)

((mp,p'), fmap (λ (a :# b) → (∅ :# a)) t)

where (mp,p') = prefixScan ( fmap (λ (a :# b) → (a ⊕ b)) t)

≡ {- assocR -}

(second ( O ∘ fmap adjustL ∘ zip ))

(mp, (p', fmap (λ (a :# b) → (∅ :# a)) t))

where ⋯

≡ {- second on functions -}

(mp, ( O ∘ fmap adjustL ∘ zip ) (p', fmap (λ (a :# b) → (∅ :# a)) t))

where ⋯

≡ {- fmap/zip/fmap -}

(mp, O (liftA2 tweak p' t))

where tweak s (a :# _) = adjustL (s, (∅ :# a))

(mp,p') = prefixScan ( fmap (λ (a :# b) → (a ⊕ b)) t)

≡ {- adjustL, then simplify -}

(mp, O (liftA2 tweak p' t))

where tweak s (a :# _) = (s :# s ⊕ a)

(mp,p') = prefixScan ( fmap (λ (a :# b) → (a ⊕ b)) t)

Now re-introduce the context of prefixScan (O t) :

prefixScan ( B4 t)

≡ {- see above -}

second (decode ∘ InR ) (prefixScan ( O t))

≡ {- see above -}

second (decode ∘ InR )

(mp, O (liftA2 tweak p' t))

where ⋯

≡ {- decode for T4 -}

(mp, B4 (liftA2 tweak p' t))

where p = fmap (λ (e :# o) → (e ⊕ o)) t

(mp,p') = prefixScan p

tweak s (e :# _) = (s :# s ⊕ e)

Notice how much this bottom-up tree scan algorithm differs from the top-down algorithm derived above. In particular, there’s only one recursive tree scan (on a half-sized tree) instead of two, plus linear additional work, for a total of Θ ( n ) work.

Guy Blelloch’s parallel scan algorithm

In Programming parallel algorithms, Guy Blelloch gives the following algorithm for parallel prefix scan, expressed in the parallel functional language NESL:

function scan(a) =

if # a ≡ 1 then [ 0 ]

else

let es = even_elts(a);

os = odd_elts(a);

ss = scan({e + o : e in es; o in os})

in interleave(ss,{s + e : s in ss; e in es})

This algorithm is nearly identical to the T4 scan algorithm above. I was very glad to find this route to Guy’s algorithm, which had been fairly mysterious to me. I mean, I could believe that the algorithm worked, but I had no idea how I might have discovered it myself. With the functor composition approach to scanning, I now see how Guy’s algorithm emerges as well as how it generalizes to other data structures.

Nested data types and parallelism

Most of the recursive algebraic data types that appear in Haskell programs are regular, meaning that the recursive instances are instantiated with the same type parameter as the containing type. For instance, a top-down tree of elements of type a is either a leaf or has two trees whose elements have that same type a . In contrast, in a bottom-up tree, the (single) recursively contained tree is over elements of type (a,a) . Such non-regular data types are called "nested". The two tree scan algorithms above suggest to me that nested data types are particularly useful for efficient parallel algorithms.

My last few blog posts have been on the theme of scans, and particularly on parallel scans. In Composable parallel scanning, I tackled parallel scanning in a very general setting....