Notes on fusion

I've been writing about some Haskell black magic lately, and I'm sorry to say that continues here: we're going to implement a couple of different sorts of fusion.

Fusion is one of those words that made me recoil for quite a few years, since it sounds extremely complicated and arcane. A few years into building applications in Haskell, I still hadn't forced the thunk.

Fusion powers libraries Haskell developers use all day, like vector , conduit , pipes , and text . It's clearly a valuable technique that library authors should understand, and that application programmers can blissfully ignore.

This is a post in which I find it is not so scary, though there is a lot of overwhelming material out there.

Disclaimer

As per usual, I am writing to formalise my own notes and understanding.

These notes are even rougher than usual.

I have compiled a list of superior resources on this topic. I recommend you consult these before taking my notes too seriously.

Please do get in touch if I'm extremely wrong about something, but don't be a pedant.

Fusion you've been usin'

When you first pick up Haskell, it's common to spend some time implementing the list functions from the Prelude. For instance, the canonical implementation of my favourite function, list append, uses plain old recursion:

(++) :: [a] -> [a] -> [a] ( ++ ) [] ys = ys ( ++ ) (x : xs) ys = x : xs ++ ys

That little function, the first I ever wrote, is in the Prelude verbatim. This is great! Little general combinators are right at home in our standard library.

If you look a little closer at the Prelude's source code, however, some terrifying magic sits alongside our good friend (++) :

{-# NOINLINE [1] (++) #-} -- We want the RULE to fire first. {-# RULES "++" [~1] forall xs ys. xs ++ ys = augment (\c n -> foldr c n xs) ys #-}

When optimisations are enabled, a rewrite rule swaps out our simple implementation for a completely different inline function. There's also a NOINLINE pragma for the (++) symbol. The integer subscripts and tilde are key to understanding what GHC will do here :

GHC's simplifier runs in phases, which decrease from N to 0

(++) will not be inlined until phase 1. It may be inlined during phase 1 or 0.

will not be inlined until phase 1. It may be inlined during phase 1 or 0. The rewrite rule is encouraged to fire up until phase 1. It will not fire in phase 1 or 0.

In sum, we make quite an effort to replace the list append operator with a fold. Rewrite rules are subject to GHC heuristics, and are not guaranteed to fire. At the last second, if it has not been replaced with an inline fold, (++) is treated like an ordinary function again.

OK, so GHC has been programmed to favour a different implementation of the function when optimisations are enabled. It does this for most of the list functions defined in base . Why?

A forest

Function composition is very intuitive. Every composition saves us from reimplementing features. Unfortunately, naive function composition chains can lead to inefficiency: each function must produce intermediate values to pass to the next.

Let's choose a function chain to benchmark, and compile with -O0 . We should be able to demonstrate excessive allocation, especially when compared against -O2 :

main :: IO () main = print expr expr :: [ Int ] expr = take 100000 . map ( + 1 ) . filter (\x -> x `mod` 2 == 0 ) $ [ 1 .. 1000000 ]

$ ghc -rtsopts -prof -O0 Plain.hs $ ./Plain +RTS -pa $ head Plain.prof <snip> total time = 0.58 secs (578 ticks @ 1000 us, 1 processor) total alloc = 78,136,000 bytes (excludes profiling overheads)

$ ghc -rtsopts -prof -O2 Plain.hs $ ./Plain +RTS -pa $ head Plain.prof <snip> total time = 0.56 secs (563 ticks @ 1000 us, 1 processor) total alloc = 39,735,832 bytes (excludes profiling overheads)

Optimisations are cutting our allocation load in half, though the time isn't affected much, since this is a simple linear-time program.

This is a pretty coarse metric, and it's hard to know whether this change is entirely thanks to short-cut fusion. Before I run off to decipher the Core, let's redefine all the functions we're using, so we can use -O2 without the rewrite rules in question:

import Prelude hiding (map, take, filter) main :: IO () main = print expr expr :: [ Int ] expr = take 100000 . map ( + 1 ) . filter (\x -> x `mod` 2 == 0 ) $ [ 1 .. 1000000 ] map :: (a -> b) -> [a] -> [b] map _ [] = [] map f (x : xs) = f x : map f xs take :: Int -> [a] -> [a] take n _ | n <= 0 = [] take _ [] = [] take n (x : xs) = x : take (n - 1 ) xs filter :: (a -> Bool ) -> [a] -> [a] filter _pred [] = [] filter pred (x : xs) | pred x = x : filter pred xs | otherwise = filter pred xs

$ ghc -rtsopts -prof -O2 Plain.hs $ ./Plain +RTS -pa $ head Plain.prof <snip> total time = 0.56 secs (561 ticks @ 1000 us, 1 processor) total alloc = 67,736,504 bytes (excludes profiling overheads)

Better than -O0 , but with about 30MB of additional allocations.

Looking at the Core, our recursive functions pasted from the Haskell Report have not been inlined. GHC is relatively powerless to optimise naively-recursive functions. Inlining recursive definitions is usually a bad idea, since it will either fail to terminate or replace a finite loop with a huge unrolled function. If we can't inline, we can't take full advantage of GHC's wonderful optimisations. This compels us toward more interesting techniques.

Short cut fusion

The fusion scheme used in base for list functions is known as short-cut fusion. It originated in the 1993 paper A Short Cut to Deforestation and can be found written up on pretty much every Haskell-inclined blog in the world.

foldr/build

This technique involves rewriting all our list consumer functions in terms of foldr , rewriting all our list producers in terms of a dual combinator called build , and letting GHC rewrite foldr/build chains into smaller functions via a known identity.

foldr :: (a -> b -> b) -> b -> [a] -> b foldr f z [] = z foldr f z (x : xs) = f x (foldr f z xs) build :: ((a -> [a] -> [a]) -> [b] -> c) -> c build g = g ( : ) []

It's usually straightforward to rewrite our consumer functions in terms of foldr :

map f xs = foldr (\a b -> f a : b) [] xs filter f xs = foldr (\a b -> if f a then a : b else b) [] xs

... though take is a little messy, so I'll elude it and point you at GHC.List.

If you look carefully at the type signature, the first argument to build is a specialised right fold. build takes this fold function and pushes cons through it:

> build foldr [ 1 .. 10 ] [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ]

In other words, build finalises a suspended fold. If we see a build on the right-hand side of a foldr , we can eliminate the foldr entirely, perpetuating the more interesting suspended one instead. This forms an identity that is expressed in GHC as a rewrite rule:

{-# RULES "my foldr/build" forall g k z. foldr k z (build g) = g k z #-}

It is also helpful to reimplement the enumeration function, from , in terms of build. from takes two bounds, and produces a specialised fold, wrapped in a build . It is the simplest list producer. Note that it is defined recursively.

from :: Int -> Int -> [ Int ] from a b = build (from' a b) where from' c d = \k n -> if c > d then n else k c (from' (c + 1 ) d k n)

The last part of the puzzle is inlining. If our foldr -based functions don't get inlined, the rewrite rule cannot be applied, since rewriting is entirely syntactic.

{-# INLINE map #-} {-# INLINE filter #-}

Now, our chain of list functions is inlined into a chain of foldr , terminated by a build . The rewrite rule is able to eliminate all but the outer foldr . We have a single loop body that GHC is free to optimise, and we do not perform redundant allocation.

Rules are based on heuristics, so they don't always fire. If short-cut fusion does not take place, these foldr -based functions are likely to be worse than the naive recursive versions. This is one motivation for the sneaky rewrite-based redefinition of functions like (++) : we get to keep the naive implementation when things don't go our way.

destroy/unfoldr

foldr/build was not the last word in fusion. It proved very practical and is still in use, but is better suited for some functions over others.

It was very unintuitive to implement take in terms of foldr and build . We didn't try to implement zip , but rest assured it is both difficult and disappointing: intermediate lists still get produced.

To this end, an alternative scheme for short-cut fusion was devised. This scheme is dual to foldr/build , and is known as unfoldr/destroy . It is implemented via similar rewrite mechanisms.

From Shortcut Fusion for Accumulating Parameters and Zip-like Functions by Svenningsson:

unfoldr :: (b -> Maybe (a, b)) -> b -> [a] unfoldr f b = case f b of Nothing -> [] Just (a, b') -> a : unfoldr f b' destroy :: (forall a . (a -> Maybe (b, a)) -> a -> c) -> [b] -> c destroy g xs = g listpsi xs where listpsi :: [a] -> Maybe (a, [a]) listpsi [] = Nothing listpsi (x : xs) = Just (x, xs)

If you squint at the type signatures of unfoldr and destroy , you will see a familiar thing at play: destroy expects a generalised unfold as its first argument, much like build expected a specialised fold. It takes that suspended unfold and applies it to the least interesting unfold possible, the list unravel:

> destroy unfoldr [ 1 .. 10 ] [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ]

The identity we use to eliminate redundant destroy/unfoldr pairs will also look a little familiar, though inverted:

{-# RULES "my destroy/unfoldr" forall g psi e. destroy g (unfoldr psi e) = g psi e #-}

... and functions implemented via these combinators look a little bit vexing:

from n m = unfoldr (\i -> if i > m then Nothing else Just (i, succ i)) n zip xs ys = destroy (\ psi1 e1 -> destroy (\ psi2 e2 -> unfoldr (zipDU psi1 psi2) (e1,e2) ) ys ) xs where zipDU psi1 psi2 (e1,e2) = case psi1 e1 of Nothing -> Nothing Just (x,xs) -> case psi2 e2 of Nothing -> Nothing Just (y,ys) -> Just ((x,y),(xs,ys))

Though that implementation of zip looks complicated, the destroy/unfoldr rewrite rules do their thing, and GHC is able to eliminate the Maybe entirely. Skim the paper, it's concise.

GHC did not adopt destroy/unfoldr .

I am extremely unqualified to elaborate on why GHC uses foldr/build over unfoldr/destroy , or stream fusion for that matter. I would love to read a folklore piece to that effect.

Church me to take

Given all the loop fusion frameworks we come across seem to be based on either folds or unfolds, it seems inevitable that there's some more general theoretical framework tying it all together.

There's a lot of work on the correctness of short-cut fusion. Of course, I have read none of it, so I cannot point you in the right direction. I'm so tired, friends. I can't fuse any more streams today.

I did, however, quite enjoy reading Thomas Harper and Ralf Hinze's treatment. In Theory and Practice of Fusion, foldr/build and destroy/unfoldr are presented as dual Church and co-Church encodings in a uniform setting.

Stream fusion

Stream fusion is a newer framework. It emerged from libraries like bytestring and vector about ten years ago. It's easy enough to implement on your own, I just wore myself out trying to produce nice canned examples. I recommend reading From Lists to Streams to Nothing at All , which lays it all out pretty clearly.

I had originally set out to write about stream fusion today, since it is so easy to implement. I've decided instead to talk about short-cut fusion, and sit on the stream fusion post until I properly explain how it generalises destroy/unfoldr. It does!

Roman Leshchinskiy has a nice article on that front.

Powerful

We've fused loops in Haskell before! We've rolled our folds together, and we've mashed up our fmaps. However, fusion is very different to runtime constructions like Yoneda and Foldl. There's no runtime overhead, the results can be much more dramatic, and the user does not need to know about it.

Compile-time fusion is a form of metaprogramming. We use GHC directives to control the way functions are substituted and inlined. We are free to subvert our own function definitions, replacing them with equivalent representations that compose well and can be optimised down to nothing at all.

RULES and NOINLINE pragmas can be used in any Haskell module. Very aggressive optimisation schemes can be pursued by library authors without any need to poke around inside GHC. As with our original (++) , any non-optimising Haskell compiler can ignore the pragmas as comments and still end up with a functional program.

Doing this well requires intimate knowledge of GHC's optimisation passes, though there is significant prior art to learn from, especially in the field of streams.

There is also risk and frustration in this approach, due to its reliance on GHC heuristics. Users may hold your library incorrectly, producing code that GHC can't optimise for complex reasons. Failure to fuse in a performance-sensitive library may be a catastrophic outcome. A more explicit metaprogramming approach may be necessary for extreme applications.

Superior resources

Primary sources

Blog posts and conjecture