A co-worker of mine had a random thought on IRC:

Is data type “unrolling” a valid performance optimization? Used anywhere? Something like: data List a = Nil | Cons a {-# Unroll 10 #-} (List a)

I thought, that sounds like the vec package I wrote. One module there uses the GADT definition

data Vec ( n :: Nat ) a where ) a VNil :: Vec 'Z a (:::) :: a -> Vec n a -> Vec ( 'S n) a n an) a

and a clever way to define functions so GHC unrolls them. Another module uses the data family

data family Vec ( n :: Nat ) a ) a data instance Vec 'Z a = VNil data instance Vec ( 'S n) a = a ::: ! ( Vec n a) n) an a)

which in some cases optimizes better than the GADT version. However in neither version we can {-# UNPACK #-} the tail. GHC simply says

Ignoring unusable UNPACK pragma on the second argument of ‘:::’

It’s not “obvious” to the compiler that tail is a single constructor data type: technically it’s not even a single constructor in the GADT; and it’s not clear in data family case either, because of the degenerate Vec ('S Any) case. But a single contructor data type is a requirement for UNPACK to work.

There’s however another trick we can use: Backpack.

Backpack is a system for retrofitting Haskell with an applicative, mix-in module system. In short, we are able to write more efficient monomorphic code. For example, the description of unpacked-containers says:

This backpack mixin package supplies unpacked sets and maps exploiting backpack’s ability to unpack through signatures.

Unpacking is exactly what we want to do.

Backpack

Backpack is an underdocumented feature of GHC and Cabal. There are Edward Yang’s blog posts, his PhD thesis (which I only scanned through), and in addition to what’s in the source code, that’s all. We won’t let that stop us from experimenting!

The idea of Backpack is that you can write modules which are parametrised by other modules.

Short glossary:

Module is what you think it is.

is what you think it is. Signature is a “type signature of module”.

is a “type signature of module”. Unit is a collection of modules and signatures. Think library .

is a collection of modules and signatures. Think . Units can be open , contain “holes” (specified by signatures) to be filled by other modules. See OpenUnitId in Cabal docs.

, contain “holes” (specified by signatures) to be filled by other modules. See in docs. Mixin is a unit with “inputs” and “outputs” possibly renamed.

One advice I can give: give types and signatures different names (i.e. avoid signature Str where data Str ), until you are comfortable with Backpack concepts. That’s similar to giving your types and constructors different names (i.e. avoid data Unit = Unit ). Rather say:

Sig where signature data S module Impl where data S = MkS

Let’s write the Backpack version of Vec next …

Nil

List-like structure have two cases: nil and cons. We’ll start with a simple one: nil. We create an ordinary Haskell-module in an internal library (unit) of the package:

library v0 exposed-modules: V0

And the Haskell source is below:

module V0 where data Vec a = Nil deriving ( Eq , Ord , Show , Functor , Foldable , Traversable ) instance Applicative Vec where pure _ = Nil _ <*> _ = Nil

Nil is essentially a new name for Proxy , if you are familiar with that.

Cons

The cons-case is a recursive case. The cons unit requires a module, and exposes a module. In other words it’s a mixin.

I suggestively named the signature of the required input module In , and exported output module Out .

library cons signatures: In exposed-modules: Out

Signature In is like a module without any implementations or definitions. The module Out is not different from an ordinary Haskell module. We import In and can use whatever is defined by its signature. To be used for vectors, we are careful to implement Out in such a way that it itself satisfies the In signature again, but that fact is not checked until the later instantiation.

In.hsig

The In signature specifies an API that we expect the tail of the vector, and therefore the whole vector to satisfy. In this example it as simple as specifying:

the type of vectors, Vec ,

, a list of instances that Vec should satisfy.

In where signature data Vec a instance Eq a => Eq ( Vec a) a) instance Ord a => Ord ( Vec a) a) instance Show a => Show ( Vec a) a) instance Functor Vec instance Foldable Vec instance Traversable Vec instance Applicative Vec

Out.hs

The output of the mixin is implementation of cons on vectors.

Note: at this stage nothing checks that Out implements In signature. The compatibility checks are done at instantiation phase. This reminds me of C++ templates, which are properly checked only when instantiated too.

Also note that we use UNPACK on the tail field !(In.Vec a) . This actually works, as we’ll see later.

module Out where import qualified In data Vec a = ! a ::: {-# UNPACK #-} ! ( In.Vec a) a) deriving ( Eq , Ord , Functor , Foldable , Traversable ) infixr 5 ::: instance Applicative Vec where pure x = x ::: pure x f ::: g <*> x ::: y = f x ::: (g <*> y) f x(gy)

We also write a Show instance manually for a bit prettier output:

instance Show a => Show ( Vec a) where a) showsPrec d (x ::: xs) = showParen (d > 5 ) d (xxs)(d $ showsPrec 6 x . showString " ::: " . showsPrec 5 xs xs

Non-empty vectors: v1, v2, v3

Now it’s time to instantiate the cons mixin. We already have a v0 vector. We can instantiate cons with V0 to get V1 :

library v1 mixins: cons (Out as V1) requires (In as V0) reexported-modules: V1 build-depends: cons, v0

The v1 library depends on v0 (one smaller vector), and cons . The mixins field says:

take cons

rename its exported module from Out to V1

to and instantiate the required signature In to V0 .

Then, the Cabal Backpack magic will connect the dots, as v0 exposes V0 .

And finally, we make V1 available by re-exporting it, using reexported-modules .

Note: for larger vectors, we require no more Haskell code to be written; we are only connecting existing units.

While larger vectors follow the same pattern, we need to pre-instantiate them manually: they cannot be created at the run-time.

library v2 mixins: cons (Out as V2) requires (In as V1) reexported-modules: V2 build-depends: cons, v1 library v3 mixins: cons (Out as V3) requires (In as V2) reexported-modules: V3 build-depends: cons, v2

Example: Some calculations

Before an actual example, we can define a library which will re-export all the modules we defined:

library build-depends: v0, v1, v2, v3 reexported-modules: V0, V1, V2, V3

Finally we can write a program that will use all these vectors:

executable main main-is: Main.hs build-depends: , backpack-vector , base

Main.hs

First the necessary imports:

module Main (main) where (main) import qualified V0 import qualified V1 import qualified V2 import qualified V3 import Control.Applicative (liftA2) (liftA2) import Data.List (mapAccumL) (mapAccumL)

Next we define a few values. We use pure and mapAccumL . Using actual constructors would be even less pretty, as we would have to use qualified names.

v0 :: V0.Vec Double = pure 0.0 v0 v1 :: V1.Vec Double = pure 1.0 v1 v2 :: V2.Vec Double = pure 2.0 v2 -- 0, 1, 2 v3 :: V3.Vec Double = snd $ mapAccumL (

_ -> ( succ n, n)) 0 ( pure ()) v3mapAccumL (

_n, n))())

main :: IO () () = do main print v0 v0 print v1 v1 print v2 v2 print $ liftA2 ( * ) v3 v3 liftA2 () v3 v3

Running this will print

Nil 1.0 ::: Nil 2.0 ::: 2.0 ::: Nil 0.0 ::: 1.0 ::: 4.0 ::: Nil

If we dump the Core of this module, we’ll see bindings like

-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0} main11 main11 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 0.0## -- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0} main10 main10 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 1.0## -- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0} main9 main9 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 4.0##

A strong hint that computations are unrolled, and perfomed at compile-time.

Excursion: Prettier syntax

In example above we had to jump through a hoop to define an interesting (= non-uniform value) V3.Vec . It’s not nice.

One solution is to use pattern synonyms. Pattern synonyms cannot be defined in a type-class, but the helper functions could. Here’s a type-class:

class ConsPattern v t | v -> t, t -> v where v tt, t match :: v a -> (a, t a) v a(a, t a) build :: a -> t a -> v a t av a

Then we can add an instance of this class to each cons case. Note, we shouldn’t require the instance in the signature In (think why!)

instance ConsPattern Vec In.Vec where = ( ::: ) build ::: xs) = (x, xs) match (xxs)(x, xs)

Given ConsPattern type-class we can define an actual bidirectional pattern synonym.

pattern (:::) :: ConsPattern v t => a -> t a -> v a v tt av a pattern x ::: xs <- (match -> (x, xs)) xs(match(x, xs)) where x ::: xs = build x xs xsbuild x xs

With the pattern synonym we have recovered a nice syntax for constructing vectors:

v3a :: V3.Vec Double = 1 ::: 2 ::: 3 ::: Nil v3a

But why stop here? Using source plugins, we can overload the list syntax! This is similar to what the OverloadedLists extension does but using a different desugaring. This approach is implemented in my overloaded library.

By adding few more instances

-- in V0.hs instance Overloaded.Lists.Nil ( Vec a) where a) = Nil nil -- in Out.hs of cons instance Overloaded.Lists.Cons a ( In.Vec a) ( Vec a) where a (a) (a) = ( ::: ) cons

and enabling the source plugin in Main module, we can write a vector as simply as

v3b :: V3.Vec Double = [ 5 , 6 , 7 ] v3b

This small syntax digression is here to show that modern GHC has ways to improve the syntax of certain constructs considerably. Pattern synonyms are great and worth using where appropriate, but I’d not recommend using overloaded library outside of blog posts.

Memory representation

GHC starting with version 8.6 bundles a library called ghc-heap which allows to look into heap representation of the values. ghc-heap documentation for a version bundled with GHC-8.8.1 can be found next to GHC documentation. We are particularly interested in GHC.Exts.Heap module.

Using the module functionality we can write a small function to compute a size of fully evaluated values. From all of possible Closure constructors, we are only interested in ConstrClosure – data constructors. Data constructor closures can have data directly there (non-pointer arguments, dargs ), or data referred to by pointers (pointer-arguments, pargs ). Also the info table takes one Word .

calculateSize :: HasHeapRep a => a -> IO ( Maybe Int ) = do calculateSize val c <- getClosureData val getClosureData val runMaybeT (go c) where go :: Closure -> MaybeT IO Int ConstrClosure _ pargs dargs _ _ _) = do go (_ pargs dargs _ _ _) <- traverse (liftIO . getBoxedClosureData >=> go) pargs ds(liftIOgetBoxedClosureDatago) pargs return $ 1 -- info table + length dargs -- data words dargs + length pargs -- pointers pargs + sum ds -- size of pointed data ds = do go x $ print x -- something unexpected liftIO MaybeT ( return Nothing )

If we ask for sizes of v0 , v1 , v2 and v3 we get following results:

value size in words v0 :: V0 Double 2 v1 :: V1 Double 4 v2 :: V2 Double 7 v3 :: V3 Double 10

To understand these numbers, let’s draw a memory representation of V3.Vec :

V3.Vec a +-------+-------+-------+-------+ | | | | | | hdr | * | * | * | | | | | | | | | +-------+---|---+---|---+---|---+ | | | v v v a a a

V3.Vec Double has

1 word for info table

3 pointers (= words) to

3 boxed doubles, 2 words each: info table and an actual number.

In total 10 words. Similarly, V2.Vec Double takes 1 + 2 * 3 = 7 words, and V1.Vec Double 1 + 1 * 3 = 4 words. V0.Vec Double is 2 words, as all GHC heap values have to be at least 2 words in size (header + 1 word of payload). It’s not a problem here, as V0.Vec a should be a static value, i.e. pre-allocated and shared.

I wish there were a library which would draw that kind of diagrams, using ghc-heap for inspection.

We should also ask how many words the GADT or data family representations take. The answer is 17 for both. For each element, there are three words as with Backpack version, but also each cell has own info table and a pointer to the next one: it’s a linked list. Thus 5 words per element, and additional 2 words for the nil. The nil should be shared value, so effectively the size of the V3.Vec Double is “only” 15 words.

V3.Vec a V2.Vec a +-------+-------+-------+ +-------+-------+-------+ | | | | | | | | | hdr | * | *------> | hdr | * | *------> V1.Vec a... | | | | | | | | | | +-------+---|---+-------+ +-------+---|---+-------+ | | v v a a

More memory, more pointers to chase, those must be slower. Next some benchmarks.

Benchmarks

We might have made strong claims that Backpack’d structure is efficient. Let’s compare it with other similar data-structures in a toy benchmark.

Given a hundred of vectors of size 3, sum them element-wise and multiply the elements at the end. For V3 defined above, that would be trivially:

benchV3 :: [ V3.Vec Double ] -> Double = foldl' ( * ) 1.0 . foldl' (liftA2 ( + )) ( pure 0.0 ) benchV3foldl' (foldl' (liftA2 ()) (

where the right foldl' is over the list, and the left foldl' is over the Vec structure.

We can compare that with using bare lists, and maintaining the length variant. The code is still quite the same. We only change liftA2 to zipWith .

benchList :: [[ Double ]] -> Double [[]] = foldl' ( * ) 1.0 . foldl' ( zipWith ( + )) [ 0 , 0 , 0 ] benchListfoldl' (foldl' ()) [

Another comparison is with Vec from the vec library. There are two variants, GADT- and data-family-based. We need to cheat a bit and use foldr while folding over vector itself. There is no foldl' , as foldr should be as fast when the size is known.

benchGADT :: [ GADT Double ] -> Double = GADT.foldr ( * ) 1.0 . foldl' (GADT.zipWith ( + )) ( pure 0.0 ) benchGADTGADT.foldr (foldl' (GADT.zipWith ()) ( benchDF :: [ DF Double ] -> Double = DF.foldr ( * ) 1.0 . foldl' (DF.zipWith ( + )) ( pure 0.0 ) benchDFDF.foldr (foldl' (DF.zipWith ()) (

And the final contestant is manually written vector, the definition is straight-forward.

data Manual a = Manual ! a ! a ! a deriving ( Functor , Foldable , Traversable ) instance Applicative Manual where pure x = Manual x x x x x x Manual x1 x2 x3) ( Manual y1 y2 y3) = liftA2 f (x1 x2 x3) (y1 y2 y3) Manual (f x1 y1) (f x2 y2) (f x3 y3) (f x1 y1) (f x2 y2) (f x3 y3)

benchManual :: [ Manual Double ] -> Double = foldl' ( * ) 1.0 . foldl' (liftA2 ( + )) ( pure 0.0 ) benchManualfoldl' (foldl' (liftA2 ()) (

Next the results.

Results

The results of criterion benchmarks are below, we can observe the following:

Backpack run takes 500 nanoseconds, this is the baseline.

run takes 500 nanoseconds, this is the baseline. List version is about 10 times slower. The benchmark is chosen so GHC cannot fuse lists away, we see their costs.

version is about 10 times slower. The benchmark is chosen so GHC cannot fuse lists away, we see their costs. GADT version is a little faster than List version, but still 8x slower than backpack version.

version is a little faster than List version, but still 8x slower than backpack version. Data-family version is only 35% slower than Backpack, and 6 times faster than GADT version. I suspect that index-driven code-generation just works better for data-families. Note: that backpack and manual version have strict fields, where GADT and data-family version have lazy ones. Maybe the data-family could be made a little faster.

version is only 35% slower than Backpack, and 6 times faster than GADT version. I suspect that index-driven code-generation just works better for data-families. Note: that backpack and manual version have strict fields, where GADT and data-family version have lazy ones. Maybe the data-family could be made a little faster. And finally, the manual version … exactly the same performance as with Backpack.

benchmarking Backpack time 501.2 ns (500.9 ns .. 501.5 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 501.3 ns (501.0 ns .. 501.9 ns) std dev 1.257 ns (675.9 ps .. 2.228 ns) benchmarking List time 4.714 μs (4.710 μs .. 4.719 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 4.724 μs (4.716 μs .. 4.746 μs) std dev 41.35 ns (16.52 ns .. 80.19 ns) benchmarking GADT time 4.023 μs (4.021 μs .. 4.025 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 4.030 μs (4.027 μs .. 4.037 μs) std dev 15.62 ns (6.974 ns .. 29.17 ns) benchmarking DF time 677.9 ns (677.3 ns .. 678.7 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 678.7 ns (677.7 ns .. 680.5 ns) std dev 4.253 ns (2.624 ns .. 7.070 ns) benchmarking Manual time 500.8 ns (498.5 ns .. 505.0 ns) 1.000 R² (0.999 R² .. 1.000 R²) mean 500.7 ns (498.9 ns .. 506.0 ns) std dev 9.150 ns (1.279 ns .. 18.06 ns)

Conclusion

We have now seen how one can use Backpack to implement Vec . The approach is different from GADTs and data family way, as the tail can be UNPACK ed. One other variant, is to backpack the element, like Key is backpacked in unpacked-containers .

There are non-Backpack implementations of V0 … V4 in linear package. If you’d ever need V5 , Backpack approach would give it with less copy & paste. Would you ever need V6 ?

One downside of Backpack approach is that we have to explicitly instantiate structures. With the GADT or data-family approach, whatever the run-time size argument is, the program will run (maybe very slowly, but still …). With the Backpack approach, there’s some maximum limit determined at compile time, with no way to surpass it at run-time.

The accompanying code for this post is on GitHub at phadej/vec-backpack.