Introduction

Multi-dimensional aggregation and grouping are common forms of analysis traditionally done with SQL databases, OLAP tools (Online Analytical Processing) and/or Excel pivot tables. These traditional tools use a flat table representation, whereby tuples are mapped to measures without any nesting. Increasingly, users are using general purpose programming languages for further offline analysis: languages like R, Python, Scala and Haskell. These often give us the opportunity to work with nested (inductive) structures, which functional languages like Haskell are particularly good at dealing with. Hopefully this post will prove that point.

Flat versus nested

Let’s start with an example flat dataset:





TradeId Location Date Value 10001 London 2017-04-01 9.99 10003 London 2017-04-01 4.99 10002 Shanghai 2017-04-02 86.38 10004 Shanghai 2017-04-03 43.15





Before we model the above in Haskell, let’s import some modules and define some types:

{-# LANGUAGE ConstraintKinds #-} {-# LANGUAGE TypeOperators #-}

module Cube where

import Prelude hiding (sum) (sum) import Data.Foldable ( Foldable , foldMap) , foldMap) import Data.Map ( Map ) import qualified Data.Map as Map import Data.Monoid import Data.Time.Calendar

newtype TradeId = TradeId { unTradeId :: String } deriving ( Eq , Ord , Show ) newtype Location = Location { unLocation :: String } deriving ( Eq , Ord , Show ) newtype Year = Year { unYear :: Integer } deriving ( Eq , Ord , Show ) newtype Month = Month { unMonth :: Int } deriving ( Eq , Ord , Show )

data Trade = Trade { tradeTradeId :: TradeId , tradeLocation :: Location , tradeDate :: Day , tradeValue :: Double }

It is idiomatic in Haskell to use a nominal record type for a row, we’ll call it ‘Trade’. We can then model the flat table structure above as a list of Trade: [Trade]. Even better would be to also model the uniqueness constraint using a primary key: Map TradeId Trade.

The flat structure does make a good canonical representation; and we could always add additional structures for efficiency (e.g. indexes). However, a nested representation offers quite a few advantages:

Nesting makes grouping explicit and assigns priorities (i.e. semantics) to the different dimensions (which I concede you may or may not want). For example, a common hierarchy of dimensions are the time units: year, month and day.

For large cubes, nesting permits efficient lookup and filtering without needing additional indexes.

Nesting allows us to separate grouping and aggregation phases. Becuase SQL databases do not permit nested structures, they must perform grouping and aggregation in a single logical phase. This is why ‘GROUP BY’ needs its own ‘HAVING’ clause.

A better monoid instance for Data.Map

Before we create a nested dataset from the above rows, we should fix the monoid instance for Data.Map. The newtype MMap below has a strictly more general monoid instance. The monoid behaviour for Data.Map can be recovered by using e.g. the First Monoid from Data.Monoid. MMap deserves its own Module and qualified import, but here I’ll just use the Data.Map API with explict wrapping/unwrapping to and from MMap.

Note that the complexity of unionWith is important. Unfortunately, Data.HashMap from the unordered-containers package states a complexity of O(n+m) for unionWith, which would always give us a quadratic implementation of mconcat.

type Key = Ord

-- A more general and useful monoid instance for maps newtype MMap k m = MMap { unMMap :: Map k m } k mk m }

instance Functor ( MMap k) where k) fmap f = MMap . fmap f . unMMap unMMap

instance ( Key k, Monoid m) => Monoid ( MMap k m) where k,m)k m) mempty = MMap mempty -- Note the complexity of unionWith is O(m*log(n/m + 1)), m <= n mappend a b = MMap $ Map.unionWith mappend (unMMap a) (unMMap b) a bMap.unionWith(unMMap a) (unMMap b) -- We must then have foldr here to get satisfactory complexity mconcat = foldr mappend mempty

instance Foldable ( MMap k) where k) foldMap f = foldMap f . unMMap unMMap

Nested groupings

To create a nested dataset, we’ll need a grouping operator. GHC calls this groupWith for lists, which it uses for the generalized list comprehensions extension (TransformListComp). It’s defined in GHC.Exts and the type signature is:

groupWith :: Ord b => (a -> b) -> [a] -> [[a]] (ab)[a][[a]]

Let’s define an equivalent version for MMap.

-- | Grouping for potential aggregation. -- Note: this implementation is sequential, but could be parallelized. groupWith :: ( Key k, Key k') => k,k') -> k') -> MMap k v -> MMap k' ( MMap k v) (vk')k vk' (k v) = MMap . fmap MMap . Map.foldrWithKey group Map.empty . unMMap groupWith fMap.foldrWithKeyMap.emptyunMMap where group k v = Map.insertWith Map.union (f v) (Map.singleton k v) k vMap.insertWith Map.union (f v) (Map.singleton k v)

This does require an initial MMap. A convenient choice here is: MMap TradeId Trade, which models the primary key. It is not yet a grouping, as the element can only be a single Trade record. Perhaps a more satisfiy choice, albeit less convenient in Haskell, would be to use an HList or extensible record for the keys. This would allow us to use compound-keys with various associated operations on those keys.

Let’s go with groupWith and perform a grouping operation to create a level of nesting. For example, to group according to ‘Location’, we would use:

λ> :t groupWith tradeLocation :: Key k => MMap k Trade -> MMap Location (MMap k Trade)

This would group our test data into the structure shown below:





Location London → TradeId Date Value 10001 → 2017-04-01 9.99 10003 → 2017-04-01 4.99 Shanghai → TradeId Date Value 10002 → 2017-04-02 86.38 10004 → 2017-05-03 43.15





A more satisifying ‘groupWith’ is perhaps the similar ‘nest’ operator below, which splits-up the key. However, going down this path really does require HLists or extensible records, as nested tuples quickly get ugly.

nest :: ( Key k, Key k') => k,k') MMap (k, k') v -> MMap k ( MMap k' v) (k, k') vk (k' v) = MMap . fmap MMap . Map.foldrWithKey group Map.empty . unMMap nestMap.foldrWithKeyMap.emptyunMMap where group (k, k') v = Map.insertWith Map.union k (Map.singleton k' v) (k, k') vMap.insertWith Map.union k (Map.singleton k' v)

Another advantage of the ‘nest’ grouping operator, is that it becomes easy to define the categorical dual: ‘unnest’. If MMap was an ‘indexed monad’ it would be the monadic join:

-- the dual to nest unnest :: ( Key k, Key k') => MMap k ( MMap k' v) -> MMap (k, k') v k,k')k (k' v)(k, k') v = MMap $ Map.fromList [ ((k, k'), v) unnest mmMap.fromList [ ((k, k'), v) | (k, m) <- Map.toList (unMMap mm) (k, m)Map.toList (unMMap mm) <- Map.toList (unMMap m) , (k', v)Map.toList (unMMap m) ]

With deep nestings, it becomes convenient to define an infix operator for our MMap type:

-- | Represent mappings using a right-associated infix operator type k :. v = MMap k v k v infixr 8 :.

Let’s now compose groupWith with fmap, to obtain grouping at various levels:

groupWith2 :: ( Key a, Key b, Key k) => a,b,k) -> k) -> a :. b :. v -> a :. k :. b :. v (vk) = fmap . groupWith groupWith2groupWith

groupWith3 :: ( Key a, Key b, Key c, Key k) => a,b,c,k) -> k) -> a :. b :. c :. v -> a :. b :. k :. c :. v (vk) = fmap . fmap . groupWith groupWith3groupWith

Using the above grouping operators, we can now obtain nested groupings, such as ‘timeDimensions’ below. However, such an API can difficult to use. We need to keep track of the type (level of nesting) within the expression. This could get awkward, once we start including aggregations.

timeDimensions :: ( Key k) => MMap k Trade k) -> Year :. Month :. Day :. k :. Trade = groupWith3 tradeDate timeDimensionsgroupWith3 tradeDate . groupWith2 (month . tradeDate) groupWith2 (monthtradeDate) . groupWith (year . tradeDate) groupWith (yeartradeDate)

A nice alternative, is to define a grouping function that also applies a function parameter to each resultant group:

-- | An alternative grouping function which can optionally aggregate -- or further group each resultant group. groupBy :: ( Key k, Key k') => k,k') -> k') -> ( MMap k v -> v') -> MMap k v -> MMap k' v' (vk')k vv')k vk' v' = fmap g . groupWith f groupBy f ggroupWith f

This function parameter can be another grouping operation, to perform at the next level of nesting. We can then therefore compose ‘groupBy’ to get a very readable (positional) definition of a grouping:

timeDimensions :: ( Key k) => (k :. Trade -> m) k)(km) -> k :. Trade -> Year :. Month :. Day :. m = groupBy (year . tradeDate) timeDimensionsgroupBy (yeartradeDate) . groupBy (month . tradeDate) groupBy (monthtradeDate) . groupBy tradeDate groupBy tradeDate

Such definitions are easily composed further, without worrying about the level of nesting (let GHC figure out the type!).

myCube :: ( Key k) => (k :. Trade -> m) k)(km) -> k :. Trade -> Location :. Year :. Month :. Day :. m = groupBy tradeLocation myCubegroupBy tradeLocation . timeDimensions timeDimensions

Of course, our grouping/cube definitions above still do not actually do any aggregation. They remain parameterized with a function to apply to the inner-most level of nesting.

Aggregation

To aggregate, we need to fold or reduce the elements (possibly inner groupings) of a map into a single value (not necessarily an atomic value). The function we need is ‘foldMap’, but I’m going to create a new fancy name for it:

rollup :: ( Monoid m) => (v -> m) -> MMap a v -> m m)(vm)a v = foldMap rollup

Now we have the necessary operator to rollup the innermost grouping of our cube:

myCubeSum :: ( Key k) => k :. Trade k) -> Location :. Year :. Month :. Day :. Sum Double = myCube (rollup $ Sum . tradeValue) myCubeSummyCube (rolluptradeValue)

Applying the above aggregation ‘myCubeSum’ to our test data (of type ‘MMap TradeId Trade’), would yield the following structure:





Location London → Year 2017 → Month 04 → Day Value 2017-04-01 → 14.98 Shanghai → Year 2017 → Month 04 → Day Value 2017-04-02 86.38 05 → Day Value 2017-05-03 43.15





Note that aggregations can have types that look similar to unnesting/ungrouping, but of course they do not form isomorphisms with our grouping operators. In general, we cannot undo an aggregation.

-- the type looks similar to unnest, but this function aggregates! rollup1 :: ( Monoid m, Key k, Key k') => (v -> m) -> MMap k ( MMap k' v) -> MMap k m m,k,k')(vm)k (k' v)k m = fmap . foldMap rollup1

Parallel Aggregation

Since MMap has the general monoid instance we defined earlier, our cube aggregations will also all have monoid instances (by induction). This means we can compute partial cubes for input chunks in parallel and then mconcat/reduce them to obtain the final cube.

mapReduce :: ( Monoid m) => (v -> m) -> [v] -> m m)(vm)[v] = mconcat . map f -- parallel map mapReduce f

λ > : t mapReduce myCubeSum t mapReduce myCubeSum mapReduce myCubeSum :: Key k => :. Trade ] [k -> Location :. Year :. Month :. Day :. Sum Double

Subcubes

I’m using the word “subcube” here to describe cubes built from aggregating larger/deeper grouping definitions.

Instead of creating a series of operators to aggregate over each successive level of nesting, we will use a positional style, as we did with the grouping definitions.

First, let’s create a new name for fmap:

keep :: ( Monoid m) => (v -> m) -> MMap a v -> MMap a m m)(vm)a va m = fmap keep

We can now just specify in the term, what we want to happen to each dimension as it appears in the type. For example to rollup just the Year and Month, we can write the following:

-- a "subcube" definition, parameterized by an aggregation function locationByMonth :: ( Monoid m) => (v -> m) m)(vm) -> Location :. Year :. Month :. Day :. v -> Location :. Month :. m = keep . rollup . keep . rollup locationByMonthkeeprollupkeeprollup

I think the above is rather nice!





Note: the literal haskell for this entire post can be found here.

References

[1] Map Comprehensions

[2] TransformListComp





Appendix

Auxiliary functions

year :: Day -> Year = let (y, _, _) = toGregorian d in Year y year d(y, _, _)toGregorian d