My understanding of reducers after EuroClojure

12 Jun 2012

At EuroClojure Rich Hickey gave a second, unscheduled talk about the new reducers library. The talk clarified a few points for me, that I didn't initially get from the explaining blogposts on clojure.com/blog. For the Amsterdam Clojure Meetup I wrote down these notes:

Reducers vs seq/lazy-seq api

This is mostly explained in the first blogpost: Reducers - A Library and Model for Collection Processing.

The reducers are partly about de-complecting representation, order and mechanism. For example: the map reducer is only a recipe to apply a function to each item in the collection. Compare the following:

[2 3 4] === ( vec ( map inc [1 2 3] )) === ( into [] ( r/map inc [1 2 3] )) === ( reduce conj [] ( r/map inc [1 2 3] ))

(here map is the clojure.core/map, r/map is the new clojure.core.reducers/map)

A reducer is only the recipe step of transforming a collection. Building a collection or a result is an explicit, separate reduce step. The building of the collection and the what to do with the input collection are separate here. When you replace r/map with a composition of various reducers the actual construction of the results is only the final step. Compare this to the intermediate lazy-seq results that are produced when composing the old seq functions.

Reducers and fold for parallelism

This is mostly explained in the second blogpost: Anatomy of a Reducer.

The split between building results and the reducer recipes also opens up some nice possibilities for parallelism, through the fork/join framework.

( def v ( vec ( range 100000000000 ))) ( reduce + ( map inc v )) vs. ( reduce + ( r/map inc v )) vs. ( r/fold + ( r/map inc v ))

The example Rich gave was (not verbatim this, but close enough):

I think the timings were: reduce+r/map 2x fast as reduce+map, and r/fold+r/map 4x fast as reduce+map.

Fold is simply the name for the new parallel reduce function, it is not meant to refer to a function with the same name in other languages.

In the example the fold function is not passed a combiner. The plus function is used for the combiner here as well. The seed for the reduce at the leafs is the combiner function called with 0 arguments.

Fold will try to do the computation in parallel using fork/join, when the input collection that is asked to apply the reducer to itself supports this and when the reducer supports this. The check for support is done through protocols: for the datastructures: PersistentVector extends CollFold, for the reducers: r/map is defined as a folder, while r/take-while is defined as a reducer (and does not support fold, because partitioning does not make sense for this computation). When parallel fold is not supported, then fold will just do a reduce. See the implementation for details: reducers.clj

A common approach for parallel processing is to use map+reduce. The work is divided into partitions and map is applied to each partition and the intermediate results are combined with reduce. In fold the approach is reduce+combine. The work on the smallest partition is done with a reduce rather than with map. Compare the two approaches by trying to express filter in map+reduce versus reduce+combine. It appeals to the functional programming sensibilities that reduce is a better fit for most operations than map.

Fold works nicely with Clojure datastructures. Many datastructures in Clojure are trees, which can be easily partitioned. As Rich explained: explicit parallel datastructures (as found in other languages) make no sense because being parallel is a property of an algorithm, not the data.

The ultimate benefit of the reducers library is this: In the past you could make your programs faster if you simply waited a while for faster hardware (Moore's law). Fold and reduce+combine bring back that promise for data processing with more cores, rather than faster cpu's.