I am a fan of both concepts in the title of this post.

I’m an engineer on Cloud Dataflow, which is a couple of system generations and levels of abstractions ahead of the original MapReduce but is based largely on the same primitives and implementation principles (our team is also a superset of the MapReduce team); and I’ve been an active advocate of functional programming, wrote a bunch of articles, gave a bunch of talks and taught a bunch of classes about it.

Disclaimer: this is not an official opinion of Google, this is my own rant :)

Despite my love of both of these concepts, I cringe when I see MapReduce presented as the poster child of functional programming.

I’m guilty of having done this myself as well.

People will point out that map is the good old map from every functional language ever, applying a function to every element of the input; and that reduce is the good old reduce, computing an aggregation over a list using a binary operator.

However, this analogy is both not quite true and not quite useful, and I believe it doesn’t do justice to MapReduce nor functional programming.

Map

For map, the analogy is not very useful. map is a trivial concept. It’s basically SELECT from SQL, and you need zero knowledge or admiration for FP to understand or apply it. As a library function in a language, map is nice, but for an imperative language, it’s merely syntactic sugar for a loop.

Also, map in MapReduce is more of a concatMap or flatMap — the function produces multiple values (more precisely, multiple key-value pairs).

The only functional part about map in MapReduce is, really, that the function will be applied to elements in isolation, so you can’t share state or mutate the input or do other dirty tricks. However, you can perform side effects (as long as you’re ok with them being performed more than once in case of retries).

By the way, this is a remarkably common case of using MapReduce: often people will skip the reduce phase entirely, and just use MR as a coordination framework to do some stuff in parallel: e.g., copy a bunch of data from one system to another, or convert it from one format to another, or perform some very heavy operation on every element of a very small dataset, or even delete a bunch of data. Most MapReduce jobs, in fact, are boring like that.

Reduce

For reduce, the analogy is just wrong. reduce in MapReduce is a GROUPBY operator: it groups the output of map by key and applies a second, different map on each (key, [stream of values]) pair.

Indeed, if the second map is an aggregation, such as SUM, expressible as an associative and commutative operator (who said “monoid”?), it can be executed more efficiently, e.g. the framework can push some aggregation into the first map. In Hadoop, such an aggregation is called a combiner.

People very often compute a count, a sum, a min/max or an average in the reduce phase, but more complex combiners are rare, and almost nobody in their right mind seriously sweats over commutativity/associativity of the combiner (because usually it obviously holds), and again, you need zero knowledge or admiration for FP to understand SUM(A) GROUP BY B. Moreover, without a combiner, there isn’t even a binary operator involved, and these fancy mathematical properties do not come into play at all.

The power of MapReduce

…lies not in bringing the straightforward concepts of map and reduce to the masses — the concepts are trivial and have been known to the masses all along. It lies in:

Making the observation that these two operations , “parallel apply” and “group by key”, are sufficient for a great number of typical big data processing tasks; and

, “parallel apply” and “group by key”, for a great number of typical big data processing tasks; and Taking care of all the devilish, mind-bending complexity of performing them in a distributed system reliably and efficiently.

It lies in making sure that you can apply the same system to datasets both 1kb and 10PB in size; which are in files, or in enormous distributed key-value stores with an unpredictable key distribution, or in a custom system; using 5 machines or 50,000; where reading data is slow, writing data is slow, or processing data is slow, or this varies by many orders of magnitude depending on what record you’re processing; where you write a lot of data or a little; where you have a lot of values per key, or a few, or it greatly varies by key, or when the values for a key don’t fit in memory; where your machines are crashing, unresponsive, unexpectedly slow or produce corrupted data; making it possible to debug your program when it’s slow or produces wrong results; etc. (I can talk about this stuff for hours, which is why I love working on this team)

To put it shortly, there’s almost no usage of nontrivial functional programming concepts in a MapReduce system or in a typical program written using it. All the complexity and all the power is elsewhere.

I should also note that, though classical MapReduce consists of 3 stages (parallel apply, group-by a.k.a. shuffle, and another parallel apply), this is merely an arbitrary restriction of the original implementation.

People quickly realized that you can assemble more complex networks of parallel applications and group-by’s, which you can see in the FlumeJava paper, Spark, Apache Crunch etc., and finally of course in Dataflow. The word “MapReduce” is better used to describe these 2 distributed primitives, not the particular 3-element pipeline assembled from them.

The power of FP

On the other hand, the power of functional programming lies not in enabling people to understand the mighty map and reduce primitives — as said above, they’re simple enough.

It lies in using much higher levels of abstraction than previously thought possible (or practical) to achieve much higher levels of modularity and generality, which of course comes at a price and carries with it the need for tools for organizing programs written at such levels of abstraction (such as powerful type systems, immutability, an admiration for equational reasoning about programs, etc).

P.S. Rumor has it that the name MapReduce is not even derived from these functions in FP, but rather from “reduction of maps”, as most of the datasets processed by the original Google system were maps (sets of key-value pairs).

An older API even referred to a MapReduce program as a “MapReduction”.