Apache Beam provides a couple of transformations, most of which are typically straightforward to choose from:

- ParDo — parallel processing

- Flatten — merging PCollection s of the same type

- Partition — splitting one PCollection into many

- CoGroupByKey — joining PCollection s by key

Then there are GroupByKey and Combine.perKey . At first glance they serve different purposes.

GroupByKey groups all elements with the same key and produces multiple collections. The next stage receives an Iterable collecting all elements with the same key. Important note is that this Iterable is evaluated lazily, at least when GroupByKey is executed on the Datflow runner. This means that elements are loaded into memory on demand — when requested from the iterator.

Combine.perKey on the other hand, also groups all elements with the same key, but does an aggregation before emitting a single value.

Let’s take a look at an example. Our input collection consists of a postal code and a street name — a key-value pair like the one below:

00–018 — ul. Zgoda

00–019 — ul. Złota

00–020 — ul. Chmielna

00–020 — ul. Szpitalna

Here we have four keys, but several streets may share the same code, like in the case of 00–020 .

When applying a GroupByKey transformation the next stage is called 3 times. It has access to an Iterable and can cycle through all streets. For the key 00–020 there will be two of them, ul. Chmielna and ul. Szpitalna . Now comes the interesting part.

What do you do in a stage following GroupByKey ?

If you just emit every element without intensive computations, you’re probably fine. But if you aggregate these values, for example count the number of the streets, collect them in a Set, access a 3rd party service or a side input to enrich the data to produce a single value, then you should listen carefully now.

The problem with this approach is that the aggregating stage has to process all the values for a given key. It receives an Iterable and has to deal with it alone — one machine in one thread. That may not be a problem, since other workers will process elements from other groups. It will however blow up, once you start processing hot keys. These are keys which occur much more often than other ones. In such a case, having two workers, one worker will be busy with the hot key and the other worker, after having processed the remaining keys, will be idle. Adding workers does not help at all. They will not be able to grab the remaining part of the occupied worker processing the hot key.

Besides not being able to scale out you may face another issue: the famous OutOfMemoryError . One worker may just not be able to aggregate a collection because of physical constraints.

Roger that

Can we fix that? Sure, we can fix that. Will it be easy? Nope, it’s not gonna be easy. A GroupByKey transformation followed by a ParDo can be replaced with Combine.perKey . The benefit of using this transformation is the ability to split the process of combining a group of elements for a particular key in parallel. Since there is no shared memory between workers, there will be more network overhead due to serializing, sending and deserializing partial data between worker nodes and what is more important — it’s us who need to develop the merging code. 😱

Let’s take a look at the documentation of the combiner we have to write:

A CombineFn<InputT, AccumT, OutputT> specifies how to combine a collection of input values of type InputT into a single output value of type OutputT . It does this via one or more intermediate mutable accumulator values of type AccumT .

To achieve this we have to implement four methods:

createAccumulator() which creates an initialized accumulating object,

which creates an initialized accumulating object, addInput() responsible for adding one element of a grouped collection into the accumulator,

responsible for adding one element of a grouped collection into the accumulator, mergeAccumulators() doing the main work of combining the efforts of all the different workers,

doing the main work of combining the efforts of all the different workers, extractOutput() being called at the end to output one single aggregated value.

The createAccumulator() method is called for every subset of data that is going to be processed on any single worker. It returns a container that will store the outcome of processing elements. Although it is empty at the beginning, it may be initialized with values that are used later in computations.

These computations may be performed in addInput() , which receives an accumulator and an element. Each element of the mentioned subset above has to find its way to the accumulator. It may be as simple as incrementing a counter, but may also perform some business logic, like calculating values, extracting only meaningful data or enriching it through 3rd party API calls or explicitly provided side input .

The most tricky part is to implement the mergeAccumulators() method which receives an Iterable of accumulators to be combined together. We have to create a new accumulator being the result of merging the given ones. As mentioned, elements of a collection with the same key are split in subsets which are routed to different workers. That’s the whole point of using Combine.perKey. Now we have to reverse that split via this method. It will be called until there is only one accumulator left.

The final method is extractOutput() . Its purpose is to output a single value based on one given accumulator — the last one from the previous method. It may be a computed value or an enriched object.

Let’s compare both solutions in a real life example. The Apache Beam pipeline consists of an input stage reading a file and an intermediate transformation mapping every line into a data model. Then, in the first case, we’ll use a GroupByKey followed by a ParDo transformation and in the second case a Combine.perKey transformation. The final stage is a logger.

The input file is again a key value-pair of postal codes and street names. The mapping however creates a key of only the first digit of the postal code. Having an input of:

We’re getting a new key-value pair of:

The postal codes start with 00–001 and end with 19–520 , which leaves us with only two unique keys: 0 and 1 .

To make it more fun, I’ve ended up with 10 files, all in all 146 131 200 records, around 16 GB.

GroupByKey with a ParDo transformation

Let’s start with GroupByKey , below you see the executed pipeline.

The Combiner does a simple thing: it iterates through all the elements with a particular key and outputs one value.

Let’s run it in Google’s Dataflow.

Metrics for one worker (on the left), and four workers (mid and right)

Once 4 workers were running, the mapping to a data model was done at a rate of 700k elements per second. Then GroupByKey started its work of grouping the elements into two groups — one with key=0 and another with key=1 . I’m not sure why this stage took so long, however the Combiner did its job quite fast. Sure, it was just incrementing a counter, but if you would do intensive computations, call 3rd party APIs or collect all the data, this stage would take much longer or wouldn’t finish at all due to an OutOfMemoryError . The process() method was called twice, since we only had two keys.

Combine.perKey transformation

The pipeline with a Combine.perKey transformation is more complex, since it requires a Combiner:

This pipeline is written with scalability in mind. For few records it is slower, but with the test input it does outperform the GroupByKey transformation. Running it with --autoscalingAlgorithm=THROUGHPUT_BASED reveals that during the first couple of minutes it processes data at a rate of 350k elements per second. Later a second worker was launched and the throughput went up to around 700k elements per second. Also looking into the log confirmed that the accumulator has been created on different instances and the merging algorithm has also been performed on several workers.

Metrics for one worker (on the left) and two workers (mid and right)

Scaling history

Always watch out!

A blog post without a kitten wouldn’t be a blog post!

As already mentioned Combine.perKey introduces some overhead related to network traffic. There is also a little bit of serialization and deserialization going on. It is absolutely a bad idea to have a List or any other Collection in an accumulator which grows with every element. In such a case you should be better off using GroupByKey .

Another issue mentioned previously is the presence of hot keys. In this case the key 0 is present four times more likely than 1 . Adding a fanout function however did not make any difference:

There is a lot more to consider when writing effective pipelines. I’ve found this blog post very informative and also this one about fanout.