Building data pipelines are common tasks for almost every company nowadays. However, the ways they are being built are quite different from one to another.

We, as a company, are not different from others so we are building these data paths too, but the tooling always varies from one place to another one.

What I am about to describe can be achieved using any of the available big data tools such as Apache Spark which I am also a huge fan. However, the team has invested heavily in Akka, and we have built several internal systems with it, which in some way, means we are going to explore how to finish the task at hand.

Simplifying the problem, we will say that our task consists in calculating a value that represent the unique line of a specific file. For example, given a file A with the following content:

This is just a

Test file

For this example

The result of processing this file using the function f(s: String) => int is:

2342342

12341234151245

234

Please, don’t worry too much about the function f, let’s focus the overall view of the process.

Using Akka Streams we could create a simple pipeline to process these files. Let’s see how.

First, we define the building blocks of our stream. Then, have to put them together.

And now we are basically done. We have built a stream that will process each file, line by line, calculating the weight of each of them and finally writing the value down.

The Real Problem

When the described stream does its job, it takes long hours to process because the file can be huge.

First Attempt. Increasing Stream Parallelism

Our first try is to increase the number of concurrent mappers on the stream. We can redefine parts of pipe to do more work at a given moment in time.

Because we have big machines, we can increase the parallelism. Also, we don’t care too much about the order since the calculated value is unique per line so we can easily identify the exact (file, line) given the line’s weight.

As you can expect, the performance is better now, but not good enough given the amount of data waiting to be processed. Even though we are doing more processing at any given time, we cannot go beyond the power of a single computer, Akka Streams do not support it for now.

Second Attempt. Going Back to the Basics

Working with streams is easy, nice, and clean, but sometimes we need to go back to basic abstractions, Actors.

We are going to use Akka Cluster and Distributed Pub / Sub to move our data through the pipeline.

Here, we have created two actors which do the same we had before with streams. Still, we are missing the one that read our dataset.

This actor will take a stream and will publish its elements (file, lines) to the cluster so WeightCalculator can do its corresponding job.

Let’s create an app that runs this pipeline to see how everything fits together.

At this point we have a runnable application that does the same we had using streams. However, this one, can scale much easier by deploying more actors of a certain type ( WeightCalculator or Writer ) in the same or in a different JVM. We can even deploy them remotely on different machines on the network. All these deployment strategies are backed by Akka Cluster.

We have no limits regarding the number of processors for each step we could have, but this approach has an interesting problem.

In the stream’s world, we have built-in back pressure so we don’t have to worry about flooding actors with too many messages. However, with our new approach, we are pushing a lot of messages into the cluster, which, in our experience, results in a lot of JVM crashes.

Third Attempt. Pulling vs Pushing

It would be nice if we could use the Akka Cluster deployment and at the same time have control on how fast messages are propagated from the source. Doing some changes in our actors we could achieve this.

We need to change our source first, so we pull from it.

In here, Streamer doesn’t push anything by default. It waits for SendNext(to) messages published on the cluster by some consumer where the value of to is the address of the consumer. Then Streamer will send the next element of the stream directly to it. This mechanism guarantees that elements are sent only to actors that requested them.

On the other hand, any actor can request elements by publishing a request on the cluster. We need to change WeightCalculator so it requests elements when it needs them.

As you could see, WeightCalculator requests new elements only when it has finished processing a given element. In this way we avoid over flooding it with too many messages that it cannot process at once.

After WeightCalculator has finished processing a message, it will publish the results in the cluster using the topic out so it can be written down by the Writer .