Please read part I before reading this. This blog has a code repository with runnable examples here: https://github.com/actions/dp_fundamentals

In the part I we‘ve made an analogy between SciFi concept of multidimensional expansion and data processing systems. We’ve introduced iterators as a way to build the data processing plan in form of transformations, as well as actions to execute these transformations and produce the results.

In this part of the series we will expand into second dimension and start processing our data simultaneously by multiple cores and systems.

Just like cloth is made of many threads stitched together, our expansion to the second dimension of data processing would need multiple iterators. The data processing would be working simultaneously on multiple computational threads utilizing available CPU cores.

Partitioning

Just like you can’t make a cloth without layering a thread, we can’t parallelize the processing of one giant blob of data. To process our data we have to partition our data. Last time we processed the “The Complete Works of William Shakespeare”. We already know how to count words in the whole file. Why don’t we try to scale the processing. To do this, we need to split it into multiple parts. I can think of may ways to split. We can split it by page and have thousands of small files; we can split it by story and have several meaningful files of different size; we can simply cut the large file into several equal pieces. Which approach works best? The long answer is — “it depends”.

The quick rule of thumb for data partitioning is — go with fixed number equally sized partitions.

Later we will examine different approaches and will gradually come to the conclusion.

How partitions are formed in real life.

In reality we rarely get to partition giant (or not so) files. Usually the data comes as a continuous stream of clicks, messages, IOT reporting metrics or lines of text written by William Shakespeare. The partitions are formed as data arrives. In further parts we will discuss how to handle continues streams using the machinery we will build.

For now we will assume that our data is already partitioned. Further discussion on partitioning available in the appendix to this article.

Parallel data processing

Now we built up the basis to start talking about parallel processing.

The idea of multi-threading always raises a lot of questions:

- How many threads do I need?

- How do I do synchronization?

- How do I get the results of my computations?

Fork-join common thread pool

Fork join model is exactly what we need to start. It helps to split the work into smaller tasks. Each task does the work serially, while multiple tasks are executing in parallel. The main thread executing the whole job will wait for tasks to complete and collect the result.

Fork-join metaphor goes like this:

- when the task starts execution, it forks into separate thread.

- when the task is done, it joins the main thread and returns the result.

source: wikipedia:Fork–join_model

To make it easier to imagine, see a common thread pool as a highway with several lanes. Each lane is a running a thread, while each task is a car, which travels on one of the lanes. As a car travels, the CPU is processing the elements of our data iterator. This car exits the highway when the iterator is drained.

The number of threads in the pool or the number of lines in the highway is equal to the number of cores in the CPU. The laptop I’m currently working on has two cores with hyperthreading, so total is 4 cores. My common fork-join pool has 4 threads - one for each core. We don’t really need more and having less threads would keep some cores idle. Well defined data processing program continuously utilizes all cores.

In java the access to common pool is done using ForkJoinPool.commonPool() api call.

Processing one partition

Here we defined our task, which will process one partition and return partial word count. We will use the wordCount function we built in Part I.

Bringing it all together

Let’s create another task, which will kick off parallel processing of the partitions and bring the result together.

In the constructor of the task we iterate over our partitions. We create a processing task for each partition and immediately starting running it on one of the available threads. We use the method ForkJoinTask::fork. In the compute method we join the tasks together and aggregate partition counts into a final count.

All that is left to run the job is submit our task to the pool:

Basic computational graph

If we are to visualize our code, we can see that we just built a very simple computational graph, where a one WordCountTask spins off multiple tasks for parallel processing and coordinates the results. The concept of computational graphs will be important as we built the machinery for more complex processing.

Towards RDD

Looking at our parallelization code above does not make me satisfied. Fork Join model gives good abstraction for running tasks on threads, but it’s not really aligned with the way we reason about data processing. Dealing directly with thread pools and scheduling tasks might get cumbersome and error prone. Just a little example of simple code shuffle, which can change the behavior.

Here is the version of WordCountTask, which looks neat.

While the code looks like it’s doing the same thing, the significant difference is that it will execute serially. One partition will be processed after the other and all we did is misplaced the join call. It would be hard to spot this difference and the problem would not get pronounced until we would deploy it on production and realize that the task takes 4 times longer to complete.

We need better model for parallel data processing. Hopefully the one, which does the task scheduling for us.

In Part I we built Iterators as a machinery to build an execution plan as a chain of transformations and then execute the plan using an action.

Let’s look at the components of our code and try to build an abstraction, which will help us to expand our data processing into second dimension and parallelize it to fully utilize our computational capabilities.

Our first class was processing one Partition. Looking closer in our code, all we needed to know is a partition number partitionId. We were able to build the processing chain just from partitionId. Yet we might need more info later.

Next we processed one partition using the wordCount method. Yet, wordCount method does a lot of things. Looking inside, did the following:

1 Built an iterator from a partition file.

2. Attached a chain of transformations to this iterator.

3. Prepared to run an action to get the partition level result.

Next we use the WordCountTask to

4. Built a computational graph and executed it on out computational thread pool.

5. Merge the partial results (another action).

Now let’s imagine the new structure, which wraps a collection of partition iterators. This structure would build the chain of processing for us just like an iterator and we would be able to run our actions without thinking about the multithreading and task scheduling. This is pretty much what the team behind Apache Spark has done. Their basic idea was called RDD — Resilient Distributed Dataset. While we will not rebuild Apache Spark in this block, we will examine the fundamental ideas behind this machinery.

This is the first step. Our abstract RDD:

Is aware of it’s partitions,

Has the ability to access partition data compute method

Is aware of it’s dependencies in order to be able to build a computational graph.

Has the access to ForkJoinPool to run the job.

In our overly simplified RDD we equate ForkJoinPool with sparkContext, we will fix this later on.

Let’s see how it can work for us and create an RDD over partitioned file:

FilePartition is a custom implementation of a Partition, which refers to actual file, the data resides in. We implemented RDD, which gives us access to partition file data by building iterators from partitions.

Next all we need to do it add the method fromFile to our RDD class.

static RDD<String> fromFile(Path directoryPath){

return new PartitionedFileRDD(directoryPath);

}

MapPartitions

In our parallel wordCount code we learned how to process one partition and then processed our partitions simultaneously. Map partitions method helps us to generalize this process. We will transform our RDD by mapping a partition iterator of one RDD to a partition iterator over another RDD.

First, we need to realize that a lot of times, we would have one dependency, so writing a few helper methods around this specific case would help.

public RDD(RDD parent) {

this.sparkContext = parent.getSparkContext();

this.dependencies = Arrays.asList(new Dependency() {

@Override

public RDD rdd() {

return parent;

}

});

}



public <P> RDD<P> getParent(){

return (RDD<P>)this.dependencies.get(0).rdd();

}

Now we are ready to build our concrete RDD:

private class MapPartitionsRDD<T,O> extends RDD<O>{



Function<Iterator<T>, Iterator<O>> converter;



public MapPartitionsRDD(RDD parent, Function<Iterator<T>, Iterator<O>> converter) {

super(parent);

this.converter = converter;

}



@Override

protected List<Partition> getPartitions() {

return getParent().getPartitions();

}



@Override

protected Iterator<O> compute(Partition partitionId) {

Iterator<T> preComputed = (Iterator<T>)getParent().compute(partitionId);

return converter

.apply(preComputed);

}

}

And finally out new method for RDD:

public <O> RDD<O> mapPartitions(Function<Iterator<T>, Iterator<O>> converter){

return new MapPartitionsRDD<T, O>(this, converter);

}

Let’s prepare to rewrite our wordCount method using RDDs. Here is how it would look like.

We now have a simple way to process partitions. We use the function calls quite similar to those we had when we were dealing with iterators. Yet the code is still look a bit messy. MapPartitions requires us working with partition iterators.

Let’s introduce a couple of familiar methods to our RDD to make it cleaner. These methods will help us to stop thinking about partitions and let us work with distributed dataset the same way as we work with iterators.

FlatMap

public <O> RDD<O> flatMap(Function<T, Iterator<O>> splitter){

return this.mapPartitions(pi->pi.flatMap(splitter));

}

Map

public <O> RDD<O> map(Function<T, O> mapper){

return this.mapPartitions(pi->pi.map(mapper));

}

What’s missing now is our action, which will collect the results from the iterators. We need some analog of reduce method for RDD. It’s called “aggregate” in the world of RDDs.

Aggregate

Let’s first design how this method should work. If we look at our original implementation of parallel word count, you will see that we made two calls to reduce method. One reduce was done for each partition as part of our serial wordCount methods executed as part of our PartitionProcessingTask. Second reduce was done to bring partition results together as part of WordCountTask. Now that we have a machinery of dealing with partitions, we can bring these two calls together in one method:

We defined the partition reducer, just as it was in serial wordCount method. It counts our words in each partition. Next we defined the result aggregator, which brings together the results from each partition. We are using the familiar code, which we already used in different components of our code.

Now we have the method aggregate to set things in motion. Good thing is that we pretty much already built it when we discussed parallel data processing.

Our method aggregate just generalizes the work we have done while building our first parallel word count method. It nicely combines RDD and ForkJoinPool to wrap the processing of partition iterators as well as aggregation of the results. Let’s discuss what is going on here.

We have to take a supplier for the start parameter. We are dealing with multiple calls to reduce on an iterator. We need the ability to obtain a new copy of start parameter every time we are doing our reduce. So we need to the ability to supply the new starting point upon request. Partition aggregator is what goes into reduce method of the partition iterator. Unlike iterator reduce, we are not done yet. Combiner combines the results we got from processing each partition.

The new code is not bad. We don’t have to deal with task scheduling or thinking about parallel processing. All we are doing there is writing our logic and letting RDD machinery to handle the processing.

Appendix

What partitioning scheme is the best?

Now that we see how parallel processing works, we can think about partitioning schemas.

Everything looks really nice if we have equally sized partitions and it’s number equals the number of cores. It looks just like the diagram above.

Unfortunately the world is not ideal. Imagine that we have 5 partitions and 4 cores.

Our total processing time would immediately double, while the data size would go up only by 25%.

Let’s now imagine that we would split the text by pages and have thousands of partitions. This scheme is very flexible, but we would have an overhead related to scheduling so many tasks and handling so many files.

Ultimately there is no perfect answer. The partitioning scheme needs to take into consideration the underlying technology and the processing requirements.

Partitioning code

This is partitioning code for our composite file: https://github.com/actions/dp_fundamentals/blob/master/src/main/java/fundamentals/llprocessing/DataSplitting.java