Distributing your dataset/database joins can be a daunting task, to say the least. Not only do you need to think about how your data is sharded, indexed, etc., you need to think about what types and sizes of resources you need to allocate based on the scale of your data. For many, these considerations cause them to retreat to brute forcing their data transformations on increasingly beefy boxes.

However, data engineers and data scientists should be able to use the tooling they are familiar with, such as Python pandas or simple SQL, to perform data transformations, and still be able to seamlessly distribute that processing on a cluster. Take, for example, the following join:

An example join

Let’s imagine that we have data in two data sets called users and events . As the name implies, the users data set includes some data about our users:

username, occupation, birthday

bob1234, dentist, 07/01/1972

…

sallymcgee, software engineer, 03/28/88

and the events data set includes some data about things our users have done in our fictitious system:

timestamp, username, event_type

2017–01–15 03:02:11, sallymcgee, refund

…

2017–01–30 14:33:52, bob1234, purchase

If we wanted to fill in the rest of the user information for each of the events in events , we could join users to events on username . For our purposes, let’s imagine that we have a simple python script join.py (see here for a concrete example script) that performs this join and outputs:

timestamp, username, event_type, occupation, birthday

2017–01–15 03:02:11, sallymcgee, refund, software engineer, 03/28/88

…

2017–01–30 14:33:52, bob1234, purchase, dentist, 07/01/1972

Distributing the join

Naively applying this simple python script to large data sets at production scale will result in problems. We are potentially pulling both data sets fully into memory and creating a third data set that is the combination. That being said, maintaining the simplicity of the above approach would be wonderful if it could be distributed across our data.

With Pachyderm, we can easily create a data pipeline with a single worker that executes our python script, performing the join, and outputs the joined data:

However, we can also instruct Pachyderm to increase the parallelism of our data pipeline (which is controlled in a “pipeline specification”). When we increase the parallelism, Pachyderm will begin to distribute the join over multiple workers, each of the workers still running our same simple python script:

Pachyderm is smart enough to know how to distribute the data in users and events so that you get the same joined results, without having to modify your naive join logic! Under the hood, Pachyderm will partition the data sets and expose portions of the data to each worker:

In all of these scenarios (1 worker, 2 workers, etc.), every combination of the users data and events data is considered by at least one worker, so we don’t miss any joins. This logic of extends seamlessly to multi-table joins as well. In which case, the diagrams above would be 3-dimensional (for three data sets), 4-dimensional (for four data sets), etc.

Also note, here we have considered only one type of join, but there are many different types (inner, left, right, etc.). For many of these joins, Pachyderm offers additional optimizations that are discussed further in our docs.

Conclusions/resources

It’s as easy as that! Write the code you want to write, and let Pachyderm think about how to split up the data and distribute the transformation.

Be sure to: