Distributed Approach

We can use the same approach taken with Dask to scale to a cluster with Spark: partition the data into independent subsets, and then distribute feature engineering with these subsets to multiple workers. This follows the general framework of breaking one large problem up into easier sub-problems, each of which can be run by a single worker.

When one problem is too big, make lots of little problems.

Partitioning Data

To partition the data, we take the customer id represented as a string and

Hash the id to an integer using the MD5 message-digest algorithm Modulo divide this integer by the number of partitions

The code is shown below:

Using the hashing function ensures that a string will always map to the same integer, so the same customer will always end up in the same partition. The assignment of customers to partition is random, but that is not an issue because each customer is independent of the others.

The end result of partitioning is that each partition holds all the data necessary to build a feature matrix for a subset of customers. Each partition is independent of all the others: we can calculate the feature matrix for one partition without worrying about the customers in any other partition. This will allow us to run the feature engineering in parallel since workers do not need to communicate with one another.

Using the id_to_hash function, we take our three individual large dataframes — representing transactions, user logs, and membership info — and convert all of the customer ids to a partition number. To actually partition the data, the most efficient approach is to use a groupby on the partition number and then iteratively save each partition. For example, in the members dataframe where the msno column is the customer id, then the following code partitions the dataframe into 1000 separate files and saves them.

Wrapping this code in a function, we can then partition all of the dataframes. If we have a large file that cannot fit into memory, such as the user_logs , then we can read and partition it in chunks using pandas pd.read_csv .

Working with a 30 GB file, this code ran in about 3 minutes. Partitioning data is a common approach when working with large datasets.