Performance Debugging Mongo ETL at Scale

A real-world investigation into long-running mgo processes

At Plaid we are building the world’s largest understanding of financial information, and it’s crucial that we have the right set of tools internally to process, understand, and track changes and growth of our datasets over time.

We use AWS Redshift to facilitate analytic analysis on internal data, which allows us to bring in data from many different sources and query, join, and merge the data all in one place. One of the datasets we expose is a copy of our production Mongo cluster, which houses 10s of terabytes of data.

Airflow schedules a golang process to move data into S3. We then COPY it into Redshift.

There are some large collections in Mongo for which we run a daily ETL script that exports data to S3 so that we can pick it up on the other side and load into Redshift. The “Mongo Dump” jobs were written over a year ago, and the volume of data that it’s responsible for moving from Mongo to S3 had grown dramatically as the business grew. We began to notice that export job had grown to take a long wall-clock time: when we were only running it once a day no one much minded if it took ten minutes or two hours, but in recent weeks we had observed a more interesting set of characteristics:

It had grown to be so long-running as to make it difficult to respond to short-term user requests (for example, to dump a new field from the collection). The runtime was becoming highly variable, with the maximum running time being almost twice as long as the minimum run time. In some instances, the export script would timeout and the entire process would fail, leaving only a partial data dump.

Running times had grown ~30%, and would sometimes fail once, before working upon retrying

We set out to improve the process. First and foremost we wanted to ensure that it succeeded every time, but we also wanted to investigate performance. To improve the long-term health we needed to understand why it had slowed down and what we could do about it. Being able to receive an internal request for data, write any necessary code, and run the updated script in a single day was a crucial benchmark to ensure engineers at Plaid felt empowered with data access, not that they were being blocked by long delays in access to their product’s usage patterns.

Our goal is to ensure that engineers like Carver feel empowered with analytical data access

Mongo Dump uses mgo to talk to Mongo. Mongo Dump sets up N threads that each create an Iterator, where empirically N is either 16 or 32. It then partitions the workload across the different threads and lets them iterate through a range of IDs. The goal is for each thread to do 1/Nth of the work.

What we wanted to happen — each thread should handle an equal distribution of the workload.

The original implementation did this in a clever way that looked at the data. Mongo BSON Object Ids are not auto-incrementing counters: they encode a timestamp within the id.

Because of this, if there is a temporal pattern to the data, splitting the data mathematically between the range (max-min) can lead to huge skew. To get around this, we were taking a sample of ids, ordering them, and creating ranges of work between the samples. Namely, we were taking a sample of size 3*N, and then taking every third sample id and using them as range boundaries.

Our original implementation would dynamically sample k*N IDs from the mongo collection, and then keep every kth ID as a partition boundary

This setup worked well in the beginning, but as our processes became less efficient we discovered a few issues:

All of the threads were sharing the same database connection. The version of the mgo library we were using had some known issues working with long-running database connections, and calling Refresh() on the session didn’t seem to work as intended. The sampling methodology could (and frequently did) result in an uneven workload. In the most extreme cases, the largest chunks were 8x as large as the small chunks, and our performance was dominated by the runtimes of the slowest 4 threads.

Our sampling methodology was creating ID ranges that were untenably disproportionate.

We set about addressing these issues. We decided to focus on architectural issues in the Mongo Dump architecture:

Replace the shared database connection with a per-thread connection Whenever the thread encounters an error related to timeouts (the tell-tale failure), we have it close the existing Database connection and establish a new one. Instead of sampling, create a static, known partitioning of the workload space that leverages the temporal growth of the dataset.

Given a known start date and bucket width, create ranges of mongo IDs for threads to iterate through.

The big change was to migrate to small, statically partitioned workloads and have the thread pull tasks from a thread-pool, instead of having N workloads and N threads, we create around N*5 tasks by splitting the id space into buckets of 14 days, and letting the threads pull from the queue when they’re done with the last job. This structure both mitigates unbalanced workloads and allows the threads to organically correct for skew in size by having some threads do a few big ID ranges and some do many small batches.

Exporting hundreds of millions of rows of data in under an hour, achieving our goal.

Our changes worked! We recovered a lot of the lost parallelism, and more effectively recovered from errors related to long-running mgo connections. Being able to dump the entire dataset in a “short-enough” time frame helps our team push new code and see the results of their changes at scale without having to wait for over-night ETL to run.

Plaid is the bedrock of the entire fintech ecosystem, and we have a world of data challenges ahead of us. You should check out our career page or tweet me if you’re excited about building a generational company that has an impact on the financial life of (almost) everyone in America.