Scaling Data Ingestion Systems: From Perl to Go Part 1

A consequence of MediaMath’s astronomical growth over the past few years is dealing with huge growth in service usage. Rapid growth sometimes means that systems are built quickly, without making hard plans for the future. Systems with headroom can now often become insufficient in as little as six months, and so technical debt becomes a tough challenge to address. We deal with the question, “Do we try to re-write this, or do we modify what we already have to scale with the load we expect to see?”

Nowhere has this been clearer than in ingesting user data, which since 2011 has more than tripled in volume to greater than 500k QPS. In this post I’ll discuss MediaMath’s history of ingesting user data, and how we lowered latency, simplified concurrency, and increased capacity by taking an experimental leap and switching to Go.

The Initial Problem

An important component in advertising, online or off-, is finding an audience. Advertisers use different strategies here:

Finding users they know are interested in their products Discovering and classifying new users who they think will be interested in their offerings

A number of companies tackle this problem in different ways. For instance, one company may claim to accurately determine users’ genders and income ranges, while another may claim to match offline users to online cookies. Further, advertisers and brands often have their own CRM systems that they’d like to bring on board.

To meet this functionality, we should provide a system for customers and partners to upload user IDs and segment membership information. Ideally, we provide a few interfaces into this system, for different use cases: batch, real-time, remodeling, etc. The solution needs to scale well to meet data ingestion load and handle slightly different formats, but have a common specification.

Designing a System

To interact with the system, we decided on two methods: FTP and HTTP. FTP would be the primary interface, taking care of batch processing. For most offline data, ingesting batch files periodically is sufficient, as the user classification and data update cycle has a longer period (on the time scale of hours). HTTP would be for lower-latency data, where a time scale of hours is insufficient. The processing itself was handled by a series of translation steps to convert user-friendly representations to a common loader format. Then a loader application would load that data into our user database. This data would get pushed to our globally distributed bidders.

Since the earliest days of MediaMath’s lifetime, we’ve loved Perl. The architects behind many of our systems, including our campaign management API and this particular data ingestion system, loved the language’s tools, community, and style. As a core component of this translation process is processing streams of text, we decided to go with Perl.

This worked fairly well for some time. We were handling gigabytes of data, processing batch files a few times per day, and HTTP data a few times per hour. Fortunately (or unfortunately), ingesting offline data caught on like wildfire. Next thing we knew, we were taking in data from numerous sources. Some of these partners had different specifications of their own, and couldn’t match our standard format. Using this system, we had to handle different types of setups, which made it extremely difficult to scale.

Evolution of a Prototype

Scaling and extending raised two seemingly contradictory approaches:

Should we add run-time flexibility into a single system? Should we write focused, independent processors to make sure each is optimized?

The first option is a common way of handling this in a dynamic language, so this is what we went with. Using extensive configuration files of options, we could fine-tune the behavior used to handle any given input. The architecture basically looked like:

However, this brought scaling challenges. Even after trying to parallelize workflows using Linux streams (more on that later), we started to fall behind on processing, with ingestion times taking anywhere from six to 24 hours. Further, it didn’t look like the incoming data volume would taper off any time soon.

So we tried splitting up the processes themselves. This made the architecture look more like:

With limited controls for concurrency, we designed a system that was line-atomic, choosing to use Linux processing controls to parallelize it. Using nightmarish configuration files, each file gets piped into parallel, splitting it up, then piped into the translator, specifying how it was to be processed, and then piped to pigz (parallel gzip). It ended up looking something like:

$ zcat ${processing_dir}/${filename}${extension} | \ parallel --gnu --halt 2 -j8 -k --pipe --block 20M \ ${translator} --source=${source} \ --config_file={root}/etc/${source}.conf \ --report_filename=${i} - | \ pigz > ${output}${extension} 1 2 3 4 5 6 $ zcat $ { processing_dir } / $ { filename } $ { extension } | \ parallel -- gnu -- halt 2 - j8 - k -- pipe -- block 20M \ $ { translator } -- source = $ { source } \ -- config_file = { root } / etc / $ { source } . conf \ -- report_filename = $ { i } - | \ pigz > $ { output } $ { extension }

Yikes. If that doesn’t look terrifying, talk to your nearest DevOps expert.

Pretty soon, our data volume was once again pushing the limits of our architecture. We were falling behind, throwing more hardware at the problem. Splitting this over many machines brought deployment problems. We needed to re-think this.

Going Go

MediaMath has a strong culture of experimentation. If you’ve got an itch to solve a problem we have, and you come up with a good solution, we try to minimize the barriers to seeing the results of your efforts. [LF1] This was a system that was staggering over. I had spent some time learning Go, and was curious if it could help us redesign this system to scale more effectively.

So, I began re-writing this process. I had two main goals at the outset:

Simplify concurrency and parallelism Scale without complicating deployment

Using Go here intrigued me because I wondered if I could solve the above goals using, respectively:

Goroutines and channels, Go’s built-in concurrency primitives Interfaces

I wondered if this duality of run-time dynamicism vs. deployment terror could be solved by clever uses of Go’s offerings . In next week’s post, I’ll explain how we deployed concurrency in Go and the (expected and unexpected) gains we experienced.