Scaling Data Ingestion Systems: From Perl to Go Part 2

In part one of this post, I explored the scaling problems that we encountered when MediaMath’s user data delivery system, which was built initially in Perl without the headroom necessary to scale to our current size. In this post, I outline the way we used goroutines and channels (Go’s built-in concurrency primitives) and interfaces to simplify concurrency and parallelism and scale without complicating deployment.

Concurrency in Go, or “Let’s just add more workers!”

Tackling our concurrency issues proved to be the easiest part of this exercise, due to Go’s fantastic primitives and the fan-out and fan-in patterns. We can start with, as a reference:

By fanning out, we can send all lines from a file into a channel:

func StringLines(scanner *bufio.Scanner, buffer int) <-chan string { outChan := make(chan string, buffer) go func() { for scanner.Scan() { outChan <- scanner.Text() } close(outChan) }() return outChan } 1 2 3 4 5 6 7 8 9 10 func StringLines ( scanner * bufio . Scanner , buffer int ) < - chan string { outChan : = make ( chan string , buffer ) go func ( ) { for scanner . Scan ( ) { outChan < - scanner . Text ( ) } close ( outChan ) } ( ) return outChan }

Notice here how we specify a second argument buffer. When making a channel in Go, you can include a second argument to specify how many items can be pending in the channel. That is, suppose we have a buffer of 1. If this goroutine sends one string into the channel, it will immediately continue and prepare a second string. That second string won’t be sent into the channel until the first string gets received. If we have a buffer of 0, on the other hand, the goroutine will be blocked on the first string until a receiver is ready to receive it.

We can then send the data from this channel to multiple workers by creating a pool:

pool := make([]<-chan []string, workers) for i := 0; i < workers; i++ { pool[i] = t.ProcessLines(lines) } 1 2 3 4 pool : = make ( [ ] < - chan [ ] string , workers ) for i : = 0 ; i < workers ; i ++ { pool [ i ] = t . ProcessLines ( lines ) }

The workers argument is a command-line flag that indicates the number of workers we add to the pool. It should be obvious that this is a boon to Ops teams: by tuning the workers in our application, we remove the need for another dependency (GNU Parallel) and its intricacies. Even nicer, it allows us to test more easily and benchmark more accurately. Developers on their local machines can test the impact different settings have on execution profiles, and benchmark according to standard settings.

Finally, we complete the fan-in by sending the data from all channels back into one:

output := make(chan []string) go func() { for outputLine := range MergeChannels(pool) { output <- outputLine } close(output) }() return output 1 2 3 4 5 6 7 8 output : = make ( chan [ ] string ) go func ( ) { for outputLine : = range MergeChannels ( pool ) { output < - outputLine } close ( output ) } ( ) return output

MergeChannels here spins a goroutine for each worker in the pool, sending its output to a channel that it returns. The entire function looks something like:

func Process(lines <-chan string, workers int, t Tokenizer) <-chan []string { pool := make([]<-chan []string, workers) for i, _ := range pool { pool[i] = t.ProcessLines(lines) } output := make(chan []string) go func() { for outputLine := range MergeChannels(pool) { output <- outputLine } close(output) }() return output } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 func Process ( lines < - chan string , workers int , t Tokenizer ) < - chan [ ] string { pool : = make ( [ ] < - chan [ ] string , workers ) for i , _ : = range pool { pool [ i ] = t . ProcessLines ( lines ) } output : = make ( chan [ ] string ) go func ( ) { for outputLine : = range MergeChannels ( pool ) { output < - outputLine } close ( output ) } ( ) return output }

And the architecture, taking advantage of our command-line flags, looks like:

Sidebar on GOMAXPROCS

One fairly common misconception surrounds GOMAXPROCS and how it affects your code. For example, I often see this snippet in source code:

runtime.GOMAXPROCS(runtime.NumCPU()) 1 runtime . GOMAXPROCS ( runtime . NumCPU ( ) )

This is not, by default, an optimal way to run your code. A default GOMAXPROCS setting of 1 means that the Go runtime will schedule goroutines to run against a single logical processor. This logical processor is bound to an operating system thread. Increasing the number of logical processors with GOMAXPROCS will increase the number of OS threads the runtime will schedule against. However, this does not automatically mean that your code will run in parallel. If you have many processes running on a single machine with runtime.GOMAXPROCS(runtime.NumCPU()) , you can very easily run up against OS thread contention and actually slow down your code.

Further, if your program is I/O-bound anyway, increasing GOMAXPROCS is unlikely to yield substantial results. The Go runtime does a pretty good job of managing this for you. But if your program can benefit from increased logical processors, the option is there. Deferring GOMAXPROCS to another command-line argument allowed us to fine-tune not only how many goroutines are pulling from the channel, but also how many logical processors we bind against.

But how do you know what value is best? The solution, of course, is to always profile your code. And on that note, Go has a fantastic built-in profiler at runtime/pprof :

*Note that this profile output is actually from the HTTP server we run to handle real-time requests, not this batch processor.

Let’s take inventory so far. We have:

A method to handle our concurrency and parallelism issues without the use of additional tools

A way to understand how these primitives interact and how we can fine-tune them within our application

But how can we solve the extensibility problem?

The Value of Interfaces

If you jump back to the code sample above, you’ll notice I included a type I hadn’t explained yet: Tokenizer . And here we get to the second part of making Go work for us.

// The Tokenizer interface exists to have a struct that tokenizes // the input to a format that a common processor can understand. type Tokenizer interface { ProcessLines(<-chan string) <-chan []string } 1 2 3 4 5 // The Tokenizer interface exists to have a struct that tokenizes // the input to a format that a common processor can understand. type Tokenizer interface { ProcessLines ( < - chan string ) < - chan [ ] string }

The Tokenizer interface allows us to respond to different input formats and parse each one into a token stream. We then have a common processing function that acts on this token stream to produce an output. I omitted it here for brevity, but the Tokenizer interface works with an internal function to parse this token stream into an output. The Tokenizer.ProcessLines method sends the result of that into the output channel.

HTTP Streaming

While I went into detail on how we handled the batch processing case, I didn’t touch much upon the HTTP streaming case. The reason here is that the big benefits we saw in HTTP streaming were not so much due to Go’s feature set; instead, Go’s speed let us augment them in features. Our original HTTP server had the following characteristics:

Time from HTTP request to data being targetable: in the time scale of minutes

Authentication/authorization: primitive, only allowing for certain “namespaces”

Data “richness” (custom timestamps, additions/removals): limited

We followed a similar process for the batch processing method. By using the HTTP server included in the standard library, we were able to simplify our dependencies by putting the application directly in front of real traffic. Further, http.Server ‘s default behavior of spinning a goroutine to handle each request means that we get concurrent handling of requests built-in. Again, we used runtime/pprof to ensure that we are finding our bottlenecks and effectively enabling parallelism.

But the benefits really showed in the features. In addition to replacing an HTTP server with one that could handle requests faster and more efficiently, we were also able to add another component into the system: sending requests to our bidding infrastructure in real-time. By allowing a user to specify a region (effectively, a continent) that the user is in, we can bypass any minute-scale batch-processing system and send data directly to the bidders. Our bidders use this data to target users in a <10 second timeframe.

Further, the simplicity and speed with which we built this meant that we could adapt it to a microservice-oriented architecture and offer this as a service that other systems could use. The way this was implemented also allowed us to support richer forms of data, including support for custom removal logic and timestamps that took effect in near-real-time.

We then tackled authentication and authorization. With the previous system, we effectively put up a wall between data sent by data providers and data sent by clients, let through by an internal mapping process. Instead, we simplified the process to natively check against an authorization service to allow providers to send data natively on behalf of clients.

Sympathy for the DevOps

But what our Ops team loves most about Go is that it compiles to a statically linked binary. I discussed previously the dependency management issue we faced in our old method of ingesting data, but there was another wrinkle. By writing our code in an interpreted, dynamically-run language, Perl, we necessarily introduced a runtime dependency into our deployment process. While this can be straightforward to manage, it doesn’t lend itself to things like old versions of the specific runtime. If code is written against an old version and not updated to a new version, complicated measures need to be taken to ensure that production servers are running the right versions of these runtimes. This can (and actually has) led to hard-to-debug issues in production.

Summary and Results

So how did all of this net out?

Batch processing times went down from six–24 hours to <one–two hours

Batch processing capacity rose an order of magnitude

Streaming activation times went down from a time scale of minutes to a time scale of seconds

“TTL” from development to deployment went down

Parity issues in production went down

LOC went down

Wait, what? LOC went down? Yes! Turns out that by splitting apart our processing to scale with extensible formats, we were dramatically duplicating code. By using Go’s primitives to add concurrency and extensibility, we were actually able to reduce lines of code while increasing speed, efficiency, and functionality.

OK, so what does all of this mean? Well, your mileage may vary. But we have found tremendous gains by restructuring our data ingestion infrastructure, with relatively little effort. Hopefully you can find similar gains.