Pipeline Update Requirements

This is what we needed the Pipeline to manage.

Maintain up-to-the-minute updates to our podcasters’ analytics dashboard and internal aggregations.

Keep our stack simple, with as few dependencies as possible. AWS has had two major outages in the last six months, and we want to limit the possibility of failure.

Cheap! We’re a startup, we need to watch our AWS bill and use those sweet credits (thanks AWS!) with moderation.

Handle 100M events/month, roughly 50x our current load.

SQL-based to run custom queries and business analytics.

Nice to have but not essential: JavaScript-optimized development. Our entire stack has been written in JavaScript, and all our engineers are able to jump on any part of the application easily. I’d like to keep it this way.

Picking a new pipeline

Now that we have our requirements, we can pick the system that will serve us best. Using a time series database suits particularly well for this type of application. However, the managed options are limited, definitely more costly and the portability is limited.

Kinesis’ proposal is very appealing, let’s explore…

Kinesis Streams or Firehose?

Streams offer close to real time data, but are a bit more tedious to setup and require “special” consumers. On the other hand, Firehose is a managed service with batches of at least 60 seconds. It also connects to standard services that we’re already familiar with.

I think we have a winner: let’s start with Firehose.

Kinesis Firehose > S3 > Redshift

This setup could work really well: just load the incremental batches in Redshift and execute our queries directly against it.

On the bright side, this would only require minimal rewrite of our SQL queries, but there are a few “gotchas”… Redshift’s goal is to offer processing capabilities over huge amounts of data, not to provide constant and short latencies. It’s best suited for BI tools. Also, keeping a tiny cluster up continually costs at least $180/month and scaling options are limited.

Maybe we could move some pieces around.

Kinesis Firehose > S3 > EMR > Redshift

We could have a winner here. This satisfies all the technical requirements, but we could do better on pricing. We’ve already used up all of the EC2 free tier Elastic MapReduce (EMR) credit. Additionally, we don’t intend to do a lot of processing on the logs. Javascript is also not an option with EMR, which is not ideal.

Kinesis Firehose > S3 > Lambda > RDS

Lambda’s free tier is pretty generous! And we are already running a few Lambda systems on production, so we’re quite familiar with the setup, and we still have plenty of credit left. Did I mention Lambda supports node 6.x out of the box? We also gained a lot of experience using RDS that we can capitalize on. Scaling options are good here too, better than Redshift.

The main concern we have now is hitting the 5 minutes timeout on the Lambda function.

There’s only one way to know how many events we can squeeze into those 5 minutes: let’s do some load testing.

There is one final element we need to integrate to create a completely asynchronous pipeline: our listeners’ playback sessions. They were previously stored in RDS, but this is not an option for us anymore. But fortunately we happen to have a small, underutilized Redis instance already running on production. Bingo!

Load Testing

I wrote a small agent which fires an event with pseudo random data every 40ms (or 1500/minute, 65M/month).

We’ll run the tests against a RDS instance db.m4.large, provisioned with a 100GB SSD drive, which gives us 300 IOPS. (We don’t want to use t2’s burst mode since this will give us some false measurements.)

Test 1: 100M events per month?

Running 2 agents, 3,000 events/minute for 5 minutes.

Result: The Lambda function was triggered after 1 minute buffer. Processing time: 60,419ms (~1min), CPU utilization chart didn’t even move! That’s pretty damn great, and gives us plenty of room to grow.

Test 2: 250M events per month?

Running 4 agents, 6,000 events/minute for 5 minutes.

Result: The Lambda function was triggered after the 1Mb buffer was reached. Multiple Lambda instances were invoked. Processing time for the longest execution: 121,900ms (~2min), CPU stayed well bellow 10%.

Let’s push this even further.

Test 3: 400M events per month?

Running 6 agents, 9,000 events/minute for 5 minutes.

Result: Processing time was identical to the previous test, but looking at the RDS metrics, we can clearly see that the max IOPS has been reached. We got this pipeline to handle almost 400M events/month — this is way more than all podcasts consumed in the USA: mind blowing!