Mike Doherty (u/Kaitaan)

Senior Engineer, Data Engineering

This post is the first installment of a three-part mini-series on data architecture at Reddit, starting with Reddit’s data “origin story” and finishing with our present-day practices.

Last year, we shared a few blog posts explaining some of the interesting technical challenges we face here at Reddit. From fixing search (again), to how we route requests to different stacks, to (one of my favourites) counting the number of people who’ve viewed a piece of content, among several others, we’ve covered a very small part of what it is our technical teams do here. One area we haven’t covered much of yet is our data systems.

There’s always room for improvement in our systems, whether it’s adding more functionality, scaling to handle more throughput, or throwing everything out the window and starting from scratch. I’ve been here for all of it. And as Reddit’s very first dedicated data engineering hire, I get credit for more than my fair share of mistakes and bad decisions along the way.

This blog post will be the first of three taking you through the history of data engineering at Reddit. Over the course of the series, I’m going to talk a bit about what we did, and why we did it. More importantly, I’ll talk a bit about what problems we ran into along the way, and what we did to solve them.

During my on-site interview at Reddit, I remember asking what the current data architecture looks like. What technologies are used? What do the data pipelines and ETL systems look like? How many people are working on this, and how does it integrate into product decisions? The answers astounded me: Reddit used the free tier of Google Analytics. There were no pipelines or ETL. There was one PM who was manually running a small collection of PIG scripts on EMR, parsing HAProxy logs to get DAU/MAU numbers. This was a golden opportunity to build up data systems from scratch for a company with a massive user base, and it was an opportunity I didn’t plan to miss out on.

The early days

I still remember the first time I met our co-founder Alexis Ohanian. As I explained what I was hoping we could accomplish by having product instrumentation in place, he said to me, “Back in the early days, we just had to make all our decisions based on gut instinct.” That was a pretty good summary of how things were done back then: gut instinct. Reddit had to be careful about changes; if a change was rolled out, and users complained, we had no effective way to determine whether those complaints were a vocal minority, or were representative of all users.

The new data team at Reddit had our work cut out for us: make Reddit a data-driven company. Change product development from a world of gut-instinct change and reacting based solely on user feedback, to a world where data drives not only product ideas, but implementation, experimentation, and iteration as well. Make Reddit a data-literate company; a place where any employee can find top-line metrics, and can analyze everything relevant to their space.

Digging through access logs

Initially, we had two primary sources of data from which to try to pull insights (besides the aforementioned Google Analytics): HAProxy logs and pixel-tracking logs. These respective logs would eventually give us information about our content, such as what posts and comments receive the most engagement, as well as where, when, and on what device—with a handy breakdown of 3rd-party app traffic vs. web traffic and other critical insights into our community and products.

PIG scripts were a great start, and while they provided invaluable data at the time, we quickly realized we were going to need a more robust approach in order to get the insights we wanted. At the time, Reddit had very limited engineering resources to dedicate to this work (only one engineer), so we needed to come up with an approach that would allow us to iterate on changes quickly, add new data sources with minimal effort, and scale to Reddit’s traffic volumes right out of the gate (for context, HAProxy logs alone are more than 500 GB per day). To this end, we opted to start with a simple MapReduce job to create our first iteration of a data warehouse. This would allow us to generate basic, but more usable output than from the raw logs, and, since the raw logs were deleted after 90 days, would allow us to keep data indefinitely.

I had some experience with maintaining a Hadoop cluster in my previous job and had no desire to start down that road just yet. To do so would add a lot of operational overhead that we didn’t have the manpower to support. Furthermore, keeping a cluster up at all times when it’s only being used in bursts (daily or hourly, as logs became available to process) was also an expensive proposition. Using Amazon’s Elastic MapReduce allowed us to bring up and shut down clusters on demand, and since the data would come from and be written back out to S3, we had no need to keep a cluster up to retain data in HDFS.

Making logs queryable

In order to do any real analysis on this data, we had to be able to query it. Writing a script or a MapReduce job every time you wanted to dig in to the data to get some insights would obviously not scale well, so we needed a simple way to be able to write queries. Additionally, with the volume of raw data available, it would be nonsensical to run queries against raw data for everything.

Enter Apache Hive.

With Hive, we were able to define a warehouse and tables, and we were able to write SQL queries that would then get transformed into MapReduce jobs and run across our EMR cluster. Hive relies on a metastore to keep all information about tables (schemas, data storage locations, etc). This metastore, by default, lives on the Hadoop cluster itself, but you can define it to be in an external location. By creating our metastore on a separate MySQL instance, we kept the ability to spin up and shut down clusters on demand, which is something we do to this day (though not in EMR anymore).

Extract, Transform, Load (ETL): V1

While getting our HAProxy and pixel logs into a queryable format in Hive was the first big step, we needed some tools to run queries on a schedule, build reporting and aggregation tables, and visualize the data. Our first version of ETL used Azkaban for dependency management and Jenkins for scheduling. Jenkins is typically used for scheduling builds, but we were able to use it to talk to an Azkaban server to trigger ETL jobs. Jenkins would run a Python script on an hourly or daily basis, which would construct a SQL query or series of queries based on a set of input parameters. Jenkins would then submit the job to Azkaban, which would run the jobs and manage interdependencies between them. Once this system was up and running, Jenkins would also be configured to trigger the HAProxy and pixel processing jobs.

After so many years without a robust data system, having answers to key questions at our fingertips was great. Our new system gave us a lot of power to help determine whether we were building the right product features for our community and determine the potential impact of new ideas.

And yet, there was still one missing piece. We needed to be able to visualize the data. After all, charts and graphs are way easier to interpret at a glance than sequences of numbers!

Who wants a book with no pictures?

When we started our search, we spoke with several vendors and looked at a variety of data-visualization tools. In the end, we settled on a third-party, closed source product, largely to lighten the load of having to maintain and support something internally with our limited resources. The tool we used did not support using Hive as a backend, and given the speed of even relatively basic Hive queries, it wouldn’t make much sense to use that for reporting metrics. A dashboard that takes 10 minutes to load isn’t a particularly useful dashboard! Fortunately, a simple MySQL database on Amazon’s RDS presented a really simple solution; since the volume of data in a reporting table is small, this solution would scale for the foreseeable future, and our ETL jobs could write out there with a simple modification.

This was a great start, but the one major piece missing was making it easy for the less technical folks at the company to be able to query data. With the above system, you would have to SSH into the Hive cluster to run your queries. Wide-open access aside, that’s a painful approach for anyone. It also meant that we couldn’t scale easily to a huge number of queries. We could spin up a new EMR cluster every time, but there’s a nontrivial amount of technical overhead involved there.

Fortunately, we found a third-party vendor who provide a nice centralized system for running queries, configuring and managing clusters (Hive, in our case, but they support a number of technologies), access controls, etc. This allowed us to create a number of clusters, and anyone could log in to run their queries.

And lo, Reddit entered the Bronze Age of Data! Tune in next time for our rocky trip into the Iron Age!

Interested in joining Reddit’s growing team of engineers? Check out our Careers page for a list of open positions.