Transcript

What is Reddit?

I will get started. If you have used Reddit before, you have probably seen this a few times. I am hoping you will see it less and less, but you have definitely seen it. And this talk will hopefully help you understand why. If you have not used Reddit, then how about a quick explanation? Reddit's the front-page of the internet, it is a community hub, and it is a place for people to talk about everything that they are interested in.

Reddit by the Numbers

But, more importantly, for this topic, Reddit is a really big website. We are currently the 4th largest in the U.S., according to Alexa, and serve 320 million users every month, doing all sorts of stuff, like posting a million times a day, and casting 75 million votes. It kind of adds up.

Major Components

So, let's dig into what the site looks like. This is a very high-level overview of the architecture of Reddit, it is focused only on the parts of the site that are involved with the core experience of the site. So, I'm leaving out some really interesting stuff, like all of our data analysis and the ad stack, all that kind of stuff. But this is the core of the Reddit experience.

The other thing to know about this diagram is that it is very much a work in progress. I made a diagram like this a year ago, and it looked nothing like this. And this also tells you a whole lot about our engineering organization, as much as it tells you about the tech that we use.

r2: The Monolith

So, that's actually really interesting here. In the middle here, the giant blob, is r2. That's the original monolithic application that is Reddit and has been Reddit since 2008. That's a big Python blob and we will talk about that in a bit more detail.

Node.js Frontend Applications

The front-end engineers at Reddit got tired with the pretty outdated stuff we have in r2, so they are building out these modern, front-end applications, these are all in node and they share a code between the server and the client. They all act as an API client themselves, so they will talk to APIs that are provided by our API gateway, or r2 itself. And it acts just like your mobile phone, or whatever other API client is out there.

New Backend Services

We are also starting to split up r2 into various back-end services, and these are all highlighted here. The core thing here is that they are kind of the focus of individual teams. So you can imagine that there's an API team, there's a listing team, there's a think team. I will explain a little bit more what those mean later. So, they are written in Python, which has helped us splitting stuff out of the existing Python monolith, and they are built on a common library that allows us to not reinvent the wheel every time that we do this, and it comes with monitoring and tracing and that kind of stuff built in. On the back-end, we use Thrift, Thrift gives us nice, strong schemas, and we allow for HTTP on the front-end for the API gateway so we can still talk to them from the outside.

CDN

Finally, we have the CDN in front, that's Fastly. And if you saw the talk earlier, they do some pretty cool stuff. One thing that we use it for is being able to do a lot of decision logic outside at the edge, and figure out which stack we're going to end that request to, based on the domain that is coming in, the path on the site, any of the cookies that the user has, including, perhaps, experiment bucketing. So that's how we can have all of these multiple stacks that we're starting to split out, and still have one Reddit.com.

r2 Deep Dive

So, since r2's the big blob, and it is complicated and old, we will dig into it's details. So the giant monolith here is a very complicated beast with its own weird diagram. We run the same code on every one of the servers here, it is monolithic, each server might run different parts of that code, but the same stuff is deployed everywhere. The load balancer is in the front, we use HTTP proxy, the point of that is to taken the request that the user has and split it into various pools of application servers. We do that to isolate different kinds of request paths so, say, a comments page is going slow today because of something going on, it doesn't affect the front page for other people. That has been very useful for us for gating these weird issues that happen.

We also do a lot of expensive operations when a user does things, like vote, or submit a link, etc. And we defer that to an asynchronous job queue via Rabbit MQ, we put the message in the queue and the processors handle it later, usually pretty quickly. These memcache and Postgress section- we have a core data model which I will talk about, called Thing, and that is what you would consider most of the guts of Reddit: accounts, links, sub-Reddits, comments, all of that is stored in this data model, called Thing, which is based in Postgress with memcache in front of it.

And finally, we use Cassandra very heavily. It has been in the stack for seven years now, it is used for a lot of the new features, ever since it came on board, and it has been very nice for its ability to stay up with one node going down, that kind of thing. Cool. So, that was a big about the structure of the site itself. Let's talk about how some of the parts of the site work, starting with listings.

Listings

So, a listing is kind of the foundation of Reddit. It is a list, an ordered list of links. You could naïvely think about this as selecting links out of the database with a sort. These, as you will see, on the front page, you will see it in sub-Reddits, etc.

Cached Results

The way that we do it is not actually by running the select out of the database. Instead, initially, what would happen is the select would happen, and it would be cached as a list of IDs in memcache, and that way you can fetch the list of IDs easily and you can look up the links by primary key, and that is very easy as well. So that was a nice system, and worked great.

Vote Queues

Those things, those listings, needed to be invalidated whenever you change something in them, which happens when you submit something but, most frequently, it happens when you vote on something. So the vote queues are something that really update the listings very frequently. We also have to do some other stuff in the vote processors, such as anti-cheat processing.

Mutate in Place

So, it turns out that running that select query, even occasionally when you invalidate it, is expensive. So when you do something like voting, you have all the information you need to be able to go and update that cache listing. You don't really need to re-run the query. What we do instead is not just store the ID, but the ID paired with the sort information related to that thing, and then when we do something like process or vote, we fetch down the current cached listing, we modify it. In this example you will see that we vote up on Link 125, which moves it up in that list and changes the score that it has in that list. And then we will write it back. That is kind of an interesting read/mutate/write operation, which has the potential for raised conditions, so we lock around that.

"Cache"

So you will notice, once we are doing that, we are not actually running the queries anymore. It is not really a cache anymore, it is actually its own first-class thing that we are storing, it is a persisted index, really, a de-normalized index. So, at that point, they started being stored originally in other things, and nowadays in Cassandra.

Vote Queue Pile-ups

So, I'm going to talk a little bit about something that went wrong. I mentioned that the queues usually process pretty quickly, not always. Back in the middle of 2012, we started seeing that the vote queues would start getting really backed up in the middle of the day, particularly peak traffic, when everybody is around. This would delay the processing of those votes which, is visible to users on the site, because a submission would not get its score properly; it would be sitting on the front page, but its score is going up very slowly, and it should be much higher than it is, and we will get it many hours later when the queue processed.

Observability

So, why not add more scale, more processors? That made it worse. We had to dig in. We didn't have great observability at the time. We couldn't figure out what was going on, we saw the whole processing time for a vote was longer than before but, beyond that, who knows.

Lock Contention

So we started adding a bunch of timers, when we narrowed it down, we realized it is the logged I mentioned that was causing the problems. The very popular sub-Reddits on this site are getting a lot of votes, it makes sense, they are popular. When you have a bunch of votes that are happening at the same time, you are trying to update the same listing from a bunch of votes at the same time, and they are all just waiting on the lock. So adding more just added more people waiting on the lock and actually didn't help at all.

Partitioning

So what we did to fix this is we partitioned the vote queues. So this is really just dead simple. We took the sub-Reddit ID of the link being voted on and used that and just did, like, modular 10 and put it into one of 10 different queues. This looks like, you are voting on a link of Reddit 111, or 1, and you are going to vote queue one, or vote queue 7. And what we did, we had the same total number of processors in the end, they are divided into different partitions and there are fewer vying for any single look at the same time. It worked really well, smooth sailing forever. Not really.

Slow Again

So, late-2012, just a few months of respite, we saw the vote queues slowing down okay. The lock contention and processing time looked okay in the average, but then we looked at the P99s, the 99th percentile of the timers, and we saw there were some votes going really poorly, and that was interesting. So we had to dig in and just start putting print statements in to see what is going on when you are taking over this amount of time. Pretty dumb, but it worked.

An Outlier

And, what we found is that there was a domain listing. So we have listings on the site that are for all of the links that are submitted to a given domain, sorry, the dangers of a touch screen. And the domain listing was the point of contention now. So, all of these things were partitioned and not vying for the same lock for sub-Reddits, but then they were in the same thing, vying for a domain listing. This was not great, and it was causing a lot of issues.

Split up Processing

So, long story short, much later we comprehensively fixed this by splitting up the queries all together. Instead of processing one vote in an entire single job processer, we now do a little bit of up-front stuff, and then we make a bunch of messages that deal with different parts of the job. And those all work on the right partitions for themselves, so they are not vying across partitions.

Learnings

Interesting stuff here is that you really need to have timers in your code, nice, granular timers, and they also give you a cross-section. You get a lot of info from your P99s and tracing, or some way of getting info out of the P99 cases is really important as well for figuring out what's going on in the weird cases. This is also kind of obvious, but locks are really bad news for throughput. If you have to use them, then you should be partitioning on the right thing.

Lockless Cached Queries

And so, going forward, we've got some new data models that we're trying out for storing those cache queries in a lockless way. It is pretty interesting, and it has been promising so far. But we haven't -- we haven't committed fully to it yet.

The Future of Listings

And, more importantly, we're starting to split out listings altogether. So we have this listing service, is and -- and the goal of it, we have a team working on it, is to make relevant listings for users. And the relevant listings can come from all sorts of sources, which includes the data analysis pipeline, machine learning, and these normal old listings, like the rest of the site. So that's -- that's kind of the future here, where we extracted out into its own service and r2 doesn't even need to know about how it is coming anymore. Cool.

Things

So you have listings of things, but what about the things themselves? As I said earlier, Thing is in Postgress and cache. This is the oldest data model in Reddit, r2, and it is a pretty interesting data model. I'm getting some smiles. The thing to know about it is that it is designed to takeaway certain pain points, and also make it so you cannot act until you do something, like expensive joins. It is vaguely schemaless and very key value.

Tables

There's one Thing per noun, on the site, like a sub-Reddit thing, and each is represented by a pair of tables in Postgress. The Thing table looks like it is, it is abbreviated, and the idea is that there is one row for each Thing, or object, that exists. And there is a set of fixed columns there, which covered everything that the original day's Reddit needed to do the just basic selects queries to run the site.

So, that's all the stuff that you would sort and filter on to make a listing back in the day. The data table, however, has many rows per thing object. And they each have a key and a value, and this makes up kind of a bag of properties for that thing. This has been pretty neat for Reddit in terms of the ability to make changes, make new additions, to the site, without having to go in and alter a table in production. It is cool in that way, and there's a lot of performance issues. So it is interesting.

Thing in PostgreSQL

Thing in Postgress is done as a set of tables live in a single database cluster, each primary in the database cluster handles writes, and then we have a number of read-only replicas which we replicate to asynchronously. r2 connects to the databases and would prefer to use the replicas for read operations so that they scale out better. At the time, it would also do this thing where it looks -- it determines that if a query failed, it would guess that the server is down and try to not use it again in the future.

Thing in Memcached

Thing also works with memcache, it reduces the read-only replicas. The object is serialized and popped into memcache, r2 reads it first and hits it on a miss, and we write to memcache from making a change, rather than deleting it and allowing it to be re-populated on the next read.

Incident

So in 2011, we were waking up a lot with these errors. We would wake up suddenly with an alert saying that the replication to one of the secondaries had crashed. And this meant that that database was getting more and more out of date as it was going on; we had to fix it. The immediately thing to do is to take it out of replication, out of usage on the site, you start re-building it and go back to bed.

But then, when we started seeing the next day when we woke up, that some of the cached listings were referring to items that didn't exist in Postgress, which is a little terrifying. So you can see the cached listing here, 1234, but then the Thing table only has 1, 2, and 4. Not a great thing to see. This caused the pages that needed that to crash, they were looking for the data, it wasn't there, and it just died. We built a lot of tooling at the time to clean up the listings, clean up the bad data that shouldn't be there, and remove it from them. This was obviously really painful; we were looking into a lot of things going on there.

And there were not a whole lot of us, either, about five people at the time. The issue started with a primary saturating its disk, it was running out of iops and something was going slow. What did we do about that? Got beefier hardware. That's pretty good, right? Problem solved. Not really.

A Clue

So, a few months later, everything has been nice and quiet for a while, but we were doing a pretty routine maintenance and accidentally bumped offline the primary, and suddenly we see the replication log alert firing. And looking at the logs from the application at the time, a lightbulb went off. I mentioned that r2 will try to remove a dead database from the connection pools. The code looks like this, very pseudo-Cody; we have in configuration a list of databases, we consider the first in the list to be the primary. So when we are going to decide which database to use for a query, we take the list of databases, we filter it down to the ones that are alive, we take the first one off the list as the primary, and the secondaries for the rest. And then we choose, based on the query type, which server to use and go for it. There's a bug there.

So what happens when it thinks the primary is down? It would, instead, take the primary out of the list and now we are using a secondary as the first item in the list, and we try writing to a secondary. Well, we didn't have proper permissions set up, so that worked. And you can write to the secondary. Oops. So, you wrote to the secondary, it created the thing, we write it to the cached listing, all is good. We take the secondary out and rebuild it, and the data is gone. Not good.

Learnings

So, yeah, use permissions. They are really useful. They are not just there to annoy people. They are very helpful. If you do de-normalize the cached listings, it is really important to have tooling for healing.

Discovery

And, going forward, some changes we're making: new services are using our service discovery system, which is pretty standard across all our stack to find databases. So they don't have to implement all this logic in themselves, and that helps with reducing complexity and making it a battle-tested component.

Thing Service

And also, finally, we're starting to move that whole Thing model out into its own service. The initial reason for this was we had new other services coming online, and they wanted to know about the core data that's in Reddit. So, we had this service that starts out by just being able to read the data, and now it is starting to take over the writes as well and take it all out of r2. A huge upside to this is all of that code in r2 had a lot of legacy and a lot of weird, twisted ways that it had been used. So by pulling it out and doing this exercise, it is now going to be separate and clean and something that we can completely re-think as necessary.

Comment Trees

Cool. So, another major thing. I said that Reddit is a place for people to talk about stuff. They do that in comment trees. So, an important thing to know about comments on Reddit is that they are threaded. This means that you nest replies so you can see the structure of a conversation. They can also be linked to deep within that structure. This makes it a bit more complicated to render these trees. It is pretty expensive to go and say, okay, there's 10,000 comments on this thread, I need to look up all 10,000 comments, find out what the parents are, etc.

So we store, hey, another denormalized listing, the parent relationships of that whole tree in one place so we can figure out ahead of time, okay, this is the sub-set of comments we're going to show right now and only look at those comments. This also is kind of expensive to do. So, we defer it to offline job processing.

One advantage in the comment tree stuff is that we can batch up messages and mutate the trees in one big batch. That allows for more efficient operations on them. A more important thing to note is the tree structure is sensitive to ordering. If I insert a comment and its parent doesn't exist, that is weird in a tree structure, right? So that needs to be watched out for, because things happen, and the system had some stuff to try and heal itself in that situation, where it will re-compute the tree, or try to fix up the tree.

Fastlane

The processing for that also has an issue that sometimes you end up with a megathread on the site, some news event is happening, the Superbowl is happening, whatever. People like to comment. You have 50,000 comments in one thread, and that thread is now going pretty slowly. That is affecting the rest of the site. So we developed a thing that allows us to manually mark a thread and say, this thread gets dedicated processing. It just goes off to its own queue, always -- called the fast lane. Well, that caused issues.

Incident

Early-2016, there was a major news event happening, pretty sad stuff, and a lot of people were talking about it. The thread was making the processing of comments pretty slow on the site, and so we fast-laned it. And then everything died. Basically, what happened immediately then was the fast lane queue started filling up with messages, very, very, very, very, very quickly, and rapidly filled up all of the memory on the message broker. This meant that we couldn't add any new messages anymore, which is not great. In the end, the only thing that we could do was re-start the message broker, and lose all those messages, to get back to steady-state. So this meant that all of the other actions on the site that are deferred to queues were also messed up.

Self-"Healing"

What turned out to be the cause was that stuff for dealing with a missing parent. So when we fast-laned the thread, the earlier comments in the thread that had happened before the fast-laneing were in the other queue and not yet processed. When we fast-laned the queue, we skipped it with a bunch of new messages, and the site recognized that it was inconsistent and every page thread was putting a new message on to the queue, please recompute me, I'm broken. Not good.

Start over

So we re-started Reddit, everything went back to normal, and now we use queue quotas, which are very nice. Resource limits allow you to prove one thing from hogging all of your resources, it is important to use things like quotas. If we had turned on quotas at the time, the fast lane queue would have started dropping messages and meant that thread would have been more inconsistent, or comments wouldn't show up there for a little while, but the rest of the site would have kept working.

Auto-scaler

Cool, all right. This is a little more meta. We have a bunch of servers to power all of this stuff, and we need to scale them up and down. So this is kind of what traffic to Reddit looks like over the course of a week. It is definitely very seasonal, so you can see that it is about half at night what it is in the day. There's a couple weird humps there for different time zones, it is pretty consistent overall.

So what we want to do with the auto scaler is to save money off peak and deal with situations where we need to scale up because something crazy is going on. It does a pretty good job of that. What it does is it watches the utilization metrics, reported by our local balancers, and it automatically increases or decreases the number of servers that we are requesting from AWS. We just offloaded the actual logic of terminating AWS to auto-scaling groups, that works pretty well. And the way the auto scaler knows what is going on out there is each host has a demon on it that registers its existence into a Zookeeper cluster. This is a rudimentary health check.

And, of note, is that we were also using this system for memcache. It is not really auto scaled in that we were ever going up and down, but it was there so that if a server died, it would be replaced automatically.

Incident

So, in mid-2016, something went pretty wrong with the auto scaler that taught us a lot. We were in the midst of making our final migration from EC2 classic into VPC, it had a huge number of benefits for us in networking and security and that stuff, and the final component to move was the Zookeeper cluster. This was the Zookeeper cluster being used for the auto scaler. So it was pretty important.

The Plan

The plan for this migration was to launch the new cluster into VPC, stop the auto scaler services so they don't mess with anything during this migration, re-point the auto scaler agents on each of the servers out in the fleet to point at the new Zookeeper cluster, and then we will repoint the auto scaler services themselves, restart the cluster, and act like nothing ever happened.

The Reality

What actually happened was a little different, unfortunately. We launched the new Zookeeper cluster, working great. We stopped the auto scaler services, cool. We start re-pointing things, we got about 1/3 of the way through that when suddenly, about 1/3 of our servers get terminated. And it took us a moment to realize why, and then we all face palmed rather heavily.

What Happened?

So what happened is that Puppet was running on the auto scaler server; when it did its half hourly run, it decided to re-start the auto scaler demons, they were still pointing at the old cluster, and they saw all of the servers that had migrated to the new cluster as being unhealthy and terminated them. Very helpful, thanks auto scaler.

Recovery

And so, what can we do about that? Well, just wait a minute, the auto scaler will bring a bunch of whole new servers up. That was relatively easy, except that the cache servers were also done in the same system. And they have state, and the new ones that come up don't have that state.

So what this meant was that suddenly, all of the production traffic that was happening was hammering the Postgress replicas that could not handle that, they were not used to that many caches being gone at the same time. Yeah. Pretty reasonable.

Learnings

So what we learned from that was that things that do destructive actions, like terminating servers, should have sanity checks in them. Am I terminating a large percentage of the servers? I should not do that. We needed to improve the process about the migration itself, and having a peer-reviewed check list where we made sure that we had extra layers of defense there would have been really helpful.

And the other important thing is that stateful services are pretty different from stateless services, you should treat them differently. So using the same auto-scaler technology for that is probably not the best idea.

Auto-scaler v2

The next-gen autoscaler we are building has a lot of this stuff. It has tooling to recognize that it is affecting a larger number of servers than it really should and stop itself, and it does -- it uses our service’s discovery system to determine health instead of just an agent on the boxes. So we actually get a little more fine-grain detail, like is the actual service okay, not just the host. But, yeah. It will refuse to take action on large numbers of servers at once. Cool!

Remember the Human

So, in summary, observability is key here. People make mistakes, use multiple layers of safeguards. Like, if we'd turned off the auto scaler and put an exit command at the top of the script, nothing would have happened if Puppet brought it back up. Little extra things, where no one failure can cause new issues.

And it is really important to make sure that your system is simple and easy to understand. Cool, thank you. So this is just the beginning for us. We are building a ton of stuff, and hiring. And also, on Thursday, the entire infraops team, many of whom are right there, are doing an AMA if you would like to ask some questions.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.