Let’s talk about DNS. After all, what could go wrong? It’s just cache invalidation and naming things.

tl;dr

This blog post is about how Stack Overflow and the rest of the Stack Exchange network approaches DNS:

By bench-marking different DNS providers and how we chose between them

By implementing multiple DNS providers

By deliberately breaking DNS to measure its impact

By validating our assumptions and testing implementations of the DNS standard

The good stuff in this post is in the middle, so feel free to scroll down to “The Dyn Attack” if you want to get straight into the meat and potatoes of this blog post.

The Domain Name System

DNS had its moment in the spotlight in October 2016, with a major Distributed Denial of Service (DDos) attack launched against Dyn, which affected the ability for Internet users to connect to some of their favourite websites, such as Twitter, CNN, imgur, Spotify, and literally thousands of other sites.

But for most systems administrators or website operators, DNS is mostly kept in a little black box, outsourced to a 3rd party, and mostly forgotten about. And, for the most part, this is the way it should be. But as you start to grow to 1.3+ billion pageviews a month with a website where performance is a feature, every little bit matters.

In this post, I’m going to explain some of the decisions we’ve made around DNS in the past, and where we’re going with it in the future. I will eschew deep technical details and gloss over low-level DNS implementation in favour of the broad strokes.

In the beginning

So first, a bit of history: In the beginning, we ran our own DNS on-premises using artisanally crafted zone files with BIND. It was fast enough when we were doing only a few hundred million hits a month, but eventually hand-crafted zonefiles were too much hassle to maintain reliably. When we moved to Cloudflare as our CDN, their service is intimately coupled with DNS, so we demoted our BIND boxes out of production and handed off DNS to Cloudflare.

The search for a new provider

Fast forward to early 2016 and we moved our CDN to Fastly. Fastly doesn’t provide DNS service, so we were back on our own in that regards and our search for a new DNS provider began. We made a list of every DNS provider we could think of, and ended up with a shortlist of 10:

Dyn

NS1

Amazon Route 53

Google Cloud DNS

Azure DNS (beta)

DNSimple

Godaddy

EdgeCast (Verizon)

Hurricane Electric

DNS Made Easy

From this list of 10 providers, we did our initial investigations into their service offerings, and started eliminating services that were either not suited to our needs, outrageously expensive, had insufficient SLAs, or didn’t offer services that we required (such as a fully featured API). Then we started performance testing. We did this by embedding a hidden iFrame on 5% of the visitors to stackoverflow.com, which forced a request to a different DNS provider. We did this for each provider until we had some pretty solid performance numbers.

Using some basic analytics, we were able to measure the real-world performance, as seen by our real-world users, broken down into geographical area. We built some box plots based on these tests which allowed us to visualise the different impact each provider had.

If you don’t know how to interpret a boxplot, here’s a brief primer for you. For the data nerds, these were generated with R’s standard boxplot functions, which means the upper and lower whiskers are min(max(x), Q_3 + 1.5 * IQR) and max(min(x), Q_1 – 1.5 * IQR), where IQR = Q_3 – Q_1

This is the results of our tests as seen by our users in the United States:

You can see that Hurricane Electric had a quarter of requests return in < 16ms and a median of 32ms, with the three “cloud” providers (Azure, Google Cloud DNS and Route 53) being slightly slower (being around 24ms first quarter and 45ms median), and DNS Made Easy coming in 2nd place (20ms first quarter, 39ms median).

You might wonder why the scale on that chart goes all the way to 700ms when the whiskers go nowhere near that high. This is because we have a worldwide audience, so just looking at data from the United States is not sufficient. If we look at data from New Zealand, we see a very different story:

Here you can see that Route 53, DNS Made Easy and Azure all have healthy 1st quarters, but Hurricane Electric and Google have very poor 1st quarters. Try to remember this, as this becomes important later on.

We also have Stack Overflow in Portuguese, so let’s check the performance from Brazil:

Here we can see Hurricane Electric, Route 53 and Azure being favoured, with Google and DNS Made Easy being slower.

So how do you reach a decision about which DNS provider to choose, when your main goal is performance? It’s difficult, because regardless of which provider you end up with, you are going to be choosing a provider that is sub-optimal for part of your audience.

You know what would be awesome? If we could have two DNS providers, each one servicing the areas that they do best! Thankfully this is something that is possible to implement with DNS. However, time was short, so we had to put our dual-provider design on the back-burner and just go with a single provider for the time being.

Our initial rollout of DNS was using Amazon Route 53 as our provider: they had acceptable performance figures over a large number of regions and had very effective pricing (on that note Route 53, Azure DNS, and Google Cloud DNS are all priced identically for basic DNS services).

The DYN attack

Roll forwards to October 2016. Route 53 had proven to be a stable, fast, and cost-effective DNS provider. We still had dual DNS providers on our backlog of projects, but like a lot of good ideas it got put on the back-burner until we had more time. Then the Internet ground to a halt. The DNS provider Dyn had come under attack, knocking a large number of authoritative DNS servers off the Internet, and causing widespread issues with connecting to major websites. All of a sudden DNS had our attention again. Stack Overflow and Stack Exchange were not affected by the Dyn outage, but this was pure luck.

We knew if a DDoS of this scale happened to our DNS provider, the solution would be to have two completely separate DNS providers. That way, if one provider gets knocked off the Internet, we still have a fully functioning second provider who can pick up the slack. But there were still questions to be answered and assumptions to be validated:

What is the performance impact for our users in having multiple DNS providers, when both providers are working properly?

What is the performance impact for our users if one of the providers is offline?

What is the best number of nameservers to be using?

How are we going to keep our DNS providers in sync?

These were pretty serious questions – some of which we had hypothesis that needed to be checked and others that were answered in the DNS standards, but we know from experience that DNS providers in the wild do not always obey the DNS standards.

What is the performance impact for our users in having multiple DNS providers, when both providers are working properly?

This one should be fairly easy to test. We’ve already done it once, so let’s just do it again. We fired up our tests, as we did in early 2016, but this time we specified two DNS providers:

Route 53 & Google Cloud

Route 53 & Azure DNS

Route 53 & Our internal DNS

We did this simply by listing Name Servers from both providers in our domain registration (and obviously we set up the same records in the zones for both providers).

Running with Route 53 and Google or Azure was fairly common sense – Google and Azure had good coverage of the regions that Route 53 performed poorly in. Their pricing is identical to Route 53, which would make forecasting for the budget easy. As a third option, we decided to see what would happen if we took our formerly demoted, on-premises BIND servers and put them back into production as one of the providers. Let’s look at the data for the three regions from before: United States, New Zealand and Brazil:

United States

New Zealand

Brazil

There is probably one thing you’ll notice immediately from these boxplots, but there’s also another not-so obvious change:

Azure is not in there (the obvious one) Our 3rd quarters are measurably slower (the not-so obvious one).

Azure

Azure has a fatal flaw in their DNS offering, as of the writing of this blog post. They do not permit the modification of the NS records in the apex of your zone:

You cannot add to, remove, or modify the records in the automatically created NS record set at the zone apex (name = “@”). The only change that’s permitted is to modify the record set TTL.

These NS records are what your DNS provider says are authoritative DNS servers for a given domain. It’s very important that they are accurate and correct, because they will be cached by clients and DNS resolvers and are more authoritative than the records provided by your registrar.

Without going too much into the actual specifics of how DNS caching and NS records work (it would take me another 2,500 words to describe this in detail), what would happen is this: Whichever DNS provider you contact first would be the only DNS provider you could contact for that domain until your DNS cache expires. If Azure is contacted first, then only Azure’s nameservers will be cached and used. This defeats the purpose of having multiple DNS providers, as in the event that the provider you’ve landed on goes offline, which is roughly 50:50, you will have no other DNS provider to fall back to.

So until Azure adds the ability to modify the NS records in the apex of a zone, they’re off the table for a dual-provider setup.

The 3rd quarter

What the third quarter represents here is the impact of latency on DNS. You’ll notice that in the results for ExDNS (which is the internal name for our on-premises BIND servers) the box plot is much taller than the others. This is because those servers are located in New Jersey and Colorado – far, far away from where most of our visitors come from. So as expected, a service with only two points of presence in a single country (as opposed to dozens worldwide) performs very poorly for a lot of users.

Performance conclusions

So our choices were narrowed for us to Route 53 and Google Cloud, thanks to Azure’s lack of ability to modify critical NS records. Thankfully, we have the data to back up the fact that Route 53 combined with Google is a very acceptable combination.

Remember earlier, when I said that the performance of New Zealand was important? This is because Route 53 performed well, but Google Cloud performed poorly in that region. But look at the chart again. Don’t scroll up, I’ll show you another chart here:

See how Google on its own performed very poorly in NZ (its 1st quarter is 164ms versus 27ms for Route 53)? However, when you combine Google and Route 53 together, the performance basically stays the same as when there was just Route 53.

Why is this? Well, it’s due to a technique called Smooth Round Trip Time. Basically, DNS resolvers (namely certain version of BIND and PowerDNS) keep track of which DNS servers respond faster, and weight queries towards those DNS servers. This means that the faster provider should be skewed to more often than the slower providers. There’s a nice presentation over here if you want to learn more about this. The short version is that if you have many DNS servers, DNS cache servers will favour the fastests ones. As a result, if one provider is fast in Auckland but slow in London, and another provider is the reverse, DNS cache servers in Auckland will favour the first provider and DNS cache servers in London will favor the other. This is a very little known feature of modern DNS servers but our testing shows that enough ISPs support it that we are confident we can rely on it.

What is the performance impact for our users if one of the providers is offline?

This is where having some on-premises DNS servers comes in very handy. What we can essentially do here is send a sample of our users to our on-premises servers, get a baseline performance measurement, then break one of the servers and run the performance measurements again. We can also measure in multiple places: We have our measurements as reported by our clients (what the end user actually experienced), and we can look at data from within our network to see what actually happened. For network analysis, we turned to our trusted network analysis tool, ExtraHop. This would allow us to look at the data on the wire, and get measurements from a broken DNS server (something you can’t do easily with a pcap on that server, because, you know. It’s broken).

Here’s what healthy performance looked like on the wire (as measured by ExtraHop), with two DNS servers, both of them fully operational, over a 24-hour period (this chart is additive for the two series):

Blue and brown are the two different, healthy DNS servers. As you can see, there’s a very even 50:50 split in request volume. Because both of the servers are located in the same datacenter, Smoothed Round Trip Time had no effect, and we had a nice even distribution – as we would expect.

Now, what happens when we take one of those DNS servers offline, to simulate a provider outage?

In this case, the blue DNS server was offline, and the brown DNS server was healthy. What we see here is that the blue, broken, DNS server received the same number of requests as it did when the DNS server was healthy, but the brown, healthy, DNS server saw twice as many requests. This is because those users who were hitting the broken server eventually retried their requests to the healthy server and started to favor it. So what does this look like in terms of actual client performance?

I’m only going to share one chart with you this time, because they were all essentially the same:

What we see here is a substantial number of our visitors saw a performance decrease. For some it was minor, for others, quite major. This is because the 50% of visitors who hit the faulty server need to retry their request, and the amount of time it takes to retry that request seems to vary. You can see again a large increase in the long tail, which indicates that they are clients who took over 300 milliseconds to retry their request.

What does this tell us?

What this means is that in the event of a DNS provider going offline, we need to pull that DNS provider out of rotation to provide best performance, but until we do our users will still receive service. A non-trivial number of users will be seeing a large performance impact.

What is the best number of nameservers to be using?

Based on the previous performance testing, we can assume that the number of retries a client may have to make is N/2+1, where N is the number of nameservers listed. So if we list eight nameservers, with four from each provider, the client may potentially have to make 5 DNS requests before they finally get a successful message (the four failed requests, plus a final successful one). A statistician better than I would be able to tell you the exact probabilities of each scenario you would face, but the short answer here is:

Four.

We felt that based on our use case, and the performance penalty we were willing to take, we would be listing a total of four nameservers – two from each provider. This may not be the right decision for those who have a web presence orders of magnitudes larger than ours, but Facebook provide two nameservers on IPv4 and two on IPv6. Twitter provides eight, four from Dyn and four from Route 53. Google provides 4.

How are we going to keep our DNS providers in sync?

DNS has built in ways of keeping multiple servers in sync. You have domain transfers (IXFR, AXFR), which are usually triggered by a NOTIFY packet sent to all the servers listed as NS records in the zone. But these are not used in the wild very often, and have limited support from DNS providers. They also come with their own headaches, like maintaining an ACL IP Whitelist, of which there could be hundreds of potential servers (all the different points of presence from multiple providers), of which you do not control any. You also lose the ability to audit who changed which record, as they could be changed on any given server.

So we built a tool to keep our DNS in sync. We actually built this tool years ago, once our artisanally crafted zone files became too troublesome to edit by hand. The details of this tool are out of scope for this blog post though. If you want to learn about it, keep an eye out around March 2017 as we plan to open-source it. The tool lets us describe the DNS zone data in one place and push it to many different DNS providers.

So what did we learn?

The biggest takeaway from all of this, is that even if you have multiple DNS servers, DNS is still a single point of failure if they are all with the same provider and that provider goes offline. Until the Dyn attack this was pretty much “in theory” if you were using a large DNS provider, because until first the successful attack no large DNS provider had ever had an extended outage on all of its points of presence.

However, implementing multiple DNS providers is not entirely straightforward. There are performance considerations. You need to ensure that both of your zones are serving the same data. There can be such a thing as too many nameservers.

Lastly, we did all of this whilst following DNS best practices. We didn’t have to do any weird DNS trickery, or write our own DNS server to do non-standard things. When DNS was designed in 1987, I wonder if the authors knew the importance of what they were creating. I don’t know, but their design still stands strong and resilient today.

Attributions