How to implement the perfect failover strategy using Amazon Route53

Using Route53 Health Checks to fail fast

Amazon Route53 is the go-to option for DNS if you’re using AWS. Many users, including us, have only scratched the surface with the power that Route53 can provide — it’s much more than a simple nameserver.

The setup

Every application should run in at least two regions for high-availability and fault-tolerance. This should also be an active-active architecture, rather than active-passive. In an active-active system, all regions are used at the same time, whereas in active-passive a back-up region is only used in a failure scenario.

So, let’s draw an architecture diagram to show what we intend to achieve.

Route53 routing requests across any number of load balancers in different AWS regions

We have our application running in any number of AWS regions, with a load balancer as the entry-point. We want Route53 to work as a kind of “pre-load balancer” to distribute requests and point users to the correct region.

Route53 Routing Policies

We have two (or more) regions running in active-active. Great! But how can Route53 distribute the load between these regions? Enter, Route53 Routing Policies. AWS offers many different policies:

Simple routing policy: Point a domain to a single, simple resource.

Point a domain to a single, simple resource. Failover routing policy: Designed for active-passive failover. Uses a simple policy unless it’s unhealthy, where it’ll use the backup.

Designed for active-passive failover. Uses a simple policy unless it’s unhealthy, where it’ll use the backup. Geolocation routing policy: Route traffic based on the country or continent of your users.

Route traffic based on the country or continent of your users. Geoproximity routing policy: Route traffic based on the physical distance between the region and your users.

Route traffic based on the physical distance between the region and your users. Latency routing policy: Route traffic to the AWS region that provides the best latency.

Route traffic to the AWS region that provides the best latency. Multivalue answer routing policy: Respond to DNS queries with up to eight healthy records selected at random.

Respond to DNS queries with up to eight healthy records selected at random. Weighted routing policy: Route traffic to resources in proportions that you specify.

For an active-active system, Geolocation, Geoproximity, Latency, Multivalue, or Weighted policies would work. So which one should we choose?

So long as our application scales horizontally, latency routing will provide the best experience for users. DNS queries will return the lowest-latency healthy region based on the users’ IP address. Geoproximity routing will have higher latencies as it only takes physical distances into account. Multivalue routing can be used to slightly improve availability and add some basic load-balancing, but DNS load-balancing isn’t reliable and another policy is almost always better. Weighted routing is great for testing new versions and allows blue-green deployments. If there are geographic requirements, for example, users in the UK need to be routed to a region in the UK, geolocation routing is a viable option.

The simple routing policy is the only one that doesn’t support Route53 Health Checks. So, if an application is unable to use latency routing, we can still implement a great failover strategy.

Latency-based routing

The application we’re building has no geographic requirements, so we can use latency routing. So how do we set it up?

To use latency-based routing, a record set needs to be created for each region the application is hosted in. Here’s how to set up basic latency routing using Terraform:

Example of Terraform for latency-routing across load-balancers in two regions

We were quickly able to verify that Route53 would direct users to the best region based on their latency. Great! We’re all set up, so let’s run some load tests.

Load tests

There’s no need to load test Route53 itself — it handles some of the largest sites in the world (Instagram, Amazon, Netflix). However, we wanted to prove that latency routing itself would work as expected with millions of requests. Specifically, we wanted to test failover scenarios to see how Route53 would direct traffic.

We’re testing our WebSocket service. WebSocket services are incredibly sensitive to instability due to the nature of persistent connections — a single user is connected to a single host in a single region for a long time. If a host (or region) fails, all users connected to that host need to reconnect. This sensitivity means it’s a great candidate for testing Route53 routing as if any connections are dropped, we know something has gone wrong.

After verifying our service could handle millions of concurrent open connections, we moved on to the failover tests.

We decided to run a load test of 3 million connections, with eu-central-1 limited to ~2 million connections. After this, the hosts would become overloaded. They’d serve requests with increased latency, and ultimately become unhealthy. To keep things simple, we’re running in just two AWS regions. Here’s what we expected to happen:

eu-central-1 hits about 2m open connections

hits about 2m open connections eu-central-1 becomes overloaded and starts serving requests slowly

becomes overloaded and starts serving requests slowly Route53 detects this increased latency and starts routing new requests to us-east-1

Existing connections to eu-central-1 are kept open as, without the incoming requests, CPU usage is reduced

are kept open as, without the incoming requests, CPU usage is reduced us-east-1 handles the remaining 1m connections, reaching a total of 3m

Here’s how we expected the number of open connections to change during the load test.