A couple weeks before writing this post, AWS had a single-region failure of S3. It was the worst failure of S3 ever, and it took down many services. We at IOpipe survived fairly well, with our dashboard offline, but our APIs and metrics ingestion were unaffected.

Still, we’d like to avoid these problems from happening in the future, and it turned out that resolving this was really, really easy. Yet, I didn’t find much documentation, tutorials, or how-tos on configuring S3 multi-region failover.

This tutorial will show you how to configure a URL which provides multi-region failover for S3 buckets, protecting against regional failures in S3. In this, we use Route 53. If you do not use Route 53 for your authoritative DNS servers, you may setup a ternary domain and utilize CNAMES to delegate specific records to Route 53. This tutorial does not cover the remediation of potential Route 53 failures.

Bucket names

You will need to have or create two buckets. Whichever bucket whose name does NOT match your DNS CNAME should be primary, and behind Cloudfront. The other bucket should be named the full DNS hostname that will be used as a CNAME in Route53.

The trick is that Cloudfront can support a CNAME and back it by “any” Origin, but S3 must have its bucket name match the CNAME if DNS is pointed to it. Unfortunately, this strategy only works for clear HTTP, and not HTTPS.

Another option is to point the failover DNS record to Cloudflare or other not-Amazon CDN. The core issue here is that Amazon does not allow two Cloudfront distributions to share a CNAME.

Cross-region Replication

The thing we need to do is have our S3 data replicated. Amazon offers a replication feature out of the box, but it only takes effect for new or updated objects. This replication comes with several downsides:

Primary-secondary architecture, there cannot be 3–4 copies

Versioning is enabled on the bucket, so old copies are archived and maintained in the bucket

For many websites, a simple two-region replication will be sufficient, and the size of objects will be small enough that versioning will not be an issue. However, it would be relatively easy to use an S3 Lambda trigger instead, allowing non-versioned copies to multiple regions, as long as the data can be copied within the 5 minute Lambda execution window.

At IOpipe, we choose to use the built-in replication.

S3 Bucket Replication Configuration:

Cloudfront

One S3 bucket is put behind Cloudfront. This provides TLS termination, caching, and other resiliency features.

Route53 Health Checks

This is where things get interesting. We setup three health checks for Cloudfront+S3. One checks the health of the S3 bucket, another checks the health of the Cloudfront distribution, plus a calculated health check requiring that BOTH of the previous checks be green. This latter health check is what we monitor from Route53.

We use simple “Basic” health checks of AWS endpoints. As configured, each health check costs $0.50/mo. With 3checks, this is a total of $1.25/mo per multi-region bucket.

While we can configure health checks for the secondary bucket, they cannot be used from inside Route 53. These might still be useful for connecting to Lambdas or other alerting/monitoring.

Health Check Configuration:

Route53 Routing Policies

For the DNS name pointing to these S3 buckets, we created a CNAME pointing to the Cloudfront distribution, then enabled a “Failover” Routing Policy with it as Primary. We enable “Associate with Health Check”, specifying the appropriate calculated health check.

A second CNAME should be created as Secondary, pointing directly to the S3 bucket named to match your CNAME.

Primary:

Conclusion

This configuration might, honestly, be overkill for the frequency at which S3 has serious outages — but it’s also not too difficult or costly to configure and maintain.