Like many websites and service providers, we use and depend on Amazon S3. Among other things, we primarily use S3 as a data store for uploaded artifacts like JavaScript source maps and iOS debug symbols; which are a critical part in our event processing pipeline. Yesterday, S3 experienced an outage that lasted 3 hours, but the impact on our processing pipeline was very minimal.

Last week, we set off on solving one potential problem that we were experiencing: fetching data out of S3 was neither as performant nor reliable as we would hope. Our servers are in Dallas, while our S3 buckets are in Virginia (us-east-1). This means that we see an average of 32ms ping times, and 100ms for a full Layer 6 TLS handshake. Additionally, S3 throttles bandwidth to servers outside of their network, which limits the ability for us to fetch our largest assets in a timely manner. While increasing performance was our primary goal, this project turned out to be extremely beneficial during the S3 outage and kept our processing pipeline chugging along for the duration of it.

We started off by putting together a quick S3 proxy cache that lived in our datacenter that would cache full assets from S3, allowing us to serve them on our local network without going to Virginia and back every single time. The requirements of this experiment were:

minimize risk to production-facing traffic

avoid introducing single points of failure

prove the concept without increasing hardware bill

avoid committing changes to application code

We completed our task with two popular services, no application code required. We used nginx as an S3 cache, while using HAProxy to route requests back to S3 if nginx were to fail.

The goal of the nginx server was to leverage the proxy_cache and store all of our S3 assets on disk when requested. We wanted to leverage a large 750GB disk cache and keep a very large set of actively cached data.

Our new proposed infrastructure would look like this:

The setup should look relatively familiar for anyone who has worked with service discovery. Each application server that’s running our Sentry code has an HAProxy process running on localhost . HAProxy is tasked with directing traffic to our cache server, which will proxy upstream to Amazon. This configuration also allows for a failover to occur, allowing HAProxy to talk directly to Amazon and bypass our cache without any interruption.

Configuring HAProxy is quick, only taking eight lines:

# Define a DNS resolver for S3 resolvers dns # Which nameserver do we want to use? nameserver google 8.8.8.8 # Cache name resolutions for 300s hold valid 300s listen s3 bind 127.0.0.1:10000 mode http # Define our s3cache upstream server server s3cache 10.0.0.1:80 check inter 500ms rise 2 fall 3 # With actual Amazon S3 as a backup host using our DNS resolver server amazon s3.amazonaws.com:443 resolvers dns ssl verify required check inter 1000ms backup

On each application server tasked with communicating to S3, the HAProxy Admin ends up displaying this:

This gives us a live view of the cache’s health from the perspective of a single application server. Ironically, this also came in handy when S3 went down, clearly depicting that there was exactly two hours and 57 minutes of downtime during which we could not communicate with Amazon.

Configuring nginx was a little bit more involving since it’s doing the heavy lifting:

http { gzip off ; resolver 8.8 .8 .8 ; resolver_timeout 5 s ; proxy_cache_path / var / cache / nginx levels = 1 : 2 keys_zone = default : 500 m max_size = 750 g inactive = 30 d ; server { listen 10.0 .0 .1 : 80 default_server ; keepalive_timeout 3600 ; location / { proxy_http_version 1.1 ; proxy_set_header Host s3 . amazonaws . com ; proxy_set_header Authorization $http_authorization ; proxy_set_header Connection '' ; proxy_cache default ; proxy_cache_valid 200 30 d ; proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504 ; proxy_cache_lock on ; proxy_ssl_verify on ; proxy_ssl_session_reuse on ; add_header X - Cache - Status $upstream_cache_status always ; set $s 3 _host 's3.amazonaws.com' ; proxy_pass https : / / $s 3 _host ; } } }

This configuration is allowing us to use a 750GB disk cache for our S3 objects as configured by proxy_cache_path .

This proxy service had been running for a week while we watched our bandwidth and S3 bill drop, but we had an unexpected exchange yesterday morning: