At 5:44pm on January 4th, our bellies still stretched from the recent feastings, we started to receive automated messages in our Slack #alerts channel. It was a horde of roughly 100 angry error messages knocking on our channel’s door.

The first were 500 errors notices sent by Logentries: the infamous “Request timeouts H12”. Those of you who are hosted on Heroku know it means your response time is higher than 30 seconds. A few moments later, we received New Relic alerts notifying the error rate had passed the 5% threshold we had setup. Pingdom alerts followed quickly to alert us our homepage was unavailable. Our office lights would have flashed red if we had connected them to Pingdom!

Actions

We all scrambled to check whether we could connect to our platform. None of us could. Additional error messages eventually came, providing more details and designated Redis as the culprit:

ERR max number of clients reached

We immediately scaled down a couple of web dynos and workers. At 5:51pm, everything went back to normal just as quickly as it came. The outage lasted 7 minutes.

Impact

5% of all requests failed, so this was considered a major outage. We cannot be certain about this, because our Logentries plan does not offer a high enough log retention. In any case, the New Relic error rate did not get higher than 5%, we then have two options:

either our error rate was higher than 5% but New Relic didn’t record it;

or it was around 5% and the 95% remaining requests completed normally.

The web UI was unavailable for the whole Appaloosa team. As we were scrambling to fix the problem, we failed to check API calls originating from our mobile clients. This would have provided valuable information about the impacted users’ ratio.

Our best guess at this time is that only dynos which failed to get connected to Redis were impacted. The problem was also amplified by requests being routed to failing dynos, because of Heroku’s random routing.

Our Redis conf

At Appaloosa, we run a Rails backend with Sidekiq to process complex tasks.

Redis is used by Sidekiq and Puma. Here’s an excerpt of our configuration files as they were before the outage:

#### sidekiq.yml production:

:concurrency: 10

#### sidekiq.rb Sidekiq.configure_client do |config|

config.redis = { size: 3 }

end Sidekiq.configure_server do |config|

#no redis settings

end

What happened?

The day of the outage, a combination of two events occurred:

Traffic was higher than usual. Our HireFire autoscaler added more web dynos and workers on our Heroku app; In the last few months, new features involving the creation of Sidekiq jobs were introduced in Appaloosa.

Thanks to Hirefire’s scaling history, we can see that at the beginning of the outage, we had:

4 web dynos

6 workers for our different Sidekiq queues

We asked Sidekiq’s creator Mike Perham to help us calculate the number of connections we were using and he very nicely answered the following:

web dynos: 4 dynos * 3 client connections per process = 12

workers: 6 workers * (5 redis connections per worker (by default) + concurrency set to 10) = 90

Which gives us 102, well over our Heroku Redis plan’s 80 connections.

What did we learn?

First, one of our mistakes was to wait too long to go to Logentries to study the issue. To fix that, we chose to configure Logentries to perform daily backups on AWS S3 instead of upgrading our plan for the time being.

Second, we added Pingdom to monitor our mobile API endpoints so we will be able determine which population is impacted during an outage.

Last, we should regularly calculate if our Redis max clients connections gives us enough leeway to scale up with our platform’s workload.

To solve our Redis “ERR max number of clients reached” issue, we had to reassess how many connections at most were needed by our backend. We realized our current plan did not give us enough for our new needs, so we simply chose to upgrade our Redis plan for now. But a more long-lasting solution could be to fine tune our Sidekiq and Redis configurations by lowering concurrency or setting a lower number of worker connections to Redis.

Tips

The Heroku Redis dashboard might be a bit inaccurate. As a chat with their customer support reported, the dashboard only provides an overview, not any detailed information… The dashboard never showed us that we had reached our max connections.

Use Sidekiq for Redis connection information. As Mike Perham mentionned, the amount of used up connections is available from Sidekiq's interface:

Note that this number accounts for Redis connections used by Sidekiq but also other services relying on Redis.

Many thanks to @dmathieu and @mperham for helping !

References:

This article was written by Appaloosa’s dev team:

Benoît Tigeot, Robin Sfez, Christophe Valentin

Want to be part of Appaloosa? Head to Welcome to the jungle.