Debugging mysterious Sidekiq shutdowns

Andrew Berls

Sidekiq is our background job processor of choice at Sutro. It's perfect for sending emails asynchronously, or performing long-running tasks (for example, jobs with lots of network I/O). However, we recently ran into issues with our Sidekiq process on our server repeatedly crashing, and the cause ended up being something we'd never run into before.

The bug

We use Capistrano for deploying code to our servers (this project happened to use a new droplet on DigitalOcean). As part of our configuration, we have a set of custom Capistrano tasks that restart Sidekiq on each deploy. Our Sidekiq configuration specifies a pidfile where Sidekiq writes its process ID on startup, and our Capistrano tasks use the contents of this file to call kill -s USR1 on the process when we deploy.

One day, our Capistrano deployments started failing within the Sidekiq tasks with a message like:

kill: kill 18576 failed: no such process

Upon ssh'ing into the server, we could see that the pidfile still existed and contained the process ID, but no Sidekiq process was running! Trying to run kill on this non-existent process was failing, and breaking our deploys.

At first, we attempted to brush this aside - we removed the now-obsolete pidfile and restarted Sidekiq. Things appeared to run smoothly for anywhere from 5 to 30 minutes, and then the Sidekiq process would die again. We carefully watched the Sidekiq logs and web dashboard, but never saw any sign of an exception or a crash; it simply processed jobs for a while and then shut down without a peep.

This was a complete mystery - a crash inside a worker should never crash the main Sidekiq processor (it will retry or go to the Dead Job Queue), and we were pretty confident that Sidekiq correctly removed its own pidfile on shutdown.

Now, based on the clues, a more experienced sysadmin than myself may immediately recognize the issue here. However, it took us a fair amount of hair pulling to eventually hypothesize that since our jobs perform large numbers of external HTTP requests, Sidekiq was eventually just choking and running out of memory. This seemed plausible, and it took us a bit more googling to find out how to debug out-of-memory (OOM) conditions. This StackOverflow answer ended up being our savior - here's the magic line:

dmesg | egrep -i 'killed process'

dmesg is a Linux/UNIX command to print the kernel message buffer - we're running Ubuntu 14.04.

We ran this and immediately saw

[102335.319388] Killed process 6567 (ruby) total-vm:1333004kB, anon-rss:355088kB, file-rss:688kB

Aha! This confirmed our hypothesis - on startup, Sidekiq was starting some number of jobs, workers were firing off a huge number of HTTP requests, running out of memory, and being terminated by the Linux OOM Killer.

The OOM Killer

If your system starts running desperately low on memory, it's the OOM Killer's job to step in and choose a process to shoot in the head using a set of heuristics (the "badness score") in an attempt to recover memory. This is why we were seeing the "zombie" Sidekiq pidfile - our Sidekiq process had the highest badness score, and was being forcibly shut down without a chance to clean up after itself. We highly recommend this article for more details about the OOM Killer.

As mentioned, we used dmesg to find the log message from the OOM killer. You can find more verbose output in syslog at /var/log/syslog - the full output is too long to post here, but here's ours in a gist for those interested.

Defeating the OOM Killer

Unfortunately, we don't have a concrete list of suggestions to prevent this from ever happening. In our case, we found that our server had accidentally been provisioned with a woefully small amount of RAM, so we'll be re-provisioning and tuning the concurrency setting in our Sidekiq configuration. Our jobs still need to perform large numbers of HTTP requests - while we don't have a drop-in solution yet we at least know how to debug out-of-memory conditions for the future ;)

In any case, a process monitoring system like supervisord or God may be useful, although that's an area we don't have much experience with. If you have a favorite, let us know!