At VTS we use Sidekiq to do much of our work in the background. Occasionally one of our queues gets too large and we need to quickly look into the problem. A large backlog of jobs on a queue can cause a suboptimal experience for our users who may be waiting on emails, pdf exports, or any other operation we do in the background.

Getting Notified

The first step is getting an alert notifying us that there is a queue backlog. We follow Sidekiq’s recommendation and use a queue status end point that exposes queue size and queue latency. Here’s the Rails controller we created to monitor our queue size:

Here’s the controller we created to monitor queue latency:

Notice how each controller allows for passing in both the queue name and max queue size or latency. This allows us to monitor our Sidekiq queues with StillAlive using a script like:

We can easily change our max thresholds or add a queue to monitor by logging into StillAlive and editing our script.

Finding the Bottleneck

The first step is to determine which queue is over its limits. In our case the stillalive script will fail on the problematic queue but you can also have a glance at your sidekiq console.

Often times engineers try to debug the situation by looking at the jobs on the queue within the sidekiq UI. This might give you a hint but it can be misleading for a couple of reasons. First, by definition of the problem, there are probably tons of jobs on the queue. The sidekiq UI is simply not built to debug a situation like this.

Browsing Jobs on Queue

Second, the key to debugging this issue is not necessarily the number of jobs on the queue but instead how the Sidekiq process consuming this queue has been spending its time. The Sidekiq queue has grown because the process has been taking too long to process jobs. Often times this is one job type in particular and the key is to figure out which job. For this, we use NewRelic Insights which allows us to query which jobs from a specific queue have been taking the longest time.

By default queue is not available within insights, but it’s easy to add queue as a custom attribute using Sidekiq middleware:

Now the queue will be available to query within NewRelic insights.

SELECT SUM(duration) as total_time

from Transaction

where transactionSubType ='SidekiqJob' AND

queue='default'

SINCE 60 minutes ago

FACET name

Through NRQL we select the sum of the duration of each job for a specific queue revealing which jobs coming from that queue are taking the longest to process.

Sidekiq Job Duration Grouped By Name

Now that we know our problematic job, we can use NewRelic RPM to drill into the bottlenecks on that specific job.

Drilling into the performance of specific job on NewRelic RPM

Summary

When a queue is backed up, it’s important to act quickly as the user experience can be sub-optimal. At VTS we use StillAlive to monitor simple queue status end points. Once we receive an alert, we use NewRelic insights to drill down to the problematic jobs and NewRelic RPM to drill into those jobs.