If you are working with highly intensive Unicorn deployments, you will be familiar with following problems.

How to find out what Ruby processes are doing?

Occasionally some Ruby processes are running pedal to the metal, possibly burning CPU like crazy, you know they shouldn't be, and you just can't tell what's happening. They are getting stuck, but can't tell where or why - it's frustrating.

Tracing sometimes works. Run strace -s 999 -fp <pid> on a suspicious process, and you may see some useful output. It may end with something like this, without any movement:

[pid 7990] select(18, [16 17], NULL, [7 8], {133, 947558} <unfinished ...> [pid 17728] select(19, [14 18], NULL, [7 8], {22, 500000} <unfinished ...>

It's selecting something from file descriptor 19, and you can try to find out what it is:

$ lsof -p 17728 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ... ruby 17728 ruby 19u IPv4 808319634 0t0 TCP web1.infra.net:35682->db3.infra.net:mysql (ESTABLISHED)

In this case it's waiting for MySQL, so you can look at what's happening over there. But sometimes you see that worker is hanging on select from a pipe or socket:

$ lsof -p 17728 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ... ruby 58610 ruby 12r FIFO 0,8 0t0 3435272094 pipe ruby 58610 ruby 13u sock 0,6 0t0 3435234779 can't identify protocol ruby 58610 ruby 14w FIFO 0,8 0t0 3435270598 pipe

If your FD shows a pipe or socket, it's difficult to dig further.

Another case is when Ruby process is endlessly spilling it's guts into trace output, and you can see a pattern repeating, meaning that there is some sort of infinite loop going on, but it's hard to tell where.

There are more advanced techniques, but they are pretty cumbersome, and you may get your process killed before you manage to find out anything useful.

What Ruby is doing at any given moment?

Wouldn't it be nice if you could be able to find out what line of code your Ruby process is executing at any given moment? Good news is that you can. Combine Signal.trap with Thread.list and Thread#backtrace , and you're set.

Signal.trap('USR2') do pid = Process.pid puts "[#{pid}] Received USR2 at #{Time.now}. Dumping threads:" Thread.list.each do |t| trace = t.backtrace.join("

[#{pid}] ") puts "[#{pid}] #{trace}" puts "[#{pid}] ---" end puts "[#{pid}] -------------------" end

Place this code in Unicorn's after_fork , Resque after_fork , beginning of Rake tasks, or wherever it fits. When something is stuck or otherwise suspicious, find the pid, and then do:

$ kill -USR2 26749

Take a look at unicorn log. It should show something like this:

[26749] Received USR2 at 2014-01-19 04:28:35 +0000. Dumping threads: [26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace' [26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload' [26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each' [26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload' [26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `call' [26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `sleep' [26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `index' [26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/action_controller/metal/implicit_render.rb:4:in `send_action' [26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/abstract_controller/base.rb:167:in `process_action' [26749] ... [26749] --- [26749] -------------------

You can see that it's hanging on sleep in welcome_controller.rb , line 8 . Bullseye.

Dumping Ruby threads when killing lazy Unicorn workers

You can't sit all day and watch those processes manually. Unicorn has a timeout configuration option, where it kills workers that run longer than given amount of seconds.

Trouble is that Unicorn sends SIGKILL when terminating stuck workers, and SIGKILL cannot be trapped. Workers get killed, and all you get are these lousy messages:

E, [2013-11-19T12:30:08.905908 #61893] ERROR -- : worker=75 PID:62991 timeout (46s > 45s), killing E, [2013-11-19T12:30:08.939120 #61893] ERROR -- : reaped #<Process::Status: pid 62991 SIGKILL (signal 9)> worker=75

That's not very helpful. It just tells us that something went wrong, and that the perpetrator got a bullet in the head.

But we can patch things up by overriding Unicorn::HttpServer#murder_lazy_workers . Here's a monkey patch that will send SIGTERM 2 seconds before sending SIGKILL :

class Unicorn::HttpServer # forcibly terminate all workers that haven't checked in in timeout seconds. # The timeout is implemented using an unlinked File def murder_lazy_workers next_sleep = @timeout - 1 now = Time.now.to_i WORKERS.dup.each_pair do |wpid, worker| tick = worker.tick 0 == tick and next # skip workers that haven't processed any clients diff = now - tick tmp = @timeout - diff # monkey patch begins here if tmp < 2 logger.error "worker=#{worker.nr} PID:#{wpid} running too long " \ "(#{diff}s), sending TERM" kill_worker(:TERM, wpid) end # end of monkey patch if tmp >= 0 next_sleep > tmp and next_sleep = tmp next end next_sleep = 0 logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \ "(#{diff}s > #{@timeout}s), killing" kill_worker(:KILL, wpid) # take no prisoners for timeout violations end next_sleep <= 0 ? 1 : next_sleep end end

Place this code somewhere where it gets loaded on unicorn startup. It can be next to Unicorn's before_fork . You may want to adjust your Signal.trap to be a little more universal:

['TERM', 'USR2'].each do |sig| Signal.trap(sig) do pid = Process.pid puts "[#{pid}] Received #{sig} at #{Time.now}. Dumping threads:" Thread.list.each do |t| trace = t.backtrace.join("

[#{pid}] ") puts "[#{pid}] #{trace}" puts "[#{pid}] ---" end puts "[#{pid}] -------------------" exit unless sig == 'USR2' end end

Restart your unicorns (this change may require full stop and clean start) and wait for any worker to receive a timeout. When it does, read your logs and rejoice:

E, [2014-01-19T04:17:19.340983 #26722] ERROR -- : worker=5 PID:27033 running too long (45s), sending TERM [27033] Received USR2 at 2014-01-19 04:17:19 +0000. Dumping threads: [27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace' [27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload' [27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each' [27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload' [27033] ... your code ...

And you can still use kill -USR2 <pid> on your unicorn workers (or any other Ruby process for that matter) to get the backtrace on demand. Just don't send USR2 to unicorn master process expecting to get a thread dump, it will reload your application instead.

Everything is so much simpler when you can pinpoint problems in your code.

Ruby