Today I noticed one of our customer containers was running really high on CPU.

# top -bn1 190 discour+ 20 0 2266292 205128 15656 S 86.7 0.3 9377:35 ruby # ps aux discour+ 190 19.4 0.3 2266292 207096 ? Sl 2017 9364:38 sidekiq 5.0.5 discourse [1 of 5 busy]

Looks like sidekiq is stuck on a job.

Where is it stuck?

Usually, this is where the story ends and another series of questions start

Can we reproduce this on staging or development?

What code changed recently?

Why is perf trace not giving me anything I can work with?

How awesome is my Sidekiq logging?

Where is my divining rod?

Julia Evans is working on a fantastic profiler that will allow us to answer this kind of question real quick. But in the meantime, is there anything we can do?

rbtrace + stackprof

At Discourse we include rbtrace and stackprof in our Gemfile.

gem 'rbtrace' gem 'stackprof', require: false

We always load up rbtrace in production, it allows us a large variety of production level debugging. stackprof is loaded on-demand.

In this particular case I simply run:

# rbtrace -p 190 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'

This injects a new thread that enables stackprof globally and finally writes the performance data to /tmp/perf

This dump can easily be analyzed:

# stackprof /tmp/perf ================================== Mode: cpu(1000) Samples: 475 (0.63% miss rate) GC: 0 (0.00%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 473 (99.6%) 473 (99.6%) #<Class:0x00007fac08218020>::Typography.to_html_french 4 (0.8%) 1 (0.2%) #<Module:0x00007fac080c3620>.reap_connections

In fact, we can even collect backtraces and generate flamegraphs in stackprof, so I strongly recommend reading through the readme.

So there you have it, my culprit is:

text.gsub(/(\s|)+([!?;]+(\s|\z))/, ' \2\3')

A regular expression consuming 100% CPU in production, this has got to be a first