Ruby 2.0 is just out-of-the-door. My favorite new feature in Ruby 2.0 is the new highly efficient mechanism for gathering stack traces.

In Ruby 1.9.3 I would struggle really hard to gather more than 5 stack traces per 100 ms of execution. In Ruby 2.0 I am able to gather a stack trace every millisecond using the brand new caller_locations API.

This increased efficiency makes it possible to write “sampling” profilers in pure Ruby and even run them, safely, in production.

Putting it all together

While researching the brand new DTrace support in Ruby 2.0 I noticed a few people were mentioning this concept of “flame graphs”. The idea is quite simple. Lay all the sampled stack traces side by side, so you can visually identify “hot” areas where the code can be optimised.

Here is an example from Brendan Gregg’s blog.

Unfortunately, getting this to hang together in Ruby, using Ruby stack traces was fairly impractical.

Rails stack traces are massively deep

The code was all written in Perl

There appears to be no API to gather Ruby call stacks from DTrace

Ruby DTrace probes do not even run on Linux, Ruby will not compile with DTrace for Linux enabled.

After I discovered stack traces are super fast in Ruby 2.0, making this all work became a reality. I quickly read up on the excellent d3js library and got going writing something similar but tailored at Ruby and Rails.

I added a special option to MiniProfiler so you can generate flame graphs for your pages. Just add ?pp=flamegraph to the end of your URL and you will get a flame graph, even in production.

Demo and profiling session

Here is what happened when I ran the flame graph on the Discourse front page:

This report is informationally very dense, hovering on areas allow you to zero in on various problems. Let me describe a few things I can glean from this graph.

The report is zoomable, just like google maps, zoom in enough and you start seeing method names.

Creating slugs is slow

12.75% of all the samples we collected have the preceding call in the stack trace.

The code is quite short.

def slug Slug.for(title).presence || "topic" end

This slug method is triggering the transliterate method on the inflector, method calls there are not cheap. If we really want we can get an exact cost by wrapping the code with a counter.

def slug Rack::MiniProfiler.counter("slug") do Slug.for(title).presence || "topic" end end

To resolve we can either improve the performance of the underlying transliterator, cache slugs in the DB or cache lookups in memory.

###The inflector overall is slow

The preceding image describes a fairly extensive area of the execution, where active model serializers are turning in-memory data into json. The surprising fact there is that a large amount of time is spent pluralizing.

We can again add MiniProfiler instrumentation to see the actual impact:

8% or our time is spent pluralizing stuff, there are a ton of things we can do to improve this:

A big culprit is the include! method in AM::Serializers, we can easily improve that

method in AM::Serializers, we can easily improve that We could look at adding an internal cache inside the pluralizer.

Since we are good citizens and AM::Serializers is getting a lot of attention at the moment, lets patch it up.

A fair amount of time is spent translating

We can notice that 6% of the samples happen to originate from the localization gem. The majority is originating from fallback logic. It probably makes sense adding a cache on top of translations, lookups really should be a no-op.

Object initialization is slow

I noticed this trace once on this page, but repeatedly on our topic page:

The code in question is:

def add_trust_level # there is a possiblity we did no load trust level column, skip it return unless attributes.key? :trust_level self.trust_level ||= SiteSetting.default_trust_level end

Funny thing is, I wrote this code to correct a more serious perf how. Originally this code was raising exceptions and handling it. In Rails we have massive call stacks, exceptions during normal operation are a huge pitfall.

However, it appears that attributes is being materialized during this call. A very expensive operation.

Our trivial fix is:

def add_trust_level # there is a possiblity we did no load trust level column, skip it return unless has_attribute? :trust_level self.trust_level ||= SiteSetting.default_trust_level end

Call stacks are really deep

I love Rack. However one side-effect of using it, is that call stacks get really really deep. In fact, we are 90 or so frames deep by the time we even start executing our application code. This make gathering stack traces slower, exceptions slower and complicates execution.

20 or so frames are spent inside omniauth this may indicate that it may make sense to have some sort of router inside omniauth for dispatch as opposed to the layered rack approach. We only have a handful of strategies in place now.

Don’t forget that the GC is totally invisible to this particular sampling profiler, at the moment.

We have found that the Ruby GC, untuned, is a huge performance problem with Discourse. I discussed this at length here: http://meta.discourse.org/t/tuning-ruby-and-rails-for-discourse/4126 . With MiniProfiler you can easily just add ?pp=profile-gc-times to the end of any request. If you are noticing multiple GCs for a single request, odds are your GC could do with a fair amount of tuning.

Future work may be integrating the GC execution times into this graph.

How you can diagnose issues

The flame graph tool is very powerful, but also can present you with information overload. My general recommendations would be:

Look for repeated visual patterns, stuff like the inflector above. If a pattern is repeating, over and over it could indicate a library is inefficient or a call is doing too much work.

Look for area you expected would be short, however spanned multiple frames. For example, if you notice a user.fullname call is taking 20 frames, its possible there is a very inefficient algorithm generating this fullname data.

Are there any gems you are spending way more time in, than you expected?

I love this, how can I help?

The code for the flame graph was written yesterday, it is my first foray into d3js and svg, I am sure there are many ways to improve and optimise it. MiniProfiler is open source, we would love your help improving this. All the code in the demo here is already checked in.

It is also possible that this technique can be used in other platforms. Maybe there is some ninja technique to allow JavaScript to grab stack traces for the running thread. I bet the code could be shared with the recent Google Go port of MiniProfiler. Unfortunately, I don’t think you can gather Stack Traces from existing running threads in .NET short of attaching with ICorProfiler or ICorDebugger.

In Summary

I only touched the surface here, much is left to be done, and the rendering js engine needs a lot of tidying up and refactoring. As always, pull requests more than welcome.

Ruby has a fair amount of options when it comes to profiling. There is the excellent perftools.rb, the long running ruby-prof project. The for pay New Relic.

MiniProfiler is now adding yet another elegant way to make sense of Ruby execution, in production. Be sure to try it out.