I was recently profiling a production Shopify application server using perf and noticed a fair amount of time being spent in a particular function, st_lookup , which is used by Ruby’s MRI implementation for hash table lookups:

Hash tables are used all over MRI, and not just for the Hash object; global variables, instance variables, classes, and the garbage collector all use MRI’s internal hash table implementation, st_table . Unfortunately, what this profile did not show were the callers of st_lookup . Is this some application code that has gone wild? Is this an inefficiency in the VM?

perf is a great sampling profiler for Linux — it’s low overhead and can safely be used in production. However, up until a few years ago, in order to use the call-graph feature of perf you had to recompile an application with -fno-omit-frame-pointer to get usable stack traces. This gives the compiler one less register to work with, but I believe the trade-off is worth it in most cases. As of Linux 3.7, perf now supports recording call graphs even when code is compiled without -fno-omit-frame-pointer . This works by recording a snapshot of stack memory on each sample. When analyzing the profile later with perf report , the stack data from each sample is combined with DWARF debugging information in order to build a call graph. This increases the amount of data included in the profile, but is a reasonable compromise when compared with having to recompile everything with -fno-omit-frame-pointer .

Now when running perf and collecting call graphs, I was able to see the callers of st_lookup :

From this profile, we can see that a large percentage of time is being spent in rb_method_entry_get_without_cache . At first, I suspected this was being caused by global method cache invalidation or clearing. After using SystemTap on the method__cache__clear probe and seeing a relatively low count, it was clear that this was not the case. At this point, I started digging into the MRI source, trying to understand what exactly happens when a method is called and how method caching actually works. Looking through the MRI source shows that effectively the only place that rb_method_entry_get_without_cache is called is via rb_method_entry :

The idea here is that the global method cache will be checked first (via the GLOBAL_METHOD_CACHE macro), and if the method is found, the method entry will be returned from the cache. If the method is not in the cache, a more expensive method lookup needs to be performed. This involves walking the class hierarchy to find the requested method, which can cause multiple invocations of st_lookup to look up the method from a class method table.

The GLOBAL_METHOD_CACHE macro performs a basic hash lookup on the provided key to locate the entry in the global method cache. Looking at this code, I was surprised to see that the global method cache only had room for 2048 entries. Shopify is a large Rails application with millions of lines of Ruby (when you include gems), and defines hundreds of thousands of methods. A 2048-entry global method cache is nowhere near adequate for an application of our size.

Analyzing method cache hit rates

One of the first things I wanted to do was determine the existing method cache hit-to-miss ratio. One way of doing this would be to add tracing code to our MRI to log this data. This would require us to deploy a custom Ruby build to our app servers, which is not an ideal solution. There is a better way: ftrace user probes.

ftrace is a lightweight tracing framework that is built into the Linux kernel. One of the best features of ftrace is the ability to create dynamic user-mode event probes (uprobes). By adding a probe, ftrace will fire an event whenever a specified function is called in a process. Better yet, you can even fire an event on the execution of a specific instruction. This was all I needed to collect method cache hit-to-miss numbers.

To configure a uprobe, you need to specify the path to the binary and the address of the instruction you want to trace. How do you figure out the address of the instruction? Using objdump :

The first field objdump prints is the address of the function. Using this, I can now create a probe:

Now when I run Ruby, I’ll get an event every time rb_method_entry is called:

What this tells me is that running Ruby with a simple “hello world” caused rb_method_entry to be called 4441 times.

This is really cool! But what I originally wanted to figure out was the global method cache hit to miss ratio. By disassembling rb_method_entry , we can figure out what code paths are executed in both hit and miss cases, which we can use to set the appropriate probes:

If you look at the C implementation of the function, you can see how it maps to the generated assembly. The relevant bits here are:

0x00000000001aa16a is the address that jumps to rb_method_entry_get_without_cache . This is the path we take on a cache miss.

is the address that jumps to . This is the path we take on a cache miss. 0x00000000001aa194 is the address of the return instruction that we’ll execute on a cache hit. Because cache misses perform an unconditional jump to rb_method_entry_get_without_cache , we can infer that this instruction will only ever be executed on a hit.

With this information, we can set up our two probes, one for global method cache misses and one for hits:

And here are the results running the simple “hello world” script:

This is what we wanted to see! So for a simple hello world script, there were 727 method cache misses and 3714 hits, giving us a hit ratio of about 84%. Now that we have a method of calculating the hit rate, let’s see what it looks like in production!

Rather than have to execute these commands every time I wanted to take a sample, I wrote a small shell script to do it for me:

The script takes a binary of the ruby you want to profile and the number of seconds to collect data. After the number of seconds has elapsed, it will display the number of cache hits and misses, along with the hit rate.

So, how does this look in production? With the default method cache size of 2048 entries, we get about a 90% hit rate. This is pretty good, but according to the results I saw in perf , it could definitely be better.

Introducing RUBY_GLOBAL_METHOD_CACHE_SIZE

The global method cache size is configured as a compile-time #define , GLOBAL_METHOD_CACHE_SIZE . I thought it would be useful to be able to configure the method cache size at runtime, like you can do with garbage collector parameters ( RUBY_GC_* environment variables). To that end, I added a new environment variable, RUBY_GLOBAL_METHOD_CACHE_SIZE , that can be used to configure the global method cache size at process start time. This code has been committed upstream and is available in Ruby 2.2.

Now that I had a way to dynamically configure the method cache size, I could run some tests to see how Shopify performed using various cache sizes. I ran the trace_rb_method_cache.sh script on one of our production servers for 60 seconds with a few different cache sizes and collected the results:

Cache Size Hits Misses Hit Rate 2K (Default) 10,227,750 1,114,933 90.17% 16K 11,697,582 516,446 95.77% 32K 13,596,388 425,919 96.96% 64K 14,263,747 324,277 97.78% 128K 13,590,568 289,948 97.91% 256K 11,532,080 275,162 97.67% 512K 12,403,522 256,329 97.98%

From these numbers, once we get around 64K it looks like the hit rates begin to level off.

Now, if we do a perf before and after tuning, we can see some respectable results:

2K method cache

128K method cache

This change gives us a cycle savings of about 3%. Not bad for changing one configuration value!

Summary

Using the system-level profiling tools, perf and ftrace , I was able to find a performance issue that would have never been visible with Ruby-level tooling. These tools are available on any modern Linux distribution, and I encourage you to experiment with them on your own application to see what kind of benefits can be had!

Resources

perf wiki

ftrace