TL;DR

We focused to improve Rails application performance in Ruby 2.7 JIT, but the last year's assumption was wrong and Ruby 2.7 JIT didn't meet the goal. We'll change the approach for it in Ruby 3.0.

Ruby 2.7 is released!

Merry Christmas! The new Ruby is out. As I've sometimes promised to make performance improvements in Ruby 2.7 JIT, let me explain how it went. See also The method JIT compiler for Ruby 2.6 and Ruby 2.6 JIT - Progress and Future in case you're not familiar with the context.

Since my original motivation to work on JIT is to improve Rails application's performance, I focused to improve JIT on such a workload. I prepared some more benchmarks using Rack, Sinatra, Rails and experimented some patches I believed were effective for Rails with them.

However, none of them didn't have a major impact on Rails benchmarks. I particularly spent a lot of time for "Optimized JIT-ed code dispatch", which will be explained later, as I thought it's very important last year. Therefore I changed my plan to stop thinking about it and optimize code more, but it was too late and the time is over to improve the Rails benchmark in Ruby 2.7.

Now I'm hoping to improve things in Ruby 3.0. To share the current progress towards that, let me summarize what I was doing in Ruby 2.7 development.

What's implemented in Ruby 2.7?

Default option changes

The easiest change, but maybe the largest impact on some benchmarks. Here's what's changed:

--jit-max-cache: 1,000 → 100

--jit-min-calls: 5 → 10,000

--jit-max-cache limits the number of methods to be JIT-ed. This is related to later "Optimized JIT-ed code dispatch" topic, because having a lot of JIT-ed code tends to increase the JIT-ed code dispatch overhead and therefore it impacts long-running application's performance a lot.

--jit-min-calls requires methods to be called such times before letting JIT worker consider compiling it. Because running a C compiler has a very big performance impact, Ruby's JIT is performing badly in most of existing Ruby benchmarks just for the C compiler's resource consumption. While I'm focusing on performance after all compilations, I thought changing this value has some easy benefits on such situations, at least skipping methods called only on startup.

These values are kind of influenced by OpenJDK's C2 compiler options.

Deoptimized recompilation

https://speakerdeck.com/k0kubun/rubyrussia-2019?slide=44

JIT has its own deoptimization mechanism. When some assumption like "this JIT-ed method is used only for this class" is not met, JIT-ed code fallbacks to VM interpretation because it's optimized using such an assumption.

While Ruby 2.6 continues to make the fallback forever, Ruby 2.7's JIT marks such methods to be recompiled and it generates less-optimized code which doesn't fallback to VM for the same assumption.

This was built in the original MJIT which was not merged in Ruby 2.6. In Ruby 2.7 development, we thought it's valuable for reducing the number of VM fallbacks, especially because Rails has many polymorphic class usages.

Frame-omitted method inlining

https://speakerdeck.com/k0kubun/rubyrussia-2019?slide=56

When we call a method, we push a frame to a call stack, evaluate the method, and pop the frame. The push and the pop of the frame has some overhead, especially for very fast operations like Integer#+. Therefore eliminating such frame manipulation has been implemented in VM level for some popular methods.

This "frame-omitted" method inlining introduces the same optimization for various more methods by JIT. For doing this, we need to know an inlined method has no chance to rely on the call stack. The CRuby's VM developers and I have developed VM instruction's attributes to examine such kind of things on methods and JIT is using the mechanism.

However, not so many methods are considered as frame-omittable for now. While we indeed spent time for making some instructions "pure" (no side effect) in Ruby 2.7 which is used for the optimization, we didn't make the optimization available for methods with a normal level of complexity. There are some rooms to be improved here for now.

What's not committed in Ruby 2.7?

Optimized JIT-ed code dispatch

https://speakerdeck.com/k0kubun/rubyrussia-2019?slide=64

This was the most promising thing to improve Rails performance on JIT. As mentioned in the --jit-max-cache explanation, it is the fact that a JIT-ed method dispatch becomes slow when there are many JIT-ed methods. Because I've benchmarked Rails with all methods compiled (up to --jit-max-cache), I thought JIT-ed code dispatch can be a bottleneck in the benchmark.

I managed to make a patch which offsets JIT-ed code dispatch overhead up to 100 methods, but in reality it didn't improve JIT's performance on Rails benchmark. So the bottleneck seems to live in another place.

Given that, I stopped working on eliminating low-level overhead. I changed my plan in Ruby 3.0 to just focus on optimizing compiled code instead.

Method inlining for more methods

The most important thing missing in the above "frame-omitted method inlining" now is that a method is not inlined if the method has local variables.

This is not super hard to solve, but I prioritized "Optimized JIT-ed code dispatch" after implementing the method inlining capability and making some important instructions pure.

Another important topic is writing core-class methods in Ruby. For example, if we define Numeric#zero? as `self == 0`, the JIT can consider the method as pure and it can perform frame-omitted method inlining on it, which can be a resolution for [Bug #15589].

Because there was "Write a Ruby interpreter in Ruby for Ruby 3" project by ko1, I was just waiting for his work. But it turned out that the current form of his work is basically inlining C code in Ruby code and therefore the JIT is blocked to inline such methods because it cannot examine whether C code is pure or not.

We're planning to mark C code with attributes like "pure" to make that happen. I thought it was for Ruby 2.7, but it just wasn't.

Stack-based object allocation

https://speakerdeck.com/k0kubun/rubyrussia-2019?slide=74

The optimization, allocating objects on stack instead of heap, is known to require a hard implementation, escape analysis. Though I thought it's possible to implement it in some very basic cases and I made an experimental implementation.

But even in such a very easy case, we need to know whether a method's receiver or arguments escapes or not, and this was blocked by the C code attributes work by ko1. I could have implemented a tentative version of it dedicated for JIT, but my time was spent for the "Optimized JIT-ed code dispatch" project and ko1 was working on many other things.

What will k0kubun do in Ruby 3.0?

I still don't think that the JIT can't optimize Rails. While it'd be important to achieve Ruby 3x3 in Optcarrot benchmark for some workloads, I'll continue to work on JIT in Ruby 3.0 to avoid slowing down Rails and optimize such a workload further.

Instead of focusing on the "Optimized JIT-ed code dispatch" project, I'll try to introduce optimizations which work well on Rails benchmarks, with the following strategies:

More per-instruction optimizations

You know more sophisticated JITs like HotSpot's C2 compiler or Graal are doing a very complicated task on optimizations. They'd be doing optimization over multiple instructions or methods. At some point Ruby's JIT should somehow have the thing to have the same level of optimization.

However, even in a single instruction level, the current implementation of Ruby JIT has a special optimization only in some very limited instructions. It optimizes a normal method call and instruction variables, but most of others are untouched and not doing per-instruction optimization, while C compilers can perform multi-instruction optimizations for some limited extent.

Once per-instruction optimizations are done well, we'll have more chance that C compilers can perform multi-instruction optimizations. Even in C code manipulated by JIT, it'd be harder to generate a simple code if per-instruction code were not already simplified.

The things I'm currently planning to try are:

Revisit type-profiling instruction operands and generate type-specific code; The last attempt did not went well maybe because it increases the number of VM instructions. Maybe we'd be able to profile the same thing without having many instructions.

Instance variable optimization for core-classes and their subclasses (VM-level preparation is done in Ruby 2.7, but JIT-level support is not done yet)

Optimize getconstant; We haven’t touched this in JIT yet and maybe we can do something. To make constant folding happen in the future, we should be able to optimize constant references in per-instruction level first.

More inlining

Some more places are theoretically inlinable but not implemented to be inlined yet. Example things are:

`super`; We attempted to implement VM-level method cache (prerequisite for JIT-level inlining) in Ruby 2.7, but we got a bug and abandoned it. We should debug the bug and introduce it again.

C methods; Because it's hard to just rewrite C methods to Ruby methods without degrading the method's performance on VM, we'd need to continue to maintain things in C. Then we should consider supporting method inlining for C methods too, at least for methods used very often.

Inlining Ruby methods with local variables; Explained above. It's just waiting to be implemented.

Not only inlining unblocks C compiler's optimizations, but also specializing code for inlining itself sometimes gives faster code because some values will be inlined in generated code instead of read from memory.

More C method's metadata

Things like allocating objects on stack and omitting frames for inlined C method would be blocked by the inability to know the behavior of C code. To unblock these optimizations, Ruby 3.0 should have a way to identify C method's behavior automatically.

As the first step, I'd like to have following attributes in C methods:

pure (a.k.a. "leaf" in YARV instructions); It doesn't call another arbitrary method.

noescape; We'd like to know whether a specific receiver or argument is not escaped by calling a C method.

Just adding the attributes is easy, but to avoid providing false information, we need to somehow automatically verify the information. @shyouhei did a good job for verifying "leaf" attribute in VM instructions on CI, and I hope we can do a similar thing in these new attributes too.