Have you seen the MJIT benchmark results? Amazing, aren’t they? MJIT basically blows the other implementations out of the water! What were they doing all these years? That’s it, we’re done here right?

Well, not so fast as you can infer from the title. But before we can get to what I take issue with in these particular benchmarks (you can of course jump ahead to the nice diagrams) we gotta get some introductions and some important benchmarking basics out of the way.

MJIT? Truffle Ruby? WTF is this?

MJIT currently is a branch of ruby on github by Vladimir Makarov, GCC developer, that implements a JIT (Just In Time Compilation) on the most commonly used Ruby interpreter/CRuby. It’s by no means final, in fact it’s in a very early stage. Very promising benchmarking results were published on the 15th of June 2017, which are in major parts the subject of this blog post.

TruffleRuby is an implementation of Ruby on the GraalVM by Oracle Labs. It poses impressive performance numbers as you can see in my latest great “Ruby plays Go Rumble”. It also implements a JIT, is known to take a bit of a warmup but comes out being ~8 times faster than Ruby 2.0 in the previously mentioned benchmark.

Before we go further…

I have enormous respect for Vladimir and think that MJIT is an incredibly valuable project. Realistically it might be one of our few shots to get a JIT into mainstream ruby. JRuby had a JIT and great performance for years, but never got picked up by the masses (topic for another day).

I’m gonna critique the way the benchmarks were done, but there might be reasons for that, that I’m missing (gonna point out the ones I know). After all, Vladimir has been programming for way longer than I’m even alive and also knows more about language implementations than I do obviously.

Plus, to repeat, this is not about the person or the project, just the way we do benchmarks. Vladimir, in case you are reading this 💚💚💚💚💚💚

What are we measuring?

When you see a benchmark in the wild, first you gotta ask “What was measured?” – the what here comes in to flavors: code and time.

What code are we benchmarking?

It is important to know what code is actually being benchmarked, to see if that code is actually relevant to us or a good representation of a real life Ruby program. This is especially important if we want to use benchmarks as an indication of the performance of a particular ruby implementation.

When you look at the list of benchmarks provided in the README (and scroll up to the list what they mean or look at them) you can see that basically the top half are extremely micro benchmarks:

What’s benchmarked here are writes to instance variables, reading constants, empty method calls, while loops and the like. This is extremely micro, maybe interesting from a language implementors point of view but not very indicative of real world ruby performance. The day looking up a constant will be the performance bottle neck in Ruby will be a happy day. Also, how much of your code uses while loops?

A lot of the code (omitting the super micro ones) there isn’t exactly what I’d call typical ruby code. A lot of it is more a mixture of a script and C-code. Lots of them don’t define classes, use a lot of while and for loops instead of the more typical Enumerable methods and sometimes there’s even bitmasks.

Some of those constructs might have originated in optimizations, as they are apparently used in the general language benchmarks. That’s dangerous as well though, mostly they are optimized for one specific platform, in this case CRuby. What’s the fastest Ruby code on one platform can be way slower on the other platforms as it’s an implementation detail (for instance TruffleRuby uses a different String implementation). This puts the other implementations at an inherent disadvantage.

The problem here goes a bit deeper, whatever is in a popular benchmark will inevitably be what implementations optimize for and that should be as close to reality as possible. Hence, I’m excited what benchmarks the Ruby 3×3 project comes up with so that we have some new more relevant benchmarks.

What time are we measuring?

This is truly my favorite part of this blog post and arguably most important. For all that I know the time measurements in the original benchmarks were done like this: /usr/bin/time -v ruby $script which is one of my favorite benchmarking mistakes for programming languages commonly used for web applications. You can watch me go on about it for a bit here.

What’s the problem? Well, let’s analyze the times that make up the total time you measure when you just time the execution of a script: Startup, Warmup and Runtime.

Startup – the time until we get to do anything “useful” aka the Ruby Interpreter has started up and has parsed all the code. For reference, executing an empty ruby file with standard ruby takes 0.02 seconds for me, MJIT 0.17 seconds and for TruffleRuby it takes 2.5 seconds (there are plans to significantly reduce it though with the help of Substrate VM). This time is inherently present in every measured benchmark if you just time script execution.

– the time until we get to do anything “useful” aka the Ruby Interpreter has started up and has parsed all the code. For reference, executing an empty ruby file with standard ruby takes 0.02 seconds for me, MJIT 0.17 seconds and (there are plans to significantly reduce it though with the help of Substrate VM). This time is inherently present in every measured benchmark if you just time script execution. Warmup – the time it takes until the program can operate at full speed. This is especially important for implementations with a JIT. On a high level what happens is they see which code gets called a lot and they try to optimize this code further. This process takes a lot of time and only after it is completed can we truly speak of “peak performance”. Warmup can be significantly slower than runtime. We’ll analyze the warmup times more further down.

– the time it takes until the program can operate at full speed. This is especially important for implementations with a JIT. On a high level what happens is they see which code gets called a lot and they try to optimize this code further. This process takes a lot of time and only after it is completed can we truly speak of “peak performance”. We’ll analyze the warmup times more further down. Runtime – what I’d call “peak performance” – run times have stabilized. Most/all code has already been optimized by the runtime. This is the performance level that the code will run at for now and the future. Ideally, we want to measure this as 99.99%+ of the time our code will run in a warmed up already started state.

Interestingly, the startup/warmup times are acknowledged in the original benchmark but the way that they are dealt with simply lessens their effect but is far from getting rid of them: “MJIT has a very fast startup which is not true for JRuby and Graal Ruby. To give a better chance to JRuby and Graal Ruby the benchmarks were modified in a way that Ruby MRI v2.0 runs about 20s-70s on each benchmark”.

I argue that in the greater scheme of things, startup and warmup don’t really matter when we are talking about benchmarks when our purpose is to see how they perform in a long lived process.

Why is that, though? Web applications for instance are usually long lived, we start our web server once and then it runs for hours, days, weeks. We only pay the cost of startup and warmup once in the beginning, but run it for a much longer time until we shut the server down again. Normally servers should spend 99.99%+ of their time in the warmed up runtime “state”. This is a fact, that our benchmarks should reflect as we should look for what gives us the best performance for our hours/days/weeks of run time, not for the first seconds or minutes of starting up.

A little analogy here is a car. You wanna go 300 kilometers as fast as possible (straight line). Measuring as shown above is the equivalent of measuring maybe the first ~500 meters. Getting in the car, accelerating to top speed and maybe a bit of time on top speed. Is the car that’s fastest on the first 500 meters truly the best for going 300 kilometers at top speed? Probably not. (Note: I know little about cars)

What does this mean for our benchmark? Ideally we should eliminate startup and warmup time. We can do this by using a benchmarking library written in ruby that also runs the benchmark for a couple of times before actually taking measurements (warmup time). We’ll use my own little library as it means no gem required and it’s well equipped for the rather long run times.

But does startup and warmup truly never matter? It does matter. Most prominently it matters during development time – starting the server, reloading code, running tests. For all of those you gotta “pay” startup and warmup time. Also, if you develop a UI application or a CLI tool for end users startup and warmup might be a bigger problem, as startup happens way more often. You can’t just warm it up before you take it into the load balancer. Also, running tasks periodically as a cronjob on your server will have to pay theses costs.

So are there benefits to measuring with startup and warmup included? Yes, for one for the use cases mentioned above it is important. Secondly, measuring with time -v gives you a lot more data:

tobi@speedy $ /usr/bin/time -v ~/dev/graalvm-0.25/bin/ruby pent.rb Command being timed: "/home/tobi/dev/graalvm-0.25/bin/ruby pent.rb" User time (seconds): 83.07 System time (seconds): 0.99 Percent of CPU this job got: 555% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1311768 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 57 Minor (reclaiming a frame) page faults: 72682 Voluntary context switches: 16718 Involuntary context switches: 13697 Swaps: 0 File system inputs: 25520 File system outputs: 312 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

You get lots of data, among which there’s memory usage, CPU usage, wall clock time and others which are also important for evaluating language implementations which is why they are also included in the original benchmarks.

Setup

Before we (finally!) get to the benchmarks, the obligatory “This is the system I’m running this on”:

The ruby versions in use are MJIT as of this commit from 25th of August compiled with no special settings, graalvm 25 and 27 (more on that in a bit) as well as CRuby 2.0.0-p648 as a baseline.

All of this is run on my Desktop PC running Linux Mint 18.2 (based on Ubuntu 16.04 LTS) with 16 GB of memory and an i7-4790 (3.6 GHz, 4 GHz boost).

tobi@speedy ~ $ uname -a Linux speedy 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I feel it’s especially important to mention the setup in here, as when I first did these benchmarks for Polyconf on my dual core notebook TruffleRuby had significantly worse results. I think graalvm benefits from the 2 extra cores for warmup etc, as the CPU usage across cores is also quite high.

You can check out the benchmarking script used etc. as part of this repo.

But… you promised benchmarks, where are they?

Sorry, I think the theory is more important than the benchmarks themselves, although they undoubtedly help illustrate the point. We’ll first get into why I chose the pent.rb benchmark as a subject and why I run it with a slightly old versions of graalvm (no worries, current version coming in later on). Then, finally, graphs and numbers.

Why this benchmark?

First of all, the original benchmarks were done with graalvm-0.22. Attempting to reproduce the results with the (at the time current) graalvm-0.25 proved difficult as a lot of them had already been optimized (and 0.22 contained some genuine performance bugs).

One that I could still reproduce the performance problems with was pent.rb and it also seemed like a great candidate to show that something is flawed. In the original benchmarks it is noted down as 0.33 times the performance of Ruby 2.0 (or well, 3 times slower). All my experience with TruffleRuby told me that this is most likely wrong. So, I didn’t choose it because it was the fastest one on TruffleRuby, but rather the opposite – it was the slowest one.

Moreover, while a lot of it isn’t exactly idiomatic ruby code to my mind (no classes, lots of global variables) it uses quite a lot Enumerable methods such as each, collect, sort and uniq while refraining from bitmaskes and the like. So I also felt that it’d make a comparatively good candidate from here.

The way the benchmark is run is basically the original benchmark put into a loop so it is repeated a bunch of times so we can measure the times during warmup and later runtime to get an average of the runtime performance.

So, why am I running it on the old graalvm-0.25 version? Well, whatever is in a benchmark is gonna get optimized making the difference here less apparent.

But seriously though, who uses a global variable instead of an argument to pass a value to a method? https://t.co/TLUGr8KVUY — Benoit Daloze (@eregontp) July 10, 2017

We’ll run the new improved version later.

MJIT vs. graalvm-0.25

So on my machine the initial execution of the pent.rb benchmark (timing startup, warmup and runtime) on TruffleRuby 0.25 took 15.05 seconds while it just took 7.26 seconds with MJIT. Which has MJIT being 2.1 times faster. Impressive!

What’s when we account for startup and warmup though? If we benchmark just in ruby startup time already goes away, as we can only start measuring inside ruby once the interpreter has started. Now for warmup, we run the code to benchmark in a loop for 60 seconds of warmup time and 60 seconds for measuring the actual runtime. I plotted the execution times of the first 15 iterations below (that’s about when TruffleRuby stabilizes):

As you can clearly see, TruffleRuby starts out a lot slower but picks up speed quickly while MJIT stay more or less consistent. What’s interesting to see is that iteration 6 and 7 of TrufleRuby are slower again. Either it found a new optimization that took significant time to complete or a deoptimization had to happen as the constraints of a previous optimization were no longer valid. TruffleRuby stabilizes from there and reaches peak performance.

Running the benchmarks we get an average (warm) time for TruffleRuby of 1.75 seconds and for MJIT we get 7.33 seconds. Which means that with this way of measuring, TruffleRuby is suddenly 4.2 times faster than MJIT.

We went from 2.1 times slower to 4.2 times faster and we only changed the measuring method.

I like to present benchmarking numbers in iterations per second/minute (ips/ipm) as here “higher is better” so graphs are far more intuitive, our execution times converted are 34.25 iterations per minute for TruffleRuby and 8.18 iterations per minute for MJIT. So now have a look at our numbers converted to iterations per minute compared for the initial measuring method and our new measuring method:

You can see the stark contrast for TruffleRuby caused by the hefty warmup/long execution time during the first couple of iterations. MJIT on the other hand, is very stable. The difference is well within the margin of error.

Ruby 2.0 vs MJIT vs. graalvm-0.25 vs. graalvm-0.27

Well, I promised you more data and here is more data! This data set also includes CRuby 2.0 as the base line as well as the new graalvm.

initial time (seconds) ipm of initial time average (seconds) ipm of average after warmup Standard Deviation as part of total CRuby 2.0 12.3 4.87 12.34 4.86 0.43% TruffleRuby 0.25 15.05 3.98 1.75 34.25 0.21% TruffleRuby 0.27 8.81 6.81 1.22 49.36 0.44% MJIT 7.26 8.26 7.33 8.18 2.39%

We can see that TruffleRuby 0.27 is already faster than MJIT in the first iteration, which is quite impressive. It’s also lacking the weird “getting slower” around the 6th iteration and as such reaches peak performance much faster than TruffleRuby 0.25. It also gets faster overall as we can see if we compare the “warm” performance of all 4 competitors:

So not only did the warmup get much faster in TruffleRuby 0.27 the overall performance also increased quite a bit. It is now more than 6 times faster than MJIT. Of course, some of it is probably the TruffleRuby team tuning it to the existing benchmark, which reiterates my point that we do need better benchmarks.

As a last fancy graph for you I have the comparison of measuring the runtime through time versus giving it warmup time, then benchmarking multiple iterations:

CRuby 2 is quite consistent as expected, TruffleRuby already manages a respectable out of the box performance but gets even faster. I hope this helps you see how the method of measuring can achieve drastically different results.

Conclusion

So, what can we take away? Startup time and warmup are a thing and you should think hard about whether those times are important for you and if you want to measure them. For web applications, most of the time startup and warmup aren’t that important as 99.99%+ you’ll run with a warm “runtime” performance.

Not only what time we measure is important, but also what code we measure. Benchmarks should be as realistic as possible so that they are as significant as possible. What a benchmark on the Internet check most likely isn’t directly related to what your application does.

ALWAYS RUN YOUR OWN BENCHMARKS AND QUESTION BOTH WHAT CODE IS BENCHMARKED, HOW IT IS BENCHMARKED AND WHAT TIMES ARE TAKEN

(I had this in my initial draft, but I ended up quite liking it so I kept it around)

edit1: Added CLI tool specifically to where startup & warmup counts as well as a reference to Substrate VM for how TruffleRuby tries to combat it 🙂

edit2: Just scroll down a little to read an interesting comment by Vladimir