The first day of the advent calendar remarked that during 2014, MoarVM became the de facto standard virtual machine to run Perl 6 on. And, with Perl 6 on course to hit 6.0.0 in 2015, MoarVM is what most people adopting Perl 6 at this point will be working with. In this post, we’ll take a look at what MoarVM brings to the table, and where it’s heading from here.

But first – why a VM?

Many modern languages depend on a good deal of runtime support. Perl 6 is very much in this camp, and the same goes for Perl 5. A VM is in the business of providing that support. With Perl 5, the word “interpreter” tends to be used over VM, but the goal is the same: provide the things needed to execute programs in the language across disparate hardware and operating system environments. The difference is mostly that we tend to use VM when there’s a clean separation of the language compiler and the runtime support – meaning that the VM might also be used to execute other languages. This has happened with the JVM, for example. MoarVM is squarely aimed at being a good host for the Rakudo Perl 6 compiler, and the NQP subset of Perl 6 that Rakudo is mostly implemented in. If any other language finds it useful, that’s fine, but it’s not a primary design consideration.

So why care so much over a clean interface between compiler and runtime? Because, in a project the size of “implement Perl 6”, managing implementation complexity is crucial. Establishing a clean boundary between the compiler and the runtime, and knowing what each side is responsible for, is one strategy for doing that. (Rakudo’s compilation process is also broken up into a pipeline of stages that only communicate by passing well-defined data structures from one to the next, which is another form of architectural boundary.)

But why interpreters and VMs at all? Why not go straight down to the metal? For any language where we can reasonably figure out a lot about the program statically, that is a fine strategy. The information needed to eliminate many forms of late binding, checks, and duplicate work is to hand at compile time, and a decent optimizer will be able to use that information. Of course, this approach restricts us to what can be seen at compile time. Some compilers as a result support profile-guided optimization, where we run the program on typical workloads, collect information about where time is spent and what code-paths are common and uncommon, and then compile the program again taking that data into account.

Profile-guided optimization hints at a strategy for efficiently executing languages where we can know rather less at compile time: put off most optimization until we observe the program running, and dynamically replace parts of the program with optimized versions. This isn’t a new idea, and there has been a massive amount of work done in this area. At its heart is the general principle that most programs are, if I may pun on a term from the database field, “eventually static”. Even though a Perl 6 program may have a huge number of points of potential dynamism and genericity, in most programs just a handful of them actually ever exploit it. Of course, different runs of the program on different inputs may tickle different bits of that dynamism, not to mention that as a program grows and evolves, these flexible points will be exploited here and there to (hopefully cleanly) add new functionality.

A modern VM aiming to support a language like Perl 6 is not, therefore, just there to run the code and provide services like garbage collection, OS abstraction, and so on. Its job is to take a program in all its dynamic glory, watch what code paths it actually takes, see what types really show up – and then produce optimized versions of carefully selected parts of the program, with the unused dynamism ruthlessly torn out, and much of the remaining dynamism massaged to more restricted and faster forms.

Bytecode specialization

So what exactly does MoarVM do with a program? First, it needs to find code worth considering optimizing. This is done by seeking the hot parts of a program: routines that are called again and again, and loops that are doing a lot of iterations. Since time to optimize a routine is proportional to the size of the routine, the threshold for “it’s worth considering” goes with code size. A small routine – such as a simple operator or an accessor method – has rather little code and will become hot quickly. A larger routine heats up more slowly. It’s just like our kettle: a little water boils quickly, but fill it up and we’re going to be waiting a bit for that cuppa. This is in part about risk management: we would prefer to avoid investing time optimizing code that’s never going to give a positive return on investment. We can’t predict the future, but some cheap, simple heuristics are at least a good start.

So, we’ve got some hot code. What now? Well, if it’s a call, we start by looking at the callsites that are invoking it. The callsite tells us how many arguments are being passed, whether they are native values or objects, and – for named arguments – what their names are. We can then start to produce bytecode specialized by callsite. Know that the object parameter is passed an object argument? Then replace the op that checks if we need to box the incoming argument with a simpler op that just copies the pointer. Know that an optional parameter actually doesn’t get passed? Then strip out the code to look for it, toss the branch, and just always run the code that sets the default value. Know that the named parameter “config” is always the first passed named argument? Then retrieve it by indexed access (so a cheap pointer copy), rather than messing about with strings.

Next up: types. If we are passed arguments, what types are they? If we look up constants, what types to those have? Knowing these is relatively easy: we just peek at the args buffer to see what’s being passed, or look up the constant and see what type it is. However, knowing them also presents us with a huge range of opportunities. See that method call? It doesn’t need to look up the method in a table each call any more, since we know where it’s going. And that attribute access? Since we know the object layout, it can just become a pointer offset from the object’s memory address. That type check? We often know the answer statically now – and can eliminate entire branches that won’t be taken. Oh, and that multiple dispatch? Let’s just resolve it once now we know the types we’re dealing with, not every single call. The list goes on.

Specialization works at the bytecode level, because that’s what the VM gets. And a specialization of a given piece of bytecode comes with a set of guards: conditions that must be met for it to be applicable. Those can constrain it by callsite and argument types. And we can produce multiple specializations for a given piece of original bytecode.

Of course, choosing the appropriate specialization is a cost too. It would eat into our winnings if we had to do that on every call. Thankfully, there are ways to avoid that. If we are optimizing a call to something and know the argument types being passed, we can pick the specialization of the callee and optimize the caller to always call that specialization, rather than having to hunt for the right one each time. And, if the callee is rather small, we can often do away with the call entirely and simply inline the callee into the caller – eliminating the need to create a callframe and keeping the code tighter, which CPU caches like.

Optimization speculation

So, all this is nice, but we want MOAR MOAR MOAR. Because there are lots of other things that we can’t statically figure out the types of, but that may end up having very stable types come runtime. Some examples are lexical variables that we close over, object attributes, and return values of our callees. So how do we get hold of these types?

When code becomes hot, we don’t actually optimize it right away. Well, we do for the simple callsite-related transformations. But before going and doing the type-driven stuff, we take a closer look at what’s actually going on in the program. The specializer quickly produces an instrumented version of the program that records what types show up where. This runs for the next bunch of invocations, and logs what it sees. After a threshold is hit, we have a look at the types that showed up and do a little analysis. If a given logging site consistently logs the same type, we’re onto something. If we see more than one type show up there, we consider that part of the program as being typically dynamic and leave it be.

The thing is, we didn’t actually prove a property of the program here, just profiled it and figured out what tends to happen. So how can we use this information? The trick is to insert a guard into the specialized program that cheaply asserts that we got the expected type. Beyond that guard, we can optimize assuming we really did. (We also make sure that we actually do an optimization based upon what the guard checks, and don’t bother to insert it otherwise.)

Which is all well and good when the guard is met, but what about when the assertion fails? In fact, this is a more general problem. Even the far simpler type-based optimizations on incoming parameters assume that an object’s type will not change – and of course, thanks to mixins, they can. All of these situations trigger deoptimization: replacing the optimized code we’re in the midst of running with the unoptimized, naive, but safe version that is dynamic enough to handle all the cases.

Deoptimization is quite a pain to get right. For one, if we were inside of inlined code, we have to go back and create all the callframes we skipped. Since inlining can be many levels deep, that can be a bunch of callframes to wire up. Then we have to make sure that our optimizations don’t throw away code that produces a value the deoptimized code relies on being there, but the optimized code would otherwise not. Of course, we still need to do dead code elimination in general; it’s just that it has to be aware of deoptimization.

Therefore, it’s not only important that MoarVM can optimize code. It’s also critical to its ability to do so that it can also deoptimize, falling back to slow paths when needed.

Down to machine code

Interpreting specialized bytecode can yield, in some programs, a significant improvement. The simple program:

my $i = 0; while ++$i <= 100000000 { }

Got a factor of 2.5 improvement with what we had several months ago, and there are still a lot of opportunities left (some of which we’ve already explored, and with others yet to go). However, interpreting – that is, looping over an instruction stream and choosing what to do for each instruction – has its overhead. Before the specialization process eliminates a lot of the dynamism, interpretation is only so bad; some ops have some checks to do, and so they cost a bit. But specialized code tends to have a lot of very cheap operations that just play with pointers – and then the overhead of interpreting can quickly come to account for the majority of the execution time.

Therefore, on x64, we can often take the further step of producing machine code to run, instead of specialized bytecode to interpret. Of course, we have to be ready to deoptimize from that too. I’ll save talking about the machine code part too much, since the guy who worked on it is, I believe, plotting an advent post about it in the near future. Needless to say, it makes a dramatic difference; for the benchmark above it more than halved the time again relative to the specialized bytecode.

We often referred to the “produce machine code” part of the work as “the JIT”. But really, the entire set of optimizations described in this article are part of the typical work a JIT compiler does. Producing machine code is just one small part, and it’s interesting to note that the win from eliminating dynamism was a bigger one than eliminating interpretation overhead.

What next?

The work so far has had a dramatic impact on performance. However, there’s still plenty of areas for improvement.

For one, the number of specializations produced and how we choose them can certainly use some tweaking. It’s possible for a routine to saturate its number of allowed specializations with ones that, in the long run, are not entirely strategic. Or we may be able to reasonably argue that it should be allowed more. At the same time, specialized bytecode is a notable memory overhead at this point. So, we need to produce both more and less specializations. Of course, what that really means is we need a better strategy for picking what is worth doing.

Another clear issue is that some speculative optimizations turn out to be bad bets, and we have no way of recovering from that. We can deoptimize pretty fast (in fact, some trade-offs have been made to have it be quite affordable). But it’s still a cost, and it’s possible a gamble that looked good based on available data may lead to code that has to be deoptimized a majority of the time – and thus run slower than if we’d never bothered. Again, it’s about being smarter and keeping on learning as we run the program.

Then there’s a bunch of things that we don’t specialize or compile to machine code in a good way yet. Array access is an obvious one, and big integer operations – which are at the heart of Perl 6’s Int type – are another. The specializer is also not as aware as it really should be of closures. These are all areas where we could do a whole lot better – and they are nice, incremental advances on what we already have.

A further interesting area for exploration is doing escape analysis, and being able to eliminate a bunch of GC allocations in favor of allocating the memory as part of the call frame, because we know it can never end up being used beyond it. This is a fair amount of work, but also highly relevant to Perl 6: many scalar containers for lexical variables would be great candidates.

Then there’s the whole idea of trace JIT. MoarVM so far has a routine-level JIT, often called a method JIT. There, we optimize at routine level first, and may do some inlining to flatten away the call tree. This is a pretty decent strategy for Perl 6, in so far as we’re often very interested in inlining primitive operators that, naively interpreted in the original dynamic program, would be multiple dispatch calls. Trace JITs, by contrast, just follow the flow of instructions, over however many callframes, and every single conditional branch becomes a guard. Their strength is loops, which today we cope with using On Stack Replacement (detect we’re in a hot loop, compile an optimized version of it, and hot-swap it). While it’s easy to talk about trace JIT vs. method JIT as a “pick one”, I’m much more interested in the “pick both” track. They both, after all, share a lot of infrastructural needs – and they have their strengths and weaknesses. There’s more than one way to dynamically optimize it, and, this being Perl, we might as well just do all the ways – spotting all the nice unifications and re-use opportunities along the way.