Fast Enough VMs in Fast Enough Time Updated (February 07 2013): If you enjoy this article, you may also find the more academically-inclined article The Impact of Meta-Tracing on VM Design and Implementation interesting.





[This is a long article because it covers a lot of ground that is little known by the computing mainstream, and easily misunderstood. If you're familiar with the area, you may find yourself wanting to skip certain explanatory sections.]





If you can stomach the smell, put yourself briefly in the shoes of a programming language designer. What you want to do is create new programming languages, combining new and old ideas into a fresh whole. It sounds like a fun, intellectually demanding job, and occasionally it is. However, we know from experience that languages that exist solely in the mind or on paper are mostly worthless: it is only when they are implemented and we can try them out that we can evaluate them. As well as a language design, therefore, we need a corresponding language implementation. This need for a corresponding language implementation leads to what I think of as the language designer's dilemma: how much implementation is needed to show that a language design is good? If too little, the language will be dismissed out of hand as unusably slow: the kiss of death for many a language. Especially if the language design is adventurous, the language designer may not even be sure that it can be made adequately efficient.

If too much, low-level fiddling will have consumed most of the energy that should have gone into design. Even worse, low-level implementation concerns can easily end up dictating higher-level design choices, to the disadvantage of the latter. Finding a balance between these two points is difficult. When I was designing Converge, I came down on the side of trying to get the design right (though the end result has more than its fair share of mistakes and hacks). That decision had consequences, as I shall now describe. Putting VM effort into context I implemented two Virtual Machines (VMs) for Converge, both in C: the first was epically terrible; the second (introduced in Converge 1.0) merely awful. Even the second is extremely, sometimes almost unusably, slow: it is roughly 5-10x slower than CPython, tending towards the slower end of that range. This partly reflects my lack of experience and knowledge about VM implementation when I started; but it also reflects the fact that the Converge VM was a secondary concern to the language design. Despite that, I estimate that I spent around 18 man months on the second VM (intertwined with a lot of non-VM work). If this seems a lot of effort to spend on such a poor quality VM, it's worth bearing a few things in mind: Converge has a hard to optimise expression evaluation system based on Icon (see this paper for more details); C, whilst a fun language in many ways, doesn't lend itself to the fast writing of reliable programs; and most real-world VMs have had a lot more effort put into them. As rough comparisons, CPython (the de-facto standard for Python, so named because it is written in C) has probably had a couple of orders of magnitude more effort put into it; and Java's HotSpot roughly three orders of magnitude more. It's not surprising that the second C Converge VM doesn't do well in such a comparison — in comparison, it's not so much a minnow as a single-celled organism. What all this means is that the second C Converge VM is so slow that during demos of some of Converge's advanced features, I learned to make winding up gestures at the side of my laptop to amuse the audience. Even I was not sure whether its combination of features could ever be made adequately efficient. Other approaches Why did I write my own VM and not use someone else's? The choices are instructive. The traditional route for big compilers (e.g. gcc) is to output machine code (perhaps via assembler). The output is efficient, but the compiler itself requires a large amount of effort. For example, simply learning the intricacies of a processor like the x86 sufficiently well to generate efficient code isn't for the faint of heart. In short, the amount of effort this approach demands is generally prohibitive. An alternative to generating machine code directly is to generate C code and have that compiled into machine code. Several compilers have taken this approach over the years (from Cfront to the original Sather compiler). While still relatively difficult to do, it is certainly much easier than generating machine code directly. However it can often lead to a poor experience for users: not only must they pay the costs of double-compilation, but the translation typically loses a large quantity of useful debugging information (something which Converge pays special attention to). With few exceptions (one of which we'll see later), this approach is now rare. Perhaps the obvious choice is to use an existing VM. The two Goliath's are the JVM (e.g. HotSpot) and Microsoft's CLR (the .NET VM). When I started work on Converge, the latter (in the form of Mono) didn't run on OpenBSD (my platform of choice), so was immediately discounted. HotSpot, however, remained a possibility because of its often stunning performance levels. The reason that I couldn't use it isn't really HotSpot specific. Rather, it is something inherent to VMs: they reflect the languages, or group of languages, they were designed for. If a language fits within an existing VMs mould, that VM will probably be an excellent choice; if not, the semantic mismatch between the two can be severe. Of course, given sufficient will-power, any programming language can be translated to any other: in practise, the real issues are the ease of the translation and the efficiency of programs run using it. Two examples at opposite ends of the spectrum highlight this. Jython (Python running on the JVM) is a faithful implementation of Python; but even with the power of HotSpot behind it, Jython almost never exceeds CPython in performance, and is generally slower because some features (mostly relating to Python's highly dynamic customisability) do not lend themselves to an efficient implementation on the JVM. Scala, on the other hand, was designed specifically for the JVM — to ensure reasonable performance, Scala's language design has in some parts had to be compromised (e.g. due to type erasure). Whether the semantic mismatch is manageable depends on the particular combination of language design and VM. Converge's unusual expression evaluation mechanism was enough on its own to rule out a practical JVM implementation (Jcon, an implementation of Icon for the JVM is still slower than the old C Icon interpreter, which itself is no speed demon). As Converge's development progressed, a number of other features (e.g. its approach to tracing information) have made it increasingly difficult to imagine how it could be practically wedged atop an existing VM. A language for JITing VMs What all the above means is that the options for implementing a language are generally unpalatable: they either require an undue amount of work, or compromises in language design, and, too often, both. My suspicion is that this status quo has severely inhibited programming language research: few groups have had sufficient resources to implement unusual languages well enough to prove them usable. And then I came across PyPy. To be more accurate, after a few years of vaguely hearing of PyPy, 6 months ago I unexpectedly bumped into a PyPy developer who convinced me that PyPy's time has come. After porting PyPy to OpenBSD, I investigated further. What I've come to realise is that PyPy is two separate things: PyPy is a new VM for Python, which is often substantially faster than CPython. If you're a Python user (and I often am), this is interesting, otherwise it may be of only of minor interest. RPython is a language for writing VMs. This is of interest to every language designer – current or would-be – and language implementer. Unfortunately the current literature uses PyPy to cover both, which has confused many a reader (myself included). Henceforth, I shall unambiguously use RPython to refer to the language for writing VMs in, and PyPy for the new Python VM written in RPython. So, what is RPython? The obvious facts about it are that it is strict subset of Python whose programs are translated to C. Every RPython program is a valid Python program (which can be run using a normal Python interpreter), but not vice versa. However, RPython is suitably restricted to allow meaningful static analysis. Most obviously, static types (with a type system roughly comparable to Java's) are inferred and enforced. In addition, extra analysis is performed e.g. to assure that list indices don't become negative. Users can influence the analysis with assert statements, but otherwise it is fully automatic. RPython would be of relatively little interest if all it did was subset Python and output C. Though a full programming language, RPython is unlikely to be the next big language for web programming, for example: it's too restricted for the mainstream. Indeed, there is little chance of it becoming a widely used alternative to full Python (I would say the same is also true for similar restricted approaches such as Shed Skin). However, in addition to outputting optimised C code, RPython automatically creates a second representation of the user's program. Assuming RPython has been used to write a VM for language L, one gets not only a traditional interpreter, but also an optimising Just-In-Time (JIT) compiler for free. In other words, when a program written in L executes on an appropriately written RPython VM, hot loops (i.e. those which are executed frequently) are automatically turned into machine code and executed directly. This is RPython's unique selling point, as I'll now explain. Traditional VM implementation Because RPython is unique, it's easy to overlook what's interesting about it: it took me a couple of months of using it before I had built up an accurate understanding. Looking at the traditional approach to VM implementation is perhaps the easiest way to explain what's interesting about RPython. First, let me make a couple of assumptions explicit. Languages like C are well suited to being translated directly to machine code, as they are little more than a thin layer over machine code: everything that is C about a program is dealt with during compilation. In other words, the C language maintains no run-time infrastructure: once a program is translated into machine code, it's on its own. More complex languages (Java, Python etc.) permit complex run-time interactions and changes. This has many advantages from the programmer's point of view, but makes life difficult for the language implementer: most static optimisations are impossible because the user can often change fundamental behaviour at run-time. The best VMs defer optimisations until run-time, when information about the particular way the program is being executed can be used to dynamically optimise it. The insight here, in part, is that while we can't statically optimise the program for all the possible ways it might be run, we can dynamically optimise it for the way a particular user is running it. Take a JITing VM such as HotSpot or V8. Such a VM will initially use an interpreter to execute the user's program. Unfortunately for us, interpreter is a vague term, used to mean many different things (and often used as a pejorative): let us assume that it is a slow-ish way of loading in a program and executing it step-by-step (perhaps by loading in a bytecode file or by operating on an Abstract Syntax Tree (AST)), with few, if any, dynamic optimisations. Though slow to execute, interpreters have the advantage that they are easy to write, and have low start-up costs. For code that executes infrequently, they are a reasonable choice. When, during execution, the interpreter in a VM such as HotSpot or V8 spots a frequently executed piece of code, it will hand that chunk of code off to a JIT compiler (along with information about the context within which that code is used) to convert it into machine code. The JIT compiler (henceforth just the JIT ) is entirely separate from the interpreter. In other words, such a VM contains two completely separate implementations of the intended language's semantics. This has several consequences: JITs are much harder to create than interpreters, and involve much more work. Most interested people can write an interpreter; writing a JIT requires much more specialist understanding and time.

Tiny differences between the interpreter and JIT (which are almost inevitable in such complex systems) lead to a divergence in a program's observable behaviour. From a programmers point of view, the program behaves one way when interpreted, and another when JITted — unfortunately, it is often extremely difficult to work out which parts of a program have been JITted, or when.

Because JITs are much more complex than interpreters, languages with complex run-times are disproportionately hard to create JITs for. In other words, the more complex the language, the harder it is to create a complete JIT, and the more likely that the JIT and interpreter will provide subtly different results. For example, Lua – a relatively small language – has a complete JIT in LuaJIT, which has evolved alongside the traditional interpreter over many years. In contrast, Python – a much more complex language (for better or worse) than Lua – had, until recently, only an incomplete JIT in Psyco, which struggled to match CPython's often rapid evolutions. Of course, all such problems can be solved with sufficient resources, but these explain in large part why major open source languages like Python and Ruby currently ship with JIT-less VMs (i.e. they have only an interpreter). JITs for free What RPython allows one to do is profoundly different to the traditional route. In essence, one writes an interpreter and gets a JIT for free. I suggest rereading that sentence again: it fundamentally changes the economics of language implementation for many of us. To give a rough analogy, it is like moving from manual memory management to automatic garbage collection. RPython is able to do this because of the particular nature of interpreters. An interpreter, whether it be operating on bytecode or ASTs, is simply a large loop: load the next instruction, perform the associated actions, go back to the beginning of the loop . In order to switch from interpretation to JITing, RPython needs to know when a hot loop has been encountered, in order to generate machine code for that loop and to use it for subsequent executions. In essence, one need only add two function calls to an RPython program to add a JIT. The first function call ( can_enter_jit ) is used to inform RPython that a loop in the user program has been encountered, and that it might want to start generating machine code if that loop has been encountered often. The second function call ( jit_merge_point ) is used to indicate to RPython when it can switch to an existing machine code version of a loop. There are a few other details needed, but altogether less than 10 lines of code do the job. To get the most out of the JIT, extra hints to the JIT and certain programming practices help, but that's more icing on the cake rather than anything fundamental. Now is a good time to get an idea of how RPython generates a JIT from just an interpreter (what RPython calls the language interpreter). As said earlier, RPython automatically layers alongside C code a second representation of the interpreter (the tracing interpreter). The details of how the tracing interpreter is stored are irrelevant to us, except to note that it's in a form that a JIT can manipulate (conceptually it could be an AST-like structure). RPython's JIT is a tracing JIT. When a hot loop is detected, a marker is left such that the next time the loop is about to run, the JIT will enter tracing mode. During tracing mode, a complete execution of the loop is performed and all the actions it takes are traced (i.e. recorded) using the tracing interpreter (which is much, much slower than the language interpreter). After the loop has finished, the trace is then analysed, optimised, and converted into machine code. All subsequent executions of the loop will then call the machine code version. Since subsequent executions may diverge from the recorded trace, RPython automatically inserts guards into the machine code to detect divergence from the machine code version's capabilities. If a guard fails at any point, execution falls back to the interpreter. At this point, it's worth taking a brief side-tour to introduce tracing JITs. Although the concept of tracing JITs dates to around 1980, the concept didn't achieve any kind of prominence until around 2006 when Andreas Gal modernised it. It has since been used in several places, most notably in Mozilla's TraceMonkey Javascript VM which was used in Firefox (we'll see later why the past tense is used). Tracing JITs record a specific path of execution in a running program and convert it to machine code. This is quite different to the method-based approach of most JITs and therefore requires explanation. User's end program Trace when x is set to 6 Optimised trace if x < 0: x = x + 1 else: x = x + 2 x = x + 3 guard_type(x, int) guard_not_less_than(x, 0) x = int_add(x, 2) x = int_add(x, 3) guard_type(x, int) guard_not_less_than(x, 0) x = int_add(x, 5) Figure 1: Example code and high-level traces. Figure 1: Example code and high-level traces. Figure 1 shows a high-level example of a tracing JIT for a dynamically typed Python-esque language. Let us assume that the code in the first column is part of a hot loop that the tracing JIT decides is worth converting into machine code. On the next execution of the loop, the initial value of x is set to 6 (i.e. an integer). The VM then enters the tracing interpreter, completing an iteration of the loop, as well as recording a trace of what happened. The first thing the tracing JIT will create is a guard to ensure that the generated machine code is only executed if x is an integer on subsequent executions; this allows subsequent operations to be optimised for integers. Note that the specific value of x is not recorded in the trace (if it was 1 or 2 or 383, the resulting trace would be the same; but if it was a negative number the trace would be different), because otherwise the trace would be so narrow as to be almost useless. Next, the condition if x < 0 is false, so the else branch is taken. The trace first records a guard, to ensure that subsequent runs follow the same logic as the trace, and then records the else branch, completely ignoring the first branch. The middle column thus shows the full trace from our simple hot loop. With the trace complete, we can then optimise the trace, compressing the two separate additions of integer constants into one, as shown in the final column. That final trace is then translated into machine code. On subsequent executions of the hot loop, the (fast) machine code version will be executed: if either of the guards is false, the machine code version will exit back to the interpreter. Optimising traces While the example in Figure 1 gives a reasonable high-level idea about tracing JITs, it doesn't really explain how the trace is created. RPython badges itself as a meta-tracing system, meaning that the user's end program isn't traced directly (which is what Figure 1 suggests), but rather the interpreter itself is traced. Using the same example code from Figure 1, Figure 2 shows a snippet of the interpreter and the trace of the interpreter that leads to. This trace (though simplified somewhat to make it readable) is indicative of the traces that RPython introduces. Interpreter Initial trace when x is set to 6 program_counter = 0 stack = [] vars = {...} while True: jit_merge_point(program_counter) instr = load_instruction(program_counter) if instr == INSTR_VAR_GET: stack.push( vars[read_var_name_from_instruction()]) program_counter += 1 elif instr == INSTR_VAR_SET: vars[read_var_name_from_instruction()] = stack.pop() program_counter += 1 elif instr == INSTR_INT: stack.push(read_int_from_instruction()) program_counter += 1 elif instr == INSTR_LESS_THAN: rhs = stack.pop() lhs = stack.pop() if isinstance(lhs, int) and isinstance(rhs, int): if lhs < rhs: stack.push(True) else: stack.push(False) else: ... program_counter += 1 elif instr == INSTR_IF: result = stack.pop() if result == True: program_counter += 1 else: program_counter += read_jump_from_if_instruction() elif instr == INSTR_ADD: lhs = stack.pop() rhs = stack.pop() if isinstance(lhs, int) and isinstance(rhs, int): stack.push(lhs + rhs) else: ... program_counter += 1 v0 = program_counter v1 = stack v2 = vars v3 = load_instruction(v0) guard_eq(v3, INSTR_VAR_GET) v4 = dict_get(v2, "x") list_append(v1, v4) v5 = add(v0, 1) v6 = load_instruction(v5) guard_eq(v6, INSTR_INT) list_append(v1, 0) v7 = add(v5, 1) v8 = load_instruction(v7) guard_eq(v8, INSTR_LESS_THAN) v9 = list_pop(v1) v10 = list_pop(v1) guard_type(v9, int) guard_type(v10, int) guard_not_less_than(v9, v10) list_append(v1, False) v11 = add(v7, 1) v12 = load_instruction(v11) guard_eq(v12, INSTR_IF) v13 = list_pop(v1) guard_false(v13) v14 = add(v11, 2) v15 = load_instruction(v14) guard_eq(v15, INSTR_VAR_GET) v16 = dict_get(v2, "x") list_append(v1, v16) v17 = add(v14, 1) v18 = load_instruction(v17) guard_eq(v18, INSTR_INT) list_append(v1, 2) v19 = add(v17, 1) v20 = load_instruction(v19) guard_eq(v20, INSTR_ADD) v21 = list_pop(v1) v22 = list_pop(v1) guard_type(v21, int) guard_type(v22, int) v23 = add(v22, v21) list_append(v1, v23) v24 = add(v19, 1) v25 = load_instruction(v24) guard_eq(v25, INSTR_VAR_SET) v26 = list_pop(v1) dict_set(v2, "x", v26) v27 = add(v24, 1) v28 = load_instruction(v27) guard_eq(v28, INSTR_VAR_GET) v29 = dict_get(v2, "x") list_append(v1, v29) v30 = add(v27, 1) v31 = load_instruction(v30) guard_eq(v31, INSTR_INT) list_append(v1, 3) v32 = add(v30, 1) v33 = load_instruction(v32) guard_eq(v33, INSTR_ADD) v34 = list_pop(v1) v35 = list_pop(v1) guard_type(v34, int) guard_type(v35, int) v36 = add(v35, v34) list_append(v1, v36) v37 = add(v32, 1) v38 = load_instruction(v37) guard_eq(v38, INSTR_VAR_SET) v39 = list_pop(v1) dict_set(v2, "x", v39) v40 = add(v37, 1) Figure 2: An interpreter fragment and a full trace of that interpreter for the end-user program from Figure 1. Figure 2: An interpreter fragment and a full trace of that interpreter for the end-user program from Figure 1. Hopefully the interpreter code in Figure 2 is mostly self-explanatory, being a simple-minded stack-based interpreter. One thing that needs explanation is the jit_merge_point call: this is the point where the interpreter tells RPython if you have any machine code to execute for the program beginning at position program_counter , here's where to start it. Initially the trace might seem rather difficult to read, but if you think of it as a flattened record of all of the interpreters actions while executing the user's code, it becomes rather easier. The astute reader will notice that the traces are in Single Static Assignment (SSA) form: all assignments are to previously unused variables. While one would probably not want to write in this style, it has many advantages for optimisers, because it trivially exposes the data flow. In normal programs, we are often unable to determine whether a variable x may be assigned to at later point (because, for example, one branch of an if assigns to it, but the other doesn't): in SSA form we can know for sure. With this knowledge, it should hopefully be fairly simple to see that the trace on the right is simply a record of the instructions the interpreter on the left performed while executing our example program. It's worth thinking about this relationship, as it's key to RPython's approach. Once one has a handle on the trace, the sheer size of it should be a concern: there's a lot of stuff in there, and if it was converted to machine code as-is, one would get disappointing performance gains (around 40%). It is at this point that RPython's trace optimiser kicks in. I'll now try and give an idea of what RPython's trace optimiser can do. The first thing that's obviously pointless about the above is the continual reading of bytecode instructions. If we start from a specific point in the program, and all the guards succeed, we know that the sequence of instructions read will always be the same in a given trace: checking that we've really got an INSTR_VAR_GET followed by a INSTR_INT is pointless. Fortunately the trace optimiser can be influenced by the user. One thing the trace optimiser knows is that because program_counter is passed to jit_merge_point as a way of identifying the current position in the user's end program, any calculations based on it must be constant. These are thus easily optimised away, leaving the trace looking as in Figure 3. Constant folded trace v1 = stack v2 = vars v4 = dict_get(v2, "x") list_append(v1, v4) list_append(v1, 0) v9 = list_pop(v1) v10 = list_pop(v1) guard_type(v9, int) guard_type(v10, int) guard_not_less_than(v9, v10) list_append(v1, False) v13 = list_pop(v1) guard_false(v13) v16 = dict_get(v2, "x") list_append(v1, v16) list_append(v1, 2) v21 = list_pop(v1) v22 = list_pop(v1) guard_type(v21, int) guard_type(v22, int) v23 = add(v22, v21) list_append(v1, v23) v26 = list_pop(v1) dict_set(v2, "x", v26) v29 = dict_get(v2, "x") list_append(v1, v29) list_append(v1, 3) v34 = list_pop(v1) v35 = list_pop(v1) guard_type(v34, int) guard_type(v35, int) v36 = add(v35, v34) list_append(v1, v36) v39 = list_pop(v1) dict_set(v2, "x", v39) Figure 3: The trace with calculations related to the program counter optimised away. Figure 3: The trace with calculations related to the program counter optimised away. Our trace has now become a fair bit of smaller, but we need it to get a lot smaller still if we want good performance. Fortunately the SSA form of the trace now comes to the fore. We can follow the flow of operations on a given list l: if an append of an object on a list l is followed by a pop without any other operations on l in the interim, we can remove both calls. We can also do something similar for the dictionary stores and lookups in the middle of the trace: since they're not used until the end, storing intermediate values is pointless. This is a bigger win, as dictionary lookups are much more expensive than list lookups. The resulting optimisations are shown in Figure 4. List folded trace Dict folded trace v1 = stack v2 = vars v4 = dict_get(v2, "x") guard_type(v4, int) guard_not_less_than(v4, 0) v16 = dict_get(v2, "x") guard_type(v16, int) v23 = add(v16, 2) dict_set(v2, "x", v23) v29 = dict_get(v2, "x") guard_type(v29, int) v36 = add(v29, 3) dict_set(v2, "x", v36) v1 = stack v2 = vars v4 = dict_get(v2, "x") guard_type(v4, int) guard_not_less_than(v4, 0) v23 = add(v4, 2) guard_type(v23, int) v36 = add(v23, 3) dict_set(v2, "x", v36) Figure 4: The trace with list and dictionary optimisations folded away. Figure 4: The trace with list and dictionary optimisations folded away. Our trace is now looking much smaller, but we can still make two further, easy optimisations. First, we know that if v4 is an int then it will still be an int after addition: we can thus remove the second type check. Once that is done, we can easily collapse the two additions of constants (since x + 2 + 3 ≜ x + 5 ). Figure 5 shows both optimisations. Type folded trace Addition folded trace v1 = stack v2 = vars v4 = dict_get(v2, "x") guard_type(v4, int) guard_not_less_than(v4, 0) v23 = add(v4, 2) v36 = add(v23, 3) dict_set(v2, "x", v36) v1 = stack v2 = vars v4 = dict_get(v2, "x") guard_type(v4, int) guard_not_less_than(v4, 0) v23 = add(v4, 5) dict_set(v2, "x", v23) Figure 5: The trace with type checks and constant additions folded away. Figure 5: The trace with type checks and constant additions folded away. At last, we have a highly optimised trace which is suitable for conversion to machine code. Not only is it much smaller than the original trace, but it contains far fewer complicated, slow function calls: the resulting machine code will run massively faster than the original interpreter. Trying to produce small traces like this is one of the skills of writing a VM in a tracing JIT. The above example should give you a flavour of how this is done in RPython, though there are many low-level details that can make doing so difficult. As we shall see later, we often need to help the JIT to produce small traces. A new Converge VM After becoming intrigued by the possibilities of RPython, I decided to use it to implement a new VM for Converge. To allow an apples-to-apples comparison, my initial goal was to maintain 100% compatibility with the old VM, so that the same bytecode could run on both the C and RPython versions of the VM. That goal wasn't quite met, although I came extremely close: for quite some time, bytecode for the RPython VM could be run on the C VM (but not vice versa). First, it's probably useful to give some simple stats about the C VM that I was aiming to replace. It is about 13KLoC (thousand lines of code; I exclude blank lines and purely commented lines from this count). It contains a simple mark-and-sweep garbage collector that is mostly accurate but conservatively collects the C stack (so that VM code doesn't need be clever when dealing with references to objects). It implements full continuations at the C level, copying the C stack to the heap and back again as necessary (this turned out to be much more portable than I had initially feared), so that code at both the VM and Converge program level can be written in similar(ish) style. The VM is ported to a variety of 32 and 64 bit little endian systems: OpenBSD, Linux, OS X, Cygwin, and (native binary) Windows. Overall, the VM works, but is very slow, and, in parts, rather hard to understand. There are no precise records to determine the effort put into it, but I estimate it took between 12 and 24 man months — let's call it 18 man months. I started work on the new VM on September 1st 2011. Before September 1st I had never used RPython, nor had anyone (to the best of my knowledge) outside the core PyPy group used RPython for a VM of this size (at the time, the Happy VM, which implements a subset of PHP, was the closest comparison). Though the Converge VM is obviously not a big VM, it is beyond merely a toy. It also has some unusual aspects (as touched upon earlier in this article) that make it an interesting test for RPython. By December 19th 2011 I had a feature-compatible version of the Converge VM, in the sense that it could run every Converge program I could lay my hands on (which, admittedly, is not a huge number). After an initially slow period of development, mostly because of my unfamiliarity with RPython, progress became rapid towards the end. The resulting VM is about 5.5KLoC (compared to 13KLoC for the C VM). I estimate I was able to dedicate around half of my time to the VM during those 4 months (I started a new job on September 1st and then taught a course on a largely unfamiliar subject). Although the two time estimates (18 man months for the C VM vs. 2-3 man months for the RPython VM) aren't fully comparable, they are useful. While many parts of the RPython VM were a simple translation from the C VM, that itself was partially a reimplementation of a previous C VM (though to a lesser extent). The RPython's VM structure is also substantially different than the C VM (it's far cleaner, and easier to understand), so some aspects of the translation were hardly simple. My best guess is that moving from C (a language which I enjoy a great deal, despite its flaws) to RPython was the single biggest factor. If nothing else, large amounts of the C VM involve faffing about with memory resizing; RPython, as a fully garbage collected language, sweeps all that under the carpet. Status of the VM The new Converge VM's source is freely downloadable as are binaries for most major platforms (other than Windows, at the time of writing). Eventually this VM will form part of a Converge 2.0 release, although more testing will be needed before it's reached that point. Before you form your opinions about the new VM, it's worth knowing what it is and isn't. It's not meant to be an industrial strength VM, at least not in its current form. Converge is a language for exploring compile-time meta-programming and domain specific languages. Some things which mainstream programming languages need to care greatly about (e.g. overflow checking; Unicode) are partly or wholly ignored. Such matters are a problem for a later day. It's also worth knowing that I haven't spent a huge amount of time optimising the new VM. As soon as it got to fast enough speed, I was happy. Whereas before I often had to explain away Converge's slow performance, it's now sufficient for the experiments I and others want to do in Converge. It's worth exploring that performance in more detail. Performance So, the RPython VM was created in roughly 1/6 the time it took to create the C VM. What is the performance like? This section will try and give a flavour of the performance, though please note that it's not totally scientific. The December 19th version of Converge (git hash 84bb9d6064 if you wish to look at it) was already usefully faster than the old C VM. One of my simple benchmarks has long been to time make regress in the Converge system (excluding the time to build the VM itself). The December 19th version of the VM isn't directly comparable to the old C VM (it has a little more Converge code, and the parsing algorithm, which occupies a lot of the execution time, is slightly different), but it's close enough to be useful. The following timings are from an Ubuntu server with a 3.33GHz Xeon (I'll explain why Linux later); since Linux support was added a day later, it's necessary to checkout a December 20th version for this test (git hash 00e290ccbb). The C VM runs make regress in 67s; the RPython VM (translated with --opt=3 since the JIT wasn't functional at that point) in 32s. Looking at output from gprof on the two VMs quickly shows up 2 main reasons for this speedup: first, RPython has an infinitely better garbage collector (with obvious consequences); second, RPython has much better dictionaries (also known as hash tables). Given that dynamically typed OO languages like Converge spend an awful lot of time looking up slot names, optimised dictionaries can play an unexpected part in things. make regress gives a good idea of the lower bound of performance, since it is torture for a JIT. It has a large body of code which runs for just enough time for the JIT to warm up (i.e. to have identified some hot spots, traced them, and converted them into machine code), but not enough time to benefit from its work (as we shall see later, the compiler is particularly punishing for a tracing JIT). Since December 19th, the old and new system have diverged sufficiently that an accurate comparison of make regress is now hard to make. That said, the system has continued to get faster, and even on the hard case of the compiler, it gives a 2-3x speed-up on the old VM, which is rather useful. How does it perform on more general types of code? In one sense, this is an impossible question, because no two people share the same definition of general . It's even harder when a JIT is involved, as they often give surprising performance results: given two seemingly similar programs, it's quite common for one to perform substantially better with a JIT than the other. So, as a proxy, I'll use a few simple benchmarks — the limitations of this approach are obvious, but it's better than nothing. I'll start with the stone benchmark. This simple benchmark started life as an Ada program, but we'll take as our starting point the Python version. The Converge version is a straight-forward translation. The old VM lacks certain timing functionality, so the timings for Converge 1.2 are ever so slightly off (being a few ms higher than they should really be): as you'll soon see, such a small difference doesn't make much difference in the overall scheme of things. By default, stone will perform 50000 iterations of the benchmark. To show how JIT warmup times can effect things, I've also included timings for 500000 runs. All timings are on the Xeon machine described earlier; all tests were run 3 times and the best figure taken. VM stone (50000) stone (500000) CPython 2.7.2 0.45s 4.50s PyPy 1.7 0.07s 0.28s Converge 1.2 2.39s 24.1s Converge-current 0.14s 0.44s Figure 6: The Stone benchmark. Figure 6: The Stone benchmark. Although stone is venerable, it's also so small and artificial that it doesn't necessarily tell us very much. The Richards benchmark is a much more interesting benchmark that models task dispatching (Mario Wolczko explains Richards in depth and also provides various other language implementations). By default it performs 10 iterations; I've also included timings for 100 iterations. Figure 7 shows the timings. VM Richards (10) Richards (100) CPython 2.7.2 2.00s 19.9s PyPy 1.7 0.34s 0.79s Converge 1.2 12.4s 126s Converge-current 0.93s 4.9s Figure 7: The Richards benchmark. Figure 7: The Richards benchmark. One thing worthy of note in Figure 7 is the better all-round performance of PyPy compared to Converge: it is substantially faster. As a final benchmark, and an example of something which programmers need to do frequently, I chose something which neither the old or new Converge VM has had any sort of optimisations for: sorting. Part of the reason why I expected them to do badly is that neither optimises list accesses. x[0] in Converge is translated to x.get(0) , whereas many other VMs (including CPython and PyPy) special-case this common operation. 100000 random strings were placed into a file and sorted. The same sorting algorithm is used in both the Python version and Converge version (indeed, I translated the Converge version into Python to ensure a fair comparison). Figure 8 shows the timings. VM sorting (100000) sorting (1000000) CPython 2.7.2 1.38s 17.3s PyPy 1.7 0.22s 3.17s Converge 1.2 13.40s 678s Converge current 0.42s 4.23s Figure 8: The sorting benchmark. Figure 8: The sorting benchmark. The terrible performance of the old Converge VM in Figure 8 surprised even me. The most likely explanation is that the large number of elements overloads the garbage collector: at a certain point, it can overflow its stack, and performance then degrades non-linearly. I was also surprised by the PyPy figures, with a larger than expected slowdown on the larger number of elements. This appears to be fixed in the nightly PyPy build I downloaded (which is very close to what will be PyPy 1.8): the timings were 0.21s and 2.22s for the small and large datasets respectively. Although this section has contained a number of very hard figures, one should be careful about making strong claims about performance from such a small set of benchmarks. My gut feeling is that they are over-generous to the new Converge VM, mostly because there are many areas in the VM which have received no optimisation attention at all: if one of those was used repeatedly, performance would suffer disproportionately. I suspect that, rather than appearing to be much faster than CPython 2.7.2 (as above), its performance on a wider set of benchmarks would probably be on a more even par. Even so, that would still be a huge improvement on the old VM. The interesting thing is that most of the performance gains are from RPython: I only made a few relatively easy changes to increase performance, as we shall now see. Optimising an RPython JIT Some RPython VMs lead to a much more efficient JIT than others. The trace optimiser, while clever, is not magic and certain idioms prevent it working to its full potential. The early versions of the Converge VM were naive and JIT-unfriendly: interestingly, I found that a surprisingly small number of tactics hugely improved the JIT. The first tactic is to remove as many instances of arbitrarily resizable lists as possible. The JIT can never be sure when appending an item to such a list might require a resize, and is thus forced to add (opaque) calls to internal list operations to deal with this possibility. Such calls prevent many optimisations from being applicable (and are relatively slow). When this was first pointed out to me, I was horrified: my RPython VM was fairly sizeable and used such lists extensively. Most noticeably, the Converge stack was a global, resizable list. After a little bit of thought, I realised that it's possible to statically calculate how much stack space each Converge function requires (this patch started the ball rolling). I was then able to move from a global resizable stack to a fixed-size stack per function frame (i.e. the frame created upon each function call; these are called continuation frames in Converge, though that need not concern us here). At this point, the relative ease of developing in a fairly high-level language became obvious. If I had tried to do such a far-reaching change in the C VM, it would have taken at least a week to do. In RPython, it took less than a day. Some other arbitrarily resizable lists took a little more thought. After a while it became clear that, even though each function now had its own fixed-size stack, the global stack of function frames, stored of course in a resizable array, was becoming a bottleneck. That seemed hard to fix: unlike the stack size needed by a function frame, there is no way to statically determine how deeply function calls might nest. A simple solution soon presented itself: having each function frame store a pointer to its parent removed the need for a list of function frames (see this patch). The second tactic is to tell the JIT when it doesn't need to include a calculation in a trace at all. The basic idea here is that when creating a trace, we often know that certain pieces of information are fairly unlikely to change in that context. We can then tell the JIT that these are constant for that trace: it will insert an appropriate guard to ensure that is true, often allowing subsequent calculations to be optimised away. The use of the word constant here can mislead: it's not a static constant in the sense that it's fixed at compile-time. Rather, it is a dynamic value that, in the context of a particular trace, is unlikely to change. Promoting values and eliding functions are the main tools in this context: Carl Friedrich Bolz described examples in a series of blog posts. The new Converge VM, for examples, uses maps (which date back to Self), much as outlined by Carl Friedrich. The interesting thing is that I haven't really spent that long optimising the Converge JIT: perhaps a man week in the early days (when I was trying to get a high level picture of RPython) and around two man weeks more recently. As a rough metric, I found that each JIT optimisation I was doing was giving me a roughly 5-10% speedup (though the per-function stack change was much more profitable): the cumulative effect was quite pronounced. Admittedly, I suspect that I've now picked most of the low-hanging fruit; improving performance further will require increasingly more drastic action (much of it in the Converge compiler, which is likely to prove rather harder to change than the VM). Fortunately, I'm adequately happy with performance as it is. The end result of these optimisations is that the traces produced by the Converge VM are often very efficient: see this optimised trace (randomly chosen — I'm not even sure what piece of code it represents). What's astonishing is that between promoting values, eliding function calls, and optimising traces, many bytecodes now have little or no code attached to them. What particularly amazes me is how one of Converge's most crippling features from an efficiency point of view, is now handled. Failure is how an Icon-base expression evaluation system allows limited backtracking. In my paper on Converge's Icon inheritance, I noted that the ADD_FAILURE_FRAME and REMOVE_FAILURE_FRAME instructions (which surround nearly every Converge expression) accounted for around 25-30% of all executed opcodes. What's more annoying is that the vast majority of the time, failure frames are created and discarded without being used. My suggestion at the time was that maybe a register-based VM (in the sense of modern Lua non-JITing VMs) might be able to lower this cost somewhat. What I hadn't anticipated was a tracing JIT. If you compare the unoptimised trace to the optimised, you'll notice that the former has lots of code in ADD_FAILURE_FRAME / REMOVE_FAILURE_FRAME instructions, while in the latter such code has almost entirely disappeared. In large part this explains why the optimised trace is 4 times smaller than the unoptimised trace. The best thing of all is that RPython's trace optimiser does this for free: I didn't raise a single finger to make it happen. In one fell swoop, RPython has given Converge the fastest VM for Icon-esque evaluation ever created. Tracing JIT issues Tracing JITs are relatively new and have some limitations, at least based on what we currently know. Mozilla, for example, removed their tracing JIT a few months back, because while it's sometimes blazingly fast, it's sometimes rather slow. This is due to a tracing JIT optimising a single code-path at a time: if a guard fails, execution falls back to the (very slow) tracing interpreter for the remainder of that bytecode (which could be quite long), and then back to the language interpreter for subsequent bytecodes. Code which tends to take the same path time after time benefits hugely from tracing; code which tends to branch unpredictably can take considerable time to derive noticeable benefits from the JIT. The real issue is that we have no way of knowing which code is likely to branch unpredictably until it actually does so. A real program which does this is the Converge compiler: in several points it walks over an AST, calling a function _preorder(x) which, using reflection, dispatches to a function which can handle that AST type (see e.g. the AST well formedness checker for an example of this idiom). Though it makes perfect sense to us, from the point of view of a tracer, _preorder(x) branches completely unpredictably. User's end program Trace without inlining Trace with inlining def f(x): return 2 + g(x) + 3 def g(x): if x < 0: return 0 else: return 1 return 2 + g(x) + 3 guard_not_less_than(x, 0) return 2 + 1 + 3 Figure 9: Inlining code with a call f(6). Figure 9: Inlining code with a call f(6). In my limited experience, the inherent ability of a tracing JIT to inline code can exacerbate this issue. Consider the simple program in Figure 9. If the tracing JIT disables inlining, the trace looks as in the middle column: the call to g remains as is. As we saw earlier, raw function calls are not only expensive but their opaqueness prevents the trace optimiser from performing much of its magic. By default, therefore, a tracing JIT will inline such functions, resulting in the trace found in the right-hand column. In this case, the inlined version will typically be much faster since the guard check is extremely quick and much of the other machinery surrounding function calling can be optimised away. However, while inlining is generally a big win, it can sometimes be a big loss. This is chiefly due to the fact that inlining leads to significantly longer traces. Traces are slow to create (due to the tracing interpreter), so the longer we trace, the greater the overhead we impose on the program. If the trace is later used frequently, and in the exact manner it was recorded, the relative cost of the overhead will reduce over time. If the trace is little used, or if guards fail regularly within it, the overhead can easily outweigh the gains. Unfortunately, there's no obvious way to predict when inlining will be a win or loss, because we can't see into a program's future execution patterns. As a heuristic to somewhat counter this problem, RPython has an (adjustable) upper limit to traces: if too long, they are aborted, and the subsequent trace will turn inlining off. This helps somewhat, but what a good value for too long might be isn't obvious to me; it's likely to vary from language to language, and even program to program. Indeed, in an experiment on a mixed tracing / method-based Java VM, Inoue et al. found that long traces are, overall, a win. With luck, future research will start to whittle away at tracing JITs weaknesses. However, it seems likely that, in the medium term at least, most hand-crafted VMs will remain method-based (referred to hereon as method JITs ). That is, when a method is called repeatedly, the entire method is converted to machine code. Where tracing JITs are optimistic – recording a particular path, removing branching possibilities, and assuming that it will mostly be used in the same way on subsequent executions – method JITs are more pessimistic – converting an entire method to machine code, with most of its branches left (though a few trace-like optimisations are likely to be performed, this isn't that important from our point of view). Although a tracing JIT can often beat a method JIT, the latter's performance is much more consistent. That said, as PyPy shows, the performance of a tracing JIT isn't bad in the general scheme of things. Furthermore, it's not obvious to me how an RPython for method JITs system could be created: tracing JITs seem to me to be much better suited to automatic creation. How fast can it go? Something that's currently unclear is how fast one can reasonably expect an RPython VM to go. The best guide we currently have to the achievable speed of RPython VMs is PyPy itself. Although it seems that most of the easy wins have now been applied, it's still getting faster (albeit the rate of gains is slowing down), and, more importantly, is increasingly giving good performance for a range of real programs. The PyPy speed centre is an instructive read. At the time of writing PyPy is a bit over 5 times faster than CPython for a collection of real-world programs; for micro-benchmarks it can be a couple of orders of magnitude quicker. It's clear that, in general, an RPython VM won't reach the performance of something like HotSpot, which has several advantages: the overall better performance of method-based JITs; the fact that it's hand-coded for one specific class of languages; and the sheer amount of effort put into it. But I'd certainly expect RPython VMs to get to comfortably within an order of magnitude performance levels as HotSpot. Time will tell, and as people write RPython VMs for languages like Java, we'll have better points of comparison. RPython issues From what I've written above, I hope you get some sense of how interesting and exciting RPython is for the language design and implementation community. I also hope you get a sense of how impressed I am with RPython. Because of that, I feel able to be frank and honest about the limitations and shortcomings of the approach. The major problem anyone creating a VM in RPython currently faces is documentation or, more accurately, a lack of it. There are various papers and fragments of documentation on RPython, but they're not yet pulled together into a coherent whole. New users will struggle to find either a coherent high-level description or descriptions of vital low-level details. Indeed, the lack of documentation is currently enough to scare off all but the most dedicated of language enthusiasts. As such a dedicated enthusiast, I got a long way with grep , but as my VM got more sophisticated, I had to resort to the IRC channel (something I've never needed to do before) to check the semantics of low-level details such as eliding, for which guesswork is simply too dangerous. Part of the problem probably stems from the strict adherence of the PyPy / RPython development process to Test Driven Development (TDD). PyPy / RPython has roughly the same amount of code for the main system as for the tests. Although this would not have been my personal choice, it appears to have served the PyPy / RPython development process extremely well. It's impossible not to admire the astonishing evolution of the project and the concepts it has developed; TDD must have played an important part in this. Indeed, although the Converge VM has inevitably uncovered a few bugs in PyPy, they have been surprisingly few and far between. Unfortunately, the prioritisation of tests seems to have been at the expense of documentation. As I rapidly discovered – to the initial bemusement of the RPython developers – tests are a poor substitute for documentation, typically combining a wealth of low-level detail with a lack of any obvious high-level intent. In short, with so many tests, it's often impossible to work out what is really being tested or why. I certainly struggled with this lack of documentation: I suspect I could have shaved at least a third off of my development time if RPython was as well documented as other language projects. Fortunately the PyPy chaps are aware of this, and there are now open issues to resolve this, and I hope my experiences will feed in to that. As alluded to above, PyPy / RPython have, since their original conception, changed to a degree unmatched by any other project I can think of. The PyPy project started off as an implement Python in Python project to allow people to easily get their heads around the nuts-and-bolts details of a Python VM. Only gradually, over the years, did it turn into an implement a fast Python VM project, with several dead-ends on the route (including an attempt to do traditional partial evaluation). As the desire to implement a faster Python VM grew, the need to define a smaller language became clearer, hence the gradual appearance of RPython. The Python part of RPython's name has consequences. In essence, RPython is a statically typed language roughly akin to Java; static types allow RPython to statically optimise an RPython program in a ways that would be impossible for full Python. RPython relies almost exclusively on type inference for its static types, with RPython programs rarely making them explicit (far less often than in, say, a typical Haskell program). This means that type errors, when they do occur, can result in inscrutable error messages (as is typical with type inference). While RPython's error messages have become better in recent months, they are still obtuse; identifying the real cause is harder than in most other type inferred languages. It took quite some time before I felt confident writing more than a few lines at a time. If RPython had more explicit static typing, the scope over which the type inferencer would operate would be smaller, and errors probably less baffling. The massive evolution PyPy / RPython have taken also has implications for the implementation. In short, RPython does not just have an experimental past, it has an experimental present. The translator is littered with a vast number of assertions, many of which can be triggered by seemingly valid user programs. These are often hard to resolve: one is left wondering what a particular assertion has to do with the program that triggered it. Occasionally, if one is really lucky, an assertion has an associated comment which says something like if you got here, it's probably because of X . As RPython matures, I expect the number of occasions such assertions to be triggered to diminish, but I suspect they are likely to maintain a pest for some time to come. Because every RPython program is also a valid Python program, RPython VMs can also be run using CPython or PyPy — run untranslated in the RPython lingo. From my point of view this facility has been of little use, because running a VM this way is slow and memory intensive. Unfortunately, despite the drawbacks of untranslated execution, it is sometimes less worse than the alternative, as we shall see. Most noticeably, PyPy's extensive test suite is run untranslated. The alternative is full translation. RPython is a whole program translator. It slurps in a program, statically analyses the whole thing afresh, before converting it to C. Unfortunately, because of the quantity of work it has to do, the translator is extremely slow. On a fairly fast machine, translation of Converge takes 3 or 4 minutes with --opt=3 ; that time more than doubles with --opt=jit . PyPy, on a fast machine, can easily take 45-60 minutes to to translate with --opt=jit . Any change, no matter how small, requires a full rerun of the translator. Frankly, I had forgotten what it was like to have such a slow edit-compile-run cycle: it makes experimentation, particularly for those unfamiliar with RPython, extremely painful. At some points, when a bug was far too deep in a VM run to be reachable by running untranslated, I almost tore my hair out due to frequent retranslations. On the plus side, the down-time translation induces all but rules out repetitive strain injuries. Whole program translation may seem an odd decision, but it has a vital use: RPython uses Python as its compile-time meta-programming language. Basically, the RPython translator loads in an RPython VM and executes it as a normal Python program for as long as it chooses. Once that has finished, the translator is given a reference to the entry point of the RPython VM, and it is from that point that translation occurs. Everything that is referenceable from the entry point must be RPython enough (this vague term is from the current documentation) to be translatable; things not reachable are ignored (and may use arbitrary Python features). What this means is that one is able to use normal Python to make decisions about e.g. portability which could not be deferred until actual translation time. The ability to do some sort of pre-translation meta-programming is absolutely necessary for software that needs to be customisable and portable. However, the fact that most VM files are in fact mixed Python and RPython programs is a headache. Some functions in the normal Python libraries are RPython compatible; some aren't; some times RPython alternatives are provided; sometimes they aren't. This mixing and matching between the two is arbitrary, confusing, and not, I suspect, resolvable. As a final matter, RPython is not just restricted to generating C: at various points it has also had JVM, CLR, and LLVM backends (though, at the time of writing, none of these is currently usable). RPython has thus tried to create its own type system to abstract away the details of these backends, not entirely successfully. This is not a fault unique to RPython. As anyone who's tried porting a C program to a number of platforms will attest, there is no simple set of integer types which works across platforms. Unfortunately, RPython's history of multiple backends and only semi-successful attempts to abstract away low-level type systems means that it has at least 5 type systems for various parts (some of which, admittedly, are hidden from the user). Not only does each have different rules, but the most common combination ( rffi and rarithmetic ) has different rules for types of the same name. The precise relationship between the varying type systems remains a mystery to me. I suspect that the current Converge VM does not use appropriate types in some places because of this. The future RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap. An important question is whether the approach that RPython takes is so unique that it is the only possible tool one can imagine using for the job. As my experience with RPython has grown, the answer is clearly no , because RPython is not magic. In other words, it is the first member of a new class of tool, but I do not expect it to be the last member of that class. If nothing else, RPython probably isn't the ideal language for such purposes, as I showed in the previous section. My best guess is that a new Java-like language with compile-time meta-programming (as found in Converge, as it so happens) might be more appropriate, but I could well be wrong. In the meantime, there is no reason not to embrace RPython — it works and it's here, right now. If you've got this far, congratulations: it's been a long read, I know! This article is so long because its subject is so worthy. I am a curmudgeon and I find most new developments in software to be thoroughly uninteresting. RPython is different. It's the most interesting thing I've seen in well over a decade. Exactly what its ramifications will be is something that only time can tell, but I think they will be two fold. First, I think new languages will suddenly find themselves able to compete well enough with existing languages that they will be given a chance: I hope this will encourage language designers to experiment more than they have previously felt able. Second, the one or two VMs should be enough for all purposes mindset is now severely challenged: RPython shows that it's possible to create a custom VM for a language which substantially outperforms mashing it atop an existing VM. In summary, the future for programming languages has just got a lot brighter: the language designer's dilemma is no more. Acknowledgements: The RPython developers have been consistently helpful and the new VM wouldn't have got this far without valuable help from Carl Friedrich Bolz and Armin Rigo in particular: Maciej Fijalkowski and others on the PyPy IRC channel have also been extremely helpful. Martin Berger, Carl Friedrich Bolz, and Armin Rigo also gave insightful comments on this article. Any remaining errors and infelicities are, of course, my own. Follow me on Twitter Blog archive Last 10 posts Which Parsing Approach? Alternative Sources of Advice Minimum Times Tend to Mislead When Benchmarking A Quick Look at Trait Objects in Rust Why Aren’t More Users More Happy With Our VMs? Part 2 Why Aren’t More Users More Happy With Our VMs? Part 1 What I’ve Learnt So Far About Writing Research Papers What Challenges and Trade-Offs do Optimising Compilers Face? Fine-grained Language Composition Debugging Layers