Over the years, there have been many, many failed attempts to create alternative VMs for Python, in the hopes of increasing program performance. Even if we ignore the many half-finished Python-to-Parrot translator projects still lurching erratically onward like a half-decayed zombie army, the road to better VM performance is lined on both sides by the gravestones of colorfully-named projects like Mamba, Rattlesnake, and Vyper, all lying untended and forgotten.

Meanwhile, newer projects like pycore and ShedSkin are announced all the time, with a hopeful optimism all too similar to that of their predecessors. (Announced a little over a year ago with much fanfare in the blogosphere, pycore is already missing in action, without a single release yet.)

Making Python run fast, it seems, is a lot harder than it looks. You don’t have to be a compiler or VM design expert to look at CPython’s implementation and say, “Doing X is wasteful. I’ll bet you could make that faster by doing Y.” The problem is that at least 9 times out of 10, somebody already tried doing Y, and got maybe 80% of Python to work with their design before they hit a wall.

That wall basically amounts to this: 80% of Python is not Python, because everybody uses some part of that remaining 20%. The only reason ShedSkin isn’t already in the land of the dead with all the other projects is that it neatly sidesteps this issue by not pretending to be anything but a “Python-like” language. However, that’s sort of like the Black Knight in Monty Python and The Holy Grail, insisting that his lack of arms and legs is “only a flesh wound”. True, in other words, but not very useful.

On the other side we see the alternative VMs that actually implement the Python language, but don’t (for the most part) try to outdo CPython for speed. Jython and IronPython actually implement reasonably complete forms of the Python language, but from a practical perspective they are different platforms. Trying to target an application to work across CPython, Jython, and IronPython would be rather pointless, so only pure-Python libraries are portable across the implementations in any case.

But it’s the impure libraries that give (C/J/Iron)Python most of its current value! Be it database access, number crunching, interfaces to GUI toolkits, or any of a thousand other uses, it’s the C, Java, or CLR libraries that make Python useful. CPython is basically a glue language for assembling programs from C libraries, and to the extent that Jython and IronPython are successful, it’s because they’re glue languages for assembling Java or CLR components. What’s more, since their value equation lies elsewhere, Jython and IronPython don’t have to fully implement CPython’s semantics, although they do try to come fairly close.

And IronPython actually manages to improve on some Python performance microbenchmarks, although I’d say the jury is still out on whether IronPython programs perform better in general. Of course, it’s difficult to measure this well because IronPython is a different platform. A heavy number-crunching program using NumPy isn’t going to run on IronPython, for example, so how would you compare them?

And that leads us to the very heart of the issue with CPython. If the value of CPython comes from all the things that work with it today, then CPython is very close to being at a dead-end for further performance improvement. Most proposed performance enhancements these days get rejected because they change the Python C API in backwards-incompatible ways. If a change requires that everybody rewrite their C code, the language might as well not be Python any more. In short, CPython isn’t just a language implementation, it’s a platform API, not unlike the Java VM and libraries.

It used to be that we held out a hope for Python 3000 – Guido’s bold vision of a Python rethought from the ground up, unburdened by the need for backward compatibility. Here we could break with the C API of the past, and explore new territory – or so we thought.

But more recently, Guido has pulled back from the original plan, citing the ongoing vaporware status of Perl 6, and Joel Spolsky’s arguments against rewriting your flagship product. Python 3000 has become Python 3.0, instead. Not a complete rewrite, but a still somewhat vague plan for tuning-up the existing language, and tossing out a few things Guido considers mistakes in retrospect. Backwards incompatibility will be allowed, but Guido has pronounced that there will be no from-scratch rewrite of the CPython implementation. It’s not yet clear whether that means we can refactor in ways that would require third-party extensions to be rewritten. Perhaps this will be decided on a case-by-case basis.

But arguably the single biggest mistake in the CPython platform as it exists today is the lack of a foreign function interface, defined by the language and expressable by Python code. Instead, CPython has always relied on a fixed C API to express foreign interfaces. For its original intended purpose – an embedded scripting language for the Amoeba OS – that was probably okay. But the lack of a C FFI has meant that tools like SWIG, Pyrex, ctypes, Boost::Python, etc. had to spring up to fill the gap, but none of them are “standard” to Python, so a given CPython extension could be written in any of them, or none of the above. Thus, today’s backward-compatibility ball-and-chain: the Python/C API.

What’s more, few of these tools are designed to be independent of the existing CPython implementation. All but ctypes tend to have quirks that are a function of their intended code-generation target. But a Python language-defined FFI would have allowed the CPython API to be a mere implementation detail, able to be changed with little consequence. Indeed, such an FFI could conceivably have been usable even with Jython and IronPython, allowing even greater portability.

But, it’s too late to fix all that now. Or is it?

Enter PyPy. Two months ago, PyPy 0.7 was released. A major milestone, PyPy 0.7 is the first self-hosting Python implementation. That is, an implementation of Python, written in Python, that can interpret itself. What’s more, part of PyPy is a translation system that allows Python code to be translated to other languages, and it includes a kind of foreign function interface, although not a standardized one blessed by Guido. The PyPy developers have now done the work of rewriting all but a minimum of platform-specific C code as high-level Python code. In short, PyPy has already taken the most important step for us to escape from the CPython “gravity well” of needing a backward-compatible C API.

It’s hard to overstress how important this is. The current CPython implementation is locked into a host of design decisions that PyPy is not. As a simple example, PyPy can generate threads-supporting and non-threads-supporting versions of itself, refcounting and garbage collection versions of itself, and so on. Essentially, PyPy is completely virtual with respect to the underlying VM, even though it uses CPython bytecode. So, in the next few years it will be possible to experiment with radical redesigns of the VM, without getting bogged down in the “last 20%” issues experienced by projects of the past. Heck, it should be possible to use custom-tuned VMs on an application-by-application basis!

Further, because PyPy is implemented in Python, hacking on it to change the actual Python language or its semantics will be easier than hacking CPython. In short, we are almost on the doorstep of a renaissance in the development of the Python language, and on the way out of the alternative-implementations graveyard.

But what about speed? PyPy is currently described as 200-300 times slower than CPython, depending on what you’re doing, and what VM you translate it to. This sounds ludicrously bad, until you look at the fact that the untranslated PyPy, running on top of CPython, runs 2000 times slower. Which means – if you’re paying attention – that PyPy’s translator is already able to turn Python code into C that runs 10 times faster!

That is one heck of an improvement, folks. Granted, the code in question is technically “RPython” – a restricted subset of Python that eschews the use of certain more-dynamic features. But it doesn’t need type declarations in order to get speed, like Pyrex does. And this technology could be available for practical use soon, if Stackless guru Christian Tismer has his way, by creating an RPython-to-CPython extension module translator.

So, if it’s possible to create efficient C from a subset of Python, does that now mean that PyPy is finished? Can’t we just take that translation process and go on our way? Unfortunately, no. Although we could certainly take those fast modules back to the CPython platform, the translation process is still quite slow, and needs some accelerating of its own. Also, it still doesn’t really make CPython any faster – it just means that we can compile some individual modules and make them faster.

To reach the promised land, then, PyPy has to first get close to CPython speed. As it gets closer and closer to this goal, more and more people with an idea or two about speeding things up will say to themselves, “I wonder if I can get PyPy to do Y instead of X?” And, unlike the situation with CPython now, they won’t need to be both a Python guru and a CPython VM expert to have a prayer of implementing it.

So, instead of entirely new VM’s springing up and dying incomplete, it may be that we will soon see the opposite trend: existing VMs fading away, consolidated and replaced by an ever-more flexible PyPy. With any luck, we may yet see PyPy become the One Python to Rule Them All, replacing CPython, Jython, and IronPython with C, Java, and C# translator backends respectively.

Update: Just after I posted this, I found a message that appears to be saying that as of September, PyPy is now only 20 times slower than CPython. If that’s the case, things are moving quickly indeed. 2000, 200, 20… How much longer till 2, and 0.2 (five times faster than CPython)? Unfortunately, each new order of magnitude from this point on will probably be more difficult than the last. Too bad they can’t just feed the output back to the input and make it ten times faster as many times as they want. 🙂