A brief experiment with PyPy

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

While one might ordinarily think of the PyPy project as an experiment in implementing the Python runtime in Python itself, there is really more to it than that. PyPy is, in a sense, a toolbox for the creation of just-in-time compilers for dynamic languages; Python is just the start - but it's an interesting start. It has been almost exactly one year since LWN first looked at PyPy and a few weeks since the 1.5 release , so the time seemed right to actually play with this tool a bit. The results were somewhat eye-opening.

LWN uses a lot of tools written in Python; one of them is the gitdm data miner which is used to generate kernel development statistics. It is a simple program which reads the output of " git log " and generates a big in-memory data structure reflecting the relationships between developers, their employers, and the patches they are somehow associated with. There is very little that is done in the kernel, and there is no use of extension modules written in C. These features make gitdm a natural first test for PyPy; there is little to trip things up.

The test was to stash the git log output from the 2.6.36 kernel release through the present - some 31,000 changes - in a file on a local SSD. The file, while large, should still fit in memory with nothing else running; I/O effects should, thus, not figure into the results. Gitdm was run on the file using both the CPython 2.7.1 interpreter and PyPy 1.5.

When switching to an entirely different runtime for a non-trivial program, it is natural to expect at least one glitch. In this case, there were none; gitdm ran without complaint and produced identical output. There was one significant difference, though: while the CPython runs took an average of about 63 seconds, the PyPy runs completed in about 21 seconds. In other words, for the cost of changing the "#!" line at the top of the program, the run time was cut to one third of its previous value. One might conclude that the effort was justified; plans are to run gitdm under PyPy from here on out.

To dig just a little deeper, the perf tool was used to generate a few statistics of the differing runs:

CPython PyPy Cycles 124B 42B Cache misses 14M 45M Syscalls 55,000 28,000

As would be expected from the previous result, running with CPython took about three times as many processor cycles as running with PyPy. On the other hand, CPython reliably incurred less than 1/3 as many cache misses; it would be hard to say why. Somehow, the code generated by the PyPy JIT generates more widely spread-out memory references; that may be related to garbage collection strategies. CPython uses reference counting, which can improve cache locality, while PyPy does not.

One other interesting thing to note is that PyPy only made half as many system calls. That called for some investigation. Since gitdm is just reading data and cranking on it, almost every system call it makes is read() . Sure enough, the CPython runtime was issuing twice as many read() calls. Understanding why would require digging into the code; it could be as simple as PyPy using larger buffers in its file I/O implementation.

Given results like this, one might well wonder why PyPy is not much more widely used. There may be numerous reasons, including a simple lack of awareness of PyPy among Python developers and users of their programs. But the biggest issue may be extension modules. Most non-trivial Python programs will use one or more modules which have been written in C for performance reasons, or because it's simply not possible to provide the required functionality in pure Python. These modules do not just move over to PyPy the way Python code does. There is a short list of modules supported by PyPy, but it's insufficient for many programs.

Fixing this problem would seem to be one of the most urgent tasks for the PyPy developers if they want to increase their user base. In other ways, PyPy is ready for prime time; it implements the (Python 2.x) language faithfully, and it is fast. With better support for extensions, PyPy could easily become the interpreter of choice for a lot of Python programs. It is a nice piece of work.

