Progress on the Gilectomy

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

At the 2016 Python Language Summit, Larry Hastings introduced Gilectomy, his project to remove the global interpreter lock (GIL) from CPython. The GIL serializes access to the Python interpreter, so it severely limits the performance of multi-threaded Python programs. At the 2017 summit, Hastings was back to update attendees on the progress he has made and where Gilectomy is headed.

He started out by stating his goal for the project. He wants to be able to run existing multi-threaded Python programs on multiple cores. He wants to break as little of the existing C API as possible. And he will have achieved his goal if those programs run faster than they do with CPython and the GIL—as measured by wall time. To that end, he has done work in four areas over the last year.

He noted that "benchmarks are impossible" by putting up a slide that showed the different CPU frequencies that he collected from his system. The Intel Xeon system he is using is constantly adjusting how fast the cores run for power and heat considerations, which makes it difficult to get reliable numbers. An attendee suggested he look into CPU frequency pinning on Linux.

Atomicity and reference counts

CPU cores have a little bus that runs between them, which is used for atomic updates among other things, he said. The reference counts used by the Gilectomy garbage collection use atomic increment and decrement instructions frequently, which causes a performance bottleneck because of the inter-core traffic to ensure cache consistency.

So he looked for another mechanism to maintain the reference counts without all of that overhead. He consulted The Garbage Collection Handbook, which had a section on "buffered reference counting". The idea is to push all of the reference count updating to its own thread, which is the only entity that can look at or change the reference counts. Threads write their reference count changes to a log that the commit thread reads and reflects those changes to the counts.

That works, but there is contention for the log between the threads. So he added a log per thread, but that means there is an ordering problem between operations on the same reference count. It turns out that three of the four possible orderings can be swapped without affecting the outcome, but an increment followed by a decrement needs to be done in order. If the decrement is processed first, it could reduce the count to zero, which might result in the object being garbage collected even though there should still be a valid reference.

He solved that with separate increment and decrement logs. The decrement log can only be processed after all of the increments. This implementation of buffered reference counting has been in Gilectomy since October and is now working well. He did some work on the Py_INCREF() and Py_DECREF() macros that are used all over the CPython code; the intent was to cache the thread-local storage (TLS) pointer and reuse it over multiple calls, rather than looking it up for each.

Buffered reference counts have a weakness: they cannot provide realtime reference counts. It could be as long as a second or two before the reference count actually has the right value. That's fine for most code in Gilectomy, because that code cannot look at the counts directly.

But there are places that need realtime reference counts, the weakref module in particular. Weak references do not increment the reference count but can be used to reference an object (e.g. for a cache) until it is garbage collected because it has no strong references. Hastings tried to use a separate reference count to support weakref , but isn't sure that will work. Mark Shannon may have convinced him that resurrecting objects in __del__() methods will not work under that scheme; it may be a fundamental limitation that might kill Gilectomy, Hastings said.

More performance

Along the way, he came to the conclusion that the object allocation routines in obmalloc.c were too slow. The object allocation scheme has different classes for different sizes of objects, so he added per-class locks. When that was insufficient, he added two kinds of locks: a "fast" lock for when an object exists on the free list and a "heavy" lock when the allocation routines need to go back to the pool for more memory. He also added per-thread, per-class free lists. As part of that work, he added a fair amount of statistics-gathering code but went to some lengths to ensure that it had no performance impact when it was disabled.

There are a lots of places where things are being pulled out of TLS and profiling the code showed 370 million calls to get the TLS pointer over a seven to eight second run of his benchmark. In order to minimize that, he has added parameters to pass the TLS pointer down into the guts of the interpreter.

An attendee asked if it made sense to do that for the CPython mainline, but Hastings pointed out those calls come from what he has added; CPython with a GIL does not have that performance degradation. Another attendee thought it should only require one assembly instruction to get the TLS pointer and that there is a GCC extension to use that. Hastings said that he tried that, but could not get it to work; he would be happy to have help as it should be possible to make it faster.

The benchmark that he always uses is a "really bad recursive Fibonacci". He showed graphs of how various versions of Gilectomy fare versus CPython. Gilectomy is getting better, but is still well shy of CPython speed in terms of CPU time. But that is not what he is shooting for; when looking at wall time, the latest incarnation of Gilectomy is getting quite close to CPython's graph line. The "next breakthrough" may show Gilectomy as faster than CPython, he said.

Next breakthrough

He has some ideas for ways to get that next breakthrough. For one, he could go to a fully per-thread object-allocation scheme. Thomas Wouters suggested looking at Thread-Caching Malloc (TCMalloc), but Hastings was a bit skeptical. The small-block allocator in Python is well tuned for the language, he said. But Wouters said that tests have been done and TCMalloc is no worse than Python's existing allocator, but has better fragmentation performance and is multi-threaded friendly. Hastings concluded that it was "worth considering" TCMalloc going forward.

He is thinking that storing the reference count separate from the object might be an improvement performance-wise. Changing object locking might also improve things, since most objects never leave the thread they are created in. Objects could be "pre-locked" to the thread they are created in and a mechanism for threads to register their interest in other threads' objects might make sense.

The handbook that he looked in to find buffered reference counts says little about reference counting; it is mostly focused on tracing garbage collection. So one thought he has had is to do a "crazy rewrite" of the Python garbage collector. That would be a major pain and break the C API, but he has ideas on how to fix that as well.

Guido van Rossum thought that working on a GIL-less Python and C API would be much easier in PyPy ( which has no GIL ), rather than CPython. Hastings said that he thought having a multi-threaded Python would be easier to do using CPython. Much of breakage in the C API simply comes from adding multi-threading into the mix at all. If you want multi-core performance, those things are going to have to be fixed no matter what.

But Van Rossum is concerned that all of the C-based Python extensions will be broken in Gilectomy. Hastings thinks that overstates things and has some ideas on how to make things better. Someone had suggested only allowing one thread into a C extension at a time (so, a limited GIL, in effect), which might help.

The adoption of PyPy "has not been swift", Hastings said; he thinks that since CPython is the reference implementation of Python, it will be the winner. He does not know how far he can take Gilectomy, but he is sticking with it; he asked Van Rossum to "let me know if you switch to PyPy". But Van Rossum said that he is happy with CPython as it is. On the other hand, Wouters pointed out one good reason to stick with experimenting with CPython; since the implementation is similar to what the core developers are already knowledgeable about, they will be able to offer thoughts and suggestions.

Hastings also gave a talk about Gilectomy status a few days later at PyCon; a YouTube video is available for those interested.

[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]

