Modifying the Python object model

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

At the 2018 Python Language Summit, Carl Shapiro described some of the experiments that he and others at Instagram did to look at ways to improve the performance of the CPython interpreter. The talk was somewhat academic in tone and built on what has been learned in other dynamic languages over the years. By modifying the Python object model fairly substantially, they were able to roughly double the performance of the "classic" Richards benchmark.

Shapiro said that Instagram is a big user of Python and has been looking for ways to improve the performance of the CPython interpreter for its workloads. So the company started looking at the representation of data in the interpreter to see if there were gains to be made there. It wanted to stick with CPython in order to preserve the existing API, ecosystem, and developer experience

He is just a "casual user" of Python, but has a background in language implementations. He has worked on the Java virtual machine (JVM) and on garbage collection for Go among other language work. He said he is "not afraid of dynamic languages". He started by looking at the Python runtime. His initial impression was that Python is relatively inefficient compared to other high-level languages. But there is nothing that is so different in Python from those other languages; it shares a lot of common operations with other language runtimes. However Python has never evolved into a faster implementation as has happened with other languages (e.g. V8 for JavaScript, HotSpot for Java)

He did some data collection with a heavily instrumented CPython interpreter. He compared different kinds of workloads and other languages on those workloads. He also ran the modified interpreter against real traffic data gathered from Instagram. The tests gathered counts of bytecodes used as well as CPU instructions for handling the bytecodes.

To a first approximation, the breakdown of which bytecodes were used the most track closely with other languages, which is what would be expected. Roughly 20% of the bytecodes executed are either CALL_FUNCTION or LOAD_ATTR , which are used to call functions and retrieve object attributes, as the names would imply. But, handling those bytecodes required nearly 50% of the CPU instructions that were executed. Those two opcodes turn into hundreds of CPU instructions each (the average was 498 instructions for CALL_FUNCTION and 240 for LOAD_ATTR ). Those are much higher than the instruction count for the highly optimized versions in other languages.

In addition, when Python is dispatching to a method call, 85% of the time there is only one type involved at a given call site. The interpreter is set up to handle generic method dispatch, but a cache of the most recently used method will work in the vast majority of cases. That percentage rises to 97% if you cache four type/method pairs for the call site. It is not uncommon for high-level languages to have different strategies for a single type versus either four or eight types; call sites that fall outside of those constraints are handled a third way, he said.

Beyond that, the comparison and binary operation implementations were more general than needed by the vast majority of the calls in the Instagram workload. Comparisons and binary operations are normally done on the built-in types (int, dict, str), rather than on user-defined classes. But the code to handle those operations does a lot of extra work to accommodate the dynamic types that are rarely used.

Some of what Shapiro presented did not sit well with Guido van Rossum, who loudly objected to Shapiro's tone, which was condescending, he said. Van Rossum thought that Shapiro did not really mean to be condescending, but that was how he came across and it was not appreciated. The presentation made it sound like Shapiro and his colleagues were the first to think about these issues and to recognize the inefficiencies, but that is not the case. Shapiro was momentarily flustered by the outburst and its vehemence, but got back on track fairly quickly.

Shapiro's overall point was that he felt Python sacrificed its performance for flexibility and generality, but the dynamic features are typically not used heavily in performance-sensitive production workloads. So he believes it makes sense to optimize for the common case at the expense of the less-common cases. But Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.

Continuing, Shapiro said that the most interesting statistic in the data gathered for him was on object construction and "monkey patching" (e.g. adding new attributes to an object after its creation). The instrumented interpreter found that 70% of objects have all of their attributes set in the object's __init__() method. Another percent or so were added from the function where the object is initialized. Much of the rest were all instances of a single class that is frequently used at Instagram. The implication is that adding object attributes post-initialization is actually not frequent.

It is in keeping with what other high-level languages have found that Python code does not use the dynamic features of the language all that much. But the data structures used in the interpreter overemphasize these dynamic features that are not used that widely, he said. Better native code generation, which is a common place to look for better performance, does not directly address the performance of data structures that are not optimized for the 90% case. What would the object model look like if it were optimized for the most frequent operations and how much performance would be gained?

A method lookup and call takes two orders of magnitude more instructions than he would like to see. So instead of optimizing the call frames for environment introspection (for stack traces and the like), as it is today, he proposed optimizing for call speed. Also, the garbage collector is reference-count based, which has poor locality. It is part of the C API, though, so reference counting must be maintained for C extensions; in the experiment, the Instagram developers moved to a tracing garbage collector everywhere else.

The representation of an instance in CPython has a hash table for attributes that is optimized for adding and removing attributes. Since that is not done that often in practice, flattening the representation to an array (with an overflow hash table for additional attributes) was tried. The experiments also featured aggressive caching for attributes.

The changes made to the interpreter were all at the data structure level; they were conservative changes, overall, Shapiro said. The developers did not go out of their way to optimize any of the changes they made, so there is still a "lot of headroom" in the implementation. They raced the code against CPython 3.6. The caching for attributes and for methods were the most powerful optimizations; each cut the run time roughly in half. The experimental interpreter started at 3.62s versus CPython at 1.32s; that dropped to 1.8s with the attribute caching and 0.72s by adding method caching as well. Other optimizations had a smaller effect, but did get the run time down to 0.65s.

He concluded that the current object data model seems to benefit the uncommon cases more than the common cases—at least for the Instagram workload. He wondered if others had seen that also. There has been a lot of effort on compilation to native code over the years, but just changing the object model provided some big wins; perhaps that is where the efforts should be focused.

Eric Snow asked whether the dictionary versioning was being used to invalidate caches. Shapiro said that the experimental interpreter relies less on dictionaries than CPython does. The instances are now arrays instead of dictionaries, but there is an equivalent signal so that the cache entries can be invalidated. Thomas Wouters asked if he had looked at PyPy. Shapiro said the company had, but there was only a modest bump in performance for its workload. He was not the one who did the work, however. Wouters noted that PyPy is more than "Python with a JIT" because it has its own data model as well.

Mark Shannon said that Python 3.7 has added a feature that should provide a similar boost as the method-lookup caching used in the experiment. Shapiro said he had looked at those changes but still believed his proposed mechanism would provide more benefit. Attribute lookup still requires five lookups in CPython, while it is only one lookup in the experimental version. Shannon did not sound entirely convinced of that, however.

These changes are not "all or nothing", Shapiro said. The experiment showed that there is a lot of headroom in the data structures themselves. Some parts of the changes could be adopted if they proved compelling. One attendee asked how serious Facebook/Instagram is about getting some or all of the changes into CPython. Shapiro said that the company has not released the code for the experiment (yet, though it was a bit unclear when that might change) but that it did try to feed its changes back upstream. It does not want to have a fork of CPython that it runs in production.