February 24, 2019

nullprogram.com/blog/2019/02/24/

This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on Python at work, I recently needed to brush up on my Python skills. So far for me, Python has been little more than a fancy extension language for BeautifulSoup, though I also used it to participate in the recent tradition of writing one’s own static site generator, in this case for my wife’s photo blog. I’ve been reading through Fluent Python by Luciano Ramalho, and it’s been quite effective at getting me up to speed.

As I write Python, like with Emacs Lisp, I can’t help but consider what exactly is happening inside the interpreter. I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options. Ultimately I’d like the code I write to drive the interpreter efficiently and effectively. The Zen of Python says there should “only one obvious way to do it,” but in practice there’s a lot of room for expression. Given multiple ways to express the same algorithm or idea, I tend to prefer the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of Python, is very transparent about its bytecode. It’s easy to inspect and reason about its bytecode. The disassembly listing is easy to read and understand, and I can always follow it without consulting the documentation. This contrasts sharply with modern JavaScript engines and their opaque use of JIT compilation, where performance is guided by obeying certain patterns (hidden classes, etc.), helping the compiler understand my program’s types, and being careful not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been studying the bytecode disassembly of the functions that I write. One fact has become quite apparent: the CPython bytecode compiler is pretty dumb. With a few exceptions, it’s a very literal translation of a Python program, and there is almost no optimization. Below I’ll demonstrate a case where it’s possible to detect one of the missed optimizations without inspecting the bytecode disassembly thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python ( pdb ) because it’s such a close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum himself said, “Python is about having the simplest, dumbest compiler imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance, use something other than Python. (And if you can’t do that, at least use PyPy.) That’s a fairly reasonable and healthy goal. Still, if I’m writing Python, I’d like to do the best I can, which means exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython 3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically typed, are primarily executed on a bytecode virtual machine, and it’s easy to access their disassembly listings. One caveat: CPython and Emacs use a stack-based virtual machine while Lua uses a register-based virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x disassemble , and all code will use lexical scoping. In Lua I’ll use lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the variable around potentially involves allocating memory for it, assigning to it, and accessing it. Take this example:

def foo (): x = 0 y = 1 return x

This function is equivalent to:

def foo (): return 0

Despite this, CPython completely misses this optimization for both x and y :

2 0 LOAD_CONST 1 (0) 2 STORE_FAST 0 (x) 3 4 LOAD_CONST 2 (1) 6 STORE_FAST 1 (y) 4 8 LOAD_FAST 0 (x) 10 RETURN_VALUE

It assigns both variables, and even loads again from x for the return. Missed optimizations, but, as I said, by keeping these variables around, debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo () local x = 0 local y = 1 return x end

It also misses this optimization, though it matters a little less due to its architecture (the return instruction references a register regardless of whether or not that register is allocated to a local variable):

1 [2] LOADK 0 -1 ; 0 2 [3] LOADK 1 -2 ; 1 3 [4] RETURN 0 2 4 [5] RETURN 0 1

Emacs Lisp also misses it:

( defun foo () ( let (( x 0 ) ( y 1 )) x ))

Disassembly:

0 constant 0 1 constant 1 2 stack-ref 1 3 return

All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at compile time? This is simple and everyone does it.

def foo (): return 1 + 2 * 3 / 4

Disassembly:

2 0 LOAD_CONST 1 (2.5) 2 RETURN_VALUE

Lua:

function foo () return 1 + 2 * 3 / 4 end

Disassembly:

1 [2] LOADK 0 -1 ; 2.5 2 [2] RETURN 0 2 3 [3] RETURN 0 1

Emacs Lisp:

( defun foo () ( + 1 ( / ( * 2 3 ) 4.0 ))

Disassembly:

0 constant 2.5 1 return

That’s something we can count on so long as the operands are all numeric literals (or also, for Python, string literals) that are visible to the compiler. Don’t count on your operator overloads to work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects allocated in a function ever become visible outside of that function. If they don’t then these objects could potentially be stack-allocated (instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython does have a trick up its sleeve: tuple optimization. Since tuples are immutable, in certain circumstances CPython will reuse them and avoid both the constructor and the allocation.

def foo (): return ( 1 , 2 , 3 )

Check it out, the tuple is used as a constant:

2 0 LOAD_CONST 1 ((1, 2, 3)) 2 RETURN_VALUE

Which we can detect by evaluating foo() is foo() , which is True . Though deviate from this too much and the optimization is disabled. Remember how CPython can’t optimize away variables, and that they break constant folding? The break this, too:

def foo (): x = 1 return ( x , 2 , 3 )

Disassembly:

2 0 LOAD_CONST 1 (1) 2 STORE_FAST 0 (x) 3 4 LOAD_FAST 0 (x) 6 LOAD_CONST 2 (2) 8 LOAD_CONST 3 (3) 10 BUILD_TUPLE 3 12 RETURN_VALUE

This function might document that it always returns a simple tuple, but we can tell if its being optimized or not using is like before: foo() is foo() is now False ! In some future version of Python with a cleverer bytecode compiler, that expression might evaluate to True . (Unless the Python language specification is specific about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with is . Was that deliberate? I’m impressed that PyPy matches CPython’s semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break this optimization. But that’s not the compiler being dumb. That’s a hard constraint on the compiler: the caller might change the mutable component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode compilers eliminate unnecessary intermediate objects.

def foo (): return [ 1024 ][ 0 ]

Disassembly:

2 0 LOAD_CONST 1 (1024) 2 BUILD_LIST 1 4 LOAD_CONST 2 (0) 6 BINARY_SUBSCR 8 RETURN_VALUE

Lua:

function foo () return ({ 1024 })[ 1 ] end

Disassembly:

1 [2] NEWTABLE 0 1 0 2 [2] LOADK 1 -1 ; 1024 3 [2] SETLIST 0 1 1 ; 1 4 [2] GETTABLE 0 0 -2 ; 1 5 [2] RETURN 0 2 6 [3] RETURN 0 1

Emacs Lisp:

( defun foo () ( car ( list 1024 )))

Disassembly:

0 constant 1024 1 list1 2 car 3 return

Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and so on, and each case is almost certainly unoptimized. The general rule of thumb is to simply not expect much from these bytecode compilers. They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious optimizations from the compiler. This frees me to be more expressive in my code. Lots of things are cost-free thanks to these optimizations, such as breaking a complex expression up into several variables, naming my constants, or not using a local variable to manually cache memory accesses. I’m confident the compiler will optimize away my expressiveness. The catch is that clever compilers can take things too far, so I’ve got to be mindful of how it might undermine my intentions — i.e. when I’m doing something unusual or not strictly permitted.

These bytecode compilers will never truly surprise me. The cost is that being more expressive in Python, Lua, or Emacs Lisp may reduce performance at run time because it shows in the bytecode. Usually this doesn’t matter, but sometimes it does.