Instagram employs Python in one of the world’s largest settings, using it to implement the “business logic” needed to serve 800 million monthly active users. We use the reference implementation of Python, known as CPython, as the runtime used to execute our code. As we’ve grown, the number of machines required toserve our users has become a significant contributor to our growing infrastructure needs. These machines are CPU bound, and as a result, we keep a close eye on the efficiency of the code we write and deploy, and have focused on building tools to detect and diagnose performance regressions. This continues to serve us well, but the projected growth of our web tier led us to investigate sources of inefficiency in the runtime itself.

Getting started

To determine what we needed to optimize and which techniques we should use, we had to figure out:

What the Instagram workload looked like from the interpreter’s perspective, and

Where the interpreter spent its time when it was running our code.

We started by collecting various bits of data from CPython as it executed Instagram server code. Unsurprisingly, from the interpreter’s perspective, the Instagram workload looks like an object-oriented, web server workload:

90% of instructions by the interpreter dealt with operand stack manipulation, control flow, and attribute access.

Our workload wasn’t loopy — 94% of loops terminated after four or fewer iterations.

Function call overhead was quite high and accounted for roughly 30% of the time spent by the interpreter.

Attribute access was the next most resource-intensive category of opcodes, and accounted for another 28% of time spent in the interpreter. Attribute loads were not particularly polymorphic; 85% of loads occurred at monomorphic sites.

Based on the data above, we decided to work to reduce the computational cost of function calls and attribute access.

Data Collection Methodology

We collected all of our data by instrumenting an interpreter running in our lab environment, InstaLab. InstaLab replays production traffic to web servers that are running in an isolated environment using a request mix taken from the top five views (by CPU instructions). It provides us with a way to collect representative performance data without fear of negatively impacting user experience. For each different data set, we instrumented CPython, built a new interpreter, and collected the relevant data from InstaLab.

Bytecode frequency distribution

From the perspective of the interpreter, Instagram is just a sequence of bytecode instructions to be executed. As a first step in understanding what the interpreter was doing, we instrumented CPython to collect bytecode execution frequencies. The top ten instructions by frequency are shown in the graph below:

Immediately, we could see that LOAD_FAST dominated. In fact, LOAD_FAST, STORE_FAST and LOAD_CONST accounted for roughly 40% of all opcodes executed. This was expected — CPython is a stack machine and each of these instructions moves values to/from the operand stack. Although these instructions are cheap, they are executed quite frequently, and generate memory traffic. This suggested techniques to eliminate loads and stores (e.g. switching to a register based bytecode) would be an effective optimization.



When we excluded the opcodes above, we could get a better sense of the “real work” done by the interpreter when it executed Instagram code. The following graph shows the distribution of execution frequency for the top twenty opcodes (of 120 total).

These instructions accounted for 90% of the remaining distribution. We grouped them broadly into categories:

Attribute access — LOAD_ATTR, STORE_ATTR, LOAD_GLOBAL

— LOAD_ATTR, STORE_ATTR, LOAD_GLOBAL Control flow — CALL_FUNCTION, RETURN_VALUE, POP_JUMP_IF_FALSE, COMPARE_OP, FOR_ITER, JUMP_ABSOLUTE, POP_JUMP_IF_TRUE, POP_BLOCK, JUMP_FORWARD, YIELD_VALUE, CALL_FUNCTION_KW, SETUP_EXCEPT

— CALL_FUNCTION, RETURN_VALUE, POP_JUMP_IF_FALSE, COMPARE_OP, FOR_ITER, JUMP_ABSOLUTE, POP_JUMP_IF_TRUE, POP_BLOCK, JUMP_FORWARD, YIELD_VALUE, CALL_FUNCTION_KW, SETUP_EXCEPT Container Manipulation — BINARY_SUBSCR, BUILD_TUPLE, UNPACK_SEQUENCE

— BINARY_SUBSCR, BUILD_TUPLE, UNPACK_SEQUENCE Operand Stack Manipulation — POP_TOP, LOAD_DEREF

This conformed to what we expected to see from an object-oriented, web server workload (as opposed to a numeric workload).

Time spent executing bytecode

Looking at execution frequencies only told part of the story. In addition to knowing what the interpreter was doing, we also wanted to know where the interpreter spent most of its time. To that end, we measured the number of CPU instructions that were retired while executing each opcode using the perf_event APIs. We bracketed each opcode body in the interpreter dispatch loop with a read of a hardware instruction counter, then subtracted the two values to compute the “time” spent executing the opcode. For cases where calls were made back into the interpreter, or out to C functions (e.g. the CALL_FUNCTION family of opcodes), we deducted the amount of “time” spent in the callee from the cost attributed to the opcode.



The graph below shows the top 10 opcodes by percentage of cumulative CPU instructions retired.

This data painted a slightly different picture, as function calls and attribute loading jumped to the head of the distribution. In terms of “time spent,” the most resource-intensive categories of opcodes were:

Control flow — CALL_FUNCTION, CALL_FUNCTION_KW, COMPARE_OP

— CALL_FUNCTION, CALL_FUNCTION_KW, COMPARE_OP Attribute loading — LOAD_ATTR, STORE_ATTR, LOAD_GLOBAL

— LOAD_ATTR, STORE_ATTR, LOAD_GLOBAL Operand stack manipulation — LOAD_FAST, STORE_FAST, LOAD_CONST

Based on the data above, as a first step we decided to spend time optimizing attribute access and reducing function call overhead.

Accelerating attribute access

One time-honored technique to accelerate attribute access and method dispatch in dynamic language runtimes is the polymorphic inline cache. Inline caches, in combination with dense attribute storage (using something like hidden classes), can dramatically speed up attribute lookup. However, their efficacy depends on the degree of polymorphism seen at the call sites.



We instrumented CPython to record the types seen for each execution of LOAD_ATTR in order to gauge the potential of inline caching. The graph below shows the distribution of LOAD_ATTRs by degree of polymorphism.

Based on the data above, inline caching looked like it would be an effective technique to accelerate attribute access. Roughly 85% of LOAD_ATTRs occured at monomorphic sites; an inline cache of size four should have been able to handle roughly 96% of attribute loads.

Workload loopyness

The final question we wanted to answer was: “how loopy is our code?” Answering this question helped us determine how effective optimizations targeted at reducing loop overhead would be at accelerating our workload.



We instrumented CPython to record the number of iterations performed by each instance of a loop. The graph below shows the distribution of loop instances grouped by the number of iterations performed.

As we can see, our workload was not loopy. In fact, 94% of loops terminated after four or fewer iterations.

Conclusion

As a result of this exercise we identified two major sources of inefficiency for our workload in CPython: function call overhead and attribute access requirements. The data we’ve collected suggests that well-known techniques should be effective at mitigating the inefficiency. As a next step, we’re using this information to guide our efforts to optimize CPython.

Matt Page is a software engineer on Instagram’s efficiency and reliability team.