Elvis Pranskevichus, Pinterest engineer, Core Experience

Making Pinterest faster and more reliable is a constant focus for our engineering team, and using hardware resources more efficiently is a major part of this effort. Improving efficiency and reliability requires good diagnostic tools. That’s why today we’re announcing our newest tracing tool, ptracer, which provides granular syscall tracing in Python programs. In this post we’ll cover background on Pinterest’s codebase, why we needed a better tracer and how ptracer can help solve certain engineering problems.

Background

Pinterest is powered by a large Python codebase. Traditionally, the most efficient way to run a Python server from the CPU utilization standpoint is to spawn multiple worker processes (the number of workers is usually close to the total number of CPU cores). The multi-process shared-nothing architecture is safe and efficient, and there’s no GIL or other issues associated with multithreaded programming. That’s why multi-process configuration is a preferred way of running things at Pinterest.

However, using multiple processes has a price to pay–increased memory usage. In a large codebase, just the memory allocated for Python code objects and strings can reach into hundreds of megabytes, with a similar amount needed for application data. The total memory usage across several dozen workers can quickly get out of hand.

Notably, a large portion of data across workers is the same, like the Python code objects and other static bits. The solution to decrease memory usage is to make the worker processes share that data. Thankfully, there’s a way to do just that.

In POSIX systems, new processes are created by invoking the fork() system call which creates an almost identical copy of the calling process. Most importantly, the memory of the process isn’t physically copied right away. Instead, all memory pages shared between the parent and the child processes are marked with the special copy-on-write bit which triggers a copy only when a memory page is modified by one of the processes.

Figure 1. Memory sharing in forked processes

The worker processes start with almost perfect memory sharing but will gradually diverge.

The main source of memory divergence in Python programs is reference counting. In CPython, every Python object is represented by a structure which includes a reference counter that’s incremented whenever a reference to the object is made. Thus, in Python, just accessing an object is enough to trigger the memory page copy.

Similarly, the CPython garbage collector maintains a special record in every tracked Python object and will modify that record when a collection run is made. There are remedies to the latter problem, and Python 3.7 will provide a standard solution in the form of a gc.freeze() call.

Despite these limitations, a large proportion of memory can stay shared, leading to significant overall savings.

From supervisord to fork-server

An external process manager like supervisord knows very little about the program it runs, and it spawns workers as independent processes which share almost no memory. In order to use the benefits of fork(), the Python program must itself call it at the most appropriate moment–just before the server starts accepting connections and after all shared code and data has been loaded.

Besides memory, the child process created by fork() also shares all open file descriptors with its parent and siblings. A file descriptor includes the current offset within the file, and sharing it between processes leads to interleaved reads and writes (which most programs will not expect and fail).

The usual solution to this problem is to immediately close all file descriptors (except stdin, stdout and stderr) after fork() returns in the child process. In our case this wouldn’t work, because the code that opened the file might rely on it to continue being open. Instead, we have to ensure no files or sockets are left open when we call fork() to spawn the workers.

Manually finding all places that open files without closing them is a daunting task. In a large codebase like ours it’s downright impossible, especially considering numerous third-party modules we depend on. Furthermore, fixing this issue once isn’t enough. We need to make sure new code doesn’t introduce regressions.

The only workable solution to this problem is to automate the checks and incorporate them into the regression test suite. To do this we need a mechanism to trace the program and intercept all system calls that create new file descriptors.

Many systems, including Linux, have the ptrace() system call, which allows tracing and interrupting the execution of other processes. Debuggers like gdb use it as well as the venerable strace tool. But we couldn’t simply use strace. While it does report the fact that a certain file was opened or closed, it doesn’t show where in the program this happened which makes finding the spot unnecessarily hard.

That’s why we built our own tracer–ptracer–to solve this problem.

ptracer

ptracer is a module that makes tracing system calls in Python programs trivial.

Under the hood, ptracer uses a combination of threads and subprocesses to make the above possible.

Figure 2. ptracer architecture

When a ptracer context is entered, ptracer spawns a new thread in which the syscall callback will be executed, in addition to another thread used to extract a Python traceback from a given thread. It also spawns a subprocess which does the actual tracing.

Whenever a tracing subprocess detects a system call being made, it checks the syscall against the specified filter (if there is any). If the system call matches the filter, the traceback thread is called to extract the call stack, which is attached to the syscall record and put into the queue. The callback is called for each item in the syscall queue.

Solving the problem of open files pre-fork()

With ptracer, we’re now able to reliably find code locations that leave behind unclosed files.

Future work

ptracer is not limited to tracing file-related syscalls. At Pinterest we use the non-blocking I/O model and making it possible to reliably identify the blocking calls and their latency is our next goal.

We welcome any issues, feedback and pull requests. For all our open source projects, check out our GitHub.

Acknowledgements: I’d like to thank Yury Selivanov, Jon Parise, Sam Meder and Joe Gordon for their help and feedback on the project.