Rationalizing Python's APIs

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

CPython is the reference implementation of Python, so it is, unsurprisingly, the target for various language-extension modules. But the API and ABI it provides to those extensions ends up limiting what alternative Python implementations—and even CPython itself—can do, since those interfaces must continue to be supported. Beyond that, though, the interfaces are not clearly delineated, so changes can unexpectedly affect extensions that have come to depend on them. A recent thread on the python-ideas mailing list looks at how to clean that situation up.

On July 11, Victor Stinner floated a draft of an as yet unnumbered Python Enhancement Proposal (PEP) entitled "Hide implementation details in the C API". The idea is to remove CPython implementation choices from the API so that different experimental choices can be made while still supporting the C-based extensions (NumPy and SciPy in particular). As he noted, other attempts to provide an alternate Python implementation (e.g. PyPy), which are typically created to enhance the language's performance, have largely run aground because they cannot directly support these all-important extensions.

In the draft, he mentioned a few possible options that could be tried if the C API was modified, including switching to indirect reference counts, removing the global interpreter lock (GIL), or changing the garbage-collection scheme. He described some of the history of Python forks and alternate implementations, many of which were blocked by the C API exposing too much of CPython's internals. The pre-PEP then went on to list some concrete steps toward splitting and rationalizing the Python C API.

To start, the Include/ directory for CPython would be split into three, one for each API: python for the existing C API, core for the internal API for CPython, and stable for the existing stable ABI (that extensions can rely on staying unchanged, though it is not really used, according to Stinner). Next up, the packaging tools would get an option for extensions to choose the API to use when they are built.

The final three steps would slowly move implementation details out of python , while still ensuring that extensions will build and function. That will require something of an iterative approach: alternately removing things from python and fixing the extensions. Eventually, the new restricted python API would be the default for all extensions. He also included an alternate path: leave the existing core API as the default, but provide an alternate API as an option at build time. That would mean that two Python binaries would be distributed for each release, one using the compatible API and another that would be faster but not compatible with all existing extensions.

Stinner prefaced his draft with some performance-related justifications, including a link to coverage of his 2017 Python Language Summit session. He is concerned about Python's performance and believes that the C API blocks various optimizations that might be applied to speed it up. He said:

This is the first draft of a big (?) project to prepare CPython to be able to "modernize" its implementation. Proposed changes should allow to make CPython more efficient in the future. The optimizations [themselves] are out of the scope of the PEP, but some examples are listed to explain why these changes are needed.

Nick Coghlan took issue with the use of "needed" with regard to performance improvements. He suggested that the status quo is a result of people not recognizing one of the best ways to increase the performance of a Python application: rewriting the performance-critical pieces in another language. "[...] So folks mistakenly think they need to rewrite their whole application in something else, rather than just selectively replacing key pieces of it." He pointed to Cython (which is used in parts of SciPy and elsewhere) as a known way to get C-level performance from Python. So there are differences of opinion about how necessary these potential performance enhancements are, he said.

However the reorganization of the API to more clearly specify what is (and is not) an external interface is "an admirable goal", Coghlan said, which will allow more experimentation as long as there is no "hard compatibility break". The C API has "enabled the Python ecosystem to become the powerhouse that it is", but it is difficult to maintain consistently. He continued:

Those kinds of use cases are more than enough to justify changes to the way we manage our public header files - you don't need to dress it up in "sky is falling" rhetoric founded in the fear of other programming languages. Yes, Python is a nice language to program in, and it's great that we can get jobs where we can get paid to program in it. That doesn't mean we have to treat it as an existential threat that we aren't always going to be the best choice for everything :)

There was general agreement in the thread that reorganizing the header files and API was beneficial. Eric Snow pointed to some work he has done to consolidate the global variables in CPython into a single structure; it could perhaps be used as a starting point for the core API work that Stinner described. Barry Scott, who created PyCXX for writing Python extensions in C++, also liked the idea; he suggested adding a PyCXX-based extension into Stinner's testing regime.

Coghlan posted again, this time looking at more of the details in Stinner's proposal, rather than just the wording in his preamble. He reiterated some of his points about performance not being the best rationale for the initial cleanup work that Stinner is talking about. There is enough confusion in the APIs to justify the cleanup:

We're not sure which APIs other projects (including extension module generators and helper libraries like Cython, Boost, PyCXX, SWIG, cffi, etc) are *actually* relying on. It's easy for us to accidentally expand the public C API without thinking about it, since Py_BUILD_CORE guards are opt-in and Py_LIMITED_API guards are opt-out We haven't structured our header files in a way that makes it obvious at a glance which API we're modifying (internal API, public API, stable ABI)

The guards that Coghlan refers to are supposed to restrict the symbols available to programs; Py_BUILD_CORE is for the interpreter and related tools (effectively what Stinner would put in core ) and Py_LIMITED_API is for the stable ABI (and is badly named according to several in the thread). Coghlan suggested making all of that more clear before tackling further questions:

In particular, better segmenting our APIs into "solely for CPython's internal use", "ABI is specific to a CPython version", "API is portable across Python implementations", "ABI is portable across CPython versions (and maybe even Python implementations)" allows tooling developers and extension module authors to make more informed decisions about how closely they want to couple their work to CPython specifically. And then *after* we've done that API clarification work, *then* we can ask the question about what the default behaviour of "#include <Python.h>" should be, and perhaps introduce an opt-in Py_CPYTHON_API flag to request access to the full traditional C API for extension modules and embedding applications that actually need it.

In the proposal, Stinner said that, instead of including files between the different API directories, declarations should be duplicated in order to avoid mistakes in exposing declarations incorrectly. Duplication has its own set of dangers, however; Coghlan and others in the thread suggested a strict hierarchy of the APIs and their include files such that no duplication was needed but that definitions could not leak out into the other APIs incorrectly.

Along the way, it became clear that the "API" and "ABI" terms were being tossed around without a clear description of what the pieces are. Brett Cannon took a stab at defining the various levels:

The stable A**B**I which is compatible across versions A stable A**P**I which hides enough details that if we change a struct your code won't require an update, just a recompile An API that exposes CPython-specific details such as structs and other details that might not be entirely portable to e.g. PyPy easily but that we try not to break An internal API that we use for implementing the interpreter but don't expect anyone else to use, so we can break it between feature releases (although if e.g. Cython chooses to use it they can)

Coghlan mostly agreed with that, but thought that the portable API (#2 above) should still be able to change over time, subject to the standard Python deprecation policy. He sees the portable API as only exposing interfaces that are genuinely portable to, at least, PyPy. It would also stay as close to the stable ABI as possible, "with additions made *solely* to support the building of existing popular extension modules". So he sees the levels as follows:

stable ABI (strict extension module compatibility policy)

portable API (no ABI stability guarantees, normal deprecation policy)

public CPython API (no cross-implementation portability guarantees)

internal-only CPython core API (arbitrary changes, no deprecation warnings)

While Stinner's motivation may be different from others', it would seem that there is broad agreement that API rationalization is needed. How it all might look at a high level is also fairly non-controversial. An actual PEP that focuses strictly on the API clarification would seem to be the next step. Once that happens, assuming that it does, Stinner and others can start working on ways to make the portable API even more portable in support of various performance optimization experiments.