The glibc s390 ABI break

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The GNU C library (glibc) project has long lived up to a reputation for conservatism; glibc developers know that an ill-chosen change can create a great deal of pain downstream, so they proceed with caution. Even so, mistakes can happen. A recent slip-up involving the s390 architecture makes it clear how one of those mistakes can cascade into a significant mess that is hard to clean up afterward.

The setjmp() and longjmp() functions have been part of the standard C library since something close to the beginning. They can be used to perform stack unwinding — a sort of "long return" from a function that skips over any number of intervening function calls. Both of these functions take an opaque jmp_buf data structure as an argument. The caller provides the buffer to setjmp() , which fills it with the information needed to make another return to the location of that call. A later call to longjmp() with that buffer will then cause setjmp() to appear to have returned a second time.

Back in April, developers from IBM committed a patch that changed the size of the jmp_buf structure on the s390 architecture; this change, which subsequently became part of the 2.19 release, was apparently needed to enable better hardware support for setjmp() and longjmp() . Since jmp_buf is a type that is visible to applications, this was a clear ABI change, with all of the possible problems that can go with it. For example, newer glibc releases expect the larger jmp_buf size, but they may be linked (at run time) against applications that have not been rebuilt and, thus, are still working with the older version of jmp_buf .

This possibility was taken into account, though. Symbol versioning was used to provide compatible versions of setjmp() and longjmp() for these older applications. So, in theory, things should Just Work without additional problems. This particular theory did not last long after its encounter with the real world, though.

The problem is that jmp_buf structures are often embedded into other structures, so a change in the size of that structure will change the containing structures too. To find victims, one need not even look outside of glibc; it turns out that glibc's POSIX threads (pthreads) implementation embeds a jmp_buf structure into its own __pthread_unwind_buf_t structure which, in turn, is visible to applications. So, as a result, a number of pthreads functions need to become versioned as well.

Versioning does not work, though, for problems that pop up outside of glibc. Consider, for example, the Perl interpreter, which embeds a jmp_buf in its main "this is a running Perl instance" structure. That has caused various Perl modules to fail (example) and can only really be fixed by rebuilding the entire Perl environment. The PNG image format library (libpng) also has an embedded jmp_buf — in a structure that is used by all PNG-using applications.

Debian's developers, who were trying to clean up this mess, considered rebuilding all of Perl and then, perhaps, all (500 or so) packages depending on the PNG library. But, by this point, it became clear that the ripples from this change spread widely indeed and that playing whack-a-mole may never get all of them fixed. So the Debian developers have figured that the course they may have to consider is to "do like Red Hat, ie just rebuild everything and warn the users their system might break during upgrade." Needless to say, this approach lacks appeal, especially in the Debian world, where mass rebuilds are a rare event.

Even then, of course, there is the problem of end-user applications. Distributors cannot rebuild those; even worse, the user may not be able to either. So some things might just be broken.

One might be thinking that there is a mechanism in place for this kind of incompatible ABI change. Shared libraries have a shared-object name ("soname") built into them; applications linked against those libraries also contain that name. For glibc on your editor's system, for example, the soname is " libc.so.6 ". The runtime linker will not link an application against a shared object if the sonames do not match. In this way, the system can disallow running against a library that will not work. It also enables, in theory, the parallel installation of multiple versions of the library; older applications would continue to use the older library, while newly built binaries would use the current version.

So the glibc project could consider making a point release with a different soname ( libc.so.6.1 , say); distributors could then install the result alongside an older version of the library and, in theory, things should work. Except that glibc developer Carlos O'Donell tried it and concluded that:

It's unsupportable as a solution for glibc. The SO name bump in a mixed-ABI environment like debian results in two libc's being loaded and competing for effectively the same namespace of symbols with resolution (and therefore selection of the ABI) being determined by ELF interposition and scope rules. It's a nightmare. It's possible a worse solution than just telling everyone to rebuild and get on with their lives.

It also turns out to be painful to bootstrap a system with a new, ABI-incompatible version of the C library. So it seems that the soname change will not happen and that, on s390, a lot of rebuilding is going to have to go on. It will also become impossible to move affected applications between systems with pre- and post-change libraries. Not fun, but, as David Miller put it:

Therefore, on the negative side, we might be stuck with this. But, on the positive side, we can refer to this incident next time a similar incident arises. We now know exactly what the ramifications are for not handling this properly.

That leads to the obvious question: what can be done to avoid this kind of problem in the future? Carlos plans to put together a policy on how to manage ABI changes, with "don't break ABI ever" as the first item. There has been talk of improving the testing tools in an attempt to catch this kind of ABI break in the future.

In the end, though, nothing can replace a high level of care on the part of the developers involved. Glibc developers have always shown that care, which is why stories like this one are rare. In the aftermath of this mistake, one can assume that they will be doubly careful in the future. That, along with some testing support, should help to ensure that upcoming glibc releases are free of this kind of issue.

