The inherent fragility of seccomp()

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

seccomp()

seccomp()

Kernel developers have worried for years that tracepoints could lead to applications depending on obscure implementation details; the consequent need to preserve existing behavior to avoid causing regressions could end up impeding future development. A recent report shows that thesystem call is also more prone to regressions than users may expect — but kernel developers are unlikely to cause these regressions and, indeed, have little ability to prevent them. Programs usingwill have an inherently higher risk of breaking when software is updated.

seccomp() allows the establishment of a filter that will restrict the set of system calls available to a process. It has obvious uses related to sandboxing; if an application has no need for, say, the open() system call, blocking access to that call entirely can reduce the damage that can be caused if that application is compromised. As more systems and programs are subjected to hardening, the use of seccomp() can be expected to continue to increase.

Michael Kerrisk recently reported that upgrading to glibc 2.26 broke one of his demonstration applications. That program was using seccomp() to block access to the open() system call. The problem that he ran into comes down to the fact that applications almost never invoke system calls directly; instead, they call wrappers that have been defined by the C library.

The glibc open() wrapper has, since the beginning, been a wrapper around the kernel's open() system call. But open() is an old interface that has long been superseded by openat() . The older call still exists because applications expect it to be there, but it is implemented as a special case of openat() within the kernel itself. In glibc 2.26, the open() wrapper was changed to call openat() instead. This change was not visible to ordinary applications, but it will break seccomp() filters that behave differently for open() and openat() .

Kerrisk was not really complaining about the change, but he did want to inform the glibc developers that there were user-visible effects from it: "I want to raise awareness that these sorts of changes have the potential to possibly cause breakages for some code using seccomp, and note that I think such changes should not be made lightly or gratuitously". The developers should, he was suggesting, keep the possibility of breaking seccomp() filters in mind when making changes, and they should document such changes when they cannot be avoided.

Florian Weimer, however, disagreed:

I have the opposite view: We should make such changes as often as possible, to remind people that seccomp filters (and certain SELinux and AppArmor policies) are incompatible with the GNU/Linux model, where everything is developed separately and not maintained within a single source tree (unlike say OpenBSD). This means that you really can't deviate from the upstream Linux userspace ABI (in the broadest possible sense) and still expect things to work.

Another way of putting this might be: seccomp() filters are not considered to be a part of the ABI that is provided by glibc, so incompatible changes there are not considered regressions. They are, instead, a consequence of filtering below the glibc level while expecting behavior above that level to remain unchanged.

Weimer's point of view would appear to be the one that will govern glibc development going forward. So Kerrisk has proposed some man-page changes to make the fragility of seccomp() filters a bit less surprising to developers. Playing the game at this level will require a fairly deep understanding of what is going on and the ability to adapt to future C-library changes.

This outcome could be seen as an argument in favor of a filtering interface like OpenBSD's pledge() . Like seccomp() , pledge() is used to limit the set of system calls available to a process, but pledge() is defined in terms of broad swathes of functionality rather than working at the level of individual system calls. It can be used to allow basic file I/O, for example, while disabling the opening (or creation) of new files. pledge() is far less expressive than seccomp() and cannot implement anything close to the same range of policies but, for basic filtering, it seems far less likely to generate surprises after a kernel or library update.