Using systemd for more secure services in Fedora

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The AF_PACKET local privilege escalation (also known as CVE-2016-8655) has been fixed by most distributions at this point; stable kernels addressing the problem were released on December 10. But, as a discussion on the fedora-devel mailing list shows, systemd now provides options that could help mitigate CVE-2016-8655 and, more importantly, other vulnerabilities that remain undiscovered or have yet to be introduced. The genesis for the discussion was a blog post from Lennart Poettering about the RestrictAddressFamilies directive, but recent systemd versions have other sandboxing features that could be used to head off the next vulnerability.

Fedora project leader Matthew Miller noted the blog post and wondered if the RestrictAddressFamilies directive could be more widely applied in Fedora. That directive allows administrators to restrict access to the network address families a service can use. For example, most services do not require the raw packet access that AF_PACKET provides, so turning off access to that will harden those services to some extent. But Miller was also curious if there were other systemd security features that the distribution should be taking advantage of.

He suggested perhaps having a Rawhide flag day where the network address families were restricted by default using the directive (to, say, only AF_INET , AF_INET6 , and AF_UNIX ); that would allow enough time to find services that needed a less (or more) restrictive set of address families and override those defaults in their unit files. An alternative to changing the defaults might be to file bugs for each affected package, Miller said, though Tomasz Torcz thought those bugs should be filed with the upstream project, rather than in the Fedora bug tracker, to reduce divergence with upstream.

Ratcheting down the availability of unneeded address families, which systemd does using the kernel's seccomp facility, appeared to be a reasonable idea based on the responses of those participating in the thread. The need for a flag day was less clear. With his "upstream hat on", as a Libreswan developer, Paul Wouters asked about ways to support the feature without requiring systemd (since Libreswan runs on non-Linux systems too). While Poettering seems to have misinterpreted Wouters's initial message (so Wouters clarified it), Poettering did describe the mechanism to do so:

Of course, you can also set up seccomp filters yourself, in your daemon, in C code, by using libseccomp. It's great if you do, and that's totally possible, and can be functionality-wise entirely equivalent. The only difference is: systemd makes all of this trivially easy to use, by making this a single-line change in a unit file without involving C hacking.

There are other systemd features that might be used by services in Fedora, though; Poettering listed fourteen separate sandboxing directives that Fedora "should enable wherever we can". They are well documented, he said, though not well publicized at this point. Most are available in the systemd that ships with the currently maintained Fedora versions, though not all are.

Some of the directives he listed include:

ProtectSystem : This allows services to choose to mount some filesystems read-only for their processes. It defaults to "false"; setting it to "true" will create a new mount namespace and mount the /usr and /boot directories read-only in it. The "full" setting adds /etc to that list, while "strict" does it for the entire filesystem hierarchy except for /dev , /proc , and /sys (which can be individually protected in various ways using the PrivateDevices , ProtectKernelTunables , and ProtectControlGroups directives).

: This allows services to choose to mount some filesystems read-only for their processes. It defaults to "false"; setting it to "true" will create a new mount namespace and mount the and directories read-only in it. The "full" setting adds to that list, while "strict" does it for the entire filesystem hierarchy except for , , and (which can be individually protected in various ways using the , , and directives). ProtectHome : This directive will set the /home , /root , and /run/user directories either as inaccessible and empty (if set to "true") or as read-only (if set to "read-only") for all processes in the unit. It defaults to "false".

: This directive will set the , , and directories either as inaccessible and empty (if set to "true") or as read-only (if set to "read-only") for all processes in the unit. It defaults to "false". ProtectKernelModules : If set to "true", kernel module loading will be disabled for the service. It removes the CAP_SYS_MODULE capability, installs a system call filter to block module-loading system calls, and makes /usr/lib/modules inaccessible. All of that does not stop automatic loading of kernel modules, though, which can be done system-wide using /proc/sys/kernel/modules_disabled .

: If set to "true", kernel module loading will be disabled for the service. It removes the capability, installs a system call filter to block module-loading system calls, and makes inaccessible. All of that does not stop automatic loading of kernel modules, though, which can be done system-wide using . PrivateTmp : This will cause systemd to create a new mount namespace for the unit with a private /tmp and /var/tmp that are not shared with any processes outside of the unit.

: This will cause systemd to create a new mount namespace for the unit with a private and that are not shared with any processes outside of the unit. PrivateNetwork : If enabled, systemd sets up a private network namespace for the service with only the loopback device available to it. This effectively turns off all network access (except to Unix sockets available in the filesystem) for all of the unit's processes.

: If enabled, systemd sets up a private network namespace for the service with only the loopback device available to it. This effectively turns off all network access (except to Unix sockets available in the filesystem) for all of the unit's processes. MemoryDenyWriteExecute : This will install a system call filter that removes the ability of the service and any of its children to create memory mappings that are both writable and executable. It intercepts attempts to use mmap() with both PROT_WRITE and PROT_EXEC , mprotect() with PROT_EXEC , and shmat() with SHM_EXEC . The idea is to restrict a program's ability to modify the code it runs, which can be exploited in various ways, but it is incompatible with programs that are meant to do that (e.g. for just-in-time compilation) so it can only be enabled for some services.

: This will install a system call filter that removes the ability of the service and any of its children to create memory mappings that are both writable and executable. It intercepts attempts to use with both and , with , and with . The idea is to restrict a program's ability to modify the code it runs, which can be exploited in various ways, but it is incompatible with programs that are meant to do that (e.g. for just-in-time compilation) so it can only be enabled for some services. RestrictRealtime : This boolean directive can block any attempt by a process in the unit to enable realtime scheduling, such as SCHED_FIFO , SCHED_RR , and SCHED_DEADLINE .

There are others, of course. Another useful pointer that came out of the discussion was the systemd.directives man page that Poettering noted, which has entries for each systemd directive. Those entries are linked to the proper spot elsewhere in the systemd man pages to get the full description of the directive.

There was some discussion of trying to enable some of the sandboxing options by default, but Poettering thinks that ship has sailed:

If we'd globally say that all services now run with ProtectSystem=strict by default, then we'd break pretty much all services that want to write something anywhere, until they get updated with the right ReadWritePaths= statements... And I have the suspicion that this kind of churn would upset quite a few people... I mean, I am all for breaking eggs to make an omelette, but not maybe not break all eggs in the egg carton at once ;-)

But Japheth Cleaver thought that efforts to use systemd's sandboxing facilities would be better spent elsewhere: "I'd much rather that effort be put into good SELinux policy evangelization, documentation, and perhaps additional admin-controllable booleans." No one really objected to that idea, exactly, but the SELinux complexity problem reared its head. As Poettering put it:

Yeah, this is really what it boils down to: the goal with the systemd directives is to make things easy to grok and easy to change. I can probably explain to most Linux admins who have administered a current Fedora in 5min what ProtectSystem=strict and ReadWritePaths=/var/lib/myservice does, and why it's a good thing. And afterwards he can easily add this to his own services. With SELinux that's not that easy: the concepts are much more complex (at least in my opinion, but I am sure many will agree), and as the selinux policy is packaged centrally making a change is not trivially easy to do. That said, SELinux and the systemd sandboxing directives are very different concepts. I don't think they are in competition really, and I am pretty sure everybody would benefit if both the SELinux policy and the systemd unit files would be improved.

Certainly providing better protections for system services of various sorts can only lead to more secure systems. Systemd has been adding many features to make it easier (and, it should be said, more understandable) for software projects, system administrators, distributions, and others to enable those protections in fairly straightforward way. In addition, doing it that way has the advantage of spreading the protections throughout the mainstream Linux distribution ecosystem. All of that added together makes it a project worth tackling.