In the refereed track of the 2016 Linux Plumbers Conference, Lennart Poettering presented a new type of service for systemd that he calls a "portable system service". It is a relatively new idea that he had not talked about publicly until the systemd.conf in late September. Portable system services borrow some ideas from various container managers and projects like Docker, but are targeting a more secure environment than most services (and containers) run in today.

There is no real agreement on what a "container" is, Poettering said, but most accept that they combine a way to bundle up resources and to isolate the programs in the bundle from the rest of the system. There is also typically a delivery mechanism for getting those bundles running in various locations. There may be wildly different implementations, but they generally share those traits.

Portable system services are meant to provide the same kind of resource bundling, but to run the programs in a way that is integrated with the rest of the system. Sandboxing would be used to limit the problems that a compromised service could cause.

If you look at the range of ways that a service can be run, he said, you can put it on an axis from integrated to isolated. The classic system services, such as the Apache web server or NGINX, are fully integrated with the rest of the system. They can see all of the other processes, for example. At the other end of the scale is virtual machines, like those implemented by KVM, which are completely isolated. In between, moving from more integrated to more isolated, are portable system services, Docker-style micro-services, and full operating system containers such as LXC.

Portable system services combine the traditional, integrated services with some ideas from containers, Poettering said. The idea is to consciously choose what gets shared and what doesn't. Traditional services share everything, the network, filesystems, process IDs, init system, devices, and logging, but some of those things will be walled off for portable system services.

This is the next step for system services, he said. The core idea behind systemd is service management; not everything is Docker-ized yet, but everything has a systemd service file. Administrators are already used to using systemd services, so portable services will just make them more powerful. In many cases, users end up creating super-privileged containers by dropping half of the security provided by the container managers and mostly just using the resource bundling aspect. He wants to go the other direction and take the existing services and add resource bundling.

Integration is good, not bad, Poettering said; having common logging and networking is often a good thing. Systemd currently recognizes two types of services: System V and native. A new "portable" service type will be added to support this new idea. It will be different from the other service types by having resource bundling and sandboxing.

To start with, unlike Docker, systemd does not want to be in the business of defining its own resource bundling format, so it will use a simple directory tree in a tarball, subvolume, or GUID Partition Table (GPT) image with, say, SquashFS inside. Services will run directly from that directory, which will be isolated using chroot() .

The sandboxing he envisions is mostly concerned with taking away the visibility of and preventing access to various system resources. He went through a long list of systemd directives that could be used to effect the sandboxing. For example, PrivateDevices and PrivateNetwork are booleans that restrict access to any but the minimal devices in /dev (e.g. /dev/null , /dev/urandom ) and to provide only a loopback interface for networking. PrivateTmp gives the service its own /tmp , which removes a major attack surface. There is a setting to have a private user database that only has three users (root and a service-specific user; all other user IDs are mapped to nobody). Some other settings protect various directories in the system by mounting them read-only for the service, there is a setting to disallow realtime scheduling priority, another to restrict kernel-module loading, and so on.

Many of those are already present in systemd and more will be added. The systemd project has been working to make sandboxing more useful for services, Poettering said. He would like to see a distribution such as Fedora turn these features on for the services that it ships. Another area systemd will be working on is per-service firewalls and accounting.

Unlike native or System V services, portable services will have to opt out of these sandboxing features if they don't support them. In fact, he said, if systemd were just starting out today, native services would have been opt-out for the sandboxing options, but it is too late for that now.

There are some hard problems that need to be solved to make all of this work. One is that Unix systems are not ready to handle dynamic user IDs. When a portable service gets started, an unprivileged user for the service gets created, but is not put into the user database (e.g. passwd file). If a file is created by this user and then the service dies, the file lingers with a user ID that is unknown to the system.

One way to handle that is to restrict services with a dynamic user ID from being able to write to any of the filesystems, so using that feature will require the ProtectSystem feature that mounts the system directories read-only. Those services will get a private /tmp and a directory under /run that they can use, but those will be tied to the lifecycle of the service. That way, the files with unknown (to the system) user IDs will go away when the service and user ID are gone.

Dynamic users are currently working in systemd, which is a big step forward, Poettering said. Right now, new users are installed when an RPM for a service is installed, which doesn't really scale. Dynamic users make that problem go away since the user ID is consumed only while the service is running.

Another problem that people encounter is that if they try to install a service in a chroot() environment, they will need to copy the user database into the chroot() . The idea behind the PrivateUsers setting is to make chroot() work right. That setting will restrict the service to only have three users, one is the dynamic user for the service and the other two are root and nobody. Most distributions agree on the user ID for root and nobody, so that will help make the portable services able to run on various distributions.

D-Bus is incompatible with a chroot() environment because there is a need to drop policy files into the host filesystem. For now, that is an unsolved problem, but the goal is to move D-Bus functionality into systemd itself. That is something the project should have done a long time ago, Poettering said. The systemd D-Bus server would then use a different policy mechanism that doesn't require access to the host filesystem.

He stressed that systemd is not building its own Docker-like container manager; it is, instead, providing building blocks to take a native service and turn it into a portable one. So systemd will only have a simple delivery mechanism that is meant to be used by developers for testing, not for production use. Things like orchestration and cluster deployment are out of scope for systemd, Poettering declared.

He showed a few examples of his vision of using the systemctl command to start, stop, and monitor portable services on the local host or a remote one, though it was not a demo. It was not entirely clear from the talk how far along things are for portable system services. The overall goal is to take existing systemd services, add resource bundling and sandboxing, and make "natural extensions to service management" to support the portable services idea, he said.

Various questions were asked at the end. For example, updating the services is something that is out of the scope for systemd. That can be handled with some other tool, but the low-level building blocks will be provided by systemd. Another question concerned configuration. Different configurations of a service would require building a different bundle with those changes, as the assumption is that configuration gets shipped in the bundle.

Having the security be opt-out is useful, but how will additions to the security restrictions be handled? Existing services could break under stricter rules. Poettering said that it was something he was aware of, but had not come up with a solution for yet. He wants to start with a powerful set of restrictions out of the box, but perhaps defining a target security level for a particular service could help deal with this backward incompatibility problem.

[ Thanks to LWN subscribers for supporting my travel to Santa Fe for LPC. ]

Comments (18 posted)

Dave Täht has been working to save the Internet for the last six years (at least). Recently, his focus has been on improving the performance of networking over WiFi — performance that has been disappointing for as long as anybody can remember. The good news, as related in his 2016 Linux Plumbers Conference talk, is that WiFi can be fixed, and the fixes aren't even all that hard to do. Users with the right hardware and a willingness to run experimental software can have fast WiFi now, and it should be available for the rest of us before too long.

Networking, Täht said, has been going wrong for over a decade; it turns out that queuing theory has not properly addressed the problem of matching data rates to the bandwidth that the hardware can provide. Developers have tended to optimize for the fastest rates possible, but those rates are rarely seen in the real world when WiFi is involved. The "make WiFi fast" effort, involving a number of developers, seeks to change the focus and to optimize both throughput and latency at all data rates.

He has been working on the bufferbloat problem for the last six years. Hundreds of people have been involved in this effort, which was spearheaded by the Linux networking stack. Many changes were merged, starting with byte queue limits in 3.3 and culminating (so far) with the BBR congestion-control algorithm, which was merged for 4.8. At this point, all network protocols can be debloated — with the exception of WiFi and LTE. But, he said, a big dent has just been made in the WiFi problem.

For the rest of the talk, Täht enlisted the aid of Ham the mechanical monkey. Ham, it seems, works in the marketing department. He only cares about benchmarks; if the numbers are big, they will help to sell products. Ham has been the nemesis for years, driving the focus in the wrong direction. The right place to focus is on use cases, where the costs of bufferbloat are felt. That means paying much more attention to latency, and focusing less on the throughput numbers that make Ham happy.

As an example, he noted that the Slashdot home page can, when latency is near zero, be loaded in about eight seconds (the LWN page, he said, was too small to make an interesting example). If the Flent tool is used to add one second of latency to the link, that load takes nearly four minutes. We have all been in that painful place at one point or another. The point is that latency and round-trip times matter more than absolute throughput.

Unfortunately, the worst latency-causing bufferbloat is often found on high-rate connections deep within the Internet service provider's infrastructure. That, he said, should be fixed first, and WiFi will start to get better for free. But that is only the start. WiFi need not always be slow; its problems are mostly to be found in its queuing, not in external factors like radio interference. The key is eliminating bufferbloat from the WiFi subsystem.

To get there, Täht and his collaborators had to start by developing a better set of benchmarks to show what is going on in real-world situations. The most useful tool, he said, is Flent, which is able to do repeatable tests under network load and show the results in graphical form. Single-number benchmark results are not particularly helpful; one needs to look at performance over time to see what is really going on. It is also necessary to get out of the testing lab and test in the field, in situations with lots of stations on the net.

What they found was that the multiple-station case is where things fall down in the WiFi stack. If you have a single device on a WiFi network, things will work reasonably well. But as soon as there is contention for air time, the problems show up.

How to improve WiFi

The WiFi stack in current kernels has four major layers of interest, when it comes to queuing:

At the top, the queuing discipline accepts packets and feeds them into the driver layer. The amount of buffering there is huge; it can hold ten seconds of WiFi data.

The mac80211 layer does high-level WiFi work, and adds some queuing and latency of its own.

The driver for the WiFi adapter maintains several queues of its own, perhaps holding several seconds of data. This level is where aggregation is done; aggregation groups a set of packets into a single transmitted frame to improve throughput — at the cost of increased latency.

The firmware in the adapter itself can hold another ten seconds of data in its queues.

That adds up to a lot of queuing in the WiFi subsystem, with all of the associated problems. The good news is that fixing it required no changes to the WiFi protocols at all. So those fixes can be applied to existing networks and existing adapters.

The first step was to add a "mac80211 intermediate queue" that handles all packets for a given device, reducing the amount of queuing overall, especially since the size of this queue is strictly limited. It is meant to to hold no more data than can be sent in two "transmission opportunities" (slots in which an aggregate of packets can be transmitted). The fq_codel queue management algorithm was generalized to work well in this setting.

The queuing discipline layer was removed entirely, eliminating a massive amount of buffering. Instead, there is a simple per-station queue, and round-robin fair queuing between the stations. The goal is to have one aggregated frame in the hardware for transmission, and another one queued, ready to go as soon as the hardware gets to it. Only having two packets queued at this layer may not scale to the very highest data rates, he said, but, in the real world, nobody ever sees those rates anyway.

There should be a single aggregate under preparation in the mac80211 layer; all other packets should be managed in the (short) per-station queues. In current kernels, mac80211 pushes packets into the low-level driver, where they may accumulate. In the new model, instead, the driver calls back into the mac80211 layer when it needs another packet; that gives mac80211 a better view into when transmission actually happens. The total latency imposed by buffering in this scheme is, he said, limited to 2-12ms, and there is no need for intelligence in the network hardware.

Results and future directions

The result of all this work is WiFi latencies that are less than 40ms, down from a peak of 1-2 seconds before they started, and much better handling of multiple stations running at full rate. Before the changes, a test involving 100 flows all starting together collapsed entirely, with at most five flows getting going; all the rest failed due to TCP timeouts caused by excessive buffering latency. Afterward, all 100 could start and run with reasonable latency and bandwidth. All this work, in the end, comes down to a patch that removes a net 200 lines of code.

There are some open issues, of course. The elimination of the queuing discipline layer took away a number of useful network statistics. Some of these have been replaced with information in the debugfs filesystem. There is, he said, some sort of unfortunate interaction with TCP small queues; Eric Dumazet has some ideas for fixing this problem, which only arises in single-station tests. There is an opportunity to add better air-time fairness to keep slow stations from using too much transmission time. Some future improvements, he said, might come at a cost: latency improvements might reduce the peak bandwidth slightly. But latency is what almost all users actually care about, so that bandwidth will not be missed — except by Ham the monkey.

At this point, the ath9k WiFi driver fully supports these changes; the code can be found in the LEDE repository and daily snapshots. Work is progressing on the ath10k driver; it is nearly done. Other drivers have not yet been changed. Expanding the work may well require some more thought on the driver API within the kernel but, for the most part, the changes are not huge.

WiFi is, Täht said, the only wireless technology that is fully under our control. We should be taking more advantage of that control to make it work as well as it possibly can; he wishes that there were more developers working in this area. Even a relatively small group has been able to make some significant progress in making WiFi work as it should, though; we will all be the beneficiaries of this work in the coming years.

[Your editor thanks LWN subscribers for supporting his travel to LPC.]

Comments (58 posted)