This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

As all Python developers discover sooner or later, Python is a rapidly evolving language whose community occasionally makes changes that can break existing programs. The switch to Python 3 is the most prominent example, but minor releases can include significant changes as well. The CPython interpreter can emit warnings for upcoming incompatible changes, giving developers time to prepare their code, but those warnings are suppressed and invisible by default. Work is afoot to make them visible, but doing so is not as straightforward as it might seem.

In early November, one sub-thread of a big discussion on preparing for the Python 3.7 release focused on the await and async identifiers. They will become keywords in 3.7, meaning that any code using those names for any other purpose will break. Nick Coghlan observed that Python 3.6 does not warn about the use of those names, calling it "a fairly major oversight/bug". In truth, though, Python 3.6 does emit warnings in that case — but users rarely see them.

The reason for that comes down to the configuration of a relatively obscure module called warnings . The Python interpreter can generate quite a few warnings in various categories, many of which are likely to be seen as noise by users. The warnings module is used to emit warnings, but it also gives developers a way to bring back some silence by establishing a filter controlling which warnings will actually be printed out to the error stream. The default filter is:

ignore::DeprecationWarning ignore::PendingDeprecationWarning ignore::ImportWarning ignore::BytesWarning ignore::ResourceWarning

The first line filters out DeprecationWarning events, such as the warnings regarding await and async in the 3.6 release. Those warnings were also present in 3.5 as longer-term PendingDeprecationWarning s, which are also invisible by default.

As it happens, things were not always this way. While PendingDeprecationWarning has always been filtered, DeprecationWarning was visible by default until the Python 2.7 and 3.2 releases. In a 2009 thread discussing the change, Python benevolent dictator for life Guido van Rossum argued that the deprecation warnings, while being useful to some developers, were more often just "irrelevant noise", especially for anybody who does not actually work on the code in question:

If you download and install some 3rd party code, deprecation warnings about that code are *only noise* since they are not yours to fix. Warnings in a dynamic language work very different than warnings in a compiled language.

The idea was fairly intensely debated, but silencing those warnings by default won out in the end.

In 2017, it has become evident that this decision has kept some important warnings out of the sight of people who should see them, with the result that many people may face an unpleasant surprise when an upgrade to 3.7 abruptly breaks previously working programs. That was the cue for a new intensely debated thread over whether deprecation warnings should be enabled again.

Neil Schemenauer started things off with a suggestion that the warnings should be re-enabled by default; Coghlan subsequently proposed reverting to the way things were. He went on to say that, if application developers don't want their users to see deprecation warnings, they should disable those warnings explicitly. The invisibility of deprecation warnings has hurt users, he said, and some classes of users in particular:

We've been running the current experiment for 7 years, and the main observable outcome has been folks getting surprised by breaking changes in CPython releases, especially folks that primarily use Python interactively (e.g. for data analysis), or as a scripting engine (e.g. for systems administration).

Application developers are, one hopes, using testing frameworks for their modules, and those frameworks typically turn the warnings back on. But the above-mentioned users will not be performing such testing and will be unnecessarily surprised if Python 3.7 breaks their scripts.

The proposal led to some familiar complaints, though. Van Rossum worried that it would inflict a bunch of warning noise on users of scripts who are in no position to fix them. Antoine Pitrou suggested that small-script developers would be deluged by warnings originating in modules that they import — warnings that, once again, they cannot fix. Over time, the thread seemed to coalesce on the idea that the warnings should not be re-enabled unconditionally; they should, instead, remain disabled for "third-party" code that the current user is unlikely to have control over.

That is a fine idea, with only one little problem: how does one define "third-party code" in this setting? There were a few ideas raised, such as emitting warnings for all code located under the directory containing the initial script, but the search for heuristics threatened to devolve into a set of complex special cases that nobody would be able to remember. So the solution that was written up by Coghlan as PEP 565 was rather simpler: enable DeprecationWarning in the __main__ module, while leaving it suppressed elsewhere. In essence, any code run directly by a user would have warnings enabled, while anything imported from another module would not.

This change will almost certainly not bring deprecation warnings to the attention of everybody who needs to see them. But it will cause them to be emitted for users who are running single-file scripts or typing commands directly at the Python interpreter. That solves what Coghlan sees as the biggest problem: casual Python scripters who will otherwise be unpleasantly surprised when a distribution upgrade causes their scripts to fail. It is a partial solution that appears to be better than the status quo.

Van Rossum agreed with that assessment. He acknowledged that it's "not going to make everyone happy", but said that it's an improvement and that he intends to approve it in the near future in the absence of more objections. Naturally, such a pronouncement brought out some objections, but none of them would appear to have the strength to keep PEP 565 from being a part of the Python 3.7 release. The 3.7 interpreter's usurpation of await and async may be an unpleasant surprise to some users, but hopefully future changes will be less surprising.

Comments (18 posted)

The quest to find a free-software replacement for the QuickBooks accounting tool continues. In this episode, your editor does his best to put Tryton through its paces. Running Tryton proved to be a trying experience, though; this would not appear to be the accounting tool we are searching for.

Tryton is a Python 3 application distributed under the GPLv3 license. Its home page mentions that it is based on PostgreSQL, but there is support for MySQL and SQLite as well. Tryton, it is said, is "a three-tier high-level general purpose application platform" that is "the core base of a complete business solution providing modularity, scalability and security". The "core base" part of that claim is relevant: Tryton may well be a solid base for the creation of a small-business accounting system, but it is not, out of the box, such a system itself.

Running Tryton

The Tryton documentation is not especially friendly to the first-time user. The installation instructions suggest going with what one's distribution provides. One can see why; following the links for a source installation leads to a lengthy directory listing with dozens of independent tarballs. There is a Mercurial repository out there, but one has to search for it and there is no documentation on how to build or install from a copy of the repository.

Your editor opted for the Fedora Tryton packages — of which there are 54 to choose from. Tryton is broken up into a lot of modules, so there is naturally a package for each. Nobody has documented this, but getting the Fedora packages running requires creating a PostgreSQL database and user, editing the trytond.conf configuration file to point there, adding the tryton group to ones account, and running the trytond-admin application to initialize the database. Once that is done, the tryton application will consent to run and put up a simple window.

In the process of figuring this out, it became clear that the Tryton developers have not put a huge amount of effort into error handling. The usual response when something goes wrong is a Python traceback, which tends to not be particularly helpful.

Getting the tryton application to use one's local database requires messing around with "profiles", even though the configuration file specifying the database setup was passed on the application's command line. Things have to be just right or access simply fails to work. Once that obstacle was passed, the result was a general interface describing "records" of various types. The accounting module was installed and provided its own record types. A basic chart of accounts was set up. But there was nothing resembling an interface to do even basic things like creating a bank account or entering a transaction. Your editor tried installing more modules (all of them, actually) to get more functionality, but the result was an application that wouldn't run at all — it died with a traceback due to apparently missing Tryton module dependencies. Some of the Fedora module packages simply don't work, in other words.

As it happens, the version of Tryton packaged with Fedora 27 is 4.0, which is somewhat behind the current release (4.6). It seems reasonable to believe that a more recent release might yield better results. The openSUSE Tumbleweed distribution packages 4.2 instead, but it never proved possible to get past the profile screen with those packages installed. One might plausibly claim that support for Tryton is not the highest-priority objective for some distributors, at least.

It is also alleged to be possible to install Tryton directly from the Python Package Index using pip . Your newly hopeful editor duly created a virtualenv and populated it with a set of packages, but the 4.6 tryton application would not even start. It is, it would seem, still tied to the GTK+ 2 toolkit, which is not all that well supported on current distributions, especially for a Python 3 application.

Moving on

There is little doubt that somebody with greater skills and patience could find a way to make a current Tryton release work on a current Linux distribution. The result would likely be gratifying in a number of ways; Tryton appears to have a well-designed and well-documented base that one could build a good accounting application on top of. Integration with a business's other processes (one of the key criteria in this search) would seem to be relatively straightforward. But even an easily installed, perfectly working Tryton would fall far short of what is needed here.

The point is this: it's a rare small-business owner who feels the urge to build an accounting system on top of anything. Accounting is a task that needs to be done, not an objective in its own right. Any accounting system that arrives as a box of small parts with "some assembly required" written on the outside does not meet the needs of this kind of user. For all its faults, Intuit understood that when it created QuickBooks. The developers behind the other systems reviewed so far (GnuCash and Odoo) also understand that.

Chances are, the Tryton developers understand that too, but creating an easily usable small-business accounting system would appear not to be at the top of their to-do list. There are a number of free-software business-management systems available. Many of these, your editor has long believed, are developed primarily as platforms for consultants. A system that is highly capable, but which is complex, minimally documented, and in need of a lot of setup work suits that business model well.

Such a system is also, of course, simply easier to implement and maintain. Creating an interactive accounting system that is usable by people with no inherent interest in accounting systems is a difficult task. It is, seemingly, not an itch that many developers feel the need to scratch without some sort of additional incentive. Nobody has the right to criticize developers for this, but the result is predictable: like many types of free software, free accounting systems tend to lack the user-level work needed to make them truly competitive with proprietary alternatives.

Still, it is not yet time to give up on this search; the list of candidate systems is not yet empty. Stay tuned as the quest to find a free accounting system that can displace the proprietary alternatives continues.

Comments (12 posted)

Linux containers are something of an amorphous beast, at least with respect to the kernel. There are lots of facilities that the kernel provides (namespaces, control groups, seccomp, and so on) that can be composed by user-space tools into containers of various shapes and colors; the kernel is blissfully unaware of how user space views that composition. But there is interest in having the kernel be more aware of containers and for it to be able to distinguish what user space considers to be a single container. One particular use case for the kernel managing container identifiers is the audit subsystem, which needs unforgeable IDs for containers that can be associated with audit trails.

Back in early October, Richard Guy Briggs posted the second version of his RFC for kernel container IDs that can be used by the audit subsystem. The first version was posted in mid-September, but is not the only proposal out there. David Howells proposed turning containers into full-fledged kernel objects back in May, but seemingly ran aground on objections that the proposal "muddies the waters and makes things more brittle", in the words of namespaces maintainer Eric W. Biederman.

Briggs's proposal is focused on the needs of the audit subsystem, rather than trying to solve any larger problem, however. He described some of the problems for the audit subsystem in a 2016 Linux Security Summit talk. In addition, he laid out some of the requirements for container tracking in response to a query from Carlos O'Donell about the first RFC:

ability to filter unwanted, irrelevant or unimportant messages before they fill queue so important messages don't get lost. This is a certification requirement.

ability to make security claims about containers, require tracking of actions within those containers to ensure compliance with established security policies.

ability to route messages from events to relevant audit daemon instance or host audit daemon instance or both, as required or determined by user-initiated rules

As proposed, audit container IDs would be handled as follows. A container orchestration system would register the ID of a container (a 16-byte UUID) by writing to a special file in the /proc directory for the container's initial process. Briggs proposes a new capability ( CAP_CONTAINER_ADMIN ) that would be required for a process to be able to register a container ID, but no process would be able to change its own container ID even with the capability.

Registering the container ID would associate the process ID (PID) of the first process (in the initial PID namespace) and all of that process's namespaces (using the namespace filesystem device and inode numbers) with the ID in an AUDIT_CONTAINER record that gets logged. The container IDs would then be used in various audit log messages to associate auditable events with the container that performed them. Any child processes would inherit the container ID of their parent so that all of the processes and threads in a container would be associated with its ID. If the first process has already forked or created threads, the registration would either fail or all of the child processes/threads would be associated with the ID; the right course will be determined as part of the RFC and implementation process.

Audit events would be generated for all namespace creation and destruction operations; creation events would be associated with the container ID of the process performing the action, destruction events occur when there are no more references to a namespace, so just the device and inode of the namespace destroyed would be logged. Changes to a process's namespaces would also generate an audit event that records the new and old namespace information.

The new capability for container IDs was one of the first things questioned about the proposal. Casey Schaufler asked how there could be a kernel container capability when the RFC clearly states that the kernel knows nothing about containers. Briggs likened container IDs to login user IDs and session IDs "that the kernel tracks for the convenience of userspace". He suggested that if the CAP_CONTAINER_ADMIN name was the problem, he would be fine with something like CAP_AUDIT_CONTAINERID , but that was not the core of Schaufler's complaint:

Sorry, but what aspect of the kernel security policy is this capability supposed to protect? That's what capabilities are for, not the undefined support of undefined user-space behavior. If it's audit behavior, you want CAP_AUDIT_CONTROL. If it's more than audit behavior you have to define what system security policy you're dealing with in order to pick the right capability. We get this request pretty regularly. "I need my own capability because I have a niche thing that isn't part of the system security policy but that is important!" Fit the containerID into the system security policy, and if that results in using CAP_SYS_ADMIN, oh well.

There already are two capabilities for the audit subsystem ( CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE ) but, as Paul Moore explained, neither is quite right to govern the ability to register container IDs:

CAP_AUDIT_WRITE exists to control which applications can submit userspace generated audit records to the kernel, CAP_AUDIT_CONTROL exists to control which applications can manage the in-kernel audit configuration (e.g. filter rules) and the current task's loginuid value. Reusing CAP_AUDIT_WRITE here would allow any application that can submit userspace audit records the ability to change the audit container ID; this would be bad, we don't allow CAP_AUDIT_WRITE to change the loginuid, it would be even worse to allow it to change the audit container ID. Reusing CAP_AUDIT_CONTROL is less worse than than CAP_AUDIT_WRITE, but it gets sticky once we get to the part where we want to auditd instances in containers, complete with their own queues, filtering rules, etc.. Perhaps we could use CAP_AUDIT_CONTROL to guard the audit container ID value, but we would always want to do that check in the init userns in order to prevent container bound processes from manipulating their own audit container ID.

James Bottomley suggested sidestepping the capability question by making the container ID a write-once attribute; once set, nothing could change it. The idea of nested containers came up several times, though, which would require some way to change these container IDs. Bottomley suggested simply to allow appending to the container ID, so that the hierarchy is inherent in the chain of IDs. Moore agreed that write-once would work for the non-nested case:

Richard [Briggs] and I have talked about a write once approach, but the thinking was that you may want to allow a nested container orchestrator (Why? I don't know, but people always want to do the craziest things.) and a write-once policy makes that impossible. If we punt on the nested orchestrator, I believe we can seriously think about a write-once policy to simplify things.

But Aleksa Sarai pointed out that nested containers are a fairly common use case, for LXC system containers in particular (which will often have other container runtimes running inside them). Biederman noted that there is not, as yet, a solution for running the audit daemon in containers, so it may be premature to worry about nested container IDs at this point.

Schaufler is concerned that adding an ID for auditing containers is heading down the wrong path. He suggested the ptags Linux Security Module as a way forward; it would allow arbitrary tags with values to be set for a process.

Then you want Jose Bollo's PTAGS. It's insane to add yet another arbitrary ID to the task for a special purpose. Add a general tagging mechanism instead. We could add a gazillion new id's, each with [its] own capability if we head down this road.

Moore stressed that the effort was not aimed at a more general mechanism, but simply to address the needs of the audit subsystem at this point. He said that the ID is meant to be an "audit container ID" and not a more general "container ID". Using the audit ID for other purposes risks opening up problems in other areas (such as container migration), so he and Briggs are attempting to restrict the use cases.

We would love to have a generic kernel facility that the audit subsystem could use to identify containers, but we don't, and previous attempts have failed, so we have to create our own. We are intentionally trying to limit its scope in an attempt to limit problems. If a more general solution appears in the future I think we would make every effect to migrate to that; keeping this initial effort small should make that easier.

At this point, there is no code on the table, it is purely a discussion on where things should go. Adding a new capability for registering these IDs seems to be a non-starter; the write-once scheme governed by one of the existing audit capabilities seems like it might plausibly pass muster. Though, as Moore said, there seems to be a bigger need here, but more general solutions have so far been hard to come by. Adding IDs willy-nilly may be suboptimal but, until something more general comes along, might just be the right way forward.

Comments (none posted)

The kernel's module mechanism allows the building of a kernel with a wide range of hardware and software support without requiring that all of that code actually be loaded into any given running system. The availability of all of those modules in a typical distributor kernel means that a lot of features are available — but also, potentially, a lot of exploitable bugs. There have been numerous cases where the kernel's automatic module loader has been used to bring buggy code into a running system. An attempt to reduce the kernel's exposure to buggy modules shows how difficult some kinds of hardening work can be.

Module autoloading

There are two ways in which a module can be loaded into the kernel without explicit action on the administrator's part. On most contemporary systems, it happens when hardware is discovered, either by a bus driver (on buses that support discovery) or from an external description like a device tree. Discovery causes an event to be sent to user space, where a daemon like udev applies whatever policies have been configured and loads the appropriate modules. This mechanism is driven by the available hardware and is relatively hard for an attacker to influence.

Within the kernel, though, lurks an older mechanism, in the form of the request_module() function. When a kernel function determines that a needed module is missing, it can call request_module() to send a request to user space to load the module in question. For example, if an application opens a char device with a given major and minor number and no driver exists for those numbers, the char device code will attempt to locate a driver by calling:

request_module("char-major-%d-%d", MAJOR(dev), MINOR(dev));

If a driver module has declared an alias with matching numbers, it will be automatically loaded into the kernel to handle the open request.

There are hundreds of request_module() calls in the kernel. Some are quite specific; one will load the ide-tape module should the user be unfortunate enough to have such a device. Others are more general; there are many calls in the networking subsystem, for example, to locate modules implementing specific network protocols or packet-filtering mechanisms. While the device-specific calls have been mostly supplanted by the udev mechanism, modules for features like network protocols still rely on request_module() for user-transparent automatic loading.

Autoloading makes for convenient system administration, but it can also make for convenient system exploitation. The DCCP protocol vulnerability disclosed in February, for example, is not exploitable if the DCCP module is not loaded in the kernel — which is normally the case, since DCCP has few users. But the autoloading mechanism allows any user to force that module to be loaded simply by creating a DCCP socket. Autoloading thus widens the kernel's attack surface to include anything in a module that unprivileged users can cause to be loaded — and there are a lot of modules in a typical distributor kernel.

Tightening the system

Djalal Harouni has been working on a patch set aimed at reducing the exposure from autoloading; the most recent version was posted on November 27. Harouni's work takes inspiration from the hardening found in the grsecurity patch set, but takes no code from there. In this incarnation (it has changed somewhat over time), it adds a new sysctl knob ( /proc/sys/kernel/modules_autoload_mode ) that can be used to restrict the kernel's autoloading mechanism. If this knob is set to zero (the default), autoloading works as it does in current kernels. Setting it to one restricts autoloading to processes with specific capabilities: processes with CAP_SYS_MODULE can cause any module to be loaded, while those with CAP_NET_ADMIN can autoload any module whose alias starts with netdev- . Setting this knob to two disables autoloading entirely. Once this value has been raised above zero, it cannot be lowered during the lifetime of the system.

The patch set also implements a per-process flag that could be set with the prctl() system call. This flag (which takes the same values as the global flag) could restrict autoloading for a specific process and all of its descendants without changing module-loading behavior in the system overall.

It is safe to say that this patch set will not be merged in its current form for a simple reason: Linus Torvalds strongly disliked it. Disabling autoloading is likely to break a lot of systems, meaning that distributors will be unwilling to enable this option and it will not see much use. "A security option that people can't use without breaking their system is pointless," he said. The discussion got heated at times, but Torvalds is not opposed to the idea of reducing the kernel's exposure to autoloaded vulnerabilities. It was just a matter of finding the right solution.

The per-process flag looks like it could be a part of that solution. It could be used, for example, to restrict autoloading for code running within a container while leaving the system as a whole unchanged. It is not uncommon to create a process within a container with the CAP_NET_ADMIN capability to configure that container's networking while wanting most of the code running in the container to be unable to force module loading.

But, Torvalds said, a single flag will never be able to properly control all of the situations where autoloading comes into play. Some modules should perhaps always be loadable, while others may need a specific capability. So he suggested retaining the request_module_cap() function added by Harouni's patch set (which performs the load only if a specific capability is present) and using it more widely. But he did have a couple of changes to request.

The first is that request_module_cap() shouldn't actually block module loading if the needed capability is absent — at least not initially. Instead, it should log a message. That will allow a study of where module autoloading is actually needed that would, with luck, point out the places where autoloading could be restricted without breaking existing systems. He also suggested that the capability check is too simplistic. For example, the " char-major- " autoload described above only happens if a process is able to open a device node with the given major and minor numbers. In such cases, a permission test (the ability to open that special file) has already been passed and the module should load unconditionally. So there may need to be other variants of request_module() to describe settings where capabilities do not apply.

Finally, Torvalds had another idea related to the idea that the worst bugs tend to lurk in modules that are poorly maintained at best. The DCCP module mentioned above, for example, is known to be little used and nearly unmaintained. If the modules that are well maintained were marked with a special flag, it might be possible to restrict unprivileged autoloading to those modules only. That would prevent the autoloading of some of the cruftier modules while not breaking autoloading in general. This idea does raise one question that nobody asked, though: when a module ceases being maintained, who will maintain it well enough to remove the "well maintained" flag?

In any case, that flag will probably not be added right away, if this proposed plan from Kees Cook holds. He suggested starting with the request_module_cap() approach with warnings enabled. The per-process flag would be added for those who can use it, but the global knob to restrict autoloading would not. Eventually it might be possible to get rid of unprivileged module loading, but that will be a goal for the future. The short-term benefit would be better information about how autoloading is actually used and the per-process option for administrators who want to tighten things down now.

This conversation highlights one of the fundamental tensions that can be found around kernel hardening work. Few people are opposed to a more secure kernel, but things get much more difficult as soon as the hardening work can break existing systems — and that is often the case. Security-oriented developers often get frustrated with the kernel community's resistance to hardening changes with user-visible impacts, while kernel developers have little sympathy for changes that will lead to bug reports and unhappy users. Some of those frustrations surfaced in this discussion, but most of the developers involved were mostly interested in converging on a solution that works for everybody involved.

Comments (26 posted)

In his linux.conf.au 2017 talk [YouTube] on the eBPF in-kernel virtual machine, Brendan Gregg proclaimed that "super powers have finally come to Linux". Getting eBPF to that point has been a long road of evolution and design. While eBPF was originally used for network packet filtering, it turns out that running user-space code inside a sanity-checking virtual machine is a powerful tool for kernel developers and production engineers. Over time, new eBPF users have appeared to take advantage of its performance and convenience. This article explains how eBPF evolved how it works, and how it is used in the kernel.

The evolution of eBPF

The original Berkeley Packet Filter (BPF) [PDF] was designed for capturing and filtering network packets that matched specific rules. Filters are implemented as programs to be run on a register-based virtual machine.

The ability to run user-supplied programs inside of the kernel proved to be a useful design decision but other aspects of the original BPF design didn't hold up so well. For one, the design of the virtual machine and its instruction set architecture (ISA) were left behind as modern processors moved to 64-bit registers and invented new instructions required for multiprocessor systems, like the atomic exchange-and-add instruction (XADD). BPF's focus on providing a small number of RISC instructions no longer matched the realities of modern processors.

So, Alexei Starovoitov introduced the extended BPF (eBPF) design to take advantage of advances in modern hardware. The eBPF virtual machine more closely resembles contemporary processors, allowing eBPF instructions to be mapped more closely to the hardware ISA for improved performance. One of the most notable changes was a move to 64-bit registers and an increase in the number of registers from two to ten. Since modern architectures have far more than two registers, this allows parameters to be passed to functions in eBPF virtual machine registers, just like on native hardware. Plus, a new BPF_CALL instruction made it possible to call in-kernel functions cheaply.

The ease of mapping eBPF to native instructions lends itself to just-in-time compilation, yielding improved performance. The original patch that added support for eBPF in the 3.15 kernel showed that eBPF was up to four times faster on x86-64 than the old classic BPF (cBPF) implementation for some network filter microbenchmarks, and most were 1.5 times faster. Many architectures support the just-in-time (JIT) compiler (x86-64, SPARC, PowerPC, ARM, arm64, MIPS, and s390).

Originally, eBPF was only used internally by the kernel and cBPF programs were translated seamlessly under the hood. But with commit daedfb22451d in 2014, the eBPF virtual machine was exposed directly to user space.

What can you do with eBPF?

An eBPF program is "attached" to a designated code path in the kernel. When the code path is traversed, any attached eBPF programs are executed. Given its origin, eBPF is especially suited to writing network programs and it's possible to write programs that attach to a network socket to filter traffic, to classify traffic, and to run network classifier actions. It's even possible to modify the settings of an established network socket with an eBPF program. The XDP project, in particular, uses eBPF to do high-performance packet processing by running eBPF programs at the lowest level of the network stack, immediately after a packet is received.

Another type of filtering performed by the kernel is restricting which system calls a process can use. This is done with seccomp BPF.

eBPF is also useful for debugging the kernel and carrying out performance analysis; programs can be attached to tracepoints, kprobes, and perf events. Because eBPF programs can access kernel data structures, developers can write and test new debugging code without having to recompile the kernel. The implications are obvious for busy engineers debugging issues on live, running systems. It's even possible to use eBPF to debug user-space programs by using Userland Statically Defined Tracepoints.

The power of eBPF flows from two advantages: it's fast and it's safe. To fully appreciate it, you need to understand how it works.

The eBPF in-kernel verifier

There are inherent security and stability risks with allowing user-space code to run inside the kernel. So, a number of checks are performed on every eBPF program before it is loaded. The first test ensures that the eBPF program terminates and does not contain any loops that could cause the kernel to lock up. This is checked by doing a depth-first search of the program's control flow graph (CFG). Unreachable instructions are strictly prohibited; any program that contains unreachable instructions will fail to load.

The second stage is more involved and requires the verifier to simulate the execution of the eBPF program one instruction at a time. The virtual machine state is checked before and after the execution of every instruction to ensure that register and stack state are valid. Out of bounds jumps are prohibited, as is accessing out-of-range data.

The verifier doesn't need to walk every path in the program, it's smart enough to know when the current state of the program is a subset of one it's already checked. Since all previous paths must be valid (otherwise the program would already have failed to load), the current path must also be valid. This allows the verifier to "prune" the current branch and skip its simulation.

The verifier also has a "secure mode" that prohibits pointer arithmetic. Secure mode is enabled whenever a user without the CAP_SYS_ADMIN privilege loads an eBPF program. The idea is to make sure that kernel addresses do not leak to unprivileged users and that pointers cannot be written to memory. If secure mode is not enabled, then pointer arithmetic is allowed but only after additional checks are performed. For example, all pointer accesses are checked for type, alignment, and bounds violations.

Registers with uninitialized contents (those that have never been written to) cannot be read; doing so cause the program load to fail. The contents of registers R0-R5 are marked as unreadable across functions calls by storing a special value to catch any reads of an uninitialized register. Similar checks are done for reading variables on the stack and to make sure that no instructions write to the read-only frame-pointer register.

Lastly, the verifier uses the eBPF program type (covered later) to restrict which kernel functions can be called from eBPF programs and which data structures can be accessed. Some program types are allowed to directly access network packet data, for example.

The bpf() system call

Programs are loaded using the bpf() system call with the BPF_PROG_LOAD command. The prototype of the system call is:

int bpf(int cmd, union bpf_attr *attr, unsigned int size);

The bpf_attr union allows data to be passed between the kernel and user space; the exact format depends on the cmd argument. The size argument gives the size of the bpf_attr union object in bytes.

Commands are available for creating and modifying eBPF maps; maps are the generic key/value data structure used for communicating between eBPF programs and the kernel or user space. Additional commands allow attaching eBPF programs to a control-group directory or socket file descriptor, iterating over all maps and programs, and pinning eBPF objects to files so that they're not destroyed when the process that loaded them terminates (the latter is used by the tc classifier/action code so that eBPF programs persist without requiring the loading process to stay alive). The full list of commands can be found in the bpf() man page.

Though there appear to be many different commands, they can be broken down into three categories: commands for working with eBPF programs, working with eBPF maps, or commands for working with both programs and maps (collectively known as objects).

eBPF program types

The type of program loaded with BPF_PROG_LOAD dictates four things: where the program can be attached, which in-kernel helper functions the verifier will allow to be called, whether network packet data can be accessed directly, and the type of object passed as the first argument to the program. In fact, the program type essentially defines an API. New program types have even been created purely to distinguish between different lists of allowed callable functions ( BPF_PROG_TYPE_CGROUP_SKB versus BPF_PROG_TYPE_SOCKET_FILTER , for example).

The current set of eBPF program types supported by the kernel is:

BPF_PROG_TYPE_SOCKET_FILTER : a network packet filter

: a network packet filter BPF_PROG_TYPE_KPROBE : determine whether a kprobe should fire or not

: determine whether a kprobe should fire or not BPF_PROG_TYPE_SCHED_CLS : a network traffic-control classifier

: a network traffic-control classifier BPF_PROG_TYPE_SCHED_ACT : a network traffic-control action

: a network traffic-control action BPF_PROG_TYPE_TRACEPOINT : determine whether a tracepoint should fire or not

: determine whether a tracepoint should fire or not BPF_PROG_TYPE_XDP : a network packet filter run from the device-driver receive path

: a network packet filter run from the device-driver receive path BPF_PROG_TYPE_PERF_EVENT : determine whether a perf event handler should fire or not

: determine whether a perf event handler should fire or not BPF_PROG_TYPE_CGROUP_SKB : a network packet filter for control groups

: a network packet filter for control groups BPF_PROG_TYPE_CGROUP_SOCK : a network packet filter for control groups that is allowed to modify socket options

: a network packet filter for control groups that is allowed to modify socket options BPF_PROG_TYPE_LWT_* : a network packet filter for lightweight tunnels

: a network packet filter for lightweight tunnels BPF_PROG_TYPE_SOCK_OPS : a program for setting socket parameters

: a program for setting socket parameters BPF_PROG_TYPE_SK_SKB : a network packet filter for forwarding packets between sockets

: a network packet filter for forwarding packets between sockets BPF_PROG_CGROUP_DEVICE : determine if a device operation should be permitted or not

As new program types were added, kernel developers discovered a need to add new data structures too.

eBPF data structures

The main data structure used by eBPF programs is the eBPF map, a generic data structure that allows data to be passed back and forth within the kernel or between the kernel and user space. As the name "map" implies, data is stored and retrieved using a key.

Maps are created and manipulated using the bpf() system call. When a map is successfully created, a file descriptor associated with that map is returned. Maps are normally destroyed by closing the associated file descriptor. Each map is defined by four values: a type, a maximum number of elements, a value size in bytes, and a key size in bytes. There are different map types and each provides a different behavior and set of tradeoffs:

BPF_MAP_TYPE_HASH : a hash table

: a hash table BPF_MAP_TYPE_ARRAY : an array map, optimized for fast lookup speeds, often used for counters

: an array map, optimized for fast lookup speeds, often used for counters BPF_MAP_TYPE_PROG_ARRAY : an array of file descriptors corresponding to eBPF programs; used to implement jump tables and sub-programs to handle specific packet protocols

: an array of file descriptors corresponding to eBPF programs; used to implement jump tables and sub-programs to handle specific packet protocols BPF_MAP_TYPE_PERCPU_ARRAY : a per-CPU array, used to implement histograms of latency

: a per-CPU array, used to implement histograms of latency BPF_MAP_TYPE_PERF_EVENT_ARRAY : stores pointers to struct perf_event , used to read and store perf event counters

: stores pointers to , used to read and store perf event counters BPF_MAP_TYPE_CGROUP_ARRAY : stores pointers to control groups

: stores pointers to control groups BPF_MAP_TYPE_PERCPU_HASH : a per-CPU hash table

: a per-CPU hash table BPF_MAP_TYPE_LRU_HASH : a hash table that only retains the most recently used items

: a hash table that only retains the most recently used items BPF_MAP_TYPE_LRU_PERCPU_HASH : a per-CPU hash table that only retains the most recently used items

: a per-CPU hash table that only retains the most recently used items BPF_MAP_TYPE_LPM_TRIE : a longest-prefix match trie, good for matching IP addresses to a range

: a longest-prefix match trie, good for matching IP addresses to a range BPF_MAP_TYPE_STACK_TRACE : stores stack traces

: stores stack traces BPF_MAP_TYPE_ARRAY_OF_MAPS : a map-in-map data structure

: a map-in-map data structure BPF_MAP_TYPE_HASH_OF_MAPS : a map-in-map data structure

: a map-in-map data structure BPF_MAP_TYPE_DEVICE_MAP : for storing and looking up network device references

: for storing and looking up network device references BPF_MAP_TYPE_SOCKET_MAP : stores and looks up sockets and allows socket redirection with BPF helper functions

All maps can be accessed from eBPF or user-space programs using the bpf_map_lookup_elem() and bpf_map_update_elem() functions. Some map types, such as socket maps, work with additional eBPF helper functions that perform special tasks.

How to write an eBPF program

Historically, it was necessary to write eBPF assembly by hand and use the kernel's bpf_asm assembler to generate BPF bytecode. Fortunately, the LLVM Clang compiler has grown support for an eBPF backend that compiles C into bytecode. Object files containing this bytecode can then be directly loaded with the bpf() system call and BPF_PROG_LOAD command.

You can write your own eBPF program in C by compiling with Clang using the -march=bpf parameter. There are plenty of eBPF program examples in the kernel's samples/bpf/ directory; the majority have a " _kern.c " suffix in their file name. The object file (eBPF bytecode) emitted by Clang needs to be loaded by a program that runs natively on your machine (these samples usually have " _user.c " in their filename). To make it easier to write eBPF programs, the kernel provides the libbpf library, which includes helper functions for loading programs and creating and manipulating eBPF objects. For example, the high-level flow of an eBPF program and user program using libbpf might go something like:

Read the eBPF bytecode into a buffer in your user application and pass it to bpf_load_program() .

. The eBPF program, when run by the kernel, will call bpf_map_lookup_elem() to find an element in a map and store a new value in it.

to find an element in a map and store a new value in it. The user application calls bpf_map_lookup_elem() to read out the value stored by the eBPF program in the kernel.

However, all of the sample code suffers from one major drawback: you need to compile your eBPF program from within the kernel source tree. Luckily, the BCC project was created to solve this problem. It includes a complete toolchain for writing eBPF programs and loading them without linking against the kernel source tree.

BCC is covered in the next article in this series; the full set is:

Comments (20 posted)

Voice computing has long been a staple of science fiction, but it has only relatively recently made its way into fairly common mainstream use. Gadgets like mobile phones and "smart" home assistant devices (e.g. Amazon Echo, Google Home) have brought voice-based user interfaces to the masses. The voice processing for those gadgets relies on various proprietary services "in the cloud", which generally leaves the free-software world out in the cold. There have been FOSS speech-recognition efforts over the years, but Mozilla's recent announcement of the release of its voice-recognition code and voice data set should help further the goal of FOSS voice interfaces.

There are two parts to the release, DeepSpeech, which is a speech-to-text (STT) engine and model, and Common Voice, which is a set of voice data that can be used to train voice-recognition systems. While DeepSpeech is available for those who simply want to do some kind of STT task, Common Voice is meant for those who want to create their own voice-recognition system—potentially one that does even better (or better for certain types of applications) than DeepSpeech.

DeepSpeech

The DeepSpeech project is based on two papers from Chinese web-services company Baidu; it uses a neural network implemented using Google's TensorFlow. As detailed in a blog post by Reuben Morais, who works in the Machine Learning Group at Mozilla Research, several data sets were used to train DeepSpeech, including transcriptions of TED talks, LibriVox audio books from the LibriSpeech corpus, and data from Common Voice; two proprietary data sets were also mentioned, but it is not clear how much of that was used in the final DeepSpeech model. The goal was to have a word error rate of less than 10%, which was met; "Our word error rate on LibriSpeech's test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance."

The blog post goes into a fair amount of detail that will be of interest to those who are curious about machine learning. It is clear that doing this kind of training is not for the faint of heart (or those with small wallets). It is a computationally intensive task that takes a fairly sizable amount of time even using specialized hardware:

Deep Speech has over 120 million parameters, and training a model this large is a very computationally expensive task: you need lots of GPUs if you don't want to wait forever for results. We looked into training on the cloud, but it doesn't work financially: dedicated hardware pays for itself quite quickly if you do a lot of training. The cloud is a good way to do fast hyperparameter explorations though, so keep that in mind. We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week.

A "human level" word error rate is 5.83%, according to the Baidu papers, Morais said, so 6.5% is fairly impressive. Running the model has reasonable performance as well, though getting it to the point where it can run on a Raspberry Pi or mobile device is desired.

On a MacBook Pro, using the GPU, the model can do inference at a real-time factor of around 0.3x, and around 1.4x on the CPU alone. (A real-time factor of 1x means you can transcribe 1 second of audio in 1 second.)

Common Voice

Because the machine-learning group had trouble in finding quality data sets for training DeepSpeech, Mozilla started the Common Voice project to help create one. The first release of data from the project is the subject of a blog post from Michael Henretty. The data, which was collected from volunteers and has been released into the public domain, is quite expansive: "This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech." In fact, it is the second largest publicly available data set; it is also growing daily as people add and validate new speech samples.

The initial release is only for the English language, but there are plans to support adding speech in other languages. The announcement noted that a diversity of voices is important for Common Voice:

Too often existing speech recognition services can't understand people with different accents, and many are better at understanding men than women — this is a result of biases within the data on which they are trained. Our hope is that the number of speakers and their different backgrounds and accents will create a globally representative dataset, resulting in more inclusive technologies. To this end, while we've started with English, we are working hard to ensure that Common Voice will support voice donations in multiple languages beginning in the first half of 2018.

The Common Voice site has links to other voice data sets (also all in English, so far). There is also a validation application on the home page, which allows visitors to listen to a sentence to determine if the speaker accurately pronounced the words. There are no real guidelines for how forgiving one should be (and just simple "Yes" and "No" buttons), but crowdsourcing the validation should help lead to a better data set. In addition, those interested can record their own samples on the web site.

A blog post announcing the Common Voice project (but not the data set, yet) back in July outlines some of the barriers to entry for those wanting to create STT applications. Each of the major browsers has its own API for supporting STT applications; as might be guessed, Mozilla is hoping that browser makers will instead rally around the W3C Web Speech API. That post also envisions a wide array of uses for STT technology:

Voice-activated computing could do a lot of good. Home hubs could be used to provide safety and health monitoring for ill or elderly folks who want to stay in their homes. Adding Siri-like functionality to cars could make our roads safer, giving drivers hands-free access to a wide variety of services, like direction requests and chat, so eyes stay on the road ahead. Speech interfaces for the web could enhance browsing experiences for people with visual and physical limitations, giving them the option to talk to applications instead of having to type, read or move a mouse. It's fun to think about where this work might lead. For instance, how might we use silent speech interfaces to keep conversations private? If your phone could read your lips, you could share personal information without the person sitting next to you at a café or on the bus overhearing. Now that's a perk for speakers and listeners alike.

While applications for voice interfaces abound (even if only rarely used by ever-increasing Luddites such as myself), there are, of course, other problems to be solved before we can throw away our keyboard and mouse. Turning speech into text is useful, but there is still a need to derive meaning from the words. Certain applications will be better suited than others to absorb voice input, and Mozilla's projects will help them do so. Text to speech has been around for some time, and there are free-software options for that, but full-on, general purpose voice interfaces will probably need a boost from artificial intelligence—that is likely still a ways out.

Comments (22 posted)