This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

At last year's X.Org Developers Conference (XDC), James Jones began the process of coming up with an API for allocating memory so that it is accessible to multiple different graphics devices in a system (e.g. GPUs, hardware compositors, video decoders, display hardware, cameras, etc.). At XDC 2017 in Mountain View, CA, he was back to update attendees on the progress that has been made. He has a prototype in progress, but there is plenty more to do, including working out some of the problems he has encountered along the way.

Jones has been at NVIDIA for 13 years and has been working on this problem in various forms for most of that time, he said. Allocating buffers and passing them around between multiple drivers is a complicated problem. The allocator will sit in the same place as the Generic Buffer Management (GBM) component is today; it will be used both by applications and by various user-space driver components. The allocator will support both vendor-agnostic (e.g. Android ION) and vendor-specific back-ends, as well as combinations of the two.

Design

The allocator is designed around a number of different objects. An "assertion" is an object that describes the width, height, and format of the surface desired. Many applications may simply assert what they need, without adding any extra parameters; that request will either succeed or fail directly. More complicated parameters can be specified with a "usage" object that describes how the surface will be used; that could be a single kind of use (e.g. rendering) or a multi-use surface (e.g. display and texturing).

A "constraint" object is returned from the allocator to describe the limitations inherent in a given assertion and usage combination. It is a description of the limitations of the surfaces that can be provided, including things like the pitch (or stride) and address alignment. Constraints are defined for the allocator library as a whole, so if a device has a strange constraint, it must be defined in the library and cannot be hidden in the allocator back-end library or driver. The reason is that constraints are non-trivial restrictions and it is not clear how to merge them if they are device specific; that code will need to live in the common allocator library.

The other object returned is a "capability" object that describes features that the driver can support for a given assertion and usage. Most commonly, these refer to memory layouts, such as the standard pitch linear layout or a vendor-specific layout (e.g. a tiling format). They will also refer to memory placement, which is what kind of memory (e.g. system memory or device-local memory) is required. Constraints and capabilities have dependencies between them, so what will be returned is a list of "capability sets" that pair compatible constraint and capability objects into valid combinations. Capability sets can be intersected with others (perhaps returned from a different device) to find a common denominator.

Jones then stepped through the allocator workflow, based on the USAGE document in his GitHub repository. The first step is to initialize an allocator device object from a device file descriptor. It is not yet clear how that allocator object will be defined. Those trying to use a single surface with multiple devices will likely initialize multiple allocator objects.

The next step is to query capability sets from the device(s) given an assertion and list of usages. After that, capability sets can be merged to find common capabilities; it is possible that there may be no commonality, so the application will need to have its own fall-back logic. Trying to allocate a surface on the available devices is next, which might also fail, in which case the application could fall back to a different capability set. Once the allocation succeeds, the surface can be imported into graphics, mode-setting, video, or other APIs.

Prototype

His goal was to have a demo for XDC, but that didn't work out. He has parts of the prototype working, though, as detailed in his slides [PDF] (slide eight). So far, he can create devices, query and merge capabilities and constraints, and create allocations. Exporting and importing allocations to other APIs and using the allocations, either for Vulkan and OpenGL or for Direct Rendering Manager (DRM) and non-graphics devices, remain to be done.

The core of the allocator is the capability set math; it is the "value add" for the new allocator, Jones said. The idea is to take two sets, potentially from two different devices, and to create a set that works for both devices. He thought it would be straightforward to implement that, but it took several weeks to get it right. It works well for all of the NVIDIA use cases, but he would like to see more testing from others with different devices and use cases.

He gave two examples of the capability set math using three different device capability sets, two of which could not be combined because there was no overlap in the format capabilities. The other two could be combined by choosing a common format capability and intersecting the address-alignment constraints. Devices can specify a "required capability"—one that will cause the merge to fail if it must be removed.

Capabilities are effectively opaque to the allocator. They are compared using a simple memcmp() , but they are typed by vendor, so there will be no confusion when doing the comparison. For common capabilities, like pitch-linear layout, there will be a vendor-neutral type so that they can be shared by all of the back-ends. In answer to a question, Jones said that it is fairly easy to add constraints. There is a header file where an ID needs to be added and a merge/intersect function must be added to a table.

Problems encountered

There are a number "gotchas" he has found so far. For one thing, a device file does not necessarily uniquely identify a logical device, at least for NVIDIA devices. Creating allocators from a file descriptor implies that there is one unique device file that corresponds to the logical device of interest. It would be nice if the UUIDs from the Vulkan/OpenGL APIs could be used to enumerate available devices, he said.

There are some capabilities that only apply to a particular device (device-local capabilities), such as a GPU with an on-chip cache. The capabilities could specify its use, but other devices won't be aware of or care about this cache. When intersecting capabilities with other device's sets, the local-cache capability will end up being removed, which is not what is desired. There may be a need for another flag (similar to the required flag) that tells devices to ignore the capability if they don't know about it. That would mean there are capability sets that get handed to devices with capabilities that are not understood, which may be problematic in other ways.

The way to specify formats is still up in the air. There are "a billion ways to do it", Jones said, and he doesn't really care which is chosen. Last year, the Khronos data format specification was suggested, as was FOURCC. The prototype supports any format as long as it is RGBA 32-bit, he said with a grin. Whatever is chosen will need to support high dynamic range (HDR) formats. There is also an open question on the need for format enumeration; which formats are supported may depend on the intended usage of the surface.

An important missing piece is how to use these allocations with the Vulkan and OpenGL import APIs. Those APIs expect some metadata to be associated with the allocations that describe them, but there are various elements of the allocator metadata that do not apply to Vulkan or OpenGL, such as the device-local capabilities. A query could be added to allow applications to retrieve the allocation metadata, but some of that is opaque metadata. Some developers are concerned that there is a security risk in using opaque metadata. He does not agree about the security risk (which he did not specify) and is unsure how else to solve the problem.

Another outstanding question is the relationship to DMA buffers (DMA-BUF); should the import and export API consume and produce DMA-BUF file descriptors? He is concerned that doing so would bake Linux-specific assumptions into the API; even file descriptors can be non-portable to other operating systems. He also wondered if there is any value in using a DMA-BUF when the allocation will only be used by a single device or driver stack.

Up next

Transitioning a surface from one usage type to another is something that Vulkan allows, which could be more widely applied. The API to do so would be complicated, however. Applications could request metadata from the allocator on what needs to be done for the transition (e.g. invalidate a cache) that could be passed to the driver.

A simpler approach would be to do a reallocation operation when the usage of the surface changes. The API is already basically established and the steady-state when there is no usage change is optimal. But allocation can be expensive, while transitions have a consistent cost. In addition, the usage may change at inconvenient times, so the allocation cost may be noticeable.

The original goal was to make memory allocation work with Wayland and other, similar compositors. That still needs to be tackled. NVIDIA introduced EGLStream to that end, and has a sample implementation that uses that mechanism. The key functionality needed to replace EGLStream with the allocator is to be able to build an EGLSurface from an allocator surface. There are multiple Wayland applications that need that ability, he said.

There is also a question of where this new allocator code should live and what it should be called. Right now, it is a standalone library called liballocator because that was easier for development. It could be moved into a new library or merged into GBM, he said. The name might be too generic if it remains as a standalone library.

He finished by putting up a slide (number 26) that listed the questions he had asked along the way. There was no immediate resolution to any of them in the talk, but it was held early on the first day of XDC. One suspects there were some hallway track discussions to try to address some or all of them.

[I would like to thank the X.Org Foundation and the Linux Foundation for travel assistance to Mountain View for XDC.]

Comments (2 posted)

Doing realtime processing with a general-purpose operating-system like Linux can be a challenge by itself, but safety-critical realtime processing ups the ante considerably. During a session at Open Source Summit North America, Wolfgang Mauerer discussed the difficulties involved in this kind of work and what Linux has to offer.

Realtime processing, as many have said, is not synonymous with "real fast". It is, instead, focused on deterministic response time and repeatable results. Getting there involves quantifying the worst-case scenario and being prepared to handle it — a 99% success rate is not good enough. The emphasis on worst-case performance is at the core of the difference with performance-oriented processing, which uses caches, lookahead algorithms, pipelines, and more to optimize the average case.

Mauerer divided realtime processing into three sub-cases. "Soft realtime" is concerned with subjective deadlines, and is used in situations where "nobody dies if you miss the deadline", though the deadline should be hit most of the time. Media rendering was an example of this type of work. "95% Realtime" applies when the deadline must be hit most of the time, but an occasional miss can be tolerated. Data acquisition and stock trading are examples; a missed deadline might mean a missed trade, but life as a whole goes on. The 100% realtime scenario, instead, applies to areas like industrial automation and aviation; if a deadline is missed, bad things happen.

Getting to 100% realtime performance requires quite a bit of work. To the extent possible, the worst-case execution time (WCET) of any task must be determined. Statistical testing is used to test that calculation. The best approach is formal verification, where the response time is proved, but that can only be done for the smallest applications and operating systems. Formal verification has been performed for the L4 system, but that kernel only has 10,000 lines of code. Formal verification is not possible for a kernel like Linux.

100% Realtime performance is hard enough to achieve, but "safety-critical" adds another dimension of reliability requirements. You do not, he said, want to see a segmentation fault when you hit the brakes. Safety-critical computing is not the same as realtime, but the two tend to go together. There is a long list of standards covering various aspects of safety-critical computing; they all come together under the IEC 61508 umbrella standard — "10,000 pages of bureaucratic poetry" according to Mauerer.

Compliance with those standards is one of three routes to safety. The second is called "proven in use", which is essentially a way of saying that a system has been used for twenty years and hasn't shown problems yet. It is scary how far the "proven in use" claim can be pushed, he said. The final approach is "compliant non-compliant development", which is how many of these systems are actually built.

Components of a realtime safety-critical system

There is a list of design patterns used to put together safety-critical realtime systems; Mauerer described some of them:

Run a traditional realtime operating system in a dedicated "side device" to handle the safety-critical work. There are a lot of these devices and systems available; they are simple and come pre-certified. The WCET of a given task can be calculated relatively easily. These systems can be hard to extend, though; they also suffer from vendor lock-in and unusual APIs.

Use a realtime-enhanced kernel — a solution that is common in the Linux community. With this approach, it's possible to keep and use existing Linux know-how and incorporate high-level technologies. The downside is that certification is difficult, the resulting systems are complex, and only statistical assurance is possible.

Run a "separation kernel" on hardware that enforces partitioning. This solution is common in the proprietary world. It offers a clean split between the realtime and non-realtime parts of the system, and there is a lot of certification experience with these systems. But there is strong coupling between the two parts at the hardware level, and there are vendor lock-in issues.

Run a co-kernel on the same hardware — like the separation kernel, but without the hardware partitioning. Once again, there is a clean division between the two parts of the system, and this solution is resource-efficient. But the necessary code is not in the mainline kernel, leading to maintenance difficulties, and there can be implicit couplings between the two kernels.

Use asymmetric multiprocessing; this solution is becoming more popular, he said. A multiprocessor system is partitioned, with some CPUs dedicated to realtime processing. Performance tends to be good, but there can be implicit coupling between the two parts. This solution is also relatively new and fast-moving, complicating maintenance.

One common feature between all of these approaches (excepting the second) is that they use some sort of partitioning to separate the realtime processing from everything else that a system is doing. The exception (the realtime-enhanced kernel) was brought about by adding support for full preemption, deterministic timing behavior, and avoidance of priority inversion. All that was needed to accomplish this, he said, was the world's best programmers and about two decades of time.

If you want to create a system with such a kernel, you need to start by avoiding "stupid stuff". No page faults can be taken; memory must be locked into RAM. Inappropriate system calls — those involving networking, for example — must be avoided; access to block devices is also not allowed. If these rules are followed, a realtime Linux system can achieve maximum latencies of about 50µs on an x86 processor, or 150µs on a Raspberry Pi. This is far from perfect, but it is enough for most uses.

There are a lot of advantages to using a realtime Linux kernel. The patches are readily available, as is community support. Existing engineering knowledge can be reused. Realtime Linux offers multi-core scalability and the ability to run realtime code in user space. On the other hand, the resulting system is hard to certify. If much smaller latencies are required, one needs specialized deep knowledge — it's best to have Thomas Gleixner available. It is also easy to mix up the realtime and non-realtime parts of the system; this does happen in practice.

One could, instead, use Xenomai, which Mauerer described as a skin for a traditional realtime operating system. It can run over Linux or in a co-kernel arrangement, but some patches need to be applied. I-pipe is used to dispatch interrupts to the realtime or Linux kernel as needed. Xenomai can achieve 10µs latencies on x86 systems, or 50µs on a Raspberry Pi. It offers a clean split between the parts of the system and is a lightweight solution. On the other hand, there are few developers working with Xenomai, and it tends to experience regressions when the upstream kernel changes.

Yet another approach, along the separation kernel lines, is an ARM system with a programmable realtime unit (PRU). The PRU has its own ARM-like processors and its own memory, so there is no contention with the main CPU. The main core can run Linux and communicate with the PRU via the remoteproc interface. Such systems are highly deterministic, cleanly split out the realtime work, and are simple. But they are also hardware-specific and require more maintenance.

Getting to safety-critical

There are, he said, two common approaches to the creation of safety-critical Linux systems. The first is called SIL2LinuxMP; it works by partitioning the system's CPUs and running applications in containers. Dedicated CPUs can then be used to isolate the safety-critical work from the rest of the system. This work is aiming for SIL certification, but at the SIL2 level only. SIL3 is considered to be too hard to reach with a Linux-based system.

The alternative is the Jailhouse system. It is a hypervisor that uses hardware virtualization to partition the system. Realtime code can be run in one partition, while safety-critical code can run in another. There are a couple of Jailhouse-like systems, but they have some disadvantages. SafeG relies on the Arm TrustZone mechanism, and only supports two partitions. The Quest-V kernel [PDF] is purely a research system that is not suitable for real-world use. So Jailhouse is Mauerer's preferred approach.

The Jailhouse project is focused on simplicity, he said. It lets an ordinary Linux kernel bring up the system and deal with hardware initialization issues; the partitioning is done once things are up and running. The "regular" Linux system is then shunted over into a virtual machine that is under the hypervisor's control. It works, but there are still some issues to deal with. Jailhouse cannot split up memory-mapped I/O regions, for example, leading to implicit coupling between the system parts. There are other hardware resources that cannot be partitioned; he mentioned clocks as being particularly problematic. The "unsafe" part of the system might manage to turn off a clock needed by the safety-critical partition, for example.

Overall, though, he said that he is happy with the Jailhouse approach. It is able to achieve 15µs maximum latencies in most settings. The obvious conclusion of the talk was thus a recommendation that Jailhouse is the best approach for safety-critical systems at this time.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your editor's travel to the Open Source Summit.]

Comments (24 posted)

In the refereed track at the 2017 Linux Plumbers Conference (LPC), Jiri Kosina gave an update on the status and plans for the live kernel patching feature. It is a feature that has a long history—pre-dating Linux itself—and has had a multi-year path into the kernel. Kosina reviewed that history, while also looking at some of the limitations and missing features for live patching.

The first question that gets asked about patching a running kernel is "why?", he said. That question gets asked in the comments on LWN articles and elsewhere. The main driver of the feature is the high cost of downtime in data centers. That leads data center operators to plan outages many months in advance to reduce the cost; but in the case of a zero-day vulnerability, that time is not available. Live kernel patching is targeted at making small security fixes as a stopgap measure until the kernel can be updated during a less-hurried, planned outage. It is not meant for replacing the kernel bit by bit over time, but as an emergency measure when the kernel is vulnerable.

History

The history of the idea behind live patching goes back at least as far as the 1940s, he said. He referenced the classic Richard Feynman book Surely You're Joking, Mr. Feynman!, where Feynman described a system he used to change the program being run by early proto-computers. He color-coded certain groups of punch cards in the program. That way, he could replace a small subset of the program in a non-destructive way. That was the beginning of live patching, Kosina said.

The first implementation of live patching for Linux that he is aware of is ksplice, which was announced in 2008. It was originally a research project for a PhD. thesis and the code was released as open-source software. The mechanism used stop_machine() to stop the kernel, then inspected the stack to see if the patch would interfere with any task currently running. If the function being patched was found on the stack, ksplice refused to patch it and retried later.

One of the major contributions that ksplice made was in its automatic patch generation by comparing binary kernels, he said. The original kernel binary and the patched kernel binary were compared. Function inlining and other optimizations make it hard to know what actually will change even from a simple source code change. The ksplice project was acquired by Oracle in 2011 and the source code was closed; it is still used by the Oracle Linux distribution today.

Based on requests from SUSE customers, Kosina had been working on an alternative approach, kGraft, which was released in 2014. Around the same time, Red Hat released kpatch, which it had been working on; both were aimed at live kernel patching, but had different ways to to achieve convergence to a fully patched state. Kpatch was similar to ksplice, in that it stopped the system and inspected its state, while kGraft used a lazy migration technique to slowly migrate all processes to use the new code. That lazy migration normally takes just milliseconds to complete.

The kGraft patches are commercially supported by SUSE, which violates the company's "upstream first" principle, he said. Patches are created manually, with the help of the toolchain, which has some advantages over automatic binary comparisons. Even with automatic generation, there is a need to look at the patch generated (and to possibly adjust it); for example, if a structure needs to change, existing versions of the structure need to be modified in place. There is still a need for more tooling to assist with the manual patch generation, he said.

As a side note, Kosina pointed to the checkpoint/restore in user space (CRIU) project as another potential way to do a kind of live patching. For some use cases, it might make sense to checkpoint all of the user-space processes, kexec() to the new kernel, then restore all of user space. That would allow changing to a completely new kernel, but it would not be immediate (or live). It also would reinitialize the hardware, which may not be desirable.

He went into a bit more detail on the lazy migration scheme. After the patch is made, a process that enters or leaves the kernel gets marked as now living in the "new universe", so it will always get the patched function from that point on. Anything that is running in the kernel at the time of the patch will end up running the old version of the code; a trampoline function is used to decide which of the two versions of the function to call. Kernel threads have been marked with "safe points" where the switch can be made, which turned out to not be that difficult, surprisingly. In addition, long-sleeping processes (e.g. blocking in get_tty() ) are identified and sent a fake signal that simply has the effect of setting the new-universe flag and putting them back to sleep.

A meeting of the minds

There were competing solutions, so a meeting was held at the 2014 LPC in Düsseldorf to discuss the matter. Each solution was presented and the developers came up with a plan to try to merge one unified scheme. It would start with a minimal base on top of Ftrace, with a simple API. Live patches could be registered with a list of functions to be replaced, and it only supported a limited set of patch types that could be applied. That was merged into the mainline in February 2015.

Since then, ideas have been cherry-picked from kpatch and kGraft to be added to the kernel under the CONFIG_LIVEPATCH option. There is now a combined, hybrid consistency model that uses lazy migration by default, but falls back to stack examination for long-sleeping processes and kthreads. Originally, the feature was x86-only, but it has been added to s390 and PowerPC-64, with ARM64 in the works.

The stack examination is a crucial piece of the feature; without reliable stack unwinding, it is impossible to provide consistency. Josh Poimboeuf created the ORC unwinder to provide a reliable way to get a stack trace. In addition, objtool (formerly stacktool ) has been added to ensure that assembly language pieces of the kernel will also produce a valid stack trace.

Earlier efforts at getting reliable stack traces either used frame pointers, which had a severe performance penalty, or DWARF debugging records, which turned out to be unreliable and slow. ORC is effectively a stripped-down version of DWARF that has nothing more than is needed for reliable stack unwinding. The ORC unwinder was merged into 4.14 and will also be used for oops and panic output. So far, it is only available for x86_64, but is in progress for other architectures; the main work is on objtool , Kosina said, as the ORC unwinder is straightforward to port.

Patches are currently hand-written, though tools are coming. The source for a patch is a single C file, which makes it easy to review and to store in Git. It creates new functions and declares them as replacements for existing kernel functions; that gets compiled into a loadable kernel module that has an initialization function to register the replacements and then to enable those changes.

More to do

There are some limitations of the feature, currently. For one, there is no way to deal with data structure changes or changes to the semantics of existing elements. There may be a straightforward solution for simply adding a new field to an existing structure using shadow variables. A "lazy state transformation", analogous to lazy migration, may be another way to deal with changing data structures; new functions that can work with both the old and new structures could be created.

There are still some problems with those approaches, however. Many kernel data structures are protected by exclusive access mechanisms, such as spinlocks and mutexes, which will be problematic to handle. If the locking rules need to change as part of the patch, it will be difficult to avoid deadlocks. There is also an effort to provide ways to fix things up during the patching process using patch callbacks, though that functionality will need to be used with some care.

There are lots of traps in verifying that the patches created will still be within the consistency model; certain things just may not fit. That is currently verified through inspection and reasoning; a guide for patch authors has been started to help with that as well. There is a lot of work being done on tooling to help tame the combinatorial explosion that comes from different optimizations that GCC will perform. For example, GCC can change the ABI for functions if it knows about all of the callers, so patches to those functions cannot be handled (or the GCC -fipr-ra option must not be used). Many of those kinds of problems could be detected automatically, Kosina said.

Kprobes are another tricky area. It is difficult to switch an existing kprobe to a new function, which may cause some surprises. There is also an inability to patch hand-written assembly code; Ftrace is not able to work with that code. User-space live patching is something that could perhaps be done, but is much more difficult. For one thing, user-space applications are often built with tools other than GCC, which expands the problem. In addition, it is harder to define a checkpoint where the consistency can be assured.

Kosina answered a few questions after the talk. The kernel address-space layout randomization (KASLR) feature has no impact on live patching. Loadable modules, on the other hand, are not easily handled. Patching the on-disk version of the module and causing a reload may be the best approach. Module signing also came up; live patches are modules, so if signed modules are required, the patch itself will need to be signed before it can be loaded.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]

Comments (5 posted)

The "tracing and BPF" microconference was held on the final day of the 2017 Linux Plumbers Conference; it covered a number of topics relevant to heavy users of kernel and user-space tracing. Read on for a summary of a number of those discussions on topics like BPF introspection, stack traces, kprobes, uprobes, and the Common Trace Format.

Unfortunately, your editor had to leave the session before it reached its end, so this article does not reflect all of the topics discussed there. For those who are interested, this Etherpad instance contains notes taken by participants at the session.

BPF introspection

Martin Lau started the session by noting that BPF programs typically use maps to communicate with the kernel or user space. It can, however, be hard for an interested person to see what is actually in any given map. A look at a BPF program's source will reveal what it is storing in a map, but that source may not always be available. What Lau would like to have is some sort of easy way to pretty-print the contents of a map.

His proposed solution was to attach a bit of metadata to each map describing the entries found therein. It would look like a C structure definition. The proposed name for this description was the "compact C-type format" or CTF, but that name will almost certainly have to change if this work goes forward, since that acronym is already used for the common trace format. The description would be created with a utility program, then passed into the kernel via the bpf() system call that creates the map. The kernel would verify the data and store it, making it available later on request.

This project may not get that far, though; there was a fair amount of doubt about whether it was really needed. If there are users who truly need a separate description of the contents of a map, it should be possible to manage that information in user space. So, while this idea may not be dead, it will clearly face some headwinds if the work goes forward.



Stack traces and kprobes

Alexei Starovoitov stood up to talk about a couple of issues that Facebook has run into; both of them come up as a result of the company's heavy use of tracing to monitor its operations. Tracing is typically running full time, and detailed tracing of specific processes can be enabled or disabled at any time, with the decision often made within the kernel. Much of the kernel's tracing support was designed around more sporadic use, so things do not always work as well as desired when tracing is done around the clock.

One trouble spot is generating stack traces associated with specific tracing events. That involves translating the address where the event happened into a symbolic address. If the address is in kernel space, Starovoitov said, that translation works most of the time, but can occasionally run into trouble if modules are loaded or removed. User-space address translation also usually works, but processes can come and go quickly, and they can also make rapid changes to the layout of their address spaces. That leads to situations where the mappings needed to do the translation no longer exist when the translation is attempted.

He had three possible solutions to discuss. The "ugly" approach involves sending an event to user space whenever tracing begins; a process there would then snapshot the address-space layouts of the tracing targets. The solution is racy, though, and thus not fully reliable.

A better (though "not pretty") alternative would be to add a BPF helper that would walk through the address space in response to events and dump the traceback info into the BPF stack. A new map type would be added to remember the needed layout information for user space when it gets around to generating the symbolic stack trace. This solution would work, but it would be expensive.

The best approach would be to have the kernel simply resolve addresses into file-and-offset pairs and generate tracebacks internally. This translation can be quickly done in the kernel, which has all of the relevant information at hand. Most tracebacks are relatively small — at least, when Java is not involved. Peter Zijlstra added that the speculative page fault patches include a lockless version of find_vma() , which would make the lookups even faster. So it seems that the "best" solution will be the one chosen here.

The other problem has to do with kprobes — dynamic tracing points inserted into the kernel at run time. Facebook makes heavy use of kprobes to instrument parts of the kernel that do not have a convenient tracepoint available. The problem, he said, is that kprobes are globally managed objects, and they "kind of suck". Most of the troubles come down to the text-file interface that is used to manage them.

At the top of the list of complaints is the fact that a process can insert kprobes then exit unexpectedly (by crashing, perhaps); those probes will not be automatically cleaned up by the kernel. Multiple processes can place probes at the same point, leading to name clashes and complicating the task of cleaning up after a crash. There are also mundane problems with the use of special characters in probe names.

The solution he proposed was to extend the perf events subsystem (and perf_event_open() system call in particular) with the ability to create kprobes. Those kprobes would be tied to the file descriptor returned by perf_event_open() and would be easily cleaned up by the kernel when the descriptor is closed. There would be no naming conflicts, and kprobes could have arbitrary names.

There were no conceptual objections to this proposal, but there are concerns that too much functionality has already been crammed into perf_event_open() . So Steve Rostedt suggested that it might be better to create a new system call for this purpose. He would also like a system call for the enabling of ftrace events. He has not done any of this work, though, out of fear of stepping on toes in the development community.

Another desired feature is "lightweight kprobes" that would have less of a runtime impact. They would avoid disabling interrupts and only save a subset of the registers. Various ideas were tossed around, but none of them exist in code at this point. Expect to see some proposals in the not-too-distant future.

Uprobe performance

Uprobes are dynamic probes placed into a user-space process; as Yonghong Song noted, these probes can create performance problems. A uprobe is implemented as a trap into the kernel but, by the time that the execution of the probe is complete, up to three traps will be required to restore the process state and avoid breaking the application. That can make uprobes too expensive to use.

Various tracing systems have found their own ways of addressing this problem. SystemTap, for example, uses ptrace() to stop the process to be probed, then inserts a jump instruction to a user-space handler, avoiding the kernel entirely. LTTng, instead, relies on tracepoints inserted into the source and a separate thread to communicate trace data to the listener. Neither approach is ideal, so Song wanted to know if anybody had a better idea.

Zijlstra suggested putting no-op instructions into the code where a probe might be placed. The actual probe could then be a simple INT3 instruction that need not displace any existing instructions and, as a result, needs no traps. This approach does require developers to know where probes might be placed, though.

An alternative would be to place a jump directly to another user-space address, shorting out the kernel entirely. Users want to run BPF programs from uprobes, but there is no reason why that couldn't be done in user space. Perhaps what is really needed is some sort of kernel-assisted mechanism to allow tracing systems to patch user-space program text. Various ideas were tossed around; which of those will turn up in code remains to be seen.

The other CTF

Matthieu Desnoyers gave a quick overview of the common trace format, a specification for the representation of tracing data. There are a lot of tracers that can produce data in this format, and quite a few tools that can use it, including Trace Compass and LTTng Scope. There is, however, one missing link: there is no CTF output from ftrace. His proposed solution was to make an ftrace input module for the Babeltrace translation utility.

Zijlstra asked what CTF was good for in the end; when he was informed that it was used with graphical tracing tools, he joked that there was "no point in using it." Most of the other people in the room felt that this translator would be useful, though; the only real question is who would write it. Rostedt said that he would like this feature, but he hasn't had the time to work on it. A suggestion that an ftrace input module would be a good Google Summer of Code project was well received; that may well be the approach that is taken to get this software written.



BPF tools

Brendan Gregg gave an energetic talk about tools for tracing with BPF. The BPF Compiler Collection (BCC) now contains about one-hundred individual tools. They are becoming more advanced and specialized over time; there is one to measure MySQL pool contention for example. It seems clear that there is a limit to the number of these tools that really belong in BCC; nobody wants to see 1000 scripts there. It may be time to look at creating some more specialized repositories for many of these scripts.

He also talked about a desire for a higher-level interface to BPF tracing functionality. The Ply project was working in that direction, but it appears to be stalled. More recently, work has gone into bpftrace, but it may well be that we can do something better. This would be, he said, a good opportunity for a "language nerd" to come up with a better way of describing tracing tasks. No nerds of this type raised their hands at the session, though.



[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to LPC 2017].

Comments (3 posted)

we believe that advancing software and content freedom is a central goal for the Fedora Project, and that we should accomplish that goal through the use of the software and content we promote

The Fedora project's four " foundations " are named "Freedom", "Friends", "Features", and "First". Among other things, they commit the project to being firmly within the free-software camp ("") and to providing leading-edge software, including current kernels. Given that the kernel project, too, is focused on free software, it is interesting to see a call within the Fedora community to hold back on kernel updates in order to be able to support a proprietary driver.

On September 5, Fedora kernel maintainer Laura Abbott announced that the just-released 4.13 kernel would be built for the (in-development) Fedora 27 release, and that it would eventually find its way into the Fedora 25 and 26 releases as well. That is all in line with how Fedora generally operates; new kernels are pushed out to all supported releases in relatively short order. Running current kernels by default is clearly a feature that many Fedora users find useful.

More recently, though, James Hogarth noted that the NVIDIA proprietary driver did not work with the 4.13 kernel. This kind of breakage is not all that unusual. While the user-space ABI must be preserved, the kernel project defends its right to change internal interfaces at any time. Any problems that out-of-tree code experiences as a result of such changes is deemed to be part of the cost of staying out of the mainline. There is little sympathy for those who have to deal with such issues, and none at all if the out-of-tree code in question is proprietary. Community-oriented projects like Fedora usually take a similar attitude, refusing to slow down for the sake of proprietary code.

In recent years, though, as part of an effort to attract more users, the Fedora distribution has been split into a set of "editions", each of which addresses a specific user community and has a certain amount of control over what is shipped. The "workstation" edition is the version that many Fedora users install. The developers behind that edition are, for obvious reasons, concerned with proper graphics support, and that concern, it would seem, has extended to support for proprietary drivers. Back in July, Christian Schaller wrote in an article about the Fedora 26 release:

We do plan on listing the NVidia driver in GNOME Software soon without having to manually setup the repository, so soon we will have a very smooth experience where the Nvidia driver is just a click in the Software store away for our users.

The Fedora Workstation project, in other words, has decided that the NVIDIA driver, as found in the Negativo repository, as an important component in the workstation edition. That is a distinct change from the Fedora project's previous attitude toward such drivers, and the consequences of this decision become clear in the discussion on the 4.13 kernel.

Hogarth, in his message, asked whether the broken NVIDIA driver was a sufficient reason to hold back the deployment of the 4.13 kernel. The response from Josh Boyer was clear: "Absolutely not". But Michael Catanzaro saw it differently:

If it breaks the Negativo repo, then yes it is. The Nvidia driver from that repo is supported and it needs to not break. We've been super super lenient with allowing kernel updates in the Workstation product, since it seems to be working well for everyone involved, but that would need to be reconsidered if a kernel update intentionally breaks an important subset of our users.

He later added:

But if Negativo users start complaining that their computers don't boot anymore, then we'll definitely need to stop doing major kernel updates ("taking the entire distro hostage" I guess) as the Negativo support is important for product strategy.

Boyer responded: "That's a completely untenable position. There is only one kernel for all the Editions."

Therein lies the core of the conflict. The Fedora project as a whole is dedicated to free software and lacks the resources to maintain different kernels for each of its editions. The workstation working group, instead, would appear to be focused on desktop success, and is willing to compromise somewhat on both free software and current software if it seems necessary. According to Catanzaro, the working group has been given full control over the edition, extending to the kernel, and it will use that control if need be to ensure that an important proprietary driver keeps working. He suggested that the issue should maybe be immediately escalated to the Fedora Engineering Steering Committee (FESCo) since it seemed certain to end up there eventually anyway.

This conflict probably will indeed come to FESCo's attention, but that may not happen right away. The problem with 4.13 was evidently due to an exported symbol being marked "GPL only" by default; that has been changed and the NVIDIA driver works again. As a result, the immediate conflict has seemingly been resolved to everybody's satisfaction. Beyond that, there is a fallback in place that uses the free Nouveau driver should the NVIDIA driver fail, but it seems certain that Nouveau will not prove to be a satisfactory replacement for all users of the proprietary driver. In any case, nobody seems to want to carry this fight forward at this time, so it looks likely to fade away.

But it seems certain that this issue will come back; kernel changes that break proprietary drivers are not all that uncommon. Future breaks will, once again, highlight a conflict between the Fedora project's "freedom" and "first" foundations on one side, and its desire to increase its user base on the other. Sooner or later, somebody will have to make a decision on which goal truly comes first.

Comments (56 posted)