This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

In a 2018 Python Language Summit talk that was initially billed as "Mariatta's Topic of Mystery", Mariatta Wijaya described her reasoning for advocating moving Python away from its current bug tracker to GitHub Issues. She wanted to surprise her co-attendees with the talk topic at least partly because it is somewhat controversial. But it would complete Python's journey to GitHub that started a ways back.

Other Python projects are using GitHub Issues, she said, as are many popular open-source projects. Many people already have a GitHub account. When they run into a Python problem, they can immediately file a GitHub Issue, but might have to create a bugs.python.org (b.p.o) account before filing one there.

There is also the problem that b.p.o is maintained by just two people and its code is still in Mercurial, while most other Python projects have moved to Git (and to GitHub). There are lots of ideas about how to make b.p.o better, but the project is probably not really ready to accept a bunch of contributions if they were to materialize.

Wijaya said that she likes GitHub because of its APIs and the automation they allow. In addition, the email notifications contain useful metadata. You can also use Markdown and emoji in the text for GitHub Issues.

In order to migrate, the Contributor License Agreement (CLA) host needs to be migrated, either to something new or perhaps by using DocuSign, she said. After that, b.p.o could be frozen; it would be read-only, but not shut down. New issues would be created at GitHub.

Of course, there are still lots of bugs stored at b.p.o, with lots of comments and other useful information. The "old and languishing" bugs should be closed or ignored. For those that are still active, a button could be added to b.p.o to copy over the bug information and comments to GitHub.

Several core developers spoke up about the notifications from GitHub, which they saw as problematic. Barry Warsaw said he was concerned about getting "bombed" by mail from new issues; he likes the current model where he can decide what he wants to follow. Steve Dower said that GitHub notifications are "useless"; he loves the "nosy" list in b.p.o. Unless the notification problem can be solved properly, he is against the switch. He has yet to see a GitHub project that does notifications well; if that can be fixed, though, he would be in favor of moving bugs to GitHub Issues.

One of the problems is that b.p.o has its own mailing list where developers can scan for bugs of interest, which is not something that GitHub Issues supports. Wijaya thought it should be possible to create something similar for GitHub. She also pointed to the Octobox project as one possible way to deal with GitHub notifications.

Christian Heimes was concerned about losing the extra metadata that b.p.o provides for categories, priorities, and the like. GitHub Issues do not have anything like that. Wijaya thought that perhaps labels could be used to handle some parts of that; she also suggested that bringing the problem up with GitHub might lead to a solution. Others were not so optimistic; Nick Coghlan said that GitHub has not been responsive to the suggestions and requests from Kubernetes and other large projects, so it is not likely to make changes for Python.

Ned Deily thought there were a number of serious issues that need to be worked through before b.p.o could go away; it is not realistic to close thousands of open bugs, he said. But Guido van Rossum pointed out that Wijaya had said "old and languishing" bugs, not all of the open bugs. Copying the bugs to GitHub with a button is "great in theory", Deily said; the problem is that there are references and links to b.p.o all over the place. How would those get updated?

He believes moving to GitHub Issues is an even bigger job than moving to GitHub for repositories and pull requests. To start with, the project needs to understand how b.p.o is used in its processes today. But another attendee noted that there is a downside to not moving as well since pull requests and issues are well integrated on GitHub and that is not true with the current b.p.o setup.

It is clear that there are enough differing opinions that more study is needed, Van Rossum said. He suggested writing multiple PEPs that described the different paths forward. Those can be discussed on the mailing lists.

Van Lindberg cited some "passive-aggressive moves" by GitHub over the years that makes him concerned about centralizing all of Python development there. He wondered if there is a way out for Python and its data if things go poorly at some point. Van Rossum said that as things stand, all of the conversation in the pull requests would be lost when transitioning to another system.

But Carol Willing of the Jupyter project said that b.p.o is the "single biggest hurdle for new contributors". GitHub is not perfect, but you can take data out of it, she said; Jupyter does so for all of its projects. It is worth giving GitHub a shot, because right now Python is excluding contributors by continuing to use b.p.o. There is another downside to sticking with b.p.o, Brett Cannon said; there needs to be some plan for keeping it running, which is not something that anyone is really looking at. To stay with b.p.o, people need to step up to fix the problems and to maintain it. The status quo will not suffice given that b.p.o does not yet run on Python 3.

Comments (23 posted)

The Python release cycle has an 18-month cadence; a new major release (e.g. Python 3.7) is made roughly on that schedule. But Łukasz Langa, who is the release manager for Python 3.8 and 3.9, would like to see things move more quickly—perhaps on a yearly cadence. In the first session after lunch at the 2018 Python Language Summit, Langa wanted to discuss that idea.

Before he got started, though, he noticed his name in Larry Hastings's schedule-display application started with the dreaded ▯ rather than Ł. That, he said with a grin, is the story of his life. Hastings dryly suggested that the font he was using predates the addition of the Ł character to Unicode, which elicited a fair bit of laughter.

Langa showed a "boring graph" that indicated the length of time that each release spent in each phase of its development. It showed that the developers spend more than a year creating the features that go into a particular major release as measured from when the branch opens until the first beta release. That means that the project is sitting on changes, which means that people cannot use them, for a long time.

He showed the proposed schedule for 3.8, which has feature development running from January 2018 (first 3.7 beta) until May 2019 (the first 3.8 beta). That puts the release of 3.8 in October 2019. "I think we can do better", Langa said.

So he would like to accelerate the development so that major releases are done yearly. He would also like to see point releases done monthly. Currently, there are release candidates for the point releases, but he thinks that no one really looks at those. Several in the audience disagreed, though. Hastings, who is the release manager for 3.4 and 3.5, said that he has had bugs reported against 3.[45].x release candidates; some caused him to do a second candidate release. Langa said that his goal is to get fixes for "stuff that is broken" into the hands of users faster.

Barry Warsaw asked if Langa was suggesting that Python move to time-based releases. It would be more predictable, which has some advantages. But Ned Deily, who is the release manager for 3.6 and 3.7, said that it makes sense to differentiate between feature releases (3.6) and maintenance releases (3.6.1). For 3.6, maintenance releases have been pretty much time-based; he plans to do the same for 3.7. For the feature releases, he has pushed for a yearly schedule, but Fedora and Ubuntu push back, he said.

Nick Coghlan said that frequent feature releases can cause distributions and other projects to have to support (and test) more releases simultaneously. If a distribution is trying to support several of its own releases, it might currently be testing with three or more Python versions; making the Python releases yearly increases that testing burden. It would also affect projects like NumPy and Django that need to ensure they work on a wide range of current Python releases.

A NumPy developer agreed that it would increase the amount of testing that needed to be done, but said it was "not completely undoable". Warsaw said that it takes a lot of work, for a lot of different projects, to roll out a new version of Python. Christian Heimes suggested that moving to yearly releases could be coupled with creating a long-term support (LTS) version of Python. For example, every other release could be an LTS.

But Brett Cannon wondered what LTS would mean and what guarantees the project would be making about that kind of release. Would there be no new features and no deprecations over the LTS time frame? Langa noted with amusement that some of his colleagues at Facebook claim that Python 2 never broke backward compatibility—that was met with a loud round of laughter. But those developers have been working with Python 2.7 for nine years at this point, so they have not seen a backward compatibility break, he said.

Overall, the idea of shortening the cycle and adding an LTS into the mix did not seem to run into any strong opposition. Langa volunteered to support 3.9 for five years as an LTS. There will presumably need to be some more discussion of the development cycle and what an LTS actually is—something that seems likely to happen on the python-dev mailing list in the not-too-distant future.

Comments (14 posted)

"Security is hard" is a tautology, especially in the fast-moving world of container orchestration. We have previously covered various aspects of Linux container security through, for example, the Clear Containers implementation or the broader question of Kubernetes and security, but those are mostly concerned with container isolation; they do not address the question of trusting a container's contents. What is a container running? Who built it and when? Even assuming we have good programmers and solid isolation layers, propagating that good code around a Kubernetes cluster and making strong assertions on the integrity of that supply chain is far from trivial. The 2018 KubeCon + CloudNativeCon Europe event featured some projects that could eventually solve that problem.

Image provenance

A first talk, by Adrian Mouat, provided a good introduction to the broader question of "establishing image provenance and security in Kubernetes" (video, slides [PDF]). Mouat compared software to food you get from the supermarket: "you can actually tell quite a lot about the product; you can tell the ingredients, where it came from, when it was packaged, how long it's good for". He explained that "all livestock in Europe have an animal passport so we can track its movement throughout Europe and beyond". That "requires a lot of work, and time, and money, but we decided that this is was worthwhile doing so that we know [our food is] safe to eat. Can we say the same thing about the software running in our data centers?" This is especially a problem in complex systems like Kubernetes; containers have inherent security and licensing concerns, as we have recently discussed.

You should be able to easily tell what is in a container: what software it runs, where it came from, how it was created, and if it has any known security issues, he said. Mouat also expects those properties to be provable and verifiable with strong cryptographic assertions. Kubernetes can make this difficult. Mouat gave a demonstration of how, by default, the orchestration framework will allow different versions of the same container to run in parallel. In his scenario, this is because the default image pull policy ( ifNotPresent ) might pull a new version on some nodes and not others. This problem arises because of an inconsistency between the way Docker and Kubernetes treat image tags; the former as mutable and the latter as immutable. Mouat said that "the default semantics for pulling images in Kubernetes are confusing and dangerous." The solution here is to deploy only images with tags that refer to a unique version of a container, for example by embedding a Git hash or unique version number in the image tag. Obviously, changing the policy to AlwaysPullImages will also help in solving the particular issue he demonstrated, but will create more image churn in the cluster.

But that's only a small part of the problem; even if Kubernetes actually runs the correct image, how can you tell what is actually in that image? In theory, this should be easy. Docker seems like the perfect tool to create deterministic images that consist exactly of what you asked for: a clean and controlled, isolated environment. Unfortunately, containers are far from reproducible and the problem begins on the very first line of a Dockerfile. Mouat gave the example of a FROM debian line, which can mean different things at different times. It should normally refer to Debian "stable", but that's actually a moving target; Debian makes new stable releases once in a while, and there are regular security updates. So what first looks like a static target is actually moving. Many Dockerfiles will happily fetch random source code and binaries from the network. Mouat encouraged people to at least checksum the downloaded content to prevent basic attacks and problems.

Unfortunately, all this still doesn't get us reproducible builds since container images include file timestamps, build identifiers, and image creation time that will vary between builds, making container images hard to verify through bit-wise comparison or checksums. One solution there is to use alternative build tools like Bazel that allow you to build reproducible images. Mouat also added that there is "tension between reproducibility and keeping stuff up to date" because using hashes in manifests will make updates harder to deploy. By using FROM debian , you automatically get updates when you rebuild that container. Using FROM debian:stretch-20180426 will get you a more reproducible container, but you'll need to change your manifest regularly to follow security updates. Once we know what is in our container, there is at least a standard in the form of the OCI specification that allows attaching annotations to document the contents of containers.

Another problem is making sure containers are up to date, a "weirdly hard" question to answer according to Mouat: "why can't I ask my registry [if] there is new version of [a] tag, but as far as I know, there's no way you can do that." Mouat literally hand-waved at a slide showing various projects designed to scan container images for known vulnerabilities, introducing Aqua, Clair, NeuVector, and Twistlock. Mouat said we need a more "holistic" solution than the current whack-a-mole approach. His company is working on such a product called Trow, but not much information about it was available at the time of writing.

The long tail of the supply chain

Verifying container images is exactly the kind of problem Notary is designed to solve. Notary is a server "that allows anyone to have trust over arbitrary collections of data". In practice, that can be used by the Docker daemon as an additional check before fetching images from the registry. This allows operators to approve images with cryptographic signatures before they get deployed in the cluster.

Notary implements The Update Framework (TUF), a specification covering the nitty-gritty details of signatures, key rotation, and delegation. It keeps signed hashes of container images that can be used for verification; it can be deployed by enabling Docker's "content Trust" in any Docker daemon, or by configuring a custom admission controller with a web hook in Kubernetes. In another talk (slides [PDF], video) Liam White and Michael Hough covered the basics of Notary's design and how it interacts with Docker. They also introduced Porteiris as an admission controller hook that can implement a policy like "allow any image from the LWN Docker registry as long as it's signed by your favorite editor". Policies can be scoped by namespace as well, which can be useful in multi-tenant clusters. The downside of Porteris is that it supports only IBM Cloud Notary servers because the images need to be explicitly mapped between the Notary server and the registry. The IBM team knows only about how to map its own images but the speakers said they were open to contributions there.

A limitation of Notary is that it looks only at the last step of the build chain; in itself, it provides no guarantees on where the image comes from, how the image was built, or what it's made of. In yet another talk (slides [PDF] video), Wendy Dembowski and Lukas Puehringer introduced a possible solution to that problem: two projects that work hand-in-hand to provide end-to-end verification of the complete container supply chain. Puehringer first introduced the in-toto project as a tool to authenticate the integrity of individual build steps: code signing, continuous integration (CI), and deployment. It provides a specification for "open and extensible" metadata that certifies how each step was performed and the resulting artifacts. This could be, at the source step, as simple as a Git commit hash or, at the CI step, a build log and artifact checksums. All steps are "chained" as well, so that you can track which commit triggered the deployment of a specific image. The metadata is cryptographically signed by role keys to provide strong attestations as to the provenance and integrity of each step. The in-toto project is supervised by Justin Cappos, who also works on TUF, so it shares some of its security properties and integrates well with the framework. Each step in the build chain has its own public/private key pair, with support for role delegation and rotation.

In-toto is a generic framework allowing a complete supply chain verification by providing "attestations" that a given artifact was created by the right person using the right source. But it does not necessarily provide the hooks to do those checks in Kubernetes itself. This is where Grafeas comes in, by providing a global API to read and store metadata. That can be package versions, vulnerabilities, license or vulnerability scans, builds, images, deployments, and attestations such as those provided by in-toto. All of those can then be used by the Kubernetes admission controller to establish a policy that regulates image deployments. Dembowski referred to this tutorial by Kelsey Hightower as an example configuration to integrate Grafeas in your cluster. According to Puehringer: "It seems natural to marry the two projects together because Grafeas provides a very well-defined API where you can push metadata into, or query from, and is well integrated in the cloud ecosystem, and in-toto provides all the steps in the chain."

Dembowski said that Grafeas is already in use at Google and it has been found useful to keep track of metadata about containers. Grafeas can keep track of what each container is running, who built it, when (sometimes vulnerable) code was deployed, and make sure developers do not ship containers built on untrusted development machines. This can be useful when a new vulnerability comes out and administrators scramble to figure out if or where affected code is deployed.

Puehringer explained that in-toto's reference implementation is complete and he is working with various Linux distributions to get them to use link metadata to have their package managers perform similar verification.



Conclusion

The question of container trust hardly seems resolved at all; the available solutions are complex and would be difficult to deploy for Kubernetes rookies like me. However, it seems that Kubernetes could make small improvements to improve security and auditability, the first of which is probably setting the image pull policy to a more reasonable default. In his talk, Mouat also said it should be easier to make Kubernetes fetch images only from a trusted registry instead of allowing any arbitrary registry by default.

Beyond that, cluster operators wishing to have better control over their deployments should start looking into setting up Notary with an admission controller, maybe Portieris if they can figure out how to make it play with their own Notary servers. Considering the apparent complexity of Grafeas and in-toto, I would assume that those would probably be reserved only to larger "enterprise" deployments but who knows; Kubernetes may be complex enough as it is that people won't mind adding a service or two in there to improve its security. Keep in mind that complexity is an enemy of security, so operators should be careful when deploying solutions unless they have a good grasp of the trade-offs involved.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (2 posted)

In April, LWN looked at the new API for zero-copy reception of TCP data that had been merged into the net-next tree for the 4.18 development cycle. After that article was written, a couple of issues came to the fore that required some changes to the API for this feature. Those changes have been made and merged; read on for the details.

This API is intended to make it possible to read data from a TCP connection without the need to copy the data between the kernel and user space. The original version was based entirely on the mmap() system call; once a socket had been marked for zero-copy reception, an mmap() call would create a mapping containing the available data — in some circumstances, anyway. The application would use munmap() to release that data once processing was complete; see the article linked above for the details.

Two problems with this interface came to light after the feature had been merged. One was that this use of mmap() was somewhat strange; an mmap() call does not normally have side effects like consuming data from a socket. The author of this patch (Eric Dumazet) was comfortable with that aspect of the interface, but he had a harder time dealing with the locking problems that came with it. Calling network-layer operations from within mmap() inverts the normal locking order around mmap_sem ; there was no easy way to fix that without separating the networking operations from the mmap() code.

So, in the version that (barring more surprises) will be merged for 4.18, the call to mmap() just sets up a range of address space into which data from the network can appear via zero-copy magic. Actually getting some data into that range requires a getsockopt() call with the TCP_ZEROCOPY_RECEIVE operation. This structure is passed into that call:

struct tcp_zerocopy_receive { __u64 address; __u32 length; __u32 recv_skip_hint; };

On entry to getsockopt() , the address field contains the address of the special mapping created with mmap() , and length is the number of bytes of data to be put into that mapping. As before, address must be page-aligned (which will happen naturally since it must also be the address returned from the mmap() call), and length must be a multiple of the page size. On return, length will be set to the number of bytes actually mapped into that range. The data will remain mapped until either the range is unmapped with munmap() or another getsockopt() call replaces the data.

In the old interface, the mmap() call would fail if the available data did not fill full pages or if there is pending urgent data. The new getsockopt() call will fail in the same way in those circumstances, but with a difference: the recv_skip_hint field of the tcp_zerocopy_receive structure will be set to the amount of data the application must consume with recv() before returning to the zero-copy mode. That should make it easier for applications to recover when things don't go as planned.

The new interface should also perform better, especially in multi-threaded applications, because it is no longer necessary to call mmap() for each new batch of data. The implementation can also avoid making some higher-order allocations that were necessary with the old API.

The end result is an interface that is less surprising, easier to use, and perhaps even faster for some use cases. The whole episode is a clear demonstration of the benefits of wider review of new features, especially those that have user-space API components. In this case, a number of the ideas behind the new implementation came from Andy Lutomirski, who seemingly only became aware of the changes once they were discussed beyond the netdev mailing list. Having many eyes on the code really does make it better in the end.

Comments (10 posted)

At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Steve French led a discussion of various problem areas for network filesystems. Unlike previous sessions (in 2016 and 2017), there was some good news to report because the long-awaited statx() system call was released in Linux 4.11. But there is still plenty of work to be done to better support network filesystems in Linux.

French said that statx() was a great addition that would help multiple filesystems that do not use local block devices for their storage; that includes Samba using SMB 3.1.1 and NFS 4.2. The "birth time" (or creation time) attribute is "super important" for Samba, he said. The next step is to get more of the Windows attribute bits supported in statx() and also in the FS_IOC_[GS]ETFLAGS ioctl() commands.

There are numerous features that Windows provides, but Linux does not, which makes life more difficult for network filesystems. There is no way to do safe caching of file and directory data because leases and delegations are not supported on Linux servers. Also, there still is no support for rich access-control lists (RichACLs) despite lots of work and testing that went on over the years. There has not been much patch activity lately, he said, but Andreas Gruenbacher has posted 28 versions of the patch set over time. The problems that have cropped up are generally due to trying to map user IDs and the like between three separate domains (perhaps server, client, and on-disk, though French did not say).

Broader support for the variants of the fast copy operation is badly needed, he said. The cp --reflink command uses the FICLONERANGE ioctl() command, but not copy_file_range() ; in fact, no utilities use copy_file_range() , though it should be the default. It will fall back to other forms of copying, if needed, but can make the copy operation complete thousands of times faster in many cases. French said he got an email from a user asking about a copy operation in the cloud that was taking an hour or so. He suggested using a different command, which was faster, but the customer asked why cp (and other tools such as rsync ) did not simply use the faster operation.

Case-insensitive lookups are another problem area; Samba emulates it, but it is expensive to do so. Ric Wheeler noted that XFS supports doing case-insensitive lookups while preserving the case of the filenames on disk; he suggested perhaps doing the same in user space for Samba. French said that might make sense as this problem has been around for a long time.

In general, macOS and Windows are both SMB friendly, but Linux is not, he said. Though he did describe a demo at a recent storage conference, where different clients on a "bad hotel network" were all able to edit the same file using SMB. It was rather eye-opening, especially when compared to ten years ago, to see Linux, macOS, Windows, Android, and iOS all interoperating that way.

Many of the standard utilities are not transferring data in large enough chunks. For example, rsync defaults to 4KB and the largest it will use is 128KB, but NFS is able to handle much larger transfers and SMB is larger still. For the network filesystems, transferring 8MB chunks would make much more sense.

He mentioned a double handful of other features that would make things easier for Samba, NFS, and others, but it was not clear who was working on those features or planning to do so—something that is also true for some of the features mentioned earlier. For example, Dave Chinner said that someone needs to update cp to bring it into the copy_file_range() world. French said that he had sent some patches to the rsync maintainers (who may well be easier to find than cp maintainers), but that there was no response. The upshot was that network filesystems, especially those that are meant to interoperate with Windows, are not getting the attention that they need from the Linux world.

Comments (8 posted)

In a filesystem-track session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Ronnie Sahlberg talked about some changes he has made to add support for compounding to the SMB/CIFS implementation in Linux. Compounding is a way to combine multiple operations into a single request that can help reduce network round-trips.

He is trying to increase the performance of the kernel's SMB/CIFS client (which he called cifs.ko ). He started by describing how compounding works in various versions of SMB. Server Message Block (SMB) 1 (also known as Common Internet File System or CIFS) had no caching in the clients, which meant that attributes needed to be retrieved from the server each time they were needed. If a stat() call was done, it would do many round-trips to the server; if another was done 1ms later, it would all be done again.

SMB 2 added a mechanism to overcome this problem, but both cifs.ko and the Microsoft server just implemented SMB 1 behavior inside SMB 2 packets. Doing a statfs() in SMB using cifs.ko today requires nine round-trips. If the server is in a different city, Sahlberg said, "you are not going to have a good time".

Microsoft has started using these new features to make SMB 2 work better. If Linux did compounding, it could reduce the nine round-trips to three. Adding in the attribute caching that Steve French is working on could drop the cost to one or even zero round-trips. In reality, zero is not achievable since there are some things that should not be cached, but a reduction from nine to one is huge.

NFS versions 3 and 4 also have compounding support but, unlike SMB, there is only one NFS header in a compound operation; SMB has a protocol header per operation. For SMB, each operation is attempted in order, regardless if any of the earlier ones have failed. For NFS, if an operation fails, that ends the processing of the compound message.

Given those differences, Sahlberg wondered if there was a way to come up with some common code that could be shared between the two. Jeff Layton said that he had tried something like that a long time ago, but it has totally bit-rotted away. He said that it is fairly hard to do code sharing in the compounding support for the two protocols.

If you look at the network traces for NFS, you will see compounded operations, French said. There are barriers to how many different operations can be collected up into a compound message, though, because of the way they are handled by the VFS layer. If a way were found to combine more operations for NFS, perhaps that could be used for SMB as well. The macOS developers have done a lot of work to reduce the round-trips by compounding six or seven operations in one message, French said. But Layton suspected that was being done from user space, since the macOS kernel and VFS are similar to Linux; it should have most of the same constraints.

Sahlberg said that his intent was to try to do better and better with the compounding over time so that we can "at least get to the point where people will not laugh at us". There is also a lot of technical debt "hanging around" with the SMB 1 protocol encapsulated into SMB 2, which he is also fixing. He is targeting the 4.18 merge window for starting to get this work upstream. Layton said that it looks like a nice cleanup in some code that is not all that easy to deal with.

Comments (3 posted)

At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Andiry Xu presented the NOVA filesystem, which he is trying to get into the upstream kernel. Unlike existing kernel filesystems, NOVA exclusively targets non-volatile main memory (NVMM) rather than traditional block devices (disks or SSDs). In fact, it does not use the kernel's block layer at all and instead uses persistent memory mapped directly into the kernel address space.

Xu compared NOVA to versions of the ext4 and XFS filesystems with support for the DAX direct-access mechanism. With those, only the filesystem data bypasses the page cache; the metadata still goes through the page cache. In addition, those filesystems have a much higher latency for append operations. There is also a write amplification effect. All of that makes for high journaling overhead, he said.

Beyond that, there are scalability issues for those filesystems on NVMM. He ran some tests on high-end multicore hardware to compare NOVA and tmpfs to the DAX modes of ext4 and XFS. In his tests, he emulated NVMM with RAM, since it is difficult to actually get NVMM devices at this point. In general, only tmpfs and NOVA scale reasonably—the other filesystems contend for various locks and semaphores—though there is still room for NOVA to improve as only tmpfs scaled reasonably for one of the tests.

Support for huge pages is difficult for DAX filesystems, Xu said. Huge pages require that the physical address is aligned on a huge-page-size boundary and that the memory is physically contiguous, but memory allocated by filesystems does not necessarily conform to those requirements. Dave Chinner said that XFS has an inode option to support huge-page use; another attendee said that ext4 has an analogous feature but it can only support 2MB huge pages, not 1GB.

Xu pointed attendees at the 2016 NOVA paper [PDF] for more information, but gave a quick overview of some of NOVA's features. It is a log-structured filesystem that is designed for NVMM. It has per-inode logging that contains only the metadata changes; the log points off to changes to the actual data. It uses a radix tree for block mappings and is copy on write (CoW) for its file data.

NOVA uses a lightweight journaling scheme that simply records the head and tail pointers for a linked list of log entries in the journal. That leads to fast garbage collection as entries are dropped from the list when they are no longer valid. There is no copying unless invalid entries make up more than half of the log, in which case a new log is created to atomically replace the old one; the metadata log entries are only copied at that point.

He showed some performance graphs comparing the DAX versions of ext4 and XFS with NOVA. Generally, NOVA performs better than either ext4 or XFS on most filebench workloads that he tested. The exception is the "web server" workload where the filesystems all performed roughly the same.

Xu said that a second RFC posting that was based on 4.16-rc4 was done in March. That post received some feedback, so he is working on those items and will be posting a v3 soon. The changes needed include 64-bit timestamps and better huge-page support.

Chinner asked about user-space tools and, in particular, whether there was an fsck for NOVA. That will be needed before the filesystem can be merged as users will need to be able to repair their filesystems. Xu said there has been a focus on performance, so there is no fsck yet. Ted Ts'o noted that NOVA also needs a tool that can verify filesystem images, which will allow more tests in xfstests to be run on it.

Comments (16 posted)

Case-insensitive file name lookups are a feature that is fairly frequently raised at the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). At the 2018 summit, Gabriel Krisman Bertazi proposed a new way to support the feature, though it met with a rather skeptical reception—with one notable exception. Ted Ts'o seemed favorably disposed to the idea, in part because it would potentially be a way to get rid of some longstanding Android ugliness: wrapfs.

Krisman noted that proposals for case-insensitive lookups show up on the Linux kernel mailing list periodically. He has incorporated some of that work into his proposal, including some SGI patches from 2014 that implemented Unicode support and case-folding for XFS. His patches would add support to the VFS layer with filesystem-specific hooks to actually do the insensitive hashes and lookups.

The intent is to be able to bind mount a subtree of a case-sensitive filesystem in a case-insensitive form, but there are changes needed to the directory entry cache to make it all work. David Howells asked if there would potentially be different case-folding functions for each mount point and Krisman indicated there would be.

The Android use case is to support the /sdcard directory (which was FAT-based and thus case-insensitive in the early days) on ext4, which is case-sensitive, of course. Ts'o said that it is a legacy Android feature that is "kind of ugly", but he would like to get rid of the out-of-tree hacks that are currently being used to support it. It is "legacy insanity", Dave Chinner said. The bind-mount approach being proposed is the "least insane way to implement what is, I grant, a somewhat insane thing", Ts'o said; Android apps expect that behavior and breaking user space is something that the Android project avoids.

Al Viro wanted to know how Krisman's code would handle two different directory-cache entries (dentries) that had the "same" name but with different case for some letters. Krisman said there would be a hash function that would hash to the same value for names that differ only by case, so there would be only one dentry per case-insensitive name. The exact name with its case preserved would be stored in the dentry, though.

A problem comes from negative dentries: assertions that a given file name does not exist, which are cached when a lookup fails. He is proposing a "hard negative dentry" that would assert that there is no file that would satisfy a case-insensitive lookup. If the filesystem determines that there should be a hard negative dentry for a given name, it would invalidate all but one of the negative dentries for any other case variants.

But if there is the ability to have both foo and FOO on the disk, which would a lookup return on the case-insensitive side? That and a number of other problems with what Krisman is suggesting were discussed in a rapid-fire round-robin that was difficult to capture. The general upshot was that most in the room were fairly skeptical of the approach.

Ts'o said that for the Android case, which is much of the reason for the "case-sensitive and case-insensitive view of the same subtree" use case, 99% of the time the access is through the case-insensitive view. There are, however, certain ways to access those files in a case-sensitive way and some apps are dependent on this behavior. It would be more sane to have case-sensitivity be a property of the directory, but it is not clear to him how much that would break apps in the wild. He could start that conversation with the Android team, he said.

Chinner said that what XFS has done for case-insensitive file names "may look stupid and slow", but it is much faster than what Samba is doing. However, it requires marking a filesystem as being case-insensitive at mkfs time. A case-insensitive hash is used for the dentries and there can be no names in the filesystem that only differ by case.

Krisman concluded by noting that his code is mostly working at this point, though there are still some problems. In particular, there are difficulties with collisions of positive dentries. Returning the first dentry found is unpredictable, so it will return an exact match if it can, but if there are multiple entries and no exact match, it is not clear what to return.

Comments (24 posted)

The bcachefs filesystem has been under development for a number of years now; according to lead developer Kent Overstreet, it is time to start talking about getting the code upstream. He came to the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) to discuss that in a combined filesystem and storage session. Bcachefs grew out of bcache, which is a block layer cache that was merged into Linux 3.10 in mid-2013.

Five or six years ago, when he was still at Google, creating bcachefs from bcache seemed like it would take a year and 15,000 lines of code, Overstreet said. Now, six years and 50,000 lines of code later, it is a real filesystem. It "turned out really well", he said.

Bcachefs is a general-purpose copy-on-write filesystem with lots of features, including checksumming for blocks, compression, encryption, multiple device support, and, of course, caching. Jens Axboe asked if there was still a clean separation between bcachefs and bcache. Overstreet said that there was; roughly 80% of the code is shared. He has taken out the bcache interfaces in his development tree because there is no need for them as bcachefs can handle all of what bcache can do (and more).

Hannes Reinecke asked about the long-term expectation for bcache and bcachefs; will they coexist or will bcache be removed in favor of bcachefs. Overstreet said that bcache is the prototype for all of the ideas in bcachefs. As part of developing bcachefs, the B-tree code has been fleshed out and polished. Bcache was fast in most cases, but there were some corner cases where it was not; all of that has been fixed in bcachefs.

He said that he would like get users off of bcache and onto bcachefs. The filesystem has an fsck available to detect and repair problems. A block layer cache does not get the same level of testing that a full filesystem does. By creating and upstreaming bcachefs, he will in some sense be turning it into a real project.

He would prefer not have both the block layer and filesystem interfaces, since that doesn't really provide anything extra. One major disadvantage of bcache is that writes to the backing device are not copy on write so there are cache coherency issues. Bcache had ways to deal with those problems, but bcachefs simply eliminates them entirely.

Ted Ts'o asked how many users of bcache there are; how much of a problem is it to get rid of bcache? Axboe said that there are users and a community has formed to develop and maintain it. Ts'o said he would be in favor of eliminating bcache, but if there are users of the feature, that really cannot happen. Reinecke said that SUSE supports bcache in its distributions, so it will need to be maintained for a few years.

The on-disk format is different between bcache and bcachefs, similar to how ext2, ext3, and ext4 have evolved, Overstreet said. If he brought back the block device interfaces into bcachefs, then the filesystem could be a drop-in replacement for bcache. Ts'o noted that before ext3 and ext2 could be dropped, ext4 was able to handle the other two; if bcachefs can support the older bcache devices, the same could be done. Axboe said that perhaps an offline conversion tool could be written. Reinecke said that SUSE will still need bcache as a device for some time, but doesn't care if it is provided by the bcache code or by bcachefs.

Amir Goldstein asked about support for reflink, but Overstreet said that bcachefs does not have that yet. It is one of the easier things on the to-do list, however. Other things on that list include erasure coding and then snapshots further out. The reflink feature uses the same design as is in XFS, he said. Dave Chinner said that reflink is a major feature to be missing from a filesystem these days. Overstreet said that he has gotten much of it working, but space accounting is not right yet.

Chinner asked if there would be an on-disk format changes that would require "forklift upgrades". The snapshot feature will require on-disk format changes, Overstreet said, but the other features should not. There has not been a need to change the on-disk format for quite some time, which is part of why he thinks it is ready to go upstream.

Chinner wondered where bcachefs is aimed; what are its target users? Overstreet said that the killer feature is performance. The latency tail is "really really good", he said. In tests, it has gotten 14GB/sec writes without major CPU impact and mixed read/write workloads also do well. On every workload the project can find, bcachefs performs as fast as the hardware should go.

Both small and large users will benefit from the filesystem, he said. He has been using it as his root filesystem for several years, there are users running it on servers, and the company that is funding him to work on bcachefs is using it on NAS boxes with up to 60 spindles. He was asked about shingled magnetic recording (SMR) support; both bcache and bcachefs do file data allocation in terms of 1-2MB buckets, which they write to once. That should be fairly SMR-friendly, but he has not worked out how to deal with metadata on SMR devices yet.

Ts'o wondered about the diversity of devices that had been used in the benchmarking; that would be useful in determining what the strengths and weaknesses of bcachefs are. Has it been tried on older hardware, low-end flash devices, small disks, etc.? From what he has heard, it "starts to sound like snake oil". It has been tested on big RAID devices, high-end NVMe devices, and various other options, but has not been tested on some of the lower-end devices that were asked about, Overstreet said.

The discussion then shifted to whether it was time to get bcachefs into the mainline and how that process would work. Axboe was concerned that the on-disk format may still change to support snapshots and wondered if it made sense to wait until that work was completed. But filesystems can support multiple on-disk formats; Btrfs does it, as Josef Bacik pointed out, and XFS has been doing it for 20 years, Chinner said. Overstreet said that filesystems using the current on-disk format would still be fully supported, just that they would not be able to take snapshots.

Ts'o asked about xfstests and Overstreet said that he uses them all the time; there is a 30-line patch needed to support bcachefs. Once that is added, Ts'o said, he would be happy to add bcachefs to his automated testing regime.

Bacik said that the filesystem and storage developers need to see the code and know that he will be around to maintain it, at least until there are others who will pick it up. He said that Overstreet had hit all the high points, so Bacik said he was comfortable with starting the review process.

Overstreet said he would post his patches shortly after LSFMM, but that it is 50,000 lines of code. Chinner said that it needs to be broken up into sane chunks. Bacik agreed, saying that he mostly cared about the interfaces, not the internal B-tree stuff. Chinner said that the user-space APIs and the on-disk format were two places to start; people make "obvious mistakes" in those areas. Next would be the interface to the VFS; generally, reviewers are going to be most interested in things at the periphery. Ts'o suggested that since Overstreet knows the code best, he should highlight places where he is making assumptions about various other parts of the kernel (e.g. the dentry cache, the memory-management subsystem); that would allow reviewers to scrutinize that code.

Comments (36 posted)

If pressed, I will admit to thinking that, if NIS was good enough for Charles Babbage, it's good enough for me. I am therefore not a huge fan of LDAP; I feel I can detect in it the heavy hand of the ITU, which seems to wish to apply X.500 to everything. Nevertheless, for secure, distributed, multi-platform identity management it's quite hard to beat. If you decide to run an LDAP server on Unix, one of the major free implementations is slapd , the core engine of the OpenLDAP project. Howard Chu is the chief architect of the project, and spoke at FLOSS 2018 about the upcoming 2.5 release. Any rumors that he might have passed the time while the room filled up by giving a short but nicely rendered fiddle recital are completely true.

OpenLDAP, which will be twenty years old this August, is produced by a core team of three members, and a "random number" of additional contributors. Development has perhaps slowed down a little recently, but they still manage a feature release every 12-18 months, with maintenance releases as needed. OpenLDAP version 2.4, which was first released in 2007, is still the production release; it is theoretically feature-frozen, having had only three releases in the past two years, but the commit rate is still fairly high and fixes, particularly in documentation, continue. Chu noted that despite it being feature-frozen, 2.4.47 will have some minor new features, but this is definitely the last time this will happen and 2.4 is now "absolutely, for-sure, frozen". Probably.

The big milestone coming up is the production release of version 2.5. New features in 2.5, which were the meat of Chu's talk, fall into two camps: those that have been merged for the 2.5 release for some time and have matured, and those which are still scattered through various development branches and have yet to be pulled back into the main tree for release. Mature features coming in 2.5 include multiple thread pool queues, streamlined write waiters, offline slapmodify and slapdelete , and support for LDAP transactions in all the primary database backends.

Currently-merged features for 2.5

In all versions through 2.4, there is a single thread pool that allocates worker threads to every operation. Because this allocation is done through a single queue with a single lock, it gets bogged down pretty heavily under large workloads, and it doesn't scale well to multiple cores. So in 2.5 a configurable number of queues is permitted. In testing, this has produced considerable benefits: a 25% boost in searches per second with the back-mdb backend [PDF] on a four-core test system.

When Oracle invited the OpenLDAP developers to Oracle's Dublin office in July 2017, multiple queues were further tested on an M8 system, with 2048 virtual CPUs and 1.5TB of RAM, running a pre-release Solaris 11.3. Initially, with a test database of a million Distinguished Names (DNs, unique entities within LDAP), and a hundred clients each with ten connections, they managed 180,000 searches per second. After tuning, which included increasing the number of thread queues, they hit 930,000 searches per second, at which point they established the Solaris kernel was the new bottleneck. For multi-core servers, said Chu, multiple thread pool queues is a huge feature; his advice is to have one queue per CPU, and if necessary to further increase the number of queues so that you don't exceed 16 threads per queue.

Also prior to 2.5, there was a single, central thread that was responsible for calling select() on all socket descriptors, for both reading and writing on the network. That thread becomes a bottleneck in high-throughput situations, with a lot of synchronization overhead. So, in 2.5, each worker thread is responsible for sending messages to its own clients, leaving the central thread to deal with receiving messages from all clients. This eliminates much of that overhead, and improves throughput in environments which mix busy clients with slow ones.

The best place to keep the configuration for a database-driven tool is inside the database, and OpenLDAP does this under the Common Name cn=config . This can, however, give rise to a chicken-and-egg situation: the database won't start because it needs a configuration change, and you can't change the configuration because the database is down. The new tools slapmodify and slapdelete allow these changes to be made by direct operations on a down database, and complement the extant slapcat and slapadd .

RFC 5805 for transaction support in LDAP has been around for some time now. Transaction support in OpenLDAP is complete for the three primary database backends: BDB, HDB, and MDB. There is also an LDAP backend, which essentially turns OpenLDAP into an LDAP proxy server; the project looked at adding transaction support to that backend as well, but Chu said it exposes a shortcoming in the RFC, and that the specification really needs to support two-phase commit if transactions distributed across multiple servers are to be possible.

Still in the pipeline

As Chu said earlier, there are also new features coming in 2.5 that have not had quite so much testing as those listed above. OpenLDAP's synchronization and replication engine, syncrepl, does two or more full transactional writes for each incoming modification, and this puts it at a bit of a disadvantage when keeping up with a busy, bursty replication provider. In 2.5, syncrepl does "lazy commit", where those writes are queued for later injection to the underlying database, which helps it keep up with such a provider. STARTTLS and SASL interactive bind have been supported by libldap for some time, but they've been synchronous functions; as of 2.5, they are supported asynchronously, which Chu expects to make your life a little easier "if you're using libldap in some other external event loop". Elliptic-curve cryptography is now supported by OpenLDAP. There are two new database backends, one called Wired Tiger (the name being a play on the BDB backend engine's long-term maintainer, Sleepycat Software), the other called asyncmeta, which is an asynchronous version of the back-meta backend. The lload LDAP load-balancer code is being merged into the slapd code base. Many new modules are provided, including ones to support TOTP OATH and Authzid, and overlays such as adremap and usn to support Microsoft schemas.

Yet another addition for 2.5 is the autoca overlay, which is an automatic generator for both certificates and certification authorities (CAs). When slapd is started with autoca configured, it will look to see if it there is a CA and a server certificate configured; if not, they will be automatically generated with appropriate contents. Then, for any user who comes along and does an LDAP search for their own DN, that is for their own user certificate, the certificate will be generated and supplied to them on the fly. Chu is "pretty happy about that one".

Getting further down into the nuts and bolts, indexing in slapd is currently based on a 32-bit hash; in larger databases Chu is starting to see excessive hash collisions. 2.5 uses 64-bit hashes, which will make false index collisions much less likely. OpenLDAP has previously had an LDIF parsing library, called libldif, which Chu found many distributions didn't package (perhaps, he speculated, because they didn't know it existed). This functionality has now been moved into libldap. Timestamps have been supported, but only with single-second granularity; as of 2.5, timestamps have microsecond resolution, and time spent in queue and time for execution are both logged. For TLS, under 2.4 the filesystem location of the keys and certificates were stored in cn=config ; as of 2.5, the keys and certificates themselves can be stored inside the database.

Not everything that was being worked on is ready to ship in 2.5, but it's useful to know what might be coming shortly after. Logging continues to be a bottleneck for OpenLDAP; Chu describes the glibc syslog() code as "some of the worst I've ever read". The OpenLDAP test server manages about 200,000 queries per second with no logging, but only 21,000 queries/second with stats logging enabled. Chu tried writing a streamlined syslog() , but this only raised throughput to 26,000 queries/second; it was clear that some form of binary logging was needed. Initially, Chu considered writing the stats logs as BER-encoded LDAP packets, but realized he'd have to write another tool to parse those binary packets and make them human-readable. So his current thinking is to write them out in the pcap format and let people use Wireshark to read the log, which is definitely an interesting approach to logging.

Chu intends that the project will grasp the nettle of two-phase commits, which he accepts will mean extending RFC 5805. There is, he feels, no alternative if it is going to support transactions across back-ldap, back-meta, and the like. As for timescales, Chu suggested in response to an audience question that we should expect a pre-alpha quality 2.5.0 in a couple of months' time. 2.4 is, as he said, over ten years old; 2.5 is badly-needed, and it'll be good when it gets here.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (24 posted)