This edition contains the following feature content (including, once again, a lot of LSFMM coverage):

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The amount of available data is growing larger these days, to the point that some data sets are far larger than any one company or organization can create and maintain. So companies and others want to share data in ways that are similar to how they share code. Some of those companies are members of the Linux Foundation (LF), which is part of why that organization got involved in the process of creating licenses for this data. LF VP of Strategic Programs Mike Dolan came to the 2018 Legal and Licensing Workshop (LLW) to describe how the Community Data License Agreement (CDLA) came about.

The kinds of data affected are for applications like machine learning, blockchains, AI, and open geolocation, he said. Governments, companies, and other organizations want to share their data and the model they want to follow is the one they have learned from open-source software. So the idea behind the CDLA is to share data openly using what has been learned about licensing from decades of sharing source code.

Version 1.0 of the CDLA was announced by the LF in October 2017. There are two different CDLA agreements that were inspired by the difference between permissive and copyleft licensing for software, he said. The "sharing" agreement is like a copyleft license, such as the GPL, while the "permissive" agreement is more like the MIT or BSD licenses. The difference comes into play if a recipient publishes the data or an enhanced version of it—they must release it if it was licensed under the sharing agreement. If the data is just used internally, there is no requirement to release it.

Data is not the same as source code, Dolan said. Facts are not copyrightable in many jurisdictions; only the creative expression of the data can be protected. But some data providers are trying to lock down access to their data with a variety of often ambiguous usage terms. They try to make things complex by using broad language.

The current practices for those releasing open data vary. Some are releasing as public domain and others are using open-source software or Creative Commons licenses. There are other open data licenses like the Open Database License used by OpenStreetMap and the Canadian government has its own Open Government License.

None of those have really gained traction for various reasons. There is a consensus that software licenses are not appropriate for data and the public domain and CC0 approaches concern some. For those reasons, LF members and others thought there was a need for new licenses for data. The intent is to try to prevent license proliferation and to try to prevent valuable data being released under licenses that do not allow aggregation in ways that will allow the data to be fully utilized over time.

Data can be long-lasting or even perpetual. It may also be hard or impossible to recreate the conditions under which it was gathered. If you have a data set containing oceanic temperatures over time, there is no opportunity to regather the data at some later point. That means the license under which it is released may be critical to how it can be used decades or even centuries from now.

One of the areas that took a lot of time to work out was the copyleft obligations; where do they begin and end? It ended up that any modifications or additions to the data that are published must be released under the CDLA sharing agreement. Any analysis of the data is explicitly excluded from that requirement, though those results may be included voluntarily. That exclusion includes any "computational or transformational activity", such as creating a TensorFlow model from a data set.

Dolan said that these agreements will be used by communities that are training AI and machine-learning systems, public-private infrastructure initiatives (such as for traffic data), and organizations with mutual interests that will be best served by pooling their data resources. The CDLA is already in use by Cisco on a data set of network anomalies that it has released on GitHub. In addition, data.world, which is positioning itself as the GitHub for data, recently added CDLA to its list of licenses.

Dolan concluded by answering a question that he always gets about the relationship of CDLA and Europe's General Data Protection Regulation (GDPR). CDLA is for data that can be shared, thus does not come under the GDPR; releasing data under a CDLA license does not magically make data shareable that would normally not be because of the GDPR.

An audience member asked about how ocean temperature data could even have a copyright, but Dolan noted that CDLA is not creating any new rights. There are database rights in Europe and similar rights elsewhere that already create this situation. CDLA simply provides a clear set of terms so that companies understand their responsibilities if there are any rights embodied in the data.

[I would like to thank the LLW Platinum sponsors, Intel, the Linux Foundation, and Red Hat, for their travel assistance support to Barcelona for the conference.]

Comments (2 posted)

In a plenary session on the second day of the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Dave Chinner described his ideas for a virtual block address-space layer. It would allow "space accounting to be shared and managed at various layers in the storage stack". One of the targets for this work is for filesystems on thin-provisioned devices, where the filesystem is larger than the storage devices holding it (and administrators are expected to add storage as needed); in current systems, running out of space causes huge problems for filesystems and users because the filesystem cannot communicate that error in a usable fashion.

His talk is not about block devices, he said; it is about a layer that provides a managed logical-block address (LBA) space. It will allow user space to make fallocate() calls that truly reserve the space requested. Currently, a filesystem will tell a caller that the space was reserved even though the underlying block device may not actually have that space (or won't when user space goes to use it), as in a thin-provisioned scenario. He also said that he would not be talking about his ideas for a snapshottable subvolume for XFS that was the subject of his talk at linux.conf.au 2018.

The new layer will provide the address space, which is a representation of an LBA range. There will be a set of interfaces to manage the backend storage for that range. A filesystem will usually be the client of the interface, while a block device or a separate filesystem can be the supplier of the storage for the layer.

The filesystem does not treat the virtual block address layer any differently than it does a block device from a space-management perspective. The supplier provides allocation and space reservation; it could also provide copy-on-write (CoW) to the upper layer, which would allow for snapshots at that level. In order to read and write data, however, a mapping must be done to turn the virtual LBA into a real LBA and block device for the I/O. It is similar to the export blocks feature of Parallel NFS (pNFS).

When the client wants to do I/O, it first maps the virtual LBA, then does the operation directly to the block device where the data is stored. Jan Kara asked if it is simply a remapping layer for filesystems; Chinner agreed that it was. He was looking at adding this ability to XFS but realized it was more widely applicable. It is similar to what is done for loopback devices, but he has chopped some layers out of that; instead of going through the block device interface, it is going through the remapping layer.

One of the problems with space reservation is that there may be a delay between the write of data and its associated metadata. But it is important that space reserved for that metadata does not disappear when it comes time to write the metadata. The upper layer filesystem needs to be able to ensure that a later writeback does not get an ENOSPC error for something that it believes it can write.

Under this new scheme, the filesystem can ask the supplier for a reservation, which will result in an opaque cookie that the filesystem can use to indicate portions of the reservation. Every object modification has the cookie associated with it; when all of those modifications are done, the reference count on the cookie drops to zero and any extra reservation goes back to the backend.

This allows allocation based on the I/O that the filesystem is building. It also can allow for write combining that is optimal for the thin-provisioned devices. Overall, it allows for optimal I/O for the underlying structures, he said.

The client does not know anything about what the underlying backing store actually does. Similarly, the supplier does not know what the client is doing; it is just allocating and mapping. The idea is just to create an abstraction that allows two different layers in the stack to manage blocks in a way that can report errors properly.

When the BIO is formed for a read operation, the filesystem does everything it does now, but it also calls out to the mapping layer to find out which block device to do the read on. It will issue I/O directly to the underlying device, taking a shortcut around all of the layers that a loopback device would use, he said.

A write operation would use a two-phase write that is similar to what XFS uses for direct I/O. It would get the block device and LBA from the mapping layer and it would also attach any needed reservation cookies to the BIO. If the target area is a hole, the system first allocates for those blocks; if it is a CoW supplier, it allocates new blocks and returns the mapping and reservation for those. All of that behavior would be hidden in the lower layers. The BIOs are built and sent down to the block device; when the write completes, the supplier must run its completion routines first, then the client runs its completions to finish its two-phase write.

At no time does the client know anything about what the underlying backing store actually does, Chinner reiterated. Similarly, the supplier does not know what the client is actually doing; it simply handles allocation and mapping. Anything that can provide a 64-bit address space can be used as a supplier, a file could be used, for example.

It is an abstract interface, he said, that is not specific to any filesystem or block device. It could be ext4 as a client with XFS as a supplier, or vice versa if ext4 implements the supplier interface. Ted Ts'o said that he originally thought this was all simply targeting thin provisioning, but having filesystems as the supplier "becomes interesting"; "that's neat". Chinner said his actual motivation was for XFS subvolumes, not thin provisioning.

The problem has turned out to be fairly simple to solve. It is about 1700 lines of code right now and he thinks it will grow to 3000 or so once he gets it cleaned up and ready for posting. He does think it will be interesting for other filesystems. Kara said that it resembled some things that Btrfs does; Chinner agreed, he is not really doing anything new, but is simply "repackaging and reimagining" ideas that are already out there.

One of the reasons he likes this approach is that it reuses the infrastructure already available in the filesystem layer. It can turn snapshots into regular files, for example. Chris Mason said that he uses loopback devices for some containers, but that this mechanism would be better. Chinner acknowledged that and noted that he has some "wild plans" for page-cache sharing that will make it even better. There are lots of use cases, he said, so will get his act together and post patches soon.

Comments (9 posted)

Chris Mason and Josef Bacik led a brief discussion on the block-I/O controller for control groups (cgroups) in the filesystem track at the 2018 Linux Storage, Filesystem, and Memory-Management Summit. Mostly they were just aiming to get feedback on the approach they have taken. They are trying to address the needs of their employer, Facebook, with regard to the latency of I/O operations.

Mason said that the goal is to strictly control the latency of block I/O operations, but that the filesystems themselves have priority inversions that make that difficult. For Btrfs and XFS, they have patches to tag the I/O requests, which mostly deals with the problem. They have changes for ext4 as well, but those are not quite working yet.

Bacik said the current block-I/O controller does not work for the company's use case. Facebook wants to be able to specify a latency target for a cgroup; if at any point that target is being exceeded, other cgroups should have their I/O throttled. The throttling is done by reducing the amount of I/O that is allowed to be in-flight for the other groups.

Kent Overstreet asked why this isn't done in an I/O scheduler. Bacik said Facebook wants to protect a certain workload, at the expense of any others. Mason noted that the workloads are already put together using cgroups, so there is no reason to create an I/O scheduler. Dave Chinner said that the use case is only concerned with throttling, not scheduling.

There is an issue of throttling filesystem-initiated I/Os for metadata and swap, Bacik said. The code is inserting delays into those in order to throttle that I/O when needed. That code is not yet present in XFS, Mason said; it was simply poked into Btrfs for testing.

Jan Kara said that this code should probably only be used with the no-op scheduler or another simple I/O scheduler. That is what is recommended for XFS anyway, Chinner said. Bacik said that CFQ is not used at Facebook, even on spinning disks, as it will cause latency spikes for no apparent reason. He said that Facebook wants to be able to use writeback throttling together with latency throttling; it is not working correctly at the moment, but was earlier, so he will figure it out and fix it.

There are actually two separate use cases; one is the protected workload, but the other is for shared workloads. In the latter case, both the maximum latency and maximum I/O rate settings will be used. The latter will be for setting expectations, Mason said. If you end up giving 100MB per second most of the time, people will come to expect that rate and applications will fail when it occasionally drops from there. But if you always give 20MB per second, the applications will scale their I/O to accommodate that.

Hearing no major objections to the idea, Bacik said he would post patches in a week or two after the summit. Chinner asked his usual question about tests; Bacik said that he had some scripts that he wrote in xfstests style. He will add those tests to the patch set.

Comments (4 posted)

The mount() system call suffers from a number of different shortcomings that has led some to consider a different API. At last year's Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), that someone was Miklos Szeredi, who led a session to discuss his ideas for a new filesystem mounting API. Since then, David Howells has been working with Szeredi and VFS maintainer Al Viro on this API; at the 2018 LSFMM, he presented that work.

He began by noting some of the downsides of the current mounting API. For one thing, you can pass a data page to the mount() call, but it is limited to a single page; if too many options are needed, or simply options with too many long parameters, they won't fit. The error messages and information on what went wrong could be better. There are also filesystems that have a bug where an invalid option will fail the mount() call but leave the superblock in an inconsistent state due to earlier options having been applied. Several in the audience were quick to note that both ext4 and XFS had fixed the latter bug along the way, though there may still be filesystems that have that behavior.

There are also problems with the in-kernel parameter passing using the data page, Howells continued. For example, a namespace cannot be turned into a string, which is what would be needed to pass a namespace option. Right now, the namespaces are inherited from the parent filesystem, but automounts should inherit the mount and network namespace from the process that caused the mount.

In the kernel, the first step of mounting is to create a filesystem context, which is represented by a struct fs_context . It is an internal kernel structure that can be initialized and used directly by in-kernel users, but will be created by the filesystem drivers for user-space callers. It contains a bunch of different fields, including operations for parsing and validating options, filesystem type, namespace and security information, and more. More information can be found in a commit in Howells's Git repository for this work.

Viro suggested that it may be useful to think of the filesystem drivers as external servers; they may actually reside in the kernel (or not) but mounting is making a request to these servers. A user-space caller would get a file descriptor by calling fsopen() , then write options and configuration information to that file descriptor, followed by a "create" command that would generate the superblock and root directory. Howells has working code for something like the following:

fd = fsopen("nfs", 0); write(fd, "d server:/dir"); write(fd, "o tcp"); write(fd, "o intr"); write(fd, "x create");

fsmount(fd, "/mntpnt", flags);

fsmount()

nodev

noexec

fsopen()

That would create the context for an NFS filesystem on "server" with two options (TCP transport and interruptible operation). The final write is what actually creates the context. The context can be used to mount the filesystem with a call like:The flags forwould govern options, such asand, and propagation attributes like "private" and "slave". Options formight include things like UID/GID translation tables for network filesystems like NFS and to eliminate the need for something like shiftfs

There would also a new system call ( fspick() ) for doing superblock reconfiguration for remounting, bind mounting, and so on. That is Howell's idea, anyway; Viro has suggested several new calls, such as mount_new() , mount_clone() , and mount_move() to handle that sort of thing.

Howells was asked about what would happen with the existing mount API. It would remain available, though it would likely eventually be switched to an implementation on top of the new API. It is not likely that it could ever be removed entirely. So far, he has added filesystem context handling for most of the internal filesystems (e.g. procfs, sysfs, and kernfs) as well as NFS and AFS. But, he warned, that bikeshedding is always going to be a problem for patches of this nature.

Comments (5 posted)

In a short filesystem-only discussion at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Jérôme Glisse wanted to talk about some (more) changes to support GPUs, FPGAs, and RDMA devices. In other talks at LSFMM, he discussed changes to struct page in support of these kinds of devices, but here he was looking to discuss other changes to support mapping a device's memory into multiple processes. It should be noted that I had a hard time following the discussion in this session, so there may be significant gaps in what follows.

A device driver stores the device context in the private_data field of the struct file , Glisse said, which has worked well, but is now becoming a problem. There are new devices that developers want to be able to attach to an mm_struct . In addition, though, those devices are still being used by a legacy API that needs to be preserved.

Glisse said that his first idea was to associate the device context with an mm_struct . That led to various developers to try to better understand the use case. Ted Ts'o summarized what came out of that. He suggested that what Glisse wanted was for every mm_struct to have a unique ID associated with it and to store that unique ID in the device context. Any ioctl() that tried to access the device would only work if the unique ID in the mm_struct of the caller is the same as that buried in the device context. Glisse agreed that would do what he was aiming for.

Ts'o noted that you can't use the address of the mm_struct because that would vary between processes. It wouldn't necessarily be implemented as a unique ID, he said, but that is conceptually how it would work. Kent Overstreet suggested a simple global sequence number for mm_struct . Processes that shared them would have the same sequence number, so the ioctl() enforcement could be done.

After questions about where the changes might lie, Glisse said that he had not written any code yet, but that he did not think changes to the virtual filesystem (VFS) layer would be required. VFS maintainer Al Viro did not really think it mattered where the changes would be made, his question was whether the behavior is needed. Glisse said that it is; it will allow the legacy code to continue running on GPUs, while allowing for more modern uses of the devices going forward.

Comments (14 posted)

At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Allison Henderson led a session to discuss an XFS feature she has been working on: parent pointers. These would be pointers stored in extended attributes (xattrs) that would allow various tools to reconstruct the path for a file from its inode. In XFS repair scenarios, that path will help with reconstruction as well as provide users with better information about where the problems lie.

The patch set has had a "bumpy history", she said. Lots of issues were identified with earlier versions of the patch set, which have now been addressed. Historically there were problems with locking order, but now the goal is to not have to lock the parent inode when creating the parent pointer. The xattr name will be the parent inode number and generation, along with the directory offset of the file. The xattr value will be the file name.

Jeff Layton said he sees how it would be useful to be able to walk the tree back to the root to recreate the path, but wondered about hard links. Dave Chinner said that each link would create its own parent pointer attribute. Al Viro asked about rename operations during the tree walk, but Chinner said there is no real problem there. The walk is done in user space (using ioctl() calls); the idea is that if there is problem in inode X, sector Y, a reverse lookup can be done to provide the user with the path. If the path changes during the walk, the user-space program should redo it.

Henderson said that one use case is for online scrub and repair. It will allow inodes that have been orphaned to be reconnected correctly. The error reporting will also be better because there will be a path associated with the inode where problems were found. She is trying to gather information on other use cases so that she can ensure that the feature supports them. Chinner said that filesystem repair is an important use; simply dumping a million files into the lost+found directory is useless.

Ted Ts'o asked about the performance of the feature. Chinner said it simply added an xattr operation to each file create, rename, link, and unlink operation. That should be fine if the xattr fits in the inode, Ts'o said, but Chinner noted that xattrs are being used everywhere these days, so xattr operations are generally expected.

Comments (22 posted)

get_user_pages()

At a plenary session held relatively early during the 2018 Linux Storage, Filesystem, and Memory-Management Summit, the developers discussed a number of problems with the kernel'sinterface. During the waning hours of LSFMM, a tired (but dedicated) set of developers convened again in the memory-management track to continue the discussion and try to push it toward a real solution.

Jan Kara and Dan Williams scheduled the session to try to settle on a way to deal with the issues associated with get_user_pages() — in particular, the fact that code that has pinned pages in this way can modify those pages in ways that will surprise other users, such as filesystems. During the first session, Jérôme Glisse had suggested using the MMU notifier mechanism as a way to solve these problems. Rather than pin pages with get_user_pages() , kernel code could leave the pages unpinned and respond to notifications when the status of those pages changes. Kara said he had thought about the idea, and it seemed to make some sense.

His current thinking is to audit all existing get_user_pages() callers and see which of those could be changed to use notifiers instead. Changing away from get_user_pages() would not be mandatory for device drivers (or other code) that couldn't handle that mode of operation. That leaves open the question of how to solve the problems for code that cannot be converted; in the worst case, operations on affected pages might just have to hang until all references to the pages in question are dropped.

The problem there is it's not always easy to know whether there are references to a page created by get_user_pages() or not. With memory accessed via DAX, life is relatively simple, and one can just wait until the reference count drops to one. For page-cache pages it's harder; it would be necessary to compare the reference and map counts for each page of interest. Glisse suggested just forcing get_user_pages() to lock the pages as it pins them. That would "be mean" to get_user_pages() callers, he said, but he thought that was a fine idea.

Hugh Dickins worried that this change would result in reduced performance and the introduction of kernel regressions. But Glisse said that only "legacy" code would be affected, and perhaps that is not a problem. An alternative might be to try to find some bits in struct page that could be used to track these uses, but there is not a lot of space available. Another possibility might be to create a special type of virtual memory area (VMA) for use with get_user_pages() .

One potential problem is interference with get_user_pages_fast() , which attempts to pin the pages without taking locks. Adding those locks to avoid contention with the MMU notifiers would cause it to not be fast anymore. Glisse, after trying a couple of suggestions, conceded that MMU notifiers are not going to work with get_user_pages_fast() ; he said that he was "running out of bad ideas". Dave Hansen suggested creating some sort of mechanism based on read-copy-update for get_user_pages_fast() users, but agreed that the idea "sounds terrifying".

In the end, the apparent conclusion was that Kara will start by experimenting with page locks and, maybe, RCU. Patches should be forthcoming.

Comments (2 posted)

A system's page tables are organized into a tree that is as many as five levels deep. In many ways those levels are all similar, but the kernel treats them all as being different, with the result that page-table manipulations include a fair amount of repetitive code. During the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Kirill Shutemov proposed reworking how page tables are maintained. The idea was popular, but the implementation is likely to be tricky.

On a system with five-level page tables (which few of us have at this point, since Shutemov just added the fifth level), a traversal of the tree starts at the page global directory (PGD). From there, it proceeds to the P4D, the page upper directory (PUD), the page middle directory (PMD), and finally to the PTE level that contains information about individual 4KB pages. If the kernel wants to unmap a range of page-table entries, it may have to make changes at multiple levels. In the code, that means that a call to unmap_page_range() will start in the PGD, then call zap_p4d_range() to do the work at the P4D level. The calls trickle down through zap_pud_range() and zap_pmd_range() before ending up in zap_pte_range() . All of the levels in this traversal (except the final one) look quite similar, but each is coded separately. There is a similar cascade of functions for most common page-table operations. Some clever coding ensures that the unneeded layers are compiled out when the kernel is built for a system with shallower page tables.

Shutemov would like to replace this boilerplate with something a bit more compact. He is proposing representing a pointer into the page tables (at any level) with a structure like:

struct pt_ptr { unsigned long *ptr; int lvl; };

Using this structure, page-table manipulations would be handled by a single function that would call itself recursively to work down the levels. Recursion is generally frowned upon in the kernel because it can eat up stack space, but in this case it is strictly bounded by the depth of the page tables. That one function would replace the five that exist now, but it would naturally become somewhat more complex.

He asked: would this change be worth it? Michal Hocko asked just how many years of work would be required to get this change done. Among other things, it would have to touch every architecture in the system. If it proves impossible to create some sort of a compatibility layer that would let architectures opt into the new scheme, an all-architecture flag day would be required. Given that, Hocko said that he wasn't sure it would be worth the trouble.

Laura Abbott asked what problems would be solved by the new mechanism. One is that it would deal more gracefully with pages of different sizes. Some architectures (POWER, for example) can support multiple page sizes simultaneously; this scheme would make that feature easier to use and manage. Current code has to deal with a number of special cases involving the top-level table; those would mostly go away in the new scheme. And, presumably, the resulting code would be cleaner.

It was also said in jest that this mechanism would simplify the work when processors using six-level page tables show up. The subsequent discussion suggested that this is no joking matter; it seems that such designs are already under consideration. When such hardware does appear, Shutemov said, there will be no time to radically rework page-table manipulations to support it, so there will be no alternative to adding a sixth layer of functions instead. In an effort to avoid that, he is going to try to push this work forward on the x86 architecture and see how it goes.

Comments (5 posted)

Memory hotplugging is one of the least-loved areas of the memory-management subsystem; there are many use cases for it, but nobody has taken ownership of it. A similar situation exists for hardware page poisoning, a somewhat neglected mechanism for dealing with memory errors. At the 2018 Linux Storage, Filesystem, and Memory-Management summit, Michal Hocko and Mike Kravetz dedicated a pair of brief memory-management track sessions to problems that have been encountered in these subsystems, one of which seems more likely to get the attention it needs than the other.

Memory hotplugging

When memory is added to the system, the kernel must allocate a new array of page structures to keep track of that memory. That array is currently allocated with kmalloc() , Hocko said, which is not the best thing to do. Among other things, if the kernel is running on a NUMA system, the new memory and its page structures are likely to end up on different nodes, which will not be good for performance. This is something that is happening now in real workloads.

One common use case is virtualization environments, where administrators are using hotplugging to move memory between virtual machines. The memory-management developers recommend against doing that — removing memory from machines is tricky, since there can never be a guarantee that everything can be moved out of that space — but people do it anyway. Sometimes they add quite a bit of memory, consuming a lot of local memory just for the page structures. If the receiving virtual machine is already under memory stress, finding a contiguous range of memory for those structures could be difficult.

The better solution, Hocko said, would be to just allocate the new page structures from the memory that has just been added. That memory is free, unfragmented, and obviously local. There were once concerns about "self-hosted" page structures when nonvolatile memory is involved, since those structures are written to frequently, but those concerns have faded over time. Hocko asked whether there were any concerns about implementing this approach.

Jérôme Glisse said that there would need to be an opt-out mechanism. If the new memory is based on a GPU, for example, the CPU cannot access it and thus cannot maintain page structures there. The solution seems to be to just avoid self-hosting page structures on device memory. Vlastimil Babka asked what would happen if only a portion of the new memory was later unplugged — and it was the portion containing the page structures; Hocko said he needs to work on that problem still. Otherwise, though, there were no complaints beyond the fact that this mechanism "takes some beer to understand".

Hocko's other question had to do with the size of the "sections" used to manage hotplug memory. A section contains 128MB by default on systems with a 4KB page size; it is the smallest unit of memory that can be plugged in or out. But, it seems, the "virtualization people" would like to do hotplugging with smaller units of memory.

That could be supported, he said, but it would waste some memory and be relatively tricky to implement, so he isn't sure that it is worth the effort. Dave Hansen said that there should be no problem with telling people that hotplugging smaller pieces of memory will be wasteful. The approach that seemed to win favor is to behave as if an entire section of memory had been plugged in, but mark the missing pages as being reserved and unavailable.

Huge-page poisoning

Hardware poisoning is a mechanism designed to keep a system in a functional state even if some of its memory goes bad. It responds to memory errors by locating and isolating the faulty page — essentially unplugging it from the system, though the hotplug mechanism is not used. Mike Kravetz has discovered that page poisoning doesn't work as well as one might like with huge pages, though.

The kernel will respond to an error in a huge page in the usual way: it will try to substitute a working page and take the malfunctioning one offline. This works fine for PMD-sized pages, he said. PMD stands for the increasingly misnamed "page middle directory", the second-to-last layer in the system's page-table hierarchy. PMD-sized pages are the smallest huge pages, 2MB on x86 systems. If the system is using PUD-size pages (PUD being "page upper directory", since there are only two layers above it on modern systems — 1GB on x86), though, poisoning no longer works. The page-table walker simply doesn't take poisoning into account above the PMD level. So he decided to disable poisoning for huge pages above the PMD size.

Hocko answered that the whole hardware poisoning mechanism seems to be "test driven" without a whole lot of high-level design. He has seen some "nasty changes" to keep the tests happy, such as huge pages being marked migratable so that offlining can work. Technically migrating those pages can be done, but it doesn't actually work. Allocating new storage for a huge page in the face of an error tends to be hard.

Overall, Hocko didn't seem to think much of the feature, but Hansen said that hardware poisoning is only going to grow in importance; as memory sizes increase, hardware problems will happen more frequently. He sees about two errors per month on a 2TB machine he works with. Anshuman Khandoul said that migration is the only way to handle hardware errors in huge pages, but Kravetz wondered how the system could realistically migrate a 16GB gigantic page. Hocko wondered whether hardware poisoning was useful at all; Hansen replied that it had indeed been added as a "checkbox feature", but that it was hard to tell for sure because customers never call to say that their system successfully recovered from an error.

Hocko remained unimpressed, calling poisoning a "toy" that doesn't work and is easy to break. He would like to see somebody explain the design of the whole thing; that might at least help keep developers from introducing bugs like the one that motivated this session. Either that, he said, or bite the bullet and admit that it was a toy feature all along. Hansen said that it is reasonable to ask how important the feature is, but that the arrival of nonvolatile memory may change the calculation, since that memory is likely to generate more errors.

As time ran out, Kravetz said that trying to migrate pages might not be worth it; perhaps the system should just note errors and mark the pages bad. Glisse added that it would then be up to the application to cope with memory errors. Kravetz concluded that he is in favor of somebody trying to understand the design, but that he wasn't seeing any hands raised in the room; Hocko said that the recovery mechanism is in danger of being ripped out of the kernel unless a maintainer shows up.

Comments (none posted)

mmap_sem

The memory-management subsystem is a central point that handles all of the system's memory, so it is naturally subject to scalability problems as systems grow larger. Two sessions during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit looked at specific contention points: the zone locks and thesemaphore.

Zone-lock optimizations

Dave Hansen ran a brief session about optimizations for the zone lock as a follow-on to the LRU lock session held on a previous day. The management of memory at the page level is handled by the zone mechanism, which maintains a set of per-CPU lists of pages that can be used to satisfy allocation requests; the zone lock serializes access to those lists when the need arises. In some workloads, the zone lock can create a significant amount of contention.

When one of the per-CPU lists is exhausted, the memory-management code moves a new batch of pages into it from the global list. The question that Hansen wanted to discuss was the number of pages that are pulled when this happens; that number has been set to 31 for a long time. Hardware has evolved considerably since that value was arrived at; perhaps it's time for a change?

He had the results of some tests run by Aaron Lu on a couple of relatively large x86 machines. Increasing the batch size to 53 yielded an 18% microbenchmark performance increase on a four-socket system; the increase on a two-socket system was about half that. Making the batch size larger yielded progressively smaller improvements, and by about 300 there was no improvement at all. So there does not appear to be a case for a huge increase, but perhaps a modest increase makes sense?

Andrew Morton asked whether there were other workloads that would be hurt by this change. Hansen replied that the worst-case latency might increase, but throughput would as well. Michal Hocko suggested asking the networking developers for their opinion, since they are highly sensitive to latency in memory-allocation functions. Hansen said that the latency could conceivably improve due to the reduced contention on the zone lock.

If the default value is going to be changed, a new value must be picked. There was some talk of trying to tune it automatically, but the tests did not show a whole lot of variation between the systems, so autotuning is probably not worth the effort. Rik van Riel suggested writing an LWN article describing the problem and asking users to test various batch sizes. The session concluded with the idea that the batch size should probably be approximately doubled, but that more tests need to be run before the change goes upstream.

mmap_sem scalability with munmap()

Yang Shi returned to the front of the room to discuss a specific problem with the often-criticized mmap_sem semaphore. When a process calls munmap() to unmap a range of memory, mmap_sem is held for the duration of the entire operation. That can be a long time for big mappings; he measured 18 seconds when undoing a 320GB mapping. Any other threads needing mmap_sem (to handle a page fault, for example) will hang while this is happening.

As a way of dealing with this problem, Shi developed a patch series changing the way munmap() operates. Rather than unmap the entire range at once, it splits the range into a number of pieces, and unmaps each piece separately, dropping and reacquiring mmap_sem after each. That change increased page-fault performance by 6-8%. The improvement is not seen for all workloads, but performance does not appear to degrade for any. That patch did not get much discussion in the room, though; instead, the developers wanted to consider alternative solutions.

Jérôme Glisse suggested that only the top-level page tables (the page upper directory — PUD — in particular) need to be unmapped while holding mmap_sem ; after that, unmapping could drop the lock and do the rest of the work without. That only works for ranges covering a full PUD entry, though. Hugh Dickins, instead, suggested marking the virtual memory areas (VMAs) covering the range as being deleted, then dropping mmap_sem to clean up the page-table mappings contained in those VMAs.

Hocko had a different variation on the two-phase idea. An unmap operation could acquire mmap_sem for read access, then call madvise() with the MADV_DONTNEED to release the pages associated with the mapping. mmap_sem could then be upgraded to write access to finish the rest of the cleanup. There are some practical difficulties with this approach, including the fact that there is no way to upgrade mmap_sem in that way, and it would be hard to create one since a thread can hold multiple read locks simultaneously. One solution there might be to just drop the lock entirely and retake it for write access.

One possible trouble point with this approach is that an application accessing the pages in a range that is being unmapped would see a behavior change if this two-phase model were implemented. It was generally agreed, though, that this application, if it exists, is already playing with undefined behavior in a buggy way, so there shouldn't be any real trouble there. Things wound down with Hocko suggesting that this change should be done first, since it is a relatively simple approach to the problem; more complex changes can be done if the easy optimization is not enough.

Comments (none posted)

ZONE_DMA

ZONE_DMA

The DMA zone () is a memory-management holdover from the distant past. Once upon a time, many devices (those on the ISA bus in particular) could only use 24 bits for DMA addresses, and were thus limited to the bottom 16MB of memory. Such devices are hard to find on contemporary computers. Luis Rodriguez scheduled the last memory-management-track session of the 2018 Linux Storage, Filesystem, and Memory-Management Summit to discuss whether the time has come to removealtogether.

Rodriguez, however, was late to his own session, so the developers started discussing the topic without him. It's not clear that any modern devices still need the DMA zone, and removing it would free one precious page flag. Any requests with the GFP_DMA flag could be redirected to the zone for the contiguous memory allocator (CMA) which, in turn, could be given the bottom 16MB of memory to manage. Matthew Wilcox asked whether the same thing could be done with ZONE_DMA32 , used for devices that can only DMA to 32-bit addresses, but it is not possible to allocate all of the lowest 4GB of memory to that zone, since it would exclude kernel allocations.

It was noted in passing that the POWER architecture uses GFP_DMA extensively. It doesn't actually need it, though; the early POWER developers had misunderstood the flag and thought that it was needed for any memory that would be used for DMA.

At this point, Rodriguez arrived and presented his case. He noted that the existence of ZONE_DMA causes an extra branch to be taken in every memory allocation call. Perhaps removing the zone could improve performance by taking out the need for those branches. It's not clear that performance would improve all that much, but the developers would be happy to be rid of this ancient zone regardless.

The problem is that quite a few drivers are still using ZONE_DMA , even if a number of them don't really need it. The SCSI subsystem was mentioned as having a number of allocations using it. Wilcox suggested that perhaps the drivers still using ZONE_DMA could be moved to the staging tree; they could then either be fixed and moved back or just removed entirely. A look at the list of affected drivers (which can be found in this summary of the session posted by Rodriguez) suggests that just deleting them is probably not an option, though.

More work will be needed to determine the real effects of changing this zone, and of possibly redirecting it into the CMA zone instead. But its removal would simplify the memory-management subsystem, so there is motivation for the developers to do the necessary research.



Comments (8 posted)

The removal of an old joke from the GNU C Library manual might not seem like the sort of topic that would inspire a heated debate. At times, though, a small action can serve as an inadvertent proxy for a more significant question, one which is relevant to both the developers and the users of the project. In this case, that question would be: how is the project governed and who makes decisions about which patches are applied?

Toward the end of April, Raymond Nicholson posted a patch to the glibc manual removing a joke that he didn't think was useful to readers. The joke played on the documentation for abort() to make a statement about US government policy on providing information about abortions. As Nicholson noted: "The joke does not provide any useful information about the abort() function so removing it will not hinder use of glibc". On April 30, Zack Weinberg applied the patch to the glibc repository.

Richard Stallman, who added the joke sometime in the 1990s, asked that it not be removed. The resulting discussion touched on a number of issues. Carlos O'Donell, who has been trying hard to resolve the issue with some degree of consensus, suggested that the joke could hurt people who have had bad experiences associated with abortion. He proposed a couple of possible alternatives, including avoiding jokes entirely or discussing such issues in a different forum. Stallman, however, replied that "a GNU manual, like a course in history, is not meant to be a 'safe space'". He suggested the possibility of adding a trigger warning about functions that create child processes, since childbirth is "far more traumatic than having an abortion".

Whether the joke belongs in the glibc manual is an issue for the glibc developers to decide and wouldn't normally be of much interest beyond the project itself. But in this case, it raises the question of how the developers make this decision. The project's wiki states that the project "uses a consensus-based community-driven development model". In this case, there seems to be a fairly clear consensus among the actual glibc developers that this joke is not appropriate in the project's manual. Weinberg's application of the patch was based on this consensus.

Stallman, however, has made it clear that there are limits to the extent to which glibc is consensus-based; his response was: "My decision is to keep the joke". Weinberg stated his refusal to revert the change; Stallman answered: "I stand by my decision to keep the joke". O'Donell apologized for not contacting Stallman directly about the removal, but also stood by the decision to remove it. He asked:

A large group of developers, serious senior developers, at least 3 project stewards (GNU Developers for the project), are indicating that they do not share your same view on the joke. Please consider their input and work with me to reach a consensus position.

Weinberg defended his application of the patch:

I don't think I did anything wrong procedurally. RMS may be the project leader, but he is not a glibc maintainer. His wishes regarding glibc are perhaps to be given _some_ more weight than those of any other individual, particularly when he is also the author of text under dispute, but we have never, to my knowledge, treated them as mandates.

Stallman was unimpressed, though, and fell back to a pure authority play, saying: "As the head of the GNU Project, I am in charge of what we publish in GNU manuals. I decide the criteria to decide by, too". He later added:

I exercise my authority over Glibc very rarely -- and when I have done so, I have talked with the official maintainers. So rarely that some of you thought that you are entirely autonomous. But that is not the case. On this particular question, I made a decision long ago and stated it where all of you could see it.

O'Donell repeated that a discussion was underway and that the maintainers did not intend to revert the patch. He also asked whether the change violated any GNU policies — a question that went unanswered as of this writing. He also stated clearly that the joke would not return in any form until some sort of consensus was reached.

One could argue that the consensus is already there if one looks at the developers who actually work on glibc; it is difficult to find any of them arguing for the joke's return. The number of people arguing for the joke in general is quite small. That did not stop Alexandre Oliva, who evidently has a high opinion of Stallman's sense of humor, from reverting the change early on May 7 — his first glibc change in 2018. He did not post his change to the mailing list (and only explained it after being asked); his attempt to justify it as a return to consensus did not fly with O'Donell. This discussion, one suspects, is not done.

Each project has its own governance model. The "authoritarian leader" model is quite common in this space, with many projects subject to the will of a (hopefully benevolent) dictator who can decide to accept or reject any change. Sometimes that model works better than others; glibc itself improved its processes and inclusiveness considerably when its single leader was replaced by a more consensus-oriented model. Usually, though, such leaders are at least active developers in the projects they manage; that is not the case for the GNU projects. It can be discouraging for a developer to discover that their changes are subject to a veto from on high by somebody who is not otherwise involved with the project's development. The echoes of this action may thus persist in the glibc community for some time.

Comments (210 posted)