This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

At the 2018 Legal and Licensing Workshop (LLW), which is a yearly gathering of lawyers and technical folks organized by the Free Software Foundation Europe (FSFE), attendees got more details on a recent hearing in a German GPL enforcement case. Marcus von Welser is a lawyer who represented the defendant, Geniatech, in a case that was brought by Patrick McHardy. In the presentation, von Welser was joined by Armijn Hemel, who helped Geniatech in its compliance efforts. The hearing was of interest for a number of reasons, not least because McHardy withdrew his request for an injunction once it became clear that the judge was leaning in favor of the defendants—effectively stopping this case dead in its tracks.

Copyright trolling

Von Welser began by noting some of the reasons that Germany is an attractive venue for copyright trolls. It is inexpensive to file cases and the process is relatively quick. The practice of the courts in Germany is to issue an interim injunction without an oral hearing. In order to avoid that interim injunction, defendants must sign a cease-and-desist declaration that then subjects them to contractual fines. There is also a lack of transparency in the German court system, which allows trolls to operate largely anonymously.

The tactics used by McHardy are to first notify the company of the GPL violation and ask for a cease-and-desist declaration that would subject the company to a flexible contractual fine. If that is signed, it is followed up with another letter, pointing to a different GPL violation, that asks for a second cease-and-desist declaration with a fixed contractual fine. After that is signed, further violations are alleged, each of which comes with a request for the fines (which can be a five-figure value per infraction), along with further cease-and-desist declarations with even higher fines.

A timeline of the McHardy versus Geniatech case was then presented. The company received a letter from McHardy on July 17, 2017 requesting the first cease-and-desist declaration in order to avoid an interim injunction. Instead of signing it, the company replied a week later questioning some of McHardy's claims, which appeared to anger him. On July 26, he replied by terminating Geniatech's GPLv2 license on McHardy's netfilter contributions in the kernel. Hemel noted that there were two elements here that were quite different from most of McHardy's cases: the defendant did not sign the first warning and McHardy explicitly terminated his license to the code for the defendant.

In his filings, McHardy referred to the Linux kernel in general, not to a specific version or one that was tied to the firmware in Geniatech's product. In its reply to his first letter, Geniatech raised questions about McHardy's contributions to the kernel, noting that his contributions were not integral to the kernel itself. The regional court in Cologne did not understand, however, and granted an interim injunction on August 23 that stopped Geniatech from distributing (or even hyperlinking to) any version of the kernel.

Arguments

That injunction was appealed and there was an oral hearing at the higher regional court of Cologne on March 7, 2018, where McHardy eventually withdrew his application for an injunction. The arguments that Geniatech made at that hearing seemed to resonate with the judge, according to von Welser and Hemel.

McHardy claimed joint authorship in the kernel, which gave him the right to enforce the copyright of the whole, but Geniatech argued that McHardy's contributions were simply adaptations. The kernel was authored by Linus Torvalds and released under the GPLv2, which explicitly grants the right to modify the work. Modifying the work under that clause does not give rise to joint authorship, but were instead adaptations under German law. If it is not a work of joint authorship, McHardy could only enforce the copyrights of his changes, not those of the whole Linux kernel.

Beyond that, McHardy's arguments that his modifications are protected by copyright and were infringed by the defendant were not proven. He alleged that he contributed 50,000 lines of code over the years and provided a CD with the changelogs corresponding to those changes, but did not show that those modifications fulfilled the requirements for copyright protection. In addition, he did not show what parts of the modifications he made were actually used by the defendant.

Another argument that Geniatech used was that McHardy was abusing his rights by approaching multiple companies with the intent of gaining monetarily. This hurts more than just the companies involved; it harms the entire Linux ecosystem by making companies insecure in their use of the operating system, it argued. Hemel said that the judge had heard from colleagues about other companies that had suffered at the hands of McHardy, which helped this argument.

The final element of the abuse-of-rights argument concerned the revocation of the GPL terms for Geniatech by McHardy. This was done to raise pressure on the defendant. In his efforts to enforce the GPL, McHardy is not following community norms, specifically "The Principles of Community-Oriented GPL Enforcement", which prioritizes compliance and not financial gain. He is also not entitled to terminate his GPL grant to Geniatech as it does not comply with German law or the GPLv2 itself, von Welser said.

The final argument, which von Welser thought was "quite striking", was that that the scope of the interim injunction was too broad and not well defined. It enjoined Geniatech from distributing the Linux kernel, not some specific version or versions of the Geniatech firmware. There are more than 100 officially released versions of the kernel that do not include any contributions from McHardy, von Welser said. By ordering Geniatech not to distribute any version of the kernel, the court was covering kernels that were not even part of the dispute with McHardy.

After those arguments had been presented, McHardy and his lawyer discussed it over the telephone (McHardy was not present, according to Harald Welte's account that was linked above). At that point, the request to continue the injunction was withdrawn. McHardy is required to cover the costs of the proceedings, including court fees and lawyer costs for both sides.

Recommendations

The recommendations for companies contacted by McHardy "are quite obvious", von Welser said. They should refuse to sign the cease-and-desist declaration. They should then ensure GPL compliance for their products and prepare a legal defense strategy. That may be easier to do now, since there is more information available from this case. Those that sign the declaration generally do so in the hopes of avoiding a conflict altogether, but it is important to recognize that it likely will not succeed in doing that. Hemel added that if companies do end up signing something with McHardy, they should try to get rid of the confidentiality clause that he demands; shining more light on his tactics will help others defend against them.

In answer to a question from the audience, von Welser said that since there is no judgment, there is nothing to translate to English for others to study. He also noted that it sets no precedent since there was no ruling, but he thinks the arguments make logical sense and will be successful elsewhere.

James Bottomley noted that it was good to have fended off McHardy in this case, but wondered how legitimate (i.e. community-oriented) enforcement would be able to proceed in Germany. Von Welser said that McHardy was trying to make wide claims to make things simpler for himself. If he had showed which parts of the kernel he modified, demonstrated that these modifications are copyrightable (which is not difficult in Germany), and showed how those modifications were used by the defendant, he might have found a different reception by the court. Bottomley said that the goal should be to stop trolls like McHardy, but to still allow legitimate GPL enforcement for the kernel. Von Welser indicated that should still be possible.

[I would like to thank the LLW Platinum sponsors, Intel, the Linux Foundation, and Red Hat, for their travel assistance support to Barcelona for the conference.]

Comments (37 posted)

The furor over the Meltdown and Spectre vulnerabilities has calmed a bit — for now, at least — but that does not mean that developers have stopped worrying about them. Spectre variant 1 (the bounds-check bypass vulnerability) has been of particular concern because, while the kernel is thought to contain numerous vulnerable spots, nobody really knows how to find them all. As a result, the defenses that have been developed for variant 1 have only been deployed in a few places. Recently, though, Dan Carpenter has enhanced the smatch tool to enable it to find possibly vulnerable code in the kernel.

Spectre variant 1 is the result of the processor incorrectly predicting the results of a bounds check; it then speculatively executes code with a parameter (such as an array index) that falls outside of its allowed range. This problem can be mitigated by disabling speculative execution in situations where an array index is under the control of a potential attacker. In the kernel, that is done by replacing code like:

value = array[index];

with:

index = array_index_nospec(index, ARRAY_SIZE); value = array[index];

That's the easy part; the hard part is finding the places in the kernel where the array_index_nospec() macro should be used. Until now, the only tool available has been the proprietary Coverity checker, which is not accessible to everybody and produces a fair number of false positives. As a result, there are only a handful of array_index_nospec() calls in current kernels.

Carpenter's addition to smatch changes that situation by providing a free tool that can search for potential Spectre variant-1 vulnerabilities. The algorithm is simple enough in concept:

What the test does is it looks at array accesses where the user controls the offset. It asks "is this a read?" and have we used the array_index_nospec() macro? If the answers are yes, and no respectively then print a warning.

This test returns a list of about 800 places where array_index_nospec() should be used. Carpenter assumes that a large percentage of these are false positives, and has asked for suggestions on how the test could be made more accurate. Instead of offering suggestions, though, both Thomas Gleixner and Peter Zijlstra confirmed that a number of the reports were accurate; Zijlstra said "I fear that many are actually things we want to fix". He followed up with a patch series fixing seven of them — nearly doubling the number of array_index_nospec() calls in the kernel.

Once the low-hanging fruit has been tackled, there probably will be a focus on improving the tests in smatch to filter out the inevitable false positives and to be sure that vulnerable sites are not slipping through. But, now that there is a free tool to do this checking, progress in this area can be expected to accelerate. Perhaps it will be possible to find — and fix — many of the existing Spectre vulnerabilities before the attackers get there.

Comments (14 posted)

In the performance-conscious world of high-speed networking, anything that can be done to avoid copying packet data is welcome. The MSG_ZEROCOPY feature added in 4.14 enables zero-copy transmission of data, but does not address the receive side of the equation. It now appears that the 4.18 kernel will include a zero-copy receive mechanism by Eric Dumazet to close that gap, at least for some relatively specialized applications.

Packet reception starts in the kernel with the allocation of a series of buffers to hold packets as they come out of the network interface. As a general rule, the kernel has no idea what will show up next from the interface, so it cannot know in advance who the intended recipient of the next packet to arrive in a given buffer will be. An implementation of zero-copy reception will thus have to map these packet buffers into user-space memory after the packets come in and are associated with an open socket.

That, in turn, implies a set of constraints that must be met. Mapping of memory into a process's address space is done on a per-page granularity; there is no way to map a fraction of a page. So inbound network data must be both page-aligned and page-sized when it ends up in the receive buffer, or it will not be possible to map it into user space. Alignment can be a bit tricky because the packets coming out of the interface start with the protocol headers, not the data the receiving process is interested in. It is the data that must be aligned, not the headers. Achieving this alignment is possible, but it requires cooperation from the network interface; in particular, it is necessary to use a network interface that is capable of splitting the packet headers into a different buffer as the packet comes in.

It is also necessary to ensure that the data arrives in chunks that are a multiple of the system's page size, or partial pages of data will result. That can be done by setting the maximum transfer unit (MTU) size properly on the interface. That, in turn, can require knowledge of exactly what the incoming packets will look like; in a test program posted with the patch set, Dumazet sets the MTU to 61,512. That turns out to be space for fifteen 4096-byte pages of data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP header.

The core of Dumazet's patch set is the implementation of mmap() for TCP sockets. Normally, using mmap() on something other than an ordinary file creates a range of address space that can be used for purposes like communicating with a device. When it is called on a TCP socket, though, the behavior is a bit different. If the conditions are met (the next incoming data chunk is page-sized and page-aligned), the buffer(s) containing that data will be mapped into the calling process's address space, where it can be accessed directly. This operation also has the effect of consuming the incoming data, much as if it had been obtained with recvmsg() instead. That is, needless to say, an unusual side effect from an mmap() call.

When the incoming data has been processed, the process should call munmap() to release the pages and free the buffer for another incoming packet.

If things are not just right (there is only a partial page of data available, for example, or that data is not page-aligned), the mmap() call will fail, returning EINVAL . That will also happen if there is urgent data in the pipeline. In such cases, the call does not consume the data, and the application must fall back to recvmsg() to obtain it.

It has long been conventional wisdom in the kernel community that zero-copy schemes dependent on memory-mapping tricks will struggle to outperform implementations that simply copy the data. There is quite a bit of overhead involved in setting up and tearing down these mappings. Indeed, Dumazet cautioned in the patch introduction that there may not be a benefit if the application uses a lot of threads, since the contention for the mmap_sem lock will become too expensive. But it is still natural to wonder if performing zero-copy packet reception in this way is worth the trouble.

One way of reducing the cost would be to not call mmap() until several pages of data are available to be consumed, so that they can all be mapped in a single batch. The network stack provides a way to request that the application not be notified until a certain amount of data is pending in the form of the SO_RCVLOWAT option. That said, the socket() man page cautions:

The select(2) and poll(2) system calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available. A subsequent read from the socket will block until SO_RCVLOWAT bytes are available.

That shortcoming would make SO_RCVLOWAT useless for this purpose. That problem appears to to have been fixed in 2008 for the 2.6.28 kernel, though, so the man page is a bit behind the times. Even so, there were still some shortcomings with SO_RCVLOWAT , including spurious wakeups, that Dumazet fixed as a part of this series.

In some benchmark results posted with the core patch, Dumazet shows some impressive improvements in packet-processing performance — from 129µs/MB to just 45µs/MB. Naturally, this is a tuned test running in a controlled setting, but it shows that there are indeed benefits to be had. Those benefits will be generally available before too long; networking maintainer Dave Miller has applied the series for the 4.18 merge window.

Comments (14 posted)

The first article in this series described the interface to the "rhashtable" resizable hash-table abstraction in Linux 4.15. While a knowledge of the interface can result in successful use of rhashtables, it often helps to understand what is going on "under the hood", particularly when those details leak out through the interface, as is occasionally the case with rhashtable. The centerpiece for understanding the implementation is knowing exactly how the table is resized. So this follow-on article will explain that operation; it will also present the configuration parameters that were skimmed over last time and discuss how they affect the implementation.

The chain gang — tables within tables

An rhashtable table ( struct rhashtable ) contains a list of bucket tables ( struct bucket_table ). Normally there is just one bucket table, or two when the table is being resized. In general there can be more, though making this happen in practice would require adding many objects more quickly than the table can be "rehashed", which is the processes of moving objects from the early bucket tables to the last.

Each bucket table has an array of struct rhash_head buckets for the various hash chains and a separate array of spinlocks, each of which protects two or more of the chains during modification. The different insert, lookup, and remove operations act on each bucket table in turn, stopping when they find what they are looking for or when they get to the last bucket table. Insertion will only insert into that last table, if anywhere.

There are a few different times when a new bucket table is added to the chain. When an insert function finds that the total number of objects in the table exceeds 70% of the number of buckets in the last bucket table in the chain, it schedules an asynchronous worker thread to look into this situation. Similarly, when object removal finds that, in a table where automatic shrinking is enabled, the number of objects is less than 30% of the number of buckets, the worker is again scheduled. This worker repeats the checks and possibly allocates a bucket table that is either double or half the previous size. If there is more than one bucket table at that point, the worker will proceed to move all objects to the last.

If insertion ever finds that the number of objects exceeds 100% of the bucket count it assumes that the thread is taking too long, so it takes matters into its own hands and allocates a new bucket table directly. Since this must happen without blocking (none of the rhashtable interfaces except rhashtable_init() and rhashtable_destroy() ever block), that allocation can fail. If it doesn't fail, the new object is inserted in the new table and, hopefully, the rehashing worker thread will eventually catch up. Insertion will also trigger an immediate allocation if it finds that the bucket it wanted to insert into already has 16 objects. Since the table is resized to keep the average bucket occupancy below one, this suggests that something has gone wrong with the hash function — possibly an attacker has found a way to generate lots of objects with the same hash.

One of the per-bucket-table values is a random hash seed ( hash_rnd ) that is passed to the hash function to be included in the calculated hash. When a new bucket table is allocated, it will have a new hash_rnd , so objects will be placed in different chains. This should mean that the 17 objects that all had the same hash will now be more evenly distributed. This is a significant difference from the original "relativistic hash table" implementation, which assumed that the hash of an given object remained unchanged; the current implementation deliberately changes it for each new bucket table.

Performing the rehash

Whenever a new bucket table is allocated, whether to change the size of the table or just to change the hash seed, every object in the old table must be moved to the new table. This process seems labor-intensive, but is pleasantly simple. Since it happens rarely and is performed asynchronously with respect to all other accesses, the cost is unlikely to be a concern.

At a simple level, the rehash process takes one object from the old table and inserts it in the new table, then repeats that until no objects are left. This can only be correct if an asynchronous lookup for that object, which proceeds with no locking, will always be certain of finding the object. To provide this correctness, rhashtable always moves the last object from a hash chain, and attaches it to the start of the destination hash chain. It attaches the object to the new chain before removing it from the old chain; see the diagram to the right. In pseudocode, the full rehash sequence is:

for each bucket in the first bucket table take a lock on the current bucket while the current bucket is not empty find the last object in the bucket determine the destination bucket in the last bucket table lock the destination bucket attach the object to the head of the new bucket remove the object from the tail of the old bucket unlock the new bucket unlock the current bucket detach the first bucket table

This insert-before-remove sequence means there is never a time when the object is not in the table, so a lookup will never fail to find the object. It also means that, for a short time, a chain in the old table is connected to a chain in the new table. Having joined chains will not hurt lookup as it already expects to see unwanted objects in the chain, so a few more cannot hurt. It also cannot hurt insert or removal as they take bucket locks that the rehash process also takes, so they will never see the joined chain. It can affect a rhashtable_walk_next() though.

Walking the hash table while the table is being rehashed

As mentioned in the previous article, it is not possible to guarantee that a walk of the hash table will see objects at most once if a rehash could happen while the walk is proceeding. One reason is that, as we have just seen, the "current" object might move from one bucket table to another and be attached to the start of a hash chain. If that hash chain was not empty, then the objects already there might have been moved from a hash chain that has already been walked along. Since hash chains are kept short this is quite unlikely, but it is a real possibility.

When rhashtable_walk_next() completes the last hash chain in a bucket table, it checks if there is another in the chain of tables. Normally, there is not and rhashtable_walk_next() will return NULL to indicate completion. If there is, rhashtable_walk_next() will return ‑EAGAIN and prepare to walk through the next table. When a caller receives ‑EAGAIN , it will know that a rehash is underway and, thus, it might have already seen a duplicate. It can choose to give up and return an error or incomplete data, or it can keep going and expect to see every object in the table again, including any that it might have missed the first time as they might have been rehashed before the walker saw them.

If a caller needs to be aware of rehashing, and is using rhashtable_walk_stop() and rhashtable_walk_start() to drop the RCU read lock, it should use rhashtable_walk_start_check() instead of rhashtable_walk_start() . This alternate interface will report ‑EAGAIN if the bucket table that was being walked through was discarded while the lock wasn't held. This has exactly the same implications as rhashtable_walk_next() returning ‑EAGAIN in that subsequent calls will start walking the next bucket table.

The only way to guarantee that a walk sees all objects exactly once would be to disable the rehash process. There is no documented interface for doing this, but in the current implementation it can be done by claiming the hash table mutex:

mutex_lock(&my_objects.mutex);

This call will wait for any current rehash to complete, and will stop future rehashes from starting. If lots of objects are added while the mutex is held, the insert function will eventually allocate a new bucket table and will start inserting into that. If this memory allocation fails, the insertion will fail. This is one of a small number of ways that insertion can fail.

Failure during insertion

It must be expected that insertion into the hash table will fail if the key is already in use, unless of course you use rhashtable_insert_fast() , which doesn't check. There are a few other circumstances that can trigger errors; three error codes (beyond ‑EEXIST ) are possible:

‑ENOMEM is returned when the insert function tries to allocate a bucket table and fails. If the required table is larger than one page, it will try to fall back on allocating a single page and use that to hold pointers to other pages to be allocated later. If this works we might not get ‑ENOMEM immediately but could still get it if a subsequent single-page allocation fails.

‑EBUSY is returned if the insert function finds that there is more than one bucket table and the last one has 16 objects already in the hash chain. This is almost impossibly unlikely, unless the rhashtable is using a custom hash function that is ignoring the random seed or doing something else strange.

‑E2BIG is returned if there are already 231 objects in the hash table. This prevents the element counter from overflowing. ‑E2BIG can also be returned if a maximum table size was specified when the table was created (as described below), and the number of objects has reached twice that maximum size.

The full scoop on configuration parameters

Now that we know how rehashing works, and how it affects walking the table, it remains only to see how it can be configured through values in the rhashtable_params structure. Of the twelve fields in this structure we have already met six ( key_len , key_offset , head_offset , hashfn , obj_cmpfn , and obj_hashfn ) and five are related to configuring the size for rehashing. The final parameter is nulls_base . It is essentially unused and unusable in the current implementation, so it is best ignored.

The size-related parameters are:

u16 nelem_hint; nelem_hint is documented as a hint regarding the number of elements that the table will have, but is never really used that way; in fact I cannot imagine how anyone would be able to make a meaningful guess. Some callers set it to two or three with the intention to "use a really small initial bucket table" — both of these values result in four buckets being allocated. Other callers use values like 192, 384, or 768 which mean "use a bigger initial table, maybe 256, 512, or 1024 buckets". I can understand memory conscious coders asking for a small initial number of buckets if they don't think the table will be used much, though the default of 64 buckets won't often be a burden. It is harder to understand why people would use a resizable hash table, but not just give it complete control of the size.

unsigned int max_size; The max_size is rounded down to a power of two, then used as a maximum size for the bucket table and half the maximum number of objects in the table. This maximum number of objects is a hard limit; insertion will fail when it is reached. The default maximum is 2 31 .

bool automatic_shrinking; If this flag is set, the table will be resized downwards (halving the size) when the residency drops below 30% (objects per bucket). Resizing upward is always enabled, resizing downward must be explicitly requested as in some cases it might be wasteful effort.

unsigned int min_size; min_size is rounded down to a power of two but kept to at least four (the default). This puts a lower bound on resizing the bucket table. When automatic_shrinking is not set, this value is ignored; it does not, for example, impose a minimum on the initial allocation.

u8 locks_mul; As mentioned, each bucket table has a set of locks protecting the buckets: two or more buckets per lock. On machines with few CPUs, even having only half as many locks as buckets can be wasteful when there are a great many buckets. To avoid wasting memory on useless locks, rhashtable limits the number of locks using the following formula: num_locks = min(num_chains/2, locks_mul * min(num_possible_cpus, 64)) If locks_mul is not given, 32 is used, so by default my eight-CPU desktop would get at most 256 locks for each bucket table. The only examples of setting this field in the current kernel involve setting to it to one, which is likely to be plenty for all but the busiest tables.

That completes the set of parameters that affect resizing. My recommendation would be to not set any of them unless you feel that memory conservation is more important than raw speed, in which case set automatic_shrinking to true and locks_mul to 1 .

A word of warning

It is nice to have rhashtables properly documented, but I wouldn't expect that situation to last. While working on this article, I have been discussing possible changes with the maintainers. None of these are major and should not invalidate this documentation, but they are unlikely to be the last changes ever suggested. Rhashtables are quite different from what they were four years ago, and may be different again in four more years — the Linux kernel is ever a dynamic place. So please don't trust this description implicitly — check it against the code, just in case.

Comments (5 posted)

Each kernel development cycle includes a vast number of changes that are not intended to change visible behavior and which, as a result, go unnoticed by most users and developers. One such change in 4.17 is a rewiring of how system-call implementations are invoked within the kernel. The change is interesting, though, and provides an opportunity to look at the macro magic that handles system-call definitions.

In user space, system calls look like ordinary functions, albeit with the strange convention of returning error codes in the global errno variable. Those functions are indeed just that, in that they are generally wrapper functions defined in the C library (or some other language-specific implementation). Those wrappers are responsible for organizing the system-call arguments properly (placing them into a set of registers defined by the architecture ABI) and triggering a trap into the kernel, where the real work gets done.

Imagine that the application is calling read() . In the 4.16 kernel, the implementation of this system call is:

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_read(f.file, buf, count, &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); } return ret; }

The SYSCALL_DEFINE3() macro at the beginning declares an implementation for the read() system call with three arguments. This is clearly not a normal function definition; one of the first things that jumps out is that the actual declaration of those arguments is done a little strangely. The names of the arguments and their types are separated in this way so that SYSCALL_DEFINE3() can perform whatever impedance-matching is required to map those arguments from the system-call ABI into what a normal C function expects. On the x86 architecture, after quite a bit of macro substitution, the SYSCALL_DEFINE3() line turns into something like:

asmlinkage long sys_read(int fd, char __user *buf, size_t count) \ __attribute__((alias(__stringify(SyS_read)))); asmlinkage long SyS_read(long fd, long buf, long count) { long ret = SYSC_read((int) fd, (char __user *) buf, (size_t) count); return ret; } static inline SYSC_read(int fd, char __user *buf, size_t count) /* SYSCALL_DEFINE3() expansion ends here, function body follows */

As can be seen, two different functions (and one alias) are declared here. SyS_read() is declared with the asmlinkage attribute (so that it expects all arguments on the stack rather than in registers), and with all arguments declared as having the long type, which is how they are passed from user space. This function casts the arguments into the expected types, then calls SYSC_read() , which is the name of the function that ends up containing the actual code implementing the system call. Note that it is declared static inline , so it will be substituted directly into SyS_read() .

A pointer to the SyS_read() version is placed in the appropriate location in the sys_call_table array. Then, when the kernel handles a trap for an incoming system call, it comes down to this bit of code in do_syscall_64() (again, on x86):

if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) { nr = array_index_nospec(nr & __SYSCALL_MASK, NR_syscalls); regs->ax = sys_call_table[nr]( regs->di, regs->si, regs->dx, regs->r10, regs->r8, regs->r9); }

(The use of array_index_nospec() prevents the processor from executing this call speculatively, thus blocking any attempts to create a speculative call to an address outside of sys_call_table ). Since all of the entries in sys_call_table are declared asmlinkage , the arguments will be copied from registers onto the stack before the call is made. Note that six registers are pushed onto the stack regardless of the number of arguments that the system call expects; the unneeded values will simply be ignored. This code reflects the maximum of six arguments that any system call may have.

In 4.17, this mechanism has changed on the x86-64 architecture, thanks to some work done by Dominik Brodowski. The new convention makes use of the fact that the pt_regs structure, created on any trap into the kernel, contains the register state of the user-space process, so it will contain the system-call arguments too. Rather than push all six registers onto the stack, the relevant line in do_syscall_64() now looks like:

if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); regs->ax = sys_call_table[nr](regs); }

The wrapper for the system-call implementation needs to change, since the calling convention has changed. In 4.17, the boilerplate for read() , after macro substitution, looks something like:

asmlinkage long __x64_sys_read(const struct pg_regs *regs) { return __se_sys_read(regs->di, regs->si, regs->dx); } static long __se_sys_read(long fd, long buf, long count) { long ret = __do_sys_read((int) fd, (char __user *) buf, (size_t) count); return ret; } static inline long __do_sys_read(int fd, char __user *buf, size_t count)

The __x64_sys_read() version goes into sys_call_table ; its job is to unpack the arguments from the pt_regs structure before calling __se_sys_read() , which will cast those arguments into the proper type and call the real implementation (now named __do_sys_read() ).

Getting to this point required quite a bit of work, including passing over the entire kernel to find and fix every direct call into the old system-call entry points. One might well ask why this effort was made. The new implementation is somewhat cleaner in general, but it also keeps unused, caller-supplied data from ending up on the stack. In current kernels, an attacker could call read() with six arguments; the final three would just end up on the stack, unchecked in any way, where they could conceivably turn out to be useful for any of a variety of exploits. By controlling the environment in which system-call code runs a bit more strictly, the new calling convention makes the kernel just a bit harder to compromise.

Comments (3 posted)

Plenary sessions

Every year, the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) brings together 80 or 90 core kernel developers for a few days of highly technical discussions. The 2018 LSFMM was held April 23 to 25 in Park City, Utah. As usual, LWN was there and able to attend most of the sessions.

This year, a relatively small subset of the discussions involved the entire group of developers. Topics addressed by the whole group include:

Memory-management track sessions

The discussions held by the memory-management subgroup were:

Filesystem track

The filesystem developers participated in the following sessions:

Filesystem and Storage track sessions

There were a few sessions shared by participants in both the Filesystem and Storage tracks:

Supporting multi-actuator drives: a discussion about the design of the interface to devices with multiple actuators that provide multiple sets of read/write heads for parallel data access.

An update on bcachefs: a discussion on bcachefs and whether it is ready for mainline inclusion.

Handling I/O errors in the kernel: what to do about I/O errors, particularly during writeback?

Group photo

Acknowledgments

Thanks to the LSFMM 2018 program committee for inviting LWN to the event, and to the Linux Foundation, LWN's travel sponsor, for supporting our travel to Park City.

Comments (none posted)

It is a good thing that strong coffee was served at the 2018 Linux Storage, Filesystem, and Memory-Management Summit; full awareness was required from the first session, in which Josef Bacik discussed some issues that have arisen in the interaction between filesystems and the memory-management subsystem. Filesystems cache a lot of data from files, but also a lot of metadata about those files. It turns out, though, that management of the cached metadata does not work as well as one might like.

In particular, Bacik said, he has been working on better tracking of dirty metadata in memory. This is a problem that every filesystem faces, and each one has come up with its own solution. Left unchecked, the amount of dirty metadata in the system is limited only by the amount of memory available — which can be problematic, given that the owner of the system usually wants to use at least some of that memory for other purposes. The Btrfs filesystem uses a special inode to track and limit dirty metadata, but that leads to "horrifying things" around the process of evicting that metadata when the time comes. Among other things, it doesn't support sub-page-sized blocks.

The Btrfs developers would like to take an approach closer to what is done in XFS, he said. That approach works well, but still suffers from the problem that the metadata cache can grow to fill all of memory. To address this problem, he's working on adding separate counters to track metadata pages, with the idea that filesystems will get a writeback callback from the memory-management subsystem indicating the number of these pages that should be freed. This callback will come from balance_dirty_pages() , which knows how much memory it needs to free and can pass that information down to filesystem code.

Dave Chinner noted that memory accounting at this level is all based on pages, which doesn't necessarily match the problem of tracking filesystem metadata, which exists in many sizes from a fraction of a page to many pages. So tracking dirty metadata in terms of pages doesn't really solve the problem. Bacik responded that he has changed all of the counters to units of bytes to address just this problem. He asked the memory-management developers in the room whether they objected to this change; nobody seemed to.

Moving on, he said that balance_dirty_pages() has a global limit of dirty memory that encompasses both data and metadata pages. But, he asserted, it is generally better to write out data pages than metadata pages when memory must be freed, so Btrfs focuses on the data pages first. Is that behavior good enough, he asked, or should there be separate limits for data and metadata?

Johannes Weiner said that one limit should apply to both types, since the system is committed to writing back all of the dirty data in the end. Jan Kara said, though, that there is a fundamental difference between writing back data, which just has to be pushed out of the page cache, and metadata, which must be written back via calls to shrinkers. Bacik said that the shrinker interface is a part of the problem, since it is not well designed. The Btrfs changes are going to put more pressure on the shrinker interface, since there will be unknown amounts of data attached to each object that might be written back. There is, he said, a lot of memory hidden in places where the memory-management subsystem has no idea how to find it.

Rik van Riel said that, when a filesystem allocates memory, it expects the memory-management code to be able to free up enough memory to satisfy the request. But, while allocations require pages, shrinkers work on smaller objects. A shrinker may free the amount of memory requested by the memory-management code, but those objects can be scattered across memory and the end result may be to free no pages at all. As more memory is taken up by filesystem caches, he admonished, the filesystem has to free those caches or it will find its allocation requests failing. Kara responded, though, that fragmentation tends not to be a huge problem in this context, since the objects involved are relatively large.

Chinner said that XFS does the majority of its metadata writeback by way of shrinkers, but also all of its page-reclaim work. The XFS cache structure is too complicated to express in the page cache, so the page cache is not used for this purpose. So, for XFS, there are no problems resulting from the shrinkers and the page cache not working together. Bacik said, though, that this results in XFS tending to write data randomly during reclaim, which can lead to excessive latencies, so Facebook has had to patch that feature out. There needs to be a way to do XFS metadata writeback in places where latency is expected.

Mike Snitzer suggested that "bufio", a shrinker-based mechanism used by the device mapper might be able to help here; it has recently been enhanced to handle objects with sizes that are not a power of two. Bacik said he'd looked at it, but he thinks that maybe the shrinker interface is not the best one for this problem. Chinner said the solution could be a new type of shrinker; the same mechanism could be used, but the accounting would be adapted to handle the metadata issues.

Chris Mason said that Btrfs has a set of throttling hooks that slow down allocations in places that are known to be safe; it is working well. The only caveat is that things fall apart if the kswapd thread starts performing I/O. In general, Weiner said, the entity that is issuing writes is not the one that is performing allocations, so throttling writes is the wrong way to address the problem. There should be two separate throttling points, one for writes and one for allocations; conflating the two has caused a lot of problems.

Bacik closed the session by saying that he will do the work to add a new shrinker interface with byte-based accounting for metadata.

Comments (none posted)

Ever since kernel page-table isolation (PTI) was introduced as a mitigation for the Meltdown CPU vulnerability, users have worried about how it affects the performance of their systems. Most of that concern has been directed toward its impact on computing performance, but I/O performance also matters. At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Ming Lei presented some preliminary work he has done to try to quantify how severely PTI affects block I/O operations.

This work was done by running the fio benchmark on current hardware. The initial tests, running in a virtual machine, showed a significant impact: a system that could execute just over 1 million I/O operations per second (IOPS) without PTI was reduced to 726,000 IOPS with PTI turned on. The situation changes significantly when the test is run on bare metal on the same machine; in that case, a system that could achieve 1,706,000 IOPS dropped to 1,568,000 IOPS when PTI is turned on. At a little under 10%, that is a smaller impact, but still a significant one.

It's not clear why performance regresses so severely when the test is run under virtualization. There was some theorizing that clock_gettime() , which is called frequently by fio, is not implemented properly on the guest system, but no real answers.

Further tests were done using an NVMe-attached drive. In this case, the IOPS rates were about the same regardless of whether PTI was being used, but the system's CPU utilization was significantly higher in the PTI case.

Lei concluded from his tests that enabling PTI adds about 0.2µs to the execution time of every system call. Normal synchronous I/O operations can be performed with a single system call, so they slow down slightly as a result. Asynchronous I/O operations, as used in the benchmark, require two system calls — one each to io_submit() and io_getevents() . As a result, asynchronous I/O feels the PTI penalty more severely. Interrupts add a similar penalty to each operation as well.

Dave Hansen (who did much of the work to bring PTI to Linux) noted that there was nothing new in these results. There has always been a cost to both interrupts and system calls; PTI just makes those costs worse. He did note that it was nice to see that the IOPS don't drop when there is adequate CPU time available, though.

Block maintainer Jens Axboe said that fio performs three clock_gettime() calls for every I/O operation by default. So, to a great extent, Lei's tests were measuring the impact of PTI on system-call execution time. Bart Van Assche suggested using the options that reduce the number of clock_gettime() calls, just as the session wound down.

Comments (9 posted)

Dave Hansen did much of the work to get kernel page-table isolation (PTI) into the kernel in response to the Meltdown CPU vulnerability. In the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, he ran a discussion on how PTI came about, what the costs are, and what can be done to minimize its performance impact.

Hansen started by saying that he was not going to talk about the Spectre vulnerabilities, which are not seen as being much of a problem at the memory-management level. This code is relatively far from user space, so forcing a particular speculative-execution path there is generally impractical. The PTI patches, instead, affect memory management directly.

PTI got its start in this paper by Daniel Gruss et al. [PDF], where it was called KAISER. The original purpose behind KAISER was to address threats to kernel address-space layout randomization; the Meltdown vulnerability was not known to the authors when the paper was written. Later on, when it became clear that there was something going on, the authors found Meltdown and disclosed it to Intel — they were one of four independent groups to do so. Hansen wondered how many others had also found the problem but kept it to themselves.

In any case, when Meltdown was discovered, Hansen realized that KAISER was a good way to address the problem. One of the first things he did with the code was to add support for the Intel processor context ID (PCID) feature. PCIDs, support for which was added by Andy Lutomirski around 4.14, allow page tables to be switched without flushing the translation lookaside buffer (TLB), thus mitigating some of the performance impact of PTI (which, by its nature forces a lot of extra page-table switches).

One interesting question is: how long should PTI be enabled in the kernel? The way things stand now, the kernel will turn it off if it detects that it is running on a processor that is not vulnerable to Meltdown. The "performance folks" want that, of course, since they want to get as much performance as possible out of their CPUs. People who are more concerned with security, though, are inclined to leave it enabled indefinitely. There are a number of useful security properties that come from completely separating the kernel and user-space address spaces.

One of the consequences of PTI is, of course, that system calls, page faults, and interrupts all get slower. The PTI patches also turned off the "global pages" feature, which kept the kernel's TLB entries resident at all times. He has a patch set to restore global pages conditionally in the 4.17-rc kernels; there are still some issues with the changes, he said. Another outcome of PTI is that TLB invalidation becomes much trickier. PCIDs allow a small number of address spaces to be active at one time; when pages are shared between address spaces, they must be separately flushed for each active process. There is also a slight increase in memory use when PTI is active.

Users who are concerned about the performance impacts have a few ways to minimize those impacts, most of which are good practice even without PTI. For example, he said, minimize the number of system calls that are made. Try to avoid extra TLB flushes; he mentioned the MALLOC_TRIM_THRESHOLD environment variable that, if misconfigured, can cause glibc to make excessive address-space changes. Huge pages should be used when possible; they can reduce the number of page faults, and also the number of system calls needed to manipulate them. Try to avoid page faults in general; mapping memory with MAP_POPULATE when it will all be accessed is one way to do that.

Michal Hocko said that it might make sense to add more multiplexing system calls for address-space operations. For example, database developers would appreciate a batched version of mprotect() that would make several changes with a single system call. Hansen agreed that this might make sense; he lamented that the API for memory protection keys (which he wrote) was designed around the notion that "system calls are fast". Now they aren't quite as fast and some of those decisions don't look as good as they did before. Rik van Riel said that a vectored version of madvise() would give big performance improvements for memory allocators, and Hansen added that a batched version of munmap() would also be helpful.

Another developer asked about ways to increase locality in general; Hansen replied that huge pages are the answer there. Huge pages were "an afterthought" in the processors where they first appeared, so they were not supported all that well. There were not many TLB entries available for huge pages, for example. But huge pages are a first-class citizen now, and they should be used when possible.

Alexei Starovoitov asked about the ability to turn off PTI on a per-process basis; patches to allow this were circulating in January, but haven't been discussed recently. Hansen replied that the most security-sensitive applications are the ones that need PTI, but they are also often quite performance-sensitive. As a result, the people who need PTI the most are often the ones who most want to turn it off. A per-process disable can thus be problematic; that said, it may now be time to revisit that issue.

Laura Abbott asked about the status of older hardware; there will be users who don't want to update. Hansen replied that older hardware will always need PTI; there are no microcode updates forthcoming that will address Meltdown on that hardware. Abbott also asked about PTI for 32-bit systems; Hansen said that the 32-bit patches will go in soon; they are "not as terrifying" as had been originally thought.

The final question came from Dan Williams, who wondered whether PTI was a useful mitigation for the newer "Spectre prime" variant. Hansen replied that any attack that takes advantage of a shared address space will be mitigated by PTI. Its costs notwithstanding, using PTI to separate the spaces brings some fundamental security benefits, he said.

Comments (none posted)

Once a niche feature, memory encryption is becoming mainstream with support in both Intel and AMD processors, Kirill Shutemov said at the beginning of his session during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit. Memory encryption can harden the system against attack, but it also presents some interesting challenges for the kernel.

Both Intel and AMD use the upper bits of the physical address to mark encrypted pages, he said. AMD processors currently only support a single encryption key, and so use a single bit to indicate that encryption is in use. Intel, instead, encrypts all of memory and uses up to six upper bits to indicate which encryption key is used for each page. The default key is marked by all zeroes in that field.

One interesting challenge is that the CPU's memory caches are based on the physical address — including the encryption bit(s). Encryption is handled by the memory controller, and the same page with two different keys will look like different addresses to the CPU. Unless due care is taken, the same page can thus appear multiple times in the cache under different encryption keys. Writing multiple competing cache lines to the same page will likely corrupt the data there, an outcome that is likely to increase the overall user disgruntlement level. Avoiding it requires carefully flushing the caches whenever the encryption status of a page changes to ensure that no entries remain under the old key.

Doing that flush turns out to be expensive. In an effort to minimize that cost, Shutemov is looking at adding encryption-key awareness to the page allocator. The key that was last used with each page would be remembered; if the page is allocated to a new use with the same key, there will be no need to flush any old cache entries. This can be implemented by putting a wrapper around the current page allocator. It is worth the effort, he said; otherwise allocation of encrypted pages can be up to three times as expensive. Since the intent is that all of memory will be encrypted, this extra cost could hurt overall performance significantly.

One question that arises, though, is: where should the key ID be stored? It needs to be associated with the page structure somehow, and it must be kept around after the page has been freed. Ross Zwisler suggested that perhaps pages could be kept in separate pools, one for each key ID. Shutemov agreed that could be done, but it would involve more significant surgery to the page allocator. There was a period of somewhat rambling exploration of ideas for solutions with no real conclusion reached.

Hugh Dickins asked how key IDs interact with the buddy allocator, which will want to join pages with different IDs into larger groupings. The buddy allocator ignores the IDs, Shutemov said. Cache flushing is done with a single-page granularity, though, so things work as expected.

For the time being, the key ID is being stored in the anon_vma structure; that means that memory encryption only works for anonymous pages. It also is not yet tracking the key ID after pages are freed. Dave Hansen said that the search for a permanent home for the key ID is likely to lead to a challenge all memory-management developers have faced at one time or another: poring over struct page in search of a few available bits that can be used for this purpose. For now, though, Shutemov is looking at storing it in the page_ext structure that is used for information that doesn't fit in struct page .

Michal Hocko worried that adding complexity to the page allocator may be a mistake, especially if memory encryption works better in future CPUs and this level of tracking is no longer needed. He also worried that encryption will add a degree of nondeterminism to the page allocator; the time required to satisfy a request will vary considerably depending on the previous encryption status of the allocated pages. The networking developers, who have been working to reduce allocation times, will complain. It would be better, he said, to ensure that the cost of using encrypted memory falls on those who choose to use it. That suggests just flushing caches when the memory is freed.

Shutemov raised another issue: the direct mapping (the portion of kernel space that, on 64-bit systems, maps directly to all of physical memory) uses the default key. But the kernel will often find itself having to access pages that are encrypted with other keys. One way to handle that would be to bring back something like the kmap() interface to create a temporary mapping to a specific page using the appropriate key. That would be somewhat less efficient than going through the direct mapping, though, and will hurt performance.

The alternative is to partition the direct mapping, creating one section for each possible key ID (of which there are up to 64 in current hardware). The promise of this approach is better, he said, but it's not working yet. The potential problem is that replicating the direct mapping in this way will use a lot of virtual address space, reducing the amount of physical memory that the machine can handle in the process. But, by grabbing a set of physical-address bits for the key ID, memory encryption already reduces the possible amount of physical memory anyway. The other possible disadvantage is that kernel address-space layout randomization would have fewer bits to play with.

The proper API for key management still needs to worked out. With 64 keys available, they can't just be handed out to any process that wants one — at least, not without complicating context switches in unpleasant ways. The user-space API is likely to be based on the existing kernel key-management facilities. A new mprotect() call would be used to apply a key to a range of memory; doing so will destroy the current contents of the indicated range. It would also be nice to have a control-group API at some point, he said.

The final challenge discussed was DMA, which also has to work with memory encryption. On systems with an I/O memory-management unit, encryption should just work. For other systems, it depends on whether the DMA mask for any specific device can handle the full range of physical addresses; encryption will just work if that is the case. Otherwise, bounce buffers will be needed so that the kernel can handle the encryption; that is easy to implement but slow to run.

Comments (11 posted)

Using the kernel thread (kthread) freezer has been a longtime problem for a variety of reasons. It is meant as a way to suspend kthreads on the way toward system suspend, but in practice has proved problematic to the point that it came up at both the 2015 and 2016 Kernel Summits (as well as on the mailing lists over the years); the intent is to try to remove the kthread freezer entirely. To that end, Luis Rodriguez led a discussion in the filesystem track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit on the problems and possible solutions.

Rodriguez has picked up the work that Jiri Kosina was doing to eliminate the kthread freezer, but is moving more cautiously than Kosina originally planned. One problem is that the kernel does not want to freeze kthreads in unexpected places, so there is a mechanism that allows the threads to block the freezing process. Part of the thinking there is that there should not be DMA in flight while the suspend is going on, Kent Overstreet said. He asked, wouldn't it be better if the drivers put themselves in a sane state for suspend?

Dave Chinner said that even if the devices are ready to suspend, the filesystems can still be making in-memory changes. A recurring problem is that suspend would sync a filesystem to make it stable, but the filesystem would still have threads and work on workqueues that were operating on the in-memory data. That led to an inconsistent state between what was on disk and what was in the memory image used by the suspend.

In general, Rodriguez said, the kernel should not be freezing kthreads. The threads want full control of where they can be frozen; it is hard to get it all right if it is imposed on them. But trying to address this problem in a generic form is "really hard"; phasing out the kthread freezer will be difficult, so he suggested a divide-and-conquer approach.

For filesystems that implement the freeze_fs() method, it should be straightforward, but there is still a problem in getting the order right. The current mechanism freezes the most recently mounted filesystems first and thaws them in the order in which they were mounted. That is simple to do using iterate_supers() , but does it work in all cases?

Al Viro said that it does not. There is a "nasty ioctl() ", which he is sorry for implementing, that can break the ordering. It is quite possible that a filesystem that was mounted later shows up earlier in the list. The ordering described is also not sufficient for FUSE filesystems, Jan Kara said, though Chinner suggested those simply be skipped in the walk.

But there are filesystems that talk to several devices, such as those hosted on a RAID device or with their journal on a separate device, Viro said. These topologies can also change at run time, so he does not recommend relying on any kind of ordering.

In fact, a directed acyclic graph (DAG) could describe these relationships, Kara said. It would have nodes for filesystems and devices, with edges that describe the dependencies between them. It would be nice to build that DAG in the kernel, but it is not done today. Viro agreed that it is probably needed at some point.

Rodriguez wondered whether the DAG generation was required before making any progress on eliminating the kthread freezer. As long as the existence of the problem is kept in mind, Viro said, work can proceed. If these problem configurations can be detected, suspend could be prohibited for those systems, Rodriguez said. But that will be difficult to detect without the graph, Kara said.

There are a number of problem areas that came up in the discussion: freezing races with automounting, the control group (cgroup) freezer is "completely broken", freezing FUSE filesystems is problematic, and so on. It was noted that applications would like to know if the filesystem they are using is about to freeze so they can quiesce their own data to keep it consistent. Rodriguez was surprised to find out that there is no generic framework for the kernel to notify user space about an upcoming suspend: "That's insane!"

No real conclusions came out of the discussion. Rodriguez plans to post his notes to the mailing list for feedback. There was also talk about discussing it more later in the summit, though that has not been scheduled as of this writing.

Comments (7 posted)

After a session at last year's Linux Storage, Filesystem, and Memory Management Summit (LSFMM), Jeff Layton was able to make some improvements to block-layer error handling. Those changes, which added a new errseq_t type to hold an error number and sequence number, seemed to help and were well received—except by the PostgreSQL developers. So Layton led a session at the 2018 LSFMM to discuss ways to improve things further; it would be followed later in the week with a session by one of the PostgreSQL developers to look at the specifics of the problem from their perspective.

Layton started by noting that Matthew Wilcox had come up with a patch to restore the behavior that PostgreSQL relies on, which is to receive I/O errors from periodic fsync() calls made by a checkpointing process. However, that change still could not guarantee that errors would always be reported; if the inode is evicted from memory, the error stored there will be lost. The lack of a guarantee made Layton leery of the patch, but Jan Kara noted that there never was a guarantee, even under the old behavior prior to errseq_t .

A writeback error may not actually be reported by fsync() , however; other calls, such as close() , could return it. There was some discussion about whether getting an error return from a close() call actually makes sense. Dave Chinner said that a write failure that is not noticed until well after the write has completed (during writeback, for example) could be reported when the file is closed.

But Layton mentioned Neil Brown's assertion that returning errors on close is not reasonable. For one thing, it is quite common for applications to ignore the return code from close() . But it is documented that the close() call can return errors, so some users will be dependent on that behavior, Chinner said.

Even with Wilcox's patch, there is still the problem of inodes that get evicted from memory. The bug has always been there but, since error reporting is being scrutinized and fixed, it is worth eliminating that problem. There was some talk of making the error persistent on disk, but Eric Sandeen suggested simply refusing to evict inodes that have errors associated with them. That should work well, and remove that lingering problem for PostgreSQL (and others) unless there is major memory pressure, Layton said. But Chinner does not think hanging on to a few extra inodes is going to affect the out-of-memory handling in any significant way.

Layton mentioned that there is another problem: syncfs() is "really broken" in its error reporting. He plans to fix that, probably by using another errseq_t in the superblock, since reporting from syncfs() requires a separate cursor on the error state.

As the session was wrapping up, Chinner asked about tests for the changes. Layton said that he had some to add to xfstests, but there is a need to create some specifically for the PostgreSQL fsync() problem. That will allow some assurance that the problem has really been solved and that it doesn't regress down the road.

Comments (11 posted)