This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Refactoring the kernel means taking some part of the kernel that is showing its age and rewriting it so it works better. Thomas Gleixner has done a lot of this over the past decade; he spoke at Kernel Recipes about the details of some of that work and the lessons that he learned. By way of foreshadowing how much fun this can be, he subtitled the talk "Digging in Dust".

Gleixner's original motivation for taking up his spade was to get the realtime (RT) patches into the mainline kernel, which he found involved constantly working around the shortcomings of the mainline code base. In addition, ten years of spending every working day digging around in dust can make you quite angry, he said, which can also be a big incentive to make things better.

Hotplug support

The first concrete example he discussed was CPU hotplug support. This was added in 2002, when Rusty Russell implemented it with a notifier-based architecture. According to Gleixner, the naming scheme was poor and, as the community worked on the code, further problems were added: the locking was handled by uninstrumented magic that deliberately evaded lockdep oversight and the code developed interesting and undocumented ordering requirements. Moreover, the handling of registered notifiers at startup and teardown was asymmetric: startup reasonably did high-priority tasks first, but teardown also did high-priority tasks first. This is not reasonable: a pile should be disassembled in the reverse of its build order, he said. The choice of asymmetric behavior meant that anything that was sensitive to startup/teardown symmetry ended up having two notifiers. The absence of any debugging for the hotplug notifiers didn't help with figuring out what was going on.

One might expect such code to be less than completely robust; according to Gleixner it was known to be fragile for a long time. The response for most of this time was to apply successive layers of duct tape to hold it together; the attempt to add RT support made it fall apart completely, which is why Gleixner first picked up the pieces in 2012. He decided to convert the structure to a state machine and redo startup/teardown to be symmetric; he got as far as a proof-of-concept patch when he ran out of spare time. He handed the work over to a (tactfully unnamed) company that promised to pick it up, but the company merely applied yet more duct tape. So at the end of 2015, when he got some funding for the RT project, he revisited it; it has taken almost two years to complete the work.

The first step was to analyze all the notifiers and document the ordering requirements. During this process several dozen low-level bugs, such as resource leaks, memory leaks, and null-pointer problems, were found and fixed, predominantly in the error paths as these had received very little in-the-field testing over the years. Gleixner noted one advantage of a state machine that comes into play here: if startup fails halfway through, teardown is done using the regular state machine teardown code rather than some failure-scenario-specific code. Once the analysis and documentation were complete, a conversion of notifiers to hotplug states was started, gradually removing old infrastructure as it was no longer needed.

Once this was complete and the notifiers were gone, he tackled the locking problem, reimplementing the locks with semaphores. This meant that lockdep could see what was going on, revealing about 25 deadlocks lurking in the existing code. Although these were hard to trigger in the mainstream kernel, which is probably why they were not been fixed before, the RT patches tend to expose them. Steven Rostedt's hotplug-stresstest script would usually lock up any machine after four or five runs.

One problem identified came from the decision to have a user-space procfs interface writing into a variable that was used, live, by the kernel. This tended to produce nasty race conditions, which were handled in about a dozen stacked patches, the net effect of which was to reduce the size of the race window until it was hard to see. Gleixner's preferred solution was to create a shadow variable that user space writes into, which the kernel's internal variable was synchronized with when (and only when) the system was in a state that allowed it to change safely.

Much was learned from this experience. First, if you unearth bugs in the kernel, you're expected to fix them. Gleixner and his colleagues drew the maintainers' attention to the first few bugs they found in the notifier callbacks, asking for advice on how to deal with them, and were either completely ignored or knocked back on the basis that the code had been working fine for five years, so it must be correct. Once they followed this up with patches, maintainers responded.

When digging in the kernel "the amount of crap you find is immense; it's really amazing". He added (to nervous giggles from the audience): "when you think you've seen the worst already, no, it's getting worse next week". Patience is required. You start off with something simple; you pull at it and other bits fall out that need attention; before you know it you have to rewrite stuff all over the place. This is normal, and you mustn't give up, he said. He also learned that corporations cannot be trusted to keep their promises.

Finally, estimation of effort in kernel reworking is hard; his estimated time required was off by a factor of two and his estimated person-hours required was off by factor of three. In response to a question from the audience, he allowed that his estimate was reached by taking a reasonable guess, doubling it, and converting to the next largest unit — and still it was off by a factor of two. In the end, the total work required to rework CPU hotplugging was about two person-years, which resulted in about 1.2 patches per person-day.

Timer wheel

The next example he looked at was the timer wheel rework. This he approached by going back to the original work done back in 1997. He and his colleagues reread the original paper on which the work was based, re-examined why it was implemented in the first place, and tried to work out how it could be done better. They found that when you try to go back 20 years, everyone you ask strenuously denies remembering anything. It took them three months to do the analysis work (tracing and examining the code, figuring out the use cases and the precision each case required), followed by two months to design and write a proof-of-concept (during which they implemented several possible designs) and a month to do the final implementation, including the posting and review process.

Certain enhancements are still works-in-progress, such as the business of choosing which CPU to place a new timer on. It's desirable to have a timer fire on a busy CPU, to avoid waking a CPU from a lower-power state just to deal with it. The old timer wheel worked out which CPU would be busiest when the timer fired by putting it on the one that was busiest now. As Gleixner pointed out, that's not guaranteed to work well, though he didn't put it that kindly.

For the reworked code they noted that over 99% of timers are canceled before they fire; they therefore wished to keep the cost of setting a timer as low as possible, so they avoid a costly and complex decision-making process by always queueing it on the local CPU. If that CPU goes idle, it lets other CPUs know that it has pending timers; either they will be canceled, or if they actually trigger, a busy CPU can take over running the timer.

By his account, the rework was a success; there was minimal fallout, with tea and medals for all. Sadly, a week before Kernel Recipes 2017, and more than a year after the rework patches were merged, a user-space-visible regression surfaced. This issue is ongoing, because members of the network community (who were most affected by this regression) initially responded by arguing that the right fix was the removal of the whole timer wheel rework — until it was pointed out that the network subsystem would suffer the biggest performance hit from doing so.

Lessons learned from this rework include being prepared for paleontological research; going back 20 years in the kernel is pretty much a visit to Jurassic Park. Gleixner mentioned some projects out there that try to map tokens in the kernel code to contemporaneous discussions on mailing lists, and said how useful these would have been for a rework like this. Don't expect that anyone will know anything, or that anyone will care about what you're doing, he said. Finally, be prepared for late surprises; if you're reworking stuff this far down in the kernel, you can't burn your notes the day after the patches are merged.

A question from the audience asked how useful Git changelogs could be for reworking. Until five years ago, Gleixner said, they're useless; the vast majority of changelogs either say something like "fix bug", or describe what the patch is doing instead of why. As a maintainer, Gleixner spends a lot of time trying to teach people to write good changelogs, and reckons he rewrites about half of the changelogs that go through him. Whether as a result of this or not, changelogs have been getting steadily better over the last five years and can now often be helpful.

He also finds git grep/log/blame useful, along with the history git trees (both the BitKeeper history, which he has converted to Git and made available on kernel.org, and the really ancient history tree which imported the delta patches from one version to the next). He uses Coccinelle to figure out which pieces of code touch a given data structure and finds it extremely helpful. Mail archives, where they exist, are useful in direct proportion to their searchability. He uses Quilt to keep patch queues under control in big refactoring jobs. Finally, the utility of a good espresso machine should never, he said, be underestimated. In response to an audience question about cross-reference indexes, he expressed disdain for any tool that requires a web browser.

Should you, he asked rhetorically, have a go at refactoring? Yes, but not if you're faint of heart. You will need to be stubborn, you will have to dig into some dark and evil places, but you will learn a lot about kernel history and why things are the way they are, even though they sometimes don't make any sense at all. You get used to understanding undocumented code written by authors who have vanished, which is a useful skill. You will spend a lot of time fighting the "it works fine" mentality: merely proving a bug exists isn't enough, as many people regard bugs that haven't been triggered as non-issues. It's fun, for some definition of fun, and to laughter, he noted that one of the greatest rewards of refactoring is the understanding that the Linux kernel works largely by chance not design.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Comments (25 posted)

The 4.14 kernel, due in the first half of November, is moving into the relatively slow part of the development cycle as of this writing. The time is thus ripe for a look at the changes that went into this kernel cycle and how they got there. While 4.14 is a fairly typical kernel development cycle, there are a couple of aspects that stand out this time around.

As of the 4.14-rc5 prepatch, 12,757 non-merge changesets had found their way into the mainline; that makes 4.14 slightly busier than its predecessor, but it remains a fairly normal development cycle overall. If, as some have worried, developers have pushed unready code into 4.14 so that it would be present in a long-term-support release, it doesn't show in the overall patch volume.

1,649 developers have contributed code in this development cycle, a number that will almost certainly increase slightly by the time the final 4.14 release is made. Again, that is up slightly from 4.13. Of those developers, 240 made their first contribution to the kernel in 4.14. The numbers are fairly normal, but a look at the most active developers this time around shows a couple of unusual aspects.

Most active 4.14 developers By changesets Arvind Yadav 544 4.3% Bhumika Goyal 195 1.5% Gustavo A. R. Silva 156 1.2% Colin Ian King 151 1.2% Julia Lawall 141 1.1% Arnd Bergmann 127 1.0% Mauro Carvalho Chehab 117 0.9% Bart Van Assche 113 0.9% Arnaldo Carvalho de Melo 106 0.8% Paul Burton 100 0.8% Chris Wilson 96 0.8% Markus Elfring 94 0.7% Thomas Gleixner 87 0.7% Dan Carpenter 87 0.7% Laurent Pinchart 87 0.7% Daniel Vetter 83 0.7% Xin Long 82 0.6% Christoph Hellwig 79 0.6% Geert Uytterhoeven 77 0.6% Michael Ellerman 76 0.6% By changed lines Greg Kroah-Hartman 129421 14.9% Ping-Ke Shih 122912 14.1% Lionel Landwerlin 30289 3.5% Mauro Carvalho Chehab 22461 2.6% Daniel Scheller 17708 2.0% Nick Terrell 14223 1.6% Aviad Krawczyk 12831 1.5% Salil Mehta 12051 1.4% Juergen Gross 11036 1.3% Todor Tomov 9286 1.1% Sukadev Bhattiprolu 9248 1.1% Hannes Reinecke 9003 1.0% Arnaldo Carvalho de Melo 7790 0.9% Andi Kleen 6826 0.8% Johannes Berg 6631 0.8% Masahiro Yamada 6429 0.7% Russell King 4573 0.5% John Fastabend 4412 0.5% Jérôme Glisse 4128 0.5% Vikas Shivappa 4091 0.5%

Arvind Yadav contributed 544 changesets mostly focused on making device-ID lists in the kernel constant. Unlike much "constification" work, these changes probably do not have much security significance, but they do tend to make the kernel text size a little smaller. Bhumika Goyal got her start as an Outreachy intern making structures full of function pointers const — a job that does improve security; she is continuing that work post-Outreachy with support from the Core Infrastructure Initiative (CII). Gustavo A. R. Silva also did constification work, along with contributing a number of other fixes, with CII support. Colin Ian King did, wait for it, constification work along with various other fixes and Julia Lawall also did constification work.

In other words, the top five contributors contributed nearly 1,200 changes mostly cleaning up declarations in the kernel. This work may not draw the same sort of attention as the addition of hardening mechanisms, but it is an important part of hardening the kernel overall.

Greg Kroah-Hartman got to the top of the "changed lines" column mostly by virtue of a single patch deleting the remaining old firmware files from the kernel tree. Firmware has long been maintained externally, so this cleanup was overdue. Ping-Ke Shih added yet another Realtek wireless driver to the staging tree, Lionel Landwerlin reworked parts of the Intel i915 driver in ways that allowed the removal of a lot of code, Mauro Carvalho Chehab contributed changes all over the media and documentation subsystems as usual, and Daniel Scheller added a large new media driver.

The 120,000-line Realtek wireless driver merits a bit more attention; it was the subject of some complaining about how Realtek gets its drivers into the kernel. But something of note has happened here. Numerous Realtek drivers have been merged via the staging tree, often moving on to the mainline kernel; that work has generally been done by Larry Finger on his own time. But, as he explained, he is reaching an age where he lacks the energy for this work and intends to stop soon. Thus it is encouraging that Ping-Ke Shih, the submitter of the Realtek driver patch this time around, is an actual Realtek employee. It would appear that the company has finally decided to put some resources into Linux support and, hopefully, the situation with its wireless drivers will improve over time. Meanwhile, Realtek should send Larry a nice retirement present — he has certainly earned it.

Work on 4.14 was supported by 213 companies that could be identified — a typical number that is, once again, a little higher than the 203 seen in the 4.13 cycle. The most active employers this time around were:

Most active 4.14 employers By changesets Intel 1328 10.4% (None) 813 6.4% Red Hat 754 5.9% (Unknown) 575 4.5% IBM 566 4.4% Motorola 544 4.3% Linaro 500 3.9% Google 453 3.6% Mellanox 425 3.3% SUSE 404 3.2% Linux Foundation 391 3.1% AMD 348 2.7% Renesas Electronics 319 2.5% Samsung 262 2.1% Rockchip 257 2.0% Oracle 221 1.7% ARM 218 1.7% (Consultant) 199 1.6% Canonical 185 1.5% Broadcom 182 1.4% By lines changed Linux Foundation 131369 15.1% Realtek 124976 14.4% Intel 101671 11.7% (None) 47222 5.4% Red Hat 31888 3.7% SUSE 29408 3.4% Huawei Technologies 28807 3.3% IBM 28363 3.3% Linaro 25614 2.9% Samsung 24940 2.9% (Unknown) 22749 2.6% Mellanox 18379 2.1% Facebook 18345 2.1% Google 17257 2.0% AMD 14621 1.7% (Consultant) 12162 1.4% Renesas Electronics 12004 1.4% ST Microelectronics 9923 1.1% Rockchip 9091 1.0% ARM 8438 1.0%

These results are fairly typical for recent kernels; the biggest surprise, perhaps, is the appearance of Realtek as was discussed above. In general, the kernel project continues to move forward powered by a great deal of corporate support.

While the kernel project clearly depends on developers to get the code written, it also depends on those who test changes and report bugs. Some of the time, at least, that contribution is noted through the addition of tags in the patches. Looking at the Reported-by (619 through 4.14-rc5) and Tested-by tags (605) yields the following results:

Testing and bug reporting in 4.14 Reported-by tags Fengguang Wu 36 5.8% Dan Carpenter 27 4.4% Andrey Konovalov 17 2.7% Peter Zijlstra 14 2.3% Eric Biggers 12 1.9% Arnd Bergmann 11 1.8% Michael Ellerman 11 1.8% Stephen Rothwell 9 1.5% kbuild test robot 7 1.1% Eric Dumazet 6 1.0% Dmitry Vyukov 6 1.0% Jianlin Shi 6 1.0% Christoph Hellwig 5 0.8% Geert Uytterhoeven 5 0.8% Ingo Molnar 5 0.8% Tetsuo Handa 5 0.8% kernel test robot 5 0.8% Mathias Kresin 5 0.8% Tested-by tags Andrew Bowers 51 8.4% Richard Scobie 21 3.5% Arnaldo Carvalho de Melo 19 3.1% Laurent Pinchart 16 2.6% Thierry Reding 16 2.6% Andrey Konovalov 15 2.5% Laura Abbott 14 2.3% Eric Biggers 12 2.0% Jasmin Jessich 12 2.0% Dietmar Spingler 12 2.0% Manfred Knick 12 2.0% Aaron Brown 12 2.0% Marcin Wojtas 12 2.0% Pavel Machek 8 1.3% Philippe Cornu 7 1.2% Stan Johnson 7 1.2% Heiko Stuebner 6 1.0% Ondrej Zary 6 1.0%

In truth, the report credits for Fengguang Wu and the two "test robot" entries probably belong together, but they were credited separately in the patches.

These numbers, of course, greatly understate the amount of testing and reporting that is happening in the kernel community. The requisite tags do not always get added to patches as they should and, in the case of testing, many testers do not make themselves known in the first place if things work for them. That said, the tags that do exist record real work done, and the kernel is better for it.

Overall, the kernel development process would appear to continue to run relatively smoothly, despite the occasional hiccup. There are few projects that would be able to integrate changes at this rate and produce a result that will be the base for countless deployed systems in the coming years.

Comments (11 posted)

There is a lot of information buried in the kernel's Git repositories that, if one looks closely enough, can yield insights into how the development community works in the real world. It can show how the idealized hierarchical model of the kernel development community matches what actually happens and provide a picture of how the community's web of trust is used to verify contributions. Read on for an analysis of the merge operations that went into the 4.14 development cycle.

The diagram to the right was generated from the commits merged for the 4.14 release, through 4.14-rc5. It is unfortunately dense; click on the image to get a version that has a chance of being legible. In short, it shows all of the subsystem trees that were pulled into the mainline and the number of patches that flowed out of each.

LWN has posted these diagrams a couple of times in the past, for the 2.6.29 and 4.4 development cycles. They have always shown a structure that is far flatter than the hierarchical maintainer model would suggest. In the real world, mid-level maintainers are relatively rare; most maintainers send pull requests directly to Linus Torvalds. Doing so helps to get changes into the mainline more quickly; that is why, for example, some security-module maintainers recently decided to bypass the security maintainer and push their trees directly to Torvalds.

That said, the hierarchy shows more clearly than it has in past years. A number of subsystems are growing to the point where there needs to be some overall higher-level coordination. So there are more two and three-level trees than there used to be. As the kernel community continues to grow, it will almost certainly need to add more mid-level maintainers.

Signing of pull requests

Diagrams like this one can be interesting to look at just to see how work is flowing through the system. But they can also be used to reveal semi-hidden aspects of how that work is being done. This time around, your editor has decided to put a focus on the security of the process.

Shortly after the 3.0 kernel was released, it was revealed that kernel.org, where many kernel developers (including Torvalds) keep their repositories, had been broken into. This episode brought the merging of patches to a halt for some time and delayed the 3.1 release by some months; it also created a great deal of concern over the possibility that somebody's repository might have been corrupted in an attempt to get malicious code into the mainline kernel. No evidence of that happening ever turned up, but the realization that it maybe could have happened drove a number of changes in the development community.

One of those changes was the establishment of a web of trust among kernel developers; at the 2011 Kernel Summit in Prague, an initial key-signing ritual was held to bootstrap that web. The ability to GPG-sign commits and tags was added to Git. One need merely tag the commit at the head of a series to be pulled with a command like:

git tag -s fixes-for-linus

and request that the fixes-for-linus tag be pulled. If the receiving maintainer pulls with the --verify-signatures option, Git will ensure that a valid signature exists before doing the merge.

The idea was that developers would sign their repositories before sending pull requests, allowing upstream maintainers to verify that those pull requests corresponded to legitimate streams of development. Even if an attacker could put up a convincing copy of a developer's repository (or somehow add a malicious commit to a real repository) and send a fake pull request, the attack would not succeed because the attacker would not be able to attach a proper signature to the relevant tag.

This system has been in place for six years now, and many developers routinely sign tags for outgoing commits and verify signatures when pulling from others. But do they all do so? It is possible to find out. When a signed commit or tag is pulled into a repository, the signature is stashed into the merge commit, allowing the provenance of the changes to be verified at a later date. That also makes it possible to examine the merges in the kernel repository and see how many of them carry signature information.

Referring back to the tree plot on the right, one will see that some repositories are shown in black boxes, while others use red boxes. The repositories in red are those from which no signed merges happened during the period in question. The results show that, while many developers do sign their tags before sending changes upstream, quite a few do not.

More to the point, the repository that sends more traffic into the mainline than any other — networking — makes almost no use of signatures anywhere in the chain. The "tip" tree (containing x86 and core-kernel work) is another significant tree that does not employ signatures, as is the linux-block tree. Neither the security tree nor the crypto tree employ cryptographic signatures. Pull requests from the graphics tree into the mainline are signed, but many of the trees feeding into graphics do not use signatures. On the other hand, some high-volume trees, such as arm-soc, have almost complete signature coverage from the leaves up to the mainline.

Years of traffic on the kernel mailing lists suggests that maintainers rarely ask for signatures to be added to pull requests that lack them. Torvalds will typically demand it when the tree being pulled is hosted on a public service like GitHub, but is otherwise happy to pull from unsigned tags. He does verify signatures when they do exist, though. Few other maintainers require (or even mention) signatures at all.

Your editor asked around a bit to get a sense for why some maintainers are not using signed tags. The answer was typically along the lines of "I never got around to incorporating them into my workflow". One maintainer admitted that he had probably forgotten the passphrase for his GPG key by now and would have to create a new one to be able to start signing tags. The problem, if there is one, is not any real hostility to the idea of signed commits. It is just that, since signatures are not required, many busy subsystem maintainers have not made the effort to start using them.

The result is that the kernel has a web of trust that, one might fairly conclude, is not really protecting much. It's nice to have the verification on pull requests that do carry signatures but, since those signatures seem to be almost entirely optional at present, they offer little protection against a malicious pull request.

If the intent of signed tags is limited to enabling developers to host repositories on untrusted services, then perhaps signature checking as it is practiced now is sufficient. Perhaps the threat model need not include more sophisticated attackers trying to sneak vulnerabilities into the kernel via some developer's tree on a well-run site. After all, kernel.org itself seems relatively well protected these days, and kernel developers have demonstrated that, like developers of most other projects, they are entirely capable of introducing security bugs at a sufficient rate without external assistance.

But if the intent is to make the kernel development process resilient against attacks on developers' machines or kernel.org, then there is some work yet to be done. It is worth remembering that the web of trust came about as a response to a compromise of kernel.org, after all. If we want to prepare for a recurrence of that sort of incident, the actual threat model needs to be defined, and the use of protective techniques like signed tags should probably not be optional. Partially implemented security mechanisms have a distressing tendency to fail when put to the test.

(The plot in this article was generated with the treeplot tool, which is part of the gitdm collection of hacks hosted at git://git.lwn.net/gitdm.git ).



Comments (13 posted)

One of the key values provided by an operating system like Linux is that it provides abstract interfaces to concrete devices. Though the original "character device" and "block device" abstractions have been supplemented with various others including "network device" and "bitmap display", the original two have not lost their importance. The block device interface, in particular, is still central to managing persistent storage and, even with the growth of persistent memory, this central role is likely to remain for some time. Unpacking and explaining some of that role is the goal of this pair of articles.

The term "block layer" is often used to talk about that part of the Linux kernel which implements the interface that applications and filesystems use to access various storage devices. Exactly which code constitutes this layer is a question that reasonable people could disagree on. The simplest answer is that it is all the code inside the block subdirectory of the Linux kernel source. This collection of code can be seen as providing two layers rather than just one; they are closely related but clearly distinct. I know of no generally agreed names for these sub-layers and so choose to call them the "bio layer" and the "request layer". The remainder of this article will take us down into the former while the latter will be left for a subsequent article.

Above the block layer

Before digging into the bio layer it will be useful to provide context by describing the parts of Linux that sit just above the block layer. "Above" in this sense means closer to user-space (the top) and further from hardware (the bottom) — it covers the clients that might use the services provided by the block layer and below.

Access to block devices generally happens through block special devices in /dev , which map to S_IFBLK inodes in the kernel. These inodes act a little bit like symbolic links in that they don't represent the block device directly but simply contain a pointer to the block device as a "major:minor" number pair. Internally the i_bdev field in the inode contains a link to a struct block_device that represents the target device. This block device holds a reference to a second inode: block_device->bd_inode . This inode is more closely involved in I/O to the block device, the original inode in /dev is just a pointer.

The main role that this second inode plays (which is implemented in fs/block_dev.c , fs/buffer.c , and elsewhere) is to provide a page cache. When the device file is opened without the O_DIRECT flag, the page cache associated with the inode is used to buffer reads, including readahead, and to buffer writes, usually delaying writes until the normal writeback process flushes them out. When O_DIRECT is used, reads and writes go directly to the block device. Similarly when a filesystem mounts a block device, reads and writes from the filesystem usually go directly to the device, though some filesystems (particularly the ext* family) can access the same page cache (traditionally known as the buffer cache in this context) to manage some of the filesystem data.

Another open() flag of particular relevance to block devices is O_EXCL . Block devices have a simple advisory-locking scheme whereby each block device can have at most one "holder". The holder is specified when activating the block device (e.g. using a blkdev_get() or similar call in the kernel); that will fail if a different holder has already claimed the device. Filesystems usually specify a holder when mounting a device to ensure exclusive access. When an application opens a block device with O_EXCL , that causes the newly created struct file to be used as the holder; the open will fail if a filesystem is mounted from the device. If the open is successful, it will block future mount attempts as long as the device remains open. Using O_EXCL doesn't prevent the block device from being opened without O_EXCL , so it doesn't prevent concurrent writes completely — it just makes it easy for applications to test if the block device is in use.

Whichever way a block device is accessed, the primary interface involves sending read or write requests, or various other requests such as discard, and eventually getting a reply. This interface is provided by the bio layer.

The bio layer

All block devices in Linux are represented by struct gendisk — a "generic disk". This structure doesn't contain a great deal of information and largely serves as a link between the filesystem interface "above" and the lower-layer interface "below". Above the gendisk is one or more struct block_device , which, as we already saw, are linked from inodes in /dev . A gendisk can be associated with multiple block_device structures when it has a partition table. There will be one block_device that represents the whole gendisk, and possibly some others that represent partitions within the gendisk.

The "bio" that gives its name to the bio layer is a data structure ( struct bio ) that carries read and write requests, and assorted other control requests, from the block_device , past the gendisk, and on to the driver. A bio identifies a target device, an offset in the linear address space of the device, a request (typically READ or WRITE), a size, and some memory where data will be copied to or from. Prior to Linux 4.14, the target device would be identified in the bio by a pointer to the struct block_device . Since then it holds a pointer to the struct gendisk together with a partition number, which can be set by bio_set_dev() . This is more natural given the central role of the gendisk structure.

Once constructed, a bio is given to the bio layer by calling generic_make_request() or, equivalently, submit_bio() . This does not normally wait for the request to complete, but merely queues it for subsequent handling. generic_make_request() can still block for short periods of time, to wait for memory to become available, for example. A useful way to think about this behavior is that it might wait for previous requests to complete (e.g. to make room on the queue), but not for the new request to complete. If the REQ_NOWAIT flag is set in the bi_opf field, generic_make_request() shouldn't wait at all if there is insufficient space and should, instead, cause the bio to complete with the status set to BLK_STS_AGAIN , or possibly BLK_STS_NOTSUPP . As of this writing, this feature is not yet implemented correctly or consistently.

The interface between the bio layer and request layer requires devices to register with the bio layer by calling blk_queue_make_request() and passing a make_request_fn() function that takes a bio. generic_make_request() will call that function for the device identified in the bio. This function must arrange things such that, when the I/O request described by the bio completes, the bi_status field is set to indicate success or failure and call bio_endio() which, in turn, will call the bi_end_io() function stored in the structure.

The two most interesting features of the bio layer, beyond the simple handling of bio requests already described, are the recursion avoidance and the queue plugging.

Recursion avoidance

It is quite possible for the use of virtual block devices such as "md" (used for software RAID) and "dm" (used, for example, by LVM2) to result in a stack of block devices, each of which modifies a bio and sends it on to the next device in the stack. A simple implementation of this would cause a large stack of devices to result in excessive use of the kernel's call stack. In the distant past (before Linux 2.6.22) this would sometimes cause problems, particularly when the bio was submitted by a filesystem that was already using a considerable amount of the stack.

Instead of allowing this recursion, generic_make_request() detects when it is being called recursively and does not pass the bio down to the next layer. Instead it queues the bio internally (using current->bio_list in the struct task_struct for the current process) and submits it only after the parent bio submission completes. As generic_make_request() is not expected to wait for the bio to complete, it is normally safe to not even start processing immediately.

This recursion avoidance often works perfectly, but it sometimes leads to deadlocks. The key to understanding these deadlocks is the observation made above that submission of a bio (i.e. the make_request_fn() called by generic_make_request() ) is permitted to wait for previously submitted bios to complete. If it waits for a bio that is still on the current->bio_list queue managed by generic_make_request() , then it will wait forever.

The dependencies that cause one bio to wait for an earlier one are often subtle and usually found through testing rather than code inspection. A simple example involves the occasional need to split a bio using a mempool. If a bio is submitted to a device that has limits on the size or alignment of I/O requests, the make_request_fn() might choose to split the bio into two parts which are handled separately. The bio layer provides functions ( bio_split() and bio_chain() ) that make this quite easy to do, but the operation requires that a second bio structure be allocated. Allocating memory must be done cautiously in the block layer since, when there is a shortage of free memory, a key strategy used by Linux is to write out dirty pages, through the block layer, so they can then be discarded. If that write-out needs to wait for memory to be allocated, it can cause problems. A standard mechanism is to use a mempool, which pre-allocates a small amount of memory for a specific purpose. Allocating from a mempool may wait for previous users of the mempool to return the memory they used, but will not wait for general memory reclaim to finish. When a mempool is used to allocate bios, this waiting can introduce the sort of dependency that can cause generic_make_request() to deadlock.

There have been several attempts to provide an easy way to avoid these deadlocks. One is embodied in the " bioset " processes that you might see in a ps listing. This mechanism focuses specifically on the deadlock scenario described above and allocates a "rescuer" thread for each mempool used for allocating bio structures. If an allocation attempt cannot be easily satisfied, any bios from the same bioset that are in the current->bio_list queue are handed to the bioset thread for processing. This approach is fairly heavy-handed, resulting in the creation of many threads that are almost never used, and only addresses one particular deadlock scenario. Most, if not all, deadlock scenarios involve splitting bios into two or more parts, but they don't always involve waiting on mempool allocation.

Recent kernels only depend on this for a few isolated cases and generally avoid creating the bioset thread when it isn't needed. Instead, an alternate approach, which was introduced by changes to generic_make_request() in Linux 4.11, is used. It is more general and imposes less overhead on a running system, but instead places requirements on how drivers are written.

The main requirement is that when a bio is split, one of the halves should be submitted directly to generic_make_request() so that it can be handled at the most appropriate time. The other half may be processed in whatever way is appropriate. This gives generic_make_request() a little more control over what happens. It makes use of this control by sorting all the bios based on how deep in the device stack they were submitted. It then always handles bios destined for lower devices before upper devices. This simple expedient removes all the annoying deadlocks.

Device queue plugging

Storage devices often have significant per-request overheads, so it can be more efficient to gather a batch of requests together and submit them as a unit. When the device is relatively slow it will often have a large queue of pending requests and that queue provides plenty of opportunity for identifying suitable batches. When a device is quite fast, or when a slow device is idle, there is less opportunity to find batches naturally. To address this challenge, the Linux block layer has a concept called "plugging".

Originally, plugging applied only to an empty queue. Before submitting a request to an empty queue, the queue would be plugged so that no requests could flow through to the underlying device for a while. Bios submitted by the filesystem could then queue up and allow batches to be identified. The queue would be unplugged explicitly by the filesystem requesting it, or implicitly after a short timeout. It is hoped that by this time some suitable batches would have been found and that the small delay in starting work is more than compensated for by the larger batches that are ultimately submitted. Since Linux 2.6.39 a new plugging mechanism has been in place that works on a per-process basis rather than per-device. This scales better on multi-CPU machines.

When a filesystem or other client of a block device submits requests it will normally bracket a collection of generic_make_request() calls with blk_start_plug() and blk_finish_plug() . This sets up current->plug to point to a data structure that can contain a list of struct blk_plug_cb (and also a list of struct request that we find out more about in the next article). As these lists are per-process, entries can be added without any locking. The make_request_fn that is given individual bios can choose to add the bio to a list in the plug if that might allow it to work more efficiently.

When blk_finish_plug() is called, or whenever the process calls schedule() (such as when waiting for a mutex, or when waiting for memory allocation), each entry stored in current->plug is processed. This processing will complete everything that the driver would have done if it had not decided to add the bio to the plug list, or if no plug has been enabled.

The fact that the plug is processed from schedule() calls means that bios are only delayed while new bios are being produced. If the process blocks to wait for anything, the list assembled so far is processed immediately. This protects against the possibility that the process might be waiting for a bio that has already been submitted, but is currently queued behind the plug.

Performing the plugging at the process level like this maintains the benefit that batches of related bios are easy to detect and keep together, and adds the benefit that locking can be reduced. Without this per-process plugging a spinlock, or at least an atomic memory operation, would be needed to handle every individual bio. With per-process plugging, it is often possible to create a per-process list of bios, and then take the spinlock just once to merge them all into the common queue.

Bio and below

In summary, the bio layer is a thin layer that takes I/O requests in the form of bio structures and passes them directly to the appropriate make_request_fn() function. It provides various support functions to simplify splitting bios and scheduling the sub-bios, and to allow plugging of the queue. It also performs some other simple tasks such as updating the pgpgin and pgpgout statistics in /proc/vmstat , but mostly it just lets the next level down get on with its work.

Sometimes the next layer is just the final driver, as with drbd (The Distributed Replicated Block Device) or brd (a RAM based block device). More often the next layer is an intermediate layer such as for the virtual devices provided by md and dm . Probably the most common is when that intermediate layer is the remainder of the block layer, which I have chosen to call the "request layer". Some of the intricacies of this layer will be the topic of the second part of this overview.

Comments (16 posted)

Academics generate enormous amounts of software, some of which inspires commercial innovations in networking and other areas. But little academic software gets released to the public and even less enters common use. Is some vast "dark matter" being overlooked in the academic community? Would the world benefit from academics turning more of their software into free and open projects?

I asked myself these questions a few months ago when Red Hat, at its opening of a new innovation center in Boston's high-tech Fort Point neighborhood, announced a unique partnership with the goal of tapping academia. Red Hat is joining with Boston-area computer science departments—starting with Boston University—to identify promising software developed in academic projects and to turn it into viable free-software projects. Because all software released by Red Hat is under free licenses, the partnership suggests a new channel by which academic software could find wider use.

This article looks at some successful academic projects that have entered mainstream use—ranging across computer history from the Berkeley Software Distribution (BSD) to Jupyter notebooks—and looks for the factors that might help make that transition work. The projects that I covered suggest the following rules of thumb:

Academics, working in the reward system of academia, are not likely to carry through the conversion of software from a research project to a viable product of interest to a broad community.

Funding, usually from government agencies or foundations, is key to the creation of high-quality products that can be widely adopted. This funding can help support the conversion to a useful free-software project.

It helps a lot if the target users share some of the values and technical knowledge of the project leaders.

Infrastructure software tends to succeed more than application-level projects, perhaps because it has a broader appeal.

We'll start off with some observations from free software advocates about why it's so hard to derive production-ready software from academic research.

Software success ≠ academic success

Academics are obsessed with publication. And almost universally, the academic publishers are interested only in research findings, such as: "Packets of a certain size maximize transmission throughput under such-and-such conditions." The publications do not include related data or source code, which are considered, at best, as ancillary and, all too often, as junk. (Things are changing a bit here, especially for government-funded projects, but mostly in the demand for open data, not open source code.)

The irony—and even tragedy—of this disregard for the infrastructure that makes their findings possible is that academics skimp on software quality measures, such as testing. Quite often, bugs in the software cause researchers to publish incorrect results.

In order to get code widely adopted by people outside the academic setting where it is invented, professors or students must first develop the code conscientiously to ensure that it's robust, extendable, and correctly solves the problem at hand. Then they must iron out the idiosyncrasies of the project for which the code was developed, generalizing it for a broader range of domains and purposes. They have to maintain a repository, solicit contributions, and vet those contributions. Ultimately, someone has to run a community that can debate changes and choose new directions. None of those time-consuming tasks has anything to do with publication or tenure. Academics are pressured to get interesting results, put out a paper, and move on to the next project.

James Vasile, a programmer and consultant in the open-source space, notes that academics are biased toward secrecy in their code, as in their research. Woe to them if they release code early in their research that helps competing scientists reach conclusions earlier and get published first. To prevent this career killer, they hold on to the code until they publish their paper or present their conference session. That could be years after they wrote the code, which is pretty late to create a public repository and develop a community around it.

Generally, Vasile told me, it's more likely for academics to share tools that enable their research as infrastructure, but that aren't the main goal of their research. The possibility of turning infrastructure into open source has parallels in commercial firms in a phenomenon I labeled "closed core" six years ago. Companies often want to keep their essential business software secret much like academics want to hold the code for their experiments close to the vest.

Vasile mentioned several other barriers that make it hard to develop open-source projects from academic code. Academics work slowly on code, not at the pace of a professional team. When organizations try to help through donations, universities take a big bite—often half—out of the funds. Finally, many of the tasks required to make software robust and useful are not academically interesting.

Marshall Kirk McKusick, one of the early developers and maintainers of BSD, furnished the additional insight that students usually haven't had the time to develop the skills of maintainable, extendable coding, so their work is quick-and-dirty and unsuitable for reuse. It may also contain useless stub code for features that were never fully developed and that probably never will be.

Additional barriers to freeing code were pointed out by Jeffrey Spies, co-founder and CTO of the Center for Open Science. Researchers rarely think about the traits of software that make it suitable for widespread adoption, such as maintainability or documentation. And even those who make the code available in a public repository have little incentive to foster a community around the software.

High-quality software development requires hiring high-quality software developers, which is difficult in academic settings. Good developers want to work in an environment that appreciates their contributions, a trait unlikely to attract them to academic environments that discount the importance of software quality.

Vasile, McKusick, and Spies presented daunting prospects for successful deployment of academic code. But some projects manage to surmount the hurdles and become free-software successes. Let's look at a few, and try to tease out what helped them succeed.

BSD

BSD, which came out of the University of California, Berkeley, is the crowning success of academically conceived software. McKusick provided a history of BSD's growth and adoption for the O'Reilly Media book Open Sources: Voices from the Open Source Revolution.

One factor that probably allowed BSD to gain wide adoption was its audience of system administrators, who had the skills to install a complicated and sophisticated piece of software on bare metal. Many in the community also submitted contributions to the code. For instance, improvements by the community made 4.2BSD networking more efficient and robust than the code that was originally contributed by Bolt, Beranek, and Newman, which had achieved fame by developing the foundations of the Internet for ARPA (the original name of DARPA). McKusick told me that hundreds of contributors were involved in developing BSD.

Was BSD denied access to sufficient capital? It seems to have been mostly a project of the Berkeley computer science department, although McKusick's article cites funding from DARPA during the critical transition from 3BSD to 4BSD. I see no record of Sun Microsystems, which used BSD as the basis of its SunOS operating system, ever giving money to the project, although it contributed a good deal of code and bug fixes.

Spark

We move now to another project that began at UC Berkeley, but in a very different time and context. A successor to the ground-breaking Hadoop, the Apache Spark cluster-computing engine is part of most "big data" strategies now. Among the projects I've researched for this article, Spark is probably closest to the kind of project that Red Hat will sponsor.

I spoke to one of the early organizers of the Spark project, Patrick Wendell, who left UC Berkeley along with some other team members to found the company Databricks, where he is now VP of Engineering. Wendell told me that Spark was a brainchild of an atypical research group at Berkeley called the AMPLab, where five or six faculty work with about 35 students at any one time on big data processing tools. The AMPLab had both public and private sponsorship, and researchers there were expected to produce software of use to a wide industry audience—as Wendell said, it's "baked into their philosophy." Although projects don't have to be released as free software, many researchers do so to gain the benefits of wide adoption and contributions from the field. For instance, the Berkeley team donated Spark to the Apache Software Foundation in 2013 and built a community of developers outside the AMPLab.

Hence, Wendell said, academics in the AMPLab can have impacts in ways that go beyond publishing papers. They measure success by the broad adoption of their work, not only by insights that get into conferences or journals. He agreed that building community and fixing bugs were not the most efficient path to publication, but for some academics that's fine. Working in the AMPLab does not preclude academic success either—for instance, the Spark project has generated lots of academic publications. Matei Zaharia, founder of the project, took a sabbatical to co-found Databricks but then returned to academia, where he is an assistant professor in the Stanford CS department.

PostgreSQL

This database-management system, perhaps as much as any free software, demonstrates how much can be achieved by developers in an open community. The long and complex history of the project, summarized on a project web page, involved a couple of failed commercialization efforts. But the code of the current PostgreSQL seems to be derived entirely from academic and community efforts, where a Berkeley database project called Ingres inspired another called Postgres, the genesis of modern PostgreSQL. The original developers were part of the same constellation of Berkeley researchers responsible for BSD.

But, as the history page notes: "In 1996, Postgres95 departed from academia". And in a podcast interview, Bruce Momjian suggested that not much of the work on modern PostgreSQL was done at Berkeley, and that PostgreSQL was really a community project from 1996 onward (2:42 into the podcast). He also highlighted the role played by college professors in the PostgreSQL community.

Paradoxically, the academicians don't seem to contribute much to the code, a recalcitrance that Momjian attributes to their lack of interest in practical use (6:30 into the podcast), echoing my conversations with Vasile and Spies. Major funding seems to have come late in the project. Recently, according to Momjian, a number of "big players" have offered support, including IBM, Amazon, and Microsoft (12:57 into the podcast).

Jupyter

Supple enough to be valuable to educators, conference presenters, and general authors alike in the computer field, Jupyter emerged from academic researchers at Cal Poly State University, San Luis Obispo and UC Berkeley. It has brought information presentation into the modern era of multimedia, interactivity, and collaboration. Originally designed to display and run Python code (and called IPython), it was eventually extended so that other computer programming languages could be supported, and its name was changed to Jupyter (still keeping "py" in the name to honor its Python roots and implementation). Jupyter is a central tool in use at my own employer, O’Reilly Media, as was described in a video keynote; it has many other users as well.

In an interview with one of Jupyter's earliest developers, Brian E. Granger, I learned that the Python origins of the project were crucial for historical reasons. Scientists, after years of using proprietary tools such as Mathematica and MATLAB, were turning to powerful Python libraries such SciPy, NumPy, and the many modules that rely on them. According to Granger, these libraries were developed in the early 2000s but were not ready for production use until later in the decade. Two other advantages enhanced their popularity: being cost-free and being easy to mix with other Python libraries for other tasks. Once the Python libraries became fixtures of many fields in science and engineering—particularly the new field that came to be known as data science—their users were open to the interactive educational tools offered by IPython.

It wasn't hard for people outside academia to appreciate IPython. Everybody in the field teaches a course sometimes, or just gives a conference presentation. IPython, and then Jupyter, cut hours from the time it took to put one's code and text into a spiffy presentation form. The project solved several scientific needs at once: repeating experiments in a reliable way, reproducibility of results by other researchers, and teaching or giving talks.

Granger makes no bones about the importance of funding for the success of his project. It benefited quite early from support by Joshua M. Greenberg of the Sloan Foundation, and now it is additionally funded by the Moore Foundation and Helmsley Charitable Trust. The project also has numerous sponsors and institutional partners and gets significant code contributions from about 25 full-time developers.

Conclusion

Each research project that experienced success in the larger software world has found its own path forward. The examples I cited in this article are by no means the end of the story. For instance, the co-founder of the R statistical language, Ross Ihaka, suggested (in a paper) that the developers maintained R for years in a "relatively closed process" and stumbled by necessity onto basic open-source practices such as establishing mailing lists and a group of core committers. This project looks like another example of academic software that could be quickly understood and adopted because the target audience closely resembled the developers and was technically adept.

The Mosaic browser, another historic project, started as a government-funded project of the National Center for Supercomputing Applications (NCSA) at University of Illinois Urbana-Champaign. The triumph of Mosaic was short-lived, however, because the leader of the Mosaic team, Marc Andreessen, soon started the Netscape company and created a far superior browser based on Mosaic's principles.

I haven't covered the Linux kernel or GNU project here, because (in addition to them already being famous) they weren't academic projects, even though Linus Torvalds and Richard Stallman happened to be associated with universities when they launched the projects.

Combining what I heard from project leaders and from the other free-software leaders I interviewed, I suggest that an effort like the Red Hat one I mentioned at the beginning of the article would have the best chance of succeeding by following a few overarching principles. First, choose a project whose value can be quickly understood and embraced by its intended users. Bring in outside experts to evaluate the code for quality to make sure it's worth using; if not, it may make sense to launch a new code base with similar goals. The code must also be easy to generalize and extend. Finally, the project should be taken out of the academic environment as soon as possible (with a payoff to the university, if necessary) and assigned to a project leader who has experience building communities around projects and in recruiting companies or individuals to develop code and all the other infrastructure a free-software project needs.

I'll end with some optimism. Professors and students have routinely turned their ideas into proprietary software. But, given the ease of coding these days, and the resulting commoditization of software, some of these academics are likely to consider making the software free. Apache Spark, discussed earlier, is one example. Another is MapD, a major database project that benefited from advice by Michael Stonebraker, one of the field's leading researchers and entrepreneurs. This company open sourced its core product and has been funded to the tune of 25 million dollars. Fledgling organizations can now turn to organizations such as the Apache Foundation and the Software Freedom Conservancy for organizational advice. In a decade or so, we may know much more about what motivates researchers to open their code, and how they can do so successfully.

[This article is also available in a Portuguese translation by homeyou.]

Comments (28 posted)