Back in August 2016, we noted that the LWN Weekly Edition format had changed little since it was first created in early 1998. We also promised some experimentation at that time. Eight months later, the time has come to roll out one of our bigger experiments. The hope is that the changes will serve the two goals of making LWN more useful to its readers and making it more sustainable to produce. We hope readers will let us know what they think — after reading the reasoning behind the changes.

There are a few issues that we are trying to address at this time. One is that there is an inherent tension between the concept of a "weekly edition" and the need to publish news in a frequent and timely manner. One thing that we have clearly observed over the years is that articles we publish immediately in our daily news stream are more visible and are circulated more widely than those that we reserve for the weekly edition. So anything that runs only in the weekly edition is essentially being hidden from many of our actual and potential readers.

The edition format also tends to de-emphasize our core product: our feature content. Feature articles placed toward the rear of the edition are far less visible than those found in the front. Many readers are clearly happy to page through screens of security updates and kernel patches to get to the next feature article, but that's not the case for everybody. The need to page through content that may have already been seen in the daily stream to find the material that is new in the edition is also not helpful.

One of the objectives behind the original weekly edition design was to force ourselves to cover a wide range of topics every week. That remains as strong a goal as ever, but experience has shown that a bit more flexibility also makes sense. Different parts of our community generate news at different times; the need to fill a set of fixed "slots" can make it hard to focus on the most interesting events at any given time.

Finally, recent experience has shown that it is indeed difficult to find authors who are willing to jump onto the LWN treadmill and commit to writing articles that meet the standards that our readers expect. As a result, we still have an open position here at LWN. While our search has not yet yielded a full-time editor, it has brought some welcome additions to our set of freelance authors, as can be seen in, for example, the recent, well-received coverage from the Netdev and Cloud-Native Computing conferences. We are more than happy to have this coverage, but accommodating it will be easier with a more flexible edition format.

With those thoughts in mind we are, on an experimental basis, making some changes to how the LWN Weekly Edition is published. They include:

Rather than build up a bunch of content and dump it all out on Thursdays, we will work to publish our work steadily over the week as it becomes ready. We will, in other words, become a bit more like most other news-oriented sites in that we'll stop delaying our content and hiding some of it in the weekly edition.

That said, we have a lot of readers who appreciate the weekly edition and use it as their primary means of access to LWN content. So the edition is not going anywhere; it will continue to be published every Thursday, and it will continue to offer an overall view of what has happened in the Linux and free-software community over the previous week. If you only come to LWN on Thursdays, you can continue to do so and will not miss a thing.

The layout of the edition will change, though. We will lead with our strongest product: the feature content that our authors work so hard to create every week. The feature page will be followed by a page of briefer items, many of which, as they always have, consist of pointers to interesting material elsewhere on the net. Finally, the back page will hold the announcements that we have always carried: newsletters, conferences, security updates, kernel patches, etc. We are not there yet, but the intent is for the back page to be the only weekly edition page that carries content which has not yet been seen elsewhere on the site. So readers who follow us every day will eventually be able to skip all but this page without fear of missing something new.

One important thing to note here is that LWN's content mix is not changing — we are not dropping any coverage areas. All that is changing is how that content is organized and presented.

This new format is an experiment; if it truly fails to work out, we can go back to the way things used to be. But we want to run the experiment for a while to see how it works and, undoubtedly, there will be changes to make on the way. There are also internal workflow changes that will have to be made in the coming weeks as we figure out the best way to work in the new mode. Suggestions from readers, in the form of comments to this article or direct email, will be appreciated.

(The curious may be wondering about the decline in subscribers that was reported in the August article. We are happy to say that the situation changed after the publication of that article, and subscriptions are growing again. Thanks to all of you who signed up to support LWN.)

Next January, LWN will complete its 20th year of publication. That is far longer than we ever thought we would be doing this but, at the same time, it often feels like we are just getting started. We will almost certainly be making other changes in the future, aimed at making LWN better and keeping it strong for the next 20 years. But one thing will certainly not change: we remain dedicated to creating the best writing about Linux and free software for the best reader community on the planet. Thanks to all of you for your support; that is what has kept us going for so long.

Comments (47 posted)

Linus Torvalds recently let it be known that the 4.11-rc7 kernel prepatch had a good chance of being the last for this development series. So the time has come to look at this development cycle and the contributors who made it happen.

As of this writing, 12,546 non-merge changesets have been pulled into the mainline repository for 4.11, making this cycle more-or-less average for recent kernels. Those changesets were contributed by 1,723 developers and grew the kernel by nearly 300,000 lines. Note that the current record for the most developers participating is 1,729 for 4.9; if another half-dozen developers put in a fix for 4.11, that record could yet fall. Of the developers contributing to 4.11, 278 made their first contribution ever in this cycle.

The most active developers in the 4.11 cycle were:

Most active 4.11 developers By changesets Chris Wilson 226 1.8% Arnd Bergmann 160 1.3% Ingo Molnar 158 1.3% Christoph Hellwig 115 0.9% Takashi Iwai 110 0.9% Guenter Roeck 101 0.8% Bart Van Assche 94 0.7% Bhumika Goyal 89 0.7% Geert Uytterhoeven 87 0.7% Ville Syrjälä 86 0.7% James Hogan 83 0.7% Johan Hovold 80 0.6% Nikolay Borisov 79 0.6% Andy Shevchenko 77 0.6% Colin Ian King 77 0.6% Laurent Pinchart 75 0.6% Julia Lawall 74 0.6% Florian Fainelli 72 0.6% Eric Dumazet 71 0.6% Daniel Vetter 70 0.6% By changed lines James Smart 14288 2.2% Ard Biesheuvel 14215 2.2% Selvin Xavier 13393 2.1% Greg Kroah-Hartman 11705 1.8% David VomLehn 11085 1.7% Rob Rice 9539 1.5% Eric Anholt 9460 1.5% Jakub Kicinski 9024 1.4% Chris Wilson 8986 1.4% Chad Dupuis 8829 1.4% Ingo Molnar 7753 1.2% Balbir Singh 7451 1.1% Maxime Ripard 7110 1.1% Ursula Braun 6666 1.0% Christoph Hellwig 5740 0.9% Rex Zhu 5719 0.9% Paul E. McKenney 5497 0.8% Jiri Pirko 5407 0.8% Quinn Tran 5331 0.8% Takashi Iwai 4892 0.8%

In the "by changesets" column, Chris Wilson ended up on top with a body of work mostly focused on the Intel i915 driver. Arnd Bergmann continues to apply fixes all over the tree, Ingo Molnar (primarily) contributed a massive reworking of the sched.h header file, Christoph Hellwig did significant work all over the block I/O subsystem, and Takashi Iwai added many patches as part of his role as maintainer of the audio subsystem.

In the "by changed lines" column, James Smart did a lot of work on the lpfc SCSI driver. Ard Biesheuvel's work was mostly in optimized crypto algorithms for the ARM architecture, Selvin Xavier contributed three patches adding the "bnxt_re" RDMA-over-converged-Ethernet driver, Greg Kroah-Hartman deleted some unwanted staging code, and David VomLehn added the AQtion network driver.

The developers contributing to 4.11 were supported by at least 225 employers, the most active of which were:

Most active 4.11 employers By changesets Intel 1608 12.8% (Unknown) 1071 8.5% Red Hat 955 7.6% (None) 798 6.4% Linaro 624 5.0% IBM 493 3.9% SUSE 482 3.8% Google 375 3.0% (Consultant) 349 2.8% Samsung 303 2.4% Broadcom 296 2.4% Mellanox 269 2.1% Oracle 231 1.8% AMD 207 1.6% Renesas Electronics 197 1.6% Huawei Technologies 197 1.6% Facebook 187 1.5% Imagination Technologies 160 1.3% Canonical 156 1.2% Code Aurora Forum 146 1.2% By lines changed Intel 80069 12.3% Broadcom 48163 7.4% (Unknown) 44046 6.8% (None) 43311 6.6% Linaro 37531 5.8% IBM 33286 5.1% Red Hat 31955 4.9% Cavium 28312 4.3% (Consultant) 24974 3.8% SUSE 15020 2.3% ST Microelectronics 14709 2.3% Mellanox 14474 2.2% Google 12185 1.9% Linux Foundation 11789 1.8% AMD 11397 1.7% Samsung 10845 1.7% Free Electrons 10258 1.6% Code Aurora Forum 9455 1.5% Netronome Systems 9172 1.4% Facebook 9023 1.4%

This table has gotten pretty boring over the years; it tends not to change much from one cycle to the next.

Developing code is important, but so are reviewing, testing, and reporting bugs. The kernel process has long had the mechanisms to track these contributions, though they are not as heavily used as they could be. The Reviewed-by tag records a reviewer's contribution explicitly; the reviewers most credited in this way in 4.11 were:

Top reviewers in 4.11 Hannes Reinecke 126 Joonas Lahtinen 125 Alex Deucher 106 Christoph Hellwig 106 Chris Wilson 92 Johannes Thumshirn 77 Geert Uytterhoeven 68 Andreas Dilger 68 Christian König 63 Tvrtko Ursulin 60 Daniel Vetter 57 Tomas Henzl 56 Andy Shevchenko 55 Laurent Pinchart 50 Oleg Drokin 50 Doug Oucharek 46 Linus Walleij 45 Liu Bo 42 Greg Kroah-Hartman 38 Darrick J. Wong 38 Chao Yu 38 Josh Triplett 37

Hannes Reinecke's review work was focused on the SCSI subsystem, while Joonas Lahtinen reviewed i915 patches. Each of them managed to review 1% of the total patch flow going into the 4.11 kernel — two patches per day for (what will probably be) a 63-day development cycle.

Of the 12,546 changesets merged for 4.11, 3,099 contained Reviewed-by tags. Needless to say, those tags do not document all of the review work that happened during this development cycle. Much review activity does not result in the addition of a tag of any type. When patches are reviewed by the subsystem maintainer that ultimately applies them, the result is usually a Signed-off-by tag instead. If one looks at those tags when applied by developers who were not the author of the patch, the result is this:

Top non-author signoffs in 4.11 David S. Miller 1473 12.4% Greg Kroah-Hartman 961 8.1% Andrew Morton 408 3.4% Mark Brown 313 2.6% Martin K. Petersen 308 2.6% Ingo Molnar 272 2.3% Kalle Valo 241 2.0% Mauro Carvalho Chehab 231 1.9% Doug Ledford 207 1.7% Jens Axboe 203 1.7% Michael Ellerman 187 1.6% Linus Walleij 172 1.4% Herbert Xu 166 1.4% Alex Deucher 164 1.4% Daniel Vetter 152 1.3% Jonathan Cameron 129 1.1% Rafael J. Wysocki 129 1.1% David Sterba 127 1.1% Ralf Baechle 126 1.1% Ulf Hansson 124 1.0%

Here, it is hard to separate review activity from the overall level of activity in the relevant subsystems. But a rough correlation will certainly exist, meaning that the developers above are looking at a huge number of patches. This work often goes unsung, but it is a crucial part of the kernel development process; without it, the process would not run as smoothly or as quickly as it does now.

Testing and bug reporting can also be tracked by tags in the associated patches. Looking at those tags for 4.11 yields this table:

Testing and bug reporting in 4.11 Reported-by tags Dmitry Vyukov 47 6.9% Dan Carpenter 47 6.9% kbuild test robot 31 4.5% Greg Kroah-Hartman 12 1.8% Andrey Konovalov 10 1.5% Al Viro 9 1.3% Colin Ian King 9 1.3% Ben Hutchings 7 1.0% Bart Van Assche 7 1.0% Jay Vana 7 1.0% Russell King 6 0.9% Marc Dionne 6 0.9% Linus Torvalds 6 0.9% David Binderman 6 0.9% Geert Uytterhoeven 5 0.7% Mike Galbraith 5 0.7% Dave Jones 5 0.7% Ville Syrjälä 4 0.6% Dexuan Cui 4 0.6% Anton Blanchard 4 0.6% Tested-by tags Andrew Bowers 70 10.0% Tomasz Nowicki 17 2.4% Bharat Bhushan 15 2.2% Arnaldo Carvalho de Melo 14 2.0% Krishneil Singh 14 2.0% Larry Finger 13 1.9% Mark Rutland 12 1.7% Florian Vaussard 10 1.4% Stan Johnson 10 1.4% Aaron Brown 10 1.4% David Lechner 9 1.3% Omar Sandoval 8 1.1% Jarkko Sakkinen 8 1.1% Xiaolong Ye 8 1.1% Jeremy McNicoll 8 1.1% Neil Armstrong 8 1.1% Geert Uytterhoeven 7 1.0% Stefan Wahren 7 1.0% Y.C. Chen 7 1.0% Laurent Pinchart 6 0.9%

Again, most of the activity out there does not result in explicit tags, especially when it comes to testing — many testers will say nothing if they do not run into problems. Regardless of whether testing and reporting are credited or not, they are critical to our ability to deliver a solid kernel over a nine or ten-week development cycle.

Comments (9 posted)

At the 2017 Vault conference for Linux storage, Ted Ts'o and Abutalib Aghayev gave a talk on some work they have done to make the ext4 filesystem work better on shingled magnetic recording (SMR) drives. Those devices have some major differences from conventional drives, such that random writes perform quite poorly. Some fairly small changes that were made to ext4 had a dramatic effect on its performance on SMR drives—and provided a performance boost for metadata-heavy workloads on conventional magnetic recording (CMR) media as well.

Ts'o said that he and Aghayev, who is a student at Carnegie Mellon University (CMU), had developed the ext4 changes; two professors, Garth Gibson at CMU and Peter Desnoyers at Northeastern, also provided useful input and advice.

SMR basics

SMR drives pack more data into the same space as CMR drives by overlapping the tracks on the platter. Sequential writes will work well with SMR, while overwriting existing data will require copying data from adjacent tracks and rewriting it in a sequential fashion. SMR is targeted at backups, large media files, and similar use cases.

To the extent that rotational drives will stick around, SMR will be with us, Ts'o said. There are additional technologies "coming down the pike" that will be compatible with SMR. Millions of SMR drives have shipped and even consumers are using the technology for backups while using SSDs for data they need faster access to.

There are two kinds of SMR drives, drive-managed and host-managed; the talk (and work) focused on drive-managed SMR, rather than host-managed, where the operating system has to actively manage the storage device. For drive-managed SMR, there is a shingle translation layer (STL) that is akin to the flash translation layer (FTL) for SSDs. The STL hides the various zones in an SMR device, which might be 256M in size and need to be written sequentially; it presents a block interface that masks the device's constraints.

SMR disks typically have a persistent cache that is a lot faster than a CMR disk. The theory is that if there is idle time, and most disks in enterprise settings will have some, data can be moved from the persistent cache to the disk itself in a sequential manner at that time, Ts'o said. In addition, idle time allows for various cleaning and housekeeping tasks. As long as there is room in the persistent cache, writes to the device are extremely fast, but once it fills up, throughput drops off drastically.

The persistent cache is invisible to the kernel unless the vendor provides some magic command to query its size and other characteristics. The exact behavior of the STL is vendor specific and subject to change, much like the situation with FTL implementations. But flash is so fast that it is hard to notice the difference when the translation layer chooses to write in different locations; for the STL, writing to the persistent cache is so much faster than to disk that it is quite noticeable.

The STL will try to recognize sequential writes and bypass the persistent cache for those. In some ways the persistent cache is like the ext4 journal, Ts'o said. With a random write workload, once the persistent cache is full, each write becomes a large read-modify-write operation. The exact details of the persistent cache, how much there is and where it is located on the disk, for example, will vary; some drives they tested had 25GB of persistent cache, others were different.

Small changes

The work that he and Aghayev did was to make fairly small changes to ext4 (40 lines of code modified, 600 lines added in new files) that made a dramatic difference. Those changes improved the performance of metadata-light workloads by 1.7-5.4x. For metadata-heavy workloads, the improvement was 2-13x.

The way that ext4 uses the disk is particularly bad for SMR devices, he said, because the metadata is spread across the disk. Metadata writes are 4KB random writes, which is the worst possible thing for SMR. Those writes can dominate the work that the STL has to do even when you are storing large video streams that are SMR-friendly. If there is lots of idle time, the change is not all that dramatic but, if not, performance drops off substantially while the STL turns the random writes into sequential ones.

Ext4 uses writeahead logging, which means that it writes metadata to a journal, which is sequential, but then does random writes to put the metadata in its final location once the journal fills or the dirty writeback timeout is reached. That means that every metadata block is written twice, once to the journal and once to its final destination; why not use the journal entry as the authoritative entry? The block in memory can be mapped to the journal location and be marked as clean in the page cache; if it gets evicted and is needed again, it can be looked up in the journal.

When the journal gets full, something needs to be done, however. Many of the blocks in the journal are not actually important because they have been updated by an entry later in the journal. For those that are still valid, they could be copied to the final location on disk or simply to a new journal as a sequential write. "If you squint", he said, it kind of looks like a log-structured filesystem for metadata.

In order to make this all work, they grew the journal from 128MB to 10GB; "on an 8TB drive, what's 10GB between friends?" They tried smaller journals, which worked, but the journal fills more quickly, requiring more copying.

Results

Aghayev then took over to report on the performance of the changes. They tested ext4 versus the new filesystem, which they call "ext4-lazy", on an i7 processor system with 16GB of memory. He started by presenting the performance on CMR drives.

The first benchmark used eight threads to create 800,000 directories with a directory-tree depth of ten. Ext4 took four minutes to complete, while ext4-lazy only took two minutes. When looking at the I/O distribution, ext4 wrote 4.5GB of metadata, with roughly the same amount of journal writes. Since ext4-lazy eliminates the metadata writes with only a small increase in journal writes, it makes sense that the benchmark only took half the time.

The second test was for the "notoriously slow rm -rf on a massive directory" case. That is slow for all filesystems, Aghayev said, not just ext4. To delete the directory tree created in the first test took nearly 40 minutes for ext4, but less than ten for ext4-lazy. Looking at the I/O distribution, ext4-lazy skips the metadata writes that ext4 does, but that is a fairly small part of the I/O for the test; most of the I/O is in metadata reads and journal writes for both filesystems. But the metadata reads for ext4 require seeking all over the disk, while ext4-lazy reads them all from the journal.

For a metadata-light workload, with less than 1% of the I/O involving metadata, ext4-lazy shows a much more modest performance increase. Running a benchmark that emulated a file server, showed a 5% performance increase for ext4-lazy. He recommended reading the paper [PDF] from the USENIX File and Storage Technologies (FAST) conference for more information.

He then turned to the benchmarks on SMR devices. For those devices, ext4-lazy benefits from the fact that it does not require much cleaning time, while ext4 results in extra work that needs to be done after the benchmark is finished. The directory-creation benchmark shows a smaller gain for ext4-lazy (just under two minutes for ext4 versus just over one minute) on SMR, but that doesn't take into account the cleaning time, which is zero for ext4-lazy, but a whopping 14 minutes for ext4.

Measuring cleaning time is not straightforward, however. They used a vendor tool in some cases, but also cut a hole into an SMR disk drive so they could observe what it was doing. Aghayev's advisors thought the hole idea would never work, but the drive is still working a year after doing so, he said. "You're lucky" said one audience member. It is difficult to get vendors to give out information about the inner workings of the drive, Aghayev said, so cutting a hole was what they were left with.

On SMR, the directory-removal benchmark took 55 minutes for ext4 and around 4 minutes for ext4-lazy but, once again, cleaning time is significant as well. Ext4 required ten minutes of cleaning, while ext4-lazy needed only 20 seconds. The file-server benchmark showed similar results, though with a twist. Two different SMR devices showed different characteristics for ext4 and for ext4-lazy. Both devices showed roughly 2x performance for ext4-lazy, but the performance of ext4 on the two devices also showed a nearly 2x difference. The same is true for ext4-lazy between the two devices, but the order is reversed; the device that performed much better for ext4 performed nearly 2x worse for ext4-lazy when compared to the other device. That reflects the different ways that the STL handles cleaning; one does all or most of it when the cache gets full, while the other interleaves it with regular I/O.

In conclusion, Aghayev said, ext4-lazy separates metadata from data and manages the former as a log, which is is not a new idea. Spreading metadata across the disk was invented some 30 years ago, maybe it is time to revisit that idea, he said. It is based on the explicit assumption that the cost of random reads is high, but also the implicit assumption that the cost of random reads is the same as the cost of random writes, which does not hold for SMR.

Someone from the audience asked about ext4-lazy on SSDs. Aghayev said he thought there would be a performance increase, but has not done the experiments. Ts'o said he thought it would be better on the FTLs used by low-end devices like those on mobile handsets. But if the CPU pushes hard enough, high-end devices may also benefit, one attendee said. There were various suggestions for ways to make ext4-lazy better, but Ts'o noted that an explicit goal was to make minimal changes to an existing production filesystem so that users would have confidence in running on it.

[I would like to thank the Linux Foundation for travel assistance to Cambridge, MA for Vault.]

Comments (7 posted)

With the speed of network hardware now reaching 100 Gbps and distributed denial-of-service (DDoS) attacks going in the Tbps range, Linux kernel developers are scrambling to optimize key network paths in the kernel to keep up. Many efforts are actually geared toward getting traffic out of the costly Linux TCP stack. We have already covered the XDP (eXpress Data Path) patch set, but two new ideas surfaced during the Netconf and Netdev conferences held in Toronto and Montreal in early April 2017. One is a patch set called af_packet, which aims at extracting raw packets from the kernel as fast as possible; the other is the idea of implementing in-kernel layer-7 proxying. There are also user-space network stacks like Netmap, DPDK, or Snabb (which we previously covered).

This article aims at clarifying what all those components do and to provide a short status update for the tools we have already covered. We will focus on in-kernel solutions for now. Indeed, user-space tools have a fundamental limitation: if they need to re-inject packets onto the network, they must again pay the expensive cost of crossing the kernel barrier. User-space performance is effectively bounded by that fundamental design. So we'll focus on kernel solutions here. We will start from the lowest part of the stack, the af_packet patch set, and work our way up the stack all the way up to layer-7 and in-kernel proxying.

af_packet v4

John Fastabend presented a new version of a patch set that was first published in January regarding the af_packet protocol family, which is currently used by tcpdump to extract packets from network interfaces. The goal of this change is to allow zero-copy transfers between user-space applications and the NIC (network interface card) transmit and receive ring buffers. Such optimizations are useful for telecommunications companies, which may use it for deep packet inspection or running exotic protocols in user space. Another use case is running a high-performance intrusion detection system that needs to watch large traffic streams in realtime to catch certain types of attacks.

Fastabend presented his work during the Netdev network-performance workshop, but also brought the patch set up for discussion during Netconf. There, he said he could achieve line-rate extraction (and injection) of packets, with packet rates as high as 30Mpps. This performance gain is possible because user-space pages are directly DMA-mapped to the NIC, which is also a security concern. The other downside of this approach is that a complete pair of ring buffers needs to be dedicated for this purpose; whereas before packets were copied to user space, now they are memory-mapped, so the user-space side needs to process those packets quickly otherwise they are simply dropped. Furthermore, it's an "all or nothing" approach; while NIC-level classifiers could be used to steer part of the traffic to a specific queue, once traffic hits that queue, it is only accessible through the af_packet interface and not the rest of the regular stack. If done correctly, however, this could actually improve the way user-space stacks access those packets, providing projects like DPDK a safer way to share pages with the NIC, because it is well defined and kernel-controlled. According to Jesper Dangaard Brouer (during review of this article):

This proposal will be a safer way to share raw packet data between user space and kernel space than what DPDK is doing, [by providing] a cleaner separation as we keep driver code in the kernel where it belongs.

During the Netdev network-performance workshop, Fastabend asked if there was a better data structure to use for such a purpose. The goal here is to provide a consistent interface to user space regardless of the driver or hardware used to extract packets from the wire. af_packet currently defines its own packet format that abstracts away the NIC-specific details, but there are other possible formats. For example, someone in the audience proposed the virtio packet format. Alexei Starovoitov rejected this idea because af_packet is a kernel-specific facility while virtio has its own separate specification with its own requirements.

The next step for af_packet is the posting of the new "v4" patch set, although Miller warned that this wouldn't get merged until proper XDP support lands in the Intel drivers. The concern, of course, is that the kernel would have multiple incomplete bypass solutions available at once. Hopefully, Fastabend will present the (by then) merged patch set at the next Netdev conference in November.

XDP updates

Higher up in the networking stack sits XDP. The af_packet feature differs from XDP in that it does not perform any sort of analysis or mangling of packets; its objective is purely to get the data into and out of the kernel as fast as possible, completely bypassing the regular kernel networking stack. XDP also sits before the networking stack except that, according to Brouer, it is "focused on cooperating with the existing network stack infrastructure, and on use-cases where the packet doesn't necessarily need to leave kernel space (like routing and bridging, or skipping complex code-paths)."

XDP has evolved quite a bit since we last covered it in LWN. It seems that most of the controversy surrounding the introduction of XDP in the Linux kernel has died down in public discussions, under the leadership of David Miller, who heralded XDP as the right solution for a long-term architecture in the kernel. He presented XDP as a fast, flexible, and safe solution.

Indeed, one of the controversies surrounding XDP was the question of the inherent security challenges with introducing user-provided programs directly into the Linux kernel to mangle packets at such a low level. Miller argued that whatever protections are expected for user-space programs also apply to XDP programs, comparing the virtual memory protections to the eBPF (extended BPF) verifier applied to XDP programs. Those programs are actually eBPF that have an interesting set of restrictions:

they have a limited size

they cannot jump backward (and thus cannot loop), so they execute in predictable time

they do only static allocation, so they are also limited in memory

XDP is not a one-size-fits-all solution: netfilter, the TC traffic shaper, and other normal Linux utilities still have their place. There is, however, a clear use case for a solution like XDP in the kernel.

For example, Facebook and Cloudflare have both started testing XDP and, in Facebook's case, deploying XDP in production. Martin Kafai Lau, from Facebook, presented the tool set the company is using to construct a DDoS-resilience solution and a level-4 load balancer (L4LB), which got a ten-times performance improvement over the previous IPVS-based solution. Facebook rolled out its own user-space solution called "Droplet" to detect hostile traffic and deploy blocking rules in the form of eBPF programs loaded in XDP. Lau demonstrated the way Facebook deploys a three-part chained eBPF program: the first part allows debugging and dumping of packets, the second is Droplet itself, which drops undesirable traffic, and the last segment is the load balancer, which mangles the packets to tweak their destination according to internal rules. Droplet can drop DDoS attacks at line rate while keeping the architecture flexible, which were two key design requirements.

Gilberto Bertin, from Cloudflare, presented a similar approach: Cloudflare has a tool that processes sFlow data generated from iptables in order to generate cBPF (classic BPF) mitigation rules that are then deployed on edge routers. Those rules are created with a tool called bpfgen , part of Cloudflare's BSD-licensed bpftools suite. For example, it could create a cBPF bytecode blob that would match DNS queries to any example.com domain with something like:

bpfgen dns *.example.com

Originally, Cloudflare would deploy those rules to plain iptables firewalls with the xt_bpf module, but this led to performance issues. It then deployed a proprietary user-space solution based on Solarflare hardware, but this has the performance limitations of user-space applications — getting packets back onto the wire involves the cost of re-injecting packets back into the kernel. This is why Cloudflare is experimenting with XDP, which was partly developed in response to the company's problems, to deploy those BPF programs.

A concern that Bertin identified was the lack of visibility into dropped packets. Cloudflare currently samples some of the dropped traffic to analyze attacks; this is not currently possible with XDP unless you pass the packets down the stack, which is expensive. Miller agreed that the lack of monitoring for XDP programs is a large issue that needs to be resolved, and suggested creating a way to mark packets for extraction to allow analysis. Cloudflare is currently in a testing phase with XDP and it is unclear if its whole XDP tool chain will be publicly available.

While those two companies are starting to use XDP as-is, there is more work needed to complete the XDP project. As mentioned above and in our previous coverage, massive statistics extraction is still limited in the Linux kernel and introspection is difficult. Furthermore, while the existing actions ( XDP_DROP and XDP_TX , see the documentation for more information) are well implemented and used, another action may be introduced, called XDP_REDIRECT , which would allow redirecting packets to different network interfaces. Such an action could also be used to accelerate bridges as packets could be "switched" based on the MAC address table. XDP also requires network driver support, which is currently limited. For example, the Intel drivers still do not support XDP, although that should come pretty soon.

Miller, in his Netdev keynote, focused on XDP and presented it as the standard solution that is safe, fast, and usable. He identified the next steps of XDP development to be the addition of debugging mechanisms, better sampling tools for statistics and analysis, and user-space consistency. Miller foresees a future for XDP similar to the popularization of the Arduino chips: a simple set of tools that anyone, not just developers, can use. He gave the example of an Arduino tutorial that he followed where he could just look up a part number and get easy-to-use instructions on how to program it. Similar components should be available for XDP. For this purpose, the conference saw the creation of a new mailing list called xdp-newbies where people can learn how to create XDP build environments and how to write XDP programs.

In-kernel layer-7 proxying

The third approach that struck me as innovative is the idea of doing layer-7 (application) proxying directly in the kernel. This comes from the idea that, traditionally, we build firewalls to segregate traffic and apply controls, but as most services move to HTTP, those policies become ineffective.

Thomas Graf, presented this idea during Netconf using a Star Wars allegory: what if the Death Star were a server with an API? You would have endpoints like /dock or /comms that would allow you to dock a ship or communicate with the Death Star. Those API endpoints should obviously be public, but then there is this /exhaust-port endpoint that should never be publicly available. In order for a firewall to protect such a system, it must be able to inspect traffic at a higher level than the traditional address-port pairs. Graf presented a design where the kernel would create an in-kernel socket that would negotiate TCP connections on behalf of user space and then be able to apply arbitrary eBPF rules in the kernel.

In this scenario, instead of doing the traditional transfer from Netfilter's TPROXY to user space, the kernel directly decapsulates the HTTP traffic and passes it to BPF rules that can make decisions without doing expensive context switches or memory copies in the case of simply wanting to refuse traffic (e.g. issue an HTTP 403 error). This, of course, requires the inclusion of kTLS to process HTTPS connections. HTTP2 support may also prove problematic, as it multiplexes connections and is harder to decapsulate. This design was described as a "pure pre- accept() hook". Starovoitov also compared the design to the kernel connection multiplexer (KCM). Tom Herbert, KCM's author, agreed that it could be extended to support this, but would require some extensions in user space to provide an interface between regular socket-based applications and the KCM layer.

In any case, if the application does TLS (and lots of them do), kTLS gets tricky because it breaks the end-to-end nature of TLS, in effect becoming a man in the middle between the client and the application. Eric Dumazet argued that HA-Proxy already does things like this: it uses splice() to avoid copying too much data around, but it still does a context switch to hand over processing to user space, something that could be fixed in the general case.

Another similar project that was presented at Netdev is the Tempesta firewall and reverse-proxy. The speaker, Alex Krizhanovsky, explained the Tempesta developers have taken one person month to port the mbed TLS stack to the Linux kernel to allow an in-kernel TLS handshake. Tempesta also implements rate limiting, cookies, and JavaScript challenges to mitigate DDoS attacks. The argument behind the project is that "it's easier to move TLS to the kernel than it is to move the TCP/IP stack to user space". Graf explained that he is familiar with Krizhanovsky's work and he is hoping to collaborate. In effect, the design Graf is working on would serve as a foundation for Krizhanovsky's in-kernel HTTP server (kHTTP). In a private email, Graf explained that:

The main differences in the implementation are currently that we foresee to use BPF for protocol parsing to avoid having to implement every single application protocol natively in the kernel. Tempesta likely sees this less of an issue as they are probably only targeting HTTP/1.1 and HTTP/2 and to some [extent] JavaScript.

Neither project is really ready for production yet. There didn't seem to be any significant pushback from key network developers against the idea, which surprised some people, so it is likely we will see more and more layer-7 intelligence move into the kernel sooner rather than later.

Conclusion

All of this work aims at replacing a rag-tag bunch of proprietary solutions that recently came up to bypass the Linux kernel TCP/IP stack and improve performance for firewalls, proxies, and other key edge network elements. The idea is that, unless the kernel improves its performance, or at least provides a way to bypass its more complex code paths, people will work around it. With this set of solutions in place, engineers will now be able to use standard APIs to hook high-performance systems into the Linux kernel.

[The author would like to thank the Netdev and Netconf organizers for travel assistance, Thomas Graf for a review of the in-kernel proxying section of this article, and Jesper Dangaard Brouer for review of the af_packet and XDP sections.]

Comments (7 posted)

Linux usage in networking hardware has been on the rise for some time. During the latest Netdev conference held in Montreal this April, people talked seriously about Linux running on high end, "top of rack" (TOR) networking equipment. Those devices have long been the realm of proprietary hardware and software companies like Cisco or Juniper, but Linux seems to be making some significant headway into the domain. According to Shrijeet Mukherjee, VP of Engineering at Cumulus Networks: "we are seeing a 28% adoption rate in the Fortune 50" companies.

As someone who has worked in system administration and networking for over a decade, I was surprised by this: switches have always been black boxes of proprietary hardware that we barely get a shell into. But as more and more TOR hardware gets Linux support, some of that work trickles down outside of that niche. Are we really seeing the rise of Linux in high-end networking hardware?

Linux as the standard interface

During his keynote at Netdev, Jesse Brandeburg explained that traffic is exploding on the Internet: "From 2006 to 2016 the compound annual growth rate was 78% of network traffic. So the network traffic is growing like crazy. In 2010 to 2023, it's going to grow by a thousand times."

He also mentioned Intel was working on devices that could do up to 400 Gbps. In his talk, he argued that Linux has a critical place in the modern world by repeating the mantra that "everything is on the network, and the network runs Linux", but does it really? Through Android, Linux has become the most popular end-user device operating system and is also dominant on the server — but what about the actual fabric of the network: the routers and switches that interconnect all those machines?

Mukherjee, in his own keynote, argued that even though Linux has achieved dominance in the server and virtualization markets, there is a "discontinuity [...] at the boundary where the host and the network meet" and argued "that the host without the network will not survive". In other words, proprietary hardware and software in the network threaten free software in the server. Even though some manufacturers are already providing a "Linux interface" in their hardware, it is often only some sort of compatibility shell which might be compared with the Ubuntu compatibility layer in Windows: it's not a real Linux.

Mukherjee pushed this idea further by saying that those companies are limiting themselves by not using the full Linux stack. He presented Linux as the "top vehicle for innovation" that provides a featureful network stack, citing VXLAN, eBPF, and Quagga as prime features used on switches. Mukherjee also praised the diversity of the user-space Linux ecosystem as something that commercial alternatives can't rival; he compared the faster Linux development to the commercial sector where similar top features stay in the beta stage for up to 3 years.

Because of its dominance in the server market, consumers are expecting a Linux-like interface on their networking gear now, which means Linux could be the standard interface all providers strive toward. As a Debian developer, I can't help but smile at the thought; if there's one thing we have not been able to do among Linux distributions, it's pretty much standardize user space in a consistent interface. POSIX is old and incomplete, the FHS is showing its age, and most distributions have abandoned LSB. Yet, the idea is certainly enticing: there is a base set of tools and applications, especially for the Linux kernel, that are standardized: iproute2 , ethtool , and iptables are generally consistent across distributions, even though each distribution has its own way of using them.

Yet Linux is not dominant, why? Mukherjee identified the problem as "packaging issues" and listed a set of features he would like Linux to improve:

Standardization of the ethtool interface. The idea is to make ethtool a de-facto standard to manage switches and ports. Mukherjee gave the example that data centers spend more money on cables than any other hardware and explained that making it easier to identify cables is therefore a key functionality. Getting consistent interface naming was also a key problem identified by numerous participants at the conference. While systemd tried to fix this with the predictable network interface names specification, the interface names are not necessarily predictable across virtual machines or on special hardware; in fact, this was the topic of the first talk of the conference. ethtool also needs to support interfaces that run faster than 1 Gbps, something that still has limited support in Linux at the moment.

Scaling of the Linux bridge. Through the rise of "software defined networking", we are likely to see multi-switch virtual environments that need to scale to hundreds of interfaces and devices easily. This is something the Linux bridge was never designed to do and it's showing scalability issues. During the conference, there was hope that the new XDP and eBPF developments could help, but also concerns this would create yet another bridge layer inside the kernel.

Cumulus's goal seems to be to establish Linux as the industry standard for this new era of networking and it is not alone. Through its Open Compute project, Facebook is sharing open designs of data center products and, while we have yet to see commercial off-the-shelf (COTS) 24 and 48-port gigabit switches trickle down to the consumer market, the company is definitely deploying new hardware based on those specifications in its own data centers, and those devices are often based on Linux.

The Linux switch implementation

So how exactly do switches work in Linux?

The Linux kernel manipulates switches with three different operation structures: switchdev_ops , which we previously covered, ethtool_ops , and netdev_ops . Certain switches, however, also need distributed switch architecture (DSA) features to be properly handled. DSA is a more obscure part of the network stack that allows Linux to represent hardware switches or chains of switches using regular Linux tools like bridge , ifconfig , and so on. While switchdev is a new layer, DSA has been in the kernel since 2.6.28 in 2008. Originally developed to support Marvell switches, DSA is now a generic layer deployed in WiFi access points, set-top boxes, on-board flight entertainment systems, trains, and other industrial equipment. Switches that have an Ethernet controller need DSA, whereas the kernel can support switches without Ethernet controllers directly with switchdev drivers.

The first years of DSA's development consisted only of basic maintenance but, in the last three years, DSA has seen a resurgence of contributions, as part of Linux networking push to support hardware offloading and network switches. Between 2014 and 2015, DSA added support for Broadcom hardware, wake on LAN, and hardware port bridging, among other features.

DSA's development was parallel to swconfig, written by the OpenWrt project to support the small office and home office (SOHO) routers that the project is targeting. The main difference between swconfig and DSA is that DSA-supported switches show one network interface per port, whereas swconfig-configured switches show up as a single port, which limits the amount of information that can be extracted from the switch. For example, you cannot have per-port traffic statistics with swconfig. That limitation is what led to the creation of the switchdev framework, when swconfig was proposed (then refused) for inclusion in mainline. Another goal of switchdev was to support bridge hardware offloading and network interface card (NIC) virtualization.

Also, whereas swconfig uses virtual LAN (VLAN) tagging to address ports, DSA enables the use of device-specific tagging headers to address different ports, which enables DSA to have better control over the switches. This allows, for example, DSA to do internet group management protocol (IGMP) snooping or implement the spanning tree protocol, whereas swconfig doesn't have those features. Some switches are actually connected to the host CPU through an Ethernet interface instead of regular PCI-Express interface, and DSA supports this as well.

One advantage that remains in the swconfig approach is that it treats the internal switch as a simple external switch, and addresses ports with standard VLAN tags. This is something DSA could do, as well, but no one has bothered implementing this just yet. For now, DSA drivers use device-specific tagging mechanisms that limit the number of supported devices. Other areas of future improvement for DSA are better multi-chip support, IGMP snooping, and bonding, as well as firewall, NAT, and TC offloading.

Where is the freedom?

Given all those technical improvements, you might rightfully wonder if your own wireless router or data center switch runs Linux.

In recent years, we have seen more and more networking devices shipped with Linux and sometimes even OpenWrt (e.g. in the case of the Turris Omnia, which we previously covered), and especially on SOHO routers, but it sometimes means a crippled operating system that only offers you a proprietary web interface. But at least those efforts make it easier to deploy free operating systems on those devices.

Based on my experience running OpenWrt on wireless routers to build the Montreal mesh network, deploying Linux on routers and switches has always been a manual process. The Ubiquiti hardware being used in the mesh network comes with an OpenWrt derivative, but it includes proprietary drivers and a proprietary web interface. To use the mesh networking protocol that was chosen, it was necessary to deploy custom OpenWrt images by hand. For years, it was a continuous struggle for OpenWrt developers to liberate generation after generation of proprietary hardware with companies like Cisco locking down the venerable WRT platform in 2006 and the US Federal Communications Commission (FCC) rules that forced TP-Link to block free software on its routers, a change that was later reverted.

Most hardware providers are obviously not dedicated to software freedom: deploying Linux on their hardware is for them an economic, not political choice. As you might expect, a "Linux router" these days often means a closed device and operating system, using Linux as the only free component. I had the privilege of doing some reverse engineering on the SmartRG SR603n VDSL modem, which also doubles as a WiFi router and VoIP phone adapter. I was able to get a shell on that machine, and what I found was a custom operating system built on top of the Linux kernel. I wrote a detailed report about this two years ago and the situation then was pretty grim.

Another experience I had was working over a decade in data centers, which tells an even worse story: most switches and routers there are not running free software at all. We have deployed HP ProCurve switches that provide free (as in beer) software updates and have struggled for years to find free (as in speech) software alternatives for those. We built our own routers using COTS server hardware, at a significant performance cost over the dedicated application-specific integrated circuits (ASICs) built into commercial routers, which do not offer us the trust and flexibility we were looking for.

But Linux is definitely making some headway, and has been for a while. When we covered switchdev in February 2016, it was just getting started, but now vendors like Mellanox, Broadcom, Cumulus, and Intel are writing and shipping code using the framework. Cumulus, in particular, is developing a Debian-based distribution (Cumulus Linux) that it deploys for clients on targeted hardware. Most of the hardware in that list, however, is not actually open in the more radical sense: they are devices that can run free software but are not generally open-source hardware. There are some exceptions, but they sit at the higher end of the spectrum: most organizations probably don't need a 100 Gbps ports, let alone the 128 ports in the Backpack switch that Cumulus is shipping.

How much this translates to actual freedom for the end-user is therefore questionable. While I have seen encouraging progress on the high end of the hardware spectrum at Netdev, I'm not sure this will translate into immediate wins in the data center or even for home users in the short term. In the long term, however, we will hopefully see some progress in Linux's rise in general-purpose networking hardware following its dominance in general-purpose computing.

[The author would like to thank the Netdev organizers for travel assistance. Also, thanks to Andrew Lunn for a technical review of this article.]

Comments (6 posted)

Every conference venue has problems with the mix of room sizes, but I don't recall ever going to a talk that so badly needed to be in a bigger room as Jessie Frazelle and Alex Mohr's talk at CloudNativeCon/KubeCon Europe 2017 on securing Kubernetes. All the chairs filled up half an hour before the start; then much of the remaining volume of the room filled up, and still people were trying to get in. Had there been a fire at this point, most of western Europe's Kubernetes clusters might have had to go without care and feeding for a while. The cause of the enthusiasm was the opportunity to get "best practice" information on securing Kubernetes, and how Kubernetes might be evolving to assist with this, directly from the source. Mohr and Frazelle work for Google; Mohr is currently technical lead and manager for Google Seattle's Kubernetes and Container Engine teams, and Frazelle has been a Docker core maintainer and is famous for containerizing everything in sight.

Security evolution

Historically, they said, Kubernetes's security model was pretty flat and simple. As of 1.5, there is a "single tenant" model: one trust boundary, being the cluster itself, and everyone inside that is effectively an administrator. Authorization is granted at cluster level; the nodes all have the same authenticated identity, and the pods all have the same permissions and full network access. If team one and team two will play nicely together, their pods can run on the same cluster; but if separation is needed, each team needs its own cluster.

With the release of 1.6, the concept of multiple users has been added, but Frazelle said there is "not much" enforcement, so the cluster is still the only truly effective trust boundary. With 1.7, they said (adding a small question mark to indicate that we're now talking about the future, so anything could happen), we get "cooperative soft multi-tenancy", meaning that we get fine-grained authentication but that it's not fully hardened, so with adequate trust and audit, it could work for some environments. In this future, namespaces and resources (initially, nodes) become the boundaries; what a pod can do becomes controlled by the permissions granted to the namespace in which it runs, and to the node on which it runs.

With that kind of change, then some kind of identity and access management is needed. The "three pillars" of this are authentication for users, system components, and workloads. For users, Kubernetes doesn't seem to be welding itself to any one mechanism, embracing such diverse solutions as basic HTTP authentication, static bearer tokens, X.509 client certificates, service account tokens, OpenID Connect (this raising a small cheer from me), and having hooks to build a custom solution. Whichever mechanism you opt for, Kubernetes will use it to produce a username, a UID, and an optional list of groups.

A slide showed a PKI-based example of system component authentication, where nodes and the master authenticate each other via a cluster-specific certificate authority. For workload authentication, Kubernetes will use JSON Web Tokens (JWTs) handled via Kubernetes service accounts (SAs). The SA controller will sign JWTs for all SAs, which will be stored as secrets using the Kubernetes API; pods acquire SAs by virtue of their configuration, then use the associated JWTs as bearer tokens when requesting resources from the master.

Once you have authenticated, Kubernetes authorizes you to do things based on the new Role-Based Access Control (RBAC), which is present for the first time in Kubernetes 1.6. As Mohr said, RBAC is only in beta in 1.6, but they expect this will be the primary method for authorization in Kubernetes going forward, so attendees were encouraged to try it out and log bug reports and feature requests if it's not what they want.

This was followed by a brief RBAC example based around the get and list commands for pods and nodes, a pair of roles (one of which could get and list pods but not nodes, and one of which could get and list both), and a user account that was bound to the former role (thus gaining the ability to get and list pods, but not nodes) while a service account and a group were bound to the latter (thus gaining the ability to get and list both pods and nodes). Mohr's single most helpful observation at this point was that he thought that the online documentation was already pretty complete, so those wishing to dig down into it should be well catered for.

Mohr said that if you're performing authorization, then audit is important. The Kubernetes APIserver currently logs an ID, timestamp, requesting IP, a user and namespace, the request URI, and in a separate line, the response code. This capability is expected to improve over time.

Anything hosted inside Kubernetes that expects to interact meaningfully with the rest of the world will need to be able to authenticate outside Kubernetes, which requires secrets: database logins, SSL keys, GitHub access tokens, and so on. This took Mohr neatly onto the question of secure storage of these secrets, which is expected to continue much as it's done now: the secret is Base64-encoded then placed in a YAML file. A pod can then use that secret if its YAML configuration permits; the secret is made available to the container via either an environment variable or a file mount, from which the containerized application can use it. The latter is more complex to use but better adapted to distributing changes in your external secrets, assuming anyone ever rotates their keys.

Runtime security

Frazelle went on to talk about runtime security, noting that containers are structures built out of Linux primitives including namespaces and control groups (cgroups), the former generally controlling what you can see and the latter generally controlling what you can use. Combined, they do much to provide isolation for the container, but she noted that it is not enough. If you need isolation that you can have real confidence in, it is necessary to add hardening on top. AppArmor uses the kernel's security module interface to control and audit access to various system resources, including file access and some system functions (mounting, accessing the network, and so on). Docker ships with sane AppArmor defaults, including blocking writing to /proc and /sys and preventing the mounting of filesystems. She gave a terse example of spinning up a containerized NGINX that took advantage of these defaults.

Useful as AppArmor is, it doesn't allow control of all system functions. Seccomp is another supported hardening tool that gives control over all system calls: you define exactly which system calls your application is allowed to run, and it terminates a process that tries to step outside that set. Again, Kubernetes now contains sensible default policies for seccomp, though they are in alpha in 1.6, and Frazelle gave an example of running NGINX subject to those defaults. The system call whitelist in Kubernetes apparently took some time to write, and has been subject to a lot of testing to ensure that it didn't gratuitously break any standard applications. The (in)famous SELinux is also supported; there are hooks to set SELinux contexts for volumes, for example.

In addition, a number of security context options now ship in Kubernetes, including one that requires a container not to run as root, one that requires a read-only root file system (if your application can support it, which is apparently quite difficult), and there are easy hooks to allow the adding and/or dropping of particular Linux capabilities.

At this point, Frazelle gave a demo showing the use of seccomp in Kubernetes with its default settings to prevent breaking out from a container via the Dirty COW exploit against a (demonstrably) vulnerable kernel. She felt that one important part of the demo was that the application (which again was NGINX) was unaffected by being run under the shipped default seccomp policy. She encouraged attendees to start using it for everything that didn't specifically need to do privileged things like mounting filesystems.

Mohr's summary of where Kubernetes security is now, and is going, is that he wants to get to hard multi-tenancy — the point where he can be comfortable running code from multiple third parties, with the potential for malice that implies, in the same cluster. Anyone doing this now should note how far into the future he sees this arriving. Currently, with 1.6, we get single tenancy but multiple identities with RBAC. Pods generally do not need to run as root. The supplied (though optional) hardening mechanisms described above can be used to increase isolation of containers and minimize the risk of breakout. With 1.7, he expects to see soft multi-tenancy. A process that manages to escape from a container would be limited to the privileges of the node on which they were running. Secrets management and audit logging will improve. He expects that seccomp will become active by default sometime around then. With 1.8, due September 2017, we will move toward hard multi-tenancy. Tools for binary verification may be provided. Resource isolation between containers will improve, to prevent things like cycle stealing. Support for organizationally mandated security policies will likely be added.

So, to recap: to secure your Kubernetes cluster as best you can, get to 1.6 as soon as is convenient, and start using the RBAC and hardening tools provided to ensure your containers start running with the least needed privileges, and stay that way. The next couple of versions of Kubernetes are likely to provide expanded versions of these tools and default to using them, so moving to them now will leave you best-placed to take full advantage of the security features to come.

[Thanks to the Linux Foundation, LWN's travel sponsor, for assistance in getting to Berlin for CNC and KubeCon.]

Comments (5 posted)

It has been said that Documentation/memory-barriers.txt can be used to frighten small children , and perhaps this is true. But even if it is true, it is woefully inefficient. After all, there is a huge number of children in this world, so a correspondingly huge amount of time and effort would be required in order to read it to them all.

This situation clearly calls for automated tooling, which is now available in prototype form; it is now possible to frighten small children at scale. This tool takes short fragments of concurrent C code as input, and exhaustively analyzes the possible results. In other words, instead of perusing memory-barriers.txt to find the answer to a memory-ordering question, you can get your answer by writing a small test case and feeding it to the tool, at least for test cases within the tool's current capabilities and limitations. This two-part series gives an introduction to the tool, describing how to use it and how it works.

To the best of our knowledge, this tool is the first realistic automated representation of Linux-kernel memory ordering, and is also the first to incorporate read-copy-update (RCU) ordering properties.

This article is organized as follows, with the intended audience for each section in parentheses:

The second article in this series will look into the issues raised by cycles and how they are important for memory models. It also demonstrates how the herd tool can be used for testing memory models.

Those wishing to skip the preliminaries found in these two articles and dive directly into the strong model will find it here. (Yes, there is also a weak model, but it is described in terms of how it differs from the strong model. So you should start with the strong model.)

Why formal memory models?

Even before Linux, kernel hacking has tended to involve more intuition and less formal methods. Formal methods can nevertheless be useful for providing definite answers to difficult questions. For example, how many different behaviors can a computer program exhibit? Particularly one that uses only values in memory, with no user input or output?

Computers being the deterministic automata they are, most people would say only one, and for uniprocessor systems they would be basically correct. But multiprocessor systems can give rise to a much wider range of behaviors, owing to subtle variations in the relative timing of the processors and the signals transmitted between them, their caches, and main memory. Memory models try to bring some order to the picture, first and foremost by characterizing exactly which outcomes are possible for a symmetric multiprocessor (SMP) system running a certain (small) program.

Even better, a formal memory model enables tools to automatically analyze small programs, as described here and here. However, those tools are specialized for specific CPU families. For analyzing the Linux kernel, what we need is a tool targeted at a higher level, one that will be applicable to every CPU architecture supported by the kernel.

Formal memory models require extreme precision, far beyond what the informal discussion in memory-barriers.txt can possibly provide. To bridge this gap in the best way possible, we have formulated the guiding principles listed in the following section.

Guiding principles

Our memory model is highly constrained because it must match the kernel's behavior (or intended behavior). However, there are numerous choices to be made, so we formulated the following principles to guide those choices:

Strength preferred to weakness

When all else is equal, a stronger memory model is clearly better, but this raises the question of what is meant by “stronger”. For our purposes, one memory model is considered to be stronger than another if it rules out a larger set of behaviors. Thus, the weakest possible memory model is one that would allow a program to behave in any way at all (as exemplified by the “undefined behavior” so common in programming-language standards), whereas the strongest possible memory model is one that says no program can ever do anything. Of course, neither of these extremes is appropriate for the Linux kernel, or for much of anything else.

The strongest memory model typically considered is sequential consistency (SC), and the weakest is release consistency process consistency (RC pc ). SC prohibits any and all reordering, so that all processes agree on some global order of all processes' accesses, which is theoretically appealing but expensive, so much so that no mainstream microprocessor provides SC by default. In contrast, RC pc is fairly close to the memory models we propose for the Linux kernel, courtesy of the Alpha, ARM, Itanium, MIPS, and PowerPC hardware that the Linux kernel supports.

On the other hand, we don't want to go overboard. Although strength is preferred over weakness as a general rule, small increases in strength are not worth order-of-magnitude increases in complexity.

Simplicity preferred to complexity

Simpler is clearly better; however, simplicity will always be a subjective notion. A formal-methods expert might prefer a model with a more elegant definition, while a kernel hacker might prefer the model that best matched his or her intuition. Nevertheless, simplicity remains a useful decision criterion. For example, assuming all else is equal, a model with a simpler definition that better matched the typical kernel hacker's intuition would clearly be preferred over a complex counterintuitive model.

Support existing non-buggy Linux-Kernel code

The memory model must support existing non-buggy code in the Linux kernel. However, our model (in its current form) is rather limited in scope. Because it is not intended to be a replacement for either hardware emulators or production compilers, it does not support:

Any number of compiler optimizations. For example, our model currently does not account for compiler optimizations that hoist identical stores from both branches of an if statement to precede that statement. (On the other hand, the model also does not cover normal variable access, instead requiring at least READ_ONCE() or WRITE_ONCE() , each of which greatly limits the compiler's ability to optimize. This restriction is therefore less of a problem than it might at first appear.)

statement to precede that statement. (On the other hand, the model also does not cover normal variable access, instead requiring at least or , each of which greatly limits the compiler's ability to optimize. This restriction is therefore less of a problem than it might at first appear.) Arithmetic. Not even integer arithmetic!

Multiple access sizes.

Partially overlapping accesses.

Nontrivial data, including arrays and structures. However, trivial linked lists are supported.

supported. Dynamic memory allocation.

Complete modeling of read-modify-write atomic operations. Currently, only atomic exchange is supported.

Locking, though some subset of the Linux kernel's numerous locking primitives is likely be added to a future version. In the meantime, locking may be emulated using atomic exchange.

Exceptions and interrupts.

I/O, including DMA.

Self-modifying code, as found in the kernel's alternative mechanism, function tracer, Berkeley Packet Filter JIT compiler, and module loader.

Complete modeling of RCU. For example, we currently exclude asynchronous grace-period primitives such as call_rcu() and rcu_barrier() . However, we believe that this work includes the first comprehensive formal model of the interaction between RCU reader and synchronous grace periods with memory accesses and memory-ordering primitives.

Quick Quiz 1: But my code contains simple unadorned accesses to shared variables. So what possible use is this memory model to me?

Answer : But my code contains simple unadorned accesses to shared variables. So what possible use is this memory model to me?

Be compatible with hardware supported by the Linux kernel

As always, adding more detail and functionality to the model slows it down, so the goal is therefore to balance the needs for speed and for functionality. The current model is a starting point, and we hope to incorporate additional functionality over time. We also hope that others will incorporate this memory model into their tools.

The memory model must be compatible with the hardware that the Linux kernel runs on. Although the memory model can be (and is) looser than any given instance of hardware, it absolutely must not be more strict. In other words, the memory model must in some sense provide the least common denominator of the guarantees of all memory models of all CPU families that run the Linux kernel. This requirement is ameliorated, to some extent, by the ability of the compiler and the Linux kernel to mask hardware weaknesses. For example:

The Alpha port of the Linux kernel provides memory-barrier instructions as needed to compensate for the fact that Alpha does not respect read-to-read address dependencies.

The Itanium port of GCC emits ld.acq for volatile loads and st.rel for volatile stores, which compensates for the fact that Itanium does not guarantee read-to-read ordering for normal loads from the same variable.

Nevertheless, the memory model must be sufficiently weak that it does not rule out behaviors exhibited by any of the CPU architectures the Linux kernel has been ported to. Different CPU families can have quite divergent properties, so that each of Alpha, ARM, Itanium, MIPS, and PowerPC required special attention at some point or another. In addition, hardware memory models are subject to change over time, as are the use cases within the Linux kernel. The Linux-kernel memory model must therefore evolve over time to accommodate these changes, which means that the version presented in this paper should be considered to be an initial draft rather than as being set in stone. It seems likely that this memory model will have the same rate of change as does Documentation/memory-barriers.txt .

Providing compatibility with all the SMP systems supporting Linux is one of the biggest memory-model challenges, especially given that some systems' memory models have not yet been fully defined and documented. In each case, we have had to take our best guess based on existing documentation, consultation with those hardware architects willing to consult, formal memory models (for those systems having them), and experiments on real hardware, for those systems we have access to. In at least one case, this might someday involve a computer museum.

Thankfully, this situation has been improving. For example, although formal memory models have been available for quite some time (such as here [400-page PDF]), tools that apply memory models to litmus tests have only appeared much more recently. We most certainly hope that this trend toward more accessible and better-defined memory models continues, but in the meantime we will continue to work with whatever is available.

Support future hardware, within reason

The memory model should support future hardware, within reason. Linux-kernel ports to new hardware must supply their own code for the various memory barriers, and might one day also need to supply their own code for similar common-code primitives. But since common code is valuable, an architecture wishing to supply its own code for (say) READ_ONCE() will need a very good reason for doing so.

This proposal assumes that future hardware will not deviate too far from current practice. For example, if you are porting Linux to a quantum supercomputer, the memory model is likely to be the least of your worries.

Be compatible with the C11 memory model, where prudent and reasonable

Where possible, the model should be compatible with the existing C and C++ memory models. However, there are a number of areas where it is necessary to depart from these memory models:

The smp_mb() full memory barrier is stronger than that of C and C++. But let's face it, smp_mb() was there first, and there is a lot of code in the kernel that might be adapted to smp_mb() 's current semantics.

full memory barrier is stronger than that of C and C++. But let's face it, was there first, and there is a lot of code in the kernel that might be adapted to 's current semantics. The Linux kernel's value-returning read-modify-write atomics feature ordering properties that are not found in their C/C++ counterparts.

The smp_mb__before_atomic() , smp_mb__after_atomic() , and smp_mb__after_unlock_lock() barrier-amplification APIs have no counterparts in the C/C++ API.

, , and barrier-amplification APIs have no counterparts in the C/C++ API. The smp_read_barrier_depends() macro does not have a direct equivalent in the C/C++ memory model.

macro does not have a direct equivalent in the C/C++ memory model. The Linux kernel's notion of control dependencies does not exist in C/C++. However, control dependencies are an important example of instruction ordering, so the memory model must account for them.

The Linux-kernel notion of RCU grace periods does not exist in C/C++. (However, the RCU-related proposals P0190R3 [PDF], P0461R1 [PDF], P0462R1 [PDF], P0561R0 [PDF], and P0566R0 [PDF] are being considered by the committee.)

On the positive side, the Linux kernel has recently been adding functionality that is closer to that of C and C++ atomics, with the ongoing move from ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() being one example and the addition of smp_load_acquire() and smp_store_release() being another.

Expose questions and areas of uncertainty

Defining a memory model inevitably uncovers interesting questions and areas of uncertainty. For example:

The Linux-kernel memory model is more strict than that of C11. It is useful to flag the differences in order to alert people who might otherwise be tempted to rely solely on C11. It is also quite possible that some of the Linux kernel's strictness is strictly historical, in which case it might (or might not) be worth considering matching C11 semantics for those specific situations.

Release-acquire chains are required to provide ordering to those tasks participating in the chain. Failure to provide such ordering would have many problematic consequences, not least being that locking would not work correctly. For tasks external to the chain, ordering cannot be provided for a write preceding the first release and a read following the last acquire due to hardware limitations. For example, if one process writes to variable x while holding a lock and a later critical section for that same lock reads from variable y , the read of y might execute before the write of x has propagated to an unrelated process not holding the lock.

while holding a lock and a later critical section for that same lock reads from variable , the read of might execute before the write of has propagated to an unrelated process not holding the lock. It turns out that release-acquire chains can be implemented using READ_ONCE() instead of smp_load_acquire() . (However, substituting WRITE_ONCE() for smp_store_release() does not work on all architectures.) Should the model require the use of smp_load_acquire() ?

instead of . (However, substituting for does work on all architectures.) Should the model require the use of ? Some architectures can “erase” writes, so that ordering specific to a given write might not apply to a later write to that same variable by that same task, even though coherence ordering would normally order the two writes. This can give rise to bizarre results, such as the possible outcomes of a code sequence depending on the code that follows it. (However, such results appear to be restricted to litmus tests that can best be described as “rather strange”.)

One interesting corner case of hardware memory models is that weak barriers (i.e., smp_wmb() ) suffice to provide transitive orderings when all accesses are writes. However, we were unable to come up with reasonable use cases, and furthermore, the things that looked most reasonable proved to be attractive nuisances. Should the memory model nevertheless provide ordering in this case? (If you know of some reason why this ordering should be respected by the memory models, please don't keep it a secret).

In a perfect world, we would resolve each and every area of uncertainty, then produce a single model reflecting full knowledge of all the hardware that the Linux kernel supports. However, astute readers might have noticed that the world is imperfect. Furthermore, rock-solid certainties can suddenly be cast into doubt, either with the addition of an important new architecture or with the uncovering of a misunderstanding or an error in documentation of some existing architecture. It will therefore be sometimes necessary for the Linux kernel memory model to say “maybe”.

Unfortunately, existing software tools are unable to say “maybe” in response to a litmus test. We therefore constructed not one but two formal models, one strong and the other less strong. These two models will disagree in “maybe” cases. Kernel hackers should feel comfortable relying on ordering only in cases where both models agree that ordering should be provided, and hardware architects should feel the need to provide strong ordering unless both models agree that strong ordering need not be provided. (Currently these models are still very much under development, so it is still unwise to trust either model all that much.)

Causality and ordering

Causality is an important property of memory models, in part because causality looms large in most peoples' intuitive understanding of concurrent code. However, causality is a generic term, lacking the precision required for a formal memory model. In this series we will therefore use the terms “causality” and “causal relationship” quite sparingly, instead defining precise terms that will be used directly within the memory model. But a brief discussion now will help illuminate the topic and will introduce some important relationships between causality, ordering, and memory models.

Causality is simply the principle that a cause happens before its effect, not after. It is therefore a statement about ordering of events in time. Let's start with the simplest and most direct example. If CPU A writes a value to a shared variable in memory, and CPU B reads that value back from the shared variable, then A's write must execute before B's read. This truly is an example of a cause-and-effect relation; the only way B can possibly know the value stored by A is to receive some sort of message sent directly or indirectly by A (for example, a cache-coherence protocol message).

Messages take time to propagate from one CPU or cache to another, and they cannot be received before they have been sent. (In theory, B could guess the value of A's write, act on that guess, check the guess once the write message arrived, and if the guess was wrong, cancel any actions that were inconsistent with the actual value written. Nevertheless, B could not be entirely certain that its guess is correct until the message arrives—and our memory models assume that CPUs do not engage in this sort of guessing, at least not unless they completely hide its effects from the software they are running.)

On the other hand, if B does not read the value stored by A but rather an earlier value, then there need not be any particular temporal relation between A's write and B's read. B's read could have executed either before or after A's write, as long as it executed before the write message reached B. In fact, on some architectures, the read could return the old value even if it executed a short time after the message's arrival. A fortiori, there would be no cause-and-effect relation.

Another example of ordering also involves the propagation of writes from one CPU to another. If CPU A writes to two shared variables, these writes need not propagate to CPU B in the same order as the writes were executed. In some architectures it is entirely possible for B to receive the messages conveying the new values in the opposite order. In fact, it is even possible for the writes to propagate to CPU B in one order and to CPU C in the other order. The only portable way for the programmer to enforce write propagation in the order given by the program is to use appropriate memory barriers or barrier-like constructs, such as smp_mb() , smp_store_release() , or C11 non-relaxed atomic operations.

A third example of ordering involves events occurring entirely within a single CPU. Modern CPUs can and do reorder instructions, executing them in an order different from the order they occur in the instruction stream. There are architectural limits to this sort of thing, of course. Perhaps the most pertinent for memory models is the general principle that a CPU cannot execute an instruction before it knows what that instruction is supposed to do.

For example, consider the statement “ x = y; ”. To carry out this statement, a CPU must first load the value of y from memory and then store that value to x . It cannot execute the store before the load; if it tried then it would not know what value to store. This is an example of a data dependency. There are also address dependencies (for example, “ a[n] = 3; ” where the value of n must be loaded before the CPU can know where to store the value 3). Finally, there are control dependencies (for example, “ if (i == 0) y = 5; ” where the value of i must be loaded before the CPU can know whether to store anything into y ). In the general case where no dependency is present, however, the only portable way for the programmer to force instructions to be executed in the order given by the program is to use appropriate memory barriers or barrier-like constructs.

Finally, at a higher level of abstraction, source code statements can be reordered or even eliminated entirely by an optimizing compiler. We won't discuss this very much here; memory-barriers.txt contains a number of examples demonstrating the sort of shenanigans a compiler can get up to when translating a program from source code to object code.

To be continued

The second half of this series focuses on the specific problem of cycles.

Acknowledgments

We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew, Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their patient explanations of their respective systems' memory models. We are indebted to Peter Sewell, Susmit Sarkar, and their groups for their seminal work formalizing many of these same memory models. We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for their help making this human-readable. We are also grateful to Michelle Rankin and Jim Wasko for their support of this effort.

Answer to the Quick Quiz

Quick Quiz 1: But my code contains simple unadorned accesses to shared variables. So what possible use is this memory model to me?

Answer: You are of course free to use simple unadorned accesses to shared variables in your code, but you are then required to make sure that the compiler isn't going to trip you up—as has always been the case. Once you have made sure that the compiler won't trip you up, simply translate those accesses to use READ_ONCE() and WRITE_ONCE() when using the model. Of course, if your code gives the compiler the freedom to rearrange your memory accesses, you may need multiple translations, one for each possible rearrangement.

Back to Quick Quiz 1.

Comments (4 posted)