This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

At the end of June, Zachary Fouts noticed something on his Ubuntu system that surprised him a bit: an entry in the "message of the day" (motd) that looked, at least to some, like an advertisement. That is, of course, not what anyone expects from their free-software system; it turns out that it wasn't an ad at all, though it was worded ambiguously and could be (and was) interpreted that way. As the discussion in the bug Fouts filed shows, the "ad" came about from a useful feature that may or not have been somewhat abused—that determination depends on the observer.

It is a longstanding Unix tradition to print a message of the day when users log in; in ages past, administrators would often note upcoming software upgrades and/or maintenance downtime that way. Typically that message has come from the /etc/motd file, but Ubuntu has long had a way to dynamically generate messages from local system information (e.g. number of package updates or reboot needed) using scripts in the /etc/update-motd.d/ directory. In Ubuntu 17.04, a new script was added that reaches out to a URL and grabs what it finds there to display as the motd.

The default configuration is that this "motd-news" feature is enabled and that it will check https://motd.ubuntu.com for updates. That check is not done at login time, but is periodically done (every twelve hours or so) and the result is cached. At the time of this writing, the message there is reminding users that Ubuntu 16.10 reaches its end of life (EOL) on July 20. But at the time Fouts filed the bug, it had a much different message:

* How HBO's Silicon Valley built "Not Hotdog" with mobile TensorFlow, Keras & React Native on Ubuntu - https://ubu.one/HBOubu

In the bug, Fouts said that the news item was targeted poorly: "Instead, https://motd.ubuntu.com should show relevant items to those that use Ubuntu Server (relevant security issues, etc), instead of items for desktop users." Others were quick to wonder whether it was an ad of some sort. Andrew Starr-Bochicchio was disappointed to see it:

I can understand the desire to be able to communicate directly to users and present timely, relevant information, but linking out to content marketing in what seems to be one of its first uses is self-sabotage. This type of behavior will lead to it being disabled and the "important security messages" to not be seen.

He pointed to the /etc/default/motd-news file as a way to disable the feature for those who wanted to do that. Others followed suit; Mikko Tanner said: "Advertising has absolutely no place in motd." No one really defended the content itself, though several commenters considered it to be a mix-up of some kind. Simos Xenitellis asked: "Is it really necessary to conflate this into some conspiracy to display ads in the Ubuntu Server motd?" The post being "advertised" is actually technical in nature and has little to do with the "Silicon Valley" TV show (the app that is built was evidently featured in an episode), but it does namedrop Ubuntu. That is presumably why it was chosen to appear as part of the news stream.

Ubuntu Product Manager Dustin Kirkland, who is the author of the original dynamic motd as well as the new motd-news feature, soon arrived in the bug thread (after commenting in a related Hacker News thread). In a lengthy comment, he explained how motd-news works along with some history and functioning of the dynamic motd feature he developed back in 2009. He described how Ubuntu is using the feed and how it can be configured to consult a local URL to get news items that would be displayed instead of (or in addition to) the official feed. There are several categories of messages that will be added, including internet-wide problems (such as Heartbleed) or important information about Ubuntu itself (like an EOL date reminder). But there is a third category:

And sometimes, it's just a matter of presenting a fun fact. News from the world of Ubuntu. Or even your own IT department. Such was the case with the Silicon Valley / HBO message. It was just an interesting tidbit of potpourri from the world of Ubuntu. Last week's message actually announced an Ubuntu conference in Latin America. The week before, we linked to an article asking for feedback on Kubuntu.

While Kirkland was not apologizing for the news item—he clearly believes it is a reasonable use of the facility—he did say that new messages would be reviewed by the ubuntu-motd team before going live. He invited those reading to submit their own messages to the repository for potential inclusion in the motd-news stream.

Some still objected to fun facts being intermingled with critical information such that users could not get one without also getting the other. Timothy R. Chavez suggested splitting out fun facts into their own stream that could be disabled by default for server installations. Markus Ueberall thought that applying tags to the messages would allow the client side to decide what it displays, which would presumably alleviate the concerns.

But Kirkland does not see a problem with the message: "Moreover, the HBO link wasn't even an advertisement!" He wondered whether those complaining were also opposed to paid Google search results and to the Google Doodles that appear on its home page. Those are imperfect analogies at best, of course. Some disagreed with Kirkland's characterization of the news item, however; Nicola Heald said:

I think the thing that made me feel uneasy is that the motd read like an advertisement. And so did parts of the article, specifically saying that we should watch Silicon Valley. I appreciate that it was not meant that way though. But maybe people are so sick of seeing clickbait advertorial content when they browse the internet that the message brought up some bad reactions.

Beyond the advertising angle, though, is a question of privacy. The "user agent" string used to contact the motd-news server sends a small amount of potentially sensitive information, including the uptime for the server, according to Chavez. It is believed that the uptime might be used to determine what news item to return (e.g. if the system has been up so long it could not have applied a particular update), but it is not clear whether that information is tracked by Canonical. It is a fairly minor privacy breach, potentially, but one that concerns a few, including Fouts, the original reporter, who had some further thoughts:

Fun facts are indeed fun, but this feature should be reserved for important information regarding EOL, Security Patches, etc. If the administrator of ${system} wants a fun fact, they can install something else. Cow Say, Fortune, whatever to display that. Not trying to stir anything up, it's a great feature but that feature should be used wisely so people do not disable it.

So far, there is no indication of any plans to change things. Kirkland changed the importance of the bug to "Wishlist" and its status to "Opinion" on June 29. He seemed to indicate that more care would be taken in choosing fun facts in the future (perhaps reviewing the wording to reduce the perception that it is an ad), but the feature itself will not be changing.

While there is some element of a "sad Twitter storm in a tea cup" regarding the bug, as Xenitellis put it, there are some reasonable concerns that it has surfaced. Clearly the news item in question was aimed at doing a bit of marketing regarding Ubuntu—people have varying reactions to that kind of message, especially in unexpected locations. And, while it is hard to imagine that Canonical has some nefarious plan that uses system uptimes, sending that kind of information anywhere seems like it should be opt-in. Overall, that seems to be the failing here: getting permission before making these kinds of changes. There are a number of ways that could be fixed, of course, but it would seem that Ubuntu/Canonical are not particularly interested in doing so, at least yet.

Comments (47 posted)

A recent paper [PDF] by a group of eight cryptography researchers shows, once again, how cryptographic breakthroughs are made. They often start small, with just a reduction in the strength of a cipher or key search space, say, but then grow over time to reach the point of a full-on breaking of a cipher or the implementation of one. In this case, the RSA implementation in Libgcrypt for 1024-bit keys has been fully broken using a side-channel attack against the operation of the library—2048-bit keys are also susceptible, but not with the same reliability, at least using this exact technique.

The RSA cryptosystem involves lots of exponentiation and modular math on large numbers with sizable exponents. For efficiency reasons, these operations are usually implemented by a square-and-multiply algorithm. Libgcrypt is part of the GNU Privacy Guard (GnuPG or GPG) project and underlies the cryptography in GPG 2.x; it uses a sliding window mechanism as part of its square-and-multiply implementation. It is this sliding window technique that was susceptible to analysis of the side channel and, thus, allowed for the break.

The cryptographers who wrote the paper (Daniel J. Bernstein, Joachim Breitner, Daniel Genkin, Leon Groot Bruinderink, Nadia Heninger, Tanja Lange, Christine van Vredendaal, and Yuval Yarom from multiple universities across the globe) note that the Libgcrypt maintainers at one point rejected a fix that would have thwarted the extraction of key information: "However, the maintainers refused a patch to switch from sliding windows to fixed windows; they said that this was unnecessary to stop the attacks." The reason was that even though sliding windows can reveal some parts of the key to local attackers, Libgcrypt's window was such that only 40% of a 1024-bit key (and 33% of a 2048-bit key) was exposed, which was insufficient to recover the full key efficiently, or so it was thought. The researchers found that reasoning did not hold up.

If an attacker can observe the pattern of squarings and multiplications by way of the cache, which is an established technique for an attacker running on the same hardware, they can extract some parts of the key. It turns out that there are two ways to implement the sliding window algorithm: right to left (i.e. starting with the least-significant bit) or left to right. It is slightly more efficient to use the left-to-right direction and that is what is recommended in several reference texts, so it is not surprising that Libgcrypt chose that direction. But the researchers found that the left-to-right calculation leaks many more bits of the key.

To verify the results, the researchers monitored particular memory locations in the RSA signature code path to extract the needed sequence of squarings and multiplications. That resulted in exposing 48% of the key bits, but 50% or more is needed for the technique used to reconstruct the full key. By analyzing the algorithm used by Libgcrypt, the researchers were able to find patterns and rules that could be used to add more known bits to the key.

In the paper, they say that most 1024-bit keys can be recovered by searching through 10,000 candidates, though some require searching up to 1,000,000 candidates. For 2048-bit keys, 13% could be found by searching only 2,000,000 possible keys. Since the public key is known, it should be straightforward to use any signatures produced to verify which of the possibilities is the proper key.

As might be guessed, the paper goes into great detail about the algorithms and how the information provided by the FLUSH+RELOAD side channel was used to extract enough to bits to break the keys.

On June 29, the Libgcrypt project released version 1.7.8 to address the problem (which is also known as CVE-2017-7526). The change made was not to switch to right-to-left operation or to a fixed window as mentioned in the paper, but to instead use blinding on the exponent to obscure the actual bits of the key. Blinding uses a reversible algorithm to alter the input to a calculation such that the result returned can be transformed into the result as if the initial input had been used. Attackers observing the sequence of calculations will be unable to extract the actual value of interest.

A note in the commit message makes it clear that blinding is simply a quick fix, though it is not clear if something more substantial is in the works. "Exponent blinding is a kind of workaround to add noise. Signal (leak) is still there for non-constant-time implementation." But the release announcement does downplay the significance of the bug somewhat:

Note that this side-channel attack requires that the attacker can run arbitrary software on the hardware where the private RSA key is used. Allowing execute access to a box with private keys should be considered as a game over condition, anyway. Thus in practice there are easier ways to access the private keys than to mount this side-channel attack. However, on boxes with virtual machines this attack may be used by one VM to steal private keys from another VM.

The research is definitely an important result, but the practical implications, at least for those not running on virtual machines alongside those of attackers, would seem to be fairly small. On the other hand, it only takes one security hole that lets an attacker's code onto a system that regularly uses a private key of interest for that key to be at fairly serious risk. This kind of incident helps remind us that attacks against cryptography only get better over time—that's something worth keeping in mind.

While it is not necessarily directly related, it should be pointed out that the GnuPG project is currently looking for financial support. Both GPG itself and Libgcrypt are used by a wide variety of other tools; it is worrisome that the project is not better supported financially. While there is no reason to believe there is Heartbleed-level bitrot going on within GnuPG, there should be concern that the project may not be able to keep up with the threats it faces in today's internet.

Comments (7 posted)

Linus Torvalds released the 4.12 kernel on July 2, marking the end of one of the busiest development cycles in the kernel project's history. Tradition requires that LWN publish a look at this kernel release and who contributed to it. 4.12 was, in many ways, a fairly normal cycle, but it shows the development community's continued growth.

The 4.12 kernel includes 14,821 non-merge changesets contributed by 1,825 developers. That is not the highest changeset count we've ever seen — 4.9 is likely to hold that record for some time — but it comes in at a solid #2. The 4.12 kernel did set a new record for the number of developers participating and for the number of first-time contributors (334), though. This was also a significant release for the growth of the kernel code base: 4.12 has just over one million lines of code more than its predecessor.

The most active developers in the 4.12 cycle were:

Most active 4.12 developers By changesets Chris Wilson 365 2.5% Al Viro 143 1.0% Christoph Hellwig 136 0.9% Tobin C. Harding 134 0.9% Johan Hovold 124 0.9% Colin Ian King 116 0.8% Geert Uytterhoeven 116 0.8% Jan Kara 115 0.8% Arnd Bergmann 113 0.8% Hans de Goede 102 0.7% Daniel Vetter 100 0.7% Dan Carpenter 98 0.7% Arnaldo Carvalho de Melo 92 0.6% Alex Deucher 91 0.6% Markus Elfring 89 0.6% Mauro Carvalho Chehab 86 0.6% Ville Syrjälä 83 0.6% Yan-Hsuan Chuang 83 0.6% Javier Martinez Canillas 80 0.5% Marc Zyngier 78 0.5% By changed lines Alex Deucher 369179 25.2% Alan Cox 209556 14.3% Hans de Goede 112114 7.7% Hans-Christian Egtvedt 27100 1.9% Gilad Ben-Yossef 17593 1.2% Chris Wilson 15670 1.1% Eric Huang 10851 0.7% Steven J. Hill 10837 0.7% Paolo Valente 10505 0.7% Yan-Hsuan Chuang 10289 0.7% Geert Uytterhoeven 9580 0.7% Mauro Carvalho Chehab 8887 0.6% Christoph Hellwig 8285 0.6% Javier González 8211 0.6% Ioana Radulescu 8123 0.6% Benjamin Herrenschmidt 8016 0.5% Boris Brezillon 7943 0.5% Jie Deng 7741 0.5% Ken Wang 6904 0.5% Neil Armstrong 6887 0.5%

For the second cycle in a row, Chris Wilson contributed the most changesets; almost all of them were changes to the Intel i915 graphics driver. Al Viro worked as usual in the virtual filesystem layer, but the bulk of his patches this time around were a reworking of the low-level user-space access code — a job that required changing a fair amount of architecture-specific machinery. Christoph Hellwig made a number of improvements in the block and filesystem layers, Tobin Harding focused on staging fixes, and Johan Hovold worked extensively in the USB subsystem and beyond.

In a cycle where the kernel grows by a million lines, one can expect to see some developers adding a lot of code. Alex Deucher added more AMD graphic driver register definitions; drivers/gpu/drm/amd/include now contains over 800,000 lines of such definitions. Alan Cox added the Intel "atomisp" camera drivers to the staging tree. Hans de Goede added the rtl8723bs WiFi driver (plus a bunch of other work), Hans-Christian Egtvedt bucked the trend by removing the unloved AVR32 architecture, and Gilad Ben-Yossef added the ARM TrustZone CryptoCell C7XX crypto accelerator drivers.

Work on the 4.12 kernel was supported by at least 233 employers, a number which is pretty much in line with previous releases. The most active of those employers were:

Most active 4.12 employers By changesets Intel 2340 13.9% (Unknown) 1447 8.6% Red Hat 1257 7.5% (None) 1173 7.0% IBM 876 5.2% Linaro 570 3.4% AMD 526 3.1% Google 515 3.1% SUSE 482 2.9% (Consultant) 458 2.7% Samsung 348 2.1% ARM 338 2.0% Renesas Electronics 303 1.8% Mellanox 284 1.7% Oracle 238 1.4% Broadcom 230 1.4% Free Electrons 221 1.3% NXP Semiconductors 212 1.3% Huawei Technologies 199 1.2% Texas Instruments 191 1.1% By lines changed AMD 406009 25.8% Intel 330637 21.0% Red Hat 171069 10.9% IBM 50198 3.2% Linaro 43525 2.8% (Unknown) 39629 2.5% (None) 31731 2.0% ARM 30795 2.0% Cisco 30016 1.9% Cavium 29737 1.9% Samsung 25442 1.6% Google 22814 1.5% NXP Semi. 20767 1.3% (Consultant) 17941 1.1% Renesas Electronics 17663 1.1% Mellanox 16638 1.1% Free Electrons 16636 1.1% Realtek 12414 0.8% Synopsys 12201 0.8% SUSE 11929 0.8%

As has been the case in recent years, there are not a lot of surprises to be found in this table. Kernel development may move quickly, but the commercial ecosystem surrounding it changes rather more slowly.

Another way of looking at things is to ask what the companies above are actually working on. Looking at the data from after the 4.7 release now (one year's worth, essentially), and just looking at Intel's contributions, we see something like this:

Intel (9192 total) Percent Directory Notes 38.3% drivers/gpu 32.0% drivers/gpu/drm/i915 10.2% include 9.6% driver/net 5.4% drivers/staging Mostly the Lustre filesystem 4.5% arch/x86 4.0% drivers/infiniband 3.5% sound 3.4% drivers/usb 3.1% tools

Intel's work, thus, is mostly focused on support for Intel hardware — not a huge surprise, really. The company is routinely the kernel's largest single contributor, but it leaves core-kernel development to others.

The results for Red Hat look rather different (once again, looking at patches after 4.7):

Red Hat (4947 total) Percent Directory Notes 15.8% include 14.8% fs 11.8% tools Mostly perf 10.6% net 10.3% arch/x86 9.3% drivers/gpu 8.1% kernel 5.5% drivers/net 4.0% drivers/md 2.6% arch/arm

Red Hat clearly has a more generalist role in kernel development, making changes all over the tree and throughout the core.

The next two rows in the table are for the hobbyists and the unknowns. The corresponding maps of where they are working are:

Unknown affiliation (5080 total) Percent Directory Notes 22.6% drivers/staging 7.8% net 7.2% include 6.6% drivers/net 5.3% arch/arm 5.3% drivers/gpu 5.3% Documentation Mostly device-tree bindings 4.7% sound No affiliation (4277 total) Percent Directory Notes 14.5% drivers/net 12.1% drivers/staging 10.7% net Mostly netfilter and batman-adv 7.6% include 6.7% drivers/media 5.8% Documentation 5.4% arch/arm 4.9% drivers/gpu 3.4% fs

To complete the set, here's the results from some of the other top companies:

IBM (2605 total) Percent Directory Notes 35.4% arch/powerpc 17.0% arch/s390 7.7% drivers/s390 5.9% tools 5.7% include 5.5% drivers/net 5.2% kernel AMD (1788 total) Percent Directory Notes 82.7% drivers/gpu/drm/amd 4.6% drivers/gpu/drm/radeon 4.6% include 2.6% arch/x86 Linaro (4084 total) Percent Directory Notes 31.7% drivers/staging Mostly greybus 7.7% arch/arm 6.7% include 5.4% arch/arm64 4.3% drivers/net 4.0% Documentation device-tree bindings 3.6% drivers/gpu 2.6% drivers/mmc Google (1956 total) Percent Directory Notes 17.4% net core and ipv4 mainly 14.7% include 11.1% drivers/staging greybus 10.2% drivers/pci 9.1% drivers/net 8.6% fs 5.6% arch/x86 4.5% drivers/input 4.3% Documentation 3.7% mm SUSE (1896 total) Percent Directory Notes 28.4% fs 15% btrfs 16.5% include 11.1% mm 8.4% sound 8.3% arch/x86 6.8% drivers/scsi 6.3% drivers/md 4.2% kernel 4.0% Documentation

Clearly, each company is contributing to the kernel for its own reasons, and each focuses its effort accordingly. Hardware-oriented companies have a tendency to not look much beyond supporting their own products, while companies that deal more directly with the end users have a more general focus. Somehow, they all manage to work together and keep the kernel process going and the community growing in a consistent and predictable way.

Comments (14 posted)

The kernel's file capabilities mechanism is a bit of an awkward fit with user namespaces, in that all namespaces have the same view of the capabilities associated with a given executable file. There is a patch set under consideration that adds awareness of user namespaces to file capabilities, but it has brought forth some disagreement on how such a mechanism should work. The question is, in brief: how should a set of file capabilities be picked for any given user namespace?

The Linux capabilities mechanism is meant to allow privileges to be granted to processes in a manner that is more fine-grained than the classic Unix "root can do anything" approach. So, for example, an otherwise unprivileged program that needs to be able to send a signal to an unrelated process could be given CAP_KILL rather than full root privileges. Capabilities have not revolutionized privilege management as had once been hoped, but they can still have their uses.

In a typical Unix system, privileged operations are made available to ordinary users by way of setuid programs. In a system with capabilities, it is natural to want to associate capabilities with an executable program instead, once again in the hope of limiting the amount of privilege that must be granted. File capabilities, added for the 2.6.24 kernel, provide that feature.

User namespaces allow a set of processes to run as root within the namespace, while mapping the root ID (and possibly others) to normal IDs for actions (such as filesystem access) involving the rest of the system. A process running as root within a user namespace can create a setuid-root binary that will only work as intended within that namespace; it will not be usable to escalate privileges outside of the namespace. The same is not true of file capabilities, though; all user namespaces have the same view of the capabilities associated with an executable file and, since processes in a user namespace lack privilege in the root namespace, they cannot change those capabilities.

File capabilities are implemented using extended attributes; in particular, they are stored in the security.capability attribute. The kernel handles the security.* extended-attribute namespace specially; only a privileged program (one possessing the CAP_SYS_ADMIN capability in particular) can change those attributes. So it is not possible for an unprivileged container running within a user namespace to add capabilities to a file; there is, in any case, no way to store extended attributes such that they are only visible within a given user namespace.

The proposed patch set, posted by Stefan Berger, aims to change that by extending the extended-attribute syntax. This is done by decorating attributes with syntax describing the user ID (in the root namespace) associated with UID zero within a user namespace. So, for example, if a user with UID 1000 starts a user namespace, processes running as root within that namespace will access the filesystem with the original ID of 1000. If that user adds capabilities to a file within the user namespace, those capabilities will actually be stored in an extended attribute named:

security.capability@uid=1000

Outside the namespace, this new attribute will have no effect. Within any namespace mapped to UID 1000, though, that attribute will appear as simply security.capability , so the program contained within that file will run with those capabilities in its masks.

This mechanism does not apply to extended attributes in general; it is, instead, restricted to a specific set of attributes that the kernel cares about. In the patch set, security.capability is obviously one of those attributes; the other is security.selinux , allowing for namespace-specific SELinux labels on files. The SELinux attribute was later removed, though, after SELinux maintainer Stephen Smalley pointed out that it would not work as intended.

Casey Schaufler objected to this mechanism, noting that if two user namespaces are both running mapped to UID 1000 and sharing a directory tree, file capabilities set in one of those namespaces will be visible in the other. He argued that the user ID is the wrong key to use for file capabilities; instead, he said, there should be some sort of persistent ID associated with the user namespace itself. Serge Hallyn (who had posted a namespaced file-capabilities patch of his own that had served as inspiration for Berger's work) disagreed, though, saying that the feature was working as designed.

James Bottomley, instead, objected that this mechanism will work poorly on systems where user IDs for containers are allocated dynamically. He asked for a simple @uid suffix, which would be picked up in any user namespace. Hallyn indicated openness to adding that suffix as an additional feature.

It would seem that most of the concerns about the feature itself have been headed off, so this patch set may be well on its way toward acceptance. That does, of course, leave out the biggest point of contention of all, one that was inevitable in retrospect: the proper formatting of the namespace-specific extended-attribute names. So the final form of the attribute may be something like security.ns@uid=1000@@.capability when the dust settles. Otherwise, though, namespaced file capabilities may be a kernel feature in the relatively near future.

Comments (10 posted)

In many performance-oriented settings, the number of times that data is copied puts an upper limit on how fast things can go. As a result, zero-copy algorithms have long been of interest, even though the benefits achieved in practice tend to be disappointing. Networking is often performance-sensitive and is definitely dominated by the copying of data, so an interest in zero-copy algorithms in networking comes naturally. A set of patches under review makes that capability available, in some settings at least.

When a process transmits a buffer of data, the kernel must format that data into a packet with all of the necessary headers and checksums. Once upon a time, this formatting required copying the data into a single kernel-space buffer. Network hardware has long since gained the ability to do scatter/gather I/O and, with techniques like TCP segmentation offloading, the ability to generate packets from a buffer of data. So support for zero-copy operations has been available at the hardware level for some time.

On the software side, the contents of a file can be transmitted without copying them through user space using the sendfile() system call. That works well when transmitting static data that is in the page cache, but it cannot be used to transmit data that does not come directly from a file. If, as is often the case, the data to be transmitted is the result of some sort of computation — the application of a template in a content-management system, for example — sendfile() cannot be used, and zero-copy operation is not available.

The MSG_ZEROCOPY patch set from Willem de Bruijn is an attempt to make zero-copy transmission available in such settings. Making use of it will, naturally, require some changes in user space, though.

Requesting zero-copy operation is a two-step process. Once a socket has been established, the process must call setsockopt() to set the new SOCK_ZEROCOPY option. Then a zero-copy transmission can be made with a call like:

status = send(socket, buffer, length, MSG_ZEROCOPY);

One might wonder why the SOCK_ZEROCOPY step is required. It comes down to a classic API mistake: the send() system call doesn't check for unknown flag values. The two-step ritual is thus needed to avoid breaking any programs that might have been accidentally setting MSG_ZEROCOPY for years and getting away with it.

If all goes well, a transmission with MSG_ZEROCOPY will lock the given buffer into memory and start the transmission process. Transmission will almost certainly not be complete by the time that send() returns, so the process must take care to not touch the data in the buffer while the operation is in progress. That immediately raises a question: how does the process know when the data has been sent and the buffer can be used again? The answer is that the zero-copy mechanism will place a notification message in the error queue associated with the socket. That notification can be read with something like:

status = recvmsg(socket, &message, MSG_ERRORQUEUE);

The socket can be polled for an error status, of course. When an "error" packet originating from SO_EE_ORIGIN_ZEROCOPY shows up, it can be examined to determine the status of the operation, including whether the transmission succeeded and whether it was able to run in the zero-copy mode. These status packets contain a sequence number that can be used to associate them with the operation they refer to; the fifth zero-copy send() call will generate a status packet with a sequence number of five. These status packets can be coalesced in the kernel, so a single packet can report on the status of multiple operations.

The mechanism is designed to allow traditional and zero-copy operations to be freely intermixed. The overhead associated with setting up a zero-copy transmission (locking pages into memory and such) is significant, so it makes little sense to do it for small transmissions where there is little data to copy in the first place. Indeed, the kernel might decide to use copying for a small operation even if MSG_ZEROCOPY is requested but, in that case, it must still go to the extra effort of generating the status packet. So the developers of truly performance-oriented programs will want to take care to only request zero-copy behavior for large buffers; just where the cutoff should be is not entirely clear, though.

Sometimes, zero-copy operation is not possible regardless of the buffer size. For example, if the network interface cannot generate checksums, the kernel will have to perform a pass over the data to do that calculation itself; at that point, copying the data as well is nearly free. Anytime that the kernel must transform the data — when IPSec is being used to encrypt the data, for example — it cannot do zero-copy transmission. But, for most straightforward transmission cases, zero-copy operation should be possible.

Readers might be wondering why the patch does not support zero-copy reception; while the patch set itself does not address this question, it is possible to make an educated guess. Reading is inherently harder because it is not generally known where a packet is headed when the network interface receives it. In particular, the interface itself, which must place the packet somewhere, is probably not in a position to know that a specific buffer should be used. So incoming packets end up in a pile and the kernel sorts them out afterward. Fancier interfaces have a fair amount of programmability, to the point that zero-copy reception is not entirely infeasible, but it remains a more complex problem. For many common use cases (web servers, for example), transmission is the more important problem anyway.

As was noted in the introduction, the benefits from zero-copy operation are often less than one might hope. Copying is expensive, but the setup required to avoid a copy operation also has its costs. In this case, the author claims that a simple benchmark ( netperf blasting out data) runs 39% faster, while a more realistic production workload sees a 5-8% improvement. So the benefit for real-world systems is not huge, but it may well be enough to be worth going for on highly-loaded systems that transmit a lot of data.

The patch set is in its fourth revision as of this writing, and the rate of change has slowed considerably. There do not appear to be any fundamental objections to its inclusion at this point. For those wanting more details, this paper [PDF] by De Bruijn is worth a read.

Comments (29 posted)

Network acceleration has always been a subject that naturally attracts the interest of network device vendors and developers. Kernel network acceleration techniques that require, for example, the caching of kernel networking data structures inside the network driver (or maintaining a private modified kernel for a specific device) are naturally frowned upon and bound to be rejected by the kernel networking community. There are also user-space kernel-bypass solutions, including the Data Plane Development Kit (DPDK)

Among the most popular open-source projects providing user-space network acceleration are Snabb, netmap, and DPDK. With the recent announcement by Jim Zemlin this April that DPDK project has moved to the Linux Foundation, it seems that this is a good time to get an overview of the current status of this project and its roadmap.

The DPDK project

DPDK was created by Intel in 2010 as a suite of tools that enable the efficient transfer of packets through a server. In 2013, the project web site, www.dpdk.org, was created by 6Wind and, recently, it moved to the Linux Foundation. DPDK is a set of libraries and drivers written in C providing I/O acceleration for network and cryptographic devices. It is a fully open-source (BSD-licensed) project, and it runs on Linux and FreeBSD. The project maintainer is Thomas Monjalon.

DPDK is used by more than 20 open-source projects, including OPNFV, OvS-DPDK, the Fast Data project (FD.io), Rump, dpdk-nginx, OpenDaylight, Contrail Virtual Router, and more. It supports a wide variety of platforms and over 20 types of interface cards; it runs on a variety of CPU architectures. It includes contributions from over 400 individuals from 70 different organizations. Starting April 2016, it adopted the Ubuntu numbering scheme, where each release is tagged as YY.MM; so the last DPDK release is DPDK 17.05, from May 2017, and the next release will be DPDK 17.08, which will be released in August 2017, reflecting the project's quarterly release cadence.

Among the interesting new features added in DPDK 17.05 is the new event-driven programming model library ( rte_eventdev ). In this model, as opposed to the polling model, the cores call the DPDK scheduler, which selects packets for them. This model adds support for dynamic load balancing, automatic multi-core scaling, and more. Until 17.05, the DPDK cryptodev API had supported only Intel hardware accelerators; a new poll mode driver was added by NXP for its Data Path Acceleration Architecture Gen2 cryptographic accelerators.

One of the more interesting features introduced in the previous release, DPDK 17.02, is the generic flow API (rte_flow), which provides a generic means to configure hardware to match specific ingress or egress traffic. In the upcoming 17.08 release, one can expect to see features like support for a generic quality-of-service API, generic receive offload support, and more.

A simple DPDK application

Before delving into the details, let's take a look at a simple layer 2 (L2) forwarding DPDK application; becoming familiar with it will help to understand and develop more advanced DPDK applications. With this program, packets arriving at one port will be forwarded back via a second port after switching the source and destination MAC addresses.

After initializations of ports, queues, and other settings via generic calls like rte_eth_dev_configure() , the program enters the following loop:

struct rte_mbuf *m; /* ... */ while (!force_quit) { /* ... */ nb_rx = rte_eth_rx_burst((uint8_t) portid, 0, pkts_burst, MAX_PKT_BURST); port_statistics[portid].rx += nb_rx; for (j = 0; j < nb_rx; j++) { m = pkts_burst[j]; /* ... */ l2fwd_simple_forward(m, portid); }

In this loop, we read received packets (represented by the rte_mbuf structure) from the incoming port in a burst of size MAX_PKT_BURST , update the stats, and then each packet is processed by l2fwd_simple_forward() , which switches the source and the destination MAC addresses of this packet and transmits it via the outgoing port by invoking rte_eth_tx_buffer() .

This example, (like other DPDK applications) uses a high-level DPDK API, which does not depend on the implementation details of any specific DPDK network driver. Those who want to delve into the full source code for this example can find it here. More information can also be found in the Sample Applications User Guides.

DPDK components

Those who want to start learning and exploring DPDK could start with the many sample applications on the examples page. There are over 40 of them, starting from a simple "hello world" and proceeding to more complex applications like IP pipelining and an IPSec gateway. All these examples are well documented. It is also recommended learning to use the testpmd tool, which enables you to start and stop packet forwarding, display statistics, configure various settings, and more.

For those who want to become familiar with the DPDK API, it is recommended to explore the Programmer's Guide and the fundamental data structures. Those structures include the rte_mbuf structure (representing a packet) and the rte_ethdev structure (representing a network device). One should also learn the Environment Abstraction Layer API.

For more advanced DPDK knowledge, it is worth learning the memory pools implementation (the rte_mempool object and the librte_mempool library). Those who are seeking familiarity with the cryptographic layer can explore the rte_cryptodev structure, representing a cryptographic device. See also the cryptodev API, which provides cryptographic poll-mode drivers as well as a standard API that supports all these drivers and can be used to perform cipher, authentication, and symmetric cryptographic operations. The library also enables migration between hardware and software cryptographic accelerators. One should become familiar with the dpdk-devbind script in order to bind and unbind devices and in order to view the status of the NICs.

The DPDK web site contains a set of open-source tools such as the dpdk-ci continuous-integration suite and the DPDK test suite (DTS), which is a Python-based testing framework. DTS works with software traffic generators like Scapy and pktgen-dpdk; it can also be used with the IXIA hardware traffic generator. DTS is easy to set up and run; it contains over 90 test modules for various networking scenarios. Here, again, one can start with a simple "hello world" test, and end up with complex tests including SR-IOV and live migration. Currently DTS supports Intel and Mellanox NICs, and patches for Cavium Networks NICs are circulating on the DTS mailing list. DTS provides both functional tests as well as benchmarking tests.

The DPDK site also hosts pktgen-dpdk, which is a DPDK-based traffic generator. There are more DPDK-based, open-source traffic generators, including TRex, which has both a stateful mode (which can be helpful when testing load balancers and NATs for example) and a stateless mode, and the LuaJIT-based MoonGen project.

Work has been done to add DPDK plugins to collectd, which is a popular system statistics collection daemon. Two DPDK plugins have been merged into collectd: dpdkevents and dpdkstat. The dpdkevents plugin retrieves the DPDK link status and the DPDK forwarding core's status. The dpdkstat plugin polls statistics from DPDK drivers.

DPDK at higher layers

While DPDK applications are focused mostly on layer 2, there are several interesting projects under FD.io that use DPDK as their primary I/O layer, including VPP. Also worth a mention is the Transport Layer Development Kit (TLDK) project, implementing a set of libraries for Layer-4 protocol processing. For those who are interested to learn more about TLDK, we suggest watching Ray Kinsella's talk at FOSDEM 2017: Accelerating TCP with TLDK.

DPDK and the community

All DPDK development is done over the public dev@dpdk.org mailing list. The guidelines for contributing code to DPDK are described here. Long-term support releases are available, with support for two years. Governance for DPDK is provided by two boards: a Governing Board (budget, marketing, etc.) and a Technical Board (technical issues including approval of new sub-projects, deprecating old sub-projects, etc).

The DPDK project is a community-driven project and, as such, there are several DPDK events across the globe. The last DPDK Summits were held in Bangalore in April 2017 (the first DPDK Summit to be held in India) and the Shanghai summit, which ws held in June. Many videos from past events are available; there is also more information in the Intel Developer Zone and in the Intel Network Builders University Program.

Summary

The DPDK project has become a popular open-source, user-space network and cryptographic acceleration solution based on bypassing the kernel. This project is gaining momentum, especially with the recent move to the Linux Foundation; it is worth following, experimenting with, and contributing to.

Comments (1 posted)