This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The combination of an "unsuspecting library employee" and a bunch of bored children has created a popular program using the Raspberry Pi and other tools to teach coding to kids. Qumisha Goss is a librarian at the Parkman branch of the Detroit Public Library; she started the "Parkman Coders" program and came to PyCon 2018 in Cleveland, Ohio to tell the assembled Pythonistas all about it. She also had some thoughts on ways to make the Python community a more diverse place, along with some concerns for her students that are much bigger than the diversity topic.

Coding for kids

Goss said that it all started when some other library employees mentioned that teaching coding was becoming popular and that they thought that a program to do that for the children who visit the library would be great. She agreed that it would be nice to have and she applauded them on taking that on. But it turned out that she was taking it on, though she didn't know anything about programming. They assured her: "You're young, you can learn it", she said with a laugh.

They started with Tynker.com and other Scratch-based environments. "The kids had a ton of fun". The first class had 26 students when they only expected 20; the next month's class was similar in size. But then they started offering the class during the school year and attendance started to trail off. She asked the students why; their answer was that it was "baby stuff" and they were tired of just moving blocks around on the screen. "They wanted to do something 'real'"—at seven years old, she said to laughter.

So she started looking into the Raspberry Pi, in part because the library administration was concerned that students might learn enough to start attacking the library computers. In "true librarian fashion", Goss read a lot of books; Python for Kids is one of her favorites. In 2015, she went to PyCon in Montreal and to the education summit held there. She got lots of good information along with a suggestion to go to PyOhio. She did that and recommended it: "Go to PyOhio, it's really great". From some of the folks she met at those conferences, she got some micro:bit devices: "I'm really getting into this now", she said with a chuckle.

The library also paid for her to go to the first Picademy in North America, which allowed her to become a certified Raspberry Pi educator. She went through all these steps to get herself to a level where she could get out ahead of these kids and "teach them something that wasn't 'baby stuff'".

Since that time, she and the kids have done all sorts of different projects. Recently, during Teen Tech Week, one of her students created a video game with her own images and set it to music, "which was really cool". Goss laughingly admitted that her project "didn't look anything like that".

Her students have also done a lot with Minecraft: Pi; "I'm sure you've heard this a million times: all kids love Minecraft". They start off by playing Minecraft, but that's just an enticement. After that, she introduces them to IDLE ("don't groan") and lets them program in their Minecraft game. That was exciting, in part because you could make flowing lava and exploding TNT. "I don't know why they never get bored of seeing things explode."

After that, the students started doing robot cars, but the program had expanded so she had students from ages six through seventeen. A hard lesson learned about six-year-olds is that they are not ready to learn about Python. One kid wrote two lines in Minecraft then ran off to roll around in the hallway for the rest of the session, she said. So she started having a "junior project" for the younger kids. For the robot project, the junior version involved building and coloring a paper robot that would have some LEDs added so it would light up. The senior project was building a robot car with a Raspberry Pi.

The next project they will be tackling is greenhouse monitoring. The kids will be writing code to make a time-lapse camera that will record the growth of plants in a mini-greenhouse. In addition, they will write code to record moisture levels in the soil.

Parkman Coders has been going for about three years now. She showed two pictures of participants, one from when it got started and one from last year. One of the achievements she is most proud of is that three of the students are in both pictures; there are, in total, six kids that have been with her since the beginning. One day "they'll be here, or I'll be working for them, or you'll be working for them", she said to a round of applause.

Diversity and beyond

She then turned to a big issue in the technology world: diversity. "How do we get women and minorities? Where is the diversity?" are questions that are being asked by many organizations. "Women and minorities are not unicorns", she said. They are magical, but there is no real mystery about what it takes to sustain them.

There are three basic things that it takes to sustain any great relationship, starting with "respect". It comes down to simply acknowledging that everyone is a human being and respecting everyone on that basis.

"Here's the tricky part", she said: engaging with these other people. That means greeting them, asking a question, and listening to the answer they give you; if they ask you a question, answer it.

The final piece is "value". People are not going to stay with companies or organizations that do not value them. "I don't feel valued if I make less money" than someone else doing the same work, for example. If you can stick with those three things, respect, engage, and value, you can attract and retain people

But diversity problems are just a "middle issue" for her. The big issue she wanted to talk about are those affecting her students. Poverty is a real problem for these kids; they are poor and they know they are poor. It really hurts their progress that their family cannot afford a computer so they can work on Python at home.

Even though she can help them get a Raspberry Pi, many can't afford the things needed to hook up to it—a keyboard, mouse, and monitor. Or, worse yet, they don't have the internet at home. These are real problems that are much bigger to her than the diversity problems. We need to help minorities get to the point where they can even participate in our communities.

Illiteracy is another problem her kids face. She can be teaching someone to program and realize she needs to backtrack to make sure they can read properly. She had a nine-year-old come to her for help the week before her talk; in the process of helping him, she realized he couldn't read. She suggested that he use his last name as part of a username, but he did not know how to spell it. "That was devastating to me." She has already started working with him and enlisted the aid of other librarians in order to help teach him to read. "These are boundaries to them."

Hunger, crime, and violence are all things that these children experience. She has had students ask for snacks to take with them because they had no food at home. She has also been asked if she has ever been to jail; one child's cousin was headed there for burglary and the kid was wondering what jail was like. That is just a normal part of their lives, unfortunately.

The kids face a lack of resources and a lot of adversity, but that leads to them being resourceful and resilient, she said. That in turn leads to innovation, both for good and not-so-good. Detroit is experiencing a revival these days, which leads to lots of "cool hipster things" downtown, such as electric bikes. She learned that kids in the neighborhood were stealing lawnmowers and mounting the engines on bikes to create their own powered bikes. She was amazed that they could get that to work—and that no one has been hurt from what she has heard—though obviously stealing lawnmowers is not what she wants to see.

The library allows using its computers for one hour per day with a library card. One kid noticed that you can apply for library cards online with just a Detroit address. One day she noticed lots of kids playing games on the computer for longer than an hour; she realized that the kid had exploited the system to generate multiple cards that he handed out to his friends so they could play together longer.

These kids need an opportunity, she said. She is teaching them Python because it provides worldwide opportunities; "exposure to Python is exposure to the world". Learning to program is also empowering. The first thing many want to do when they learn to code is break someone else's code because "you are empowered to mess stuff up or to make stuff better". It is important that the kids know that they can do more than simply what someone tells them they can do. "You can use your superpower for good or evil, then we let them test it, but then we encourage them to do good."

At this point, the kids are mostly consumers. They all had a homework assignment to spend a million dollars, but they had to use some of it to go to college. They had to research what it cost for the college of their choice, but after that, the spending all looked the same: a house (usually a mansion), a car or two, a house for their mom, an iPhone, and several new pairs of shoes. She wants these kids to see that their goals should not be to own an iPhone, they should aspire to create a better iPhone. They do not just have to be consumers, they can be the next innovators.

The goal is for greatness, she said; these kids have it in them, but "they just need an opportunity to develop that greatness". She suggested that PyCon attendees not be selfish and to "cultivate greatness in others". She continued: "No one has ever become less great by helping someone else become great". That statement was met with much applause.

She also suggested asking questions if you need help and answering questions if you are asked. In both cases, though, you should listen ("don't hear it, actually listen and take it in") to the other person. Her personal favorite suggestions were to support public libraries ("we're awesome", she said with a laugh). Also, she asked the attendees to support educators and "support people, remember this is about people; what is Python without people?"

Her final suggestion was to "mentor someone". She suggested that person be "someone who doesn't look like you, someone who is from somewhere different than you—go out on a limb and help somebody else". With that, she was done with her talk, which was well-received—attendees gave her a standing ovation, no doubt for her work and her keynote.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assistance in traveling to Cleveland for PyCon.]

Comments (48 posted)

Python 3 adoption has clearly picked up over the last few years, though there is still a long way to go. Big Python-using companies tend to have a whole lot of Python 2.7 code running on their infrastructure and Facebook is no exception. But Jason Fried came to PyCon 2018 to describe what has happened at the company over the last four years or so—it has gone from using almost no Python 3 to it becoming the dominant version of Python in the company. He was instrumental in helping to make that happen and his talk [YouTube video] may provide other organizations with some ideas on how to tackle their migration.

Fried started working at Facebook in 2011 and he quickly found that he needed to teach himself Python because it was much easier to get code reviewed if it was in Python. At some point later, he found that he was the driving force behind Python 3 adoption at Facebook. He never had a plan to do that, it just came about as he worked with Python more and more.

He started out by being active in the internal Python group. He was often the first to answer questions that came up. He eventually became famous ("or maybe infamous") with the Pythonistas at Facebook because, when he saw a problem with how the language was being used, he didn't ask permission, he simply fixed it. That works at Facebook because there is no real top-down hierarchy of control; everyone has as much power to back out a change you make as you have to make the change to begin with. Over time, his changes built up credibility within the Facebook Python community that would serve him well in the migration process.

Changing something like the Python language version at "Facebook scale" was going to take some time and a lot of diplomacy, he said. He wanted to tell the "story about how I and couple of engineers used our free time, with no authority whatsoever, and made Python 3 the dominant version at Facebook."

In 2013, there was rudimentary support for Python 3.3 at Facebook. It was there as part of a task for adding Python 3 support to the build system. But that task was blocked on Python 3 support in the Facebook libraries, which was in turn blocked by no Python 3 support in the build system. It was something of a catch-22; Python 3 was "available" but nothing in the Facebook environment supported it.

In addition, there was lots of negative sentiment about Python 3 at Facebook in 2013. The overall thinking was that the company would simply stay on Python 2.7 forever. There was talk of jumping ship to another language entirely. Even he said (in an internal group) that Python 3 would never happen at Facebook. Only one person challenged him on that statement and suggested that he do something about it; at the time, he ignored the suggestion, but it did stick in his head.

Some hope

There was, actually, some hope, he said. In January 2013, the four imports from __future__ ( print_function , division , absolute_imports , and unicode_literals ) were required by a "linter" that was being used. They were added in an attempt to extend the life of the Python 2 code base. They were added everywhere in order to quiet the linter, which ended up making it easier to convert modules to Python 3.

The Apache Thrift framework for serialization and remote procedure calls is "used everywhere" at Facebook. Since it was Python 2-only, it was a core blocker. But adding Python 3 support was popular in a poll for new Thrift features that the Facebook Thrift group had run. He voted for it, but not because he was on the Python 3 bandwagon at that point; he thought the Python 2 interface needed a refactor as it looked like it had come from Java.

His thinking started to switch when he saw Guido van Rossum give a talk at Yelp in San Francisco on something called "Tulip", which is what eventually became the asyncio module. He had always been a fan of asynchronous programming in Python, but found that it was fragmented because of the differences between the frameworks (e.g. Twisted, gevent) that provided it. Tulip looked like it would make asynchronous I/O interoperable rather than fragmented. Before that talk was even over, he was communicating with the Facebook Thrift team, suggesting that Thrift should simply support Tulip for Python 3, rather than wait for Twisted, gevent, and others to port to Python 3. A few days later, the Thrift team published a roadmap that showed Python 3 and Tulip support coming.

Both of those arrived in early 2014, but then nothing happened for six months; users did not show up, they had no plans to show up, and they, in fact, did not know about the changes at all.

A new project

In August 2014, he started a project to rewrite a service that he had inherited. He started planning to do it using gevent and Python 2, but then realized it would be obsolete at the time it was written if he did so. In order for something to change, someone needs to be the first one; for Facebook and Python 3, that was him. "For Python 3 in your organization, I think that person should be you."

So he started his project using Python 3 and "everything was broken"; it was no wonder that no one was using Python 3. The build system would not even build his code and all of the third-party wheel packages were only available for Python 2. When he finally fixed enough things to allow his service to be built, it would immediately fail when it was run—someplace deep in the guts of the code that sets up service entry points in the Facebook system.

So in order to get his code running, he had to fix everything else; he rebuilt hundreds of third-party wheels so that they would work with both Python versions and he had to make any internal libraries be 2/3 compatible. Every day, though, someone would commit a Python 2-only change into one of his dependencies. Not surprisingly, he got tired of fixing regressions. One solution would be to force Python 3 compliance within the organization, but Facebook is not a place where that is possible. But, if you act like you have some authority, people will start to believe that you do, indeed, have that authority.

He used up a lot of his social capital to add Pyflakes linting into the build process. He was able to justify adding it because there already was a PEP 8 linter, but Pyflakes would address other code quality issues; in addition, Pyflakes had few false positives so it did not overly irritate the developers. He set things up so that Pyflakes would run on all code that was put up for review, first for Python 2 and then for Python 3. That helped spread the job of keeping Python 3 compatibility out to all of the developers and not just him, which allowed him to make progress with his project.

Early on, he had to be responsive to help people understand that "no, the linter is not wrong" and that there was value to making the code work with Python 3. If the developers had started believing that moving to Python 3 was difficult, they would fall back on the "let's stay with Python 2 forever" mindset. He made it easy for developers to do the right thing with respect to keeping the code running on Python 3. It was easier to just "shut the linter up" and, by extension, him, than it was to complain about it, so most developers just did so.

Education

With all that in place, he had stopped the bleeding, but little or no progress toward running more Python 3 at Facebook was being made. He joined the team that did training on Python programming for new employees at Facebook. The linters already complained if the code was not compatible with 2 and 3, but he wanted to get to a point where 2/3 compatible code was only written for legacy projects and that new code should be written in Python 3. Once again he took matters into his own hands: in 2015, he changed the slides for the new employee Python class to make that statement. The idea was that at some unknown point in the future, Facebook will want to switch to Python 3, so writing Python 2-only code at this point makes no sense since it will have to be rewritten someday. He taught new hires that all of this should just work with the Facebook infrastructure and build systems and that if it didn't, they should file a bug or try to fix it themselves. "Strangely enough, that's what happened."

In January 2015, he "finally shipped" his project. He spent most of the rest of the year telling people how much better it was and why they should switch to using Python 3 where they could. Over the year, various allies in the effort to switch to Python 3 at Facebook made themselves known.

One of those allies was Łukasz Langa, who had "somehow convinced Instagram to move to Python 3". In 2016, he and Langa formed a brand new team in Facebook to shepherd Python within the company, which they dubbed "The Ministry of Silly Walks". Because they were "the Python team", the "perceived authority" he mentioned earlier worked; people assumed they could make decisions about Python at Facebook.

In 2016, he was seeing slow but steady growth in the amount of Python 3 that was being run at the company. There was mention of it in meetings and he regularly heard of new projects that were using it. The tide of opinion had changed at Facebook even though Python 3 was not the default and projects needed to actively choose to use it. By May 2016, he signaled his intention to switch the build system default to Python 3, which was overwhelmingly supported so he made that switch a few days later—with no ill effects.

Toward the end of 2016, there was a post from a project team that reported its results in switching to Python 3. The developers simply ran 2to3 on the code and fixed a few things that it complained about. When they ran the resulting code, they found it was 40% faster and used half the memory. This points to a persistent myth that Fried has heard: Python 3 is slower than Python 2. That may have been true for earlier releases of Python 3, but it is definitely not true now, he said.

Nice things

In early 2017, Instagram finished its migration to Python 3 and Facebook was reaping the benefits of this "glorious future where you can have nice things". Upgrades of Python versions were not particularly scary and brought new features that could be used. Facebook developers now focus on problems like using the new static typing features or migrating services to use asyncio . "Python at Facebook is fun again."

The problem now is that everyone is asking when Python 2 support can be retired. When there are regressions in the Python 2 support for a library or module, it is common to hear developers ask if those users can simply move to Python 3. It is the reverse of the problem he had a few years prior. "Oh what a wonderful world in which I live."

He showed a graph of the Facebook service entry points in Python over time, starting in Q3 of 2015 where there were four total Python 3 entry points. At the time of the switch defaulting to Python 3 in mid-2016, Facebook already had 4% of its entry points as Python 3. In March 2018, it crossed over the 50% line; in mid-May, when he gave the talk, it was 55% of the "tens of thousands of Facebook entry points" that were running Python 3. At Facebook it is now embarrassing to have code that only runs on Python 2, Fried said.

He then reviewed the process. He noted that you have to do more than just build something new; you have to lead developers to it by "being the change you want to see". You should get other people to help, even if they don't know they are helping, which is where linters and unit tests come into play. It is important to educate new hires for where you are heading. Once you get there, or some of the way there, celebrate by enjoying the "nice things": write some "awesome stuff in Python 3". Seeing how the new features can be used will make others want to convert.

He fielded some questions from the audience. One asked how they might make this happen in a more traditional, hierarchical organization. Fried thought that might actually be easier since, instead of thousands of developers that need to be convinced, it should be possible to work up the management chain starting with a manager who recognizes the benefits. It could also be harder if the culture is conservative, but focusing on code quality improvements may help there. Another question focused on code that is monolithic, rather than broken up into multiple entry points; for that Fried suggested looking at the Instagram keynote [YouTube video] from PyCon 2017.

There was lots in the talk that other organizations can use, but it is clear that having an advocate and shepherd with a lot of perseverance will be important. Companies that are planning a conversion of this sort will likely want to have someone like Fried on board.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assistance in traveling to Cleveland for PyCon.]

Comments (13 posted)

As attackers have lost the easy ability to execute code stored in writable memory, they have increasingly turned to return-oriented programming (ROP) and related techniques to compromise vulnerable systems. ROP attacks use the code that is present in the program under attack and are hard to defend against in software. In response, hardware vendors are developing ways to defeat ROP-like techniques at a lower level. One of the results is Intel's Control-Flow Enforcement Technology (CET) [PDF] , which adds two mechanisms (shadow stacks and indirect-branch tracking) that are intended to resist these attacks. Yu-cheng Yu recently posted a set of patches showing how this technology is to be used to defend Linux systems.

The patches adding CET support were broken up into four separate groups: CPUID support and documentation, some memory-management work, shadow stacks, and indirect-branch tracking (IBT). The current patches support 64-bit systems only, and they only support CET for user-space code. Future versions are supposed to lift both of those restrictions.

ROP attacks generally work by loading a set of fabricated call frames onto the stack, each of which "returns" into a carefully chosen fragment of code. By stringing these "ROP gadgets" together, the attacker is able to execute enough useful code to take control of the system. Gadgets are plentiful in any large program; the ability to "return" into the middle of a multi-byte instruction to get an entirely different sequence of operations makes them even more available on x86 systems. The stack is, of course, writable by the running program; it contains a mixture of control-flow information (return addresses, for example) and other data. It is that mixing that has made ROP attacks possible.

One way to thwart such attacks is to move the return addresses to another context where they are not so easy to mess with; that is the core idea behind the shadow-stack functionality. Briefly, when shadow stacks are enabled, a function call will push the return address onto both the regular stack and a special shadow stack. When a return instruction is encountered, the return address is popped from both stacks and compared; if they do not match, a fault results. Both the push and pop operations are handled by the processor. As long as the attacker is unable to tamper with the shadow stack, it should prevent the use of a return instruction to divert the flow of control.

Preventing that tampering requires some special treatment for the shadow stack. It is allocated from a virtual-memory range, and the base address is stored into a model-specific register (MSR). The pages within the shadow stack must have a strange combination of bits set: read-only but dirty. Until now, the dirty state has been used almost exclusively by the kernel to track pages that must be written to backing store, but shadow stacks won't work without it. As a result, a new "software dirty" bit must be allocated in the page tables to fill the role that the hardware dirty bit handled previously.

The read-only protection on the shadow stack should prevent attackers from adding their own special entries — if that protection cannot be changed. To that end, shadow stacks are allocated in a special type of virtual-memory area (VMA) marked with the new VM_SHSTK flag. System calls like madvise() , mprotect() , mremap() , and munmap() will refuse to operate on a shadow-stack VMA. There is a new set of arch_prctl() operations that will operate on shadow stacks; they are described in this documentation patch. These calls, which are unprivileged, are meant to be used at program startup to set up the stack; one of them ( ARCH_CET_LOCK ) can be used to prevent disabling of shadow stacks (and IBT).

One interesting issue with shadow stacks is how they will interact with retpolines, which are used to thwart Spectre variant-2 attacks. Retpolines replace indirect function calls (those where the address of the function is determined at run time) with an instruction sequence that looks a lot like a ROP attack; they will not work when a shadow stack is in use. Intel claims (in section 4.3 of this document [PDF]) that retpolines will be unneeded on processors that support CET. Hopefully there will be no surprises that will force a choice between these two protective technologies.

Jump-oriented programming is a ROP-like technique that exploits indirect jumps and function calls. One way to severely restrict such exploits is to prevent jumps to any location that was not actually intended to be jumped to. IBT does this by adding a new pair of instructions ( endbr32 and endbr64 ) that function as no-ops but which indicate a possible target for an indirect jump. These instructions will be treated as no-ops by older processors that lack CET support. When IBT is enabled, the processor will require that an endbr instruction is the first one encountered after an indirect jump; if something else is encountered, a fault will result.

Shadow stacks should be largely transparent to any program that is not, itself, doing strange things with return addresses on the stack. IBT is different, though; if it is enabled, the entire program must have been compiled with the necessary options to insert the endbr instructions in the right places. If a program has been so compiled, but it requires a library that has not, then IBT cannot be enabled without breaking the program. One of the jobs of the ELF loader on a CET-enabled system will be to check the CET-readiness of each library and only enable CET if all components are ready for it.

That leaves one interesting case uncovered, though. A program may need only CET-ready libraries to get started, but it might at some later point call dlopen() to load a library that has not been built for CET. At that point, there are only two options: turn off CET for that process, or fail the operation. If the ARCH_CET_LOCK operation described above has been invoked, only the latter option will be available. So locking can only be done at the cost of introducing a real chance of breaking programs when IBT is enabled.

That led to a long discussion about whether ARCH_CET_LOCK makes sense at all. Kees Cook argued that, in its absence, attackers will focus all of their energies on finding a way to turn CET off before carrying out the real attack. Andy Lutomirski responded that, by the time an attacker can disable CET, they are already in control and there's not much CET can do anyway. How that will be resolved is unclear at this time.

Disagreements over details like that notwithstanding, there appear to be no concerns (outside of grsecurity land anyway) about the CET features overall. They should make the system far more resistant to some common attack techniques with, seemingly, little in the way of performance or convenience costs. Chances are, though, that this technology won't be accepted until it is able to cover kernel code as well, since that is where a lot of attacks are focused. So CET support in Linux won't happen in the immediate future — but neither will the availability of CET-enabled processors.

Comments (21 posted)

poll()

select()

epoll_wait()

One of the new features merged for the 4.18 kernel is a new polling interface using the asynchronous I/O mechanism. As part of this work, the internal implementation of how the various polling-related system calls (, and) work was significantly changed. The reporting of a significant performance regression has now put all of that work into doubt, though. While it could be reverted, the more likely outcome would appear to be another set of changes to how polling works in the kernel.

As a reminder, kernels prior to 4.18 expect filesystems and device drivers to provide a single poll() method in the file_operations structure:

int (*poll) (struct file *file, struct poll_table_struct *table);

This function's job is twofold: add the wait queue(s) on which I/O-readiness events may be reported to table , and return a bitmask describing the I/O operations that could be performed immediately without blocking. In 4.18, these tasks have been split out into separate methods:

struct wait_queue_head *(*get_poll_head)(struct file *file, int mask); int (*poll_mask) (struct file *file, int mask);

Now get_poll_head() returns a single wait queue on which events will be reported, while poll_mask() indicates the operations that can be performed immediately. The old poll() interface remains in the kernel because many drivers have not been converted, but the long-term intent is to get rid of it.

On June 22, a problem with this new interface came to light in the form of a performance-regression report from the kernel test robot. In particular, a test that exercises poll() heavily regressed by 8.8%, which is a significant performance hit. Linus Torvalds was quick to put his finger on the problem: the new polling interface is slowing things down. In particular, the replacement of the single invocation of poll() with calls to two other methods added another indirect call to the polling path.

An additional function call may not seem like that heavy a cost, but indirect calls (where the address of the function to be called is computed at run time) are relatively expensive. The advent of Spectre has made that situation worse, since indirect calls must use retpolines on affected processors; that makes them quite a bit more expensive than they were before. So the new scheme has made polling significantly more expensive and, since polling is a performance-critical operation in many workloads, that is a real problem.

Torvalds was unimpressed with the changes in general; he said that he was "inclined to just revert the whole mess". Christoph Hellwig responded with a quick patch that attempted to eliminate some of the extra overhead, but that didn't win applause from anybody. Some more serious changes were clearly called for.

The direction of the most likely fix was suggested by Torvalds in that same message. The introduction of get_poll_head() already limits drivers to using a single wait queue to signal I/O-related events — a change that is not universally popular, but it is only a problem for a small number of drivers. Rather than provide a callback to obtain a pointer to that queue, Torvalds suggested, that queue pointer could just be stored in the file structure, where it would be immediately available when needed.

Hellwig noted that this solution would not work for every case:

People are doing weird things with their poll heads, so we can't do that unconditionally. We could however offer a waitqueue pointer in struct file and most users would be very happy with that.

One of those cases turned out to be in the networking code, and in the ability to perform busy waiting in particular. Hellwig ended up reworking some of that code before writing a patch adding a new field:

struct wait_queue_head *f_poll_head;

to struct file and removing the get_poll_head() method entirely. The entire set of patches has been posted in Hellwig's Git tree.

The changes simplify the polling code somewhat, and they should remove the 4.18 performance regression (though no benchmark results confirming that have been posted yet). The cost comes in the form of adding another pointer to struct file , of which there can be many instances on a busy system. The fixes are also the sort of change that is normally not seen as desirable after the close of the merge window, and the networking changes have not yet been approved by the networking developers.

An argument could thus be made in favor of reverting the polling changes entirely and trying again in 4.19. That may be exactly what happens if the networking developers resist the changes in their subsystem. The more likely outcome, though, is that these changes will receive whatever additional fixes prove to be necessary and will be merged in the near future. The new polling mechanism brings significant performance improvements for users who can take advantage of the asynchronous I/O interface, and they would prefer not to wait for it if possible.

Comments (8 posted)

While there has been quite a bit of work on various aspects of networking performance, including bufferbloat reduction, queue management, and more, much of that work has been oriented toward the needs of high-end users. But there is more to the Internet than data centers and high-speed links. A large number of Internet-connected devices can be found behind consumer-level routers on relatively slow broadband links. For some time, a group of developers has been working on the "Common Applications Kept Enhanced" (CAKE) queuing discipline, which is aimed directly at the needs of those users.

Home networks face a number of challenges not found in many commercial settings. Bufferbloat can cause significant latencies, but can often be difficult to address by end users. The links themselves are relatively slow, and they are often highly asymmetric — download speeds can be an order of magnitude higher than upload speeds. The result of all this can be significant domestic tension when, for example, one household member is pushing a large Git tree while the other is engaged in a high-stakes raid. Given the special features of home networks, it would seem to make sense to tune the behavior of the network stack to match. That is where CAKE is meant to come in.

CAKE ingredients

The CAKE patches, posted by Toke Høiland-Jørgensen (though the principal author is said to be Jonathan Morton), are an attempt to better meet the needs of home-network users; the 18th revision of this patch set was posted at the end of May. CAKE takes the form of a queuing discipline, meaning that it sits between the higher-level protocol code and the network interface and decides which packets to dispatch at any given time. It has four different components designed to make things work on home links.

The first of those is a rate-based bandwidth shaper. LWN readers are relatively likely to have a home router running a recent kernel and entirely free drivers; on such systems, the problems with bufferbloat have mostly been solved. Others, though, are not so lucky. There is often buffering to be found within proprietary drivers or the hardware itself that cannot be fixed just by installing a current OpenWrt release. And, in any case, there may be buffering problems in external components — a cable modem, for example — that are not under the user's control at all.

In such cases, sending too much data through the link at any given time will almost certainly lead to excessive buffering and the resulting latency problems. There is still, though, one way to avoid such problems: don't send data faster than the upstream link can carry it. That is where the bandwidth shaper comes in. It regulates outgoing traffic to cap it at just below the bandwidth that the link can handle, preventing buffering at that link. In essence, it takes control of buffering away from the downstream components, solving bufferbloat problems in settings where the code itself cannot be changed.

One potential problem with this kind of shaping is that it can, if configured conservatively, waste a portion of the available bandwidth. The shaper goes to great lengths to try to account for all of the overhead that will be applied to packets over the link (including things like DOCSIS framing), with the result that, it is claimed, the speed limit can be configured to over 99% of the actual link speed. What seems to be missing, though, is an automatic way to determine what the link speed actually is.

For the parts of the system that are under the control of the networking code, some sort of queue-management algorithm is needed to prevent the overfilling of buffers. CAKE includes a variant of the FQ-CoDel algorithm (called "Cobalt") for that purpose. Among other things, FQ-CoDel performs packet scheduling to ensure fairness between the various flows (connections) that are transmitting at any given time. Cobalt adds to that, though, in that it can also ensure fairness between the various hosts that are sending packets through the router. If host H1 has a single connection running, while H2 has four, FQ-CoDel will allocate 20% of the available bandwidth to each, with the result that H2 is able to use 80% of the total. Cobalt, instead, will give 50% to the flow from H1, then allocate 12.5% to each of the flows from H2. This feature can help to ensure that all devices on the network get reasonable access to the available bandwidth.

The differentiated services (or "DiffServ") specification uses a field in the IP packet header to classify the data contained within that packet. Some packets can be marked as being high priority or latency sensitive, while others might be low-priority bulk traffic. CAKE implements DiffServ in the bandwidth shaper with a small number of priority-ordered queues. The highest-priority queues are serviced first, but only to a point; the latency-sensitive queue is given a maximum of 25% of the available bandwidth, for example. If a given queue does not use all of its allocation, that bandwidth is naturally made available to the other classes of service.

This approach to DiffServ is meant to enable priority handling for traffic like video conferencing without letting an abusive host use DiffServ to crowd out all other users. There are several different mappings of DiffServ classes to priorities available in CAKE.

The last major component of CAKE is ACK filtering. A stream of data flowing in one direction over a TCP connection will generate a corresponding stream of acknowledgment (ACK) packets heading the other way. The ACK traffic is much smaller than the actual data, but it can still reach problematic levels on asymmetric links like those found in many home links. Much of that data will be redundant: if an ACK packet for the first 10,000 bytes is immediately followed by an ACK for the first 20,000 bytes, the first can often be dropped with no ill effect.

Since CAKE maintains per-flow queues of packets, it is relatively easy for it to tell when a newly queued ACK packet makes an earlier one redundant. Some care must be taken, though, to only drop ACK packets that contain no other data, or bad things will happen. The ACK filtering also will not touch packets that contain unknown headers; that is an attempt to avoid protocol ossification that could break future extensions.

The baking process

CAKE first appeared on the netdev list in April; since then it has been, to put it politely, subjected to a great deal of discussion ranging from serious technical criticism to the inevitable requirement to put variable declarations in "reverse Christmas tree" ordering. It is probably fair to say that many developers would have given up on getting this code merged by now. Høiland-Jørgensen appears to be a persistent and good-humored developer, though, with the result that this patch set is now up to version 18 and is, with luck, close to being merged.

Early on, networking developer Eric Dumazet's reaction was: "Oh my god. Cake became a monster". He questioned a number of things, with special attention for the ACK filter which, he said, should be in the TCP stack itself if it exists at all. He has since then let it be known that he will indeed be adding ACK filtering to TCP as a whole. Many other issues raised by Dumazet have been fixed in subsequent versions of the patch set.

One other significant objection has to do with how the host-based shaping works. When network address translation (NAT) is in use — as it often is on home networks — a queuing discipline loses the information about where packets originally came from. To get around that, the CAKE patches reach into the netfilter subsystem to obtain that information and keep it with packets as they pass through the system. This is seen as a layering violation, and it makes CAKE dependent on the netfilter connection-tracking mechanism. Better solutions have not been offered, though, so that feature remains in the current patch set.

One slice of CAKE that remains lacking is good documentation on how to actually use it. The intent is to create something that is "simple enough that even an ISP can configure it", but it is still not entirely straightforward. The CAKE page on bufferbloat.net has some information on how to get started, including instructions on how to get it running on an OpenWrt system. Detailed information on CAKE, including performance numbers, can be found in this paper [PDF]. With luck, the baking process will be finished soon and we will all be able to have CAKE.

Comments (24 posted)