This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

LWN has been running articles for years to the effect that the end of Python 2 is nigh and that code should be ported to Python 3 immediately. So, naturally, one might expect that our own site code, written in Python, had been forward-ported long ago. Strangely enough, that didn't actually happen. Itmostly happened now, though. In the process of doing this work, your editor has noticed a few things that don't necessarily appear in the numerous porting guides circulating on the net.

One often-heard excuse for delaying this work is that one or more dependencies have not yet been ported to Python 3. For almost everybody, that excuse ran out of steam some time ago; if a module has not been forward-ported by now, it probably never will be and other plans need to be made. In our case, the final dependency was the venerable Quixote web framework which, due to the much appreciated work of Neil Schemenauer, was forward-ported at the end of 2017. Quixote never really took the world by storm, but it makes the task of creating a code-backed site easy; we would have been sad to have to leave it behind.

Much of the anxiety around moving to Python 3 is focused on how that language handles strings. The ability to work with Unicode was kind of bolted onto Python 2, but it was designed into Python 3 from the beginning. The result is a strict separation between the string type ( str ), which holds text as Unicode code points, and bytes , which contains arbitrary data — including text in a specific encoding. Python 2 made it easy to be lazy and ignore that distinction much of the time; Python 3 requires a constant awareness of which kind of data is being dealt with.

In practice, for LWN at least, Unicode is not where the problems arose. The standard advice is to use bytes for encoded strings originating from (or exiting to) the world outside a program, while converting to (or from) str at the boundary, thus using only str internally. That forces a focus on how one is communicating with the environment — a focus that really needs to be there anyway. It is not a hard discipline to acquire, and it leads to more robust code overall.

So text encodings aren't a big challenge except — in your editor's experience — for a couple of places, one of which is the email module, which has proved to be the reason for the most version-dependent code in this particular project. Much of that is due to API changes in that module, most of which are probably justified for proper email handling even if they are annoying in the short term. But there is also the simple problem that one cannot hide the text-encoding issue when dealing with email. It's not just that a message can arrive in an arbitrary encoding: a single message can contain text in multiple encodings — in a single header line. Properly processing such email is arguably easier and more correct in Python 3, but it's different from Python 2 in subtle ways that took a while to figure out.

Another problem has put your editor in a pickle — literally. The Python pickle module is a convenient way to serialize objects, but it has always been loaded with traps for the unwary. Pickle in Python 2 could be relied upon to generate pickles that could be treated as strings, especially if the oldest "protocol" was used. In Python 3, pickles are bytes , and they are not friendly toward any attempt to treat them as strings. Even the "human readable" protocol=0 mode will produce distinctly non-readable output for some types; these include things like NUL bytes that trip up even the relatively oblivious Latin-1 decoder. The datetime type is prone to this kind of problem, for example.

One solution is paint "PICKLES ARE NOT STRINGS" on one's monitor and to resolve never to be so sloppy again. But pickles have other problems, including sometimes surprising behavior when one pickles an object under Python 2, then tries to unpickle it under Python 3, where the definition of the object's class may have changed considerably. Your editor has concluded that pickles are an attractive way to avoid defining a proper persistence mechanism for Python objects, but that taking that shortcut leads to problems in the long run.

Yet another inspiration for high levels of grumpiness is the change in how module importing works. In Python 2, a line like:

import mydamnmodule

would find mydamnmodule.py in the same directory as the module doing the import. That behavior was evidently too convenient to survive into Python 3, so it was taken out. The documentation gives some lame excuse about confusion between modules located this way and standard-library modules, but your editor knows that a more mean-spirited motive must have driven such a change.

Now, one can try to fix such code with an explicit relative import:

from . import mydamnmodule

In many situations, though, that will lead to the dreaded "attempted relative import in non-package" exception that has been the cause of a seemingly infinite series of Stack Overflow postings. Once again, the rules must make sense to somebody, but they make this kind of relative import nearly impossible to use.

So there was nothing for it but to actually get a handle on the namespaces in use and change all the import statements into proper absolute form. Doing so revealed some interesting things. The lazy way in which we had set up our hierarchy was silently causing modules to be imported multiple times — as foo , lwn.foo , and even lwn.lwn.foo , for example — unnecessarily bloating the size of the running program. Such imports can also create difficult-to-debug havoc if any modules maintain module-level state that will also be duplicated and, naturally, become inconsistent.

Moving to well-defined absolute imports fixed those issues, but revealed another that had been hidden: the presence of a number of import loops in the code. These loops, where module A imports B which, in turn (and possibly through several layers of indirection) tries to import A , lead to a "can't import" exception. They are almost always an indication of code structure that, to put it charitably, could use a little more thought. Fixing those required a fair amount of refactoring, profanity, and slanderous thoughts about the Python developers.

The truth, though, is that these issues should have been fixed long ago; the end result of the import change is a much improved code structure here.

Some of the more annoying language changes really do seem like gratuitous attacks on people who have to maintain code over the long term, though. Python 2 did the Right Thing with source files containing both spaces and tabs, for example, while Python 3 throws a fit. The problem is easily fixed, but it seems like it didn't need to be a problem in the first place. Since time immemorial, octal constants have been written with a preceding zero — 0777, for example. Python 3 requires one to write 0o777 instead, for reasons that are not particularly clear. But JavaScript made that change too, so it must be the right thing to do.

At least old-style octal constants will generate a syntax error in Python 3, so there is no chance of subtle problems resulting from those constants being interpreted as decimal. The same is not true of integer division. Python 2 defined integer division as originally intended by $DEITY and implemented by almost every processor: the result is a rounded-downward integer value. So 3/2 == 1 . In Python 3, instead, dividing integers yields a floating-point result: 3/2 == 1.5 . That is a change that could silently create subtle problems. In the LWN code, integer division is used for tasks like subscription management and money calculations; these are not places where mistakes can be afforded.

The fix is easy enough on its face: use // for true integer division. But that requires finding every place that needs to be fixed. Grepping " / " in a large code base is not particularly fun, especially if said code base also includes a lot of HTML. This work has been done, but it is going to take a lot of testing before your editor is confident with the results.

There are numerous other little incompatibilities that one stumbles across, naturally. Some library modules have changed or are no longer present. The syntax of the except statement is different. Dictionaries no longer have has_key() . And so on. Most of these are relatively easy to catch and fix, though — just part of a day's work.

One might wonder about the various tools that are available to help with this transition. The 2to3 tool can be useful for finding some issues, but it wants to translate the code outright, generating a result that no longer runs under Python 2. That is a bigger jump than your editor would like to take; the strategy has very much been to get the code working under both versions of the language before making the big switch. 2to3 also chokes on the Quixote template syntax that is used by much of LWN's Python code. So it was of limited use overall.

An alternative is the six compatibility library, which can be useful for writing code that works under both Python versions. Your editor steered away from six instinctively, though, due to a kernel programmer's inherent dislike for low-level, behind-the-scenes magic. It reworks the module namespace, overrides functionality in surprising places, and requires coding in a version of the language that is neither 2 nor 3. Various versions of six bundled with dependencies have already led to problems even in the Python 2 version of the code. It is better, in your editor's opinion, to have the transitional compatibility code be in one's face, where it can be left behind once the changeover is complete. The increasing number of Python 3 features added to 2.7 make it easier to write portable code, in any case.

All told, the Python 3 transition has been an adventure — one that is not yet complete. It has taken a lot of time that was already in short supply. The end result, though, is cleaner code written in a better version of the language, or so your editor believes, anyway. The Python 2 code base put in over 16 years of service; hopefully the next version will be good for at least that long.

Comments (74 posted)

A PEP that has been around for a while, without being either accepted or rejected, was reintroduced recently on the python-ideas mailing list. PEP 505 ("None-aware operators") would provide some syntactic sugar, in the form of new operators, to handle cases where variables might be the special None value. It is a feature that other languages support, but has generally raised concerns about being "un-Pythonic" over the years. At this point, though, the Python project still needs to figure out how it will be governed—and how PEPs can be accepted or rejected.

None -awareness

The basic idea is fairly straightforward. Patterns like the following:

if x is None: x = 42

x = x ?? 42

x ??= 42

??

None

None

+=

??=

??

could be replaced with a more compact, and possibly more readable, version:or the even more compact:So, two new operators would be defined. The first is "", which returns the left-hand side if it is notand the right-hand side otherwise; importantly, it does not evaluate the right-hand side at all unless the left side is. As with other "augmented assignment" operators (e.g.), "" simply applies theoperator to the two arguments and assigns the result to the left-hand side.

In that same vein, None -aware indexing and attribute-access operators can change code like the following:

if x is not None and x['foo'] == 'bar': ... if y is not None and y.foo == 'bar': ...

if x?['foo'] == 'bar': ... if y?.foo == 'bar': ...

email/generator.py

mangle_from_ = True if policy is None else policy.mangle_from_ mangle_from_ = policy?.mangle_from_ ?? True

to:The operators can, of course, be combined, as the following example from the PEP shows. The first line is an expression from thefile in the standard library, while the second is a rewrite using the new operators:

Readers who think some or all of that looks more like Perl—or, more pejoratively, like line noise—than Python are not alone. That is probably one of the most common complaints heard after Steve Dower reintroduced the PEP in mid-July (as well as in previous discussions of the PEP). In that message, he noted that it might not be the best time to bring up the PEP, given the fact that there was no agreed-upon way for the PEP to be accepted (or, for that matter, rejected) since the recent resignation of Guido van Rossum; "but since we're likely to argue for a while anyway it probably can't hurt (and maybe this will become the test PEP for whoever takes the reins?)."

"Argue for a while" is what followed. Many were simply opposed to any of the operators, largely because of their supposed un-Pythonic nature. But some were more open to just adding ?? (and ??= ); the None -aware indexing and attribute-access operators ( ?[] and ?. , also known as "maybe-subscript" and "maybe-dot") can be strung together in ways that seem potentially confusing so folks seemed more wary of them. There are also differing opinions on the readability of the resulting "one-liners"—even advocates of the new operators do not seem inclined toward using them willy-nilly.

Spelling

Many of the arguments against the operators boil down to how they are "spelled"; this is the root of the "un-Pythonic/Perl-like/line-noise" argument. But that's only part of the discussion—or should be. If there is utility to replacing the canonical way to default a variable, how to spell the operator (or keyword) should be a secondary consideration; as Steven D'Aprano put it:

*Its just spelling*. If it is a useful and well-defined feature, we'll get used to the spelling soon enough. That's not to say that spelling is not important *at all*, or that we should never prefer words to symbols. But if the only objection we have is "this is useful but I don't like the spelling so -1" then that's usually a pretty weak argument against the feature.

Giampaolo Rodolà is concerned that most of the new operators are not "explicit", effectively arguing that they run aground on the "explicit is better than implicit" guideline in The Zen of Python. As he put it:

I personally don't find "a ?? b" too bad (let's say I'm -0 about it) but idioms such as "a?.b", "a ??= b" and "a?[3] ?? 4" look too Perl-ish to me, non pythonic and overall not explicit, no matter what the chosen symbol is gonna be. It looks like they want to do too much for the sole reason of allowing people to write more compact code and save a few lines. Compact code is not necessarily a good thing, especially when it comes at the expense of readability and explicitness, as I think is this case.

D'Aprano argued that any new operators are going to have to look somewhat strange since all of the non-strange operators are already in use. Rodolà said that was part of his point: "Not to state the obvious but it's not that we *have to* use the remaining unused symbols just because they're there." David Mertz went even further, arguing that the PEP 572 approval has pushed things too far:

I think the bar has been much too low for introducing new features over the last 5 years or so. Internal changes like the new dictionary implementation are fine, but user-facing changes should be exceedingly rare in the base language. This proposal doesn't come remotely close to such a good standard. I was consistently +0 on the 572 idea, as long as its worst excesses were trimmed, as in the final PEP. But after reading this discussion, I almost reconsider that opinion since its social effect seems to be a move towards accepting wild and unnecessary changes that "might be useful" for a few unusual programming patterns.

The argument over the explicitness of the new operators spawned a huge sub-thread, where it was pretty clear that no minds were being changed. Rodolà argued that the implicit check for None used by the operators makes them non-explicit. On the other hand, D'Aprano argued that if the operators are defined to do a particular thing, then using them that way is explicit. The whole argument caused him to proclaim:

Comments like the above is why I think that "explicit" and "implicit" are used to mean "I like it" and "I don't like it" rather than being objective arguments, or indeed having anything to do with explicitness or implicitness.

Nicholas Chammas observed that perhaps it was not a difference of explicit versus implicit, but instead related to something that C++ inventor Bjarne Stroustrup once said:

This reminds me of something I read about once called Stroustrup's Rule > For new features, people insist on LOUD explicit syntax.

> For established features, people want terse notation. I think the "explicit vs. implicit" part of this discussion is probably better expressed as a discussion about "loud vs. terse" syntax. None of the operators in PEP 505 have implicit behavior, to the best of my understanding. It's just that the operators are new and have terse spellings.

Limbo

Along the way, Dower tried to refocus the discussion by presenting some of the arguments being made that he found lacking, along with a rebuttal of each. That apparently did not have the desired effect, as he bowed out of the discussion shortly thereafter:

I had assumed that because my emails are not the PEP that people would realise that they are not the PEP. I'm going to duck out of the discussions here now, since they are not as productive as I'd hoped, and once we have a BDFL-replacement I'll reawaken it and see what is required at that point.

For some, that may be just as well. Raymond Hettinger said that he was worried that it was not the right time for a discussion of PEP 505 as it might introduce "divisiveness when we most need for be focusing on coming together". Furthermore, he was not a fan of the feature for a number of reasons:

This PEP also shares some traits with PEP 572 in that it solves a somewhat minor problem with new syntax and grammar changes that affect the look and feel of the language in a way that at least some of us (me for example) find to be repulsive. This PEP is one step further away from Python reading like executable pseudo-code. That trait is currently a major draw to the language and I don't think it should get tossed away just to mitigate a minor irritant.

He also noted that other implementations of Python are having a hard time keeping up with the "ferocious rate of change" in recent Python releases, so he thinks a moratorium on language changes makes sense. He listed off a long string of features that have been added and wondered how many folks—including CPython core developers—were on top of all of them. "We've been putting major changes in faster than anyone can keep up with them. We really need to take a breath."

Informally, a moratorium is what the participants in a thread on the python-committers mailing list (where only core developers can post) have more or less recognized. There is a de facto moratorium because of a lack of a way to decide on a PEP, but, even once that logjam has cleared (currently planned to be completed early in 2019), it seems likely that any major changes will push out to Python 3.9 (currently targeted for 2021)—at the very least.

The discussion of PEP 505 may serve another purpose, as well. Discussions of PEPs in python-ideas are supposed to be more freewheeling and potentially fractious, but the scars of PEP 572 linger in many ways. This discussion may fuel the efforts to find other ways to discuss PEPs once they reach the point where they might be decided upon (and would typically then move to the python-dev mailing list). Presumably, discussions on python-ideas would stay the same, but what happens after that might be in for some changes—much like the Python project itself at this point.

Comments (38 posted)

The O'Reilly Open Source Conference (OSCON) returned to Portland, Oregon this July for the 20th convocation of this venerable gathering. While some of the program focused on retrospectives, there were also talks and tutorials on multiple technical topics and open-source community management. To give you a feel for the whole conference, we will explore it in a two-part article. This installment will cover a retrospective of open source and some presentations on releasing projects as open source at your organization. A second article will include a few of the technical topics at the conference.

20 Years of open source

OSCON launched in 1999, less than a year after the founding of the Open Source Initiative (OSI) and the drafting of the Open Source Definition. The first conference was a tiny gathering in Monterey, California; it was held side-by-side with the Third Perl Conference and included the board of the young OSI. Ever since, the histories of OSCON and the OSI have been intertwined, including OSI board meetings at most OSCONs.

Accordingly, the OSI ran a "20 Years of Open Source" day, covering much of the last two decades of changes. This included histories of Debian, Red Hat, Mozilla, FreeBSD, and other parts of our ecosystem. This retrospective continued in the "Open Source Past, Present and Future" track in the main program, with talks on how the roles of maintainers have changed, past heroes of open source, and managing projects. One thing that might have startled the attendees of OSCON 1999 was the emphasis on corporate and commercial open source.

One of the more interesting retrospective talks was by Deb Bryant on the history of open source in Oregon, since she was personally there for a lot of it. OSCON itself moved from California to Portland in 2003; 13 of the 20 events have been at the Oregon Convention Center. Bryant attributes Oregon's early adoption of open-source software to its history as a "pioneer state" and its high proportion of small businesses.

Bryant shared a brochure with the audience circulated in 2004 by the city of Beaverton, Oregon, announcing a $1.2 million project to attract open-source business. This campaign enticed the Intel Open Source Technology Center and the IBM Linux Technology Center to start operating there and for the Linux non-profit Open Source Development Labs to relocate there. It also attracted early open-source luminaries to move to the Portland area, including Linus Torvalds, Dirk Hohndel, Greg Stein, and Larry Wall.

This success encouraged the Oregon state legislature to vote on the first bill in the US mandating consideration of open-source software for state procurement. At the time, Bryant was the assistant CIO for Oregon and had to straighten out the mess created by the bill. Particularly, it was written poorly and required the state to define what open source was, which is how Bryant came to be involved with the OSI. Ultimately, the bill failed to pass in the state senate, partly because of the difficulty in explaining open source to legislators who didn't even own computers, she said.

"One advocate tried to explain open source as picking mixed fruit from an orchard", she recalled. "To which an elderly legislator remarked, 'If you pick out of my orchard, you're getting a backside full of buckshot!'"

While the state failed to mandate open source, several agencies did adopt Linux. In 2009, Portland became the first city with a municipal open-source and open-data policy. Bryant also listed off some of the many open-source and open-culture projects started in Oregon, including the Personal Telco Project, the Government Open Source Conference (GOSCON), the Drupal Association, and DemocracyLab.

Later in her career, Bryant managed the Open Source Lab (OSL) at Oregon State University (OSU). This institution was created in 2004 as a place for students to work with and learn open-source tools and system administration skills. Since OSU had a fiber optic internet connection at a time when only universities had them, it hosted kernel.org, Firefox, and other high-traffic projects. More amusingly, Alex Polvi and some other OSU students created a Firefox crop circle in an Oregon field to celebrate the millionth download of Firefox.

Today, the Portland area is home to offices for Intel, IBM, VMware, Google, Puppet, Urban Airship, New Relic, and many other companies that are heavily involved in various open-source projects. With the return of OSCON and all of the many projects and companies active in the city, open source in Portland seems likely to continue to grow.

Community and project management

Another quarter of OSCON was devoted to community management topics. This started with Jono Bacon's Community Leadership Summit the weekend before OSCON and continued with talks in the main program. Many of these talks are practical instructions on how to promote and manage open-source software in a corporate environment, like VM Brasseur's "How to open source an internal project". This talk was given as a kind of preview of Forge Your Future with Open Source, Brasseur's upcoming book on contributing to open-source projects.

Brasseur's core piece of advice to organizations looking to release a software project as open source is to "identify your goals for the project". Companies need to know why they are opening up a project before deciding anything else. It's completely acceptable to want something in return for giving away your software, she said, since it will end up costing you significant time and money to do it.

"You know what I like as much as free software?", she asked. "Solvent companies."

Your return from releasing a project as open source is unlikely to be monetary, but there are other things you can get in return, such as good publicity, competitive advantage, or better hiring prospects. Whatever your reason is, the external community should expect to benefit as well. Open source isn't altruism, it's cooperation, she said.

Once your organization has a goal, figuring out the answers to other questions becomes a lot easier. These include which audiences to approach, which metrics to track, and how to determine if your new community is succeeding.

After you've decide to open a project, there is a lot of "pre-release hygiene" you need to do, including scrubbing the code of credentials, trademark mentions, and profane or rude comments. You'll also need to audit your code for compliance with any licensed content your team has included or linked, whether open source or proprietary. The OSI's ClearlyDefined website can help with this.

You'll need to decide if you need to have a Contributor License Agreement (CLA) or a Developer Certificate of Origin (DCO) for new contributors to sign. You'll want to create a "contributors" file, as well as style-sheets for your code and your documentation to avoid arguments over tabs versus spaces on your first outside contribution. All projects will also need a Code of Conduct (CoC), she said. The one at the Contributor Covenant site is good enough for most projects.

Choosing a license should be the last thing you do, then, instead of the first thing you argue about. For many projects, their licenses are determined by their dependencies. For example, a project that depends on GPL-licensed libraries will probably be GPL-licensed as well. Brasseur recommended GPLv3 for most projects, but cautioned that "there is no one true license". Then you need to ensure that every code file in the project has a license header. She suggested adding a test for that to your continuous-integration/continuous-delivery (CI/CD) system or code linter.

Brasseur finished with instructions on building community, emphasizing above all patience and two-way communication. "Your community needs to feel like stakeholders, not like a free labor force", she warned.

Speaking of stakeholders, Red Hat community team manager Stormy Peters led an interactive exercise for project managers later in the program called "Do You Know Who Your Stakeholders Are?". By "stakeholders", Peters means people who care enough to help the project and would care if it changed or went away. Identifying these people is critical at every stage of managing an open-source project, or an effort to release a piece of software as open source.

There are multiple different kinds of stakeholders for any software project. While companies and other projects that depend on your software or resell it are the obvious ones, they are hardly the only ones. Other types include internal users and other departments, other corporate sponsors, contributors, and even users. If you hold conferences or events, a look at who attends these and who sponsors them can be a good way to identify them. Stakeholders aren't just people and organizations that you have to avoid upsetting; they are also your primary advocates and contributors. Peters recalled an occasion of receiving a full walk-through of a One Laptop Per Child (OLPC) device from a woman who was simply an enthusiastic user.

When you are assessing the health or success of an open-source project, the evaluation that matters is how the stakeholders think the project is doing, not what others think. This is why she recommends doing a regular review of the project with them, either in a large meeting or in one-on-one interviews. These are also instrumental in setting project goals.

Next for OSCON 2018

Of course, OSCON isn't just a community management conference, or even primarily so. There were many technical sessions, tutorials, and keynotes over the four days, a few of which we will present in the next article. Stay tuned for message brokers, container security, and some new technology from IBM.

[Josh Berkus is an employee of Red Hat.]

Comments (3 posted)

The kernel's out-of-memory (OOM) killer is summoned when the system runs short of free memory and is unable to proceed without killing one or more processes. As might be expected, the policy decisions aroundprocesses should be targeted have engendered controversy for as long as the OOM killer has existed. The 4.19 development cycle is likely to include a new OOM-killer implementation that targets control groups rather than individual processes, but it turns out that there is significant disagreement over how the OOM killer and control groups should interact.

To simplify a bit: when the OOM killer is invoked, it tries to pick the process whose demise will free the most memory while causing the least misery for users of the system. The heuristics used to make this selection have varied considerably over time — it was once remarked that each developer who changes the heuristics makes them work for their use case while ruining things for everybody else. In current kernels, the heuristics implemented in oom_badness() are relatively simple: sum up the amount of memory used by a process, then scale it by the process's oom_score_adj value. That value, found in the process's /proc directory, can be tweaked by system administrators to make specific processes more or less attractive as an OOM-killer target.

No OOM-killer implementation is perfect, and this one is no exception. One problem is that it does not pay attention to how much memory a particular user has allocated; it only looks at specific processes. If user A has a single large process while user B has 100 smaller ones, the OOM killer will invariably target A's process, even if B is using far more memory overall. That behavior is tolerable on a single-user system, but it is less than optimal on a large system running containers on behalf of multiple users.

Control-group awareness

To address this issue, Roman Gushchin has introduced the control-group-aware OOM killer. It modifies the OOM-kill algorithm in a fairly straightforward way: first, the control group with the largest memory consumption is found, then the largest process running within that group is killed. There is also a new knob added to control groups called memory.oom_group ; if it is set to a non-zero value, the OOM killer will kill all processes running within the targeted group instead of just the largest one. This flag is useful for cases where the processes in a group depend on each other and the whole set will fail to function properly if one is killed.

This patch set is in the -mm tree (and thus in linux-next) now, so it is on the path for merging during the next merge window. It has proved to be a relatively controversial feature, though. There are no real objections to teaching the OOM killer about control groups, but there is significant disagreement over just how the OOM killer should treat those groups. Most of these complaints can be found summarized in this message from David Rientjes.

The first of these is that processes in the root control group are treated differently from those in any other group. The memory-size computation is different and, importantly, the oom_score_adj value is not used for processes running inside of (non-root) control groups. That can lead to surprising results when it come time for the OOM killer to choose a victim. Rientjes says that the solution is to use the same heuristic for all processes and groups.

Perhaps surprisingly, a number of memory-management developers seem to disagree with this position. In a system dedicated to container workloads, they say, there should be no significant processes running in the root control group; there should be little in the root beyond kernel threads and maybe some system-level daemons. The oom_score_adj knob can still be used to ensure that the OOM killer will leave those processes alone. As Johannes Weiner put it:

You don't have any control and no accounting of the stuff situated inside the root cgroup, so it doesn't make sense to leave anything in there while also using sophisticated containerization mechanisms like this group oom setting.

Rientjes finds this argument unconvincing, however.

Another issue Rientjes pointed out is that the new OOM killer is not hierarchical; each control group is considered as a separate entity. Imagine the following simple hierarchy, with the memory usage of each group shown:

If the OOM killer is brought forth, it will quickly conclude that Group 2 is the problem and will target a process found there. Thus far, things are as one might expect. But if the container running in Group 2 creates some subgroups of its own and splits its workload between them, the result could look something like this:

Now, Group 1 will look like the biggest group in the system, and Group 2 will escape the OOM killer's attention. A truly hierarchical view of the control-group hierarchy (which is generally how things are supposed to work) would see the 24GB of memory used by Group 2 and kill a process there instead.

Once again, there is disagreement over whether there is really a problem here or not. Many users of control groups may not want the fully hierarchical behavior. If one were to substitute "Group 1" and "Group 2" with "Accounting" and "Scientists", for example, it might well seem right that the latter group would use more memory overall. Besides, accountants are always fair game, so the current system behaves as it should.

With regard to the deliberate dodging of the OOM killer by creating subgroups, the response is that such gaming of the system is possible now. Small processes will be passed over, while large processes are targeted, so a clever user could split a task into a large number of processes and get away with using more memory. The control-group-aware mechanism doesn't enable anything new in that regard.

Finally, Rientjes also complained that, since the oom_score_adj value is ignored within control groups, there is no longer any way for users to influence how the OOM-killing decision is made. The answer here seems to be that the oom_score_adj mechanism is unwieldy and not particularly useful anyway. As Michal Hocko put it:

oom_score_adj is basically unusable for any fine tuning on the process level for most setups except for very specialized ones. The only reasonable usage I've seen so far was to disable OOM killer for a process or make it a prime candidate. Using the same limited concept for cgroups sounds like repeating the same error to me.

Rientjes, naturally, disagreed, saying: "The ability to protect important cgroups and bias against non-important cgroups is vital to any selection implementation". He further argued that this feature should be incorporated before the new OOM killer goes upstream to avoid changing user-visible behavior in future kernel releases.

Next steps

These concerns notwithstanding, the control-group-aware OOM-killer patches have landed in the -mm tree. That is not an absolute guarantee that they will go into the mainline; -mm maintainer Andrew Morton often puts interesting work there to see what problems turn up. Rientjes has not given up, though; he has been working on a patch series of his own adding the features he would like to see in the new OOM killer. The changes he makes include:

A new memory.oom_policy knob is added to control groups. Setting it to none causes the current largest-process heuristic to be used. A setting of cgroup will cause the OOM killer to pick the single group with the largest memory usage and kill a process within it; setting the root group's policy to cgroup reproduces the behavior of Gushchin's patch set. Finally, a setting of " tree " enables a fully hierarchical mode. With this knob, the hierarchical mode is available for those who want it; it is also possible to use different modes for different subtrees of the control-group hierarchy.

knob is added to control groups. Setting it to causes the current largest-process heuristic to be used. A setting of will cause the OOM killer to pick the single group with the largest memory usage and kill a process within it; setting the root group's policy to reproduces the behavior of Gushchin's patch set. Finally, a setting of " " enables a fully hierarchical mode. With this knob, the hierarchical mode is available for those who want it; it is also possible to use different modes for different subtrees of the control-group hierarchy. The same heuristic is used to compare processes across all groups, including the root group. When control groups are in use for OOM-killer control, the oom_score_adj value is ignored with one exception: setting it to -999 (still) makes the associated process unkillable.

This patch set is not yet in -mm, but there does not appear to be any real opposition to it at this point. It preserves the behavior of the original control-group-aware OOM killer for those who want it while making other modes available "for general use" of the feature. So chances are good that it will be included when the new OOM killer finds its way into the mainline. Of course, chances are equally good that many users will still be unhappy with how the OOM killer works and will be looking for yet another set of heuristics to use — it's a traditional part of Linux kernel development, after all.

Comments (11 posted)

One might think that memory allocation during system startup should not be difficult: almost all of memory is free, there is no concurrency, and there are no background tasks that will compete for memory. Even so, boot-time memory management is a tricky task. Physical memory is not necessarily contiguous, its extents change from system to system, and the detection of those extents may be not trivial. With NUMA things are even more complex because, in order to satisfy allocation locality, the exact memory topology must be determined. To cope with this, sophisticated mechanisms for memory management are required even during the earliest stages of the boot process.

One could ask: "so why not use the same allocator that Linux uses normally from the very beginning?" The problem is that the primary Linux page allocator is a complex beast and it, too, needs to allocate memory to initialize itself. Moreover, the page-allocator data structures should be allocated in a NUMA-aware way. So another solution is required to get to the point where the memory-management subsystem can become fully operational.

In the early days, Linux didn't have an early memory allocator; in the 1.0 kernel, memory initialization was not as robust and versatile as it is today. Every subsystem initialization call, or simply any function called from start_kernel() , had access to the starting address of the single block of free memory via the global memory_start variable. If a function needed to allocate memory it just increased memory_start by the desired amount. By the time v2.0 was released, Linux was already ported to five more architectures, but boot-time memory management remained as simple as in v1.0, with the only difference being that the extents of the physical memory were detected by the architecture-specific code. It should be noted, though, that hardware in those days was much simpler and memory configurations could be detected more easily.

Up until version 2.3.23pre3, all early memory allocations used global variables indicating the beginning and end of free memory and adjusted them accordingly. Luckily, the page and slab allocators were available early, so heavy memory users, such as buffers_init() and page_cache_init() , could use them. Still, as hardware evolved and became more sophisticated, the architecture-specific code dealing with memory had grown quite a bit of complex cruft.

The 2.3.23pre3 patch set included the first bootmem allocator implementation, which used a bitmap to represent the status of each physical memory page. Cleared bits identified available pages, while set bits meant that the corresponding memory pages were busy or absent. All the generic functions that tweaked memory_start and the i386 initialization code were converted to use bootmem, but other architectures were left behind. They were converted by the time version 2.3.48 was ready. Meanwhile, Linux was ported to Itanium (ia64), which was the first architecture to start off using bootmem.

Over time, memory detection has evolved from simply asking the BIOS for the size of the extended memory block to dealing with complex tables, pieces, banks, and clusters. In particular, the Power64 architecture came prepared, bringing with it the Logical Memory Block allocator (or LMB). With LMB, memory is represented as two arrays of regions. The first array describes the physically contiguous memory areas available in the system, while the second array tracks allocated regions. The LMB allocator made its way into 32-bit PowerPC when the 32-bit and 64-bit architectures were merged. Later on it was adopted by SPARC. Eventually LMB made its way to other architectures and became what is now known as memblock.

The memblock allocator provides two basic primitives that are used as the base for more complex allocation APIs: memblock_add() for registering a physical memory range, and memblock_reserve() to mark a range as busy. Both of these are based, in the end, on memblock_add_range() , which adds a range to either of the two arrays.

The major drawback of bootmem is the bitmap initialization. To create this bitmap, it is necessary to know the physical memory configuration. What is the correct size of the bitmap? Which memory bank has enough contiguous physical memory to store the bitmap? And, of course, as memory sizes increase so does the bootmem bitmap. For a system with 32GB of RAM, the bitmap will require 1MB of that memory. Memblock, on the other hand, can be used immediately as it is based on static arrays large enough to accommodate, at least, the very first memory registrations and allocations. If a request to add or reserve more memory would overflow a memblock array, the array is doubled in size. There is an underlying assumption that, by the time that may happen, enough memory will be added to memblock to sustain the allocation of the new arrays.

The design of memblock relies on the assumption that there will be relatively few allocation and deallocation requests before the primary page allocator is up and running. It does not need to be especially smart, since its lifetime is limited before it hands off all the memory to the buddy page allocator.

To ease the pain of transition from bootmem to memblock, a compatibility layer called nobootmem was introduced. Nobootmem provides (most of) the same interfaces as bootmem, but instead of using the bitmap to mark busy pages it relies on memblock reservations. As of v4.17, only five out of 24 architectures are still using bootmem as the only early memory allocator; 14 use memblock with nobootmem. The remaining five use memblock and bootmem at the same time.

Currently there is ongoing work on enabling the use of memblock with nobootmem on all architectures. Several architectures that use device trees have been converted as a consequence of recent changes in early memory management in the device-tree drivers. There are patches for alpha, c6x, m68k, and nios2 that are already published. Some of them are already merged by the arch maintainers while some are still under review.

Hopefully, by the 4.20 merge window all architectures will cease using bootmem; after that it will be possible to start a major cleanup of the early memory management code. That work would include removing the bootmem allocator and several kernel configurations associated with it. That, in turn, should make it possible to start moving more early-boot functionality from the architecture-specific subtrees into common code. There is never a lack of problems to solve in the memory-management subsystem.

Comments (4 posted)

Memory allocation for applications is a bit of a balancing act between various factors including CPU performance, memory efficiency, and how the memory is actually being allocated and deallocated by the application. Different programs may have diverse needs, but it is often the kind of workload that the application is expected to handle that determines which memory allocator performs best. That argues for a diversity of memory allocators (and allocation strategies) but, on the other hand, that complicates things for Linux distributions. As a result, Fedora is discussing ways to rein in the spread of allocators used by its packages.

Florian Weimer raised the issue on the Fedora devel mailing list on July 26. He wanted to change the packaging guidelines to ask that packagers not replace the glibc malloc allocator with other choices (chiefly jemalloc and TCMalloc). He listed three reasons for that:

We have resources to support glibc malloc, but not for other mallocs.

Other mallocs do not follow ABI and provide insufficient alignment.

Choosing a malloc is workload-dependent and forcing a non-default malloc takes options away from system administrators.

Jason L. Tibbitts III wondered whether Weimer was looking for a ban on switching away from malloc or just a strong guideline. Weimer called an absolute ban "overly broad", but did want to make it clear that packagers switching away from malloc may be making it harder for users to switch to a different allocator using LD_PRELOAD . Tibbitts said that maintainers may have difficulty moving away from a particular allocator if the upstream project does not make it easy to do so. Weimer acknowledged that as a possibility.

Tibbitts mentioned performance as one reason that an upstream project might want to switch allocators, but Daniel P. Berrangé pointed out that sometimes those assumptions should be checked:

Yeah it is often linked to a supposed performance improvement. QEMU supports choice of native, jemalloc or tcmalloc and in Fedora we used to use tcmalloc for QEMU for a while. Then we checked performance again and found that the delta to glibc's native malloc had essentially gone, so we've stopped using tcmalloc. IOW, even if it was done for performance reasons originally, we should not assume that is still valid today as glibc has improved its impl[ementation].

One basic problem, as Carlos O'Donell said, is that the Fedora project lacks people who have experience with allocators other than the one in glibc, so it makes sense to standardize on that one as much as possible.

I think a key point here is to reduce the number of allocators being used by the distribution so we can keep the quality high and help our users when they have problems. Granted some people will need to use jemalloc because upstream links against it directly, or is deeply integrated with it. I don't think we should block that. We should however, avoid it where possible, and standardize on an allocator that works well by default across a lot of workloads, and let the system admins / DevOps people choose allocators in the light of feedback from performance on production workloads (not chosen by us, package managers, or upstream).

Jan Kratochvil wondered why the glibc implementation had not simply been replaced with TCMalloc. But DJ Delorie, who has made some changes to the glibc allocator with an eye toward improving its performance, noted that it undergoes extensive testing that may be lacking for other choices:

As for replacing it, I/we are not against that in principle, although that's a glibc topic and not a Fedora topic. However, keep in mind that glibc's allocator has been tested against a HUGE collection of software, compared to other mallocs that might have a much smaller testing breadth. To replace glibc's allocator would require a huge testing effort, and careful consideration of EVERY glibc-specific feature, hack, hook, and historical divot that Fedora apps might rely on (I'm looking at you, Emacs). So if you can prove that some alternate allocator can serve as a *general* purpose *system-wide* default allocator, and has better performance (speed, RSS, VSZ, etc) all the time, for all apps in Fedora (and other Linux distros, and Hurd, etc) that use it, with fewer security bugs and no missing features... yeah, we're listening. Patches welcome :-)

A bug report for the Ruby language that suggested moving to jemalloc by default is instructive. It was mentioned in the thread as a place to perhaps gather information to improve malloc. O'Donell commented in the bug entry pointing out that allocator performance is highly workload dependent. Subsequent comments show that jemalloc is not the panacea that some in the Ruby community thought it was, so Ruby may well stay with malloc as its default.

No one seemed truly opposed to the proposal, though there was some grumbling about the address-alignment constraints imposed by the glibc allocator, which is, of course, not a Fedora-specific issue either. Allocators are complex beasts and, as Delorie and O'Donell both said, need to be tested in a wide variety of different scenarios. Trying to encourage a default to the allocator that has extensive testing—and in-project expertise—seems like a good choice for Fedora.

Comments (39 posted)