This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

While some aspects of the kernel's defenses against the Meltdown and Spectre vulnerabilities were more-or-less in place when the problems were disclosed on January 3, others were less fully formed. Additionally, many of the mitigations (especially for the two Spectre variants) had not been seen in public prior to the disclosure, meaning that there was a lot of scope for discussion once they came out. Many of those discussions are slowing down, and the kernel's initial response has mostly come into focus. The 4.15 kernel will include a broad set of mitigations, while some others will have to wait for later; read on for details on where things stand.

This article from January 5 gives an overview of the defenses for all three vulnerability variants. That material will not be repeated here, so those who have not read it may want to take a quick look before proceeding.

Variant 1

On its surface, the mitigation for Spectre variant 1 (speculative bounds-check bypass) hasn't changed much. In the latest patch set from Dan Williams, the proposed nospec_array_ptr() macro has been renamed to just array_ptr() :

array_ptr(array, index, size)

Its function remains the same: it returns a pointer that is either within the given array or NULL and prevents the processor from speculating with values that are outside the array. The implementation of this macro has been the subject of some debate, though.

The initial implementation used the Intel-blessed mechanism of inserting an lfence instruction as a barrier to prevent speculation past the bounds check. But barriers are relatively expensive, so this approach generated a fair amount of concern about its performance impacts, though few actual measurements were posted. In response, a different approach, which appears to have originated with Alexei Starovoitov, is being explored. It takes a different tack; rather than disabling speculation, it tries to ensure that any speculation that does occur remains within the bounds of the array being accessed.

The trick is to AND the pointer value with a mask which is generated in the following way, given a constant size and a possibly hostile index :

mask = ~(long)(index | (size - 1 - index)) >> (BITS_PER_LONG - 1);

If index is larger than size , the subtraction at the core of the macro will generate a negative number. Putting in the index again with an OR operation also ensures that the sign bit will be set for the largest index values that might otherwise cause the subtraction to underflow back to a positive number. The subsequent right-shift by BITS_PER_LONG-1 has the effect of replicating the sign bit through the entire mask, yielding a mask that is either all zeros or all ones; the latter case happens when the index is too large. Finally, the " ~ " at the beginning inverts all the bits. The result: a mask that is all ones for a valid index, all zeroes otherwise. (Note that there is an x86 implementation of this computation that comes down to two instructions).

The key point here is that, if the processor speculates a load from the array with a given index value, it will speculate the mask generation from the same value. That should ensure that the mask is appropriate to the index used and cause the right thing to happen when the mask is ANDed with the pointer value, heading off any attempts to force speculative loads outside of the bounds of the array. There seems to be a high level of confidence that the processor will not speculate on any of the data values used in the masking operation — speculation is almost entirely limited to control decisions, not data values. Normal speculation and reordering can continue, though, retaining the performance of the code overall.

This would appear to be an optimal solution to the problem. It seems that some developers are not yet fully comfortable with this approach, though; they worry that there is still room for the processor to mis-speculate the calculation of the mask, perhaps abetted by optimizations done by the compiler. The fact that the processor vendors have not given any assurances to the contrary gives weight to those concerns. Linus Torvalds, instead, believes that the masking approach is actually safer than using barriers. Even so, some would like to stick with the barrier-based approach. The current patches, as posted, offer both approaches, controlled by a configuration option.

The other significant problem — finding the places where this macro needs to be used — remains unsolved. The current patch set leaves out most of the locations that had been protected in previous versions, since a number of them proved to be controversial. As of this writing, the variant-1 defenses have not yet found their way into the mainline, but that could yet change in this rather atypical development cycle.

Variant 2

Variant 2 (poisoning of the branch prediction buffer) is primarily protected against using the "retpoline" mechanism, which replaces indirect jumps and calls with a dance that defeats speculation. This mechanism was merged into the mainline for the 4.15-rc8 release — a late date indeed for a change of this magnitude — and there are still a few small pieces missing. Given the short time involved and the number of questions needing answers, though, it would not have been possible to get this work done any sooner.

There were various discussions about implementation details, and ongoing uncertainty over whether retpolines are a sufficient protection against variant 2 on Intel Skylake processors. The problem on Skylake has to do with another data structure internal to the processor: the return stack buffer (RSB). Normally, this buffer is used to predict the address used in a "return" instruction, but there are situations where this buffer can run out of entries. That generally happens when the call stack is made deeper without the processor knowing about it; just about any sort of context switch can cause that to happen, for example. On a Skylake processor, an RSB underflow will cause a fallback to the branch prediction buffer instead, turning any "return" into a possible attack point.

It may also be possible, on some other processors, for user space to populate the RSB with hostile values, once again enabling the wrong kind of speculation. The answer in either case is the same: stuff the RSB full of well-known values in places (like context switches) where things could go wrong. The RSB-stuffing patches have been circulating for a while; they have not yet been merged but that should happen in the near future.

One other issue with retpolines remains somewhat unresolved, though: using them requires support from the compiler, and almost nobody has a compiler with that support available. Support for GCC was only posted by H.J. Lu on January 7; those patches were then subjected to a fair amount of ... discussion ... on the details that threatened to delay their merging indefinitely. Richard Biener finally jumped in to request that the process be expedited a bit:

And I'd also like people not to bikeshed too much on this given we're in the situation of having exploitable kernels around for which we need a cooperating compiler. So during the time we bikeshed this (rather than reviewing the actual patches) we have to "backport" the current non-upstream state anyway to deliver fixed kernels to our customer.

That seems to have been enough to at least bring about agreement that this feature would be requested with the -mindirect-branch=thunk-extern compiler option. The GCC developers did force a name change for the retpoline thunk itself, though, breaking the existing kernel patches and making the compiler (when released) incompatible with the version that distributors have been using to create fixed kernels thus far. If that change sticks, it will require more 4.15 patches in the immediate future.

Meanwhile, the IBRS feature is being added to the microcode for some processors to defend against variant-2 attacks, but the degree to which the kernel will use it is still unclear. Setting the IBRS bit in a model-specific register acts as a sort of barrier, preventing bad values placed in the branch prediction buffer from being used when speculating the execution of code in the kernel. IBRS is generally considered inferior to retpolines because it has a much higher performance impact, though that cost is lower on the newest CPUs. An extensive mailing-list discussion made it clear that few people truly understand how IBRS is meant to work or when it should be used. A rather frustrated series of questions from Thomas Gleixner elicited some answers, but only after a considerable amount of contradictory information had been passed around.

Work on IBRS seems to have slowed for now, though, perhaps because retpolines are now seen as being good enough for Skylake processors — for now, at least. As Gleixner put it, the IBRS question can now be resolved in a non-emergency mode:

The further RSB vs. IBRS discussion has to be settled in the way we normally work. We need full documentation, proper working micro code and actual comparisons of the two approaches vs. performance, coverage of attack vectors and code complexity/ugliness.

The remaining concern on Skylake processors would appear to be system-management interrupts (SMIs), which can cause unprotected code to be run in kernel context. There does not appear to be a consensus that SMIs are exploitable in the real world, though, and no known proofs of the concept. Still, David Woodhouse has stated his intent to eventually have Skylake processors use IBRS by default, with retpolines as a boot-time option. But, as he pointed out, this outcome has been slowed by the lack of anybody pushing the IBRS patches forward at an acceptable rate. 4.15 looks set to release without IBRS support, but it will almost certainly show up in the relatively near future.

Variant 3

Variant 3 (the "Meltdown" vulnerability) allows a user-space process to read the contents of kernel memory on a vulnerable system. The defense against this problem is kernel page-table isolation (KPTI), which has been developed in public since early November. It was merged for the 4.15-rc5 and has remained mostly unchanged since then — if one doesn't count a rather large number of bug fixes. Such a fundamental memory-management change was never going to be without glitches, but they are being found and dealt with, one at a time.

The biggest upcoming change to KPTI is certainly the ability to control its use on a per-process basis. KPTI is an expensive mitigation, with overheads of 30% or more reported for some specific workloads (though most workloads will not see an impact of that magnitude). The nopti command-line option can be used to disable KPTI entirely, but there are likely to be settings where an administrator wishes to exempt specific performance-critical processes from KPTI while retaining that protection for the system as a whole. Willy Tarreau has been working on a patch set to provide that capability, but there are some remaining differences of opinion on how it should work.

Tarreau's patch set adds a couple of new operations to the ptrace() system call: ARCH_DISABLE_PTI_NOW and ARCH_DISABLE_PTI_NEXT . The first immediately disables KPTI for the calling process, while the latter merely sets a flag that causes KPTI to be disabled for the process after it makes a call to execve() . The CAP_SYS_RAWIO capability is required to be able to disable KPTI. There is also a sysctl knob ( /proc/sys/vm/pti_adjust ) that can be used to disable these operations, either temporarily or permanently.

Many aspects of this interface have been discussed without a whole lot of conclusions. The current proposal works at the process level, for example; it is not possible for different threads within a process to have a different KPTI state. Some developers, though, think that thread-level control makes more sense. Another point of discussion was whether both the "now" and "next" modes are needed, but there is, naturally, disagreement over which of the two should go. Linus Torvalds was adamant that the "next" mode is the right one, because the natural place to disable KPTI is in an external wrapper program:

Processes should never say "I'm so important that I'm disabling PTI". That's crazy talk, and wrong. It's wrong for all the usual reasons - everybody always thinks that _their_ own work is so important and bug-free, and that things like PTI are about protecting all those other incompetent people.

Instead, he said, the decision to disable KPTI should be made by an external program run by the administrator. As might be expected for this group, the first use case for such a wrapper would be a nopti wrapper that could be used to run kernel builds without KPTI.

Andy Lutomirski has proposed that a new capability ( CAP_DISABLE_PTI ) should control access to this functionality rather than CAP_SYS_RAWIO . That would make a lot of the existing privilege checks just work without the need to add a bunch of new infrastructure. The idea is somewhat controversial, though, and it's not clear whether it will make it into the final version of this feature.

All told, there are a number of unresolved issues around how per-process KPTI control should work, even though everybody involved seems to agree that the feature itself should exist. The 4.15 kernel will be released without the per-process KPTI feature, and it would be surprising to see it get into 4.16 as well.

In conclusion

After all of this work, it would appear that the 4.15 kernel will be released with fairly complete Meltdown and Spectre protection, though a number of sharp edges are sure to remain. But, quoting Gleixner again, the time has come to slow down a bit:

Surely we all know there is room for improvements, but we also have reached a state where the remaining issues are not longer to be treated in full emergency and panic mode. We're good now, but not perfect. [...] We all are exhausted and at our limits and I think we can agree that having the most problematic stuff covered is the right point to calm down and put the heads back on the chickens. Take a break and have a few drinks at least over the weekend!

Those of us who know Gleixner can be fairly well assured that he will have taken his own advice.

All told, this set of vulnerabilities has been an intense death march for a number of kernel developers, most (or all) of whom were not informed of the problems until months after their discovery. Many of them were doing this work as part of their normal job, but others jumped in just because the work needed to be done. All of them were working to address issues that were not of their making in any way. As a result of their effort, Linux systems are reasonably well protected from these problems. We are all very much in their debt.

Comments (50 posted)

Many techniques in software security are complicated and require a deep understanding of the internal workings of the computer and the software under test. Some techniques, though, are conceptually simple and do not rely on knowledge of the underlying software. Fuzzing is a useful example: running a program with a wide variety of junk input and seeing if it does anything abnormal or interesting, like crashing. Though it might seem unsophisticated, fuzzing is extremely helpful in finding the parsing and input processing problems that are often the beginning of a security vulnerability.

Many common types of security vulnerabilities occur when something goes wrong while processing input — for example, the classic buffer overflow. These are interesting in that they tend to manifest first as instability: when input too long for the buffer is read, the program will probably misbehave and simply crash. With careful design of the too-long input, it might be possible to turn this crash into arbitrary code execution. The goal of fuzzing is to find any situations where a program crashes due to unusual input. While fixing these bugs makes the software more stable, it also closes the door on any security issues that could result from them.

Fuzzing tools

In practice, fuzzing tools work by generating files, text, and other kinds of input that have all kinds of unusual properties, a simple example being excessive length. The software is then automatically tested against all of these inputs and any abnormal terminations are logged for later analysis. Any crashes represent at best a minor bug and at worst a serious vulnerability.

One of the best-known fuzzing tools goes by the tongue-in-cheek name american fuzzy lop, or AFL. This tool uses instrumentation compiled into the program under test to detect how many code paths are being exercised by its input, and then uses a genetic algorithm to design input that results in maximum coverage. This ensures that the testing covers even the most rarely used paths through the code, which are often the most likely to contain vulnerabilities.

One of the most powerful features of AFL is its very low setup effort, as AFL can use its algorithms to design a fuzzing regimen unattended. After compiling a program with AFL's test instrumentation, you need only provide AFL with a command to invoke the program and a sample input file. AFL tests modifications to the sample input to discover the minimum acceptable input to the program, and then invents progressively more complicated inputs to find as many execution paths as possible.

AFL has a suite of rules used to generate different input for testing. At the simplest, it changes individual bits in the input. In a more complex situation, it can use a provided dictionary file to invent strings of keywords that are meaningful to the tested program. Finally, AFL's test data doesn't need to be ephemeral. The test inputs that AFL designs can be saved for use with other testing and auditing tools.

Another popular open-source fuzzer is honggfuzz, which is similar in many ways to AFL, but with the important difference that it can use features built into processors, such as the performance management unit and Intel's Processor Trace mechanism, to detect different code paths in the tested program. This makes honggfuzz a better choice for testing software that cannot be rebuilt from source to add instrumentation. Also worth mentioning is libFuzzer, a relatively new tool included with Clang and tightly integrated with other LLVM testing features.

To make the process more efficient, fuzzing is often augmented with other test instrumentation such as the Sanitizers from Google. Perhaps the most important part of this collection is the Address Sanitizer, or ASan. ASan is built into an executable and then detects and reports on common memory addressing problems including buffer overflows and use-after-free. Also available are ThreadSanitizer, which detects certain race conditions, and MemorySanitizer, which detects use of uninitialized memory. These types of tools make for a perfect combination with a fuzzer, as they can detect problems produced by the fuzzing input even when those bugs don't directly cause a crash.

Fuzzing is now a best practice for evaluating new software for stability and security issues, but many common open-source software packages predate fuzzing as a common technique.

OSS-Fuzz

Google launched the OSS-Fuzz project in late 2016 to spread automated fuzzing as a best practice in open-source development. Developers must build the necessary test infrastructure, providing entry points for the fuzzer to execute and a simple automatable build process. They can then submit their project to OSS-Fuzz for inclusion. Once accepted, Google's ClusterFuzz system runs extensive fuzzing with multiple tools on the project.

Any crashes found are filed in a dedicated issue tracker to avoid premature disclosure of potential security issues, and the project maintainers are notified of the finding. Once they report that the issue is fixed, ClusterFuzz automatically verifies that the fix works.

OSS-Fuzz has not disappointed. The project's initial announcement included a newly detected heap buffer overflow in the FreeType font rendering library, which is widely used across multiple platforms. On May 8th, 2017, Google reported that OSS-Fuzz had had identified over one thousand bugs, with 264 potential security vulnerabilities. Today, the project has logged over 3,500 findings as resolved.

Bugs found through fuzzing are most common in software that reads complicated input formats and applies multiple processing steps, particularly when that processing involves a lot of memory manipulation and math. There are few types of software that fit this description better than video, audio, and image codecs, so it's unsurprising that OSS-Fuzz findings are concentrated in media software such as FFmpeg. Another major source of findings is LibreOffice, with 90 confirmed bugs found by OSS-Fuzz.

Input-handling bugs found by fuzzing are most concerning when they occur in software that often handles untrusted input. The GnuTLS cryptography library is often used to secure network traffic, the canonical example of untrusted input, and since its enrollment in OSS-Fuzz 41 bugs have been found. There have been three bugs in OpenSSL, and one resulted in CVE 2017-3735.

Even more than an impressive set of bug findings, OSS-Fuzz's main achievement may be bringing a high level of automation to the fuzzing of open-source projects. The availability of automatic testing running on Google infrastructure makes it significantly easier for open-source projects to incorporate this kind of testing. Today, OSS-Fuzz is still in beta, but it covers about 110 projects and is actively enrolling more.

The Fuzzing Project

Open-source contributor Hanno Böck has taken a different approach to strengthening open-source security through fuzzing. He was frustrated with what he has seen in many open-source projects:

Right now if you pick up a random tool from a Linux system that does file parsing and fuzz it chances are high that you'll immediately hit some segfaults. This is a pretty dismal state.

In response, he launched the Fuzzing Project. In addition to making a few fuzzing tutorials available, the project runs fuzzing tools on a number of popular open-source packages and reports the results back to the projects. The findings are also presented in a simple report card showing how well various projects stand up to fuzzing.

Some of the most interesting findings of the Fuzzing Project are those that have been reported to the project maintainers but have not been resolved. For example, the common archiving tool cpio includes an apparent out-of-bounds write bug that poses a potential security concern, but the GNU cpio project now sees minimal development and the bug has not been resolved. The venerable GNU calculators bc and dc are in the same situation.

More significantly, the Fuzzing Project has led to fixes and three CVE-number assignments in the internationalized domain name encoding library, libidn, half a dozen bugs in the PDF rendering library Poppler, and issues related to file parsing in OpenSSH. Multiple bugs were found and fixed in the file tool, an especially important area for security testing because Linux users are likely to run file on any unusual data they come across.

Working for stability and security

On the surface, fuzzing appears to serve mostly as a tool for edge-case stability testing, locating crashes in extremely abnormal situations. However, many major security vulnerabilities are first discovered as minor bugs. Software that is more stable is less likely to enter exceptional situations that it was not designed for and tested against, and this directly equates to an improvement in security.

Comments (12 posted)

Realtime systems are computing systems that must react within precise time constraints to events. In such systems, the correct behavior does not depend only on the logical behavior, but also in the timing behavior. In other words, the response for a request is only correct if the logical result is correct and produced within a deadline. If the system fails to provide the response within the deadline, the system is showing a defect. In a multitasking operating system, such as Linux, a realtime scheduler is responsible for coordinating the access to the CPU, to ensure that all realtime tasks in the system accomplish their job within the deadline.

The deadline scheduler enables the user to specify the tasks' requirements using well-defined realtime abstractions, allowing the system to make the best scheduling decisions, guaranteeing the scheduling of realtime tasks even in higher-load systems.

This article provides an introduction to realtime scheduling and some of the theory behind it. The second installment will be dedicated to the Linux deadline scheduler in particular.

Realtime schedulers in Linux

Realtime tasks differ from non-realtime tasks by the constraint of having to produce a response for an event within a deadline. To schedule a realtime task to accomplish its timing requirements, Linux provides two realtime schedulers: the POSIX realtime scheduler (henceforth called the "realtime scheduler") and the deadline scheduler.

The POSIX realtime scheduler, which provides the FIFO (first-in-first-out) and RR (round-robin) scheduling policies, schedules each task according to its fixed priority. The task with the highest priority will be served first. In realtime theory, this scheduler is classified as a fixed-priority scheduler. The difference between the FIFO and RR schedulers can be seen when two tasks share the same priority. In the FIFO scheduler, the task that arrived first will receive the processor, running until it goes to sleep. In the RR scheduler, the tasks with the same priority will share the processor in a round-robin fashion. Once an RR task starts to run, it will run for a maximum quantum of time. If the task does not block before the end of that time slice, the scheduler will put the task at the end of the round-robin queue of the tasks with the same priority and select the next task to run.

In contrast, the deadline scheduler, as its name says, schedules each task according to the task's deadline. The task with the earliest deadline will be served first. Each scheduler requires a different setup for realtime tasks. In the realtime scheduler, the user needs to provide the scheduling policy and the fixed priority. For example:

chrt -f 10 video_processing_tool

With this command, the video_processing_tool task will be scheduled by the realtime scheduler, with a priority of 10, under the FIFO policy (as requested by the -f flag).

In the deadline scheduler, instead, the user has three parameters to set: the period, the run time, and the deadline. The period is the activation pattern of the realtime task. In a practical example, if a video-processing task must process 60 frames per second, a new frame will arrive every 16 milliseconds, so the period is 16 milliseconds.

The run time is the amount of CPU time that the application needs to produce the output. In the most conservative case, the runtime must be the worst-case execution time (WCET), which is the maximum amount of time the task needs to process one period's worth of work. For example, a video processing tool may take, in the worst case, five milliseconds to process the image. Hence its run time is five milliseconds.

The deadline is the maximum time in which the result must be delivered by the task, relative to the period. For example, if the task needs to deliver the processed frame within ten milliseconds, the deadline will be ten milliseconds.

It is possible to set deadline scheduling parameters using the chrt command. For example, the above-mentioned tool could be started with the following command:

chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \ --sched-period 16666666 0 video_processing_tool

Where:

--sched-runtime 5000000 is the run time specified in nanoseconds

is the run time specified in nanoseconds --sched-deadline 10000000 is the relative deadline specified in nanoseconds.

is the relative deadline specified in nanoseconds. --sched-period 16666666 is the period specified in nanoseconds

is the period specified in nanoseconds 0 is a placeholder for the (unused) priority, required by the chrt command

In this way, the task will have a guarantee of 5ms of CPU time every 16.6ms, and all of that CPU time will be available for the task before the 10ms deadline passes.

Although the deadline scheduler's configuration looks complex, it is not. By giving the correct parameters, which are only dependent on the application itself, the user does not need to be aware of all the other tasks in the system to be sure that the application will deliver its results before the deadline. When using the realtime scheduler, instead, the user must take into account all of the system's tasks to be able to define which is the correct fixed priority for any task.

Since the deadline scheduler knows how much CPU each deadline task will need, it knows when the system can (or cannot) admit new tasks. So, rather than allowing the user to overload the system, the deadline scheduler denies the addition of more deadline tasks, guaranteeing that all deadline tasks will have CPU time to accomplish their tasks with, at least, a bounded tardiness.

In order to further discuss benefits of the deadline scheduler it is necessary to take a step back and look at the bigger picture. To that end, the next section explains a little bit about realtime scheduling theory.

A realtime scheduling overview

In scheduling theory, realtime schedulers are evaluated by their ability to schedule a set of tasks while meeting the timing requirements of all realtime tasks. In order to provide deterministic response times, realtime tasks must have a deterministic timing behavior. The task model describes the deterministic behavior of a task.

Each realtime task is composed of N recurrent activations; a task activation is known as a job. A task is said to be periodic when a job takes place after a fixed offset of time from its previous activation. For instance, a periodic task with period of 2ms will be activated every 2ms. Tasks can also be sporadic. A sporadic task is activated after, at least, a minimum inter-arrival time from its previous activation. For instance, a sporadic task with a 2ms period will be activated after at least 2ms from the previous activation. Finally, a task can be aperiodic, when there is no activation pattern that can be established.

Tasks can have an implicit deadline, when the deadline is equal to the activation period, or a constrained deadline, when the deadline can be less than (or equal to) the period. Finally, a task can have an arbitrary deadline, where the deadline is unrelated to the period.

Using these patterns, realtime researchers have developed ways to compare scheduling algorithms by their ability to schedule a given task set. It turns out that, for uniprocessor systems, the Early Deadline First (EDF) scheduler was found to be optimal. A scheduling algorithm is optimal when it fails to schedule a task set only when no other scheduler can schedule it. The deadline scheduler is optimal for periodic and sporadic tasks with deadlines less than or equal to their periods on uniprocessor systems. Actually, for either periodic or sporadic tasks with implicit deadlines, the EDF scheduler can schedule any task set as long as the task set does not use more than 100% of the CPU time. The Linux deadline scheduler implements the EDF algorithm.

Consider, for instance, a system with three periodic tasks with deadlines equal to their periods:

Task Runtime

(WCET) Period T 1 1 4 T 2 2 6 T 3 3 8

The CPU time utilization (U) of this task set is less than 100%:

U = 1/4 + 2/6 + 3/8 = 23/24

For such a task set, the EDF scheduler would present the following behavior:

However, it is not possible to use a fixed-priority scheduler to schedule this task set while meeting every deadline; regardless of the assignment of priorities, one task will not run in time to get its work done. The resulting behavior will look like this:

The main advantage of deadline scheduling is that, once you know each task's parameters, you do not need to analyze all of the other tasks to know that your tasks will all meet their deadlines. Deadline scheduling often results in fewer context switches and, on uniprocessor systems, deadline scheduling is able to schedule more tasks than fixed priority-scheduling while meeting every task's deadline. However, the deadline scheduler also has some disadvantages.

The deadline scheduler provides a guarantee of accomplishing each task's deadline, but it is not possible to ensure a minimum response time for any given task. In the fixed-priority scheduler, the highest-priority task always has the minimum response time, but that is not possible to guarantee with the deadline scheduler. The EDF scheduling algorithm is also more complex than fixed-priority, which can be implemented with O(1) complexity. In contrast, the deadline scheduler is O(log(n)). However, the fixed-priority requires an “offline computation” of the best set of priorities by the user, which can be as complex as O(N!).

If, for some reason, the system becomes overloaded, for instance due to the addition of a new task or a wrong WCET estimation, it is possible to face a domino effect: once one task misses its deadline by running for more than its declared run time, all other tasks may miss their deadlines as shown by the regions in red below:

In contrast, with fixed-priority scheduling, only the tasks with lower priority than the task which missed the deadline will be affected.

In addition to the prioritization problem, multi-core systems add an allocation problem. On a multi-core system, the scheduler also needs to decide where the tasks can run. Generally, the scheduler can be classified as one of the following:

Global : When a single scheduler manages all M CPUs of the system. In other words, tasks can migrate to all CPUs.

: When a single scheduler manages all M CPUs of the system. In other words, tasks can migrate to all CPUs. Clustered : When a single scheduler manages a disjoint subset of the M CPUs. In other words, tasks can migrate to just a subset of the available CPUs.

: When a single scheduler manages a disjoint subset of the M CPUs. In other words, tasks can migrate to just a subset of the available CPUs. Partitioned : When each scheduler manages a single CPU, so no migration is allowed.

: When each scheduler manages a single CPU, so no migration is allowed. Arbitrary: Each task can run on an arbitrary set of CPUs.

In multi-core systems, global, clustered, and arbitrary deadline schedulers are not optimal. The theory for multi-core scheduling is more complex than for single-core systems due to many anomalies. For example, in a system with M processors, it is possible to schedule M tasks with a run time equal to the period. For instance, a system with four processors can schedule four "BIG" tasks with both run time and period equal to 1000ms. In this case, the system will reach the maximum utilization of:

4 * 1000/1000 = 4

The resulting scheduling behavior will look like:

It is intuitive to think that a system with a lower load will be schedulable too, as it is for single-processor systems. For example, in a system with four processors, a task set composed of four small tasks with the minimum runtime, let's say 1ms, at every 999 milliseconds period, and just one task BIG task, with runtime and period of one second. The load of this system is:

4 * (1/999) + 1000/1000 = 1.004

As 1.004 is smaller than four, intuitively, one might say that the system is schedulable, But that is not true for global EDF scheduling. That is because, if all tasks are released at the same time, the M small tasks will be scheduled in the M available processors. Then, the big task will be able to start only after the small tasks have run, hence finishing its computation after its deadline. As illustrated below. This is known as the Dhall's effect.

Distribution of tasks to processors turns out to be an NP-hard problem (a bin-packing problem, essentially) and, due to other anomalies, there is no dominance of one scheduling algorithm over any others.

With this background in place, we can turn to the details of the Linux deadline scheduler and the best ways to take advantage of its capabilities while avoiding the potential problems. See the second half of this series, to be published soon, for the full story.

Comments (6 posted)

The Linux kernel's generic power domain (genpd) subsystem has been extended to support active state management of the power domains in the 4.15 development cycle. Power domains were traditionally used to enable or disable power to a region of a system on chip (SoC) but, with the recent updates, they can control the clock rate or amount of power supplied to that region as well. These changes improve the kernel's ability to run the system's hardware at the optimal power level for the current workload.

SoCs have become increasingly complex and power-efficient over the years. Most of the IP blocks in an SoC have independent power-control logic that can be turned on or off to reduce the power they consume. But there is also a significant amount of static current leakage that can't be controlled using the IP-block-specific power logic. SoCs are normally divided into several regions depending on which IP blocks are generally used together, so that an unused region can be completely powered off to eliminate this leakage. These regions of the chip, called "power domains", can be present in a hierarchy and thus can be nested; a nested domain is called a subdomain of the master domain. Powering down a power domain results in disabling all the IP blocks and subdomains controlled by the domain and also stopping any static leakage in that region of the chip.

The Linux kernel's generic power domains are used to group devices that share clock or other power resources and are all enabled or disabled together, though these devices may further have fine-grained control over individual resources. Generic power domains support a limited number of operations today, most of which eventually come down to enabling or disabling the power domain to avoid static leakage.

Powering down a power domain can have a penalty though, as powering it back up later may take a significant amount of time. Additionally, the power domain controller registers are often only accessible via the SPI and I2C buses, which are quite slow. For that reason, some of the more advanced SoCs have implemented several idle states for their power domains. A deeper idle state saves more power for the region the power domain controls, but raises the penalty to restore power to the domain. Thus it is important to avoid taking the power domain to a deeper idle state if we already know that we need to power on the domain after a short amount of time. The idle-state support was recently added to the generic power domains in the Linux kernel.

Similar to idle states, some advanced SoCs have implemented various active states for power domains. The active states control the clock rate, voltage, or power that the power domain provides to the region it controls. These active states are called "performance states" within the Linux kernel. The higher the performance state, the higher the dynamic power consumption and the static power leakage of the region controlled by the domain.

Each device controlled by the power domain can request that the power domain be configured to a performance state that satisfies the current performance requirements of the device; the power domain will be configured to the highest performance state requested by all of its devices. The performance states (within the genpd core) are identified by positive integer values; a lower value represents a lower performance state. The performance state zero is special; devices can request this state if they do not want to be considered when the next performance state for the power domain is calculated.

Linux doesn't enforce a policy on what the values of the performance states should be. Platforms can choose any range of consecutive or non-consecutive values, from 1-10 or 500-550 or anything else they want. The genpd core only compares these values against each other to find the highest integer value and passes that value to the platform-specific genpd callback (described later); that callback should have knowledge about the valid performance-state ranges for that platform.

Internals

The genpd core provides the following helper for devices to request a performance state for their power domain:

int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state);

Here, dev is the pointer to the device structure and state is the requested performance state for the power domain that controls the device. This function updates the performance state constraint of the device on its domain. The genpd core then finds the new performance state for the domain based on the current requests from the various devices the domain controls, then updates the performance state, if required, of the power domain in a platform-dependent way. This happens synchronously and the performance state of the power domain is updated before this helper returns. dev_pm_genpd_set_performance_state() returns zero on success and an error number otherwise. The return value -ENODEV is special; it is returned if the power domain of the device doesn't support performance states.

On a call to dev_pm_genpd_set_performance_state() , the genpd core calls the set_performance_state() callback of the power domain if the performance state of the power domain needs to be updated. This callback must be supplied by the power-domain drivers that support performance states.

struct generic_pm_domain { int (*set_performance_state)(struct generic_pm_domain *genpd, unsigned int state); /* Many other fields... */ };

Here, genpd is the generic power domain and state is the target performance state based on the requests from all the devices managed by the genpd . As pointed out earlier, if the domain doesn't have this callback set, the helper dev_pm_genpd_set_performance_state() will return -ENODEV .

The mechanism by which the performance state of a power domain is updated is left for the implementation and is platform dependent. For some platforms, the set_performance_state() callback may directly configure some regulator(s) and/or clock(s) that are managed by Linux, while in other cases the set_performance_state() callback may end up informing the firmware running on an external processor (not managed by Linux) about the target performance state, which eventually may program the power resources locally.

Also note that, in the current implementation, performance-state updates aren't propagated to the master domains from the subdomains and only devices (i.e. no subdomains) directly controlled by the power domain are considered while finding its effective performance state. The reason being that none of the current hardware designs have a configuration that would need this feature; more thought needs to be put into that for various reasons. For example, there may not be a one-to-one mapping between the performance states of subdomains and those of their master domains. We can also have multiple master domains for a subdomain and the master domains may need to be configured to different performance states for a single performance state of the subdomain.

Interaction with the OPP layer

We have discussed how a device requests a performance-state change and how that happens internally in the genpd core, but we haven't discussed how the device drivers know which performance state to request based on their own performance requirements. Ideally, this information should come from the device tree (DT) but, after several rounds of discussions on the linux-kernel mailing list, it was decided to merge a non-DT solution first and then attempt to add DT bindings for the power-domain performance states later. The DT bindings are being reviewed currently on the mailing list.

The devices with power-domain performance-state requirements fall broadly into two categories:

Devices with fixed performance requirements that will always request the same performance state for their power domain. Drivers of such devices can hard-code the performance-state requirement in the driver or its platform data until the time that DT bindings are in place. Devices with fixed performance-state requirements can call dev_pm_genpd_set_performance_state() just once, when they are enabled by their drivers, and they don't need to worry about power-domain performance states after that, as genpd will always consider them while reevaluating power domain's performance state.

Devices with varying performance requirements, based on their own operating performance state. An example of such a device would be a Multi-Media Card (MMC) controller or a CPU. The rest of this section discusses such devices.

The discrete tuples, consisting of frequency and voltage pairs, that the device supports are called "operating performance points" (OPPs). These were explained in detail in this article.

Devices can have different performance-state requirements than their power domain, based on which OPP the devices are currently configured for. For example, a device may need performance state three for running at 800MHz and performance state seven to run at 1.2GHz. These devices would need to call dev_pm_genpd_set_performance_state() whenever they change their OPP if the performance state of the previous OPP is different than the new OPP.

The OPP core has been enhanced to store a performance state corresponding to each OPP of the device and can do the conversion from an OPP to device's power domain's performance state. The OPP core helper dev_pm_opp_set_rate() has also been updated to handle performance-state updates automatically along with clock and regulator updates.

In the absence of DT bindings to get the performance state corresponding to each OPP of the device, the OPP core has gained a pair of new helpers to link a device's OPPs to its power domain's performance states. Note that these helpers have been added temporarily to the OPP core to support initial platforms that need to configure the performance states of power domains. These helpers will be removed once the proposed DT bindings (and corresponding kernel code) are merged.

struct opp_table *dev_pm_opp_register_get_pstate_helper(struct device *dev, int (*get_pstate)(struct device *dev, unsigned long rate));

Here, dev is the pointer to the device structure and get_pstate() is the platform-specific callback that returns the performance state corresponding to device's rate on success or an error number on failure. dev_pm_opp_register_get_pstate_helper() returns a pointer to the OPP table on success and an error number (cast as a pointer) on failure. It must be called before any OPPs are added for the device, as the OPP core invokes this callback when OPPs are added to get the performance state corresponding to those OPPs (and hence target frequencies). dev_pm_opp_unregister_get_pstate_helper() takes a reference of the OPP table and that must be returned (so that the table can be freed once we don't need it anymore) with the help of the following function:

void dev_pm_opp_unregister_get_pstate_helper(struct opp_table *opp_table);

Here, opp_table is the pointer to the OPP table, earlier returned by dev_pm_opp_register_get_pstate_helper() .

The basic infrastructure is in place now to implement platform-specific power-domain drivers that allow configuring performance states. If you want to implement performance states for your power domains, then all you need to do is:

Implement a power domain driver (which you would do anyway, with or without performance states).

Implement set_performance_state() callback for the power domain.

callback for the power domain. Call dev_pm_opp_register_get_pstate_helper() from platform-specific code and register your helper routine that can convert device OPPs to performance states. Note that this step is only required for devices that have OPPs of their own.

from platform-specific code and register your helper routine that can convert device OPPs to performance states. Note that this step is only required for devices that have OPPs of their own. Hard-code the performance-state requirements in platform data or drivers for devices that do not have changing performance state requirements.

The DT bindings proposal is already under review, and code updates will be sent once the DT bindings are merged. In the future, we may also want to drive the devices controlled by a power domain at the highest OPP permitted by the current performance state of the power domain. For example, a device may have requested performance state five as it needs to run at 900MHz currently but, because of the votes from other devices (controlled by the same power domain), the effective performance state selected is eight. At this point it may be better, power and performance wise, to run the device at 1.3GHz (the highest device OPP supported at performance state eight) as that may not result in much more power consumption as the power domain is already configured for state eight. More thought is needed in this area, though.

Comments (4 posted)

Prometheus is a monitoring tool built from scratch by SoundCloud in 2012. It works by pulling metrics from monitored services and storing them in a time series database (TSDB). It has a powerful query language to inspect that database, create alerts, and plot basic graphs. Those graphs can then be used to detect anomalies or trends for (possibly automated) resource provisioning. Prometheus also has extensive service discovery features and supports high availability configurations. That's what the brochure says, anyway; let's see how it works in the hands of an old grumpy system administrator. I'll be drawing comparisons with Munin and Nagios frequently because those are the tools I have used for over a decade in monitoring Unix clusters.

Monitoring with Prometheus and Grafana

What distinguishes Prometheus from other solutions is the relative simplicity of its design: for one, metrics are exposed over HTTP using a special URL ( /metrics ) and a simple text format. Here is, as an example, some network metrics for a test machine:

$ curl -s http://curie:9100/metrics | grep node_network_.*_bytes # HELP node_network_receive_bytes Network device statistic receive_bytes. # TYPE node_network_receive_bytes gauge node_network_receive_bytes{device="eth0"} 2.720630123e+09 # HELP node_network_transmit_bytes Network device statistic transmit_bytes. # TYPE node_network_transmit_bytes gauge node_network_transmit_bytes{device="eth0"} 4.03286677e+08

In the above example, the metrics are named node_network_receive_bytes and node_network_transmit_bytes . They have a single label/value pair( device=eth0 ) attached to them, along with the value of the metrics themselves. This is only a couple of hundreds of metrics (usage of CPU, memory, disk, temperature, and so on) exposed by the "node exporter", a basic stats collector running on monitored hosts. Metrics can be counters (e.g. per-interface packet counts), gauges (e.g. temperature or fan sensors), or histograms. The latter allow, for example, 95th percentiles analysis, something that has been missing from Munin forever and is essential to billing networking customers. Another popular use for histograms is maintaining an Apdex score, to make sure that N requests are answered in X time. The various metrics types are carefully analyzed before being stored to correctly handle conditions like overflows (which occur surprisingly often on gigabit network interfaces) or resets (when a device restarts).

Those metrics are fetched from "targets", which are simply HTTP endpoints, added to the Prometheus configuration file. Targets can also be automatically added through various discovery mechanisms, like DNS, that allow having a single A or SRV record that lists all the hosts to monitor; or Kubernetes or cloud-provider APIs that list all containers or virtual machines to monitor. Discovery works in real time, so it will correctly pick up changes in DNS, for example. It can also add metadata (e.g. IP address found or server state), which is useful for dynamic environments such as Kubernetes or containers orchestration in general.

Once collected, metrics can be queried through the web interface, using a custom language called PromQL. For example, a query showing the average bandwidth over the last minute for interface eth0 would look like:

rate(node_network_receive_bytes{device="eth0"}[1m])

Notice the "device" label, which we use to restrict the search to a single interface. This query can also be plotted into a simple graph on the web interface:

What is interesting here is not really the node exporter metrics themselves, as those are fairly standard in any monitoring solution. But in Prometheus, any (web) application can easily expose its own internal metrics to the monitoring server through regular HTTP, whereas other systems would require special plugins, on both the monitoring server and the application side. Note that Munin follows a similar pattern, but uses its own text protocol on top of TCP, which means it is harder to implement for web apps and diagnose with a web browser.

However, coming from the world of Munin, where all sorts of graphics just magically appear out of the box, this first experience can be a bit of a disappointment: everything is built by hand and ephemeral. While there are ways to add custom graphs to the Prometheus web interface using Go-based console templates, most Prometheus deployments generally use Grafana to render the results using custom-built dashboards. This gives much better results, and allows graphing multiple machines separately, using the Node Exporter Server Metrics dashboard:

All this work took roughly an hour of configuration, which is pretty good for a first try. Things get tougher when extending those basic metrics: because of the system's modularity, it is difficult to add new metrics to existing dashboards. For example, web or mail servers are not monitored by the node exporter. So monitoring a web server involves installing an Apache-specific exporter that needs to be added to the Prometheus configuration. But it won't show up automatically in the above dashboard, because that's a "node exporter" dashboard, not an Apache dashboard. So you need a separate dashboard for that. This is all work that's done automatically in Munin without any hand-holding.

Even then, Apache is a relatively easy one; monitoring some arbitrary server not supported by a custom exporter will require installing a program like mtail, which parses the server's logfiles to expose some metrics to Prometheus. There doesn't seem to be a way to write quick "run this command to count files" plugins that would allow administrators to write quick hacks. The options available are writing a new exporter using client libraries, which seems to be a rather large undertaking for non-programmers. You can also use the node exporter textfile option, which reads arbitrary metrics from plain text files in a directory. It's not as direct as running a shell command, but may be good enough for some use cases. Besides, there are a large number of exporters already available, including ones that can tap into existing Nagios and Munin servers to allow for a smooth transition.

Unfortunately, those exporters will only give you metrics, not graphs. To graph metrics from a third-party Postfix exporter, a graph must be created by hand in Grafana, with a magic PromQL formula. This may involve too much clicking around in a web browser for grumpy old administrators. There are tools like Grafanalib to programmatically create dashboards, but those also involve a lot of boilerplate. When building a custom application, however, creating graphs may actually be a fun and distracting task that some may enjoy. The Grafana/Prometheus design is certainly enticing and enables powerful abstractions that are not readily available with other monitoring systems.

Alerting and high availability

So far, we've worked only with a single server, and did only graphing. But Prometheus also supports sending alarms when things go bad. After working over a decade as a system administrator, I have mixed feelings about "paging" or "alerting" as it's called in Prometheus. Regardless of how well the system is tweaked, I have come to believe it is basically impossible to design a system that will respect workers and not torture on-call personnel through sleep-deprivation. It seems it's a feature people want regardless, especially in the enterprise, so let's look at how it works here.

In Prometheus, you design alerting rules using PromQL. For example, to warn operators when a network interface is close to saturation, we could set the following rule:

alert: HighBandwidthUsage expr: rate(node_network_transmit_bytes{device="eth0"}[1m]) > 0.95*1e+09 for: 5m labels: severity: critical annotations: description: 'Unusually high bandwidth on interface {{ $labels.device }}' summary: 'High bandwidth on {{ $labels.instance }}'

Those rules are regularly checked and matching rules are fired to an alertmanager daemon that can receive alerts from multiple Prometheus servers. The alertmanager then deduplicates multiple alerts, regroups them (so a single notification is sent even if multiple alerts are received), and sends the actual notifications through various services like email, PagerDuty, Slack or an arbitrary webhook.

The Alertmanager has a "gossip protocol" to enable multiple instances to coordinate notifications. This design allows you to run multiple Prometheus servers in a federation model, all simultaneously collecting metrics, and sending alerts to redundant Alertmanager instances to create a highly available monitoring system. Those who have struggled with such setups in Nagios will surely appreciate the simplicity of this design.

The downside is that Prometheus doesn't ship a set of default alerts and exporters do not define default alerting thresholds that could be used to create rules automatically. The Prometheus documentation also lacks examples that the community could use, so alerting is harder to deploy than in classic monitoring systems.

Issues and limitations

Prometheus is already well-established: Cloudflare, Canonical and (of course) SoundCloud are all (still) using it in production. It is a common monitoring tool used in Kubernetes deployments because of its discovery features. Prometheus is, however, not a silver bullet and may not the best tool for all workloads.

In particular, Prometheus is not designed for long-term storage. By default, it keeps samples for only two weeks, which seems rather small to old system administrators who are used to RRDtool databases that efficiently store samples for years. As a comparison, my test Prometheus instance is taking up as much space for five days of samples as Munin, which has samples for the last year. Of course, Munin only collects metrics every five minutes while Prometheus samples all targets every 15 seconds by default. Even so, this difference in sizes shows that Prometheus's disk requirements are much larger than traditional RRDtool implementations because it lacks native down-sampling facilities. Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome) will be difficult without some serious hacking to selectively purge samples or adding extra disk space.

The project documentation recognizes this and suggests using alternatives:

Prometheus's local storage is limited in its scalability and durability. Instead of trying to solve long-term storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote long-term storage systems.

Prometheus in itself delivers good performance: a single instance can support over 100,000 samples per second. When a single server is not enough, servers can federate to cover different parts of the infrastructure. And when that is not enough sharding is possible. In general, performance is dependent on avoiding variable data in labels, which keeps the cardinality of the dataset under control, but the dataset size will grow with time regardless. So long-term storage is not Prometheus' strongest suit. But starting with 2.0, Prometheus can finally write to (and read from) external storage engines that can be more efficient than Prometheus. InfluxDB, for example, can be used as a backend and supports time-based down-sampling that makes long-term storage manageable. This deployment, however, is not for the faint of heart.

Also, security freaks can't help but notice that all this is happening over a clear-text HTTP protocol. Indeed, that is by design, "Prometheus and its components do not provide any server-side authentication, authorisation, or encryption. If you require this, it is recommended to use a reverse proxy." The issue is punted to a layer above, which is fine for the web interface: it is, after all, just a few Prometheus instances that need to be protected. But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection. It would be nice to have at least IP-level blocking in the node exporter, although this could also be accomplished through a simple firewall rule.

There is a large empty space for Prometheus dashboards and alert templates. Whereas tools like Munin or Nagios had years to come up with lots of plugins and alerts, and to converge on best practices like "70% disk usage is a warning but 90% is critical", those things all need to be configured manually in Prometheus. Prometheus should aim at shipping standard sets of dashboards and alerts for built-in metrics, but the project currently lacks the time to implement those.

The Grafana list of Prometheus dashboards shows one aspect of the problem: there are many different dashboards, sometimes multiple ones for the same task, and it's unclear which one is the best. There is therefore space for a curated list of dashboards and a definite need for expanding those to feature more extensive coverage.

As a replacement for traditional monitoring tools, Prometheus may not be quite there yet, but it will get there and I would certainly advise administrators to keep an eye on the project. Besides, Munin and Nagios feature-parity is just a requirement from an old grumpy system administrator. For hip young application developers smoking weird stuff in containers, Prometheus is the bomb. Just take for example how GitLab started integrating Prometheus, not only to monitor GitLab.com itself, but also to monitor the continuous-integration and deployment workflow. By integrating monitoring into development workflows, developers are immediately made aware of the performance impacts of proposed changes. Performance regressions can therefore be trivially identified quickly, which is a powerful tool for any application.

Whereas system administrators may want to wait a bit before converting existing monitoring systems to Prometheus, application developers should certainly consider deploying Prometheus to instrument their applications, it will serve them well.

Comments (15 posted)

GnuBee is the brand name for a line of open hardware boards designed to provide Linux-based network-attached storage. Given the success of the crowdfunding campaigns for the first two products, the GB-PC1 and GB-PC2 (which support 2.5 and 3.5 inch drives respectively), there appears to be a market for these devices. Given that Linux is quite good at attaching storage to a network, it seems likely they will perform their core function more than adequately. My initial focus when exploring my GB-PC1 is not the performance but the openness: just how open is it really? The best analogy I can come up with is that of a door with rusty hinges: it can be opened, but doing so requires determination.

A mainline kernel on the GnuBee?

Different people look for different things when assessing how open or free some device is, so I should be clear about what my own metrics are. I am interested primarily in the pragmatics of openness: whether I can examine, understand, and modify the behavior of my device with no informational, technological, or legal impediments — cognitive and temporal impediments I'll take responsibility for myself. A good first measurement is: can I run the latest upstream kernel on the device? I can, but there is plenty of room for improvement.

The heart of the GB-PC is the MT7621 SoC (System On a Chip) from Mediatek. It provides a dual-core, 32-bit MIPS processor together with controllers for memory, flash ROM, serial ports, USB, SD cards, audio, Ethernet, and most of the connections you might expect. It doesn't control SATA drives directly, but it provides a PCI Express interface to which the GnuBee board connects an ASM1061 SATA controller. This SoC is mostly used in WiFi routers and similar network hardware and it is supported by Linux distributions focused on those devices, such as OpenWrt, LEDE, and libreCMC, but this support is only partially upstream.

There are some specifications and documentation available for the MT7621 on the web, but most of the PDFs have a watermark saying "Confidential", so their legality seems unclear and, while useful, they are incomplete. The main source of driver code for this SoC appears to be a software development kit (SDK) from Mediatek. The OpenWrt-based distribution that GnuBee provides as one source for a bootable kernel builds Linux from a GitHub repository provided by mqmaker Inc. This is not a clone of the upstream Git tree with patches added, but is rather a code-dump for Linux 3.10.14 with lots of changes. It seems a reasonable guess that this code was part of the Mediatek SDK. This code appears to be completely functional; all the MT7621 hardware works as expected when using this kernel. It is a little old though.

GnuBee also provides a 4.4.87-based kernel as part of a libreCMC distribution. This contains the MT7621 support broken out as individual patches — 83 in total, though several of those are not specific to the hardware. This is a much easier starting point when aiming to use the latest kernel and John Crispin, author of many of the patches, deserves thanks. He has not been idle and several of these patches have already landed upstream; they are not enough to boot a working kernel, but it is a useful start. One remaining weakness with this set of patches is that the driver for the MMC interface, which is used to access the microSD card, isn't reliable. It can read data from the card, but it can also fail to read. Given that the 3.10.14 code is reliable, this should be fixable given time and patience.

It should be possible to start with the latest mainline kernel, apply those patches from libreCMC that seem relevant and haven't already been applied upstream, fix merge conflicts and compiler errors, and get a working kernel. Unfortunately this didn't quite work. The resulting kernel did nothing — nothing on the console at all, so no indication of what might be wrong, just that something was wrong.

The standard approach to analyzing this sort of problem is to use git-bisect. I had a 4.4 kernel that worked and a 4.15-rc kernel that didn't. I just need to try the kernel in the middle, then continue searching earlier or later depending on how things turned out. While git-bisect is an invaluable tool, it can be a bit of a painful process even when working with the upstream kernel. When you have a pile of patches to apply to each kernel before testing, and when that set is different for each different kernel (as some have already been included upstream at different point), it requires real determination.

I was lucky and tested 4.5 early and found that it didn't work, thus narrowing my search space more quickly than I had any reason to expect. I eventually discovered commit 3af5a67c86a3 ("MIPS: Fix early CM probing"). This commit makes a tiny change that probably makes sense to people who understand what the code is supposed to do, but which breaks booting of the GnuBee board. Reverting this patch gave me a kernel that booted enough to print useful error messages about the next problem it hit, which was then easy to fix (something to do with maximum transfer sizes for the SPI bus driver).

With a 4.15-rc kernel that boots and a minimal initramfs that can find the device with the root system, I only have two kernel issues to fix. One is the MMC controller that is still unreliable. The other is the Ethernet interface.

The Ethernet interface in the MT7621 comes with an integrated 6-port switch. One port connects to the processor and the other five can be wired externally (the GB-PC1 only provides connectors for two, the GB-PC2 has three). The switch can understand VLANs so different ports on the switch can be on different logical networks, and the MT7621 port can send packets to any VLAN.

When the SDK was created, Linux didn't have an API for integrated switches, so Mediatek used an out-of-tree implementation called "swconfig". Now Linux has "switchdev", which was introduced in late 2014. When Crispin posted his patches in early 2016 to support the Mediatek Ethernet controller, though with minimal support for the switch and no switchdev integration, he hit a roadblock. Dave Miller made it quite clear that they could not be accepted without proper switchdev integration, and reminded readers that help was available from various people who were quite familiar with the inner working of switchdev. No further progress on these patches can be found.

I could simply include Crispin's latest patches and I suspect I could get the network working with minimal pain, but I don't think that is the right way forward. If I do ever dig into those patches, it will be as part of learning switchdev and creating a proper upstreamable solution. For now, I dug through my drawer and found a USB-attached Ethernet adapter that only needed a little bit of convincing to work. This sits nicely beside my USB-attached card reader that holds the micro-SD card and my root filesystem. This isn't the most elegant solution, but hopefully it will be temporary.

How about mainline U-Boot?

So I have the door open now, but it still squeaks a bit. I can run a mainline kernel (important if I want to benefit from the latest filesystem developments, or avoid the more well-known exploits) and I have code and some documentation, which should be enough to develop and test new kernels. The only remaining barrier is that testing new kernels isn't quite as easy as I would like. My bar for "easy" is rather high here. When I'm doing a git-bisect to find the cause of a regression (and experience assures me I'll need to do that again one day), I need every step to be as smooth and automatic as possible: if I have to do anything manually I will get it wrong occasionally and end up late for dinner.

The easiest method for installing a new kernel is to copy the kernel file onto a USB storage device as " gnubee.bin ", plug that in and turn on the board. The U-Boot firmware will notice this, write the kernel to flash memory, then ask you to unplug and power-cycle. When you do that the new kernel will boot. This is conceptually easy enough, but I don't really want to write the kernel to flash (which takes several seconds), I just want to boot it. It is possible to do this by interrupting the boot (type "4" on the console) and issuing a couple of commands to the U-Boot CLI (which I can copy/paste with the mouse instead of typing), but this is still manual interaction which I would like to avoid. "U-Boot" can load a new kernel over the network (it even supports a simple HTTP server for uploading firmware) but all this still requires manual interaction.

The obvious answer would be to replace U-Boot — it is open source after all. There is a difference with updating U-Boot though: the fear of turning my NAS into a brick.

U-Boot, for those not familiar with the term, is a suite of code designed to fill a similar role to the BIOS on a traditional PC. It is stored in flash memory that the processor can read directly (or at least can copy directly to RAM) and it performs all early configuration (such as enabling and mapping the DRAM, and setting up various clock signals etc.), and then finds a storage device to load your kernel from. Modern U-Boot can be quite sophisticated; it is able to read from USB, IDE, SATA, or the network, and can understand a variety of filesystems (FAT, ZFS, ext4, etc.) and network protocols (such as TFTP or NFS).

The U-Boot that is provided with the GnuBee looks like it was probably part of the Mediatek SDK. It is fairly old, contains a lot of hackish code, and looks like a separate development path; the HTTP server it contains is not present in the mainline code, for example. The ideal way forward would be to extract all the hardware drivers from the old U-Boot installation and add them upstream. There will undoubtedly be bugs at first, which, just as with the kernel, will be hard to analyze. A bad kernel can easily be replaced because U-Boot is still working. If you break U-Boot, instead, your hardware becomes what we like to call a "brick". There is no easy mechanism to replace that broken U-Boot code.

By "easy" here I mean easy for the home hacker. A professional with a fully kitted-out lab will have a device that can use a JTAG port to take control of the processor and load anything into memory directly. Even a fairly sophisticated home hacker might be able to attach a secondary flash ROM to the board as shown in these two pictures and boot from that. But as I prefer working with software, I like a software solution. I need the main U-Boot to normally jump straight into a secondary U-Boot. If that is missing, or if the reset button is held (for example), it can continue to the default behavior, which doesn't need to be sophisticated, it just needs to work.

The U-Boot in the GnuBee does have something that looks just enough like this functionality that it might be usable. The "uImage" file (a container format used by U-Boot) that it expects to load the kernel from can, instead, hold a standalone program. This will be copied from flash into RAM and run. It even has access to some of the functionality of the original U-Boot so that I don't have to get new code working for everything at once.

There are a few challenges with this approach. One of the more frustrating so far is that it doesn't work for small test programs, only larger programs. Like many processors, MIPS has separate instruction and data caches. If you write some bytes to memory, they will be stored in the d-cache. If you try to execute code, it will be read from memory through the i-cache. If you don't flush the d-cache out to main memory between writing out the code and running the code, the wrong code will be run. The old U-Boot on the GnuBee doesn't do that flush, so I spent quite a few hours wondering what could possibly be wrong. Fortunately I don't have enough hair for it to be worth pulling out. This was eventually fixed by the simple expedient of adding 32K of zeros to the end of the test program, this being the size of the d-cache.

Another challenge is that the USB code in the main U-Boot is a bit unreliable and I cannot seem to get it to work from a separate standalone program, so booting straight from USB is not an option until I can include my own USB driver. Similarly, booting directly from an SD card won't work as U-Boot doesn't have an MMC driver. That leaves the network. I suspect it will be possible to write a standalone program, loaded by U-Boot, which uses the TFTP functionality from U-Boot to load a kernel and then boot it. This will allow me to build a kernel on my workstation, then boot it by simply turning the GnuBee on. I'm not yet desperate enough to have arranged for a remotely controllable power switch, but I have thought about it.

Is this open enough?

While I would like proper SoC documentation that didn't have "Confidential" watermarks and which was reasonably complete, code that works reliably and isn't deliberately obfuscated is at least a good start. Working code and sketchy documentation is probably enough to keep me happy. Having an easy path to experiment with U-Boot code without the risk of creating a brick would be ideal, I suspect I can get close enough with what I have. And who knows, maybe one day I'll be brave enough to try flashing a whole new U-Boot.

Certainly I can run a current mainline kernel (with just a few patches) and I can run the Linux distribution of my choice, so that is open enough for now.

Comments (29 posted)