This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Back in mid-1997, your editor (Jonathan Corbet) and Liz Coolbaugh were engaged in a long-running discussion on how to trade our nice, stable, reliably paying jobs for a life of uncertainty, poverty, and around-the-clock work. Not that we thought of it in those terms, naturally. We eventually settled on joining Red Hat's nascent "support partner" program; while we were waiting for it to get started, we decided to start a weekly newsletter as a side project — not big and professional like the real press — to establish ourselves in the community. Thus began an amazing journey that has just completed its 20th year.

After some time thinking about what we wanted to do and arguing about formats, we published our first edition on January 22, 1998. It covered a number of topics, including the devfs controversy, the pesky 2GB file-size limit on the ext2 filesystem, the use of Linux on Alpha to render scenes in the film "Titanic", the fact that Red Hat had finally hired a full-time quality-assurance person and launched the Red Hat Advanced Development Labs, and more. We got almost no feedback on this issue, though, perhaps because we didn't tell anybody that we had created it.

We were determined not to repeat that mistake after publishing the January 29 edition so, showing the high level of marketing skill that has characterized LWN all along, we sent a brief note to the linux-announce mailing list. We didn't say much about what were doing and had a URL that nobody could spell, but the traffic came and LWN was well and truly launched. We reported on the devfs controversy and predicted, correctly, that the big kernel lock would probably not be completely removed from the kernel before the 2.2 release. See, your editor's predictions aren't always wrong.

We were arguably helped by the lead news in that edition, though: Netscape's decision to open-source its "Communicator" web browser. That quickly brought the world's attention to open-source software, though that term would not be invented for a few months yet, and to Linux in particular. LWN was a shadow of what it is now, but it was evidently good enough to ride on that wave and establish itself as a part of the Linux community.

Over the following years we have borne witness to a long series of events that none of us could really have predicted. Linux got caught up in the dotcom boom and, with the VA Linux Systems IPO, came to epitomize its excesses, but when that boom went boom, Linux was still there, stronger than ever. The SCO Group tried to steal our community's work and turn it into its own rent-generating machine; in the process of fending them off it was made clear that the Linux kernel had one of the cleanest code bases around. Companies discovered our little hobbyist system and invested billions into it, massively accelerating development at all levels of the system. We learned how to scale development communities from dozens of developers up to many thousands of developers. The security environment, which was initially defending against script kiddies playing their own form of Capture the Flag, became a fight against spammers, organized criminals, and nation states with vast resources. Google bought an obscure phone operating system called Android and used it to dominate the phone market; as a result, we got mobile devices that are far more open than they would otherwise have been. Linux became the base software supporting the bulk of the Internet economy; some of our biggest contributors do not distribute Linux at all, but they use it internally and want to help make it work better.

And so on. It has been quite a ride. We in the free-software community set out to change the world, and we succeeded beyond our wildest expectations.

Through all of this, we also got to learn some lessons about successfully running a community information source on the net. We were acquired during the dotcom days and unacquired after those days came to their abrupt end. The advertising business failed utterly to work for us, leading to that sad day in 2002 when we announced that we could no longer continue and that LWN would shut down. A flood of donations from our readers convinced us to give the subscription model a try; after a couple of months of frantic site-code hacking, we adopted the subscription model that, with only minor tweaks, sustains us to this day.

Our model remains nearly unique, but it suits the site well. Relying on subscriptions aligns our interests firmly with those of our readers. Keeping content behind the paywall for a relatively short period seems to be enough to motivate subscriptions while allowing our content to quickly become part of the community record (though it seems that relatively few people realize that this content becomes CC-BY-SA licensed after the subscription period ends). The "subscriber link" mechanism, suggested by our readers, has become one of our most powerful marketing tools. All told, it is not a model that has made any of us rich, but working for LWN is not an exercise in poverty anymore either. It has kept us going to reach a point none of us ever thought we would see — the 20th anniversary of our first Weekly Edition.

Along the way, various people have come and gone within LWN itself; only your editor is crazy enough to have been here the whole time, though Rebecca Sobol, who is still part of the crew, has been around since nearly the beginning. Current staff is rounded out by Jake Edge; we are still looking to hire more, please contact us if you would like to become part of the LWN writer/editor team. Meanwhile, we thank Elizabeth Coolbaugh, Forrest Cook, Dennis Tenney, Dave Whitinger, Michael Hammel, Michael Kerrisk, and Nathan Willis for being a part of LWN over the years. You are all missed.

We have had the privilege of traveling to events all over the world; in the process we have met — and become friends with — many of our readers and many people in the community as a whole. This community is an amazing group of people; it has been a honor and a joy to be a part of it. In the process, we hope that we have helped to knit this community together a bit more tightly, to help it be a real community. We also hope to have the privilege of continuing to do so for some time yet. The free-software community's work is not done, and neither is ours. Thanks to all of you for being with us these last 20 years! We're looking forward to what is yet to come.

Comments (101 posted)

2017 was a big year for the Prometheus project, as it published its 2.0 release in November. The new release ships numerous bug fixes, new features and, notably, a new storage engine that brings major performance improvements. This comes at the cost of incompatible changes to the storage and configuration-file formats. An overview of Prometheus and its new release was presented to the Kubernetes community in a talk held during KubeCon + CloudNativeCon. This article covers what changed in this new release and what is brewing next in the Prometheus community; it is a companion to this article, which provided a general introduction to monitoring with Prometheus.

What changed

Orchestration systems like Kubernetes regularly replace entire fleets of containers for deployments, which means rapid changes in parameters (or "labels" in Prometheus-talk) like hostnames or IP addresses. This was creating significant performance problems in Prometheus 1.0, which wasn't designed for such changes. To correct this, Prometheus ships a new storage engine that was specifically designed to handle continuously changing labels. This was tested by monitoring a Kubernetes cluster where 50% of the pods would be swapped every 10 minutes; the new design was proven to be much more effective. The new engine boasts a hundred-fold I/O performance improvement, a three-fold improvement in CPU, five-fold in memory usage, and increased space efficiency. This impacts container deployments, but it also means improvements for any configuration as well. Anecdotally, there was no noticeable extra load on the servers where I deployed Prometheus, at least nothing that the previous monitoring tool (Munin) could detect.

Prometheus 2.0 also brings new features like snapshot backups. The project has a longstanding design wart regarding data volatility: backups are deemed to be unnecessary in Prometheus because metrics data is considered disposable. According to Goutham Veeramanchaneni, one of the presenters at KubeCon, "this approach apparently doesn't work for the enterprise". Backups were possible in 1.x, but they involved using filesystem snapshots and stopping the server to get a consistent view of the on-disk storage. This implied downtime, which was unacceptable for certain production deployments. Thanks again to the new storage engine, Prometheus can now perform fast and consistent backups, triggered through the web API.

Another improvement is a fix to the longstanding staleness handling bug where it would take up to five minutes for Prometheus to notice when a target disappeared. In that case, when polling for new values (or "scraping" as it's called in Prometheus jargon) a failure would make Prometheus reuse the older, stale value, which meant that downtime would go undetected for too long and fail to trigger alerts properly. This would also cause problems with double-counting of some metrics when labels vary in the same measurement.

Another limitation related to staleness is that Prometheus wouldn't work well with scrape intervals above two minutes (instead of the default 15 seconds). Unfortunately, that is still not fixed in Prometheus 2.0 as the problem is more complicated than originally thought, which means there's still a hard limit to how slowly you can fetch metrics from targets. This, in turn, means that Prometheus is not well suited for devices that cannot support sub-minute refresh rates, which, to be fair, is rather uncommon. For slower devices or statistics, a solution might be the node exporter "textfile support", which we mentioned in the previous article, and the pushgateway daemon, which allows pushing results from the targets instead of having the collector pull samples from targets.

The migration path

One downside of this new release is that the upgrade path from the previous version is bumpy: since the storage format changed, Prometheus 2.0 cannot use the previous 1.x data files directly. In his presentation, Veeramanchaneni justified this change by saying this was consistent with the project's API stability promises: the major release was the time to "break everything we wanted to break". For those who can't afford to discard historical data, a possible workaround is to replicate the older 1.8 server to a new 2.0 replica, as the network protocols are still compatible. The older server can then be decommissioned when the retention window (which defaults to fifteen days) closes. While there is some work in progress to provide a way to convert 1.8 data storage to 2.0, new deployments should probably use the 2.0 release directly to avoid this peculiar migration pain.

Another key point in the migration guide is a change in the rules-file format. While 1.x used a custom file format, 2.0 uses YAML, matching the other Prometheus configuration files. Thankfully the promtool command handles this migration automatically. The new format also introduces rule groups, which improve control over the rules execution order. In 1.x, alerting rules were run sequentially but, in 2.0, the groups are executed sequentially and each group can have its own interval. This fixes the longstanding race conditions between dependent rules that create inconsistent results when rules would reuse the same queries. The problem should be fixed between groups, but rule authors still need to be careful of that limitation within a rule group.

Remaining limitations and future

As we saw in the introductory article, Prometheus may not be suitable for all workflows because of its limited default dashboards and alerts, but also because of the lack of data-retention policies. There are, however, discussions about variable per-series retention in Prometheus and native down-sampling support in the storage engine, although this is a feature some developers are not really comfortable with. When asked on IRC, Brian Brazil, one of the lead Prometheus developers, stated that "downsampling is a very hard problem, I don't believe it should be handled in Prometheus".

Besides, it is already possible to selectively delete an old series using the new 2.0 API. But Veeramanchaneni warned that this approach "puts extra pressure on Prometheus and unless you know what you are doing, its likely that you'll end up shooting yourself in the foot". A more common approach to native archival facilities is to use recording rules to aggregate samples and collect the results in a second server with a slower sampling rate and different retention policy. And of course, the new release features external storage engines that can better support archival features. Those solutions are obviously not suitable for smaller deployments, which therefore need to make hard choices about discarding older samples or getting more disk space.

As part of the staleness improvements, Brazil also started working on "isolation" (the "I" in the ACID acronym) so that queries wouldn't see "partial scrapes". This hasn't made the cut for the 2.0 release, and is still work in progress, with some performance impacts (about 5% CPU and 10% RAM). This work would also be useful when heavy contention occurs in certain scenarios where Prometheus gets stuck on locking. Some of the performance impact could therefore be offset under heavy load.

Another performance improvement mentioned during the talk is an eventual query-engine rewrite. The current query engine can sometimes cause excessive loads for certain expensive queries, according the Prometheus security guide. The goal would be to optimize the current engine so that those expensive queries wouldn't harm performance.

Finally, another issue I discovered is that 32-bit support is limited in Prometheus 2.0. The Debian package maintainers found that the test suite fails on i386, which lead Debian to remove the package from the i386 architecture. It is currently unclear if this is a bug in Prometheus: indeed, it is strange that Debian tests actually pass in other 32-bit architectures like armel. Brazil, in the bug report, argued that "Prometheus isn't going to be very useful on a 32bit machine". The position of the project is currently that "'if it runs, it runs' but no guarantees or effort beyond that from our side".

I had the privilege to meet the Prometheus team at the conference in Austin and was happy to see different consultants and organizations working together on the project. It reminded me of my golden days in the Drupal community: different companies cooperating on the same project in a harmonious environment. If Prometheus can keep that spirit together, it will be a welcome change from the drama that affected certain monitoring software. This new Prometheus release could light a bright path for the future of monitoring in the free software world.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend KubeCon + CloudNativeCon.]

Comments (6 posted)

This is the second article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. The first article provided a short rationale for this topic, and covered the link-time garbage collection, also called the ld --gc-sections method. We've seen that, though it is pretty straightforward, link-time garbage collection has issues of its own when applied to the kernel, making achieving optimal results more difficult than it is worth. In this article we'll have a look at what the compiler itself can do using link-time optimization.

Please note that most examples presented here were produced using the ARM architecture, however the principles themselves are architecture-independent.

Dead-code elimination

Kernel developers often rely on a compiler feature called "dead-code elimination". This is an important optimization that results in unreachable code simply being dropped from the final binary. Unlike the linker garbage-collection feature, dead-code elimination can be performed by the compiler both within and across functions inside a compilation unit or file.

Let's reuse the example code we used previously as test.c to illustrate it:

int foo(void) { return 1; } int bar(void) { return foo() + 2; } int main(void) { return foo() + 4; }

Again, the compiler generates the following (simplified) assembly output for that code:

.text .type foo, %function foo: mov r0, #1 bx lr .type bar, %function bar: push {r3, lr} bl foo adds r0, r0, #2 pop {r3, pc} .type main, %function main: push {r3, lr} bl foo adds r0, r0, #4 pop {r3, pc}

Despite bar() not being called, it is still part of the compiled output because there is no way for the compiler to actually know whether code in some other file might call it. But the author of that code often knows that, and can tell the compiler about it with the static qualifier as follows:

static int foo(void) { return 1; } static int bar(void) { return foo() + 2; } int main(void) { return foo() + 4; }

By marking foo() and bar() static, the developer renders them no longer reachable from other source files. The compiler is then free to perform more optimizations on the compiled code. Of course the entry point ( main() ) must remain externally accessible and therefore cannot be static.

The above compiles to this:

.text .type main, %function main: mov r0, #5 bx lr

Boom! Not only did the compiler get rid of the unused bar() with dead-code elimination, but it also merged foo() directly into main() due to automatic inlining. In addition, it performed the arithmetic operation up front since all of the operands are constants, so that all we have left in the compiled code is the load of the resulting value and the return instruction. Instant code-size reduction that already works better than link-time garbage collection!

As mentioned, this dead-code elimination is heavily relied upon in the Linux kernel source tree, so that large portions of the code can be optimized away at compile time. For example, let's consider the following from include/linux/mmzone.h :

static inline int is_highmem_idx(enum zone_type idx) { #ifdef CONFIG_HIGHMEM return (idx == ZONE_HIGHMEM || (idx == ZONE_MOVABLE && zone_movable_is_highmem())); #else return 0; #endif }

When CONFIG_HIGHMEM is not defined, is_highmem_idx() (and PageHighMem() derived from it) return zero unconditionally. Any code within functions that follows the " if (PageHighMem(page)) " pattern will be automatically optimized away as dead code.

But this works only because is_highmem_idx() is marked static ; to avoid duplication of that function everywhere mmzone.h is included it has to be marked inline too. Those optimizations only work within a single compilation unit, missing out on opportunistic dead code elimination across different compilation units. So, what can we do short of concatenating all source files into a single one and making everything static to achieve the full benefit?

As mentioned previously, the core-kernel APIs are split into different C files for ease of maintenance. Those files may provide functions that are not called when some unwanted feature is configured out. It could be argued that the unused core functions should be #ifdef 'd in or out along with their call sites, but this gets hairy when multiple features sharing the same core API may be configured in and out independently. To complicate things further, those core functions might be called within " if (PageHighMem(page)) " blocks showing no directly visible relationship with a configuration option. So there are limits to how much unused code can be removed by the compiler; doing a more thorough job requires a tool like link-time optimization.

Link-time optimization (LTO)

What is it? LTO is a compilation mode that instructs the compiler to parse the code into an abstract internal representation as usual, and store that representation directly into the resulting object file without any optimization, rather than optimizing and assembling it into final machine instructions. Then, at link time when all the different object files are gathered together, the compiler intercepts the link process to reload that internal representation from all of the object files at once; only then will it perform its optimization passes — on the whole program — before the actual link. So it is basically just like if it concatenated all source files into a single one and made everything static. Great, that is exactly what we wished for.

Let's see how this works in practice with our little example program by having each function in its own file:

$ gcc -O2 -flto -c foo.c $ gcc -O2 -flto -c bar.c $ gcc -O2 -flto -c main.c $ gcc -O2 -flto -o test foo.o bar.o main.o $ nm test | grep "foo\|bar\|main" 000102c0 T main $ objdump -d test [...] 000102c0 <main>: 102c0: e3a00005 mov r0, #5 102c4: e12fff1e bx lr

As expected, the result is the same as our earlier test despite having separate source files containing non-static functions.

LTO and the kernel

LWN first covered LTO for the kernel more than five years ago. Since then, things have improved a lot. LTO still isn't supported in the mainline, but Andi Kleen's kernel LTO patchset has become much simpler as basic code correctness issues, which LTO is pickier about, have been merged upstream, and many LTO bugs in GCC have been fixed.

One of the biggest LTO showstoppers for the kernel had to do with the fact that a special version of binutils was required. The kernel used to rely solely on partial linking ( ld -r ) when recursively gathering subdirectory build results, however ld -r doesn't support objects with LTO data unless binutils is patched to do so. And it was rather unlikely that the necessary patch would ever be merged in the upstream binutils tree. Nowadays the kernel build system can use thin archives instead of ld -r , making LTO of the kernel possible with the upstream tools that most distributions ship.

LTO also has a big advantage over link-time garbage collection given that it does not require separate linker sections for each exception table entry and does not suffer from the "backward reference" and "missing forward reference" problems described in the previous article. When the compiler optimizes code away, those exception table entries instantiated by that code are simply dropped automatically with it.

Numbers please!

Let's not forget that our end goal is to fit Linux into tiny systems. So it is about time we looked at actual kernel-size numbers. Let's pick the STM32 target which represents the kind of tiny systems we're aiming for. The advantage here is that mainline Linux already runs on most STM32 microcontrollers, albeit with external RAM. The baseline kernel version is v4.15-rc1 plus the LTO patches.

First, with LTO disabled:

$ make stm32_defconfig $ make vmlinux $ size vmlinux text data bss dec hex filename 1704024 144732 117660 1966416 1e0150 vmlinux

And with LTO enabled:

$ ./scripts/config --enable CONFIG_LTO_MENU $ make vmlinux $ size vmlinux text data bss dec hex filename 1281644 142492 112985 1537121 177461 vmlinux

This is a 22% size reduction right there. For completeness, let's see how link-time garbage collection as described in the previous article fares:

$ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION] $ make vmlinux $ size vmlinux text data bss dec hex filename 1304516 141672 113108 1559296 17cb00 vmlinux

Here we get a 21% size reduction. However, this comes with a big disclaimer due to the following hacks:

No KEEP() statements were added to the ARM linker file as required. Worse: the ASSERT() statements about missing processor and architecture tables have been disabled for the sake of successful compilation. This means important pieces of code and data are missing from this kernel.

The ARM unwinding facility needed for function backtraces has been forcefully disabled as it also contained a reference to every function, making garbage collection ineffective. So, unlike the LTO-built kernel, this one would lack an important debugging facility.

Of course those hacks produce a non-functional kernel. Still, the size reduction is slightly lower than what LTO produces, and it would be even less if proper link-time garbage collection support was implemented. And optimal link-time garbage collection as described in the previous article is way more invasive than LTO. We therefore have a clear winner here.

One could wonder if size reduction could improve further by combining both link-time optimization and link-time garbage collection. The answer is no since, once LTO has removed every piece of dead code, there is simply nothing left to garbage-collect.

More numbers

So LTO seems to be the best thing since sliced bread, right? Well, it has drawbacks of its own. The most significant is build time. Let's repeat the above kernel compilation sequence to see what we get.

First with LTO disabled:

$ make clean $ make stm32_defconfig $ time make -j8 vmlinux real 0m36.645s user 3m59.252s sys 0m21.026s

And with LTO enabled:

$ make clean $ ./scripts/config --enable CONFIG_LTO_MENU $ time make -j8 vmlinux real 1m24.774s user 8m4.143s sys 0m31.902s

LTO requires 1.9x more CPU time and 2.3x more wall-clock time to build the kernel. Having code optimizations performed at the very end creates a bigger serialization point, unlike traditional builds where individual source files are compiled and optimized concurrently without LTO.

But the most annoying case, at least for a kernel developer, is partial rebuild time after some source-code modifications. Without LTO we get:

$ touch init/main.c $ time make -j8 vmlinux real 0m3.686s user 0m5.803s sys 0m1.819s

And with LTO enabled this becomes:

$ touch init/main.c $ time make -j8 vmlinux real 0m58.283s user 5m6.089s sys 0m12.732s

A partial build with LTO is about 15x longer than the non-LTO case, and not very far from the full build time. So LTO is clearly not something suitable during frequent debug/rebuild/test cycles.

And for completeness:

$ make clean $ [hacks for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION] $ time make -j8 vmlinux real 0m37.572s user 3m58.826s sys 0m21.616s

More or less the same result as our initial build. Clearly link-time garbage collection is basically free in terms of build time which is its biggest (perhaps only) advantage.

Test-build environment details GCC version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)

Intel® Core™ i7-4770R CPU @ 3.20GHz

Samsung SSD 850 EVO 500GB

Conclusion

We have two approaches for automatic kernel-size reduction at our disposal, each with a different set of compromises. However the advantage is clearly on the LTO side when considering maintenance costs and intrusiveness. And build time becomes tolerable when building very small kernels anyway. But, did we manage to get a "very small kernel"? Kernels that cross the one-megabyte range cannot realistically be qualified as "very small" or even "tiny" yet. Clearly automatic size reduction alone won't be sufficient, so more assertive approaches will be required to achieve our goal. That will be the subject of the next article.

Meanwhile, anybody wanting to play with LTO with their own kernel in the short term should start with these instructions found in Kleen's patch set.

The next article in this series is Shrinking the kernel with an axe

Comments (26 posted)

Linux’s deadline scheduler is a global early deadline first scheduler for sporadic tasks with constrained deadlines. These terms were defined in the first part of this series . In this installment, the details of the Linux deadline scheduler and how it can be used will be examined.

The deadline scheduler prioritizes the tasks according to the task’s job deadline: the earliest absolute deadline first. For a system with M processors, the M earliest deadline jobs will be selected to run on the M processors.

The Linux deadline scheduler also implements the constant bandwidth server (CBS) algorithm, which is a resource-reservation protocol. CBS is used to guarantee that each task will receive its full run time during every period. At every activation of a task, the CBS replenishes the task’s run time. As the job runs, it consumes that time; if the task runs out, it will be throttled and descheduled. In this case, the task will be able to run only after the next replenishment at the beginning of the next period. Therefore, CBS is used to both guarantee each task’s CPU time based on its timing requirements and to prevent a misbehaving task from running for more than its run time and causing problems to other jobs.

In order to avoid overloading the system with deadline tasks, the deadline scheduler implements an acceptance test, which is done every time a task is configured to run with the deadline scheduler. This test guarantees that deadline tasks will not use more than the maximum amount of the system's CPU time, which is specified using the kernel.sched_rt_runtime_us and kernel.sched_rt_period_us sysctl knobs. The default values are 950000 and 1000000, respectively, limiting realtime tasks to 950,000µs of CPU time every 1s of wall-clock time. For a single-core system, this test is both necessary and sufficient. It means that the acceptance of a task guarantees that the task will be able to use all the run time allocated to it before its deadline.

However, it is worth noting that this acceptance test is necessary, but not sufficient, for global scheduling on multiprocessor systems. As Dhall’s effect (described in the first part of this series) shows, the global deadline scheduler acceptance task is unable to schedule the task set even though there is CPU time available. Hence, the current acceptance test does not guarantee that, once accepted, the tasks will be able to use all the assigned run time before their deadlines. The best the current acceptance task can guarantee is bounded tardiness, which is a good guarantee for soft real-time systems. If the user wants to guarantee that all tasks will meet their deadlines, the user has to either use a partitioned approach or to use a necessary and sufficient acceptance test, defined by:

Σ(WCETi / Pi) <= M - (M - 1) x Umax

Or, expressed in words: the sum of the run time/period of each task should be less than or equal to the number of processors, minus the largest utilization multiplied by the number of processors minus one. It turns out that, the bigger Umax is, the less load the system is able to handle.

In the presence of tasks with a big utilization, one good strategy is to partition the system and isolate some high-load tasks in a way that allows the small-utilization tasks to be globally scheduled on a different set of CPUs. Currently, the deadline scheduler does not enable the user to set the affinity of a thread, but it is possible to partition a system using control-group cpusets.

For example, consider a system with eight CPUs. One big task has a utilization close to 90% of one CPU, while a set of many other tasks have a lower utilization. In this environment, one recommended setup would be to isolate CPU0 to run the high-utilization task while allowing the other tasks to run in the remaining CPUs. To configure this environment, the user must follow the following steps:

Enter in the cpuset directory and create two cpusets: # cd /sys/fs/cgroup/cpuset/ # mkdir cluster # mkdir partition Disable load balancing in the root cpuset to create two new root domains in the CPU sets: # echo 0 > cpuset.sched_load_balance Enter the directory for the cluster cpuset, set the CPUs available to 1-7, the memory node the set should run in (in this case the system is not NUMA, so it is always node zero), and set the cpuset to the exclusive mode. # cd cluster/ # echo 1-7 > cpuset.cpus # echo 0 > cpuset.mems # echo 1 > cpuset.cpu_exclusive Move all tasks to this CPU set # ps -eLo lwp | while read thread; do echo $thread > tasks ; done Then it is possible to start deadline tasks in this cpuset. Configure the partition cpuset: # cd ../partition/ # echo 1 > cpuset.cpu_exclusive # echo 0 > cpuset.mems # echo 0 > cpuset.cpus Finally move the shell to the partition cpuset. # echo $$ > tasks The final step is to run the deadline workload.

With this setup, the task isolated in the partitioned cpuset will not interfere with the tasks in the cluster cpuset, increasing the system’s maximum load while meeting the deadline of real-time tasks.

The developer’s perspective

There are three ways to use the deadline scheduler: as constant bandwidth server, as a periodic/sporadic server waiting for an event, or with a periodic task waiting for replenishment. The most basic parameter for the sched deadline is the period, which defines how often a task is activated. When a task does not have an activation pattern, it is possible to use the deadline scheduler in an aperiodic mode by using only the CBS features.

In the aperiodic case, the best thing the user can do is to estimate how much CPU time a task needs in a given period of time to accomplish the expected result. For instance, if one task needs 200ms each second to accomplish its work, run time would be 200,000,000ns and the period would be 1,000,000,000ns. The sched_setattr() system call is used to set the deadline-scheduling parameters. The following code is a simple example of how to set the mentioned parameters in an application:

int main (int argc, char **argv) { int ret; int flags = 0; struct sched_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); /* This creates a 200ms / 1s reservation */ attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 200000000; attr.sched_deadline = attr.sched_period = 1000000000; ret = sched_setattr(0, &attr, flags); if (ret < 0) { perror("sched_setattr failed to set the priorities"); exit(-1); } do_the_computation_without_blocking(); exit(0); }

In the aperiodic case, the task does not need to know when a period starts, and so the task just needs to run, knowing that the scheduler will throttle the task after it has consumed the specified run time.

Another use case is to implement a periodic task which starts to run at every periodic run-time replenishment, runs until it finishes its processing, then goes to sleep until the next activation. Using the parameters from the previous example, the following code sample uses the sched_yield() system call to notify the scheduler of end of the current activation. The task will be awakened by the next run-time replenishment. Note that the semantics of sched_yield() are a bit different for deadline tasks; they will not be scheduled again until the run-time replenishment happens.

Code working in this mode would look like the example above, except that the actual computation looks like:

for(;;) { do_the_computation(); /* * Notify the scheduler the end of the computation * This syscall will block until the next replenishment */ sched_yield(); }

It is worth noting that the computation must finish within the given run time. If the task does not finish, it will be throttled by the CBS algorithm.

The most common realtime use case for the realtime task is to wait for an external event to take place. In this case, the task waits in a blocking system call. This system call will wake up the real-time task with, at least, a minimum interval between each activation. That is, it is a sporadic task. Once activated, the task will do the computation and provide the response. Once the task provides the output, the task goes to sleep by blocking waiting for the next event.

for(;;) { /* * Block in a blocking system call waiting for a data * to be processed. */ process_the_data(); produce_the_result() block_waiting_for_the_next_event(); }

Conclusion

The deadline scheduler is able to provide guarantees for realtime tasks based only in the task’s timing constraints. Although global multi-core scheduling faces Dhall’s effect, it is possible to configure the system to achieve a high load utilization using cpusets as a method to partition the systems. Developers can also benefit from the deadline scheduler by designing their application to interact with the scheduler, simplifying the control of the timing behavior of the task.

The deadline scheduler tasks have a higher priority than realtime scheduler tasks. That means that even the highest fixed-priority task will be delayed by deadline tasks. Thus, deadline tasks do not need to consider interference from realtime tasks, but realtime tasks must consider interference from deadline tasks.

The deadline scheduler and the PREEMPT_RT patch play different roles in improving Linux’s realtime features. While the deadline scheduler allows scheduling tasks in a more predictable way, the PREEMPT_RT patch set improves the kernel by reducing and limiting the amount of time a lower-priority task can delay the execution of a realtime task. It works by reducing the amount of the time a processor runs with preemption and IRQs disabled and the amount of time in which a lower-priority task can delay the execution of a task by holding a lock.

For example, as a realtime task can suffer an activation latency higher than 5ms when running in a non-realtime kernel, it is that this kernel cannot handle deadline tasks with deadlines shorter than 5ms. In contrast, the realtime kernel provides a guarantee, on well tuned and certified hardware, of not delaying the start of the highest priority task by more than 150µs, thus it is possible to handle realtime tasks with deadlines much shorter than 5ms. You can find more about the realtime kernel here.

Acknowledgment: this series of articles was reviewed and improved with comments from Clark Williams, Beth Uptagrafft, Arnaldo Carvalho de Melo, Luis Claudio R. Gonçalves, Oleksandr Natalenko, Jiri Kastner and Tommaso Cucinotta.

Comments (5 posted)

Sometimes, a data structure proves to be inadequate for its intended task. Other times, though, the problem may be somewhere else — in the API used to access it, for example. Matthew Wilcox's presentation during the 2018 linux.conf.au Kernel miniconf made the case that, for the kernel's venerable radix tree data structure, the latter situation holds. His response is a new approach to an old data structure that he is calling the "XArray".

The kernel's radix tree is, he said, a great data structure, but it has far fewer users than one might expect. Instead, various kernel subsystems have implemented their own data structures to solve the same problems. He tried to fix that by converting some of those subsystems and found that the task was quite a bit harder than it should be. The problem, he concluded, is that the API for radix trees is bad; it doesn't fit the actual use cases in the kernel.

Part of the issue is that the "tree" terminology is confusing in this case. A radix tree isn't really like the classic trees that one finds in data-structure texts. Addition of an item to a tree has been called "insertion" for decades, for example, but an "insert" doesn't really describe what happens with a radix tree, especially if an item with the given key is already present there. Radix trees also support concepts like "exception entries" that users find scary just because of the naming that was used.

So Wilcox decided to fix the interface. He has kept the existing radix-tree data structure unchanged; there are, he said, few problems with it. But the metaphor describing its operation has been changed from a tree to an array. It behaves much like an automatically resizing array; fundamentally, it is an array of pointer values indexed by an unsigned long. This view better describes how the structure is actually used.

The radix tree requires users to do their own locking; the XArray, instead, handles locking itself by default, simplifying the task of using it. The "preload" mechanism, which allows users to pre-allocate memory before acquiring locks, has been removed; it added significant complexity to the interface for almost no real value.

The actual XArray API has been split into two pieces, the normal API and the advanced API. The latter provides much more control to the caller; it can be used to explicitly manage locking, for example. This API will be used at call sites with special needs; the page cache is one example where it is needed. The normal API is entirely implemented on top of the advanced API, so it serves as a demonstration of how the advanced API can be used.

The page cache has been converted to use the XArray, he said, and there are no bugs remaining that he knows of. His plan is to "plead" for inclusion during the 4.16 merge window.

A quick look at the XArray API

The current version of the XArray patch set, as of this writing, is version 6, posted on January 17. It is a 99-patch series and, thus, not for the faint of heart, but an introduction to its operation can be found in the documentation patch in the series. One starts by defining an array with:

#include <linux/xarray.h> DEFINE_XARRAY(array_name); /* or */ struct xarray array; xa_init(&array);

Storing a value into an XArray is done with:

void *xa_store(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp);

This function will store the given entry at the requested index ; if memory must be allocated, the given gfp flags will be used. The return value on success is the previous value (if any) that was stored at index . An entry can be removed from the array by storing NULL there, or by calling:

void *xa_erase(struct xarray *xa, unsigned long index);

Other variants include xa_insert() to store without overwriting an existing entry, and xa_cmpxchg() :

void *xa_cmpxchg(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp);

In this case, entry will be stored at index , but only if the current value stored there matches old . Either way, the current value stored at index is returned.

Fetching a value from an XArray is done with xa_load() :

void *xa_load(struct xarray *xa, unsigned long index);

The return value is the value stored at index . In an XArray, an empty entry is the same as an entry that has had NULL stored into it, so xa_load() will not behave specially for empty entries.

Up to three single-bit tags can be set on any non-null XArray entry; they are managed with:

void xa_set_tag(struct xarray *xa, unsigned long index, xa_tag_t tag); void xa_clear_tag(struct xarray *xa, unsigned long index, xa_tag_t tag); bool xa_get_tag(struct xarray *xa, unsigned long index, xa_tag_t tag);

The tag values here are one of XA_TAG_0 , XA_TAG_1 , and XA_TAG_2 . A call to xa_set_tag() will set the given tag on the entry at index , while xa_clear_tag() will remove that tag. xa_get_tag() returns true if the given tag is set on the entry at index .

As a general rule, an XArray is sparsely populated; that means that looping through all of the possible entries looking for the non-null ones would be rather inefficient. Instead, this helper macro should be used:

xa_for_each(xa, entry, index, max, filter) { /* Process "entry" */ }

Before entering this loop, index should be set to the beginning of the range to be iterated over, while max indicates the largest index that should be returned. The filter value can specify tag bits which will be used to filter out uninteresting entries. During loop execution, index will be set to match the current entry. It is possible to change the iteration by changing index within the loop; it is also allowed to make changes to the array itself.

There are many other functions in the normal API that provide other ways to access an XArray; there is also the entire advanced API for the special cases. The API as a whole is reasonably large and complex, but it would appear to be rather easier to work with than the radix-tree API. The current patch set converts a number of radix-tree users to XArrays, but some still remain. If all goes according to Wilcox's plan, though, those will be converted in the near future and the radix-tree API will head toward removal.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, and linux.conf.au for assisting with his travel to the event.]

Comments (20 posted)

BPF is an increasingly capable tool for instrumenting and tracing the operation of the kernel; it has enabled the creation of the growing set of BCC tools. Unfortunately, BCC has no support for a cross-development workflow where the development machine and the target machine running the developed code are different. Cross-development is favored by embedded-systems kernel developers who tend to develop on an x86 host and then flash and test their code on SoCs (System on Chips) based on the ARM architecture. In this article, I introduce BPFd, a project to enable cross-development using BPF and BCC.

The BPF compiler collection (BCC) is a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can't crash the kernel, they are safer than kernel modules and can be used in production. Brendan Gregg has written several nice tools, and has given talks showing the full power of eBPF-based tools; see also this introduction to BCC published on LWN.

In the Android kernel team, we work mostly on ARM64 systems, since most Android devices use this architecture. BCC tools support on ARM64 systems has been broken for years. One of the reasons for this difficulty is with ARM64 inline assembler statements. Unavoidably, kernel-header includes in BCC tools result in the inclusion of architecture-specific headers which, in the case of ARM64, has the potential to spew inline ARM64 assembly instructions causing major pains to LLVM's BPF backend. Recently, this issue was fixed by adding BPF inline assembly support to the compiler (these LLVM commits) and folks could finally run BCC tools on ARM64, but that turns out to not be the only problem.

In order for BCC tools to work at all, they need kernel sources. This is because most tools need to register callbacks on the ever-changing kernel API in order to get their data. Such callbacks are registered using the kprobe infrastructure. When a BCC tool is run, BCC switches its current directory into the kernel source directory before compilation starts, then compiles the C program that embodies the BCC tool's logic. The C program is free to include kernel headers for kprobes to work and to use kernel data structures.

Even if one were not to use kprobes , BCC also implicitly adds a common helpers.h include directive whenever an eBPF C program is being compiled; that file is found in src/cc/export/helpers.h in the BCC source. This header uses the LINUX_VERSION_CODE macro to create a "version" section in the compiled output. LINUX_VERSION_CODE is available only in the source of the specific kernel being targeted; it is used during eBPF program loading to make sure the BPF program is being loaded into a kernel with the right version. As you can see, kernel sources quickly become mandatory for compiling eBPF programs.

In some sense this build process is similar to how external kernel modules are built. Kernel sources are large in size and often can take up a large amount of space on the system being debugged. They can also get out of sync, which may make the tools misbehave.

The other issue is that the Clang and LLVM libraries need to be available on the target being traced because the tools compile the needed BPF bytecode, which is then loaded into the kernel. These libraries take up a lot space. It seems overkill that you need a full-blown compiler infrastructure on a system when the BPF code can be compiled elsewhere and maybe even compiled just once. Further, these libraries need to be cross-compiled to run on the architecture you're tracing. That's possible, but why would anyone want to do that if they didn't need to? Cross-compiling compiler toolchains can be tedious and stressful.

Instead of loading up all the tools, compiler infrastructure and kernel sources onto the remote targets being traced and running BCC that way, I decided to write a proxy program named BPFd that receives commands and performs them on behalf of whoever is requesting them. All the heavy lifting (compilation, parsing of user input, parsing of the hash maps, presentation of results, etc.) is done by BCC tools on the host machine, with BPFd running on the target as the interface to the target kernel. BPFd encapsulates all the needs of BCC and performs them; this includes loading a BPF program, creating, deleting and looking up maps, attaching an eBPF program to a kprobe, polling for new data that the eBPF program may have written into a perf buffer, etc. If it's woken up because the perf buffer contains new data, it'll inform BCC tools on the host about it, or it can return map data whenever requested, which may contain information updated by the target eBPF program.

Simple design

Before this work, the BCC tools architecture was as follows:

BPFd-based invocations partition this architecture, thus making it possible to do cross-development and execution of the tools across machine and architecture boundaries. For instance, the kernel sources that the BCC tools depend on can be on a development machine, with eBPF code being loaded onto a remote machine. This partitioning is illustrated in the following diagram:

The design of BPFd is quite simple, it expects commands on stdin (standard input) and provides the results over stdout (standard output). Every command is a single line no matter how big the command is. This allows easy testing using cat , since one could simply cat a file with commands, and check if BPFd's stdout contains the expected results. Results from a command, however can be multiple lines.

BPF maps are data structures that a BPF program uses to store data, which can be retrieved at a later time. Maps are represented by a file descriptor returned by the bpf() system call once the map has been successfully created. For example, the following is a command to BPFd for creating a BPF hash-table map:

BPF_CREATE_MAP 1 count 8 40 10240 0

And the result from BPFd is:

bpf_create_map: ret=3

Since BPFd is proxying the map creation, the file descriptor (3 in this example) is mapped into BPFd's file descriptor table. The command tells BPFd to create a map named count with map type 1 (a hash table), with a key size of eight bytes and a value size of 40, a maximum of 10240 entries, and no special flags. In response, BPFd created a map which is identified by file descriptor 3.

With the standard-input/output design, it's possible to write wrappers around BPFd to handle more advanced communication methods such as USB or networking. As a part of my analysis work in the Android kernel team, I am communicating these commands over the Android Debug Bridge (adb), which interfaces with the target device over either USB or TCP/IP. I have shared several demos below.

Changes to BCC tools

A number of changes have been made to the BCC tools repository to enable it to work with BPFd; some of the more significant changes are described here. These changes can be found in this branch of the BPFd repository.

A new remotes module has been added to BCC tools with an abstraction that different remote access types, such as networking or USB, must implement. This keeps code duplication to a minimum. By implementing the functions needed for a remote, a new communication method can be easily added. Currently an adb remote and a process remote are provided. The adb remote is for communication with the target device over USB or TCP/IP using the Android Debug Bridge. With the process remote, which is probably just useful for local testing, BPFd is forked on the same machine running BCC and communicates with it over stdin and stdout .

libbpf.c is the main C file in the BCC project that talks to the kernel for all things BPF. This is illustrated in the diagram above. In order to make BCC perform BPF operations on the remote machine instead of the local machine, the parts of BCC that make calls to the local libbpf.c are now instead channeled to the remote BPFd on the target. BPFd on the target then performs the commands on behalf of BCC running locally, by calling into its copy of libbpf.c .

One of the tricky parts to making this work is that certain other paths need to be channeled to the remote machine as well. For example, to attach to a tracepoint, BCC needs a list of all available tracepoints on the system. This list has to be obtained on the remote system and is the reason for the GET_TRACE_EVENTS command in BPFd.

When BCC compiles the C program encapsulated in a BCC tool into eBPF instructions, it assumes that the eBPF program will run on the same processor architecture that BCC is running on. This is incorrect when building an eBPF program for a different target. Some time ago, before I started this project, I changed this assumption for the building of in-kernel eBPF samples (which are simple standalone samples and unrelated to BCC). Now, I have had to make a similar change to BCC so that it compiles the C program correctly for the target architecture.

Installation and running

To try it out for yourself, follow the detailed or simple instructions. Also, apply this kernel patch (currently submitted upstream) to make it faster to run tools like offcputime .

As an example, consider filetop , which is a BCC tool that shows you all read/write I/O operations with a similar experience to the top tool. It refreshes every few seconds, giving you a live view of these operations. To run filetop remotely with BPFd, start by going to your BCC directory and setting the environment variables needed. Something like the following will do:

export ARCH=arm64 export BCC_KERNEL_SOURCE=/home/joel/sdb/hikey-kernel/ export BCC_REMOTE=adb

You could also use the bcc-set script provided in the BPFd sources to set these environment variables for you. Check the INSTALL.md file in BPFd sources for more information.

Next, start filetop :

# ./tools/filetop.py 5

This tells the tool to monitor file I/O every 5 seconds. While filetop was running, I started the stock email app in Android and the output looked like:

Tracing... Output every 5 secs. Hit Ctrl-C to end 13:29:25 loadavg: 0.33 0.23 0.15 2/446 2931 TID COMM READS WRITES R_Kb W_Kb T FILE 3787 Binder:2985_8 44 0 140 0 R profile.db 3792 m.android.email 89 0 130 0 R Email.apk 3813 AsyncTask #3 29 0 48 0 R EmailProvider.db 3808 SharedPreferenc 1 0 16 0 R AndroidMail.Main.xml 3811 SharedPreferenc 1 0 16 0 R UnifiedEmail.xml 3792 m.android.email 2 0 16 0 R deviceName 3815 SharedPreferenc 1 0 16 0 R MailAppProvider.xml 3813 AsyncTask #3 8 0 12 0 R EmailProviderBody.db 3809 AsyncTask #1 8 0 12 0 R suggestions.db 2434 WifiService 4 0 4 0 R iface_stat_fmt 3792 m.android.email 66 0 2 0 R framework-res.apk

Note the Email.apk file being read by Android to load the email application, and then various other reads happening related to the email app. Finally, WifiService continuously reads iface_state_fmt to get network statistics for Android accounting.

Other use cases for BPFd

While the main use case at the moment is easier use of BCC tools in cross-development situations, another potential use case that's gaining interest is easy loading of a BPF program locally. The BPFd code can be stored on disk in base64 format and sent to bpfd using something as simple as:

# cat my_bpf_prog.base64 | bpfd

In the Android kernel team, we are also experimenting with loading a program with a forked BPFd instance, creating maps, pinning them for use at a later time once BPFd exits, and then killing the BPFd fork, since it's done. Creating a separate process and having it load the eBPF program for you has the distinct advantage that the runtime-fixing up of map file descriptors isn't needed in the loaded eBPF code. In other words, the eBPF program's instructions can be pre-determined and statically loaded.

Conclusion

Building code for instrumentation on a different machine than the one actually running the debugging code is beneficial; BPFd makes this possible. Alternately, one could also write tracing code in their own kernel module on a development machine, copy it over to a remote target, and do similar tracing/debugging. However, this is quite unsafe since kernel modules can crash the kernel. On the other hand, eBPF programs are verified before they're run and are guaranteed to be safe when loaded into the kernel, unlike kernel modules. Furthermore, the BCC project offers great support for parsing the output of maps, processing them, and presenting results, all using the friendly Python programming language. BCC tools are quite promising and could be the future for easier and safer deep-tracing endeavors. BPFd can hopefully make it even easier to run these tools for folks such as embedded system and Android developers who typically compile their kernels on their local machine and run them on a non-local target machine.

If you have any questions, feel free to reach out to me or drop me a note in the comments section.

Comments (11 posted)