This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Social networking is often approached by the free-software community with a certain amount of suspicion—rightly so, since commercial social networks almost always generate revenue by exploiting user data in one way or another. While attempts at a free-software approach to social networking have so far not met widespread success, the new ActivityPub federation protocol and its implementation in the free-software microblogging system Mastodon are gaining popularity and already show some of the advantages of a community-driven approach.

While a community-run, open-source social network would avoid many of the concerns raised by commercial social networks, it's difficult for such a platform to gain widespread adoption because of the "network effect": social networks become more valuable as they gain more users, and so centralization tends to come about naturally. Few people are excited about having an account on yet another social network with few of their friends.

A technical solution to this social problem is federation. In a federated system, multiple independent services use standard protocols to exchange data so that you don't need to use the same social network that a friend does in order to communicate with them. Email is a federated system, where many independent mail servers interact via SMTP, but so far no clear "SMTP for social" has emerged. There are a few contenders, though, and one is on track for W3C standard. First, though, let's take a look at the first major attempt.

OStatus

Most federated social systems aim to implement "microblogging" as popularized by Twitter. The first open-source for federated microblogging to gain traction was identi.ca, a microblogging platform launched in 2008 by Canadian startup Control Yourself, Inc. The company later launched status.net, a service offering hosted instances of the identi.ca software, called Laconica; the protocol instances used to communicate among them came to be known as OStatus.

OStatus is fairly simple, consisting of an Atom feed published by each server containing the actions taken by its users: things like publishing new status messages, posting comments, sharing photos, deleting previously shared objects, and more—in general, everything you would do on a social network. OStatus was augmented with the rather verbosely named PubSubHubbub, which allows OStatus services to publish and subscribe to intermediary servers that actively push out new changes, avoiding the load of constantly polling other servers. In the meantime, the related WebFinger protocol allowed OStatus services to query each other for user profiles and other information.

OStatus is now implemented by a number of software packages, the best known being the identi.ca software itself, which went through several organizational changes before ultimately joining the GNU project as GNU social. The original identi.ca and status.net have since fallen quiet, but GNU social lives on as a functional federated microblogging service, although it noticeably lags behind the current state of the art.

The OStatus protocol itself was submitted to the W3C standards track in 2012, but for several reasons the process stalled. The major reason was the formation of the W3C Social Web Working Group and OStatus's own creator, Evan Prodromou, turned to favor a more sophisticated protocol he called pump.io. These two efforts ultimately converged, with the W3C working group drafting a standard called ActivityPub, directly based on pump.io.

ActivityPub

ActivityPub entered W3C "Proposed Recommendation" status on December 5, 2017. This is the last step before full W3C Recommendation status; the comment period prior to adoption runs until January 2, 2018. The ActivityPub specification includes a number of enhancements over OStatus and is, in general, a more complete approach to building a standardized microblogging platform.

The most significant change is that ActivityPub standardizes both the client-to-server interface and the server-to-server interface. Client-to-server standardization will allow for desktop clients (as used to be quite popular with Twitter) that will work with multiple free-software social platforms, something that has previously been implemented by just duplicating the Twitter API.

Like OStatus, ActivityPub uses HTTP as the underlying protocol. Unlike OStatus, ActivityPub makes heavy use of JSON and allows servers to push messages directly to other servers, removing the need for a third-party publish/subscribe (pub/sub) service. Note that although ActivityPub has removed the need for PubSubHubbub, the pub/sub protocol is used in a number of other applications and is also on the W3C standards track under the more concise name WebSub.

ActivityPub distributes status messages, photos, comments, and other types of content collectively referred to as "activities". These activities are expressed in a standardized format called ActivityStreams, which make use of the JSON for Linking Data (JSON-LD) format. This extends JSON with more complete support for object relationships. ActivityStreams are quite flexible and make ActivityPub a fit for many different types of social sharing.

Conceptually, ActivityPub is designed around the concept of an inbox and outbox, much like email. When a user posts an activity, their server initially places it in their outbox. Their server then uses a simple POST request to submit that activity to the servers of each of their followers, which store the activity in an inbox for delivery to the receiving user next time their client checks for updates. In a break from email, though, a user's outbox is directly browsable by other servers, albeit likely after filtering based on the permissions of the browsing server or user. Because servers will often have multiple users, and potentially multiple users that follow the same person, ActivityPub also allows for a shared inbox that allows the poster's server to only POST an update to a federated server once for delivery to all relevant followers.

One of the most interesting features of ActivityPub is its support for privacy scopes on activities. OStatus was built with the assumption that all content posted by users was completely public; it provided no mechanism for an activity to have a limited distribution scope. ActivityPub, on the other hand, includes a recipient list as part of an activity and requires that servers respect that list.

Despite its advances, ActivityPub continues to have a number of limitations. Probably the greatest is that ActivityPub allows for authentication but does not address the actual mechanism, leaving it largely up to individual implementers. This somewhat limits the value of the privacy features in the protocol, as extensions to it are still required to, for example, protect private messages from being requested by servers other than that of the recipient. The Social Web Working Group intends to take this on in future work, with signed HTTP requests as the most likely direction for server-to-server communications.

ActivityPub is a fairly new specification and hasn't been widely adopted yet. The most popular project to adopt ActivityPub so far is the federated, free-software microblogging platform Mastodon, which originally implemented OStatus and added ActivityPub alongside in version 1.6, which was released in September 2017. While Mastodon has not implemented the client-server API, it does use ActivityPub for server-to-server communications when supported. This change was motivated most of all by support for better privacy features.

Mastodon

Mastodon, first released in 2016 and recently reaching version 2.0, is a microblogging system strongly reminiscent of Twitter or GNU Social, but with a more sophisticated user interface inspired by the popular third-party Twitter client TweetDeck. Mastodon is AGPL licensed, and implemented as a Ruby on Rails application with source available on GitHub.

Mastodon has a somewhat lengthy but well documented install process; there are also official Docker containers and, of course, a directory of community-run instances ready for use. The total Mastodon community, a major part of the "fediverse" of federated social software, consists of somewhat over one million users across 1,231 publicly listed instances.

The Mastodon web interface shows three columns [YouTube], one with a personal timeline (consisting of posts from those you follow), one with notifications, and one that can be made to show posts from a specific user, other users on the same server, or all posts your server is aware of. This last option is called the "federated timeline" and is seen as one of Mastodon's killer features, since it allows the kind of serendipitous discovery of other users that few federated platforms have been able to offer. Mastodon posts, which it jokingly calls "toots", are limited to 500 characters. This encourages more in-depth content than Twitter while still keeping to the conversational style of microblogging.

While Mastodon itself is an impressive project, with a modern UI and strong feature set, much of its appeal is its socially progressive community and tools oriented toward more effective community policing. Twitter has faced enormous controversy recently due to harassment and hate speech on its platform; Mastodon aspires to avoid this problem by giving users the freedom to choose an instance with moderation policies that reflect their interests—whether that be a complete "anything goes" attitude or a tightly regulated community for polite users only. This is central to Mastodon's marketing:

Mastodon isn't one place and one set of rules: it's thousands of unique, interconnected communities to choose from.... Don't like the rules? You're free to join any community you like, or better yet: you can host your own, on your own terms!

In the Mastodon fediverse, administrators of each instance set their own moderation policies and community standards. Mastodon then equips instance operators with the tools to enforce those rules, both against users and other instances—if a user on a different instance violates the rules of your instance, then you can silence or suspend that user without affecting their activity on their home instance. You can even sever federation entirely with another instance that has a completely incompatible social climate but, again, without any effect on the users of the other instance. Under the federated model, moderation is a local matter rather than a global one.

This approach has worked well for Mastodon. Unlike other federated social platforms, which have typically gained little traction outside of the free-software community, Mastodon is often mentioned in completely disjoint communities, with headlines like "Mastodon 101: A Queer-Friendly Social Network You're Gonna Like a Lot" on Autostraddle, a lesbian and queer community. This appeal to an audience far flung from the privacy-minded free-software community demonstrates some of the power of a federated system: while centralized communities will always struggle with conflicting goals in moderation, federation offers an opportunity to balance a large social network with localized content policies.

Beyond microblogging

One of the most exciting aspects of ActivityPub is that its flexible definition of an "activity" allows it to serve as the federated messaging layer for a variety of social applications. One interesting example is PeerTube, which combines ActivityPub federation with WebTorrent, an in-browser peer-to-peer file transfer implementation, to build a decentralized video sharing service. In this case, the activities exchanged between instances are simply references to videos that are retrieved directly from other peers. While PeerTube is still in early development, the current implementation is quite promising and it's easy to imagine it succeeding in many of the same ways as Mastodon.

The ActivityPub protocol has great potential for decentralized social applications of a variety of types, and the Mastodon implementation is already a promising example of how a free-software, decentralized approach can have real advantages over the dominant commercial services. With the upcoming completion of ActivityPub as a W3C Recommendation, we can look forward to more implementations of this flexible standard.

Comments (18 posted)

MAP_FIXED

Theoption to the mmap() system call allows a process to specify that a mapping should be placed at a given virtual address if at all possible. It turns out, though, that "if at all possible" can involve a bit more collateral damage than some would like, and can even lead to exploitable vulnerabilities. A new, safer option is in the works but, as is often the case, it has run into a bit of non-technical difficulty.

Any mmap() call allows the calling process to specify an address for the mapping. In normal operation, though, this address is simply a hint that the kernel is free to ignore. MAP_FIXED exists for cases where the mapping really has to be placed at the requested address or the application will fail to work. The kernel takes this flag seriously, to the point that, if there is already another mapping in the given address range, the existing mapping will be destroyed to make room for the new one. This seems like a strange semantic; if an application wants a mapping at a given area, it should probably be able to take responsibility for making room for that mapping. But mmap() is specified to work that way, so that is what happens.

Needless to say, that can be problematic if the application wasn't aware of the conflicting mapping — something that could occur as the result of a bug, address-space layout randomization, disagreements between libraries, or deliberate manipulation by an attacker. The data contained within that mapping (or the overlapping part of it, at least) will be silently dropped on the floor and the new mapping will show up in its place. The chances of things working correctly after that are likely to be fairly small. In some cases, security vulnerabilities can result; see, for example, CVE-2017-1000253. In that case, the kernel's internal use of MAP_FIXED to load programs into memory was exploited to corrupt the stack.

A solution can be found in Michal Hocko's MAP_FIXED_SAFE patch set. It adds a new mmap() flag called, surprisingly, MAP_FIXED_SAFE with semantics similar to MAP_FIXED with one exception: the operation will fail if the targeted address range is not free. The kernel's ELF loader is modified to use this new flag when mapping programs into memory; that will cause program loading to fail if two mappings collide, but that is better than the alternative. It is expected that new code would use this new flag in almost all cases, and that older programs would eventually be switched as well.

Some had suggested adding a separate flag to modify the behavior of MAP_FIXED , so that applications would pass something like MAP_FIXED|MAP_SAFE to mmap() . The problem with that approach is that mmap() is one of those system calls that never checked for unknown flags. A program using that construction would, as a result, silently fall back to MAP_FIXED on older kernels that lacked support for the new MAP_SAFE flag. Using a new flag means that, while the application will not get the desired failure status on an older kernel if the address range is not available, it also will not clobber any existing mappings (because the specified address will be treated as a hint by the kernel).

This change is pretty much ready to go, and Hocko has requested that it be merged. There is, however, the vital issue which has caused the most discussion about this patch series: the naming of MAP_FIXED_SAFE . For various reasons, various developers wanted a different name. Suggestions included MAP_FIXED_UNIQUE , MAP_FIXED_NOREPLACE , MAP_FIXED_NO_CLOBBER , MAP_TANTRUM , MAP_EXACT , MAP_NOFORCE , and quite a few others. It was just the sort of discussion that results when the technical issues are resolved, but everybody wants to put their stamp on the final result.

After enduring a fair amount of that discussion, Hocko made his own decision on the naming:

I am afraid we can bikeshed this to death and there will still be somebody finding yet another better name. Therefore I've decided to stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the MAP_FIXED prefix which should be recognized by developers and _SAFE suffix should also be clear that all dangerous side effects of the old MAP_FIXED are gone.

He also stated that anybody who was truly unhappy with the name was welcome to block the patch and somehow build a consensus around a better one, but that he was done with it. So, naturally, somebody objected, and Hocko wished him luck carrying the patch set forward.

Given the personalities involved, one might think that a useful patch will end up simply blocked at this point. Your editor would wager, though, that the MAP_FIXED_SAFE patches will be merged in something close to their current form. They address a real problem; holding them up while waiting for the perfect name does not seem like an approach that will do anybody any good.

Comments (24 posted)

For various reasons related to accounting and security, there is recurring interest in having the kernel identify the container that holds any given process. Attempts to implement that functionality tend to run into the same roadblock, though: the kernel has no concept of what a "container" is, and there is seemingly little desire to change that state of affairs. A solution to this problem may exist in the form of a neglected patch called "ptags", which enables the attachment of arbitrary tags to processes.

Given that containers are at the receiving end of a lot of attention currently, it is natural to wonder why the kernel refuses to recognize them. The kernel does provide the features needed to implement containers: namespaces for isolation, control groups for resource management, seccomp and security modules to implement security policies, etc. But there is little agreement over what actually constitutes a container, and there is still a lot of experimentation going on with interesting new ways of implementing the container concept. When, as part of the recent discussion on container IDs for auditing, it was suggested that use of namespaces identified a container, Casey Schaufler responded:

You might think so, but I am assured that you can have a container without using namespaces. Intel's "Clear Containers", which use virtualization technology, are one example. I have considered creating "Smack Containers" using mandatory access control technology, more to press the point that "containers" is a marketing concept, not technology.

An attempt to codify such a diverse and rapidly evolving concept (be it a "marketing concept" or not) into a kernel API is likely to end in tears. It would have a strong chance of either stifling ongoing container development or just proving to not be useful with next year's idea of what a container should be. So there is indeed a good case to be made for not recognizing the "container" concept inside the kernel.

That position may be entirely logical, but it doesn't make the use cases for identifying containers and associating processes with them go away. More than once, Schaufler has suggested that a module called "ptags" is a better solution to this problem, so your editor decided to go take a look.

Ptags is a proposed security module that was posted to the LSM list a few times by José Bollo in late 2016. It received little attention at the time and appears to have disappeared into that place where unloved kernel patches go. There is a GitLab repository for the project, but it has not seen any commits since early February. Ptags has clearly stalled; perhaps what the project needs is some wider attention and more feedback.

As one might expect, ptags enables the addition of tags to processes. Those tags can be seen and manipulated through a new /proc file: /proc/PID/attr/ptags . Individual threads of a process can have their own tags in /proc/PID/tasks/TID/attr/ptags . Tags are UTF-8 strings (up to 4000 bytes in length, which may be a bit excessive), optionally associated with a string value (32,700 bytes or less — ditto). There are some limitations on control characters, but just about anything goes, so valid tags would include:

IS_EVIL CONTAINER_ID=ae883c कंटेनर=विपणन

The colon character has a special meaning: it is used as a sort of namespace separator. So, for example, if a system were running the Ultimate Marketing Container Manager (UMCM), it might tag processes with their container IDs using something like:

UMCM:CONTAINER_ID=foo

If a process is allowed to change some other process's tags (more on that below), such changes are effected by writing to the appropriate ptags file. Preceding a tag with " + " adds that tag to a process, while " - " removes it. Normally a process's tags will be stripped if it calls execve() , but that behavior can be changed by prepending " @ " (the "keep flag") to the tag name. Tags are copied when a process calls clone() or fork() , though. There is a simple glob mechanism for deleting tags or changing keep flags in bulk.

By default, unprivileged processes cannot change tags — neither their own nor another process's. Permissions to change tags with a specific namespace prefix can be delegated using the tag system itself. If the administrator wanted the UMCM process to be able to control tags starting with UMCM: on other processes, the UMCM process would be given one or more of these tags:

ptags:UMCM:add ptags:UMCM:sub ptags:UMCM:set ptags:UMCM:others

The first tag allows the UMCM process to add tags starting with UMCM: to itself. The " sub " tag allows removing those tags from itself, and " set " allows changing existing tags. The " others " tag is different, in that it causes any other permissions on the UMCM: namespace to apply globally. If a process's tags include both ptags:UMCM:add and ptags:UMCM:others , it can add tags in the UMCM: namespace to any other process in the system. That permission does also require that the process in question can write to the target process's ptags file, which may be restricted by access permissions or another security module.

Other than the special ptags: tags, nothing in the kernel uses or cares about process tags in any way. They are maintained as a service for user space, making it easy to associate information with processes in a way that those processes cannot change. It would seem that this sort of mechanism would work well for the container use case; a container manager could tag processes in a way that matches its particular scheme. Meanwhile, the kernel need not know anything about any particular conception of what a container is.

One drawback to this scheme, beyond the fact that it's not in the mainline and doesn't appear to be headed that way is that, according to Schaufler: "PTAGS unfortunately needs module stacking, but how hard could that be?" The answer to that question would be "fairly hard", but there is another question that is worth asking: does the ptags mechanism need to be a security module at all? The usual point of security modules is to restrict access to system resources in some way, but ptags doesn't do that.

If the ptags approach looks like the right solution to the container-ID problem, it might be worth implementing it as a core kernel feature. Processes have a long list of attributes in a Linux system; the tags would just be more of the same. That would ensure that tags would be available on the systems that need them, eliminate the stacking problem and, in general, reduce the potential for unfortunate interactions with other security modules. "Container" might not be appropriate as a core-kernel concept, but "process tags" might be.

But that, of course, would require somebody to either push the existing module forward or implement a similar scheme in another way. But, as Schaufler asked, how hard can that be? As the pressure to solve the container-ID problem continues to grow, some developer may well be motivated to give this approach a try.

Comments (1 posted)

High-bandwidth Digital Content Protection (or HDCP) is an Intel-designed copy-protection mechanism for video and audio streams. It is a digital rights management (DRM) system of the type disliked by many in the Linux community. But does that antipathy mean that Linux should not support HDCP? That question is being answered — probably in favor of support — in a conversation underway on the kernel mailing lists.

HDCP is based on encryption and authentication. An HDCP-compliant device is not allowed to send high-quality media streams to any other device that cannot authenticate itself under the HDCP protocol and show that it contains a suitable key. In theory, HDCP prevents the extraction of digital media streams from a chain of devices using it; the practice is, as is often the case, a bit less certain. That notwithstanding, various content providers require HDCP to be present before making their offerings available.

Many of the devices implementing HDCP — set-top boxes, televisions, etc. — run Linux, but the kernel itself does not currently have HDCP support. That may be about to change with this patch set from Sean Paul implementing HDCP for Intel i915 graphics. One part of the patch set in particular provides a generic capability in the direct-rendering layer to enable user space to turn on the content protection feature of the hardware; the application can also verify whether the graphics subsystem was able to establish an authenticated connection with the device at the other end of the cable. Said application is likely to use that information to refuse to play content in the absence of an HDCP-compliant device on the line.

This patch is not new; it was first posted in 2014. Paul noted that: "We [Google] have been using this in ChromeOS across exynos, mediatek, and rockchip over that time". So, in a sense, the kernel has had HDCP support for some time, it just hasn't found its way into the mainline.

Naturally, some developers might prefer that such a feature not get into the mainline now either. Pavel Machek, in particular, questioned the value of HDCP support, asking when the user of a machine would ever want to turn this feature on. Alex Deucher replied that it could be used to protect sensitive video streams in governmental offices (though Alan Cox questioned that idea). Paul said: "We have a lot of Chrome OS users who would really like to enjoy premium hd content on their tvs". There is truth in that latter claim. Just like many users didn't see a Linux system as being usable without the ability to play MP3 files regardless of that format's patent issues, others want to use their systems to play content that is only available through HDCP-protected streams.

The defenders of HDCP also made the point that the hardware has the content-protection capability already; the kernel patches are merely exposing that capability to user space. They bring the feature in from the cold and put it where it can be seen; as Paul put it: "Having all of the code in the open allows users to see what is happening with their hardware, how is this a bad thing?" The alternative, he noted, might be to hide the feature entirely in the firmware or in a user-space binary blob.

The real complaint, arguably, is not that the patch makes an undesirable feature available. User space can instruct current kernels to do no end of unpleasant things already. Instead, consider this reply from Daniel Vetter explaining that the HDCP feature is not a full content-protection implementation:

If you want to actually lock down a machine to implement content protection, then you need secure boot without unlockable boot-loader and a pile more bits in userspace. If you do all that, only then do you have full content protection. And yes, then you don't really own the machine fully, and I think users who are concerned with being able to update their kernels and be able to exercise their software freedoms already know to avoid such locked down systems.

Machek complained that having the feature available would lead to the creation of more systems that are locked down in just this manner. "That is evil, and [a] direct threat to [the] free software movement", he said. No doubt, others will agree with that sentiment.

The problem with this argument, of course, is that those systems already exist, and the kernel patches that enable hardware DRM capabilities already exist. Hardware vendors have not proved to be even slightly reluctant to apply out-of-tree patches to their kernels, so keeping this feature out of the mainline is unlikely to have any measurable effect on the number of locked-down systems in the market. The technology exists, and content providers seem to be of the opinion that it will prevent their precious shows from being pirated, so it will be shipped in consumer devices regardless of the state of mainline kernel support.

What keeping this code out of the kernel would do is to make a statement that digital rights management technologies are unwelcome in general. That is an unlikely position for the kernel community to take, though. It is worthwhile to take a look back at Linus Torvalds's 2003 "DRM is perfectly OK with Linux!" posting for the policy that he applies to kernel code implementing this sort of feature. Beyond that, one should remember that the bulk of kernel development is supported by companies working in this area; such companies have proved remarkably reluctant to take this kind of philosophical position.

So the end result is almost certainly that this patch set will go in without a whole lot of fuss. In the distant future, when consumer-electronics device vendors upgrade their kernels, they'll have one less out-of-tree patch to apply in the process. Beyond that, it is unlikely that anybody will notice any difference.

Comments (26 posted)

"Load tracking" refers to the kernel's attempts to track how much load each running process will put on the system's CPUs. Good load tracking can yield reasonable predictions about the near-future demands on the system; those, in turn, can be used to optimize the placement of processes and the selection of CPU-frequency parameters. Obviously, poor load tracking will lead to less-than-optimal results. While achieving perfection in load tracking seems unlikely for now, it appears that it is possible to do better than current kernels do. The utilization estimation patch set from Patrick Bellasi is the latest in a series of efforts to make the scheduler's load tracking work well with a wider variety of workloads.

Until relatively recently, the kernel had no notion of how much load any process was putting on the system at all. It tracked a process's total CPU utilization, but that is different from — and less useful than — tracking how much of the available CPU time that process has been using recently. In 2013, the per-entity load-tracking (PELT) mechanism was merged; it maintains a running average of each process's CPU demands. That average decays quickly over time, so that a process's recent behavior is weighted much more heavily than its distant past. The PELT values are maintained (and continue to decay) while processes are blocked, giving a better overall view of their utilization.

The addition of PELT improved the scheduler considerably. It became possible to estimate just how much CPU a given mix of processes is likely to need and to distribute those processes across the system in a way that loads all CPUs equally. The addition of the "schedutil" CPU-frequency governor enabled the kernel to set the operating frequencies of the CPUs at the level needed to service the current load, but no higher. In short, PELT is regarded as a clear step forward for the kernel's CPU scheduler.

That does not mean that PELT is perfect, though; indeed, developers have been running into its limitations almost since it was merged. The mobile and embedded community seems to complain the loudest. The biggest concern is almost always responsiveness: PELT can take too long to respond to changes in the system workload. A user who starts a browser on a mobile device wants it to respond quickly, but PELT will take a few 32ms measurement cycles to fully understand the load that the browser is placing on the system. During that time, the browser may be scheduled inappropriately (alongside other CPU-intensive tasks, for example) and the CPU it is running on may not be operating at as high a frequency as it should be. In fact, running such a task on a CPU that is running at a slower frequency will cause PELT to take even longer to generate a realistic estimate.

In the first posting of the utilization estimation patch set (in August 2017), Bellasi expressed the problem another way:

In the mobile world, where some of the most important tasks are synchronized with the frame buffer refresh rate, it's quite common to run tasks on a 16ms period. This 16ms window is the time in which everything happens for the generation of a new frame, thus it's of paramount importance to know exactly how much CPU bandwidth is required by every task running in such a time frame.

PELT operates on a rather longer time scale than 16ms, so several frames will have gone by before it gets a handle on the load presented by such a process. One can, of course, change PELT's accumulation periods, but that still leaves an unwanted ramp-up period and doesn't address some of the related issues. For example, the load estimates from PELT tend to vary over time as a result of the decay algorithm, even when the processes involved are running regularly. If a process sleeps for a period of time without work to do, its load estimate will quickly decay toward zero, meaning that the scheduler no longer has useful information about its needs once it starts to run again.

Various attempts have been made over time to improve the performance of PELT in this setting. The window-assisted load tracking (WALT) algorithm works mostly by eliminating the decay and only looking at recent behavior. WALT has shipped in some devices, but has not found its way into the mainline, perhaps out of fear of worsening load tracking for other use cases. Qualcomm went further by replacing much of the scheduler entirely with its out-of-tree variant tuned for its systems. This code has not even been posted to the kernel mailing lists, much less seriously considered for mainline inclusion.

The current utilization estimation work has taken a simpler approach that has a better chance of working across all use cases. It is based on the observation that, while PELT may struggle to properly characterize processes that have not been running for long, its measurement of what a process needed to get to the point where it stops running and goes back to sleep is good. But PELT quickly decays that information away and has to start over the next time the process begins running. If the kernel were to track those end-of-run measurements, it would have a better idea of what the process will need the next time it starts running.

So the utilization estimation patches do not change the PELT algorithm at all. Instead, whenever a process becomes non-runnable, the current utilization value is added into a new running average that represents the kernel's best guess for what the process will need the next time it runs. That average is designed to change relatively slowly, and it is not decayed while a process it not runnable, so the full value will still be there even after a long sleep.

Whenever the system needs to look at the load created by a given running process, either to calculate overall CPU loads or to set CPU frequencies, it will take the greater of the saved estimate or the current load as calculated by PELT. The estimate, in other words, is used as a lower bound when calculating a process's load; if PELT comes up with a higher value, that value will be used. When a given process becomes runnable, its load will be immediately set to this saved estimate, giving the scheduler the information it needs to properly place the task and set CPU operating parameters.

The cost of the new estimation code is approximately a 1% performance hit when running the perf bench sched messaging benchmark (also known as "hackbench"), which stresses context-switch performance. That may be a hit that users with long-running, throughput-oriented workloads don't want to take, so the patch set leaves utilization estimation off by default. Enabling it requires setting the SCHED_UTILEST scheduler feature bit.

The patch set has received little in the way of review comments as of this writing. Getting scheduler changes into the mainline is always difficult because the chances of regressing somebody's workload tend to be high. In this case, though, the existing load-tracking code is left carefully untouched, so the probability of regressions should be quite low. Perhaps that will be enough to make some progress on this longstanding scheduler issue in the mainline.

Comments (2 posted)

The Cloud Native Computing Foundation (CNCF) held its conference, KubeCon + CloudNativeCon, in December 2017. There were 4000 attendees at this gathering in Austin, Texas, more than all the previous KubeCons before, which shows the rapid growth of the community building around the tool that was announced by Google in 2014. Large corporations are also taking a larger part in the community, with major players in the industry joining the CNCF, which is a project of the Linux Foundation. The CNCF now features three of the largest cloud hosting businesses (Amazon, Google, and Microsoft), but also emerging companies from Asia like Baidu and Alibaba.

In addition, KubeCon saw an impressive number of diversity scholarships, which "include free admission to KubeCon and a travel stipend of up to $1,500, aimed at supporting those from traditionally underrepresented and/or marginalized groups in the technology and/or open source communities", according to Neil McAllister of CoreOS. The diversity team raised an impressive $250,000 to bring 103 attendees to Austin from all over the world.

We have looked into Kubernetes in the past but, considering the speed at which things are moving, it seems time to make an update on the projects surrounding this newly formed ecosystem.

The CNCF and its projects

The CNCF was founded, in part, to manage the Kubernetes software project, which was donated to it by Google in 2015. From there, the number of projects managed under the CNCF umbrella has grown quickly. It first added the Prometheus monitoring and alerting system, and then quickly went up from four projects in the first year, to 14 projects at the time of this writing, with more expected to join shortly. The CNCF's latest additions to its roster are Notary and The Update Framework (TUF, which we previously covered), both projects aimed at providing software verification. Those add to the already existing projects which are, bear with me, OpenTracing (a tracing API), Fluentd (a logging system), Linkerd (a "service mesh", which we previously covered), gRPC (a "universal RPC framework" used to communicate between pods), CoreDNS (DNS and service discovery), rkt (a container runtime), containerd (another container runtime), Jaeger (a tracing system), Envoy (another "service mesh"), and Container Network Interface (CNI, a networking API).

This is an incredible diversity, if not fragmentation, in the community. The CNCF made this large diagram depicting Kubernetes-related projects—so large that you will have a hard time finding a monitor that will display the whole graph without scaling it (seen below, click through for larger version). The diagram shows hundreds of projects, and it is hard to comprehend what all those components do and if they are all necessary or how they overlap. For example, Envoy and Linkerd are similar tools yet both are under the CNCF umbrella—and I'm ignoring two more such projects presented at KubeCon (Istio and Conduit). You could argue that all tools have different focus and functionality, but it still means you need to learn about all those tools to pick the right one, which may discourage and confuse new users.

You may notice that containerd and rkt are both projects of the CNCF, even though they overlap in functionality. There is also a third Kubernetes runtime called CRI-O built by RedHat. This kind of fragmentation leads to significant confusion within the community as to which runtime they should use, or if they should even care. We'll run a separate article about CRI-O and the other runtimes to try to clarify this shortly.

Regardless of this complexity, it does seem the space is maturing. In his keynote, Dan Kohn, executive director of the CNCF, announced "1.0" releases for 4 projects: CoreDNS, containerd, Fluentd and Jaeger. Prometheus also had a major 2.0 release, which we will cover in a separate article.

There were significant announcements at KubeCon for projects that are not directly under the CNCF umbrella. Most notable for operators concerned about security is the introduction of Kata Containers, which is basically a merge of runV from Hyper.sh and Intel's Clear Containers projects. Kata Containers, introduced during a keynote by Intel's VP of the software and services group, Imad Sousou, are virtual-machine-based containers, or, in other words, containers that run in a hypervisor instead of under the supervision of the Linux kernel. The rationale here is that containers are convenient but all run on the same kernel, so the compromise of a single container can leak into all containers on the same host. This may be unacceptable in certain environments, for example for multi-tenant clusters where containers cannot trust each other.

Kata Containers promises the "best of both worlds" by providing the speed of containers and the isolation of VMs. It does this by using minimal custom kernel builds, to speed up boot time, and parallelizing container image builds and VM startup. It also uses tricks like same-page memory sharing across VMs to deduplicate memory across virtual machines. It currently works only on x86 and KVM, but it integrates with Kubernetes, Docker, and OpenStack. There was a talk explaining the technical details; that page should eventually feature video and slide links.

Industry adoption

As hinted earlier, large cloud providers like Amazon Web Services (AWS) and Microsoft Azure are adopting the Kubernetes platform, or at least its API. The keynotes featured AWS prominently; Adrian Cockcroft (AWS vice president of cloud architecture strategy) announced the Fargate service, which introduces containers as "first class citizens" in the Amazon infrastructure. Fargate should run alongside, and potentially replace, the existing Amazon EC2 Container Service (ECS), which is currently the way developers would deploy containers on Amazon by using EC2 (Elastic Compute Cloud) VMs to run containers with Docker.

This move by Amazon has been met with skepticism in the community. The concern here is that Amazon could pull the plug on Kubernetes when it hinders the bottom line, like it did with the Chromecast products on Amazon. This seems to be part of a changing strategy by the corporate sector in adoption of free-software tools. While historically companies like Microsoft or Oracle have been hostile to free software, they are now not only using free software but also releasing free software. Oracle, for example, released what it called "Kubernetes Tools for Serverless Deployment and Intelligent Multi-Cloud Management", named Fn. Large cloud providers are getting certified by the CNCF for compliance with the Kubernetes API and other standards.

One theory to explain this adoption is that free-software projects are becoming on-ramps to proprietary products. In this strategy, as explained by InfoWorld, open-source tools like Kubernetes are merely used to bring consumers over to proprietary platforms. Sure, the client and the API are open, but the underlying software can be proprietary. The data and some magic interfaces, especially, remain proprietary. Key examples of this include the "serverless" services, which are currently not standardized at all: each provider has its own incompatible framework that could be a deliberate lock-in strategy. Indeed, a common definition of serverless, from Martin Fowler, goes as follows:

Serverless architectures refer to applications that significantly depend on third-party services (knows as Backend as a Service or "BaaS") or on custom code that's run in ephemeral containers (Function as a Service or "FaaS").

By designing services that explicitly require proprietary, provider-specific APIs, providers ensure customer lock-in at the core of the software architecture. One of the upcoming battles in the community will be exactly how to standardize this emerging architecture.

And, of course, Kubernetes can still be run on bare metal in a colocation facility, but those costs are getting less and less affordable. In an enlightening talk, Dmytro Dyachuk explained that unless cloud costs hit $100,000 per month, users may be better off staying in the cloud. Indeed, that is where a lot of applications end up. During an industry roundtable, Hong Tang, chief architect at Alibaba Cloud, posited that the "majority of computing will be in the public cloud, just like electricity is produced by big power plants".

The question, then, is how to split that market between the large providers. And, indeed, according to a CNCF survey of 550 conference attendees: "Amazon (EC2/ECS) continues to grow as the leading container deployment environment (69%)". CNCF also notes that on-premise deployment decreased for the first time in the five surveys it has run, to 51%, "but still remains a leading deployment". On premise, which is a colocation facility or data center, is the target for these cloud companies. By getting users to run Kubernetes, the industry's bet is that it makes applications and content more portable, thus easier to migrate into the proprietary cloud.

Next steps

As the Kubernetes tools and ecosystem stabilize, major challenges emerge: monitoring is a key issue as people realize it may be more difficult to diagnose problems in a distributed system compared to the previous monolithic model, which people at the conference often referred to as "legacy" or the "old OS paradigm". Scalability is another challenge: while Kubernetes can easily manage thousands of pods and containers, you still need to figure out how to organize all of them and make sure they can talk to each other in a meaningful way.

Security is a particularly sensitive issue as deployments struggle to isolate TLS certificates or application credentials from applications. Kubernetes makes big promises in that regard and it is true that isolating software in microservices can limit the scope of compromises. The solution emerging for this problem is the "service mesh" concept pioneered by Linkerd, which consists of deploying tools to coordinate, route, and monitor clusters of interconnected containers. Tools like Istio and Conduit are designed to apply cluster-wide policies to determine who can talk to what and how. Istio, for example, can progressively deploy containers across the cluster to send only a certain percentage of traffic to newly deployed code, which allows detection of regressions. There is also work being done to ensure standard end-to-end encryption and authentication of containers in the SPIFFE project, which is useful in environments with untrusted networks.

Another issue is that Kubernetes is just a set of nuts and bolts to manage containers: users get all the parts and it's not always clear what to do with them to get a platform matching their requirements. It will be interesting to see how the community moves forward in building higher-level abstractions on top of it. Several tools competing in that space were featured at the conference: OpenShift, Tectonic, Rancher, and Kasten, though there are many more out there.

The 1.9 Kubernetes release should be coming out in early 2018; it will stabilize the Workloads API that was introduced in 1.8 and add Windows containers (for those who like .NET) in beta. There will also be three KubeCon conferences in 2018 (in Copenhagen, Shanghai, and Seattle). Stay tuned for more articles from KubeCon Austin 2017 ...

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend KubeCon + CloudNativeCon.]

Comments (13 posted)