This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

In one sense, the Stack Clash vulnerability that was announced on June 19 has not had a huge impact: thus far, at least, there have been few (if any) stories of active exploits in the wild. At other levels, though, this would appear to be an important vulnerability, in that it has raised a number of questions about how the community handles security issues and what can be expected in the future. The indications, unfortunately, are not all positive.

A quick review for those who are not familiar with this vulnerability may be in order. A process's address space is divided into several regions, two of which are the stack and the heap areas. The stack contains short-lived data tied to the running program's call chain; it is normally placed at a high address and it grows automatically (toward lower addresses) if the program accesses memory below the stack's current lower boundary. The heap, instead, contains longer-lived memory and grows upward. The kernel, as of 2010, places a guard page below the stack in an attempt to prevent the stack from growing into the heap area.

The "Stack Clash" researchers showed that it is possible to jump over this guard page with a bit of care. The result is that programs that could be fooled into using a lot of stack space could be made to overwrite heap data, leading to their compromise; setuid programs are of particular concern in this scenario. The fix that has been adopted is to turn the guard page into a guard region of 1MB; that, it is hoped, is too much to be readily jumped over.

Not a new problem

There are certain developers working in the security area who are quite fond of saying "I told you so". In this case, it would appear that they really told us so. An early attempt to deal with this problem can be found in this 2004 patch from Andrea Arcangeli, which imposed a gap of a configurable size between the stack and the heap. Despite being carried in SUSE's kernels for some time, this patch never found its way into the mainline.

In 2010, an X server exploit took advantage of the lack of isolation between the stack and the heap, forcing a bit of action in the kernel community; the result was a patch from Linus Torvalds adding a single guard page (with no configurability) at the bottom of the stack. It blocked the X exploit, and many people (and LWN) proclaimed the problem to be solved. Or, at least, it seemed solved once the various bugs introduced by the initial fix were dealt with.

In the comments to the above-linked LWN article (two days after it was published), Brad Spengler and "PaX Team" claimed that a single-page gap was insufficient. More recently, Spengler posted a blog entry in his classic style on how they told us about this problem but it never got fixed because nobody else knows what they are doing. The thing they did not do, but could have done if they were truly concerned about the security of the Linux kernel, was to post a patch fixing the problem properly.

Of course, nobody else posted such a patch either; the community can only blame itself for not having fixed this problem. Perhaps LWN shares part of that blame for presenting the problem as being fixed when it was not; if so, we can only apologize and try to do better in the future. But we might argue that the real problem is a lack of people who are focused on the security of the kernel itself. There are few developers indeed whose job requires them to, for example, examine and address stack-overrun threats. Ensuring that this problem was properly fixed was not anybody's job, so nobody did it.

The corporate world supports Linux kernel development heavily, but there are ghetto areas that, seemingly, every company sees as being somebody else's problem; security is one of those. The situation has improved a little in recent times, but the core problem remains.

Meanwhile, one might well ask: has the stack problem truly been fixed this time? One might answer with a guarded "yes" — once the various problems caused by the new patch are fixed, at least; a 1MB gap is likely to be difficult for an attacker to jump over. But it is hard to be sure, anymore.

Embargoes

Alexander "Solar Designer" Peslyak is the manager of both the open oss-security and the closed "distros" list; the latter is used for the discussion of vulnerabilities that have not yet been publicly disclosed. The normal policy for that list is that a vulnerability disclosed there can only be kept under embargo for a period of two weeks; it is intended to combat the common tendency for companies to want to keep problems secret for as long as possible while they prepare fixes.

As documented by Peslyak, the disclosure of Stack Clash did not follow that policy. The list was first notified of a problem on May 3, with the details disclosed on May 17. The initial disclosure date of May 30 was pushed back by Qualys until the actual disclosure date of June 19. Peslyak made it clear that he thought the embargo went on for too long, and that the experience would not be repeated in the future.

The biggest problem with the extended embargo, perhaps, was that it kept the discussion out of the public view for too long. The sheer volume on the (encrypted) distros list was, evidently, painful to deal with after a while. But the delay also kept eyes off the proposed fix, with the result that the patches merged by the disclosure date contained a number of bugs. The urge to merge fixes as quickly as possible is not really a function of embargo periods, but long embargoes fairly clearly delay serious review of those fixes. Given the lack of known zero-day exploits, it may well have been better to disclose the problem earlier and work on the fixes in the open.

That is especially true since, according to Qualys, the reason for the embargo extension was that the fixes were not ready. The longer embargo clearly did not result in readiness. There was a kernel patch of sorts, but the user-space side of the equation is in worse shape. A goal like "recompile all userland code with GCC's -fstack-check option" was never going to happen in a short period anyway, even if -fstack-check were well suited to this application — which it currently is not.

There is a related issue in that OpenBSD broke the embargo by publicly committing a patch to add a 1MB stack guard on May 18 — one day after the private disclosure of the problem. This has raised a number of questions, including whether OpenBSD (which is not a member of the distros list) should be included in embargoed disclosures in the future. But perhaps the most interesting point to make is that, despite this early disclosure, all hell stubbornly refused to break loose in its aftermath. Peslyak noted that:

This matter was discussed, and some folks were unhappy about OpenBSD's action, but in the end it was decided that since, as you correctly say, the underlying issue was already publicly known, OpenBSD's commits don't change things much.

As was noted above, the "underlying issue" has been known for many years. A security-oriented system abruptly making a change in this area should be a red flag for those who follow commit streams in the hope of finding vulnerabilities. But there appears to be no evidence that this disclosure — or the other leaks that apparently took place during the long embargo — led to exploits being developed before the other systems were ready. So, again, it's not clear that the lengthy embargo helped the situation.

Offensive CVE assignment

Another, possibly discouraging, outcome from this whole episode was a demonstration of the use of CVE numbers as a commercial weapon. It arguably started with this tweet from Kurt Seifried, reading: "CVE-2017-1000377 Oh you thought running GRsecurity PAX was going to save you?". CVE-2017-1000377, filed by Seifried, states that the grsecurity/PaX patch set also suffers from the Stack Clash vulnerability — a claim which its developers dispute. Seifried has not said whether he carried out these actions as part of his security work at Red Hat, but Spengler, at least, clearly sees a connection there.

Seifried's reasoning appears to be based on this text from the Qualys advisory sent to the oss-security list:

In 2010, grsecurity/PaX introduced a configurable stack guard-page: its size can be modified through /proc/sys/vm/heap_stack_gap and is 64KB by default (unlike the hard-coded 4KB stack guard-page in the vanilla kernel). Unfortunately, a 64KB stack guard-page is not large enough, and can be jumped over with ld.so or gettext().

The advisory is worth reading in its entirety. It describes an exploit against sudo under grsecurity, but that exploit depended on a second vulnerability and disabling some grsecurity protections. With those protections enabled, Qualys says, a successful exploit could take thousands of years.

It is thus not entirely surprising that Spengler strongly denied that the CVE number was valid; in his unique fashion, he made it clear that he believes the whole thing was commercially motivated; "this taints the CVE process," he said. Seifried defended the CVE as "legitimate" but suggested that he was getting tired of the whole show and might give up on it.

Meanwhile Spengler, not to be outdone, filed for a pile of CVE numbers against the mainline kernel, to the befuddlement of Andy Lutomirski, the author of much of the relevant code. Spengler made it appear that this was a retaliatory act and suggested that Lutomirski talk to Seifried about cleaning things up. "I am certain he will treat a member of upstream Linux the same as I've been treated, as he is a very professional and equitable person."

The CVE mechanism was created as a way to make it easier to track and talk about specific vulnerabilities. Some have questioned its value, but there does seem to be a real use for a unique identifier for each problem. If, however, the CVE assignment mechanism becomes a factory for mudballs to be thrown at the competition, it is likely to lose whatever value it currently has. One can only hope that the community will realize that turning the CVE database into a cesspool of fake news will do no good for anybody and desist from this kind of activity.

In conclusion

Our community's procedures for dealing with security issues have been developed over decades and, in many ways, they have served us well over that time. But they are also showing some signs of serious strain. The lack of investment in the proactive identification and fixing of security issues before they become an emergency has hurt us a number of times and will continue to do so. The embargo processes we have developed are clearly not ideal and could use improvement — if we only knew what form that improvement would take.

It is becoming increasingly apparent to the world as a whole that our industry's security is not at the level it needs to be. Hopefully, that will create some commercial incentives to improve the situation. But it also creates incentives to attack others rather than fixing things at home. That is going to lead to some increasingly ugly behavior; let us just hope that our community can figure out a way to solve problems without engaging in divisive and destructive tactics. Our efforts are much better placed in making Linux more secure for all users than in trying to take other approaches down a notch.

Comments (53 posted)

In his PyCon 2017 talk, Miguel Grinberg wanted to introduce asynchronous programming with Python to complete beginners. There is a lot of talk about asynchronous Python, especially with the advent of the asyncio module, but there are multiple ways to create asynchronous Python programs, many of which have been available for quite some time. In the talk, Grinberg took something of a step back from the intricacies of those solutions to look at what asynchronous processing means at a higher level.

He started by noting that while he does a lot of work on the Flask Python-based web microframework, this talk would not be about Flask. He did write the Flask Mega-Tutorial (and a book on Flask), but he would be trying to mention it less than ten times during the talk—a feat that he managed admirably. He has also developed a Python server for Socket.IO that started out as something for "that framework", but has since "taken on a life of its own".

He asked attendees if they had heard people say that "async makes your code go fast". If so, he said, his talk would explain why people say that. He started with a simple definition of "async" (as "asynchronous" is often shortened). It is one way of doing concurrent programming, which means doing many things at once. He is not only referring to asyncio here as there are many ways to have Python do more than one thing at once.

He then reviewed those mechanisms. First up was multiple processes, where the operating system (OS) does all the work of multi-tasking. From CPython (the reference Python implementation) that is the only way to use all the cores in the system. Another way to do more than one thing at once is by using multiple threads, which is also a way to have the OS handle the multi-tasking, but Python's Global Interpreter Lock (GIL) prevents multi-core concurrency. Asynchronous programming, on the other hand, does not require OS participation. There is a single process and thread, but the program can get multiple things done at once. He asked: "what's the trick?"

Chess

He turned to a real-world example of how this works: a chess exhibition, where a chess master takes on, say, 24 opponents simultaneously. "Before computers killed the fun out of chess", these kinds of exhibitions were done regularly, but he is not sure if they still are. If each game takes around 30 move pairs to complete, the master would require twelve hours to finish the matches if they were played consecutively (at one minute per move pair). By sequentially making moves in each game, though, the whole exercise can be completed in an hour. The master simply makes a move at a board (in, say, five seconds) and then goes on to the next, leaving the opponent lots of time to move before the master returns (after making 23 other moves). The master will "cream everyone" in that time, Grinberg said.

It is "this kind of fast" that people are talking about for async programming. The chess master is not optimized to go faster, the work is arranged so that they do not waste time waiting. "That is the complete secret" to asynchronous programming, he said, "that's how it works". In that case, the CPU is the chess master and it waits the least amount of time possible.

But attendees are probably wondering how that can be done using just one process and one thread. How is async implemented? One thing that is needed is a way for functions to suspend and resume their execution. They will suspend when they are waiting and resume when the wait is over. That sounds like a hard thing to do, but there are four ways to do that in Python without involving the OS.

The first way is with callback functions, which is "gross", he said; so gross, in fact, that he was not even going to give an example of that. Another is using generator functions, which have been a part of Python for a long time. More recent Pythons, starting with 3.5, have the async and await keywords, which can be used for async programs. There is also a third-party package, greenlet, that has a C extension to Python to support suspend and resume.

There is another piece needed to support asynchronous programming: a scheduler that keeps track of suspended functions and resumes them at the right time. In the async world, that scheduler is called an "event loop". When a function suspends, it returns control to the event loop, which finds another function to start or resume. This is not a new idea; it is effectively the same as "cooperative multi-tasking" that was used in old versions of Windows and macOS.

Examples

Grinberg created examples of a simple "hello world" program using some of the different mechanisms. He did not get to all of them in the presentation and encouraged the audience to look at the rest. He started with a simple synchronous example that had a function that slept for three seconds between printing "Hello" and "World!". If he called that in a loop ten times, it would take 30 seconds to complete since each function would run back to back.

He then showed two examples using asyncio . They were essentially the same, but one used the @coroutine decorator for the function and yield from in the body (the generator function style), while the other used async def for the function and await in the body. Both used the asyncio version of the sleep() function to sleep for three seconds between the two print() calls. Beyond those differences, and some boilerplate to set up the event loop and call the function from it, the two functions had the same core as the original example. The non-boilerplate differences are by design; asyncio makes the places where code suspends and resumes "very explicit".

The two programs are shown below:

# async/await version import asyncio loop = asyncio.get_event_loop() async def hello(): print('Hello') await asyncio.sleep(3) print('World!') if __name__ == '__main__': loop.run_until_complete(hello()) # @coroutine decorator version import asyncio loop = asyncio.get_event_loop() @asyncio.coroutine def hello(): print('Hello') yield from asyncio.sleep(3) print('World!') if __name__ == '__main__': loop.run_until_complete(hello())

Running the program gives the expected result (three seconds between the two strings), but it gets more interesting if you wrap the function call in a loop. If the loop is for ten iterations, the result will be ten "Hello" strings, a three-second wait, then ten "World!" strings.

There are other examples for mechanisms beyond asyncio , including for greenlet and Twisted. The greenlet examples look almost exactly the same as the synchronous example, just using a different sleep() . That is because greenlet tries to make asynchronous programming transparent, but hiding those differences can be a blessing and a curse, Grinberg said.

Pitfalls

There are some pitfalls in asynchronous programming and people "always trip on these things". If there is a task that requires heavy CPU use, nothing else will be done while that calculation is proceeding. In order to let other things happen, the computation needs to release the CPU periodically. That could be done by sleeping for zero seconds, for example (using await asyncio.sleep(0) ).

Much of the Python standard library is written in blocking fashion, however, so the socket , subprocess , and threading modules (and other modules that use them) and even simple things like time.sleep() cannot be used in async programs. All of the asynchronous frameworks provide their own non-blocking replacements for those modules, but that means "you have to relearn how to do these things that you already know how to do", Grinberg said.

Eventlet and gevent, which are built on greenlet, both monkey patch the standard library to make it async compatible, but that is not what asyncio does. It is a framework that does not try to hide the asynchronous nature of programs. asyncio wants you to think about asynchronous programming as you design and write your code.

Comparison

He concluded his talk with a comparison of processes, threads, and async in a number of different categories. All of the techniques optimize the waiting periods; processes and threads have the OS do it for them, while async programs and frameworks do it for themselves. Only processes can use all cores of the system, however, threads and async programs do not. That leads some to write programs that combine one process per core with threads and/or async functions, which can work quite well, he said.

Scalability is "an interesting one". Running multiple processes means having multiple copies of Python, the application, and all of the resources used by both in memory, so the system will run out of memory after a fairly small number of simultaneous processes (tens of processes are a likely limit), Grinberg said. Threads are more lightweight, so there can be more of those, on the order of hundreds. But async programs are "extremely lightweight", such that thousands or tens of thousands of simultaneous tasks can be handled.

The blocking standard library functions can be used from both processes and threads, but not from async programs. The GIL only interferes with threads, processes and async can coexist with it just fine. But, he noted, there is only "some" interference from the GIL even for threads in his experience; when threads are blocked on I/O, they will not be holding the GIL, so the OS will give the CPU to another thread.

There are not many things that are better for async in that comparison. The main advantage to asynchronous programs for Python is the massive scaling they allow, Grinberg said. So if you have servers that are going to be super busy and handle lots of simultaneous clients, async may help you avoid going bankrupt from buying servers. The async programming model may also be attractive for other reasons, which is perfectly valid, but looking strictly at the processing advantages shows that scaling is where async really wins.

A YouTube video of Grinberg's talk is available; the Speaker Deck slides are similar, but not the same as what he used.

[I would like to thank The Linux Foundation for travel assistance to Portland for PyCon.]

Comments (9 posted)

The default apps on a mobile platform like Android are familiar targets for replacement, especially for developers concerned about security. But while messaging and voice apps (which can be replaced by Signal and Ostel, for instance) may be the best known examples, the non-profit Guardian Project has taken up the cause of improving the security features of the camera app. Its latest such project is ProofMode, an app to let users take photos and videos that can be verified as authentic by third parties.

Media captured with ProofMode is combined with metadata about the source device and its environment at capture time, then signed with a device-specific private PGP key. The result can be used to attest that the contents of the file have not been retouched or otherwise tampered with, and that the capture took place when and where the user says it did. For professional reporters or even citizen journalists capturing sensitive imagery, such an attestation provides a defense against accusations of fakery — an all-too-common response when critiquing those in positions of power. But making that goal accessible to real-world users has been a bit of a challenge for the Guardian Project.

CameraV

It is widely accepted that every facet of digital photography has both an upside and a downside. Digital cameras are cheap and carry no film or development costs, but digital images are impermanent and easily erased. Instant cloud storage and online sharing make media distribution easy, but do so at the cost of privacy and individual ownership. Perhaps nowhere is the dichotomy more critical, however, than in the case of news photography. Activists, journalists, and ordinary citizens have documented important world events using the cameras in their mobile devices, capturing everything from political uprisings to sudden acts of unspeakable violence. The flipside, though, is that the authenticity of digital photos and videos is hard to prove, and detractors are wont to dismiss any evidence that they don't like as fake.

Improving the verification situation was the goal of the Guardian Project's 2015 app, CameraV. The app provided a rather complex framework for attesting to the untampered state of recorded images, which the team eventually decided was inhibiting its adoption by journalists, activists, and other potential users. ProofMode is an attempt to whittle the CameraV model back to its bare essentials. Nevertheless, a quick look at CameraV is useful for understanding the approach.

CameraV attests to the unmodified state of an image by taking a snapshot of the device's sensor readings the same instant that the photograph is taken. The sensor data recorded for the snapshot is user-configurable, consisting of geolocation data (including magnetometer readings, GPS, and network location information), accelerometer readings, and environmental sensors (such as ambient light, barometric pressure, and temperature). Network device state, such as the list of visible Bluetooth devices and WiFi access points, can optionally be included as well. In addition, the standard Exif image tags (which include the make and model of the device as well as camera settings) are recorded. A full list is provided in the CameraV user's guide.

All of this metadata is stored in JSON Mobile Media Metadata (J3M) format and is appended to the image file, a process termed "notarization". The file is then MD5-hashed and the result signed with the user's OpenPGP key. CameraV provides another Android Intent service to let users verify the hash on any CameraV-notarized image they receive.

The signature can be published with the image, enabling third parties to verify that the metadata matches what the photographer claims about the location and context of the image. In theory, some of that metadata (such as nearby cell towers) could also be verified by an outside source. The app can also generate a short SHA-1 fingerprint of the signed file intended to be sent out separately. This fingerprint is short enough to fit into an SMS message, so that users can immediately relay proof of their recording, even if they do not have a means to upload the image itself until later. Users can share their digitally notarized images to public services or to publish them over Tor to a secure server that the user controls.

CameraV takes a number of steps to ensure that images are not altered while on the user's device, lest the app then be used to create phony attestations and undermine trust in the system. First, the MD5 hash of image or video that is saved alongside the device-sensor metadata is computed over the raw pixel data (or raw video frames), as a mechanism to protect against the image being faked using some other on-camera app before the user publishes it for consumption. Second, the full internal file path of the raw image file is saved with the metadata, which serves as a record that the CameraV app is the source of the file. Third, app-specific encrypted storage is used for the device's local file storage — including the media, the metadata, and key material. Finally, the OpenPGP key used is specific to the app itself. The key is generated when the user first sets up CameraV; the installer prompts the user to take a series of photos that are used as input for the key-generation step.

Rethinking the complexity issues

CameraV's design hits a lot of the bullet points that security-conscious developers care about, but it certainly never gained a mass following. Among other stumbling blocks, the user had to decide in advance to use the CameraV app to record any potentially sensitive imagery. That might be fine for someone documenting human rights violations as a full-time job, but is less plausible for a spur-of-the-moment incident — and it does not work for situations where the user only realizes the newsworthiness of a photo or video after the fact. In addition, there may be situations where it is genuinely harmful to have detailed geolocation information stored in a photo, so using CameraV for all photos might frighten off some potential users.

Consequently, in 2016 the Guardian Project began working on a sequel of sorts to CameraV. That effort is what became ProofMode, which was first announced to the public on the project's blog in February 2017. The announcement describes ProofMode as a "reboot," but it is worth noting that CameraV remains available (through the Google Play Store as well as through the F-Droid repository) and is still being updated.

ProofMode essentially turns CameraV's metadata-recording process into a background service and makes its available to the user as a "share" action (through Android's Intent API). When any media is captured with any camera app, ProofMode takes a snapshot of the device sensor readings. The user then has the option of choosing "Share Proof" from their camera app's sharing menu.

At present, ProofMode offers three sharing options: "Notarize Only" (which shares only the SHA-1 fingerprint code), "Share Proof Only" (which shares a signed copy of the metadata files), and "Share Proof with Media" (which appends the metadata to the media file and signs the result, as in the CameraV case). Whichever option the user chooses, selecting it immediately brings up another "share" panel so the user can pick an app to finalize the action — thus directing the ProofMode file to email, SMS, a messaging app, Slack, or any other option that supports attaching files.

In March, Bruce Schneier posted about ProofMode on his blog, which spawned a series of in-depth questions in the comment section. As might be expected on such a public forum, the comments ranged from complaints about the minutia of the app's approach to security to bold assertions that true authentication on a mobile device is unattainable.

Among the more specific issues, though, the commenters criticized ProofMode's use of unencrypted storage space, its practice of extracting the PGP private key into RAM with the associated passphrase, and how the keys are generated on the device. There were also some interesting questions about how a malicious user might be able to generate a fake ProofMode notary file by hand.

The Guardian Project's Nathan Freitas responded at length to the criticism in the comment thread, and later reiterated much of the same information on the Guardian Project blog. As to the lower-level security steps, he assured commenters that the team knew what it was doing (citing the fact that Guardian Project ported Tor to Android, for example) and pointed to open issues on the ProofMode bug tracker for several of the enhancements requested (such as the use of secure storage for credentials).

On other issues, Freitas contended that there may simply be a valid difference of opinion. For example, the on-device generation of key pairs may seem less than totally secure, but Freitas noted that the keys in question are app-specific and not designed for use as a long-term user identity. "Our thinking was more focused on integrity through digital signatures, with a bit of lightweight, transient identity added on. " Nevertheless, he added, the project does have an issue open to port key storage to the Android Keystore system service.

Android also provides some APIs that can protect against tampering. Freitas said that the project has already integrated the SafetyNet API, which is used to detect if the app is running in an emulator (although ProofMode does not block this behavior; it simply notes it in the metadata store). In the longer term, the team is also exploring implementing stronger security features, such as more robust hashing mechanisms or the Blockchain-based OpenTimestamps.

Ultimately, however, complexity is the enemy of growing a broad user base, at least from the Guardian Project's perspective. Freitas told the Schneier commenters that the goal is to provide notarization and security for "every day activists around the world, who may only have a cheap smartphone as their only computing device" rather than cryptographers. In an email, he also noted that ProofMode requires little to no training for users to understand, which is a stark contrast to the complexity of CameraV.

Verification versus anonymity

Given all the talk about recording sensor input and geolocation information, a privacy-conscious user might well ask whether or not CameraV and ProofMode take a step backward for those users who are interested in recording sensitive events but are also legitimately worried about being identified and targeted for their trouble. This is a real concern, and the Guardian Project has several approaches to addressing it.

The first is that CameraV and ProofMode both provide options for disabling some of the more sensitive metadata that can be captured. For now, that includes the network information and geolocation data. Second, potentially identifiable metadata like Bluetooth device MAC addresses are not recorded in the clear, but only in hashed form. And the project has an issue open to allowing wiping ProofMode metadata files from a device.

For the extreme case, however — when a user might want to completely sanitize an image of all traceable information before publishing it — there is too little overlap with the intent of ProofMode, but the project has published a separate app that may fit the bill.

That anonymizing app is called ObscuraCam. It automatically removes geolocation data and the device make and model metadata from any captured photo. It also provides a mechanism for the user to block out or pixelate faces, signs, or other areas of the image that might be sensitive.

At the moment, it is not possible to use ObscuraCam in conjunction with ProofMode (attempting to do so crashes the ProofMode app), but the precise interplay between the two security models likely would require some serious thought anyway. Nevertheless, if anonymity is of importance, it is good to know there is an option.

In the pudding

In the final analysis, neither CameraV nor ProofMode is of much value if it remains merely a theoretical service: it has to be usable to real-world, end-user human beings. In my own personal tests, CameraV is complex enough that it is little surprise that it has not been adopted en masse. The first step after installation requires the user to set up a "secure database," the preferences screen is not particularly user-friendly, and the sharing features are high on detail but light on interface polish.

On the other hand, ProofMode makes serious strides forward in ease-of-use but, at present, it lacks the built-in documentation that a new user might require in order to make the right choices. If one has not read the ProofMode blog posts, the sharing options ("Notarize Only" and "Share Proof Only") might not be easy to decipher. Obviously, the project is still in pre-release mode, though, so there is plenty of reason to believe that the final version will hit the right notes.

Readers with long memories might also recall that the CameraV–ProofMode saga marks the second time that the Guardian Project developed a security app only to later refactor the code into a system service. The first instance was PanicKit, a framework for erasing device data from multiple apps that grew out of the project's earlier storage-erasing app InTheClear.

Freitas calls this a coincidence, however, rather than a development trend. With PanicKit, he said, the goal was to develop a service that third-party app developers would find useful, too. ProofMode, in contrast, was merely a simplification of the original concept designed to meet the needs of a broader audience. Regardless of how one looks at it, though, most will likely agree that if security features come built into the operating system at a lower level — eliminating the need to choose "secure apps" or "insecure apps" — then the end users will benefit in the end.

Comments (7 posted)

daxctl()

Persistent memory promises high-speed, byte-addressable access to storage, with consequent benefits for all kinds of applications. But realizing those benefits has turned out to present a number of challenges for the Linux kernel community. Persistent memory is neither ordinary memory nor ordinary storage, so traditional approaches to memory and storage are not always well suited to this new world. A proposal for a newsystem call, along with the ensuing discussion, shows how hard it can be to get the most out of persistent memory.

The "DAX" mechanism allows an application to map a file in persistent-memory storage directly into its address space, bypassing the kernel's page cache. Thereafter, data in the file can be had via a pointer, with no need for I/O operations or copying the data through RAM. So far, so good, but there is a catch: this mode really only works for applications that are reading data from persistent memory. As soon as the time comes to do a write, things get more complicated. Writes can involve the allocation of blocks on the underlying storage device; they also create metadata updates that must be managed by the filesystem. If those metadata updates are not properly flushed out, the data cannot be considered properly written.

The end result is that applications performing writes to persistent memory must call fsync() to be sure that those writes will not be lost. Even if the developer remembers to make those calls in all the right places, fsync() can create an arbitrary amount of I/O and, thus, impose arbitrary latencies on the calling application. Developers who go to the trouble of using DAX are doing so for performance reasons; such developers tend to respond to ideas like "arbitrary latencies" with poor humor at best. So they have been asking for a better solution.

daxctl()

That is why Dan Williams wrote in the introduction to this patch series that "the full promise of byte-addressable access to persistent memory has only been half realized via the filesystem-dax interface". Realizing the other half requires getting the filesystem out of the loop when it comes to write access. If, say, a file could be set up so that no metadata changes would be needed in response to writes, the problem would simply go away. Applications would be able to write to DAX-mapped memory and, as long as they ensured that their own writes were flushed to persistent store (which can be done in user space with a couple of special instructions), there should be no concerns about lost metadata.

Williams's proposal to implement this approach requires a couple of steps. The first is that the application needs to call fallocate() to ensure that the file of interest actually has blocks allocated in persistent memory. Then it has to tell the kernel that the file is to be accessed via DAX and that the existing block allocations cannot be changed under any circumstances. That is done with a new system call:

int daxctl(char *path, int flags, int align);

Here, path indicates the file of interest, flags indicates the desired action, and align is a hint regarding the size of pages that the application would like to use. The DAXFILE_F_STATIC flag, if present, will put the file into the "no changes allowed mode"; if the flag is absent, the file becomes an ordinary file once again. While the static mode is active, any operation on the file that would force metadata changes (changing its length with truncate() , for example) will fail with an error code.

The implementation of this new mode would seem to require significant changes at the filesystem level, but it turns out that this functionality already exists. It is used by the swap subsystem which, when swapping to an ordinary file, needs to know where the blocks allocated to the file reside on disk. There are two pieces to this mechanism, the first of which is this address_space_operations method:

/* Unfortunately this kludge is needed for FIBMAP. Don't use it */ sector_t (*bmap)(struct address_space *s, sector_t sector);

A call to bmap() will return the physical block number on which the given sector is located; the swap subsystem uses this information to swap pages directly to the underlying device without involving the filesystem. To ensure that the list of physical blocks corresponding to the swap file does not change, the swap subsystem sets the S_SWAPFILE inode flag on the file. Tests sprinkled throughout the virtual filesystem layer (and the filesystems themselves) will block any operation that would change the layout of a file marked with this flag.

This functionality is a close match to what DAX needs to make direct writes to persistent memory safe. So the daxctl() system call has simply repurposed this mechanism, putting the file into the no-metadata-changes mode while not actually swapping to it.

MAP_SYNC

Christoph Hellwig was not slow to register his opposition to this idea. He would rather not see the bmap() method used anywhere else in the kernel; it is, in his opinion, broken in a number of ways. Its use in swapping is also broken, he said, though "we manage to paper over the fact". He suggested that development should be focused instead on making DAX more stable before adding new features.

An alternative approach, proposed by Andy Lutomirski, has been seen before: it was raised (under the name MAP_SYNC ) during the "I know what I'm doing" flag discussion in early 2016. The core idea here is to get the filesystem to transparently ensure that any needed metadata changes are always in place before an application is allowed to write to a page affected by those changes. That would be done by write-protecting the affected pages, then flushing any needed changes as part of the process of handling a write fault on one of those pages. In theory, this approach would allow for a lot of use cases blocked by the daxctl() technique, including changing the length of files, copy-on-write semantics, concurrent access, and more. It's a seemingly simple idea that hides a lot of complexity; implementing it would not be trivial.

Beyond implementation complexity, MAP_SYNC has another problem: it runs counter to the original low-latency goal. Flushing out the metadata changes to a filesystem can be a lengthy and complex task, requiring substantial amounts of CPU time and I/O. Putting that work into the page-fault handler means that page faults can take an arbitrarily long amount of time. As Dave Chinner put it:

Prediction for the MAP_SYNC future: frequent bug reports about huge, unpredictable page fault latencies on DAX files because every so often a page fault is required to sync tens of thousands of unrelated dirty objects because of filesystem journal ordering constraints.

There was some discussion about how the impact of doing metadata updates in the page-fault handler could be reduced, but nobody has come forth with an idea that would reduce it to zero. Those (such as Hellwig) who support the MAP_SYNC approach acknowledge that cost, but see it as being preferable to adding a special-purpose interface that brings its own management difficulties.

On the other hand, this work could lead to improvements to the swap subsystem as well, making it more robust and more compatible with filesystems (like Btrfs) whose copy-on-write semantics work poorly with the "no metadata changes" idea. There is another use case for this functionality: high-speed DMA directly to persistent memory also requires that the filesystem not make any unexpected changes to how the file is mapped. That, and the relative simplicity of Williams's patch, may help to push the daxctl() mechanism through, even though it is not universally popular.

Arguably, the real lesson from this discussion is that persistent memory is not a perfect match to the semantics provided by the Unix API and current filesystems. It may eventually become clear that a different type of interface is needed, at least for applications that want to get maximum performance from this technology. Nobody really knows what that interface should look like yet, though, so the current approach of trying to retrofit new mechanisms onto what we have now would appear to be the best way forward.

Comments (15 posted)

The CentOS distribution has long been a boon to those who want an enterprise-level operating system without an enterprise-level support contract—and the costs that go with it. In keeping with its server orientation, CentOS has been largely focused on x86 systems, but that has been changing over the last few years. Jim Perrin has been with the project since 2004 and his talk at Open Source Summit Japan (OSSJ) described the process of making CentOS available for the ARM server market; he also discussed the status of that project and some plans for the future.

Perrin is currently with Red Hat and is the maintainer of the CentOS 64-bit ARM (aarch64) build. CentOS is his full-time job; he works on building the community around CentOS as well as on some of the engineering that goes into it. His background is as a system administrator, including stints consulting for the defense and oil industries; with a bit of a grin, he said that he is "regaining a bit of my humanity" through his work at Red Hat on CentOS.

The initial work on CentOS for ARM started back in the CentOS 6 days targeting 32-bit ARMv6 and ARMv7 CPUs. That distribution is now six or seven years old and it was already old when the developers started working on an ARM version of it. The software in CentOS 6 was simply too old to effectively support ARM, Perrin said. The project ended up with a distribution that mostly worked, but not one it was happy to publish. It improperly mixed Fedora and RHEL components and was not up to the project's standards, so that build was buried.

In January 2015, which was after CentOS 7 was released, the project restarted using that base but targeting aarch64. There was "lots more support for ARM" in that code base, he said. After about six months, there was a working version of the distribution that he and other project members were happy with, so it was time to give the community access to it. Unfortunately, 64-bit ARM chips were not widely available in July 2015, so the project needed to decide where it wanted to go with the distribution.

Community

There are multiple parts of the CentOS community, each of which has its own needs. Hardware vendors are the first and foremost members of the community, because they must create the hardware that all of the others will use. If the hardware does not work well—or CentOS doesn't work well on it—no one will be interested in it.

The second group is the business partners of the hardware vendors. These are early adopters that get the hardware from the vendors and want to "kick things around" to see that the hardware is working for their use cases. CentOS needs to be able to provide help and support for these companies.

There are also early adopters who are not affiliated with the hardware vendors. They buy and break new hardware and are particularly vocal on social media. They will let it be known that they have this new hardware and what software is or isn't being supported on it. They have opinions and a project needs to take care of their needs, he said.

A group that is somewhat similar to early adopters is the maker community. The difference is that early adopters are going to try out business use cases using the system, while the makers will "blast it into space" or run it at the bottom of a lake. Any folks that do "that level of weird things with the hardware" deserve their own group, Perrin said.

Then there are the slower-moving parts of the community. Businesses will typically allow others to work out the bugs in the hardware and software before starting to use it; they have "a more cautious approach", he said. The last group is the end users, who are system administrators and others whose boss bought the hardware; they may not be particularly pleased about using the distribution, but they need to get work done so it is important to try to make their jobs easier.

Some of these communities are "more equal than others", which sounds backwards or odd coming from a community person, Perrin said. But what he is really talking about is timing; you don't need to worry about makers, say, until there is working hardware available. So CentOS needed to take a tiered approach to supporting its various communities.

It all starts with the hardware, naturally. Working with some of the larger vendors on building the distribution for their aarch64 server prototypes was the first step. That was facilitated by the unannounced arrival of hardware at his house. That was "fantastic, but really surprising". From the audience, Jon Masters, who had arranged for some of those shipments, jokingly warned attendees: "don't tell me your address". With a grin, Perrin said: "my electric bill does not thank you".

CentOS started by working with AppliedMicro; that was used as the reference platform starting in March 2015. After that, the project also worked with Cavium, Qualcomm, AMD, and some other vendors that are not public.

Once the hardware is supported, it is time to move on to the early adopters. It was not practical to work with the hardware vendors' business partners as it is not his job to manage those kinds of relationships, he said. But early adopters are different; CentOS wanted to work with folks who are going to be loud about using the distribution. From those efforts, the project learned about some optimizations for aarch64 that were not working well for some users, for example.

More packages

One of the biggest things that helped with that was working with the Fedora and Extra Packages for Enterprise Linux (EPEL) communities to get the EPEL packages working for aarch64. Those packages are valuable for day-to-day work on servers, he said. The CentOS project focused on making the hardware work and making a base set of packages, then getting out of the way. The EPEL group has been "fantastic at packaging up things they think people will need".

Part of the process of creating the distribution is figuring out what software the community wants. The short answer turns out to be that it wants "containers and virtualization". So one of the early projects was to get docker (with a small "d", "not with a large 'D' that is now trademarked", he said) running on aarch64. Docker is written in Go, which meant that the golang package needed to be built.

When the process started, though, the released version of golang was 1.4, which did not support aarch64. The project had to bootstrap a build of 1.5 beta using the 1.4 compiler in a container on an x86_64 system. That "failed immediately" because of a calculation done by docker to determine the page size. It is 4KB on x86_64, but 64KB on CentOS aarch64. That got fixed (and upstreamed) and CentOS was able to build docker by late 2015 or early 2016.

The availability of docker started to accelerate other development on CentOS for ARM. For example, Kubernetes is being ported. The same page-size problem cropped up there, but the Kubernetes developers are quite receptive to patches. Kubernetes 1.4 is "not 100% baked yet" for the distribution but is getting there.

On the virtualization side, users wanted OpenStack. They wanted to be able to do virtualization and virtualization management on ARM. As it turns out, though, the bulk of the need for OpenStack was for network function virtualization (NFV), rather than wanting OpenStack for its own sake. OpenStack is just a stepping stone to NFV, he said. The process of porting OpenStack is under active development right now.

Boring

The overall goal for the CentOS on ARM project is to "get to boring". The idea is that the distribution works just like every other distribution on every other architecture. For some platforms, it has been difficult at times to get to that point. There is a mindset of how software works in the embedded world that doesn't translate well to the server world. If the story is that "this system boots this way, this other one that way", it will not sit well with customers.

So a lot of work was put into community building within the hardware vendor community regarding standards. The idea is that ARM, Intel, AMD, and others all need to work the same way, install the same way, boot the same way, and so on. That means support for PXE, UEFI, ACPI, and so on. There is something of a balance required, though, because at the same time he is beating on the vendors to standardize, he is also asking them to provide hardware for makers and other early adopters.

At this point, there is a functional base distribution that matches what there is on the x86 side. The next step is to get things like Kubernetes and OpenStack working; after that is less clear. He is no longer a system administrator, so he is not attuned to what users may want and need. Part of coming to OSSJ was to hopefully gather some feedback on what users would like to see. He can take that back to Red Hat engineering as input for upcoming plans. Maybe there are tools and technologies that CentOS doesn't even know about that need to be added to the ARM ecosystem; he encouraged attendees to let him know what they are.

In answer to an audience question, Perrin said that installing CentOS for ARM is straightforward, much like the process for x86: download an ISO image, boot that from a USB drive or via PXE. Instructions are available on the wiki. That is for supported 64-bit ARM hardware; for 32-bit hardware, like Raspberry Pi, a specialized image is needed for the platform. The 64-bit Raspberry Pi 3 (RPi3) will ostensibly be supported in three months or so, he said, once the U-Boot bootloader gets UEFI support.

Masters spoke up to note that Perrin is "stuck with" some of the decisions that Masters made. One of those is the 64KB page size, which is good for servers but not as good for embedded use cases like Raspberry Pi. Red Hat (where Masters also works) is focused on the server market where the larger page size makes a lot of sense. Some ARM distributions did not think about that, he said, and will be stuck with 4KB pages that are more suited to embedded use cases.

There are some other hardware choices, which have a similar price point to the RPi3, that could be used for development and testing, Perrin said in answer to another question. The ODROID C2 and C3 boards have 64-bit ARM CPUs, but there is a "giant caution flag" for those kinds of systems. Since the changes for those boards have not been pushed to the upstream kernel, users will be running the CentOS user space with a vendor kernel. That may be just fine, but there have been occurrences in the past where vendor kernels have had problems—a remote root hole in one case.

If you want online hardware, Perrin suggested Packet, where you can get a bare-metal aarch64 system. It is "kind of like AWS" but with ARM hardware.

When asked about 96Boards, Perrin said the company has an "array of good hardware" that doesn't do what is needed for CentOS. The HiKey board is the best, but there are some implementation issues that cause CentOS difficulties and the DragonBoard 410c does not have the right bootloader for CentOS. As Masters put it, the right answer is to spend $1000 to get a real server.

The final question was about whether CentOS is talking with other ARM distributions that do not emanate from Red Hat. Perrin said there is a cross-distribution mailing list; he doesn't see eye to eye with the others on it all the time, but that's true with his colleagues at Red Hat too at times. Driving standards for the hardware helps everyone and the people on the list are trying to do that. That is part of why there has been some effort into supporting CentOS on RPi3; everyone has one, so it is a good way to open up development without having to tell interested people to go buy a $1000 server.

[I would like to thank the Linux Foundation for travel assistance to Tokyo for Open Source Summit.]

Comments (6 posted)

Recently, Lennart Poettering announced a new tool called casync for efficiently distributing filesystem and disk images. Deployment of virtual machines or containers often requires such an image to be distributed for them. These images typically contain most or all of an entire operating system and its requisite data files; they can be quite large. The images also often need updates, which can take up considerable bandwidth depending on how efficient the update mechanism is. Poettering developed casync as an efficient tool for distributing such filesystem images, as well as for their updates.

Poettering found that none of the existing system image delivery mechanisms suited his requirements. He wanted to conserve bandwidth when updating the images, minimize disk space usage on the server and on clients, make downloads work well with content delivery networks (CDNs), and for the mechanism to be simple to use. Poettering considered Docker's layered tarball, OSTree's direct file delivery via HTTP with packed deltas for updates, and other systems that deliver entire filesystem images.

Docker's approach of "layers" of updates on top of an initial tarball required tracking revisions and history, which Poettering believes a deployment should not be burdened with. OSTree's method of serving individual files would be detrimental to the performance of content distribution networks if there were a plethora of small files, as synchronization will hammer the CDN with multiple HTTP GET requests. Delivering entire filesystem images repeatedly for every update would be an unacceptably high use of bandwidth and server disk space, even though the delivery would be simple to implement. In the end, Poettering decided that, while existing systems have their merits, he had to roll his own solution optimized for the use case of filesystem image delivery with frequent updates. Casync was inspired by rsync (which copies and syncs files based on deltas) and Git (which provides content-addressable storage based on hashing).

Casync can be used to distribute directory trees as well as raw disk images. When operating on directories, all data in the target directory is serialized into a stream of bytes, much like the tar utility does. Poettering created his own serialization as the output of tar varies from implementation to implementation; he required consistent output without being dependent on any particular flavor of tar .

When invoked on either a directory or a disk image, casync will create a repository of data that mirrors the original, but reorganized such that it is broken into data chunks of similar size, stored inside a directory called a "chunk store", together with an index file that stores the metadata for the repository. The directory and index file can both be served via a web server (or with another network file transfer protocol) to a client, which can use casync to reassemble the original data.

Chunking

Casync works by chunking the target data from a stream of bytes into into a set of variable-sized chunks, though the sizes do not vary by much. Chunking helps reduce the bandwidth consumed when a user synchronizes their repository, since the index file can be used to determine which chunks have changed or been added, and only those chunks need to be downloaded.

The chunking algorithm will create the same chunks for the same data, even at different offsets. To accomplish this, casync makes use of a cyclic polynomial hash (also known as Buzhash) to find the offsets for identical data. Buzhash is a hash -based search algorithm that can that can be used to find patterns in a stream of bytes more efficiently than brute-force scanning.

The basic idea of Buzhash is that, given two strings of data where one may contain the other, it is possible to search for the target by looking at a few bytes (called the window) at every byte offset and hashing them (this is called a "rolling hash"). The resulting hash is compared against the hash of a search key of the same size as the window; a match provides a strong indicator that the rest of the string might also match the other data, and a full hash at the indicated offset can be executed to confirm it. An advantage of a rolling hash is that it can be performed quickly since the next hash can be generated from the previous one by a computationally cheap operation.

When chunking for the first time, the Buzhash algorithm is used with a 48-byte window across the stream, moving one byte at a time and calculating a hash. A chunk boundary is placed whenever the value of the calculated hash, h satisfies the following equation:

h mod k == k - 1

The constant k is chosen to reflect the intended average chunk size. Assuming an even distribution of the hash h, the probability that the function h mod k will yield a specific value, in this case k-1, is 1/k. Therefore, the probability is such that for roughly every k bytes read, the equation will evaluate to true and a boundary is placed. To guarantee the chunks are neither too small or too large, there are hard limits on the chunk size enforced by the algorithm. The implementation is similar to what rsync does when chunking, except that rsync uses a smaller window, works with individual files rather than a serialized directory tree, and the hash algorithm used is the Adler-32 checksum.

Each time a chunk is created, its contents are hashed with SHA-256 to create a digest for the chunk, which is then recorded in an index together with the chunk size and a filename . The chunks are kept in compressed form in the chunk store. If a chunk arriving in the store hashes to the same digest as an existing one in the index, the chunk need not be added. This gives the chunk store deduplication for data, which is particularly efficient for filesystem images that do not differ much between versions. The chunks can then be delivered over HTTP, along with the index, and they can be reassembled on the client side.

Serializing directories requires the preservation of metadata such as ownership and access control lists (ACLs). A user can specify what metadata to save in the chunk archive when running the tool. Casync will also store extended attributes, file capabilities, ACLs, Linux chattr file attributes, and FAT file attributes. Casync recognizes pseudo-filesystems such as /proc and sysfs ; it will not include them when creating an archive. Additionally, if the underlying filesystem supports reflinks, which save space by sharing disk blocks for identical files (with copy-on-write semantics), then casync can take advantage of this; instead of creating identical files, it will reflink them instead. Casync supplies a FUSE facility for read-only mounting of filesystem or disk images directly from the HTTP source.

Trying it out

There are packages for casync created by third parties available for Ubuntu, Arch Linux, and Fedora. I tried it out by compiling it from the GitHub repository, which requires the Meson build system. The installed binaries let you create chunked repositories and reconstruct them, both locally and over HTTP, FTP, or SFTP. The README contains a list of commands you can run to try out the various features of casync.

Future work

Casync is not intended as a replacement for rsync or zsync, as it is more for filesystem delivery than fine-grained file-based backup. It also does not attempt to find the most optimal deduplication and smallest deltas, but has a "good enough" heuristic to save bandwidth and storage. It is a welcome addition in the space of filesystem delivery, where something like rsync would be useful, but the fine-grained, per-file granularity is not required.

Poettering has stated that he has "concrete plans" for adding encryption to casync, so that it could be used as a backup tool like restic, BorgBackup, or Tarsnap. He also intends to automate GPG validation of data, so that chunks can be signed and verified without user intervention. Casync does not expose an API for third party tools, although it is designed to be able to do so eventually. This will enable things such as GNOME's GVfs to access casync repositories, and make it modular enough so that components like the HTTP delivery mechanism can be replaced with customized implementations. Other plans are support for local network caches of chunks and automated home-directory synchronization.

Casync only works on Linux at the moment, but Poettering says he is open to accepting patches for portability that do not interfere with the fundamental operation of casync. Currently, casync is developed mainly by Poettering, with a few other contributors. The project is not yet completely stable, although it is usable and has many features implemented already. There may be changes to the file formats down the road, so any index or serialized files made with the current version might break in the future.

Conclusion

Casync is a new option to complement tools like rsync, which may prove useful to anyone who needs to distribute large filesystem images that also need to be regularly updated. The granularity of "chunks" that casync uses is reminiscent of BitTorrent, but the fact that it is network protocol independent should make the distribution of data friendlier to firewalls and content distribution networks. It should be a useful tool for cloud providers, software distributions, developers sharing customized virtual machine images, and anyone else who needs an efficient way of providing large and constantly updated bundles of data.

[I would like to thank Lennart Poettering for his help in clarifying some of the inner workings of casync.]

Comments (9 posted)