LWN.net Weekly Edition for July 10, 2014

GENIVI assesses driver distraction and builds on location data At the 2014 Automotive Linux Summit (ALS) in Tokyo, the GENIVI Alliance showcased several new open-source software projects that are slated to make their way into future in-vehicle systems. They included a framework for tracking driver attention (and, consequently, distraction level) and several new location-based services. For those who do not pay close attention to the automotive software field, these new efforts represent some of the first efforts that push open-source software past the existing, relatively predictable confines of navigation or entertainment—and into more experimental territory. Driver workload management Yusuke Nakamura from Denso Corporation presented a session about Driver Workload Assessor (DWA), GENIVI's new open-source framework to track the attention of a driver and adjust the behavior of the in-vehicle infotainment (IVI) system accordingly. The need for such a system is well-known, he said; the vehicle is an increasingly complex environment, and society is more and more concerned that driver distraction will result in accidents. He pointed to several studies about the increase in distraction-related crashes, noting that there is a rising trend of distractions from integrated devices—which, as opposed to accidents involving cell phones and other portable devices, is something GENIVI can address directly. On the flip side, he pointed out, drivers expect and even demand continual access to their information systems; consequently GENIVI's challenge is to not simply keep information away from the driver, but to design a human-machine interface (HMI) system that lets drivers focus on driving when it requires high attention, but adapts to not dissatisfy them on the straight, low-traffic stretches of road when the attention level required drops. The "smart" solution is to monitor and manage the driver's "workload"—roughly defined as the number and intensity of physical, visual, and cognitive tasks the driver is engaged in. This is a broader definition than "driver distraction," he said; "distraction" is what happens when the workload exceeds the driver's capacity. Even so, some distractions are unhelpful (such as text messages), while other are beneficial (such as alerts and warnings). The naive approach to managing driver workload, Nakamura said, is to consider only two states: stopped and in-motion. Such an IVI system might simply disable all user input and notifications while in motion, and allow everything when stopped. But this ignores the fact that driver workload goes up and down according to the driving task. DWA defines some middle states in between the naive "all" and "nothing" options; the current version essentially has three in-between states, for "low," "medium," and "high" levels of driver workload. The plan is that the IVI system would respond to the current workload level by allowing or suppressing input and output. Either individual applications could monitor the current workload level, or a management process could broker API requests, restricting or delaying them when the driver is overly occupied. GENIVI's current approach is to have a "workload manager" process handle the brokering of other applications. The trick, in either case, is that "driver workload" is fundamentally a cognitive concept. As a result, Nakamura said, software cannot measure it directly. But it can be at least partially inferred from car and environmental conditions. DWA tracks a number of vehicle system states to approximate how busy the driver is: whether the speed is constant, accelerating, or braking, whether the steering wheel is turned, whether the windshield wipers are engaged, and so on. Changes in each of these conditions increment or decrement the current driver workload level—if the driver brakes suddenly and turns the steering wheel, then clearly driving requires more attention at the moment. If a notification comes in at just such a time—for example, an incoming call on the phone paired via Bluetooth—then the workload manager might suppress the phone ringer until the steering wheel straightens out and speed returns to a constant. Such basic vehicle states are already measured by most modern cars' diagnostic buses. Nakamura demonstrated DWA with a dummy app, in which he could change the simulated vehicle speed and change the steering angle, and DWA would suppress output messages from the dummy app in response. But there are other factors that could also be used to contribute to the driver-workload estimate in the future, he said, including rain sensors, other environmental factors, and even messages from nearby vehicles or infrastructure. There is clearly a lot more to be done, but the benefits are an IVI system that is considerably more responsive to changing conditions than the simplistic all-on/all-off design in use today. Location-based services Philippe Colliot of Peugeot Citroën presented the recent work of GENIVI's Location-based services (LBS) expert group, which includes developing several API standards and a demonstration app for GENIVI-compliant IVI systems. The APIs represent the next level up from generic geolocation information, and are intended to let application developers create more complex services. The demo app is called Fuel Stop Advisor, and it represents one example use case: it builds on geolocation, point-of-interest (POI) data, and vehicle status to recommend the best times to stop and refuel. The LBS group is working on a set of APIs that work in conjunction with the W3C Geolocation API. At present there are four. The Navigation Core API [PDF] (currently at version 3.0) provides a way requesting routing between destinations, including multiple transportation types, breaking the route into segments, and getting "guidance" instructions that can be used as turn-by-turn directions. The Positioning API [PDF] provides dead reckoning, taking gyroscope and compass sensor readings and establishing the vehicle's orientation and motion—so that its position can be tracked on a map even when GPS lock is lost. It is currently at version 2.0. The Point-of-Interest (POI) Service API [PDF] is designed to serve as a bridge between a POI database and any of several applications that might request POI information. For example, a map application might simply need to display all of the POIs in a rectangular region, while a search application might request all of the POIs in some category (e.g., restaurants) within a given radius of the current position or some other specified location. The POI Service API was recently declared 1.0. The fourth API is a traffic information API. Colliot explained that GENIVI was attempting to "not reinvent the wheel" where possible, which led to the Traffic API being developed jointly with the European Transport Protocol Experts Group (TPEG), an existing standardization project. Colliot said that the Traffic API had also recently been declared 1.0, although it does not seem to be published on the LBS Git repository. There are still other areas where the LBS is working on additional specifications, Colliot said, including a Log Replayer API that will allow for easier application testing by playing back position and sensor data. But the group is also working on submitting its APIs to the W3C in the hopes of getting them approved as standards. The Navigation Core API has been submitted, he said, and there are already pending changes in the works based on feedback from navigation services. Apart from writing specifications, the LBS group has also developed its first open-source app, Fuel Stop Advisor (FSA), "to show people that it is fun to write GENIVI apps." FSA uses the Navit routing engine and Open Street Map (OSM) data. It requires that the car have an active navigation route, and calculates whether or not the current fuel level is enough to get to the destination without stopping. If there is not enough, it recommends alternate routes to stop and refuel along the way. Colliot showed a demonstration of FSA on his laptop. The user interface is "proof of concept"-level, he said, so it does not look like a finished product. But work continues; the next steps are to port the interface to Qt 5, port the graphics to use GENIVI's Layer Manager (which allows it to be composited with other running applications), and to add the ability to search for refueling stations from additional POI providers. FSA represents new ground for GENIVI in the sense that it is an end-user application, rather than a base layer. As Colliot indicated in his talk, GENIVI is not changing its mandate—it still targets a middleware layer of software that carmakers do not want to individually reimplement. But it is progress to see that the middleware has gotten to a point where usable applications can be developed. GENIVI's community manager Jeremiah Foster also gave a talk, in which he pointed to other projects that are reaching the point where application developers can use them. There is an IVI radio service, for example, that can handle AM, FM, and a variety of digital broadcast standards, and a speech output framework that can be used for anything from turn-by-turn directions to reading text alerts out loud. The Media Manager project, on which GENIVI is collaborating with Automotive Grade Linux, should have a release ready by October. The goal is an API for connecting to consumer electronics devices for media playback; Foster noted that the team started with the Media Player Remote Interfacing Specification (MPRIS) and has worked with developers from several existing open source projects (like VLC) to make sure that Media Manager meets their needs as well. Foster ended his session by asking the audience to get involved in the effort; GENIVI wants to know "what is currently missing." As the other GENIVI talks suggested, the project is reaching the point where several of the low-level tasks it has been focusing on are essentially complete, an attention now turns to more user-visible software. [The author would like to thank The Linux Foundation for travel assistance to attend ALS 2014.] Comments (42 posted) Yorba, the IRS, and tax-exemption On June 30, Yorba Foundation director Jim Nelson posted a blog entry reporting that the US Internal Revenue Service (IRS) had denied Yorba's application to be registered as a tax-exempt 501(c)(3) charity. Nelson and others contend that this denial is cause for concern to other players in the FOSS arena, but there are voices who disagree about the implications. As Nelson's blog post explains, Yorba had filed its 501(c)(3) application in 2009. In the US, 501(c)(3) organizations are one of several types of tax-exempt nonprofit, but there are additional benefits to being a 501(c)(3) rather than, say, a 501(c)(6) trade association. Most notably, Nelson said, donations made to 501(c)(3)s are tax deductible for the donor, and that makes fundraising easier. Many high-profile organizations in the free and open-source software realm are 501(c)(3)s, including the GNOME Foundation, the Mozilla Foundation, the Apache Software Foundation, and the Linux kernel project. Last year, news broke that the IRS had flagged "open source" tax-exemption applications for increased scrutiny, reportedly out of concern that for-profit companies might seek to run their operations out of a nonprofit organization to evade taxes. That "be on the lookout" (or "BOLO") issue, as it was known, had allegedly started in 2010, and Nelson reported that Yorba received two requests for further information from the IRS that year. Yorba's application was as a "charitable, scientific, and educational organization." Nelson reported that the IRS's rejection notice was dated May 22, 2014, and gave several reasons for the decision—reasons he called "hair-raising" statements that "could have a direct impact on the free software movement, at least here in the United States." Nelson quoted five snippets from the IRS's justification for its decision, including the fact that Yorba's software could be used "by any person for any purpose, including nonexempt purposes such as commercial, recreational, or personal purposes," that Yorba does not own all of the copyrights on its software, and that releasing the source code to software does not constitute an educational function since "anything learned by people studying the source code is incidental." The IRS also contended that developing and distributing software is not a "public work" because software is not something ordinarily provided at public expense, and that open-source software is available worldwide and therefore does not "serve a community" as 501(c)(3) rules require. Nelson pointed out that several of these rationales seem to conflict with the IRS's recognition of other open-source software foundations as 501(c)(3)s, and that several of them seem to suggest that Yorba should impose restrictions on its projects, such as limiting their usage or requiring copyright assignment from all contributors. "In other words," he surmised in one place, "we (and, presumably, everyone else) cannot license our software with a GNU license and meet the IRS’ requirements of a charitable organization." He added that the potential impact of these statements by the IRS would be chilling to FOSS as a whole: I doubt they’re going to start enforcing this in the future for organizations that already enjoy exemption. If they do, it will be a royal mess for those projects having to contact every author of every non-trivial contribution and get them to sign over their rights. This is all a big if, of course. and concluded by saying that Yorba does not intend to appeal the rejection, but will continue developing its application software nonetheless. The story was picked up by the general tech press in short order, many of whom paired it with the news that the OpenStack Foundation had received a rejection from the IRS for its 501(c)(6) application in March (a decision that OpenStack has already appealed). According to that blog post, the IRS listed three issues with the OpenStack application: That the foundation is producing software and thus is “carrying on a normal line of business.” That the foundation is not improving conditions for the entire industry. That the foundation is performing services for its members. The rules for 501(c)(3)s and 501(c)(6)s differ, of course, but both rejections share some common themes, like the assertion that the projects are essentially engaging in normal software development practices as many for-profit companies do. To a lot of commenters, that amounted to a rejection of the core principles of FOSS. Simon Phipps, for example, in a story titled "Are open source foundations nonprofits? The IRS says no," said "it seems that the IRS no longer thinks collaborating on open source is a public good." Other news outlets took their interpretations to even greater extremes, applying them to FOSS as a whole (headlines such as "IRS says free software projects can't be nonprofits" and "The IRS wages war on open source nonprofits" are easy-to-find examples). But others have pointed out that the decision in Yorba's case does not set precedent for any other FOSS project's application. Bradley Kuhn, in a comment on Nelson's post, said that the decision is the opinion of one IRS examiner, and should not be treated as broader in scope. Furthermore, it "doesn’t change the status of orgs that are already operating properly under 501(c)(3) status." Karen Sandler noted in a blog post of her own that the IRS has said more than once in the past that a decision about one non-profit application has no effect on existing non-profits. Concern that existing organizations will lose their tax-exempt status would seem to be overblown. Nevertheless, Yorba's multi-year wait for a decision from the IRS does seem to be the norm. Perhaps that is a good thing in and of itself; although no organization seems to be happy about the lengthy wait, vetting an organization is probably a process that ought to require some in-depth investigation, lest gaming the system be too easy. But the lengthy wait clearly has an impact on the projects and foundations in question, consuming time and resources. It is also possible that the IRS (or some portion of its reviewers) is developing an attitude toward FOSS software that is fundamentally at odds with the common practice of developing and releasing free software while finding other means to fund operations. In a 2013 WIRED article, Luis Villa commented that he had heard from several projects that the IRS wanted them to put non-commercial usage restrictions into their licenses. No doubt there are unscrupulous individuals out there who would love to be paid to write software but not have to pay taxes (and if there were none before, the idea has surely occurred to them in the wake of the Yorba story). It is a tricky problem for the IRS to sort out, determining whose work is truly in the public interest and who might be developing a standard-issue software product but putting an open-source license on it for tax purposes. For those who are genuine in their commitment to the ideals of software freedom, though, it is just one more uphill battle among many. Hopefully others will not take the Yorba rejection as a discouragement, and hopefully Yorba will not be discouraged either. Many commenters, both on Nelson's blog post and elsewhere, spoke up to offer their encouragement in general, and their encouragement that Yorba should appeal this initial rejection. Comments (8 posted) A speech framework and a GUI for automotive systems At the 2014 Automotive Linux Summit (ALS) in Tokyo, several sessions highlighted new work from the Automotive Grade Linux (AGL) and Tizen IVI projects, including a flexible speech recognition and generation framework and a graphical user interface (GUI) for in-dash head units. In addition, AGL offered teasers of several upcoming new releases and put out a call for application developers interested in open-source automotive software. Formally speaking, AGL is a working group of the Linux Foundation focused on the task of increasing Linux adoption in vehicles. But as a practical matter, this has meant group members putting resources into developing open source software. Just prior to ALS 2014, AGL announced the release of its reference Linux platform, which is built on top of the in-vehicle infotainment (IVI) version of Tizen. The AGL release contains several components not found in the contemporaneous Tizen IVI release, but there is clearly a close working relationship between the two projects. Some of the AGL release's additions may not make it upstream into Tizen IVI in the foreseeable future, either because they are contributed by member companies who have not yet shown an interest in Tizen, because there are licensing issues, or because they are evidently intended only as proof-of-concept code with less general appeal. The exact reasons, though, are not always clear. For example, the AGL release includes support for controlling a MOST-connected audio amplifier. MOST (Media Oriented Systems Transport) is an automotive industry standard data bus that runs over fiber-optic cable; it provides a number of benefits compared to other vehicle buses (such as high throughput and resistance to electrical noise), but the standard is proprietary and there is reportedly no interest from MOST's governing organization to the idea of opening up the specification or loosening its licensing restrictions. There is, therefore, little chance that general-purpose MOST support will come to Tizen IVI, but AGL has an interest in demonstrating that MOST integration is possible. The Modello user interface On the other hand, Intel's Geoffrey van Cutsem gave a talk about Tizen IVI's new GUI project, Modello, which actually started off as an AGL add-on project but is now developed within Tizen IVI. Modello is a suite of free-software HTML5 applications that cover basic GUI functionality. There is a "home" screen, a dashboard that shows vehicle statistics and sensor readings, a media player, a heater/air-conditioner controller, a phone-tethering application for hands-free usage, and a navigation tool that connects to Google Maps. The Modello system is completely modular, Van Cutsem said; the home screen launcher can launch any application, not just those already mentioned. But the official Modello applications are all designed to look the same; they pick up the same UI elements from a central theme. As of right now, there are just two themes to choose from—and they differ only in color—but the theming engine is a flexible one. Someone could create a "nighttime" theme, he said, and have it activated automatically when the car's light sensors indicate that it is getting dark. The Modello applications are also designed to run on 720p portrait-orientation screens, which are not the norm in today's vehicles. Van Cutsem explained the rationale: Modello is targeting the IVI systems of the future, when larger screens are expected to be commonplace. Most center consoles are "portrait-shaped," he said, and if the screen replaces many of the physical controls in use today (including climate-control knobs), users are likely to expect the biggest screen that will fit. The Tesla Model S, he said, is a good example: it sports a 17-inch portrait display. The Modello project has also been filling in some miscellaneous missing pieces in Tizen IVI; it implements a GUI system settings utility, which has been prominently missing from prior Tizen releases. Perhaps most importantly, it allows GUI configuration of Bluetooth and WiFi networking, which, up until now, had only been configurable with command-line tools. There is still more to come, he said. The navigation application is still quite rough; as of today it only supports pre-set destinations. Although Van Cutsem did not discuss it, navigation is in a state of flux in both Tizen and AGL at present. Tizen IVI dropped the navigation application Navit from its builds in 2013. The word around the project is that either Navit or some other free-software routing application will return in due course; the Google Maps tool may not last due it its reliance on a single, proprietary data provider. Also still to come in Modello is a port from Tizen IVI's older web runtime to the newer Crosswalk, support for localization, and integration with the Wayland-based Layer Manager. A new release is expected within the week. Van Cutsem also noted that the Modello project would be working to add support for "twenty plus" new applications written by AGL. Jaguar Land Rover's Matt Jones provided a preview of that application collection in part of his ALS keynote talk. The new additions being developed include Modello-compatible versions of older software, such as the SmartDeviceLink mobile device tethering and screen-sharing tool. But they also include several entirely new applications, such as fingerprint recognition and voiceprint recognition utilities, a weather application, and a news carousel. Jones pointed out that Jaguar Land Rover was interested in funding open-source projects like these AGL reference applications, and told anyone interested in contributing to get in touch. The company has found working with open-source developers to be in its best interests, he said. The average time from concept to deployment in a car is 39 months, but the average software startup only has a lifespan of 18 months. So pairing with startups is not a strategic option. In contrast, he said, for every dollar that the company puts into Tizen and open source, it estimates that it generates at least 20. He now hopes that the company can start working on more interesting new applications, such as the biometric systems mentioned above. "I hope we're done with implementing Bluetooth profiles and FM radio, and can start doing the unique stuff." At last: the talking car Intel's Jaska Uimonen provided a look at one of those possible new developments in his presentation on Winthorpe, an open-source framework for adding speech support to Tizen IVI applications. Winthorpe supports both speech recognition for input and speech synthesis for output, and it provides both as a system-wide service. This design is distinct from most of the other speech recognition systems on the market, he said. The others tend to either be a standalone, "assistant" application like Apple's Siri, or else each individual app (search, navigation, etc.) is its own "silo"—linked internally to a third-party provider's speech recognition module. The assistant model can be linked to other apps (such as voice dialing and web searching), but adding new features to these apps requires making changes to the assistant. The close linking approach may also mean multiple apps have speech support, but it has serious drawbacks: the apps are not aware of each other, so they cannot cooperate, and their fate depends entirely on the continued support of the third-party speech engine supplier. In addition, he said, most of the popular speech recognition services (including Google's and Apple's) rely on an active network connection to a remote cloud service. Winthorpe attempts to improve on these shortcomings. It provides a platform-level API service with multiple back-ends, so that speech-enabling an application is a one-time process—you do not need to rewrite your code to start using a different speech engine. The API also lets applications stay simple, offloading the speech processing to the service. The process of speech-enabling an application is straightforward, he said. The program registers itself with the Winthorpe process and declares a set of commands that it wants to listen for. Winthorpe listens for speech input, then notifies its registered client if it recognizes a command—delivering the notification event and, if requested, passing the speech input buffer to the application. For deciding which registered application gets "voice focus" for a recognized command when there are multiple options, Winthorpe delegates the decision to Tizen IVI's Murphy policy manager—though how Murphy makes that decision is up to the system implementor. Winthorpe is context-aware, he said. When the user makes or answers a phone call, all audio is sent to the phone application and speech recognition is switched off. The Winthorpe architecture is modular; there can be multiple speech-recognition plugins installed, and there are plugins for disambiguation and for speech synthesis. Currently the plugins include only one open-source recognition engine, Carnegie Mellon University's Pocketsphinx. There are two open-source speech synthesis plugins, one based on Emacspeak and one based on Festival. The Winthorpe team has written demo extensions for media players and for simple web searching. In addition to registering for callbacks to specific commands, applications can also make use of some special Winthorpe tokens, Uimonen said. One is the wildcard operator * for free-form input. An application can use it to have Winthorpe send the raw audio input rather than having Winthorpe process it as speech. This might be useful for recording notes or calls. Another is "dictionary switch" command, which tells Winthorpe to match speech input against a special dictionary rather than the general-purpose one. This can dramatically improve recognition quality, he said. For instance, if one knows that the speech input will be numeric, switching to a "digits" dictionary will reduce the error rate. Speech output is considerably simpler than speech recognition, Uimonen said. Winthorpe supports selecting from among multiple installed "voices," multiple languages, and includes commands to adjust the voice's rate and pitch. One of the weaknesses of the system is how few open-source speech projects there are, he said. Pocketsphinx is currently the only open-source recognition engine because there are few others available, although he said the project is working with the Julius engine that is designed for Japanese. Between the two synthesis engines, Festival is noticeably weaker than Emacspeak. He added that most existing IVI systems use a proprietary speech recognition back-end. Future work for the project includes Julius integration, improvements to the Murphy integration, the ability to reconfigure the speech-decoding pipeline on the fly, and tools to better pronounce unrecognized words. Together, the AGL and Tizen IVI projects appear to be making progress on multiple fronts. While some of the work (such as Winthorpe) is of interest primarily to developers, the details of the project indicate that the team is trying to improve on the status quo available in other IVI systems. And other new pieces, such as Modello, indicate that polished, end-user code is within reach for the first time, which is good news for those who are interested in seeing an open-source IVI platform reach the market. [The author would like to thank The Linux Foundation for travel assistance to attend ALS 2014.] Comments (2 posted)

Page editor: Nathan Willis



Security

Evaluating the LZO integer-overflow bug In June, a security researcher disclosed an integer-overflow bug in the Lempel–Ziv–Oberhumer (LZO) compression algorithm—a bug that has persisted in the wild for roughly two decades, and is reproduced in multiple LZO implementations as well as in the related LZ4 algorithm. LZ4's author then accused the researcher of irresponsible disclosure and of over-hyping the issue for the sake of publicity. The two have subsequently argued back and forth about the proper assessment of the bug's severity, but wherever history eventually comes down on that particular question, the case holds lessons on a number of fronts for software developers. Don A. Bailey published a June 26 blog post explaining the bug, which he had discovered during a code audit. In essence, Markus Oberhumer's original LZO code included a simple integer overflow in the code block that handles uncompressed "Literal Runs" in a compressed LZO file. The overflowed variable is later used as a size parameter, which an attacker can use to overflow a pointer and potentially gain access to a protected area of system memory. Importantly, as Bailey sees it, the original LZO reference code has essentially been copied verbatim into a wide variety of later LZO implementations, including OpenVPN, MPlayer2, Libav, FFmpeg, Btrfs, squashfs, Android systems, and the Linux kernel. Furthermore, the LZ4 algorithm developed by Yann Collet also reuses Oberhumer's reference code (including the bug), and LZ4 is also used in a variety of places, including the ZFS filesystem. Bailey's original post not only described the bug in detail, but it went on to offer an assessment of the severity of the bug for real-world attacks. The Libav and FFmpeg versions of LZO are susceptible to remote code execution, he said, as are LZ4 implementations (although in the LZ4 case, such exploits are only practical on 32-bit architectures). On the other hand, denial-of-service attacks—while arguably less serious—are plausible on all LZO and LZ4 implementations. All of the possible attacks rely on specially crafted data payloads. Collet fired back with a blog post of his own, the same day, calling Bailey's post "totally irresponsible" and an attempt "to create a flashy headline" by claiming that the bug is far more serious that it actually is. In reality, Collet said, the conditions required to exploit the bug in LZ4 are so peculiar that there is "no real-world risk," and that none of the known LZ4 implementations can be targeted. On June 28, Collet retracted some of his criticisms of Bailey's disclosure methodology but continued to argue that no known program met the conditions required to exploit the bug. What followed was an at times heated back-and-forth between the two, both about the severity of the bug and about how it was disclosed. Collet noted that the underlying issue had been reported by someone else more than a year earlier and was deemed low-priority, mostly because the hypothetical attack would require that LZ4 be called with blocks of extremely large size (8MB or larger) and because a 64-bit system would require an implausibly large amount of memory to overflow the buffer. Bailey contended that there are plenty of 32-bit systems in the wild today (including most ARM devices) and that disinterest in future-proofing 64-bit implementations was short-sighted. On July 1, Bailey posted a follow-up showing that LZ4 could be exploited with 2.47 MB of data. Collet again accused Bailey of irresponsible disclosure for publishing this follow-up without notifying the LZ4 project privately ahead of time. As a practical matter, an update for LZ4 that fixes all of the issues cited by Bailey is already available, r119. The LZO reference implementation has also been updated with a fix, in version 2.08. Fixes have also been published for the affected downstream projects. Assessing the real-world severity of the bug is, to a large degree, a matter on which reasonable people may never fully agree. It certainly requires separating the issue from Collet and Bailey's argument over disclosure practices—an argument that is not technical in nature. Bailey has written two posts that describe real-world attacks against LZ4 in the wild; one hinges on the fact that, while "real" users might never use LZ4 with exceptionally large block sizes, higher-level libraries often pass data down to algorithms like LZ4 without doing sanity checks. The second shows Bailey exploiting Firefox 30.0's video-playback code. Lost in all of the debate about how plausible an attack is against LZ4, though, is a separate point raised in Bailey's original blog post. Oberhumer's reference code for LZO is the original source of the integer overflow, and because that reference code is believed to be highly optimized for decompression speed (which is, after all, one of LZO's key selling points), many developers copied it—flaws and all—into their own projects. Algorithms, Bailey said, become treated like "blessed" code, with other developers assuming their correctness and not giving them the same level of scrutiny that they might to other third-party work. The potential harm of a long-standing bug or even a back-door in reference code is therefore magnified. Where the subject matter is regarded as highly specialized, things become even trickier. One can see echoes of this concern in the recent "too few independent implementations" issue that was cited as an objection to including Daniel J. Bernstein's Curve25519 cryptographic function in the W3C WebCrypto API. The odds may not be particularly high that Bernstein's code contains an exploitable bug, but the fact that so many developers implicitly trust its correctness is a cause for caution. And cryptography is far from the only subject matter where widespread code reuse is commonplace. It is frequently found where low-level and highly-optimized functions are required. For example, virtually all—if not literally all—free-software raw photo software is built on top of Dave Coffin's dcraw decoder, which is released as ANSI C code typically copied into downstream projects. However difficult it may be to craft a real-world exploit for LZO or LZ4, a key lesson is that the bug was replicated to a variety of downstream projects in part because the original reference code was not subjected to sufficient scrutiny sooner. A code audit did eventually uncover the flaw, but had that audit taken place years earlier, there would likely be far less outcry over the issue today. Comments (8 posted)

Brief items

New vulnerabilities

Page editor: Jake Edge



Kernel development

Brief items

Kernel development news

Anatomy of a system call, part 1 System calls are the primary mechanism by which user-space programs interact with the Linux kernel. Given their importance, it's not surprising to discover that the kernel includes a wide variety of mechanisms to ensure that system calls can be implemented generically across architectures, and can be made available to user space in an efficient and consistent way. I've been working on getting FreeBSD's Capsicum security framework onto Linux and, as this involves the addition of several new system calls (including the slightly unusual execveat() system call), I found myself investigating the details of their implementation. As a result, this is the first of a pair of articles that explore the details of the kernel's implementation of system calls (or syscalls). In this article we'll focus on the mainstream case: the mechanics of a normal syscall ( read() ), together with the machinery that allows x86_64 user programs to invoke it. The second article will move off the mainstream case to cover more unusual syscalls, and other syscall invocation mechanisms. System calls differ from regular function calls because the code being called is in the kernel. Special instructions are needed to make the processor perform a transition to ring 0 (privileged mode). In addition, the kernel code being invoked is identified by a syscall number, rather than by a function address. Defining a syscall with SYSCALL_DEFINEn() The read() system call provides a good initial example to explore the kernel's syscall machinery. It's implemented in fs/read_write.c , as a short function that passes most of the work to vfs_read() . From an invocation standpoint the most interesting aspect of this code is way the function is defined using the SYSCALL_DEFINE3() macro. Indeed, from the code, it's not even immediately clear what the function is called. SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */ These SYSCALL_DEFINEn() macros are the standard way for kernel code to define a system call, where the n suffix indicates the argument count. The definition of these macros (in include/linux/syscalls.h ) gives two distinct outputs for each system call. SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count) __SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */ The first of these, SYSCALL_METADATA() , builds a collection of metadata about the system call for tracing purposes. It's only expanded when CONFIG_FTRACE_SYSCALLS is defined for the kernel build, and its expansion gives boilerplate definitions of data that describes the syscall and its parameters. (A separate page describes these definitions in more detail.) The __SYSCALL_DEFINEx() part is more interesting, as it holds the system call implementation. Once the various layers of macros and GCC type extensions are expanded, the resulting code includes some interesting features: asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count); asmlinkage long SyS_read(long int fd, long int buf, long int count); asmlinkage long SyS_read(long int fd, long int buf, long int count) { long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count); asmlinkage_protect(3, ret, fd, buf, count); return ret; } static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */ First, we notice that the system call implementation actually has the name SYSC_read() , but is static and so is inaccessible outside this module. Instead, a wrapper function, called SyS_read() and aliased as sys_read() , is visible externally. Looking closely at those aliases, we notice a difference in their parameter types — sys_read() expects the explicitly declared types (e.g. char __user * for the second argument), whereas SyS_read() just expects a bunch of (long) integers. Digging into the history of this, it turns out that the long version ensures that 32-bit values are correctly sign-extended for some 64-bit kernel platforms, preventing a historical vulnerability. The last things we notice with the SyS_read() wrapper are the asmlinkage directive and asmlinkage_protect() call. The Kernel Newbies FAQ helpfully explains that asmlinkage means the function should expect its arguments on the stack rather than in registers, and the generic definition of asmlinkage_protect() explains that it's used to prevent the compiler from assuming that it can safely reuse those areas of the stack. To accompany the definition of sys_read() (the variant with accurate types), there's also a declaration in include/linux/syscalls.h , and this allows other kernel code to call into the system call implementation directly (which happens in half a dozen places). Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen. Syscall table entries Hunting for callers of sys_read() also points the way toward how user space reaches this function. For "generic" architectures that don't provide an override of their own, the include/uapi/asm-generic/unistd.h file includes an entry referencing sys_read : #define __NR_read 63 __SYSCALL(__NR_read, sys_read) This defines the generic syscall number __NR_read (63) for read() , and uses the __SYSCALL() macro to associate that number with sys_read() , in an architecture-specific way. For example, arm64 uses the asm-generic/unistd.h header file to fill out a table that maps syscall numbers to implementation function pointers. However, we're going to concentrate on the x86_64 architecture, which does not use this generic table. Instead, x86_64 defines its own mappings in arch/x86/syscalls/syscall_64.tbl , which has an entry for sys_read() : 0 common read sys_read This indicates that read() on x86_64 has syscall number 0 (not 63), and has a common implementation for both of the ABIs for x86_64, namely sys_read() . (The different ABIs will be discussed in the second part of this series.) The syscalltbl.sh script generates arch/x86/include/generated/asm/syscalls_64.h from the syscall_64.tbl table, specifically generating an invocation of the __SYSCALL_COMMON() macro for sys_read() . This header file is used, in turn, to populate the syscall table, sys_call_table , which is the key data structure that maps syscall numbers to sys_name() functions. x86_64 syscall invocation Now we will look at how user-space programs invoke the system call. This is inherently architecture-specific, so for the rest of this article we'll concentrate on the x86_64 architecture (other x86 architectures will be examined in the second article of the series). The invocation process also involves a few steps, so a clickable diagram, seen at left, may help with the navigation. In the previous section, we discovered a table of system call function pointers; the table for x86_64 looks something like the following (using a GCC extension for array initialization that ensures any missing entries point to sys_ni_syscall() ): asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { [0 ... __NR_syscall_max] = &sys_ni_syscall, [0] = sys_read, [1] = sys_write, /*... */ }; For 64-bit code, this table is accessed from arch/x86/kernel/entry_64.S , from the system_call assembly entry point; it uses the RAX register to pick the relevant entry in the array and then calls it. Earlier in the function, the SAVE_ARGS macro pushes various registers onto the stack, to match the asmlinkage directive we saw earlier. Moving outwards, the system_call entry point is itself referenced in syscall_init() , a function that is called early in the kernel's startup sequence: void syscall_init(void) { /* * LSTAR and STAR live in a bit strange symbiosis. * They both write to the same internal register. STAR allows to * set CS/DS but only a 32bit target. LSTAR sets the 64bit rip. */ wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32); wrmsrl(MSR_LSTAR, system_call); wrmsrl(MSR_CSTAR, ignore_sysret); /* ... */ The wrmsrl instruction writes a value to a model-specific register; in this case, the address of the general system_call syscall handling function is written to register MSR_LSTAR (0xc0000082), which is the x86_64 model-specific register for handling the SYSCALL instruction. And this gives us all we need to join the dots from user space to the kernel code. The standard ABI for how x86_64 user programs invoke a system call is to put the system call number (0 for read ) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the SYSCALL instruction. This instruction causes the processor to transition to ring 0 and invoke the code referenced by the MSR_LSTAR model-specific register — namely system_call . The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table — namely sys_read() , which is a thin, asmlinkage wrapper for the real implementation in SYSC_read() . Now that we've seen the standard implementation of system calls on the most common platform, we're in a better position to understand what's going on with other architectures, and with less-common cases. That will be the subject of the second article in the series. Comments (18 posted) Control groups, part 2: On the different sorts of hierarchies Hierarchies are everywhere. Whether this is a deep property of the universe or simply the result of the human thought process, we see hierarchies wherever we look, from the URL bar that your browser displays (or maybe doesn't) to the pecking order in the farm yard. There is a fun fact that if you click on the first link in the main text of a Wikipedia article, and then repeat that on each following article, you eventually get to Philosophy, though this is apparently only true 94.52% of the time. Nonetheless it suggests that all knowledge can be arranged hierarchically underneath the general heading of "Philosophy". Control groups (cgroups) allow processes to be grouped hierarchically and the specific details of this hierarchy are one area where cgroups have both undergone change and received criticism. In our ongoing effort to understand cgroups enough to enjoy the debates that regularly spring up, it is essential to have an appreciation of the different ways a hierarchy can be used, so we can have some background against which to measure the hierarchy in cgroups. I find that an example from my past raises some relevant issues that we can then see play out in some more generally familiar filesystem hierarchies and that we can be prepared to look for in cgroup hierarchies. Hierarchies in computer account privileges In a previous role as a system administrator for a modest-sized computing department at a major Australian university, we had a need for a scheme to impose various access controls on, and provide resource allocations to, a wide variety of users: students, both undergraduate and post-graduate, and staff, both academic and professional. Already it is clear that a hierarchy is presenting itself, with room for further subdivisions between research and course-work students, and between technical and clerical professional staff. Largely orthogonal to this hierarchy were divisions of the school into research groups and support groups (I worked in the Computing Support Group) together with a multitude of courses that were delivered, each loosely associated with a particular program (Computer Engineering, Software Engineering, etc.) at a particular year level. Within each of the different divisions and courses there could be staff in different roles as well as students. Some privileges best aligned with the role performed by the owner of the account, so staff received a higher printing allowance than students. Others aligned with the affiliation of the account owner — a particular printer might be reserved for employees in the School Office who had physical access and used it for printing confidential transcripts. Similarly, students in some particular course had a genuine need for a much higher budget of color printouts. To manage all of this we ended up with two separate hierarchies that were named "Reason" (which included the various roles, since they were the reason a computer account was given) and "Organisation" (identifying that part of the school in which the role was active). From these two we formed a cross product such that for each role and for each affiliation there was, at least potentially, a group of user accounts. Each account could exist in several of these groups, as both staff and students could be involved in multiple courses, and some senior students might be tutors for junior courses. Various privileges and resources could be allocated to individual roles and affiliations or intersections thereof, and they would be inherited by any account in the combined hierarchy. Manageable complexity Having a pair of interconnected hierarchies was certainly more complex than the single hierarchy that I was hoping for, but it had one particular advantage: it worked. It was an arrangement that proved to be very flexible and we never had any trouble deciding where to attach any particular computer account. The complexity was a small price to play for the utility. Further, the price was really quite small. While creating the cross product of two hierarchies by hand would have been error prone, we didn't have to do that. A fairly straightforward tool managed all the complexity behind the scenes, creating and linking all the intermediate tree nodes as required. While working with the tree, whether assigning permissions or resources or attaching people to various roles or affiliations, we rarely needed to think about the internal details and never risked getting them wrong. This exercise left me with a deep suspicion of simple hierarchies. They are often tempting, but just as often they are an over-simplification. So the first lesson from this tale is that a little complexity can be well worth the cost, particularly if it is well-chosen and can be supported by simple tools. Two types of hierarchy The second lesson from this exercise is that the two hierarchies weren't just different in detail; they had quite different characters. The "Reason" hierarchy is what might be called a "classification" hierarchy. Every individual had their own unique role but it is useful to group similar roles into classes and related classes into super-classes. A widely known hierarchy that has this same property is the Linnaean taxonomy of Biological classification, which is a hierarchy of life forms with seven main ranks of Kingdom, Phylum, Class, Order, Family, Genus, and Species. With this sort of hierarchy all the members belong in the leaves. In the biological example, all life forms are members of some species. We may not know (or be able to agree) which species a particular individual belongs to, but to suggest that some individual is a member of some Family, but not of any Genus or Species doesn't make sense. It would be at best an interim step leading to a final classification. The "Organisation" hierarchy has quite a different character. The different research groups did not really represent a classification of research interests, but were a way to organize people into conveniently sized groups to distribute management. Certainly the groups aligned with people's interests where possible, but it was not unheard of for someone to be assigned to a group not because they naturally belonged, but because it was most convenient. To some extent the grouping exists for some separate purpose and members are placed in groups to meet that purpose. This contrasts with a "classification" where each "class" exists only to contain its members. An organizational hierarchy has another important property: it is perfectly valid for internal nodes to contain individuals. The Head of School was the head of the whole school, and so belonged at the top of the hierarchy. Similarly, a program director could reasonably be associated with the program as a whole without being specifically associated with each of the courses in the program. In many organizations, the leader or head of each group is a member of the group one step up in the organizational hierarchy, which affirms this pattern. These two different types of hierarchy are quite common and often get mingled together. Two places that we can find them that will be familiar to many readers are the "sysfs" filesystem in Linux, and the source code tree for the Linux kernel. Devices in /sys The "sysfs" filesystem (normally mounted at /sys ) is certainly a hierarchy — as that is how filesystems work. While sysfs currently contains a range of different objects including modules, firmware information, and filesystem details, it was originally created for devices and it is only the devices that will concern us here. There are, in fact, three separate hierarchical arrangements of devices that all fit inside sysfs, suggesting that each device should have three parents. As devices are represented as directories, this is clearly not possible, since Unix directories may have only one parent. This conundrum is resolved thorough the use of symbolic links (or "symlinks") with implicit, rather than explicit, links to parents. We will start exploring with the hierarchies that are held together with symlinks. The hierarchy rooted at /sys/dev could be referred to as the "legacy hierarchy". From the early days of Unix there have been two sorts of devices: block devices and character devices. These are represented by the various device-special-files that can normally be found in /dev . Each such file identifies as either a block device or a character device and has a major device number indicating the general class of device (e.g. serial port, parallel port, disk or tape drive) and a minor number that indicates which particular device of that class is the target. This three-level hierarchy is exactly what we find under /sys/dev , though a colon is used, rather than a slash, to separate the last two levels. So /sys/dev/block/8:0 (block device with major number 8 and minor number 0) is a symbolic link to a directory representing the device also known as " sda ". If we start in that directory and want to find the path from /sys/dev , we can find the last two components (" 8:0 ") by reading the " dev " file. Determining that it is a block device is less straightforward, though the presence of a " bdi " (block device info) directory is a strong hint. This hierarchy is particularly useful if all you have is the name of a device file in /dev , or an open file descriptor on such a device. The stat() or fstat() system calls will report the device type and the major and minor numbers, and these can trivially be converted to a path name in /sys/dev , which can lead to other useful information about the device. The second symlink-based hierarchy is probably the most generally useful. It is rooted at /sys/class and /sys/bus , suggesting that there really should be another level in there to hold both of these. There are plans to combine both of these into a new /sys/subsystem tree, though as those plans are at least seven years old, I'm not holding my breath. One valuable aspect of these plans that is already in place is that each device directory has a subsystem symlink that points back to either the class or bus tree, so you can easily find the parent of any device within this hierarchy. The /sys/class hierarchy is quite simple, containing a number of device classes each of which contains a number of specific devices with links to the real device directory. As such, it is conceptually quite similar to the legacy hierarchy, just with names instead of numbers. The /sys/bus hierarchy is similar, though the devices are collected into a separate devices subdirectory allowing each bus directory to also contain drivers and other details. The third hierarchy for organizing devices is a true directory-based hierarchy that doesn't depend on symlinks. It is found in /sys/devices and has a structure that, in all honesty, is rather hard to describe. The overriding theme to the organization is that it follows the physical connectedness of devices, so if a hard drive is accessed via a USB port with the USB controller attached to a PCI bus, then the path though the hierarchy to that hard drive will first find the PCI bus, and then the USB port. After the hard drive will be the "block" device that provides access to the data on the drive, and then possibly subordinate devices for partitions. This is an arrangement that seems like a good idea until you realize that some devices get control signals from one place (or maybe two if there is a separate reset line) and power supply from another place, so a simple hierarchy cannot really describe all the interconnectedness. This is an issue that was widely discussed in preparation for this year's Kernel Summit. When examining these hierarchies from the perspective of "classification" versus "organization", some fairly clear patterns emerge. The /sys/dev hierarchy is a simple classification hierarchy, though possibly overly simple as many devices (e.g. network interfaces) don't appear there. The /sys/class part of the subsystem hierarchy is similarly a simple classification, though it is more complete. The /sys/bus part of the subsystem hierarchy is also a simple two-level classification, though the presence of extra information for each bus type, such as the drivers directory, confuse this a little. Devices in the class hierarchy are classified by what functionality they provide (net, sound, watchdog, etc). Devices in the bus hierarchy are classified by how they are accessed and represent different addressable units rather than different functional units. The extra entries in the /sys/bus subtree allow some control over what functionality (represented by a driver and realized as a class device) is requested of each addressable unit. With this understood, it is hierarchically a simple two-level classification. The /sys/devices hierarchy is indisputably an organizational hierarchy. It contains all the class devices and all the bus devices in a rough analog of the physical organization of devices. When there is no physical device, or it is not currently represented on any sort of bus, devices are organized into /sys/devices/virtual . Here again we see that both a classification hierarchy and an organization hierarchy for the same objects can be quite useful, each in its own way. There can be some complexity to working with both, but if you follow the rules, it isn't too bad. The Linux kernel source tree For a significantly different perspective on hierarchies, we can look at the Linux kernel source code tree, though many evolving source code trees could provide similar examples. This hierarchy is more about organization than classification, though, as with the research groups discussed earlier, there is generally an attempt to keep related things together when convenient. There are two aspects of the hierarchy that are worth highlighting, as they illustrate choices that must be made — consciously or unconsciously. At the top level, there are directories for various major subsystems, such as fs for filesystems (and also file servers like nfsd ), mm for memory management, sound , block , crypto , etc. These all seem like reasonable classifications. And then there is kernel . Given that all of Linux is an operating system kernel, maybe this bit is the kernel of the kernel? In reality, it is various distinct bits and pieces that don't really belong to any particular subsystem, or they are subsystems that are small enough to only need one or two files. In some cases, like the time and sched directories, they are subsystems which were once small enough to belong in kernel and have grown large enough to need their own directory, but not bold enough to escape from the kernel umbrella. The fs subtree has a similar set of files. Most of fs is the different filesystems and there are a few support modules that get their own subdirectory, such as exportfs , which helps various file servers, and dlm , which supports locking for cluster filesystems. However, in fs is also an ad hoc collection of C files providing services to filesystems, or implementing the higher-level system call interfaces. These are exactly like the code that appears in kernel (and possibly lib ) at the top level. However, in fs there is no subdirectory for miscellaneous things, it all just stays in the top level of fs . There is not necessarily a right answer as to whether everything should be classified into its own leaf directory (following the kernel model), or whether it is acceptable to store source code in internal directories (as is done in fs ). However, it is a choice that must be made, and is certainly something to hold an opinion on when debating hierarchies in cgroups. The kernel source tree also contains a different sort of classification: scripts live in the scripts directory, firmware lives in the firmware directory, and header files live in the include directory — except when they don't. There has been a tendency in recent years to move some header files out of the include directory tree and closer to the C source code files that they are related to. To make this more concrete, let's consider the example of the NFS and the ext3 filesystems. Each of these filesystems consist of some C language files, some C header files, and assorted other files. The question is: should the header files for NFS live with the header files for ext3 (header files together), or should the header files for NFS live with the C language files for NFS (NFS files together)? To put this another way, do we need to use the hierarchy to classify the header files as different from the other files, or are the different names sufficient? There was a time when most, if not all, header files were in the include tree. Today, it is very common to find include files mixed with the C files. For ext3, a big change happened in Linux 3.4, when all four header files were moved from include/linux/ into a single file with the rest of the ext3 code: fs/ext3/ext3.h . The point here is that classification is quite possible without using a hierarchy. Sometimes hierarchical classification is perfect for the task. Sometimes it is just a cumbersome inconvenience. Being willing to use hierarchy when, but only when, it is needed, makes a lot of sense. Hierarchies for processes Understanding cgroups, which is the real goal of this series of articles, will require some understanding of how to manage groups of processes and what role hierarchy can play in that management. None of the above is specifically about processes, but it does raise some useful questions or issues that we can consider when we start looking at the details of cgroups: Does the simplicity of a single hierarchy outweigh the expressiveness of multiple hierarchies, whether they are separate (as in sysfs) or interconnected (as in the account management example)?

Is the overriding goal to classify processes, or simply to organize them? Or are both needs relevant, and, if so, how can we combine them?

Could we allow non-hierarchical mechanisms, such as symbolic links or file name suffixes, to provide some elements of classification or organization?

Does it ever make sense for processes to be attached to internal nodes in the hierarchy, or should they be forced into leaves, even if that leaf is simply a miscellaneous leaf. In the hierarchy of process groups we looked at last time, we saw a single simple hierarchy that classified processes, first by login session, and then by job group. All processes that were in the hierarchy at all were in the leaves, but many processes, typically system daemons that never opened a tty at all, were completely absent from the hierarchy. To begin to find answers to these questions in a more modern setting, we need to understand what cgroups actually does with processes and what the groups are used for. In the next article we will start answering that question by taking a close look at some of the cgroups "subsystems", which include resource controllers and various other operations that need to treat a set of processes as group. Comments (1 posted) Filesystem notification, part 1: An overview of dnotify and inotify Filesystem notification APIs provide a mechanism by which applications can be informed when events happen within a filesystem—for example, when a file is opened, modified, deleted, or renamed. Over time, Linux has acquired three different filesystem notification APIs, and it is instructive to look at them to understand what the differences between the APIs are. It's also worthwhile to consider what lessons have been learned during the design of the APIs—and what lessons remain to be learned. This article is thus the first in a series that looks at the Linux filesystem notification APIs: dnotify, inotify, and fanotify. To begin with, we briefly describe the original API, dnotify, and look at its limitations. We'll then look at the inotify API, and consider the ways in which it improves on dnotify. In a subsequent article, we'll take a look at the fanotify API. Filesystem notification use cases In order to compare filesystem notification APIs, it's useful to consider some of the use cases for those APIs. Some of the common use cases are the following: Caching a model of filesystem objects : The application wants to maintain an internal representation that accurately reflects the current set of objects in a filesystem, or some subtree of that filesystem. An example of such an application is a file manager, which presents the user with a graphical representation of the objects in a filesystem.

: The application wants to maintain an internal representation that accurately reflects the current set of objects in a filesystem, or some subtree of that filesystem. An example of such an application is a file manager, which presents the user with a graphical representation of the objects in a filesystem. Logging filesystem activity : The application wants to record all of the events (or some subset of event types) that occur for the monitored filesystem objects.

: The application wants to record all of the events (or some subset of event types) that occur for the monitored filesystem objects. Gatekeeping filesystem operations: The application wants to intervene when a filesystem event occurs. The classic example of such an application is an antivirus system: when another program tries to (for example) execute a file, the antivirus system first checks the contents of the file for malware, and then either allows the execution to proceed if the file contents are benign, or prevents execution if a virus is detected. In the beginning: dnotify Without a kernel-supported filesystem notification API, an application must resort to techniques such as polling the state of directories and files using repeated invocations of system calls such as stat() and the readdir() library function. Such polling is, of course, slow and inefficient. Furthermore, this approach allows only a limited range of events to be detected, for example, creation of a file, deletion of a file, and changes of file metadata such as permissions and file size. By contrast, operations such as file renames are difficult to identify. Those problems led to the creation of the first in-kernel implementation of a filesystem notification API, dnotify, which was implemented by Stephen Rothwell (these days, the maintainer of the linux-next tree) and which first appeared in Linux 2.4.0 (in 2001). Because it was the first attempt at implementing a filesystem notification API, done at a time when the problem was less well understood and when some of the pitfalls of API design were less easily recognized, the dnotify API has a number of peculiarities. To begin with, the interface is multiplexed on the existing fcntl() system call. (By contrast, the later inotify and fanotify APIs were each implemented using new system calls.) To enable monitoring, one makes a call of the form: fcntl(fd, F_NOTIFY, mask); Here, fd is a file descriptor that specifies a directory to be monitored, and this brings us to the second oddity of the API: dnotify can be used to monitor only whole directories; monitoring individual files is not possible. The mask specifies the set of events to be monitored in the directory. These include events for file access, modification, creation, deletion, and attribute changes (e.g., permission and ownership changes) that are fully listed in the fcntl(2) man page. A further dnotify oddity is its method of notification. When an event occurs, the monitoring application is sent a signal ( SIGIO by default, but this can be changed). The signal on its own does not identify which directory had the event, but if we use sigaction() to establish the handler using the SA_SIGINFO flag, then the handler receives a siginfo_t argument whose si_fd field contains the file descriptor associated with the directory. At that point, the application then needs to rescan the directory to determine which file has changed. (In typical usage, the application would maintain a data structure that caches a mapping of file descriptors to directory names, so that it can map si_fd back to a directory name.) A simple example of the use of dnotify can be found here. Problems with dnotify As is probably clear, the dnotify API is cumbersome, and has a number of limitations. As already noted, we can monitor only entire directories, not individual files. Furthermore, dnotify provides notification for a rather modest range of events. Most notably, by comparison to inotify, dnotify can't tell us when a file was opened or closed. However, there are also some other serious limitations of the API. The use of signals as a notification method causes a number of difficulties. The first of these is that signals are delivered asynchronously: catching signals with a handler can be racy and error-prone. One way around that particular difficulty is to instead accept signals synchronously using sigwaitinfo() . The use of SIGIO as the default notification signal is also undesirable, because it is one of the traditional signals that does not queue. This means that if events are generated more quickly than the application can process the signals, then some notifications will be lost. (This difficulty can be circumvented by changing the notification signal to one of the so-called realtime signals, which can be queued.) Signals are also problematic because they convey little information: at most, we get a signal number (it is possible to arrange for different directories to notify using different signals) and a file descriptor number. We get no information about which particular file in a directory triggered an event, or indeed what kind of event occurred. (One can play tricks such as opening multiple file descriptors for the same directory, each of which notifies a different set of events, but this adds complexity to the application.) One further reason that using signals as a notification method can be a problem is that an application that uses dnotify might also make use of a library that employs signals: the use of a particular signal by dnotify in the main program may conflict with the library's use of the same signal (or vice versa). A final significant limitation of the dnotify API is the need to open a file descriptor for each directory that is monitored. This is problematic for two reasons. First, an application that monitors a large number of directories may quickly run out of file descriptors. However, a more serious problem is that holding file descriptors open on a filesystem prevents that filesystem from being unmounted. Notwithstanding these API problems, dnotify did provide an efficiency improvement over simply polling a filesystem, and dnotify came to be employed in some widely used tools such as the Beagle desktop search tool. However, it soon became clear that a better API would make life easier for user-space applications. Enter inotify The inotify API was developed by John McCutchan with support from Robert Love. First released in Linux 2.6.13 (in 2005), inotify aimed to address all of the obvious problems with dnotify. The API employs three dedicated system calls— inotify_init() , inotify_add_watch() , and inotify_rm_watch() —and makes use of the traditional read() system call as well. inotify_init() creates an inotify instance—a kernel data structure that records which filesystem objects should be monitored and maintains a list of events that have been generated for those objects. The call returns a file descriptor that is employed by the rest of the API to refer to this inotify instance. The diagram at right summarizes the operation of an inotify instance. inotify_add_watch() allows us to modify the set of filesystem objects monitored by an inotify instance. We can add new objects (files and directories) to the monitoring list, specifying which events are to be notified, and change the set of events that are notified for an object that is already in the monitoring list. Unsurprisingly, inotify_rm_watch() is the converse of inotify_add_watch() : it removes an object from the monitoring list. The three arguments to inotify_add_watch() are an inotify file descriptor, a filesystem pathname, and a bit mask: int inotify_add_watch(int fd, const char *pathname, uint32_t mask); The mask argument specifies the set of events to be notified for the filesystem object referred to by pathname and can include some additional bits that affect the behavior of the call. As an example, the following code allows us to monitor file creation and deletion events inside the directory mydir , as well as monitor for deletion of the directory itself: int fd, wd; fd = inotify_init(); wd = inotify_add_watch(fd, "mydir", IN_CREATE | IN_DELETE | IN_DELETE_SELF); A full list of the bits that can be included in the mask argument is given in the inotify(7) man page. The set of events notified by inotify is a superset of that provided by dnotify. Most notably, inotify provides notifications when filesystem objects are opened and closed, and provides much more information for file rename events, as we outline below. The return value of inotify_add_watch() is a "watch descriptor", which is an integer value that uniquely identifies the specified filesystem object within the inotify monitoring list. An inotify_add_watch() call that specifies a filesystem object that is already being monitored (possibly via a different pathname) will return the same watch descriptor number as was returned by the inotify_add_watch() that first added the object to the monitoring list. When events occur for objects in the monitoring list, they can be read from the inotify file descriptor using read() . (The inotify file descriptor can also be monitored for readability using select() , poll() , and epoll() .) Each read() returns one or more structures of the following form to describe an event: struct inotify_event { int wd; /* Watch descriptor */ uint32_t mask; /* Bit mask describing event */ uint32_t cookie; /* Unique cookie associating related events */ uint32_t len; /* Size of name field */ char name[]; /* Optional null-terminated name */ }; The wd field is a watch descriptor that was previously returned by inotify_add_watch() . By maintaining a data structure that maps watch descriptors to pathnames, the application can determine the filesystem object for which this event occurred. mask is a bit mask that describes the event that occurred. In most cases, this field will include one of the bits specified in the mask specified when the watch was established. For example, given the inotify_add_watch() call that we showed earlier, if the directory mydir was deleted, read() would return an event whose mask field has the IN_DELETE_SELF bit set. (By contrast, dnotify does not generate an event when a monitored directory is deleted.) In addition to the various events for which an application may request notification, there are certain events for which inotify always generates automatic notifications. The most notable of these is IN_IGNORED , which is generated whenever inotify ceases to monitor an object. This can occur, for example, because the object was deleted or the filesystem on which it resides was unmounted. The IN_IGNORED event can be used by the application to adjust its internal model of what is currently being monitored. (Again, dnotify has no analog of this event.) The name field is used (only) when an event occurs for a file inside a monitored directory: it contains the null-terminated name of the file that triggered this event. The len field indicates the total size of the name field, which may be terminated by multiple null bytes in order to pad out the inotify_event structure to a size that allows successive structures in the read buffer to be aligned at architecture-appropriate byte boundaries (typically, multiples of 16 bytes). The cookie field exists to help applications interpret rename events. When a file is renamed inside (or between) monitored directories, two events are generated: an IN_MOVED_FROM event for the directory from which the file is moved, and an IN_MOVED_TO event for the directory to which the file is moved. The first event contains the old name of the file, and the second event contains the new name. Both events have the same unique cookie value, allowing the application to connect the two events, and thus work out the old and new name of the file (a task that is rather difficult with dnotify). We'll say rather more about rename events in the next article in this series. Inotify does not provide recursive monitoring. In other words, if we are monitoring the directory mydir , then we will receive notifications for that directory as well as all of its immediate descendants, including subdirectories. However, we will not receive notifications for events inside the subdirectories. But, with some effort, it is possible to perform recursive monitoring by creating watches for each of the subdirectories in a directory tree. To assist with this task, when a subdirectory is created inside a monitored directory (or indeed, when any event is generated for a subdirectory), inotify generates an event that has the IN_ISDIR bit set. This provides the application with the opportunity to add watches for new subdirectories. Example program The code below demonstrates the basic steps in using the inotify API. The program first creates an inotify instance and adds watches for all possible events for each of the pathnames specified in its command line. It then sits in a loop reading events from the inotify file descriptor and displaying information from those events (using our displayInotifyEvent() , shown in the full version of the code here). int main(int argc, char *argv[]) { struct inotify_event *event ... inotifyFd = inotify_init(); /* Create inotify instance */ for (j = 1; j < argc; j++) { wd = inotify_add_watch(inotifyFd, argv[j], IN_ALL_EVENTS); printf("Watching %s using wd %d

", argv[j], wd); } for (;;) { /* Read events forever */ numRead = read(inotifyFd, buf, BUF_LEN); ... /* Process all of the events in buffer returned by read() */ for (p = buf; p < buf + numRead; ) { event = (struct inotify_event *) p; displayInotifyEvent(event); p += sizeof(struct inotify_event) + event->len; } } } Suppose that we use this program to monitor two subdirectories, xxx and yyy : $ ./inotify_demo xxx yyy Watching xxx using wd 1 Watching yyy using wd 2 If we now execute the following command: $ mv xxx/aaa yyy/bbb we see the following output from our program: Read 64 bytes from inotify fd wd = 1; cookie =140040; mask = IN_MOVED_FROM name = aaa wd = 2; cookie =140040; mask = IN_MOVED_TO name = bbb The mv command generated an IN_MOVED_FROM event for the xxx directory (watch descriptor 1) and an IN_MOVED_TO event for the yyy directory (watch descriptor 2). The two events contained, respectively, the old and new name of the file. The events also had the same cookie value, thus allowing an application to connect them. How inotify improves on dnotify Inotify improves on dnotify in a number of respects. Among the more notable improvements are the following: Both directories and individual files can be monitored.

Instead of signals, applications are notified of filesystem events by reading structured data from a file descriptor created using the API. This approach allows an application to deal with notifications synchronously, and also allows for richer information to be provided with notifications.

Inotify does not require an application to open file descriptors for each monitored object. Instead, it uses an API-specific handle (the watch descriptor). This avoids the problems of file-descriptor exhaustion and open file descriptors preventing filesystems from being unmounted.

Inotify provides more information when notifying events. First, it can be used to detect a wider range of events. Second, when the subject of an event is a file inside a monitored directory, inotify provides the name of that file as part of the event notification.

Inotify provides richer information in its notification of rename events, allowing an application to easily determine the old and new name of the renamed object.

IN_IGNORED events make it (relatively) easy for an inotify application to maintain an internal model of the currently monitored set of filesystem objects. Concluding remarks We've briefly seen how inotify improves on dnotify. In the next article in this series, we look in more detail at inotify, considering how it can be used in a robust application that monitors a filesystem tree. This will allow us to see the full capabilities of inotify, while at the same time discovering some of its limitations. Comments (26 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Security-related

Miscellaneous

Page editor: Jake Edge



Distributions

Debian and the PHP license Unclear or idiosyncratic licenses on projects can often be problematic for distributions. In particular, Debian seems to struggle with more of these license issues than most other distributions, largely because of the project's notorious attention to that kind of detail. Even so, it is a bit surprising to see the distribution wrestling with the PHP license. One might have guessed that any problems with it would have been worked out long ago, but a problem with that license, as it applies to PHP extensions, reared its head (again) at the end of June. The problem has been present for years. The PHP License, version 3.01—the most recent as of this writing—contains statements about the software it covers that are specific to distributing PHP itself. According to Ondřej Surý, any package that uses the license but does not come from the "PHP Group" does not have a valid license: I did have a quite long and extensive chat with FTP Masters and our conclusion was that PHP License (any version) is suitable only for software that comes directly from "PHP Group", that basically means only PHP (src:php5) itself. In fact, the Debian FTP masters, who serve as the gatekeepers on what packages are allowed into the distribution, specifically mention PHP in a Reject FAQ that lists reasons the team may reject packages. For PHP extensions, it says: includes Zend Engine, so its not applicable to anything else. You have a PHP add-on package (any php script/"app"/thing, not PHP itself) and it's licensed only under the standard PHP license. That license, up to the 3.x which is actually out, is not really usable for anything else than PHP itself. I've mailed our -legal list about that and got only one response, which basically supported my view on this. Basically this license talks only about PHP, the PHP Group, and, so its not applicable to anything else. Given that the mail referenced is from 2005, this is clearly a longstanding problem, though little seems to have been done about it over the years. PHP has updated its license and removed some of the problematic wording (the "Zend Engine" wording in particular), but there is still a belief that PHP extensions shouldn't be using the PHP license. There are a number of possible solutions to that problem, which Surý outlined. Debian could get the extension upstreams to relicense under the BSD or MIT licenses (for example), show that the software does actually come from the PHP Group, or remove the affected packages from Debian entirely. He also updated a pile of bugs that were filed against various PHP add-on modules. It's a complicated question and, unsurprisingly, there are multiple interpretations of the license. That is unfortunate, but it is something that only the PHP Group can address—something it seems unwilling to do. There are some who think that anything distributed from PEAR (PHP Extension and Application Repository) that uses the PHP license (version 3.01 or greater) should be considered to have a reasonable license, while others would add code that comes from PECL (PHP Extension Community Library) to that list as well. But the use of the PHP license is pervasive throughout the PHP ecosystem, well beyond just PEAR and PECL. For example, Mike Gabriel wondered what the problem was for the LGPL-covered Smarty 3 template engine. As Surý pointed out, though, Smarty 3 also uses four separate PHP files that are under the PHP license. Surý's email subject said that the extensions covered by the PHP license were "not distributable", but others took exception to that claim. The license text says that the software is being distributed by the PHP Group, which is clearly not the case when Debian (or anyone else) distributes. Other, similar language essentially requires the distributor to lie, as Steve Langasek said: There is nothing in these licenses that makes the software undistributable; it just requires the distributor to attach *false statements* to it as part of distribution. I have no objection to the ftp team's decision to treat this as an automatic reject on this basis - I don't think a license that requires us to make false statements is suitable for main - but it's wrong to claim that these works are undistributable. But Marco d'Itri thought that none of that mattered. PHP support for certain packages is critical: Reality check #1: it is quite obvious that even if anybody else accepts this interpretation then nobody cares.

Reality check #2: Debian would not be viable without major packages like PHP support for imagemagick or memcached, if we do we may as well remove the the whole language. Matthias Urlichs piled on to the "reality check" theme. He agreed that the problem is one that no other distribution cares about and noted that Debian has had these extensions in its repositories for years. Furthermore: Thus, reality check #3: This license contains some strange terms that make it look like it doesn't really apply to the software it's distributed with, but QUITE OBVIOUSLY the author of the software in question thought otherwise, and there is no actual legal problem (nobody else is complaining about the license, much less threatening to revoke permissions, much less suing somebody). Thus, while we're in a reasonably good position to convince Upstream to fix that problem, filing RC bugs and thus making PHP [unusable] in Debian is certainly going to be regarded as typical Debian principles-above-all overkill but unlikely to be helpful to anybody. Later in the thread, Urlichs summarized the situation. It is clear, he said, that PHP doesn't care about the misuse of its license and the misusers don't understand that they are making a mistake. Any efforts by Debian to change that just makes the extension authors "consider us quite strange for even mentioning" a license change. He outlined three options: removing the modules ("I'd be for this in a heartbeat if it would make people switch to a saner programming language, but that's wishful thinking"), getting all of the upstreams to change their licenses ("Fat chance"), or biting the bullet and just living with the status quo. That last option seems to be winning the day (or else everyone ran out of steam to keep arguing). As Russ Allbery put it: I don't see this as a matter of principle unless the principle is "we refuse to deal with even major software packages that do dumb and self-contradictory things with licenses but without any intent to actually restrict the freedom of the software covered by them." And I don't actually agree with that principle. For stuff not already in Debian, sure, let's stick to a simple policy because we can usually get people to change upstream and make the world a better place, and we don't lose much if we fail. But that doesn't really apply to PHP. For his part, Surý plans to start closing bugs for those packages that are distributed from PEAR and PECL, which covers most of the affected packages. While PHP is able to have an unclear license that gets wrongly applied to its extensions (at least in Debian's view), it can only do so because of its popularity—lesser packages may find it much harder to find their way into distributions with oddly constructed licenses. It is important that projects choose their licenses carefully, which is something that many of these extension developers seem to have skipped. It is possible that Debian is being overly critical of the terms, but anyone reading that license may find it to be rather informal and it certainly makes life difficult for distributors. Perhaps that's what the PHP project wants, but one gets the sense that what most project members really want is just to ignore licensing issues altogether. Comments (3 posted)

Brief items

Distribution quote of the week These days Gentoo is sort of a “background” distro that has been around for ages, has loads of users but new people don’t get excited about anymore, kind of like Debian. -- -- Patrick McLean Comments (3 posted) Release for CentOS-7 The CentOS project has released CentOS 7.0-1406. This release is the first to be built with sources hosted at git.centos.org. All source rpms are signed with the same key used to sign their binary counterparts. This release also introduces the new numbering scheme. " The 0 component maps to the upstream release, whose code this release is built from. The 1406 component indicates the monthstamp of the code included in the release ( in this case, June 2014 ). By using a monthstamp we are able to respin and reissue updated media for things like container and cloud images, that are regularly refreshed, while still retaining a connection to the base distro version. " The The CentOS project has released CentOS 7.0-1406. This release is the first to be built with sources hosted at git.centos.org. All source rpms are signed with the same key used to sign their binary counterparts. This release also introduces the new numbering scheme. "" The release notes also mention that this is the first release to have a supported upgrade path, from CentOS 6.5 to CentOS 7. (Thanks to Scott Dowdle) Full Story (comments: none)

Distribution News

Debian GNU/Linux

Updating the list of Debian Trusted Organizations Lucas Nussbaum presents an updated list of Debian Trusted Organizations. Historically, SPI was the sole organization authorized to hold assets for the Debian Project. Over the years, a number of other organizations started to hold assets on behalf of Debian, but we did not enforce the process defined in our constitution to officially maintain a list of Trusted Organizations. I would like to use this opportunity to stress the importance of the work of such supporting organizations for Debian, and for the Free Software community general. The legal and financial framework they provide is a crucial contribution to ensure healthy and functional projects. Lucas Nussbaum presents an updated list of Debian Trusted Organizations. Full Story (comments: none) Debian RT News - New member, freeze reminder and last Squeeze release The Debian release team welcomes new members, talks about the Jessie release schedule and the upcoming final point release for Squeeze, and more. " For users who wish to stay with Squeeze a bit longer, we recommend that you use and support the Squeeze LTS project. Please keep in mind that Squeeze LTS is only provided for a limited set of architectures (i386 and amd64), and that you need to update your sources.list to use Squeeze LTS. " The Debian release team welcomes new members, talks about the Jessie release schedule and the upcoming final point release for Squeeze, and more. " Full Story (comments: none)

Newsletters and articles of interest

Page editor: Rebecca Sobol



Development

The future of Ardour Ardour is likely the most compelling open-source digital audio workstation (DAW) for music professionals. But a recent blog post by Ardour's lead developer, Paul Davis, revealed that he will likely need to shift his focus due to a lack of financial support for the project: I really don't like writing articles about Ardour and money. I like to think that successful and worthy projects will magically fund themselves, and obviously, I like to think that Ardour is successful and worthy. This is wrong thinking, however. Given the support from users and companies for the project, that news comes as a bit of a shock. Users have stated that Ardour can hold its own against Pro Tools, a proprietary DAW used by professionals throughout the music industry. It has also been used in the Mixbus DAW product from Harrison Audio Consoles, sales of which provides some income for the project. Davis is not completely stopping development work on Ardour; he has "the option of working for a digital audio company that is developing new projects based on Ardour." However, this would focus development on the needs of that particular business over Ardour's end users: If I do this, I will still be working on Ardour's codebase, but my focus will cease being what I perceive the needs and desires of Ardour users to be, and will be dominated by what another company thinks I should be doing. I don't particularly want to go down this route, but given the current "curve" of the income trend, it appears to me that I will probably have to. Davis concluded the post with uncertainty about the prospects of the community picking up the burden, and an insistence that his message is not a request for funding. He also emphasized in a comment on the post that he is not abdicating his role as lead developer. However, the message is clear: Ardour's lead developer will likely shift gears on that work in the near future. What happened? How did a project that received both commercial attention and high praise from users not find the means to fund one developer full-time? To help figure that out, we will need to take a look Ardour's history. Davis had started the project in 2000, working full-time on Ardour for several years, buoyed by a windfall from his work for Amazon.com in its early days. But the issue of financial sustainability would rear its head before Ardour's initial "stable" 0.99 release in September 2005 (developers made a decision to skip the 1.0 release and make major changes for a 2.0 release). In a post to the development mailing list in May 2005, Davis explained that he had only earned $6,000 over the past five years to support his work. As a result, for a time, he would have to take an unrelated development job that would consume most of his time during the week. In 2007, the project started a subscription-based funding model. Web site visitors could only download binaries with a paid subscription, which currently costs $1, $4, or $10 per month, with a $50/month option for institutions. Full source code remains available for free download. However, with the project only targeting Linux and Mac OS X users, this seems not to have led to a sustainable model, as popular Linux distributions packaged Ardour for their users. For example, in an October 2007 LWN review of Ardour 2, Forrest Cook reported that the multimedia-focused Ubuntu Studio came with Ardour out-of-the-box. The community remained concerned about sustainable funding, with the topic dominating discussion on the development mailing list in January 2009. Patrick Shirkey of Boost Hardware then suggested a number of possible funding sources, ranging from seeking grants, to a music CD featuring artists who use Ardour, to celebrity endorsements. While Davis's income from Ardour had improved by June 2009, an interview with Linux Journal around that time revealed that he was still very concerned about his personal financial situation. Positive attention toward Ardour, coupled with concern for its self-sustainability, remained during the years to come. In a 2010 episode of Jono Bacon and Stuart Langridge's "Shot of Jaq" podcast [.ogg], they noted that software projects relying on funding directly from end users to finance long-term development have struggled in comparison to the proven model of corporations financing free software projects while selling related services (e.g. Red Hat and SUSE). There have also been marketing issues with the subscription model, with one potential user complaining in 2011 about having to pay for binaries for this particular open-source project when other open-source software is available free in both source and executable forms. In 2012, we finally saw an Ardour-based release for Windows users: a proprietary, closed-source product named Harrison Mixbus. Windows users arriving at the Ardour download page were directed to the Harrison Consoles web site to purchase the product: it currently costs $149 outright (discounted from $219), or $49 plus $9/month for a subscription. There was also discussion in 2012 about charging $10-20 for Ardour on the Ubuntu Software Center, which did not lead to doing so; discussion was short-lived and the idea did not seem to be taken seriously by the community. Ardour's 3.0 release came in 2013, with many new features, such as complete support for MIDI. It appears that Ardour's subscription model has one major technical flaw, which may have cost the project some money. Some formerly subscribing Ardour users, who were concerned about the future of the project,commented on Davis' recent post noting that they had various technical difficulties with the subscription mechanism. Davis replied that PayPal, which is used for subscriptions, does not provide a programmatic interface to view canceled subscriptions, leaving him to manually update the subscription database by downloading a CSV file from PayPal. Addressing that issue may pull in some extra cash, but likely not enough to cause Davis to reconsider his decision to take the new job. Perhaps an opportunity was missed with the project's refusal to consider "average" Windows users (i.e. those who couldn't or wouldn't pay the high cost for Harrison Mixbus) as a potential userbase worth targeting. In 2006, Davis noted, albeit without providing examples, that several open source projects offering Windows ports have not had the capacity to manage the increased demands: Many other *nix open source projects have been overwhelmed when they have ported to windows - a huge, sudden influx of users with zero background in software development, and no infrastructure