The efficiency of Operating Systems (OSes) has always been in the spotlight of systems researchers, ever since the seminal Dijkstra’s THE multiprogramming system in early 60s. But the reason for this obsession is not entirely obvious. While the OS is commonly perceived as a layer between applications and hardware, such a view is somewhat misleading at least with respect to the CPU: application code runs on the CPU without any OS mediation. The OS intervenes only for management events (i.e., process preemption) and for accesses to shared resources (i.e., file / network I/O). So one may wonder how much of OS code actually gets executed at all?

It turns out that in data center workloads about 15-20% of CPU cycles are spent in the OS kernel. Indeed, most of these workloads are I/O intensive, thus they stress multiple OS components, from device drivers through networking and file I/O stack, to the OS scheduler managing thousands of threads.

Taken in isolation each piece of OS code is usually highly optimized. However, the system as a whole suffers from several inherent performance issues that are the direct implication of the design principles underlying traditional OSes:

Protection : the execution path from the application to the OS goes through two protection domains (or more for virtual machines), thus incurs the world switch (user-to-kernel mode and back) with its associated overheads.

: the execution path from the application to the OS goes through two protection domains (or more for virtual machines), thus incurs the world switch (user-to-kernel mode and back) with its associated overheads. Modularity : numerous layers of abstractions in the OS design are essential to tame its complexity. Crossing those layers might be expensive as they often require extra data copies and impose other overheads associated with software encapsulation.

: numerous layers of abstractions in the OS design are essential to tame its complexity. Crossing those layers might be expensive as they often require extra data copies and impose other overheads associated with software encapsulation. Generality: the OS is mostly oblivious to the application logic and runtime state, thus it offers only general mechanisms and policies that are not well suited for a specific application. For example, an application might benefit from a custom page table structure rather than a radix tree.

Overcoming these limitations requires a radically different OS design which gives developers the power to cut through abstraction layers to implement application-specific core OS abstractions, all without compromising state protection and performance isolation.

Exokernel and libOS

In their highly influential work, Dawson Engler et al. introduced an exokernel OS architecture that “exterminates all OS abstractions”, and empowers application developers to build their own in a secure and efficient way. These application-specific OS services are encapsulated in a library OS (libOS) running in user mode as part of the application address space. There is no sharing of libOSes among the applications, so shared services such as a file system must be implemented as shared servers (as in microkernel design).

Underneath, the privileged exokernel core provides a minimalistic set of hardware-level interfaces for multiplexing hardware among applications. The exokernel plays the “trust, but verify” game – it allocates resources off the application critical path, but enforces allocation at runtime with low overhead. The management within the boundaries of the allocated resources is left to the application-specific libOS that implements all the basic abstractions and services, including virtual memory, scheduler, file system and network stack. The end-result is a remarkably extensible OS architecture that minimizes or completely eliminates the performance downsides of traditional OS designs.

New generation of single-application libOSes

Despite the overwhelming changes in the hardware and software landscape since the paper’s publication in 1995, the concepts introduced in Exokernel are ever more relevant today. The emergence of high-speed low-latency I/O devices shifts the performance bottlenecks from hardware to the OS software management layers, and motivates unmediated application access to hardware. For example, user-space networking libraries, i.e. DPDK, enable direct application access to the NIC I/O buffers that bypass the OS entirely, and are broadly used in performance-critical applications, such as packet processing.

One of the key enabling technologies that makes single-application libOSes practical is hardware-assisted virtualization that simplifies the task of providing strong isolation among multiple applications for both the CPU and peripherals. However, there are several other domains where libOSes turn out to be a good match. Most common use cases are briefly outlined below.

High performance servers. IX and Arrakis (both earned best paper awards at OSDI14 conference) are the two most recent examples of the OSes that implement exokernel design principles on modern hardware with virtualization support. The availability of sophisticated QoS and isolation mechanisms in modern I/O devices originally introduced for virtualization, enables to completely remove the OS logic from the data plane. These papers provide a number of interesting insights about missing hardware features necessary to fully realize the vision of the OS-free data plane.

Lightweight virtualization. Unikernel is a design where a highly specialized libOS is compiled with an application, enabling the resulting fat binary to be invoked on a bare-metal physical (or virtual) machine. Unikernels are lightweight by design, include only a small subset of OS logic, and best suited for applications that require strong isolation of Virtual Machines but with radically faster deployment and a tiny OS memory footprint.

Compatibility layer. Intel Software Guard Extensions (SGX) implement a shielded execution environment, called enclave, which by design does not allow an application to perform system calls, so unmodified applications cannot run in enclaves. SGX libOSes such as SCONE and Graphene-SGX introduce compatibility layers inside the enclave that implement the missing OS interfaces either natively (i,e., user-level thread scheduling), or by transparently forwarding the in-enclave system calls to the OS outside the enclave. Another example of a native enclave service is our recent Eleos library that provides user-space virtual memory management for SGX to boost the speed of enclave execution of large memory footprint applications.

Accelerated systems. In our prior works on native GPU OS abstractions we demonstrated the performance and programmability benefits of I/O calls made directly from GPU kernels. For example, the GPUfs and GPUnet libraries implement a file system and networking layers on the GPU. These device-specific libOSes are designed to efficiently accommodate numerous concurrent I/O requests from thousands of GPU threads. The more recent work extends this idea beyond GPUs, introducing an accelerator-centric OS architecture. Its goal is to eliminate the CPU control overheads in multi-accelerator applications by building a device-optimized libOS on each accelerator, with all the libOSes together working in concert to provide a coherent view of a system to the application.

Hardware implications

The adoption of libOSes is driven by the necessity of achieving maximum efficiency, and becomes practical thanks to addition of hardware virtualization support in both CPUs and I/O devices. So what is missing? It is an open question, but here are some thoughts. First, for some devices, i.e, GPUs and storage, the virtualization support is quite limited, so direct unprivileged access to them cannot be fully supported. In addition, hardware support for low-overhead multiplexing among multiple applications for high-speed devices is a non-trivial problem. Further, hardware virtualization in I/O devices does not scale well today (on the order of hundred of virtual functions in NICs, for example), so the number of libOSes that can run concurrently is limited. More importantly, as virtual devices are directly accessed by untrusted user code, they get exposed to attacks that would have been blocked by the OS. Last, some of the traditional OS responsibilities such as performance isolation and resource partitioning should be moved into hardware, but how to maintain the flexibility of the original software mechanisms is often unclear.

In summary, libOSes provide an appealing alternative to traditional OSes, in particular in high performance servers, but they pose new requirements and provide an opportunity to revisit the existing architectural support for OSes in order to make libOSes more robust, usable and efficient.

About the author

Mark Silberstein is an Assistant Professor in the Electrical Engineering Department at the Technion – Israel Institute of Technology, where he heads the Accelerated Computer Systems Lab.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.