Linux on the OSF Mach3 microkernel

1. Introduction

In addition to the advanced features already present in the Mach microkernel, such as SMP support, network transparent IPC and support for application specific paging policies, the promise of a microkernel architecture was very appealing. OSF's customers have interests in many different operating systems. We needed a way to pursue a program of operating systems research that was not unduly operating system specific. Operating system specific behavior needed to be separated from OS independent functionality. Our hope was that a microkernel based architecture would prove to be more portable and modular than the monolithic systems that were prevalent at the time. Currently OSF provides two versions of OSF/1. Each version of OSF/1 is hosted on top of the microkernel. OSF/1 1.3 is a system suitable for workstations and minicomputers. Its performance is very competitive with other versions of Unix. OSF/1 AD, available to OSF RI customers, is a system intended for massively parallel processing supercomputers and clusters.

We are interested in promoting the use of our systems technology within the industrial and research communities. The OSF MK kernel is freely available but the OSF/1 server is encumbered by commercial licenses including the SVR2 Unix license. For many organizations this is not a problem because they already have a SVR2 license but for others the licenses have proved to be serious obstacles. In an effort to remove these obstacles we decided to produce a free UNIX-like server that would suit the needs of Mach developers. We chose Linux for several reasons. It is one of the most popular free implementations of UNIX. Some of our members had expressed an interest in a Linux server. Linux is efficient and has very competitive performance. It provides a very attractive and effective development environment (GNU tools). There are several research projects based on Linux that could potentially benefit from a microkernel. Finally, it was an opportunity to create and experiment with an operating system server that was not derived from BSD. CMU's UX, BSD-Lites and OSF/1 are all descended from BSD. In 1995 we began a project to port OSF MK to the Apple PowerMac and to create a Linux server that could run on top of OSF MK. The server was hosted on both Intel x86 and PowerMac platforms. Our goal is to produce a free system that has competitive performance, is usable on multiple, inexpensive hardware platforms and is of interest to both our members and the research community. In this paper we will describe the project in detail and some of our future plans. We will also describe some of the important differences between OSF MK and Mach 3.0. 2. OSF MK, the OSF microkernel Our original interest in Mach was due to the powerful abstractions provided by the kernel, its operating system neutrality and the promise of greater portability and modularity. To a large extent, the microkernel has lived up to our expectations. OSF and some of our customers have ported the kernel to several platforms without undue difficulty. We and our collaborators have hosted different operating system personalities on top of the kernel. And for the last few years we have had an active research program that exploits and extends the abstractions provided by the microkernel. Portability The kernel itself is somewhat complex, yet the task of porting it to a new hardware platform is fairly straightforward. We support our version of the microkernel on several different hardware platforms. These include the Intel x86 family, the Intel i860, the DEC Alpha, the HP PA-RISC and the Apple/IBM/Motorola PowerPC. The microkernel has a clean separation of hardware dependent and hardware independent functionality. Writing the hardware dependent code for the microkernel usually requires approximately 4 to 6 months. The difficulty in creating an adequate suite of device drivers can easily exceed the effort necessary to port the rest of the kernel. In some cases the effort can be reduced by converting pre-existing drivers. We will discuss the effort to port the kernel to the PowerMac later in the paper. Server Performance In addition to different versions of Unix, MacOS, and MS/DOS have been hosted on Mach. IBM has hosted OS/2 on their own version of the Mach microkernel. Early efforts to layer OS personality servers on top of the microkernel have had disappointing performance due to the extra message-based communication between the system components. Often, as much as a 40% performance cost has been reported. Our recent experience has led us to believe that most of the problems are due to a lack of attention to performance related issues. After creating the OSF/1 server and noting its disappointing performance, we embarked on an effort to improve its performance. Thread Migration Thread migration was derived from work done on Mach4.0 at the University of Utah [3]. It aims at reducing the cost of switching context between the sender and the receiver of an RPC.

The ``thread'' abstraction has been split into two new entities:

Short-Circuited RPC

Collocation

Combined Effects Used together, thread migration, short-circuited RPC and collocation can almost reduce an RPC to a simple procedure call. The kernel to server system call exception RPC and the server to kernel system calls benefit from this optimization, greatly improving the overall system performance. The performance of OSF/1 as a kernel task is very competitive with other Unix systems, including the original OSF/1. In a subsequent experiment we ported OSF/1 1.3 and the microkernel to an HP PA-RISC workstation. Then we layered an HP-UX (HP's version of Unix) emulation library on top of OSF/1. This allows us to run HP-UX applications unchanged on our system. We then performed a variety of performance tests including AIM III (a common multi-user Unix benchmark), TTCP and others. Most of the tests indicate that OSF/1 using HP-UX emulation libraries has equal or superior performance to HP-UX [11]. This result came as a pleasant surprise and is generally considered a vindication of our performance efforts.

A Real-Time microkernel In addition to its portability and its operating system neutrality, we were interested in the microkernel as a foundation for research into real-time and distributed computing issues. Mach was not designed as a real-time operating system. The original design focus was for scalable, multiprocessing timesharing systems. Extensive use of lazy evaluation techniques were used in the design of the VM and scheduling subsystems. In order for OSF MK to be a suitable foundation for real-time application we had to make enhancements to the microkernel ranging from the prosaic, such as pre-emption, clocks and alarms to the innovative, such as real-time RPC, the CORDS framework for network protocols and ``paths''. Pre-emption Before the microkernel can be suitable for real-time applications it must provide reasonable, predictable behavior. Complex real-time operating systems, like the various real-time Unix systems and the OSF microkernel all use pre-emption to avoid indeterminate event latencies because there are certain features that, though desirable, are inherently unpredictable. Our pre-emption strategy exploits the fine-grained locks already in the kernel to provide Symmetric Multi-Processing (SMP) support. This naturally led to a fully preemptible system [5].

Mach 3.0 had simple and complex locks [6]. Simple locks provided mutual exclusion and complex locks provided multiple reader, single writer semantics. In OSF MK, simple locks were enhanced to prevent pre-emption. This resulted in a working system but because of the original code's extensive use of simple locks the resulting system had unacceptable event latencies. To deal with this problem we added a new type of lock: a mutex lock. A mutex lock is an inexpensive, mutual exclusion blocking lock. The difference between a simple lock and a mutex lock is that a kernel thread can be preempted while holding a mutex. Most of the algorithms that used simple locks were converted to the new mutex locks.

In the initial version of the system it was possible to have unwarranted context switching between timesharing threads due to pre-emption. This problem was corrected by a simple modification to the pre-emption code. Preemption only occurred if the higher priority thread was a fixed priority thread. With this change, the cost of enabling pre-emption in an SMP environment was negligible when measured by a standard benchmark like AIM III. In a uni-processor environment, the cost of pre-emption is identical to the cost of enabling SMP locks, i.e., approximately 10%. Since our preemption mechanisms are integrated so closely with the SMP locking mechanisms this is not surprising. Priority Inversion Kernel pre-emption created a new problem - a type of scheduling anomaly sometimes referred to as a priority inversion [7]. Priority inversions can occur when a high priority thread becomes dependent on or blocked by a lower priority, preempted thread. We designed a straightforward priority boosting protocol inside the kernel to deal with priority inversions. Priority boosting propagates across dependencies, not just locks. If a thread blocks and becomes dependent on another thread, then the thread controlling the dependency is boosted. If the boosted thread is blocked by another dependency then the boosting propagates down the dependency chain. A thread remains boosted until it releases its last dependency.

This algorithm is not perfect in that some threads remain boosted longer than absolutely necessary. But it is very simple and inexpensive. Real-time RPC The Real-Time RPC [8] is not layered on top of message based IPC. Implementing RT-RPC as a new kernel service had important advantages:

RPC specific optimizations can be made along the entire RPC path (our RPC is twice as fast as the Mach3.0 RPC optimizations). Real-time RPC specific behaviors, such as alerts, orphan detection, predictable delivery and nested time constraint propagation are possible. An efficient, unified programming model for invoking operations across module boundaries within a task, across the task/kernel boundary or across task boundaries is possible. The client side of a RT RPC is very similar to a message based RPC. In both cases threads invoke RPCs using ports and the client thread waits for the server to process the request and reply. The server side is somewhat different. Instead of a pool of threads waiting in a message receive loop, a server creates a pool of ``empty threads''. These threads have no scheduling state. When a client invokes a server and a server thread is available, the kernel chains the client and server threads together and upcalls the server immediately. Many client thread attributes, such as scheduling attributes, are propagated as well as the normal RPC parameters associated with the server's operational interface. When the server completes the service requested by the RPC and replies, the server thread becomes empty and a candidate for future upcalls. The reply parameters are propagated back to the client which returns from the invoke and resumes execution. If no server threads are available then the client thread is blocked. When a server thread eventually becomes available then the scheduling policy selects the appropriate client thread based on its scheduling attributes and it is chained to the server thread. In this way, client threads are serviced in the correct order. This avoids the scheduling anomalies introduced by Mach IPCs port queues and ordered message delivery guarantees and it makes it possible for servers to provide service according to client specified time constraints. Alerts Sometimes it is important to signal or generate an exception at the head of an RPC chain rather than a thread somewhere in the middle. One reason for doing this could be the elapsing of a deadline specified by one of the threads in the chain. Alerts are the mechanism used, by either the kernel or an application, to generate a timely exception at the head of an RPC chain. When an alert is posted to thread A, which is not the head of a chain (suppose a A->B->C->D RPC chain), the kernel propagates the alert towards the head. When the thread that is the current head of the chain is located, in this case thread D, it is suspended and an alert exception upcall is made to the thread's exception port. This gives the task with thread D a chance to respond in a timely fashion to the event that triggered the alert. By using upcalls instead of messages the time constraints of the target thread can be propagated to the exception handler. In this way, the exception processing can proceed without risk of a scheduling anomaly. The return value in the reply from the exception upcall indicates to the kernel whether the alert was successfully handled or not. If it was not handled then thread D is terminated (just as with any unsuccessfully handled exception) and the alert is raised on thread C, the new head of the chain. Alerts will be back propagated up the chain until the thread originally alerted is reached. Orphans Node failures or other events such as task or thread termination can result in broken chains. Detecting and eliminating the orphaned chain fragment in a timely fashion is important in real-time systems. Responding to whatever failure or event caused the chain to break is as important as responding to any other external event. In some systems, timely response to failures is more important than the processing of ordinary events. When a chain is broken the rooted portion of the chain is immediately restarted. It returns from the RPC invocation with an error indicating the chain was broken and takes whatever action is deemed appropriate by the application. An orphan alert is posted to the head of the orphaned chain. The alert is propagated to the head of the chain where an orphaned exception is generated. This is a fatal alert than cannot be handled correctly, i.e., when the orphaned exception handler returns the thread is terminated and the alert is raised on the next thread up the chain. This gives each thread on the chain a chance to clean-up (release locks, undo or perform compensating actions) before being terminated. Characterization Tools (ETAP) ETAP (Event Trace Analysis Package) [14] is a tool for characterizing the performance and behavior of real-time applications as well as the system software. ETAP is straightforward in design. The kernel reserves a block of memory as a circular message buffer. The size of the buffer is configurable. The kernel has been instrumented with a variety of probes. When activated these probes create entries in the circular buffer. Probe entries contain a type field, a time-stamp, a thread ID tuple, and probe specific information. Probes can be used to capture a wide range of information such as context switching events, system calls, lock events, device events, etc. There are global and per thread probes. Any subset of the probes can be dynamically activated or deactivated. Applications with probes write to the buffer. There is a second task that reads the buffer and records it on disk where the information can be subsequently analyzed using different report generation programs. When configured into the kernel, inactive probes incur an insignificant overhead. Approximately 1 percent when running AIM III.

The thread ID tuple identifies the thread and its shuttle or RPC chain. The RPC chain identifier allows us to determine which client thread a server is acting for. This is an invaluable tool for tracking the causal dependency of events in a client/server system. It has also become a valuable tool for debugging the kernel and applications. Miscellaneous In addition to the work already described we have made a variety of relatively small changes and additions to the microkernel. These include:

Networking with CORDS The support for networks in Mach3.0 was limited to the packet filter. Protocols and network transparent IPC were expected to be implemented as user space servers. This had a negative effect on the performance of most uses of the network. The architecture also presented significant obstacles to a correct implementation of network IPC. In OSF MK we have added an object oriented framework for network protocols, Communication Objects for Reliable, Distributed Systems (CORDS) [9]. CORDS is derived from the xkernel, developed by the University of Arizona [10].

The CORDS framework has many features to simplify the task of implementing network protocols. Complex protocols can be decomposed into a graph of ``micro-protocols''. A protocol graph can be extended across protection boundaries permitting portions of a protocol graph to exist in a task while other parts exist in the kernel. There is a notion of a ``path'' that describes the route a message will take through the protocol graph. Resources such as message buffers and threads can be attached to paths allowing the protocol designer to manage the resources needed to guarantee the end-to-end quality of service needed by the application. Paths also provide a natural means for the treatment of protocol parallelism. We have used the framework to develop a real-time distributed clock protocol based on the Cristian algorithms, node-alive protocol, ordered reliable broadcast protocols, Mach IPC and RPC protocols among others. Multi-Computers and Clusters DIPC (Distributed IPC) and XMM (eXtended Memory Management) [13] provide transparent internode communication and shared memory on NORMA (No Remote Memory Access) architectures.

DIPC extends Mach IPC in a way that permits applications running on any node to view Mach abstractions such as tasks, threads, memory objects and ports in a transparent way. XMM supports distributed shared memory. The OSF/1 AD system uses these two Mach subsystems to provide a scalable, single system image of UNIX. It is intended for massively parallel processing environments, such as the MPP Intel Paragon, but also for clusters of interconnected workstations. Configurable Kernel With all these extensions, the microkernel has grown to become unacceptably large. We want the microkernel to run on low-end machines and we also want to target embedded systems, so we embarked on a program to make most of the microkernel's features configurable[12]. The target is a minimal microkernel that could run on a compute-only node and would only perform basic scheduling and IPC.

Miscellaneous We are developing or planning other projects on OSF MK, which are less relevant to the Linux server project because they are not integrated in the mainline microkernel or not freely available. These projects include:

3. Port of Mach to PowerMac

Introduction

The Early Stages of a Kernel's Life

Once the minimal kernel functionality is complete, there remains the issue of device drivers. Device drivers are more platform dependent than they are processor dependent, and on many platforms device drivers may be reused or easily adapted from those written for previous ports. This was the case for both the serial line driver and for the SCSI controller on the PowerMacs. Writing a small stub of PowerMac-dependent DMA code meant that the original drivers could be used with little modification. Trade-offs In any first implementation of an operating system, the trade-off of simplicity and debuggability is made against those of performance and functionality. Once an initial version of the kernel is working, more time can be spent padding out stub routines and in optimizing those routines which were written for simplicity instead of performance (routines such as bcopy plus bit testing and setting routines start out written in C, before being optimized into hand-coded assembler). Another trade-off made was to concentrate on obtaining functionality on the available test hardware. Minimal effort was made to cater for other processors in the PowerPC family or for other PowerMac machine architectures, since testing on these other machines was not possible. However, the assembly code and low-level exception handlers have been written so that it should be simple to incorporate the behavior of the other PowerPC processors (the 603 and 604 in particular). Additional device drivers and interrupt handling code will have to be written for the other machine architectures when porting to those machines. 4. Linux Server Architecture A Single Server Our Linux server is a ``single server'', meaning that the entire Linux functionality resides in one single Mach task. The alternative to this design is the ``multi-server'' design, where functionality is split between smaller specialized tasks communicating though Mach RPC. The multi-server design takes better advantage of the Mach architecture and allows one to re-use more code between different OS personalities: a generic terminal server could be shared by most OS servers for example. The drawbacks are performance and complexity. Performance is impacted by the cost of the extra communication between the various servers. Because servers can call each other in random ways, complex RPC chains are created, making it hard to implement some aspects of the OS functionality, like interrupting a system call for example. One has to implement potentially complex mechanisms to chase the system call through the various servers and abort it in a sensible way. Although we are convinced that multi-servers are the way to go to produce high-quality operating systems on top of Mach, this strategy was not applicable in our case because we started from an existing monolithic kernel and we wanted to maximize the code reuse ratio to make it easier to track new releases of Linux and leverage the Linux community effort. A Multi-Threaded Server A server on top of Mach simply receives and replies to requests from user tasks or from the microkernel. It has no explicit control on scheduling nor hardware interrupts, so it cannot decide what it needs or wants to do at a given time. We do not want to add code everywhere to check if there is something more important to do (like receive an incoming network packet or disk block) or to manage explicit context switches when we can rely on Mach threads and the user-mode ``cthreads'' library. This library offers various synchronization primitives (simple locks, mutexes and condition variables) and hides most of the necessary synchronization of the underlying Mach kernel threads. The Linux server has dedicated threads to handle the following tasks:

System Call Redirection

Although this provides excellent performance, it means that the server functionality is shared between this emulation library and the server itself, leading to extra complexity and consistency problems; the server cannot really consider the emulation library like the rest of the user code, especially with respect to signal handling. The emulation library is not protected from user access and is therefore a potential Trojan horse for a malicious user. It is extremely complex (and inefficient) to protect the server against malicious usage of the emulation library's privileged communication to the server. Furthermore, multi-threaded applications imply even more complexity for the emulation library, which has to be fully-reentrant and has to identify the user threads. For these reasons, we decided against the use of such an emulation in our servers. We extended the Mach exception mechanism to be more flexible and efficient [17]. With OSF MK, a system call from a user task raises an exception and enters the microkernel, which sends an exception RPC to the server, providing the user thread's state. This is similar to the way a system call enters a traditional UNIX system.

Combined with the collocation, thread migration and short-circuited RPC microkernel improvements, this method has proved to have competitive performance. User Memory Access Having the Linux server running as a regular user task makes it harder for it to access the memory of its user processes. The monolithic Linux kernel just uses segment registers to get inexpensive access to the user address space, but the Linux server has to use the Mach VM interfaces. Since this is also a critical aspect for the overall system performance. We cannot afford to suffer the overhead of switching to the microkernel for each access to user memory. We map the necessary user memory areas into the server's address space using Mach VM services. Once the mapping is done, the server can access the memory without any performance penalty. When the Linux server is collocated in the microkernel's address space, it can even avoid to setting up the mapping and use the microkernel's copyin and copyout mechanisms, which are similar to the monolithic Linux memcpy_fromfs and memcpy_tofs interfaces. There is still an extra cost because the server does not have direct access to the microkernel routines (it is a separately linked task) and has to go through short-circuited-RPC-like interfaces. This overhead is not a problem in itself, but is emphasized by Linux's habit of doing lots of very small (byte or word) copies at a time. By re-organizing some pieces of code in critical places (mainly in the exec path), we managed to get reasonable performance. Device Access The device drivers are in the microkernel, but the server has to access them and let its processes use them. Linux handles device numbers and uses its own device operation routines. Mach names its devices with regular names (``console'', ``hd0a'', ``fd0a'', ``sd0a'', etc...) and offers its own device interfaces. In the Linux server, we just added a generic emulation layer, replacing the bottom half of most Linux device drivers. The device emulation code fulfills two tasks:

Scheduling

Fake Interrupts

Linux Jiffies Emulation

Linux VM Emulation

VM Mappings

Memory Map

External Inode Pager

Dynamic Buffer Cache

Advisory Page Out

Avoiding Double Paging

5. Linux Server on the PowerMac

Differences with the native Linux

6. Performance

Conclusion

Status and Future Work

Portability

Status on the Intel Platform

Status on the PowerMac Platform

Linux Code Base

Linux Device Drivers

Development Environment

Availability

8. Related Work

BSD-Lite Server

GNU HURD

9. Conclusions

Acknowledgments

References

Last Modified: 03:38pm , January 12, 1996