Ghosts of Unix past, part 3: Unfixable designs

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

In the second installment of this series, we documented two designs that were found to be imperfect and have largely (though not completely) been fixed through ongoing development. Though there was some evidence that the result was not as elegant as we might have achieved had the original mistakes not been made, it appears that the current design is at least adequate and on a path towards being good.

However, there are some designs mistakes that are not so easily corrected. Sometimes a design is of such a character that fixing it is never going to produce something usable. In such cases it can be argued that the best way forward is to stop using the old design and to create something completely different that meets the same need. In this episode we will explore two designs in Unix which have seen multiple attempts at fixes but for which it isn't clear that the result is even heading towards "good". In one case a significant change in approach has produced a design which is both simpler and more functional than the original. In the other case, we are still waiting for a suitable replacement to emerge. After exploring these two "unfixable designs" we will try to address the question of how to distinguish an unfixable design from a poor design which can, as we saw last time, be fixed.

Unix signals

Our first unfixable design involves the delivery of signals to processes. In particular it is the registration of a function as a "signal handler" which gets called asynchronously when the signal is delivered. That this design was in some way broken is clear from the fact that the developers at UCB (The University of California at Berkeley, home of BSD Unix) found the need to introduce the sigvec() system call, along with a few other calls, to allow individual signals to be temporarily blocked. They also changed the semantics of some system calls so that they would restart rather than abort if a signal arrived while the system call was active.

It seems there were two particular problems that these changes tried to address. Firstly there is the question of when to re-arm a signal handler. In the original Unix design a signal handler was one-shot - it would only respond the first time a signal arrived. If you wanted to catch a subsequent signal you would need to make the signal handler explicitly re-enable itself. This can lead to races, such as, if a signal is delivered before the signal handler is re-enabled it can be lost forever. Closing these races involved creating a facility for keeping the signal handler always available, and blocking new deliveries while the signal was being processed.

The other problem involves exactly what to do if a signal arrives while a system call is active. Options include waiting for the system call to complete, aborting it completely, allowing it to return partial results, or allowing it to restart after the signal has been handled. Each of these can be the right answer in different contexts; sigvec() tried to provide more control so the programmer could choose between them.

Even these changes, however, were not enough to make signals really usable, so the developers of System V (at AT&T) found the need for a sigaction() call which adds some extra flags to control the fine details of signal delivery. This call also allows a signal handler to be passed a " siginfo_t " data structure with information about the cause of the signal, such as the UID of the process which sent the signal.

As these changes, particularly those from UCB, were focused on providing "reliable" signal delivery, one might expect that at least the reliability issues would be resolved. Not so it seems. The select() system call (and related poll() ) did not play well with signals so pselect() and ppoll() had to be invented and eventually implemented. The interested reader is encouraged to explore their history. Along with these semantic "enhancements" to signal delivery, both teams of developers chose to define more signals generated by different events. Though signal delivery was already problematic before these were added, it is likely that these new demands stretched the design towards breaking point.

An interesting example is SIGCHLD and SIGCLD, which are sent when a child exits or is otherwise ready for the parent to wait() for it. The difference between these two (apart from the letter "H" and different originating team) is that SIGCHLD is delivered once per event (as is the case with other signals) while SIGCLD would be delivered constantly (unless blocked) while any child is ready to be waited for. In the language of hardware interrupts, SIGCHLD is edge triggered while SIGCLD is level triggered. The choice of a level-triggered signal might have been an alternate attempt to try to improve reliability. Adding SIGCLD was more than just defining a new number and sending the signal at the right time. Two of the new flags added for sigaction() are specifically for tuning the details of handling this signal. This is extra complexity that signals didn't need and which arguably did not belong there.

In more recent years the collection of signal types has been extended to include "realtime" signals. These signals are user-defined signals (like SIGUSR1 and SIGUSR2) which are only delivered if explicitly requested in some way. They have two particular properties. Firstly, realtime signals are queued so the handler in the target process is called exactly as many times as the signal was sent. This contrasts with regular signals which simply set a flag on delivery. If a process has a given (regular) signal blocked and the signal is sent several times, then, when the process unblocks the signal, it will still only see a single delivery event. With realtime signals it will see several. This is a nice idea, but introduced new reliability issues as the depth of the queue was limited, so signals could still be lost. Secondly (and this property requires the first), a realtime signal can carry a small datum, typically a number or a pointer. This can be sent explicitly with sigqueue() or less directly with, e.g., timer_create() .

It could be thought that this addition of more signals for more events is a good example of the "full exploitation" pattern that was discussed at the start of this series. However, when adding new signal types require significant changes to the original design, it could equally seem that the original design wasn't really strong enough to be so fully exploited. As can be seen from this retrospective, though the original signal design was quite simple and elegant, it was fatally flawed. The need to re-arm signals made them hard to use reliably, the exact semantics of interrupting a system call was hard to get right, and developers repeatedly needed to significantly extend the design to make it work with new types of signals.

The most recent step in the saga of signals is the signalfd() system call which was introduced to Linux in 2007 for 2.6.22. This system call extends "everything has a file descriptor" to work for signals too. Using this new type of descriptor returned by signalfd() , events that would normally be handled asynchronously via signal handlers can now be handled synchronously just like all I/O events. This approach makes many of the traditional difficulties with signals disappear. Queuing becomes natural so re-arming becomes a non-issue. Interaction with system calls ceases to be interesting and an obvious way is provided for extra data to be carried with a signal. Rather than trying to fix a problematic asynchronous delivery mechanism, signalfd() replaces it with a synchronous mechanism that is much easier to work with and which integrates well into other aspect of the Unix design - particularly the universality of file descriptors.

It is a fun, though probably pointless, exercise to imagine what the result might have been had this approach been taken to signals when problems were first observed. Instead of adding new signal types we might have new file descriptor types, and the set of signals that were actually used could have diminished rather than grown. Realtime signals might instead be a general and useful form of interprocess communication based on file descriptors.

It should be noted that there are some signals which signalfd() cannot be used for. These include SIGSEGV, SIGILL, and other signals that are generated because the process tried to do something impossible. Just queueing these signals to be processed later cannot work, the only alternatives are switching control to a signal handler, or aborting the process. These cases are handled perfectly by the original signal design. They cannot occur while a system call is active (system calls return EFAULT rather than raising a signal) and issues with when to re-arm the signal handler are also less relevant.

So while signal handlers are perfectly workable for some of the early use cases (e.g. SIGSEGV) it seems that they were pushed beyond their competence very early, thus producing a broken design for which there have been repeated attempts at repair. While it may now be possible to write code that handles signal delivery reliably, it is still very easy to get it wrong. The replacement that we find in signalfd() promises to make event handling significantly easier and so more reliable.

The Unix permission model

Our second example of an unfixable design which is best replaced is the owner/permission model for controlling access to files. A well known quote attributed to H. L. Mencken is "there is always a well-known solution to every human problem - neat, plausible, and wrong." This is equally true of computing problems, and the Unix permissions model could be just such a solution. The initial idea is deceptively simple: six bytes per file gives simple and broad access control. When designing an operating system to fit in 32 kilobytes of RAM (or less), such simplicity is very appealing, and thinking about how it might one day be extended is not a high priority, which is understandable though unfortunate.

The main problems with this permission model is that it is both too simple and too broad. The breadth of the model is seen in the fact that every file stores its own owner, group owner, and permission bits. Thus every file can have distinct ownership or access permissions. This is much more flexibility than is needed. In most cases, all the files in a given directory, or even directory tree have the same ownership and much the same permissions. This fact was leveraged by the Andrew filesystem which only stores ownership and permissions on a per-directory basis, with little real loss of functionality.

When this only costs six bytes per file it might seem a small price to pay for the flexibility. However once more than 65,536 different owners are wanted, or more permission bits and more groups are needed, storing this information begins to become a real cost. However the bigger cost is in usability.

While a computer may be able to easily remember six bytes per file, a human cannot easily remember why various different settings might have been assigned and so are very likely to create sets of permission settings which are inconsistent, inappropriate, and hence not particularly secure. Your author has memories from University days of often seeing home directories given "0777" permissions (everyone has any access) simply because a student wanted to share one file with a friend, but didn't understand the security model.

The excessive simplicity of the Unix permission model is seen in the fixed, small number of permission bits, and, particularly, that there is only one "group" that can have privileged access. Another maxim from computer engineering, attributed to Alan Kay, is that "Simple things should be simple, complex things should be possible." The Unix permission model makes most use cases quite simple but once the need exceeds that common set of cases, further refinement becomes impossible. The simple is certainly simple, but the complex is truly impossible.

It is here that we start to see real efforts to try to "fix" the model. The original design gave each process a "user" and a "group" corresponding to the "owner" and "group owner" in each file, and they were used to determine access. The "only one group" limit is limiting on both sides; the Unix developers at UCB saw that, for the process side at least, this limit was easy to extend. They allowed a process to have a list of groups for checking filesystem access against. (Unfortunately this list originally had a firm upper limit of 16, and that limit made its way into the NFS protocol where it was hard to change and is still biting us today.)

Changing the per-file side of this limit is harder as that requires changing the way data is encoded in a filesystem to allow multiple groups per file. As each group would also need its own set of permission bits a file would need a list of groups and permission bits and these became known quite reasonably as "access control lists" or ACLs. The Posix standardization effort made a couple of attempts to create a standard for ACLs, but never got past draft stage. Some Unix implementations have implemented these drafts, but they have not been widely successful.

The NFSv4 working group (under the IETF umbrella) were tasked with creating a network filesystem which, among other goals, would provide interoperability between POSIX and WIN32 systems. As part of this effort they developed yet another standard for ACLs which aimed to support the access model of WIN32 while still being usable on POSIX. Whether this will be more successful remains to be seen, but it seems to have a reasonable amount of momentum with an active project trying to integrate it into Linux (under the banner of "richacls") and various Linux filesystems.

One consequence of using ACLs is that the per-file storage space needed to store the permission information is not only larger than six bytes, it is not of a fixed length. This is, in general, more challenging than any fixed size. Those filesystems which implement these ACLs do so using "extended attributes" and most impose some limit on the size of these - each filesystem choosing a different limit. Hopefully most ACLs that are actually used will fit within all these arbitrary limits.

Some filesystems - ext3 at least - attempt to notice when multiple files have the same extended attributes and just store a single copy of those attributes, rather than one copy for each file. This goes some way to reduce the space cost (and access-time cost) of larger ACLs that can be (but often aren't) unique per file, but does nothing to address the usability concerns mentioned earlier. In that context, it is worth quoting Jeremy Allison, one of the main developers of Samba, and so with quite a bit of experience with ACLs from WIN32 systems and related interoperability issues. He writes: "But Windows ACLs are a nightmare beyond human comprehension :-). In the 'too complex to be usable' camp." It is worth reading the context and follow up to get a proper picture, and remembering that richacls, like NFSv4 ACLs, are largely based on WIN32 ACLs.

Unfortunately it is not possible to present any real example of replacing rather than fixing the Unix permission model. One contender might be that part of "SELinux" that deals with file access. This doesn't really aim to replace regular permissions but rather tries to enhance them with mandatory access controls. SELinux follows much the same model of Unix permissions, associating a security context with every file of interest, and does nothing to improve the usability issues.

There are however two partial approaches that might provide some perspective. One partial approach began to appear in Level 7 Unix with the chroot() system call. It appears that chroot() wasn't originally created for access control but rather to have a separate namespace in which to create a clean filesystem for distribution. However it has since been used to provide some level of access control, particularly for anonymous FTP servers. This is done by simply hiding all the files that the FTP server shouldn't access. Anything that cannot be named cannot be accessed.

This concept has been enhanced in Linux with the possibility for each process not just to have its own filesystem root, but also to have a private set of mount points with which to build a completely customized namespace. Further it is possible for a given filesystem to be mounted read-write in one namespace and read-only in another namespace, and, obviously, not at all in a third. This functionality is suggestive of a very different approach to controlling access permissions. Rather than access control being per-file, it allows it to be per-mount. This leads to the location of a file being a very significant part of determining how it can be accessed. Though this removes some flexibility, it seems to be a concept that human experience better prepares us to understand. If we want to keep a paper document private we might put it in a locked drawer. If we want to make it publicly readable, we distribute copies. If we want it to be writable by anyone in our team, we pin it to the notice board in the tea room.

This approach is clearly less flexible than the Unix model as the control of permissions is less fine grained, but it could well make up for that in being easier to understand. Certainly by itself it would not form a complete replacement, but it does appear to be functionality that is growing - though it is too early yet to tell if it will need to grow beyond its strength. One encouraging observation is that it is based on one of those particular Unix strengths observed in our first pattern, that of "a hierarchical namespace" which would be exploited more fully.

A different partial approach can be seen in the access controls used by the Apache web server. These are encoded in a domain-specific language and stored in centralized files or in " .htaccess s" files near the files that are being controlled. This method of access control has a number of real strengths that would be a challenge to encode into anything based on the Unix permission model:

The permission model is hierarchical, matching the filesystem model. Thus controls can be set at whichever point makes most sense, and can be easily reviewed in their entirety. When the controls set at higher levels are not allowed to be relaxed at lower levels it becomes easy to implement mandatory access controls.

The identity of the actor requesting access can be arbitrary, rather than just from the set of identities that are known to the kernel. Apache allows control based on source IP address or username plus password. Using plug-in modules almost anything else that could be available.

Access can be provided indirectly through a CGI program. Thus, rather than trying to second-guess all possible access restrictions that might be desirable and define permission bits for them in a new ACL, the model can allow any arbitrary action to be controlled by writing a suitable script to mediate that access.

It should be fairly obvious that this model would not be an easy fit with kernel-based access checking and, in any case, would have a higher performance cost than a simpler model. As such it would not be suitable to apply universally. However it could be that such a model would be suitable for that small percentage of needs that do not fit in a simple namespace based approach. There the cost might be a reasonable price for the flexibility.

While an alternate approach such as these might be appealing, it would face a much bigger barrier to introduction than signalfd() did. signalfd() could be added as a simple alternate to signal handlers. Programs could continue to use the old model with no loss, while new programs can make use of the new functionality. With permission models, it is not so easy to have two schemes running in parallel. People who make serious use of ACLs will probably already have a bunch of ACLs carefully tuned to their needs and enabling an alternate parallel access mechanism is very likely to break something. So this is the sort of thing that would best be trialed in a new installation rather than imposed on an existing user-base.

Discerning the pattern

If we are to have a convincing pattern of "unfixable designs" it must be possible to distinguish them from fixable designs such as those that we found last time. In both cases, each individual fix appears to be a good idea addressing a real problem without obviously introducing more problems. In some case this series of small steps leads to a good result, in others these steps only help you get past the small problems enough to be able to see the bigger problem.

We could use mathematical terminology to note that a local maximum can be very different from a global maximum. Or, using mountain-climbing terminology, it is hard to know the true summit from a false summit which just gives you a better view of the mountain. In each case the missing piece is a large scale perspective. If we can see the big picture we can more easily decide if a particular path will lead anywhere useful or if it is best to head back to base and start again.

Trying to move this discussion back to the realm of software engineering, it is clear that we can only head off unfixable designs if we can find a position that can give us a clear and broad perspective. We need to be able to look beyond the immediate problem, to see the big picture and be willing to tackle it. The only known source of perspective we have for engineering is experience, and few of us have enough experience to see clearly into the multiple facets and the multiple levels of abstraction that are needed to make right decisions. Whether we look for such experience by consulting elders, by researching multiple related efforts, or finding documented patterns that encapsulate the experience of others, it is vitally important to leverage any experience that is available rather than run the risk of simply adding bandaids to an unfixable design.

So there is no easy way to distinguish an unfixable design from a fixable one. It requires leveraging the broad perspective that is only available through experience. Having seen the difficulty of identifying unfixable designs early we can look forward to the final part of this series, where we will explore a pernicious pattern in problematic design. While unfixable designs give a hint of deeper problems by appearing to need fixing, these next designs do not even provide that hint. The hints that there is a deeper problem must be found elsewhere.

Exercises

Though we found that signal handlers had been pushed well beyond their competence, we also found at least one area (i.e. SIGSEGV) when they were still the right tool for the job. Determine if there are other use cases that avoid the observed problems, and so provide a balanced assessment of where signal handlers are effective, and where they are unfixable. Research problems with " /tmp ", attempts to fix them, any unresolved issues, and any known attempts to replace rather than fix this design. Describe an aspect of the IP protocol suite that fits the pattern of an "Unfixable design". It has been suggested that dnotify, inotify, fanotify are all broken. Research and describe the problems and provide an alternate design that avoids all of those issues. Explore the possibility of using fanotify to implement an "apache-like" access control scheme with decisions made in user-space. Identify enhancements requires to fanotify for this to be practical.

Next article

Ghosts of Unix past, part 4: High-maintenance designs

