Joel Spolsky came up with the term “Leaky Abstraction” and its associated law:

All non-trivial abstractions, to some degree, are leaky.

As programmers and system builders, we leverage abstraction at every turn. When an abstraction leaks, some characteristic of the underlying implementation spoils the purity of the abstraction. A programmer needs to worry about tuning their use of the abstraction to take into account these messy details. Or the program breaks or performs badly.

One can quibble with the equivocation in this law even while agreeing with the basic principle. It’s hard to disprove — if the abstraction doesn’t leak, one can claim it is trivial. Alternatively, leaking “to some degree” covers a lot of ground. But the general examples Joel provides resonate. Whether it is trying to abstract a reliable connection over an unreliable transport, or even something so basic as trying to abstract a flat memory address space in the face of processor cache locality effects, having to deal with the characteristics of the implementation underlying the abstractions they build on are issues any developer has wrestled with as they develop a system.

Joel’s original post is a long lament with little guidance. Abstraction is key to managing complexity in a large system and yet these leaky abstractions mean things really just keep getting more and more complex. A developer needs to master more and more technologies. The best engineers embrace this leakiness, working to deeply understand why it arises and how they need to account for it. Lamenting it is really a failure of end-to-end design.

I wanted to consider a couple issues related to leakiness. In some cases, the problem with the abstraction and the cause of leakiness is that the abstraction tries to abstract or hide “the wrong thing”. A common type of design problem is trying to abstract something that can’t successfully be abstracted to begin with. The most common and perhaps fundamental example is latency — wrapping something in a synchronous API to make it look local and fast when it is really remote and slow. While I was working on Office, we actually shipped a version of Excel that after saving a file would query the file handle for the new modified time stamp. For a local NTFS file, this information was effectively cached in memory and so the call was fast and cheap. When running against the “DAV redirector”, which abstracted the file system across the remote DAV protocol, asking for the last modified time would download the entire file in order to retrieve the metadata associated with it (even though the underlying DAV protocol implementation had provided the new modified time as part of the response to the save operation). So effectively on every save we would upload the entire file and then immediately download it again in order to retrieve a 64-bit value. In Word 2003, 5% of all hangs were caused by a synchronous call to test for the existence of a file prior to displaying the MRU list. For files on network drives, this could take a long time to time out (especially when moving between home and work networks) and the user would kill the application while it was sitting there hung, waiting for the time out.

One way of dealing with leakiness is simply not trying to abstract something where the failure of the abstraction is so profound that it impacts (or should impact) the overall structure of how the abstraction is used. Often the best approach is to adapt the abstraction itself, not the usage of it. You can’t “abstract” that P = NP (at least I don’t think so!)

A surprisingly hard problem is how to design a system that is “intentionally leaky” — where you can provide higher level functionality while still exposing internal layers that allow someone building on top of your component direct access to those lower layers. This is a common problem in building UI application frameworks. Office worked closely with the XAML and Windows composition teams to come up with the right design that both provided access to high level XAML functionality while allowing direct and highly performant access to composition and rendering. This seems like an eternal problem since every new windowing / control framework seems to start with the perspective that “windows are going to be really cheap so I don’t need to expose a lower-level primitive” and then they keep getting heavier and more expensive as more and more functionality is accreted on. Eventually the framework ends up going down a path of “window-less controls” or other mechanisms to reduce cost. Ultimately the real solution is to add explicit layering — which is the hard design problem.

One of the long-running layering controversies inside Microsoft was the topic of “Exchange on SQL”. Exchange runs on “ESENT”, a database engine that also underlies many other technologies in the company (e.g. Active Directory) but is separate from the commercial SQL Server engine. There were elements of the company that pushed hard to re-platform Exchange on top of SQL (and in fact they long had working prototypes of such an implementation). The real controversy was not which technology to use for low-level IO. The argument was whether (or how) to expose the underlying data model that Exchange implements (as well as how to change that data model so that it was easier to program against). The argument was that building on top of SQL would allow direct access to the Exchange data model through all SQL’s powerful programming technologies. That is, the intent was that the implementation on top of SQL would be “intentionally leaky”. The controversy was whether that was a feature or a bug.

The view of the Exchange team is that Exchange is an app. The “app semantics” are implemented through a combination of specific database table design as well as the code that operates over that data. Some of that code enforces semantics that are not explicit in the data design (as well as not present in database triggers and other mechanisms embedded in the database itself to enforce integrity). The Exchange code is also tuned to operate with that data design to deliver the performance and responsiveness required by the “app”. Virtually every complex application has these characteristics.

Exposing the underlying data representation has 3 problems which are universal to this kind of layering problem.

Semantic Integrity. Some of the semantics are implemented at the app level. In fact, the “verbs” — how the data is modified, can be just as important to application semantics as the “nouns” — the structure of the data itself. Accessing and modifying the data directly through the data access level can break assumptions that are made throughout the rest of the app code. Databases provide mechanisms to define and enforce certain integrity characteristics of the data schema, but often properties are more easily and efficiently enforced at the app level. Performance. The design of the data model is typically carefully tuned and designed for the access patterns and requirements of the app built on top of it. Allowing access at this level to third-party code exposes significant performance risk. The risk shows up in two ways. Applications that access the lower layers may make use of them in ways that are unoptimized, bringing the performance of the overall system down. Alternatively, the wider set of use cases that the lower levels need to support either prevents or dilutes efforts to optimize for the more limited behaviors of the app itself. Agility. The detailed data model is effectively a hidden implementation detail of the application. As soon as it is exposed to third parties, it becomes an explicit public contract with rigid requirements around how and when it can be changed. This is a huge constraint on future evolution of an application whose performance is highly linked to the careful design of the data model and the ability to tune it for the load patterns of the application and its continually evolving feature set.

These challenges and questions often also play out internally to the implementation of an application. A data model is often designed with layers of more restrictive constraints and semantics. For example, at one level the FrontPage data model is just a text buffer and associated ordered array of nodes with pointers into the buffer. At the next level, those nodes define an interval tree over the text buffer. At another level, it is a full HTML tree with significant semantic constraints on how and where nodes can be nested. Where exactly to enforce those constraints and whether to expose simpler low-level capabilities at higher levels is an interesting challenge. For example, it was convenient to implement the “blockquote” action by simply inserting a BLOCKQUOTE node into the node array and setting its start and end pointers to straddle the selected paragraphs. It was a simple and expressive way of implementing the action — in fact significantly less complex than the code it replaced that did explicit tree manipulations. But it meant that the integrity of the data structure at the full semantic level was now dependent on the correctness of those 6 lines of code (and the hundreds of other examples of similar manipulations sprinkled through the code base). A significant ongoing source of instability was tree consistency, although typically after doing messy merging or pruning of complex trees in actions like Cut and Paste rather than smaller actions like blockquote.

In an evolving system, you can have two alternate sources of layering stress, both tied to this same dynamic of more and more code deeply “correlated” with the underlying data representation. In the first case, you decide not to expose these internal layers because the overhead of going through these thin extra layers is low. But as these layers because more and more functional and the lower layers become bound to the semantics and load of the higher layers, the promise of transparent access to those lower layers becomes less and less true. In the second case, you start by exposing these lower layers (maybe because initially your mapping from data model to semantic model is quite straight-forward), but over time allowing access to these lower layers is a significant ongoing constraint and overhead on the evolution of the system. There is no free lunch.