I have always been somewhat obsessed with the end-to-end argument. I see its application in systems everywhere. The paper I linked to above by Saltzer, Reed and Clark appeared in 1981. That seems a century ago in computer terms. Its genesis is even earlier and describes motivation and examples back through the fifties. One can easily argue that the main premise is really independent of computers — any complex layered system can be examined in this way. The main focus and motivation for writing the paper then was in the context of the early days of the Internet and the communication systems and applications being developed at the time.

The challenge was how to divide functionality between a distributed application and the underlying communication subsystem it is layered on top of. Their succinct articulation was:

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)

These were the days of the canonical “seven layers” in the ISO communication stack. As a young engineer, the idea that challenges in where to place functionality did not arise from a poor understanding of this beautiful layering system but rather were more fundamental to the structure of any system was liberating. This was especially true for someone like me who was primarily focused on applications — as remained true for most of my career.

In fact, the end-to-end argument defines a way of looking at virtually any layered system — which really includes any non-trivial system. It is a way of thinking — sort of an Occam’s razor for systems design.

The power of the end-to-end argument comes from being willing to accept failure. You recognize that you have no way of being perfect or solving some problem in certain layers of the system — which then opens up the opportunity for those layers to be simple, fast or great in some other dimension rather than “perfect”.

One broadly applied recent example is the rise of systems that use the concept of eventual consistency. Amazon is the prototypical example. You have multiple users interacting with this large distributed real-time system to buy a book. A classic centralized system has a single transactional database that ensures that you don’t sell the last copy of a book in your warehouse to more than one user. But you have an end-to-end problem. The “end” is the book being shipped from the warehouse and arriving in good condition at the buyer’s house. Even if the database system is perfect, that last book might get run over by a forklift in the warehouse and destroyed. You need some other end-to-end mechanism to ensure the buyer is satisfied — for example some way of putting the book on backorder and notifying the user.

Once you have such an end-to-end mechanism, you have much greater flexibility in the internal layers of the system. You can accept other types of “failure” that can be mitigated by your end-to-end mechanism. You can scale out that database across multiple boxes and use asynchronous mechanisms to bring them into consistency. You accept that you might sell that last book to multiple customers. You then leverage these end-to-end resolution mechanisms you already built to deal with potential anomalies and still meet your end-to-end goals.

This pattern occurs over and over again. You accept that guaranteeing some result in internal layers is impossible — you revel in that impossibility — and that frees up those layers to focus on other key attributes like performance or scalability or “merely” simplicity. A characteristic of these kinds of systems is they exhibit a more dynamic robustness that makes them more tunable and responsive to underlying changes in performance of their components.

A non-distributed example comes from the FrontPage HTML editor. One “end-to-end” goal is to ensure that the editor does not generate “stupid” HTML. Stupid HTML is HTML structure that is awkward or inefficient in some way. Examples include empty unnecessary tags (e.g. <b></b>) or tags that abut or subsume each other and could be more gracefully joined into a single tag. The challenge is the editor needs to be able to read virtually any existing HTML input structure and then perform edits on that content. You would rather that the hundreds of operations that edit the existing HTML structure focus on the transformation they need to perform rather than these other issues of end-to-end correctness. The approach taken in FrontPage was to have each manipulation routine focus on making its local edit and then have a single separate routine that knew how to take existing structure and make semantically equivalent transformations to make it “non-stupid”. This kept the large number of editing routines relatively simple as well as concentrating the knowledge of how to transform from stupid to non-stupid in a single location in the system. Ultimate responsibility was with this function rather than scattered throughout the system.

This is another characteristic of end-to-end approaches. You can concentrate responsibility for meeting the end-to-end goals in a more isolated part of the system rather than requiring a careful — and fragile — hand off of responsibility layer by layer.

One of the consequences of the end-to-end argument that is not discussed much in the literature is how this principle interacts with large organizational structure. The challenge that arises is that internal layers in a system have a constituency — a team that is responsible for it and that is motivated to “improve” it in some way release over release. These improvements inevitably have costs — more code loaded, memory used, performance consumed. The team can easily lose sight of the end to end goals. The team leaders have established the “release goals” — of course they require additional code to be written and memory to be consumed! Why else would you need a team?

Butler Lampson’s old but still fabulous “Hints for Computer System Design” makes a couple points about layers in a system design that are really motivated by these end-to-end arguments. His dicta “make it fast” and “don’t hide power” are really about recognizing that the application is in a much better position to understand these tradeoffs in resource and performance consumption. The more decisions made for the application in internal layers, the more flexibility is lost.

The writers of these early papers had the advantage of dealing with much smaller systems — where the designers understood every layer and in some cases had written all the layers. They were intimately familiar with the details and tradeoffs. With today’s systems, most developers walk up to an existing layer and need to accept it for what it is. It is not always clear what tradeoffs or motivations have driven the developers of a layer you depend on.

There was a case in Office development that I found especially illustrative of these dynamics. The application code was making what looked like an innocuous call to read a string in a directory lookup. Internally, the layer ended up communicating with the remote Active Directory system to build up a large tree describing the various fields and types that had been configured as part of the overall Active Directory domain. This type information could then be used to perform automatic type transformations when accessing fields from the directory. Building this tree required querying the remote AD system and then required building a data structure that required several megabytes to cache the information in order to make future calls faster. In our case, we had no need for the type transformations and were only making a single call — the whole mechanism was entirely unnecessary for us. The team had an answer for us — we were using the “slow, easy-to-use API” and should be using the “fast, complex API”. This was purely an internal distinction that the team had used to rationalize these tradeoffs; there was nothing in the external documentation that described these differences.

A common approach that teams working on these middleware layers will take is to make these costly features optional. The challenge is that over time the fact that there is a fast way to navigate through functionality is lost to both users of the API and the team that owns the layer. The API ends up being a hand grenade that can blow up in a developers face. The Mac Outlook client provides another good example. We received a bunch of complaints that the mail synchronization process was much slower than simpler mail clients. An investigation found that the Mac Outlook client was requesting the mail messages with a flag that told the Exchange service endpoint to fully resolve all email addresses to user names, which involved an uncached call back to the Active Directory server for each recipient in each message. It was a feature just waiting to blow up.

Another recent example comes from the work to build the Microsoft Graph API, which is providing a consistent REST-based API across all the Office 365 and other Microsoft services. It is definitely worthwhile work — the APIs for each of the individual products grew up separately and the consistent structure and endpoints will make them much easier to use. The challenge that arose (and is still being resolved) is that the team had a “mission” and clearly that mission meant that costs to deliver on the mission were acceptable. In particular, they added a layer to deliver this consistency that introduced additional latency and failure modes. A consequence was internal teams frequently found these costs unacceptable and would continue to use the more baroque service-specific APIs. These were good proxies for external customers. You get a better result if the team building a layer understands it has no special right to eat performance in the service of some secondary goal. Make it fast! Following through on this ends up being both technically and organizationally difficult — it is so much easier if the organization gets to own its own code and service!

I like Lampson’s “Make it fast!” dictum because it serves as a much simpler analytic tool compared to actually producing and analyzing specific applications and their end-to-end requirements. That is much messier and open to argument rather than simply looking at a layer and recognizing where it has consumed performance unnecessarily. Just say no! No argument you make will justify the additional costs you have introduced.

There are lots of other examples that arose over the years in struggles between Windows and Office to build new functionality in “the right way”. Windows would want to make functionality easy to program (and in some cases automatically slipped in to a layer in a way that did not require applications to write new code). The Office developers owned the end-to-end experience and wanted control. Solving for both is really difficult. I alluded to some of these challenges in my post Leaky by Design. One thing that helps is understanding that this dynamic is playing out at both the system level and the organizational level. There are not people acting in bad faith, but there are dynamics that can result in the wrong thing being built if you do not actively work against them.