On September 23, 1999, the unmanned NASA Mars Climate Orbiter reached Mars after cruising for 10 months and 416 million miles. It fired its rockets to maneuver itself into orbit around Mars in preparation for a planned 687 day mission. Instead, the spacecraft swung behind Mars and was never heard from again.

A Simple Math Error

The $125 million orbiter disappeared because a simple math error in the spacecraft’s software that did not convert English units to metric.

The navigation team at the Jet Propulsion Laboratory (JPL) used the metric system in its calculations, while Lockheed Martin Astronautics, which designed and built the spacecraft, provided crucial acceleration data in the English system of inches, feet and pounds. The error had affected the orbiter mission from its launching, yet the problem was never caught and corrected.

Too Little Testing? Or Too Much to Test?

NASA performed an immediate learning review, identified the causes of the failure, and made several recommendations. The error that caused the failure was trivial. The need to use consistent units was always part of the specification and was well understood. What led it to be a cause of failure was the sheer complexity of the entire system, not the individual task.

There are two ways of constructing a software design:

One way is to make it so simple that there are obviously no deficiencies, and

the other way is to make it so complicated that there are no obvious deficiencies.

The first method is far more difficult.

— Sir Tony Hoare, The Emperor’s Old Clothes

The traditional approach to flight software for spacecraft at JPL was to program very conservatively and test everything very thoroughly because the recovery options are very limited if a program crashes 55 million miles away. Accounting for every possible risk in such a dynamic system takes a bit of magic, a lot of wisdom, experience, confidence, creativity, and attention to detail. It’s an art that is very difficult to teach.

Guide Rails

For many years most spacecraft programs were written in C. To reduce risk, they identified certain features that might be more prone to lead to bugs and created coding guidelines to restrict themselves from using those features. George Fairbanks calls these sorts of self-imposed architectural constraints guide rails (see Architectural Hoisting, IEEE Software, vol.31, no. 4, pp. 12-15, July-Aug. 2014, doi:10.1109/MS.2014.82).

The rules would say things like, no dynamic memory allocation, no recursion, and every switch statement must have a default. With code reviews, tool-based compliance checkers, etc. to enforce the guide rails, the intent was that with sufficient application of vigilance the risk could be reduced to tolerable levels.

Vigilance is an effective technique for reducing risk. It’s ingrained into the habits of software developers defensive programming habits such as always checking return values, validating parameter values, handling exceptions, etc.

Eternal Vigilance is Exhausting

While vigilance is often sufficient when programming in the small, there are many problems with scaling up vigilance as the primary method of reducing risk. It has to be continuously sustained and only grows more difficult as the system gets more complex. A developer only has limited cognitive bandwidth, and every additional item the developer must be vigilant about increases the mental burden of maintaining that vigilance and decreases the remaining bandwidth available for implementing features.

Also, while the coding guidelines can act as the guide rails, they aren’t easily visible in the code because there is no need for an explicit representation there. A new developer joining the team might not easily figure out what code the previous developers had refrained from writing in order to comply with the coding guidelines.

Architectural Hoisting

The term, architecture hoisting, was coined by the Mission Data System (MDS) project at NASA JPL to describe their methodologies originally designed for spacecraft flight control software design. These systems have multiple sensors and control systems that must be monitored and reacted to. Developers would expend considerable effort ensuring that one activity, say transmitting a block of data back to Earth, didn’t interfere with any of these other critical tasks whose code was often distributed in many different modules.

What they did instead was hoist into the system architecture a model of the spacecraft sensors and components, and the constraints that had to be followed. The model could then be transformed into code that had enforcement of the constraints explicitly built in. The system would then enforce the priorities so that it wouldn’t allow a situation where, say, the spacecraft is busy taking pictures when it should be making a course correction.

Architectural Hoisting:

A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system.

Example: Hoisting Memory Management

To take a simple example, garbage collection and smart pointers in C++ (like unique_ptr) can be seen as hoisting memory management, making the task easier for developers to handle. Hoisting generally comes with some constraints and costs, so for example, hoisting memory management into the architecture with automatic garbage collection might make it more difficult to achieve the same level of performance.

Example: Hoisting Scalability and Concurrency

A recent Wired article, Why WhatsApp Only Needs 50 Engineers for Its 900M Users, explains why the company builds its service using a programming language called Erlang. David Chisnall explains (What Language I Use for… Building Scalable Servers: Erlang) some of the reasons why Erlang is well-suited for building highly scalable systems.

There is a pattern in Chisnal’s explanation. He describes the various ways Erlang limits the developer and says how this makes it easier to write concurrent programs that scale. For example, “If you want to write scalable, maintainable, parallel code, there is one rule that you must abide by: No data may be both shared and mutable. Erlang enforces this because within a process it has an (almost) purely functional model. All variables are immutable, with just one exception: the process dictionary.”

This pattern where the architecture enforces a guide rail in order to achieve a desired system attribute is the essence of architectural hoisting.

Hoisting Everywhere!

Once you get the idea of architecture hoisting your start to realize that all of the various frameworks and libraries used for building large scale systems are all examples of the application of architecture hoisting. For example, Rest.li is a “framework for building robust, scalable RESTful architectures using type-safe bindings and asynchronous, non-blocking IO.” Each one of the italicized terms is a quality attribute that Rest.li hoists for the developer. You can try the same exercise looking at the overview descriptions for other frameworks:

So That's What That's For

Learning to view a software framework as a way of applying various kinds of architecture hoisting to a system is like discovering that each tool in your toolbox has a little label on it like, “Hammer: Use flat end for pounding in nails and the other for removing them.” Each tool has characteristic abilities, constraints and tradeoffs. The more tools one knows how to use, the better a developer is able to select and use the best tool for the job.

A Common Language

Also, when one is able to summarize a complicated framework or component in terms of what it hoists and what constraints it imposes, that is a potent shorthand for communicating to all the developers on the team the rationale that lies behind the design choices so that the rest of the implementation will fulfill the desired quality attributes.

Technical Debt

Software developers have a great advantage if they are aware of the architecture they are using and what qualities it hoists for them. As the system evolves, a portion of what is called technical debt is an accretion of decisions to manage risk through the application of vigilance instead of refactoring the handling of that risk into the design and the architecture. The more a system depends on vigilance, the more fragile it becomes and the harder it is to maintain.

As a system evolves, developers should be looking out for situations where the reliance on vigilance is slowing progress and raising the risk of introducing bugs. Before the risk gets too large, seek ways to establish guide rails or refactor the system to hoist those quality attributes into the design.

For Further Research...

View George Fairbanks's talk (Mar 18, 2012) about guide rails, vigilance and architecture hoisting.

Please join the conversation...

Have you found this to be true in your experience? Please comment below.

Thanks for reading. Please like and share. You can find my previous LinkedIn articles here (https://www.linkedin.com/today/author/davidpmax).

PHOTO: The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Photography Collection, The New York Public Library. "Construction workers and crane seen from below" The New York Public Library Digital Collections. 1931. http://digitalcollections.nypl.org/items/510d47d9-a902-a3d9-e040-e00a18064a99