First, a little backstory…

Reasoning about distributed systems is challenging.

That’s the first big lesson I learned when I started working on distributed systems in 2009. With single-threaded code, life is pretty simple. Control flow is easy to reason about, at least if you avoid code that’s riddled with goto statements. With concurrent code, things get trickier. Different threads are working on different things, and their interleaving is not always easy to predict or reason about.

Distributed systems are even worse. Because the individual servers are now separated by a network, no one gets to see the overall state of the system. By the time a server receives a message, its contents may already be stale, but that server has to make a decision anyway. Plus, distributed systems have to deal with partial failures. Servers can fail and messages can be delayed or dropped, and the system must recover. So distributed systems really open up a whole new level of complexity (maybe that’s why I like them).

The challenge with distributed systems is managing this complexity. Most people strive to make their distributed systems as simple as possible. That’s not to say the resulting systems will be trivial; there will still be a lot of essential complexity. The goal is to minimize any additional, accidental complexity. Until we master building “simple” systems, I think that’s solid advice.

During my PhD, I co-authored a paper titled “In Search of an Understandable Consensus Algorithm”, which describes how we designed the Raft algorithm specifically for understandability. The paper argues that understandability should be a primary design objective for distributed systems.

Even if we strive to simplify our distributed systems, they are still challenging to communicate about and understand at a design level. I’ve explored many techniques to help with distributed systems design, including formal specification, model checking, formal and informal proofs, simulation, and visualization. I’ve found all of these things valuable to learn about, and I think some of them are entirely practical to use. However, standard practice in industry is to use none of them. We still use low-tech design tools like whiteboards and back-of-the-envelope calculations, and we write out design documents in prose or pseudocode. The perception seems to be that doing anything more sophisticated would take too long. I hope to change that.

Meet Runway

Visualization of the Too Many Bananas model in Runway. It’s a simple concurrency problem. You’ve run out of bananas, so you leave for the store to get more. Before you return, your roommates have already gone out to get more bananas. You should never have more than 8 bananas at home, since bananas go bad over time.

Runway is a new tool for distributed systems design that I’ve been working on at Salesforce. It combines specification, model checking, simulation, and visualization, all centered around the idea of a system model:

Specification: In Runway, a specification is code that serves as a precise description of a model for others to review, as well as being executable by Runway for model checking and simulation. Specifications are written in a domain-specific language, where the model’s entire state is explicit and transition rules advance that state forward.

In Runway, a specification is code that serves as a precise description of a model for others to review, as well as being executable by Runway for model checking and simulation. Specifications are written in a domain-specific language, where the model’s entire state is explicit and transition rules advance that state forward. Model checking: A Runway specification can include invariants, properties on the state that must always be true. The model checker explores all reachable states for a model, up to some size limit. If the invariants are ever broken, it reports the history of state transitions that led to the bad state. This can be used to find errors in the design.

A Runway specification can include invariants, properties on the state that must always be true. The model checker explores all reachable states for a model, up to some size limit. If the invariants are ever broken, it reports the history of state transitions that led to the bad state. This can be used to find errors in the design. Simulation: Many aspects of distributed system designs that we’d like to evaluate aren’t hard invariants but are metrics on hard-to-predict or emergent behaviors. Runway’s randomized simulator runs the specification with user-provided assumptions about timing and external events, and it produces realistic executions. Data from these executions can be used to analyze the design’s availability, performance, scalability, efficiency, etc.

Many aspects of distributed system designs that we’d like to evaluate aren’t hard invariants but are metrics on hard-to-predict or emergent behaviors. Runway’s randomized simulator runs the specification with user-provided assumptions about timing and external events, and it produces realistic executions. Data from these executions can be used to analyze the design’s availability, performance, scalability, efficiency, etc. Visualization: Visualization is a great way to communicate about designs that can help others build intuition very quickly. When given a view for a model, Runway can produce visualizations or animations of the model. Runway generates an execution using the simulator and displays the model’s state graphically as it changes over time. The user can also interact with a visualization, to change the state of the model and see how it reacts.

I recently gave a talk on Runway at CoreOS Fest 2016 in Berlin. It includes a demo of Runway (starting at 13:25) with three example models: the Too Many Bananas problem, an elevator system, and Raft. Here’s the recording:

We hope Runway might one day be widely adopted as a distributed systems design tool, both in industry and in academia. It seems other tools haven’t gained widespread use so far, but we think Runway has a good chance for a few reasons:

Integrating many components together might tip the cost-benefit calculation in Runway’s favor: you can write one model and get a lot of value from it. We hope to make Runway approachable with only a small learning curve. The interactive visualizations help, where people can learn about a design with no special knowledge of Runway. For model developers, Runway’s specification language is designed to be familiar to most industry developers and encourages simple code without many abstractions. Runway visualizations run in a web browser, which enables people to share their models easily. We’re currently designing a registry to help people discover other models, as well as a component system to enable using one model within another (for example, you’ll be able to plug a wide area network model into a Raft model). We hope a community will emerge around sharing Runway models and learning from each other’s designs.

As of three weeks ago, we open sourced Runway (MIT license), and there’s a live version running at https://runway.systems. The project is still in early stages of development, but we wanted to start growing the community now. We invite others to browse the existing models, build and share their own models, and help contribute to Runway’s development. We’re looking for help in a variety of areas, including compilers and programming languages, model checking optimizations, front-end, UX, and documentation. But most of all, we’re really excited to see what models you build and share with the world!