The main result of the Paxos Made Simple paper by Leslie Lamport is an algorithm to transform any finite-state machine (FSM) into a fault-tolerant distributed system.

It sounds complicated, but with the step-by-step visualization and this post I'm going to demystify its complexity.

Let us start with the FSM. It's a fancy name for a pretty wide class of applications. How wide is it? If you can model your computation as a pure function from a state and an event into a new state, and you have a fixed limit on the size of each state/event then your computation (your application) belongs to the FSM class.

You can guess that almost any web application using three-tier architecture matches the description: the stateless servers play the role of the pure function while the size of the database is limited by the storage capacity.

Now it should be clear why Paxos is important: it allows to convert almost any application into a distributed system to achieve more reliability than each individual node provides.

The paper was written in 2001 and it contains several ground-breaking ideas:

Leslie Lamport demonstrated how to build two linearizable distributed data structures: a distributed write-once register (based on his Synod protocol) a distributed append-only log (known as Multi-Paxos, based on the sequence of distributed write-once registers)

He suggested to use distributed append-only log to store the sequence of all events leading a system to the current state instead of storing the state itself (later this idea was rediscovered as the event sourcing pattern).

In 2016 it feels natural to use logs for FSM (see ZooKeeper, Etcd, Raft). But logs aren't the first thing that I want to work with as an engineer because they force me to implement garbage collection and snapshots (nasty stuff, see the Paxos Made Live paper by Google). Also it's possible to imagine how distributed data structures other than logs could model FSM. I wonder if they really exist and why FSMs implemented on top of them aren't known...

They do exist. For example, TreodeDB used Synod-based distributed rewritable register. It's interesting that I haven't seen any academic paper dedicated to implementation of FSM on top of the distributed rewritable register datastructure (a distributed variable), but there are industrial systems like riak_ensemble and TreodeDB which use it. So I practiced my Math-Fu, described and proved FSM over Single Decree Paxos (a distributed variable flavor) in the How Paxos Works post.

IMHO the log-based solutions are more popular because they had the advantage of time. Until Vertical Paxos was published in 2009 , log replication was the only consensus protocol capable of cluster membership change. Thus, it was the only practical solution for highly available services. But now we have a few more algorithms to choose from. For example Vertical Paxos and Raft's joint consensus membership change are among them. Take a look at the Best of both worlds: Raft's joint consensus + Single Decree Paxos post if you're interested in understanding how dynamic membership works with a distributed variable.

Visualization

Now when we understand that there is a place for alternative distributed FSMs, let's discuss the visualization. We're going to visualize the work of a distributed FSM managing an intercontinental ballistic missile launch system.

FSM

Imagine that we want to model an intercontinental ballistic missile launch system with the following rules:

The missile is launched when at least two generals signed off the launch. Until the missile is launched a general can revoke their sign-off.

The (non-distributed) FSM engine may look like

And the state transition functions are:

The rules define the possible states of the system, while the actual state is defined by the generals' behavior. It's quite simple in our example:

unsign the launch sign the launch goto 1

Distributed FSM

Distributed FSM has several endpoints (think of nodes running the execute REST api) but it doesn't matter which endpoint you choose to work with because the system behaves exactly like a system with several endpoints leading to the same (non-distributed) FSM. The only difference is that the distributed version runs on 2n+1 nodes and tolerates up to n failures.

Visualization

The visualization consists of two code columns and a side view with a) the persistent state b) a watch list with all variables from the stackframes and c) a network interface to control the order and lost of the cross nodes messages.

The first column contains the code of a distributed Single Decree Paxos FSM engine (37 lines of code); while the second column is dedicated for the state transition functions and the generals behavior (29 lines of code).

You can iterate through the visualization by clicking the start/step buttons in the upper-right corner next to the 'General Lee' and 'General Ross' labels. General Lee works with the 0th proposer and Ross uses the 1th proposer.