My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in brewing, cooking and skiing.I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our other serverless products. Before that, I worked on EC2 and EBS.All opinions are my own.

Distributed Consensus: Beating Impossibility with Probability One

Reading Nancy Lynch's 1989 paper A Hundred Impossibility Proofs for Distributed Computing was the first time I came to a real understanding of the value of impossibility proofs. Before reading it, I was aware of many of the famous impossibility proofs, including Brewer's CAP Theorem, FLP impossibility and the lower bounds of number of rounds needed for consensus, but I'd always held existence proofs to be somehow more important. My attitude was along these lines:

What good are impossibility results, anyway? They don’t seem very useful at first, since they don’t allow computers to do anything they couldn’t previously.

Following that question (in Section 3.5 of A Hundred Impossibility Proofs), Lynch goes on to justify the importance of impossibility proofs. The whole case is worth reading, but the one that resonates with me most strongly as a practitioner is:

... the effect of the impossibility result might be to make a systems developer clarify his/her claims about what the system accomplishes.

Nearly 25 years have passed since the publication of this paper, and that remains something of a hopeful dream. Despite the efforts of Lynch, Lamport, Aphyr, Daniel Abadi and many others, there's still a long way to go in having distributed systems developers clearly state the guarantees their systems are making.

Another effect of impossibility proofs, and the clear definition of the models in which they exist, has been research into how little it is possible to change the model to get around the impossibility result. Easily my personal favorite result in this area is another paper from the 1980s, Michael Ben-Or's Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols from 1983 (sadly, I can't seem to find an open-access version of that paper), and a similar result by Rabin in the same year. Ben-Or looked at the FLP impossibility result, and discovered an algorithm which can achieve consensus with probability one in a slightly modified system model.

The first two sections of the paper lay out the problem to be solved, describe the properties of the solution and present the system model. The system model is the standard asynchronous message passing one, with the additional ability of each process to make non-deterministic decisions. This includes the key difference between the problem Ben-Or is solving, and the problem FLP proves is impossible. At each step (i.e. after receiving a message), a process can make a decision based on its internal state, the message state, and some probabilistic state. FLP's processes can't do the last of these: the decisions they make must be deterministic based only on their internal state, and the message state. This is a great illustration of the importance of models in distributed systems proofs. A slight variation of the model turns the problem from an impossible one, to one that is both possible and not particularly complex.

The other key point from section 2, which the correctness of the whole algorithm depends on, is this one:

If for all P, x_p = v, then the decision must be v.

In the paper's language, x_p is the original binary input made by process P. This is different from the majority wins model which is frequently used when informally talking about consensus. The algorithm is correct if it chooses 1, as long as at least one of the original processes originally proposed 1. In a system with five processes, if four propose 0 and one proposes 1, then 1 is still a correct solution. If all five propose 0, only 0 is the correct solution. This definition of correctness becomes critical when we look at the algorithm itself.

The algorithm proceeds in rounds, with four steps per round. In the first step of each round, each process broadcasts it's x_p, along with the round number. It then waits until it receives N - t of these first-step messages, where N is the number of processes, and t is the number of faulty processes (more on t later). The second step then depends on the set of messages received.

If more than N/2 messages have the same v, then the process broadcasts a message the paper calls a D-message, basically just a message indicating that the process has seen a majority of the same value. Obviously if there have been no failures, this happens on the first round (because it's binary consensus, and there's always a majority). Similarly, in the trivial case, where all x_p were the same, all processes will send D-messages. On the other hand, if a process has seen a split vote, it sends a message indicating that it's still unsure.

In the third step, each process waits for N-t of the step 2 messages, and counts how many of those were D-messages. If it gets only one D-message it sets x_p to the v in that message for future rounds. If a process gets more than t D-messages, we're done and can decide on v. In this case, all the D-messages will have the same v, because it's not possible in step 2 for more than one v to be in more than N/2 messages. At this point, the algorithm may be feeling oddly similar to Paxos' Synod protocol. Finally, if no D-messages were received, the process does something interesting - it randomly selects a new v with probability 0.5.

This is where things start to get interesting for the correctness criterion. If a process gets to this random selection part of step 3 in the first round, it must be true that x_p didn't have the same value for all P. If that isn't the case, all the processes could chose a different v, and break the correctness of the protocol. For this protocol to be true, it must decide in a single round in the trivial case, and not allow random re-selection. This protocol does two things, random re-selection and non-triviality, which are not obviously compatible at first glance.

The number of rounds used by this algorithm, and it's Byzantine fault tolerant counterpart, is surprisingly low. For many executions, consensus can be reached on the first round, and the number of rounds increases as slowly as you would expect the number of randomly selected ties to increase. Here's the number of rounds needed for each of 100k runs of the N=5 t=1 case based on a simulation of the algorithm:

I find this paper particularly interesting for a few reasons. The first reason is that it demonstrates how sensitive the FLP result is to the problem statement and model in which it is proven. As distributed systems practitioners who use academic research and formal models to inform our designs (as we should), we need to be careful to not over- or understate what various results actually mean. It's possible, and actually extremely common, to read the CAP and FLP results to mean something like distributed consensus is impossible, when they actually mean exactly this problem is impossible in exactly this system model. These results should only be extended to other problems and other models with care.

The second reason is that it's a very creative solution to a tricky problem. Backed into a corner by FLP, Ben-Or found a very creative solution that still solves a useful problem in a meaningful system model. For practitioners like me, that's the dream. I want to solve real problems in real systems, and I really admire solutions like this. The third reason is that it's a great reminder, when faced with a claim that a system is solving an apparently impossible problem, that we should ask exactly what problem is being solved, and in exactly what system model. It would be easy to package up Ben-Or's result in a press release titled New Algorithm Proves FLP Wrong, but that would be missing the point entirely.