My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in brewing, cooking and skiing.I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our other serverless products. Before that, I worked on EC2 and EBS.All opinions are my own.

Exploring TLA+ with two-phase commit

There are very few distributed algorithms more widely known by working programmers than the two-phase commit atomic commit protocol. It's a great algorithm to use for teaching purposes: two-phase commit is both extremely simple to write down, and has significant caveats. Some of these shortcomings are obvious, and easily noticed by most students, and some are much more subtle. At a high level, two-phase commit is an atomic commit protocol: it ensures that changes across multiple database systems are either applied to all the systems or to none of them. Assuming a serial stream of transactions, two-phase commit ensures atomicity - the transaction is either fully applied or not applied at all.

A single coordinator (let's call her Alice) runs a group of fried chicken restaurants, and wants each restaurant manager (the literature calls them cohorts, let's them Bob and Chuck) to paint their green restaurant blue. Alice really cares that her customers get a consistent fried chicken experience, so wants to make sure that all the managers do the work or none of them to do it. If Alice simply asked Bob to do the work, then asked Chuck, she'd be in trouble. If Bob went ahead and did the work, then Chuck couldn't (say he didn't have enough paint), Alice would need to ask Bob undo his work. If Bob was then out of green paint, Alice would be stuck with inconsistent restaurant colors. In Alice's world, that's a catastrophe.

Instead, Alice uses two-phase commit. First, she calls Bob and Chuck and asks them to check if they can repaint today. When both acknowledge they can, Alice calls them and asks them to go ahead. For this to work, she doesn't have to get both of them on the same conference call. She just needs to call them one after the other. Alice also needs to be sure that Bob and Chuck won't lie to her about being able to do the work, and that Bob and Chuck will keep answering their phones. If Bob leaves work early after he's acknowledged that he can do the work, but before he does it, Chuck will be left with the cans open and ladders up, and Alice won't be sure if Bob did the painting or not. She doesn't know what to tell Chuck.

Even for such a simple protocol, two-phase commit has some subtle downsides and the distributed nature of the algorithm makes it exceptionally hard to reason about in prose. We could make little dolls of Alice, Bob and Chuck and act out every possible scenario, but that would take a really long time. Even if we managed to do that (and not screw up), we'd need to start the whole exercise again if Alice opened a third chicken frying location. What if we could have a computer do that checking for us? What if we could write down the protocol clearly and precisely, then write down everything we need to make sure is true, then have a computer run through every possible scenario and tell us if it works. That would be good, right?

Leslie Lamport's TLA+ tools allow us to do exactly that - write pseudocode implementations of complex algorithms, and ask the computer to exhaustively check them. Going through every possible path in a code base is a painstaking and time consuming process without any creativity required - the exact kind of problem that computers excel at. Let's see how we can use TLA+ to ask the computer to solve Alice's problem. I've used the PlusCal algorithm language here, because I find it much easier to write and understand than raw TLA+. First, let's define some things about the world:

variables managers = { "bob", "chuck", "dave" }; restaurant_stage = [ i \in managers |-> "start" ];

Here, we're telling PlusCal that there are three managers (Bob, Chuck and Dave), and creating an array of states (one per restaurant) with the initial state of each set to "start". Next, we need to define how each manager behaves:

process (Restaurant \in managers) { c: await restaurant_stage[self] = "propose";

Each manager waits for a call from Alice, proposing that they repaint their restaurant. They'll be happy to wait for ever in this stage, patiently staring at the phone while their employees cut, spice, fry and sell chicken after chicken.

either { restaurant_stage[self] := "accept"; } or { restaurant_stage[self] := "refuse"; };

In the next stage, the managers are allowed to do one of two things - either accept the work that's been given to them, or refuse to do the work. Using either tells PlusCal that we can go down either of these paths non-deterministically.

c1: await (restaurant_stage[self] = "commit") \/ (restaurant_stage[self] = "abort");

They then wait for the next call from Alice, giving them the go ahead to paint, or telling them to put away the ladders.

if (restaurant_stage[self] = "commit") { restaurant_stage[self] := "committed"; } else { restaurant_stage[self] := "aborted"; } }

Finally, they act on Alice's orders - either painting or aborting. Next, we have to specify how Alice behaves. To simplify that code substantially, we can use PlusCal's handy macro feature:

macro SetAll(state, k) { while (k # {}) { with (p \in k) { restaurant_stage[p] := state; k := k \ {p}; }; }; }

This macro loops over every restaurant (in non-deterministic order), and sends them a message. Let's use it to define Alice's behavior:

process (Controller = "alice") variable k, aborted = FALSE; { n: k := managers; n2: SetAll("propose", k);

First up, create the process and define the local variables. Then, send a message to each manager proposing the change.

k := managers; n3: while (k # {}) { with (p \in k) { await (restaurant_stage[p] = "accept") \/ (restaurant_stage[p] = "refuse"); if (restaurant_stage[p] = "refuse") { aborted := TRUE; }; k := k \ {p}; }; };

Wait for each manager to return the call (checking in non-deterministic order), and write down whether anybody wants to abort the operation.

k := managers; if (aborted = TRUE) { n6: SetAll("abort", k); } else { n4: SetAll("commit", k); }

If all the managers were happy to continue, then tell everybody to continue. That's the end of the specification of Alice's behavior, and the end of our PlusCal program. Writing down the program like this is valuable already. The precision of the PlusCal language, and the way it ignores many of the other challenges that would complicate real code, forces you to think clearly and completely about the behavior of each player. Programmers are all aware that fuzzy thinking doesn't last long when you have to translate it to code, and this is even more true of PlusCal. Just the act of writing the program this way is valuable. In terms of value, though, we're only just getting started.

TLA+ includes a model checker called TLC. In short, it runs through every possible path of the code and checks some invariants at each stage. Remember all of those non-deterministic steps in the code? When it hits those, it takes all possible paths. To make TLC useful, we need to tell it what it should check, both invariants (things that are true in every state) and properties (things that must become true). The simplest check is one at PlusCal generates itself:

Termination == <>(\A self \in ProcSet: pc[self] = "Done")

In the TLA+ languages, this means "for all self in the set of processes (alice, bob, chuck and dave), check that the program counter eventually reaches Done". The Done state is a magic state that means the code has fallen off the end of our process. This is a valuable thing to check, because it makes sure that all the process run the entire algorithm. Next, we define an invariant:

StateOK == /\ (\A i \in managers: restaurant_stage[i] \in {"start", "propose", "accept", "commit", "abort", "committed", "aborted", "refuse"})

This simply makes sure that restaurant_stage, the variable we have used to simulate the telephone, never goes off into a state we don't know about. Then, we want to check if all the restaurants either get painted or don't:

Committed == /\ \/ <>(\A i \in managers: restaurant_stage[i] = "committed") \/ <>(\A i \in managers: restaurant_stage[i] = "aborted")

Running the code through the handy TLC model checker will check that all of these things are true. Even for this little program, TLC found 718 states the program can be in, 296 of them unique. If Alice opened another two restaurants, these numbers would increase to 21488 states, and 5480 unique states. Long before the time Alice runs a multinational chicken empire, we'd have no chance of enumerating all these states by hand - let alone doing it correctly. To further illustrate the value of TLA+, let's introduce a subtle bug into the system, one that allows Alice to ignore a refuse message from Bob (in the real world, this could be a poorly handled timeout). Replace this line:

if (restaurant_stage[p] = "refuse") {

with this one:

if ((restaurant_stage[p] = "refuse") /\ (p # "bob")) {

That change lets Alice ignore the 'refuse' message from Bob. Running the model checker TLC again reveals something odd. The protocol still works when it shouldn't. We need to tell TLC to check one other invariant: that everybody aborts when somebody asks to. There are several ways to do this, including adding an explicit invariant. Another way to do this is to use the assert functionality in PlusCal. We can track whether each restaurant asked for an abort like this:

or { restaurant_stage[self] := "refuse"; refused := TRUE; };

Then assert that when we are asked to commit:

if (restaurant_stage[self] = "commit") { assert(refused = FALSE); restaurant_stage[self] := "committed";

Running TLC again reveals that the program is broken and that reverting Alice's behavior to always care about Bob's opinion fixes the issue. This reveals the most enlightening thing I've found about playing with TLA+. It's extremely easy to write a TLA+ specification and a set of invariants that work. What's much harder is coming up with a set of invariants that cover all the cases we actually care about, and making sure that modifications to the specification break those invariants. This is a great lesson about writing unit tests too - you have to be very honest to avoid studying to the test if you write the code and write the tests.

In a larger sense, it would be really cool to have a tool that does for TLA+ what Jester does for Java: make random modifications to the specification and show cases where the invariants are not violated. This would be very interesting for building quality invariants, but also for automated exploration of the space of algorithms which meet a given set of invariants.