United States – 2013

For fundamental contributions to the theory and practice of distributed and concurrent systems, notably the invention of concepts such as causality and logical clocks, safety and liveness, replicated state machines, and sequential consistency.

If we could travel back in time to 1974, perhaps we would have found Leslie Lamport at his busy local neighborhood bakery, grappling with the following issue. The bakery had several cashiers, but if more than one person approached a single cashier at the same time, that cashier would try to talk to all of them at once and become confused. Lamport realized that there needed to be some way to guarantee that people approached cashiers one at a time. This problem reminded Lamport of an issue which has been posed in an earlier article by computer scientist Edsger Dijkstra on another mundane issue: how to share dinner utensils around a dining table. One of the coordination challenges was to guarantee that each utensil was used by at most one diner at a time, which came to be generalized as the mutual exclusion problem, exactly the challenge Lamport faced at the bakery.

One morning in 1974, an idea came to Lamport on how the bakery customers could solve mutual exclusion among themselves, without relying on the bakery for help. It worked roughly like this: people choose numbers when they enter the bakery, and then get served at the cashier according to their number ordering. To choose a number, a customer asks for the number of everyone around her and chooses a number higher than all the others.

This simple idea became an elegant algorithm for solving the mutual exclusion problem without requiring any lower-level indivisible operations. It also was a rich source of future ideas, since many issues had to be worked out. For example, some bakery customers took a long time to check other numbers, and meanwhile more customers arrived and selected additional numbers. Another time, the manager of the bakery wanted to get a snapshot of all the customer numbers in order to prepare enough pastries. Lamport later said "For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it." [1]

The Bakery Algorithm and Lamport's other pioneering works -- many with amusing names and associated parables -- have become pillars of computer science. His collection forms the foundation of broad areas in concurrency, and has influenced the specification, development, and verification of concurrent systems.

After graduating with a PhD in math from Brandeis University in 1972, Lamport worked as a computer scientist at Massachusetts Computer Associates from 1970 to 1977, at SRI International from 1977 to 1985, and at Digital Equipment Corporation Systems Research Center (later owned by Compaq) from 1985 to 2001. In 2001 he joined Microsoft Research in Mountain View, California.

Spending his research career in industrial research environments was not an accident. "I like working in an industrial research lab, because of the input", Lamport said. "If I just work by myself and come up with problems, I’d come up with some small number of things, but if I go out into the world, where people are working on real computer systems, there are a million problems out there. When I look back on most of the things I worked on—Byzantine Generals, Paxos—they came from real-world problems.” [2]

His works shed light on fundamental issues of concurrent programs, for which there was no formal theory at the time. He grappled with fundamental concepts such as causality and logical time, atomic and regular shared registers, sequential consistency, state machine replication, Byzantine agreement and wait-freedom. He worked on algorithms which have become standard engineering practice for fault tolerant distributed systems. He has also developed a substantial body of work on the formal specification and verification of concurrent system, and has contributed to the development of automated tools applying these methods. We will touch on only some of his contributions.

1. Mutual Exclusion solutions and the Bakery algorithm

Lamport's influential works from the 1970's and 1980's came at a time there was only a little understanding of the fundamental issues about programming for multiple concurrent processors.

For example, it was known that correct execution may require parallel activities to exclude one another during periods in "critical sections" when they manipulate the same data, in order to prevent undesired interleaving of operations. The origins of this mutual exclusion problem are from Edsger Dijkstra's pioneering work, which includes his solution. [3] Dijkstra's algorithm, while correct, depends on shared memory accesses being atomic – that one processor reading when another is writing will be made to wait, rather than returning a possibly garbled value. In a sense, it constructs a high-level solution out of low-level mutual exclusion already implemented by the hardware.

Lamport's remarkably elegant and intuitive "Bakery Algorithm" [4] doesn't do that. His solution arranges contending processes in an implicit queue according to their arrival order, much like a wait-queue in a Bakery. Yet it doesn't matter if a processor reading data that is being updated by another processor gets garbage. The algorithm still works.

The Bakery algorithm has become textbook material, and most undergraduates in computer science encounter it in the course of his or her studies.

2. Foundations of Concurrent programming

Several important new concepts emanated from the Bakery Algorithm work, a trend which recurred several times in Lamport's career. The experience of devising concurrent algorithms and verifying correctness caused him to focus on the basic foundations that would make multiprocessors behave in a manner that programmers can reason mathematically about. While working on a solution for a specific concrete problem, Lamport invented abstractions and general rules needed to reason about its correctness, and these conceptual contributions then became theoretical pillars of concurrent programming.

Loop-freedom: The Bakery Algorithm work introduced an important concept called "loop freedom". Some obvious solutions that come to mind for the mutual exclusion problem pre-assign `turns' in rotation among the processes. But this forces processes to wait for others that are slow and have not yet even reached the point of contention. Using the bakery analogy, it would be akin to arriving to an empty bakery and being asked to wait for a customer who hasn't even arrived at the bakery yet. In contrast, loop-freedom expresses the ability of processes to make progress independent of the speed of other processes. Because the Bakery Algorithm assigns turns to processes in the order of their arrival, it has loop-freedom. This is a crucial concept which has been used in the design of many subsequent algorithms, and in the design of memory architectures. Wait-freedom, a condition requiring independent progress despite failures, has its clear roots in the notion of loop-freedom and the Bakery doorway concept. It was later extensively explored by others, including Maurice Herlihy [5].

Sequential consistency: Working with a multi-core architecture that had distributed cache memory led Lamport to create formal specifications for coherent cache behavior in multiprocessor systems. That work brought some order to the chaos of this field by inventing sequential consistency [6], which has become the gold standard for memory consistency models. This simple and intuitive notion provides just the right level of “atomicity” to allow software to work. Today we design hardware systems with timestamp ordering or partial-store ordering, with added memory fence instructions, which allows programmers to make the hardware appear sequentially consistent. Programmers can then implement algorithms that provide strong consistency properties. This is key to the memory consistency models of Java and C++. Our multicore processors run today based on principles described by Leslie Lamport in 1979.

Atomic and regular registers: The Bakery Algorithm also led Lamport to wonder about the precise semantics of memory when multiple processes interact to share data. It took almost a decade to formalize, and the result is the abstraction of regular and atomic registers [7].

His theory gives each operation on a shared register an explicit duration, starting with an invocation and ending with a result. The registers can be implemented by a variety of techniques, such as replication. Nevertheless, the interactions of processes with an atomic register are supposed to “look like” serial accesses to actual shared memory. The theory also includes weaker semantics of interaction, like those of a regular register. A regular register captures situations in which processes read different replicas of the register while it is being updated. At any moment in time, some replicas may be updated while others are not, and eventually, all replicas will hold the updated value. Importantly, these weaker semantics suffice to support mutual exclusion: the Bakery algorithm works correctly if a reader overlapping a writer obtains back any arbitrary value.

This work initiated a distinct subfield of research in distributed computing that is still thriving. Lamport’s atomic objects supported only read and write operations, that is, they were atomic registers. The notion was generalized to other data types by Maurice Herlihy and Jeannette Wing [8], and their term "linearizability" became synonymous with atomicity. Today, essentially all non-relational storage systems developed by companies like Amazon, Google, and Facebook adopt linearizability and sequential consistency for their data coherence guarantees.

3. Foundations of Distributed Systems

A special type of concurrent system is a distributed system, characterized by having processes that use messages to interact with each other. Leslie Lamport has had a huge impact on the way we think about distributed system, as well as on the engineering practices of the field.

Logical clocks: Many people realized that a global notion of time is not natural for a distributed system. Lamport was the first to make precise an alternative notion of "logical clocks", which impose a partial order on events based on the causal relation induced by sending messages from one part of the system to another [9]. His paper on "Time, Clocks, and the Ordering of Events in a Distributed System" has become the most cited of Lamport’s works, and in computer science parlance logical clocks are often nicknamed Lamport timestamps. His paper won the 2000 Principles of Distributed Computing Conference Influential Paper Award (later renamed the Edsger W. Dijkstra Prize in Distributed Computing), and it won an ACM SIGOPS Hall of Fame Award in 2007.

To understand why that work has become so influential, recognize that at the time of the invention there was no good way to capture the communication delay in distributed systems except by using real time. Lamport realized that the communication delay made those systems very different from a shared-memory multiprocessor system. The insight came when reading a paper on replicated databases [10] and realizing that its logical ordering of commands might violate causality.

Using ordering of events as a way of proving system correctness is mostly what people do today for intuitive proofs of concurrent synchronization algorithms. Another powerful contribution of this work was to demonstrate how to replicate a state machine using logical clocks, which is explained below.

Distributed Snapshots: Once you define causal order, the notion of consistent global states naturally follows. That led to another insightful work. Lamport and Mani Chandy invented the first algorithm for reading the state (taking a `snapshot’) of an arbitrary distributed system [11]. This is such a powerful notion that others later used it in different domains, like networking, self-stabilization, debugging, and distributed systems. This paper received the 2013 ACM SIGOPS Hall of Fame Award.

4. Fault tolerance and State Machine Replication

``A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable'' is a famous Lamport quip. Much of his work is concerned with fault tolerance.

State Machine Replication (SMR): Perhaps the most significant of Lamport's many contributions is the State Machine Replication paradigm, which was introduced in the famous paper on ``Time, Clocks, and the Ordering of Events in a Distributed System', and further developed soon after [12]. The abstraction captures any service as a centralized state machine -- a kind of a universal computing engine similar to a Turing machine. It has an internal state, and it processes commands in sequence, each resulting in a new internal state and producing a response. Lamport realized that the daunting task of replicating a service over multiple computers can be made remarkably simple if you present the same sequence of input commands to all replicas and they proceed through an identical succession of states.

This insightful SMR paradigm underlies many reliable systems, and is considered a standard approach for building replicated distributed systems due to its elegance. But before Lamport developed a full solution using for SMR, he needed to address a core ingredient, agreement, which was tackled in his next work.

Byzantine Agreement: While state machine approaches that are resilient to crash faults are sufficient for many applications, more mission-critical systems, such as for avionics, need an even more extreme model of fault-tolerance that is impervious to nodes that might disrupt the system from within.

At Stanford Research Institute (later called SRI International) in the 1970's, Lamport was part of a team that helped NASA design a robust avionics control system. Formal guarantees were an absolute necessity because of the mission-critical nature of the task. Safety had to be guaranteed against the most extreme system malfunction one could imagine. One of the first challenges the team at SRI was asked to take on was to prove the correctness of a cockpit control scheme, which NASA had designed, with three computer systems that use majority-voting to mask any faulty component.

The result of the team's work was several foundational concepts and insights regarding these stringent types of robust systems. It included a fundamental definition of robustness in this setting, an abstraction of the coordination problem which underlies any replicated system to this date, and a surprising revelation that systems with three computers can never safely run a mission critical cockpit!

Indeed, in two seminal works ("LPS") published with Pease and Shostak [13] [14], the team first identified a somewhat peculiar vulnerability. LPS posited that "a failed component may exhibit a type of behavior that is often overlooked -- namely, sending conflicting information to different parts of the system". More generally, a malfunctioning component could function in a manner completely inconsistent with its prescribed behavior, and might appear almost malicious.

The new fault model needed a name. At the time there was a related classical challenge of coordinating two communicating computers, introduced in a 1975 paper [15] and referred to by Jim Gray in [16] as the "The Two Generals Paradox ". This led Lamport to think of the control computers in a cockpit as an army of Byzantine Generals, with the army trying to form a coordinated attack while inside traitors sent conflicting signals. The name "Byzantine Failures" was adopted for this fault model, and a flurry of academic work followed. The Byzantine fault model is still in use for capturing the worst kind of mishaps and security flaws in systems.

Byzantine Failures analyze the bad things which may happen. But what about the good things that need to happen? LPS also gave an abstract formulation of the problem of reaching coordination despite Byzantine failures; this is known as the "Byzantine Agreement" problem. This succinct formulation expresses the control coordination task as the problem of forming an agreement decision for an individual bit, starting with potentially different bits input to each component. Once you have agreement on a single bit, it is possible to repeatedly use it in order to keep an entire complex system coordinated. The paper shows that four computers are needed to form agreement on a single bit in face of a single malfunction. Three are not enough, because with three units, a faulty unit may send conflicting values to the other two units, and form a different majority with each one. More generally, they showed that 3F+1 units are needed in order to overcome F simultaneously faulty components. To prove this, they used a beautiful symmetry argument which has become known as the `hexagon argument'. This archetype argument has found other uses whenever one argues that a malfunctioning unit that sends conflicting information to different parts of the system looks indistinguishable from a symmetrical situation in which the correct and faulty roles are reversed.

LPS also demonstrated that 3F+1 units are enough, and they presented a solution for reaching Byzantine Agreement among the 3F+1 units in F+1 synchronous communication rounds. They also showed that if you use digital signatures, just 2F+1 units are sufficient and necessary.

The Byzantine Agreement problem and its solutions have become the hallmark of fault tolerant systems. Most systems constructed with redundancy make use of it internally for replication and for coordination. Lamport himself later used it in forming the State Machine Replication paradigm discussed next, which gives the algorithmic foundation of replication.

The 1980 paper was awarded the 2005 Edsger W. Dijkstra Prize in Distributed Computing, and the 1982 paper received the Jean-Claude Laprie Award in Dependable Computing.

Paxos: With a growing understanding of the agreement problem for distributed computing, it was time for Lamport to go back to State Machine Replication and address failures there. The first SMR solution he presented in his 1978 paper assumed there are no failures, and it makes use of logical time to step replicas through the same command sequence. In 1989, Lamport designed a fault tolerant algorithm called Paxos [17] [18]. Continuing his trend of humorous parable-telling, the paper presents the imaginary story of a government parliament on an ancient Greek island named Paxos, where the absence of any number of its members, or possibly all of them, can be tolerated without losing consistency.

Unfortunately the setting as a Greek parable made the paper difficult for most readers to comprehend, and it took nine years from submission to publication in 1998. But the 1989 DEC technical report did get noticed. Lamport's colleague Butler Lampson evangelized the idea to the distributed computing community [19]. Shortly after the publication of Paxos, Google's Chubby system and Apache's open-source ZooKeeper offered State Machine Replication as an external, widely-deployed service.

Paxos stitches together a succession of agreement decisions into a sequence of state-machine commands in an optimized manner. Importantly, the first phase of the agreement component given in the Paxos paper (called Synod) can be avoided when the same leader presides over multiple decisions; that phase needs to be performed only when a leader needs to be replaced. This insightful breakthrough accounts for much of the popularity of Paxos, and was later called Multi-Paxos by the Google team [20]. Lamport's Paxos paper won the ACM SIGOPS (Special Interest Group on Operating Systems) Hall of Fame Award in 2012.

SMR and Paxos have become the de facto standard framework for designing and reasoning about consensus and replication methods. Many companies building critical information systems, including Google, Yahoo, Microsoft, and Amazon, have adopted the Paxos foundations.

5. Formal specification and verification of programs

In the early days of concurrency theory the need surfaced for good tools to describe solutions and prove their correctness. Lamport has made central contributions to the theory of specification and verification of concurrent programs. For example, he was the first to articulate the notions of safety properties and liveness properties for asynchronous distributed algorithms. These were the generalization of “partial correctness” and “total correctness” properties previously defined for sequential programs. Today, safety and liveness properties form the standard classification for correctness properties of asynchronous distributed algorithms.

Another work, with Martin Abadi [21], Introduced a special abstraction called prophecy variables to an algorithm model, to handle a situation where an algorithm resolves a nondeterministic choice before the specification does. Abadi and Lamport pointed out situations where such problems arise, and developed the theory needed to support this extension to the theory. Moreover, they proved that whenever a distributed algorithm meets a specification, where both are expressed as state machines, the correspondence between them can be proved using a combination of prophecy variables and previous notions such as history variables. This work won the 2008 LICS Test-Of-Time award.

Formal Modeling Languages and Verification Tools: In addition to developing the basic notions above, Lamport has developed the language TLA (Temporal Logic of Actions) and the TLA+ toolset, for modeling and verifying distributed algorithms and systems.

TLA and TLA+ support specification and proof of both safety and liveness properties, using notation based on temporal logic. Lamport has supervised the development of verification tools based on TLA+, notably the TLC model checker built by Yuan Yu. TLA+ and TLC have been used to describe and analyze real systems. For example, these tools were used to find a major error in the coherence protocol used in the hardware for Microsoft’s Xbox 360 prior to its release in 2005. At Intel, it was used for the analysis of a cache-coherence protocol of the Intel Quick Path Interconnect as implemented in the Nehalem core processor. To teach engineers how to use his formal specification tools, Lamport has written a book [22]. More recently, Lamport has developed the PlusCAL formal language and tools for use in verifying distributed algorithms; this work builds upon TLA+.

6. LaTeX

When creating such a vast collection of impactful papers, it is natural to wish for a convenient typesetting tool. Lamport did not just wish for one, he created one for the entire community. Outside the field of concurrency is Lamport’s Latex system [23], a set of macros for use with Donald Knuth’s TeX typesetting system [24]. LaTeX added three important things to TeX:

The concept of ‘typesetting environment’, which had originated in Brian Reid’s Scribe system. A strong emphasis on structural rather than typographic markup. A generic document design, flexible enough to be adequate for a wide variety of documents.

Lamport did not originate these ideas, but by pushing them as far as possible he created a system that provides the quality of TeX and a lot of its flexibility, but is much easier to use. Latex became the de facto standard for technical publishing in computer science and many other fields.

There are many other important papers by Leslie Lamport -- too many to describe here. They are listed in chronological order on Lamport's home page [1], accompanied by historical notes that describe the motivation and context of each result.

Any time you access a modern computer, you are likely to be impacted by Leslie Lamport's algorithms. And all of this work started with the quest to understand how to organize a queue at a local bakery.

Author: Dahlia Malkhi

Additional contributors: Martin Abadi, Hagit Attiya, Idit Keidar,

Nancy Lynch, Nir Shavit,

George Varghese, and Len Shustek