1. My name is Charles Humble and I am here at QCon San Francisco 2014 with Leslie Lamport. Leslie, could you introduce yourself to the InfoQ community? My name is Leslie Lamport. I work for Microsoft and have for the last dozen or so years and before that I have been in several industrial research labs doing computer science for the past 30 or 40 years.

2. Great, thank you. So, what sparked your initial interest in computer science? Well, my initial interest was in computers. There was no concept of computer science when I started, as near as I knew at any rate, and I almost fell into some summer jobs of programming when I was in college and partially supported myself in grad school, programming. I just somehow wound up doing computer science.

3. You worked on LaTeX. How did that come about? Well, I was going to write the great American concurrency book, but in order to do that I realized I needed to write some TeX macros and I decided that, with a little extra effort, I could make those macros usable by other people and, needless to say, the great American concurrency book never got written, but LaTeX did.

4. Excellent. So, your Paxos algorithm is very widely used today. Could you describe it for us? Well, I don’t want to get into the details, but if you ask a reasonably good engineer to solve the problem that Paxos does, which is to build a reliable, fault-tolerant system that can handle a particular class of faults - namely the most common kind of computers just stopping and messages just getting lost - then they would come up with an algorithm that works pretty much like Paxos in the normal case, when there are no failures. The difference is that if an engineer did it when a failure happened, instead of having an algorithm, he would have a bunch of code and we would never be quite clear what it did. Whereas Paxos has - well provably will never do anything wrong in that class of failure: although the worst that can happen is that the failure can prevent it from making progress.

5. How did your paper on time and clocks and the ordering of events in distributed systems come about? I received a pre-print of a paper by Bob Thomas and, I think, Paul Johnson describing an algorithm, and when I read the paper I realized that they hadn’t gotten it quite right: that their algorithm allowed the system using it to apparently violate causality. I made some fairly simple changes to their algorithm and that was my algorithm. The additional thing that my paper did is that I am not sure to what extent Johnson and Thomas realized the significance of what they were trying to do, as the particular problem they were addressing was maintenance of a distributed database, whereas what I realized was that the fundamental problem was implementing a state machine, that is to say having a collection of separate processors communicating by sending messages, implement - cooperate to implement - the same state machine, so they all have the same notion of what the state machine does. I realized that if you could do that, then you can essentially solve any problem.

6. You extend the “Time, Clocks[….]” paper to handle arbitrary faults. Can you describe that? Well, after the writing the “Time, Clocks[….]” paper, which assumed no failures, it was clear that the next step - since you have a collection of processors, some of them may fail, and in some aspects the reason for building a distributed system is so that, by distributing it, you make sure that the failure of one processor does not physically imply the failure of any other processors, and so it becomes possible to build a system that will tolerate the failure of some processors. In those days, there was no clear notion of what constituted a typical failure, so I assumed that a failed process could do absolutely anything and wrote an algorithm that essentially generalized the “Time, Clocks[….]” paper to arbitrary failures. Very shortly after doing that I went to SRI and I discovered that they were actually considering the same problem, in the context of a NASA contract to build a fault-tolerant system for flying an airplane essentially, and one of the things they had done was to in fact prove that the algorithm that I had invented could not work.



Well, it was not quite like that because the algorithm that I had invented used digital signatures because - At that time, digital signatures were known only to very few people, but since Whit Diffie was one of the people who invented modern cryptography and he was a friend of mine, I was aware of the problem of digital signatures and actually came up with the first scheme, at the time highly impracticable, for implementing digital signatures. So, in my algorithm, I used digital signatures and their theorem which said that to tolerate a single failure you needed to use four processors - Basically, the use of digital signatures was cheating in terms of the assumptions of their theorem and so with digital signatures you were going to get around that theorem and could actually do it with three processors. So in the original paper, what became known as Byzantine Generals, the algorithms without digital signatures were due to Shostak and Pease, the other two authors, and the algorithm that used digital signatures was mine.



Charles: Right. OK. So, we mentioned earlier that Paxos - it’s almost like it’s kind of been rediscovered: it seems to be being used everywhere now. I am curious as to why you think your work on distributed systems has survived, given the changes in hardware that have happened since you did the work originally.



Well, in some sense it is like asking why Turing machines have survived, despite all the changes in computers, and it is because Turing machines are talking about something very fundamental in computing. Now, what I did is nowhere near the significance of what Turing did and I am just thrilled to have my name used in the same sentences as Alan Turing’s. But, like what Turing did, it is talking about fundamental aspects of computation, not tied to any particular technology.

7. What are your current research interests? Currently, there is not much that I do that I would label as research, in the sense that my current interest is in getting people to think better about how they write software and build systems, and I think most of the research, you know, the ideas that are needed for that, are now known and it is a matter of improving the tools based on them, but I think even more, convincing people of the importance of doing it.

8. Do you think we’ve got better at thinking as an industry or worse, or are we about the same as we always were? I wouldn’t really know how to measure it, but I think you have a system where the need for programmers is getting further and further ahead of the number of programmers and, as a result, that means that the general quality of programming is declining.

9. Right. So, you’ve worked on TLA and the TLA+ specification language, and the PlusCal algorithm, and the various tools associated with that. Can you give us some examples of where that has been used in Microsoft maybe? Well, in Microsoft it has been used in a few places, different projects, as I mentioned in my talk. There was an intern, who was hired by Chuck Thacker actually in our lab, to a write a TLA+ spec of the memory system for the chip that was going to be used in the Xbox 360 and, as Chuck tells the story, even before he had finished the spec, just in the act of writing the spec, he had discovered a problem with it, and they sent it to the IBM engineers who sat and puzzled over it; and two weeks later they came back and said “Yes” – Did I mention that IBM were the people who were building that chip? - so, they came back and said “Yes, that is a bug in the chip. That bug would not have been caught by any of our tests and, had that reached production, every Xbox in the world would have crashed after 4 hours of continuous use”.