The Byzantine Generals Problem

Photo by Dean Hinnant

In their 1982 paper titled “The Byzantine Generals Problem” Lamport, Shostak and Pease devised an illustrious analogy to demonstrate the severeness of this problem.

The situation they came up with comprises a Byzantine army which is camped outside all around an enemy city. Their army consists of many different units, which all have their own commanding general.

After having carefully assessed the strength of their enemy who is defending the city, they know that they can only win if all units attack at the same time — if not, defeat is ensured.

Since the city is quite large and the army is encamped far away from each others positions, the generals can only communicate by messenger.

The essential task is to find a common strategy and execute it together — attack or retreat? It’s very important to note that the nature of the plan doesn’t matter at all — but rather that consensus on one common plan is found.

So far so good, now we introduce some potentially problematic variables into the situation though. What do we do when not all the generals or their messengers are loyal and honest?

For one, image a traitorous general who sends contradicting messages to the other generals, telling some to attack and some to retreat.

There is another factor that makes this situation arguably worse — potentially traitorous messengers. Since the generals can’t communicate directly with one another, they rely on their messengers. Instead of the generals who act dishonest, it could also very well be that some messengers will communicate faulty messages.

How can the Byzantine army make sure that all the generals still accept the same plan in spite of this challenge?

These type of faults are called Byzantine faults and they refer to any fault that presents different symptoms to different observers. In the case described above, this refers to generals receiving contradictory messages either due to a traitorous general or a traitorous messenger.

If a system service that requires consensus to be performed is lost due to a Byzantine fault, it’s called a Byzantine failure. In this case, the failure to reach a common plan that is carried out by all the generals is the failed system service.

A system that can tolerate some of these faults and still come to a consensus is called Byzantine fault tolerant.

This problem relates directly to modern day distributed systems — the different computers are the generals and the communication systems they rely on are the messengers.

In the next section we will explore the implication of all this in distributed computer systems.