Distributed systems

“At 1:14pm Pacific time, May 15th, the Stellar network halted for 67 minutes due to an inability to reach consensus.”

This is the opening line of a recent blog post explaining the issue that the Stellar DLT experienced recently, and it highlights something important around distributed systems, and the impact on their design.

Some people jumped on the outage and stated that it showed that Stellar is not decentralized, but this is simply incorrect. What’s rather at play here is something called the CAP theorem. In short, it states than in distributed systems, you need to pick a design which is a trade-off between these three attributes:

Consistency

Availability

Partition tolerance

Stellar and some other DLTs, like Ripple for example, are designed to ensure consistency. Once a transaction is confirmed you should have confidence that it will not change. This allows for quick turnaround when determining that a transaction is confirmed, usually in the range of a few seconds.

But the cost of this consistency guarantee is reduced availability potential. For this particular outage experienced by Stellar it was down to what you might call operational issues, something which can be improved upon with better tooling and more experience running the network by the various network participants. It is important to understand it is not a protocol issue, and that instead the protocol worked exactly as designed, maintaining the consistency of the ledger.

So if you prioritize consistency you reduce availability guarantees, because you do not tolerate network partitioning. In the context of DLTs and blockchains we often call network partitioning a soft fork.

Prioritizing availability?

There exists other DLTs that prioritize availability over consistency, with Bitcoin being one of them.

Over time the Bitcoin ledger will give consistency, but it is not guaranteed immediately. This is often what we refer to as eventually consistent. The benefit with this is that the network is more tolerant towards individual node failures, ensuring that we keep on making progress with the remaining network members. But the downside is that you can’t be sure that a transaction is confirmed after the first round. You’ll need to wait, which is why we often say it takes an hour to confirm a transaction on Bitcoin. This is because as time increases we can be more certain that our view of the confirmed transaction will remain.

What is happening on the network is that nodes are designed to accept the current longest chain of blocks, as viewed by them. Initially with just one block appended after the previous, it is rather uncertain that this particular block is picked over any other block which some other node computed. That other block might not contain our transaction.

As nodes start accepting a blockchain where the block with our transaction is present, with more and more new blocks appended after it, the more certain we can be that this is the longest surviving chain. Put differently, we have constant soft forking of the blockchain with one of these forks eventually emerging as the longest surviving chain.

You can ask the question, that, if you still need to wait to be sufficiently sure your transaction is on the ledger, is it really a design where availability is prioritized? Or is it just a very slow method of reaching consistency?