Consensus Algorithms for Distributed Systems

Consensus is the process by which multiple nodes agree on a single result to guarantee consistency among them. In the paper Impossibility of distributed consensus with one faulty process the authors state that no asynchronous protocol can always reach consensus in a bounded time, in the event of even a single node failure.

1. Two-Phase Commit Protocol

The manager collects the values in the voting phase and passes down a result in the commit phase which all participants must agree upon.

Perhaps one of the most widely used distributed consensus algorithms, this contains two phases — The Voting Phase and the Commit phase.

In the voting phase, the transaction manager asks every node for the result, then decides the correct value based on majority consensus and gives it back to the nodes. If everyone agrees, the transaction manager contacts every participants again to let them know about the final value. Otherwise, contact every participant to abort the consensus. The participants do not have to agree on a value but have to agree on whether or not to agree on the value provided by the Transaction manager.

2. PAXOS

A fictional legislative council in the Paxos island of Greece.

PAXOS is a method of achieving consensus in a network with unreliable nodes. It contains the following roles — a Client which issues a request to the distributed system, Acceptors which form a quorum, a Proposer is an advocate for the client request trying to convince acceptors to accept it, a learner takes action once a request has been accepted, a leader is who is also a Distinguished Proposer and a Distinguished Learner. The leader then chooses the “valid result” and sends it to all the nodes, the nodes can reply with an accept or a reject. If a majority of nodes accept then the value is committed. Apache Zookeeper and Google Spanner use the PAXOS algorithm to achieve consensus.

3. RAFT

With strong leader election chosen through liveness and random-timeouts, RAFT has more practical implications than PAXOS.

The RAFT consensus algorithm is similar to PAXOS in guarantees of performance and efficiency. It contains three actors — leader, followers and candidates. Leader election happens through heartbeats and random timeouts. The leader accepts log entries from clients and propagates them across clusters. YugaByte uses RAFT to achieve consensus and arbitrarily add, remove or change nodes in the cluster in an online fashion. It can also tolerate the failure of a minority partition as long as enough nodes are online to hold a leader-election.

4. PBFT (Practical Byzantine Fault Tolerance)

The byzantine-general trying to kill the consensus.

PBFT is a consensus protocol that can tolerate Byzantine faults. Byzantine faults occur due to system failures or when all the actors are not behaving in altruistic ways. For eg. a byzantine fault may occur when an actor does not return any result, a deliberately misleading result or returns different results to different parts of the system.

5. Proof Of Work

Random miner tries to mine the next block by solving a computationally challenging hash puzzle.

In a POW type algorithm, a random miner tries to mine the next block by spending huge computing resources. The block is then accepted and added to the chain. The majority is decided by the longest chain. The difficulty of mining is constantly changing to control block generation rate. Bitcoin, Ethereum and countless other digital currencies acquire consensus amongst nodes via POW. Since this a 1-CPU-1-Vote model, the attacker needs 51% of computing power to compromise the network.

6. Proof Of Stake

Everyone lays out their stake in the network and the greater the stake the greater the probability of mining the next block.

The more stake a candidate has the greater the chances of him/her mining the next block. POS implementations can be either Chain based on BFT based. Neo and Cardano are two currencies using POS to achieve consensus in the network. There are also proposals to move Ethereum to a POS consensus system in the Casper update.

Isolation

Each Transaction must proceed as if it were the only one..

Snapshot Isolation

This isolation level guarantees that all the reads made on a particular timestamp are consistent across nodes and the transaction itself will successfully commit only if no conflicting updates are being made. MongoDB version 4.2 is likely to support Snapshot Isolation for consistent reads across documents, culminating a multi-year engineering effort. Serializable Isolation

This isolation level guarantees that all transactions would run in a linear(serializable) schedule.

YugaByte supports snapshot-isolation with serializable isolation support being on the roadmap.

Fine grained isolation locks for serializability (Source)

Durability

All state changes must be persistant upon commit.

For durability in distributed systems, the TimeStamps (Hybrid or Atomic) of each update need to be recorded, so that it is possible to recover the state of the document from any point in the past. These versioned updates need to subsequently be garbage collected when there are no transactions reading a snapshot at which the old value would be visible.

YugaByte uses a Log Structure Merge Tree based key-value store with updates being replicated via the RAFT protocol. The mutation of records are versioned using timestamps and uses bloom-filters and range-query optimization for faster reads.

Distributed Transactions in modern scale-out databases

In the article Practical Tradeoffs in Google Cloud Spanner, Azure Cosmos DB and YugaByte DB, the author does quite well in explaining the trade-offs that need to be considered when choosing between the three scale-out databases. Here I will summarize some highlights from the text briefly —

Google Spanner internally divides the data into chunks called splits and assigns each split to a different node. It then replicates each split and distributes them amongst the nodes using the PAXOS consensus algorithm.

Global clock synchronization is done via TrueTime which provides minimum bounds of timestamp error. Using commit-wait for error bounds, it is able to achieve External-Consistency meaning all transactions will follow a global serializable schedule even on nodes in different regions. Google Spanner however, is proprietary, works on specialized GCP hardware (cloud lock-in) and only supports a single API (SQL).

Azure Cosmos DB is a multi-model, polyglot distributed data solution with 5 different consistency levels to choose from. It supports the Cassandra, Mongo and Graph APIs amongst others and uses proprietary replication and consensus algorithms. To prevent write unavailability in the case of partitions for the Strong-consistency mode of operations, it limits accounts with strong consistency guarantees to a single Azure region. It is not open-source and works only on Azure hardware (Cloud lock-in).

YugaByte is an infrastructure-agnostic, distributed data-store which can be deployed across Public Cloud and OnPrem datacenters. YugaByte DB shards the data into a configurable number of tablets, which are then distributed evenly amongst the nodes in the cluster. Each tablet is further replicated in the other nodes via the RAFT consensus algorithm.

It provides Strong Consistency guarantees across regions and zones, uses hybrid timestamps for global timekeeping and provides snapshot isolation via MVCC with serializable isolation being in the works. The codebase is open-source and compatible with Redis, Cassandra and Gremlin APIs and SQL support being in the works.

Because what matters is the system behaving in consistent, predictable ways..

Any distributed data-store — be it a database, a messaging queue or a blockchain must behave in consistent and predictable ways even in cases of network failures and as application developers, we must be mindful of the choices that we make when choosing a data model because it will come to define the life of the product you are building and everything surrounding it.

If you liked this article there’s 50 ways(claps) to show you appreciation :)

Further Reading -

2. Distributed Transactions- ACID and BASE.

3. Isolations levels in Distributed databases.

4. Transactional IO using provisional records.

5. A brief tour of FLP impossibility

6. Time, Clock and order of execution of events in a distributed system.

7. Honeybee democracy — actor behaviour when given equally lucrative choices