THE BIG PICTURE:

Now that we have introduced the basic algorithms that distributed databases use to synchronize data across the system, we can look at the big picture and pinpoint where each algorithm is actually getting used. Lets start by tracing an operation from the client program, through the database, and back to the client.

Client sends a write to N3

The client for a distributed database usually connects to one node in the cluster and uses this node to communicate with the entire system. To start off, the client sends a write to N3 with RF set to quorum (in this case that is 2). N3 replicates this write to N2 and returns a success message to the client. At this point, both N3 and N2 share the correct & updated state of the system. Furthermore, both N3 and N2 are able to tell that their state is the most recent by comparing their vector clocks to other vector clocks in the system.

The client now reconnects to the database, using N1 as an entry point, and sends a read to N1 with RF at quorum. N1 replicates the operation to N3 and realizes that its state is stale. At this point, it has determined the most recent state among 2 nodes in the system, so it can send N3's state back to the client.

The above example illustrates why consistent reads and writes are essential when using a distributed database; if N3 or N1 did not replicate the operation the client could have received stale information. It is important to note that inconsistent behavior is not always an issue; for example, a distributed database that is serving as a cache could simply not provide any consistency guarantees.

WHEN THINGS GO TO SHIT:

So far, we have only discussed distributed databases in perfect conditions without network partition or failure. These failures are very possible and must be accounted for when creating distributed algorithms.

Lets consider the same system of three nodes; however, this time, one of our system admins didn’t have enough coffee and screwed up N1's routing table. Now, N1 can no longer reach N2 and N3, but the client can still communicate with N1. In this state, the client will still be able to perform operations on the database if it connects to N2 or N3, but connecting to N1 will cause all requests to hang indefinitely (or until a timeout is reached).

GLOBAL CHANGES:

In many systems there are operations that require global consensus before they can be performed; for example, schema changes and configuration changes (like adding a new node to the cluster) must be ordered by all nodes before the operation can be performed. Supporting global ordered operations is significantly more difficult than guaranteeing consistency for individual operations, because they require the use of a distributed consensus algorithm such as Raft or Paxos. These algorithms describe a protocol for reaching consensus on the ordering of a log, a sequence of causally ordered events. As I mentioned earlier, it is impossible to provide an algorithm that achieves this goal in a finite amount of time, so Raft and Paxos must be eventually consistent.

Split-Brain Schema Example

Let’s go over an example that requires Raft or Paxos. Your database is running on a cluster of four nodes which are all accepting writes. N2 is exhibiting write failures and needs to be taken down and replaced with a new database node. If any of the nodes aren’t alerted of this change, then the system could exhibit erratic behavior (like replicating operations to a node that doesn’t actually exist anymore). Now, at the same time we have another hardware issue with N4 and need to replace it as well. If we try to remove these nodes by concurrently applying two separate schema changes we could end up in a “split-brain” state. In this example, the system cannot make progress because the two schema changes have no causal ordering, which means that the database cannot determine whether or not Schema A follows Schema B.

WHY MASTER-SLAVE SYSTEMS ARE NOT DISTRIBUTED DATABASES:

Many systems are built using two relational databases, where one serves as the master and the second as a slave. This system can work well in many cases, but often lacks the consistency guarantees you expect. If you rely on one of these systems in production and the master fails there is often no guarantee that the slave is going to have the most recent state of the system. However, the same system of two nodes running a distributed database is formally proven to be consistent for all reads performed after either of the machines fails (given that writes were performed at quorum). Additionally, a distributed database can accept client connections from both machines where as a master-slave system can only perform writes on the master.

WORDS OF CAUTION:

Distributed databases are wonderful data stores in theory, but in practice tend to be far buggier than established relational databases. Some products claim to provide strong consistency guarantees, but exhibit non-deterministic inconsistent behavior which results in you losing your data. Another issue is that many products do not make their default settings evident in their respective APIs. MongoDB is an example of a product that is that has struggled with both a lack of clarity and long living consistency bugs (one of these bugs was in their product for almost 2 years). This cautionary note is not intended to discourage you from using a distributed database, but instead to encourage you to thoroughly vet the one you decide to go with.

Thanks for reading! Please leave a comment if you have any questions.