I picked four modern Raft-based distributed systems (Etcd, CockroachDB, RethinkDB & TiKV) and tested them with default settings to demonstrate that they all are affected by the mentioned issue.

Crashed leader Isolated leader Version Etcd 2s 2s 3.1.0 (8ba2897) RethinkDB 0.6s 15s rethinkdb 2.3.5~0xenial (GCC 5.3.1) CockroachDB 12s 10s beta-20170223 TiDB 18s 2m 40s Bug pd: f5744d7

tikv: eb185b3

tidb: a8d185d

Crashed leader a case when I stopped a leader with "kill -9" Isolated leader a case when I isolated a leader with iptables Time interval a duration of unavailability when no one node of storage was able to serve requests

I tested the systems with the default settings. Of course, a test with the defaults settings is the test of the default settings so it can't be used to decide which system handles the leader issues better. Each system can be tuned to have a lesser unavailability window. However, the lesser leader election time means the more sensibility to the network glitches. The point of the test is to demonstrate that all Raft implementations have this tradeoff.

Later in post I'll show that the tradeoff isn't essential to the problem of consensus and it can be avoided with a different consensus algorithm.

The systems were tested on the cluster of 4 machines. Three nodes hosted a storage, and the 4th was a client. The client used three coroutines, each of them opened a connection to its dedicated node of a storage and were executing the following loop:

read a value by a key if the wasn't set then use 0 increment the value write it back increment a number of successful iterations repeat the loop

Each coroutine used its own key to avoid collisions. If there was an error during the loop, then it closed the current connection, opened a new one and began next iteration of the loop.

Once in a second (100ms in the case of 10x) the app dumped the number of successful iterations since the last dump per cluster and each node. Such simple metrics helped to analyze a behavior of a cluster when a leader was disturbed.

I picked a leader empirically as a node with the highest rate of successful iterations.

Etcd (10x) 261 51 16 16 19 262 44 14 13 17 RethinkDB (10x) 175 74 23 23 28 176 72 22 21 29 CockroachDB 146 489 217 134 138 147 512 234 133 145 TiDB 498 426 120 171 135 499 436 126 171 139

1st column of a storage's subtable nth second of the experiment 2nd number of successful iterations per last second per all nodes 3rd 4th 5th number of successful iterations per second per 1st (2nd or 3rd) node

In the case of Etcd and RethinkDB the 3rd node is the leader, in the case of CockroachDB it's the 1st node and with TiDB it's the 2nd. Let's kill a leader and see how it affects the health of the cluster.

Etcd (10x) 266 39 13 12 14 267 4 1 1 2 268 0 0 0 0 ... 286 0 0 0 0 287 23 13 10 0 288 28 15 13 0 RethinkDB (10x) 179 68 21 21 26 180 61 20 19 22 181 0 0 0 0 ... 186 0 0 0 0 187 41 23 18 0 188 42 23 19 0 CockroachDB 150 549 250 143 156 151 410 186 109 115 152 0 0 0 0 ... 161 0 0 0 0 162 106 0 106 0 163 221 0 167 54 164 310 0 188 122 TiDB 501 476 134 192 150 502 133 37 54 42 503 0 0 0 0 ... 517 0 0 0 0 518 57 56 0 1 519 211 146 0 65 520 294 163 0 131

You can see that the duration of the unavailability windows corresponds to the data provided in the first table.

For more information see the tests repository.