Some months ago I was reading the enlightening Diego Ongaro paper on the Raft consensus algorithm and thought that it would be great to have an option like that in the Red Hat open source portfolio. In the past I messed with Paxos many times and decided it was really too complex to implement correctly, at least for me. To tell the truth, we already support Apache ZooKeeper, which is currently part of JBoss Fuse, and JGroups, which is the basis of everything else clusterable in the JBoss family, from the WildFly application Server itself to Infinispan, which both solve similar problems.

ZooKeeper is a very mature project implementing a consensus algorithm, (which has been published in some detail), and is also cited in the Raft paper. JGroups is arguably even more mature, dating back to 2002, and provides many low level messaging features, but doesn’t implement a distributed consensus algorithm. JGroups is more of a toolset for writing these kinds of algorithms rather than just an implementation of any one of those.

Related Sponsored Content Demystifying Microservices for Jakarta EE & Java EE Developers

Chatting with Bela Ban, JGroups benevolent dictator for life, we decided that it would be interesting to implement the Raft algorithm on top of JGroups, both because JGroups already has many features that could be useful to a robust Raft implementation, and on the other side because a consistent consensus based algorithm could be really a nice addition in many different use cases. Seeing that Bela implemented 99% of the code in just a few days, I guess it was a valid choice!

Wait, what’s Raft?

Diego Ongaro and John Ousterhout described Raft in detail in their paper “In Search of an Understandable Consensus Algorithm (Extended Version). In their own words, “it is a consensus algorithm that is designed to be easy to understand and equivalent to Paxos in fault-tolerance and performance”. The goal of Raft is to make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today. I’m part of the wider audience, so that’s definitely a great start!

Another very important fact about Raft is that it’s been proven through the use of formal methods, specifically with the help of TLA+: you won’t find hidden corner cases in the algorithm, though obviously there could be bugs in implementations!

Consensus is a well known problem in distributed computing. It basically consists of strategies for making different systems agree on something. When you have different (networked) systems, a simple command like “SET MEANING_OF_LIFE to 42” must be agreed on to make sure that the state can be highly available. The consensus approach is to have an agreement of the majority of nodes, and the Raft algorithm relies on a leader election and a per-node persistent log to achieve that.

A leader is elected by consensus and all changes happen through it, which then replicates them to all nodes and adds them to their persistent log. Raft guarantees that there’s at most one leader at any time, so all state machines receive the same ordered stream of updates and thus have the exact same state.

Raft favors consistency over availability: in terms of the Cap theorem, jgroups-raft is a C-P system, meaning that if it can’t get a majority of nodes agreeing, it won’t be available but it will maintain its consistency. If for example we have a cluster of 5 nodes, 3 is the majority, so it will be possible to read/write on the system even with 2 nodes failures. With more than 2 failures it’s impossible to get a majority so the system won’t be available (though it’s possible to have some read-only features in this case).

In summary, at a very high level Raft consists of a leader election, (which requires a majority), as well as nodes being coordinated by the leader, each having one persistent log detailing what they are doing.

An excellent graphic explanation of how Raft algorithm works in detail is available here.

Why jgroups-raft

There already were many Raft implementations available in many different languages, but there were many reasons to implement Raft in JGroups too: JGroups has many ready-to-use building blocks which have been battle tested in fifteen years of wide usage and the code required for a robust implementation is much smaller than starting everything from scratch and therefore less buggy. Besides that, JGroups can be extremely useful to expand the basic Raft features too, for example adding reliability over UDP, multicasting features for large clusters and smarter cluster membership changes ("view changes" in JGroups parlance, we will see an example later)

Having a reliable consensus algorithm “inside” JGroups is also important for JGroups itself, because it means that Wildfly and/or Infinispan will be able in the future to offer different clustering features out of the box, for example a cluster-aware singleton that is capable of maintaining its unicity even in the presence of a network partition (providing it can still reach consensus): a “normal” Wildfly cluster-aware singleton in fact would break its unicity in case of a Network partition, ending up with a singleton per partition (a “partitionton”?).

Testing jgroups-raft

To test jgroups-raft, we will download it from GIT (see below) and use the included CounterServiceDemo. This is a distributed counter that is simple enough to analyze at the Raft level. You will need Java 8 installed on your preferred OS, though this tutorial has been tested only on OSX and Linux.

For this example I will assume you’re testing jgroups-raft on your laptop, using a loopback interface.

To start open a terminal and clone the git repo somewhere on your disk:

> git clone git@github.com:belaban/jgroups-raft.git > export JGROUPS_RAFT_HOME=jgroups-raft > cd jgroups-raft

we will reference this directory as $JGROUPS_RAFT_HOME in the rest of this tutorial.

> ./bin/counter.sh -name A

You should see some output from JGroups, similar to:

------------------------------------------------------------------- GMS: address=A, raft-id=A, cluster=cntrs, physical address=127.0.0.1:62162 ------------------------------------------------------------------- -- view: [A, raft-id=A|0] (1) [A, raft-id=A] [1] Increment [2] Decrement [3] Compare and set [4] Dump log [8] Snapshot [9] Increment N times [x] Exit first-appended=0, last-appended=4, commit-index=4, log size=40b -- changed role to Candidate

the last line means that this node is now a candidate to become leader, but it’s still not a leader. In Raft, at any given time each node can be in exactly one of three states: leader, follower, or candidate. Being a candidate means that the node is trying to become a leader, while being a follower means that the node will just follow requests from leaders and candidates. The default raft.xml (which you can find in the $JGROUPS_RAFT_HOME/conf) is configured with three Raft nodes, so the majority is 2, meaning that to finalize an election we need to add a second node.

Open a second terminal:

> ./bin/counter.sh -name B

Now that we started a second node, one of the two nodes will be able to become the leader, and the other one will become a follower. In Raft the leader election is based on a heartbeat mechanism: if the followers don’t receive the heartbeat in a period called the election timeout, they’ll assume there is no leader and they’ll begin an election. The election consists roughly of becoming a candidate (everyone start as a follower) voting for itself and broadcasting a request for vote to the other nodes in the cluster to get a majority. Election timeouts are randomized and thus slightly different for each node, so split votes are rare and quickly resolved by the algorithm.

You should now see the voting happening between nodes and see an output similar to the following in the two consoles:

-- changed role to Follower -- changed role to Leader

If this doesn’t happen, it’s because the second Raft instance can’t see the other one at the networking level: the default JGroups configuration is UDP based, so it probably means that your UDP network configuration is not correctly configured for the loopback interface, which typically happens on BSD based Operating Systems like OSX: you will see in the output that the second Raft instance creates a new JGroups view, containing just itself.

-- view: [B, raft-id=B|0] (1) [B, raft-id=B]

If this is the case, to fix this misconfiguration, add this line in the routing table to make UDP work on the loopback interface:

> sudo route add -net 224.0.0.0/5 127.0.0.1

Start again the two Raft instances and you should finally see the election starting, the two nodes changing role and the JGroups view correctly containing both nodes:

-- view: [A, raft-id=A|1] (2) [A, raft-id=A, B, raft-id=B]

If everything is fine, open a third terminal to start the third Raft node:

> ./bin/counter.sh -name C

Now you will have three different Raft nodes and you can start to play with the example.

If it’s still not working, it could be that for some reason you are binding to the wrong network interface and UDP traffic is dropped. The advice is to have a look at the troubleshooting section of JGroups Manual. Or, if you are in a hurry, just contact us on IRC (#jgroups-raft) or Google group.

The distributed counter has a very simple command line interface that you can use to increment/decrement the distributed counter in different ways and see the information contained in the persistent log of the node.

[1] Increment [2] Decrement [3] Compare and set [4] Dump log [8] Snapshot [9] Increment N times [x] Exit first-appended=0, last-appended=7, commit-index=7, log size=70b

You can for example modify the counter in the different nodes consoles and see that each node has a consistent view of the counter.

It’s interesting now to look at what happens if you kill a single node: you will still have the majority so the system will continue working, but if you kill two nodes the system will become unavailable: the remaining node will vote for itself but it won’t be able to get the remaining vote needed to reach the majority. Remember that Raft is a C-P system and sacrifices availability if it detects that it won’t be able to guarantee consistency, which is the case when there is just one node alive in a cluster of three. Restarting one of the two dead nodes will make the overall system work again.

You’ll see last-appended and commit-index values grow together: to understand what they are, you’ll need more information on how the Raft persistent log is structured and how exactly Raft manages the availability of the cluster when it can’t reach consensus.

Note that when you kill and restart a node, its initial state is initialized from its persistent log.

What’s in the Persistent log

In Raft, once the leader is up and running, it begins answering all client requests, that contain a command to be executed by the replicated state machine, which in this case is just an action (inc/dec) on our distributed counter. The leader appends the command to its own log and then asks the other cluster members to do the same: the leader returns the control to the client only if it’s been able to replicate the command to a majority of the followers.

If you look at the persistent log you’ll get an output similar to the following:

index (term): command --------------------- 1 (200): incrementAndGet(counter) 2 (200): incrementAndGet(counter) 3 (200): incrementAndGet(counter) 4 (200): incrementAndGet(counter) 5 (200): decrementAndGet(counter) 6 (200): incrementAndGet(counter) 7 (200): compareAndSet(counter, 4, 10)

The log contains the operations that have been applied to the replicated state machine. The term is an important concept in Raft, because it’s basically its unit of time. The term begins with the election, and when a candidate wins it it will become the leader for the rest of that term (potentially, forever: the term has an arbitrary length). Raft ensures that there is at most one leader in a given term: in this example we are at term 200, and to see this value grow you just have to make Raft start a new election.

You can see the last-appended and commit-index values in the log: the first one is the index of the last command appended to the persistent log, while the latter is the last committed index: they are usually the same, but if for example on the leader you have last-appended at 100 and commit-index at 90, that means that the leader hasn’t been able to replicate 10 commands on a majority, so these commands are not yet applied to the state machine.

To show this behaviour in action, let’s kill the C node and have a cluster with just two of the three nodes alive, A and B. Let’s say for example that A is the leader: if you kill it, B - the follower - won’t be able to become a leader (no majority) and won’t accept any command: if you try to do something on the counter, you’ll get a Java Exception saying that the leader is no longer part of the cluster view, so the command can’t be applied. But if instead of killing the leader A you kill the follower B, you’ll end up with a view containing just the leader.

In this case, if you try to apply commands to it, you will see a slightly different behaviour; you could expect the leader to step down and become a candidate, but in fact you’ll just get a timeout on your client (a ConcurrentTimeOutException to be precise). The command will then be appended to the persistent log of the leader without being committed (since there isn’t anyone answering to the poor leader), so you’ll see the last-appended and commit-index values of the leader diverge. If you then restart the B node, you’ll eventually see the commit-index catching up.

This is the default behaviour of Raft and it’s just fine, but with the help of JGroups we could just pick up the change in the cluster view, observe that the size of the view is less than the required majority and make the leader step down: this is not implemented at the moment but will be a configurable option in the future. This is one example of the possible Raft improvements that can be easily applied to the basic algorithm thanks to the power of JGroups.

Jgroups-raft also implements a snapshotting system as suggested in the Raft paper: the leader can take snapshots of its persistent log, so it can transfer a snapshot to a member that might be out of sync for example , after being down for a considerable amount of time and then started again. Note that this is just an optimization, because even without snapshots the algorithm is able to eventually catch up.

jgroups-raft roadmap

The implementation of jgroups-raft is currently at 0.2 release, so it’s not really ready for prime time, but is almost feature complete. Its codebase is currently living “outside” of the main JGroups project because it’s evolving much more quickly, but sometime in the future it will be merged back. Nonetheless it’s definitely ready to be tested, and any feedback is - as usual in open source projects - more than welcome. Please submit feedback in the IRC Group #jgroups-raft or Google group.

Conclusions

In this first part of the tutorial we just scratched the surface of Raft and had a look at jgroups-raft implementation with the provided DistributedCounter example. In the second part of this tutorial, we will (finally) write some Java code, looking at what is needed to implement your own jgroups-raft enabled partition-proof cluster.

About the Author

Ugo Landini takes the sentence “The only difference between men and boys is the size of their shoes and the cost of their toys” very seriously. He works as a software architect at Red Hat. He dedicates the rest of his time to what’s new in the IT field and is strongly convinced that sharing knowledge is not only a must but also an opportunity for personal growth. The cofounder of the JUG Roma, Ugo is an Apache committer, develops games for mobile devices and is convinced he can still play a decent football game (soccer for American people). He is also the cofounder and chair of technical committee at Codemotion.