Authors:

Alexey Vanin, alexey@nspcc.ru

Sergei Liubich, sergei@nspcc.ru

This article describes the ways of constructing distributed message log in a decentralized system. We are still under heavy development, so more details and numeral metrics will be provided later.

Distributed Log Usage

Communication is an important component of any distributed system. During the development and design of our storage system, we were researching how nodes of a distributed system can log certain events. External entity requires a consistent log of all these events. This allows to evaluate the entire system state.

This external entity can send multiple requests to each node in the system separately in order to place responses into an ordered log of events.

Another approach is that nodes are engaged independently with each other, therefore, they maintain a consistent log of events. The entity needs to send a request only to one node in the system.

Figure 1. Different approaches for obtaining a distributed log

The first approach is simple to implement. However, the effectiveness of this approach decreases with an increasing number of polled nodes. High frequency of consistent log queries also affects the performance. It is not scalable enough, therefore, we decided to look closer at the second approach to obtain a consistent event log.

From Raft to Reliable Broadcast

Raft log

Maintaining a consistent event log is an important part of consensus algorithms. In the early stages of the storage system design, we chose Raft as the consensus algorithm. It is much simpler than Paxos and it has a consistent event log in the core.

Figure 2. Replicated state machine architecture in the Raft protocol [1]

However, during the development process, the initial issue began to change. With the system growth, issues and tasks are distributed over the nodes. Therefore, nodes of the system are divided into subsets (groups), which are responsible for their own small range of tasks to be performed. Now external entities require not the state of the whole system, but only the state of certain subsets. Each node may consist of one or more subsets.

Figure 3. Multiple log groups in the nodes

Multi Raft

There are many implementations of the Raft protocol. Some of them introduce new features to the original one. One of these is Multi Raft groups [2] support. Groups independently handle the work of the entire protocol: they choose a leader, send heartbeats, etc. Multi Raft supports heartbeat coalescing which means that the heartbeat will be sent only once between 2 nodes even if they share multiple groups. We took cockroachdb’s implementation of the Multi Raft protocol as a reference and tried to evaluate the capabilities of this solution for our storage system.

During the testing, it was found out that this approach works remarkably with a small number of simultaneously working groups (from 1000 to 10 000). However, the design of our solution involves a large number of groups (up to 1 000 000).

Reliable Broadcast

In the search of a lightweight solution, we turned our attention to the Reliable Broadcast protocol. In particular, it is used in the Honey Badger BFT protocol implementation, which we use for consensus tasks. Despite the fact that the protocol describes only the message delivery method, we can build a consistent log on the top of it. Messages are accompanied by the Lamport timestamps [3] in our solution, so we can restore their order.

Figure 4. Reliable broadcast messages

Multi Reliable Broadcast

The implementation of Reliable Broadcast that we are using is presented in the form of a finite state machine, which produces three types of messages (Fig. 4). One finite state machine handles a single broadcast group. In order to maintain several broadcast groups, we expanded implementation with the support of multiple groups in one broadcast instance. The proposed protocol has been called Multi Reliable Broadcast.

The final implementation is still in progress. First tests have shown that Multi Reliable Broadcast can handle up to 1 000 000 concurrent groups in one instance. Multi Reliable Broadcast has better control on CPU (one thread may handle messages from several groups) and memory resources: for 10 000 groups, Multi Reliable Broadcast allocated 130 MiB while our implementation of Multi Raft allocated around 600 MiB.

References

[1] D.Ongaro, J.Ousterhout In Search of an Understandable Consensus Algorithm

[2] S.Turukin What is a Multiraft?

[3] L.Lamport Time, Clocks, and the Ordering of Events in a Distributed System