Distributed monitoring is an alternative to conventional, centralised monitoring, promising robustness and lower failure detection times. We will discuss the trade-offs.

Conventional server and website monitors are (semi) centralised, which can suffer from a single point of failure. When the monitoring system fails, you cannot make sure your website is still available. Distributed monitoring could solve this issue. When monitoring a distributed infrastructure, you could use this infrastructure to monitor itself. Each server runs a client that acts as a node in the distributed monitoring network and contributes in monitoring the other servers.

The protocol that the nodes follow to communicate determines the effectiveness and efficiency of the system. We will explain an efficient protocol, that is described by four researchers from the University of Florida in their paper GEMS (gossip-enabled monitoring service) and used by us to create a proof-of-concept distributed monitor.

Gossiping servers

The monitoring algorithm works by creating a network of nodes. Each server that is monitored represents a node in this network and also participates in monitoring the other nodes, by gossiping to them. Each node will periodically gossip its knowledge about the other nodes to a random node in the network. This so-called gossip message contains the node's gossip list, suspect list and suspect matrix, and will be sent at a fixed gossip interval. Because gossip messages are sent to a random other node, each node will (on average) receive one gossip message each round. Therefore, this approach scales linearly with the number of nodes.

The gossip list contains a heartbeat value for each node. This number indicates how many gossip rounds have passed since the last gossip message has been received. It is stored for each node and will be incremented after each gossip round. For the local node, this number is fixed at zero.

The suspect list reflects if the node suspects the other nodes. A node suspects another node when the heartbeat value for that node is greater than a specified threshold value. The threshold value is named the cleanup time and is typically 20 to 40 gossip rounds.

The suspect matrix is a two-dimensional Boolean matrix and contains for each node if it suspects the other nodes. For example, if cell (3, 4) is true in the suspect matrix, it implies Node 3 suspects Node 4 to have failed. Note that the suspect matrix also contains a row for the local node, which is equal to the suspect list.

This list summarises the concepts used by the algorithm:

Heartbeat value: indicates the number of gossip rounds passed since the last communication with that node. This value is defined zero for the node itself.

Gossip list: contains a heartbeat value for each node.

Gossip interval: time between two gossip messages sent by a node.

Cleanup time: threshold value of the heartbeat before a node will be suspected to have failed.

Suspect list: contains, for each node, if the node is suspected to have failed.

Suspect matrix: two-dimensional matrix where each row is the suspect list of a node.

Processing gossip messages

When a node receives a gossip message, it will update its own gossip list and suspect matrix if the received data is more recent. We will demonstrate this process using an example. In the example, Node 1 receives a gossip message from Node 2. The cleanup time in this example is defined at 20, meaning that Node 1 suspects nodes 3 and 4, and Node 2 just suspects Node 4. This figure shows the initial state of both nodes:

Update gossip list

The first step consists of comparing the gossip lists and replacing the heartbeat value if the received value is lower. Node 1 therefore updates its heartbeat values for nodes 2, 3 and 4. Note that, for these nodes, the received suspect matrix contains more recent suspect lists than Node 1 currently has in it's own suspect matrix. This insight is used when updating the suspect matrix in third step. The first step is illustrated in the following figure:

Update suspect list

Second, Node 1 updates its suspect list based on the new gossip list. Each heartbeat value is compared to the cleanup time, which is the threshold value for nodes to be suspected. The result is that Node 3 is no longer suspected, as the new heartbeat value (5) is lower than the cleanup time:

Update suspect matrix

Following that, Node 1 updates its suspect matrix. This step uses the insight that a lower heartbeat value indicates more recent communication. Further, each row in the suspect matrix corresponds to the suspect list of the respective node. This means that the first row in the suspect matrix is equal to the suspect list of Node 1. Also, each row in the received suspect matrix is more up to date if the heartbeat value for that node was lower and has been updated.

Because the heartbeat values of nodes 2, 3 and 4 were updated, the second, third and fourth rows are updated from the received suspect matrix. This is illustrated by the following figure:

Consensus check

After these steps, Node 1 will check for consensus about all suspected nodes. In this case, Node 1 checks the column of Node 4. Because all live nodes suspect Node 4, consensus is reached and the failure is detected. The following figure shows the final state of Node 1:

Additions and changes

The paper discussed earlier used the system in a high-performance computing environment. As this environment differs from our application environment, in which we use servers communicating over the Internet, some additions and changes were made to improve the results.

Alerting

When a node detects that consensus is reached about a failure, it broadcasts this across the network. However, this can happen multiple times for a single failure, as the consensus can be detected by multiple nodes simultaneously. To prevent multiple alerts when this happens, only the live node with the lowest identifier will send the consensus broadcast and alert. When this node fails, the next live node will take this role. The other nodes will know an alert has been sent when they receive the consensus broadcast.

Layering

Layering of nodes allows for improved scalability by grouping nodes together. Nodes within one group communicate using the gossip protocol as described. In addition to that, the groups of nodes form higher-level layers, which gossip less frequently with each other. Each round, one node from each group gossips to a node in another group. This way, the groups get liveness information about each other and failures of whole groups are detected.

We decided that layering is beyond the scope of this project. Implementing support for layering will increase the complexity of the software, while the advantage of scalability is not needed. The network size we are interested in can work within one group, making layering unnecessary.

Results

We made a proof-of-concept implementation and ran benchmarks with it. The results were compared with the current monitoring system. We found the failure detection time can be as low as 30 seconds, where our current monitor has a minimum detection time of 1 minute. However, the improved detection time did show to have the disadvantage of increased resource usage on the servers running the monitor.

Another disadvantage is the development effort required to implement and maintain the distributed monitoring software. Compared to an external, centralised monitoring solution, distributed monitors takes more effort to implement, deploy and maintain, because of the higher complexity of the system.

Finally, because the monitoring software is running on the server instead of externally, the resource usage of the server can be monitored as well. This can provide more information in case of a failure, and from monitoring the system health, possible failure causes can be detected early, helping to prevent downtime.

Summary

Using distributed monitoring as an alternative to centralised monitoring can be beneficial. We discussed an algorithm that can be used for distributed monitoring and showed the lower failure detection time trades off in increased resource usage and system complexity.

These differences, as well as the infrastructure that will be monitored, determine which solution is best. From April to July, I have been researching this subject, as part of my Master's degree in Software Engineering at the University of Amsterdam. My thesis should help making an informed decision when choosing for a monitoring solution. It can be found in the UvA digital library and contains more in depth information on this subject and the benchmarking measurement that lead to these conclusions.