Monitoring is an essential part of every production system. It helps us to understand performance characteristics and spot current or potential future problems. This is especially valuable when it comes to distributed systems, which are often be a good deal more tricky to keep an eye on.

Fortunately, once you know how to monitor one distributed Java application, you will have a good idea of how to monitor others.

So, in this post, I am going to look at one application in particular, one I am particularly familiar with: CrateDB, a distributed SQL database. Though the lessons learned here should be broadly applicable to any other distributed system written in Java, such as Spark, Elasticsearch, or HDFS.

In this post, I will use CrateDB to demonstrate the most important metrics when monitoring a distributed Java application, and explain why those metrics are important. I will also do a quick round-up of some of the tools you might want to consider using.

Important Metrics

There are four main areas that should be monitored:

Java Virtual Machine (JVM)

CPU utilization

Disk utilization

Network utilization

For bonus points you can also monitor CrateDB cluster integrity.

Let's take a look at each area.

Java Virtual Machine (JVM)

CrateDB is written in Java and runs in a Java Virtual Machine (JVM) that is provided by the Java Runtime Environment (JRE).

The three most critical metrics for any Java application are:

Heap usage Garbage collection time Thread count

Heap Usage

When running CrateDB in production, the heap size should be fixed in order to prevent the JVM from paging out to disk. Paging to disk (aka "swapping") is very slow and will have a significant impact on performance.

The JVM maintains a memory heap (i.e. system memory that is reserved by the JVM) and dynamically allocates this to the application as needed.

Later, when that memory is no longer needed by the application, the JVM garbage collector will free it up (i.e. put it back on the heap for future use).

The utilization of heap memory over time is an important indicator of the health of CrateDB. CrateDB exposes that information directly in the sys.nodes table, which we can query like so:

SELECT heap['probe_timestamp'] AS ts, heap['max'] AS heap_max, heap['used'] AS heap_used, 100.0 * heap['used'] / heap['max'] AS heap_percent FROM sys.nodes

When drawing a graph from this data, heap memory usage in a healthy CrateDB cluster should look like a sawtooth pattern. Heap uses should gradually increases as CrateDB requires more memory, and should drop suddenly when garbage collection happens.

Ideally, the normal operation of CrateDB should vary between 33% heap usage and 66% heap usage.

An ever increasing line would indicate a memory leak and would eventually lead to an OutOfMemory exception. To avoid this scenario, it makes sense to set a trigger threshold that will alert you when heap usage gets too high, e.g. above 90% of available heap.

Garbage Collection Time

A healthy garbage collection should be running regularly, but not often enough that it impacts performance. And garbage collection time should be quick and not vary too much.

An increasing garbage collection time, due to constant high load, may be the first signs of a node failure where the node becomes unresponsive and drops out of the cluster. Sometimes a too big heap size is the reason for high garbage collection time, because allocated memory can pile up and not be released in time.

At the moment, garbage collection times are not exposed directly by CrateDB, but slow garbage collection times are logged. A garbage collection log line in CrateDB looks like this:

[2018-02-19T14:52:30,798][INFO ][o.e.m.j.JvmGcMonitorService] [crate1] [gc][89] overhead, spent [...s] collecting in the last [...s]

However, if you want to monitor garbage collection times, you will need to use a third-party tool to do so.

Thread Count

CrateDB uses multiple, differently sized thread pools for specific tasks, such as indexing or search. CrateDB sets fixed-size thread pools when it starts (though you can change this to dynamic) which is mostly based on the amount of available CPU cores.

If the thread pools are constantly full, it may indicate that CrateDB is overloaded and that there is too much "pressure" on the CPU.

CrateDB exposes statistics about its own thread pools via the sys.nodes table. You can query this table to get the number of currently running and queued threads for each pool on each node, like so:

SELECT thread_pools['active'] AS active_threads, thread_pools['queue'] AS queued_threads, thread_pools['threads'] AS max_threads FROM sys.nodes

CPU Utilization

Some tasks in CrateDB are memory intensive, whereas others are more CPU intensive.

The most CPU intensive tasks are: table indexing and shard recovery at startup. Additionally, handling a large number of client connections can be CPU intensive.

There are three aspects of CPU utilization:

Operating system CPU usage

Process CPU usage

Load

Operating System CPU Usage

When CrateDB is the only computation intensive application running on the host, overall operating system CPU usage gives you a decent indication of how the CPU cores are being utilized.

CrateDB exposes this metric via the sys.nodes table, which you can query, like so:

SELECT os['cpu']['used'] FROM sys.nodes

If there are also other CPU intensive services running (e.g. and client application hosted on the same machine) this metric will be less useful.

Process CPU Usage

If you want to monitor how much CPU CrateDB is using, as distinct from the overall operating system CPU usage, this metric is also exposed via the sys.nodes table, which you can query, like so:

SELECT process['cpu']['percent'] FROM sys.nodes

Load

On Linux systems, the system load is an indication of how many processes are waiting for resources like CPU or disk. This is a good high-level metric. However, as with operating system CPU usage, this will become less useful for monitoring CrateDB when other resource intensive services are running on the same host.

Disk Utilization

Disk utilization has two components:

Disk usage

Disk input/output

Disk Usage

The disk (or disks) on which CrateDB stores its data (defined by the path.data setting) need to be monitored if you want to make sure you never run out of disk space.

Additionally there are two thresholds in CrateDB, known as the low and high disk watermarks:

The low disk watermark (configured by cluster.routing.allocation.disk.watermark.low) is the threshold at which CrateDB will stop allocating any more replica shards on that node.

The high disk watermark (configured by cluster.routing.allocation.disk.watermark.high) is the threshold at which CrateDB will try to actively relocate replica shards away from that node.

If you are monitoring disk usage, it makes sense to set up some sort of alerting that takes these values into consideration.

Disk Input/Output

Disk input/output, or disk I/O, is how often and how much data is being written to or read from disk.

Disk I/O is often a performance bottleneck for CrateDB clusters, and monitoring it can verify whether this is the case.

Additionally, extremely high amounts of disk reads, in combination with slow queries, may indicate that CrateDB does not have enough memory, and so disk reads are not being cached often enough.

CrateDB exposes two disk I/O statistics (bytes read and bytes written) in the sys.nodes table, but we recommend that you use a third-party tool for a more complete picture of disk I/O health and so that you can continue to collect metrics even if CrateDB becomes non-responsive. Prometheus is one good option for this.

The sys.nodes table can be queried like so:

SELECT fs['total']['bytes_read'] / 1024.0 / 1024.0 AS read_mb, fs['total']['bytes_written'] / 1024.0 / 1024.0 AS write_mb FROM sys.nodes

While these figures represent running totals, sampling them regularly allows you to calculate time period read/write values.

One additional disk-related metric you might want to consider is the number of open file descriptors. Most operating systems impose an upper limit on this number, so it may be a good idea to monitor this.

Network Utilization

Network utilization metrics are not as important as the previously mentioned metrics, but they can help you debug some problems. This is especially when you are running CrateDB on hardware that you do not control, i.e. cloud environments, because you are not in full control of network performance.

With a distributed system like CrateDB, operations that require the involvement of multiple nodes are limited by the slowest network connection between those nodes. The more network latency or packet loss in a cluster, the slower the cluster. In fact, in some poor network performance situations, you may find that nodes are dropping out of the cluster because they are not able to respond quickly enough.

Monitoring the amount of data that is sent and received on each node as well as the number of sent, received, and retransmitted packets, will help you understand how stable your network performance is.

Another network-related metric to consider is the number of open connections. Again, most operating systems impose an upper limit on this number, so it may be a good idea to monitor this.

Since CrateDB does not expose any of these metrics, network statistics need to be gathered with a third-party tool.

Cluster Integrity

So far we have only looked at external metrics, i.e. metrics that tell you about the host machine that CrateDB is running on.

External metrics will give you a decent picture of how healthy your cluster is. However, there is still the possibility that CrateDB may experience internal issues.

The two most important internal metrics for CrateDB are data health and cluster health. And both of these can be found in the CrateDB administration UI in the status bar at the top of the screen, for example:

There are three possible statuses for data health:

Green: All data is replicated and available

Yellow: There are unreplicated records

Red: Some data is unavailable

And three possible statuses for cluster health:

Green: Good configuration

Yellow: Some configuration warnings

Red: Some configuration errors

At the time of writing, CrateDB unfortunately does not yet provide a way to get either value via SQL. However, if the (unsupported) Elasticsearch API is enabled, you can query the cluster health API via HTTP.

We plan to expose this information via SQL in future release.

Important Tools

There are many third-party monitoring tools. Some of them are hosted solutions, and others are run on-premises.

Hosted Solutions

Hosted solutions often provide a proprietary collect daemon that can collect a wide range of host metrics and well as Java application metrics via the JMX interface. These metrics are then sent to and stored at the service provider who also provides web based dashboards to analyze the data.

In no particular order, some options you might want to consider are:

On-premises Monitoring

There are plenty of open source monitoring tools, if you'd prefer to run the monitoring software yourself.

In no particular order, some examples are:

Wrap Up

Monitoring and alerting are a vital part of any production deployment of a distributed Java application. Of which, CrateDB is just one example.

There are three important principles:

While the application itself might expose metrics, it is always better to gather those metrics from the host system directly Less metrics is better than a lot of metrics, if you choose them well How you monitor is less important than what you monitor

In future posts, we will take a closer look at setting up monitoring and alerting using individual third-party tools.