Lessons learned about monitoring the JVM in the era of containers

Quick and easy things you can do to monitor JVM when it is being used in containers

Java Virtual Machine (JVM) has been around for a long time (1994) so it’s hardly a surprise there are hundreds of tools to monitor its health and performance. We could use profiler tools (like VisualVM) to track memory leaks, CPU usage, and memory consumption but they require a direct connection to the JVM: this makes it hard to execute in a containerized production environment with autoscaling (like ours).

Other tools (like Universal GC Log Analyzer) expect to use GC log analysis to auto-detect GC problems and recommend solutions to it, but it requires to enable GC verbose logs and this might have a cost and performance impact in a production environment.

So we faced the challenge of finding a simple way to monitor JVM instances and detect problems in the regular use of our applications or during special occasions such as peak load events. We also wanted to group metrics from multiple container instances of the same type to get an overview and also to be able to compare different types at a glance.

In a previous article, we described how we implemented our internal monitoring. This is a good starting point to use some JVM collectors provided to us by Prometheus.

GC compromise (latency vs throughput)

Latency represents the time that the user of the application waits for a response. In the specific case of GC tuning, this is the time that the JVM is paused because of garbage collection and cannot run the application. The duration of the garbage collection pause directly affects the way your application is perceived by end-users, so decreasing it results in a greater perception of responsiveness.

Throughput is the amount of time spent in productive work. Productive work is the time spent on processing customer transactions, non-productive work is the time spent on managing the application (es garbage collection). If we have evidence of low throughput, it means that the application is not optimally using resources.

The JVM garbage collectors can be configured to preferentially meet one between latency or throughput. If the preferred goal is met, the collectors will try to maximize the other.

Our containers run on Oracle JRE 8 where the default collector is Parallel GC. This GC is designed to be used by applications that need to do a lot of work (high throughput) and where long pause times are acceptable. This can cause latency, affecting the responsiveness of the application, not ideal in our case where our containers are API services so we switched to G1 Collector.

Garbage Collector G1 defaults have been balanced differently than either of the other collectors. G1’s goals in the default configuration are neither maximum throughput nor lowest latency, but to provide relatively small, uniform pauses at high throughput.

I think you should use G1 if you have no special needs, then tune to optimize performances. At the time of writing, we exclude new experimental GCs like Shenandoah GC in the production environment or other JVM like OpenJ9 (with Generational Concurrent policy) because we found no significant differences in our applications.

How to tune Latency and Throughput

To reduce latency, maximum pause time duration can be specified as a hint with the command-line option -XX:MaxGCPauseMillis=nnn (G1 default value is 200). In this case, the heap size and other parameters related to garbage collection are adjusted in an attempt to keep garbage collection pauses shorter than the specified value. These adjustments may cause the garbage collector to reduce the overall throughput of the application, and the desired pause time goal cannot always be met.

To improve throughput, the ratio of the time spent outside the garbage collection vs the time spent in the garbage collection can be specified with the command-line option -XX:GCTimeRatio=nnn (G1 default value is 19). The ratio of garbage collection time to application time is 1/ (1+nnn). The default value sets a goal of 5% of the total time for GC and throughput goal of 95%.

If the throughput goal isn’t being met, then one possible action for the garbage collector is to increase the size of the heap so that the time spent in the application between collection pauses can be longer. In a containerized application where only one java app runs, we need to adjust heap size and consequently the container memory in proportion 1/4 (https://docs.oracle.com/javase/9/gctuning/parallel-collector1.htm#JSGCT-GUID-74BE3BC9-C7ED-4AF8-A202-793255C864C4).

Variation of throughput of our containerized application with G1 with -Xmx 256m (at the left of the blue line) and G1 with -Xmx 128m (at the right of the blue line)

High throughput may accompany long, yet infrequent pause times.

Variation of GC duration of our containerized application with G1 with -Xmx 256m (at the left of the blue line) and G1 with -Xmx 128m (at the right of the blue line). As said before, left part with high throughput present some long pause times.

Aggregate metrics to make them understandable

If you are monitoring an environment with many instances, placing the metrics as they are provided into a chart can lead to a very difficult to read the result.

an example of the CPU time usage metric chart, hard to read, isn’t it?

One series per instance makes the graph unreadable, too much information, too much noise. For the same job, I suggest to pick the maximum value across all instances, to plot the worst-case scenario.

same data as before but worst case metrics only, much easier to detect trends and peaks

As we can see, trends and peaks are easier to track. Of course, this comes at a cost: this graph’s objective is to quickly detect high usage but it will not be useful to “debug” an issue.

Key performance indicators

GC Throughput

So, with the default configuration, we can consider an acceptable throughput any value above 90/95%.

In our sample data, there are low variations in most applications.

Throughput of our containerized app named “cdn-delivery”. The red line is the 95% threshold

The following graph shows a metric hinting that something is not working optimally:

Throughput of another containerized app “core-xadmin” is variable and well below 95%

This means there is an efficiency gain to be achieved here, we need to investigate! Or this type of graph it’s worth exploring the GC Frequency metric.

GC type

There are three types of algorithms for collectors:

serial, which use a single thread to perform all garbage collection work, best suited to single processor machines for application with small data sets (up to 100MB)

parallel, which use many threads to get the work done faster, to use when application performance is the priority

concurrent, perform mostly expensive work concurrently to the application, best suited when response time is important

More info in https://docs.oracle.com/javase/9/gctuning/available-collectors.htm and here for info about new experimental GCs https://blogs.oracle.com/javamagazine/understanding-the-jdks-new-superfast-garbage-collectors

Do not underestimate GC type impact, as seen in the following real example where we changed from Parallel GC to G1 GC:

Near 12:00 we switched to G1 GC from Parallel GC. Throughput of our batch application changes

Allocation size for new objects is different on Parallel GC (blue line) and G1 GC (orange line)

here the evidence how different GCs works on memory (green line show Old generation allocation with Parallel GC, red line show Tenured generation with G1 GC)

Near 12:00 we switched to G1 GC from Parallel GC and heap memory usage increase

GC Duration

To get an idea of the impact of the GC pause and possible problems, it is useful to compare the GC pause metric with the metric of the duration of the requests.

Latency increases as the duration of GC increases

The following graph shows a real case we encountered. We were plotting GC Duration for several different applications and found unexpected spikes.

GC duration. Yellow series is Full GC

From image, we see some spikes (yellow series) related to metric named “G1 Old Generation core-xadmin”. This means that the application is not healthy because this metric represents a full GC and ideally full GC should never happen in G1 because this type of collection performs in-place compaction of the entire heap. This might be very slow. The reason that a Full GC occurs is because the application allocates too many objects that can’t be reclaimed quickly enough. Often concurrent marking has not been able to complete in time to start a space-reclamation phase. The probability to run into a Full GC can be compounded by the allocation of many humongous objects. Due to the way these objects are allocated in G1, they may take up much more memory than expected.

Young and Mixed collections are counted in “G1 Young Generation”. G1 pauses the application during all three phases.

If GC pause time starts to increase, then it’s an indication that the app is suffering from memory problems.

For G1 you can specify the size of the young generation on the command line as with the other garbage collectors, but doing so may hamper the ability of G1 to attain the target pause time.

GC Frequency

Frequency is how many times the garbage collector runs in a given amount of time. Collection frequency impacts throughput and can be related to object allocation/promotion rate and the size of the generation.

The selected region shows an increment of GC frequency. This coincides with the decrease in throughput

Conclusions

The metrics described are a first step in highlighting performance issues. To analyze in detail and understand how to act to carry out the tuning of the JVM it is necessary to enable the logs of the GC which provides a detailed overview during and out of the breaks on the activity of waste collection.