With tens of thousands of Java servers running in production in enterprise, Java has become a language of choice for building production systems. If our machines are to exhibit acceptable performance, they require regular tuning.This article takes a detailed look at techniques for tuning a Java Server.

Measuring Performance

In order for our tuning to be meaningful we need a way to measure the performance improvement. Lets understand two key performance metrics: latency and throughput.

Latency measures the end-to-end processing time for an operation. In a distributed environment we determine latency by measuring the time for the full round-trip between sending the request and receiving the response. In those cases, latency is measured from the client machine and includes the network overhead as well.

measures the end-to-end processing time for an operation. In a distributed environment we determine latency by measuring the time for the full round-trip between sending the request and receiving the response. In those cases, latency is measured from the client machine and includes the network overhead as well. Throughput measures the number of messages that a server processes during a specific time interval (e.g. per second). Throughput is calculated using the equation:

Throughput = number of requests / time to complete the requests

Ideally we like to have maximum throughput with minimal latency. However, in a well-designed server, there is a tradeoff between throughput and latency. For example as shown by the graph below, if you need more throughput, you have to the run system with more concurrency, but then average latency also increases. Usually, you need to achieve maximum throughput while keeping latency within some acceptable limit. For example, you might choose maximum throughput in a range where latency is less than 10ms.

The following graphs capture the characteristic behavior of a server. As shown by the graph, server performance is gauged by measuring latency and throughput against concurrency.

In the graph, “ideal path” is the theoretical performance graph for a server written perfectly. In practice, most servers would crash or degrade performance if it is hit with too much concurrency.

Tuning/ Profiling Servers

In a profiling session, we try to look deep into a server and understand its behavior, and either verify that it runs at its potential or find a way to improve its performance. We use profiling to solve three goals:

Improve throughput - maximizing the number of messages processed by the system Improve latency - ensure we are hitting our response time SLAs. Find and fix leaks (e.g. memory, file, thread, connection leaks)

As we said, our goal is to achieve maximum throughput while keeping the latencies within acceptable limits. If you have not done any tuning at all the best place to start is by tuning throughput, so let’s look at that first.

Tuning throughput

Before we start, let’s understand what limits performance. It is helpful to hink of a server as a system of pipes for water flow; increasing concurrency is like increasing the amount of water we put into the system., Adding more water does not ensure that more water will flow through the pipes; the flow is limited by the slowest part of the system.

Similarly, application performance is decided by the scarcest resource in the system. A computer system has many types of resources; CPU, memory, disk and network I/O. Any of these may limit the performance of our system.

While trying to study performance in vitro, we can start by increasing the load until at least one resource is exhausted (since you may not be hitting the server hard enough). That will help identify any limited resource, and we can either allocate more of that resource or change the system to use the resource more carefully.

Next, let’s set up the system and run a representative workload. (For more details about how to do so, please see my blog). While load is running, we should next determine utilization of the various resources.

Based on the scarcest resource, we categorize server performance degradation into three classes:

CPU bound - the server is blocked on CPU IO bound - the server is blocked due to disk or network bandwidth Latency bound - the server is waiting for some activity to happen (for example waiting for data to pass from the disk to the network)

Let’s investigate what resource is scarce in a sample system. To start, let’s run the top command on Unix/Linux.

(Click on the image to enlarge it)

One mistake engineers often make is profiling the CPU without first determining whether it is truly a CPU bound use case. Although CPU utilization shown by top command is low, machine may be busy doing IO (e.g. reading disk, writing to network). Load average is a much better metric for determining whether the machine is loaded.

Load average represents the number of processes waiting in the OS scheduler queue. Unlike CPU, load average will increase when any resource is limited (e.g. CPU, network, disk, memory etc.). Please refer to the blog “Understanding Linux Load Average” for more details.

We can use the following load average values to decide whether a machine is loaded

If load average < number of cores, then the machine is not loaded

If load average == number of cores, then the machine is in full use

If load average >= 4X number of cores, then the machine is highly loaded

If load average >= around 40X number of cores, then the machine is unusable

If the machine is not loaded, that generally means it is idling. A number of factors can cause this and it could be corrected by the following:

Try increasing load (often this means increasing concurrency). For example, testing the server with just one or two clients will not generally load the server; it often requires hundreds of clients to load a server. Check the lock profile (If most threads are locked or deadlocked, then throughput will decrease). Find and fix those where you can, using non-blocking data structures when possible. We will discuss lock profile in more detail in the “Tuning for Latency” section below. Try tuning thread pools (Sometimes the system is configured with too few threads, causing it to run slowly). For example, I frequently find that increasing the number of threads in the Tomcat threadpool increases throughput. Ensure the network is not saturated (If the network is saturated, then the effective load coming into the machine can be very small). In most machines you can check this with the Linux iftop command.

On the other hand if the machine is loaded, then clearly it is doing something, but you still must ensure it is busy doing useful work!

Make sure your application is the only heavy process running on the same machine/ machines; you don’t want other applications skewing your test results.

If it is not then check CPU utilization via top. If CPU usage is high, use a profiler to look at the CPU profile.

(Click on the image to enlarge it)

​

For example, the above picture shows a CPU profile taken using JProfiler. It shows execution tree (java method) of the server annotated with how much time is spent on each method minus time spent on it’s children. Check for methods that takes lot of CPU, and make sure they doing useful stuff. (Note: In this article we use JProfiler, but similar views are available with other products such as the popular Yourkit Profiler, as well as the Java Flight Recorder bundled with JDK 1.7 and higher.)

Determine how much time the application is spending onGarbage Collection (GC). If it is more than 10% you need to tune JVM GC. (You can use a tool like VisualVM GC plugin to check overhead. See my blog post on GC tuning analysis for more information.) If GC is a problem, allocation view of the profiler might help you find allocation hotspots and fix them. This talk by Kirk pepperdine is a good resource on GC Tuning.

Next step is to check for IO profiles for network and disk ( assuming your program writes to the disk). The following screenshot shows an IO profile taken with JProfiler. Verify that the nodes with heavy IO are indeed locations where they are expected to happen. Do the same for database access as well.

(Click on the image to enlarge it)

​​

Finally, check whether your machine is paging (e.g. Check Swap Usage in Linux). Generally, you need to avoid swapping for any Java server as that will significantly slow down the server. If there is swapping, either you need to reduce memory usage of the server or add more physical memory to the host machine.

If you have done all of the above and the server still doesn’t reach an acceptable level of performance, chances are the server has hit its limits. You need to either scale up by adding more servers or do a complete redesign of its architecture.

Tuning for latency

High latency is caused by long-running operations in processing a request. Disk access, network access, and locks are common causes of long-running operations.

When tuning for latency, first check the network and disk profiles, as we discussed in the previous section, and try to find and reduce the number of IO operations. Following are some possible remedies:

Avoid unnecessary IO operations. Eliminate them if possible, or replace them with a cache. Try to batch many IO operations together, which will reduce the IO overhead to one IO operation. If you can guess what data is needed in advance, try to prefetch the data.

Next lets check a thread view of the JVM. The graph below plots the number of threads and status over time. The red band in the graph shows that many threads are blocked on locks.

(Click on the image to enlarge it)

If the thread views reveals many waiting threads, you could explore the “Monitor and Locks” views to find out which threads are causing the blocking.

(Click on the image to enlarge it)

The following screenshot from JProfiler reveals which code was blocked for the longest time and which code was owning the lock during those times. Following are two rules of thumbs:

Avoid synchronized blocks and locks as much as possible. As an alternative you are usually better off using the concurrent data structures from java.util.concurrent package. When you have to take a lock or write a synchronized block, try to return it as soon as possible. Try to minimize long-running operations, such as IO, while holding a lock. And try to avoid grabbing other locks from within another lock or synchronized block.

Next check the network behaviour between client andserver. You can start with tools like “ping” and “iftop”, but you might want help from the network administrators for a detailed investigation.

One last option is to introduce more servers to the system. Then reduce the concurrency seen by each server, to reduce the latency, as we discussed with the server performance curve earlier in this article. A drastic measure would be to run two copies of the server and use the first results as Jeff Dean explained in his talk “Taming Latency Variability”. Also, If you require very low latencies, consider tools like LMX disruptor that are designed for such use cases.

Performance Tuning Checklist

We discussed tuning for throughput and Latency. As you would appreciate by now, tuning Java servers is a tricky business, and there is a lot of ice hidden below the waterline in that iceberg. Following is a summary checklist of the steps we discussed. This may not tell you how to profile everything, but it can save you from many pitfalls.

Check the load average of the machine. If it is > 4XNumberOfCores, then your machine is running at full throttle and you can skip to the CPU tuning part. Are you hitting the system hard enough? Try simulate more client by increasing the number of threads in your client programs. If that improves throughput then increase the load until throughput is maximized. Are your threads idling? If you have too many locks or too few threads, the system might not provide full throughput. Use a profiler to view the lock profile and try to remove them. While some locks are neededmost are not. Try increasing the number of threads in your thread pools, and check whether that improves throughput. Now check the CPU profile for hotspots Study the tree view and make sure any hot spots are where expected. For example, an XML parser would be expected to consume a lot of CPU. But if you see something unexpected taking too much CPU, there is something to fix Check memory and GC, if GC throughput is less than 90% it is time to tune the GC. If memory runs on the brink (check for paging), add more and see if that helps. Look at DB profiles, and ensure that what takes most load are expected. Look at network IO profile. Go to hot spots, make sure all major writes are what you expect. If there is some unnecessary IO, this will show it. Verify that resources such as underlying network and disk mounting are in place

Tuning a Java server can be tricky but the fruits are tasty. Sometimes it seems like more art than engineering, but following the steps I have outlined should get you very far.

About the Author

Srinath Perera is the Director of Research at WSO2 Inc. He is a long-standing open source contributor. Srinath is a co-founder of Apache Axis2 (open source Web Service engine), a member of the Apache Software foundation, and Committer for Apache Geronimo (J2EE Engine) and Apache Airavata. Srinath has been working with large-scale distributed systems and parallel computing for about 10 years and is a co-architect behind WSO2’s Complex Event Processing Engine.