I recently had a group of developers performance troubleshooting a problem-riddled application from my performance workshop. After dispensing with a couple easy wins, the group was faced with a CPU that was running very hot. The group reacted exactly in the same way that I see most teams do when faced with a hot CPU; they fired up an execution profiler hoping that it would help them sort things out. In this particular case, the problem was related to how the application was burning through memory. Now, while an execution profiler can find these problems, memory profilers will paint a much clearer picture. My group had somehow missed a key metric that was telling them that they should have been using a memory profiler. Lets run through a similar exercise here so that we can see when and why it is better to use a memory profiler.

Profilers work by either sampling top of stack or instrumenting the code with probes, or a combination of both. These techniques are very good at finding computations that happen frequently or take a long time. As my group experienced, the information gathered by execution profilers often correlates well with the source of the memory inefficiency. However it points to an execution problem which can sometimes be confusing.

The code, found in Listing 1 defines the method API findByName(String,String). The problem here isn't so much in the API itself but more in how the String parameters are treated by the method. The code concatenates the two strings to form a key that is used to look up the data in a map. This misuse of strings is a code smell in that it indicates that there is a missing abstraction. As we will see, that missing abstraction is not only at the root of the performance problem, but adding it also improves the readability of the code. In this case the missing abstraction is a CompositeKey<String,String>, a class that wraps the two strings and implements both the equals(Object) and hashCode() methods.

public class CustomerList { private final Map customers = new ConcurrentHashMap(); public Customer addCustomer(String firstName, String lastName) { Customer person = new Customer(firstName, lastName); customers.put(firstName + lastName, person); return person; } public Customer findCustomer(String firstName, String lastName) { return (Customer) customers.get(firstName + lastName); }

Listing 1. Source for CustomerList

Another downside to the style of API used in this example is that it will limit scalability because of the amount of data the CPU is required to write to memory. In addition to the extra work to create the data, the volume of data being written to memory by the CPU creates a back pressure that will force the CPU to slow down. Though this bench is artificial in how it presents the problem, it's a problem that is not so uncommon in applications using the popular logging frameworks. That said, don't be fooled into thinking only String concatenation can be at fault. Memory pressure can be created by any application that is churning through memory, regardless of the underlying data structure.

The easiest way to determine if our application is burning through memory is to examine the garbage collection logs. Garbage collection logs report on heap occupancy before and after each collection. Subtracting occupancy after the previous collection from the occupancy before the current collection, yields the amount of memory allocated between collections. If we do this for many records we can get a pretty clear picture of the application's memory needs. Moreover, getting the needed GC log is both cheap and with the exception of a couple of edge cases will have no impact on the performance of your application. I used the flags -Xloggc:gc.log and -XX:+PrintGCDetails to create a GC log with a sufficient level of detail. I then loaded the GC log file into Censum, jClarity's GC log analysis tool.

(Click on the image to enlarge it)

Table 1. Garbage Collection Activity Summary

Censum provides a whole host of statistics (see Table 1), of which we’re interested in the “Collection Type Breakdown” (at the bottom.) The “% Paused” column (the sixth column in Table 1) tells us that the total time paused for GC was 0.86%. In general we’d like GC pause time to be less than 5%, which it is. So the number suggests that the collectors are able to reclaim memory without too much effort. Keep in mind however that when it comes to performance, a single measure rarely tells you the whole story. In this case we need to see the allocation rates and in Chart 1 we can see just that.

(Click on the image to enlarge it)

Chart 1. Allocation Rates

If we look at this chart we can see that the allocation rates initially start out at about 2.75 gigs per second. The laptop that I used to run this benchmark under ideal conditions can sustain an allocation rate of about 4 gigs per second. Thus this value of 2.75 gigs represents a significant portion of the total memory bandwidth. In fact, the machine is not able to sustain the 2.75 gigs as is evidenced by the drop over time in allocation rates. While your production servers may have a larger capacity to consume memory, it is my experience that any machine trying to maintain object creation rates greater than 500 megabytes per second will spend a significant amount of time allocating memory. It will also have a very limited ability to scale. Since memory efficiency is the overriding bottleneck in our application the biggest wins will come from making it more memory efficient.

Execution Profiling

It should go without saying that if we’re looking to improve memory efficiency we should be using a memory profiler. However when faced with a hot CPU, our group decided that they should use execution profiling; so lets start with that and see where it leads. I used the NetBeans profiler running in VisualVM in its default configuration to produce the profile in Chart 2.

(Click on the image to enlarge it)



Chart 2. Execution Profile

Looking at the chart we can see that outside of the Worker.run() method, most of the time is spent in CustomerList.findCustomer(String,String). If the source code were a bit more complex you could imagine it being difficult to understand why the code is a problem or what you should do to improve performance. Lets contrast this view with the one presented by memory profiling.

Memory Profiling

Ideally I would like my memory profiler to show me how much memory is being consumed and how many objects are being created. I would also like to know the casual execution paths; that is, the path through the source code that is responsible for churning through memory. I can get these statistics using the NetBeans profiler once again running in VisualVM. However I will need to configure the profiler to collect allocation stack traces. This configuration can be seen in Figure 1.

(Click on the image to enlarge it)

Figure 1. Configuring NetBeans memory profiler

Note that the profiler will not collect for every allocation but only for every 10th allocation. Sampling in this manner should produce the same result as if you were capturing data from every allocation but with much less overhead. The resulting profile is shown in Chart 3.

(Click on the image to enlarge it)

Chart 3 Memory Profile

The chart identifies char[] as the most popular object. Having this information the next step is to take a snapshot and then look at the allocation stack traces for char[]. The snapshot can be seen in Chart 4.

(Click on the image to enlarge it)

Chart 4 char[] allocation stack traces.

The chart shows three major sources of char[] creation of which one is opened up so that you can see the details. In all three cases the root can be traced back to the firstName + lastName operation. It was at this point that the group tried to come up with numerous alternatives. However, none of the proposed solutions were ever as efficient as the code produced by the compiler. It was clear that to have the application run faster we were going to have to eliminate the concatenation. The solution that eventually solved the problem was to introduce a Pair class that took the first and last name as arguments. We called this class CompositeKey as it introduced the missing abstraction. The improved code can be seen in listing 2.

public class CustomerList { private final Map customers = new ConcurrentHashMap(); public Customer addCustomer(String firstName, String lastName) { Customer person = new Customer(firstName, lastName); customers.put(new CompositeKey(firstName, lastName), person); return person; } public Customer findCustomer(String firstName, String lastName) { return (Customer) customers.get(new CompositeKey(firstName, lastName)); } }

Listing 2. Improved Implementation using CompositeKey abstraction

CompositeKey implemented both hashCode() and equals() thus eliminating the need to concatenate the strings together. While the first benchmark completed in ~63 seconds, the improved version ran in ~21 seconds, a 3x improvement. The garbage collector ran 4 times making it impossible to get an accurate picture but we can be said is that the application consumed in aggregate just under 3 gigs of data as apposed to the more than 141 GB consumed by the first implementation.

Two ways to fill a water tower

A colleague of mine once said that you can fill a water tower one teaspoon at a time. This example proves that you certainly can. However it’s not the only way to fill the tower, you could also open a large hose to fill it very quickly. In those cases it’s unlikely that an execution profiler would pick up on the problem. However the garbage collector will see the allocation and the recover and certainly the memory profiler will see the allocation in shear byte count. In one application where these large allocations were predominate, the development team had exhausted the vast majority of the gains there were going to get by using an execution profiler, yet they still needed to squeeze more out of the app. At that point we turned on the memory profiler and it exposed one allocation hotspot after another and with that information we were able to extract a number of significant performance gains using a memory profiler. What that team learned is that not only was memory profiling giving them the right view, it was giving them the only view into the problem. This is not to say that execution profiling isn’t productive. What it is saying is that sometimes it’s not able to tell you where your application is spending all of it’s time and in those cases getting a different perspective on the problem can make all the difference in the world.

About the Author

Kirk Pepperdine has worked in high performance and distributed computing for nearly 20 years. Since 1998 Kirk has been working all aspects of performance and tuning in each phase of a project life cycle. In 2005 he helped author the foremost Java Performance tuning workshop that has been presented to hundreds of developers worldwide. Author, speaker, consultant, Kirk was recognized in 2006 as a Java Champion for his contributions to the Java community. He was the first non-Sun employee to present a technical lab at JavaONE, an achievement that opened up the opportunity for others in the industry to do so. He was named a JavaONE Rockstar in 2011 and 2012 for his talks on Garbage Collection. You can reach him by email at kirk@kodewerk.com or on twitter @kcpeppe