When trying to optimize algorithms implemented in java, not just cputime, and memory footprint are important, but often we are interested in the amount of generated garbage. If your service is bounded by some availability agreements having a deep knowledge about garbage collection is a must.

Tuning garbage collection is a complicated task. If you are interested in that, I’m proposing you to read these brilliant articles for start: [1], [2], [3]. Also there are some tools to poke around with GC at a deeper level: VisualVM, jol, Eclipse MAT, and possibly many others. It’s also noteworthy to mention that there is a promising ongoing pauseless GC development (named Shenandoah) for OpenJDK.

Scope of the analyzis

In this post I will compare 4 different implementations for an algorithmic problem, by checking their cputime, and garbage footprint.

The problem I’m trying to examine, is the following: we have a stream of (id, score) pairs, from which we want to select the ids of the top N best scoring pairs, without needing them to be returned in order.

We will evaluate the algorithms in two scenarios:

fixed size of input, variable size of the selected top N elements

fixed size of the selected top N elements, variable size of input

The algorithms implement this interface, where the sink function is responsible to add a new (id, score) pair from the stream, while the getTopN is responsible to get the top N best scoring elements:

public interface TopNSelectAlg { void sink(int index, double score); int[] getTopN(int topN); }

Implementations

I’ve created the following implementations:

Simple priority queue based

This is a simple naive implementation. We collect the (id, score) pairs represented by a simple POJO class named Entry in a java.util.PriorityQueue , and fetch the top N elements by calling the poll() function of the priority queue N times in a row. Just a note: this implementation will return the top N elements ordered, which was not required by the problem specification. Addition of one element is O(log(current_size)) complexity, calculating the topN element is topN * O(log(input_size)).

[source]

Native priority queue based

This implementation is based on a native priority queue (by native i mean here that we use scalar typed arrays for ids, and scores, repectively int[] and double[] ). The top N elements are calculated on the fly: if the new (id, score) pair added to the queue is worst scoring than the minimum element in the queue, then its discarded, otherwise we remove the minimum element from the priorityqueue and add the new one. Fetching the top N elements is simply returning the internal id vector of the queue. Addition of one element to the queue is O(1) or O(log(topN)), complexity based on the score of the new element, calculating the top N is O(1).

[source]

Quickselect

Quickselect is an inplace selection algorithm. This means we need to have the full input in an array, and the algorithm moves the topN elements to the beginning of the array. Its working mechanism is similar to Quicksort, but it only needs to handle one pivot element, and because we don’t need to have the array fully sorted, we need to do the recursive descent only in one partition of the data. We collect the elements in a dynamically sized array, and apply the algorithm to it when the getTopN function is called. The algorithms average time complexity is O(input_size), while in the worst case it can be O(input_size2).

[source]

Quickselect with fixed size buffer

Since as one can suspect that the default quickselect algorithm requires a lot of memory (it needs to hold the full stream in memory), we can slightly modify it: instead of dynamically sized arrays we use fixed size arrays, allocated to the size of topN * loadFactor , and run the quickselect algorithm if the array gets full. This allows us to calculate the topN elements of the partial data, and after each calculation we have (loadFactor - 1) * topN space available to add new elements before we need to rerun the quickselect algorithm.

[source]

Measuring generated garbage

Our goal is to measure the generated garbage between two points in the program execution flow. We can query garbage collector statistics through the garbage collector MXBean, from which in certain circumstances we can infer information about the generated garbage. The idea is originating from one of my colleagues (@pila), credits for this go to him.

Note: experiments were performed using the latest version of Sun JRE 7, using default garbage collector settings which means PS Scavange for young collection and PS MarkSweep for old collection.

We need to make the following restrictions:

During the evaluation of the algorithm, the size of the old generation shouldn’t change, or at least shouldn’t change too much: we can achieve this by starting the JVM with big eden space, for example specifying -XX:MaxNewSize=1G

During the evaluation of the algorithm, the size of the eden space should not change. This can be achieved by specifying the minimum size of the eden space to the same as the maximum using the following JVM option -XX:NewSize=1G , and also disabling adaptive size policy -XX:-UseAdaptiveSizePolicy

, and also disabling adaptive size policy We need to disable TLABs (Thread Local Allocation Buffer). TLABs are used by the JVM to speed up object allocation in the eden space, by preallocating bigger chunks of memory, and serve object allocation requests from this preallocated memory area. By disabling it the performance of the application slightly degrades, but we will be able to measure memory allocations more precizely. To disable TLAB, use the -XX:-UseTLAB JVM option.

JVM option. Make sure no other threads are running. Since the garbage collector statistics is global, if other threads are running, the garbage generated by them can not be distinguished from the garbage generated by the thread we are trying to examine.

We use the following method to derive the generated garbage between two points in the execution flow of the program:

save garbage collector statistics at both of the points

check that the prerequisites are met between the two point

if no garbage collection happened estimate the generated garbage by the difference in eden space usage size

if there was at least one garbage collection cycle, calculate the generated garbage by assuming that at each garbage collection cycle cleared out the whole eden space (which is eden space committed size)

Check the provided implementation in GCMeasure.java

Results

I’ve divided the algorithms into three parts, and measured cputime, and generated garbage for each of them. These parts are:

CREATE : creating the algorithm

: creating the algorithm ADD : adding all the elements in the stream

: adding all the elements in the stream TOP: fetching the top N best elements

Fixed input, variable topN

In these measurements, i’ve used a random input of 1000000 elements, while I varied topN from 40000 to 960000.

For the simple priority queue based algorithm we see that the addition takes the same cputime and generates the same amount of garbage independent from the topN size, since we need to add all the elements to the priorityqueue. Fetching the topN elements can be very slow especially if topN is big (compared to the size of the input).

For the native priority queue we can see that most of the memory allocations happen, while instantiating the algorithm, and also when we copy the result from this internal structure, to the caller of getTopN. We also can see that all the cpu time is consumed by the addition of the elements, which is also plausible, since the queue always holds the topN elements of the partially added stream, so all the work is done during the add operation. We also can see, that this algorithm is efficient if topN is relatively small or relatively big compared to the size of the input.

The quickselect algorithm has nearly the same memory footprint as the simple priorityqueue based variant, because it also needs to keep the whole structure in memory, before it can apply the algorithm. We can also see, that the execution times got better, and are somewhat independent from topN.

The modified quickselect with fixed buffer, has a really nice memory footprint, and has execution times which are competing with the standard quickselect algorithm. It is interesting to see, that the execution graph changes charactersitics around topN = 700k. This is because the loadfactor of this config is 1.5, so if topN is bigger than (1000000/1.5 = ~660k), then the quickselect algorithm is only runned once, because we never hit the end of the allocated array, while adding new elements from the stream. This also explains, why the execution times are getting bigger as topN approaches 0, because quickselect needs to be executed several times more to calculate the partial state.

These statements can be verified, if we set the loadFactor to a bigger value (3.0 in this example), which increases performance, and also the garbage footprint:



Fixed topN, variable input

In these series of experiments, I’m interested in the top 1000 elements on a varied input size ranging from 40000 to 96000 items.

For the simple priority queue based implementation we can see that both response times and garbage footprint are proportional to the size of the input. Compared to the previous case, the time for querying the topN items disappeares, compared to the time used by the add function. From the steps of the garbage footprint graph we can infer that the priority queue implementation in java is baked by some dynamical sized array.

For the native array based priority queue we can see that in this use case it produces the best response times, while the generated garbage is also pretty low. I would expected different proportions on the generated garbage graph, for example i would expect fetching the top, which is simply an array copy should be the same size as create. I didn’t dig more deeper into this, but I think its because we measure globally generated garbage, and when the garbage footprint is this small, garbage produced by other parts of the code are not negligible, compared to the other cases when the garbage generated by the algorithm is a lot of times bigger.

Nothing surprizing in the results for quickselect, except that we can see, as the dynamically sized arrays double the size of their internal native array at certain points.

For the fixed size quickselect implementation we can see that the garbage footprint is similar to the native priority queue, while the cputime is somewhat worst.



But this can be also optimized by changing the loadfactor parameter to a bit higher, which of course increases the memory footprint too:



Sources

Sources are available at github.