Java performance optimization tips: How to avoid common pitfalls By Taylor | | 13 min. ( 2703 words)

In this post, I’m going to take you through some Java performance optimization tips. I’ll specifically look at certain operations in your Java programs. These tips are only really applicable in specific high-performance scenarios, so there’s no need to go writing all your code in this approach as the difference in speed will be minor. In hot code paths, however, they could make a considerable difference.

Raygun lets you detect and diagnose errors and performance issues in your codebase with ease It takes minutes to add Raygun into your software. Be alerted to issues affecting end users and replicate problems 1,000x faster than using logs and incomplete information from users. Learn more and try Raygun free for 14 days

I’ll be covering the following topics:

Use a profiler!

Before performing any optimizations, the first task any developer must do is check that their assumptions about the performance are correct. Maybe the portion of code they believe is the slow part is, in fact, masking the true slow part, resulting in any improvements having a negligible effect. They must also have a comparison point to be able to know if their improvements have improved anything, and if so, by how much.

The easiest way to achieve both these goals is to use a profiler. The profiler will give you the tools to find which portion of the code is actually slow and how long it is taking. Some profilers that I can recommend are VisualVM (free) and JProfiler (paid – and totally worth it).

Armed with that knowledge you can be assured that you are optimizing the correct portion of the code and your changes have a measurable effect.

Taking a step back to think about the approach to the problem

Before attempting to micro-optimize a specific code path, it’s worth thinking about the current approach it is taking. Sometimes the fundamental approach might be flawed, meaning even if you expend a great effort and manage to make it run 25% faster by performing all the optimizations possible, changing the approach (using a better algorithm) could result in order of magnitude or more performance increase. This often happens when the scale of data being operated on changes – it’s straightforward to write a solution that works well enough now, but when you get real data, it starts falling over.

Sometimes this could be as simple as changing the data structure you are storing your data in. To use a contrived example, if your data access patterns are mostly random access and you’re using a LinkedList just switching to an ArrayList could be a significant speed boost. For large data sets and performance-sensitive work, it’s critical that you select the right data structure for the shape of the data and the operations being performed on it.

It’s always worth taking a step back and thinking about whether the code you are optimizing is already efficient and is just slow because of how it’s written, or whether it’s slow because of the approach it’s taking is sub-optimal.

Streams API vs the trusty for loop

Streams are a great addition to the Java language, letting you easily lift error prone patterns from for loops into generic, more reusable blocks of code with consistent guarantees. But this convenience doesn’t come for free; there is a performance cost associated with using streams. Thankfully it appears this cost isn’t too high – anywhere from a few percent faster to 10-30% slower for common operations, but it is something to be aware of.

99% of the time the loss of performance from using Streams is more than made up by the increased clarity of the code. But for that 1% of times where maybe you’re using a stream inside of a hot loop, it’s worth being aware of the performance trade-off. This is especially true for any very high throughput applications, the increased memory allocations from the streams API (according to this StackOverflow post each filter adds 88 bytes of used memory) can cause enough increased memory pressure to require more frequent GC runs causing a heavy hit on performance.

Parallel streams are another matter, despite their ease of use they are something that should only be used in rare scenarios and only after you’ve profiled both the parallel and serial operations to confirm the parallel one is in fact faster. On smaller data sets (the cost of the stream computation determines what constitutes a smaller data set) the cost of splitting up the work, scheduling it on other threads and stitching it back together once the stream has been processed will dwarf the speedup from running the computations in parallel.

You must also consider the type of execution environment your code is running in, if it’s running an already heavily parallelized environment (like a website for example) then it’s unlikely you will even get the speedup of running the stream in parallel. In fact, under load, this might be worse than non-parallel execution. This is because the parallel nature of the workload is most likely already making as much use of the remaining CPU cores as it can, meaning you’re paying the cost of splitting the data up without the benefit of increasing the amount of available computation power.

A sample of the benchmarks I performed. The testList is a 100,000 element array of the numbers 1 to 100,000 converted to a String, shuffled.

// ~1,500 op/s public void testStream ( ArrayState state ) { List < String > collect = state . testList . stream () . filter ( s -> s . length () > 5 ) . map ( s -> "Value: " + s ) . sorted ( String: : compareTo ) . collect ( Collectors . toList ()); } // ~1,500 op/s public void testFor ( ArrayState state ) { ArrayList < String > results = new ArrayList <>(); for ( int i = 0 ; i < state . testList . size (); i ++) { String s = state . testList . get ( i ); if ( s . length () > 5 ) { results . add ( "Value: " + s ); } } results . sort ( String: : compareTo ); } // ~8,000 op/s // Note, with an array size of 10,000 and more variable load on my CPU this was 1/3rd as fast as testStream public void testStreamParrallel ( ArrayState state ) { List < String > collect = state . testList . stream () . parallel () . filter ( s -> s . length () > 5 ) . map ( s -> "Value: " + s ) . sorted ( String: : compareTo ) . collect ( Collectors . toList ()); }

In summary, streams are a great win for code maintenance and readability with a negligible performance impact for the vast majority of cases but it pays to be aware of the overhead for the rare case where you really need to wring the extra performance out of a tight loop.

Don’t underestimate the cost of parsing a date string into a date object and formatting a date object to a date string. Imagine a scenario where you had a list of a million objects (either strings directly or objects representing some item with a date field on them backed by a string), and you had to perform an adjustment to the date on them. In the context where this date is represented as a string you first have to parse it from that string into a Date object, update the Date object and then format it back into a string. If the date was already represented as a Unix timestamp, (or a Date object, because it’s effectively just a wrapper around a Unix timestamp) all you have to do is perform a simple addition or subtraction operation.

Per my test results, it is up to 500x faster to just manipulate the date object compared to having to parse and format it from/to a string. Even just cutting out the parsing step results in a speedup of ~100x. This may seem like a contrived example, but I’m sure you’ve seen cases where dates were being stored as strings in the database or returned as strings in API responses.

// ~800,000 op/s public void dateParsingWithFormat ( DateState state ) throws ParseException { Date date = state . formatter . parse ( "20-09-2017 00:00:00" ); date = new Date ( date . getTime () + 24 * state . oneHour ); state . formatter . format ( date ); } // ~3,200,000 op/s public void dateLongWithFormat ( DateState state ) { long newTime = state . time + 24 * state . oneHour ; state . formatter . format ( new Date ( newTime )); } // ~400,000,000 op/s public long dateLong ( DateState state ) { long newTime = state . time + 24 * state . oneHour ; return newTime ; }

In summary, always be conscious of the cost of parsing and formatting date objects and unless you need the string right then, it’s much better to represent it as a Unix timestamp.

String operations

String manipulation is probably one of the most common operations in any program. However, it can be an expensive operation if done incorrectly, which is why I’ve focused on string manipulation in these Java performance optimization tips. I’ll list some of the common pitfalls below. However, I would like to point out that these problems only present themselves in very fast code paths or with a considerable number of strings, none of the following will matter in 99% of cases. But when they do, they can be a performance killer.

Using String.format when a simple concatenation would have worked

A very simple String.format call is on the order of 100x slower than manually concatenating the values into a string. This is fine most of the time because we’re still talking about 1 million operations p/s on my machine, but inside of a tight loop operating on millions of elements the loss of performance could be substantial.

One instance of where you _should _use string formatting instead of concatenation in high-performance environments, however, is debug logging. Take the following two debug logging calls:

logger.debug("the value is: " + x);

logger.debug("the value is: %d", x);

The second instance, which may seem counter-intuitive at first, can be faster in a production environment. Since it’s unlikely you will have debug logging enabled on your production servers the first causes a new string to be allocated and then never used (as the log is never outputted). The second requires loading a constant string and then the formatting step will be skipped.

// ~1,300,000 op/s public String stringFormat () { String foo = "foo" ; String formattedString = String . format ( "%s = %d" , foo , 2 ); return formattedString ; } // ~115,000,000 op/s public String stringConcat () { String foo = "foo" ; String concattedString = foo + " = " + 2 ; return concattedString ; }

Not using a string builder inside of a loop

If you’re not using a string builder inside of a loop, you’re throwing away a lot of potential performance. The naive implementation of appending to a string inside of a loop would be to use += to append the new portion of the string to the old string. The problem with this approach is that it will cause an allocation of a new string every iteration of the loop and require copying the old string into the new string. This is an expensive operation in and of self, without even bringing the extra garbage collection pressure into account from creating and discarding so many strings. Using a StringBuilder will limit the number of memory allocations resulting in a large performance speedup. In my testing using a StringBuilder resulted in a speedup of greater than 500x. If you can at least have a good guess at the size of the resulting string when constructing the StringBuilder, setting the correct size (which means the internal buffer won’t need to be resized causing an allocation and copy each time) can result in a further 10% speedup.

As a further note, (almost) always use StringBuilder instead of StringBuffer. StringBuffer is designed for being used in multi-threaded environments, and as such has internal synchronization, the cost of performing the synchronization must be paid even if it’s only being used in a single threaded environment. If you do need to append to a string from multiple threads (say in a logging implementation), that’s one of the few situations where StringBuffer should be used instead of a StringBuilder.

// ~11 operations p/s public String stringAppendLoop () { String s = "" ; for ( int i = 0 ; i < 10_000 ; i ++) { if ( s . length () > 0 ) s += ", " ; s += "bar" ; } return s ; } // ~7,000 operations p/s public String stringAppendBuilderLoop () { StringBuilder sb = new StringBuilder (); for ( int i = 0 ; i < 10_000 ; i ++) { if ( sb . length () > 0 ) sb . append ( ", " ); sb . append ( "bar" ); } return sb . toString (); }

Using a StringBuilder outside of a loop

This is something I have seen recommended on the internet and seems like it would make sense. But my testing showed it was in fact 3x slower than using += is using a StringBuilder – even when not in a loop. Even though using += in this context is translated into StringBuilder calls by javac it seems to be much faster than using a StringBuilder directly, which surprised me.

If anyone has any idea why this is, I’d love to hear about it in the comments.

// ~20,000,000 operations p/s public String stringAppend () { String s = "foo" ; s += ", bar" ; s += ", baz" ; s += ", qux" ; s += ", bar" ; s += ", bar" ; s += ", bar" ; s += ", bar" ; s += ", bar" ; s += ", bar" ; s += ", baz" ; s += ", qux" ; s += ", baz" ; s += ", qux" ; s += ", baz" ; s += ", qux" ; s += ", baz" ; s += ", qux" ; s += ", baz" ; s += ", qux" ; s += ", baz" ; s += ", qux" ; return s ; } // ~7,000,000 operations p/s public String stringAppendBuilder () { StringBuilder sb = new StringBuilder (); sb . append ( "foo" ); sb . append ( ", bar" ); sb . append ( ", bar" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); sb . append ( ", baz" ); sb . append ( ", qux" ); return sb . toString (); }

In summary, string creation has a definite overhead and should be avoided in loops where possible. This is easily achieved by using a StringBuilder inside of the loop instead.

I hope this post has provided you with some useful Java performance optimization tips. Once again I would like to stress that all of the information in this post does not matter for most code being executed, it doesn’t make any difference if you can format a string 1 million times a second or 80 million times a second if you’re only doing it a few times. But in those critical hot paths where you may, in fact, be doing it millions of times having that 80x speedup could save hours on a long-running piece of work.

This article is just a taste of the deep world of optimizing Java applications for high performance.

I’ve attached a zip file containing all the benchmarks and data that I used to write this post and see below for an output of the full benchmark run. All of these results were run on a desktop with an i5-6500. The code was run with JDK 1.8.0_144, VM 25.144-b01 on Windows 10

Benchmark Mode Cnt Score Error Units DateBenchmark . dateLong thrpt 10 398 , 180 , 671 . 307 ± 861095 . 156 ops / s DateBenchmark . dateLongWithFormat thrpt 10 3 , 205 , 290 . 937 ± 17495 . 643 ops / s DateBenchmark . dateParsingWithFormat thrpt 10 829 , 339 . 030 ± 5872 . 319 ops / s StreamBenchmark . testFor thrpt 10 1 , 676 . 135 ± 33 . 174 ops / s StreamBenchmark . testForInLoop thrpt 10 543 . 856 ± 2 . 759 ops / s StreamBenchmark . testLargeStream thrpt 10 95 . 626 ± 0 . 480 ops / s StreamBenchmark . testLargeStreamParallel thrpt 10 128 . 419 ± 1 . 185 ops / s StreamBenchmark . testStream thrpt 10 1 , 637 . 747 ± 11 . 793 ops / s StreamBenchmark . testStreamInLoop thrpt 10 432 . 207 ± 1 . 098 ops / s StreamBenchmark . testStreamParrallel thrpt 10 8 , 068 . 937 ± 260 . 249 ops / s StringBenchmark . stringAppend thrpt 10 22 , 572 , 323 . 301 ± 181309 . 750 ops / s StringBenchmark . stringAppendBuilder thrpt 10 6 , 983 , 217 . 796 ± 54212 . 734 ops / s StringBenchmark . stringAppendBuilderLoop thrpt 10 7 , 202 . 648 ± 117 . 500 ops / s StringBenchmark . stringAppendBuilderLoopSized thrpt 10 8 , 106 . 539 ± 135 . 040 ops / s StringBenchmark . stringAppendLoop thrpt 10 11 . 111 ± 0 . 373 ops / s StringBenchmark . stringAppendLoopPlus thrpt 10 11 . 624 ± 0 . 056 ops / s StringBenchmark . stringAppendLoopPlusDouble thrpt 10 11 . 593 ± 0 . 084 ops / s StringBenchmark . stringConcat thrpt 10 11 , 6474 , 840 . 767 ± 766799 . 557 ops / s StringBenchmark . stringCreateLoop thrpt 10 6 , 404 . 057 ± 97 . 817 ops / s StringBenchmark . stringFormat thrpt 10 1 , 384 , 875 . 999 ± 7025 . 733 ops / s

All of the benchmark code can be found here on GitHub

If you’d like more clues as to where performance problems may be lurking in your Java applications, try Raygun free for 14 days. Read more here.

Read more on performance and Java

Four top Java exceptions that Raygun can help fix

How Raygun saved customers 75 hours per month with Real User Monitoring

Raygun’s latest feature: Compare performance data side-by-side