Over the past year there have been a couple of changes in how Strings are represented in Java. In this post, I will summarize these changes, help you understand how it can influence the performance of your code, and give you a couple of ideas for tuning your code.

Before Java 7 update 6 (1.7.0_06)

Representation

Up until Java 7u6 the String class had four instance fields:

value – this was a reference to a character array (char[]) containing the actual characters which make up the String (remember that char in Java is 16 bit/two bytes wide unsigned integer type).

offset – an integer specifying the index of the first character belonging to the string in the array. This was 0 most of the time.

count – an integer specifying the number of characters belonging to the string (the length of the string)

hash – an integer representing the hash code of the String. This was being lazily populated and had some tricky logic applied to it (it wasn’t volatile but still was thread safe because the Java Memory Model guarantees the atomicity of 32 bit reads / writes. It also had an issue with strings that had hashCode zero – in this case it was being recalculated needlessly because 0 was being used as a marker for “not yet calculated”)

This representation meant that creating a substring of an existing string was a rather cheap operation (since the actual characters weren’t being copied – just a new reference to the existing backing array was created). In fact this was the reason why this implementation was chosen – for Java to perform well in situations where a lot of substrings of larger strings were created (for example when parsing source code, CSV files or other text files). The danger of this representation is that you can retain more memory than you actually need:

In the above code we’re retaining 100MB of memory, even though we only need ten characters. The solution is to force the copying of the underlying char array by using the String(String) constructor:

You can pretty quickly diagnose this situation if you know what you’re looking for. Also, some profilers have dedicated diagnostic tools which detects such problems.

Deduplication

Another issue which arises frequently in practice when working with strings is Java are duplicate strings – distinct instances of String which contain the same sequence of characters. While strings are immutable and instances can be freely interchanged (or deduplicated to reduce memory consumption), Java only does deduplication for String constants.

This is wholly understandable since doing deduplication for all strings would be very time consuming and it would slow down the program without much benefit (in the general case). Still, if we know that our dataset contains a lot of repeating strings, we can reduce the memory footprint by deduplicating our strings manually.

One way to do this is to call intern on the String instance. This returns an identical string from the internal string pool (if one was already present there) or adds the current string to the internal string pool and returns it. There were a couple of issues with intern however in the pre-Java 7 days which discouraged its use:

it was slow. Very slow.

the string pool was located in a special memory area (the PermGen – permanent generation – space) which wasn’t being affected by the “maximum memory” (-Xmx) command line switch and had to be tuned separately using the -XX:MaxPermSize=… switch (or risk running out of it).

In conclusion if you needed deduplication under Java 1.7.0_05 and earlier, it was best if you rolled your own solution using WeakHashMap<String, WeakReference<String>>. Even better, if you knew all the possible values you can have before hand, you could use an Enum to represent your data.

After Java 7 update 6

Representation

Starting in Java 7u6 Oracle went back to the simpler representation of String. It now contains only 3 instance fields:

value – as before, the actual characters

hash – the cached hash code of the String (calculated lazily, still with the same issues for Strings with a zero hash code)

hash32 – a new hash code based on Murmur hash, which gives a better dispersion of hash values

A couple more words about the alternative hash code:

it isn’t exposed publicly through the String class. You can access it using the (unofficial) sun.misc.Hashing.stringHash32 method

unlike the original hash code, hash32 for two strings containing the same characters but running in different JVMs (on the same machine or on different machines) isn’t guaranteed to be the same (in fact most likely it won’t be, since a “HASHING_SEED” value is included in the calculation which is initialized on JVM startup using the current time)

the purpose of alternative hash code is to give better performance for HashMap and related classes with String keys and to thwart hash-collision denial of service attacks

Its usage isn’t enabled by default. You need to set the “jdk.map.althashing.threshold” property to enable it. If you set this to a value X, then HashMap and related classes with a capacity at least X will use the alternative hashing algorithm.

A word of caution if you want to enable alternative hashing: prior to Java 7u40 (ie. all versions between Java 7u6 and Java 7u39) had a performance issue which meant that HashMap creation while alternative hashing was enabled was slower than needed to be. Thus if you want to enable alternative hashing, ensure that you have the latest Java 7 runtime.

As you can see substring works now “as expected”, creating a completely independent copy of the requested characters. This eliminates the possible memory retention issues from the previous versions.

This new behavior can cause performance regressions in some special cases when we have a large string and extract many substrings from it. A canonical example for this is parsing of source code (or any other text for that matter) where we create a substring for each token. You can find a possible solution for this a little bit later on in this post.

Yet another change which was made in Java 7 was the removal of compressed strings. The compressed string option allowed for a more compact representation of strings (using byte[] instead of char[] – 50% less memory needed) which contained only ASCII characters (as is the case in English environments).

In practice the memory savings were 10-30% and the additional code complexity was not deemed justifiable. Again, read on for a possible solution.

Deduplication

In Java 7 the string pool used by the intern call was relocated to the heap. This means that setting the maximum memory size will also make it possible to have a larger string pool. Its performance was also improved which means that we can get rid of our manual string pools.

If you have an application which uses intern for string deduplication, take a look at the -XX:StringTableSize=… tuning parameter. This determines the number of “buckets” the hash table containing the strings has. By default this is 1009 in Java 7, which is insufficient when we have many interned strings. You might want to tune it to something like 60013 (come Java 8 this will be the new default value).

You can also use the -XX:+PrintStringTableStatistics option which prints out a statistic about the string cache to the console during shutdown and the jmap tool which shows the total number of strings in the string cache.

Tuning for many substrings

If you are in one of those special situations where you need a lot of relatively long-lived substrings from a single source (like parsing source code or other large text files), you can create a custom CharSequence implementation which mimics the old substring behavior (and also uses byte[] if you know that you’re only dealing with ASCII data for example).

CharSequence is an interface introduced in Java 1.4 to work-around the fact that the String class is final and we can’t have alternative implementations. String itself implements CharSequence (as does StringBuilder) and if you wish to have a flexible API, you should expect CharSequence for parameters rather than Strings.

In our benchmark code we provide an alternative implementation for CharSequence (BBCS – which stands for BlobBackedCharSequence) that uses a byte array to store characters (thus being space efficient if we only use ASCII characters) and uses the “cheap substring” trick from Java 6. You can find the source code under my GitHub account.

Lets see what this would look like:

The benchmark does a pseudo-tokenizing of the JDK 8 source code (you can get early access builds of JDK 8 here to test your code now) and reports the runtime and the consumed memory.

The benchmark supports different implementations for extracting the String tokens:

null – this is used to create a baseline (to see how much the Token instances occupy)

String – uses String.substring

for Java 6 we also test String with compressed strings enabled

cached String – uses a simple opportunistic cache (ie. it doesn’t guarantee that all the duplicates are detected, but it’s fast)

interned String – uses String.intern as a cache

Blob – uses the custom CharSequence implementation

cached Blob – the custom CharSequence implementation combined with opportunistic caching

Tests were run with both Java 6 and Java 7. The results are:

Java Version Tokenizer implementation Consumed memory (MB) Runtime (seconds) Java 6u38 null (baseline) 285 Java 6u38 String 639 (+354) 5 Java 6u38 Chached String 461 (+176) 4 Java 6u38 Interned String 291 (+6) 11 Java 6u38 Compressed String 461 (+176) 5 Java 6u38 Blob 508 (+223) 4 Java 6u38 Cached Blob 375 (+90) 4 Java 7u45 null (baseline) 279 Java 7u45 String 618 (+339) 8 Java 7u45 Cached String 328 (+49) 5 Java 7u45 Interned String 304 (+25) 6 Java 7u45 Blob 507 (+228) 5 Java 7u45 Cached Blob 375 (+96) 5

Conclusion