<edit date="December 2010"> This blog post describes an old Java bug that has since (to the best of my knowledge) been fixed and all affected versions EOL'd. Regardless, it remains a cautionary tale about the problem of leaky abstractions, and why it's important for developers to have some idea of what's going on under the hood. </edit>

Every Java standard library I have seen uses the same internal representation for String objects: a char[] array holding the string data, and two ints to represent the offset from the start of the array that the String starts, and the length of the String. So a String with an array of [ 't', 'h', 'e', ' ' , 'f', 'i', 's', 'h' ] , an offset of 4 and a length of 4 would be the string "fish".

This gives the JDK a certain amount of flexibility in how it handles Strings: for example it could efficiently create a pool of constant strings backed by just a single array and a bunch of different pointers. It also leads to some potential problems.

In the String source I looked at (and I'm pretty sure this is consistent across all Java standard library implementations), all of the major String constructors do the 'safe' thing when creating a new String object - they make a copy of only that bit of the incoming char[] array that they need, and throw the rest away. This is necessary because String objects must be immutable, and if they keep hold of a char[] array that may be modified outside the string, interesting things can happen.

String has, however, a package-private "quick" constructor that just takes a char array, offset and length and blats them directly into its private fields with no sanity checks, saving the time and memory overhead of array allocation/copying. One situation this constructor is used in is String#substring() . If you call substring() , you will get back a new String object containing a pointer to the same char[] array as the original string, just with a new offset and length to match the chunk you were after.

As such, substring() calls are incredibly fast: you're just allocating a new object and copying a pointer and two int values into it. On the other hand, it means that if you use substring() to extract a small chunk from a large string and then throw the large string away, the full data of that large string will continue to hang around in memory until all its substrings have been garbage-collected.

Which could mean you carrying around the complete works of Shakespeare in memory, even though all you wanted to hang on to was "What a piece of work is man!"

Another place this constructor is called from is StringBuffer. StringBuffer also stores its internal state as a char[] array and an integer length, so when you call StringBuffer#toString() , it sneaks those values directly into the String that is produced. To maintain the String's immutability, a flag is set so that any subsequent operation on the StringBuffer causes it to regenerate its internal array.

(This makes sense because the most common case is toString() being the last thing called on a StringBuffer before it is thrown away, so most of the time you save the cost of regenerating the array.)

The potential problem again lies in the size of the char[] array being passed around. The size of the array isn't bound by the size of the String represented by the buffer, but by the buffer's capacity. So if you initialise your StringBuffer to have a large capacity, then any String generated from it will occupy memory according to that capacity, regardless of the length of the resulting string.

How did this become relevant?

Well, some guys at work were running a profiler against Jira to work out why a particular instance was running out of memory. It turned out that the JDBC drivers of a certain commercial database vendor (who shall not be named because their license agreement probably prohibits the publishing of profiling data) were consistently producing strings that contained arrays of 32,768 characters, regardless of the length of the string being represented.

Our assumption is that because 32k is the largest size these drivers comfortably support for character data, they allocate a StringBuffer of that size, pour the data into it, and then toString() it to send it into the rest of the world.

Just to put the icing on the cake, if you have data larger than 32k characters, you overload the StringBuffer. When a StringBuffer overloads, it automatically doubles its capacity.

As a result, every single String retrieved from the database takes up some multiple of 64KB of memory (Java uses two-byte Unicode characters internally), most of it empty, wasted bytes.

Ouch.

The first computer I owned had 64KB of memory, and almost half of that was read-only. Which means every String object coming out of that driver is at least twice the size of a game of Wizball.

This turned out to be false. According to a reddit comment: “The c64 used bank switching to allow for a full 64KB of RAM and still provide for ROM and memory-mapped I/O.”

One possible workaround is that the constructor new String(String s) seems to "do the right thing", trimming down the internal array to the right size during the construction. So all you have to do is make an immediate copy of every String you get from the drivers, and make the artificially bloated string garbage as soon as possible.

Still, ouch.