230 String Substring

Author: Dr. Heinz M. Kabutz Date: 2015-06-30 Java Version: 8 Category: Tips and Tricks

Abstract: Java 7 quietly changed the structure of String. Instead of an offset and a count, the String now only contained a char[]. This had some harmful effects for those expecting substring() would always share the underlying char[].

Welcome to the 230th edition of The Java(tm) Specialists' Newsletter, written on the Island of Crete in GREECE. By now, you would have heard about the ATMs drying up. It is surprisingly calm here. Some petrol stations ran out of gas as my fellow Cretans decided for the first time in years to completely fill their tanks. But that was resolved within a day. ATMs are issuing only 60 EURO to us locals per day (no limit for visitors), but there seems to be no limit inside the supermarkets and restaurants. It's also easy to find those ATMs with money - just check for a bunch of people standing in a queue. As my one friend said - thanks to these lines of people, he's now discovering ATMs that he didn't even know existed ;-) Chania seems as deserted as it is during winter, with a noticeable reduction in tourists. This is the best possible time for you to come visit Crete! The Cretan hospitality is shining through even more than usual. If you can speak Greek and you take the time to sit with a grandfather in his 70s, you will hear the reality of what he's been going through. As a tourist, you won't see any of that. You'll just be invited to drink a glass of tsikoudia with a hearty smack on your back. You can probably find excellent deals at this time with flights and hotels. The beaches are exactly the same as last year, albeit with less drunken louts. The food is still delicious. The weather fine and warm. Come. You won't regret it. And by the way, Chania is the best part of Crete, especially the Akrotiri :-)

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

String Substring

String is ubiquitous in Java programs. It has changed in quite a few ways over the last generations of Java. For example, in very early versions, the generated code of appending several non-constant Strings together would either be a call to concat() or a StringBuffer. In Java 1.0 and 1.1, the hashCode() function would check the size of the String and if it was too long, would add up every 8th character instead of every one. Of course, considering memory layout, that optimization would not have been all that effective anyway. In Java 2, they changed that to every character always and in Java 3, they cached the hash code. Whilst this sounds sensible, it wasn't. There are almost no cases where it helps in real code and it introduces an assumption that the hash code is unlikely to be zero. It isn't. Once you find one combination of characters that has a zero hash code, you can produce an arbitrary long series. A constant time operation like hashCode() using the cached value now potentially becomes O(n). They tried to fix this in Java 7 with the hash32() calculation, which never would allow a zero value. However, I see that is also gone again in Java 8.

Recently, my co-trainer Maurice Naftalin (author of Mastering Lambdas) and I taught our Extreme Java course together, which focuses on concurrency and performance. I always spend a bit of time on String, as it is used so much and does tend to appear near the top of many a profile. From Java 1.0 up to 6, String tried to avoid creating new char[]'s. The substring() method would share the same underlying char[], with a different offset and length. For example, in StringChars we have two Strings, with "hello" a substring of "hello_world". However, they share the same char[]:

import java.lang.reflect.*; public class StringChars { public static void main(String... args) throws NoSuchFieldException, IllegalAccessException { Field value = String. class .getDeclaredField("value"); value.setAccessible( true ); String hello_world = "Hello world"; String hello = hello_world.substring( 0 , 5 ); System.out.println(hello); System.out.println(value.get(hello_world)); System.out.println(value.get(hello)); } }

In Java 1 through 6, we would see output like this:

Hello [C@721cdeff [C@721cdeff

However, in Java 7 and 8, it would instead produce output with a different char[]:

Hello [C@49476842 [C@78308db1

"Why this change?", you may ask. It turns out that too many programmers used substring() as a memory saving method. Let's say that you have a 1 MB String, but you actually only need the first 5 KB. You could then create a substring, expecting the rest of that 1 MB String to be thrown away. Except it didn't. Since the new String would share the same underlying char[], you would not save any memory at all. The correct code idiom was therefore to append the substring to an empty String, which would have the side effect of always producing a new unshared char[] in the case that the String length did not correspond to the char[] length:

String hello = "" + hello_world.substring( 0 , 5 );

During our course, the customer remarked that they had a real issue with this new Java 7 and 8 approach to substrings. In the past they assumed that a substring would generate a minimum of garbage, whereas nowadays the cost can be quite high. In order to measure how many bytes exactly are being allocated, I wrote a little Memory class that uses a little-known ThreadMXBean feature. The details will be the subject of another newsletter:

import javax.management.*; import java.lang.management.*; public class Memory { public static long threadAllocatedBytes() { try { return (Long) ManagementFactory.getPlatformMBeanServer() .invoke( new ObjectName( ManagementFactory.THREAD_MXBEAN_NAME), "getThreadAllocatedBytes", new Object[]{Thread.currentThread().getId()}, new String[]{ long . class .getName()} ); } catch (Exception e) { throw new IllegalArgumentException(e); } } }

Let's say that I have a large string that I would like to break up into smaller chunks:

import java.util.*; public class LargeString { public static void main(String... args) { char [] largeText = new char [ 10 * 1000 * 1000 ]; Arrays.fill(largeText, 'A' ); String superString = new String(largeText); long bytes = Memory.threadAllocatedBytes(); String[] subStrings = new String[largeText.length / 1000 ]; for ( int i = 0 ; i < subStrings.length; i++) { subStrings[i] = superString.substring( i * 1000 , i * 1000 + 1000 ); } bytes = Memory.threadAllocatedBytes() - bytes; System.out.printf("%,d%n", bytes); } }

In Java 6, the LargeString class generates 360,984 bytes, but in Java 7, it goes up to a whopping 20,441,536 bytes. That's quite a jump! You can run this code yourself to try out on your machine.

Unfortunately if we want to have the memory allocation saving of Java 6, we need to write our own String class. Fortunately that is not too hard with the CharSequence interface. Please note that my SubbableString is not thread safe, nor is it meant to be. I used Brian Goetz's annotation, albeit in a comment:

//@NotThreadSafe public class SubbableString implements CharSequence { private final char [] value; private final int offset; private final int count; public SubbableString( char [] value) { this (value, 0 , value.length); } private SubbableString( char [] value, int offset, int count) { this .value = value; this .offset = offset; this .count = count; } public int length() { return count; } public String toString() { return new String(value, offset, count); } public char charAt( int index) { if (index < 0 || index >= count) throw new StringIndexOutOfBoundsException(index); return value[index + offset]; } public CharSequence subSequence( int start, int end) { if (start < 0 ) { throw new StringIndexOutOfBoundsException(start); } if (end > count) { throw new StringIndexOutOfBoundsException(end); } if (start > end) { throw new StringIndexOutOfBoundsException(end - start); } return (start == 0 && end == count) ? this : new SubbableString(value, offset + start, end - start); } }

If we now use CharSequence instead of String in the test, we can avoid creating all those unnecessary char[]s. Here is the revised test:

import java.util.*; public class LargeSubbableString { public static void main(String... args) { char [] largeText = new char [ 10000000 ]; Arrays.fill(largeText, 'A'); CharSequence superString = new SubbableString(largeText); long bytes = Memory.threadAllocatedBytes(); CharSequence[] subStrings = new CharSequence[ largeText.length / 1000 ]; for ( int i = 0 ; i < subStrings.length; i++) { subStrings[i] = superString.subSequence( i * 1000 , i * 1000 + 1000 ); } bytes = Memory.threadAllocatedBytes() - bytes; System.out.printf("%,d%n", bytes); } }

With that improvement, we now use roughly 281000 bytes on Java 6, 7 and 8. For Java 7 and 8, that would be a 72x improvement!

Please keep this new "feature" in mind when you do your migration from Java 6 to Java 8. I know, too many of my customers are stuck on 6 and are finding it hard to find a business case for funding the move. Besides the syntactic advantages in Java 7 and 8, you will also want to move away from the bugs still stuck in Java 6. The sooner the better!

Kind regards

Heinz

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

Load Disqus comments

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Please enable JavaScript to view the comments powered by Disqus.

Related Articles

139 Mustang ServiceLoader 2007-02-10 Mustang introduced a ServiceLoader than can be used to load JDBC drivers (amongst others) simply by including a jar file in your classpath. In this newsletter, we look at how we can use this mechanism to define and load our own services. Full Article

254 Big O Cost of Class.getMethod() 2018-02-27 We now look at why the best-case scenario for a getMethod() call is O(n), not O(1) as we would expect. We also discover that the throughput of getMethod() has doubled in Java 9. Full Article

211 Unicode Redux (2 of 2) 2013-05-30 We continue our discussion on Unicode by looking at how we can compare text that uses diacritical marks or special characters such as the German Umlaut. Full Article

Browse the Newsletter Archive