How the JVM compares your strings using the craziest x86 instruction you've never heard of

We’ve all probably seen Java’s String comparison function before. It compares strings by the first differing character, falling back to the length difference when they are identical up to the end of the shorter string:

public int compareTo ( String anotherString ) { int len1 = value . length ; int len2 = anotherString . value . length ; int lim = Math . min ( len1 , len2 ); char v1 [] = value ; char v2 [] = anotherString . value ; int k = 0 ; while ( k < lim ) { char c1 = v1 [ k ]; char c2 = v2 [ k ]; if ( c1 != c2 ) { return c1 - c2 ; } k ++; } return len1 - len2 ; }

But did you know there is also a secret second implementation? String.compareTo is one of a few methods that is important enough to also get a special hand-rolled assembly version. On my machine, it something like this:

# {method} 'compare' '(Ljava/lang/String;Ljava/lang/String;)I' in 'Test' # parm0: rsi:rsi = 'java/lang/String' # parm1: rdx:rdx = 'java/lang/String' # [sp+0x20] (sp of caller) 7fe3ed1159a0: mov %eax ,- 0x14000 ( %rsp ) 7fe3ed1159a7: push %rbp 7fe3ed1159a8: sub $0x10 , %rsp 7fe3ed1159ac : mov 0x10 ( %rsi ), %rdi 7fe3ed1159b0 : mov 0x10 ( %rdx ), %r10 7fe3ed1159b4: mov %r10 , %rsi 7fe3ed1159b7: add $0x18 , %rsi 7fe3ed1159bb: mov 0x10 ( %r10 ), %edx 7fe3ed1159bf: mov 0x10 ( %rdi ), %ecx 7fe3ed1159c2: add $0x18 , %rdi 7fe3ed1159c6: mov %ecx , %eax 7fe3ed1159c8: sub %edx , %ecx 7fe3ed1159ca: push %rcx 7fe3ed1159cb: cmovle %eax , %edx 7fe3ed1159ce: test %edx , %edx 7fe3ed1159d0: je 0x00007fe3ed115a6f 7fe3ed1159d6: movzwl ( %rdi ), %eax 7fe3ed1159d9: movzwl ( %rsi ), %ecx 7fe3ed1159dc: sub %ecx , %eax 7fe3ed1159de: jne 0x00007fe3ed115a72 7fe3ed1159e4: cmp $0x1 , %edx 7fe3ed1159e7: je 0x00007fe3ed115a6f 7fe3ed1159ed: cmp %rsi , %rdi 7fe3ed1159f0: je 0x00007fe3ed115a6f 7fe3ed1159f6: mov %edx , %eax 7fe3ed1159f8: and $0xfffffff8 , %edx 7fe3ed1159fb: je 0x00007fe3ed115a4f 7fe3ed1159fd: lea ( %rdi , %rax , 2 ), %rdi 7fe3ed115a01: lea ( %rsi , %rax , 2 ), %rsi 7fe3ed115a05: neg %rax 7fe3ed115a08: vmovdqu ( %rdi , %rax , 2 ), %xmm0 7fe3ed115a0d: vpcmpestri $0x19 ,( %rsi , %rax , 2 ), %xmm0 7fe3ed115a14: jb 0x00007fe3ed115a40 7fe3ed115a16: add $0x8 , %rax 7fe3ed115a1a: sub $0x8 , %rdx 7fe3ed115a1e: jne 0x00007fe3ed115a08 7fe3ed115a20: test %rax , %rax 7fe3ed115a23: je 0x00007fe3ed115a6f 7fe3ed115a25: mov $0x8 , %edx 7fe3ed115a2a: mov $0x8 , %eax 7fe3ed115a2f: neg %rax 7fe3ed115a32: vmovdqu ( %rdi , %rax , 2 ), %xmm0 7fe3ed115a37: vpcmpestri $0x19 ,( %rsi , %rax , 2 ), %xmm0 7fe3ed115a3e: jae 0x00007fe3ed115a6f 7fe3ed115a40: add %rax , %rcx 7fe3ed115a43: movzwl ( %rdi , %rcx , 2 ), %eax 7fe3ed115a47: movzwl ( %rsi , %rcx , 2 ), %edx 7fe3ed115a4b: sub %edx , %eax 7fe3ed115a4d: jmp 0x00007fe3ed115a72 7fe3ed115a4f: mov %eax , %edx 7fe3ed115a51: lea ( %rdi , %rdx , 2 ), %rdi 7fe3ed115a55: lea ( %rsi , %rdx , 2 ), %rsi 7fe3ed115a59: dec %edx 7fe3ed115a5b: neg %rdx 7fe3ed115a5e: movzwl ( %rdi , %rdx , 2 ), %eax 7fe3ed115a62: movzwl ( %rsi , %rdx , 2 ), %ecx 7fe3ed115a66: sub %ecx , %eax 7fe3ed115a68: jne 0x00007fe3ed115a72 7fe3ed115a6a: inc %rdx 7fe3ed115a6d: jne 0x00007fe3ed115a5e 7fe3ed115a6f: pop %rax 7fe3ed115a70: jmp 0x00007fe3ed115a73 7fe3ed115a72: pop %rcx 7fe3ed115a73: add $0x10 , %rsp 7fe3ed115a77: pop %rbp 7fe3ed115a78: test %eax , 0x17ed6582 ( %rip ) 7fe3ed115a7e: retq

The code that generates this, MacroAssembler::string_compare in macroAssembler_x86.cpp is well-documented for the curious. Its worth noting that there is an even fancier version for modern systems using AVX2 (with its 256bit vectorized registers) that I’m not going to cover here.

PCMPESTRIwhat?

Introduced in SSE4.2, pcmpestri is a member of the pcmpxstrx family of vectorized string comparison instructions. With a control byte to specify options for their complex functionality, they are complicated enough to get their own subsection in the x86 ISR. Intel even provides a flow diagram for our viewing pleasure:

Now that’s really putting the C in CISC!

The option bits for the control byte are specified as follows:

-------0b 128-bit sources treated as 16 packed bytes. -------1b 128-bit sources treated as 8 packed words. ------0-b Packed bytes/words are unsigned. ------1-b Packed bytes/words are signed. ----00--b Mode is equal any. ----01--b Mode is ranges. ----10--b Mode is equal each. ----11--b Mode is equal ordered. ---0----b IntRes1 is unmodified. ---1----b IntRes1 is negated (1’s complement). --0-----b Negation of IntRes1 is for all 16 (8) bits. --1-----b Negation of IntRes1 is masked by reg/mem validity. -0------b Index of the least significant, set, bit is used (regardless of corresponding input element validity). IntRes2 is returned in least significant bits of XMM0. -1------b Index of the most significant, set, bit is used (regardless of corresponding input element validity). Each bit of IntRes2 is expanded to byte/word. 0-------b This bit currently has no defined effect, should be 0. 1-------b This bit currently has no defined effect, should be 0.

1. If you want to learn more, Section 4.1 of the Instruction Set Reference covers these options in detail.

compareTo uses 0x19 , which means doing the “equal each” (aka string comparison) operation across 8 unsigned words (thanks UTF-16!) with a negated result. This monster of an instruction takes in 4 registers of input: the 2 strings themselves as parameters, plus their lengths in %rax and %rdx (‘e’ meaning explicit length - pcmpistri & pcmpistrm instead look for terminating nulls). The result (the index generated from IntRes2) is placed in %ecx . And just in case that wasn’t enough, pcmpxstrx also reappropriate flags as well:

CFlag – Reset if IntRes2 is equal to zero, set otherwise ZFlag – Set if absolute-value of EDX is < 16 (8), reset otherwise SFlag – Set if absolute-value of EAX is < 16 (8), reset otherwise OFlag – IntRes2[0] AFlag – Reset PFlag – Reset

With all of out of our way, lets look at the main loop in detail with some setup before it for context:

7fe3ed1159f6: mov %edx , %eax 7fe3ed1159f8: and $0xfffffff8 , %edx 7fe3ed1159fd: lea ( %rdi , %rax , 2 ), %rdi 7fe3ed115a01: lea ( %rsi , %rax , 2 ), %rsi 7fe3ed115a05: neg %rax 7fe3ed115a08: vmovdqu ( %rdi , %rax , 2 ), %xmm0 7fe3ed115a0d: vpcmpestri $0x19 ,( %rsi , %rax , 2 ), %xmm0 7fe3ed115a14: jb 0x00007fe3ed115a40 7fe3ed115a16: add $0x8 , %rax 7fe3ed115a1a: sub $0x8 , %rdx 7fe3ed115a1e: jne 0x00007fe3ed115a08

Going in, %rax% is the minimum of the strings’ lengths, and %rdx is that minimum masked by ~0x7 (so 8x the maximum number of iterations). It then bumps the pointers in the character arrays ( %rsi and %rdi ) by that many characters and then negates %rax , so the indexing into the array in the main loop is actually backwards. After loading 8 characters of the first string into %xmm0 , it then does the comprison against 8 characters of the second, jumping out if CFlag is set (which means the index of the differing character is in %ecx ), and then adjusts the 2 length registers and checks to see if this was the last iteration (which would make %rdx 0). How does a negative number make a valid length? Oops, almost forgot to mention that pcmpestri actually considers the lengths to be the absolute value:

The length of each input is interpreted as being the absolute-value of the value in the length register.

Following the main loop, there is a fallthrough case to check the remaining characters when the minimum length isn’t a multiple of 8, and then the final case of diffing the lengths when the strings are identical up the shortest’s length. Phew!

More matching fun

If this wasn’t complicated enough for you, have a quick gander at the indexOf implementations (there are 2, depending on the size of the matching string), which use control byte 0x0d , which does “equal ordered” (aka substring) matching.

As always, if you are crazy enough to find wierd JVM internals interesting you should totally follow me on twitter