Curious case of an Android compiler bug, a[(int) longValue] goes kaboom.

It’s never fun to wake up to weird bugs that are impossible and didn’t happen on one of your test devices. This is a case of an impossible condition made possible by a compiler bug. (Ack, not again, I don’t like framework bugs…)

One of the pitfalls in Android is that Context.obtainStyledAttributes(int[] attrs) requires that the attributes are sorted in ascending order, which is done by aapt if you use <declare-stylable>, but if you need to lookup a few colors in runtime, you need something to sort the attributes and remap the return values.

We just added one as we tend to do that quite often, and the “algorithm” is simple — just allocate a long[], put the value in the top 32 bits, the index in the lower 32, Arrays.sort and unpack again. But on certain devices, it fails with a mysterious exception:

Which is impossible in the code flow — relevant parts are pasted into this gist.

First, a temporary array is allocated, then sorted into a long array with (((long) value << 32) | index). After that, sortedAttrs(sorted, output) returns output, with only the attribute ids (int)(long >> 32), and sortedIndices(sorted, output) does the same, but with ((int) output). Then it crashes in that function, saying that output which is just used by sortedAttrs is suddenly null…

What. How does tmp suddenly become null on certain devices, and how do you fix this without a device that can reproduce this issue? You know what, the tmp int[] can be allocated twice, I’ll just deoptimize it so it can never become null, in both sortedAttrs and sortedIndices. Here comes 1.7.3d!

Package for release, check in the artifact and remapping table, upload, resume staged roll out, monitor ACRA backend for a few hours, no more errors, big success!

Next morning:

Oh you’re kidding me… What the hell? ACRA! <nothing there>. Uh oh, I’ve a bad baaad feeling about this. Dev console:

<Impossible Java Exception> has evolved to… <Native Crash>!

Uh oh. ACRA cannot catch native crashes so this went unnoticed, and worse, instead of weird devices that always have issues like Elephone, Coolpad, Oneplus*, it’s more widespread than expected.

Okay, at least we now know that a tiny tiny diff is causing this issue. Wait a minute, did I see a Huawei ALE-L21 in the list? Yay, we have one of those, purchased for testing because they tend to cause so many headaches with device/rom specific issues.

Deploy. Launch. Crash. Hm. Baksmali time. We have a working version in debug, we have a broken version in release, what the heck is the difference? Turns out, in release build, out[indexOf(sorted[i])] is optimized into out[(int) sorted[i]], which is correct in smali and works fine on all arm32, arm64 @ M+ builds, but not on arm64 @ L.

Root cause analysis

After a while, we narrowed it down to a single line of code

out[indexOf(sorted[i])] = i

Yup. That’s it. The cause of all evil. This works:

This crashes:

So when you issue a statement for array[(int) 0xffffffff00000001], instead of array[1], it fails to clear the upper bits and writes to a memory address waaay out of bounds, causing a SIGSEGV. Now we know what it is, the fix was stupidly easy. @Keep the indexOf function. The call and return clears the upper half of the 64-bit register correctly so you no longer get segfaults.

Apply workaround, create build, check in artifacts, upload to store, apologize to all the users complaining. After that, I distilled the bug into a small repro gist for the compiler team, ready for bug submission.

What also fixed it was rewriting the smali manually — allocate one extra local, and long-to-int v4, v0 then aput v3, v2, v4, but that kind of patches are hard to apply consistently on build, we’ll just take the invoke-static performance hit.

Rooter cause analysis

But there’s still a nagging feeling, how about we disassemble whatever dex2oat generated on a bugged device to get a look inside? Siebe dumped the oat and extracted the relevant part:

0x0025: aget-wide v2, v6, v0

0x0027: long-to-int v2, v2

0x0028: aput v0, v1, v2

And in arm64 assembly:

0x0025: Looks fine. Read 2x 32 bit words into w0, w1, call array get

0x0027: Oi, where did the cast go.

0x0028: Huh. Seems like I cannot read assembly that well, but it looks like a call to a store-in-array subroutine. I stick to Java and smali for a reason.

Either a force cast to int is missing, or the subroutine at #1048 implementing aput incorrectly uses a whole 64 bit register instead of the relevant 32 bit only. On M, the generated code is way longer at 49 lines, but oh well. We have our workaround, and if the ART guys want to publish a patch for 5.x, they can with the test case — my work is done here.

Note to self: keep rolling out to 5% max, this is the second native / JIT / ARM bomb we stepped on this month…

Rootest cause analysis

Eh, we still have no clue how the “array is null” exception is created, but if you’re really really interested, I can give you a copy of 1.7.3c so you can debug the heck out of it.

*: yes really, they sure know how to do marketing but their devices generates the craziest stack traces. Do not want. Or well, I’d want one to make sure our stuff doesn’t crash, but not to use as a phone.