Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a quad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

(Note: An updated version of this comparison is available at A Performance Comparison Redux: Java, C, and Renderscript on the Nexus 5, along with source code).

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better) gcc 4.8 1.00 gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02 gcc 4.8 -ffast-math -O3 1.05 clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27 clang 3.4 -ffast-math -O3 1.42 clang 3.4 1.43 ART (manually-optimized) 2.22 Dalvik (manually-optimized) 2.87 ART (normal code) 7.99 Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better) clang 3.4 -O3 -ffast-math -flto 8.38x clang 3.4 -O3 -ffast-math 6.09x Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x Java SE 1.6u65 (manually-optimized) 3.85x Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & worst result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c

#include "dsp.h" #include <algorithm> #include <cstdint> #include <limits> static constexpr int int16_min = std::numeric_limits<int16_t>::min(); static constexpr int int16_max = std::numeric_limits<int16_t>::max(); static inline int16_t clamp(int input) { return std::max(int16_min, std::min(int16_max, input)); } static inline int get_offset(const FilterState& filter_state, int relative_offset) { return (filter_state.current + relative_offset) % filter_state.size; } static inline void push_sample(FilterState& filter_state, int16_t sample) { filter_state.input[get_offset(filter_state, 0)] = sample; ++filter_state.current; } static inline int16_t get_output_sample(const FilterState& filter_state) { return clamp(filter_state.output[get_offset(filter_state, 0)]); } static inline void apply_lowpass(FilterState& filter_state) { double* x = filter_state.input; double* y = filter_state.output; y[get_offset(filter_state, 0)] = ( 1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state, -0)])) + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -9)] + x[get_offset(filter_state, -1)])) + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -8)] + x[get_offset(filter_state, -2)])) + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -7)] + x[get_offset(filter_state, -3)])) + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -6)] + x[get_offset(filter_state, -4)])) + (252.0 * (1.0 / 6.928330802e+06) * x[get_offset(filter_state, -5)]) + ( -0.4441854896 * y[get_offset(filter_state, -10)]) + ( 4.2144719035 * y[get_offset(filter_state, -9)]) + ( -18.5365677633 * y[get_offset(filter_state, -8)]) + ( 49.7394321983 * y[get_offset(filter_state, -7)]) + ( -90.1491003509 * y[get_offset(filter_state, -6)]) + ( 115.3235358151 * y[get_offset(filter_state, -5)]) + (-105.4969191433 * y[get_offset(filter_state, -4)]) + ( 68.1964705422 * y[get_offset(filter_state, -3)]) + ( -29.8484881821 * y[get_offset(filter_state, -2)]) + ( 8.0012026712 * y[get_offset(filter_state, -1)]); } void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length) { for (int i = 0; i < length; ++i) { push_sample(filter_state, input[i]); apply_lowpass(filter_state); output[i] = get_output_sample(filter_state); } }

dsp.h

#include <cstdint> struct FilterState { static constexpr int size = 16; double input[size]; double output[size]; unsigned int current; FilterState() : input{}, output{}, current{} {} }; void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest; import com.example.perftest.DspJavaManuallyOptimized.FilterState; public class DspJava { public static class FilterState { static final int size = 16; final double input[] = new double[size]; final double output[] = new double[size]; int current; } static short clamp(short input) { return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input)); } static int getOffset(FilterState filterState, int relativeOffset) { return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size; } static void pushSample(FilterState filterState, short sample) { filterState.input[getOffset(filterState, 0)] = sample; ++filterState.current; } static short getOutputSample(FilterState filterState) { return clamp((short) filterState.output[getOffset(filterState, 0)]); } static void applyLowpass(FilterState filterState) { final double[] x = filterState.input; final double[] y = filterState.output; y[getOffset(filterState, 0)] = ( 1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState, -0)])) + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -9)] + x[getOffset(filterState, -1)])) + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -8)] + x[getOffset(filterState, -2)])) + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -7)] + x[getOffset(filterState, -3)])) + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -6)] + x[getOffset(filterState, -4)])) + (252.0 * (1.0 / 6.928330802e+06) * x[getOffset(filterState, -5)]) + ( -0.4441854896 * y[getOffset(filterState, -10)]) + ( 4.2144719035 * y[getOffset(filterState, -9)]) + ( -18.5365677633 * y[getOffset(filterState, -8)]) + ( 49.7394321983 * y[getOffset(filterState, -7)]) + ( -90.1491003509 * y[getOffset(filterState, -6)]) + ( 115.3235358151 * y[getOffset(filterState, -5)]) + (-105.4969191433 * y[getOffset(filterState, -4)]) + ( 68.1964705422 * y[getOffset(filterState, -3)]) + ( -29.8484881821 * y[getOffset(filterState, -2)]) + ( 8.0012026712 * y[getOffset(filterState, -1)]); } public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) { for (int i = 0; i < length; ++i) { pushSample(filterState, input[i]); applyLowpass(filterState); output[i] = getOutputSample(filterState); } } }

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest; public class DspJavaManuallyOptimized { public static class FilterState { static final int size = 16; final double input[] = new double[size]; final double output[] = new double[size]; int current; } public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) { for (int i = 0; i < length; ++i) { filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i]; ++filterState.current; final double[] x = filterState.input; final double[] y = filterState.output; y[(filterState.current + 0) & (FilterState.size - 1)] = ( 1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)])) + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)])) + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)])) + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)])) + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)])) + (252.0 * (1.0 / 6.928330802e+06) * x[(filterState.current + -5) & (FilterState.size - 1)]) + ( -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)]) + ( 4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)]) + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)]) + ( 49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)]) + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)]) + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)]) + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)]) + ( 68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)]) + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)]) + ( 8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]); output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)])); } } }