Arm servers are already deployed in some datacenters, but they are pretty new compared to their Intel counterparts, so at this stage software may not always be optimized as well on Arm as on Intel.

Vlad Krasnow working for Cloudflare found one of those unoptimized cases when testing out Jpegtran – a utility performing lossless transformation of JPEG files – on one of their Xeon Silver 4116 Server:



vlad@xeon:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg real 0m2.305s user 0m2.059s sys 0m0.252s 1 2 3 4 5 vlad @ xeon : ~ $ time . / jpegtran - outfile / dev / null - progressive - optimise - copy none test .jpg real 0m2.305s user 0m2.059s sys 0m0.252s



and comparing it to one based on Qualcomm Centriq 2400 Arm SoC:



vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg real 0m8.654s user 0m8.433s sys 0m0.225s 1 2 3 4 5 vlad @ arm : ~ $ time . / jpegtran - outfile / dev / null - progressive - optimise - copy none test .jpg real 0m8.654s user 0m8.433s sys 0m0.225s



Nearly four times slower on a single core. Not so good, as the company aims for at least 50% of the performance since the Arm processor has double the number of cores.

Vlad did some optimization on The Intel processor using SSE instructions before, so he decided to look into optimization the Arm code with NEON instructions instead.

First step was to check which functions may slowdown the process the most using perf:



perf record ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpeg perf report 71.24% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine 15.24% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first 1 2 3 4 perf record . / jpegtran - outfile / dev / null - progressive - optimise - copy none test .jpeg perf report 71.24 % lt - jpegtran libjpeg .so . 9.1.0 [ . ] encode_mcu_AC_refine 15.24 % lt - jpegtran libjpeg .so . 9.1.0 [ . ] encode_mcu_AC_first



encode_mcu_AC_refine and encode_mcu_AC_first are the main culprits. He first optimized encode_mcu_AC_refine comprise of two loops with NEON instructions (Check the post on Cloudflare for technical details, or check the source on Github), which ended up boosting the performance per over two times:



vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg real 0m4.008s user 0m3.770s sys 0m0.241s 1 2 3 4 5 vlad @ arm : ~ $ time . / jpegtran - outfile / dev / null - progressive - optimise - copy none test .jpg real 0m4.008s user 0m3.770s sys 0m0.241s



He matches his requirements of at least 50% performance of the Intel Xeon processor, but after optimization of the second function (encode_mcu_AC_first) plus the use of some NEON instructions with no equivalent in SSE such as vqtbl4q_u8 (TBL for 4 register) using assembly languages since the compiler would not generate optimal, he managed to transform the test image in just 2.756 seconds, ever closer to the 2.305 seconds achieved on the Intel Xeon.

The final two charts compare the performance of the various stage of optimizations on set of 34,159 images, and here the Qualcomm processor is faster than the Intel Xeon with a single worker handling close to 50 images per second, against close to 40 images per second.

While scaling to more workers there’s still a slightly performance advantage for the Centriq processor, but what’s really impressive is that it is achieved at a much lower power consumption.

When using all available cores/threads, the Arm processor can “reduce” around 24 images per second with one watt, while two Intel Xeon processors can only manage 10 images per second with one watt.