The speedup for most non-dc-only dct functions is around 9-12x over the C code generated by GCC 7.3.

Relative speedups vs C for a few functions:

Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.90 4.16 5.65 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.20 8.05 11.19 inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 5.09 6.73 6.45 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 12.18 10.80 13.05 inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 7.31 9.35 11.17 inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 14.36 13.06 15.93 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 11.00 10.09 12.05 inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 4.41 5.40 5.77 inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 13.84 13.81 18.04 inv_txfm_add_32x32_dct_dct_2_8bpc_neon: 11.75 11.87 15.22 inv_txfm_add_32x32_dct_dct_3_8bpc_neon: 10.20 10.40 13.13 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 9.01 9.21 11.56 inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 3.84 4.82 5.28 inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 14.40 12.69 16.71 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 10.91 9.63 12.67

Some of the specialcased identity_identity transforms for 32x32 give insane speedups over the generic C code:

inv_txfm_add_32x32_identity_identity_0_8bpc_neon: 225.26 238.11 247.07 inv_txfm_add_32x32_identity_identity_1_8bpc_neon: 225.33 238.53 247.69 inv_txfm_add_32x32_identity_identity_2_8bpc_neon: 59.60 61.94 64.63 inv_txfm_add_32x32_identity_identity_3_8bpc_neon: 26.98 27.99 29.21 inv_txfm_add_32x32_identity_identity_4_8bpc_neon: 15.08 15.93 16.56

The total decoding speedup for one particular test with Chimera ( -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o n.null --framethreads 4 --tilethreads 4 --skip 120 --limit 1000 on a Snapdragon 835) takes it from 109 fps to 118 fps.

Also mentioning @janne for potential reviews.