Performance numbers, measured on Skylake-X:

Before: After: cdef_filter_4x4_8bpc_c: 1217.0 cdef_filter_4x4_8bpc_c: 885.2 cdef_filter_4x8_8bpc_c: 2355.1 cdef_filter_4x8_8bpc_c: 1710.1 cdef_filter_8x8_8bpc_c: 2669.5 cdef_filter_8x8_8bpc_c: 1439.7

For 10-bit (which currently uses C DSP code) the overall decoding performance is increased by around 20%.

The asm can also be optimized using the same approach, although the benefit will likely be a bit smaller.