A naive approach is to convert from directly using the mm registers to using xmm registers. This can usually be done with minimal changes just paying attention to packs, unpacks, and moves. This can make things faster on Skylake and related microarches from Intel. A discussion of why is beyond the scope of this post. The point is that you can measure that functions are faster if they use xmm registers.

This does provide some speedup for this function. However, the IDCT could make much better use of the wider xmm registers and more registers (on 64-bit). The data block is 128 bytes, an 8×8 array of int16_t. Always look at your specific problem.

Having tried to implement this directly myself and failing, Ronald pointed me to the existing 10-bit IDCT which does use xmm registers and all 16 available on x86-64.

Comparing that with the C function and comparing the 8 and 10-bit C functions I thought it would be as simple as calling the existing macros with the right parameters. If only. It was pretty close but didn’t pass testing.

Ronald pointed out that I should be using different rounding at one point and, very helpfully, contributed the value. Incorporating that still wasn’t completely accurate. At first I thought this was because of tiny differences in the coefficients FFmpeg uses between the 8- and 10-bit functions. Minor differences of +/- 1, for example 16383 vs 16384 and 19265 vs 19266 but the former is used most often. This isn’t too hard to correct. Do a little more macro magic and the code can be made to use the right ones. This is a little messy but keeps the functional code clean. This is commit 8221c71703.

After boxing this all up into a few neat commits I submitted the patches for review. Almost immediately, Michael found a problem. Most of the problem came from using one but not all of the 3 functions I wanted to add. Simply stated: it left the internal state of the decoder broken. It is expected that the IDCT functions all have the same permutation and that add/put functions exist.

While addressing this problem we kicked off a lengthy and slightly heated discussion about the legacy of the IDCT any why it uses the coefficients it does. That discussion is covered below.

There was one final problem with my new code which more thorough testing revealed. Unfortunately I don’t remember who first highlighted this last one as we were discussing the legacy of the IDCT and other issues.

Ronald identified that in the case when there is only a DC coefficient in a row that the C code uses a short cut to prevent the need to do a full IDCT. Instead of multiplying by 16383 it only shifts to make it the equivalent of multiplying by 16384. After telling me the specifics he said that the code would need to perform a full IDCT and then merge in the “shortcut” values.

He must have been bored because while I was working on some other corner of the code he suddenly presented me with the appropriate patch. That can be seen in commit 8b19467d07.

Other than a few mostly minor cosmetic patches which precede it the commit which adds the new functions is d7246ea9f2.