TL;DR: dav1d 0.3.0 decodes AV1 video’s 24% faster on SSSE3, 26% on SSE4.1 and 4% on AVX2 (all PC), and 12% faster on Arm64 (mobile).

The open-source AV1 decoder dav1d was updated yesterday to version 0.3.0. With the third release, new assembly code provides some serious performance gains on both the PC and mobile platforms.

Previously:

PC

On the x86 side, this release mostly improves the SSSE3 performance of dav1d. Xuefeng Jiang contributed with prediction of chroma from luma and Paeth intra prediction functions, delivering 0,8% and 0,4% improved global performance.

Liwei Wang continued his work on inverse transform with larger 8x32, 32x16 and 32x32 and up to 64x64 blocks, providing the largest speedup of this release, way over 10% on some video’s.

dav1d 0.3.0 also introduces the first SSE4.1 assembly. In most cases the added SSE4.1 instructions aren’t useful in addition to SSSE3, but Victorien Le Couviour — Tuffet found a usecase where it was. He optimized the CDEF filter, resulting in a 1,15x speedup on the module level and around 1,5% overall.

Meanwhile Henrik Gramner wrote some very clever SSE2 code to speed up entropy decoding/bitstream reading, which started to eat up a large proportion of decode time, especially on AVX2. The assembly code resulted in a speedup for all 64-bit x86 platforms, measured around 4% for AVX2 and 2% for SSSE3 and SSE4.1