Since we have this wonderful situation in Europe and I need to stay at home why not do something useless and comment on the features of AV1 especially since there’s a nice paper from (some of?) the original authors is here. In this post I’ll try to review it and give my comments on various details presented there.

First of all I’d like to note that the paper has 21 author for a review that can be done by a single person. I guess this was done to give academic credit to the people involved and I have no problems with that (also I should note that even if two of fourteen pages are short authors’ biographies they were probably the most interesting part of paper to me).



The abstract starts with “In 2018, the Alliance for Open Media (AOMedia) finalized its first video compression format AV1” and it fails to continue with the word twice. Not a big deal but I still shan’t forget how well the standardisation process went.

Introduction

It starts with what I can describe as usual marketing words that amount to nothing and then it has this passage:

As a codec for production usage, libvpx-VP9 considerably outperforms x264, a popular open source encoder for the most widely used format H.264/AVC, while is also a strong competitor to x265, the open source encoder of the state-of-the-art royalty-bearing format H.265/HEVC codec on HD content.

While this is technically true, this sentence looks like a PR department helped to word it. “A codec for production usage” is not an exact metric (and in production you have not just quality/bitrate trade-off but rather quality/bitrate/encoding time), VP9 is a H.265 rip-off and thus it’s expected to beat H.264 encoder and be on par with H.265 encoder is a given (unless you botch your encoder) and cherry on top is “royalty-bearing format H.265”. H.264 is “royalty-bearing format” too and if Sisvel succeeds so are VP9 and AV1 and claims like those make me want for such organisations to achieve that goal. Not because I like the current situation with greedy IP owners and patent trolls but rather because you have the choice between paid and “shitty but free” (does anybody miss Theora BTW?).

Then you have

The focus of AV1 development includes, but is not limited to achieving: consistent high-quality real-time video delivery, scalability to modern devices at various bandwidths, tractable computational footprint, optimization for hardware, and flexibility for both commercial and non-commercial content.

Somehow I interpret it as “our focus was all these nice things but it’s not limited by the actual requirement to achieve all this”. And it makes me wonder what image features make the content non-commercial.

Since the soft freeze of coding tools in early 2018, libaom has made significant progress in terms of productization-readiness, by a radical acceleration mainly achieved via machine learning powered search space pruning and early termination, and extra bitrate savings via adaptive quantization and improved future reference frame generation/structure.

I’m sorry, do humans write such sentences? I have no problems with what is done for extra bitrate savings (though I’d expect that any modern advanced encoder would do the same) but what’s written before that makes me want to cry Bingo! for some reason.

The rest of introduction is dedicated to mentioning various AV1 encoders and decoders and ranting about tests. That’s perfectly normal since most of those tests conducted by entities with their own interests and biases. And then you have “…30% reduction in average bitrate compared with the most performant libvpx VP9 encoder at the same quality“—as the most sceptical Kostya who blogs on multimedia.cx I find this phrase funny.

AV1 Coding Tools

Now for the part that describes how AV1 achieves better compression than its predecessors. This is the reason why I looked at the paper in the first case: there have been many presentations about those coding tools already but having it all in a single paper instead of multiple videos is much better.

Coding block partition. Essentially “the H.265 tree partitioning from VP9 remains but we extend it a bit to allow larger and smaller sizes and more partitioning ways”. It also sounds like VP9 had some limitations that H.265 had not and now they are relaxed.

Intra prediction. Some of it sounds like “VP9 had intra prediction like in H.264, AV1 borrowed more from H.265” (H.264 has 8 directional modes plus DC and plane, VP9 had the same, H.265 had 35 modes most of them defined via universal directional predictor, AV1 has 56 directional modes most of them defined via universal directional predictor).

AV1 has also two new gradient prediction modes—smooth gradient (calculated from either top, left or both neighbours and using quadratic interpolation; this sounds like a logical development of plane prediction mode that was not used before for computational considerations) and Paeth prediction (you might remember it from PNG).

There’s also recursive-filtering-based intra predictor which predicts 4×2 pixel block from its top and left neighbours and the whole block is predicted by essentially filling it with those predicted blocks (somewhat like intra prediction in general without residue coding and with fixed prediction mode). This is an interesting idea and makes me wonder if H.266 uses it as well.

Of course there’s this proposal for H.265 rejected for computation complexity but picked up in AV1—Chroma-from-Luma mode. The idea is rather simple: you usually have correlation between luma and chroma so by multiplying luma values and adding some bias you can predict chroma values quite good in some cases.

Another method is palette mode. This reminds me a lot about screen coding extensions for ITU H.EVC.

And finally, intra block copy. This mode does what really old codecs did and copies data from already decoded part of the frame. Its description also features this passage:

To facilitate hardware implementations, there are certain additional constraints on the reference areas. For example, there is a 256-horizontal-pixel-delay between current superblock and the most recent superblock that IntraBC may refer to. Another constraint is that when the IntraBC mode is enabled for the current frame, all the in-loop filters, including deblocking filters, loop-restoration filters, and the CDEF filters, must be turned off.

I’d argue that only the first limitation is for hardware implementation. Having various features enabled would result in decoding frame in very specific order e.g. usually you may decode frame and deblock rows that are not used for intra prediction while you decode other rows, or you may decode frame and only that perform deblocking. With the IntraBC feature and deblocking you’d need to perform deblocking at the specific stage of decoding or you won’t have the right input. Same with the constraint that IntraBC should reference only the data in the same tile—obviously it would do wonders to multi-threaded decoding otherwise (and you need multi-threaded decoding for fast decoding on large frames packed with modern codecs). If you wonder if this feature is a bit alien to the codec, the paper puts it straight: “Despite all these constraints, the IntraBC mode still brings significant

compression improvement for screen content videos.” And thus it was accepted (why shouldn’t it?).

Inter prediction. The paper lists five features: extended reference frames, dynamic spatial and temporal motion vector referencing, overlapped block motion compensation, warped motion compensation, and advanced compound prediction.

Extended reference frames means mimicking B-pyramid with VPx approach. I.e. instead of coded B-frames that may reference other B-frames (so you need to carry around the lists of which frames you may want to reference from the current frame) you have some previous frames and some future frames (coded but not displayed) like you had since VP7 times.

Dynamic spatial and temporal motion vector referencing means that co-located motion vectors from H.264 and H.265 got to AV1 at last.

Overlapped block motion compensation is a novel technique that some of us still remember from MPEG-4 ASP times, Snow and “that thing that Daala tried but decided not to use”.

Warped motion compensation also reminds of MPEG-4 ASP (and a bit about exotic VP7 motion compensation modes).

Advanced compound prediction means combining two sources usually at an angle (an interesting idea IMO, not sure how much it complicates decoding though). But also seems to include the familiar weighted motion compensation from H.264.

Transform coding. Essentially “we use the same DCT and ADST from VP9 but we got to separable transforms now”.

Entropy coding. There are two radical things here: moving to old-fashioned multi-symbol adaptive arithmetic coding (which went out of favour because adapting large models is much slower than adaptive binary coding) and using coefficient coding in a way that resembles H.264 (but with the new feature of multiple symbol coding it looks different enough).

In-loop filtering tools and post-processing. Those tools include: constrained directional enhancement filter (IIRC the brilliant thing made by collaboration of Xiph and Cisco), loop restoration filters (a bit more complex than usual deblocking filters), frame super-resolution (essentially the half-done scalability feature present along with full scalability; also it reminds me of VC1 feature), film grain synthesis (postprocessing option that has nothing to do with the codec itself but Some flix company made it normative).

Tiles and multi-threading. Nothing remarkable except for large-scale tiles described as

The large-scale tile tool allows the decoder to extract only an interesting section in a frame without the need to decompress the entire frame.

The description is vague and the specification is not any better. And it does not seem to be a dedicated feature by itself but rather “if you configure encoder to tile frames in certain way you can then use decoder to decode just a part of the frame efficiently”.

Performance Evaluation

Here I have two questions. The first question is why PSNR was chosen for a metric. As most people should know, the current “objective” metrics do not reflect visual quality perceived by humans well (so encoded content with high PSNR can look worse to humans than different content with lower PSNR). IIRC (and I’m not following it closely) SSIM has been used as a fairer metric since long time (look at the famous MSU codec comparison reports for example) and PSNR is synonymous with people who either don’t know what they’re doing or trying to massage data to present their encoder favourably.

The second question is the presented encoding and decoding complexity. The numbers look okay at the first glance and then you spot that x265 and HM have comparable encoding time and HM and libvpx have comparable decoding time. And for the record HM is the reference implementation that has just plain C++ code while others (maybe not VTM at the time of benchmarking) have a lot of the platform specific SIMD code. I can understand x265 wasting more power on encoding (by trying more combinations and such) but decoding time still looks very fishy.

Conclusion

Nothing to say here, it simply re-iterates what’s been said before and concludes that the goal of having format with 30% of bitrate saving with the same (PSNR?) quality over VP9 is reached.

Author biographies

As I said, still probably the most interesting part for me since it tells you enough information about people involved in creating AV1 and what they are responsible for. Especially the four people from Duck (so I don’t have to wonder why there are segments named WILK_DX in the old binary decoders from On2 era).

And now my conclusion.

The paper is good and worth reading for all those technical details of the format but it would be even better if it didn’t try to sell AV1 to me. The abstract and introduction look like they were written by marketing or PR division with some technical details inserted afterwards. The performance evaluation leaves a slight hint of dishonesty like on the one hand they wanted to conduct a fair benchmarking but on the other hand they were told to demonstrate AV1 superiority no matter what and thus the final report was compiled from whatever they could produce without outright lying. But if you ignore that it’s still a good source of some technical information.