This weekend on #libav-devel we discussed again a bit about the problems with the current core avcodec api.

Current situation

Decoding

We have 3 decoding functions for each of the supported kind of media types: Audio, Video and Subtitles.

Subtitles are already a sore thumb since they are not using AVFrame but a specialized structure, let’s ignore it for now. Audio and Video share pretty much the same signature:

int avcodec_decode_something ( AVCodecContext * avctx , AVFrame * f , int * got_frame , AVPacket * p )

It takes a context pointer containing the decoder state, consumes a demuxed packet and optionally outputs a decoded frame containing raw data in a certain format (audio samples, a video frame).

The usage model is quite simple it takes packets and whenever it has enough encoded data to emit a frame it emits one, the got_frame pointer signals if a frame is ready or more data is needed.

Problem:

What if 1 AVPacket is near always enough to output 2 or more frames of raw data?

This happens with MVC and other real-world scenarios.

In general our current API cannot cope with it cleanly.

While working with the MediaSDK interface from Intel and now with MMAL for the Rasberry Pi, similar problems arisen due the natural parallelism the underlying hardware has.

Encoding

We have again 3 functions again Subtitles are somehow different, while Audio and Video are sort of nicely uniform.

int avcodec_encode_something ( AVCodecContext * avctx , AVPacket * p , const AVFrame * f , int * got_packet )

It is pretty much the dual of the decoding function: the context pointer is the same, a frame of raw data enters and a packet of encoded data. Again we have a pointer to signal if we had enough data and an encoded packet had been outputted.

Problem:

Again we might get multiple AVPacket produced out of a single AVFrame data fed.

This happens when the HEVC “workaround” to encode interlaced content makes the encoder to output the two separate fields as separate encoded frames.

Again, the API cannot cope with it cleanly and threaded or otherwise parallel encoding fit the model just barely.

Decoupling the process

To fix this issue (and make our users life simpler) the idea is to split the feeding data function from the one actually providing the processed data.

int avcodec_decode_push ( AVCodecContext * avctx , AVPacket * packet ); int avcodec_decode_pull ( AVCodecContext * avctx , AVFrame * frame ); int avcodec_decode_need_data ( AVCodecContext * avctx ); int avcodec_decode_have_data ( AVCodecContext * avctx );

int avcodec_encode_push ( AVCodecContext * avctx , AVFrame * frame ); int avcodec_encode_pull ( AVCodecContext * avctx , AVPacket * packet ); int avcodec_encode_need_data ( AVCodecContext * avctx ); int avcodec_encode_have_data ( AVCodecContext * avctx );

From a single function 4 are provided, why it is simple?

The current workflow is more or less like

while ( get_packet_from_demuxer ( & pkt )) { ret = avcodec_decode_something ( avctx , frame , & got_frame , pkt ); if ( got_frame ) { render_frame ( frame ); } if ( ret < 0 ) { manage_error ( ret ); } }

The get_packet_from_demuxer() is a function that dequeues from some queue the encoded data or directly call the demuxer (beware: having your I/O-intensive demuxer function blocking your CPU-intensive decoding function isn’t nice), render_frame() is as well either something directly talking to some kind of I/O-subsystem or enqueuing the data to have the actual rendering (including format conversion, overlaying and scaling) in another thread.

The new API makes much easier to keep the multiple area of concern separated, so they won’t trip each other while the casual user would have something like

while ( ret >= 0 ) { while (( ret = avcodec_decode_need_data ( avctx )) > 0 ) { ret = get_packet_from_demuxer ( & pkt ); if ( ret < 0 ) ... ret = avcodec_decode_push ( avctx , & pkt ); if ( ret < 0 ) ... } while (( ret = avcodec_decode_have_data ( avctx )) > 0 ) { ret = avcodec_decode_pull ( avctx , frame ); if ( ret < 0 ) ... render_frame ( frame ); } }

That has probably few more lines.

Asyncronous API

Since the decoupled API is that simple, is possible to craft something more immediate for the casual user.

typedef struct AVCodecDecodeCallback { int ( * pull_packet )( void * priv , AVPacket * pkt ); int ( * push_frame )( void * priv , AVFrame * frame ); void * priv_data ; } AVCodecDecodeCallback ; int avcodec_register_decode_callbacks ( AVCodecContext * avctx , AVCodecDecodeCallback * cb ); int avcodec_decode_loop ( AVCodecContext * avctx ) { AVCodecDecodeCallback * cb = avctx -> cb ; int ret ; while (( ret = avcodec_decode_need_data ( avctx )) > 0 ) { ret = cb -> pull_packet ( cb -> priv_data , & pkt ); if ( ret < 0 ) return ret ; ret = avcodec_decode_push ( avctx , & pkt ); if ( ret < 0 ) return ret ; } while (( ret = avcodec_decode_have_data ( avctx )) > 0 ) { ret = avcodec_decode_pull ( avctx , frame ); if ( ret < 0 ) return ret ; ret = cb -> push_frame ( cb -> priv_data , frame ); } return ret ; }

So the actual minimum decoding loop can be just 2 calls:

ret = avcodec_register_decode_callbacks ( avctx , cb ); if ( ret < 0 ) ... while (( ret = avcodec_decode_loop ( avctx )) >= 0 );

Cute, isn’t it?

Theory is simple …

… the practice not so much:

– there are plenty of implementation issues to take in account.

– LOTS of tedious work converting all the codecs to the new API.

– lots of details to iron out (e.g. have_data() and need_data() should block or not?)

We did radical overhauls before, such as introducing reference-counted AVFrames thanks to Anton, so we aren’t much scared of reshaping and cleaning the codebase once more.

If you like the ideas posted above or you want to discuss them more, you can join the Libav irc channel or mailing list to discuss and help.