As you may well know, RetroArch has embedded video player support on platforms such as Windows, and Linux. Just like VLC, Kodi, mpv and other video players out there, it accomplishes this by leveraging the ffmpeg project.

Up until now, all video decoding was performed entirely in software. This means that the CPU has to do all the decoding instead of being able to delegate it to the GPU. This meant that on some systems, video playback could be too slow if the CPU was too underpowered. This so happens to be the case on many ARM SoC devices out there, such as the Raspberry Pi and Odroids.

Now, we finally support hardware video decoding through ffmpeg’s own APIs! This should really help on systems where there is a CPU bottleneck and the GPU happens to support hardware decoding. Whether or not you are able to decode 1080p, 1440p or 4K on hardware depends entirely on your GPU’s capabilities however.

In addition to hardware decoding, frame based multithreading is now enabled for SW based video decoders, but actual effectiveness hasn’t been proven yet.

The core switches back to SW based decoding if the HW based decoding couldn’t be initialized.

The following backends have been tested:

DXVA2 [Windows]

D3D11VA [Windows] (it will use this when using the D3D11 driver

VDPAU [Linux] (Tested on an AMD System with VDPAU to VAAPI layer)

VAAPI

We have performed the following tests so far:

Nvidia Titan XP/RTX 2080 Ti

– Can hardware decode 1080p/1440p/4K content. Intel UHD 630

– Can hardware decode 1080p/1440p/4K content. AMD Radeon R9 290x – This is a slightly older card from 2014. It only supports 1080p hardware video decoding at best. 1440p and 4K content therefore falls back to software video decoding. This means that if your CPU is not up to the task, you won’t be able to run this content at fullspeed.

As a stress test video, we picked a 4K video (3840×2160) with a total bitrate of 29561 kb/s (h264/AVC1, YUV420P), running at 30 frames per second. The CPU we’re using for this test is an Intel Core i7 7700k. With such a CPU, we don’t really have a CPU bottleneck and we are merely GPU bound when it comes to rendering the content.

With software decoding (the current default in RetroArch) – we averaged around 55fps with the 2080 Ti. Our CPU load averages around 15% with GPU load averaging around 11%.

With hardware decoding (the 2080 Ti defaults to DXVA2 for this test) – we averaged 77fps with the 2080 Ti. Our CPU load averages around 11% with GPU load averaging around 20%.

NOTE: The above is long since out of date – the same video is now 256fps with hardware decoding and 224fps with threaded video decoding at an automatically defined amount of threads. Quite the improvement from 55fps I’m sure you’ll agree.

What remains to be done

We will still need to gather tests for the following backends:

Cuda

Videotoolbox

DRM

OpenCL

Mediacodec

Future plans

In short, we hope this will really help out RetroArch’s video playback capabilities not only on desktops such as Windows and Linux, but also on the ARM SoCs, and in specific our own Linux distribution, Lakka.

But hardware video decoding is not the end-all-be all. There is certainly a lot of room for improvement for future speedups, and these are being investigated. But that’s the subject of another blog post somewhere down the line.

For now, rest assured that big things are coming up for the next version of RetroArch!