Okay, I’ve made some changes so hopefully the server will withstand the curiosity of more than two people if it will go like the last time.

So, after implementing Indeo 4/5 decoders for NihAV I nano-benchmarked it and my decoder was about twice as slow compared to libavcodec . And since neither has SIMD optimisations they should be good enough to compare.

The tested file was 00186002.avi — Indeo 4 sample with scalability feature(i.e. luma is split into four bands and uses Haar wavelet to compose the output plane) and duration over ten minutes. The results I got will be given in Linux perf sample counts as those should be representative enough.

avconv — 13.4 seconds, 10K cycles. About 24% spent in luma plane recombination (with Haar wavelet), about 40% of time is taken by bitstream decoding and the rest is mostly transforms and motion compensation.

nihav-tool — 31.6 seconds, 20K cycles. 30% spend in luma plane recombination, 48% of time is taken by bitstream decoding, 11% is for motion compensation and the rest is mostly transforms. Or in samples: recombination — 9900 (against 3300 in libavcodec), bitstream decoding (dirty estimate, it includes some DSP functions inlined) — 15800 against

5600. Motion compensation — 3500 against 1700. Transforms — 1300 against 1500 (they are not equivalent though, my code only transforms the block and output costs are hidden in bitstream decoding). Overall, my code is consistently worse. Is there any way to optimise it a bit?



Step 1. Committing explicit murder.

My DSP function for motion compensation for case 0 (no interpolation) uses this code:

for _ in 0..h { for x in 0..w { dst[didx + x] = src[sidx + x]; } sidx += sstride; didx += dstride; }

Replacing it with

for _ in 0..h { let mut dest = &mut dst[didx..didx+w]; dest.copy_from_slice(&src[sidx..sidx+w]); sidx += sstride; didx += dstride; }

gives 2800 samples (new function plus memcpy time) instead of 3500 in the old function.

Conclusion: rustc is too stupid to recognize copies.

Step 2. Going after the biggest function.

The plane recombination function is very simple:

for _ in 0..(h/2) { for x in 0..(w/2) { let p0 = src[idx0 + x]; let p1 = src[idx1 + x]; // idx0 + w/2 let p2 = src[idx2 + x]; // idx0 + (h/2) * stride let p3 = src[idx3 + x]; // idx0 + w/2 + (h/2) * stride // oidx1 = oidx0 + stride dst[oidx0 + x * 2 + 0] = clip8(((p0 + p1 + p2 + p3 + 2) >> 2) + 128); dst[oidx0 + x * 2 + 1] = clip8(((p0 + p1 - p2 - p3 + 2) >> 2) + 128); dst[oidx1 + x * 2 + 0] = clip8(((p0 - p1 + p2 - p3 + 2) >> 2) + 128); dst[oidx1 + x * 2 + 1] = clip8(((p0 - p1 - p2 + p3 + 2) >> 2) + 128); } idx0 += sstride; idx1 += sstride; idx2 += sstride; idx3 += sstride; oidx0 += dstride * 2; oidx1 += dstride * 2; }

Using intermediates (p0±p2, p1±p3) cuts it from 9900 samples to 9200.

But I like to live dangerously so I rewrite it into an unsafe mess with raw pointers like *d0_ptr.offset(1) = clip8(((d0 + d1 + 2) >> 2) + 128); d0_ptr = d0_ptr.offset(2); . Now it’s only 6700 samples. I try wrapping_add() instead of sum — 6600, no significant change. I roll back and cast p0-p3 to i32 . 4700 cycles! I try wrapping (i.e. without checks) arithmetics now. 4600 cycles. Maybe a statistical deviation, maybe a real small improvement.

Then I try a bit smarter clipping function:

#[inline(always)] fn mclip8(a: i32) -> u8 { if (a as u16) > 255 { !(a >> 16) as u8 } else { a as u8 } }

4300 samples. Not libavcodec ‘s 3300 but also not the original 9900.

Conclusions:

the compiler is not smart enough yet; maybe there is/there should be a way to tell compiler “hey, this slice is guaranteed to be this size, drop the unneeded checks”; non-checking arithmetic is a bit too inconvenient to write (by design, I suppose) but it may help to squeeze some performance; having optimised routines for basic type downconversion with saturation would make life a bit easier; the compiler sucks at doing basic calculations with smaller types.

Step 3. Strategically placed inlines.

I strategically put some #[inline(always)] on certain bitstream reading functions. Overall bitstream reading time reduces to 14600 samples (from 15800).

Let’s see how well it works with constant arguments—I force motion compensation function to be called with explicit block size (it’s either 4 or 8). 2300 samples instead of 2800, not bad. In total I’ve managed to shave off about 8 seconds.

Conclusion: compiler does inlining okayish but some hints would improve the performance significantly.

Step 4. Improving bitstream reading.

Indeo 4/5 codebooks are represented as unary prefix and optional bits to read that may vary for each prefix (e.g. first predefined codebook for macroblock data starts with codes 0 , 10xxxx , 110xxxxx and 1110xxxx ). So after caching offsets for each possible prefix and using quick routine for determining prefix ( (!bitread.peek(16).trailing_zeros() since the actual codes come LSB-first) instead of a loop I got bitstream handling time reduced to 11200 samples (from 14600). It’s nice that Rust has those count leading/trailing zeroes functions. This part is done differently in my decoder (it doesn’t build LUT for the codebook) so it’s hard to compare it directly to libavcodec but there’s still a lot of room for improvement.

Step 5. More pointers!

I rewrite some other DSP functions using pointers. Overall decoding time drops to 17.3 seconds (from 31.6 seconds originally and versus 13.4 seconds for avconv ) so I call it a day.

Overall conclusion

It is possible to write fast code in Rust but the compiler is not smart enough to do all possible optimisations and you have to make sacrifices if you want your code to be safer by default (and I’m fine with it). Also it’s obvious how Rust makes the code that sacrifices checks for speed uglier (compare a + b and a.wrapping_add(b) or arr[i] and *ptr.offset(i as isize) ). But in most cases it’s better to have a function in pure assembly language to perform the task instead. Maybe I’ll get to that stage eventually, maybe not. I still have NihAV to design and implement and piglets can shave themselves.