http://en.wikipedia.org/wiki/Entropy_encoding

http://en.wikipedia.org/wiki/Arithmetic_coding

The first process in “decoding order” is the entropy decoder. Lets paraphrase wikipedia :

One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword . The length of each codeword is approximately proportional to the negative logarithm of the probability. Therefore, the most common symbols use the shortest codes .

In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium.

The VP8 codec uses an entropy coder called “Boolean Entropy Decoder”, a form of arithmetic coding, which uses only two symbols : 0 or 1. It is faster than typical arithmetic coding and allows purely integer implementation.

The only thing we need to known, is that entropy decoding is a very linear process and cannot be parallelized. The symbols are decompressed one by one (the bitstream is processed bit by bit in the case of arithmetic coding, i/o are bufferized however) and each symbol decoded updates the probability context of the decoder. Loose one bit, loose one frame.

Conclusion ? The entropy decoding stays in the CPU. What we need to do though is to decode the bitstream content for one frame in one pass, and not to start other processes as soon as enough data are available. That way we will have all the data to start other processes in parallel.

4/ Inverse transformation

http://en.wikipedia.org/wiki/Discrete_cosine_transform

http://en.wikipedia.org/wiki/Walsh-Hadamard_transform

The DCT, initials for Discrete Cosine Transform, is the operation at the heart of today video compression formats. The DCT is a reversible and lossless operation, meaning that the encoder and the decoder do the exact same thing, and that it does not directly help data compression, but just reorganize the coefficients of a block by frequency order, with the high frequencies at the beginning of the block, and the low frequencies at the end.

Note that the first coefficient of the block is called the “DC” coefficient, and represents the average of all the values of the coefficients of the block. The other coefficients are called “AC”.



(original picture – original coefficients – transformed coefficients – “zigzag scan pattern”)

The VP8 codec uses a 2d DCT-II on AC coefficients of each 4×4 block of luma pixel in frame. The DC coefficients are then bundled by groups of 16 (so each DC coefficient of all 16 4×4 blocks of a macroblock) and applied with a secondary WHT transform. As the DCT and WHT transformations are basically matrix calculus, they does fit well inside shaders. Plus, the blocks of data to process are independent upon each other, which makes parallelization easy. The same DCT is applied to all of the blocks of the picture. But there is a major drawback that we will saw in chapter 6.

5/ Inverse quantization http://en.wikipedia.org/wiki/Quantization_(image_processing) The quantization stage reduces the amount of information by dividing each transformed blocks of coefficient (a 2d matrix) by a “quantization matrix” and then by a “quantization parameter (used to adjust the final quality of the block) to reduce the quantity of possible values that value could have. All near zero values are zeroed. Because this makes the values fall into a narrower range, this allows entropy coding to express the values more compactly.



(transformed coefficients – transformed & quantized coefficients- “zigzag scan pattern”)

Here is a quick overview of the quantity of data in a block before and after transformation + quantization :

A simple way to gain a lot of speed using shaders is to process several 4×4 blocks of data simultaneously as there is no dependency between blocks (in the limits of available shader units), and to process 4 data at a time inside each block.

6/ Intra-prediction www.vcodex.com/files/h264_intrapred.pdf Intra-predictions use spatial redundancy inside a frame to reduce the amount of data necessary to encode adjacent blocks. It is a major compression feature, able to greatly reduce the size of a single picture. VP8 has 10 available intra-prediction schemes, which consist in copying the edge of the adjacent block(s) through another block, as seen in the following picture :

The major problem of the intra-prediction is that it reads the edges of adjacent reconstructed blocks. To reconstruct a block, the transformation, quantization and intra-prediction must have been performed on that block. As the intra-prediction is in the middle of the operations needed to reconstruct a block, it mean that we cannot process these blocks in parallel, because each block can depend on the left, up-left, up-right block. So basically each time we want to intra-predict a block we have to wait for the result of the previous intra-prediction process. It means that it suppresses the gain of doing parallel block transformation + quantization, because it would need to wait for the intra-prediction results to reconstruct the blocks anyway. In a CPU implementation we do things like that : – for each macroblock of a picture

* entropy decode a macroblock

– for each 16 blocks of the macroblock

* entropy decode a block

* transformation

* quantization

* intra-prediction

* block reconstruction Once a set of data is loaded, it is better to perform all of the operations on that set while the data are still “hot” into the CPU cache, avoiding cache miss and expensive memory reads. In a GPU implementation, we would try to do that : entropy decode everything in a frame

upload data to the GPU memory

transform all the blocks (in parallel)

quantifiy all the blocks (in parallel)

intra-predict all the blocks

reconstruct all the blocks (in parallel) So what to do with the intra-prediction stage ? When you start several intra-predictions processes at different places of the picture, ignoring the dependency between block is not an option. Do one mistake during the decoding, and that mistake will propagate itself :

One another obstacle is that as each block uses a different intra-prediction scheme, we would need to setup shader programs for each scheme. Each of these shader programs must have a list of blocks it can handle, and they would have to wait for the availability of the reconstructed adjacent blocks of these inside the list to perform their task. But doing so (meaning doing linear processing using a parallel processor) could be counter productive performance wise. The other solution would be to download transformed and quantified data (after parallel processing) to the main memory, do the intra-prediction in the CPU and re-upload the data for the reconstruction stage and further processing. So far, my decision is to leave all the treatments before block reconstruction to the CPU, and concentrate on the parallelization of heavier treatments. The SIMD CPU optimizations (MMX, SSE…) will be reintroduced. 7/ Inter-prediction (motion compensation) http://en.wikipedia.org/wiki/Motion_compensation The inter-prediction, or motion compensation, is a practical way to use the temporal redundancy between frames. As two frames which follow each other have a great chance to be very similar, each macroblock of the picture can be just copied into another picture instead of being recoded. If needed, the copied macroblock can be slightly moved and interpolated to match the destination picture. This process can be quite intensive, up to 70% of CPU time, depending on the amount of inter-predicted blocks in the frame of course. The VP8 specification allows up to 3 frames to be used as reference.

If the reference frames are already loaded into the GPU memory, recomposing a frame by copying macroblocks between GPU textures is faster than doing the same thing across CPU memory. One other advantage is that sometimes an interpolation filter must be applied to the block of texture copied, and texture filtering is one thing a GPU do fast, or at least faster than a CPU. 8/ Loop filter http://en.wikipedia.org/wiki/Deblocking_filter_(video) As we talked about on this post, most of the processing during video decoding are done on small blocks of data. One side effect is that differences can appear between blocks when they are compressed with slightly different parameters.

To counter this “blocking effect”, a “loop filter”, or “deblocking filter” must be applied on the final picture. This filter is usually applied as a post processing filter for low bitrate videos. In VP8 as in the H.264, the loop filter is a mandatory feature, allowing the encoder to lower the bitrate by 10%, regained in quality by applying the loop filter in both encoding and decoding paths. The only problem is that the loop filter is the most expensive operation of the VP8 decoding process, from 30% and up to 60% of the total decoding time, especially with high definition videos. VP8 has two filter modes, a low complexity “simple filter” and a high complexity “normal filter” (used in almost every video). These two modes can be adjusted on a frame basis. The filter is somewhat adaptive, as some filtering parameter can be adjusted for each macroblock. The goal is to smooth the rough edges between each block of the three planes (Y, Cb and Cr) for the normal filter, and only the Y planes for the simple filter. Depending if we are on a block or macroblock boundaries the process is not exactly the same, as 16×16 macroblock can pack parameters specific to a group of 16 of 4×4 blocks, there can be bigger differences at these macroblocks edges, so adjacent blocks edges within a macroblock may be concatenated and processed at once in their entirety.

Edges are filtered in that order : left macroblock edge (16 pixels wide) each left edge (4 pixels wide) for every 16 blocks inside the macroblock top marcoblock edge (16 pixels wide) each top edge (4 pixels wide) for every 16 blocks inside the macroblock

For each pixel position of an edge, two or three pixels adjacent to either side of the edge (at a right angle to the edge orientation) are examined and possibly modified if the difference between the values on either sides fall into a certain threshold.

The deblocking process reads and then writes to the same values. But the shader programs usually input and output to different textures. The idea would be to create a new texture to output only delta values for each pixel, and then to compose the “delta texture” to the “frame texture”. This VP8 deblocking process is linear, each macroblock must be fully processed before staring the next one. In order to parallelize it, it should be safe to start several deblocking process at different macroblocks (chosen depending on their filter strength) at the expense of some particular edges left unfiltered. 9/ Color space conversion http://www.itu.int/rec/R-REC-BT.601-7-201103-I/en Color space conversion is the last process in “decoding order”. As we’ve seen in chapter 1, the video formats don’t store their pixels in RGB, but in subsampled YCbCr format. To convert a pixel between the two color models, the VP8 uses the ITU recommendation 601, which defines among some other things how to convert an analog RGB signal to a digital YCbCr signal. From there an equation to convert digital YCbCr to digital RGB can be extrapolated :

Where R, G and B are the 3 components of a 24 bits RGB color space, and Y, Cb and Cr are the 3 components of a 24 bits YCbCr color space. The prime denotes a gamma-corrected component. The YCbCr to RGB conversion can be intense on the CPU, for each pixel, 9 divisions, 13 multiplications, 11 additions/subtractions. With some rounding and factorisation we can find a faster implementation, with still 7 multiplications, 7 bitshift and 7 additions/subtractions :