And I thought my implementation of Deflate was bad, part 2

A few years ago I posted about poor compression ratios from the .NET Framework's System.IO.Compression library, more specifically files getting larger when going through compression using the .NET Framework 2.0 library. Fast forward to the more recent .NET Framework 4.0, where the changelog includes the following:

Compression Improvements The compression algorithms for the System.IO.Compression.DeflateStream and System.IO.Compression.GZipStream classes have improved so that data that is already compressed is no longer inflated. This results in much better compression ratios.

Uh huh. Let's test that, shall we? :)

We'll use a revised version of the C# test program I used the last time, one which feeds an entire buffer to the compressor. This is necessary since GZipStream is unfortunately sensitive to write block sizes:

using System;

using System.IO;

using System.IO.Compression; namespace VCZipTest {

class Program {

static void Main(string[] args) {

using (FileStream src = new FileStream(args[0], FileMode.Open)) {

using (FileStream dst = new FileStream(args[1], FileMode.Create)) {

using (GZipStream def = new GZipStream(dst, CompressionMode.Compress, true)) {

byte[] buf = new byte[(int)src.Length];

int act = src.Read(buf, 0, buf.Length);

def.Write(buf, 0, act);

def.Flush();

} System.Console.WriteLine("{0} ({1} bytes) -> {2} ({3} bytes)", args[0],

src.Length, args[1], dst.Length);

}

}

}

}

}



Some background

First, I should probably explain a bit about the gzip/Deflate algorithm and clear up some misconceptions.

The compression algorithm itself is called Deflate and is described by RFC 1951. Deflate compressed data can be wrapped in either a minimalistic gzip stream described by RFC 1952, or in the more well known .ZIP format introduced and made popular by PKWARE's PKZIP tool, which includes multiple compressed files along with directory information. Of concern here is only the implementation of the Deflate algorithm itself, the method by which the raw bits are compressed.

Deflate consists primarily of two layers of compression. The first is LZ77 style sliding window compression, where repeated strings of bytes are replaced by references to the previous instance. What is not so obvious is that this layer is also capable of compressing runs of identical byte values or repeated patterns of byte values through self-overlapping runs. (This fact is rediscovered each time a C programmer sees the forward byte copy loop in the decoder and tries to "fix" it with memmove or "optimize" it with memcpy.) The second layer uses Huffman encoding to compress based on the probability of codes, which include both literal bytes and copy length codes. Thus, Deflate is able to compress not only based on patterns, but also based on some byte values occurring more often than others or not occurring at all.

It is provably true that no lossless compression algorithm can consistently compress all incoming streams, and some streams have to enlarge for others to shrink. However, the amount of enlargement required to satisfy this requirement is much lower than the 50%+ enlargement seen in using the .NET Framework 2.0 version of System.IO.Compression. One way to solve this problem is simply to add a single byte for a compression flag. I've seen discussions where people said that the .NET's DeflateStream/GZipStream is unable to take advantage of this because the compression flag is stored in the .ZIP container and not in the compressed stream itself, which is all that the .NET code handles. This is incorrect. While the .ZIP container does have a compressed/stored flag in the header for each file, the compressed Deflate stream itself is also composed of blocks such that each block can also be encoded as stored or compressed. This has a cost of a three byte header per stored block of up to 64K each, which still results in an upper bound much lower than a 150% ratio.

The other big misconception I've read is that .NET's implementation is hampered by supporting a stream instead of seeing the whole file. The main problem with this explanation is that stream compression doesn't preclude optimization on a per block basis; it simply requires that enough buffering be used to accommodate a block at a time, which is what zlib does. What would preclude per-block optimization is performing compression under such tight latency constraints that bits need to be shoved out the door almost as soon as they arrive and can be compressed, such as with a modem. Even then, it's possible to mitigate the issue by dynamically monitoring the compression ratio and switching encoding modes as needed.

The easy test

Let's start with a simple test, 100 null bytes. That should be easy for just about any compressor. First, .NET 2.0:

testnulls.bin (100 bytes) -> testnulls.prev (124 bytes)

Well, expectations were low for that one. This includes the small gzip stream header, which is about 10 bytes, but that's still bigger than if a stored block had been used (~113 bytes). Let's try .NET 4.0:

testnulls.bin (100 bytes) -> testnulls.gz (128 bytes)

That's... not promising. Remember, this is just about the easiest case possible for any compression algorithm. gzip -9 is able to get this down to 38 bytes, and its header is slightly bigger due to inclusion of a filename. What happened? Well, let's dump the .NET 2.0 output first:

New block: dynamic Huffman trees

Code 0: 14 bits

Codes 1-6: 14 bits

Code 7: 14 bits

Code 8: 14 bits

Code 9: 12 bits

Code 10: 6 bits

Code 11: 14 bits

Codes 12-17: 14 bits

Codes 18-23: 14 bits

Codes 24-29: 14 bits

...

Dynamic block header took 790 bits

Literal <00>

Reference <1, 99>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

EOB

To review, a Deflate stream is composed of a series of blocks, where each block can have a different encoding mode. This stream consists of a single block in "dynamic" mode, where Huffman trees are included that are tailored for the data in the block. The main tree contains codes for both literal values -- encodings for single byte values -- as well as an end of block (EOB) code and length codes that indicate repeated sections to copy. After the trees comes the actual data, which consists of a mix of literal codes and references to data to copy, and ending with EOB.

In this case, the 2.0 encoder was correctly able to encode a run for the nulls, but it spent a whopping 790 bits on encoding a full Huffman tree that it didn't need. Not only that, but the tree is completely mismatched for this data: the only codes present are a literal 00, a single copy length code, and an end of block (EOB) code. 14 bits for a null byte in a file consisting of nothing but that definitely isn't a good choice, nor is allocating nodes for all of the other bytes that never occur. We'll get back to this in a second.

Now, the 4.0 encoder... why did it produce a bigger file than the 2.0 encoder?

Dynamic block header took 790 bits

Literal <00>

Reference <1, 99>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

EOB

New block: stored mode (0 bytes)

Strange, the 4.0 encoder added an extra zero-byte stored block at the end of the file. A "stored" block is just raw data encoded as straight bytes, but a zero-byte stored block is useless. I can't think of why you'd do this in an encoder.

How about gzip?

New block: static Huffman trees

Literal <00>

Literal <00>

Reference <1, 98>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

EOB

There is a third block mode in Deflate that I didn't mention earlier, static tree mode. In this mode, compression is used, but the encoding trees aren't specified; instead, the decoder assumes a known and mostly flat static tree that allocates 7-9 bits to literal codes, which a slight bias toward ranges used by ASCII text. The gzip encoder used this mode to avoid having to spend bits on encoding the trees, which likely would have been much less than the 790 bits used by GZipStream, but still more than would have been saved by encoding a more targeted set of trees. The two literal codes are slightly odd here; I don't see why gzip couldn't have just encoded a single literal like the .NET encoders.

A 6-bit test

The trees encoded by the .NET encoder were certainly strange and not very appropriate. Let's do another test, which is to encode a 1K block of 6-bit pseudorandom data produced by this C++ program:

#include <stdio.h> int main() {

FILE *f = fopen("rand6.bin", "wb");

unsigned seed = 1; for(int i=0; i<1024; ++i) {

putc(seed & 0x3f, f);

seed = (seed << 8) + (((seed >> (28 - 8)) ^ (seed >> (31 - 8))) & 0xff);

}

}

With only 6 of 8 bits pertinent, and with the pseudorandom data mostly suppressing the pattern-loving sliding window compression, we should expect a compression ratio of about 75%. Reverse order this time, gzip first:

New block: dynamic Huffman trees

Code 0: 5 bits

Code 1: 4 bits

Code 2: 6 bits

Codes 3-7: 6 bits

Code 8: 5 bits

...

Code 273: 3 bits

Code 274: 5 bits

Code 275: 3 bits

Code 276: 5 bits

Code 277: 3 bits

Dynamic block header took 293 bits

Literal <01>

Literal <00>

Literal <00>

Literal <00>

Literal <12>

Literal <00>

Literal <00>

Literal <01>

Literal <04>

Reference <7, 3>: 00 00 12

...

Total output including 20 byte stream header, 791 bytes. The LFSR I used to generate the data isn't that great and so gzip was able to find some redundancy to exploit, but that was balanced by the bits it had to spend on encoding the trees, which show the expected allocations of around 6 bits/value. This is the kind of behavior you'll see when a compressor gets mostly incompressible data, like an MP3 file or already compressed data: the Huffman trees are mostly balanced, but the sliding window compression starts finding tiny sections of repeated data here and there.

Next, .NET Framework 4.0:

rand6.bin (1024 bytes) -> rand6.gz (1052 bytes)



New block: stored mode (1024 bytes)

New block: stored mode (0 bytes)

Again, GZipStream doesn't do as good of a job as gzip, but at least this time it uses a stored block and doesn't make everything much worse. Now, .NET Framework 2.0:

New block: dynamic Huffman trees

Code 0: 14 bits

Codes 1-6: 14 bits

Code 7: 14 bits

Code 8: 14 bits

Code 9: 12 bits

Code 10: 6 bits

Code 11: 14 bits

Codes 12-17: 14 bits

...

Dynamic block header took 790 bits

Literal <01>

Literal <00>

Literal <00>

Literal <00>

Literal <12>

Literal <00>

Literal <00>

Literal <01>

Literal <04>

Reference <7, 3>: 00 00 12

Uh oh... this tree seems familiar. Let's do another test, and compress the source code to the test program itself (.NET 4.0):

..netzip.cs (682 bytes) -> foo.gz (448 bytes)



New block: dynamic Huffman trees

Code 0: 14 bits

Codes 1-6: 14 bits

Code 7: 14 bits

Code 8: 14 bits

Code 9: 12 bits

Code 10: 6 bits

Code 11: 14 bits

Codes 12-17: 14 bits

Codes 18-23: 14 bits

Codes 24-29: 14 bits

...

Dynamic block header took 790 bits

Very different data, same trees.

As I noted the last time in the comments, the .NET Framework 2.0 compressor appeared to be using a predefined Huffman tree that was tailored for text, allocating 14 bits for most control codes and 6 bits for spaces, instead of generating an optimal tree based on actual probabilities. This does not appear to have changed in 4.0. No matter what data I gave the encoder, it always used this tree whenever it emitted dynamic blocks. Doing so simplifies and speeds up the encoder, but the downside is that it produces suboptimal results on both text and binary data. For text, coding space is wasted on lots of bytes that don't actually appear, but have to be in the tree in case they do; for binary data, it's seriously bad because the most common bytes in binary data suffer terrible expansion (8 bits to 14 bits). Unfortunately, this does not appear to have been fixed in the 4.0 version of compressor, which means that the Huffman portion of the Deflate algorithm is largely ineffective and sometimes counterproductive.

I find this design decision odd. By not generating an optimal set of Huffman trees, the compressor is able to emit codes almost as soon as they are produced and there is no need to buffer the output data. However, in terms of performance, the encoder already needs to handle bit packing in order to emit with the static trees, so the additional time needed is just that to measure and create optimal trees. That's just a histogram and a quick tree node merging algorithm, which doesn't take long. I suspect that most .NET programmers would not mind a little bit of delay and an additional 64K of memory to optimize the trees properly.

Buffer size issues

Another oddity of GZipStream is its sensitivity to buffer sizes. With the test program above, where the entire buffer is fed to the encoder, the output consists of a series of nearly 4K blocks:

New block: stored mode (4090 bytes)

New block: stored mode (4090 bytes)

New block: stored mode (4090 bytes)

...

I have no idea why 4090 bytes, since the overhead for a stored block is three bytes, and a stored block can be up to 65535 bytes. That's suboptimal, but a very minor transgression. Writing blocks of 4096 bytes, however, produces worse output:

New block: stored mode (4090 bytes)

New block: stored mode (6 bytes)

New block: stored mode (4090 bytes)

New block: stored mode (6 bytes)

...

What appears to be happening is that the compressor doesn't use its buffer as a sliding window, but rather keeps resetting it. This fragments the output into more blocks and also has the undesirable property of making the compression ratio dependent upon the buffer size used. Furthermore, occasionally the compressor will do really odd things like this:

New block: stored mode (4090 bytes)

New block: dynamic Huffman trees

Code 0: 14 bits

Codes 1-6: 14 bits

Code 7: 14 bits

Code 8: 14 bits

Code 9: 12 bits

...

Code 313: 5 bits

Code 314: 5 bits

Code 315: 5 bits

Dynamic block header took 790 bits

Literal <b4>

Literal <34>

Literal <65>

Literal <30>

Literal <6b>

Literal <4e>

EOB

It appears in this case that the compressibility check failed; the encoder emitted a stored block of 4090 bytes, and then used a dynamic block for the rest... using 106 bytes to encode 6 bytes. This pattern occurs several times throughout the file.

Conclusion

Given all of this, the conclusion we can draw is that GZipStream and DeflateStream are not significantly improved in .NET Framework 4.0. The compressor now has checks to prevent major expansion of data, but you can expect compression ratios across the board to be significantly worse than standard implementations like zlib. As such, I would highly suggest using another library for compression needs.