How It Works

Have you ever tried compressing the string “hello world” with gzip? Let’s do it now:

$ echo "hello world" | gzip -c | wc -c 32

So the output is actually larger than the input string. And gzip is quite good with short input: xz produces an output size of 68 bytes. Of course, compressing short strings is not what they are made for, because you rarely need to make small strings even smaller – except when you do. That’s why shoco was written.

shoco works best if your input is ASCII. In fact, the most remarkable property of shoco is that the compressed size will never exceed the size of your input string, provided it is plain ASCII. What is more: An ASCII string is suitable input for the decompressor (which will return the exact same string, of course). That property comes at a cost, however: If your input string is not entirely (or mostly) ASCII, the output may grow. For some inputs, it can grow quite a lot. That is especially true for multibyte encodings such as UTF-8. Latin-1 and comparable encodings fare better, but will still increase your output size, if you don’t happen to hit a common character. Why is that so?

In every language, some characters are used more often than others. English is no exception to this rule. So if one simply makes a list of the, say, sixteen most common characters, four bits would be sufficient to refer to them (as opposed to eight bits – one byte – used by ASCII). But what if the input string includes an uncommon character, that is not in this list? Here’s the trick: We use the first bit of a char to indicate if the following bits refer to a short common character index, or a normal ASCII byte. Since the first bit in plain ASCII is always 0, setting the first bit to 1 says “the next bits represent short indices for common chars”. But what if our character is not ASCII (meaning the first bit of the input char is not 0)? Then we insert a marker that says “copy the next byte over as-is”, and we’re done. That explains the growth for non-ASCII characters: This marker takes up a byte, doubling the effective size of the character.

How shoco actually marks these packed representations is a bit more complicated than that (e.g., we also need to specify how many packed characters follow, so a single leading bit won’t be sufficient), but the principle still holds.

But shoco is a bit smarter than just to abbreviate characters based on absolute frequency – languages have more regularities than that. Some characters are more likely to be encountered next to others; the canonical example would be q, that’s almost always followed by a u. In english, the, she, he, then are all very common words – and all have a h followed by a e. So if we’d assemble a list of common characters following common characters, we can do with even less bits to represent these successor characters, and still have a good hit rate. That’s the idea of shoco: Provide short representations of characters based on the previous character.

This does not allow for optimal compression – by far. But if one carefully aligns the representation packs to byte boundaries, and uses the ASCII-first-bit-trick above to encode the indices, it works well enough. Moreover, it is blazingly fast. You wouldn’t want to use shoco for strings larger than, say, a hundred bytes, because then the overhead of a full-blown compressor like gzip begins to be dwarfed by the advantages of the much more efficient algorithms it uses.

If one would want to classify shoco, it would be an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string. That’s opposed to dictionary coders that maintain a dictionary of common substrings. An optimal compression for short strings could probably be achieved using an arithmetic coder (also a type of entropy encoder), but most likely one could not achieve the same kind of performance that shoco delivers.