The result reduces the data by just a few bytes. Not nearly as good as our tailored bit-packing result. This an important lesson: Using entropy alone for packing bits is NOT a silver bullet. In fact, with real-world data, entropy is probably going to make your data bigger. Why? Entropy works based on accurate predictions. Guess correctly, you get a little benefit. Get it wrong by the same amount, get punished much more.

The number of bits required to encode a given symbol can be calculated using -log2(Psym). Let's lay you figure you have a 70% chance of encoding a 1 bit, but when you read the next bit you want to encode it comes up a 0. That bit is going to encode to 1.74 (-log2(1-.70)) bits in the output stream - you lost almost 1 whole bit! Had you been lucky and actually read a 1 bit instead, you would have encoded it with 0.51 bits (-log2(.70)), you would have saved .49 bits of space, whereas the mis-prediction costs you an additional .74 bits.

Think of it this way, the better you can predict the future, the better entropy works.

Let's try another way to compress our data. This time, rather than be adaptive, we'll come up with some fixed probabilities that work well. We can compute this ahead of time. For the above dataset, we have 128 bits. 10 of these bits 'transition', that is, they are the last of a series of 1 or 0, and the proceeding bit will be in the opposite state. The means that the probability of the next bit being the same state as the current bit is about 92%.

The implementation works as such: We encode 0 bits with a 8% probability of being 1, and we encode 1 bits with 92% probability of being one. We can implement this using two static models: One of the models is used when encoding runs of 0's. This model has a low probability of 1 because while you're encoding 0's there is little chance that they will encounter a 1. The other model is the opposite, it encodes with a high probability of 1 because you are unlikely to encounter a zero.



Another important step is that you encode the current bit with the last model you used, regardless of the bit's value. So if you are encoding 0's, and then you suddenly encounter a 1, you will encode the 1 with the same model as the previous 0's, then you will switch to the 1-biased model. The reason for this is that when decoding, you won't know ahead of time what the next decoded symbol will be, and you need to follow the same rules. This switching of models during encoding/decoding is called a "context switch". What is very interesting here is that, as long as you preserve the encoder state, you can dynamically switch out the probability models as you go. The only requirement is that whatever you do during encoding, you are able to duplicate during decoding.



For this example, we are going to set aside our adaptive probability model for now and just use the simpler static probability model with context switching: