There are plenty of existing error metrics, but our primary concern is with minimizing the amount of information we have to send. Both of these models reduce our problem to two parameters, number teeth and a probability (though we really only need the number of teeth for the uniform distribution). The best test of which is better is to ask which distribution preserves the most information from our original data source. This is where Kullback-Leibler Divergence comes in.

The entropy of our distribution

KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy, typically denoted as \(H\). The definition of Entropy for a probability distribution is:

$$H = -\sum_{i=1}^{N} p(x_i) \cdot \text{log }p(x_i)$$

If we use \(log_2\) for our calculation we can interpret entropy as "the minimum number of bits it would take us to encode our information". In this case, the information would be each observation of teeth counts given our empirical distribution. Given the data that we have observed, our probability distribution has an entropy of 3.12 bits. The number of bits tells us the lower bound for how many bits we would need, on average, to encode the number of teeth we would observe in a single case.

What entropy doesn't tell us is the optimal encoding scheme to help us achieve this compression. Optimal encoding of information is a very interesting topic, but not necessary for understanding KL divergence. The key thing with Entropy is that, simply knowing the theoretical lower bound on the number of bits we need, we have a way to quantify exactly how much information is in our data. Now that we can quantify this, we want to quantify how much information is lost when we substitute our observed distribution for a parameterized approximation.

Measuring information lost using Kullback-Leibler Divergence

Kullback-Leibler Divergence is just a slight modification of our formula for entropy. Rather than just having our probability distribution \(p\) we add in our approximating distribution \(q\). Then we look at the difference of the log values for each:

$$D_{KL}(p||q) = \sum_{i=1}^{N} p(x_i)\cdot (\text{log }p(x_i) - \text{log }q(x_i))$$



Essentially, what we're looking at with the KL divergence is the expectation of the log difference between the probability of data in the original distribution with the approximating distribution. Again, if we think in terms of \(log_2\) we can interpret this as "how many bits of information we expect to lose". We could rewrite our formula in terms of expectation:

$$D_{KL}(p||q) = E[\text{log } p(x) - \text{log } q(x)]$$

The more common way to see KL divergence written is as follows:

$$D_{KL}(p||q) = \sum_{i=1}^{N} p(x_i)\cdot log\frac{p(x_i)}{q(x_i)}$$

since \(\text{log }a - \text{log b} = \text{log }\frac{a}{b}\).

With KL divergence we can calculate exactly how much information is lost when we approximate one distribution with another. Let's go back to our data and see what the results look like.

Comparing our approximating distributions

Now we can go ahead and calculate the KL divergence for our two approximating distributions. For the uniform distribution we find:

$$D_{kl}(\text{Observed } || \text{ Uniform}) = 0.338$$

And for our Binomial approximation:

$$D_{kl}(\text{Observed } || \text{ Binomial}) = 0.477$$

As we can see the information lost by using the Binomial approximation is greater than using the uniform approximation. If we have to choose one to represent our observations, we're better off sticking with the Uniform approximation.

Divergence not distance

It may be tempting to think of KL Divergence as a distance metric, however we cannot use KL Divergence to measure the distance between two distributions. The reason for this is that KL Divergence is not symmetric. For example we if used our observed data as way of approximating the Binomial distribution we get a very different result:

$$D_{kl}( \text{Binomial} || \text{Observed}) = 0.330$$

Intuitively this makes sense as in each of these cases we're doing a very different form of approximation.

Optimizing using KL Divergence

When we chose our value for the Binomial distribution we chose our parameter for the probability by using the expected value that matched our data. But since we're optimizing for minimizing information loss, it's possible this wasn't really the best way choose the parameter. We can double check our work by looking at the way KL Divergence changes as we change our values for this parameter. Here's a chart of how those values change together: