Suppose you are talking with three patients in the waiting room of a doctor’s office. All three of them have just completed a medical test which, after some processing, yields one of two possible results: the disease is either present or absent. Let’s assume we are dealing with curious and data-oriented individuals. They’ve researched the probabilities for their specific risk profiles in advance and are now eager to find out the result.

Patient A knows that, statistically, there is a 95% chance that he has the disease in question. For Patient B, the probability of being diagnosed with illness is 30%. Patient C, in contrast, faces a 50/50 probability.

Uncertainty in the waiting room

I would like to focus on a simple question. All other things being equal, which of the three patients is confronted with the greatest degree of uncertainty?

I think the answer is clear: patient C. Not only is he experiencing “a lot of uncertainty”. What he is going through is the greatest degree of uncertainty possible under the circumstances: a dramatic medical version of a flip with a fair coin.

Compare this with patient A. Sure, the overall situation looks quite grim, but at least this patient is experiencing little uncertainty with regard to his medical prospects.

Intuitively speaking, what can we say about patient B? Perhaps that her situation falls “somewhere in the middle”?

This is where entropy comes in. Describing a situation as “somewhere in the middle” might be good enough for waiting room talk, but it’s certainly too coarse a description for machine learning purposes.

Measuring uncertainty

Entropy allows us to make precise statements and perform computations with regard to one of life’s most pressing issues: not knowing how things will turn out.

Entropy, in other words, is a measure of uncertainty.

(It is also a measure of information, but, personally, I prefer the uncertainty interpretation. It might just be me, but things seemed a lot clearer when I no longer attempted to impose my preconceived notion of information on the equations.)

In a way, saying that entropy is “a measure of uncertainty” is an understatement. Given certain assumptions (and foreshadowing an important result mentioned below), entropy is the measure of uncertainty.

By the way, when I use the term entropy, I’m referring to Shannon entropy. There are quite a few other entropies, but I think it’s safe to assume that Shannon entropy is the one that is used most frequently in natural language processing and machine learning.

So without further ado, here it is, the entropy formula for an event X with n possible outcomes and probabilities p_1, …, p_n:

Shannon entropy

Basic properties

If you are anything like me when I first looked at this formula, you might be asking yourself questions such as: Why the logarithm? Why is this a good measure of uncertainty at all? And, of course, why the letter H? (Apparently, the use of the English letter H evolved from the the Greek capital letter Eta, although the history appears to be quite complicated.)

One thing I’ve learned over time is that a good starting point — here and in many other cases — is to ask two questions: (1) Which desirable properties does the mathematical construct I’m trying to understand have? And (2) Are they any competing constructs that have all of the these desirable properties?

In short, the answers for Shannon entropy as a measure of uncertainty are: (1) many and (2) no.

Let’s proceed with a wish list.

Basic property 1: Uniform distributions have maximum uncertainty

If your goal is to minimize uncertainty, stay away from uniform probability distributions.

Quick reminder: A probability distribution is a function that assigns a probability to every possible outcome such that the probabilities add up to 1. A distribution is uniform when all of the outcomes have the same probability. For example, fair coins (50% tails, 50% tails) and fair dice (1/6 probability for each of the six faces) follow uniform distributions.

Uniform distributions have maximum entropy for a given number of outcomes.

A good measure of uncertainty achieves its highest values for uniform distributions. Entropy satisfies the criterion. Given n possible outcomes, maximum entropy is maximized by equiprobable outcomes:

Equiprobable outcomes

Here is the plot of the Entropy function as applied to Bernoulli trials (events with two possible outcomes and probabilities p and 1-p):

In the case of Bernoulli trials, entropy reaches its maximum value for p=0.5

Basic property 2: Uncertainty is additive for independent events

Let A and B be independent events. In other words, knowing the outcome of event A does not tell us anything about the outcome of event B.

The uncertainty associated with both events — this is another item on our wish list — should be the sum of the individual uncertainties:

Uncertainty is additive for independent events.

Let’s use the example of flipping two coins to make this more concrete. We can either flip both coins simultaneously or first flip one coin and then flip the other one. Another way to think about this is that we can either report the outcome of the two coin flips at once or separately. The uncertainty is the same in either case.

To make this even more concrete, consider two particular coins. The first coin lands heads (H) up with an 80% probability and tails (T) up with a probability of 20%. The probabilities for the other coin are 60% and 40%. If we flip both coins simultaneously, there are four possible outcomes: HH, HT, TH and TT. The corresponding probabilities are given by [ 0.48, 0.32, 0.12, 0.08 ].

The joint entropy (green) for the two independent events is equal to the sum of the individual events (red and blue).

Plugging the numbers into the entropy formula, we see that:

Just as promised.

Basic property 3: Adding an outcome with zero probability has no effect

Suppose (a) you win whenever outcome #1 occurs and (b) you can choose between two probability distributions, A and B. Distribution A has two outcomes: say, 80% and 20%. Distribution B has three outcomes with probabilities 80%, 20% and 0%.

Adding a third outcome with zero probability doesn’t make a difference.

Given the options A and B, which one would you choose? An appropriate reaction at this point would be to shrug your shoulders or roll your eyes. The inclusion of the third outcome neither increases nor decreases the uncertainty associated with the game. A or B, who cares. It doesn’t matter.

The entropy formula agrees with this assessment:

Adding a zero-probability outcome has not effect on entropy.

In words, adding an outcome with zero probability has no effect on the measurement of uncertainty.

Basic property 4: The measure of uncertainty is continuous in all its arguments

The last of the basic properties is continuity.

Famously, the intuitive explanation of a continuous function is that there are no “gaps” or “holes”. More precisely, arbitrarily small changes in the output (uncertainty, in our case) should be achievable through sufficiently small changes in the input (probabilities).

Logarithm functions are continuous at every point for which they are defined. So are sums and products of a finite number of functions that are continuous on a subset. It follows that the entropy function is continuous in its probability arguments.

The Uniqueness Theorem

Khinchin (1957) showed that the only family of functions satisfying the four basic properties described above is of the following form:

Functions that satisfy the four basic properties

where λ is a positive constant. Khinchin referred to this as the Uniqueness Theorem. Setting λ = 1 and using the binary logarithm gives us the Shannon entropy.

To reiterate, entropy is used because it has desirable properties and is the natural choice among the family functions that satisfy all items on the basic wish list (properties 1–4). (I might discuss the proof in a separate article in the future.)

Other properties

Entropy has many other properties beyond the four basic ones used in Khinchin’s Uniqueness Theorem. Let me just mention some of them here.

Property 5: Uniform distributions with more outcomes have more uncertainty

Suppose you have the choice between a fair coin and a fair die:

Fair coin or fair die?

And let’s say you win if the coin lands heads up or the die lands on face 1.

Which of the two options would you choose? A if you are a profit maximizer and B if you prefer with more variety and uncertainty.

As the number of equiprobable outcomes increases, so should our measure of uncertainty.

And this is exactly what Entropy does: H(1/6, 1/6, 1/6, 1/6, 1/6, 1/6) > H(0.5, 0.5).

And, in general, if we let L(k) be the entropy of a uniform distribution with k possible outcomes, we have

for m > n.

Property 6: Events have non-negative uncertainty

Do you know what negative uncertainty is? Neither do I.

A user-friendly measure of uncertainty should always return a non-negative quantity, no matter what the input is.

This is yet another criterion that is satisfied by entropy. Let’s take another look at the formula:

Shannon entropy

Probabilities are, by definition, in the range between 0 and 1 and, therefore, non-negative. The logarithm of a probability is non-positive. Multiplying the logarithm of a probability with a probability doesn’t change the sign. The sum of non-positive products is non-positive. And finally, the negative of a non-positive value is non-negative. Entropy is, thus, non-negative for every possible input.

Property 7: Events with a certain outcome have zero uncertainty

Suppose you are in possession of a magical coin. No matter how you flip the coin, it always lands head up.

A magical coin

How would you quantify the uncertainty about the magical or any other situation in which one outcome is certain to occur? Well, there is none. So the natural answer— I think, you will agree — is 0.

Does entropy agree with this intuition? Of course.

Suppose that outcome i certain is certain to occur. It follows that p_i, the probability of outcome i, is equal to 1. H(X), thus, simplifies to:

The entropy for events with a certain outcome is zero.

Property 8: Flipping the arguments has no effect

This is another obviously desirable property. Consider two cases. In the first case, the probability of heads and tails are 80% and 20%. In the second case, the probabilities are reversed: heads 20%, tails 80%.

Both coin flips are equally uncertain and have the same entropy: H(0.8, 0.2) = H(0.2, 0.8).

In more general terms, for the case of two outcomes, we have:

Flipping arguments

This fact applies to any number of outcomes. We can position the arguments (i.e., the probabilities of a distribution) in any order we like. The result of the entropy function is always the same.

Summary

To recap, Shannon entropy is a measure of uncertainty.

It is widely used because its satisfies certain criteria (and because life is full of uncertainty). The Uniqueness Theorem tells us that only one family of functions has all the four basic properties we’ve mentioned. Shannon entropy is the natural choice among this family.

In addition to other facts, entropy is maximal for uniform distributions (property #1), additive for independent events (#2), increasing in the number of outcomes with non-zero probabilities (#3 and #5), continuous (#4), non-negative (#6), zero for certain outcomes (#7) and permutation-invariant (#8).