Information theory kicked off with a bang in 1948 when Claude Shannon penned an impressively accessible article titled, A Mathematical Theory of Communication. One year later Shannon and a colleague expounded on his work in a book titled, The Mathematical Theory of Communication, noting the generality of his work.

Information theory is an elegant framework for quantifying communication, defined abstractly. It is applied in both analytical and numerical settings. Often information can be thought of as a sequence or set of sequences where the elements belong to some discrete or continuous variable.

To get a sense for the numerical utility of information theory let’s generate some random data representing the outcome of 500 trials of a fair-coin.

# From IPython # Assuming numpy is installed import numpy as np trials = np.random.randint(0,2,[500])

The variable trials contains 500 random samples We can think of 0 as representing heads and 1 as representing tails. Note that the range for the elements chosen by randint are

Suppose we want a single number to quantify the fairness of the coin. One of the first steps would be to generate counts of the outcomes of the trials, such as:

pmf = np.bincount(trials)

On my machine,

In : pmf

Out : array([255, 245])

There are 255 0’s and 245 1’s in trials. This matches our intuition that there should be approximately as many heads as tails. However this form of the data doesn’t generalize well. If we convert pmf to frequencies we have a universal representation that is naturally comparable to other trials of different sample size.

# We first convert the integer array to a floating point array pmf = pmf.astype(float) # Next we divide by it's sum # This is sugar for pmf = pmf / pmf.sum() pmf /= pmf.sum()

On my machine,

In : pmf

Out : array([ 0.51, 0.49])

This quantity represents the frequencies. The variable name pmf is meant to reflect that this quantity is associated with the probability mass function, though this is technically incorrect. The real probability mass function for our trials would be given by

Let’s define a term information entropy for all to be

The base of the logarithm determines the unit. Base 2 is referred to as bits. Base is referred to as nats.

In the case of our sample

data

and, given sufficiently many trials of a fair coin, we expect

.

Our entropy term would then reduce to

In python

# Numpy's log is in base 10 # Dividing by log(2) converts to base 2 h = -(pmf * np.log(pmf)/np.log(2)).sum()

On my machine,

In : h

Out : 0.9997114417528099

We can define this as a function.

def entropy(pmf): return -(pmf * np.log(pmf)/np.log(2)).sum()

Let’s step back and consider what happened.

Eq(1)

Because this is a probability mass function we know that

and

In our case we can rewrite in terms of

Eq(1) can then be rewritten as a

Eq(2)

Equation 2 is known as the binary entropy function, and it has a pretty plot.

# Assuming pylab is installed import pylab as pl # [start,stop)step x = np.arange(0.01,1,.01) # vstack is used to zip up # Two n length vectors # Into a 2,n shaped array # Then it is transposed pmfs = np.vstack((x,1-x)) pmfs = pmfs.T # Define the output binaryEntropy = [] # Loop over the rows # Get the entropies of the steps for i in xrange(pmfs.shape[0]): binaryEntropy.append(entropy(pmfs[i,:])) # Plot our output, black and a little thicker pl.plot(binaryEntropy,'black',linewidth=12) # Give it some titles pl.title('Binary Entropy Function') pl.xlabel('Frequency of heads') pl.ylabel('Entropy(bits)') # Get the current xticks xloc,xticklabels = pl.xticks() # Make the x-axis go from [0,1] pl.xticks(xloc,pl.arange(0,1.2,.2)) # Turn on the grid pl.grid() # Show it pl.show()

This produces the following:

Stay tuned. Next week shortly I’ll go over entropy calculations for the n-bins and continuous cases. The following week I’ll go over joint entropy and mutual information.

The links so you don’t have to: