One of the most important quantities in information theory is the mutual information between two random variables. When the joint distribution of the random variables is well-known, the mutual information can be calculated using numerical integration to solve the double-integral from the mutual information definition. However, the dimension of the random variables are often high making the integration very hard. Additionally, sometimes we don’t know the distribution, but we have access to samples from the joint distribution of the two random variables. We would like to then estimate the mutual information directly from the samples. In a recent paper (“MINE: Mutual Information Neural Estimation“)[1], authors show how to approximate mutual information from the samples even if the distribution of the random variables is unknown. Their approach also scales to higher dimensions. This is done by combining a maximization of a lower bound of the mutual information with Monte Carlo integration. Since this lower bound to be maximized depends on finding a function that maximizes an objective, we can simply parametrize this function as a neural network and optimize it via gradient descent. Here I will try to summarize the idea of the paper by illustrating it with a simple example and code.

Mutual information was first used by Claude Shannon [2] in his famous 1948 paper “A Mathematical Theory of Communication” and I will use the continuous Gaussian channel as an example for explaining the MINE idea. For the purpose of communicating over a channel, if we model a discrete-time channel input signal as a random variable and the received signal as a random variable , the upper limit on the amount of information that can be transported is , the mutual information between and as shown by Shannon. The mutual information is defined as

Note that the mutual information can be written as the KL divergence between the joint distribution and the product of marginals in the form since

One way to avoid the double integral is to do Monte Carlo integration. The idea is to take K samples from the joint distribution and calculate

where and are samples from the joint distribution. Note that here, we can calculate the mutual information direct from the samples but we still need to know how to evaluate the joint probability of the samples as well as the marginals. One way the authors propose to avoid this is to use the Donsker-Varadhan representation of the KL divergence

which gives a lower bound on the KL divergence. The idea is to have an auxiliary function parametrized by that can be optimized to make this lower bound as high as possible. If we compute this lower bound of KL divergence between the joint and product of marginals, we can obtain a lower bound on mutual information

Additionally, to avoid the computation of the integrals we can use Monte Carlo integration so the lower bound can be estimated from samples as

where samples and come from the joint distribution and from the marginal ( and do not come from the same realization and is also the marginal). The idea of the MINE (mutual information neural estimator) is to define the function as a neural network, parametrized by representing the neural network weights and the idea is to use gradient descent to maximize this lower bound which can be very tight.

We can write this differentiable lower bound to be maximized in Tensorflow. Here we assume that the auxiliary function is a neural network with one hidden layer with 10 neurons and RELU activation function. The whole code defining the lower bound

x = tf.placeholder(tf.float32, [None,1]) y = tf.placeholder(tf.float32, [None,1]) y_ = tf.placeholder(tf.float32, [None,1]) n_hidden=10 Wx=tf.Variable(tf.random_normal(stddev=0.1,shape=[1,n_hidden])) Wy=tf.Variable(tf.random_normal(stddev=0.1,shape=[1,n_hidden])) b=tf.Variable(tf.constant(0.1,shape=[n_hidden])) hidden_joint=tf.nn.relu(tf.matmul(x,Wx)+tf.matmul(y,Wy)+b) hidden_marg=tf.nn.relu(tf.matmul(x,Wx)+tf.matmul(y_,Wy)+b) Wout=tf.Variable(tf.random_normal(stddev=0.1,shape=[n_hidden,1])) bout=tf.Variable(tf.constant(0.1,shape=[1])) out_joint=tf.matmul(hidden_joint,Wout)+bout out_marg=tf.matmul(hidden_marg,Wout)+bout lower_bound=tf.reduce_mean(out_joint)-tf.log(tf.reduce_mean(tf.exp(out_marg))) train_step = tf.train.AdamOptimizer(0.005).minimize(-lower_bound)

Now let’s see if this approach can compute the mutual information for some simple cases. Let’s test it in the case of binary phase shift keying (BPSK) modulation over a Gaussian channel. The input of the channel is a random variable that can take two equiprobable values: 1 and -1. After transmission over a channel that adds additive white Gaussian noise (AWGN) with zero-mean and variance represented by random variable N, we obtain the received signal represented by the random variable . The mutual information can be calculated using Monte Carlo integration as

where is the distribution of a Gaussian centered at with variance . Using to calculate the mutual information we obtain 0.95 bit of mutual information when the variance equals 0.2 and 0.48 bit when variance equals 1. Note that the mutual information is the upper bound on the amount of information that can be transmitted by a channel, so when the variance of noise is small we can operate our channel close to maximum achievable rate of 1 bit per channel utilization (BPSK limit), however when variance is high (variance of 1) only 0.48 bit per channel use can be transmitted meaning that at least 52% of transmission bits must be reserved to redundancy (channel coding). I also show the code to calculate the mutual information for this example (discrete binary input continuous output Gaussian noise channel):

N=100000 sig2=0.2 x=np.sign(np.random.normal(0.,1.,[N,1])) y=x+np.random.normal(0.,np.sqrt(sigma2),[N,1]) p_y_x=np.exp(-(y-x)**2/(2*sig2)) p_y_x_minus=np.exp(-(y+1)**2/(2*sig2)) p_y_x_plus=np.exp(-(y-1)**2/(2*sig2)) mi=np.average(np.log2(p_y_x/(0.5*p_y_x_minus+0.5*p_y_x_plus)))

Before calculating the mutual information using the MINE method, let’s just see if we can do better in this channel. The question that Shannon asked was how to maximize the mutual information for a given channel. And one of his results is that we can choose the channel input distribution in a way to maximize the mutual information. In the case of the Gaussian channel, the input distribution that maximizes the mutual information is also a Gaussian. So we can recalculate and use a zero-mean Gaussian with unit variance as input. Note that with unit variance, we keep the average energy of the random variable unchanged compared to the BPSK case (both have unit average energy). This mutual information has an analytical solution which is . So we could increase the mutual information from 0.95 to 1.29 when the variance is 0.2, we can also increase from 0.48 to 0.5 when variance is 1.

Now we use the MINE idea to calculate these 4 mutual information values (2 variances and two input distributions). The code corresponding to the tensorflow implementation

sig2=0.2#1.0 N=20000 for i in range(1000): x_sample=np.sign(np.random.normal(0.,1.,[N,1]))#np.random.normal(0.,1.,[N,1]) y_sample=x_sample+np.random.normal(0.,np.sqrt(sig2),[N,1]) y_shuffle=np.random.permutation(y_sample) sess.run(train_step, feed_dict={x:x_sample,y:y_sample,y_:y_shuffle})

and the results with dashed lines corresponding to theoretical mutual information (for Gaussian input, black curves) and Monte Carlo mutual information (blue curve, BPSK input). The solid lines represent the results obtained by MINE as a function of the number training steps. We can see that for all cases the estimator converges to the target value which is very promising since the inputs of the algorithm are just sampled pairs! No need to know the distribution. Here I didn’t test for higher dimensions but I’ll keep this algorithm in my toolbox.

So the algorithm works for the toy example in low dimension and the last question before finishing the post is why do we need to calculate the mutual information. In communication systems, the answer is clear as the mutual information defines the achievable transmission rate over a channel. Mutual information was essential to understand the limits of communications over noisy channels. In machine learning, there are several uses, but one of the most recent that catch my attention was the one mentioned in “Opening the Black Box of Deep Neural Networks via Information”[3] where authors calculate the evolution of the mutual information between input and hidden layers as well as hidden layers and output to try to understand what is happening during learning. Their hypothesis, after observing the evolution of the mutual information calculated via binning is that learning happens in two phases: memorization then compression.

I hope this short post may help people start calculating mutual information in their own problems.

Citations

Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, Aaron Courville (2017). MINE: Mutual Information Neural Estimation. arXiv preprint arXiv:1801.04062. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379-423. Shwartz-Ziv, R., & Tishby, N. (2017). Opening the Black Box of Deep Neural Networks via Information. arXiv preprint arXiv:1703.00810.