Given two finite sets $X$ and $Y$, a discrete memoryless channel is a specification of probability distributions $P(y | x)$, for $x \in X$ and $y \in Y$.

Informally, this notion is motivated as follows: a sender chooses a message $x \in X$ to send, which after going through the communication channel will result in a receiver receiving some $y \in Y$. In an ideal world, we would be certain of what $y$ would be received after sending $x$ (in other words, y would be a function of $x$); but unfortunately, the channel adds random noise to the signal, such that after having chosen to send $x$, we only know the probability distribution $P(y | x)$ of what might be received.

Because the channel consists of $|X|$ probability distributions over a set of $Y$ elements, the channel can be represented by an $|Y| \times |X|$ transition matrix $Q$:

$$ Q_{yx} := P(y | x) $$

Therefore, each cell of $Q$ contains a values between 0 and 1, and the columns of Q sum to 1.

If we assume of probability distribution $P_X$ over $X$, a channel $Q$ gives us a probability distribution $P_Y$ of $Y$, by $P(y) = \sum_{x}{P(y | x)P(x)}$. Representing $P_X$ and $P_Y$ as vectors, this relation can be written as a matrix multiplication:

$$ P_Y = Q P_X $$

The mutual information $I[X;Y]$ between $X$ and $Y$ represents the average amount of information about $X$ that is gained by learning $Y$; phrased differently, it is the amount of uncertainty (entropy) about $X$ that goes away by learning $Y$. Formally:

$$ I[X;Y] = H[X] - H[X | Y] = H[Y] - H[Y | X] $$

In the above formula, $H[X]$ is the entropy of random variable $X$, and $H[X | Y]$ is the conditional entropy of $X$ given $Y$:

What we want to prove¶

$$ H[Y] = \sum_{y}{P(y) \log(\frac{1}{P(y)})} $$$$ H[Y | X] = \sum_{x}{P(x) H[Y | X = x]} = \sum_{x}{P(x) \sum_{y}{P(y | x) \log(\frac{1}{P(y | x)})} } $$

The result we want to prove is the following:

Given a probability distribution over $X$, the function $I: Q \mapsto I[X;Y]$ is convex.

In details, this means that given 2 channels $Q^{(0)}$ and $Q^{(1)}$, and a weight $\lambda \in [0,1]$, we can define a 'mixed' channel $Q^{(\lambda)} := \lambda Q^{(0)} + (1 - \lambda)Q^{(1)}$, with the property that $I(Q^{(\lambda)}) \leq \lambda I(Q^{(0)}) + (1 - \lambda)I(Q^{(1)})$

Intuitive interpretation¶

With regards to the capacity of conveying information, mixing 2 channels is always worse than using the most informative of the 2 channels we started with.

In particular, mixing 2 fully-deterministic channels (which are maximally informative) can result in a noisy channel!