The purpose of this post is to serve as a very basic introduction to Concentration Inequalities. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. These inequalities are used extensively in Machine Learning Theory for deriving PAC like results.

You have an \(n\)-sided die, but you do not know if it is a fair one or not ie, you do not know the exact probability distribution of the die. Your job is to roll the die \(k\) times and estimate how close your estimated distribution is to the true distribution of the die.

Assume the faces are numbered \(1,2,..n\). Let the true probability that face \(i\) comes up when the die is rolled be \(P_i\). Let \(X_{i,t}\) be the indicator variable defined as follows:

\[X_{i,t}= \begin{cases} 1, & \mbox{if } \mbox{face } i \mbox{ comes up on roll } t \\ 0, & \mbox{otherwise} \end{cases}\]

The empirical estimate of \(P_i\) after \(k\) rolls would be:

\[\hat{P_i} = \frac{\sum_{t=1}^kX_{i,t}}{k}\]

This is just the average of \(X_{i,1},X_{i,2},..,X_{i,k}\). Lets calculate the expectation and variance of \(\hat{P_i}\).

The expected value of \(X_{i,t}\) is:

\[E[X_{i,t}] = 1\cdot P_i + 0 \cdot (1-P_i)=P_i\]

The expected value of \(\hat{P_i}\) is:

\[\begin{align*} E[\hat{P_i}] &= \frac{\sum_{t=1}^kE[X_{i,t}]}{k}\\ &=\frac{k \cdot P_i}{k}\\ &=P_i \end{align*}\]

The variance of \(X_{i,t}\) is:

\[Var[X_{i,t}] = E[X_{i,t}^2]-E[X_{i,t}]^2=P_i-P_i^2=P_i(1-P_i)\]

The variance of \(\hat{P_i}\) is:

\[\begin{align*} Var[\hat{P_i}]&=Var\bigg{[} \frac{\sum_{t=1}^kX_{i,t}}{k}\bigg{]}\\ &=\frac{1}{k^2}\sum_{t=1}^kVar[X_{i,t}] & \mbox{As }X_{i,t}\mbox{ are drawn i.i.d}\\ &=\frac{P_i(1-P_i)}{k} \end{align*}\]

Now we are ready to apply Hoeffding’s inequality. Let \(P=(P_1,P_2,..,P_n)\) and \(\hat{P}=(\hat{P_1},\hat{P_2},..,\hat{P_n})\). We have for all \(i\in\{1,2..,n\}\):

\[\Pr(\mid \hat{P_i}-P_i\mid \geq \epsilon)\leq 2 \exp (-2k\epsilon^2)\]

Using union bound:

\[\Pr(\bigcup\limits_{i=1}^{n} \mid \hat{P_i}-P_i\mid \geq \epsilon)\leq 2n \exp (-2k\epsilon^2)\\ \implies 1-\Pr(\bigcup\limits_{i=1}^{n} \mid \hat{P_i}-P_i\mid \geq \epsilon) \geq 1- 2n \exp (-2k\epsilon^2)\\ \implies Pr(\bigcap\limits_{i=1}^{n}\mid \hat{P_i}-P_i\mid \leq \epsilon) \geq 1- 2n \exp (-2k\epsilon^2)\\ \implies Pr(\|\hat{P} - P \|_\infty \leq \epsilon) \geq 1- 2n \exp (-2k\epsilon^2)\]

Let \(\delta = 2n \exp (-2k\epsilon^2)\).

\[k = \frac{1}{2\epsilon^2} \log(\frac{2n}{\delta})\]

Substituting for \(k\) we get :

\[\Pr(\|\hat{P} - P \|_\infty \leq \epsilon) \geq 1- \delta\]

This means for \(\hat{P}\) to get \(\epsilon\)-close to \(P\) in the infinity norm with probability atleast \(1-\delta\), we need to roll the die \(\frac{1}{2\epsilon^2} \log(\frac{2n}{\delta})\) number of times. To know the exact probability distribution of the die, we need to keep rolling the die forever.

Reference: