One of the provided references was T. K. Moon's introductory paper The Expectation-Maximization Algorithm, which was published in the November 1996 issue of the IEEE Signal Processing Magazine. You can download it from behind the IEEE Xplore paywall here or for free here. According to Google Scholar, it has been cited 1634 times. I like the paper—it is a light read that explains its concepts in a very clear and concise manner and it at least seems to give a very intuitive overview of the underlying mathematics.

The problem is just that it took me a substantial amount of time to get past the motivating example, which is given on the first few pages. It comes with a funny story—I recommend reading the article for some more details on the tale of D. T. Ector and S. Tim Hatter—and it's supposed to give the reader some sort of intuition about the workings of the algorithm. The problem is posed as the following:

Suppose that in an image pattern-recognition problem, there are two general classes to be distinguished: a class of dark objects and a class of light objects. The class of dark objects may be further subdivided into two shapes: round and square. Using a pattern recognizer, it is desired to determine the probability of a dark object.

The author then goes ahead to inform us that the data samples are assumed to be drawn from a multinomial distribution, parameterized by some parameter \(p\), which we wish to estimate. For now, all that is interesting to us is that we can describe the distribution of the data with some pdf \(f(x_1, x_2, x_3\, |\, p)\), where, again, \(p\) is the unknown parameter. The problem statement continues with

A feature extractor is employed that can distinguish which objects are light and which are dark, but cannot distinguish shape. Let \([y_1,\ y_2]^{\mathsf{T}} =\, \mathbf{y}\) be the number of dark objects and number of light objects detected, respectively, so that \(y_1 = x_1 + x_2\) and \(y_2 = x_3\).

You get the issue: We want to obtain the ML estimate of the parameter \(p\), but the data we're observing is missing some information:

There is a many-to-one mapping between \(\{x_1,\ x_2\}\) and \(y_1\). For example, if \(y_1 = 3\), there is no way to tell from the measurements whether \(x_1 = 1\) and \(x_2 = 2\) or \(x_1 = 2\) and \(x_2 = 1\).

The EM algorithm is now supposed to help us obtain an estimate for the parameter \(p\), even though we cannot observe the \(x\) 's directly. Based on the definition of the algorithm and the known underlying multinomial distribution of the data, the author then derives equations for the two iteration steps and provides us with a table of the estimated values of \(x_1\), \(x_2\), and \(p\) for the first 10 steps of the iteration. In the E-step, we are supposed to compute estimates for \(x_1\) and \(x_2\) as

\begin{equation} x_{1}^{[k+1]} = y_{1} \frac{1/4}{1/2 + p^{[k]}/2} \end{equation}

and

\begin{equation} x_{2}^{[k+1]} = y_{1} \frac{1/4 + p^{[k]}/2}{1/2 + p^{[k]}/2} \end{equation}

and in the M-step, we are supposed to compute the estimate of the parameter \(p\) as

\begin{equation} p^{[k+1]} = \frac{2x_{2}^{[k+1]} - x_{3}}{x_{2}^{[k+1]} + x_{3}}. \end{equation}

So far so good. We are also informed that the algorithm is initialized with \(p^{[0]} = 0\) and that out of 100 drawn samples, 100 are dark, i.e. \(y_1 = 100\), and that the true values of \(x_1\) and \(x_2\) are 25 and 38, respectively.

What?

This is where it starts to get messy. Of course, if we draw a 100 samples, 25 of them are dark and square, and 38 of them are dark and round, it would be hard to get me to agree to the fact that the number of dark objects is 100. The right answer is of course that \(y_1= 63\). Obviously a typo, but a very disruptive typo. I had to go back and forth to reassure myself that it actually was an error, which was obviously distruptive to my reading flow.

We are also given the following table of results, in which we can clearly see how the estimated values of \(x_1\) and \(x_2\) converge to 25 and 38.

Given this data and the equations used to obtain it, I figured that it would be a good idea to reproduce those results by myself, since it would not take more than a few lines of Python to write out this algorithm. So I did, and you can inspect the code below.

n = 100 y1 = 63 x3 = n-y1 kk = 10 x1_k , x2_k , p_k = 0, 0, 0 table = [[ 'k' , 'x1' , 'x2' , 'p' ]] for k in range (1, kk+1): x1_k = y1 * (1/4)/(1/2 + p_k/2) x2_k = y1 * (1/4 + p_k/2)/(1/2 + p_k/2) p_k = (2*x2_k - x3)/(x2_k + x3) table.append([ str (k)] + [ '{:10.6f}' . format (s) for s in [x1_k, x2_k, p_k]]) return table

Runnning this small simulation gives us the following results:

k x1 x2 p 1 31.500000 31.500000 0.379562 2 22.833333 40.166667 0.561555 3 20.172199 42.827801 0.609507 4 19.571211 43.428789 0.619897 5 19.445679 43.554321 0.622048 6 19.419896 43.580104 0.622489 7 19.414618 43.585382 0.622579 8 19.413539 43.586461 0.622597 9 19.413318 43.586682 0.622601 10 19.413273 43.586727 0.622602

This doesn't look like the table from the article at all! The numbers are converging to some values, but definitely not the values that we are expecting. Something is obviously wrong here. So what's the problem?

The problem is that while the simulation results that were presented are correct, the equations that are claimed to have been used to obtain those results are wrong. Take a look at the article. A few pages later (in Box 2), the author describes how to derive the expressions to calculate the values for \(x_{1}^{[k+1]}\) and \(x_{2}^{[k+1]}\) (the E-step) from the underlying trinomial distribution.

I don't have to reiterate the entire derivation here, but the end result of it is that given three discrete random variables \(X_1, X_2, X_3\) drawn from a trinomial distribution with probabilities \(p_1, p_2, p_3\), we can compute the conditional expectation \(E[X_1\, |\, X_1 + X_2 = y]\) as

\begin{equation} E[X_1\, |\, X_1 + X_2 = y] = y\frac{p_2}{p_1 - p_2}. \end{equation}

When this expression is used in the motivating example, we arrive at the equations for \(x_{1}^{[k+1]}\) and \(x_{2}^{[k+1]}\) that I gave earlier in the post (and which also appear in the paper).

This is incorrect. After going through his derivation by hand and verifying each of the steps, I realized that the cause of my frustration—and I was very frustrated at this point, because as a student my first assumption is that I am in the wrong—was due to a simple sign error that seemed to have propagated from here back to the example given in the introduction. You can check this yourself by working through the given derivation, but it turns out that an evil ghost must have flipped the plus sign in the denominator of the expression given above. The correct answer is

\begin{equation} E[X_1\, |\, X_1 + X_2 = y] = y\frac{p_2}{p_1 + p_2}. \end{equation}

Cool. After almost losing my mind over this for a short while, I finally regained my sanity and plugged the correct results into my simulation:

\begin{equation} x_{1}^{[k+1]} = y_{1} \frac{1/4}{1/2 + p^{[k]}/4} \end{equation}

and

\begin{equation} x_{2}^{[k+1]} = y_{1} \frac{1/4 + p^{[k]}/4}{1/2 + p^{[k]}/4}. \end{equation}

For the sake of completeness: The underlying trinomial distribution given in the example was

\begin{equation} P(X_1=x_1, X_2=x_2, X_3=x_3\, |\,p) = \left(\frac{n!}{x_1!x_2!x_3!}\right)\left(\frac{1}{4}\right)^{x_1} \left(\frac{1}{4}+\frac{p}{4}\right)^{x_2}\left(\frac{1}{2}-\frac{p}{4}\right)^{x_3}. \end{equation}

Great! The correct simulation and its results can be seen below.

n = 100 y1 = 63 x3 = n-y1 kk = 10 x1_k , x2_k , p_k = 0, 0, 0 table = [[ 'k' , 'x1' , 'x2' , 'p' ]] for k in range (1, kk+1): x1_k = y1 * (1/4)/(1/2 + p_k/4) x2_k = y1 * (1/4 + p_k/4)/(1/2 + p_k/4) p_k = (2*x2_k - x3)/(x2_k + x3) table.append([ str (k)] + [ '{:10.6f}' . format (s) for s in [x1_k, x2_k, p_k]]) return table

Lo and behold, these results look a lot better: