by mish4 in Machine Learning / Pattern Recognition Tags: em algorithm, expectation maximization, hidden markov models, machine learning, mixture of gaussians, pattern recognition

One of the topics I have been trying to get a handle on is the Expectation Maximization (EM) algorithm. In this post I want to explain what the EM algorithm is used for, and motivate the algorithms importance. Lastly, I would like to tie in the EM algorithm with mixture of Gaussians, and Hidden Markov models (HMMs).

One important task in machine learning is to estimate the parameters of the model, which supposedly generated your data. One way to estimate these parameters is by computing the Maximum Likelihood (ML) estimate , where X is the observed data, and represents the parameters of the model. Maximizing this likelihood means we find the value of that maximizes the probability of generating our data X.

For example, if you believed your data came from a Gaussian distribution you could set up an expression for the likelihood and then maximize it. It turns out that for a Gaussian distribution the ML estimate has a mean equal to the sample mean, and a variance/covariance equal to the sample variance/covariance. The main point is: sometimes there is a nice closed form solution for finding the ML estimate.

As you may have guessed, it’s not always easy to compute a ML estimate. Cases in which the data depends on hidden variables are good examples of when finding the ML estimate is not straight forward. For example, consider the case where you model the data as having come from a mixture of Gaussians. With mixture of Gaussians we model the data as having originated from k Gaussians. So now in addition to estimating the parameters of k Gaussians we also need to decide which data points came from which Gaussians. This unknown information is represented by a hidden (also called latent) variable, which we will name Z.

What our dilemma boils down to is that our likelihood which can be rewritten from to depends on Z a hidden variable. Without information on Z we are stuck when it comes to maximizing this likelihood. [It is important to convince yourself that if we did know Z then maximizing the likelihood would indeed be a lot easier!]

This is where the EM algorithm saves the day. It will help us maximize the likelihood iteratively.



The EM algorithm is an iterative algorithm that works as follows:

Initialize the parameters of the model using some technique. This is likely done with some knowledge of the application domain you are in. Evaluate . Substitute into the Expectation formula . Maximize the Expectation formula with respect to . If the difference between and is below a threshold stop. Otherwise set to be and go back to step 2.

With the EM algorithm is guaranteed to keep increasing every time you do an E step (2) followed by an M step (3). Although you are not guaranteed a global max, you will converge. Thus if keeps increasing so will our maximum likelihood (which is obtained by marginalizing over , ).

To help understand the EM algorithm I will explain it with the mixture of Gaussians application in mind. To make this concrete let’s say there are 3 Gaussians which generated our data. We would then do the following:

Initialize the mean and covariance of each Gaussian. Compute . This means for a particular data point we want to know the probability that it came from Gaussian 1, Gaussian 2, or Gaussian 3. To find the probability that it came from each Gaussian, we first take the prior probability of a Gaussian generating a data point and multiply that by the probability this particular point came from that Gaussian. So . This says that the probability that came from Gaussian 1 is the prior probability of Gaussian 1 ( ) times the Gaussian 1 distribution governed by evaluated at . Plug into the Expectation formula. Take the derivative of the expectation formula with respect to the mean, with respect to the variance/covariance, and with respect to the priors (mixture weights). Find the equations for these parameters that maximize the expectation by setting the derivatives equal to 0. Obtain new values for the parameters . Repeat 2-4 if convergence is not yet reached.

[Credit for figures above goes to Christopher Bishop’s PRML text.]

Figure 1a-c is used to depict the EM algorithm being performed on mixture of Gaussian data. Figure 1a, shows the ground truth data, where the colors indicate which Gaussian gave rise to that data point. Figure 1b, depicts what we observe. It becomes clear that we are not given the Gaussian to which each data point belongs to and so this information is hidden. Assume all we know is that 3 Gaussians generated the data. Figure 1c, shows the results of running EM after some iterations on the data. That is in Figure 1c, the EM algorithm has labeled each point by a distribution of how likely they are to be from each Gaussian. You can see that points near other Gaussians are colored as a mixture.

Lastly, to conclude the post: Like the relationship between EM and mixture of Gaussians, there is a relationship between EM and HMMs. With HMMs we want to estimate the parameters of the HMM given a sequence of observations. In this case the hidden variable Z is the sequence of states that led to an observation sequence . If you were to apply EM to HMMs you would obtain parameters of the HMM.

In this post I did not go into detail on why EM works. For example you might be wondering why we maximize the particular expectation function above.. To get answers to these questions I recommend reading a great tutorial by Sean Borman available here.

Thanks for reading, and as usual if I made any mistakes or you have any comments, let me know! Hope this helped some!