by mish4 in Machine Learning / Pattern Recognition Tags: hidden markov models, markov models, probability, viterbi algorithm

I wanted to write about Hidden Markov Models (HMMs) as they are interesting and fun to learn about. They are used in many more ways than I know of, but one such way is in speech recognition. In this post I will tell you what HMMs are, what questions we want to ask about HMMs, and how we answer these questions.

We start by talking about Markov Models. A Markov model describes the probabilistic relationship between different states. So as an example suppose we have a discrete Markov model with three states: ‘happy’, ‘sad’, and ‘angry’. Over the course of a day (let’s say every hour) a person will be in a particular state, e.g the one that corresponds to their emotion. A person can stay in one state for several hours and then potentially change to a different state.

To model the probability of transitioning from one state to another we can create a matrix which describes the probability of being in state i at time t and being in state j at time t+1. With this Markov model we think of a sequence of states where each state corresponds to a particular instant in time (e.g a certain hour of the day). So in our example would correspond to the probability of going from ‘happy’ at some time instant to ‘sad’ in the next time instant. In an ideal world model where people are happy most of the time, we could choose this probability to be low; for example 0.1. Note that the sum of probabilities in a row of our transition matrix A must add to 1. This makes sure that when you start in one state, you either stay at that state or move to another state.

Now one question we may ask given a Markov model is: what is the probability that our model generated a particular sequence of states. In our example we could ask what is the probability of {‘happy’, ‘happy’, ‘sad’ , ‘angry’}. If we use to represent the state that occurred at time t. Then we would write the probability of such a sequence as . Using rules of probability (Bayes theorem!) we can rewrite this as . For a first order Markov model we make the simplifying assumption that the state at time t only depends on the state at time t-1. This simplification is depicted in Figure 1 using a graphical model. The circles (nodes) in Figure 1 represent the states at different time instances, and the arrows show the dependencies between those states. For example the arrow leaving state and arriving at means that depends on .

With this simplification our joint probability becomes . You can also have higher order Markov models; for example in a second order Markov model the state at time t would depend on the states at time t-1, and time t-2.

If you are given the transition matrix describing the probabilities of going from one state to another, and you are given the prior probabilities of starting in a particular state, then you can compute the probability of a particular sequence of states by evaluating the probability expressions above. It is implied that this can be extended to a model with any number of states and a state sequence of arbitrary length.

Now that I have provided a very brief intro to Markov models you may be wondering why we even need to talk about Hidden Markov Models? Does this have anything to do with hiding a Russian named Markov?

Hidden Markov models are used because in practice we do not observe the states of a system directly. That is, in real life we will not have the sequence of states our system went through, instead we will have observations. Suppose now that our system, is a poker player and the states as before are the emotions. The true states of the person are hidden, since the poker player will not directly tell you that he/she is happy, sad, or angry. Instead the only information you get is an observation, which in this example could be a facial expression such as smiling etc. Likewise, in the context of voice recognition the states represent the words spoken by a person. In reality we can only observe the speech waveform which is not the exact same thing as the state.

In a HMM there are several assumptions that are made. For a first order HMM there is the same assumption as in Markov models where the state at time t only depends on the state at time t-1. Additionally, for HMM there is the assumption that an observation at time t only depends on the state at time t. That is, the observation at time t conditioned on the state at time t is independent of all other states and independent of all other observations. Figure 2 shows a graphical model representing HMMs. This model is exactly the same as the Markov model in Figure 1, but now we have observation nodes that only depend on the state at the time the observation was obtained.

The assumptions mentioned above simplify the probabilistic expressions used with HMMs. To specify a HMM we will have just as before a matrix that describes the transition probabilities from one state to another, and a prior distribution on the states that tells us the probability of starting in any particular state. Furthermore, because this is a HMM we need to specify a matrix that specifies the probability of getting observation at time t given the state at time t is .

With the understanding of what a HMM is, we can pose three meaningful questions:

What is the probability of a sequence of observations given a particular HMM? Given a sequence of observations what is the most likely sequence of states to give rise to those observations? What should the parameters of a HMM (A,B, and the prior distribution ) be, given a training sequence of observations ?

In this post we will answer the first two of these questions. The third question is more difficult and may appear in a separate post. UPDATE: The third question will also be answered!

What is the probability of a sequence of observations given a particular HMM?

We will denote the parameters A,B, for a particular HMM by the variable . So what we want is . One way to arrive at this probability is to see that this can be obtained by marginalizing the joint distribution over Q. That is if we sum over all possible sequences of states we can obtain the probability we want from the joint probability. The joint probability can also be written down as the product . This says that the joint probability is the product of two distributions. The first distribution gives us the conditional distribution on the set of observations given the set of states and a model. The second distribution is a distribution on obtaining a particular sequence of states given a model.

Now, we can write the distribution for getting a particular sequence of states (given a model) just as we did with a Markov model. It is simply: . Now the conditional distribution on a sequence of observations can be expressed as the product: . That is the conditional distribution simplifies as a product of smaller probabilities where each probability of an observation is only dependent on the current state. In other words by considering the assumptions we made about HMMs and applying the appropriate conditional independence rules of probability we are left with the above equation. So now the joint probability: is equal to

So we have the joint distribution! All we have to do now is sum over ALL possible sequences of states and that will give us the probability of our observation sequence given the model. If there are N states in the model, and T states in the sequence. Then the summation is from 1 to . Since this is exponential we are left with a probability that is too computationally expensive to compute. Luckily, there is a recursive approach that allows us to compute this probability in polynomial time.

The recursive solution inductively computes which is the probability of observing the first t observations and having the current state equal to state i. The idea is that if we have the total probability then this makes it easy to compute . So in words, to determine how likely state j is at time t+1 we simply need to compute how likely it is for state i at t to transition to state j at t+1 and weight it by the probability that at time t the state was i. The contribution from all possible state i’s is given by summing over i . Lastly we multiply the sum by the probability that state j gave rise to the observation to get the probability . If we do this for all j, we now have a distribution on the states at t+1. That is for each state at t+1 we have the probability of that state being the correct state at time t+1. Proceeding inductively we can eventually compute the probability . Summing these probabilities over all i gives the total probability of observing the observation sequence O. One way perhaps to think about this is that different state sequences can yield the observation sequences with different probabilities. By weighting all possible ways of getting observation sequence O by how likely those state sequences are we can obtain a total probability for observing O.

[Now would be a good time to take a stretch break before proceeding to next question.]

Given a sequence of observations what is the most likely sequence of states to give rise to those observations?

Alright, so we want to determine the best sequence of states that yielded the observation sequence. One way to define best is by finding the sequence of states that maximizes the probability of our observations. That is maximizing which is equivalent to maximizing . (I believe that maximizing one is equivalent to maximizing the other because and are constants. This implies one probability is within a constant of the other. Though not 100% sure on this).

It turns out that using the Viterbi algorithm (which is an instance of dynamic programming) finds us the sequence of states that maximizes the probability of getting the observations O.

We define . This says that we want to store the probability based on the states Q up until time t that maximize this probability.

Now, assuming we have it becomes straight forward to compute . . What this says is to compute the probability that at time t+1 the state will be j and observations will be O up to and including time t+1, we need to take the most likely state i (by considering and the transition probabilities) and multiply it by the probability of state j yielding the most recent observation. Each time we decide which state jointly is the most likely to be the state at time t and is also the most likely to yield state j we remember it. This allows us to later reconstruct the sequence of states we are looking for. Eventually this iterative process stops and you are left with which tells you how likely the state at time T is to be state j. By picking the state at time T with the largest this gives us the most likely last state. Then by going backwards through the states that we stored we can recreate the most likely sequence of states. This is because for each possible state at t+1 we store the most likely state at t to yield it. This is exactly the Viterbi algorithm which gives a solution to the question of which sequence of states is most likely responsible for the observations observed. This algorithm still seems a bit magical to me.

Let us quickly motivate the answer to the second question. Suppose that in speech recognition you have different words being states. Then given a speech waveform you want to recover the set of states that most likely caused the observations. This is speech recognition!!! It works pretty well too! Have you ever called a phone number and talked to a machine, it probably uses HMMs.

What should the parameters of a HMM (A,B, and the prior distribution ) be, given a training sequence of observations ?

Essentially to answer this question we manipulate several variables into a form where they give us useful information. For example, to estimate we would want to know how often do we transition from state i to state j compared to the total number of transitions we make from state i. Likewise to estimate we will want to estimate the number of times being in state j yields observation k out of the total number of times we are in state j. Lastly, the prior distribution will be the number of times we expect to be at state i at t = 1. Once we have these quantities we will iterate until convergence of the parameters.

Now we define the variables we need:

Equation 1 tells us the probability of having state i at time t along with observations up to time t. Equation 2 tells us that given the state at time t is i, what is the probability of seeing the rest of the observations up until time T. Equation 3 tells us what is the probability of the state at time t being i and the state at time t+1 being j, given all observations. You should convince yourself that this expression in terms of equations 1 and 2 makes sense. Equation 4 is just the probability that the state at time t is i, given all observations.

Now we will try and relate these 4 equations to the parameter representing our HMM.

represents the expected number of states that transitioned from state i.

represents the expected number of transitions from state i to state j.

Following from this we get:

So what we have is that the transition probability can be expressed as the expected number of times we come from state i to state j, divided by the expected number of times we come from state i. can be expressed as the expected number of times we come from state j, subject to the constraint that the observation at that time is k, divided by the expected number of times we come from state j. Lastly, is the probability that at time t = 1 we will be in state i.

We have shown how the parameters of a HMM depend on the 4 equations introduced above. With an initial guess or estimate of the parameters of the HMM we can compute the variables associated with the four equations above. Then we recompute the parameters of the HMM, and it has been proven (not by me) that these new parameters are ‘better’ than the old ones. In fact, each time we iterate by recomputing the four equations and then the HMM parameters, we get a better estimate of the true HMM parameters. After iterating sufficiently long, the parameters of the HMM shouldn’t change much and thus the algorithm has converged.

That folks, is how you iteratively estimate the parameters of a HMM given only a sequence of observations. The reason why this iterative method works (though it isn’t guaranteed that you will get the optimal HMM parameters, you might get a local maximum) is based on the Expectation Maximization algorithm which I will cover in another post. You can convince yourself that estimating the parameters of a HMM is useful as there may be many applications in which you only have a sequence of observations.

Conclusion:

There is a lot of probability involved with Markov models and HMMs. By thinking about the assumptions of a HMM and how states and observations relate to each other we can solve interesting questions. My explanations in this post may not be sufficient to learn HMM in detail. However, I hope that the explanations are a good introduction into the subject and gives you the reader a sense of the math and thinking involved. It is interesting and cool that certain HMM problems can be solved recursively.

In order to learn the material I have heavily relied on Rabiner’s ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’ paper. My post here is essentially my own understanding and summary of the material I read in this paper. It is a great paper and for the interested reader who is looking for details I suggest you go there.

Furthermore, I would like to mention that the Viterbi algorithm is pretty interesting in of itself. One way to see the math actually work is by applying it. This Wikipedia page on the Viterbi algorithm provides some python code which lets you compute the most likely set of states that resulted in a particular observation sequence. You can also look at a trellis diagram on the Wikipedia page which helps illustrate the Viterbi algorithm.

Thanks for reading, this was quite a difficult post to write because there is a lot of material to keep straight.