P(x_i | y_i) is the probability of observing a given dice roll given the current dice label. To give an example, P(x_i | y_i) = 1/6 if y_i = dice is fair. The other term, T(y_i | y_{i-1}), is the cost of having transitioned from the previous dice label to the current one. We can just read this cost off the transition matrix.

Notice how in the denominator we’re computing the sum over all possible sequences of labels y`. In a traditional logistic regression for a classification problem of two classes, we’d have two terms in the denominator. But now we’re dealing with sequences and for a sequence of length 15, there are a total of 2¹⁵ possible sequences of labels so the number of terms in the denominator is huge. The “secret sauce” of the CRF is that it exploits how the current dice label only depends on the previous one to compute that huge sum efficiently.

This secret sauce algorithm is called the forward-backward algorithm*. Covering this in depth is out of the scope for this blog post but I’ll point you to helpful resources below.

Sequence Prediction

Once we estimate our transition matrix, we can use it to find the most likely sequence of dice labels for a given sequence of dice rolls. The naive way to do this is to compute the likelihood for all possible sequences but this will be intractable for even sequences of moderate length. Just like we did for parameter estimation, we’ll have to use a special algorithm to find the most likely sequence efficiently. This algorithm is closely related to the forward-backward algorithm and it’s called the Viterbi algorithm.

Code

PyTorch is a deep learning library in Python built for training deep learning models. Although we’re not doing deep learning, PyTorch’s automatic differentiation library will help us train our CRF model via gradient descent without us having to compute any gradients by hand. This will save us a lot of work. Using PyTorch will force us to implement the forward part of the forward-backward algorithm and the Viterbi algorithms, which is more instructive than using a specialized CRF python package.

Let’s start by envisioning what the result needs to look like. We need a method for computing the log likelihood for an arbitrary sequence of rolls, given the dice labels. Here is one way it could look:

This method does three main things: 1) maps the value on the dice to a likelihood, 2) computes the numerator of the log likelihood term, 3) computes the denominator of the log likelihood term.

Let’s first tackle the _data_to_likelihood method, which will help us do step 1. What we’ll do is we’ll create a matrix of dimension 6 x 2 where the first column is the likelihood of rolls 1–6 for the fair dice, and the second column is the likelihood of rolls 1–6 for the biased dice. This is what this matrix looks like for our problem:

array([[-1.79175947, -3.21887582],

[-1.79175947, -3.21887582],

[-1.79175947, -3.21887582],

[-1.79175947, -3.21887582],

[-1.79175947, -3.21887582],

[-1.79175947, -0.22314355]])

Now, if we see a roll of 4, we can just select the fourth row of the matrix. The first entry of that vector is the likelihood of a four for the fair dice (log(1/6)) and the second entry is the likelihood of a four for the biased dice (log(0.04)). This is what the code looks like:

Next, we’ll write the methods to compute the numerator and denominator of the log likelihood.

That’s it! We have all the code we need to start learning our transition matrix. But if we want to make predictions after training our model, we’ll have to code the Viterbi algorithm:

There’s more to our implementation but I’ve only included the big functions we discussed in the theory section.

Evaluating on Data

I evaluated the model on some data I simulated using the following probabilities:

P(first dice in sequence is fair) = 0.5 P(current dice is fair | previous dice is fair) = 0.8 P(current dice is biased | previous dice is biased) = 0.35

Check out the notebook I made to see how I generated the model and trained the CRF.

The first thing we’ll do is look at what the estimated transition matrix looks like. The model learned that I am more likely to roll the fair dice on the current roll if I used the fair dice on the previous roll (-1.38 < -0.87). The model also learned that I am more likely to use the fair dice after using the biased dice, but not by a lot (-0.59 < -0.41). The model assigns equal cost to both dice in the first roll (-0.51 ~ -0.54).

array([[-0.86563134, -0.40748784, -0.54984874],

[-1.3820231 , -0.59524935, -0.516026 ]], dtype=float32)

Next, we’ll see what the predictions looks like for a particular sequence of rolls:

# observed dice rolls

array([2, 3, 4, 5, 5, 5, 1, 5, 3, 2, 5, 5, 5, 3, 5]) # corresponding labels. 0 means fair

array([0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1]) # predictions

array([0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0])

The model recognizes long sequences of 6’s (these are the 5’s since we’re starting from 0) as coming from the biased dice, which makes sense. Notice that the model doesn’t assign every 6 to the biased dice, though (eighth roll). This is because prior to that 6, we’re pretty confident we’re at the fair dice (we rolled a 2) and transitioning to the biased dice from the fair dice is less likely. I’m ok with that mistake — I’d say our model is successful!

Conclusion

I’ve shown you a little bit of the theory behind CRF’s as well as how one can be implemented for a simple problem. There’s certainly a lot more to them than I’ve been able to cover here, so I encourage you to check out the sources I’ve linked below.

Further Reading:

An Introduction to Conditional Random Fields: Overview of CRFs, Hidden Markov Models, as well as derivation of forward-backward and Viterbi algorithms.

Using CRFs for named entity recognition in PyTorch: Inspiration for this post. Shows how a CRF can be applied to a more complex application in NLP.

Footnotes

*To be precise, we’re covering a linear-chain CRF, which is a special case of the CRF in which the sequences of inputs and outputs are arranged in a linear sequence. Like I said before, this topic is deep.

*Since we’re using PyTorch to compute gradients for us, we technically only need the forward part of the forward-backward algorithm .