I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).

Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. In the bigram case, the smoothed estimate is:

p(a|b) = lam(b) * p'(a|b) + (1-lam(b)) p(a) p(a) = lam() * p'(a) + (1 - lam()) unif

where lam(b) in [0,1] is the interpolation coefficient for context b and lam() is the interpolation coefficient for the empty n-gram. p'(a|b) and p'(a) are maximum likelihood estimates, and unif is the uniform probability of a token.

For Witten-Bell smoothing, interpolation is defined as:

lam(b) = count(b,_) / (count(b,_) + 1)

where count(b,_) is the number of times the token b is seen before another token in training, making lam(b) increase with the number of training instances.

Let count[X] be the count for n-gram X (where the length may be 0, 1 or 2). For instance, the vector “John likes John” has counts for the following dimensions:

[] : 3 [John] : 2 [likes] : 1 [John,likes] : 1 [likes,John] : 1

Then define coefficient vector beta , with dimensions for all of the n-grams seen in training (after pruning):

beta[] = log lam() + log unif beta[a] = log p(a) - beta[] beta[a,b] = log p(b|a) - beta[a]

If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than [likes,John] .

[] : 3 3 * beta[] = 3 * (log lam() + log unif) [John] : 2 2 * beta[John] = 2 * (log p(john) - beta[]) [likes] : 1 1 * beta[likes] = 1 * (log p(likes) - beta[]) [John,likes] : 1 1 * beta[John,likes] = 1 * (log p(likes|John) - beta[John]) [likes,John] : 1 1 * beta[likes,John] = 0

Doing the sums leaves us with the log estimate:

log p([John,likes,John]) = log p(John) + log p(likes|John) + log lam(likes) + log p(John)

It’s no problem extending this to higher order n-grams. Many other kinds of interpolated smoothing will work, too.

Share this: Share

Twitter

Facebook



LinkedIn

Email



Related