A while back I wrote a Complete guide for training your own Part-Of-Speech Tagger. If you are new to Part-Of-Speech Tagging (POS Tagging) make sure you follow that tutorial first. This article is more of an enhancement of the work done there.

What is a CRF?

A Conditional Random Field (CRF for short) is a discriminative sequence labelling model. It’s fairly easy to explain model (compared to Hidden Markov Models). Basically, given:

some feature extractors (feature extractors need to output real numbers) weights associated with the features (which are learned) previous labels

predict the current label.

You probably just realized that they seem totally appropriate for doing POS tagging. That’s true, and it’s also appropriate for other NLP tools like NE Extractors and Chunkers .

Building the tagger

In the previous tutorial, we used the nltk.corpus.treebank corpus. Let’s do the same here in order to compare. I’m also going to remind you that we haven’t used any historical features in the previous tutorial. Our previous classifier didn’t know anything about the previous decisions.

Let’s check the data:

1 2 3 4 5 6 7 8 import nltk tagged_sentences = nltk . corpus . treebank . tagged_sents ( ) print ( tagged_sentences [ 0 ] ) print ( "Tagged sentences: " , len ( tagged_sentences ) ) print ( "Tagged words:" , len ( nltk . corpus . treebank . tagged_words ( ) ) )

Let’s also use the exact same feature extraction function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def features ( sentence , index ) : "" " sentence: [w1, w2, ...], index: the index of the word " "" return { 'word' : sentence [ index ] , 'is_first' : index == 0 , 'is_last' : index == len ( sentence ) - 1 , 'is_capitalized' : sentence [ index ] [ 0 ] . upper ( ) == sentence [ index ] [ 0 ] , 'is_all_caps' : sentence [ index ] . upper ( ) == sentence [ index ] , 'is_all_lower' : sentence [ index ] . lower ( ) == sentence [ index ] , 'prefix-1' : sentence [ index ] [ 0 ] , 'prefix-2' : sentence [ index ] [ : 2 ] , 'prefix-3' : sentence [ index ] [ : 3 ] , 'suffix-1' : sentence [ index ] [ - 1 ] , 'suffix-2' : sentence [ index ] [ - 2 : ] , 'suffix-3' : sentence [ index ] [ - 3 : ] , 'prev_word' : '' if index == 0 else sentence [ index - 1 ] , 'next_word' : '' if index == len ( sentence ) - 1 else sentence [ index + 1 ] , 'has_hyphen' : '-' in sentence [ index ] , 'is_numeric' : sentence [ index ] . isdigit ( ) , 'capitals_inside' : sentence [ index ] [ 1 : ] . lower ( ) != sentence [ index ] [ 1 : ] }

Let’s build the dataset:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 from nltk . tag . util import untag # Split the dataset for training and testing cutoff = int ( . 75 * len ( tagged_sentences ) ) training_sentences = tagged_sentences [ : cutoff ] test_sentences = tagged_sentences [ cutoff : ] def transform_to_dataset ( tagged_sentences ) : X , y = [ ] , [ ] for tagged in tagged_sentences : X . append ( [ features ( untag ( tagged ) , index ) for index in range ( len ( tagged ) ) ] ) y . append ( [ tag for _ , tag in tagged ] ) return X , y X_train , y_train = transform_to_dataset ( training_sentences ) X_test , y_test = transform_to_dataset ( test_sentences ) print ( len ( X_train ) ) print ( len ( X_test ) ) print ( X_train [ 0 ] ) print ( y_train [ 0 ] ) # 2935 # 979 # [{'word': 'Pierre' ... # ['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.']

Notice how each row in the dataset is a sequence, not a single word. CRFs learn sequences.

Let’s now install the CRF library we’ll be using:

1 2 pip install sklearn - crfsuite

The sklearn-crfsuite is a wrapper over the python-crfsuite library and provides a sklearn compatible API for the library.

1 2 3 4 5 from sklearn_crfsuite import CRF model = CRF ( ) model . fit ( X_train , y_train )

Here’s how to make predictions using our model:

1 2 3 4 5 6 7 8 sentence = [ 'I' , 'am' , 'Bob' , '!' ] def pos_tag ( sentence ) : sentence_features = [ features ( sentence , index ) for index in range ( len ( sentence ) ) ] return list ( zip ( sentence , model . predict ( [ sentence_features ] ) [ 0 ] ) ) print ( pos_tag ( sentence ) ) # [('I', 'PRP'), ('am', 'VBP'), ('Bob', 'NNP'), ('!', '.')]

Let’s compute the performance of our model:

1 2 3 4 5 6 7 from sklearn_crfsuite import metrics y_pred = model . predict ( X_test ) print ( metrics . flat_accuracy_score ( y_test , y_pred ) ) # 0.9602683593122289

We achieved a whopping 0.96 accuracy on the POS tagging task. In our previous tutorial, we only achieved 0.90 using a DecisionTreeClassifier .