When you train a NER system the most typically evaluation method is to measure precision, recall and f1-score at a token level. These metrics are indeed useful to tune a NER system. But when using the predicted named-entities for downstream tasks, it is more useful to evaluate with metrics at a full named-entity level. In this post I will go through some metrics that go beyond simple token-level performance.

You can find the complete code associated with this blog post on this repository:

You can find more about Named-Entity Recognition here:





Comparing NER system output and golden standard

Comparing the golden standard annotations with the output of a NER system different scenarios might occur:

I. Surface string and entity type match

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type in O in O New B-LOC New B-LOC York I-LOC York I-LOC . O . O

II. System hypothesized an entity

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type an O an O Awful O Awful B-ORG Headache O Headache I-ORG in O in O

III. System misses an entity

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type in O in O Palo B-LOC Palo O Alto I-LOC Alto O , O , O

Note that considering only this 3 scenarios, and discarding every other possible scenario we have a simple classification evaluation that can be measured in terms of false negatives, true positives, false negatives and false positives, and subsequently compute precision, recall and f1-score for each named-entity type.

But of course we are discarding partial matches, or other scenarios when the NER system gets the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios again at a full-entity level.

IV. System assigns the wrong entity type

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type I O I O live O live O in O in O Palo B-LOC Palo B-ORG Alto I-LOC Alto I-ORG , O , O

V. System gets the boundaries of the surface string wrong

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type Unless O Unless B-PER Karl B-PER Karl I-PER Smith I-PER Smith I-PER resigns O resigns O

VI. System gets the boundaries and entity type wrong

Golden Standard System Prediction Surface String Entity Type Surface String Entity Type Unless O Unless B-ORG Karl B-PER Karl I-ORG Smith I-PER Smith I-ORG resigns O resigns O

How can we incorporate these described scenarios into evaluation metrics ?

Different Evaluation Schemas

Throughout the years different NER forums proposed different evaluation metrics:

CoNLL: Computational Natural Language Learning

The Language-Independent Named Entity Recognition task introduced at CoNLL-2003 measures the performance of the systems in terms of precision, recall and f1-score, where:

“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”

so basically it only considers scenarios I, II and III, the others described scenarios are not considered for evaluation.

Automatic Content Extraction (ACE)

The ACE challenges use a more complex evaluation metric which include a weighting schema, I will not go into detail here, and just point for the papers about it:

I kind of gave up on trying to understand results and replicating experiments and baselines from ACE since all the datasets and results are not open and free, so I guess this challenge results and experiments will fade away with time.

Message Understanding Conference (MUC)

MUC introduced detailed metrics in an evaluation considering different categories of errors, these metrics can be defined as in terms of comparing the response of a system against the golden annotation:

Correct (COR) : both are the same;

: both are the same; Incorrect (INC) : the output of a system and the golden annotation don’t match;

: the output of a system and the golden annotation don’t match; Partial (PAR) : system and the golden annotation are somewhat “similar” but not the same;

: system and the golden annotation are somewhat “similar” but not the same; Missing (MIS) : a golden annotation is not captured by a system;

: a golden annotation is not captured by a system; Spurius (SPU) : system produces a response which doesn’t exit in the golden annotation;

these metrics already go a beyond the simple strict classification and consider partial matching for instance. They are also close to cover the scenarios defined in the beginning of this post, we just need to find a way to consider the differences - between NER output and golden annotations - based on two axes, the surface string and the entity type.

An implementation of the MUC evaluation metrics can be found here:

International Workshop on Semantic Evaluation (SemEval)

The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC.

Strict : exact boundary surface string match and entity type;

Exact : exact boundary match over the surface string, regardless of the type;

Partial : partial boundary match over the surface string, regardless of the type;

Type: some overlap between the system tagged entity and the gold annotation is required;

each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above

Scenario Golden Standard System Prediction Evaluation Schema Surface String Entity Type Surface String Entity Type Type Partial Exact Strict III brand TIKOSYN MIS MIS MIS MIS II brand healthy SPU SPU SPU SPU V drug warfarin drug of warfarin COR PAR INC INC IV drug propranolol brand propranolol INC COR COR INC I drug phenytoin drug phenytoin COR COR COR COR I Drug theophylline drug theophylline COR COR COR COR VI group contraceptives drug oral contraceptives INC PAR INC INC

Then precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more quantities need to be calculated:

Number of gold-standard annotations contributing to the final score

$\text{POSSIBLE} (POS) = COR + INC + PAR + MIS = TP + FN $

Number of annotations produced by the NER system:

$\text{ACTUAL} (ACT) = COR + INC + PAR + SPU = TP + FP$

Then we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct named-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations that are retrieved by the NER system. This is computed in two different ways depending wether we want an exact match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

Exact Match (i.e., strict and exact )

$\text{Precision} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$

$\text{Recall} = \frac{COR}{POS} = \frac{TP}{TP+FN}$

Partial Match (i.e., partial and type)

$\text{Precision} = \frac{COR\ +\ 0.5\ \times\ PAR}{ACT} = \frac{TP}{TP+FP}$

$\text{Recall} = \frac{COR\ +\ 0.5\ \times\ PAR}{POS} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$

Putting all together:

Measure Type Partial Exact Strict Correct 3 3 3 2 Incorrect 2 0 2 3 Partial 0 2 0 0 Missed 1 1 1 1 Spurius 1 1 1 1 Precision 0.5 0.66 0.5 0.33 Recall 0.5 0.66 0.5 0.33 F1 0.5 0.66 0.5 0.33

Code

I did a small experiment using sklearn-crfsuite wrapper around CRFsuite to train a NER over the CoNLL 2002 Spanish data. Next I evaluate the trained CRF over the test data and show the performance with the different metrics:

Note you can find the complete code for this blog post on this repository:

Example

import nltk import sklearn_crfsuite from copy import deepcopy from collections import defaultdict from sklearn_crfsuite import metrics from ner_evaluation import collect_named_entities from ner_evaluation import compute_metrics

Train a CRF on the CoNLL 2002 NER Spanish data

nltk . corpus . conll2002 . fileids () train_sents = list ( nltk . corpus . conll2002 . iob_sents ( 'esp.train' )) test_sents = list ( nltk . corpus . conll2002 . iob_sents ( 'esp.testb' ))

def word2features ( sent , i ): word = sent [ i ][ 0 ] postag = sent [ i ][ 1 ] features = { 'bias' : 1.0 , 'word.lower()' : word . lower (), 'word[-3:]' : word [ - 3 :], 'word[-2:]' : word [ - 2 :], 'word.isupper()' : word . isupper (), 'word.istitle()' : word . istitle (), 'word.isdigit()' : word . isdigit (), 'postag' : postag , 'postag[:2]' : postag [: 2 ], } if i > 0 : word1 = sent [ i - 1 ][ 0 ] postag1 = sent [ i - 1 ][ 1 ] features . update ({ '-1:word.lower()' : word1 . lower (), '-1:word.istitle()' : word1 . istitle (), '-1:word.isupper()' : word1 . isupper (), '-1:postag' : postag1 , '-1:postag[:2]' : postag1 [: 2 ], }) else : features [ 'BOS' ] = True if i < len ( sent ) - 1 : word1 = sent [ i + 1 ][ 0 ] postag1 = sent [ i + 1 ][ 1 ] features . update ({ '+1:word.lower()' : word1 . lower (), '+1:word.istitle()' : word1 . istitle (), '+1:word.isupper()' : word1 . isupper (), '+1:postag' : postag1 , '+1:postag[:2]' : postag1 [: 2 ], }) else : features [ 'EOS' ] = True return features def sent2features ( sent ): return [ word2features ( sent , i ) for i in range ( len ( sent ))] def sent2labels ( sent ): return [ label for token , postag , label in sent ] def sent2tokens ( sent ): return [ token for token , postag , label in sent ]

Feature Extraction

%% time X_train = [ sent2features ( s ) for s in train_sents ] y_train = [ sent2labels ( s ) for s in train_sents ] X_test = [ sent2features ( s ) for s in test_sents ] y_test = [ sent2labels ( s ) for s in test_sents ]

CPU times: user 1.12 s, sys: 98.2 ms, total: 1.22 s Wall time: 1.22 s

Training

%% time crf = sklearn_crfsuite . CRF ( algorithm = 'lbfgs' , c1 = 0.1 , c2 = 0.1 , max_iterations = 100 , all_possible_transitions = True ) crf . fit ( X_train , y_train )

CPU times: user 34.1 s, sys: 197 ms, total: 34.3 s Wall time: 34.4 s

Performance per label type per token

y_pred = crf . predict ( X_test ) labels = list ( crf . classes_ ) labels . remove ( 'O' ) # remove 'O' label from evaluation sorted_labels = sorted ( labels , key = lambda name : ( name [ 1 :], name [ 0 ])) # group B and I results print ( sklearn_crfsuite . metrics . flat_classification_report ( y_test , y_pred , labels = sorted_labels , digits = 3 ))

precision recall f1-score support B-LOC 0.810 0.784 0.797 1084 I-LOC 0.690 0.637 0.662 325 B-MISC 0.731 0.569 0.640 339 I-MISC 0.699 0.589 0.639 557 B-ORG 0.807 0.832 0.820 1400 I-ORG 0.852 0.786 0.818 1104 B-PER 0.850 0.884 0.867 735 I-PER 0.893 0.943 0.917 634 avg / total 0.809 0.787 0.796 6178

Performance over full named-entity

test_sents_labels = [] for sentence in test_sents : sentence = [ token [ 2 ] for token in sentence ] test_sents_labels . append ( sentence )

index = 2 true = collect_named_entities ( test_sents_labels [ index ]) pred = collect_named_entities ( y_pred [ index ])

true

[Entity(e_type='MISC', start_offset=12, end_offset=12), Entity(e_type='LOC', start_offset=15, end_offset=15), Entity(e_type='PER', start_offset=37, end_offset=39), Entity(e_type='ORG', start_offset=45, end_offset=46)]

pred

[Entity(e_type='MISC', start_offset=12, end_offset=12), Entity(e_type='LOC', start_offset=15, end_offset=15), Entity(e_type='PER', start_offset=37, end_offset=39), Entity(e_type='LOC', start_offset=45, end_offset=46)]

compute_metrics ( true , pred )

({'ent_type': {'actual': 4, 'correct': 3, 'incorrect': 1, 'missed': 0, 'partial': 0, 'possible': 4, 'precision': 0.75, 'recall': 0.75, 'spurius': 0}, 'strict': {'actual': 4, 'correct': 3, 'incorrect': 1, 'missed': 0, 'partial': 0, 'possible': 4, 'precision': 0.75, 'recall': 0.75, 'spurius': 0}}, {'LOC': {'ent_type': {'correct': 1, 'incorrect': 1, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 1, 'incorrect': 1, 'missed': 0, 'partial': 0, 'spurius': 0}}, 'MISC': {'ent_type': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}, 'ORG': {'ent_type': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}, 'PER': {'ent_type': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}})

to_test = [ 2 , 4 , 12 , 14 ]

index = 2 true_named_entities_type = defaultdict ( list ) pred_named_entities_type = defaultdict ( list ) for true in collect_named_entities ( test_sents_labels [ index ]): true_named_entities_type [ true . e_type ] . append ( true ) for pred in collect_named_entities ( y_pred [ index ]): pred_named_entities_type [ pred . e_type ] . append ( pred )

true_named_entities_type

defaultdict(list, {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15)], 'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)], 'ORG': [Entity(e_type='ORG', start_offset=45, end_offset=46)], 'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})

pred_named_entities_type

defaultdict(list, {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15), Entity(e_type='LOC', start_offset=45, end_offset=46)], 'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)], 'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})

true_named_entities_type [ 'LOC' ]

[Entity(e_type='LOC', start_offset=15, end_offset=15)]

pred_named_entities_type [ 'LOC' ]

[Entity(e_type='LOC', start_offset=15, end_offset=15), Entity(e_type='LOC', start_offset=45, end_offset=46)]

compute_metrics ( true_named_entities_type [ 'LOC' ], pred_named_entities_type [ 'LOC' ])

({'ent_type': {'actual': 2, 'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'possible': 1, 'precision': 0.5, 'recall': 1.0, 'spurius': 1}, 'strict': {'actual': 2, 'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'possible': 1, 'precision': 0.5, 'recall': 1.0, 'spurius': 1}}, {'LOC': {'ent_type': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 1}, 'strict': {'correct': 1, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 1}}, 'MISC': {'ent_type': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}, 'ORG': {'ent_type': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}, 'PER': {'ent_type': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'missed': 0, 'partial': 0, 'spurius': 0}}})

results over all messages

metrics_results = { 'correct' : 0 , 'incorrect' : 0 , 'partial' : 0 , 'missed' : 0 , 'spurius' : 0 , 'possible' : 0 , 'actual' : 0 } # overall results results = { 'strict' : deepcopy ( metrics_results ), 'ent_type' : deepcopy ( metrics_results ) } # results aggregated by entity type evaluation_agg_entities_type = { e : deepcopy ( results ) for e in [ 'LOC' , 'PER' , 'ORG' , 'MISC' ]} for true_ents , pred_ents in zip ( test_sents_labels , y_pred ): # compute results for one message tmp_results , tmp_agg_results = compute_metrics ( collect_named_entities ( true_ents ), collect_named_entities ( pred_ents )) # aggregate overall results for eval_schema in results . keys (): for metric in metrics_results . keys (): results [ eval_schema ][ metric ] += tmp_results [ eval_schema ][ metric ] # aggregate results by entity type for e_type in [ 'LOC' , 'PER' , 'ORG' , 'MISC' ]: for eval_schema in tmp_agg_results [ e_type ]: for metric in tmp_agg_results [ e_type ][ eval_schema ]: evaluation_agg_entities_type [ e_type ][ eval_schema ][ metric ] += tmp_agg_results [ e_type ][ eval_schema ][ metric ]

results

{'ent_type': {'actual': 3518, 'correct': 2909, 'incorrect': 564, 'missed': 111, 'partial': 0, 'possible': 3584, 'spurius': 45}, 'strict': {'actual': 3518, 'correct': 2779, 'incorrect': 694, 'missed': 111, 'partial': 0, 'possible': 3584, 'spurius': 45}}

evaluation_agg_entities_type

{'LOC': {'ent_type': {'actual': 0, 'correct': 861, 'incorrect': 180, 'missed': 32, 'partial': 0, 'possible': 0, 'spurius': 5}, 'strict': {'actual': 0, 'correct': 840, 'incorrect': 201, 'missed': 32, 'partial': 0, 'possible': 0, 'spurius': 5}}, 'MISC': {'ent_type': {'actual': 0, 'correct': 211, 'incorrect': 46, 'missed': 33, 'partial': 0, 'possible': 0, 'spurius': 7}, 'strict': {'actual': 0, 'correct': 173, 'incorrect': 84, 'missed': 33, 'partial': 0, 'possible': 0, 'spurius': 7}}, 'ORG': {'ent_type': {'actual': 0, 'correct': 1181, 'incorrect': 231, 'missed': 34, 'partial': 0, 'possible': 0, 'spurius': 31}, 'strict': {'actual': 0, 'correct': 1120, 'incorrect': 292, 'missed': 34, 'partial': 0, 'possible': 0, 'spurius': 31}}, 'PER': {'ent_type': {'actual': 0, 'correct': 656, 'incorrect': 107, 'missed': 12, 'partial': 0, 'possible': 0, 'spurius': 2}, 'strict': {'actual': 0, 'correct': 646, 'incorrect': 117, 'missed': 12, 'partial': 0, 'possible': 0, 'spurius': 2}}}

References