One of the difficulties inherent in machine learning techniques is that the most accurate algorithms refuse to tell a story: we can discuss the confusion matrix, testing and training data, accuracy and the like, but it’s often hard to explain in simple terms what’s really going on.

Practically speaking this isn’t a big issue from an engineering perspective, but in a general political sense it is- highly accurate machine are often considered creepy, especially when it’s not apparent how it figured something out.

A simple case of this is part of speech tagging – you can read a book on how it works, and see the output, but it’s really hard to figure out whether something is “good” and develop an intuition for the personality of the algorithms. To that end, I’ve experimented with comparing the output of two taggers on common pieces of text, below.

The first tagger is the POS tagger included in NLTK (Python). This is presented in some detail in “Natural Language Processing with Python” (read my review), which has lots of motivating examples for natural language processing around NLTK, a natural language processing library maintained by the authors. The second toolkit is the Stanford NLP tagger (Java). Conveniently, these each use a simlar set of text.

For the first example, we’ll take a simple sentence and compare the output of the two products. In this case, you can see the formatting is quite different, but the tags are the same.

nltk.pos_tag(nltk.word_tokenize( """This Court has jurisdiction to consider the merits of the case."""")) [('This', 'DT'), ('Court', 'NNP'), ('has', 'VBZ'), ('jurisdiction', 'NN'), ('to', 'TO'), ('consider', 'VB'), ('the', 'DT'), ('merits', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('case', 'NN'), ('.', '.')] This_DT Court_NNP has_VBZ jurisdiction_NN to_TO consider_VB the_DT merits_NNS of_IN the_DT case_NN ._.

For reference, there are quite a few possible tags in a POS tagger, far more than what you learn in high school English class – this helps later processes form more accurate results. Here are examples, from the Penn TreeBank documentation-

CC - Coordinating conjunction CD - Cardinal number DT - Determiner EX - Existential there FW - Foreign word IN - Preposition or subordinating conjunction JJ - Adjective JJR - Adjective, comparative JJS - Adjective, superlative LS - List item marker MD - Modal NN - Noun, singular or mass NNS - Noun, plural NNP - Proper noun, singular NNPS - Proper noun, plural PDT - Predeterminer POS - Possessive ending PRP - Personal pronoun PRP$ - Possessive pronoun (prolog version PRP-S) RB - Adverb RBR - Adverb, comparative RBS - Adverb, superlative RP - Particle SYM - Symbol TO - to UH - Interjection VB - Verb, base form VBD - Verb, past tense VBG - Verb, gerund or present participle VBN - Verb, past participle VBP - Verb, non-3rd person singular present VBZ - Verb, 3rd person singular present WDT - Wh-determiner WP - Wh-pronoun WP$ - Possessive wh-pronoun (prolog version WP-S) WRB - Wh-adverb

The following are some more involved examples, rendered side by side. I’ve edited the output to facilitate comparison. NLTK is on the left; Stanford NLP on the right.

For the first example, I’ve chosen a sonnet, which is surprisingly similar between the two tools.

nltk.pos_tag(nltk.word_tokenize(""" Can my love excuse the slow offence, Of my dull bearer, when from thee I speed, From where thou art, why should I haste me thence? Till I return of posting is no need. """))

Can: NNP Can: MD my: PRP$ my: PRP$ love: NN love: NN excuse: NN excuse: NN the: DT the: DT slow: JJ slow: JJ offence: NN offence: NN ,: , ,: , Of: IN Of: IN my: PRP$ my: PRP$ dull: NN dull: JJ bearer: NN bearer: NN ,: , ,: , when: WRB when: WRB from: IN from: IN thee: NN thee: NN I: PRP I: PRP speed: VBP speed: VBP ,: , ,: , From: NNP From: IN where: WRB where: WRB thou: PRP thou: JJ art: VBP art: NN ,: , ,: , why: WRB why: WRB should: MD should: MD I: PRP I: PRP haste: VB haste: NN me: PRP me: PRP thence: NN thence: VB ?: . ?: . Till: NNP Till: IN I: PRP I: PRP return: VBP return: VBP of: IN of: IN posting: VBG posting: VBG is: VBZ is: VBZ no: DT no: DT need: NN need: NN .: . .: .

For a second example, I’ve chosen a very wordy example from a recent Supreme Court case. The reason this type of text is interesting is that it is a common type of thing one might want to analyze, and it has entity names in it. For such a short sentence, there is very little deviation – it makes me wonder if these aren’t two versions of the same code/model.

nltk.pos_tag(nltk.word_tokenize("""Roy Koontz, Sr., whose estate is represented here by petitioner, sought permits to develop a section of his property from respondent St. Johns River Water Management District (District), which, consistent with Florida law, requires permit applicants wishing to build on wetlands to offset the resulting environmental damage."""))

Roy: NNP Roy: NNP Koontz: NNP Koontz: NNP ,: , ,: , Sr.: NNP Sr.: NNP ,: , ,: , whose: WP$ whose: WP$ estate: NN estate: NN is: VBZ is: VBZ represented: VBN represented: VBN here: RB here: RB by: IN by: IN petitioner: NN petitioner: NN ,: , ,: , sought: VBD sought: VBD permits: NNS permits: NNS to: TO to: TO develop: VB develop: VB a: DT a: DT section: NN section: NN of: IN of: IN his: PRP$ his: PRP$ property: NN property: NN from: IN from: IN respondent: NN respondent: NN St.: NNP St.: NNP Johns: NNP Johns: NNP River: NNP River: NNP Water: NNP Water: NNP Management: NNP Management: NNP District: NNP District: NNP (: NNP -LRB-: -LRB- District: NNP District: NNP ): NNP -RRB-: -RRB- ,: , ,: , which: WDT which: WDT ,: , ,: , consistent: VBD consistent: JJ with: IN with: IN Florida: NNP Florida: NNP law: NN law: NN ,: , ,: , requires: VBZ requires: VBZ permit: NN permit: NN applicants: NNS applicants: NNS wishing: VBG wishing: VBG to: TO to: TO build: VB build: VB on: IN on: IN wetlands: NNS wetlands: NNS to: TO to: TO offset: VB offset: VB the: DT the: DT resulting: VBG resulting: VBG environmental: JJ environmental: JJ damage: NN damage: NN .: .’ .: .

Now, for a really interesting example: gibberish made to look like English. For this test, we’re going to use Lewis Carroll’s Jabberwocky:

nltk.pos_tag(nltk.word_tokenize(""" Twas bryllyg, and ye slythy toves Did gyre and gymble in ye wabe: All mimsy were ye borogoves; And ye mome raths outgrabe. """))

At last, we have something where the output varies. One obvious lesson from this is that these algorithms are more than happy to guess to improve accuracy, even where they have no idea what’s going on, a similar strategy to multiple choice tests. It may be prudent to develop a class of algorithms which lose points for consistently guessing wildly incorrectly (similar to the scoring method used on the SATs).

Twas: NNP Twas: NNP bryllyg: NN bryllyg: NN ,: , ,: , and: CC and: CC ye: VB ye: NN slythy: JJ slythy: NN toves: NNS toves: VBZ Did: NNP Did: VBD gyre: NN gyre: NN and: CC and: CC gymble: JJ gymble: NN in: IN in: IN ye: NN ye: JJ wabe: NN wabe: NN :: : :: : All: DT All: DT mimsy: NN mimsy: NN were: VBD were: VBD ye: NN ye: JJ borogoves: NNS borogoves: NNS ;: : ;: : And: CC And: CC ye: NN ye: VB mome: NN mome: FW raths: NNS raths: FW outgrabe: VBP outgrabe: FW .: . .: .

For a final sample, we have a commonly cited section of Winnie the Pooh: while completely decipherable as English, it’s excruciatingly long. It may be worth noting that while this is verbose for modern tastes, many legal documents are written in the form of a single long sentence, separated by conjunctions (whereas a, whereas b, …) – this also bears strong resemblance to the writings of Victor Hugo:



In after-years [Piglet] liked to think that he had been in Very Great Danger during the Terrible Flood, but the only danger he had really been in was in the last half-hour of his imprisonment, when Owl, who had just flown up, sat on a branch of his tree to comfort him, and told him a very long story about an aunt who had once laid a seagull’s egg by mistake, and the story went on and on, rather like this sentence, until Piglet who was listening out of his window without much hope, went to sleep quietly and naturally, slipping slowly out of the window towards the water until he was only hanging on by his toes, at which moment luckily, a sudden loud squawk from Owl, which was really part of the story, being what his aunt said, woke Piglet up and just gave him time to jerk himself back into safety and say, “How interesting, and did she?” when-well, you can imagine his joy when at last he saw the good ship, The Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him.



It’s worth noting here that the two processes tokenize these slightly differently – one handles a set of unicode characters more gracefully, and the other inserts extra token breaks.