NLTK Regular Expression Parser (RegexpParser)

The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language. One such tool is the Regular Expression Parser. If you’re familiar with regular expressions, it can be a useful tool in natural language processing.

Background Information

You must first be familiar with regular expressions to be able to fully utilize the RegexpParser/RegexpChunkParser. If you need to learn about regular expressions, here is a site with an abundance of information to get you started: http://www.regular-expressions.info. It is also necessary to know how to use a tagger, and what the tags mean. A tagger is a tool that marks each word in a sentence with its part of speech. Here is a small comparison I did of python taggers: NLTK vs MontyLingua Part of Speech Taggers. The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger. The Brown Corpus tags will be the tags used throughout the rest of this post, and are commonly used by taggers in general. On a side note, the RegexpParser can be used with either the NLTK or MontyLingua tagger.

Basic RegexpParser Usage

Let me start by going over the “how to” provided in the NLTK documentation. The source of this information is here: NLTK RegexParser HowTo. The documentation goes through how you could use the RegexParser/RegexpChunkParser to do a traditional parse of a sentence.

The RegexParser/RegexChunkParser works by defining rules for grouping different words together. A simple example would be: “NP: {<DT>? <JJ>* <NN>*}”. This is a definition for a rule to group of words into a noun phrase. It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns. In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase. It’s important to note that earlier regular expressions can be used in later ones. Also, the regular expression syntax can occur within the tags or apply to the tags themselves.

Here is the example from the NLTK website:

parser = RegexpParser ( ''' NP: {<DT>? <JJ>* <NN>*} # NP P: {<IN>} # Preposition V: {<V.*>} # Verb PP: {<P> <NP>} # PP -> P NP VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* ''' ) parser = RegexpParser(''' NP: {<DT>? <JJ>* <NN>*} # NP P: {<IN>} # Preposition V: {<V.*>} # Verb PP: {<P> <NP>} # PP -> P NP VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* ''')

Alternative RegexpParser Usage

I call this an alternate usage because it can be used to find patterns that aren’t necessarily related to grammatical phrases in English. It can be used to find any pattern in a sentence. Let me start by showing the regular expression grammar from my program.

grammar = """ NP: {<PRP>?<JJ.*>*<NN.*>+} CP: {<JJR|JJS>} VERB: {<VB.*>} THAN: {<IN>} COMP: {<DT>?<NP><RB>?<VERB><DT>?<CP><THAN><DT>?<NP>} """ self . chunker = RegexpParser ( grammar ) grammar = """ NP: {<PRP>?<JJ.*>*<NN.*>+} CP: {<JJR|JJS>} VERB: {<VB.*>} THAN: {<IN>} COMP: {<DT>?<NP><RB>?<VERB><DT>?<CP><THAN><DT>?<NP>} """ self.chunker = RegexpParser(grammar)

I was using it to look for a specific pattern in a sentence. The first part, NP, is looking for a noun phrase. The <PRP>? is there because of a bug found in the tagger I was using. It was marking An with a capital ‘A’ as a PRP (Pronoun) rather than a DT (Determinant/Article). I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.

Then it moves onto the CP, which is the comparison word. JJR tagged words are comparative adjectives. They include words bigger, smaller, and larger. JJS words are words that signify the most or chief. JJS words include biggest, smallest, and largest.

The next two a simply the VERB and the word THAN. The VERB could be a compound verb, so there would be one or more verbs present. The IN tag denotes a preposition. In this case, I was looking specifically for the word than.

The last line is COMP. This is the regular expression that puts it all together. This was looking for a size comparison of two objects. It might be easier to look at the output of this part of the expression than trying to explain it piece by piece. The only tag not explained above is RB, which is an adverb.

Here is the parse for the sentence “Everyone knows an elephant is larger than a dog.”:

(S (NP everyone/NN) (VERB knows/VBZ) (COMP an/DT (NP elephant/NN) (VERB is/VBZ) (CP larger/JJR) (THAN than/IN) a/DT (NP dog/NN)) ./.)

The output is a simple tree, that makes to easy data extraction. It’s easy to see there are many possibilities that open up when looking for patterns in English text. May this help you in your data mining endeavors.