Some basic NLP techniques:

Now that we know what is and is not NLP and what problems does it face, we can start to learn which are the most basic NLP tools. In the next post we will apply these techniques using a Python NLP library called SpaCy. This post focuses on the concepts.

a. Stemming and Lemmatizing: this tasks consist of reducing different forms of a word to a common base form. For example:

In the sentence “I am a student” the process would result in “I be a student”.

In the sentence “My dog’s fur is dark” the process would result in “My dog fur be dark”.

Stemming usually refers to a crude process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational units (the obtained element is known as the stem).

On the other hand, lemmatization consists in doing things properly with the use of a vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.

If we stem the sentence “I saw an amazing thing ”we would obtain ‘s’ instead of ‘saw’, but if we lemmatize it we would obtain ‘see’, which is the lemma.

As it was already mentioned, both techniques could remove important information but also help us to normalize our corpus (although lemmatization is the one that is usually applied).

b. Coreference resolution: consists of solving the coreferences that are present in our corpus. This can also be thought as a normalizing or preprocessing task.

c. Part-of-speech (POS) Tagging: a POS tagger marks each word in a corpus by assigning a syntactic category such as:

Open class categories or types (those with relatively fixed membership): noun , verb , adjective , adverb .

, , , . Closed class types: preposition, determiner, pronoun, conjunction,

auxiliary verb, particle, numeral.

For example, given the sentence “I want to play the piano” a POS tagger should return:

I (Preposition) want (Verb) play (Verb) piano (Noun)

d. Dependency Parsing: sometimes instead of the category (POS tag) of a word we want to know the role of that word in a specific sentence of our corpus, this is the task of dependency parsers. The objective is to obtain the dependencies or relations of words in the format of a dependency tree.

The considered dependencies are in general terms subject, object, complement and modifier relations.

As an example, given the sentence “I want to play the piano” a dependency parser would produce the following tree:

SpaCy’s Dependency Tree visualized with displaCy

Here we can see that the dependency parser that I use (SpaCy’s dependency parser) also outputs the POS tags. If you think about it, it makes sense because we first need to know the category of each word to extract dependencies.

We will see in detail the types of dependencies, but in this case we have:

want — I: nominal subject. want — play: open clausal complement. play — to: auxiliary verb. play — the piano: direct object

where a — b: R means “b is R of a”. For example, “the piano is direct object of play” (which is play — the piano: direct object from above).

e. Named Entity Recognition (NER):

in the real world, in our daily conversations we don’t work directly with the categories of words. Instead, for example, if we want to build a Netflix chatbot we want it to recognize both ‘Batman’ and ‘Avatar’ as instances of the same group which we call ‘films’ , but ‘Steven Spielberg’ as a ‘director’. This concept of semantic field dependent of a context is what we define as entity. The role of a named entity recognizer is to detect relevant entities in our corpus.

For example, if our NER knows the entities ‘film’, ‘location’ and ‘director’, given the sentence “James Cameron filmed part of Avatar in New Zealand”, it will output:

James Cameron: DIRECTOR Avatar: FILM New Zealand: LOCATION

Note that in the example instances of entities can be just a single word (‘Avatar’) or several ones (‘New Zealand’ or ‘James Cameron’).