Photo by Hannes Wolf on Unsplash

Fine-tuning I: Dimensionality reduction

To create a good classifier with the model described in Part I, we need a big and properly labelled corpus in order to compute a comprehensive word-sentiment occurrence table. In the training corpus, there should be statistically enough examples of each word in different contexts so the occurrences computed in the table can leverage a good approximation of their real probabilities (frequencies).

There are several techniques aimed to reduce the dimensionality of the problem to make it more manageable. Basically they consist on reducing the table size (number of different words or tokens in our vocabulary) so the requirements about the size of the training corpus are lowered.

If we pay attention only to the words as they appear in the text (what is called the form) we will learn different weights for like, likes and liked and we will need a much bigger corpus to learn accurate weights for all words. We could also collapse all these words into the single word token like (which is called the lemma) so we combine everything we have learnt from the different forms. This process is called lemmatization.

If you don’t have an available lemmatizer for your language there is another approach, called stemming, which tries to obtain the root of a word by removing some morphological parts. For example, a stemmer algorithm will trim adorable, adores, adore and adorably to “ador”, thus all words are treated as the same (only one row in the vocabulary table) . The most well known stemmer algorithm is the Porter Stemmer.

However, it is important to note that this process makes the model simpler (we have to deal with fewer different words) but it’s not guaranteed to help in all situations; it will mostly depend on your use case.

Keep in mind

Consider the case where you want to detect the sentiment of product reviews and you decide to use lemmatization. A “breaking product” is very different from a “broken product”, however your classifier will treat both cases equally: break + product.

Fine-tuning II: Negations

Now we will face a much more complex problem for sentiment analysis: negations. For multiple NLP algorithms, negation has always been a difficult situation to deal with because, with just a single tiny word, we can completely invert the meaning of a whole sentence.

Let’s see the following example:

I really do like everytime that we all gather to have lunch together around the table at your grandparent’s house. I really don’t like everytime that we all gather to have lunch together arround the table at your grandparent’s house.

These two sentences are almost identical except for one word (1 out of 20) but they have completely opposite meanings.

Even worse, we can also assume that both tokens, do and don’t, will tend to appear evenly in either positive and negative documents so they won’t help us to calculate the sentiment scores.

A quick trick would be to transform the documents before the training step (AKA preprocessing) as it follows:

If we find negations like don’t or won’t, which are followed by a verb, we can remove them and prepend the verb with the special flag NOT_. Then, a phrase like “I don’t love apples” would be transformed to “I NOT_love apples” and the word NOT_love would be different to the word love (different rows in the positive-negative table).

Hopefully this trick will help us to build a table like:

+----------+--------------+--------------+

| | Positive | Negative |

+----------+--------------+--------------+

| ... | ... | ... |

| love | large number | small number |

| NOT_love | small number | large number |

| ... | ... | ... |

+----------+--------------+--------------+

For other negations, like “the movie isn’t good enough” the negation should be applied to the following adjective (NOT_good).

Keep in mind

Once you want to analyze new documents, you will have to apply exactly the same preprocessing you have applied to the documents in the training set.

Fine-tuning III: N-gram models

So far we have been working with single words (or tokens, in case we have performed some stemming) which, combined with the bag of words model, implied the assumption of independent words. One way to reduce the effect of this simplification by adding some context to the words is to use n-grams.

An n-gram is a set of n consecutive words and we can use them as the building blocks of our model: the rows for the table need to compute. In fact, we have been using the n-gram model for the specific case of n equals one (n=1) which is also called unigrams (for n=2 they are called bigrams, for n=3 trigrams, four-grams and so on…). When dealing with n-grams, special tokens to denote the beginning and end of a sentence are sometimes used.

For instance, given the sentence “we enjoyed the meal”, the n-grams would be:

Bigrams:

(_,we) (we, enjoyed) (enjoyed, the) (the, meal) (meal, _) Trigrams:

(_, _, we) (_, we, enjoyed) (we, enjoyed, the) (enjoyed, the, meal) (the, meal, _) (meal, _, _)

Ideally, this could also help with negations, even without the previously explained preprocessing, since you could end up with bigrams like (don’t, love). We may also mix multiple tricks if we find that they work for our use case. For example, with lemmatization + negations + bigram model, we would get the following:

He doesn’t really like ice-creams

(_, he) (he. really) (really, NOT_like) (NOT_like, ice-cream) (ice-cream,_)

The larger the n-grams is, the more context you add, but your vocabulary (the number of different word combinations) grows too, which requires you to have a larger dataset to train on. For instance, if you have V words in your training set, your table will need to have V rows for the unigram model, roughly V² rows for the bigram model, V³ for the trigrams, etc.

Combining n-grams

We can also combine multiple n-grams models, which is called backoff: for a given document we may use trigrams when found in the training table. Otherwise, if a trigram is not found, we then try to use the bigrams or directly fallback to use unigrams.

Normally, when you have to backoff to a lower-order n-gram model, we discount the weights as a way to give more confidence to the high-order n-grams compared to the low-order n-grams. This scaling factor can be constant (e.g. 0.8 for bigrams and 0.6 for unigrams) or it might depend on the specific n-grams that we had to backoff.

Wrap up

Apart from these tricks you may consider implementing other models that have an increased complexity. We may cover some of these models in a future article. Stay tuned by following this publication.