To perform a sentiment analysis all that we need is a dictionary and a text. In plain words the idea is: pick up a word from the text, verify the inclusion into the dictionary, and after that, the dictionary shows if it is positive or negative word and how negative or positive it is through adding or subtracting points.

Based on that, what is the most important variable to perform a sentiment analysis? The dictionary off course! how complete it is, and the asigned values to each word: the negative value of the word suicide is not the same as that of the word hit.

Ideas to validate

Do dictionaries with a greater number of words have better performance? Are more accurate results obtained when the dictionary and the text to be analyzed are in the same language, or is it preferable to consider a larger dictionary and the translated text?

Analysis

1. Texts

Let’s analyze the book: The Quixote of La Mancha, which has 381,104 words and 52 chapters in the first part and 74 chapters in the second part. To make sentiment analysis as this posed present preferred format the text like this:

chapter_n part chapter_n_o chapter_title chapter_text 1 1 1 Capítulo primero. Que trata de la condición y ejercicio del famoso hidalgo

don Quijote de la Mancha En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho

tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua,

rocín flaco y galgo corredor... 2 1 2 Capítulo II. Que trata de la primera salida que de su tierra hizo el

ingenioso don Quijote Hechas, pues, estas prevenciones, no quiso aguardar más tiempo a poner en

efeto su pensamiento, apretándole a ello la falta que él pensaba que hacía

en el mundo su tardanza.... 3 1 3 Capítulo III. Donde se cuenta la graciosa manera que tuvo don Quijote en

armarse caballero Y así, fatigado deste pensamiento, abrevió su venteril y limitada cena; la

cual acabada, llamó al ventero, y, encerrándose con él en la caballeriza,

se hincó de rodillas ante él, diciéndole:



2. Dictionaries

Using the tidytext package we find the three general-purpose lexicons for English:

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

We also can we use one lexicons for Spanish:

Sentiment Lexicons in Spanish from Veronica Perez-Rosas, Carmen Banea and Rada Mihalcea

A comparison between the four dictionaries shows us the number of words that compose them:

Lexicon Words AFINN 2,477 Bing et al. 6,800 NRC 14,182 URL Lexicon 1,347

3. Visualization

We will make the comparison of the performance of dictionaries throughout the work, dividing it into chapters.

3. Analysis of the result

The four different lexicons for calculating sentiment give results have fairly similar relative trajectories through the chapters.

In general, dips and peaks are very similar same places, but the absolute values are significantly different.

NRC, Bing et al. and AFINN have very similar relative trajectories.

Sentiment Lexicons in Spanish have some similar dips and peaks with AFINN lexicon.

It appears the NRC lexicon finds more positive sentiments than the AFINN lexicon.

Bing et al found more negative sentiments

4. Top words per feeling according to each dictionary

A second analysis that we can perform is, according to each dictionary, the ranking of negative and positive words:

5. Analysis of the use of different dictionaries

Comparing the top 10 words with positive and negative feelings we find:

There are some words common to every dictionary: good, gran, bueno, god, love..

translation errors: when working with a translated work there are words that are assigned a feeling that they do not have. For example, the NRC dictionary considers the word “Don” as an extraordinary gift or skill and hence gives it a positive value, but in Old Spanish and this book “Don Quijote de la Mancha” was written more than 500 years ago, the word “Don” is the title of courtesy for the gentlemen.

Conclusions

Based on the previous analysis, the dictionaries with the most words have better performance, obtaining more accurate results and a greater detection of feelings. Regardless of whether the work is in the original language, if the selected dictionary contains fewer terms, the result obtained is imprecise.