Next, we present comparison results for the twenty-four methods considered in this paper based on the eighteen considered gold standard datasets.

Experimental details

At least three distinct approaches have been proposed to deal with sentiment analysis of sentences. The first of them, applied by OpinionFinder and Pattern.en, for instance, splits this task into two steps: (i) identifying sentences with no sentiment, also named as objective vs. neutral sentences and then (ii) detecting the polarity (positive or negative), only for the subjective sentences. Another common way to detect sentence polarity considers three distinct classes (positive, negative and neutral) in a single task, an approach used by VADER, SO-CAL, USent and others. Finally, some methods like SenticNet and LIWC, classify a sentence as positive or negative only, assuming that only polarized sentences are presented, given the context of a given application. As an example, reviews of products are expected to contain only polarized opinion.

Aiming at providing a more thorough comparison among these distinct approaches, we perform two rounds of tests. In the first we consider the performance of methods to identify 3-class (positive, negative and neutral). The second considers only positive and negative as output and assumes that a previous step of removing the neutral messages needs to be executed firstly. In the 3-class experiments we used only datasets containing a considerable number of neutral messages (which excludes Tweets_RND_II, Amazon, and Reviews_II). Despite being 2-class methods, as highlighted in Table 2, we decided to include LIWC, Emoticons and SenticNet in the 3-class experiments to present a full set of comparative experiments. LIWC, Emoticons, and SenticNet cannot define, for some sentences, their positive or negative polarity, considering it as undefined. It occurs due to the absence in the sentence of emoticons (in the case of Emoticons method) or of words belonging to the methods’ sentiment lexicon. As neutral (objective) sentences do not contain sentiments, we assumed, in the case of these 2-class methods, that sentences with undefined polarities are equivalent to neutral sentences.

The 2-class experiments, on the other hand, were performed with all datasets described in Table 3 excluding the neutral sentences. We also included all methods in these experiments, even those that produce neutral outputs. As discussed before, when 2-class methods cannot detect the polarity (positive or negative) of a sentences they usually assign it to an undefined polarity. As we know all sentences in the 2-class experiments are positive or negative, we create the coverage metric to determine the percentage of sentences a method can in fact classify as positive or negative. For instance, suppose that Emoticons’ method can classify only 10% of the sentences in a dataset, corresponding to the actual percentage of sentences with emoticons. It means that the coverage of this method in this specific dataset is 10%. Note that, the coverage is quite an important metric for a more complete evaluation in the 2-class experiments. Even though Emoticons presents high accuracy for the classified phrases, it was not able to make a prediction for 90% of the sentences. More formally, coverage is calculated as the number of total sentences minus the number of undefined sentences, all of this divided by the total of sentences, where the number of undefined sentences includes neutral outputs for 3-class methods.

$$\mathit{Coverage} = \frac{\#\ \mathit{Sentences} - \#\ \mathit{Undefined}}{\#\ \mathit{Sentences}}. $$

Comparison metrics

Considering the 3-class comparison experiments, we used the traditional Precision, Recall, and F1 measures for the automated classification.

Each letter in Table 4 represents the number of instances which are actually in class X and predicted as class Y, where \(X; Y \in \{\mbox{positive}; \mbox{neutral}; \mbox{negative}\}\). The recall (R) of a class X is the ratio of the number of elements correctly classified as X to the number of known elements in class X. Precision (P) of a class X is the ratio of the number of elements classified correctly as X to the total predicted as the class X. For example, the precision of the negative class is computed as: \(P(\mathit{neg}) = i/(c+f+i)\); its recall, as: \(R(\mathit{neg}) = i/(g+h+i)\); and the F1 measure is the harmonic mean between both precision and recall. In this case, \(F1(\mathit{neg})=\frac{2P(\mathit{neg})\cdot R(\mathit{neg})}{P(\mathit{neg})+R(\mathit{neg})}\).

Table 4 Confusion matrix for experiments with three classes Full size table

We also compute the overall accuracy as: \(A = \frac {a+e+i}{a+b+c+d+e+f+g+h+i}\). It considers equally important the correct classification of each sentence, independently of the class, and basically measures the capability of the method to predict the correct output. A variation of F1, namely, Macro-F1, is normally reported to evaluate classification effectiveness on skewed datasets. Macro-F1 values are computed by first calculating F1 values for each class in isolation, as exemplified above for negative, and then averaging over all classes. Macro-F1 considers equally important the effectiveness in each class, independently of the relative size of the class. Thus, accuracy and Macro-F1 provide complementary assessments of the classification effectiveness. Macro-F1 is especially important when the class distribution is very skewed, to verify the capability of the method to perform well in the smaller classes.

The described metrics can be easily computed for the 2-class experiments by just removing neutral columns and rows as in Table 5.

Table 5 Confusion matrix for experiments with two classes Full size table

In this case, the precision of positive class is computed as: \(P(\mathit{pos}) = a/(a+c)\); its recall as: \(R(\mathit{pos}) = a/(a+b)\); while its F1 is \(F1(\mathit{pos})=\frac{2P(\mathit{pos})\cdot R(\mathit{pos})}{P(\mathit{pos})+R(\mathit{pos})}\).

As we have a large number of combinations among the base methods, metrics and datasets, a global analysis of the performance of all these combinations is not an easy task. We propose a simple but informative measure to assess the overall performance ranking. The Mean Ranking is basically the sum of ranks obtained by a method in each dataset divided by the total number of datasets, as below:

$$\mathit{MR} = \frac{\sum_{j=1}^{\mathit{nd}}\mathit{ri}}{\mathit{nd}} $$

where nd is the number of datasets and ri is the rank of the method for dataset i. It is important to notice that the rank was calculated based on Macro-F1.

The last evaluation metric we exploit is the Friedman’s Test [71]. It allows one to verify whether, in a specific experiment, the observed values are globally similar. We used this test to tell if the methods present similar performance across different datasets. More specifically, suppose that k expert raters evaluated n item - the question that arises is: are rates provided by judges consistent with each other or do they follow completely different patterns? The application in our context is very similar: the datasets are the judges and the Macro-F1 achieved by a method is the rating from the judges.

The Friedman’s Test is applied to rankings. Then, to proceed with this statistical test, we sort the methods in decreasing order of Macro-F1 for each dataset. More formally, the Friedman’s rank test in our experiment is defined as:

$$F_{R} = \Biggl(\frac{12}{rc(c+1)} {\sum_{j=1}^{c}R^{2}_{j}}\Biggr) -3r(c+1), $$

where

$$\begin{aligned}& R^{2}_{j} = \mbox{square of the sum of rank positions of method }j\quad (j = 1,2,\ldots, c), \\& r = \mbox{number of datasets}, \\& c = \mbox{number of methods}. \end{aligned}$$

As the number of datasets increases, the statistical test can be approximated by using the chi-square distribution with \(c-1\) degrees of freedom [72]. Then, if the \(F_{R}\) computed value is larger than the critical value for the chi-square distribution the null hypothesis is rejected. This null hypothesis states that ranks obtained per dataset are globally similar. Accordingly, rejecting the null hypothesis means that there are significant differences in the ranks across datasets. It is important to note that, in general, the critical value is obtained with significance level \(\alpha= 0.05\). Synthesizing, the null hypothesis should be rejected if \(F_{R} > X^{2}_{\alpha}\), where \(X^{2}_{\alpha}\) is the critical value verified in the chi-square distribution table with \(c-1\) degrees of freedom and α equals 0.05.

Comparing prediction performance

We start the analysis of our experiments by comparing the results of all previously discussed metrics for all datasets. Table 6 and Table 7 present accuracy, precision, and Macro-F1 for all methods considering four datasets for the 2-class and 3-class experiments, respectively. For simplicity, we choose to discuss results only for these datasets as they come from different sources and help us to illustrate the main findings from our analysis. Results for all the other datasets are presented in Additional file 1. There are many interesting observations we can make from these results, summarized next.

Table 6 2-classes experiments results with 4 datasets Full size table

Table 7 3-classes experiments results with 4 datasets Full size table

Methods prediction performance varies considerably from one dataset to another: First, we note the same social media text can be interpreted very differently depending on the choice of a sentiment method. Overall, we note that all the methods yielded with large variations across the different datasets. By analyzing Table 6 we can note that VADER works well for Tweets_RND_II, appearing in the first place, but it presents poor performance in Tweets_STF, Comments_BBC, and Comments_DIGG, achieving the eleventh, thirteenth and tenth place respectively. Although the first two datasets contain tweets, they belong to different contexts, which affects the performance of some methods like VADER. Another important aspect to be analyzed in this table is the coverage. Although SentiStrength has presented good Macro-F1 values, its coverage is usually low as this method tends to classify a high number of instances as neutral. Note that some datasets provided by the SentiStrength’s authors, as shown in Table 3, specially the Twitter datasets, have more neutral sentences than positive and negative ones. Another expected result is the good Macro-F1 values obtained by Emoticons, specially in the Twitter datasets. It is important to highlight that, in spite of achieving high accuracy and Macro-F1, the coverage of many methods, such as PANAS, VADER, and SentiStrength, is low (e.g. below 30%) as they only infer the polarity of part of the input sentences. Thus, the choice of a sentiment analysis is highly dependent on the data and application, suggesting that researchers and practitioners need to take into account this tradeoff between prediction performance and coverage.

The same high variability regarding the methods’s prediction performance can be noted for the 3-class experiments, as presented in Table 7. Umigon, the best method in five Twitter datasets, felt to the eighteenth place in the Comments_NYT dataset. We can also note the lower Macro-F1 values for some methods like Emoticons are due to the high number of sentences without emoticons in the datasets. Methods like Emoticons DS and PANAS tend do classify only a small part of instances as neutral and also presented a poor performance in the 3-class experiments. Methods like SenticNet and LIWC were not originally developed for detecting neutral sentences and also achieved low values of Macro-F1. However, they also do not appear among the best methods in the 2-class experiments, which is the task they were originally designed for. This observation about LIWC is not valid for the newest version, as LIWC15 appears among the top five methods for 2-class and 3-class experiments (see Table 8).

Table 8 Mean rank table for all datasets Full size table

Finally, Table 9 presents the Friedman’s test results showing that there are significant differences in the mean rankings observed for the methods across all datasets. It statistically indicates that in terms of accuracy and Macro-F1 there is no single method that always achieves a consistent rank position for different datasets, which is something similar to the well-known ‘no-free lunch theorem’ [16]. So, overall, before using a sentiment analysis method in a novel dataset, it is crucial to test different methods in a sample of data before simply choose one that is acceptable by the research community.

Table 9 Friedman’s test results Full size table

This last results suggests that, even with the good insights provided by this work about which methods perform better in each context, a preliminary investigation needs to be performed when sentiment analysis is used in a new dataset in order to guarantee a reasonable prediction performance. In the case in which prior tests are not feasible, this benchmark presents valuable information for researchers and companies that are planning to develop research and solutions on sentiment analysis.

Existing methods let space for improvements: We can note that the performance of the evaluated methods are ok, but there is a lot of space for improvements. For example, if we look at the Macro-F1 values only for the best method on each dataset (see Table 6 and Table 7), we can note that the overall prediction performance of the methods is still low - i.e. Macro-F1 values are around 0.9 only for methods with low coverage in the 2-class experiments and only 0.6 for the 3-class experiment. Considering that we are looking at the performance of the best methods out of 24 unsupervised tools, these numbers suggest that current sentence-level sentiment analysis methods still let a lot of space for improvements. Additionally, we also noted that the best method for each dataset varies considerably from one dataset to another. This might indicate that each method complements the others in different ways.

Most methods are better to classify positive than negative or neutral sentences: Figure 2 presents the average F1 score for the 3-class experiments. It is easier to notice that twelve out of twenty-four methods are more accurate while classifying positive than negative or neutral messages, suggesting that some methods may be more biased towards positivity. Neutral messages showed to be even harder to detect by most methods.

Figure 2 Average F 1 score for each class. This figure presents the average F1 of positive and negative class and as we can see, methods use to achieve better prediction performance on positive messages. Full size image

Interestingly, recent efforts show that human language have a universal positivity bias ([73] and [74]). Naturally, part of the bias is observed in sentiment prediction, an intrinsic property of some methods due to the way they are designed. For instance, [32] developed a lexicon in which positive and negative values are associated to words, hashtags, and any sort of tokens according to the frequency with which these tokens appear in tweets containing positive and negative emoticons. This method showed to be biased towards positivity due to the larger amount of positivity in the data they used to build the lexicon. The overall poor performance of this specific method is credited to its lack of treatment of neutral messages and the focus on Twitter messages.

Some methods are consistently among the best ones: Table 8 presents the mean rank value, detailed before, for 2-class and 3-class experiments. The elements are sorted by the overall mean rank each method achieved based on Macro-F1 for all datasets. The top nine methods based on Macro-F1 for the 2-class experiments are: SentiStrength, Sentiment140, Semantria, OpinionLexicon, LIWC15, SO-CAL, AFINN and VADER and Umigon. With the exception of SentiStrength, replaced by Pattern.en, the other eight methods produce the best results across several datasets for both, 2-class and 3-class tasks. These methods would be preferable in situations in which any sort of preliminary evaluation is not possible to be done. The mean rank for 2-class experiments is accompanied by the coverage metric, which is very important to avoid misinterpretation of the results. Observe that SentiStrength and Sentiment140 exhibited the best mean ranks for these experiments, however both present very low coverage, around 30% and 40%, a very poor result compared with Semantria and OpinionLexicon that achieved a worse mean rank (4.61 and 6.62 respectively) but an expressive better coverage, above 60%. Note also that SentiStrength and Sentiment140 present poor results in the 3-class experiments which can be explained by their bias to the neutral class as mentioned before.

Another interesting finding is the fact that VADER, the best method in the 3-class experiments, did not achieve the first position for none of the datasets. It reaches the second place five times, the third place twice, the seventh three times, and the fourth, sixth and fifth just once. It was a special case of consistency across all datasets. Tables 10 and 11 present the best method for each dataset in the 2-class and 3-class experiments, respectively.

Table 10 Best method for each dataset - 2-class experiments Full size table

Table 11 Best method for each dataset - 3-class experiments Full size table

Methods are often better in the datasets they were originally evaluated: We also note those methods perform better in datasets in which they were originally validated, which is somewhat expected due to fine tuning procedures. We could do this comparison only for SentiStrength and VADER, which kindly allowed the entire reproducibility of their work, sharing both methods and datasets. To understand this difference, we calculated the mean rank for these methods without their ‘original’ datasets and put the results in parenthesis. Note that, in some cases the rank order changes towards a lower value but it does not imply in major changes. We also note those methods often perform better in datasets in which they were originally validated, which is somewhat expected due to fine tuning procedures. We could do this comparison only for SentiStrength and VADER, which kindly allowed the entire reproducibility of their work, sharing both methods and datasets. To understand this difference, we calculated the mean rank for these methods without their ‘original’ datasets and put the results in parenthesis. Note that, in some cases the rank order slightly changes but it does not imply in major changes. Overall, these observations suggest that initiatives like SemEval are key for the development of the area, as they allow methods to compete in a contest for a specific dataset. More important, it highlight that a standard sentiment analysis benchmark is needed and it needs to be constantly updated. We also emphasize that is possible that other methods, such as paid softwares, make use of some of the datasets used in this benchmark to improve their performance as most of gold standard used in this work is available in the Web or under request to authors.

Some methods showed to be better for specific contexts: In order to better understand the prediction performance of methods in types of data, we divided all datasets in three specific contexts - Social Networks, Comments, and Reviews - and calculated mean rank of the methods for each of them. Table 12 presents the contexts and the respective datasets.

Table 12 Contexts’ groups Full size table

Tables 13, 14 and 15 present the mean rank for each context separately. In the context of Social Networks the best method for 3-class experiments was Umigon, followed by LIWC15 and VADER. In the case of 2-class the winner was SentiStrength with a coverage around 30% and the third and sixth place were Emoticons and PANAS-t with about 18% and 6% of coverage, respectively. This highlights the importance to analyze the 2-class results together with the coverage. Overall, when there is an emoticon on the text or a word from the psychometric scale PANAS, these methods are able to tell the polarity of the sentences, but they are not able to identify the polarity of the input text for the large majority of the input text. Recent efforts suggest these properties are useful for combination of methods [20]. Sentiment140, LIWC15, Semantria, OpinionLexicon and Umigon showed to be the best alternatives for detecting only positive and negative polarities in social network data due to the high coverage and prediction performance. It is important to highlight that LIWC 2007 appears on the 16th and 21th position for the 3-class and 2-class mean rank results for the social network datasets and it is a very popular method in this community. On the other side, the newest version of LIWC (2015) presented a considerable evolution obtaining the second and the fourth place in the same datasets.

Table 13 Mean rank table for datasets of social networks Full size table

Table 14 Mean rank table for datasets of comments Full size table

Table 15 Mean rank table for datasets of reviews Full size table

Similar analyses can be performed for the contexts Comments and Reviews. SentiStrength, VADER, Semantria, AFINN, and Opinion Lexicon showed to be the best alternatives for 2-class and 3-class experiments on datasets of comments whereas Sentiment140, SenticNet, Semantria and SO-CAL showed to be the best for the 2-class experiments for the datasets containing short reviews. Note that for the last one, the 3-class experiments have no results since datasets containing reviews have no neutral sentences nor a representative number of sentences without subjectivity.

We also calculated the Friedman’s value for each of these specific contexts. Even after grouping the datasets, we still observe that there are significant differences in the observed ranks across the datasets. Although the values obtained for each context were quite smaller than Friedman’ global value, they are still above the critical value. Table 16 presents the results of Friedman’s test for the individual contexts in both experiments, 2 and 3-class. Recall that for the 3-class experiments, datasets with no neutral sentences or with an unrepresentative number of neutral sentences were not considered. For this reason, Friedman’s results for 3-class experiments in the Reviews context presents no values.