Experimental design

In this study, we train different models using various algorithms (described in “Experimental design and evaluation” section) along with different feature representations (presented in “Discussion” section) and compare their performance. In addition to the different algorithm performance, we also evaluated the performance of the models using two baseline models: a keyword-based classifier (KBC). We choose KBC because it is a quick and easy approach for software developers to implement and, therefore, feasible in practice. We use the list of keywords developed by Salminen et al. [72]; this list contains 200 manually curated hateful phrases and is available online.Footnote 22 KBC checks if a comment contains hateful phrases defined in the dictionary and classifies the comment as hateful or non-hateful accordingly. We then compare the prediction of the system with the ground truth to calculate the performance. Second, we use the model by Chandrasekharan [64] that is publicly available as a downloadable Pickle fileFootnote 23 and, equally importantly, was trained on data from several platforms, achieving solid performance when applied to an independent social media platform whose comments the model had not previously seen (accuracy = 75.27%, precision = 77.49%, and recall = 71.24%).

In addition to studying the importance of different feature representations along with different algorithms, we also evaluate and present the results of our two best trained classifiers for each targeted social media platform: Wikipedia, Twitter, Reddit, and YouTube. For comparison, we group our instances in the test set according to its platform (% of instances in each group along with its distribution of classes as shown in Table 6) and present the results accordingly. For completeness of the study, we also include the results previously published for each platform. However, due to differences in training/test distribution between the source papers and our work, the results are not entirely comparable.

Evaluation metrics

The classifier performance is measured using the test set (~ 25% of the total dataset) with two metrics: (a) F1 score and (b) receiver operating characteristic—area under the curve (ROC-AUC). The F1 score is the harmonic mean of precision and recall at a decision threshold of 0.50. The ROC computes precision and recall at all potential decision thresholds, so the area under the ROC curve is an appropriate metric to measure overall model performance. Equation 1 shows the formula for calculating the F1 score.

$$F_{1} = 2 \times \frac{p \times r}{p + r}$$ (1)

where p = precision (i.e., positive predictive value) and r = recall (i.e., true positive rate). In this research, we only report the F1 measure along with ROC-AUC. In addition, the models are compared for statistically significant performance differences using McNemar’s test with significance level α = 0.01.

Experimental results

To evaluate different algorithms and features (RQ1), Table 9 (evaluation measure: F1) and Table 10 (evaluation measure: ROC-AUC) present the results obtained using different lexical feature representations and classification algorithms. Regarding the model families, XGBoost outperforms the other models on all feature sets except the BOW features, where the FFNN performs slightly better. XGBoost is closely followed by the FFNN on all other feature spaces. The other three model types (LR, NB, and SVM) perform worse on all feature subsets. On the different training sets, the performance of NB and LR ranks in different orders, while the SVM is always last. As expected, in baseline comparison, the KBC performed the worst.

Table 9 F1 scores (the highest scores italicized) Full size table

Table 10 ROC-AUC scores (the highest scores italicized) Full size table

Comparing the feature representations, we observe a linear trend (see Fig. 2) in the performance of the classifiers when moving from simpler features to more advanced ones, with BERT giving the best results when comparing individual features. While the TF-IDF and BOW features perform much worse, their performance is still considerably higher than a random guess. The fact that TF-IDF models are only marginally better than the BOW models indicates that TF is not critical for the predictions. Likely, the most substantial information gain comes from the presence of certain words like “fuck”, which can be detected by BOW features as well as by TF-IDF features.

Fig. 2 There is a linear trend (exemplified by the dotted line in the picture), with almost all classifiers performing better with more advanced features. In the case of SVM and NB, BERT features outperformed all features Full size image

The results indicate that XGBoost outperforms the other algorithms most of the time, and XGBoost with all features is the highest performing model. This linearly combined feature set significantly outperforms XGBoost using only the BERT features. In contrast, the results using the FFNN shows no significant difference between the performance when using BERT only versus all features.

Baseline comparison shows that the KBC has drastically lower performance than any of the developed models, with an accuracy of 41.4% and F1 score of 0.388. The poor performance results from a high number of false positives (Type I error), which is 112,581 (57%) from the test set (recall = 0.25). In other words, the KBC considers many non-hateful comments as hateful, conforming with its known limitations [16]. Conversely, the problem of false negatives (Type II error) is much smaller for the KBC, as its precision is 0.919. This stems from the unbalanced dataset. In comparison, the false positive rate for the XGBoost model with all features is 2.0%, and the false negative rate is only 1.0%.

Surprisingly, the BOC model performs even worse than KBC, obtaining an F1 score of 0.084, precision of 0.085 and recall of 0.083. The accuracy is better than for the KBC (63.4% vs. 41.4%), but the results clearly indicate that the BOC model does not generalize well into these datasets. Unfortunately, we cannot test if our model generalizes to the original data of the BOC model [64], because the researchers are not sharing their data, only the Pickle file of the model. This also means that retraining using a sample of our data to improve their model is not possible. In terms of platform-specific performance, the BOC provides a better-than-chance (> 50%) accuracy for Wikipedia (70.4%) and Reddit (71.7%), but worse-than-chance accuracy for Twitter (19.6%) and YouTube (25.5%). The raw accuracy of the XGBoost model with all features is 97.0%, which implies a 94% improvement over a random model (κ = (0.97 − 0.50)/(1 − 0.50) = 0.94). Conversely, the performance over a random model is negative for the KBC (− 17.2%) and somewhat positive (+ 26.8%) for the BOC model.

Platform-specific analysis

For the platform-specific analysis (see Table 11), we use the XGBoost (All and BERT) models to predict the hatefulness of comments from each social media platform separately to assess the model’s generalizability. The F1 score of XGBoost (All) outperforms XGBoost (BERT) significantly for Wikipedia and Twitter platforms; however, the same could not be said for the other two platforms. The features we use reflect the language being used in the social media platforms, so the difference in performance implies that the use of hateful language somewhat differs by platform (see “Linguistic variable analysis” section for more). Regarding the results, we consider the generalizability to be fair, as we achieve solid F1 and ROC-AUC scores (> 0.70) for each platform using XGBoost with BERT and all features (see Table 11).

Table 11 Generalizability of our best models (XGBoost with All and BERT features) across social media platforms Full size table

Interestingly, the best model performs particularly well for YouTube (F1 xgboost_all = 0.91) and Twitter (F1 xgboost_all = 0.980). This implies the hateful language is easier to decipher for the model in these platforms. In contrast, the model performs worse with Reddit (F1 xgboost_all = 0.776) and Wikipedia (F1 xgboost_all = 0.861). On these two platforms, users may be more likely to engage in syntactically and semantically complex discussions, which makes it more difficult for the model (and perhaps for humans, too) to understand the hateful intent in their comments. Regarding the errors of the best model, many of the falsely classified comments can be seen difficult for a human to classify as well. For example, the comment “As usual, Jews and turks try to make famous Lebanese arab christian belong to them, he is Mexican Lebanese Arab christian and thats all!!” (from Wikipedia). This comment is labelled as ‘not hateful’ in the ground truth but the model classifies it as hateful because it probably detects a racist sentiment in the comment. Among the false negatives, there are many similar examples, such as “You WERE NOT REVERTING ALL THOSE TIMES EM. You deleted other contributions, every time something was added you didn’t like. Get lost.” (from Wikipedia). This comment is definitely annoyed or even angry, but it is not clear if it crosses the line of ‘hateful’.

The highest risk for false positives using our model is in the Reddit platform (recall = 0.779). The lowest risk for false positives is when applied to the Twitter platform (recall = 0.978). The highest risk for missing hateful comments (false negatives) is again Reddit (precision = 0.813) and the lowest for Twitter (precision = 0.984). Overall, the model is slightly more likely to detect hateful comments when the comments are not hateful relative to classifying hateful comments as non-hateful (+ 3.9% relative difference).

We also analyze the possibility of overfitting by plotting the log loss of the XGBoost model on training and test sets. The model converges after about 75 trees (Fig. 3), after which the test error remains constant, indicating that there is little risk of overfitting.

Fig. 3 Log loss of XGBoost (All features) model for training and test sets Full size image

Linguistic variable analysis

To investigate how well the best model (XGBoost with all features) learns linguistic characteristics in the hateful and non-hateful language (RQ3), we extracted scores on all linguistic variables available in the LIWC (Linguistic Inquiry and Word Count) [97] software. The LIWC taxonomy contains 93 categories that reflect the use of language at various levels, ranging from simple (word count, use of negations) to more complex (anxiety, tone) variables. To investigate the LIWC properties of the predicted comments, we applied the following procedure:

1. Extract LIWC variable scores for all hateful and non-hateful comments in train and test set. 2. Create four comment sets: hateful ground (containing comments whose ground truth value is hateful).

hateful predicted (predicted value == hateful).

non-hateful ground (ground truth value == non-hateful).

non-hateful predicted (predicted value == non-hateful). 3. Calculate the average score for each LIWC variable in each set. 4. Calculate relative difference D between average scores of ground truth and average scores of predicted comments (i.e., (hateful predicted − hateful ground )/hateful ground and (non-hateful predicted − non-hateful ground )/non-hateful ground for each LIWC variable. 5. Sort D by highest value and examine (a) which linguistic features are replicated well by the predictive model (i.e., their relative difference is small) and (b) which features are not well captured (i.e., their relative difference to ground truth is high) by the model.

Results (see Fig. 4) indicate that the model’s predictions replicate the linguistic characteristics of both the hateful and non-hateful comments reasonably well (i.e., the difference scores are centered around zero). The average difference across all LIWC categories is M = 0.011 (SD = 0.240) for the Hateful paired comments and M = 0.002 (SD = 0.020) for the Non-hateful paired comments. Thus, hateful language is replicated more poorly relative to non-hateful language, and it has a considerably higher standard deviation among LIWC categories.

Fig. 4 Differences of aggregated mean scores of predicted labels’ LIWC scores and ground-truth labels’ LIWC scores Full size image

When examining the difference between predicted hateful comments and the ground truth hateful comments, out of the 93 LIWC categories, seven categories are classified as outliers (see Table 12). The predictions show more seldom use of (a) parentheses (− 13.7%, indicating less parentheses in predicted hateful than in ground truth hateful comments), (b) quotation signs (− 8.6%), (c) dashes (− 7.8%), and (d) question marks (− 5.6%). Moreover, the score for word count (WC) was 4.9% lower for predicted hateful comments relative to ground-truth hateful comments, which can be indicative of the model’s capability to learn Twitter’s short-messaging format well (see “Experimental results” section). In contrast, the predicted comments had a higher use of words from the Friends category (+ 6.9% relative to ground truth hateful comments)—this category contains, for example, references to ‘pal’, ‘buddy’, and ‘coworker’. Similarly, the relatively higher scores for the Body (+ 5.3%; examples: ‘ache’, ‘heart’, ‘cough’), Swear words (+ 5.2%), Sexuality (+ 5.1%; e.g., ‘horny’, ‘love’, ‘incest’), and Anger (+ 4.4%; e.g., ‘hate’, ‘kill’, ‘pissed’) categories imply that the model is over-emphasizing the importance of their use when predicting hatefulness. Similarly, biological process (e.g., ‘eat’, ‘blood’, ‘pain’) and “netspeak”, consisting of shorthand interpersonal communication (e.g., “lol”, “4ever”) [98], are also over-emphasized (+ 4.7% and + 3.6%, respectively). For some reason, the use of semi-colons (SemiC) takes place more (+ 4.0%) in the predicted hateful comments than ground truth hateful comments.

Table 12 Relative differences of linguistic variables between comments predicted as hateful by XGBoost + All and those labeled as hateful in the ground truth Full size table

Feature importance analysis

Addressing RQ2 (the impact of features on the predictions), we carry out a feature importance analysis by using Shapley values. Shapley values originate from game theory, where they are used to distribute a reward among the players in cooperative games [99]. When applying this concept to machine learning models, the game is the model accuracy, and the players are the different features. The important features, i.e., those that have a large influence on the model performance, will have large Shapley values.

Figure 5 shows the 30 most important features from test set predictions and their contribution to the predicted class. The dots with negative values on the x-axis are predictions where the specific feature had a negative contribution (less likely a hate comment) and vice versa. The most important takeaway is that out of the 30 most important features, 29 come from the BERT model. Only one other feature type (Word2Vec on the second to last row) is present. This outcome illustrates the importance of the BERT model for the classifier. Even though we only trained the last three layers of the model, it still produces better features than all the other approaches in this analysis. Unfortunately, the BERT features are not humanly interpretable, so it cannot be said as to why high values of, e.g. the bert_322 feature are so strongly correlated with non-hateful comments.

Fig. 5 Feature importance of the XGBoost model. The vertical axis represents the value of the feature, ranging from low to high. The horizontal axis represents the feature’s impact on the model output. For example, a high value of “bert_322” (top-ranking feature, with the high value represented by red color) has a high negative impact across model predictions, with most SHAP values ranging between − 0.50 and − 1.00. The feature analysis shows the usefulness of BERT for online hate detection Full size image

For additional interpretability, we provide an analysis of LR results using TF-IDF, as this model performed the best (F1 = 0.768, see Table 9) out of the models that provide easily interpretable features (i.e., coefficients for individual words). Table 13 shows the most impactful terms for hate prediction using LR. The coefficients indicate the importance of a given feature for the models’ predictions; a high coefficient implies that the feature is a strong predictor of hateful prediction.

Table 13 Most impactful words for hate prediction using LR and TF-IDF Full size table

Figure 6 shows the unions among the TOP15 hateful words of each platform (according to the LR classifier). On average, the unions contain 1.36 overlapping top hateful words. Top hateful words unique for Twitter mostly reflect racism and sexism (e.g., ‘hoes’, ‘hoe’, ‘nigga’). Top hateful words unique for YouTube emphasize the news context and associated topics (‘media’, ‘world’, ‘country’). Interestingly, for Reddit, the unique top hateful terms have the least signs of aggression when interpreted in isolation (‘god’, ‘reading’, ‘people’, ‘seriously’). Hateful words in Wikipedia seem to coalesce with those in other platforms, as Wikipedia has only one unique word (‘die’) emerging from the analysis.