Identifying political extremist groups in U.S. and their followers

Determining which political ideology is extreme and which is not is a very challenging and context-dependent task and usually involves controversies. Any sort of definition leaves considerable room for interpretation. In this paper, we limit our scope to white supremacy and neo-Nazi ideologies as right-wing extremist (RWE) and Antifa as left-wing extremist (LWE) ideology in the current U.S. political arena. We identified 25 white supremacy and neo-Nazi groups that have active Twitter accounts by consulting the Southern Poverty Law Center website (SPLC 2018) and list their names and Twitter handles in Table A1 in the Appendix section. For Antifa groups, we relied on manual search on Twitter to identify popular official and local chapters of the movement and came up with a list 16 verified Antifa accounts. The verification was performed through cross-checking our list with those listed at blocktogether.org, a crowdsourcing web application intended to share a list of fake Antifa accounts.

To obtain more validated Antifa accounts so that we have equal number of left- and right-wing extremist seed accounts, we collected 4527 friends (i.e. those who are being followed) and 5,639,256 friends of friends of our initial 16 Antifa accounts and built their friendship network. We perform k-core decomposition on the friendship network to obtain the main core of it. The k-core of a graph is formally defined as the maximal subgraph with nodes of at least degree k. The main core is the non-empty graph with maximum value of k and can be used to identify the most influential nodes of a given network (e.g. Kitsak et al. [55]).

Previous research also used k-core decomposition to characterize the efficiency of the spread of information (Conover et al. [21]) or disinformation (Shao et al. [82]). From the obtained main core, we take 9 users who have the highest degrees and manually check their Twitter pages to make sure they are associated to Antifa. The manual process includes looking for either mentioning of “Antifa” in the name, Twitter handle, or bio description of the group, or high volume sharing of content posted by other known Antifa pages (more than half of the recent 50 tweets), plus the number of followers (only include those with more than 5000 followers). Finally, we add these obtained influential users to our list of 16 LWE extremist accounts to form our final list of 25 Antifa accounts and report their names and Twitter handles in Table A2 in the Appendix section. We call these two sets of LWE and RWE extremist individual/organization lists as “seed accounts”. We get the followers of these seed accounts, and from those users who passed the preprocessing step (see Sect. 3.3), we consider a follower as a supporter or sympathizer of an extreme political ideology if s/he follows at least three of the corresponding accounts from our seed accounts.

Control groups

To test our hypotheses, we compare the psychological profile of political extremists with qualified followers of the top five most liberal and conservative U.S. Senators according to their 2018 DW-NOMINATE score (Poole and Rosenthal [74]). Table 1 demonstrates the list of Senator names in each political category. We perform all preprocessing steps mentioned in Sect. 3.3 on followers of these Senators. In addition, we exclude mutual followers between each pair of political ideologies. Finally, we only consider users who follow at least three Senators from each category.

Table 1 List of the five most liberal and conservative U.S. senators according to their 2018 DW-NOMINATE scores Full size table

Preprocessing of followers

We crawled all followers of the seed accounts (Table 2). In case of liberals or left-wing (LW) and conservatives or right-wing (RW) followers, after collecting the total number of unique followers (Table 2), we uniformly sample 10,000 followers at random from each of the seed accounts (i.e. 50,000 LW and 50,000 RW in total) and perform the rest of the preprocessing and analysis on these samples. We only keep those user IDs that their language is English, are from the U.S., and are not “verified”. We impose the “not verified” constraint to exclude potential journalists, news anchors, and celebrities. We further exclude users who mentioned “journalist” or “\(\mbox{RT}

eq \mbox{Endorsement}\)” in their bio. We also exclude mutual followers between political groups (i.e. those who follow at least one seed account from at least two political groups). After these preprocessing steps, we use botometer (Davis et al. [26], Varol et al. [98]) to identify bots. Using more than a thousand features including friends, tweet content, tweet sentiment, network properties, and temporal patterns, botometer provides two scores between zero and one for English speaking users and universal users, where zero indicates the highest classifier confidence for a human, and one for a bot. Any score in-between means that the classifier is not certain about the account and we have to make a decision. We use the 0.7 threshold for the English speaker score and exclude all users with scores beyond that from our data. Next, to make sure that the followers of the seed accounts are really affiliated with the political groups, following Barberá [12], we only consider those followers who follow at least three of the seed accounts from each political group.

Table 2 Data summary Full size table

Finally, to control for any potential differences between followers of those seed accounts that belong to groups and those that belong to individuals, we removed all followers who were only following individual RWE accounts. In case of LWE seed accounts, there is only one individual account. This results in 9400 LW, 12,034 RW, 7665 LWE, and 10,983 RWE affiliated Twitter users (Table 2). We collected up to 3200 tweets of these users and analyze their timestamps. We limit our analysis to those tweets which have posted in a prior three months of the date of our data collection, which was on March 15, 2018. We exclude those users who have not posted in this time period or those whose oldest available tweet in our data set is after December 15, 2017 and call the remaining users as “qualified users” in Table 2. We make the latter restriction to make sure that the users’ tweets are representative of their temporal changes over the course of three months. Finally, we uniformly take 5000 users at random from the qualified users and estimate their text-based psychological and moral variables (see Sects. 3.5 and 3.6 on how to estimate the psychological and moral indicators from tweets). The summary of the users at each step of the data collection process is listed in Table 2.

Preprocessing of tweets

Before we procced to estimating the psychological and moral profile of extremists and non-extremists, we need to perform some preprocessing on our text data. First, we convert all tweet texts to lowercase and remove all URLs, user mentions, and punctuations from the text. We further remove retweets from our corpus of data since retweets are not the original posts of the authors and should not be considered as emotional expressions of the users. To control for temporal variations, we only consider those tweets which have been posted within a prior three months.

Inferring and validating psychological indicators

A rich body of research has shown the relationship between linguistic usage and emotion (Pennebaker et al. [71]). We use well-validated Linguistic Inquiry and Word Count (LIWC) lexicon to measure the set of psychological variables mentioned in H1–H3. LIWC uses frequency percentages to gauge individuals’ preferences regarding specific “function” words as well as “content” words that are chosen to convey semantic information. We constructed psychological language profiles using the LIWC2015 lexicon (Pennebaker et al. [69]). To measure certainty (H1), we consider the certainty words list from the cognitive processes category. Examples of the certainty words and word-phrases include “always”, “never”, and “certain”. To quantify the anxiety (H2) and happiness (H3), we use the anxiety, positive emotion, and negative emotion words lists from the affective processes category. Examples of the affective processes words and word-phrases include “love”, “nice”, and “sweet” for positive emotion, “hurt”, “ugly”, and “nasty” for negative emotion, and “worried” and “fearful” for anxiety. We use these affective and cognitive processes word lists to count the number of usages across all the tweets in a user’s Twitter “timeline” expressed as a proportion of the user’s total word count. For a review of using Twitter data in health and well-being research see Sinnenberg et al. [85].

Although LIWC have been validated and used in many contexts, to the best of our knowledge, it has never been used on text originated from political extremists. Therefore, the question remains whether or not it can properly estimate the text-based psychological profile of political extremists. Therefore, before using LIWC, we need to validate its performance on extremists’ tweets. To accomplish the validation task, we uniformly sample 100 tweets at random from the corpus of extremists-generated tweets. Then, the first author evaluated the extent to which each of the sampled tweets communicated each of the four psychological constructs (i.e. anxiety, certainty, negative and positive emotion) using a 7-point Likert-type scale. Next, we run LIWC on tweets and compute the ratio of hits associated with each of the four psychological measures. Finally, we compute the Pearson correlation coefficient between the hand-coded and LIWC-generated scores and report the results in Table 3. As can be seen, there are significant and strong positive correlations between the hand-coded and LIWC-generated scores across all four psychological constructs, which indicates that the LIWC dictionary words and terms for anxiety, certainty, positive emotion, and negative emotion are sufficiently robust to detect the corresponding psychological constructs in tweets published by American left- and right-wing political extremists identified in this study.

Table 3 Correlation statistics between LIWC-generated and hand-coded psychological scores for 100 political extremist-written tweets Full size table

Inferring moral foundations

Graham et al. [38] developed Moral Foundations Dictionary (MFD), which contains word lists associated with each of the five moral foundations introduced in the MFT. Examples of the words include “safe”, “peace”, and “endanger” for harm avoidance and care, “fair”, “equal” and “disproportion” for fairness and reciprocity, “together”, “nation”, and “traitor” for in-group and loyalty, “obey”, “law”, “tradition”, and “illegal” for authority and respect, and “piety”, “innocent”, and “trashy” for purity and sanctity. Graham et al. [38] applied the MFD on sermons in text form and the results were consistent with MFT. Using MFD to analyze 12 years of news content related to stem cell research, Clifford and Jerit [20] found consistent results with MFT with respect to harm avoidance and purity moral foundations. They further showed that word lists related to the other three foundations rarely appeared in their dataset through content analysis of a small number of randomly selected articles.

Similar to LIWC dictionary, we could not find any previous study that has validated the application of MFD in the political extremism context. Therefore, we take the same procedure as described above in the Sect. 3.5 and report the validity statistics in Table 4. The results show significant and strong correlations between hand-coded and MFD-generated scores across all five moral foundations.

Table 4 Correlation statistics between MFD-generated and hand-coded moral scores for 100 political extremist-written tweets Full size table

Confounding covariates

There are many variables that might contribute to the text-based psychological indicators of Twitter users. Hence, without controlling for common causes, our results would be confounded. In the case of language analysis of Twitter users through word-count approach, an analyst should select variables that might affect the distribution of the words among individuals. Table 5 lists a set of covariates which we measured for use as covariates of psychological language of different groups. For example, one could hypothesize that users who published more tweets are more likely to get matched with LIWC and MFD dictionaries, and thus, get higher scores. To better demonstrate the covariate imbalance across the groups, the distribution of the covariates listed in Table 5 (except for topic) are plotted in Fig. A1 in the Appendix section.

Table 5 List of covariates Full size table

One important latent confounding variable that could impose bias on our results is topics of the tweets. That is, since different political groups might disproportionally talk more/less about certain topics compared to the others, some words are more/less likely to be used by members of those particular political groups. If this is the case, and those frequently used words are associated with some of the LIWC or MFD categories, that would cast doubts on our results, because the potential observed psychological differences between political groups are then driven in part by those highly topic-related words, not the political ideology or extremity of the users in those political groups. For example, it would be hard to talk about gun control issues without using certain terms that might be found in LIWC and MFD dictionaries, including terms such as “control”, “own”, and “power”. Therefore, we should control for these topics before comparing text-based psychological/moral profiles of different political users. When we control for topics, in fact we are conditioning out the average level of psychological/moral constructs in those topics.

Latent Dirichlet Allocation (LDA) is a popular method for topic modeling on text data. However, standard LDA would not work well for tweets because they are short, and a single tweet usually talks about only one topic. Therefore, unlike LDA which yields a distribution of various topics for a document, we use a Twitter-LDA method (Zhao et al. [103]), which assigns each tweet to only one topic. Since we are interested in controlling for general topics (e.g. elections, gun control, hate speech, etc.), not events and stories, we set the number of topics at 20 and iterations at 1000 and used Twitter-LDA’s default settings (Zhao et al. [103]) to estimate the topics of tweets. The word distribution of the topics along with their suggested names are listed in Table A3 in the Appendix section.

Figure 1 shows the distribution of topics across the four different political groups. While the frequency differences of some topics are small across the groups (e.g. entertainment, photography, and social media activity), there are topics that their frequency differences between the four groups of users are large (e.g. election, sport, racial, religious, community events, and black lives matter (BLM) and environment). These results further emphasize the importance of adjusting for topics in our analysis. However, not all topics are eligible or required to be adjusted for. We should control for a topic if it:

1.: Is semantically meaningful (i.e. it is not noisy outcome of the LDA); 2.: Does not overlap too much with the word categories of interest in LIWC and MFD.

Figure 1 Distribution of Twitter-LDA-generated topics across various political groups. The number of topics is set to 20. While the frequency of occurrence of some topics are similar among the groups (e.g. entertainment and photography), there are topics that their frequency differences between groups are high (e.g. sport and racial). We control for a topic if it is semantically meaningful (i.e. it is not a noisy outcome of the LDA) and does not overlap too much with the word categories of interest in LIWC and MFD. As a result, we remove Noise 1, 2, & 3, Feelings, and Pleasure topics and controlled for the remaining 15 topics Full size image

As a result, we should not control for “Noise 1”, “Noise 2”, and “Noise 3” topics simply because they are not representative of semantically meaningful topics. Furthermore, we should not adjust for “Pleasure” and “Feelings” because they overlap too much with the positive emotion category of LIWC.

Covariates adjustment

Since the users are not randomly assigned to each of the four groups, our observational study of their social media activities would suffer from selection bias. Therefore, we should identify confounding variables and control for them so that we can characterize mean differences that are more likely to be about the link between political orientation, political extremity, and text-based psychological indicators of psychological and moral constructs. In addition, we have a multi-valued treatments experiment with four levels, each representing a different group of political users. Thus, reducing covariate imbalance between them is not a trivial task, since most of the existing approaches and tools are designed for binary treatments.

According to Rosenbaum and Rubin [80], if we have relevant information on a set of covariates X, and the potential outcomes are independent of the treatment, then we can estimate an unbiased estimator using only the propensity score and the observed outcome. The propensity score is the conditional probability of being treated at some point in covariates space, \(P (T = 1| X)\), where T is the treatment status, with \(T =1\) meaning treated and \(T = 0\) meaning nontreated. Here, the four group labels (i.e. LW, RW, LWE, and RWE) are used as the treatment indicator.

However, there are two main difficulties in using propensity scores: (1) even a slight misspecification in the estimation of propensity scores can result in getting a biased estimate (e.g. Smith and Todd [86], Kang and Schafer [54]); and (2) balancing covariates between more than two groups of subjects is not trivial. To tackle these issues, Imai and Ratkovic [46] introduced Covariate Balancing Propensity Score (CBPS) methodology, which estimates the propensity scores for each observation while optimizing the covariate balance. It also generalizes well to multi-valued treatments. Once the propensity scores are computed, they can be used for weighting, matching, regression, stratification, or a combination of them (Imai and Ratkovic [46]). See Imbens [47] and Stuart [89] for extensive review of propensity score methods.

In this paper, we use the inverse of the estimated propensity scores as weights to create a balanced sample of treated and control observations. The method is known as Inverse Probability Weighting. An important advantage of using weighting over other possible approaches is that we do not lose any of our subjects. Let \(T _{i,j}\) be an indicator variable denoting whether user i has received jth treatment (i.e. whether it belongs to LWE, LW, RW, or RWE), and \(e _{i,j}\) denotes the propensity score associated with the user i receiving treatment j. Then, for multi-valued treatments, the weights can be obtained from Eq. (1):

$$ w_{ij} = \sum_{j=0}^{J-1} \frac{T_{i, j}}{e_{i, j}}. $$ (1)

Figure 2 compares the covariate imbalance measured as difference in means between our four treatment groups (LW is coded as group 1, LWE as 2, RW as 3, and RWE as 4) before and after weighting. Each point on the plot is a covariate and each boxplot represents the median, min and max, upper and lower quartiles of the covariates for each contrast. Comparing the covariate imbalance before (the upper panel) and after (the lower panel) weighting in Fig. 1 shows that applying the weights obtained from the CBPS method is significantly reduced the covariates imbalance, measured as absolute difference of standardized means, across all four treatment groups.