This section first describes the machine learning methodology used for sentiment classification. Then, it presents the methods used for the correlation analysis and Granger causality. Finally, it describes the event study methodology, by presenting the detection of events, the categorization of events based on Twitter sentiment, and the statistical validation of the cumulative abnormal returns.

When preprocessing tweets, we removed URLs because they normally do not represent relevant content but rather point to it. We also removed cash-tags (e.g., “$NKE”) and user mentions (e.g., “@johndoe”) to make a tweet independent of a specific stock (company) and/or users involved in the discussion, and thus make the first step towards generalizing our model. Last but not least, we collapsed letter repetitions (e.g., “coooool” becomes “cool”). This step is relatively easy to implement and has proven useful for sentiment classification tasks [ 46 ]. After these steps, we followed a typical bag-of-words computation procedure by applying tokenization (based on relatively simple regular expressions), lemmatization (we used LemmaGen [ 51 ] for this purpose), n-gram construction (we included unigrams and bigrams into the feature set), and the TF-IDF weighting scheme [ 52 ]. Note that we did not remove stop words, such as “not”, as this would in some cases change the sentiment polarity of a tweet.

In this paper, as is common in the sentiment analysis literature [ 48 ], we have approximated the sentiment of tweets with an ordinal scale of three values: negative (−), neutral (0), and positive (+). Sentiment classification is an ordinal classification task, a special case of multi-class classification where there is a natural ordering between the classes, but no meaningful numeric difference between them [ 49 ]. Our classifier is based on Support Vector Machine (SVM), a widely used, state-of-the-art supervised learning algorithm, well suited for large scale text categorization tasks, and robust on large feature spaces. We implemented the wrapper approach, described in [ 50 ], which constructs two linear-kernel SVM [ 42 ] classifiers. Since the classes are ordered, two classifiers suffice to partition the space of tweets into the three sentiment areas. The two SVM classifiers were trained to distinguish between positive and negative-or-neutral, and between negative and positive-or-neutral, respectively. During prediction, if the target class cannot be determined as the two classifiers disagree (which happens rarely), the tweet is labeled as neutral.

Our approach to automatic sentiment classification of tweets is based on supervised machine learning. The procedure consists of the following steps: (i) a sample of tweets is manually annotated with sentiment, (ii) the labeled set is used to train and tune a classifier, (iii) the classifier is evaluated by cross-validation and compared to the inter-annotator agreement, and (iv) the classifier is applied to the whole set of collected tweets. There is a lot of related work on the automatic construction of Twitter sentiment classifiers. The main difficulty is obtaining sentiment labels for a large enough set of tweets which are then used for training. One of the first approaches was by Go et al. [ 44 ] who used smileys as a proxy for sentiment labels of 1.6 million tweets. Collecting high quality human-labeled tweets is considerably more expensive. Saif et al. [ 45 ] give a survey of eight manually annotated datasets having from 500 to 14,000 labeled tweets, considerably less than our dataset. A more exhaustive review of different approaches to Twitter sentiment analysis, both lexicon-based and machine learning, is in [ 46 ]. A methodology which combines the lexicon-based and machine learning approaches is given in [ 47 ].

Determining sentiment polarity of tweets is not an easy task. Financial experts often disagree whether a given tweet represents a buy or a sell signal, and even individuals are not always consistent with themselves. We argue that the upper bound that any automated sentiment classification procedure can achieve is determined by the level of agreement between the human experts. In order to achieve the performance of human experts, a large enough set of tweets has to be manually annotated—in our case, over 100,000. In order to measure the agreement between the experts, a substantial fraction of tweets has to be annotated by two different experts—in our case, over 6,000 tweets were annotated twice.

We also perform the Granger causality test [ 33 ] to check if the Twitter variables help in the prediction of the price returns. The steps of the procedure applied are summarized as follows [ 53 ]:

For an initial investigation of the relation between the Twitter sentiment and stock prices, we apply the Pearson correlation and Granger causality tests. We use the Pearson correlation to measure the linear dependence between P d and R d . Given two time series, X t and Y t , the Pearson’s correlation coefficient is calculated as: (2) where ⟨⋅⟩ is the time average value. The correlation ρ(X, Y) quantifies the linear contemporaneous dependence.

Event study

The method used in this paper is based on an event study, as defined in financial econometrics [41]. This type of study analyzes the abnormal price returns observed during external events. It requires that a set of abnormal events for each stock is first identified (using prior knowledge or automatic detection), and then the events are grouped according to some measure of “polarity” (whether the event should have positive, negative or no effect on the valuation of the stock). Then, the price returns for events of each group are analyzed. In order to focus only on isolated events affecting a particular stock, the method removes the fluctuations (influences) of the market to which the stock belongs. This is achieved by using the market model, i.e., the price returns of a selected index.

Event window. The initial task of conducting an event study is to define the events of interest and identify the period over which the stock prices of the companies involved in this event will be examined: the event window, as shown in Fig 1. For example, if one is looking at the information content of an earnings announcement on day d, the event will be the earnings announcement and the event window (T 1 , T 2 ] might be (d−1, d + 1]. The reason for considering one day before and after the event is that the market may acquire information about the earnings prior to the actual announcement and one can investigate this possibility by examining pre-event returns.

Normal and abnormal returns. To appraise the event’s impact one needs a measure of the abnormal return. The abnormal return is the actual ex-post return of the stock over the event window minus the normal return of the stock over the event window. The normal return is defined as the return that would be expected if the event did not take place. For each company i and event date d, we have: (3) where AR i, d , R i, d , E[R i, d ] are the abnormal, actual, and expected normal returns, respectively. There are two common choices for modeling the expected normal return: the constant-mean-return model, and the market model. The constant-mean-return model, as the name implies, assumes that the mean return of a given stock is constant through time. The market model, used in this paper, assumes a stable linear relation between the overall market return and the stock return.

Estimation of the normal return model. Once a normal return model has been selected, the parameters of the model must be estimated using a subset of the data known as the estimation window. The most common choice, when feasible, is to use the period prior to the event window for the estimation window (cf. Fig 1). For example, in an event study using daily data and the market model, the market model parameters could be estimated over the 120 days prior to the event. Generally, the event period itself is not included in the estimation period to prevent the event from influencing the normal return model parameter estimates.

Statistical validation. With the estimated parameters of the normal return model, the abnormal returns can be calculated. The null hypothesis, H 0 , is that external events have no impact on the returns. It has been shown that under H 0 , abnormal returns are normally distributed, AR i, τ ∼ 𝓝(0, σ2(AR i, τ )) [34]. This forms the basis for a procedure which tests whether an abnormal return is statistically significant.

Event detection from Twitter data. The following subsections first define the algorithm used to detect Twitter activity peaks which are then treated as events. Next, a method to assign a polarity to the events, using the Twitter sentiment, is described. Finally, we discuss a specific type of events for the companies studied, called earnings announcement events (abbreviated EA), which are already known to produce abnormal price jumps.

Detection of Twitter peaks. To identify Twitter activity peaks, for every company we use the time series of its daily Twitter volume, TW d . We use a sliding window of 2L + 1 days (L = 5) centered at day d 0 , and let d 0 slide along the time line. Within this window we evaluate the baseline volume activity TW b as the median of the window [54]. Then, we define the outlier fraction ϕ(d 0 ) of the central time point d 0 as a relative difference of the activity TW d 0 with respect to the median baseline TW b : ϕ(d 0 ) = [TW d − TW b ]/max(TW b , n min ). Here, n min = 10 is a minimum activity level used to regularize the definition of ϕ(d 0 ) for low activity values. We say that there is an activity peak at d 0 if ϕ(d 0 ) > ϕ t , where ϕ t = 2. The threshold ϕ t determines the number of detected peaks and the overlaps between the event windows—both increase with larger ϕ t . One should maximize the number of detected peaks, and minimize the number of overlaps [41]. We have analyzed the effects of varying ϕ t from 0.5 to 10 (as in [54]). The decrease in the number of overlaps is substantial for ϕ t ranging from 0.5 to 2, for larger values the decrease is slower. Therefore, we settled for ϕ t = 2. As a final step we apply filtering which removes detected peaks that are less then 21 days (the size of the event window) apart from the other peaks.

As an illustration, the resulting activity peaks for the Nike company are shown in Fig 2. After the peak detection procedure, we treat all the peaks detected as events. These events are then assigned polarity (from Twitter sentiment) and type (earnings announcement or not).

Polarity of events. Each event is assigned one of the three polarities: negative, neutral or positive. The polarity of an event is derived from the sentiment polarity P d of tweets for the peak day. From our data we detected 260 events. The distribution of the P d values for the 260 events is not uniform, but prevailingly positive, as shown in Fig 3. To obtain three sets of events with approximately the same size, we select the following thresholds, and define the event polarity as follows:

If P d ∈ [−1,0.15) the event is a negative event,

∈ [−1,0.15) the event is a negative event, If P d ∈ [0.15,0.7] the event is a neutral event,

∈ [0.15,0.7] the event is a neutral event, If P d ∈ (0.7,1] the event is a positive event.

Putting thresholds on a signal is always somewhat arbitrary, and there is no systematic treatment of this issue in the event study [41]. The justification for our approach is that sentiment should be regarded in relative terms, in the context of related events. Sentiment polarity has no absolute meaning, but provides just an ordering of events on the scale from −1 (negative) to +1 (positive). Then, the most straightforward choice is to distribute all the events uniformly between the three classes. Conceptually similar approaches, i.e., treating the sentiment in relative terms, were already applied to compare the sentiment leaning of network communities towards different environmental topics [39], and to compare the emotional reactions to conspiracy and science posts on Facebook [38]. Additionally, in the closely related work by Sprenger et al. [36], the authors use the percentage of positive tweets for a given day d, to determine the event polarity. Since they also report an excess of positive tweets, they use the median share of positive tweets as a threshold between the positive and negative events.

Event types. For a specific type of events in finance, in particular quarterly earnings announcements (EA), it is known that the price return of a stock abnormally jumps in the direction of the earnings [34, 35]. In our case, the Twitter data show high posting activity during the EA events, as expected. However, there are also other peaks in the Twitter activity, which do not correspond to EA, abbreviated as non-EA events. See Fig 2 for an example of Nike.

The total number of peaks that our procedure detects in the period of the study is 260. Manual examination reveals that in the same period, there are 151 EA events (obtained from http://www.zacks.com/). Our event detection procedure detects 118 of them, the rest are non-EA events. This means that the recall (the fraction of all EA events that were correctly detected as EA) of our peak detection procedure is 78%. In contrast, Sprenger et al. [36] detect 224 out of 672 EA events, yielding the recall of 33%. They apply a simpler peak detection procedure: a Twitter peak is defined as one standard deviation increase of the tweet volume over the previous five days.

The number of the detected peaks indicates that there is a large number of interesting events on Twitter which cannot be explained by earnings announcement. The impact of the EA events on price returns is already known in the literature, and our goal is to reconfirm these results. On the other hand, the impact of the non-EA events is not known, and it is interesting to verify if they have similar impact on prices as the EA events.

Therefore, we perform the event study in two scenarios, with explicit detection of the two types of events, all the events (including EA) and non-EA events only:

Detecting all events from the complete time interval of the data, including the EA days. In total, 260 events are detected, 118 out of these are the EA events. Detecting non-EA events from a subset of the data. For each of the 151 EA events, where d is the event day, we first remove the interval [d − 1, d + 1], and then perform the event detection again. This results in 182 non-EA events detected.

We report all the detected peaks, with the dates and their polarity, in S1 Appendix. The EA events are in Table 1, and the non-EA events are in Table 2.

The first scenario allows to compare the results of the Twitter sentiment with the existing literature in financial econometrics [34]. It is worth noting, however, that the variable used to infer “polarity” of the events there is the difference between the expected and announced earnings. The analysis of the non-EA events in the second scenario tests if the Twitter sentiment data contain useful information about the behavior of investors for other types of events, in addition to the already well-known EA events.

Estimation of normal returns. Here we briefly explain the market model procedure for estimation of normal returns. Our methodology follows the one presented in [34] and [55]. The market model is a statistical model which relates the return of a given stock to the return of the market portfolio. The model’s linear specification follows from the assumed joint normality of stock returns. We use the DJIA index as a normal market model. This choice helps us avoid adding too many variables to our model and simplifies the computation of the result. The aggregated DJIA index is computed from the mean weighted prices of all the stocks in the index. For any stock i, and date d, the market model is: (4) (5) (6) where R i, d and R DJIA, d are the returns of stock i and the market portfolio, respectively, and ϵ i, d is the zero mean disturbance term. are the parameters of the market model. To estimate these parameters for a given event and stock, we use an estimation window of L = 120 days, according to the hint provided in [34]. Using the notation presented in Fig 1 for the time line, the estimated value of is: (7) where are the estimated parameters following the OLS procedure [34]. The abnormal return for company i at day d is the residual: (8)

Statistical validation. Our null hypothesis, H 0 , is that external events have no impact on the behavior of returns (mean or variance). The distributional properties of the abnormal returns can be used to draw inferences over any period within the event window. Under H 0 , the distribution of the sample abnormal return of a given observation in the event window is normal: (9) Eq (9) takes into account the aggregation of the abnormal returns.