In a widely cited study, Johan Bollen, Huina Mao and Xiao-Jun Zeng claim that

… collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. … We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA …

The media have responded to this study with a mix of adulation and credulity. Perhaps the narrative presented by Bollen et. al is appealing because it assuages our suspicions that Twitter is a frivolous waste of time; or perhaps it fits with the ‘needle in the haystack’ technophiliac fantasy; or perhaps it empowers the hoi polloi, who now claim the mantle of controlling the Dow Jones Index by the vagaries of their mood.

Whatever the appeal of the paper, the story continues to resonate in the internet echo chamber, with ripples still appearing now, some 18 months after the original publication. Among those reporting this paper, without any hint of skepticism are The Telegraph, The Daily Mail, USA Today, The Atlantic, Wired Magazine, Time Magazine, CNBC, CNN, All Things Considered, On the Media, and a long tail of blogs, newspapers, etc.

Given the entirely unskeptical reception the Bollen paper has received, there is a clear need for a critical evaluation of it, expressed in terms that can be understood by those with no formal statistical training.

The principal problems with this paper are:

The authors exhibit a level of sloppiness that taints the integrity of the results. From basic accounting mistakes to appalling methodological flaws, these errors call into question whether any of their results can be trusted. The advertised results, e.g. the purported forecast accuracy of their system, are biased 'by selection.’ Effectively, the authors have picked winners after the race is run, citing the results of the race as unbiased estimates of true merit, without untangling the effects of luck. The advertised 86.7% forecast accuracy is suspect in vacuo, since it would yield the greatest quantitative strategy ever discovered. That this would have been discovered by newcomers to the world of quantitative finance, and that the strategy depends only on two- to six-day old public information beggars the imagination. There is no sensible physical model of how such a large effect could exist, nor any reason it would have passed undetected until now.

There are roughly three parts of this paper beyond the introductory material:

The sentiment analysis tools employed are not insane, in that they correctly detect that people are, in general, happy around Thanksgiving, and uneasy before an election, for example. Some of the mood scores found by the sentiment analysis tools are purportedly correlated with changes in the DJIA, according to a 'Granger causality’ analysis. The raw mood scores are turned into forecasts of daily DJIA movements. Accuracy of the system is 'confirmed’ by looking at some extra data (a 'hold out’ set) which was not used in the training of the predictive models.

I will tackle the last two findings in turn; the first finding is mostly absent of any specific predictive claims. Because the paper gives only loose technical details, and the data used are not widely available (collecting all Twitter feeds over a 1 year period is a technically challenging feat), it is impossible to definitively refute the claims; rather they can only be cast into serious doubt.

the Granger Causality tests and Table II

The second 'finding’ of Bollen et. al. is that of purported statistical significance in a Granger causality test. This is supposed to establish the ability of raw Twitter mood data to forecast changes in the DJIA index. There are numerous technical reasons why such an analysis might malfunction. However, none of them need be invoked here because the authors make a much more basic statistical blunder, that of not correcting for multiple hypothesis testing.

A classical statistical test spits out a 'p-value’, which is something like a probability when assuming some condition that you would like to rule out. A p-value balances the amount of evidence and its strength. If the resultant p-value is indeed small, smaller than some 'sacred value’, usually taken to be 0.05, one claims that the 'null hypothesis’, the condition assumed as part of the test, is unlikely, or is 'rejected’. In the Granger causality analysis being performed here, the 'null hypothesis’, the hypothesis the authors wish to reject, is that the Twitter based signal has no forecasting ability on DJIA, and is effectively independent from it.

Bollen et. al., in Table II of their paper, commit the statistical sin of performing many such tests (49 of them), and then attributing statistical significance to those that have a small p-value (they display the p-values in boldface if they are less than 0.10, attach two stars to those less than 0.05, etc.). If one were to perform ten million such tests, and the null hypothesis were true (i.e. if Twitter did not predict DJIA in any way), one would expect to have one million resultant p-values less than 0.1, printed in boldface in one’s enormous table. Similarly, one would expect to have one million p-values between 0.317 and 0.417, a hundred thousand between 0.8349 and 0.8449, etc. The presence of many small p-values in this scenario is simply due to chance 'bad luck’ under the null hypothesis.

For comparison, here is a plot of the empirical distribution of the p-values from Table II. Under the null hypothesis, as one performs more and more statistical tests one expects the p-values to be 'uniformly distributed’, and thus the empirical CDF plot would fall on the \(y=x\) line, plotted in red here. If the null were violated, i.e. if the Twitter mood data exhibited 'causality’ on the DJIA movement, we should see a lot of p-values on the left side of the plot, and the empirical CDF would hug the left side and top of the plot, bowing away from the diagonal. However, by my eye, the data are consistent with the null hypothesis, and the 7 p-values less than 0.10 are no more remarkable than the 13 that are greater than 0.90.

Performing a Bonferroni correction for multiple tests, none of the p-values from Bollen’s Table II are considered statistically significant at the 0.10 level. For the layman, the conclusion to be drawn is that the evidence is not inconsistent with all the Twitter moods and lags being independent from movements of the DJIA, and some of them looking better than others due to chance.

the Forecast Model

The third, and perhaps most galvanizing, 'finding’ of Bollen et. al. is of an “accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA.” This is formulated in terms of cross-validation of a Neural Net model, using training and test (or 'hold out’) sets of data. The goal is to simulate how this model would be used in the real world, trading real money:

Train the model using all the data you have up to this very minute; Going forward, input each day’s new Twitter data into the model to get predictions to make trades. Repeat this process, retraining the model as is expedient or necessary, and trading the forecasts every day.

Typically when one trains a model on data, the model’s own estimate of how well it understands or can predict that data is optimistic. This is why one tests a model methodology by training a model, then validating it’s predictive ability on data that was not used in building the model.

This is commonly accepted practice. However, Bollen’s finding is broken in so many ways:

They got the number wrong. They report an accuracy of 87.6% in the abstract and twice in the paper; they report the same figure as 86.7% twice, including in Table III. Since the accuracy estimates are based on 15 (!) days of test data, the correct value is the smaller one, 86.7% corresponding to the fraction 13 / 15. The incorrect figure is widely quoted in the media, and was used by Johan Bollen during his interview with CNN. Not that it matters, because … The forecast accuracy is reported with far too many significant figures. If the model had correctly predicted 12 or 14 days’ directions, instead of the 13 it did, the number would change by plus or minus 7 percent. For the technically minded, the standard error on the accuracy figure is around 9%, and a 95% lower confidence interval on the accuracy figure is 72%. For the layman, the upshot is that it is not inconceivable that the accuracy of the system is as small as 72%, but it looked better in this experiment simply due to random luck. In all, reporting two significant figures is unwarranted, much less three. The effect is perhaps minor, but it does not instill confidence in the authors’ attention to detail. The accuracy figure is biased upward. The reported 86.7 % accuracy is the maximal accuracy achieved for the 8 models listed in Table III of the paper. As in Part II, where the smallest p-values were reported as 'significant’, when they could be explained due to chance, here there is an (upward) bias in the sample accuracy numbers when selecting based on those same quantities. As an analogy, imagine if the 8 models listed in Table III truly had no predictive ability, and thus a forecast accuracy of 50%. You can view them as fair coins. The probability that a single fair coin would land heads 13 or more times out of 15 is 0.37%. This probability is so small it makes us doubt the assumption that the models in Table III are really non-predictive. However, if one were to flip 8 fair coins, the probability that at least one of them would land heads 13 or more times out of 15 is 2.9%. While this is still small, it is less damning of the assumption of non-predictive models. A similar problem exists with selecting the 'best’ model based on some sample statistic, then using that same sample statistic as an estimate of a population parameter. Here the forecast accuracy of 86.7% is inflated by the fact that we selected the model based on the estimate. And this is only the bias that we can observe from the paper. There is the very real possibility of unobservable bias, i.e. datamining bias and publication bias. That is, the authors might have tried numerous different data treatments and algorithms, evaluating the purported out-of-sample accuracy, before settling on one where the results were considered sufficiently 'interesting.’ Continuing the coin flip analogy, if one were to flip 50 fair coins 15 times, the probability that one of them would land heads 13 times is 17%. Now the results seem much less interesting. One cannot prove that the authors biased their results in this way. It just provides a plausible alternative explanation for the observed 'effect.’ The authors also did themselves no favors by using such a tiny sample size: if their model had correctly predicted the direction of the DJIA on 130 out of 150 days instead, the possible effect of this kind of bias is lessened. The model accuracy seems high compared to the Granger causality results. The forecast accuracy of 86.7% seems rather high compared to the unconvincing p-values reported in Bollen’s Table II. To test this, I perform some Monte Carlo experiments. For one realization of the Monte Carlo experiment, I take the returns of the DJIA index over the period February 28, 2008 to November 3, 2008, and spawn a random -1/+1 random variable which has the sign of the next day’s DJIA log return with probability \(13 / 15\). I then feed it to R’s grangertest function, with 2 lags, and record the p-value. I repeat this experiment 200 times. The point of this experiment is to get some kind of feeling for what a binary signal with the purported accuracy would yield in a Granger analysis. The maximum p-value from 200 Monte Carlo realizations is 2e-05. Compare this to the smallest of the 49 p-values reported in Table II, 0.013. This is something of an apples-to-oranges comparison because, in general you cannot just compare p-values, and the Neural Net model can capture non-linear relationships that the inherently linear Granger model does not. However, it is very suspicious to me that such an accurate forecast could be made from raw data about which the Granger tests were so ambivalent. An 86.7% forecast accuracy on DJIA’s daily movement would represent the greatest quantitative strategy ever discovered. As an illustration, here I perform a Monte Carlo simulation of the historical performance of a system with the purported forecasting ability. With probability \(13 / 15\), the strategy gains the absolute return of DJIA, and otherwise loses that amount. It trades at 1x leverage on the DJIA from 1970-01-02 to 2012-04-13. Here are the performance plots showing, respectively, the cumulative return, the daily return, and the drawdown from peak. Note that under the random seed chosen here, the simulation is on the wrong side of Black Monday, and thus the results are mildly pessimistic. However, the annualized Sharpe ratio of this backtest is \(9.2\mbox{yr}^{-½}\), with 95% confidence interval \([8.9\mbox{yr}^{-½},9.5\mbox{yr}^{-½}]\). It doubles its money every 26 weeks.

For the layman, the Sharpe ratio is the metric (other than ex post returns!) by which trading strategies are measured. To put these figures into context, an achieved (i.e. in real trading, not backtesting) Sharpe ratio of \(1\mbox{yr}^{-½}\) is considered 'good’; an achieved value of \(2\mbox{yr}^{-½}\) is 'excellent’; anything north of \(3\mbox{yr}^{-½}\) is the stuff of legend. I have read dozens of papers on quantitative strategies and market timing, and, to the best of my recollection, have never seen one claim a Sharpe ratio higher than \(4\mbox{yr}^{-½}\). Shen’s analysis of timing strategies, for example, lists 'successful’ market-timing Strategies with Sharpe ratios on the order of \(0.5\mbox{yr}^{-½}\) to \(0.7\mbox{yr}^{-½}\). None of the tin-foil hat purveyors of market timing signals one can find on the web claim Sharpe ratios higher than \(2\mbox{yr}^{-½}\), nor do they promise 100% returns in 26 weeks. Bollen was apparently unaware he had found the philosopher’s stone when he was quoted as saying: “… we are hopeful to find … better improvements for more sophisticated market models,” i.e. we hope to make the model even better. The putative mechanism for the forecast defies all common sense. Part of the authors’ argument is that the 'Calm’ signal from Twitter is predictive of the DJIA at two to six day lag, and thus they use lagged data from this signal as input to their forecast model. Somewhat paradoxically, the one day lag of 'Calm’ does not give significant Granger p-values in Table II. Somehow, we are to believe, the information content 'skips a day’ (or more). This is contrary to common sense, and common practice of downweighting older observations as less relevant. It is particularly hard to imagine how using two- or three-day old tweets would give one the best market timing model of all time. Furthermore, because the daily movements of the DJIA are 'high frequency’ (autocorrelation would be 'arbed out’), a gap such as this could cause the signal to appear 'out of sync’. For example, let P and C stand for 'panic’ and 'calm’ in the Twitter 'Calm’ signal, and let + and - mean up and down days for the DJIA. Imagine the following stream of days, where the DJIA moves exactly as suggested by the 'Calm’ signal two (market) days prior. Calm: P P C C P C P P C ... DJIA: ... - - + + - + - ... Because of the delay effect and DJIA’s high frequency nature, in this example the DJIA often has down days when the 'Calm’ signal is calm, and up days when it panics, meaning market participants pay more attention to how the Twitterverse felt two or three days ago than how it feels today.

Are we to believe that Twitter users are trading on how they felt two or three (but not one) days prior, and thus moving the market? Or are they predicting the state of the world two or three days ahead of time, without being able to predict tomorrow? Both of these models are nonsensical. A more reasoned interpretation of the results is that the two- to six-day lags in the 'Calm’ signal looked better due to datamining bias, and any justification for their existence (I have seen none) is ex post story telling. Moreover, given that the putative effect leads to the best market timing model of all time, and the signal is based on people’s expression of mood, one would think that people, in general, would be good at market timing, i.e. do significantly better than random. There is no evidence that this is the case. The form of the accuracy claim is almost impossibly general. The forecast accuracy is quoted in terms of the predictive accuracy of the “daily up and down changes in the closing values of the DJIA,” full stop. Are we to accept this accuracy claim holds both in bear and bull markets? In periods of high volatility and low? Regardless of whether tomorrow’s DJIA return is, in absolute value, 2 percent or 0.05 percent? It is not clear how such a broad claim could be extrapolated from performance during 15 trading days in December 2008.

Employing Hanlon’s Razor, I am to conclude that Bollen, Mao and Zeng are statistical naifs. This is consistent with the egregious methodological flaws evidenced in their paper. If it were merely a matter of the authors’ reputation, we could agree that mistakes were made and move on. However, Bollen and Mao have teamed up with a hedge fund to 'capitalize’ on this market timing model. Thus unsuspecting real investors can lose real money if the advertised forecast accuracy fails to exist in the real world. Moreover, given the fee structure of hedge funds, investors in said fund are probably signing up for 'random walk minus costs’, which seems like a bad deal.

It would be too simple to fault the media’s fawning reaction to this paper. After all, the whole story is stuffed full of new-technology-catnip, and there has not apparently been an accessible critical debate of its merits. In my opinion, the peer-review process has failed miserably here, and journalists can choose only to either re-report the finding as gospel fact or ignore it entirely.

Disclosure author has no holdings in Twitter, holds broad market ETFs which intersect with the DJIA, has never made money in market-timing, and would short the Twitter hedge fund if shorting costs were possible.