People love to talk about the mood of a franchise, or the collective feeling of its fanbase. Are they dispirited, optimistic? Ecstatic following a World Series win, or broken after an agonizing walkoff loss? For the most part, we leave it to the beat writers to gauge mood (which is not necessarily a bad thing), without any kind of backing for their proclamations (which might be a bad thing).

Hypothetically, fans are a reservoir of great wisdom (collectively, although perhaps not individually). So tapping into the mood of a fanbase could be more than interesting, it could be useful. But, beyond inquiring with potentially biased observers, there was little we could do to objectively or quantitatively measure a fanbase’s mood.

In this article, I’m going to present one way to gauge the happiness of a fanbase, using a text analysis of the website Reddit. Reddit is an aggregation engine, to which individual users can submit links to other websites or original content, which is then upvoted, downvoted, and commented upon. Importantly, Reddit self-organizes into communities of like-minded individuals, one category of which is fans of a sports team. As a result, there is one team-specific subreddit (community) for each MLB teams’ fans, along with a huge body of text from that teams’ fans.

I used a freely-available program[1] to harvest Reddit comments and posts en masse, over a month-long time period (roughly Jan. 5-Feb. 5). The program spits out a list of words, along with the number of times each word occurs. So, for example, the Yankees subreddit uses the word “money” 25 times in the past month. The small-market Rays, on the other hand, used the same word merely five times.

To figure out how happy each team’s fanbase is, I did what’s called ‘sentiment analysis’ on each list of words. The idea is like this: Some words tend to be used in positive situations, and indicate that the writer is happier, while others are more negative in connotation, and suggestive of despair. For example, ‘excellence’ is a very positive word, and ‘deception’ an unpleasant one. If a team’s comments are filled with words like excellence, and bereft of words like deception, they are probably happy, and vice versa.

To do the sentiment analysis, I used a list of words (called AFINN-111[2]) which had been manually assigned levels of positivity from -5 to 5. To give you an idea of how it works, the word ‘excellence’ is rated a +3 on this list, while ‘deception’ is rated -3. Then I matched up words from the Reddit analysis with the sentiment list and multiplied by the number of times each word was used in each subreddit. The higher the total score, which I called the total affect rating, the more happy the fanbase[3].

Here’s what I found, for all 30 teams, sorted by total affect rating, our proxy for fanbase happiness.

It’s Always Sunny in {Insert City Here}

First of all, let’s get this out of the way: Fanbases are all, without exception, pretty optimistic compared to other subreddits. On average, every fanbase maintains a substantially positive total affect. This finding makes a lot of sense, when you take into account the powerful selection bias involved in contributing to a team-specific subreddit—you probably aren’t going to do it unless you have some positive feelings (or at least hope) for the team of interest.

But perhaps these fanbases aren’t any happier than the rest of the internet. To check that, I looked at a few other subreddits, and calculated their levels of positive affect. For example, I scrutinized a collection of texts from city-based subreddits (for example, /r/Chicago, /r/Miami, etc.). No city I looked at had higher than the lowest affect ratio for a team-specific subreddit. All in all, this makes a lot of sense: baseball is an optional hobby, so if someone doesn’t like participating in it, they probably won’t.

The Causes of Fan Happiness

Next, I was curious about what factors correlate with the happiness of the redditors. The first and most obvious factor that might influence the happiness of a fanbase is its past performance. The Tigers, for example, are perennial contenders and finished last year with 90 wins. They’ve been to a World Series recently, and are known as a great organization. How much does that contribute to their mood? As a rough proxy for past success, I used last year’s number of wins.

Previous year wins contribute surprisingly little to total happiness, is what I would say. The correlation is there (r=.3[4]), but not quite significant.

Another possibility is that the fanbase is less concerned about the past performance, and more with the future. It’s possible that fans are already over the results of last season, and have moved on in their mood to thinking about next season. We can check this by going to PECOTA, which objectively projects the performance of every team for the next year. PECOTA stands in here for the conventional wisdom, reflecting what we think we know about next year’s likely performance.

Here, there is a slightly more substantial (r=.39) and also significant (p=.032) relationship. So it seems, on the surface at least, that Reddit fanbases are much more concerned with the future than they are dwelling on their past success.

Individually, past performance and future projections contribute relatively little to explaining a fanbase’s mood. But perhaps together, there are some synergistic effects that can explain more of the variation. I put both predictors into a combined regression, and checked to see how well I could predict the resulting affect ratio.

Surprisingly, when combining the variables together[5], a very substantial improvement is possible. Using the complete model[6], I can predict the total affect rating astoundingly well (r=.7). So maybe fan happiness is, in aggregate and to a first approximation, a simple function of past success and future expectations.

Irrational Exuberance

Doing the predictions in this way allows us to also look at fanbases that are irrationally happy or sad. Here are the top five fanbases that are happier than their performances suggest that they should be:

Name Total Affect Rating Predicted Affect Rating Difference San Francisco Giants 12082 8008 4074 Seattle Mariners 4172 3338 834 Atlanta Braves 6967 6522 445 Chicago White Sox 2214 1846 368 New York Mets 8087 7814 273

There’s no surprise in number one. The Giants total happiness is off the charts, which I think must be the result of winning the World Series (again and again and again, in all even-numbered years since 2010). The magnitude of the effect is kind of incredible: The Giants fans have a total affect number about 50 percent higher than the next happiest fanbase.

The other teams are a bit more surprising. The Seattle Mariners were significant to the playoff picture last year for the first time in a few seasons, and they project to be above average this year as well. Maybe this excess happiness is the side effect of that return to relevancy. A similar argument could be made for the White Sox, whose shrewd offseason has seen their postseason odds increase substantially. The Braves confuse me, both at the organizational and fanbase levels. The team is not projected to be competitive, nor were they last year, and yet their hopes spring eternally enough to invest $44 million in the dubious defense of Nick Markakis. On top of that, the team is undergoing a gruesome publically-funded stadium controversy, with allegations of political corruption. How the fans remain so optimistic is anybody’s guess.

And the reverse, the fanbases that are most groundlessly unhappy:

San Diego Padres 1540 1813.261696 -273.262 New York Yankees 684 962.1412489 -278.141 Los Angeles Angels 433 1162.718562 -729.719 Tampa Bay Rays 320 1183.282363 -863.282 Toronto Blue Jays 4263 5183.414512 -920.415

Three of the top five are in the AL East, and that might be more than coincidence. It must be frustrating to see your team regularly compete with great teams outside of the division, only to contend for division titles and wild cards with two of the richest teams in baseball, along with three less wealthy but exceedingly well-run teams (one of whom possesses occult powers). Beyond them, we have the Angels, who are as puzzling as the Braves above. They are good, young, and projected to win 91 games after pacing all of baseball with 98 wins last year. Their continuing despair is mysterious.

There could be a variety of reasons which explain deviations from their expected behavior, some of which I’ve explained above. I have a faint and probably baseless hope that some of the deviations in expected happiness are the result of the fanbases being able to weigh and take into account factors beyond PECOTA’s considerable purview, like changes in coaching staff (the Rays and the Cubs) or other positive or negative indications from their organization. If that’s the case, than maybe the teams with exceptionally happy or sad redditors (relative to expectations) might be able to tell us something about the accuracy of the projections.

To that end, as the season goes on, I’m hoping to continue tracking the mood of the redditors, checking back in a few times during the year to see how their sentiment scores have changed. It would be fun to see when each fanbase gives up on a team, or if they simply don’t until the very last gasp; or how they react to winning or losing streaks, injuries to their core players, and so on. On top of that, although it’s a very long shot, maybe the mood of the fans will be able to tell us something PECOTA doesn’t know.



[1] Thanks to github user rhiever for making this script. [2] Check out this paper for some details about the word sentiment list. [3] Fan bases also differed in terms of their levels of Reddit particpitation, so in addition to the total affect rating, I calculated the ratio of positive to negative affect scores, which I term the affect ratio. The latter statistic corrects for the variation in participation, and could be used as another measure of fanbase ‘happiness’. Surprisingly, however, affect ratio was not correlated with total number of words in a Reddit, indicating the participation and happiness are somewhat decoupled. The other results also mostly hold if I look at affect ratio instead of total, although some of the surprisingly happy/unhappy teams change. [4] For these correlations, I am using the Spearman, i.e. rank-order, correlation coefficient, because the relationships don’t look linear to me. [5] Along with the total number of words on each subreddit, to account for the level of participation. [6] To guard against overfitting, I built a support-vector machine model with 2-fold cross-validation, because that’s all this small sample of data could bear. However, there still exists the possibility of overfitting, with so few datapoints. I would like to have more data than just the 30 teams, but unfortunately I am not yet able to harvest subreddit information from earlier than a year ago.