Sports Fans on Reddit

I enjoyed this recent article on fivethirtyeight.com analysing Trump followers on Reddit.com. It covered an interesting range of technical concepts:

Working with a very large dataset on Google’s BigQuery

Applying a natural language processing technique in R

Exploring the behaviour of a defined group of users online

I was interested to see if using a similar approach would allow insight into sports fans on Reddit, so I forked the code from fivethirtyeight’s github repo and went about adapting it to sports content. In this blog I will explain the analysis, and provide some interactive tables and charts for readers to play with. Overall I found that:

US-specific sports generated the most activity (unsurprisingly given a strong bias on Reddit towards US users), although Soccer was up there too

The top Premier League clubs were on a par with NFL and NBA teams in terms of activity, particularly Manchester United, but lower-profile teams registered very little activity

Individual sports subreddits tended to cluster, in behavioural terms, around geographies (e.g. high profile US sports) or the type of sport (e.g. boxing and mma)

Supporters of teams behaved very similarly to other teams in the same league, and more so for teams that are close geographically (in the case of the NFL) or competitively (in the case of the Premier League)

Reddit.com

Part of the motivation in writing this post was to showcase some of the analysis and web development tools and techniques we use day-to-day in our client work. In this spirit the SQL and R code to recreate all results and static versions of the plots can be found on a forked repo on my github page , and the code for the interactive plots below, created using Chart.js and Echarts 3, is visible in the page source.

Before we dive into the dataset, some quick notes on Reddit. While public sources disagree about traffic to the site (as is often the case), Reddit is consistently ranked among the largest websites in the US ( 4th and 11th) and globally ( 7th and 26th). The audience is heavily skewed to large English speaking markets, with almost half of visits over the last 2 years from the US, and two thirds from the top 4 markets (US, Canada, UK and Australia), according to comScore. This regional bias can also be observed in relative Google search volume (see the Google Trends result below, noting that the numbers refer to the relative proportion of search within a market as opposed to the absolute number of searches, which is why the US is not top) and should be kept in mind when considering results.

The Dataset

I have taken the Reddit comments data from a collection hosted on Google’s BigQuery. This contains a total of 1.7 billion comments from January 2015 to February 2017. A dataset of this size would be challenging to download, host and analyse locally, but BigQuery allows us to query directly using normal SQL code. I was therefore able to generate some summary statistics before exporting a smaller and more manageable dataset for additional analysis.

First up, I created a set of the largest property/sport subreddits and queried the following for each:

total number of unique authors over the range

number of comments they made

average score (on Reddit users can either upvote or downvote the post, the score is the difference)

number of comments per author

Subreddit Authors Comments Average Score Comments Per Author /r/nfl 255811 18974717 9.2 74.2 /r/nba 236271 16762887 9.7 70.9 /r/soccer 226709 12657108 10.3 55.8 /r/CFB 115487 9442045 6.8 81.8 /r/baseball 114909 4274549 8.3 37.2 /r/MMA 111017 6297874 6.7 56.7 /r/formula1 51442 2162709 6.5 42 /r/CollegeBasketball 50796 2303007 5 45.3 /r/olympics 44612 310416 6.5 7 /r/ufc 33586 153850 2.9 4.6 /r/MLS 28381 1741721 5.2 61.4 /r/Boxing 24812 637345 4.2 25.7 /r/Cricket 24316 2237176 3.7 92 /r/tennis 21579 640938 4.3 29.7 /r/nhl 20536 93103 3.2 4.5 /r/rugbyunion 19649 910415 3.8 46.3 /r/mlb 16582 65493 3 3.9 /r/NASCAR 14491 1317135 3.3 90.9 /r/AFL 10088 898840 4.9 89.1 /r/nrl 7340 1362770 3.5 185.7 /r/CFL 3293 106551 2.6 32.4 /r/ProGolf 2760 20469 2.8 7.4 /r/PremierLeague 2715 7466 2.1 2.7 /r/GAA 848 6763 2.3 8 /r/LaLiga 317 1244 1.5 3.9

There were over 80 million comments in the above subreddits alone over a two year period, corresponding to over 100,000 comments per day, and tens of millions more in the team subreddits described below. Reddit clearly represents a significant platform for sports fans to get together. In pure volume terms, the US sports dominated, with the NFL and NBA comfortably ahead of the pack. /r/soccer was the next highest, highlighting the fact that most soccer activity goes through a general subreddit, as opposed to league-specific domains like /r/PremierLeague and /r/LaLiga. A similar situation arises for /r/ufc and /r/mlb, with more activity in /r/baseball and /r/MMA respectively.

The comments per author metric gives an indication of the volume of activity by subreddit contributors. The high engagement in Australian sports stands out, accounting for three of the top four subreddits: /r/AFL, /r/nrl and /r/Cricket (although clearly UK and Indian users will be significant with cricket). Just 7,340 AFL authors contributed almost 1.4m comments. By contrast, /r/nhl, with almost triple the number of authors, generated just 7% the number of comments (this may be due to greater engagement in team-specific subreddits, see below). US-sport focused subreddits with similarly high engagement were /r/CFB and /r/NASCAR.

Average scores are less interpretable since they will be naturally higher for larger subreddits, and the dataset does not record upvotes and downvotes separately so I could not normalise. Despite this, /r/soccer stands out as having particularly high average scores relative to its size, although this could be due to more ‘lurkers’ – users voting but not commenting.

Leagues and Teams

Next I focused in on three largest leagues/sports, looking at summary statistics for individual team subreddits in the NFL, NBA and Premier League.

NFL

Subreddit Authors Comments Average Score Comments Per Author /r/Patriots 40085 1307903 4.8 32.6 /r/GreenBayPackers 26345 769848 5 29.2 /r/Seahawks 25352 702994 5.3 27.7 /r/eagles 20264 770894 4.6 38 /r/cowboys 20013 649778 4.3 32.5 /r/minnesotavikings 17753 631414 4.5 35.6 /r/DenverBroncos 17463 563280 4.3 32.3 /r/steelers 15373 408690 3.9 26.6 /r/49ers 14746 466076 4.3 31.6 /r/detroitlions 14310 382013 4.1 26.7 /r/falcons 13890 319984 4.8 23 /r/panthers 13330 381969 4.5 28.7 /r/CHIBears 13304 396000 4.7 29.8 /r/NYGiants 13180 373919 4.2 28.4 /r/Browns 12262 489700 3.4 39.9 /r/Texans 11658 339026 4.5 29.1 /r/Chargers 10960 279111 4 25.5 /r/oaklandraiders 10652 365600 3.9 34.3 /r/bengals 10617 222847 4.5 21 /r/buffalobills 10147 282441 4 27.8 /r/ravens 9921 294932 4.1 29.7 /r/Colts 9606 237727 4 24.7 /r/nyjets 9542 273689 3.9 28.7 /r/Redskins 9454 273847 4 29 /r/miamidolphins 9314 361395 3.7 38.8 /r/KansasCityChiefs 8841 209136 4 23.7 /r/Tennesseetitans 6667 172603 3.7 25.9 /r/AZCardinals 6629 121659 4.4 18.4 /r/Saints 6281 147491 3.7 23.5 /r/LosAngelesRams 5842 78286 3.9 13.4 /r/Jaguars 5525 147997 3.6 26.8 /r/buccaneers 5383 188094 3.5 34.9 /r/StLouisRams 4867 57683 3.6 11.9

Unsurprisingly the /r/Patriots came out top (again), with the volume of activity for other teams seemingly driven by a combination of recent and historical success. I was surprised to see the highest number of comments per user in /r/Browns – fans were seemingly not deterred by poor performance on the field, or maybe they just had a lot to complain about.

NBA

Subreddit Authors Comments Average Score Comments Per Author /r/warriors 23309 537913 4.7 23.1 /r/clevelandcavs 17028 579492 4.1 34 /r/lakers 16344 433834 4.2 26.5 /r/chicagobulls 13741 570859 3.8 41.5 /r/torontoraptors 12717 275571 4 21.7 /r/bostonceltics 11580 484028 3.7 41.8 /r/NYKnicks 10244 365284 3.8 35.7 /r/sixers 8717 365950 4.1 42 /r/rockets 8702 257372 4.1 29.6 /r/Thunder 8630 125711 3.6 14.6 /r/NBASpurs 7813 119889 4.6 15.3 /r/heat 7238 353505 3.6 48.8 /r/timberwolves 6693 171573 4.2 25.6 /r/AtlantaHawks 6341 120987 4.7 19.1 /r/ripcity 6039 112817 3.7 18.7 /r/Mavericks 5671 91048 3.7 16.1 /r/kings 4801 133729 3.5 27.9 /r/MkeBucks 4716 154518 4 32.8 /r/LAClippers 4010 219454 3.8 54.7 /r/suns 3919 88187 3.5 22.5 /r/CharlotteHornets 3756 58145 3.7 15.5 /r/washingtonwizards 3698 64839 3.8 17.5 /r/DetroitPistons 3628 94115 3.4 25.9 /r/pacers 3078 47022 3.2 15.3 /r/OrlandoMagic 2925 76144 3.4 26 /r/GoNets 2655 38977 3 14.7 /r/NOLAPelicans 2627 38824 3 14.8 /r/denvernuggets 2554 90911 3.5 35.6 /r/UtahJazz 2494 43195 4.1 17.3 /r/memphisgrizzlies 2297 27310 4.2 11.9

A similar story for the NBA, with the two most recent champs topping the table, followed by the historically strong Lakers and Bulls. /r/torontoraptors at 5th is perhaps a consequence of the higher relative Reddit interest in Canada.

Premier League

Subreddit Authors Comments Average Score Comments Per Author /r/reddevils 32185 1857705 6 57.7 /r/Gunners 29602 1682461 5.3 56.8 /r/LiverpoolFC 25392 1525531 5.6 60.1 /r/chelseafc 15137 597838 4.8 39.5 /r/coys 11106 538429 5.5 48.5 /r/MCFC 5925 221853 4.5 37.4 /r/Everton 3885 155632 3.7 40.1 /r/Hammers 2462 127591 3.8 51.8 /r/SaintsFC 1359 55625 3.5 40.9 /r/lcfc 1297 11231 3 8.7 /r/swanseacity 982 32301 3.8 32.9 /r/crystalpalace 816 24877 2.9 30.5 /r/safc 620 13347 2.7 21.5 /r/Watford_FC 228 1412 2.1 6.2 /r/WBAfootball 223 4883 3.2 21.9 /r/AFCBournemouth 213 1122 1.8 5.3 /r/HullCity 201 1338 2.6 6.7 /r/StokeCityFC 201 1442 1.8 7.2 /r/Burnley 153 1547 2 10.1 /r/Middlesbrough 64 129 1.4 2

It felt neater to consider Premier League clubs, although arguably the top European clubs would have been a better set since the smaller Premier League teams have relatively low levels of activity. The historical strength and high profile of Manchester United (r/reddevils) meant they topped the table, despite a lack of titles over the period. Behind them were the remainder of the current ‘top six’. Of note is that Leicester (r/lcfc) had a very low number of comments per author – relatively unengaged fans jumping on the championship bandwagon perhaps?

It was interesting to see that the top Premier League clubs had similar numbers of authors and higher numbers of comments than top NFL and NBA teams, despite the strong bias towards US users – Manchester United, Liverpool and Arsenal all saw more comments than any NFL or NBA team, and only the Patriots had more authors than Manchester United. Having said that, the drop off for other English clubs is stark.

This could be partially explained by the long term financial advantages enjoyed by top premier league clubs, in contrast to the forced egalitarianism of US leagues. This also may explain why the relatively smaller UK market manages to generate high levels of top team activity relative to US counterparts – the smaller number of fans are distributed much more unevenly onto teams. Finally, as we know from our analysis of TV audience data the Premier League is incredibly international in its fanbase, and international fans are far more likely to follow a high profile club like Manchester United than lower level clubs like West Bromwich Albion, whose presence in the Premier League is likely to be transient.

Subreddit Similarity

Finally I stepped into a more detailed analysis of behaviour using latent semantic analysis (LSA), a natural language processing techniques often used to analyse text and speech. You can read the fivethirtyeight article for more detail, but the basic concept and steps were:

Use BigQuery to generate all subreddit pair author cross-over (i.e. the number of authors who commented at least 10 times in every pair across the the 50,323 subreddits)

Export this dataset to R and calculate a vector for every subreddit comprising its cross-over with 2,133 of the most important subreddits (i.e. each subreddit is defined by a 2,133 dimensional vector of author cross-overs)

Write functions in R to calculate the geometrical similarity between subreddits (i.e. the angle between each subreddit vector), and run so-called subreddit algebra

Memory limitations on my machine meant I was not able to run the analysis locally on R. Instead I spun up a higher memory Google Cloud Compute Engine instance, installed R, and carried out all analysis remotely through SSH.

I was interested in the relationships between my sets of sports subreddits, so I created an additional function to calculate a matrix of similarities for a set of subreddits, and plot the results in the form of a heatmap. The resulting plots are shown below, with lighter squares indicating greater similarity.

Sports Subreddits - Average Similarity = 0.53

Light patches indicate pairs or groups of similar subreddits. Groups tended to emerge along geographical and sporting lines:

/r/nfl, /r/nba, /r/baseball, /r/cfb and /r/collegebasketball – similarities of 0.80 to 0.91 (maximum is 1)

/r/afl and /r/nrl – similarity of 0.79

/r/boxing and /r/mma – 0.79

Conversely, some sports subreddits emerged as being particularly different to others, most notably Cricket and Formula 1, with average similarities of 0.46 and 0.4 to all other subreddits, driven largely by their low similarity to the US based properties.

What does this mean exactly? It means that authors on /r/collegebasketball are very similar to authors on r/cfb in terms of what other subreddits they comment on. My initial assumption was that this was driven by the same authors posting on each subreddit, but this is only partially true – for /r/collegebasketball and r/cfb there was a cross-over of 8,500 authors (just 17% of authors from the smaller subreddit). This would not completely explain the high level of similarity, and suggests that different users are behaving in similar ways on the site.

NFL Subreddits - Average Similarity = 0.73

Performing the same analysis for NFL teams, the first thing to note is that the overall average similarity was much higher than for the sports subreddits above. In other words NFL fans of different teams behave very similarly. Not too surprising, but I’m not sure it’s something rival fans would be happy to admit. The light spots tended to arise for teams located close to each other geographically e.g. /r/oaklandraiders and /r/chargers, /r/minnesotavikings and /r/GreenBayPackers, /r/Saints and /r/Tennesseetitans. Unsurprisingly /r/StLouisRams (a defiantly still-active subreddit with the tagline “Never Forget”) was very closely related to /r/LosAngelesRams.

A few teams did emerge that were strikingly different from most other teams (darker lines on the plot), namely /r/Seahawks, /r/CHIBears and /49ers.

NBA Subreddits - Average Similarity = 0.73

Overall similarity was the same for NBA teams as NFL teams, although NBA team subreddits appeared to be less driven by geography. The NBA is often described as more of a ‘lifestyle’ league, corroborated by the finding in the fivethirtyeight article that /r/sneakers is closely related to /r/nba. Perhaps this leads fans to be less defined by their region, and more by a broader national culture. Again, curiously, a Chicago team is noticeably different from most other teams.

Premier League Subreddits - Average Similarity = 0.58

Finally, the Premier League. Interestingly, the 'top six' formed a tight knit set, with /r/reddevils, /r/MCFC, /r/LiverpoolFC, /r/Gunners, /r/coys and /r/chelseafc all having similarities above 0.8. For all these fans' apparent differences, they behave in remarkably similar ways on Reddit. Despite this, the overall similarity for the league is significantly lower than for the NFL and NBA, albeit driven by three of the least followed teams (Hull City, Burnley and Bournemouth).

This may be another function of the international nature of Premier League fans. Particularly on Reddit where the skew away from the UK will accentuate the bias. The smaller team subreddits are more likely to be populated by local fans who are significantly different to international fans, driving the overall similarity down.

Final Thoughts

It was interesting to get an idea of some macro trends across sports subreddits – the volume and share of activity across sports and teams, and some insight into the behaviour of different groups of users – but equally exciting (for me, at least…) was being able to tap into a massive dataset and produce meaningful results relatively easily. If you were interested in the analysis, please feel free to build on it (there is scope to go much deeper, even encompassing the content of each comment), or reach out with any questions or comments.