This turned out to be a relatively simple piece, if extremely slow. It can be broken down to three main stages;

Data collection Analyses Visualization

1. Data collection

First step was to find the list of the top subreddits. They are nicely curated here in one table to extract.

Now that I had the list of subreddits that I wanted to target I needed to scrape the top posts and comments from each. Luckily Reddit makes it very easy to do that through API calls with PRAW . It is possible to extract up to 500 post IDs from each subreddit. Once you have a post ID you can extract all of the comments on that post.

So all it took was to iterate through;

Each of the top subreddits (100)

Through each of hot post (500)

Through each comment (loads)

Seems simple right?

The complication comes with the fact that the API (or my shitty code) is not particularly fast. Each post took ~2 seconds to extract all the comments. Therefore for 50,000 posts (100 subreddits x 500 post / subreddit) it took almost 28 hours of run-time to collect it all. This process was not helped by the fact that I had numerous false-starts including bugs in the code and laptop issues.

I will share the code and data in the github repo at the end.

2. Analyses

Once the data was collected I did some preprocessing of the comment corpus to replace special characters ([.;:!\'?,\"()\[\]]) and formatting.

From here on out it was quite simple I used TextBlob to extract the sentiment from each comment corpus. While this might not be the most accurate package it useful for when I have unlabeled data and don’t want to go into training something myself.

One thing I did notice was that it tended to keep everything tightly centred around 0 (neutral). This might be due to the fact there are a large number of comments on each post and therefore average out - or it could be a quirk of the package. Either way I decided to try to counteract this by transforming the data to be a bit more extreme.

The formula I used was

Sentiment(skew)= Sentiment*(1+abs(Sentiment))

This was useful to pull the data away from narrow peaks at 0.