For our final Data Science Lab project, we wanted to classify which subreddits various posts belonged in. Since the mid-2000’s we have seen internet culture spill into and overtake popular culture. Due to the fast nature and accessibility of the internet, websites such as Reddit can reach millions of people every day. With a subreddit for almost every topic, Reddit enjoys mass appeal from both “normies” and internet veterans alike.

Although the uses for our project may seem superficial at first, classifying posts to subreddits could be very helpful for targeted advertising, as one could find out what topics most interest a certain user. One recent example is the fast-food chain Wendy’s successfully gaining a lot of mainstream attention to through their “savage” tweets, where they reply to Twitter users with snarky comebacks. Another example is of Arby’s which successfully ran a social media campaign referencing “nerd culture” such as video games and anime. By classifying posts to subreddits, we hope to gain insight into the users and how their interests lead them to follow certain topics.

SCRAPING REDDIT

Reddit includes an API that allows you to send a request and receive a JSON containing crucial information about a post. However, this did not come without its issues. One hindrance is that Reddit throttles requests to once every two seconds, which led to a lot of sleep statements in our code. Due to this, we had to reduce the scope of our scraping. Originally we wanted the top 1000 posts from the top 1000 subreddits, but we reduced it to the top 1000 posts from the top 100 subreddits. We also ran into a lot of disconnections from the Reddit servers so we distributed the scraping across all team members’ computers. Additionally, the Reddit API documentation has not been updated in over two years, and in that time Reddit has added multiple JSON fields to its API. Thus making some fields unclear has to what information they carry.

Once the data was fully scraped, we ran into some other issues. First, there were some missing values in some JSONs that had to be accounted for. For example, a Reddit post may have had a lot of likes but the user may have deleted their account, so initially, our scraper crashed multiple times because we forgot to take into account these edge-cases. Secondly, Reddit did not remove special character encodings for the posts, so filtering out those encodings via regex was required. Additionally, a lot of the top voted posts linked to one another in the body of the post, so additional cleaning via regex was required to remove links in the body of the posts. Lastly, some subreddits were notorious for having extremely long posts (r/nosleep was a repeat offender) that overflowed the character limit of an Excel cell. As a result, extra text spilled over to other rows, ruining the formatting of our CSV. After a lot of trial and error, which included eight hours of manual labor of attempting to clean thousands of offending posts and dealing with multiple Excel crashes, we simply truncated the overflowed posts. An example is included below.

Example of overflowed text

On top of texts, we also wanted to scrape images. Although it was simple to scrape images that were part of the Reddit posts, there were a lot of images that were linked (such as from Imgur) in each post, but not necessarily part of the post itself. We used a separate script to download the images from these links. We scraped data from the top 10 image-based subreddits for a total of 8586 training images and 6964 validation images. After all the images were downloaded, we cleaned it by deleting reposts and dead links, like the one shown below.

FEATURE EXPLORATION

As part of exploring our data set, we used the wordcloud Python library to create word-cloud graphics of the most common words in the titles and texts of the Reddit posts that we scraped. We created word clouds of the most common words in the entire data set, and certain subreddits to find interesting patterns. Shown below are the word clouds for all text posts on the left and the NBA subreddit for the right.

FEATURE ENGINEERING

As stated before, we ran into some NaN values in some of our JSONs. To account for NaN booleans, we got the distribution of true and false from the non-NaN booleans and created a Bernoulli Random Variable with the same distribution to fill in the NaN booleans. We also scraped comment Karma and upvotes, but sometimes they were empty. Instead, we replaced these empty values with the median Karma value.

Another feature we added was whether a post was from the top 25 accounts on Reddit. Certain subscribers of Reddit tend to post in particular categories of subreddits more than other categories so knowing which person posted a particular post can enlighten what subreddit could have been posted. However, given that the account name is a categorical value and many of the users do not post in more than one account, we did not want to dummy code all of the users. This would result in too large of a DataFrame to run models and some of the user’s posts are noise and nonpredictive. After looking through the data, we found that those 25 accounts were responsible for 8% of all the posts we gathered. We believe that these accounts could frequent certain subreddits.

The final feature we added was the sentiment of the title of the posts. To do this, we used a tool from TextBlob that took in a string and output two features: subjectivity and polarity. Subjectivity ranged from 0.0 to 1.0 and measured how opinionated a post was. Polarity ranged from -1.0 to 1.0 and measured how negative or positive a post was.

We plotted the average sentiment for all of the subreddits and found that there was indeed a general trend between sentiments and subreddits. For example, the most extreme negative subreddit was r/mildlyinfuriating and the most extreme positive subreddit was r/wholesomemes. In terms of subjectivity, r/me_irl was the most objective (as it consists of mostly images) and r/showerthoughts were the most subjective (it indeed consists of random opinions). We eventually incorporated these features into a Stacking model.

METRICS AND HEURISTICS

Because we had 100 possible classifications, many of them with overlapping themes, we decided to create a scheme to cluster these subreddits based on the similarity of themes. For example, r/pcmasterrace and r/leagueoflegends would probably have a lot of overlapping content since they’re both video game related, and therefore should not be penalized as much if posts are classified as either one. However, r/wholesomememes and r/nsfw would probably not have a lot of overlapping content and therefore should be penalized heavily.

A fraction of the loss matrix found here https://drive.google.com/open?id=1jqcFiCFlBYR0yZWjsfwdvEo_Xyx8wGy4

To implement this idea, one of our team members manually clustered the subreddits into 13 separate clusters and created a 100x100 loss matrix. The figure below depicts a subjective “distance” between clusters. The whiter values represent a far distance while the darker values represent no distance. One thing to notice about the way we created this matrix is we gave NSFW a high misclassification penalty because if anything else gets mistaken for NSFW, then clearly the model isn’t working. The goal of our loss function was to have a 0.5 average value loss in the matrix, 0.2 for a “good” model, and 0 for a perfectly classified post. The average value ended up being 0.55.

Visualization of loss based on our subjective clustering of the subreddits

MODELS

Natural Language Processing

All posts on Reddit must have a title. So while many of our other features had missing values for some if not most of the entries we were able to scrape and clean the text of every post title. From here we took our cleaned title post and vectorized the text creating a very sparse word count matrix that contained over 60,000 features. Afterward, we normalized the matrix using sklearn’s TFIDFTransformer, allowing our data to be ready to input into a model. From here we created a Multinomial Naive Bayes model and tuned it’s one parameter, which controls smoothing, to our data set. We were relatively happy with this model’s performance but wanted to improve it even further. To do so we created a Support Vector Machine with Stochastic Gradient Descent. This model gave us many more parameters to work with so we were then able to hypertune all of its iteration parameters, loss calculation, and regularization parameters to improve our accuracy. After training this multi-label classification, we analyzed the accuracy of its predict_probas. and found that in capturing its top 10 predictions we were able to include 85% of the correct values. We then stacked this result into our other features to train our final model.

To get a better interpretation of our results, we input the losses from the Naive Bayes Model and the linear SVM into the loss matrix we created. From that, we got the following results.

The picture below indicates the loss for the Naive Bayes Model.

The picture below indicates the loss for the linear SVM model.

The first thing we noticed is that the simple Naive Bayes Model, which assumed each word was independent given the prediction, performed way better than a random guess which would have an expected loss of .99. We also noticed that a relatively large percentage of our misclassifications fall within the same cluster, of which there are 13 so any value significantly larger than .07 is indicative that it is not only able to classify exact labels but in its shortcoming can predict a class that we deem similar a large percentage of the time.

The picture below indicates the coverage of our model using it’s top n guesses

We can see that our model predicts correctly just above 50% of the time but if we take the probability vector output and look at its top 5 predictions, the correct answer will be contained in it’s top 5 out of 100 guesses over 75% of the time which indicates that a majority of the time our model will be able to provide its top 5–10 guesses and from this we can stack additional features and models knowing with about an 85% confidence that the correct answer will be within the guesses that our SVM has made.

Another model we tried training on our dataset is Facebook’s Fasttext. Fasttext is a text classifier that uses English word-vectors trained on Wikipedia, and which runs many times faster than other NLP classification methods. Fasttext uses a hierarchical softmax classifier with bag-of-words and bag of n-gram text representations of the input text. We implemented Fasttext both using the original subreddits as targets as well as our subjective clusters as targets. We used only the titles of posts as inputs. Unsurprisingly, the model performed better using the clusters that the original subreddits belonged to instead of the subreddits themselves, as we decrease the size of the target set by an order of magnitude. The model trained on clusters achieved an f1 score of 0.1650, indicating poor precision, recall, or both. We manually tested the model by giving it inputs typical of certain subreddits, for example, “IAMA data science student, AMA”. When we fed the model short titles, it exhibited poor accuracy but improved with longer titles.

Stacking

On our first try of stacking our SVM model with the features we had scraped and engineered directly from the Reddit post, we output the entire prediction probability matrix creating 100 new features to concatenate to our existing features matrix. However, when running a Random Forest Classifier on this new DataFrame we only got a .15 accuracy which was less than a third than that of NLP alone. This is when we decided on reducing the number of features added via stacking by doing more exploration and finding that we could get 75% accuracy out of a possible 89% from our non-zero probabilities of the prediction vector by only including only the top 5 predictions of our NLP model. This greatly reduced the noise and we were then able to train a random forest classifier up to about .32 accuracy which was significantly better than before but still fell short of our NLP alone. At this point, we realize that without further text cleaning regarding post text and our inability to scrape data from embedded links we are still a way away from being able to stack all of our models together to create one master model that outperforms any of our models. We require separate scraping for the images to perform image classification and in doing so lost portability of our finding back over to our text-based scraping. We still believe that the model we were able to achieve is significant in its results performing 50x better than the baseline and that our model shows a propensity to minimize gross misclassifications.

Image Processing

The 8586 training set images we scraped for image classification were taken from the top 1000 all-time posts from the top 10 image-based subreddits by subscriber count. These subreddits are: r/pics, r/funny, r/getmotivated, r/earthporn, r/gaming, r/oldschoolcool, r/aww, r/food, r/art, and r/photoshopbattles. The 6964 validation images were taken from the recent 1000 hot posts from each of these subreddits. Since our goal was to build a model on the most representative data from each subreddit possible to then classify typical posts, the training = top, and validation = hot split made the most sense to us.

Some of the images in the set are easy to classify by human standards. “Oldschoolcool” posts generally consist of an older portrait of a person, “getmotivated” posts consist of encouraging text set on a background of nature or a person, “aww” posts are pictures of cute pets, and (the unfortunately named) “earthporn” subreddit consists of pictures of natural beauty. “Food” and “art” are self-explanatory.

Below is one of the top posts from r/getmotivated, and a good example of a typical post.

The other subreddits, however, present a bigger challenge. Pics and funny posts are almost random in the type of content they include, and funny images in particular mean little without context from the post title or comments. Photoshopbattles is a subreddit where users post images to be manipulated in photo editing software by commenters, usually for comedic effect. Because of the open-ended nature of the subreddit, the image content often does not correlate with past content. Finally, the gaming subreddit sounds like it should be relatively straightforward to classify, but Reddit’s userbase has a higher than the average number of gamers, so r/pics, r/funny, and r/photoshopbattles are rife with gaming-related content.

To illustrate the confusion between subreddits, below is the second most upvoted post of all time from r/pics. Without knowing any better, one might guess that this image was taken from r/earthporn.

Another example, shown below is the top all time post from r/funny.

Just kidding. This image was posted on r/photoshopbattles. Even a human would be hard pressed to correctly classify this picture.

The examples above illustrate how difficult subreddit image classification can be even for human subjects. To evaluate the performance of our models, we set a baseline accuracy that a rational human can achieve. Given that the six subreddits we discussed first (r/aww, r/earthporn, r/oldschoolcool, r/food, r/art, and r/getmotivated) are generally easier to classify than the others, our naive assumption is that a human subject would correctly classify all posts from those subreddits and misclassify all posts from the other subreddits, for a classification accuracy of 60%. Of course, this figure is only a crude estimate, and the true human accuracy rate is most likely lower.

We decided to use a convolutional neural network to classify the images, as they are the most robust and well-developed computer vision classification method. We trained two separate models based on this Keras implementation of a dog vs. cat classifier: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html. In both instances, we preprocessed the images using Keras’ ImageDataGenerator function to randomly rescale images.

The first model we built was a shallow 5-layer network with 3 convolutional layers. We used a softmax output layer and trained the model using categorical cross-entropy loss. We tweaked the learning rate between 0.1 and 0.001, batch size between 16 and 40, and the number of epochs between 25 and 100 over several training iterations of the model. The results of this simple model weren’t stellar: the best model, using batch size 16 over 50 epochs with a learning rate of 0.001 achieved a final validation accuracy of 36.98%. Since this type of model did not seem promising, we turned to a neural net that had already been trained.

The second model we built was a simple two-layer classifier on top of VGG16. Since VGG16 is a former state-of-the-art CNN that has been trained on many classes of a natural image, the bottleneck features from the end of the network would contain high-level descriptive attributes for many types of objects and shapes and would generalize well to new classes of image. We could put our classifier at the output layer of VGG16, replacing the old output layer, and train only the last layer of the net using the bottleneck features from the images in our dataset as inputs. This method is known as transfer learning and is useful for obtaining good classification results with a short amount of training using an already-established CNN.

The classifier we trained had two layers with a final softmax output. Once again, we used the cross-entropy loss to train. Instead of manually adjusting the learning rate, we used Keras’ adam optimizer which builds in learning rate updates and achieves good results. We once again tweaked the batch size and number of epochs. One of the results we obtained using a net with batch size 16 and 50 epochs achieved a final validation accuracy of 47.80%.

Clearly the model started to overfit, as we can see on the bottom chart of validation loss vs epochs.To decrease our model variability, we introduced L2 regularization into the model and also tweaked the lambda value. The model we found with the best mix of high validation accuracy and low cross-entropy was one with 25 epochs, batch size 48, and lambda of 0.00075. It had a validation loss of 51.46%.

The downward trend in the validation loss before settling was a good indication that our model indeed had not overfitted to the training set. The normalized loss matrix we obtained from this model on the validation data is shown below.

From the matrix, the main mislabeling occurred when mistaking aww for photoshopbattles, pics for earthporn, and funny for getmotivated. Looking through the images for each of these classes, it is clear that there is a significant overlap in the content of these subreddits. The validation accuracy of just above 50% is significantly better than our earlier shallow model and approaches our assumed naive human rate.

CONCLUSION

Classifying subreddits is a task that involves many variables, most of them hidden in the text and metadata of the posts. On the surface, the task appears trivial, but when we analyze the data, we find that much of the post content is very similar between disparate subreddits.By classifying both text and image posts, we wanted to attempt to build parts of a classification system that when complete can classify any subreddit within its domain with better than human-level accuracy. In the future, we plan to train GANs to try to simulate posts from specific subreddits, classify them, and uncover whether they are real or generated.