At QuillBot, we are building an artificial intelligence that can rephrase full sentences. We call it a smart thesaurus. During the development of this tools, we needed to extract large amounts of natural language data, analyze this data, and determine which data is best for our algorithms to train on. Part of this proccess was analyzing the personality of different subreddits and their growth over time. We made a nifty, interactive data vizualization for our fans to see the growth of subreddits over time clustered with subreddits in the same genre.

This post is the first in a series of experiments that we are performing to see if we can use reddit comments and machine learning to uncover linguistic patterns that differentiates internet communities as well as how internet vocabulary (aka memes) spawn and reproduce.

The data was generated from counting the frequencies of comments and their associated subreddit from the good people at pushshift. After getting a count calander we then used r/ListOfSubreddits to group subs together. We made a handful of tweaks to the list to make the groups more equal in size. There are likely to be a few mishaps, and it only covers a small portion of all the subs on this site. If you are interested in helping us expand on our classification skeleton we have a github repo. We will post more aggregated data on the pushshift archives throughout the month.