With the president’s slow response to condemning neo-Nazis, his general comments about violence on “both sides” post Charlottesville, and the alt-right’s general support of the current president (at least on 4chan and Reddit), I was interested in seeing how these two digital conversations overlap. But how could we measure that overlap in a meaningful way? One idea we had was to look at the links that were being discussed. Reddit is a place of sharing--sharing jokes, images, and urls. Conversations or boards revolve around hyper-specific topics and we wondered how much overlap there might be between the links that are shared and discussed in the two communities.

Our data was sourced from a database of public Reddit comments originally compiled by redditor u/Stuck_In_the_Matrix and subsequently migrated to BigQuery by another user, u/fhoffa.

We exported all available posts and comments from r/altright, which had its first post on Dec. 4, 2015, and r/The_Donald, which was created on June 27, 2015, from the start of each subreddit through April 2017 (for r/The_Donald), and January 2017, which is when r/altright was banned for violating Reddit’s terms of use. (In retrospect, we should have limited our analysis to June 2015 through January 2017, the period when both forums were active simultaneously.) At the time of our analysis, in May 2017, r/the_Donald had 396,526 subscribers. Because it was taken down and banned, a lot of data was lost. Thus, it was incredibly hard to find the final subscriber count, what we did find was that in November 2016, r/altright had 7,625 subscribers. The BigQuery database is mostly complete but there are occasionally gaps which we filled using a secondary source, a database maintained by the same redditor, u/Stuck_In_the_Matrix.

Francis wrote a series of Python scripts to process all post and comment text to extract any links referenced within. The scripts go over each post and its comments and look for strings of text that look like URLs and saves those strings to a separate file.

We distinguished image links by looking for image extensions in each URL -- if any common image file extension ("jpg", "jpeg", "gif", "png", or "gifv") is present in the URL, we tagged it as an image. There were so many images in the archive that we turned that image data set into another project focused specifically on studying visual culture and memes. Our Knight Prototype Fund Grant will fund research on those images. More on that later.

We extracted the top level domains (TLDs) of each link we collected. We expanded more popular TLDs, such as youtube.com (and variants like youtu.be), wikipedia.org, and reddit.com, so that we could group them by article or subreddit. These particular sites are so large and vary so much in content that the TLD alone provides little context. What we found to be the most interesting, were other subreddit links and Wikipedia.

Finally, we grouped and counted the outbound links and computed summary statistics on how they are used, namely looking at the number of references made to specific URLs. Generally, the most important thing for this analysis was popularity, and that we determined by amount of links referencing specific sites. For example, the more links to Wikipedia, the more ‘popular’ Wikipedia became in the graphs.