Detecting Bots on Reddit

Overview Here is my Official report Here is my code on Github In my last semester of university at the Hong Kong University of Science and Technology, under the supervision of Professor David Rossiter, I took an independent research course for credit where I was able to lead a semester long solo project. The focus of my project was on detecting Russians bots on Reddit. I built a classifier that analyzed many thousands of posts, comments and user metadata from a list known Russian accounts. The results of my project were very good with accuracy and precision often well over 0.80 (see my Official report for more detailed analysis). My paper is published on Prof. Rossiter's website.

Collecting Data The first step was to collect the user data from Reddit. I had a list of 944 known Russian accounts from Reddit's 2017 Transparency Report that I later used as the ground truth for my classifiers. These accounts made posts and comments starting in approximately April of 2015 and some continued to make submissions as late as April 2018. I selected normal user accounts from the same time period that the Russian accounts were active. I extracted the following data for each user: username: Username of account

Username of account created_utc: Day of account creation

Day of account creation comments: All of the comments from the account. Each comment has body text and a timestamp.

All of the comments from the account. Each comment has body text and a timestamp. posts: All of the posts from the account. Each post has a title, description and a timestamp

All of the posts from the account. Each post has a title, description and a timestamp comment_karma: The total number of upvotes for all of the comments from the user.

The total number of upvotes for all of the comments from the user. link_karma: The total number of upvotes for all of the posts from the user. The scripts I wrote to extract the data were written in python. To collect user metadata I used python's popular API praw. To collect user posts and comments I used a 3rd party API called PushShift, which had no limits on how many comments and posts you could extract (praw was limited to 1000). Finally, I stored all of the data locally in Mongodb where I created the tables and data objects for User, Comment, Post, etc.

Classification Once I collected the user data I could then build a classifier. I created classifiers on four attributes: post title, comment text, post subreddit, and comment subreddit. The comment text classification saw mixed results while all other methods had very high accuracy and precision. Detailed classification results are in my Official report. Click on any of the pictures below to view an interactive web page of my results. Clicking on the interactive images below will load VERY slowly

Post Title Visualization: This graph shows the words of a title post that most strongly indicate whether the user is a bot or a normal user. The blue dots signify a normal user and the red dots signify a bot. The further to the right the word is the more characteristic it is to the word corpus.



Comment Text Visualization: This graph shows the words in the text of a comment that most strongly indicate whether the user is a bot or a normal user. The blue dots signify a normal user and the red dots signify a bot. The further to the right the word is the more characteristic it is to the word corpus.



Post Subreddit Visualization: This graph shows the subreddits that, when posted in, are likely to originate from a bot or a normal user. The blue dots are indicative of a normal user and the red bots are indicative of a bot. The further to the right the subreddit name is the more common it is in the corpus.



Comment Subreddit Visualization: This graph shows the subreddits that, when commented in, are likely to originate from a bot or a normal user. The blue dots are indicative of a normal user and the red bots are indicative of a bot. The further to the right the subreddit name is the more common it is in the corpus.



Account Activity Analysis These graphics show that the Reddit bot accounts were active during the business hours of Moscow while the normal Reddit bot accounts roughly resemble the time zone of America. America has by far the most Reddit accounts.

These two graphs show the time of the day that comments were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London). Normal users Bot users

These two graphs show the time of the day that posts were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London). Normal users Bot users

These two graphs show the time of the day that comments were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London). Normal users Bot users

Number of Comments and Posts Per Account These graphics show that the Reddit bot accounts typically have a higher number of posts compared to comments.

These two graphs show the number of comments per account for bots and normal users. Normal users Bot users

These two graphs show the number of posts per account for bots and normal users. Normal users Bot users