Introduction

News regarding Russian twitter bots influencing American elections has been on the front page for over a year now. There has been many studies and investigations done on the Russian twitter botnets, to identify, categorize and understand their network.

Twitter bots influencing American elections on the news

However, we hear very little of twitter bots being engaged in propaganda activities for every other country in the world. Elections are high stakes in whichever country they occur in, and we would expect to see similar propaganda activities conducted in social media. I set out to investigate whether this occurred in a recent presidential election in Indonesia, and indeed, I found propaganda bots accounting for ranging up to about 57% of the accounts participating in a political topic.

In this article, I will be describing the methodology by which I identify twitter bots, followed by an analysis of the results I have obtained.

Methodology:

Early on in my investigation, I realized that identifying twitter bots by hand would not be scalable. A scalable method would be to use a machine learning model to identify bots, however this would require a certain number of true labels (bot accounts vs human accounts) to train a model on.

In my first iteration, I decided to use the Cresci 2017 dataset to train a machine learning model to identify bots. Then I built a twitter scraper to scrape data and used the machine learning model to identify bots. Unfortunately, this method quickly ran into a couple of problems.

Firstly, the Cresci 2017 dataset did not contain all the data that the Twitter API returns, this meant that if I were to create new features based on those data, the Cresci 2017 dataset would not be able to support those features and the machine learning model would not be able to make use of the features.

Secondly, the Cresci 2017 dataset seemingly had erroneous labels. In my testing, I found a few accounts that were labelled bots in Cresci’s dataset, but under examination, they appeared completely human by my standards.

With these problems, I decided to drop the Cresci 2017 dataset. Instead, I proceeded with a bootstrapping method to build up my own labels and machine learning model in an efficient way.

To achieve that, first, I built a platform that displayed the scraped data for each twitter account in an intuitive and easy to interpret fashion. The UI not only displays the basic data available on the Twitter page, but also derived features such as retweet ratio and posting activity for hour of day and day of week. This allowed me to quickly and accurately evaluate each account to be a bot or human, as well as optionally assign them a label for the type of bot.