In late August, as tension grew in Hong Kong following months of street protests against the government and Beijing, Twitter, Facebook and Google announced that they had discovered and dealt with disinformation campaigns by Chinese state trolls on their platforms.

The campaigns were aimed at “the protest movement and their calls for political change”, Twitter said on August 19 as it released a trove of 3.6 million state troll tweets from what the company called “a coordinated state-backed operation”. Facebook and Google issued press statements but have yet to release datasets related to Beijing’s disinformation campaigns.

This is the second part of my examination of the troll tweets released by Twitter, following a quick exploratory dive last month. Here I’ll try to sift out the key words and rhetoric in the Chinese troll tweets, using a range of NLP and visualization tools.

KEY ASSUMPTIONS AND REPO

First things first: Here’s the repo for the project with the latest notebooks. The CSV files are too huge to be uploaded on Github. Download them directly from Twitter instead.

This dataset has over 3.6 million rows and tweets in 59 languages. And it is extremely noisy — cluttered with sports and porn-related tweets, and a running war of words between the state trolls and fugitive Chinese billionaire Guo Wengui.

The state trolls accounts, now suspended, were also tweeting in multiple languages. Example, a troll which had set the account language setting to English could tweet in both English and Chinese, or more languages. Many of the accounts also had been dormant for a long time.

To make this project more manageable, I took the following steps:

Focus only on English and Chinese tweets (since they are primarily targeted at Hong Kongers), but making sure that I capture the English tweets from accounts with Chinese language settings, and vice versa.

(since they are primarily targeted at Hong Kongers), but making sure that I capture the English tweets from accounts with Chinese language settings, and vice versa. Set 2017 as the start point for analysis , seeing that many regard the Russian 2016 disinformation campaign in the US as having inspired significant tactical changes in recent state disinformation efforts. The Chinese trolls have, of course, been active on Twitter far earlier than 2017.

, seeing that many regard the Russian 2016 disinformation campaign in the US as having inspired significant tactical changes in recent state disinformation efforts. The Chinese trolls have, of course, been active on Twitter far earlier than 2017. Use obvious key terms, such as “Hong Kong” and “police”, as anchor points for active filtering. This clearly introduces selection bias into the analysis and visualization. But given the amount of noise in the dataset, I feel that this is an acceptable compromise.

THE SIGNAL AND THE NOISE: KEY FINDINGS

#1. Very low signal-to-noise ratio

Without aggressive filtering, it is hard to discern much from the dataset. This is true of both the English and Chinese tweets (as well as the retweets). Take for example the chart below, which shows the frequency distribution of the top 50 most common terms after filtering lightly for tweets involving Guo: