We used a combination of machine and human translation to translate the keywords to English and analyzed the context behind each one. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book we developed in previous work (see Table 5).

Theme Example Categories Event Scheduled events, recurring events, current events Political Communist Party of China, religious movements, ethnic groups People Government officials, dissidents Social Gambling, illicit goods and services, prurient interests Technology General technical terms, URLs, applications and services Misc Keywords with no clear context that cannot be classified under other themes Table 5: Content Themes and Related Categories

Figure 5 shows the distribution of themes across the three applications (normalized by total number of keywords on each app). In the following sections, we examine each theme in detail.

Figure 5: Distribution of themes across the the three applications

Social

The Social theme is divided into three categories: gambling (e.g., online casinos), illicit goods and services (e.g., narcotics, weapons, counterfeit products), and prurient interests (e.g., sexuality, pornography, prostitution). Figure 6 shows the percentage of Social theme keywords by category (normalized by the total number of social keywords in each app).

The Social theme accounts for the highest percentage of keywords on each application relative to other themes (Sina Show: 59%, 9158: 50%, YY: 44%). The focus on this theme may reflect a reaction of the companies to the new regulatory campaigns that specifically target pornography, drugs, and weapons.

Figure 6: Percentage of Social theme keywords by category

Events

The Event theme includes reference to 20 distinct events. We correlate the timing of keyword list updates to events that happened within our collecton period, and find reactive censorship driven by current events.

Reporting on current events in China is tightly controlled by government authorities. Media organizations are routinely provided directives on how to report the news. China Digital Times, an independent media group, occasionally publishes leaked directives sent to Chinese news organizations, which provide a glimpse into how this system works. There have also been leaks from social media companies, such as Sina Weibo, which describe censorship instructions from company managers that purportedly correspond to state directives. However, it is unclear in what form or at what frequency directives are provided and if companies receive the same ones.

YY has the largest number of event related keywords (632 keywords) compared to 9158 (33 keywords) and Sina Show (31 keywords). YY also referenced more unique events (15) than 9158 (8) or Sina Show (8) . We found similar results in our previous collection period where YY also referenced the most events and included the highest number of Event keywords.

We compare unique keywords (that have not previously appeared on the keyword lists) across the platforms by Jaccard similarity and our similarity metric (see Table 6). Our results show no overlap in event keywords referenced between YY and Tian Ge operated applications, which suggests there are either no common directives provided to these companies or there is varying compliance with directives. However, we do see close similarity between Sina Show and 9158 Event keywords.

9158 versus Sina Show YY versus 9158 YY versus Sina Show Jaccard similarity 73.52 0% 0% max(% of x in y, % of y in x) 92.59% 0% 0% Table 6: Additions to Event keyword lists compared by Jaccard similarity and similarity(x, y) = max(% of x in y, % of y in x)

Only three events are referenced by all applications (June 4 1989 Tiananmen Square Massacre, the sentencing of Zhou Yongkang, and the Hague Verdict on the South China Sea arbitration).

Sina Show and 9158 reference the same 7 events. The only difference between them is 9158 references the Tianjin Explosion and Sina Show references the Cultural Revolution. Event updates on the two applications are often made within the same period and sometimes on the same day. The close similarities between these applications can explained by common ownership. However, the lack of complete overlap in event-related keywords between the platforms shows they still do not share an identical list.

Figure 7: Percentage of Event theme keywords by category

Below we examine the three events that each application referenced.

The June 4, 1989 Tiananmen Square Massacre remains one of the most taboo events in China. Reactive censorship on social media in China often accompanies the anniversary, and the Chinese government continues to push revisionist narratives of what happened.

Between late May and the first week of June 2015, leading up to the 27th anniversary of the Tiananmen Square Massacre, YY added 525 keywords related to the event. Comparatively, 9158 and Sina Show each added 3 keywords on dates that did not fall close to the anniversary.

In our previous data collection period, YY keyword lists also had a heavy focus on June 4, accounting for over 90% of YY’s event keywords and 32% of YY’s lists overall. June 4 related keywords on YY’s lists include a number of ways to refer to the event including numerals ("89VIIV"); homonyms (陆4, “Land 4,” the character (陆 Lù) sounds similar to six (六 Liù) in Chinese); locations of annual memorial events (维园烛光, “Victoria Park Candle”); and references to recent discussion of the event such as “Trump June 4” (川普六四), which is likely related to Donald Trump referring to Tiananmen Square as a “riot” in an election debate.

On June 11 2015, Zhou Yongkang, who was once one of China's most powerful political figures, was sentenced to life in prison on corruption charges. On June 11, YY added 23 keywords related to the sentencing (e.g., 無期徒刑 "life imprisonment"). Prior to the date of the sentencing, Sina Show and 9158 added references to associates of Zhou who were also implicated in his corruption case including former PLA general Xu Caihou (徐才厚) and former Party official Ling Jihua (令计划).

In a case known as the South China Sea Arbitration, the Philippines under provisions of the United Nations Conventions on the Law of the Sea brought complaints against China over territorial claims in the South China Sea.

On July 12, 2016, an international tribunal in the Hague ruled in favour of the Philippines and concluded that China has no legal basis to claim historical rights in the South China Sea. China rejected the ruling. On the same day, Sina Show added two keywords (南海仲裁 "South China Sea Arbitration", 海牙 “Hague”) and 9158 added one (南海仲裁 “South China Sea Arbitration”). On July 13, YY added two keywords, one referencing China’s rejection of the ruling (习总的拒绝 “President Xi's rejection”) and another related to a fake news story that went viral on Chinese social media following the verdict, which claimed China and the Philippines had declared war on each other and the Chinese army successfully wiped out a unit of the Philippine Air Force (全歼菲方空军 “Wipe out the Philippine Air Force”).

YY keyword lists include reference to 11 events that do not appear on the other applications. Some of these events are clearly sensitive topics for the government shown by leaked directives.

Wukan is a fishing village in southern Guangdong that has earned renown for activism. In 2016, villagers took to the streets calling for the release of detained democratically-elected local leader Lin Zulian and the resolution of a long-simmering dispute over land sales. China Digital Times published a leaked directive that was issued to news organizations on June 21, 2016, (China Digital Times does not disclose the issuing bodies to protect the sources of the leaks):

"Regarding former village committee chief of Wukan, Guangdong, Lin Zuluan being investigated and admitting his guilt, websites are strictly prohibited from releasing or re-publishing any news, photos, video, or information related to the mass incident in the village"

On June 22, YY added one keyword (林祖銮 "Lin Zulian") followed by the addition of two keywords on June 23 (林祖戀 “Lin Zulian”, 還我書記 “Return our secretary”). It is unclear if YY received similar directives for hanlding the Wukan protests. While it is plausible, the lack of any Wukan related keywords on Sina Show or 9158 suggests distribution of these directives or compliance to them varies.

We observe a similar pattern in the censorship of President Xi’s gaffe during his opening speech at the 2016 G20 summit in Hangzhou. During the September 4, 2016 speech Xi mistakenly said "reduce taxes and make roads easy [to travel on], facilitate commerce and loosen clothing" (轻关易道通商宽衣), when he should have read “reduce taxes and make roads easy [to travel on], facilitate commerce and be lenient to farmers” (轻关易道通商宽农).

This slip of the tongue was clearly embarrassing for Xi. China Digital Times published a September 4 leaked directive that instructed online media to "filter and intercept content" related to "tongshang kuannong [通商宽农]," and strictly delete comments, photos, videos, and related information”. On September 5, YY added 17 keywords including “Xi undress” (習寬衣), “loosen the clothing and undo the belt” (寬衣解帶), and other references to the speech.

Events like the Wukan protest and G20 speech gaffe are clearly sensitive to Chinese authorities, and it is surprising to see them only referenced on one application. Other keywords found on YY are related to sensitive events specific to the application. In September 2015, YY added 6 keywords referencing an August 2015 incident during which a YY user apparently forgot to turn off her webcam and had sex with her partner while live streaming (yy出事视频 “yy accident video”, 忘关视频被啪 “forgot to turn off the video while having sex”). Videos of the incident circulated on Chinese social media causing a scandal. In this case, it is obvious that YY would be motivated to attempt damage control over the incident as it brings unwanted attention from authorities.

Overall, we find that censorship of events is dynamic and reactive and in some cases can be correlated to directives sent to media organizations by government propaganda offices. However, we observe a lack of overlap in the unique events censored by different companies suggesting that there are no centralized directives given to the companies or differing levels of compliance. YY is registered in Guangzhou whereas Tian Ge is registered in Hangzhou. Each company has to follow respective municipal and provincial regulations. The companies may therefore be given different directives based on the location of their registration, which other studies have suggested may account for variance in how censorship is implemented. These results demonstrate that events are catalysts for censorship but the ways in which they are managed is not uniform.

Political

The Political theme includes 18 categories related to issues including the Communist Party of China (CPC), ethnic minority groups in China, religious movements, and terrorism.

Figure 8: Percentage of Political theme keywords by category

All three applications have keywords related to the CPC. This content includes general references to the structure of the party and its various departments (e.g., 中央政治局 "politburo", 中共中央 “CPC Central Committee”); allusions to factional struggles within the party (e.g,习近平阵营和江派 “Xi and Jiang faction camp”); and pejoratives (e.g., 共匪 “Communist bandits”).

Keywords related to the Uyghur ethnic minority are also present on all of the applications. These keywords appear in Chinese and in the the Uyghur language in both Arabic and Latin script. The content of the keywords range from religion (东突穆斯林 “East Turkestan Muslim”), violence (partila “explode”), to separatism ( تۈركىستان ئىسلام پار تىيىسى “Turkestan Islamic Party” – an Islamic separatist organization founded by Uyghur militants). Other keywords are more cryptic without clear context such as “cloudy weather” (“بۇلۇتلۇق ھاۋا”), and “sweet potato (“ياڭيۇ تاتلىق”). In our previous data collection period, Uyghur keywords were also present on all three applications and represented the largest percentage of keywords within the political theme for Sina Show (45%) and YY (25%).

Titles of books dealing with sensitive topics that have been banned in China also appear on each application. These books, predominantly published in Hong Kong and Taiwan include discussions of power struggles within the CPC (e.g., 老江气杀习大, “Old Jiang Enrages Uncle Xi”), and fiction critical of communist rule (e.g., 黄祸, “Yellow Peril” written by Wang Lixiong). China has strict regulations on the publishing industry, pushing dissident and tabloid authors to Hong Kong and Taiwan to publish on sensitive topics. The sale of banned books was highlighted in 2015 after five employees of a book shop and publishing firm in Hong Kong specializing in taboo titles went missing, only to later emerge in custody in mainland China. Their disappearances had a chilling effect on publishers in Hong Kong who pulled sensitive titles from their shelves. One of the booksellers, Lam Wing-kee, revealed details of his detention at a press conference in Hong Kong on June 16, 2016. His name (林荣基) is included in the keywords lists on YY under the event theme.

Technology

The technology theme has five categories including censorship circumvention tools, URLs, hardware devices, Chinese software and websites, and phone numbers.

Figure 9: Percentage of Technology theme keywords by category

The hardware category includes 25 references to drones and other unmanned aircraft (e.g, 四旋翼无人机 "quadcopter"; ). While it is unclear why these keywords are censored, there is rising concern in China regarding safety, privacy, national security issues and increasing regulations on drone technology.

In the "Chinese websites and software" category we see instances of what may be the companies using censorship to gain competitive advantage. YY lists include 25 keywords that reference competing live streaming services in China (e.g., 美拍直播, “Mei Pai Live,” 熊猫TV, “Panda TV”), and 9158 includes two keywords (e.g., 六间房 “Six Room”), We found similar content in our previous collection period with all three applications adding names and URLs of competitors. The addition of these keywords may be attempts to prevent users from being lured away from the provider’s platform.

People

The People theme includes two categories: names of CPC officials and names of dissidents.

Figure 10: Percentage of People theme keywords by category

References to dissidents include the renowned artist Ai WeiWei (艾未未), Chinese human rights lawyer Guo Feixiong (郭飛雄), and gender activist Ye Haiyan (referred to by her nickname “rogue yan” 流氓燕).

References to officials includes current and former leaders (e.g., 李克强 “Li Keqiang” current Premier of the State Council of the PRC, 胡锦涛 “Hu Jintao” former Chinese President).

There are also numerous examples of playful, derogatory, and creative ways to refer to party leaders in the keyword lists. Keywords related to President Xi Jinping include an endearing nickname, “Daddy Xi” or “Uncle Xi” (习大大), which has been used in state propaganda, but recently has been reportedly banned from official use to tone down Xi’s populist image. Other nicknames are more derogatory such as “Bun Ruthless” (包子心狠手辣). The word steamed bun (包子) is used to refer to Xi following the circulation of a photo showing him ordering lunch at a steamed bun shop that was subsequently criticized as a political show. Whereas “ruthless” (心狠手辣) criticizes Xi’s hardline rule over China. Chinese netizens often make creative use of the Chinese language in efforts to evade censorship. We see examples of this practice in reference to Xi by reversing the order of the characters in his name (平近习), and using homoglyphs (刁近乎, diāo jín hū) that appear similar to his characters (习近平, xí jìn píng).

Researchers have argued that automated keyword censorship is ineffective, because through creative use of language users find means to circumvent the filters. The keyword lists we collected show that censors are clearly picking up on these practices, engaged in a cat and mouse game between users. The censors will never be able to comprehensively censor speech through keyword filtering, nor will users always be able to evade these controls.