One of the main challenges in building multilingual machine learning models is collecting enough relevant data. To help, we at Lionbridge have compiled a list of the best publicly available Chinese language datasets. These datasets cover a wide range of use cases, from handwritten data for Chinese OCR to labeled text data for sentiment analysis.

If you missed our previous language dataset compilations, be sure to check out our other dataset articles. Without further ado, here are the best Chinese data sources for machine learning projects.

Best Chinese Datasets for Machine Learning

Chinese Text Datasets

Chinese Treebank: This treebank contains 1.5 million words of annotated and parsed text from Chinese news, government documents, and magazine articles.

Mandarin Chinese News Text: From the Linguistic Data Consortium, this link contains over 250 million Chinese characters of news text from People’s Daily, Xinhua newswire, and China Radio International.

Tencent AI Lab Embedding Corpus of Chinese Words and Phrases: Released by Chinese multinational conglomerate Tencent, this corpus provides 200-dimension vector representations for over 8 million Chinese words and phrases.

Large Scale Chinese Short Text Summarization Dataset: This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text.

Chinese OCR & Handwriting Datasets

Chinese Characters: A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.

Chinese Characters Generator: This fonts file is able to generate Chinese character images which can be used for training a Chinese OCR system.

Text in the Wild: Using street view images, this dataset contains samples of about one million Chinese characters annotated by experts in over 30,000 pictures. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes.

Chinese Translation & Parallel Text Datasets

Chinese-English Emails: Contains 15,000 characters in Chinese (equivalent to 10,000 words) from emails, and a reference translation in English.

OntoNotes: Annotated corpus containing various genres of text – news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows – in Chinese, English, and Arabic.

NUS Corpus: This corpus was created for social media text normalization and translation. The researchers randomly selected 2,000 messages from the NUS English SMS corpus and translated them into formal Chinese.

Chinese-French Text: This dataset contains French translations of approximately 30,000 characters from Chinese Broadcast News.

GALE Phase 1 Chinese Blog Parallel Text: Also from the LDC, this dataset contains 277 Chinese blog posts translated into English.

Chinese Sentiment Analysis Datasets

Ren-CECps: This dataset includes 1,500 blog posts (11k paragraphs, 35k sentences) with annotations of emotion and sentiment at document paragraph, and sentence levels.

Microblog PCU: From researchers at Xi’an Jiaotong University, this dataset has 50,000 posts from Sina Weibo, and includes user metadata, including following-follower information.

Still can’t find what you need? Lionbridge AI provides custom multilingual datasets in 300 languages. Our community of over 1 million certified contributors can quickly collect, create, and annotate training data for your machine learning model.