Tencent AI Lab has announced an open-source NLP dataset comprising vector representations for eight million Chinese words and phrases. The dataset aims to provide large-scale and high-quality support for deep learning-based Chinese language NLP research in both academic and industrial applications.

Chinese Vocabularies Vector DataSet characteristics

Tencent AI Lab’s new dataset used 200 different dimensions to identify the contained Chinese hanzi characters, and significantly outperforms existing datasets in coverage, freshness, and accuracy.

Coverage

The dataset not only enables traditional Chinese vernacular and slang but also includes a synonym feature. Take “喀拉喀什河” (China’s “Karakash River”) as an example. The dataset is able to provide the following synonyms from vector calculation with different Chinese words with the same meaning. For example, 墨玉河, 和田河, 玉龙喀什河, 白玉河, and 喀什河 are different local dialect names for the same Karakash River.

Freshness

The dataset also includes web-vocabularies, transliterated words, and neologisms from the past few years. Take “因吹斯汀” (Interesting) as an example: the dataset able to match synonyms from vector calculation with context and format, such as 一颗赛艇 (Exciting), and 因吹斯听 (Interesting).

Accuracy

Lastly, the dataset is able to provide accurate word match-up respectfully with the input command, such as local idiom match up, and popular word match up for example in the case of “兴高采烈” and “欢天喜地”, which both translate as “happiness.”

Chinese Vocabularies Vector Dataset creation

Creation of the Tencent AI Lab Chinese vocabularies vector dataset involved three different phases:

Corpus Collection

The corpus of the training word vector comes from Tencent News, internet novels and webpages. This large-scale multi-source corpus enables the generated word vector data to cover multiple types of vocabulary.

Thesaurus Creation

The thesaurus meanwhile leverages not only Baidu Baike and Wikipedia but also large scale webpage data with the model of corpus-based semantic class mining.

Algorithm Training

The dataset’s training algorithm leverages Tencent AI Lab’s self-developed Directional Skip-Gram model. This enables the word vector to consider relative context environments in order to find the correct word.

Tencent AI Lab is continuously investing in the text processing area. Their latest research results will be published alongside the upcoming ACL, EMNLP, and IJCAI summits.

The Tencent AI Lab Chinese Vocabularies Vector Dataset is available at https://ai.tencent.com/ailab/nlp/embedding.html