We at Lionbridge AI are continuing our article series on machine learning datasets, and in this blog post, we’ll share 13 free Japanese language text datasets for machine learning.

Dataset Finders

DATA GO JP: The Japanese government’s catalogue site provides public datasets as part of its mission to improve the economy and standard of living for Japanese citizens.

National Information Research Data Repository: This site includes datasets that Japan’s National Information Research Group is currently working on, or preparing to work on in the near future.

Link Data: Support site where you can convert table data into RDF files and make them public.

Japanese Datasets for Natural Language Processing

Resources for Natural Language Processing: Datasets for natural language processing, provided by Kyoto University. For example, there are annotated datasets of online text literature and articles from Mainichi Shinbun, a major Japanese newspaper.

Aozora Book Collection: Online books in text, xhtml, and html form, with the author’s permission. You can also download the dataset on GitHub.

Aozora Book Collection Morphological Analysis Data: Dataset of 11,176 pieces from the Aoyama Book Collection that have undergone morphological analysis. You can use this for business purposes with a CC license.

Kanjivg-radical: Dataset of Japanese kanji and the different parts that make up kanji. For example, the Japanese kanji [脳] is made up three different parts: [月] [⺍] [凶]. You can use this dataset to search for Japanese kanji that you don’t know how to read, based on the parts.

Japanese Parallel Text Datasets

Japanese Parallel Text Data: List of language resources that you can use to train a Japanese machine translation system. The list mostly includes resources for Japanese/English translation, but there are several multilingual resources available too at the end.

SNOW T15 Japanese Simplified Corpus with Core Vocabulary: The creators took Japanese/English parallel text corpus and translated the Japanese into easy-to-understand, plain Japanese.

Japanese Datasets for Sentiment Analysis

Japanese Tweets Evaluation Dataset: The creators used crowdsourcing to evaluate and analyze Japanese tweets, and the result dataset is available here.

SNOW D18 Japanese Emotional Expression Dictionary: Japanese dataset of emotional expressions that includes 48 different emotions and 2000 expressions.

Other Japanese Language Datasets

Livedoor News Corpusニュースコーパス: Livedoor is a Japanese news site. The dataset includes news articles from nine different categories.

Japan Meteorology Agency: Download past Japanese meteorological data in csv format.

Still can’t find what you’re looking for? With 20 years of translation experience and 500,000 qualified translators around the world, language datasets are Lionbridge’s strength. We provide custom datasets that match the needs of your machine learning datasets in 300 different languages. Contact us to find out how we can support your machine learning project.