For all the geeks, nerds, and otaku (Japanese for ‘geek’) out there, we at Lionbridge AI have compiled a list of 25 anime, manga, comics, and video game datasets. Most of the datasets on this list are both public and free to use. The datasets below include both text and image data, and some even contain annotated images. While most of these datasets are free, please note that some of them are strictly for non-commercial research use only. Please be sure to read each dataset’s terms of use upon downloading.

The majority of these datasets are in English. If you’re looking for , get in touch to learn how Lionbridge AI’s crowd of 500,000 multilingual annotators can help you get the data you need.

Anime Datasets for Machine Learning

Where can I find anime datasets for machine learning?

1. Anime Data (score, staff synopsis, and genre) – With data taken from the Anime News Network, this dataset contains information from 4029 anime divided into five CSV files. The files are separated by content as follows: anime title, anime synopsis, anime genre, anime staff, and anime scores.

2. Anime Faces – This is a simple anime image dataset with over 21,000 images of anime character faces taken from getchu.com. The images have been cropped and all resized to 64 x 64 pixels.

3. Anime Recommendations Database – With information taken from myanimelist.com, this dataset includes anime ratings and user data of over 73,000 users and 12,000 anime.

4. Between Our Worlds: An Anime Ontology – With data on over 390,000 anime titles, this dataset is composed of linked open data. This dataset includes both a CSV file and N-Triples format file.

5. MyAnimeList Dataset – Similar to the Anime Recommendations Database, this dataset takes information from myanimelist.com. However, this dataset is much more detailed with information about over 302,000 users including demographic data along with their anime ratings.

6. Safebooru Anime Image Metadata – The anime images in this dataset are all taken from the Safebooru website, a site where explicit content is banned. Therefore, the images in this dataset should all be safe for work. This is a large dataset with 1.9 million rows of metadata.

7. Tagged Anime Illustrations – This anime dataset contains a large amount of labeled Japanese anime artwork. While the images have been labeled as safe for work, the websites where these images have been taken from do not filter explicit images and some may have been scraped into the dataset. As well, the dataset includes cropped illustrated faces of anime characters.

Manga & Comics Datasets for Machine Learning

Where can I find manga and comics datasets for machine learning?

8. Comic Books Images – This comics dataset has just over 52,000 RGB comic book images suitable for usage with facial recognition models, classification and more.

9. MAL Manga Ratings – This is a simple manga dataset with a list of manga titles as well as anime titles adapted from those manga. It also includes the ratings of both the manga and anime adaptations.

10. Manga109 – Compiled by the Aizawa Yamasaki Laboratory at the University of Tokyo, Manga109 is a dataset consisting of every page from 109 different manga titles. The dataset also includes annotations and the owners of the manga have given their consent to be included in Manga109. However, both the dataset and annotations must be requested via email and are only for non-commercial use.

Video Game Datasets for Machine Learning

Where can I find video game datasets for machine learning?

11. 12,000 Video Game Reviews from Vandal – With data taken from the Spanish video game site, Vandal.com, this video game dataset can double as a Spanish language dataset as well. The dataset includes the video game production information, the user rating of the game, as well as a short preview of the user’s review in Spanish.

12. 17,000 Video Game Reviews from JVC – The reviews and video game information in this dataset were taken from the French video game site jeuxvideo.com. Therefore, this dataset can also be used as a French language dataset. The dataset includes the game information as well as the user ratings and user reviews all in French.

13. 20 Years of IGN Game Scores – A simple text dataset, this CSV file includes 18625 lines of text including the game titles, release dates, genres, IGN scores, platforms, and more. All this information is taken from a crawl of IGN’s reviews page.

14. Aerial Change Detection Images – This is a video game dataset that includes images taken from the game Virtual Battle Station 2. The images included are aerial images of the same area each with slight and major differences including changes in buildings, roads, nature, and weather.

15. Clash of Clans App Store Comments – A useful dataset for sentiment analysis models, this video game dataset contains 50,000 user comments. These comments were taken from both the Itunes App Store and Google Play and contain both the comment text and the user’s rating of the game.

16. Dota 2 Game Chats – Containing the unfiltered chat logs of nearly 1 million public matches, these chat logs include many insults, vulgar language, and even racially offensive messages. This dataset could be used for chatbot and content moderation/filtering algorithms. It should be noted that the text also includes many abbreviations and terms specific to the game.

17. Labeled Video Game Driving – This dataset contains just 2500 traffic images from in-game driving. However, all 2500 images come with both the originals and the semantically segmented versions.

18. Metacritic Video Game Comments – An interesting dataset that could be used for sentiment analysis, this dataset includes video game information for 5000 games along with review scores and review comments for 3420 games.

19. Pokémon Images – A small and simple dataset containing images of 809 pokémon from generations one to seven of the popular Nintendo game.

20. Pokémon Sun and Moon (Gen 7) Stats – This dataset contains information from all 807 pokémon in the Pokémon Sun and Moon games. The information includes each pokémon’s name along with their attacks and stats. As well, the dataset includes every item in the game with each item’s description.

21. PubG Match Deaths – An incredibly thorough compilation of match stats from the popular online game, PlayerUnknown’s Battlegrounds. This dataset includes information from 720,000 matches including player deaths, kills, distance traveled, position data, and more.

22. Steam User Behavior – Compiling Steam users video game data, this dataset includes the following information: user id, title of the game, purchase information, and amount of hours played.

23. Tweets During Nintendo E3 2018 Conference – This dataset is a JSON file consisting of tweets scraped during the Nintendo E3 Conference of 2018. The scraped tweets included the hashtags #NintendoE3 and #NintendoDirect.

24. Video Game Sales – The information in this dataset was scraped from vgchartz.com and contains data from video games that had sold over 100,000 units. This video game dataset includes the following information for each game: sales rank, title, platform, year of release, genre, publisher, North America sales, Europe sales, Japan sales, sales from other regions, and combined total global sales.

25. Video Game Sales with Ratings – This dataset contains scraped information from the video game rating site, Metacritic. The information within the dataset includes critic scores from the Metacritic staff, amount of critics, user score, amount of users, developers, and ESRB rating.

Still looking for more data? Be sure to check out the 50 Best Free Datasets for Machine Learning and the Best 25 Datasets for Natural Language Processing.

If you have a project with unique requirements or are in need of custom AI training data, get in touch with Lionbridge AI to see how we can help you. Our crowd of 500,000 specialists fluent in 300 different languages can help you get the data you need when you need it.

Multilingual Data Annotation Services

Lionbridge provides professional data annotation services in over 300 languages.

Some of our most popular languages include: