High-quality data is the fuel that keeps the AI wheel turning — and the machine learning community can’t get enough of it. Last year saw an unprecedented number of newly open-sourced datasets, including UC Berkeley’s large-scale self-driving dataset BDD100K, Stanford University’s Q&A dataset Hotpot, and Google’s Open Images V4.

Alongside new algorithms and enhanced compute power, the open-source data movement has always been a major contributor to AI development: CIFAR 10 and ImageNet spawned the computer vision boom; COCO nurtured state-of-the-art object detection models; while SQuAD enabled language systems to answer questions correctly.

To help keep our readers abreast of the trend, Synced has identified five high-quality open-source datasets that were released this month (January 2019) and that AI researchers and engineers might find useful in their work.

Google translates search queries into Q&A data

Google Research last week introduced its Natural Questions dataset to drive NLP research provide end-to-end training data for question-answering research problems.

The dataset consists of over 300,000 question-answer pairs. The questions were collected from “real anonymized, aggregated queries issued to the Google search engine.” Human annotators were first presented with a question (for example, “What color was John Wilkes Booth’s hair?”) and a related Wikipedia page, and asked to return a long-short question pair. The long answers are usually a paragraph that contains relevant information; short answers a single word or short phrase. These long and short answers have 90 percent and 84 percent accuracy respectively.

Also included in the dataset are 16,000 special examples with 5-way annotations. Here, five different annotators gave answers to the same question and the system aggregated a single output response. The 5-way annotation examples proved more robust than those from a single annotator.

Click here to view the Natural Questions paper.

Andrew Ng introduces new chest x-ray dataset

Last week, a pair of Stanford University researchers led by Landing.ai Founder Andrew Ng announced CheXpert, a large dataset of chest X-rays designed for automated interpretation.

The Stanford Machine Learning group believes deep learning could automatically detect chest abnormalities at the human-expert level. Researchers however require a large amount of annotated chest x-ray data to train their end-to-end AI models.

CheXpert contains 224,316 chest radiographs from 65,240 patients. The data was collected from Stanford Hospital chest radiographic examinations performed between 2002 and 2017 in both inpatient and outpatient centers, along with associated radiology reports. Ng’s research group developed an automatic labeler which can translate observations into structured labels: positive, negative, or uncertain.

Ng’s team also announced a model evaluation competition that will begin in February. Anyone can submit their trained model on Codalab to run tests, and the winner will effectively represent the state of the art in chest disease image screening.

Click here to view the ChexPert dataset.

Stanford dataset focuses on visual reasoning

Stanford NLP last year put heavy efforts into open-data development in question answering with SQuAD 2.0, Hotpot, and CoQA datasets. This week, a Stanford NLP Group led by Christopher Manning announced GQA, a dataset for visual reasoning and compositional question answering on real world images.

Visual question answering is a critical AI subfield that involves building models to answer questions based on the visual content using natural language. Existing datasets of this type include VQA, which contains open-ended questions about images.

GQA is expected to be more comprehensive and challenging because it involves multiple reasoning skills, spatial understanding and multi-step inference. The dataset contains 20 million questions paired with various images, each of which is associated with a scene graph of the image’s object, attributes and relations.

Click here to view the GQA dataset.

Facebook Binary Image Selection (BISON) dataset

Facebook recently introduced an alternative evaluation task for computer vision models. In BISON (Binary Image Selection), the AI system is presented with two semantically similar images and a text description that describes one image but not the other. The system needs to select the image that best matches the caption.

Facebook also compiled a BISON dataset to complement the COCO Captions dataset. BISON-COCO is not a training dataset, but rather an evaluation dataset that can be used to test existing models’ ability for pairing visual content with appropriate text descriptions.

Click here to view the BISON dataset paper.

IBM raises stakes on facial diversity

In a 2018 study, IBM performed poorly in recognizing dark-skinned female subjects with only 65.3 percent accuracy. Since then the tech giant has upped its research efforts against AI bias, last year releasing the world largest facial attribute dataset.

This week, IBM continued to push its progress in this area, releasing a new large dataset called Diversity in Faces (DiF). DiF comprises one million human facial images collected from the YFCC-100M Creative Commons dataset. IBM says the aim is to reduce bias and advance fairness in facial recognition technology: “the DiF dataset provides a more balanced distribution and broader coverage of facial images compared to previous datasets.”

Click here to view the DiF dataset.