The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.

true

One image shows exactly two brown acorns in back-to-back caps on green foliage.

false



Image Credit (in order left-right, top-bottom): MemoryCatcher (CC0), Calabash13 (CC BY-SA 3.0), Charles Rondeau (CC0), Andale (CC0). (in order left-right, top-bottom): MemoryCatcher (CC0), Calabash13 (CC BY-SA 3.0), Charles Rondeau (CC0), Andale (CC0).

We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.

Natural Language for Visual Reasoning

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthically generated, this dataset can be used for semantic parsing.

Data Paper (Suhr et al. 2017)



Show Examples from NLVR

There is exactly one black triangle not touching any edge

true

there is at least one tower with four blocks with a yellow block at the base and a blue block below the top block

true

There is a box with multiple items and only one item has a different color.

false

There is exactly one tower with a blue block at the base and yellow block at the top false

More examples (from the development set) are available here.



Leaderboards

Both NLVR and NLVR2 are split into training, development, and two test sets. One test set is public (Test-P) and available with the data, and the other is not released (Test-U). We maintain a leaderboard displaying accuracy and consistency on the unreleased test set (as well as accuracy on the development and public test sets). Results are ordered by accuracy on the unreleased test set; ties are broken with consistency.

We require two months or more between runs on each leaderboard test set. We will do our best to run within two weeks (usually we will run much faster). We will only post results on the leaderboard when an online description of the system is available. Testing on the leaderboard test set is meant to be the final step before publication. Under extreme circumstances, we reserve the right to limit running on the leaderboard test set to systems that are mature for publication. Your model should generate a prediction file in the format specified in the NLVR readme and run with the provided evaluation scripts. You can request to add your model to the leaderboard even if you don't evaluate on the unreleased test set.

For both datasets, we use two evaluation metrics: accuracy and consistency. Accuracy (Acc) is computed as the proportion of examples (sentence-image pairs) for which a model correctly predicted a truth value. Consistency (Cons) measures the generalization of a model. It is computed as the proportion of unique sentences for which a model correctly predicted the truth value for all paired images (Goldman et al., 2018).

Questions?

Please visit our Github issues page or email us at

nlvr < at > googlegroups.com

Please email us if you wish to run on an unreleased test set. To keep up to date with major changes, please subscribe:

Acknowledgments

This research was supported by the NSF (CRII-1656998), a Facebook ParlAI Research Award, an AI2 Key Scientific Challenges Award, Amazon Cloud Credits Grant, and support from Women in Technology New York. This material is based on work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. We thank Mark Yatskar and Noah Snavely for their comments and suggestions, and the workers who participated in our data collection for their contributions.

Also thanks to SQuAD for allowing us to use their code to create this website!