In recent years researchers in the busy deep learning, computer vision and natural language processing communities have all become increasingly interested in vision and language (V&L).

A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems — useful in both specifying a wide range of problems and communicating AI responses. However, previous research in visually-grounded language understanding have been mostly task-specific.

Researchers from the Facebook AI Research, Georgia Institute of Technology, and Oregon State University found that the skills required for different V&L tasks such as visual question answering and caption-based image retrieval overlap significantly, thanks mainly to the rise of V&L general architectures.

The wide variety of independent V&L tasks motivated these researchers explore ways to consolidate some of them — and the result of their efforts is an all-in-one model that learns from 12 supporting datasets of four broad categories of V&L tasks. The model reduces the number of parameters from some 3 billion to 270 million while improving task performance by an average of 2.05 points.

Based on the recently proposed ViLBERT (Vision-and-Language BERT) model for learning joint representations of image content and natural language, the new model focuses on four categories — visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification.

Among the 12 datasets are three for vocab-based VQA (VQAv2, GQA, and VGQA), two for image retrieval (COCO and Flickr30K), five for referring expressions (RefCOCO, RefCOCO+, RefCOCOG, Visual7W, and GuessWhat), and two for multi-modal verification (NLVR2 and SNLI-VE).

Previous V&L datasets were infamous for variations in size, quality, interface, and difficulty. The new research not only shows the possibility of using a single model to perform multiple tasks but also proves that even with the same architecture, training with multiple datasets can actually lead to improvements on task metrics compared to single-task training.

The paper further demonstrates that multi-task training can be an effective pretraining step for single-task models as it led to further gains and set a new state-of-the-art for 7 out of 12 dataset tasks.

The paper 12-in-1: Multi-Task Vision and Language Representation Learning is available on arXiv.