Posed the scene-based question “How to chop these carrots?” even a dim human could respond “Use the knife!” A computer system however has never prepared a meal, and so requires vision-and-language training to interpret context before it can generate an answer. Most current research approaches in this field use separate language and vision models. But teaching systems to effectively process and provide their visual understandings through natural language requires deeper interpretation regarding the relationships between visuals and linguistics.

A team of researchers from the Georgia Institute of Technology, Facebook AI Research and Oregon State University has proposed ViLBERT (Vision-and-Language BERT), a novel model for visual grounding that can learn joint representations of image content and natural language, and leverage the connections across various vision-and language tasks.

Although benefits of the current “pretrain-then-transfer” learning approach used in computer vision and natural language processing (NLP) include ease-of-use and the strong representational power of large available models, the approach can also lead to myopic grounding due to limited or biased paired visiolinguistic data used for learning task-based groundings between vision and language. To avoid this, the researchers propose shifting the approach to pretraining for visual groundings.

The BERT language model has significantly advanced self-supervised learning on NLP tasks. Google released the huge model with its 24 Transformer blocks, 1024 hidden layers, and 340M parameters in 2018, and it quickly made its mark by setting records on 11 key NLP tasks. In the new study, researchers extend BERT into a joint visual-linguistic, task-agnostic model, linking separate streams for vision and language processing through co-attentional transformer layers. This maintains the different processing needs for each modality while also allowing them to interact at various representation depths.

Researchers trained ViLBERT on 3.1 million image-caption pairs from the Conceptual Captions dataset under two pretraining tasks: Masked multi-modal learning and Multi-modal alignment prediction. They transferred the pretrained ViLBERT model to four common vision-and-language tasks: Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Grounding Referring Expressions, and Caption-Based Image Retrieval.

The full ViLBERT model outperformed task-specific state-of-the-art models across the four tasks, with the most significant accuracy gains for VQA and Grounding Referring Expressions on the RefCOCO+ dataset. Based on the results, researchers propose that ViLBERT can learn critical visual-linguistic links that could be a helpful feature for downstream vision-and-language tasks. For example, a visual representation of dog breeds can be of greater use if the downstream model can associate it with accurate phrases like “beagle” or “shepherd.”

Researchers note that transferring the ViLBERT model to different tasks only requires adding a task-specific classifier. This simple implementation shows the enormous potential of the joint model in self-supervised learning across a wide range of vision-and-language tasks.

For many NLP researchers, the BERT language model is a gift that keeps on giving. Potential applications for the new joint vision-and-language ViLBERT model are many, and could include for example helping vision-impaired individuals better understand their surroundings in real time.

The paper ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks is on arXiv.