Machines struggle to make sense of scenes and language without detailed accompanying annotations. Unfortunately, labeling is generally time-consuming and expensive, and even the best labels convey an understanding only of scenes and not of language.

In an attempt to remedy the problem, Microsoft researchers conceived of an AI system that trains on image-text pairs in a fashion mimicking the way humans improve their understanding of the world. They say that their single-model encoder-decoder Vision-Language Pre-training (VLP) model, which can both generate image descriptions and answer natural language questions about scenes, lays the groundwork for future frameworks that could reach human parity.

A model pretrained using three million image-text pairs is available on GitHub in open source.

“Making sense of the world around us is a skill we as human beings begin to learn from an early age … The more we interact with our physical environments … the better we become at understanding and using language to explain the items that exist and the things that are happening in our surroundings,” wrote Microsoft senior researcher Hamid Palangi in a blog post. “For machines, on the other hand, scene understanding and language understanding are quite challenging to hone, especially with only weak supervision, essentially the indirect learning people are able to leverage so well.”

As Palangi and colleagues explain, image captioning and visual question answering quality algorithms usually underperform for three reasons: (1) They can’t leverage context to describe images and perform reasoning about them; (2) they’re not tapping large-scale training data for pre-training; and (3) their architecture isn’t designed to perform well on language, vision alignment, and language generation tasks. The team sought to overcome those with an architecture comprising an encoder (which learns numerical representations of data it’s given) and a decoder (which converts the encoder’s representations into human-interpretable information) pre-trained together and optimized for two kinds of predictions. They say that it created better-aligned encoder and decoder representations in the end, allowing them the use the same model for objectives as different as image captioning and visual question answering.

Image Credit: Microsoft

The researchers evaluated VLP’s ability to caption and reason over images on publicly available benchmarks, including COCO, Flickr30K, and VQA 2.0. They report that it not only outperformed state-of-the-art models on several image captioning and visual question answering metrics, but that it managed to answer questions about images (like those having to do with similarity in clothing design) with which previous models trained only on language struggled.

“With smart model design and smart data selection, we can capitalize on existing publicly available resources to reach even greater heights in language and scene understanding, as evidenced by VLP,” wrote Palangi. “With VLP, we believe we show the potential of unified models to reach the levels of language and scene understanding necessary to successfully complete a variety of distinct downstream tasks — single models that complete multiple tasks efficiently without sacrificing performance. That means more effective and capable vision-language systems without the costs of several separately trained models to achieve the same goals.”

The team leaves to future work strengthening the model’s architecture while adding more data during pretraining.