As any avid reader will attest, humans can envision even complex scenes given just a few well-chosen words. Artificial intelligence systems however have struggled with the task of turning text descriptions into pictures. Now, researchers from Microsoft and JD AI Labs have proposed Object-driven Attentive Generative Adversarial Networks (Obj-GAN) a new model capable of generating relatively complex scenes based on a short phrase or sentence of descriptive text.

Obj-GAN’s generator identifies descriptive words and object-level information to gradually refine the synthesized image, bettering previous cutting-edge models on image details and the relationships between compositional elements.

Below are comparisons of a real-life picture and images generated from its text description using different AI techniques. The results show that as the descriptions become more complex, Obj-GAN is increasingly able to turn the texts into realistic images when compared with other GANs.

Testing Obj-GAN’s generalization ability, researchers found the model would generate images with unreasonable physics or relationships in accordance with text inputs that did not make much sense in the real world. For example:

One difficulty in generating an image from text is finding a way for AI systems to understand the relationship between multiple objects in a scene. Previous approaches used image-descriptive pairs that provided coarse-grained signals only for a single object, and so even the best-performing models of this type had difficulty generating images that contained multiple objects arranged in reasonable configurations.

To overcome this problem, researchers proposed a new object-driven attention mechanism that divides image generation into two steps:

First, researchers converted texts into semantic layouts, such as bounding boxes and shapes, using a seq2seq attentive model.

Then a multi-stage attentive image generator created a low-resolution image on top of the above-mentioned layout, and refined details in different regions by paying attention to the most relevant words and pre-generated class labels. Researchers also designed patch-wise and object-wise discriminators to determine whether the synthesized images matched the text description and pre-generated layout.

In their experiments researchers found that Obj-GAN bettered previous SOTA methods on various COCO benchmark tasks, increasing Inception score by 27 percent.

The paper Object-driven Text-to-Image Synthesis via Adversarial Training has been accepted by CVPR 2019 and is on arXiv. The Obj-GAN model and code have been open-sourced on GitHub.