“Who Am I?” is an interactive party game where participants ask yes or no questions to gain clues and guess a character’s concealed identity. A number of generally effective strategies emerge over time when human players engage in the game. Now, researchers from Beijing University of Posts and Telecommunication have introduced a novel visual dialogue state tracking (VDST) model that’s got pretty good at the similar, visual dialogue guessing game “GuessWhat?!”

Visual dialogue is a multi-modal task involving vision, language and reasoning in a continuous conversation. Visual dialogue in a photo app can help users interactively discover items — for example vision-impaired people can identify people, buildings, landmarks, etc. in photos.

GuessWhat?! is a two-player guessing game introduced as a testbed for the interplay of computer vision and dialogue systems research in the University of Montreal, Inria, DeepMind and Twitter paper GuessWhat?! Visual object discovery through multi-modal dialogue. A Questioner and an Oracle are given an image that includes several objects. The Oracle secretly selects one target object, which the Questioner attempts to identify by asking the Oracle Yes/No questions.

The Questioner has two sub-tasks, Question Generator (QGen) to generate visually grounded questions; and Guesser, to identify the target object given the full dialog context.

Previous question generation models paid less attention on the representation and tracking of dialogues states, and tended to ask low quality and repeated questions. To tackle this problem, the research team proposed a visual dialogue state tracking (VDST) based QGen model, which includes a visual-language-visual multi-step reasoning cycle.

A visual dialogue state reflects both the representation and distribution of objects in an image. The representations are tracked and updated with changes in distribution, and an object-difference based attention is used to decode new questions. Distribution is updated by comparing the question-answer pair and the objects.

The Questioner uses the answers received from the Oracle to reevaluate where to most effectively direct its attention in the images.

This research achieves SOTA performance on the GuessWhat?! QGen task on four different training methods: Separate Supervised Learning, Joint Supervised Learning, Separate Reinforcement Learning, and Cooperative Learning. The proposed model reduces the rate of repeated questions from more than 50 percent to 21.9 percent compared with previous SOTA methods.

Researchers noted challenges to be addressed in the future include how to learn more flexible question-asking policies, and improving judgement on when to stop asking questions and when to make a guess.

The paper Visual Dialogue State Tracking for Question Generation is available on arXiv.