Providing systems with the ability to link linguistic and visual content is one of the significant achievements of computer vision. Tasks such as image captioning and retrieval were designed to test this ability, but complex evaluation is required to also deal with various other abilities and biases. Now, a cooperative research group from Facebook and the University of Southern California has introduced an alternative evaluation task for visual-grounding systems: BISON (Binary Image Selection).

“Given a caption the system is asked to select the image that best matches the caption from a pair of semantically similar images. The system’s accuracy on this Binary Image SelectiON (BISON) task is not only interpretable, but also measures the ability to relate fine-grained text content in the caption to visual content in the images. We gathered a BISON dataset that complements the COCO Captions dataset and used this dataset in auxiliary evaluations of captioning and caption-based retrieval systems. While captioning measures suggest visual grounding systems outperform humans, BISON shows that these systems are still far away from human performance.” (arXiv).

Synced invited Hong Kong University of Science and Technology Professor Qifeng Chen to provide insights and share his thoughts on BISON.

Could you give us a brief description of the tech behind Binary Image Selection (BISON)?

BISON refers to a dataset for fine-grained visual grounding. Given a caption description and two images in BISON, a system should choose an image that best matches the caption between the two images.

Why does this tech matter?

BISON provides a new automatic evaluation metric for visual grounding tasks such image captioning and retrieval. The dataset is available on GitHub so that everyone can use it now.

What impact might this bring to the AI community?

This dataset shows that the existing visual grounding methods are still far from human performance. It will provide a more reliable metric to measure the progress of visual grounding. I believe more and more breakthrough will be proposed and evaluated on BISON.

Can you identify any bottlenecks?

BISON is not perfect as the authors mention: it does not measure the language generation quality. Also the evaluation has to be performed on the images and captions provided in the BISON dataset.

Could you predict any potential future trends related to this tech?

BISON is a new dataset that connects computer vision and NLP. It may serve as the standard evaluation platform for other visual grounding tasks. Potentially, this dataset can lead to systems that reach the human level on visual grounding tasks.

The BISON validation data and evaluation code are currently open sourced. The paper Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding is on arXiv.

About Prof. Qifeng Chen

Dr. Qifeng Chen is an assistant professor of CSE and ECE at HKUST. He received his Ph.D. in computer science from Stanford University in 2017, and a bachelor’s degree in computer science and mathematics from HKUST in 2012. His research interests are computer vision, machine learning, optimization, and computer graphics. Four of his papers were selected for full oral presentations in ICCV’15, CVPR’16, ICCV’17, and CVPR’18. In 2011, he won the 2nd place worldwide at the ACM-ICPC World Finals. He also earned a gold medal at IOI 2007. He co-founded the startup Lino in 2017.

Synced Insight Partner Program

The Synced Insight Partner Program is an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.

Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus in AI. We will give you a response once your application is approved.