In June, fourth-year UC Santa Barbara computer science PhD candidate Xin Wang received the CVPR 2019 Best Student Paper Award. Now, Wang has received three “Strong Accepts” for his new ICCV submission, VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research — a research collaboration between UCSB and the AI Lab of ByteDance, parent company of Tiktok.

The paper presents a new large-scale multilingual video description dataset, covering over 41,250 videos and 825,000 captions in both Chinese and English. In VATEX each video is accompanied by ten different English and Chinese captions contributed by a team of 20 human annotators. There are more than 206,000 English-Chinese parallel translation pairs among captions depicting over 600 human activities. VATEX is more linguistically complex regarding both video and natural language translations than other existing datasets.

The research team introduced two video-and-language tasks based on VATEX. The first, Multilingual Video Captioning, focuses on interpreting a video in numerous languages with a compact unified captioning model. The other task, Video-guided Machine Translation, uses video information as supplementary spatiotemporal context to help translate a source language description into a target language description.

Figure 1. VATEX tasks

Figure 1 is a demonstration of the VATEX tasks in which (a) shows how a compact unified video captioning model accurately describes video content in both English and Chinese. The upper part of image (b) shows that the machine translation model mistakenly describes “pull up bar” as “pulling pub” and “do pull-ups” as “do pull.” The lower part of image (b) shows how the English sentence can be more accurately translated into Chinese with the addition of relevant video context.

Figure 2. VATEX example

Above is an example showing VATEX dataset entries, with 10 English and 10 Chinese descriptions taken from the same video. The lower (highlighted) descriptions have been paired by the machine translation model.

The VATEX dataset enables machines to accurately and efficiently describe videos in both English and Chinese and enhances the performance of monolingual models. By leveraging the capabilities of VATEX, machine translation models can align source and target languages with spatiotemporal video context to take video-and-language research to a higher level.

The paper VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research is on arXiv.