Google’s Bidirectional Encoder Representations from Transformers (BERT) has demonstrated tremendous success across various language tasks. Although the model was originally tailored for single-modal data like text, this summer Google Research extended BERT to video and introduced VideoBERT, a self-supervised learning model that can learn the relationship between videos and texts. VideoBERT achieves a higher level of abstraction than previous models built mostly on GAN-based approaches.

VideoBERT’s core objective is to model text and video representation and learn their multimodal relationships, which can be extended to downstream applications such as video captioning, action classification, and even future video predictions. Given a few video frames that include a bowl of flour and cocoa powder, VideoBERT can speculate that the following video frames might involve baking a brownie or cupcake.

The first training step is data preparation. The VideoBERT model was trained on more than one million YouTube videos of cooking, gardening, and vehicle repair. The video features were extracted using S3D. Researchers also applied automatic speech recognition (ASR) to extract sentences from YouTube audio.

To train VideoBERT, both video frames and text data are first tokenized (1.5-second image frames are used as video tokens). Like BERT, VideoBERT takes the cloze test — technically predicting missing word tokens or video tokens which are randomly masked in a sequence — as its proxy task. To further learn the text-video relationship, researchers proposed a linguistic-visual alignment classification objective to predict where text is aligned with video.

Researchers tasked the pretrained VideoBERT to perform “zero-shot” classification, where the model had no prior knowledge of the actual videos except action and object labels from ground truth captions and is asked to generate the top three matching words for input nouns and verbs. In comparison with the supervised model S3D, VideoBERT showed better performance under cross-modal conditions.

Researchers also tested the feature-extraction effectiveness of VideoBERT in video captioning. Under the same setup and following standard procedures, VideoBERT outperformed S3D on all metrics. The best performance was with a combination of VideoBERT and S3D. The results are shown below:

Google Research says VideoBERT is the first research in the area of joint representation learning. The paper VideoBERT: A Joint Model for Video and Language Representation Learning is on arXiv.