Crowd Acting™: How to Grow Large-Scale Video Datasets for Deep Learning

In this post, we discuss the limitations to the traditional data collection approach and illustrate how to use crowd acting, an innovative approach, to grow large-scale video datasets for deep learning. Our crowd acted datasets Jester and Something-Something are publicly released and free for academic purposes.

Data is the unreasonably effective force behind the current deep learning breakthroughs. Without a sufficient amount of data, even the most intricate neural network powered by the best hardware would fall short of human-level performance. As video data is becoming ubiquitous, we will rely on machines to reason and extract information from numerous videos made available by social media and visual-enabled devices.

GIF 1: Data is essential to High-Performing AI Algorithms

Supervised learning will drive the most commercial successes in deep learning but its data collection process is flawed. Finding no suitable video dataset for teaching machines to understand the world, we developed crowd acting, an industrial data collection approach inspired by previous contributions, particularly Hollywood in Homes and its dataset Charades (Sigurdsson et al.). With crowd acting, we successfully industrialized the curiously academic video data collection process, driving down unit cost per sample and making video understanding commercially scalable.

Building real-world video datasets comes at a high opportunity cost —requiring significant amount of time and resources. However, we have successfully built the largest industrial data factory for video applications and spearheaded the creation of the first two real-world video datasets, Jester and Something-Something, which we released to the public. The world urgently needs an innovative data collection approach for video dataset and we believe crowd acting is the solution.

The Past: Crowdsourcing Data Collection

A high-quality dataset should contain a human-centric, logical and balanced taxonomy, featuring natural video scenes and dynamic actions generated by a large group of people of different ethnicities, gender, etc. Each data sample should be densely captioned with minimum label noise and errors. Most importantly, the dataset should be relevant to real-world challenges.

The traditional data collection approach, however, fails to build high-quality deep learning datasets. Video datasets, such as Kinetics and AVA, have made indispensable contributions to the AI community in the right direction. But as they adopted the traditional data collection approach, these datasets cannot unlock the full potential of video understanding.

As shown in Image 1, the status quo approach contains four unidirectional steps — taxonomy, data mining, human annotation, and training.