Google AI has introduced a deep learning based approach that generates depth prediction from videos where both camera and subject are in motion.

Humans are very good at making sense of the 3D world through 2D projections. Watching a movie screen for example we can guestimate the relative positions of pedestrians, buildings and boulevards. Even in complex environments with objects in motion we can still form a fairly sound understanding of where everything is. Computer vision however does not do so well in this regard. Researchers in the field have long sought to develop a mechanism capable of achieving 3D world understanding by reconstructing geometry and depth ordering from 2D image data via computation.

Computer vision models struggle most when both camera and objects in a scene are in motion. The freely moving camera and objects confuse conventional 3D reconstruction algorithms since the traditional approach assumes the same object can be observed from more than one viewpoint at the same time, enabling triangulation. The assumption requires either a multi-camera array, or that all objects remain stationary while one camera moves through the scene.

The Google AI researchers used 2,000 “Mannequin Challenge” YouTube videos to train an AI model. A viral trend in 2016, these videos see groups of people acting like frozen characters in the film The Matrix, while a camera person moves through and records the scene. By learning priors on human poses and shapes from the data, the model can perform accurate dense depth prediction on motion-motion videos without traditional direct 3D triangulation. Researchers focused on depth prediction for humans as humans tend to feature prominently in related applications such as augmented reality, and human motion is relatively difficult to model.

Depth prediction network: The input includes an RGB image (Frame t), a mask of the human region, and an initial depth for the non-human regions, computed from motion parallax (optical flow) between the input frame and another frame in the video. The model outputs a full depth map for Frame t. Supervision for training is provided by the depth map, computed by multi-view stereo methods.

In a blog post, Google AI researchers point out the method’s innovation: “While there is a recent surge in using machine learning for depth prediction, this work is the first to tailor a learning-based approach to the case of simultaneous camera and human motion.” The research is receiving attention on social media.

The paper Learning the Depths of Moving People by Watching Frozen People is on ArXiv.