Traditional methods used to estimate 3D structure and camera motion in videos rely heavily on manual assumptions such as continuity and planarity. Google researchers have now presented an alternative deep learning method which is able to obtain these assumptions from unlabelled video. They succeeded in training a deep network to predict the intrinsic parameters of the camera, the first method to address occlusions in geometric ways and substantially reduce the amount of semantic understanding needed to detect moving elements in video content.

From the paper abstract:

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal. Similarly to prior work, our method learns by applying differentiable warping to frames and comparing the result to adjacent ones, but it provides several improvements: We address occlusions geometrically and differentiably, directly using the depth maps as predicted during training. We introduce randomized layer normalization, a novel powerful regularizer, and we account for object motion relative to the scene. To the best of our knowledge, our work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale. We evaluate our results on the Cityscapes, KITTI and EuRoC datasets, establishing new state of the art on depth prediction and odometry, and demonstrate qualitatively that depth prediction can be learned from a collection of YouTube videos. (arXiv).

Synced invited Dr. Yibiao Zhao, Co-Founder and CEO of ISEE, whose research has focused on computational cognitive science, machine learning, and robotics, to share his thoughts on self-supervised depth learning for arbitrary videos.

Could you brief describe self-supervised depth estimation?

The self-supervised depth estimation is a very powerful approach especially for applications in autonomous driving, but the idea is not new. In CVPR2017, the paper “Unsupervised Monocular Depth Estimation With Left-Right Consistency” laid a good foundation of self-supervised depth estimation from a pair of stereo images.

Why does this research matter?

This paper outperforms the previous approaches and achieved very impressive results based on two major insights: the predictive learning leverages the huge amount of video data, and the differentiable inductive knowledgeabout latent geometric structure has been introduced into the NN-based model.

It is known that current Neural Network models are data hungry, while detailed annotations are challenging to acquire at scale. Predictive learning emerges as a powerful way to train NNs with the self-supervised signal from an abundant amount of video sequences online without much concern about the limited coverage of annotated data and scenarios.

In order to take advantage of the power of predictive learning, this paper explored the latent geometric structure between two successive frames. The latent structure is just as simple as geometric transformation between two frames (z’p’ = KRK−1zp + Kt). By introducing inductive knowledge about the geometric/perspective structure and motion into the NN-based model design and relaxing the need for intrinsic camera parameters, it makes the learning much more data efficient and generalizable.

Can you identify any bottlenecks in the research? What are your suggestions for future development?

As a recent review paper pointed out: “We conclude that current self-supervised methods are not ‘hard’ enough to take full advantage of large scale data and do not seem to learn effective high-level semantic representations.” There is no guarantee that a self-supervised algorithm can learn any meaningful high-level structure about the scene. Instead of learning from scratch without any inductive knowledge, I can see that there is a trend in the field to leverage self-supervised learning with inductive knowledge to provide high-level supervision guidance. That’s what we develop at iSEE AI to train an autonomous driving systems with some common sense (MIT Technology Review, Fortune).

The criticism of this paper is they did not show many visualizations of the internal representation that the network learned, for example, the foreground mask and translation field. I guess the learned representation of the foreground mask and translation field is not well regularized. Some higher-level knowledge about the scene/object motion may be helpful to further improve the performance.

The paper Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras is on arXiv.