Nvidia and the MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) have open-sourced their video-to-video synthesis model. By using a generative adversarial learning framework, the method can generate high-resolution, photorealistic and temporally coherent results with various input formats, including segmentation masks, sketches, and poses.

Compared to image-to-image translation, there has been less research into video-to-video synthesis. To solve the problem of low visual quality and incoherency of video results in existing image synthesis approaches, the research group proposes a novel video-to-video synthesis approach capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long.

The team’s 2017 Image Synthesis and Semantic Manipulation research

The authors performed extensive experimental validation on various datasets and the model showed better results than existing approaches from both quantitative and qualitative perspectives. In addition, when the team extended the method to multimodal video synthesis with identical input data, the model produced new visual properties in the scene, with high resolution and coherency.

Researchers suggest the model may be improved in the future by adding additional 3D cues such as depth maps to better synthesize turning cars; using object tracking to ensure an object maintains its colour and appearance throughout the video; and training with coarser semantic labels to solve issues in semantic manipulation.

The Video-to-Video Synthesis paper is on arVix, the team’s model and data are here.

Author: Victor Lu | Editor: Michael Sarazen

Share this: Twitter

Facebook

