Video-to-Video Synthesis

Summary

In this paper, realistic and high-resolution videos are generated by a GAN called vid2vid based on semantic images such as area division masks and line drawing sketches. The summary of this paper is as follows.

They propose a GAN called vid2vid to synthesize video. Compared with pix2pixHD and COVST in the previous research, the generated video is less disturbed because it is generated using the conditional probability of the previous frame. They train the model in “Spatio-temporally Progressive” manner, which alternately performs temporally progressive training, which increases the number of frames used for synthesis as learning progresses, and spatially progressive training, which gradually increases the resolution like PG-GAN.

In the example below, two high-resolution videos are generated based on the lower left area division mask.

Problem Formulation

Consider generating a video, which composes a set of generated images , with generated images x tilde from time 1 to T and semantic images s from time 1 to time T.

Then, image generation at each time t formulated as joint probability of generated images x tilde from time t-L to t-1 and semantic images s from t-L to t.

The whole image sequences (video) is formulated as below.

Generator (F)

Architecture of the generator F is shown below.

Explaining roughly, it consists two images multiplied mask m. ​The first part is previous time (t-1) image warped by the optical flow (Blue). The second part is the intermediate image that synthesizes the other part (Red). The mask m has time continuous values from 0 to 1. The two images are assigned for each location using a mask m. Since the video is time continuous, there is a natural assumption that the images are divided into the one which can be expressed using optical flow and the other which is not so.

The latter intermediate image h can be further decomposed as follows: Subscript B indicates the background and subscript F indicates the foreground.

The mask m_B indicates the position of the background at time t and decides which part to be generated by forground function h_F or background function h_B.

The foreground function h_F is responsible for the structure of the intense movement that is difficult to express with optical flow and the background function h_B is responsible for the part that can be expressed with optical flow with little movement. In fact, the first term of F contains a term that distorts the image using optical flow, so h_B is in charge of the part that cannot be expressed only by the optical flow such as the newly appearing background at time t.

Rewriting F including these results in the following.

Discriminator

The vid2vid model introduces two Discriminators, Image Discriminator and Video Discriminator.

The former is a Discriminator that distinguishes (true image, corresponding semantic image) or (generated image, corresponding semantic image), and considers whether the image generated from the semantic image is plausible. The latter distinguishes between (true image, corresponding optical flow at the previous time) or (generated image, corresponding optical flow at the previous time), and determines whether the motion of the video is natural or not.

In addition, it is known that Discriminator should be introduced at multiple scales to prevent mode collapse.In the vid2vide model, Image Discriminators are introduced at multiple scales.

Objective function

The objective function for optimizing Generator (F) and Discriminators is as follows.

The first term L_I and the second term L_V in parentheses are the minimax formulation of Generator F and Discriminators, as in the normal GAN ​​objective function. Subscript I stands for Image Discriminator, and subscript V stands for video Discriminator.

The third term L_W is related to Optical Flow. The first item is the difference between the true optical flow and the predicted optical flow, and the second item is the difference between the warped image at time t by the optical flow and the image one time ahead.

Training Method

Training is done in a “spatio-temporally progressive manner”. Simply put, it is a learning method that starts learning with a small number of frames and coarse resolution, and gradually increases the number of frames and resolution alternately.

spatio-temporally progressive manner

Results

I introduce a part of the result. First, an example of generating two videos using the same segmentation mask. Since the masks are the same, the position of the car and the background (street trees and buildings) are not change, but you can see that the buildings and street trees and the types of cars can be freely converted.

Generated videos by vid2vid

Next is a comparison with other methods. Since the images are generated with the conditional probability of the previous time, you can see that the natural video can be generated compared to other methods.