Purging your favourite photos or videos of an unsightly trash pile, a parked car or even an ex-partner has never been easier, thanks to the rapid progress of AI models designed for such tasks.

In pursuit of better visual synthesis and inpainting approaches, researchers from Adobe Research and Stanford University have proposed an internal learning for video inpainting method inspired by the Deep Image Prior (DIP) method for single image generation.

DIP is a type of convolutional neural network (CNN) for enforcing plausible texture in static images, and has been widely used to enhance images by solving problems like noise reduction, super-resolution, and inpainting. The new DIP-inspired inpainting approach generates content for missing visual information areas (holes) and also augments motion (optical flow) information.

Since there’s usually no unique solution for naturally replacing removed or otherwise missing sections of a video, the goal of video inpainting is to reconstruct lost or damaged visual information in a manner that is consistent in both space and time. Previous methods have been mostly hand-crafted and patch-based and often could not sufficiently capture natural image priors, which resulted in distortion, especially in videos with complex motions.

Although learning image priors from an external image corpus via a deep neural network can improve image inpainting performance, extending neural networks to video inpainting remains challenging because the hallucinated content in videos not only needs to be consistent within its own frame, but also across adjacent frames. Also, video sizes are generally much larger than image sizes, making it difficult to train a single model to learn all effective priors and generalize it to all videos.

That’s when DIP comes in handy. Under DIP, image statistics are captured by a convolutional image generator rather than previously learned capabilities, so the “knowledge” of natural images can be encoded through CNN architecture. This enables DIP to learn from the internal recurrence of visual patterns in images.

Since the DIP requires no prior training data apart from the image itself, the DIP-inspired video inpainting algorithm can be based entirely on internal (within-video) learning without any external visual data, simplifying the training of a one-size-fits-all model.

The DIP-inspired inpainting approach (bottom row) outperforms the frame-based baseline (2nd row) even for content unseen in multiple frames (orange box). The new method can also employ natural image priors to avoid shape distortions, which often occur in patch-based methods (3rd row, red box).

The final DIP-Vid-Flow model outperformed DIP, which directly applies the DIP framework from images to videos frame-by-frame; DIP-Vid, a framework trained only using the image generation loss; and DIP-Vid-3DCN, a modified network using the DIP framework in both 2D and 3D convolution and with the image generation loss applied.

The researchers note that a drawback of their method, like other visual synthesis systems, is the long processing time required. They are confident however that the new approach will attract more research attention to “the interesting direction of internal learning” in video inpainting.

The paper An Internal Learning Approach to Video Inpainting is on arXiv.