Recently animated stickers have increased in popularity due to their massive use in messaging applications or memes. Still, with existing tools, generating animated stickers is extremely challenging and time-consuming, making the task practically infeasible for non-experts. Removing the background of an arbitrary video (no green screen) is a menial task that involves manually segmenting the object in each frame of a video.

Example of an animated sticker from a video

At gifs.com, we decided to tackle this problem and help people create animated stickers easily by using AI.

Challenge

Automated animated sticker generation is a challenging problem to solve because of the complex nature of videos: they are subject to motion blur, bad composition, and occlusion. An object can be hard to segment due to its complex structure, small size (very little information) or large similarity between background and foreground. Also, a video clip can contain multiple objects, and we need to make sure the users extracts the object they are interested in.

Preview of user generated stickers

Our solution

First, the user uses our interactive object segmentation tool to mark the object of interest in the first frame of the video. Then, the result will be propagated to the other frames and rendered as an animated sticker. For segmenting the object, i.e. instance segmentation, we use Computer Vision techniques that can infer the full segmentation from minimal user input.

Example of using the interactive tool to annotate the first frame

Both segmentation steps (first frame and full video) rely on Convolutional Neural Networks, a type of a deep learning model. Deep learning is a good fit for our problem because of its recent improvements in Computer Vision. Convolutional Neural Networks have shown exceptional performance for image and video recognition. Those algorithms are capable of “understanding” the visual concept of an object (animal, car, …) in an image.

Next, we present the two steps of our method in more detail.

Interactive segmentation

A quick way to implement interactive segmentation is to use the GrabCut algorithm. It builds a model of the pixel distribution (colors) and performs well when the background and foreground are distinct, but outputs sub-optimal results when both are similar.

Even with multiple user annotations (left), the GrabCut result (right) is not satisfactory because the bear fur is a similar to the ground.

To get a beautiful sticker, we need a high-precision segmentation on the first frame. Because we were not satisfied by the GrabCut results, we decided to develop our method based on the latest research in deep learning. Inspired by recent work in interactive object segmentation with deep neural networks, we built a model that takes the image, the current segmentation result, and the user corrections as input and outputs a binary mask of the object.

We provide a brush tool to the user for correcting the first image of the video. Based on our production data, we have found that typical users tend to draw with a variety of patterns such as clicks, strokes or highlighting the whole object. Thus, we needed our algorithm to take into account a diversity of annotations and decided to include simulated strokes and clicks during the training phase to get the best results and give the user a great experience.

Examples of the 3 annotation types: clicks, strokes and highlights

Video segmentation

After annotating one frame and successfully segmenting the object, we use a deep learning model based on the OSVOS paper to generate the segmentation in the other frames. OSVOS (One-Shot Video Object Segmentation) is a convolutional neural network (based on VGG) that uses generic semantic information to segment objects. For each sticker, the model is fine-tuned on frame/mask pairs. Then, we will infer the masks for all the frames in the video and combine the results to output an animated sticker with a transparent background.

If the object is fast moving or changes a lot towards the end of the video, we can get variable results. Thus, we allow the user to refine more frames in the video to improve the quality of the sticker.