Supervised learning and deep learning AI models are effective in recognizing and classifying objects and actions in videos. These methods however currently summarize an entire video clip with a single label, which does not always completely capture the content. To better understand a multi-step process such as pitching a baseball, more than one label is required.

Frame-by-frame fine-grained labeling is however a time-consuming task. To tackle this problem, researchers from Google AI and DeepMind have introduced a novel self-supervised learning method called Temporal Cycle-Consistency Learning (TCC), which leverages temporal alignment between videos to break down continuous actions in videos to develop “a semantic understanding of each video frame.”

Minimizing cycle-consistency errors to learn representations for temporally fine-grained tasks.

Right: Input videos of people performing a squat exercise. The video on the top left is the reference. The other videos show nearest neighbor frames (in the TCC embedding space) from other videos of people doing squats. Left: The corresponding frame embeddings move as the action is performed.

Actions such as “throwing a baseball” or “pouring tea” are processes with unique steps that present across different video instances of the action. TCC is a differentiable cycle-consistency loss which can be used to find these correspondences across time in different videos by leveraging the principle of cycle-consistency.

TCC learns a frame encoder using image processing network architecture such as ResNet. Researchers first passed video frames to be aligned through the encoder to produce their corresponding embeddings, then selected two videos for TCC learning (a reference frame and its nearest neighbor frame) and trained the embedder by using distance as the training signal.

In this process, the embeddings minimize the cycle-consistency loss by developing a semantic understanding of each video frame in the context of the action being performed. When trained on Penn Action Dataset video clips of people performing squats, TCC embeddings can encode the different phases of squatting without being provided explicit labels.

TCC embeddings can enable multiple applications, for example, transferring labels and other modalities (such as sounds) between videos. Using the nearest neighbor search in the embedding space, TCC can find similar frames and transfer metadata such as audio or text related to frames in Video A into the corresponding frames in Video B.

The above demonstration shows how TCC links and transfers the sound of pouring tea into a cup from one video into a similar but different scene. Researchers suggest TCC could also be applied to few-shot action phase classification, unsupervised video alignment, per-frame retrieval, and other similar tasks.

The paper Temporal Cycle-Consistency Learning is on arXiv. Google is planning to release the TCC database to encourage new applications.