Since August Google AI researchers have published or revised three different papers on optimizing video search feature representation. The aim is to improve on previous approaches in the field that required manually designing CNN architectures to understand videos. Google’s latest contribution came last week with “Tiny Video Networks (TVN),” a new method for reducing the runtime of neural networks when analyzing videos that is significantly faster than existing models.

In August Google revised a paper it had previously published introducing EvaNet, the first automated neural architecture search algorithm for video understanding. EvaNet can be applied to extended 2D architectures, and allows individual modules to evolve. The researchers applied massive diverse mutations at earlier stages of the evolution and limited them at later stages to identify different but similarly good architectures to combine. To enrich the search space for video inputs, they also designed an Inflated Temporal Gaussian Mixture (iTGM) layer based on 1D TGM to capture longer-term temporal information.

In September Google researchers revised a paper that deals the connectivity between network blocks. “AssembleNet” is a multi-stream neural architecture search algorithm with connection learning guided evolution that represents multi-stream CNNs as directed graphs. Coupled with an efficient evolutionary algorithm, AssembleNet can find multi-stream architectures with better connectivity and temporal resolutions for video representation learning. Google adjusted experiment context and updated other data in the revised paper, where AssembleNet showed impressive performance of 58.6 percent mean Average Precision (mAP) compared to a previous best of 45.2 on popular video recognition dataset Charades.

Google AI’s latest paper presents Tiny Video Networks (TVN), an evolutionary algorithm that can automatically design efficient models to understand videos. Researchers describe it as a family of tiny neural networks, which runs for just 37 milliseconds per video frame on a CPU and 10 milliseconds on a standard GPU. Contemporary models usually require over 2000 milliseconds per video frame on a CPU and more than 500 milliseconds on a GPU. With fewer convolutional layers, TVN learned to prefer effective modules instead of computationally intensive layers such as the 3D convolutions often seen in other models. With its short runtime TVN is suitable for dynamic real-world applications such as sport activity recognition and robotics perception.

It is noteworthy that Google AI has taken three different approaches to building machine-optimized video architecture. EvaNet, introduced in Evolving Space-Time Neural Architectures for Videos, is a module-level architecture search (open-sourced on GitHub); AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architecturesworks on multi-stream connectivity; while Tiny Video Networks explores boosting model efficiency when facing real world environments.

Google AI’s recent video understanding research aims to deliver improved accuracy and efficiency for mobile-ready models, and could also be applied to real-world challenges for example in industrial robotics applications or real-time traffic monitoring at city intersections.