Parsing fine-grained temporal actions is vital in application scenarios that require an understanding of detailed and precise operations over long-term periods, such as daily activity understanding, surgical robots, human motion analysis and animal behavior analysis. In a new paper accepted at CVPR 2019, researchers from the Max Planck Institute for Intelligent Systems propose a novel learnable bilinear pooling method for extracting second-order information from time series data to achieve more subtle local data capturing than conventional methods, and introduce an information lossless dimension reduction method that shows superior performance to state-of-the-art pooling methods for action parsing.

From the Max Planck Institute for Intelligent Systems Summary:

“In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.” (MPIIS).

The Max Planck Institute for Intelligent Systems responded to Synced questions regarding their recent paper on local temporal bilinear pooling.

How would you describe bilinear pooling?

Bilinear pooling is a natural mechanism to fuse two feature vectors, which can be from different information sources and of different dimensions. When the two input feature vectors are identical, bilinear pooling computes the (centered or un-centered) covariance and extracts the second-order information. Such a technique is widely used in fine-grained image classification, image segmentation, visual question answering, information fusion and so forth.

Why do you recommend this paper?

To our knowledge this is the first investigation of applying bilinear pooling for fine-grained action segmentation. Besides the superior performance of action parsing, this paper also systematically investigates the importance of second-order information as well as the influence on consecutive convolutional layers. In addition, the authors show that the proposed learnable bilinear pooling is actually a feature map to an RKHS with a polynomial kernel. Based on the RKHS property, an exact dimension reduction method is proposed to improve efficiency while preserving all the second-order information. Therefore, the investigations contribute well to future related studies in the community. Moreover, this paper is well written and contains a comprehensive paper review in the related work section.

What impact might this work bring to the research community?

This paper shows the potential of using bilinear pooling for fine-grained action parsing. Since the proposed method is actually a neural network layer, it can potentially be used in more sophisticated net architectures to yield better performance.

This paper provides an exact dimension reduction method without sacrificing information based on RKHS theories. Such investigation is interesting and inspiring.

Can you identify any bottlenecks in the research?

Datasets of fine-grained action parsing are limited and difficult to annotate, especially for the surgical robot data in real operation scenarios.

Despite the information lossless dimension reduction, the computational cost is still high.

Can you predictany potential future developments related to this research?

From this paper one can see that the first-order information can blur different actions but the second-order information can lead to over-segment. Therefore, the useful information is actually with a fractional order.

The paper Local Temporal Bilinear Pooling for Fine-Grained Action Parsing is here.