Current state-of-the-art convolutional architectures for object detection tasks are human-designed. In a recent paper, Google Brain researchers leveraged the advantages of Neural Architecture Search (NAS) to propose NAS-FPN, a new automatic search method for feature pyramid architecture. NAS-FPN achieves a better accuracy and latency tradeoff than current SOTA models for object detection.

From the paper’s abstract:

“Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time.” (arXiv).

Synced invited Dr. Dawei Du, a Postdoctoral Researcher at the State University of New York with a research focus on visual tracking, object detection and video segmentation applications, to share his thoughts on Google Brain’s NAS-FPN.

How would you describe NAS-FPN?

FPN is a pyramid representation for deep learning that combines low-resolution but strong semantic features and weak semantic but high-resolution features via top-down and lateral connections. Moreover, NAS-FPN is an automatic neural architecture search algorithm that focuses on finding optimal connections between different layers for pyramidal representations.

Specifically, the RNN controller is trained to select the best architecture using reinforcement learning. First the child networks are sampled by combining any two different layers. Then the accuracy score is regarded as the reward of reinforcement learning to calculate the policy gradient to update the parameters of the controller. During the training iterations, it is possible to generate the structure with better accuracy via the controller. Experiments on the COCO test set show the proposed method achieves considerable accuracy improvement compared to existing object detection models, e.g., MobileNetV2, and RetinaNet.

Why does this research matter?

Deep learning dominates various tasks in computer vision. However, the majority of the previous works focuses on training the parameters of networks in human-designed architectures. Recently, there has been increasing interest in designing the structure of neural networks automatically. Xie et al. explored randomly wired neural networks for image recognition. Liu et al. proposed searching the network level structure in addition to the cell level structure for semantic segmentation. Differing from the aforementioned work, this paper provides another research direction that makes it possible to search the optimal cross-layer connections to achieve discriminative multiscale feature representation of neural networks.

What impact might this research bring to the research community?

Combing multi-scale features from different layers is one of the important techniques in deep learning for effectively improving the performance of many computer vision tasks. However, the previously proposed human-designed structures may be not optimal, resulting in limited performance. Inspired by NAS-FPN, researchers can transfer the optimal network structures to related tasks such as visual tracking and semantic segmentation.

Can you identify any bottlenecks in the research?

The computational complexity of NAS is extremely high (100 TPUs used in this paper), especially for complex backbones (e.g., ResNet-101). Therefore, it is very difficult to follow for labs without many computational resources. Besides, we still know little about the insights of the optimal network generated by the NAS method. Why such layer connection combinations achieve better performance than human-designed ones? Can we learn from the network design and transfer to other tasks (e.g., tracking, segmentation and classification)? The interpretation of it remains unsolved due to the complex cross-layer connections.

Can you predict any potential future developments related to this research?

I believe there will be much work using the NAS method in the future. According to prior knowledge of a specific task, researchers can reduce the NAS search space by pruning some unnecessary connections. Besides, some effective modules will be found based on similar connections of optimal networks in different tasks. It is interesting to think about designing a network that considers the tradeoff between complexity and accuracy, especially in embedded systems with limited resources.

The paper NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection is on arXiv.