ImageNet Pre-training is common in a variety of CV (Computer Vision) tasks, as something of a consensus has emerged that pre-training can help a model learn transferrable information that can be useful for target tasks. A new FAIR (Facebook AI Research) paper however suggests that while many researchers believe the path to solving computer vision challenges is “paved by pre-training a ‘universal’ feature representation on ImageNet-like data” — the process may not actually be helpful at all.

Researchers point out that a model pre-trained on ImageNet can only learn approximate general image knowledge through thousands of image recognition tasks, so if the target tasks exceed the scope of those tasks, pre-training is not effective.

Kaiming He, Ross Girshick, and Piotr Dollar from FAIR are the team that previously developed the highly regarded object instance segmentation framework Mask R-CNN. Their new paper Rethinking ImageNet Pre-training finds that even if a model is pre-trained on datasets 3000x the size of ImageNet, improvements for specific tasks such as object detection are very limited and scale poorly. The authors did further experiments and achieved competitive accuracy on tasks such as object detection and instance segmentation when training on the COCO dataset from random initialization without any pre-training.

Training — Mask R-CNN with a ResNet-50 FPN and Group Norm on the COCO train2017 set; Evaluation — Bounding Box Average Precision on the val12017 set

The research results show that for the target tasks of object detection and instance segmentation on the COCO dataset, a model trained from random initialization with sufficient iterations can perform on par with a model initialized with pre-training on ImageNet.

The graph above compares training time between the two methods: It takes three times more iterations for the model trained with random initialization to recognize a similar scale of pixels compared with its fine-tuning counterpart on COCO (72 epochs vs. 24 epochs) considering the pre-training takes 100 epochs on ImageNet.

Researchers expanded the experiments to different networks with different settings, with findings as follows (from the paper):

Training from scratch on target tasks is possible without architectural changes. Training from scratch requires more iterations to sufficiently converge. Training from scratch can be no worse than its ImageNet pre-training counterparts under many circumstances, down to as few as 10k COCO images. ImageNet pre-training speeds up convergence on the target task. ImageNet pre-training does not necessarily help reduce overfitting unless we enter a very small data regime. ImageNet pre-training helps less if the target task is more sensitive to localization than classification.

Based on their observations, the authors offered answers to these questions:

Is ImageNet pre-training necessary? No Is ImageNet helpful? Yes Do we need big data? Yes Shall we pursuit universal representations? Yes

The paper Rethinking ImageNet Pre-training on arXiv.