In this post I will show why state-of-the-art Deep Neural Networks can still recognise scrambled images perfectly well and how this helps to uncover a puzzlingy simple strategy that DNNs seem to use to classify natural images. These findings, published at ICLR 2019, have a number of ramifications: first, they show that solving ImageNet is much simpler than many have thought. Second, the findings allow us to build much more interpretable and transparent image classification pipelines. Third, they explain a number of phenomena observed in modern CNNs like their bias towards texture (see our other paper at ICLR 2019 and our corresponding blog post) and their neglect of the spatial ordering of object parts.

Good ol’ bag-of-features models

In the old days, before Deep Learning, object recognition in natural images used to be fairly simple: define a set of key visual features (“words”), recognize how often each visual feature is present in an image (“bag”) and then classify the image based on these numbers. These models are therefore called “bag-of-features” models (BoF models). For illustration, say we have only two visual features, a human eye and a feather, and we want to classify images into “human” and “bird” class. The simplest BoF model would work as follows: for each eye in the image it increases evidence for “human” by +1. Vice versa, for each feather in the image it will increase evidence for “bird” by +1. Whatever class accumulates the most evidence across the image will be the predicted one.

A nice property of this simplest BoF model is its interpretability and transparent decision making: we can check exactly which image features carry evidence for a given class, the spatial integration of evidence is super simple (in contrast to the deep non-linear feature integration in deep neural networks) and so it is quite straight-forward to understand how the model reaches its decisions.

Traditional BoF models have been extremely popular and state-of-the-art before the onset of Deep Learning but quickly fell out of favour due to their subpar classification performance. But are we sure that Deep Neural Networks really use a fundamentally different decision-strategy as BoF-models?

A deep but interpretable bag-of-feature network (BagNet)

To test this we combine the interpretability and transparency of BoF models with the performance of DNNs. The high-level strategy is as follows:

Split the image into small q x q image patches

Pass patches through a DNN to get class evidences (logits) for each patch.

Sum the evidence over all patches to reach an image-level decision.

Classification strategy of BagNets: for each patch we extract class evidences (logits) using a DNN and sum up the total class evidences over all patches.

To implement this strategy in the simplest and most efficient way we take a standard ResNet-50 architecture and replace most (but not all) 3x3 convolutions with 1x1 convolutions. In this case the hidden units in the last convolutional layer each only “see” a small part of the image (i.e. their receptive field is much smaller than the size of the image). This avoids an explicit partitioning of the image and is as close as possible to standard CNNs while still implementing the outlined strategy. We call the resulting model architecture BagNet-q where q stands for the receptive field size of the top-most layer (we test q = 9, 17 and 33). The runtime of BagNet-q is roughly 2,5 the runtime of a ResNet-50.

Performance of BagNets with different patch sizes on ImageNet.

The performance of BagNets on ImageNet is impressive even for very small patch sizes: image features of size 17 x 17 pixels are enough to reach AlexNet-level performance while features of size 33 x 33 pixels are sufficient to reach around 87% top-5 accuracy. Higher performance values might be achievable with a more careful placement of the 3 x 3 convolutions and additional hyperparameter tuning.

That’s our first main result: you can solve ImageNet using only a collection of small image features. Long-range spatial relationships like object shape or the relation between object parts can be completely neglected and are unnecessary to solve the task.

A great feature of the BagNets is their transparent decision-making. For example, we can now look which image features are most predictive for a given class (see below). For example, a tench (a very big fish) is typically recognized by fingers on top of a greenish background. Why? Because most images in this category feature a fisherman holding up the tench like a trophy. Whenever the BagNet wrongly classifies an image as a tench it’s often because there are some fingers on top of a greenish background somewhere in the image.