Selected units are shown from three state-of-the-art network architectures when trained to classify images of places (places-365). Many individual units respond to specific high-level concepts (object segmentations) that are not directly represented in the training set (scene classifications).

Why we study interpretable units

Interpretable units are interesting because they hint that deep networks may not be completely opaque black boxes.

However, the observations of interpretability up to now are just a hint: there is not yet a complete understanding of whether or how interpretable units are evidence of a so-called distentangled representation. AlexNet-Places205 conv5 unit 138: heads AlexNet-Places205 conv5 unit 215: castles AlexNet-Places205 conv5 unit 13: lamps AlexNet-Places205 conv5 unit 53: stairways

What is Network Dissection?

Our paper investigates three questions:

What is a disentangled representation, and how can its factors be quantified and detected? Do interpretable hidden units reflect a special alignment of feature space, or are interpretations a chimera? What conditions in state-of-the-art training lead to representations with greater or lesser entanglement?

Network Dissection is our method for quantifying interpretability of individual units in a deep CNN (i.e., our answer to question #1). It works by measuring the alignment between unit response and a set of concepts drawn from a broad and dense segmentation data set called Broden. By measuring the concept that best matches each unit, Net Dissection can break down the types of concepts represented in a layer: here the 256 units of AlexNet conv5 trained on Places represent many objects and textures, as well as some scenes, parts, materials, and a color.

Are interpretations a chimera?

Network dissection shows that interpretable concepts are unusual orientations of representation space. Their emergence is evidence that the network is decomposing intermediate concepts, answering question #2. Interpretability drops as the basis is gradually changed towards a random basis. Contradicting the prevailing wisdom, interpretability is not isotropic in representation space, and networks do appear to learn axis-aligned decompositions.

What affects interpretability?

This brings us to question #3: what conditions lead to higher or lower levels of interpretability?

Interpretability of ResNet > VGG > GoogLeNet > AlexNet, and in terms of primary training tasks, we find Places365 > Places205 > ImageNet. Interpretability varies widely under a range of self-supervised tasks, and none approaches interpretability from supervision by ImageNet or Places.

We find that interpretabile units are found in representations of the major architectures for vision, and interpretable units also emerge under different training conditions including (to lesser degree) self-supervised tasks.

The code you find here will let you reproduce our interpretability benchmarks, and will allow you measure and find ways to improve interpretability in your own deep CNNs.

Videos

Network Dissection also allows us to understand how emergent concepts appear when training a model: in particular, it can quantify the change of representations under fine-tuning.