Note: Part 2 of this article includes code examples for obtaining the illustrations below. Both parts of this article are available for download on GitHub.

In 1947, the psychologist B. F. Skinner laid out an experiment that, he argued, demonstrated the way we form superstitions. Feeding pigeons “at regular intervals with no reference whatsoever to the bird’s behavior,” he observed that the birds developed complex dances made up of whatever motions they happened to be going through when the food appeared. If Skinner fed the pigeons at close enough intervals, he’d reinforce their belief that their dances brought about feeding. At wider intervals between feedings, the birds would lose faith in their dances and “extinguish” them.

Trained on a limited set of experiences, and reinforced throughout training by the regular appearance of food, the pigeons detected causal patterns where none existed. Skinner argued that humans develop superstitions in the same way, explaining random or difficult-to-understand events by developing superficially plausible explanations that aren’t supported by an understanding of first principles.

Neural networks make complex decisions in the manner of highly simplified human brains, and they can be susceptible to similar tendencies. Moreover, their reasoning is often obscure; they essentially configure themselves by searching for patterns naively across large sets of training data, and they encode those patterns in several (sometimes very many) different layers. Each layer may make some intuitive sense on its own, but taken collectively they’re often inscrutable to humans, and they wind up reasoning on patterns that no human would ever identify.

That is a profoundly promising feature: the world is full of patterns that have evaded human senses, and deep learning may illuminate vast scientific fields that rely on them. It is also a sometimes problematic one.

You probably won’t be surprised to know that a fairly simple neural network can recognize this as a 3, with 99% certainty.

Surprisingly, the same neural network recognizes both of these as 3s—the left image with 100% certainty:

Understanding these misclassifications tells us something about how humans and neural networks learn and reason.

Whereas a human might say “a three has two semicircles that curve to the right, stacked on top of each other, but no closures on the left, which would otherwise make it an eight,” the neural network can’t describe the digit from first principles. Instead it discovers many subtle features of threes in the thousands of images it considers during training. These aren’t necessarily related to the actual primitive definition of a three, which makes for interesting behavior when we present the network with an image that lies far outside of its training set. That noisy image above doesn’t look anything like the two rightward bulbs that a human would recognize, but it satisfies all of the subtle details that the network is looking for.

How can we understand exactly what the network is “looking for?” This is the issue of interpretability—being able to relate the working of neural networks to human intuition.

A very simple network can be fairly easy to interpret. Below are the weights from a simple multinomial logistic regression that classifies digits. For each pixel in an image, the weight determines whether a darkened pixel makes it more or less likely that the image represents a particular digit. Here, visualized in the manner suggested by TensorFlow’s documentation, blue areas are positively correlated with a particular classification; red areas are negatively correlated. (See the code below to generate these yourself.)