“Science often follows technology, because inventions give us new ways to think about the world and new phenomena in need of explanation.”

Or so Aram Harrow, an MIT physics professor, counter-intuitively argues in “Why now is the right time to study quantum computing”.

He suggests that the scientific idea of entropy could not really be conceived until steam engine technology necessitated understanding of thermodynamics. Quantum computing similarly arose from attempts to simulate quantum mechanics on ordinary computers.

So what does all this have to do with machine learning?

Much like steam engines, machine learning is a technology intended to solve specific classes of problems. Yet results from the field are indicating intriguing—possibly profound—scientific clues about how our own brains might operate, perceive, and learn. The technology of machine learning is giving us new ways to think about the science of human thought … and imagination.

Not Computer Vision, But Computer Imagination

Five years ago, deep learning pioneer Geoff Hinton (who currently splits his time between the University of Toronto and Google) published the following demo.

Hinton had trained a five-layer neural network to recognize handwritten digits when given their bitmapped images. It was a form of computer vision, one that made handwriting machine-readable.

But unlike previous works on the same topic, where the main objective is simply to recognize digits, Hinton’s network could also run in reverse. That is, given the concept of a digit, it can regenerate images corresponding to that very concept.

We are seeing, quite literally, a machine imagining an image of the concept of “8”.

The magic is encoded in the layers between inputs and outputs. These layers act as a kind of associative memory, mapping back-and-forth from image and concept, from concept to image, all in one neural network.

“Is this how human imagination might work?

But beyond the simplistic, brain-inspired machine vision technology here, the broader scientific question is whether this is how human imagination — visualization — works. If so, there’s a huge a-ha moment here.

After all, isn’t this something our brains do quite naturally? When we see the digit 4, we think of the concept “4”. Conversely, when someone says “8”, we can conjure up in our minds’ eye an image of the digit 8.

Is it all a kind of “running backwards” by the brain from concept to images (or sound, smell, feel, etc.) through the information encoded in the layers? Aren’t we watching this network create new pictures — and perhaps in a more advanced version, even new internal connections — as it does so?

On Concepts and Intuitions

If visual recognition and imagination are indeed just back-and-forth mapping between images and concepts, what’s happening between those layers? Do deep neural networks have some insight or analogies to offer us here?

Let’s first go back 234 years, to Immanuel Kant’s Critique of Pure Reason, in which he argues that “Intuition is nothing but the representation of phenomena”.

Kant railed against the idea that human knowledge could be explained purely as empirical and rational thought. It is necessary, he argued, to consider intuitions. In his definitions, “intuitions” are representations left in a person’s mind by sensory perceptions, where as “concepts” are descriptions of empirical objects or sensory data. Together, these make up human knowledge.

Fast forwarding two centuries later, Berkeley CS professor Alyosha Efros, who specializes in Visual Understanding, pointed out that “there are many more things in our visual world than we have words to describe them with”. Using word labels to train models, Efros argues, exposes our techniques to a language bottleneck. There are many more un-namable intuitions than we have words for.

There is an intriguing mapping between ML Labels and human Concepts, and between ML Encodings and human Intuitions.

In training deep networks, such as the seminal “cat-recognition” work led by Quoc Le at Google/Stanford, we’re discovering that the activations in successive layers appear to go from lower to higher conceptual levels. An image recognition network encodes bitmaps at the lowest layer, then apparent corners and edges at the next layer, common shapes at the next, and so on. These intermediate layers don’t necessarily have any activations corresponding to explicit high-level concepts, like “cat” or “dog”, yet they do encode a distributed representation of the sensory inputs. Only the final, output layer has such a mapping to human-defined labels, because they are constrained to match those labels.

“Is this Intuition staring at us in the face?

Therefore, the above encodings and labels seem to correspond to exactly what Kant referred to as “intuitions” and “concepts”.

In yet another example of machine learning technology revealing insights about human thought, the network diagram above makes you wonder whether this is how the architecture of Intuition — albeit vastly simplified — is being expressed.

The Sapir-Whorf Controversy

If — as Efros has pointed out — there are a lot more conceptual patterns than words can describe, then do words constrain our thoughts? This question is at the heart of the Sapir-Whorf or Linguistic Relativity Hypothesis, and the debate about whether language completely determines the boundaries of our cognition, or whether we are unconstrained to conceptualize anything — regardless of the languages we speak.

In its strongest form, the hypothesis posits that the structure and lexicon of languages constrain how one perceives and conceptualizes the world.

Can you pick the odd one out? The Himba — who have distinct words for the two shades of green — can pick it out instantly. Credit: Mark Frauenfelder, How Language Affects Color Perception, and Randy MacDonald for verifying the RGB’s.

One of the most striking effects of this is demonstrated in the color test shown here. When asked to pick out the one square with a shade of green that’s distinct from all the others, the Himba people of northern Namibia — who have distinct words for the two shades of green — can find it almost instantly.

The rest of us, however, have a much harder time doing so.

The theory is that — once we have words to distinguish one shade from another, our brains will train itself to discriminate between the shades, so the difference would become more and more “obvious” over time. In seeing with our brain, not with our eyes, language drives perception.

“We see with our brains, not with our eyes.

With machine learning, we also observe something similar. In supervised learning, we train our models to best match images (or text, audio, etc.) against provided labels or categories. By definition, these models are trained to discriminate much more effectively between categories that have provided labels, than between other possible categories for which we have not provided labels. When viewed from the perspective of supervised machine learning, this outcome is not at all surprising. So perhaps we shouldn’t be too surprised by the results of the color experiment above, either. Language does indeed influence our perception of the world, in the same way that labels in supervised machine learning influence the model’s ability to discriminate among categories.

And yet, we also know that labels are not strictly required to discriminate between cues. In Google’s “cat-recognizing brain”, the network eventually discovers the concept of “cat”, “dog”, etc. all by itself — even without training the algorithm against explicit labels. After this unsupervised training, whenever the network is fed an image belonging to a certain category like “Cats”, the same corresponding set of “Cat” neurons always gets fired up. Simply by looking at the vast set of training images, this network has discovered the essential patterns of each category, as well as the differences of one category vs. another.

In the same way, an infant who is repeatedly shown a paper cup would soon recognize the visual pattern of such a thing, even before it ever learns the words “paper cup” to attach that pattern to a name. In this sense, the strong form of the Sapir-Whorf hypothesis cannot be entirely correct — we can, and do, discover concepts even without the words to describe them.

Supervised and unsupervised machine learning turn out to represent the two sides of the controversy’s coin. And if we recognized them as such, perhaps Sapir-Whorf would not be such a controversy, and more of a reflection of supervised and unsupervised human learning.