AI researchers had a breakthrough when they became able to practically replicate in their machines what we believe to be one of the most basic functions of the human brain: thought is generated by the combined activity of clusters of connected neurons. They’re now left in the same position as the neuroscientists who first proposed that idea: On the outside looking in, wondering just how millions of tiny components contribute to a greater whole.

But computer scientists might understand their machines before we understand ourselves. New research from MIT offers clues to how artificial neural networks process information—and points to a possible method for interpreting why they might make one decision over another. That could help us more easily figure out, for example, why a self-driving car swerved off the road after perceiving a certain object, or investigate exactly how biased an image classification algorithm was trained to be.

Buckle in, because this is about to get fairly nerdy.

First, a quick overview of how a trained, deep-neural network functions: The goal is for you to be able to give it a picture, and have it tell you what’s in that picture.

The network takes that image and processes it through different layers, each a dense cluster of millions of tiny computations meant to model neurons in the brain. Each layer looks at the picture at different levels of abstraction, comparing patterns in it to patterns the network has seen before. The first layer might look for shapes, the next textures, and so on—but each layer also throws out data it deems unnecessary for identifying the contents of the image.

The process culls the image into a “high-level representation”, or data that efficiently describes the important parts of image’s contents to the machine. The final high-level representation is what the neural network interprets to be, say, a dog or a horse.

Today’s deep-learning scientists have two competing theories about how these deep neural networks actually do this work on the level of the the computational neurons: either via “disentangled representations” and “distributed representations.”

The notion of disentangled representations says that individual neurons are responsible for detecting patterns highly correlated with dogs or horses. Distributed representation is a theory suggesting that a group of neurons, tied together by a mathematical relationship, works together to identify these patterns, according to MIT researcher David Bau.

If representations in deep neural networks are disentangled, we’d have a shot at isolating the specific neurons responsible for identifying, for example, a person’s gender in a photo. But if they’re distributed, that would mean a complex set of relationships between neurons would need to be derived. Luckily, Bau’s team’s research, published late June, indicates the former. “Now we have a quantitative tool that you can use for understanding any visual representation,” Bau told Quartz.

The MIT team wanted to find exactly how different neural networks, trained on different data, built their own mechanisms for understanding concepts, ranging from simple patterns to specific objects. To do this, researchers compiled a specialized dataset of images. The images ranged from objects to landscapes to simple textures, all accurately labeled in detail down to the pixel level.

MIT Samples from MIT’s Broden database

Since the MIT team knew exactly what the neural network was perceiving in every part of the picture, they would be able to analyze which neurons were highly active at a specific time, and trace the recognition of specific concepts in the picture back to those neurons. In tests, the team found that individual neurons could be highly correlated to high-level concepts—a strong sign that neural networks build “disentangled representations.”

(Okay, the nerdy bit is pretty much over. You made it.)

So why does this matter? Bau thinks it can be used to objectively measure bias imparted in a neural network by its training data. He offers the example of a network trained on the ImageNet dataset, an industry standard to test image-recognition networks. Specific neurons’ reactions to dogs were mathematically larger than the neurons’ reactions to cats—indicating evidence of bias towards dogs in the ImageNet data.

We’ve seen this before in Google’s famous DeepDream experiment, in which neural networks were asked to amplify the patterns they perceived in random images. In that experiment, neural networks trained on ImageNet put dogs everywhere, because their training to see dogs was stronger than anything else they had learned.

MIT isn’t the first to address this problem. There are conferences for researchers to gather and talk about how to better explain AI, dozens of papers published each year, and even a DARPA program to further deep learning explainability—it all speaks to the work’s importance.

The ability to measure bias in neural networks could be critical in fields like healthcare, where bias inherent in an algorithm’s training data could be carried into treatment, or in determining why self-driving cars make certain decisions on the road for a safer vehicle.