If a doctor told that you needed surgery, you would want to know why — and you’d expect the explanation to make sense to you, even if you’d never gone to medical school. Been Kim, a research scientist at Google Brain, believes that we should expect nothing less from artificial intelligence. As a specialist in “interpretable” machine learning, she wants to build AI software that can explain itself to anyone.

Since its ascendance roughly a decade ago, the neural-network technology behind artificial intelligence has transformed everything from email to drug discovery with its increasingly powerful ability to learn from and identify patterns in data. But that power has come with an uncanny caveat: The very complexity that lets modern deep-learning networks successfully teach themselves how to drive cars and spot insurance fraud also makes their inner workings nearly impossible to make sense of, even by AI experts. If a neural network is trained to identify patients at risk for conditions like liver cancer and schizophrenia — as a system called “Deep Patient” was in 2015, at Mount Sinai Hospital in New York — there’s no way to discern exactly which features in the data the network is paying attention to. That “knowledge” is smeared across many layers of artificial neurons, each with hundreds or thousands of connections.

As ever more industries attempt to automate or enhance their decision-making with AI, this so-called black box problem seems less like a technological quirk than a fundamental flaw. DARPA’s “XAI” project (for “explainable AI”) is actively researching the problem, and interpretability has moved from the fringes of machine-learning research to its center. “AI is in this critical moment where humankind is trying to decide whether this technology is good for us or not,” Kim says. “If we don’t solve this problem of interpretability, I don’t think we’re going to move forward with this technology. We might just drop it.”

Kim and her colleagues at Google Brain recently developed a system called “Testing with Concept Activation Vectors” (TCAV), which she describes as a “translator for humans” that allows a user to ask a black box AI how much a specific, high-level concept has played into its reasoning. For example, if a machine-learning system has been trained to identify zebras in images, a person could use TCAV to determine how much weight the system gives to the concept of “stripes” when making a decision.

TCAV was originally tested on machine-learning models trained to recognize images, but it also works with models trained on text and certain kinds of data visualizations, like EEG waveforms. “It’s generic and simple — you can plug it into many different models,” Kim says.

Quanta Magazine spoke with Kim about what interpretability means, who it’s for, and why it matters. An edited and condensed version of the interview follows.

You’ve focused your career on “interpretability” for machine learning. But what does that term mean, exactly?

There are two branches of interpretability. One branch is interpretability for science: If you consider a neural network as an object of study, then you can conduct scientific experiments to really understand the gory details about the model, how it reacts, and that sort of thing.

The second branch of interpretability, which I’ve been mostly focused on, is interpretability for responsible AI. You don’t have to understand every single thing about the model. But as long as you can understand just enough to safely use the tool, then that’s our goal.

But how can you have confidence in a system that you don’t fully understand the workings of?

I’ll give you an analogy. Let’s say I have a tree in my backyard that I want to cut down. I might have a chain saw to do the job. Now, I don’t fully understand how the chain saw works. But the manual says, “These are the things you need to be careful of, so as to not cut your finger.” So, given this manual, I’d much rather use the chainsaw than a handsaw, which is easier to understand, but would make me spend five hours cutting down the tree.

You understand what “cutting” is, even if you don’t exactly know everything about how the mechanism accomplishes that.

Yes. The goal of the second branch of interpretability is: Can we understand a tool enough so that we can safely use it? And we can create that understanding by confirming that useful human knowledge is reflected in the tool.

How does “reflecting human knowledge” make something like a black box AI more understandable?

Here’s another example. If a doctor is using a machine-learning model to make a cancer diagnosis, the doctor will want to know that the model isn’t picking up on some random correlation in the data that we don’t want to pick up. One way to make sure of that is to confirm that the machine-learning model is doing something that the doctor would have done. In other words, to show that the doctor’s own diagnostic knowledge is reflected in the model.

So if doctors were looking at a cell specimen to diagnose cancer, they might look for something called “fused glands” in the specimen. They might also consider the age of the patient, as well as whether the patient has had chemotherapy in the past. These are factors or concepts that the doctors trying to diagnose cancer would care about. If we can show that the machine-learning model is also paying attention to these factors, the model is more understandable, because it reflects the human knowledge of the doctors.