Today’s AI systems make weighty decisions regarding loans, medical diagnoses, parole, and more. They're also opaque systems, which makes them susceptible to bias. In the absence of transparency, we will never know why a 41-year-old white male and an 18-year-old black woman who commit similar crimes are assessed as “low risk” versus “high risk” by AI software.

WIRED OPINION ABOUT Oren Etzioni is CEO of the Allen Institute for Artificial Intelligence and a professor in the Allen School of Computer Science at the University of Washington. Tianhui Michael Li is founder and president of Pragmatic Data, a data science and AI training company. He formerly headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw.

For both business and technical reasons, automatically generated, high-fidelity explanations of most AI decisions are not currently possible. That's why we should be pushing for the external audit of AI systems responsible for high-stakes decision making. Automated auditing, at a massive scale, can systematically probe AI systems and uncover biases or other undesirable behavior patterns.

One of the most notorious instances of black-box AI bias is software used in judicial systems across the country to recommend sentencing, bond amounts, and more. ProPublica’s analysis of one of the most widely used recidivism algorithms for parole decisions uncovered potentially significant bias and inaccuracy. When probed for more information, the creator would not share specifics of their proprietary algorithm. Such secrecy makes it difficult for defendants to challenge these decisions in court.

AI bias has been reported in numerous other contexts, from a cringeworthy bot that tells Asians to “open their eyes” in passport photos to facial recognition systems that are less accurate in identifying dark-skinned and female faces to AI recruiting tools that discriminate against women.

In response, regulators have sought to mandate transparency through so-called "explainable AI." In the US, for example, lenders denying an individual’s application for a loan must provide “specific reasons” for the adverse decision. In the European Union, the GDPR mandates a “right to explanation” for all high-stakes automated decisions.

Unfortunately, the challenges of explainable AI are formidable. First, explanation can expose proprietary data and trade secrets. It is also extremely difficult to explain the behavior of complex, nonlinear neural network models trained over huge data sets. How do we explain conclusions derived from a weighted, nonlinear combination of thousands of inputs, each contributing a microscopic percentage point toward the overall judgement? As a result, we typically encounter a trade-off between fidelity and accuracy in automatically explaining AI decisions.

Netflix, for instance, tries to explain its recommendation algorithm based on a single previous show you’ve watched (“Because you watched Stranger Things”). In actuality, its recommendations are based on numerous factors and complex algorithms. Although simplified explanations behind your Netflix recommendations are innocuous, in high-stakes situations, such oversimplification can be dangerous.

Even simple predictive models can exhibit counterintuitive behavior. AI models are susceptible to a common phenomenon known as Simpson’s paradox, in which behavior is driven by an underlying unobserved variable. In one recent case, researchers discovered that a history of asthma decreases a patient’s risk of mortality from pneumonia. This naive interpretation would have been misleading for health care practitioners and asthma patients. In reality, the finding was attributed to the fact that those with a prior history of asthma were more likely to receive immediate care.

This is not an isolated incident, and such mistaken conclusions cannot be easily resolved with more data. Despite our best efforts, AI explanations can be tricky to understand.

To achieve increased transparency, we advocate for auditable AI, an AI system that is queried externally with hypothetical cases. Those hypothetical cases can be either synthetic or real—allowing automated, instantaneous, fine-grained interrogation of the model. It's a straightforward way to monitor AI systems for signs of bias or brittleness: What happens if we change the gender of a defendant? What happens if the loan applicants reside in a historically minority neighborhood?

Auditable AI has several advantages over explainable AI. Having a neutral third-party investigate these questions is a far better check on bias than explanations controlled by the algorithm's creator. Second, this means the producers of the software do not have to expose trade secrets of their proprietary systems and data sets. Thus, AI audits will likely face less resistance.

Auditing is complementary to explanations. In fact, auditing can help to investigate and validate (or invalidate) AI explanations. Say Netflix recommends The Twilight Zone because I watched Stranger Things. Will it also recommend other science fiction horror shows? Does it recommend The Twilight Zone to everyone who’s watched Stranger Things?