A few years ago, Joelle Pineau, a computer science professor at McGill, was helping her students design a new algorithm when they fell into a rut. Her lab studies reinforcement learning, a type of artificial intelligence that’s used, among other things, to help virtual characters (“half cheetah” and “ant” are popular) teach themselves how to move about in virtual worlds. It’s a prerequisite to building autonomous robots and cars. Pineau’s students hoped to improve on another lab’s system. But first they had to rebuild it, and their design, for reasons unknown, was falling short of its promised results. Until, that is, the students tried some “creative manipulations” that didn’t appear in the other lab’s paper.

Lo and behold, the system began performing as advertised. The lucky break was a symptom of a troubling trend, according to Pineau. Neural networks, the technique that’s given us Go-mastering bots and text generators that craft classical Chinese poetry, are often called black boxes because of the mysteries of how they work. Getting them to perform well can be like an art, involving subtle tweaks that go unreported in publications. The networks also are growing larger and more complex, with huge data sets and massive computing arrays that make replicating and studying those models expensive, if not impossible for all but the best-funded labs.

“Is that even research anymore?” asks Anna Rogers, a machine-learning researcher at the University of Massachusetts. “It’s not clear if you’re demonstrating the superiority of your model or your budget.”

Pineau is trying to change the standards. She’s the reproducibility chair for NeurIPS, a premier artificial intelligence conference. Under her watch, the conference now asks researchers to submit a “reproducibility checklist” including items often omitted from papers, like the number of models trained before the “best” one was selected, the computing power used, and links to code and datasets. That’s a change for a field where prestige rests on leaderboards—rankings that determine whose system is the “state of the art” for a particular task—and offers great incentive to gloss over the tribulations that led to those spectacular results.

The idea, Pineau says, is to encourage researchers to offer a road map for others to replicate their work. It’s one thing to marvel at the eloquence of a new text generator or the “superhuman” agility of a videogame-playing bot. But even the most sophisticated researchers have little sense of how they work. Replicating those AI models is important not just for identifying new avenues of research, but also as a way to investigate algorithms as they augment, and in some cases supplant, human decision-making—everything from who stays in jail and for how long to who is approved for a mortgage.

LEARN MORE The WIRED Guide to Artificial Intelligence

Others are also attacking the problem. Researchers at Google have proposed so-called “model cards” to detail how machine-learning systems have been tested, including results that point out potential bias. Others have sought to show how fragile the term “state of the art” is when systems, optimized for the data sets used in rankings, are set loose in other contexts. Last week, researchers at the Allen Institute for Artificial Intelligence, or AI2, released a paper that aims to expand Pineau’s reproducibility checklist to other parts of the experimental process. They call it “Show Your Work.”

“Starting where someone left off is such a pain because we never fully describe the experimental setup,” says Jesse Dodge, an AI2 researcher who coauthored the research. “People can’t reproduce what we did if we don’t talk about what we did.” It’s a surprise, he adds, when people report even basic details about how a system was built. A survey of reinforcement learning papers last year found only about half included code.