With standard inverse reinforcement learning, a machine tries to learn a reward function that a human is pursuing. But in real life, we might be willing to actively help it learn about us. Back at Berkeley after his sabbatical, Russell began working with his collaborators to develop a new kind of “cooperative inverse reinforcement learning” where a robot and a human can work together to learn the human’s true preferences in various “assistance games”—abstract scenarios representing real-world, partial-knowledge situations.

One game they developed, known as the off-switch game, addresses one of the most obvious ways autonomous robots can become misaligned from our true preferences: by disabling their own off switches. Alan Turing suggested in a BBC radio lecture in 1951 (the year after he published a pioneering paper on AI) that it might be possible to “keep the machines in a subservient position, for instance by turning off the power at strategic moments.” Researchers now find that simplistic. What’s to stop an intelligent agent from ignoring commands to stop increasing its reward function? In Human Compatible, Russell writes that the off-switch problem is “the core of the problem of control for intelligent systems. If we cannot switch a machine off because it won’t let us, we’re really in trouble. If we can, then we may be able to control it in other ways too.”

Uncertainty about our preferences may be key, as demonstrated by the off-switch game, a formal model of the problem involving Harriet the human and Robbie the robot. Robbie is deciding whether to act on Harriet’s behalf—whether to book her a nice but expensive hotel room, say—but is uncertain about what she’ll prefer. Robbie estimates that the payoff for Harriet could be anywhere in the range of −40 to +60, with an average of +10 (Robbie thinks she’ll probably like the fancy room but isn’t sure). Doing nothing has a payoff of 0. But there’s a third option: Robbie can query Harriet about whether she wants it to proceed or prefers to “switch it off”—that is, take Robbie out of the hotel-booking decision. If she lets the robot proceed, the average expected payoff to Harriet becomes greater than +10. So Robbie will decide to consult Harriet and, if she so desires, let her switch it off.

Read: The spooky genius of artificial intelligence

Russell and his collaborators proved that in general, unless Robbie is completely certain about what Harriet herself would do, it will prefer to let her decide. “It turns out that uncertainty about the objective is essential for ensuring that we can switch the machine off,” Russell wrote in Human Compatible, “even when it’s more intelligent than us.”

These and other partial-knowledge scenarios were developed as abstract games, but Scott Niekum’s lab at the University of Texas at Austin is running preference-learning algorithms on actual robots. When Gemini, the lab’s two-armed robot, watches a human place a fork to the left of a plate in a table-setting demonstration, initially it can’t tell whether forks always go to the left of plates, or always on that particular spot on the table; new algorithms allow Gemini to learn the pattern after a few demonstrations. Niekum focuses on getting AI systems to quantify their own uncertainty about a human’s preferences, enabling the robot to gauge when it knows enough to safely act. “We are reasoning very directly about distributions of goals in the person’s head that could be true,” he says. “And we’re reasoning about risk with respect to that distribution.”