Over the past few months, some major media outlets have been spreading concern about the idea that AI might spontaneously acquire sentience and turn against us. Many people have pointed out the flaws with this notion, including Andrew Ng, an AI scientist of some renown:

I don’t see any realistic path from the stuff we work on today—which is amazing and creating tons of value—but I don’t see any path for the software we write to turn evil.

He goes on to say, on the topic of sentient machines:

Computers are becoming more intelligent and that’s useful as in self-driving cars or speech recognition systems or search engines. That’s intelligence. But sentience and consciousness is not something that most of the people I talk to think we’re on the path to.

I say, these objections are correct. I endorse Ng’s points wholeheartedly — I see few pathways via which software we write could spontaneously “turn evil.”

I do think that there is important work we need to do in advance if we want to be able to use powerful AI systems for the benefit of all, but this is not because a powerful AI system might acquire some “spark of consciousness” and turn against us. I also don’t worry about creating some Vulcan-esque machine that deduces (using cold mechanic reasoning) that it’s “logical” to end humanity, that we are in some fashion “unworthy.” The reason to do research in advance is not so fantastic as that. Rather, we simply don’t yet know how to program intelligent machines to reliably do good things without unintended consequences.

The problem isn’t Terminator. It’s “King Midas.” King Midas got exactly what he wished for — every object he touched turned to gold. His food turned to gold, his children turned to gold, and he died hungry and alone.

Powerful intelligent software systems are just that: software systems. There is no spark of consciousness which descends upon sufficiently powerful planning algorithms and imbues them with feelings of love or hatred. You get only what you program.

To build a powerful AI software system, you need to write a program that represents the world somehow, and that continually refines this world-model in response to percepts and experience. You also need to program powerful planning algorithms that use this world-model to predict the future and find paths that lead towards futures of some specific type.

The focus of our research at MIRI isn’t centered on sentient machines that think or feel as we do. It’s aimed towards improving our ability to program software systems to execute plans leading towards very specific types of futures.

A machine programmed to build a highly accurate world-model and employ powerful planning algorithms could yield extraordinary benefits. Scientific and technological innovation have had great impacts on quality of life around the world, and if we can program machines to be intelligent in the way that humans are intelligent — only faster and better — we can automate scientific and technological innovation. When it comes to the task of improving human and animal welfare, that would be a game-changer.

To build a machine that attains those benefits, the first challenge is to do this world-modeling and planning in a highly reliable fashion: you need to ensure that it will consistently pursue its goal, whatever that is. If you can succeed at this, the second challenge is making that goal a safe and useful one.

If you build a powerful planning system that aims at futures in which cancer is cured, then it may well represent all of the following facts in its world-model: (a) The fastest path to a cancer cure involves proliferating robotic laboratories at the expense of the biosphere and kidnapping humans for experimentation; (b) once you realize this, you’ll attempt to shut it down; and (c) if you shut it down, it will take a lot longer for cancer to be cured. The system may then execute a plan which involves deceiving you until it is able to resist and then proliferating robotic laboratories and kidnapping humans. This is, in fact, what you asked for.

We can avoid this sort of outcome, if we manage to build machines that do what we mean rather than what we said. That sort of behavior doesn’t come for free: you have to program it in.

A superhuman planning algorithm with an extremely good model of the world could find solutions you never imagined. It can make use of patterns you haven’t noticed and find shortcuts you didn’t recognize. If you follow a plan generated by a superintelligent search process, it could have disastrous unintended consequences. To quote professor Stuart Russell (author of the leading AI textbook):

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem: 1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down. 2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

Humans have a lot of fiddly little constraints akin to “oh, and don’t kidnap any humans while you’re curing cancer”. Programming in a full description of human values and human norms by hand, in a machine-readable format, doesn’t seem feasible. If we want the plans generated by superhuman planning algorithms to respect all of our complicated unspoken constraints and desires, then we’ll need to develop new tools for predicting and controlling the behavior of general-purpose autonomous agents. There’s no two ways about it.

Many people, when they first encounter this problem, come up with a reflexive response about why the problem won’t be as hard as it seems. One common one is “If a powerful planner starts running amok, we can just unplug it” — an objection which is growing obsolete in the era of cloud computing, and which fails completely if the system has access to the internet or any other network where it can copy itself onto other machines.

Another common one is “Why not have the system output a plan rather than having it execute the plan?” — but if we direct a powerful planning procedure to generate plans such that (a) humans who examine the plan approve of it and (b) executing it leads to cancer being cured, then the plan may well be one that looks good but which exploits some predictable oversight in the verification procedure and kidnaps people anyway.

Or you could say, “How about we just make systems which only answer questions?” But how exactly do you direct a superhuman planning procedures towards “answering questions”? Will you program it to output text that it predicts will cause you to press the “highly satisfied” button after the answer has been output? Because in that case, the system may well output text that constitutes a particularly deceptive answer. Or, if you add a constraint that the answer must be accurate, it may output text that manipulates you into asking easier questions in the future.

Maybe you reply, “Well, perhaps instead I’ll direct the planner to move toward futures where its output is measured by this clever metric where…,” and now you’ve been drawn in. How exactly could we build powerful planers that search for beneficial futures? It looks like it’s possible to build systems that somehow learn the user’s intentions or values and act according to them, but actually doing so is not trivial. You’ve got to think hard to build systems that figure out all the intricacies of your intentions without deceiving or manipulating you while acquiring that information. That doesn’t happen for free: ambitious, long-term software projects are still ultimately software projects, and we have to figure out how to actually write the required code.

If we can figure out how to build smarter-than-human machines aligned with our interests, the benefits could be extraordinary. Like Phil Libin (founder of Evernote) says, AI could be “one of the greatest forces for good the universe has ever seen.” It’s possible to get there, but it’s going to require some work.