I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our pre-existing research agenda.

MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.

Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble present-day machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neural-network-inspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to present-day work in ML succeed in creating AGI?”.

Much of this work will be aimed at improving our high-level theoretical understanding of task-directed AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in long-term and large-scale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.

To this end, we will begin by investigating eight relevant technical problems:

1. Inductive ambiguity detection.

How can we design a general methodology for ML systems (such as classifiers) to identify when the classification of a test instance is underdetermined by training data?

For example: If an ambiguity-detecting classifier is designed to distinguish images of tanks from images of non-tanks, and the training set only contains images of tanks on cloudy days and non-tanks on sunny days, this classifier ought to detect that the classification of an image of a tank on a sunny day is ambiguous, and pose some query for its operators to disambiguate it and avoid errors.

While past and current work in active learning and statistical learning theory more broadly has made progress towards this goal, more work is necessary to establish realistic statistical bounds on the error rates and query rates of real-world systems in advance of their deployment in complex environments.

2. Informed oversight.

How might we train a reinforcement learner to output both an action and a “report” comprising information to help a human evaluate its action?

For example: If a human is attempting to train a reinforcement learner to output original stories, then in evaluating the story, the human will want to know some information about the story (such as whether it has been plagiarized from another story) that may be difficult to determine by looking at the story itself.

3. Safe training procedures for human-imitators.

How might we design a ML system that imitates humans performing some task that involves rich outputs (such as answering questions in natural language), to the best of the ML system’s abilities?

While there are existing approaches to imitation learning and generative models, these have some theoretical shortcomings that prevent them from fully solving the general problem. In particular, a generative adversarial model trained on human actions only has an incentive to imitate aspects of the human that the adversary can detect; thus, issues similar to the plagiarism problem from (2) can arise.

4. Conservative concepts.

How might we design a system that, given some positive examples of a concept, can synthesize new instances of the concept without synthesizing edge cases of it?

For example: If we gave the system detailed information about 100 human-created burritos as training data, it should manufacture additional burritos while avoiding edge cases such as extremely small burritos (even though these could still be considered burritos).

By default, most objective functions will lead to such edge cases (say, because small burritos are cheaper to manufacture). Can we develop a general technique for avoiding this problem?

5. Specifying environmental goals using sensory data.

The goals of most ML systems are defined in terms of sensory data, such as discounted reward over time. While these sensory goals can be useful proxies for environmental goals, they are not identical to them: tricking your own sensors into perceiving a strawberry in the room is not the the same as actually having one there. How might we create systems that directly pursue useful goals in their environment, such as “make there be a strawberry in this room”, in a way that is robust to opportunities for the system to interfere with the proxies we used to specify the goal?

6. Low-impact agents.

We would prefer a highly intelligent task-oriented system to generally avoid creating large unintended-by-us side effects in pursuit of its goal, and also to notify us of any large impacts that might result from achieving its goal. For example, if we ask it to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials — a large side effect — while still having the desirable large follow-on effect of improving the family’s socioeconomic situation. For any specific task we can specify ad-hoc cost functions for side effects like nearby house destruction, but since we cannot always anticipate such costs in advance, we want a quantitative understanding of how to generally limit an AI’s side effects (without also limiting its ability to have large positive intended impacts).

7. Mild optimization.

In the event that the low impact criteria we employ in (6) are flawed or missing some key considerations, we would like some way to tell our system “don’t try too hard” or “don’t exercise too much optimization power”, since intuitively, large impacts are more likely to result from large efforts (although the correspondence is imperfect).

Less capable AI systems tend to be less likely to have large unintended side effects, and we would like to know how to make even highly capable systems safe in a similar way, by having them apply less “effort”. With a satisfactory encoding of our notion of “effort”, we could then program a system to impose absolute limits on its total effort, or to use only as much effort as is needed to achieve its objective with reasonably high probability, thus potentially limiting impacts that we did not manage to measure in (6).

8. Averting instrumental pressures.

How might we design a system that is not motivated to pursue certain convergent instrumental goals — such as gaining additional resources — even when pursuing these goals would help it achieve its main objective?

In particular, we may wish to build a system that has no incentive to cause or prevent its own shutdown/suspension. This relates to (6) and (7) in that instrumental pressures like “ensure my continued operation” can incentivize large impacts/efforts. However, this is a distinct agenda item because it may be possible to completely eliminate certain instrumental incentives in a way that would apply even before solutions to (6) and (7) would take effect.

Having identified these topics of interest, we expect our work on this agenda to be timely. The idea of “robust and beneficial” AI has recently received increased attention as a result of the new wave of breakthroughs in machine learning. The kind of theoretical work in this project has more obvious connections to the leading paradigms in AI and ML than, for example, our recent work in logical uncertainty or in game theory, and therefore lends itself better to collaborations with AI/ML researchers in the near future.

Thanks to Eliezer Yudkowsky and Paul Christiano for seeding many of the initial ideas for these research directions, to Patrick LaVictoire, Andrew Critch, and other MIRI researchers for helping develop these ideas, and to Chris Olah, Dario Amodei, and Jacob Steinhardt for valuable discussion.