Update Nov. 23: This post was edited to reflect Scott’s terminology change from “naturalized world-models” to “embedded world-models.” For a full introduction to these four research problems, see Scott Garrabrant and Abram Demski’s “Embedded Agency.”

Scott Garrabrant is taking over Nate Soares’ job of making predictions about how much progress we’ll make in different research areas this year. Scott divides MIRI’s alignment research into five categories:

embedded world-models — Problems related to modeling large, complex physical environments that lack a sharp agent/environment boundary. Central examples of problems in this category include logical uncertainty, naturalized induction, multi-level world models, and ontological crises.

Introductory resources: “Formalizing Two Problems of Realistic World-Models,” “Questions of Reasoning Under Logical Uncertainty,” “Logical Induction,” “Reflective Oracles”

Examples of recent work: “Hyperreal Brouwer,” “An Untrollable Mathematician,” “Further Progress on a Bayesian Version of Logical Uncertainty”

decision theory — Problems related to modeling the consequences of different (actual and counterfactual) decision outputs, so that the decision-maker can choose the output with the best consequences. Central problems include counterfactuals, updatelessness, coordination, extortion, and reflective stability.

Introductory resources: “Cheating Death in Damascus,” “Decisions Are For Making Bad Outcomes Inconsistent,” “Functional Decision Theory”

Examples of recent work: “Cooperative Oracles,” “Smoking Lesion Steelman” (1, 2), “The Happy Dance Problem,” “Reflective Oracles as a Solution to the Converse Lawvere Problem”

robust delegation — Problems related to building highly capable agents that can be trusted to carry out some task on one’s behalf. Central problems include corrigibility, value learning, informed oversight, and Vingean reflection.

Introductory resources: “The Value Learning Problem,” “Corrigibility,” “Problem of Fully Updated Deference,” “Vingean Reflection,” “Using Machine Learning to Address AI Risk”

Examples of recent work: “Categorizing Variants of Goodhart’s Law,” “Stable Pointers to Value”

subsystem alignment — Problems related to ensuring that an AI system’s subsystems are not working at cross purposes, and in particular that the system avoids creating internal subprocesses that optimize for unintended goals. Central problems include benign induction.

Introductory resources: “What Does the Universal Prior Actually Look Like?”, “Optimization Daemons,” “Modeling Distant Superintelligences”

Examples of recent work: “Some Problems with Making Induction Benign”

other — Alignment research that doesn’t fall into the above categories. If we make progress on the open problems described in “Alignment for Advanced ML Systems,” and the progress is less connected to our agent foundations work and more ML-oriented, then we’ll likely classify it here.

The problems we previously categorized as “logical uncertainty” and “naturalized induction” are now called “embedded world-models”; most of the problems we’re working on in three other categories (“Vingean reflection,” “error tolerance,” and “value learning”) are grouped together under “robust delegation”; and we’ve introduced two new categories, “subsystem alignment” and “other.”

Scott’s predictions for February through December 2018 follow. 1 means “limited” progress, 2 “weak-to-modest” progress, 3 “modest,” 4 “modest-to-strong,” and 5 “sizable.” To help contextualize Scott’s numbers, we’ve also translated Nate’s 2015-2017 predictions (and Nate and Scott’s evaluations of our progress for those years) into the new nomenclature.

embedded world-models: 2015 progress: 5. — Predicted: 3.

2016 progress: 5. — Predicted: 5.

2017 progress: 2. — Predicted: 2.

2018 progress prediction: 3 (modest). decision theory: 2015 progress: 3. — Predicted: 3.

2016 progress: 3. — Predicted: 3.

2017 progress: 3. — Predicted: 3.

2018 progress prediction: 3 (modest). robust delegation: 2015 progress: 3. — Predicted: 3.

2016 progress: 4. — Predicted: 3.

2017 progress: 4. — Predicted: 1.

2018 progress prediction: 2 (weak-to-modest). subsystem alignment (new category): 2018 progress prediction: 2 (weak-to-modest). other (new category): 2018 progress prediction: 2 (weak-to-modest).

These predictions are highly uncertain, but should give a rough sense of how we’re planning to allocate researcher attention over the coming year, and how optimistic we are about the current avenues we’re pursuing.

Note that the new bins we’re using may give a wrong impression of our prediction accuracy. E.g., we didn’t expect much progress on Vingean reflection in 2016, whereas we did expect significant progress on value learning and error tolerance. The opposite occurred, which should count as multiple prediction failures. Because the failures were in opposite directions, however, and because we’re now grouping most of Vingean reflection, value learning, and error tolerance under a single category (“robust delegation”), our 2016 predictions look more accurate in the above breakdown than they actually were.

Using our previous categories, our expectations and evaluations for 2015-2018 would be:

Logical uncertainty + naturalized induction Decision theory Vingean Reflection Error Tolerance Value Specification Progress 2015-2017 5, 5, 2 3, 3, 3 3, 4, 4 1, 1, 2 1, 2, 1 Expectations 2015-2018 3, 5, 2, 3 3, 3, 3, 3 3, 1, 1, 2 3, 3, 1, 2 1, 3, 1, 1

In general, these predictions are based on evaluating the importance of the most important results from a given year — one large result will yield a higher number than many small results. The ratings and predictions take into account research that we haven’t written up yet, though they exclude research that we don’t expect to make public in the near future.