Why AI Safety?

MIRI is a nonprofit research group based in Berkeley, California. We do technical research aimed at ensuring that smarter-than-human AI systems have a positive impact on the world. This page outlines in broad strokes why we view this as a critically important goal to work toward today.

The arguments and concepts behind AGI safety research

Humanity’s social and technological dominance stems primarily from our proficiency at reasoning, planning, and doing science (Armstrong). We will call this capacity general intelligence (Muehlhauser) — “general” because humans didn’t need to evolve separate modules for doing theoretical physics, software engineering, and heart surgery over millions of years. Instead, a relatively small set of adaptations separating humans from chimpanzees must simultaneously enable all of these capabilities.

It is this general problem-solving ability that we have in mind when we talk about “artificial general intelligence” (AGI) or “smarter-than-human AI.” AI systems may come to surpass humans in science and engineering abilities without being particularly human-like in any other respects — artificial intelligence need not imply artificial consciousness, for example, or artificial emotions. Instead, we have in mind the capacity to model real-world environments well and identify a variety of ways to put those environments into new states.

The case for focusing on AI risk mitigation doesn’t assume much about how future AI systems will be implemented or used. Here are the claims that we think of as key:



Whatever problems/tasks/objectives we assign to advanced AI systems probably won’t exactly match our real-world objectives. Unless we put in an (enormous, multi-generational) effort to teach AI systems every detail of our collective values (to the extent there is overlap), realistic systems will need to rely on imperfect approximations and proxies for what we want (Soares, Yudkowsky).

If the system’s assigned problems/tasks/objectives don’t fully capture our real objectives, it will likely end up with incentives that catastrophically conflict with what we actually want (Bostrom, Russell, Benson-Tilsen & Soares).

AI systems can become much more intelligent than humans (Bostrom), to a degree that would likely give AI systems a decisive advantage in arbitrary conflicts (Soares, Branwen).

It’s hard to predict when smarter-than-human AI will be developed: it could be 15 years away, or 150 years (Open Philanthropy Project). Additionally, progress is likely to accelerate as AI approaches human capability levels, giving us little time to shift research directions once the finish line is in sight (Bensinger).



Stuart Russell’s Cambridge talk is an excellent introduction to long-term AI risk. Other leading AI researchers who have expressed these kinds of concerns about general AI include Francesca Rossi (IBM), Shane Legg (Google DeepMind), Eric Horvitz (Microsoft), Bart Selman (Cornell), Ilya Sutskever (OpenAI), Andrew Davison (Imperial College London), David McAllester (TTIC), and Jürgen Schmidhuber (IDSIA).

Our take-away from this is that we should prioritize early research into aligning future AI systems with our interests, if we can find relevant research problems to study. AI alignment could easily turn out to be many times harder than AI itself, in which case research efforts are currently being wildly misallocated.

Alignment research can involve developing formal and theoretical tools for building and understanding AI systems that are stable and robust (“high reliability”), finding ways to get better approximations of our values in AI systems (“value specification”), and reducing the risks from systems that aren’t perfectly reliable or value-specified (“error tolerance”).

MIRI’s approach to these problems

How does MIRI try to make progress on this issue? Loosely speaking, we can imagine the space of all smarter-than-human AI systems as an extremely wide and heterogeneous space, in which “alignable AI designs” is a small and narrow target (and “aligned AI designs” smaller and narrower still). We generally think that the most important thing a marginal alignment researcher can do today is help ensure that the first generally intelligent systems humans design are in the “alignable” region.

We expect that this is unlikely to happen unless researchers have a fairly principled understanding of how the systems they’re developing reason, and how that reasoning connects to the intended objectives. Most of our work is therefore aimed at seeding the field with ideas that may inspire more AI research in the vicinity of (what we expect to be) alignable AI designs. When the first general reasoning machines are developed, we want the developers to be sampling from a space of designs and techniques that are more understandable and reliable than what’s possible in AI today.

We focus on research that we think could help inspire new AI techniques that are more theoretically principled than current techniques. In practice, this usually involves focusing on the biggest gaps in our current theories, in the hope of developing better and more general theories to undergird subsequent engineering work (Soares).

Other factors setting our approach apart include the fact that we focus more on AI systems’ reasoning and planning, rather than on systems’ goals, their input and output channels, or features of their environments. This is partly because of the previously mentioned considerations, and partly because we expect reasoning and planning to be a key part of what makes highly capable systems highly capable. To make use of these capabilities (and do so safely), it’s likely that we’ll need a good model of how the system does its cognitive labor, and how this labor ties in to the intended objective.

Finally, we also usually avoid problems we think academic and industry researchers are well-positioned to address, focusing instead on what we expect to be the most neglected lines of research going forward (Bensinger).

Goals for the field

Researchers at MIRI are generally highly uncertain about how the field of AI will develop over the coming years, and there are many different scenarios that strike us as plausible. Conditional on a good outcome, though, we put a fair amount of probability on scenarios that more or less follow the following sketch:

In the short term, a research community coalesces, develops a good in-principle understanding of what the relevant problems are, and produces formal tools for tackling these problems. AI researchers move toward a minimal consensus about best practices, more open discussions of AI’s long-term social impact, a risk-conscious security mindset (Muehlhauser), and work on error tolerance and value specification.

In the medium term, researchers build on these foundations and develop a more mature understanding. As we move toward a clearer sense of what smarter-than-human AI systems are likely to look like — something closer to a credible roadmap — we imagine the research community moving toward increased coordination and cooperation in order to discourage race dynamics (Soares).

In the long term, we would like to see AI-empowered projects used to avert major AI mishaps while humanity works towards the requisite scientific and institutional maturity for making lasting decisions about the far future (Dewey). For this purpose, we’d want to solve a weak version of the alignment problem for limited AI systems — systems just capable enough to serve as useful levers for preventing AI accidents and misuse.

In the very long term, our hope is that researchers will eventually solve the “full” alignment problem for highly capable, highly autonomous AI systems. Ideally, we want to reach a position where engineers and operators can afford to take their time to dot every i and cross every t before we risk “locking in” any choices that have a large and irreversible effect on the future.

The above is a vague sketch, and we prioritize research we think would be useful in less optimistic scenarios as well. Additionally, “short term” and “long term” here are relative, and different timeline forecasts can have very different policy implications. Still, the sketch may help clarify the directions we’d like to see the research community move in.