I spent most of July working on a project to evaluate risk from AI that is smarter than humans. I posted updates as I went [1] but intended to write up a summary with conclusions when I finished. Unfortunately things petered out at the end: I started applying to jobs, wanted to finish a bunch of house projects, and kept waiting for one final set of conversation notes. [2] The more I put off the final write-up the more I

kept putting it off , and now it's September. So, in the interest of just writing the thing up, here's where I am now:

Summary: I'm not convinced that AI risk should be highly prioritized, but I'm also not convinced that it shouldn't. Highly qualified researchers in a position to have a good sense the field have massively different views on core questions like how capable ML systems are now, how capable they will be soon, and how we can influence their development. I do think these questions are possible to get a better handle on, but I think this would require much deeper ML knowledge than I have.

Background

First, what's is the problem about? This isn't entirely agreed on, with different concerned people seeing different ranges of ways a smarter-than-human AI could lead to disaster. But here's an example Paul Christiano gave in a comment

Many tasks that humans care about ... are extremely hard to convert into precise objectives: they are inherently poorly-defined or involve very long timescales, and simple proxies can be 'gamed' by a sophisticated agent. As a result, many tasks that humans care about may not get done well; we may find ourselves in an increasingly sophisticated and complex world driven by completely alien values.

Or, here's one from Stuart Russell, pulled from his Of Myths And Moonshine:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

The main idea is that it's not easy to specify what we actually care about. Now, it's also not easy to get AI systems to do anything all, so this isn't currently a big problem. As AI systems get more capable, however, it could become one. One way people will often show this is having two people play a game: one tries to specify what they think the other should optimize for, and the other shows how optimizing single-mindedly on that description leads to a world we wouldn't actually want.

People concerned about superintelligence risk differ a lot in exactly how they see this going, and in how likely they think it is to end in a human extinction-level catastrophe. I heard people with views as far apart as 'overwhelmingly likely to go badly' and 'overwhelmingly likely to go well'. Many of the things people think we should be doing now, however, don't depend strongly on how likely things are to go badly, just that it's likely enough that it's worth trying to prevent.

Highly Reliable Agents

The approach to superintelligence risk mitigation that people are most familiar with, both in the ML and EA communities, is MIRI 's Agent Foundations . The best summary of this I've seen is Daniel Dewey 's, from his My current thoughts on MIRI's "highly reliable agent design" (HRAD) work

Advanced AI systems are going to have a huge impact on the world, and for many plausible systems, we won't be able to intervene after they become sufficiently capable. If we fundamentally "don't know what we're doing" because we don't have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system. Even minor mistakes in an advanced AI system's design are likely to cause catastrophic misalignment. Because of 1, 2, and 3, if we don't have a satisfying description of how an AI system should reason and make decisions, we're likely to make enough mistakes to cause a catastrophe. The right way to get to advanced AI that does the right thing instead of causing catastrophes is to deeply understand what we're doing, starting with a satisfying description of how an AI system should reason and make decisions. This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding "unknown unknown" problems.

Before I started this project, this was the only technical approach I'd seen articulated, though I was also not following things closely. In talking to ML researchers, in as much as they were familiar with the idea of trying to reduce the risk of global catastrophe from smarter-than-human AI they associated it with this line of thinking, and with Bostrom, Yudkowsky, and MIRI. ML researchers were generally pretty negative on this approach: they didn't think it was possible for this kind of theoretical work to make progress without being grounded in real systems. For example, Michael Littman, a CS professor at Brown working in ML and AI, told me he worries that the AI risk community is not solving real problems: they're making deductions and inferences that are self-consistent but not being tested or verified in the world. Since we can't tell if that's progress, he argues it probably isn't. [3]

The 9/2016 writeup from the Open Philanthropy Project on their grant to MIRI and their Program Officer Daniel Dewey's recent post both seem pretty reasonable to me, and don't leave me optimistic about this path.

(I'm generally very skeptical about theoretical work being able to advance engineering fields in ways other than consolidating knowledge, mostly because of my experience with academic CS papers.)

Differential Progress

There is also significant work in AI safety that takes a very different approach. The idea is, you look at the field of AI/ML and you try to figure out improvements to the state of the art that are likely to push in the direction of safety. For example, extending RL systems to be able to learn a reward function from humans ( pdf ) instead of requiring a hand-coded reward function pushes us further in the direction of having ML systems generally do what we want them to do. Another example would be better visualization showing us how a network represents concepts ( DeepDream ), or better descriptions of why it made the decisions it did ( pdf ). Several approaches along these lines are described in the Concrete Problems paper ( pdf ).

I think it's helpful to look at the value of this work under two perspectives, based on how far we are from getting to human-level AI.

If you think we're not that far from human-level AI, in terms of how many breakthroughs or how much hardware improvement we still need, then it's very important to make sure we understand current systems well and can get them to do what we want. In that case this work is directly relevant.

If you think that we're farther, then the case is more complex and somewhat weaker. I see several arguments:

First, working on making systems safer today helps lay the groundwork for making systems safer in the future. Things tend to build on each other, and early work can open up paths for future research. On the other hand, as we discover things in other aspects of AI and hardware keeps getting better this sort of research gets easier, plus there's less of a risk of working on something that ends up being inapplicable. I'm not sure which of these factors is stronger, but from talking to researchers I'm leaning towards the latter.

Second, there's some value in building a field around AI safety. The idea is something like, if we get to the point where AI safety is actually extremely important to work on, it would be good to already have researchers spun up and ready to work on it, a culture of how to do this kind of work well, and a sense that this was a reasonable academic path. The cost of this would be large, except that making current systems safer is generally also good ML work.

Third, it would be really bad if the AI/ML community started thinking "Safety? That's not the kind of thing we do." Researchers I talked to often had a negative enough view of MIRI's work and Musk's recent comments that I'm worried this could grow into a strong division. [4] A robust field of people doing good ML work who also think we should be trying to mitigate loss-of-control risks seems like good protection against this sort of divide.

Then there's also the argument that forecasting is hard, and we may not get much warning between systems being clearly not very capable and being extremely capable, and so it's always worth it to have some people working on how to making current systems safer.

Conclusions

The main difficult thing with this project has been just how far apart the views of very smart researchers can be, with people confidently asserting things other people think are absurd. I think it would be valuable to try and get people with different views together where they could make faster progress at understanding each other. This makes it hard to come to any sort of satisfying conclusion, but I do still need to try, at least for myself.

Overall, I think we're still pretty far from AGI in terms of both time and technical work required. I do see AI/ML having an enormous effect on society, and our current level of technology is more than sufficient to automate an enormous range of tasks. [5] All in all, I do think it's useful to have some people working on AI safety, but it's hard for me to get a sense of how many people would be good. Most of the current value from my perspective is field building; I think the work itself is both moderately more efficient to do later and will be more compelling to existing researchers later as well. This pushes me towards thinking slow consistent field growth is valuable, where the people best suited to go into it do so.

Thanks to Allison Cheney, Bronwyn Woods, Bryce Wiedenbeck, Carl Shulman, Daniel Dewey, Dario Amodei, David Chudzicki, Michael Littman, Nisan Stiennon, Jacob Steinhardt, Janos Kramar, Owen Cotton-Baratt, Paul Christiano, Tsvi Benson-Tilsen, Victoria Krakovna, and several others who spoke with me anonymously for their help with this project. Of course I suspect none of them agree with everything I've written here, and their names shouldn't be interpreted as endorsements!



[1] Project update posts, many with good comments:

[2] I ended up not being able to publish those. With each person I interviewed I offered a choice between no writeup, published anonymously, or published under their name. Because I really didn't want to be publishing things people didn't want published, I let people make this decision at any time, even after preparing a draft writeup. In this case I wrote something, we had a lot of back and forth to get it into a place where it accurately reflected their views, and then they eventually decided they didn't want it published in any form.

[3] For more from that conversation, see Conversation with Michael Littman.

[4] A previous example of this sort of division would be the one between cryonics people and cryobiology researchers: Mike Darwin.

[5] I think at our current level of technology, if computers stopped getting faster and we stopped being able to get any new theoretical insights, we'd still be positioned for massive technological unemployment.