Web developer Maciej Cegłowski recently gave a talk on AI safety (video, text) arguing that we should be skeptical of the standard assumptions that go into working on this problem, and doubly skeptical of the extreme-sounding claims, attitudes, and policies these premises appear to lead to. I’ll give my reply to each of these points below.

First, a brief outline: this will mirror the structure of Cegłowski’s talk in that first I try to put forth my understanding of the broader implications of Cegłowski’s talk, then deal in detail with the inside-view arguments as to whether or not the core idea is right, then end by talking some about the structure of these discussions.



(i) Broader implications

Cegłowski’s primary concern seems to be that there are lots of ways to misuse AI in the near term, and that worrying about long-term AI hazards may distract from working against short-term misuse. His secondary concern seems to be that worrying about AI risk looks problematic from the outside view. Humans have a long tradition of millenarianism, or the belief that the world will radically transform in the near future. Historically, most millenarians have turned out to be wrong and behaved in self-destructive ways. If you think that UFOs will land shortly to take you to the heavens, you might make some short-sighted financial decisions, and when the UFOs don’t arrive, you are full of regrets.

I think the fear that focusing on long-term AI dangers will distract from short-term AI dangers is misplaced. Attention to one kind of danger will probably help draw more attention to other, related kinds of danger. Also, risks associated with extraordinarily capable AI systems appear to be more difficult and complex than risks associated with modern AI systems in the short term, suggesting that the long-term obstacles will require more lead time to address. If it is as easy to avert these dangers as some optimists think, then we lose very little by starting early; if it is difficult (but doable), then we lose much by starting late.

With regards to outside-view concerns, I question how much we can learn about external reality from focusing only on human psychology. Many people have thought they could fly, for one reason or another. But some people actually can fly, and the person who bets against the Wright brothers based on psychological and historical patterns of error (instead of generalizing from, in this case, regularities in physics and engineering) will lose their money. The best way to get those bets right is to wade into the messy inside-view arguments.

As a Bayesian, I agree that we should update on surface-level evidence that an idea is weird or crankish. But I also think that argument screens off evidence from authority; if someone who looks vaguely like a crank can’t provide good arguments for why they expect UFOs to land in Greenland in the next hundred years, and someone else who looks vaguely like a crank can provide good arguments for why they expect AGI to be created in the next hundred years, then once I’ve heard their arguments I don’t need to put much weight on whether or not they initially looked like a crank. Surface appearances are genuinely useful, but only to a point. And even if we insist on reasoning based on surface appearances, I think those look pretty good.

Cegłowski put forth 11 inside-view and 11 outside-view critiques that I’ll paraphrase and then address:



(ii) Inside-view arguments

1. Argument from wooly definitions

Many arguments for working on AI safety trade on definition tricks, where the sentences “A implies B” and “B implies C” both seem obvious, and this is used to argue for a less obvious claim “A implies C”; but in fact “B” is being used in two different senses in the first two sentences.

That’s true for a lot of low-grade futurism out there, but I’m not aware of any examples of Bostrom making this mistake. The best arguments for working on long-term AI safety depend on some vague terms, because we don’t have a good formal understanding of a lot of the concepts involved; but that’s different from saying that the arguments rest on ambiguous or equivocal terms. In my experience, the substance of the debate doesn’t actually change much if we paraphrase away specific phrasings like “general intelligence.”

The basic idea is that human brains are good at solving various cognitive problems, and the capacities that make us good at solving problems often overlap across different categories of problem. People who have more working memory find that this helps with almost all cognitively demanding tasks, and people who think more quickly again find that this helps with almost all cognitively demanding tasks.

Coming up with better solutions to cognitive problems also seems critically important in interpersonal conflicts, both violent and nonviolent. By this I don’t mean that book learning will automatically lead to victory in combat, but rather that designing and aiming a rifle are both cognitive tasks. When it comes to security, we already see people developing AI systems in order to programmatically find holes in programs so that they can be fixed. The implications for black hats are obvious.

The core difference between people and computers here seems to be that the returns to putting cognitive work into getting more capacity to do cognitive work are much higher for computers than people. People can learn things, but have limited ability to improve their ability to learn things, or to improve their ability to improve their ability to learn things, etc. For computers, it seems like both software and hardware improvements are easier to make given better software and hardware options.

The loop of using computer chips to make better computer chips is already much more impressive than the loop of using people to make better people. We are only starting on the loop of using machine learning algorithms to make better machine learning algorithms, but we can reasonably expect that to be another impressive loop.

The important takeaway here is the specific moving pieces of this argument, and not the terms I’ve used. Some problem-solving abilities seem to be much more general than others: whatever cognitive features make us better than mice at building submarines, particle accelerators, and pharmaceuticals must have evolved to solve a very different set of problems in our ancestral environment, and certainly don’t depend on distinct modules in the brain for marine engineering, particle physics, and biochemistry. These relatively general abilities look useful for things like strategic planning and technological innovation, which in turn look useful for winning conflicts. And machine brains are likely to have some dramatic advantages over biological brains, in part because they’re easier to redesign (and the task of redesigning AI may itself be delegable to AI systems) and much easier to scale.



2. Argument from Stephen Hawking’s cat

Stephen Hawking is much smarter than a cat, but he isn’t overpoweringly good at predicting a cat’s behavior, and his physical limitations strongly diminish his ability to control cats. Superhuman AI systems (especially if they’re disembodied) may therefore be similarly ineffective at modeling or controlling humans.

How relevant are bodies? One might think that a robot is able to fight its captors and run away on foot, while a software intelligence contained in a server farm will be unable to escape.

This seems incorrect to me, and for non-shallow reasons. In the modern economy, an internet connection is enough. One doesn’t need a body to place stock trades (as evidenced by the army of algorithmic traders that already exist), to sign up for an email account, to email subordinates, to hire freelancers (or even permanent employees), to convert speech to text or text to speech, to call someone on the phone, to acquire computational hardware on the cloud, or to copy over one’s source code. If an AI system needed to get its cat into a cat carrier, it could hire someone on TaskRabbit to do it like anyone else.



3. Argument from Einstein’s cat

Einstein could probably corral a cat, but he would do so mostly by using his physical strength, and his intellectual advantages over the average human wouldn’t help. This suggests that superhuman AI wouldn’t be too powerful in practice.

Force isn’t needed here if you have time to set up an operant conditioning schedule.

More relevant, though, is that humans aren’t cats. We’re far more social and collaborative, and we routinely base our behavior on abstract ideas and chains of reasoning. This makes it easier to persuade (or hire, blackmail, etc.) a person than to persuade a cat, using only a speech or text channel and no physical threat. None of this relies in any obvious way on agility or brawn.



4. Argument from emus

When the Australian military attempted to massacre emus in the 1930s, the emus outmaneuvered them. Again, this suggests that superhuman AI systems are less likely to be able to win conflicts with humans.

Science fiction often depicts wars between humans and machines where both sides have a chance at winning, because that makes for better drama. I think xkcd does a better job of depicting how this would look:





Repeated encounters favor the more intelligent and adaptive party; we went from fighting rats with clubs and cats to fighting them with traps, poison, and birth control, and if we weren’t worried about possible downstream effects, we could probably engineer a bioweapon that kills them all.



5. Argument from Slavic pessismism

“We can’t build anything right. We can’t even build a secure webcam. So how are we supposed to solve ethics and code a moral fixed point for a recursively self-improving intelligence without fucking it up, in a situation where the proponents argue we only get one chance?”

This is a good reason not to try to do that. A reasonable AI safety roadmap should be designed to route around any need to “solve ethics” or get everything right on the first try. This is the idea behind finding ways to make advanced AI systems pursue limited tasks rather than open-ended goals, making such systems corrigible, defining impact measures and building systems to have a low impact, etc. “Alignment for Advanced ML Systems” and error-tolerant agent design are chiefly about finding ways to reap the benefits of smarter-than-human AI without demanding perfection.



6. Argument from complex motivations

Complex minds are likely to have complex motivations; that may be part of what it even means to be intelligent.

When discussing AI alignment, this typically shows up in two places. First, human values and motivations are complex, and so simple proposals of what an AI should care about will probably not work. Second, AI systems will probably have convergent instrumental goals, where regardless of what project they want to complete, they will observe that there are common strategies that help them complete that project.

Some convergent instrumental strategies can be found in Omohundro’s paper on basic AI drives. High intelligence probably does require a complex understanding of how the world works and what kinds of strategies are likely to help with achieving goals. But it doesn’t seem like complexity needs to spill over into the content of goals themselves; there’s no incoherence in the idea of a complex system that has simple overarching goals. If it helps, imagine a corporation trying to maximize its net present value, a simple overarching goal that nevertheless results in lots of complex organization and planning.

One core skill in thinking about AI alignment is being able to visualize the consequences of running various algorithms or executing various strategies, without falling into anthropomorphism. One could design an AI system such that its overarching goals change with time and circumstance, and it looks like humans often work this way. But having complex or unstable goals doesn’t imply that you’ll have humane goals, and simple, stable goals are also perfectly possible.

For example: Suppose an agent is considering two plans, one of which involves writing poetry and the other of which involves building a paperclip factory, and it evaluates them based on expected number of paperclips produced (instead of whatever complicated things motivate humans). Then we should expect it to prefer the second plan, even if a human can construct an elaborate verbal argument for why the first is “better.”



7. Argument from actual AI

Current AI systems are relatively simple mathematical objects trained on massive amounts of data, and most avenues for improvement look like just adding more data. This doesn’t seem like a recipe for recursive self-improvement.

That may be true, but “it’s important to start thinking about mishaps from smarter-than-human AI systems today” doesn’t imply “smarter-than-human AI systems are imminent.” We should think about the problem now because it’s important and because there’s relevant technical research we can do today to get a better handle on it, not because we’re confident about timelines.

(Also, that may not be true.)



8. Argument from Cegłowski’s roommate

“My roommate was the smartest person I ever met in my life. He was incredibly brilliant, and all he did was lie around and play World of Warcraft between bong rips.” Advanced AI systems may be similarly unambitious in their goals.

Humans aren’t maximizers. This suggests that we may be able to design advanced AI systems to pursue limited tasks and thereby avert the kinds of disasters Bostrom is talking about. However, immediate profit incentives may not lead us in that direction by default, if gaining an extra increment of safety means trading away some annual profits or falling behind the competition. If we want to steer the field in that direction, we need to actually start work on better formalizing “limited task.”

There are obvious profit incentives for developing systems that can solve a wider variety of practical problems more quickly, reliably, skillfully, and efficiently; there aren’t corresponding incentives for developing the perfect system for playing World of Warcraft and doing nothing else.

Or to put it another way: AI systems are unlikely to have limited ambitions by default, because maximization is easier to specify than laziness. Note how game theory, economics, and AI are all rooted in mathematical formalisms describing an agent which attempts to maximize some utility function. If we want AI systems that have “limited ambitions,” it is not enough to say “perhaps they’ll have limited ambitions;” we have to start exploring how to actually make them that way. For more on this topic, see the “low impact” problem in “Concrete Problems in AI Safety” and other related papers.



9. Argument from brain surgery

Humans can’t operate on the part of themselves that’s good at neurosurgery and then iterate this process.

Humans can’t do this, but this is one of the obvious ways humans and AI systems might differ! If a human discovers a better way to build neurons or mitochondria, they probably can’t use it for themselves. If an AI system discovers that, say, it can use bitshifts instead of multiplications to do neural network computations much more quickly, it can push a patch to itself, restart, and then start working better. Or it can copy its source code to very quickly build a “child” agent.

It seems like many AI improvements will be general in this way. If an AI system designs faster hardware, or simply acquires more hardware, then it will be able to tackle larger problems faster. If an AI system designs an improvement to its basic learning algorithm, then it will be able to learn new domains faster.



10. Argument from childhood

It takes a long time of interacting with the world and other people before human children start to be intelligent beings. It’s not clear how much faster an AI could develop.

A truism in project management is that nine women can’t have one baby in one month, but it’s dubious that this truism will apply to machine learning systems. AlphaGo seems like a key example here: it probably played about as many training games of Go as Lee Sedol did prior to their match, but was about two years old instead of 33 years old.

Sometimes, artificial systems have access to tools that people don’t. You probably can’t determine someone’s heart rate just by looking at their face and restricting your attention to particular color channels, but software with a webcam can. You probably can’t invert rank ten matrices in your head, but software with a bit of RAM can.

Here, we’re talking about something more like a person that is surprisingly old and experienced. Consider, for example, an old doctor; suppose they’ve seen twenty patients a day for 250 workdays over the course of twenty years. That works out to 100,000 patient visits, which seems to be roughly the number of people that interact with the UK’s NHS in 3.6 hours. If we train a machine learning doctor system on a year’s worth of NHS data, that would be the equivalent of fifty thousand years of medical experience, all gained over the course of a single year.



11. Argument from Gilligan’s Island

While we often think of intelligence as a property of individual minds, civilizational power comes from aggregating intelligence and experience. A single genius working alone can’t do much.

This seems reversed. One of the properties of digital systems is that they can integrate with each other more quickly and seamlessly than humans can. Instead of thinking about a server farm AI as one colossal Einstein, think of it as an Einstein per blade, and so a single rack can contain multiple villages of Einsteins all working together. There’s no need to go through a laborious vetting process during hiring or a talent drought; expanding to fill more hardware is just copying code.

If we then take into account the fact that whenever one Einstein has an insight or learns a new skill, that can be rapidly transmitted to all other nodes, the fact that these Einsteins can spin up fully-trained forks whenever they acquire new computing power, and the fact that the Einsteins can use all of humanity’s accumulated knowledge as a starting point, the server farm begins to sound rather formidable.



(iii) Outside-view arguments

Next, the outside-view arguments — with summaries that should be prefixed by “If you take superintelligence seriously, …”:



12. Argument from grandiosity

…truly massive amounts of value are at stake.

It’s surprising, by the Copernican principle, that our time looks as pivotal as it does. But while we should start off with a low prior on living at a pivotal time, we know that pivotal times have existed before, and we should eventually be able to believe that we are living in an important time if we see enough evidence pointing in that direction.



13. Argument from megalomania

…truly massive amounts of power are at stake.

In the long run, we should obviously be trying to use AI as a lever to improve the welfare of sentient beings, in whatever ways turn out to be technologically feasible. As suggested by the “we aren’t going to solve all of ethics in one go” point, it would be very bad if the developers of advanced AI systems were overconfident or overambitious in what tasks they gave the first smarter-than-human AI systems. Starting with modest, non-open-ended goals is a good idea — not because it’s important to signal humility, but because modest goals are potentially easier to get right (and less hazardous to get wrong).



14. Argument from transhuman voodoo

…lots of other bizarre beliefs follow immediately.

Beliefs often cluster because they’re driven by similar underlying principles, but they remain distinct beliefs. It’s certainly possible to believe that AI alignment is important and also that galactic expansion is mostly an unprofitable waste, or to believe that AI alignment is important and also that molecular nanotechnology is unfeasible.

That said, whenever we see a technology where cognitive work is the main blocker, it seems reasonable to expect that the trajectory that AI takes will have a major impact on that technology. If you were writing during the early days of the scientific method, or at the dawn of the Industrial Revolution, then an accurate model of the world would require you to make at least a few extreme-sounding predictions. We can debate whether AI will be that big of a deal, but if it is that big of a deal, it would be odd for there not to be any extreme futuristic implications.



15. Argument from Religion 2.0

…you’ll be joining something like a religion.

People are biased, and we should worry about ideas that might play to our biases; but we can’t use the existence of bias to ignore all object-level considerations and arrive at confident technological predictions. As the saying goes, just because you’re paranoid doesn’t mean that they’re not out to get you. Medical science and religion both promise to heal the sick, but medical science can actually do it. To distinguish medical science from religion, you have to look at the arguments and the results.



16. Argument from comic book ethics

…you’ll end up with a hero complex.

We want a larger share of the research community working on these problems, so that the odds of success go up — what matters is that AI systems be developed in a responsible and circumspect way, not who gets the credit for developing them. You might end up with a hero complex if you start working on this problem now, but with luck, in ten years it will just feel like normal research (albeit on some particularly important problems).



17. Argument from simulation fever

…you’ll believe that we are probably living in a simulation instead of base reality.

I personally find the simulation hypothesis deeply questionable, because our universe looks both temporally bounded and either continuous or near-continuous in spacetime. If our universe looked more like, say, Minecraft, then this would seem more likely. (It seems that the first can’t easily simulate itself, whereas the second can, with a slowdown. The “RAM constraints” that are handwaved away with the simulation hypothesis are probably the core objection.) In either case, I don’t think this is a good argument for or against AI safety engineering as a field.



18. Argument from data hunger

…you’ll want to capture everyone’s data.

This seems unrelated to AI alignment. Yes, people building AI systems want data to train their systems on, and figuring out how to get data ethically instead of just quickly should be a priority. But how would shifting one’s views on whether or not smarter-than-human AI systems will someday exist, and how much work will be necessary in order to align their preferences with ours, shift one’s view on ethical data acquisition practices?



19. Argument from string theory for programmers

…you’ll detach from reality into abstract thought.

The fact that it’s difficult to test predictions about advanced AI systems is a huge problem; MIRI, at least, bases its research around trying to reduce the risk that we’ll just end up building castles in the sky. This is part of the point of pursuing multiple angles of attack on the problem, encouraging more diversity in the field, focusing on problems that bear on a wide variety of possible systems, and prioritizing the formalization of informal and semiformal system requirements. Quoting Eliezer Yudkowsky:

Crystallize ideas and policies so others can critique them. This is the other point of asking, “How would I do this using unlimited computing power?” If you sort of wave your hands and say, “Well, maybe we can apply this machine learning algorithm and that machine learning algorithm, and the result will be blah-blahblah,” no one can convince you that you’re wrong. When you work with unbounded computing power, you can make the ideas simple enough that people can put them on whiteboards and go, “Wrong,” and you have no choice but to agree. It’s unpleasant, but it’s one of the ways that the field makes progress.

See “MIRI’s Approach” for more on the unbounded analysis approach. The Amodei/Olah AI safety agenda uses other heuristics, focusing on open problems that are easier to address in present and near-future systems, but that still appear likely to have relevance to scaled-up systems.



20. Argument from incentivizing crazy

…you’ll encourage craziness in yourself and others.

Crazier ideas may make more headlines, but I don’t get the sense that they attract more research talent or funding. Nick Bostrom’s ideas are generally more reasoned-through than Ray Kurzweil’s, and the research community is correspondingly more interested in engaging with Bostrom’s arguments and pursuing relevant technical research. Whether or not you agree with Bostrom or think the field as a whole is doing useful work, this suggests that relatively important and thoughtful ideas are attracting more attention from research groups in this space.



21. Argument from AI cosplay

…you’ll be more likely to try to manipulate people and seize power.

I think we agree about the hazards of treating people as pawns, behaving unethically in pursuit of some greater good, etc. It’s not clear to me that people interested in AI alignment are atypical on this dimension relative to other programmers, engineers, mathematicians, etc. And as with other outside-view critiques, this shouldn’t represent much of an update about how important AI safety research is; you wouldn’t want to decide how many research dollars to commit to nuclear security and containment based primarily on how impressed you were with Leó Szilárd’s temperament.



22. Argument from the alchemists

…you’ll be acting too soon, before we understand how intelligence really works.

While it seems unavoidable that the future holds surprises, of how and what and why, it seems like there are some things that we can identify as irrelevant. For example, the mystery of consciousness seems orthogonal to the mystery of problem-solving. It’s possible that the use of a problem-solving procedure on itself is basically what consciousness is, but it’s also possible that we can make an AI system that is able to flexibly achieve its goals without understanding what makes us conscious, and without having made it conscious in the process.



(iv) Productive discussions

Now that I’ve covered those points, there’s some space to discuss how I think productive discussions work. To that end, I applaud Cegłowski for doing a good job of laying out Bostrom’s full argument, though I think he misstates some minor points. (For example, Bostrom does not claim that all general intelligences will want to self-improve in order to better achieve their goals; he merely claims that this is a useful subgoal for many goals, if feasible.)

There are some problems where we can rely heavily on experiments and observation in order to reach correct conclusions, and other problems where we need to rely much more heavily on argument and theory. For example, when building sand castles it’s low cost to test a hypothesis; but when designing airplanes, full empirical tests are more costly, in part because there’s a realistic chance that the test pilot will die in the case of sufficiently bad design. Existential risks are on an extreme end of that spectrum, so we have to rely particularly heavily on abstract argument (though of course we can still gain by testing testable predictions whenever possible).

The key property of useful verbal arguments, when we’re forced to rely on them, is that they’re more likely to work in worlds where the conclusion is true as opposed to worlds where the conclusion is false. One can level an ad hominem against a clown who says “2+2=4” just as easily as a clown who says “2+2=5,” whereas the argument “what you said implies 0=1” is useful only against the second clown. “0=1” is a useful counterargument to “2+2=5” because it points directly to a specific flaw (subtract 2 from both sides twice and you’ll get a contradiction), and because it is much less persuasive against truth than it is against falsehood.

This makes me suspicious of outside-view arguments, because they’re too easy to level against correct atypical views. Suppose that Norman Borlaug had predicted that he would save a billion lives, and this had been rejected on the outside view — after all, very few (if any) other people could credibly claim the same across all of history. What about that argument is distinguishing between Borlaug and any other person? When experiments are cheap, it’s acceptable to predictably miss every “first,” but when experiments aren’t cheap, this becomes a fatal flaw.

Insofar as our goal is to help each other have more accurate beliefs, I also think it’s important for us to work towards identifying mutual “cruxes.” For any given disagreements, are there any propositions about the world that you think are true, and that I think are false, where if you changed your mind on that proposition you would come around to my views, and vice versa?

By seeking out these cruxes, we can more carefully and thoroughly search for evidence and arguments that bear on the most consequential questions, rather than getting lost in side-issues. In my case, I’d be much more sympathetic to your arguments if I stopped believing any of the following propositions (some of which you may already agree with):

Agents’ values and capability levels are orthogonal, such that it’s possible to grow in power without growing in benevolence. Ceteris paribus, more computational ability leads to more power. More specifically, more computational ability can be useful for self-improvement, and this can result in a positive feedback loop with doubling times closer to weeks than to years. There are strong economic incentives to create autonomous agents that (approximately) maximize their assigned objective functions. Our capacities for empathy, moral reasoning, and restraint rely to some extent on specialized features of our brain that aren’t indispensable for general-purpose problem-solving, such that it would be a simpler engineering challenge to build a general problem solver without empathy than with empathy.

This obviously isn’t an exhaustive list, and we would need a longer back-and-forth in order to come up with a list that we both agree is crucial.