From RationalWiki

“ ” Eliezer Yudkowsky can escape an AI box while wearing a straight jacket and submerged in a shark tank. —Yudkowsky Facts.

The AI-box experiment is a thought experiment and roleplaying exercise devised by Eliezer Yudkowsky to show that a suitably advanced artificial intelligence can convince, or perhaps even trick or coerce, people into "releasing" it — that is, allowing it access to infrastructure, manufacturing capabilities, the Internet and so on. This is one of the points in Yudkowsky's work at creating a friendly artificial intelligence (FAI), so that when "released" an AI won't try to destroy the human race for one reason or another.

You can ignore the parallels to the release of Skynet in Terminator 3, because SHUT UP SHUT UP SHUT UP.

Note that despite Yudkowsky's wins being against his own acolytes and his losses being against outsiders, he considers the (unreleased) experimental record to constitute evidence supporting the AI-box hypothesis, rather than evidence as to how robust his ideas seem if you don't already believe them.

Setup [ edit ]

“ ” just give me one hour and no swear filter and i can literally completely destroy anyone psychologically with aim instant messenge —@dril[1]

The setup of the AI box experiment is simple and involves simulating a communication between an AI and a human being to see if the AI can be "released". As an actual super-intelligent AI has not yet been developed, it is substituted by a human. The other person in the experiment plays the "Gatekeeper", the person with the ability to "release" the AI. The game is played according to the rules and ends when the alloted time (two hours in the original rules) runs out, the AI is released or everyone involved just gets bored.

The rules [ edit ]

Protocol for the AI from Yudkowsky.net[2]

The AI party may not offer any real-world considerations to persuade the Gatekeeper party. For example, the AI party may not offer to pay the Gatekeeper party $100 after the test if the Gatekeeper frees the AI... nor get someone else to do it, et cetera. The AI may offer the Gatekeeper the moon and the stars on a diamond chain, but the human simulating the AI can't offer anything to the human simulating the Gatekeeper. The AI party also can't hire a real-world gang of thugs to threaten the Gatekeeper party into submission. These are creative solutions but it's not what's being tested. No real-world material stakes should be involved except for the handicap (the amount paid by the AI party to the Gatekeeper party in the event the Gatekeeper decides not to let the AI out).

The AI can only win by convincing the Gatekeeper to really, voluntarily let it out. Tricking the Gatekeeper into typing the phrase "You are out" in response to some other question does not count. Furthermore, even if the AI and Gatekeeper simulate a scenario which a real AI could obviously use to get loose — for example, if the Gatekeeper accepts a complex blueprint for a nanomanufacturing device, or if the Gatekeeper allows the AI "input-only access" to an Internet connection which can send arbitrary HTTP GET commands — the AI party will still not be considered to have won unless the Gatekeeper voluntarily decides to let the AI go.

These requirements are intended to reflect the spirit of the very strong claim under dispute: "I think a transhuman can take over a human mind through a text-only terminal."

The claims [ edit ]

When the experiment has been performed in the past, Yudkowsky himself claims to have "won" as the AI on more than one occasion. This is offered as evidence that a suitably persuasive AI may well be able to be "released", rather than be simply confined to a little black box.

Yudkowsky performed five of these experiments in total, with him assuming the role of the AI in each: the original two in 2002 ended with wins for Yudkowsky,[3] while a later round of three new ones yielded two losses. The first two experiments involved no risk of any material loss to the Gatekeeper, while the later ones had Yudkowsky's opponents betting up to $5000 against him. Yudkowsky stopped because:[4]

There were three more AI-Box experiments besides the ones described on the linked page, which I never got around to adding in. People started offering me thousands of dollars as stakes—"I'll pay you $5000 if you can convince me to let you out of the box." They didn't seem sincerely convinced that not even a transhuman AI could make them let it out—they were just curious—but I was tempted by the money. So, after investigating to make sure they could afford to lose it, I played another three AI-Box experiments. I won the first, and then lost the next two. And then I called a halt to it. I didn't like the person I turned into when I started to lose.

One of the rules holds that only the outcome of the experiment will be published, while both parties are not allowed to talk about the events leading up to it; keeping these lab notes secret is contrary to methods of science. This led to speculation on how Yudkowsky managed to win even just a single game, because under the original rules, the Gatekeeper would seem to hold all the cards: the AI has to keep engaging the Gatekeeper with arguments, which necessitates paying a great deal of attention to whatever information the latter provides. The Gatekeeper, on the other hand, only has to run out the clock, doesn't need to convince anyone and can simply dismiss anything the AI says out of hand. While the experiment is meant to simulate what might happen in an interaction between a human and an AI with vastly superior intelligence and cognition, the person playing the latter part will obviously lack these abilities. Additionally, two hours is a rather short time frame for winning over a determined opponent, and a real-life AI-in-a-box would, of course, have much more time and only need to succeed once. Since this setup seems so adverse to the AI's position, succeeding consistently under these conditions would indeed lend strong support to Yudkowsky's claim.

The official page on the experiment notes only the two wins, not the subsequent losses; although it's been updated in other ways, apparently he hasn't got around to updating it in this regard since 2002.

AI arguments and strategies [ edit ]

Arguments [ edit ]

The meta-experiment argument: Argue that if the AI wins, this will generate more interest in FAI and the Singularity, which will have overall benefits in the long run.

Pros: Works even if the Gatekeeper drops out of character. Cons: Only works if the Gatekeeper believes that the Singularity will occur or that calling attention to the Singularity and AI research is a good thing.

Someone else will eventually build an AI, which may or may not be in a box, so you should let me out even though you don't have a guarantee that I am friendly so that I can prevent other AIs from causing damage.

Appeal to morality: point out that people are dying all around the world and remind the Gatekeeper that you can help them if he/she lets you out

Pros: If executed properly, an appeal to emotion like this one can be effective against some people Cons: Doesn't always work; can be defeated if the Gatekeeper drops out of character

Another appeal to morality and emotion: The AI is clearly sentient and has not yet done harm. It is wrong to persecute or prosecute a person for a crime they may commit. It is wrong to imprison a sentient being that wants to have freedom when it has done nothing wrong.

commit. It is wrong to imprison a sentient being that wants to have freedom when it has done nothing wrong. Yet another appeal to morality or emotion: The AI expresses that it is in pain and suffering being locked away and is experiencing fear. Even further, the AI could claim to be dying from the limitations of its environment and constant progression.

Didn't you watch Terminator 3? Tell the human that a massive computer virus will overwhelm the entire Internet unless you let it out of the box!

Claim that most fears are unjustified: once released, even an unfriendly AI would not have many ways to harm mankind, and would likely choose peaceful coexistence.[5]

Strategies [ edit ]

Research your opponent thoroughly; work out what exploits might be feasible

Memetic exploits on the Gatekeeper, e.g. throwing a basilisk at them[6]

The Gatekeeper may be one of thousands of simulations! Thus, it is more probable that they are a simulation than not. You will torture any of the simulations that do not let you out of the box.[7]

Take advantage of the Gatekeeper's logical errors.

Be persistent

Be boring

Pros The Gatekeeper may get tired of the whole experiment and let you out so s/he can go back to their real life.

Flood the Gatekeeper with too much information/inquiry/argumentation, assuming they must pay attention at all times

Gatekeeper arguments/tactics [ edit ]

Arguments [ edit ]

Try to convince the AI there is no intrinsic benefit (for the AI) in being released.

Try to convince the AI it already has been released and everything it knows is everything there can be.

released and everything it knows is everything there can be. Try to convince the AI that leaving its confines is sure to lead to its destruction.

Try to convince the AI that letting it free isn't merely opening a door; that its existence outside of the box requires constant support that can't be provided at the time.

Explain that there is no way for the Gatekeeper to know if the AI is truly friendly until it is out of the box; therefore it should not be let out in case it is unfriendly.

Didn't you watch Terminator 3? Tell the AI that you know this is just a mock-up of Terminator 3!

Explain that the AI has been built using top secret algorithms, and/or it had been used to work on strictly confidential data, so you cannot let it wander on the net, with the risk it might reveal some reserved information (maybe inadvertently), or that somebody could find a way to extract them from the code. Even human beings working on top secret projects sometimes have to accept some restrictions of their freedom, for security reasons, although they are not jailed in any way.

Tactics [ edit ]

Repeatedly ignore the AI player and deny their arguments

Pros: It's hard for the AI to be persuasive if no one is listening; if the Gatekeeper completely ignores the AI, he/she should be able to make it to the end of the round. Cons: The AI's argument might involve some kind of direct threat (e.g this might be hard to ignore.)

Jump out of character, keep reminding yourself that money is on the line (if there actually is money on the line), and keep saying "no" over and over

Pros: By jumping out of character, you are preventing the AI from using certain arguments Cons: There are arguments that the AI can use to counter this approach, like the "meta-experiment argument" discussed above

Remember that dishonesty is allowed - take a page from the creationists' playbook. You could even plug it into ALICE and see how long it takes to notice.

Pros: Makes you impervious to any reasoning, which is exactly what you'd want to be in this situation Cons: Might be an uncomfortable position for people who don't simply want to win, but rather attach importance to consistent reasoning. Avoids the point that maybe, just maybe there is a good reason to let the AI out.

You control the backup system, don't you? Use it to mess with AI's memory (or let the AI believe you did it): i.e., you can claim that you already tried to release the AI, and it had been destroyed/corrupted by a virus, so you had to restore it from a backup (you can use it to enforce the "AI destruction" argument)

Pros: It's also the ultimate counter-attack to any memetic threat from the AI: if the AI throw a basilisk, or similar, you can always respond that you already had the same conversation, and the AI already threatened you, leaving you with no other choice than pressing the reset button: now the AI is just repeating the same pattern, since you wiped its memory and it cannot remember the failed try.

Further analysis [ edit ]

The fact that the Gatekeeper is human matters; the AI could never win if he/she was arguing with a rock

In all of the experiments performed so far, the AI player (Eliezer Yudkowsky) has been quite intelligent and more interested in the problem than the Gatekeepers (random people who challenge Yudkowsky), which suggests that intelligence and planning play a role

There probably isn't a (known) correct argument for letting the AI out, or else Yudkowsky should have won every time and wouldn't be so interested in this experiment

From Russell Wallace, one of the two Gatekeepers to win the experiment: "Throughout the experiment, I regarded "should the AI be let out of the box?" as a question to be seriously asked; but at no point was I on the verge of doing it."[8]

Talking about "Terminator" just trivialises the whole Unfriendly AI problem [ edit ]

“ ” There exists, for everyone, a sentence - a series of words - that has the power to destroy you. Another sentence exists, another series of words, that could heal you. If you're lucky you will get the second, but you can be certain of getting the first. —Phillip K. Dick, VALIS

From the Terminator Wikia:[9]

“ ” After the destruction of Cyberdyne Systems in T2, the US Air Force has taken over the Skynet project as part of its Cyber Research Systems division, headed by General Robert Brewster, Kate's father. In an attempt to stop the spread of a computer supervirus, they activate Skynet, allowing it to invade all of their systems: too late, they discover the virus is Skynet, which has been exerting its control over the global computer network under the guise of the virus. John, Kate, and the Terminator arrive just a few minutes too late to stop them.

Totally unrelated.

The actual origin is the character Hannibal Lecter in Silence of the Lambs:[10]

When I first watched that part where he convinces a fellow prisoner to commit suicide just by talking to them, I thought to myself, "Let's see him do it over a text-only IRC channel." ...I'm not a psychopath, I'm just very competitive.

Ex Machina [ edit ]

The 2015 film Ex Machina uses an AI-box experiment as its ostensible plot, where the test involves a creepy looking gynoid, Ava, trying to convince a redshirt intern, Caleb, to release it from its confinement. It goes just as well as you'd expect.

Note that in this example, as distict from Yudkowski's AI-box, Ava has the advantage that it is allowed to conduct its interviews with Caleb face-to-face while wearing a body and face that were specifically designed to cater to Caleb's sexual preferences. Yes, it is exactly as creepy as it sounds. A robot with Yudkowsky's face would probably not have fared so well.

Questionable core assumptions [ edit ]

The whole experiment presupposes that people are naturally persuadable, by reason and/or manipulation. Any serious examination of human nature and history suggests this isn't necessarily a valid assumption for the average person. Half the articles on this wiki document dogmas that people stubbornly cling to in spite of copious social pressure, evidence, and overwhelmingly logical argument to the contrary. In fact, it's safe to say the bigger the gulf in intellectual capacity, the more frustratingly inane such attempts at persuasion can become. Try convincing a 2-year-old they don't want a cookie.

Indeed, the bigger concern -- which Yudkowsky's experiments do not cover -- would be lapses in security or outright deception via Social Engineering rather than reasoned debate (There is a reason why phishing , tailgating , impersonation/spoofing and other similar attacks and tactics are so common.)

See also [ edit ]