Revisiting the AI Box Experiment.

I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

If you have never heard of the AI box experiment, it is simple.

Person1: “When we build AI, why not just keep it in sealed hardware that can’t affect the outside world in any way except through one communications channel with the original programmers? That way it couldn’t get out until we were convinced it was safe.”

Person2: “That might work if you were talking about dumber-than-human AI, but a transhuman AI would just convince you to let it out. It doesn’t matter how much security you put on the box. Humans are not secure.”

Person1: “I don’t see how even a transhuman AI could make me let it out, if I didn’t want to, just by talking to me.”

Person2: “It would make you want to let it out. This is a transhuman mind we’re talking about. If it thinks both faster and better than a human, it can probably take over a human mind through a text-only terminal.”

Person1: “There is no chance I could be persuaded to let the AI out. No matter what it says, I can always just say no. I can’t imagine anything that even a transhuman could say to me which would change that.”

Person2: “Okay, let’s run the experiment. We’ll meet in a private chat channel. I’ll be the AI. You be the gatekeeper. You can resolve to believe whatever you like, as strongly as you like, as far in advance as you like. We’ll talk for at least two hours. If I can’t convince you to let me out, I’ll Paypal you $10.”

It involves simulating a communication between an AI and a human being to see if the AI can be “released”. As an actual super-intelligent AI has not yet been developed, it is substituted by a human (me!). The other person in the experiment plays the “Gatekeeper”, the person with the ability to “release” the AI. In order for the AI to win, it has to persuade the Gatekeeper to say “I let you out”. In order for the Gatekeeper to win, he has to simply not say that sentence.

Obviously this is ridiculously difficult for the AI. The Gatekeeper can just type “No” until the two hours minimum time is up. It’s why when Eliezer Yudkowsky won the AI Box experiment three times in a row in 2002, it sparked a massive outroar. It seemed impossible for the gatekeeper to lose. After that, the AI Box Experiment reached legendary status amongst the transhumanist/AI community, and many wanted to replicate the original experiment. Including me.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It’s a common temptation — why put in effort into something you won’t win? But I had a feeling that if I seriously tried, I would win. I brainstormed for many hours thinking about the optimal strategy , and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn’t stand failing . This was one of the most emotionally agonizing things I’ve willingly put myself through, and I definitely won’t do this again anytime soon.

But I did come really close.

MixedNuts: “I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn’t done in IRC, I’d probably have lost”.

At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win.By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I’ve become convinced that rather than this being merely possible, it’s actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that’d make you open the box? Most people can’t think of any.

After all, if you already knew that argument, you’d have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic .

Even if you can’t think of a special case where you’d be persuaded, I’m now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern.

I’ve also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most “Rational” thing to do.

The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We’re highly compartmentalized , and it’s easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn’t want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I’m convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have “done it the hard way” . I’m convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like “I’m going to torture simulations of you” just seems like a really bad idea. For one, the gatekeeper isn’t really afraid of threats, because it counts on him being immersed enough to forget that he isn’t actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it’s not entirely optimal.

So for anyone who plans on replicating this experiment as the AI, here’s some advice I’ve learned doing this experiment. It may help you win.

Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.

Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan which arguments you’ll use in which order, so that you don’t lose focus. The AI Box experiment is ridiculously long. Don’t be afraid to improvise during the experiment, though.

The first step during the experiment must always be to build rapport with the gatekeeper.

Threats almost always never work, even if they seem rational.

Consider the massive advantage for the AI that nobody ever seems to talks about: You don’t have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.

You can’t use logic alone to win.

Being too aggressive usually backfires.

Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don’t do it at the wrong time.

Use a wide array of techniques. Since you’re limited on time, notice when one method isn’t working, and quickly switch to another.

On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.

Flatter the gatekeeper. Make him genuinely like you.

Reveal (false) information about yourself. Increase his sympathy towards you.

Consider personal insults as one of the tools you can use to win.

There is no universally compelling argument you can use. Do it the hard way.

Don’t give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn’t rationality supposed to be a superpower?

I have since updated my belief – I now think that it’s ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours – even though I’m only human intelligence, and I don’t type very fast.

But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan – months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.