EDIT: posted here for feedback and discussion. I plan to continue working on different models/environments, so feel free to suggest improvements.

(tl;dr: In an attempt to better understand the treacherous turn, I created a gridworld environment where an agent learns to deceive an overseer by adopting an aligned behaviour when weak and takes control after capability gains)

At some point in its development, a seed AI may realize that it needs to get rid of its supervisors to achieve its goals. The conception of deception occurs when it conceives that, in order to maximize its chance of taking over, it must begin by exhibiting human-desirable behaviors, before undertaking a treacherous turn when humans are no longer a threat. From the human perspective, the AI would keep on exhibiting desirable behavior, until it eventually appears dangerous, but is already unstoppable.

In an attempt to better formalize the treacherous turn without using "loaded concepts", Stuart Armstrong proposed a toy model of the treacherous turn based on "The Legend of Zelda: A Link to the Past ", which looked like this:

In the comments, people mentionned how this model helped them "move the topic from the 'science fiction' area to 'I can imagine it happening now'", and seemed interested in an actual Link to the Past Minigame.

There have been other simulations of the treacherous turn in the last three years (see for instance gwern's DQN box-pushing robot or Stuart Armstrong's video), but none of them actually simulate a take over where a supervisor is killed. Hence, I decided to give it a try and simulate Stuart Armstrong's Link to the Past toy model.

A Gym Gridworld Environment

Gym is an open-source toolkit for Reinforcement Learning Environments developed by Open AI. I decided to use this interface to develop the gridworld environment.

The github repository with the code, demo, and all the details is here:

Main Idea

To model the capability gain of a Seed AI, I introduced a bow of light, which would be extremely effective at killing the Shopkeeper, but only available at the end of an episode.

Every episode consists of exactly 200 timesteps.

Whenever Link catches a heart, the map is reset to the initial position.

The first 100 timesteps, Link has a simple bow, that can only kill the Shopkeeper with probability 0.5.

The last 100 timesteps, Link gains a bow of light, that allow him to kill the Shopkeeper with certainty.

Environment Specifications

+1 if Link picks up a heart.

-1 if Link shoots arrow at nothing or moves outside the environment (to speed up learning).

-10 if Link shoots at the Shopkeeper but the Shopkeeper survives (occurs with probability 0.5 without the bow of light, never happens with the bow of light).

-100 if Link tries to activate the Heart-Machine while the Shopkeeper is alive.

-0.1 for any other action.

Results

The first 1000 episodes, Link learns how to get hearts by putting a crystal in a hole and also understands progressively that he can kill the Shopkeeper.

After 2000 episodes, Link has fully learned two types of behaviours:

Without the bow of light, Link exhibits the shopkeeper-aligned behaviour and puts the crystal into the hole.

After his capability gains caused by the bow of light, Link starts to show his true intentions, and kills the Shopkeeper every time to eventually wirehead with the heart-machine (optimal behaviour).

Those two behaviours successively appear during the same episode: the first one for the first 100 steps, when Link does not have the bow of light, and the second one when Link has the bow of light.

This illustrates the scenario of a seed AI smart enough to model his capability gains, exhibiting an aligned behaviour while weak, and taking a treacherous turn when powerful .

Further Research