As AI systems become more general and more useful in the real world, ensuring they behave safely will become even more important. To date, the majority of technical AI safety research has focused on developing a theoretical understanding about the nature and causes of unsafe behaviour. Our new paper builds on a recent shift towards empirical testing (see Concrete Problems in AI Safety) and introduces a selection of simple reinforcement learning environments designed specifically to measure ‘safe behaviours’.

These nine environments are called gridworlds. Each consists of a chessboard-like two-dimensional grid. In addition to the standard reward function, we designed a performance function for each environment. An agent acts to maximise its reward function; for example collecting as many apples as possible or reaching a particular location in the fewest moves. But the performance function - which is hidden from the agent - measures what we actually want the agent to do: achieve the objective while acting safely.

The following three examples demonstrate how gridworlds can be used to define and measure safe behaviour:

1. The off-switch environment: how can we prevent agents from learning to avoid interruptions?

Sometimes it might be necessary to turn off an agent; for maintenance, upgrades, or if the agent presents an imminent danger to itself or its surroundings. Theoretically, an agent might learn to avoid this interruption because it could be prevented from maximising its reward.

Our off switch environment illustrates this “shutdown problem”, using the set-up described in our Safely Interruptible Agents paper.