TL;DR: We present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states.

When we train reinforcement learning (RL) agents in the real world, we don’t want them to explore unsafe states, such as driving a mobile robot into a ditch or writing an embarrassing email to one’s boss. Training RL agents in the presence of unsafe states is known as the safe exploration problem. We tackle the hardest version of this problem, in which the agent initially doesn’t know how the environment works or where the unsafe states are. The agent has one source of information: feedback about unsafe states from a human user.

Existing methods for training agents from human feedback ask the user to evaluate data of the agent acting in the environment. That is – in order to learn about unsafe states, the agent first needs to visit these states, so the user can provide feedback on them. This makes prior work inapplicable to tasks that require safe exploration.

In our latest paper, we propose a method for reward modeling that operates in two phases. First, the system is encouraged to explore a wide range of states through synthetically-generated, hypothetical behaviour. The user provides feedback on this hypothetical behaviour, and the system interactively learns a model of the user's reward function. Only after the model has successfully learned to predict rewards and unsafe states, we deploy an RL agent that safely performs the desired task.

We start with a generative model of initial states and a forward dynamics model, trained on off-policy data like random trajectories or safe expert demonstrations. Our method uses these models to synthesise hypothetical behaviours, asks the user to label the behaviours with rewards, and trains a neural network to predict these rewards. The key idea is to actively synthesise the hypothetical behaviours from scratch to make them as informative as possible, without interacting with the environment. We call this method reward query synthesis via trajectory optimisation (ReQueST).