In this instruction-following task, the action trajectories a 1 , a 2 and a 3 reach the goal, but the sequences a 2 and a 3 do not follow the instructions. This illustrates the issue of underspecified rewards.