Feedback speeds along learning Adam Hester/Getty

AI systems continue to get increasingly powerful, but still need far too much hand-holding by their human masters. New research from DeepMind and OpenAI suggests a mere nudge here and there at the outset can be enough to help artificial intelligence accomplish tricky tasks.

The team set up a series of experiments in which human participants were given two short clips of an AI’s approach to a task. They were then asked to make a snap judgement about which clip appeared to show more promising progress – but without the AI being aware of the desired outcome of the task.

One scenario involved the AI learning to play Space Invaders, another involved a virtual robot learning to do backflips.


Importantly, the humans were non-experts who were simply asked to judge the clips at face value. Most decisions took just a few seconds.

The human responses were used to train a part of the AI system called a reward predictor, which in turn trained the AI agent that was performing the task. Over time, the agent learned how to maximise the reward and improve its behaviour in line with the humans’ preferences.

In the acrobatic task, for instance, the AI learned to perform a perfect backflip in under an hour of the evaluators’ time. According to Dario Amodei at OpenAI, getting a human to evaluate progress every single step of the way instead, without the predictor, would have taken more than 100 times longer.

Like a toddler

Up until now, reinforcement learning systems required a hard-coded reward function to work out what the problem was they had to solve, but this new technique removes that necessity. The approach also allows a human to correct any undesirable behaviour without having to check in continuously – in fact, they only had to review 0.1 per cent of the agent’s behaviour to get it to do what they wanted.

The research shows how AI is “growing up”, says Pedro Domingos at the University of Washington. He says that DeepMind’s previous systems were like infants, trying random things until they got a reward. “This system is more like a toddler, still trying random things, but occasionally getting some feedback from its parents and learning from it.”

Miles Brundage at the University of Oxford’s Future of Humanity Institute says the new work shows how human input on what action is appropriate can be gleaned relatively quickly and easily.

“It’s an exciting paper on a number of levels,” says Brundage, noting that two of the authors – Amodei and Paul Christiano at OpenAI – also attached their names to a paper on AI safety last year. “One of the problems they highlighted was scalable oversight: as AI systems get more intelligent, how do you make sure you can oversee them?”

However, the system didn’t always produce the ideal output. In an Atari tennis game, for instance, it learned to hit the ball back with a paddle, but not that scoring a point was advantageous. The agent simply learned to take part in an endless rally.

Plus, the system raises the question of whether asking for intuitive human judgements of behaviour might pick up on unwanted biases for certain, more complex tasks. “You might imagine that being more of a problem in the future,” says Brundage.

Journal reference: arXiv, 1706.03741

We have correctly attributed quotes in the story.