I asked this question on Reddit:

As I understand it (and please correct me if I’m wrong), inverse reinforcement learning + reinforcement learning will eventually produce the same result as supervised learning/behavioural cloning. Inverse RL assumes the agent’s behaviour is optimal, so it will end up just imitating the agent. Let’s say you want to do a task better than the agent. Has there been any research on deriving a reward function from agent behaviour without assuming the agent’s behaviour is optimal?

I asked this question because I’m concerned about avoiding brittle reward functions when doing reinforcement learning in simulation for self-driving cars. (This is assuming you can accurately simulate other road users.) You have to define a reward function somehow, and a hand-crafted reward function might be as brittle as hand-crafted rules about how to drive. So, what if a reward function could be learned from demonstrations, in such a way that the agent can actually improve on the demonstrated behaviour?

u/wdabney came through with this answer:

Good timing for the question. There’s actually a very recent paper from Scott Niekum’s lab that does exactly this: https://arxiv.org/abs/1904.06387 They take an intuitively appealing approach and it yields impressive results on both MuJoCo problems and Atari.

Here’s the abstract of the paper:

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is a consequence of the general reliance of IRL algorithms upon some form of mimicry, such as feature-count matching, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward learning from observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, we show that this approach can achieve performance that is more than an order of magnitude better than the best-performing demonstration, on multiple Atari and MuJoCo benchmark tasks. In contrast, prior state-of-the-art imitation learning and IRL methods fail to perform better than the demonstrator and often have performance that is orders of magnitude worse than T-REX. Finally, we demonstrate that T-REX is robust to modest amounts of ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

A key excerpt:

We only assume access to a qualitative ranking over demonstrations. Thus, we only require the demonstrator to have an internal goal or intrinsic reward. The demonstrator can rank trajectories using any method, such as giving pairwise preferences over demonstrations or by rating each demonstration on a scale.

So, similar to supervised learning for computer vision, this is where a company might need to employ a team of human labellers to sort driving demonstrations into rankings of, say, 1 to 5, or 1 to 10, or 0 to 100. Similar to image labelling, there could be extensive documentation on how exactly to rank demonstrations.

This isn’t just shifting the problem of defining a reward function to the problem of defining a way to score and rank demonstrations. Similar to a grading rubric for an essay, you can write down in natural language what makes a demonstration a 1 vs. a 2. But this still requires human judgment.

Or, to simply the process, you could just show human labellers two examples side by side (perhaps from the same GPS location), and ask them to choose which one is better. (You could have documentation and training on how to do this as well.)

Another way to do it (I think) would be to use multiple versions of a driving agent trained with supervised learning/behavioural cloning. As you train the agent on more and more state-action pairs, and it moves closer and closer to human-level performance, you thereby produce a series of agents, each better than the last. These agents can be used to produce demonstrations that can be automatically ranked based on version number. This is time-ordered T-REX.

The authors didn’t quite test this idea. Instead, they did the same thing but with an agent trained using reinforcement learning. But I think the same principle would apply. I emailed the authors to ask about this.

So, there are two ways to apply T-REX:

with qualitatively ranked demonstrations with time-order-ranked demonstrations given some underlying learning process

PDF: https://arxiv.org/pdf/1904.06387.pdf