Towards Reinforcement Learning Inspired By Humans Without Human Demonstrations

Leave any 12-year-old alone with an Atari video game for the afternoon and chances are, she will have mastered it before dinner. How do people learn to achieve high reward so quickly and how can we enable artificial agents to do the same? Some hypothesize that people learn and leverage structured models of how the world works (for example see [1,2]), models that represent the world in terms of objects rather than pixels, and that artificial agents could benefit from doing the same [3].

Inspired by such ideas, we present the Strategic Object-Oriented RL (SOORL) algorithm which is the first algorithm to our knowledge that can achieve positive rewards on the notoriously hard Atari game Pitfall! without access to human demonstrations, and can do so within 50 episodes. SOORL uses stronger prior knowledge (access to objects in the environment and a class of potential dynamics model) than standard deep RL algorithms, but much weaker information than methods that require access to trajectories of decent human play.

Snapshot of first three rooms in Pitfall!

SOORL goes beyond prior object oriented RL work by two key ideas:

The agent actively tries to choose a simple model of how the world works that makes the world look deterministic. The agent uses an optimistic model-based planning approach for making decisions that explicitly assumes the agent will not computationally be able to compute a perfect plan for how to act even if it does know how the world works.

Both are inspired by the challenges faced by humans — given little experience and bounded computational capacity, humans must quickly learn to make good decisions.Towards this, our first idea observes that in contrast to sophisticated and data-intensive deep neural network models, simple deterministic models of what happens to a player’s agent if the player makes a particular keyboard press require little experience to estimate, reduce the computational cost of planning (since only one next state is possible) and though often wrong, may frequently be sufficient for achieving good behavior. Second, in sparse complex video games, game play can require hundreds to thousands of steps, and performing exact planning at the per decision level is intractable for any agent with a reasonably bounded amount of computation, including a 12 year old video gamer. We use a popular and powerful method for lookahead planning (Monte Carlo Tree Search) combined with object oriented optimism to do strategic optimistic exploration and guide the agent towards learning about parts of the world it knows little about.

As a challenge problem we consider Pitfall!, perhaps the hardest Atari video game left for artificial agents. The first positive rewards in Pitfall! happen after multiple rooms which are reached only after careful manipulation, making it important to both strategically explore and think about things far into the future when making decisions.

Our SOORL agent was able to reach an average of 17 rooms in Pitfall! in 50 episodes (out of 100 runs) compared to DDQN [6], a strong baseline that uses pixel input and no strategic exploration, which reached an average of 6 rooms after 2000 episodes.

SOORL discovers 17 rooms on average and 25 rooms in the best run

A histogram below shows the distribution of the best episode performance during training (a training run goes only up to 50 episodes) for each of 100 SOORL runs with different random seeds.

Histogram of the best episode performance of 100 different runs (best during the first 50 episodes of each run)

As can be seen, SOORL most often scores no better than all prior deep RL methods which get at best a reward of 0 (though such methods frequently achieve this even after 500 or 5000 episodes, compared to our 50 episodes). In such cases, SOORL often explores much further (reaching more rooms) than alternate approaches but does not reach better best episode scores (as compared to evaluation runs of the alternate approaches). However, in several runs SOORL reaches immediate rewards of 2000 (in room -17) and rewards of 4000 (in room 6), thus achieving to the best of our knowledge the first positive scores on this game when learning without demonstrations. The best results known with human demonstrations are substantially higher (60k) [4], but while very exciting, this work requires substantially more prior knowledge and in particular, significantly decreases the exploration challenge by providing a trusty worked example to build on.

Below are examples of interesting maneuvers learned by the SOORL agent.