RL Weekly 21: The interplay between Experience Replay and Model-based RL

by Seungjae Ryan Lee

Subscribe to RL Weekly Get the highlights of reinforcement learning in both research and industry every week.

Search on the Replay Buffer

Benjamin Eysenbach12, Ruslan Salakhutdinov1, Sergey Levine23

1CMU, 2Google Brain, 3UC Berkeley

What it says

Goal-conditioned RL studies tasks where there exists a “goal”, and the epsiode ends when the agent is sufficiently close to it (Section 2.2). Many attempts have used reward shaping or demonstrations to guide the agent. Instead, the authors propose reducing this goal-reaching problem into several easier goal-reaching tasks.

To decompose the problem, the authors build a directed, weighted graph, where each node is an observation from the replay buffer, and the edges are the predicted “distance” between two observations (Section 2.3). Then, by using Dijkstra’s algorithm (Appendix A),it is possible to find the shortest path to the goal, allowing the agent to plan its trajectory.

The authors discuss different distance estimates, using distributional RL (Section 3.1) or value function ensembles (Section 3.2). Experiments show that these enhancements are crucial for the performance of SoRB (Section 5.4).

In a visual navigation task, SoRB outperforms Hindsight Experience Replay (HER), C51, Semi-parametric Topological Memory (SPTM), and Value Iteration Networks (VIN) (Section 5.2, 5.3). It is also shown to generalize well to new navigation environments (Section 5.5).

Read more

External Resources

Learning Powerful Policies by Using Consistent Dynamics Model

Shagun Sodhani1, Anirudh Goyal1, Tristan Deleu1, Yoshua Bengio12, Sergey Levine3, Jian Tang1

1Mila, 2CIFAR, 2UC Berkeley

What it says

Model-based RL relies on having a good model that sufficiently represents the real environment, since the error of the model is compounded as the model is unrolled for multiple steps. Thus, a multi-step dynamics model could deviate greatly from the real environment. To fix this problem, the authors propose an auxilary loss to match the imagined state from the model with the observed state from the real environment (Section 3.1). Compared to the baseline algorithms Mb-Mf and A2C, the Consistent Dynamics model shows superior performance in various Atari and MuJoCo tasks (Section 6).

Read more

External Resources

When to use Parametric Models?

Hado van Hasselt1, Matteo Hessel1, John Aslanides1

1DeepMind

What it says

Model-based and replay-based agents have different computational properties. Parametric models typically require more computations than sampling from a replay buffer, but parametric models can achieve good accuracy with a finite memory requirement, whereas for replay memory past experiences are forgotten (Section 2.1). However, they also share similarities. Notably, with a perfect model, the experience generated from it will be indistinguishable from that from the real environment, so it has the same effect as experience replay. (Section 2.2).

Parametric models can be helpful when the agent must plan forward into the future to improve behavior, or when the agent plans backward to solve the credit assignment problem. (Section 2.3). However, these models can lead to catastrophic learning updates that are commonplace for algorithms that have the “deadly triads”: function approximation, bootstrapping, and off-policy learning (Section 3).

To compare the two approaches, the authors chose SimPLe for the model-based agent and data-efficient Rainbow DQN (Section 4.1) for the replay-based agent. Their results show that Rainbow was superior to SimPLe in 17 out of 26 Atari games (Appendix E), hinting that the hypothesized instability indeed exists in model-based agents.

Why it matters

Although replay-based agents such as Rainbow are generally categorized as model-free, the experience replay mechanism has many similarities with model-based RL. Just like how model-based agents use their models to improve the agent in between interactions with the real environment, replay-based agents also extensively use the replay buffer to train and improve the agent.

Many RL papers introduce new models for the sake of sample efficiency, but model-based RL is not a panacea that works in every environment and situation. It is important to understand both the power and the shortcoming of model-based approaches.

Read more

External Resources

More exciting news in RL: