Reinforcement learning systems can make decisions in one of two ways. In the model-based approach, a system uses a predictive model of the world to ask questions of the form “what will happen if I do x?” to choose the best x. In the alternative model-free approach, the modeling step is bypassed altogether in favor of learning a control policy directly. Although in practice the line between these two techniques can become blurred, as a coarse guide it is useful for dividing up the space of algorithmic possibilities.



Predictive models can be used to ask “what if?” questions to guide future decisions.

The natural question to ask after making this distinction is whether to use such a predictive model. The field has grappled with this question for quite a while, and is unlikely to reach a consensus any time soon. However, we have learned enough about designing model-based algorithms that it is possible to draw some general conclusions about best practices and common pitfalls. In this post, we will survey various realizations of model-based reinforcement learning methods. We will then describe some of the tradeoffs that come into play when using a learned predictive model for training a policy and how these considerations motivate a simple but effective strategy for model-based reinforcement learning. The latter half of this post is based on our recent paper on model-based policy optimization, for which code is available here.

Model-based techniques

Below, model-based algorithms are grouped into four categories to highlight the range of uses of predictive models. For the comparative performance of some of these approaches in a continuous control setting, this benchmarking paper is highly recommended.

Analytic gradient computation

Assumptions about the form of the dynamics and cost function are convenient because they can yield closed-form solutions for locally optimal control, as in the LQR framework. Even when these assumptions are not valid, receding-horizon control can account for small errors introduced by approximated dynamics. Similarly, dynamics models parametrized as Gaussian processes have analytic gradients that can be used for policy improvement. Controllers derived via these simple parametrizations can also be used to provide guiding samples for training more complex nonlinear policies.

Sampling-based planning

In the fully general case of nonlinear dynamics models, we lose guarantees of local optimality and must resort to sampling action sequences. The simplest version of this approach, random shooting, entails sampling candidate actions from a fixed distribution, evaluating them under a model, and choosing the action that is deemed the most promising. More sophisticated variants iteratively adjust the sampling distribution, as in the cross-entropy method (CEM; used in PlaNet, PETS, and visual foresight) or path integral optimal control (used in recent model-based dexterous manipulation work).

In discrete-action settings, however, it is more common to search over tree structures than to iteratively refine a single trajectory of waypoints. Common tree-based search algorithms include MCTS, which has underpinned recent impressive results in games playing, and iterated width search. Sampling-based planning, in both continuous and discrete domains, can also be combined with structured physics-based, object-centric priors.

Model-based data generation

An important detail in many machine learning success stories is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the Dyna algorithm by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. This strategy has been combined with iLQG, model ensembles, and meta-learning; has been scaled to image observations; and is amenable to theoretical analysis. A close cousin to model-based data generation is the use of a model to improve target value estimates for temporal difference learning.

Value-equivalence prediction

A final technique, which does not fit neatly into model-based versus model-free categorization, is to incorporate computation that resembles model-based planning without supervising the model’s predictions to resemble actual states. Instead, plans under the model are constrained to match trajectories in the real environment only in their predicted cumulative reward. These value-equivalent models have shown to be effective in high-dimensional observation spaces where conventional model-based planning has proven difficult.

Trade-offs of model data

In what follows, we will focus on the data generation strategy for model-based reinforcement learning. It is not obvious whether incorporating model-generated data into an otherwise model-free algorithm is a good idea. Modeling errors could cause diverging temporal-difference updates, and in the case of linear approximation, model and value fitting are equivalent. However, it is easier to motivate model usage by considering the empirical generalization capacity of predictive models, and such a model-based augmentation procedure turns out to be surprisingly effective in practice.

The Good News

A natural way of thinking about the effects of model-generated data begins with the standard objective of reinforcement learning:

which says that we want to maximize the expected cumulative discounted rewards \(r(s_t, a_t)\) from acting according to a policy \(\pi\) in an environment governed by dynamics \(p\). It is important to pay particular attention to the distributions over which this expectation is taken. For example, while the expectation is supposed to be taken over trajectories from the current policy \(\pi\), in practice many algorithms re-use trajectories from an old policy \(\pi_\text{old}\) for improved sample-efficiency. There has been much algorithm development dedicated to correcting for the issues associated with the resulting off-policy error.

Using model-generated data can also be viewed as a simple modification of the sampling distribution. Incorporating model data into policy optimization amounts to swapping out the true dynamics \(p\) with an approximation \(\hat{p}\). The model bias introduced by making this substitution acts analogously to the off-policy error, but it allows us to do something rather useful: we can query the model dynamics \(\hat{p}\) at any state to generate samples from the current policy, effectively circumventing the off-policy error.

If model usage can be viewed as trading between off-policy error and model bias, then a straightforward way to proceed would be to compare these two terms. However, estimating a model’s error on the current policy’s distribution requires us to make a statement about how that model will generalize. While worst-case bounds are rather pessimistic here, we found that predictive models tend to generalize to the state distributions of future policies well enough to motivate their usage in policy optimization.



Generalization of learned models, trained on samples from a data-collecting policy \(\pi_D\) , to the state distributions of future policies \(\pi\) seen during policy optimization. Increasing the training set size not only improves performance on the training distribution, but also on nearby distributions.

The Bad News

The above result suggests that the single-step predictive accuracy of a learned model can be reliable under policy shift. The catch is that most model-based algorithms rely on models for much more than single-step accuracy, often performing model-based rollouts equal in length to the task horizon in order to properly estimate the state distribution under the model. When predictions are strung together in this manner, small errors compound over the prediction horizon.



A 450-step action sequence rolled out under a learned probabilistic model, with the figure’s position depicting the mean prediction and the shaded regions corresponding to one standard deviation away from the mean. The growing uncertainty and deterioration of a recognizable sinusoidal motion underscore accumulation of model errors.

Analyzing the trade-off

This qualitative trade-off can be made more precise by writing a lower bound on a policy’s true return in terms of its model-estimated return:



A lower bound on a policy’s true return in terms of its expected model return, the model rollout length, the policy divergence, and the model error on the current policy’s state distribution.

As expected, there is a tension involving the model rollout length. The model serves to reduce off-policy error via the terms exponentially decreasing in the rollout length \(k\). However, increasing the rollout length also brings about increased discrepancy proportional to the model error.

Model-based policy optimization

We have two main conclusions from the above results:

predictive models can generalize well enough for the incurred model bias to be worth the reduction in off-policy error, but compounding errors make long-horizon model rollouts unreliable.

A simple recipe for combining these two insights is to use the model only to perform short rollouts from all previously encountered real states instead of full-length rollouts from the initial state distribution. Variants of this procedure have been studied in prior works dating back to the classic Dyna algorithm, and we will refer to it generically as model-based policy optimization (MBPO), which we summarize in the pseudo-code below.

We found that this simple procedure, combined with a few important design decisions like using probabilistic model ensembles and a stable off-policy model-free optimizer, yields the best combination of sample efficiency and asymptotic performance. We also found that MBPO avoids the pitfalls that have prevented recent model-based methods from scaling to higher-dimensional states and long-horizon tasks.



Learning curves of MBPO and five prior works on continuous control benchmarks. MBPO reaches the same asymptotic performance as the best model-free algorithms, often with only one-tenth of the data, and scales to state dimensions and horizon lengths that cause previous model-based algorithms to fail.

This post is based on the following paper:

I would like to thank Michael Chang and Sergey Levine for their valuable feedback.