Deep Planning Network (PlaNet), is a model-based agent that learns a latent state dynamics model from images and takes actions based on latent state online planning.

Architecture

The architecture can be viewed from a high-level as 3 components. Each component will later be further be broken down into smaller modules.

posterior: has the constraint of being near the single step prior multi-step prior: has the constraint of latent state distribution close to the posterior latent state distribution. Multi-step priors are used to model the dynamics observation model: has the constraint of reconstructing observations from the the latent state Controller: the controller is a simple planner that uses the cross entropy method to maximize sum of rewards across a trajectory of length H

Posterior

The posterior(q) parameterizes a distribution for the stochastic latent state(s), based on the previous deterministic latent state, previous stochastic latent state, previous action(a), and ground truth observation(o). The posterior can be thought of as the observation encoder. q is a multidimensional gaussian with a diagonal covariance matrix.

where h represents the deterministic latent state

Multi-step prior

The multi-step priors are important when it comes to online forward planning in the case of PlaNet. The multi-step prior can be thought of as the transition model between latent states.

The multi-step prior for the latent states is a multidimensional gaussian with diagonal covariance(similar to q):

The observation model is a multidimensional gaussian with identity covariance. Since this is reconstructing images it uses a deconvolutional network

The reward estimates are sampled from a scalar gaussian with unit variance:

below is an example of unrolling the transition model to make reward and observation predictions. Performing observation predictions is expensive but only necessary during training as it has no influence on what actions to take.

Training

Training of the controller and the dynamics model are done separately. There are no controller parameters in same sense as typical policy gradient or Q learning agent. The parameters(mean, covariance) of the controller are reset after every step. This means that our posterior and multi-step priors are learned offline. Then they are used as the dynamics model to do online planning via the cross entropy method.

Dynamics Model

Below we see the objective used to train our dynamics model. The reconstruction term is used update parameters to ensure latent state contains necessary information to reconstruct a ground truth observation.

The latent overshooting term has 2 interesting purposes. When d=1, parameters of both the prior and posterior are pushed towards each other to enforce consistency. When d>1, only the parameters of the prior are updated to model the latent state distribution from the posterior. In the second case the posterior can be thought of as the target.

D is the planning length used for training and T is length of the trajectory.

Planning

Planning is done via the cross entropy method at every step. The cross entropy method can be described in two basic steps

Generate a random trajectory according to a distribution p(initialized as standard normal) Update the parameters of p based on the data to produce “better” samples in the next iteration.

A trajectory of length H is first sampled from a standard normal and goes through I iterations of updating the mean and variance so as to maximize the sum of rewards across the H steps. Each iterations uses the K best performing agents in the population of J agents to update parameters.

After each step this process is restarted in a model predictive control fashion. Where the action to actually take is the first action in the max reward trajectory (outlined in red).

Comparison to World Models

PlaNet is similar to World Models(summary) in that they both separate the learning of a dynamics model and learning a policy. However, world models further breaks the learning of the dynamics model down into first learning a compressed embedding of the observation and then learning a dynamics model over only the embeddings. As we saw above in PlaNet, learning of the dynamics model over the latent states is done at the same time as learning a latent state representation.

In the future it would be interesting to see some tests comparing performance of both World Models and PlaNet. Some immediate drawbacks of PlaNet could be that the dynamics model is not capable of capturing multi-modal transitions like the MD-RNN is capable of in World Models.