One of the core issues in Reinforcement Learning is sample complexity. Therefore it’s appealing to train RL agents in a simulator which remove the need to collect samples of an agent interacting with the world. World Models is a compelling example of this approach. The basic idea is to first train a model that understands the world and then either use features learned by the model to train an agent or train an agent entirely inside the model.

Architecture

There are three main components proposed in world models:

The encoder or vision module(V) The model(M) The controller(C)

Encoder(V)

The encoder is a basic Variational Auto Encoder which maps images from a 64x64x3 space(3 from RGB channels) down to a 64 dimensional vector(z).

The general goal of a Variational Auto Encoder is to find a lower dimensional representation(z) of the observation. This is done by minimizing reconstruction loss(first term) with a penalty(second term) for moving far from beliefs(P(z)).

The encoder is represented by q(z|x) and the decoder is represented by p(x|z). Where x is the raw observation.

Model(M)

The model is a Mixture Density Recurrent Neural Network(MD-RNN). The MD-RNN takes in z(defined above) as input and predicts π, μ, and σ across future timesteps. π is a k-dimensional vector representing the logits of a multinomial distribution. There are k different μ and σ used to parameterize k separate multidimensional gaussian distributions.

To draw a sample from the MD-RNN first we would sample the multinomial parameterized by π and then the sample would be used to index which of the k multidimensional normal distributions we will sample from.

The recurrent neural network produces a hidden state vector(h) and a cell state vector(c) which will come in useful later.

Controller(C)

The controller(C) is responsible for deciding what actions to take, where in the experiments the actions take on continuous values(however, this could be easily generalized to the discrete case). The authors intentionally make C be as simple as possible to show that most of the complexity lies inside the encoder(v) and the model(m). The simple controller is a single layer linear model which uses an evolutionary algorithm to find its parameters(CMA-ES).

Notice that the controller is a linear model with respect to its inputs and NOT the raw observation. Depending on the environment the input is either a concatenation of the [encoded observation(z), hidden state vector(h)] or [encoded observation(z), hidden state(h), cell state[c]].

Training

The training scheme follows a process of 4 steps after initializing V, M, and C

collect trajectories from the controller(C) use the raw observations to train V encode the raw observations to the latent space z with V and then train M use z, h, and c(sometimes) to train the controller(C)

When training the controller we have the option of either training on the actual environment or seeding M with a random value and training the controller inside it’s own “dreams.” The authors speculate that training inside synthetic environments was not feasible in [recurrent environment simulators, action-conditional video prediction] because the controller was able to take advantage of imperfections in the model(M). However, in this case a temperature parameter(τ) is introduced to the multinomial distribution which creates more randomness in samples from the model(M). This is interesting because even though it makes the model worse, it forces the controller to be more robust.

The second choice is compelling because it would make training an agent in “dangerous” environments easier. By dangerous we mean environments where interacting with it can have catastrophic consequences. For example, imagine unleashing a randomly initialized controller out onto the open road.

A final note to consider is when the environment is not sufficiently explored with a random controller. This would mean that your model would never learn about the dynamics only a strong controller will experience. To navigate this problem, the authors suggest to iteratively perform the 4 steps mentioned above. However, this was not demonstrated in any of the experiments.