We’ll now look at how to set up an environment that allows you to train your own version of the agent for car racing.

Time for some code!

Step 3: Set up your environment

If you’ve got a high-spec laptop, you can run the solution locally, but I’d recommend using Google Cloud Compute for access to powerful machines that you can use in short bursts.

The following has been tested on Linux (Ubuntu 16.04) — just change the relevant commands for package installation if you’re on Mac or Windows.

Clone the repository

In the command line, navigate to the place you want to store the repository and enter the following:

git clone https://github.com/AppliedDataSciencePartners/WorldModels.git

The repository is adapted from the highly useful estool library developed by David Ha, the first author of the World Models paper.

For the neural network training, this implementation uses Keras with a Tensorflow backend, though in the original paper the authors used raw Tensorflow.

2. Set up a virtual environment

Create yourself a Python 3 virtual environment (I use virutalenv and virtualenvwrapper)

sudo apt-get install python-pip

sudo pip install virtualenv

sudo pip install virtualenvwrapper

export WORKON_HOME=~/.virtualenvs

source /usr/local/bin/virtualenvwrapper.sh

mkvirtualenv --python=/usr/bin/python3 worldmodels

2. Install packages

sudo apt-get install cmake swig python3-dev zlib1g-dev python-opengl mpich xvfb xserver-xephyr vnc4server

3. Install requirements.txt

cd WorldModels

pip install -r requirements.txt

There are more here than required by the Car Racing example, but you’ll have everything installed in case you want to test out some of the other environments in Open AI gym, that require the additional packages.

Step 4: Generate random rollouts

For the Car Racing environment, both the VAE and RNN can be on random rollout data — that is, observation data generated by randomly taking actions at each time-step. Actually, we use pseudo-random actions, which forces the car to accelerate initially, in order to get it off the start line.

Since the VAE and RNN are independent of the decision-making Controller, all we need to ensure is that we encounter a diverse range of observations and choose a diverse range of actions to save as training data.

To generate the random rollouts, run the following from the command line

python 01_generate_data.py car_racing --total_episodes 2000 --start_batch 0 --time_steps 300

or if you’re on a server without a display,

xvfb-run -a -s "-screen 0 1400x900x24" python 01_generate_data.py car_racing --total_episodes 2000 --start_batch 0 --time_steps 300

This will produce 2000 rollouts (saved in ten batches of 200), starting with batch number 0. Each rollout will be a maximum of 300 time-steps long

Two sets of files are saved in ./data , (* is the batch number)

obs_data_*.npy (stores the 64*64*3 images as numpy arrays)

action_data_*.npy (stores the 3 dimensional actions)

Step 5: Train the VAE

Training the VAE only requires the obs_data_*.npy files. Make sure you’ve completed Step 4, so that these files exist in the ./data folder.

From the command line, run:

python 02_train_vae.py --start_batch 0 --max_batch 9 --new_model

This will train a new VAE on each batch of data from 0 to 9.

The model weights will be saved to ./vae/weights.h5 . The --new_model flag tells the script to train the model from scratch.

If there is an existing weights.h5 in this folder and the --new_model flag is not specified, the script will load the weights from this file and continue training the existing model. This way, you can iteratively train your VAE in batches, rather than all in one go.

The VAE architecture specification in the ./vae/arch.py file.

Step 6: Generate RNN data

Now that we have a trained VAE, we can use it to generate the training set for the RNN.

The RNN requires encoded image data (z) from the VAE and actions (a) as input and one time-step ahead encoded image data from the VAE as output.

You can generate this data by running:

python 03_generate_rnn_data.py --start_batch 0 --max_batch 9

This will take the obs_data_*.npy and action_data_*.npy files from batches 0 to 9 and convert them to the correct format required by the RNN for training.

Two sets of files will be saved in ./data , (* is the batch number)

rnn_input_*.npy (stores the [z a] concatenated vectors)

rnn_output_*.npy (stores the z vector one time-step ahead)

Step 7: Train the RNN

Training the RNN only requires the rnn_input_*.npy and rnn_output_*.npy files. Make sure you’ve completed Step 6, so that these files exist in the ./data folder.

From the command line, run:

python 04_train_rnn.py --start_batch 0 --max_batch 9 --new_model

This will train a new RNN on each batch of data from 0 to 9.

The model weights will be saved to ./rnn/weights.h5 . The --new_model flag tells the script to train the model from scratch.

Similarly to the VAE, if there is an existing weights.h5 in this folder and the --new_model flag is not specified, the script will load the weights from this file and continue training the existing model. This way, you can iteratively train your RNN in batches, rather than all in one go.

The RNN architecture specification is in the ./rnn/arch.py file.

Step 8: Train the Controller

Now for the fun part!

So far, we’ve just used deep learning to build a VAE that can condense high dimension images down to a low dimensional latent space and an RNN that can predict how the latent space will evolve over time. This was possible because we were able to create a training set for each, using random rollout data.

To train the controller, we’ll use a form of reinforcement learning, that utilises an evolutionary algorithm known called CMA-ES (Covariance Matrix Adaptation — Evolution Strategy).

Since the input is a vector of dimension 288 (= 32 + 256) and the output a vector of dimension 3, we have 288 * 3 + 1 (bias) = 867 parameters to train.

CMA-ES works by first creating multiple randomly initialised copies of the 867 parameters (the ‘population’). It then tests each member of the population inside the environment and records its average score. In exactly the same principle as natural selection, the weights that generate the highest scores are allowed to ‘reproduce’ and spawn the next generation.

To start this process on your machine, run the following command, with the appropriate values for the arguments

python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25

or on a server without display:

xvfb-run -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25

--num_worker 16 : set this to no more than number of cores available

--num_work_trial 2 : the number of members of the population that each worker will test ( num_worker * num_work_trial gives the total population size for each generation)

--num_episode 4 : the number of episodes each member of the population will be scored against (i.e. the score will be the average reward across this number of episodes)

--max_length 1000 : the maximum number of time-steps in an episode

--eval_steps 25 : the number of generations between the evaluation of the best set of weights, across 100 episodes

--init_opt ./controller/car_racing.cma.4.32.es.pk By default, the controller will start from scratch each time it is run and save the current state of the process to a pickle file in the controller directory. This argument allows you to continue training from the last save point, by pointing it at the relevant file.

After each generation, the current state of the algorithm and the best set of weights will be output to the ./controller folder.

Step 9: Visualise agent

At the point of writing, I’ve managed to train an agent to achieve an average score of ~833.13 after 200 generations of training. This was trained on Google Cloud using an Ubuntu 16.04, 18 vCPU, 67.5GB RAM machine with the steps and parameters given in this tutorial.

The authors of the paper managed to achieve an average score of ~906, after 2000 generations of training, which is believed to be the highest score in this environment to date. This utilised a slightly higher spec set-up (e.g. 10,000 episodes of training data, 64 population size, 64 core machine, 16 episodes per trial etc.)

To visualise the current state of your Controller, simply run:

python model.py car_racing --filename ./controller/car_racing.cma.4.32.best.json --render_mode --record_video

--filename : the path to the json of weights that you want to attach to the controller

--render_mode : render the environment on your screen

--record_video : outputs mp4 files into the ./video folder, showing each episode

--final_mode : run a 100 episode test of your controller and output the average score.

Here’s a demo!

Step 10: Hallucinogenic Learning

That’s already pretty cool — but the next part of the paper is mind-blowingly impressive and I think has major implications for AI.

The paper goes on to show an amazing result, through another environment, DoomTakeCover. The object here is to move an agent to avoid fireballs and stay alive as long as possible.

The authors show how it is possible for the agent to actually learn how to play the game within its own VAE / RNN inspired hallucinogenic dreams, rather than inside the environment itself.

The only required addition is that the RNN is trained to also predict the probability of being killed in the next time-step. This way, the VAE / RNN combination can be wrapped up as an environment in its own right and used to train the Controller. This is the very concept of a ‘World Model’.