To put this in perspective, 100M frames which is normally used by RL papers will took us 7.5 days of patient waiting.

Change 1: larger batch size + several steps

The first idea we usually apply to speed up Deep Learning training is larger batch size. It’s applicable to the domain of Deep Reinforcement Learning, but you need to be careful here. In the normal Supervised Learning case, a simple rule “large batch is better” is usually true: you just increase your batch until your GPU memory allows and larger batch normally means more samples will be processed in a unit of time, thanks to the enormous GPU parallelism.

Reinforcement Learning case is slightly different. During the training, two things happen simultaneously:

Your network is trained to get better predictions on current data, Your agent is exploring the environment.

As an agent explores the environment and learns about the outcome of its actions, the training data is changing. For example, in a shooter your agent can run randomly for a while beeing shot by monsters, having only miserable “death is everywhere” experience in the training buffer. But after a while, the agent can discover that he has a weapon it can use. This new experience can dramatically change the data we’re using for training.

RL convergence usually lays on fragile balance between training and exploration. If we just increase a batch size without tweaking other options we can easily overfit to the current data (for our shooter example above, your agent can start thinking that “die young” is the only option to minimize suffering and can never discover the gun it has).

So, in 02_play_steps.py we do several steps every training loop and use batch sizes multiplied by this number of steps. But we need to be careful with this number of steps parameter. More steps mean a larger batch size, which should lead to faster training, but at the same time doing lots of steps between training can populate our buffer with samples obtained from the old network.

To find a sweet spot, I’ve fixed the training process with a random seed (which you need to pass both numpy and pytorch) and trained it for various steps.

steps=1 : speed 154 f/s (obviously, it’s the same as the original version)

: speed 154 f/s (obviously, it’s the same as the original version) steps=2 : speed 200 f/s (+30%)

: speed 200 f/s (+30%) steps=3 : speed 212 f/s (+37%)

: speed 212 f/s (+37%) steps=4 : speed 227 f/s (+47%)

: speed 227 f/s (+47%) steps=5 : speed 228 f/s (+48%)

: speed 228 f/s (+48%) steps=6: speed 232 f/s (+50%)

The convergence dynamics is almost the same (see image below ), but speed the increase saturates around 4 steps, so, I’ve decided to stick to this number for further experiments.

Runs with steps varying from 1 to 6

Ok, we’ve got +47% performance increase.

Change 2: play and train in separate processes

In this step we’re going to check our training loop, which basically contains repetition of the following steps:

play N steps in the environment using the current network to choose actions, put observations from those steps into replay buffer, randomly sample batch from replay buffer, train on this batch.

The purpose of the first two steps is to populate the replay buffer with samples from the environment (which are observation, action, reward and next observation). The last two steps are training our network.

The illustration of the above steps and their communication with the environment, DQN on GPU and replay buffer is on the diagram below.

Serial version

As we can see, the environment is being used only by the first step and the only connection between top and bottom halves of our training is our replay buffer. Due to this data independence, we can run both processes in parallel:

the first one will communicate with the environment, feeding the replay buffer with fresh data,

the second will sample training batch from the replay buffer and perform training.

Both activities should run in sync, to keep training/exploration balance we’ve discussed in the previous section.

This idea was implemented in 03_parallel.py and is using torch.multiprocessing module to parallelize playing and training still being able to work with GPU concurrently. To minimize the modifications in other classes, only the first step (environment communication) was put in separate process. The obtained observations were transferred to the training loop using the Queue class.

Parallel version

Benchmarking of this new version shows impressive 395 frames/s, which is 74% increase versus the previous version and 156% increase in comparison to the original version of the code.

Change 3: async cuda transfers

The next step is simple: every time we call cuda() method of Tensor we pass async=True argument, which disables waiting for transfer to complete. It won’t give you very impressive speed up, but sometimes gives you something and very simple to implement.

This version is in file 04_cuda_async.py and the only difference is passing cuda_async=True to calc_loss function.

After benchmarking I’ve got 406 frames/s training speed, which is 3.5% speed up to the previous step and 165% increase versus the original DQN.

Change 4: latest Atari wrappers

As I’ve said before, original version of DQN used some old Atari wrappers from OpenAI baselines project. Several days ago those wrappers were changed with commit named “change atari preprocessing to use faster opencv”, which is definetely worth to try.

Here is the new code of the wrappers in the baselines repo. Next version of the DQN is in 05_new_wrapper.py. As I haven’t pulled new wrappers into ptan library, they are in the separate lib in examples.

Benchmarking result is 484 frames/s, which is 18% increase to the previous step and final 214% gain to the original version.

Summary

Thanks for reading!

With several not very complicated tricks we’ve got more than 3 times increase in speed of DQN, without sacrificing readability and adding extra complexity to the code (the training loop is still less than 100 lines of python code). And now, the latest version is able to reach 18 score in Pong in 20–30 minutes, which opens lots of new possibilities to experiment with other Atari games, as 484 frames per second means less than 2.5 days to process 100M observations.

If you know more things that can increase performance of PyTorch code, please leave comments, I am really interested to know them.