This month OpenAI published a paper “Evolution Strategies as a Scalable Alternative to Reinforcement Learning” by Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever which shows Evolution Strategies (ES) can be a strong alternative to Reinforcement Learning (RL) and have a number of advantages like ease of implementation, invariance to the length of the episode and settings with sparse rewards, better exploration behaviour than policy gradient methods, ease to scale in a distributed setting.

ES scales extremely well with the number of CPUs available demonstrating linear speedups in run time even when using over a thousand workers. Running on a computing cluster of 80 machines and 1,440 CPU cores, authors’ implementation was able to train a 3D MuJoCo humanoid walker in only 10 minutes (A3C on 32 cores takes about 10 hours). Using 720 cores they can also obtain comparable performance to A3C on Atari while cutting down the training time from 1 day to 1 hour. The communication overhead of implementing ES in a distributed setting is lower than for reinforcement learning methods such as policy gradients and Q-learning.

By not requiring backpropagation, black box optimizers (the ones make no assumptions about the structure of the function being optimized) reduce the amount of computation per episode by about two thirds, and memory by potentially much more. This partly offsets the slight decrease in data efficiency — to match the performance of a good A3C implementation on most Atari environments ES required 3x to 10x more data.

Additionally black box optimization methods are uniquely suited to capitalize on advances in low precision hardware for deep learning. Low precision arithmetic, such as in binary neural networks, can be performed much cheaper than at high precision. When optimizing such low precision architectures, biased low precision gradient estimates can be a problem when using gradient-based methods. Similarly, specialized hardware for neural network inference, such as TPUs, can be used directly when performing optimization using ES, while their limited memory usually makes backpropagation impossible.

ES even allows us to incorporate non-differentiable elements into the architecture, such as modules that use hard attention.

You can read more about this implementation in a blog post.