Exploring a Pixel-Maze with Evolution Strategies

Intro

Since a few years it is possible to learn playing Atari games directly from pixels with reinforcement learning. A more recent discovery was that evolution strategies are competitive for training deep neural networks on those tasks. And even a simple genetic algorithm can work.

Those methods are fun to play around with. I have created a maze-world and trained a neural network to explore it. This task is much simpler than Atari but still gives interesting results.

I’m using Covariance-Matrix Adaptation (CMA-ES) and have tried a few simpler methods. The focus is on results but code is available.

Maze

This is my generated 2D maze, featuring:

Food (red)

Agents (green)

Traces (dark green)

The agents cannot die. They all use a copy of the same controller. The goal is to improve this controller so that the agents (together) find as much food as possible.

This video shows a mediocre controller in five random worlds:

(html5 video)

▶| next world

Observations: The agents start moving in different directions, but later all drift in a random walk to the bottom right. Sometimes they eat food blobs. (They can actually move through walls with a very low probability.)

Agent Controller

The controller is a neural network with fixed topology and learned weights. The architecture is pretty standard except for the memory.

Inputs:

The agents are nearly blind. They only see walls and food next to them.

The agents don’t know their previous action. To keep going into a new direction they must learn how to use the external memory.

The trace count can detect crowded areas. It includes the agent’s own trace.

Outputs:

The softmax makes it simple to output fixed probabilities over actions, like “60% down, 40% left”.

An output with high certainty (“99.9% down”) requires larger weights, which can only appear later during training.

The network has a single hidden layer and about 1‘000 parameters (weights and biases).

The parameters are optimized directly. The setup violates some assumptions for standard (value-based) reinforcement learning, but a direct search for the parameters does not rely on those assumptions.

Training

Inspired by the Visual Guide to Evolution Strategies and by remarks in various papers I have used the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES). It worked with practically no tuning. I have tried other black-box optimization methods (discussed below) but was unable to reach the same score.

I dare to compare CMA-ES with the Random Forest Classifier: a strong baseline that works out-of-the-box and is difficult to beat. Though it only works up to a few thousand parameters.

Here is a summary: CMA-ES tracks a Gaussian distribution (with full covariance) over the parameters. The user provides an initial estimate of the scale of each parameter (variance). CMA-ES will then sample a small population, evaluate it and sort by fitness. This rank information is used to update the distribution, and the process repeats.

For a better overview, see “What is an Evolution Strategy?” from the same guide. The general principle is easy to understand, but the details are a bit involved. With CMA-ES you really should be using a library (pycma) and not implement from scratch.

Training plot:

Each generation is evaluated on a different set of random maps. There is a lot of fitness noise because of this. The moving average shows a slight upwards trend at the end.

It takes about an hour to reach an average score of 1‘000 on my low-end PC. It takes about ten hours on an AWS c5.9xlarge instance to reach the score above.

After 44‘000 evaluations (second video): (html5 video) ▶| next world

Observations: The agents now keep changing directions. Sometimes they jitter around for a moment, especially when they should be turning. They can escape from dead ends, but it requires many attempts if the exit is narrow.

And finally, after 2‘000‘000 evaluations:

(html5 video)

▶| next world

Observations: The winning strategy follows the left wall at a short distance. In empty space it prefers straight lines with random left-turns. At the end of a horizontal line of food the agent usually turns 180 degrees and finds back to the remaining food blob.

Discussion

CMA-ES is known to work well with tens to hundreds of parameters. With just 1‘000 parameters we are already approaching its limits. This may be a bit disappointing, because the learned strategy does not look really complex. I expect that the model could be improved to fit the task better, reducing the number of parameters.

To put the network size into context: six parameters are enough to learn the classic cart-pole balancing task; 5‘000 have been used to control a simple 2D-Hopper robot; and 33‘000 to play Atari games from pixels, though four million have also been used.

Other evolution strategies can scale up. But CMA-ES can achieve amazing results with smart use of a few hundred parameters, like this Muscle-Based Locomotion for Bipedal Creatures.

I have tried a few simpler methods. A simple GA (not shown here), the Cross-Entropy Method (CEM) and differential evolution. They all got stuck in local optima.

Here is an impression of my attempts to find good CEM parameters:

(The CMA-ES lines have more noise because they use a smaller population size of 23 per generation, while other methods use between 200 and 1000. The dots show the fitness per generation before filtering.)

This is a different (older) variant of the task which is faster to train. As you can see, it is not a trivial optimization problem.

One big challenge is the evaluation noise caused by the random maze generator. One lucky maze can have lots of food exactly where the agents are going. Some algorithms will give this score too much credibility, and use that lucky controller as a starting point for most mutations. Until the score gets beaten by an even more lucky evaluation. I think this is why differential evolution failed here.

I have also tried variations of the neural network:

The controller did surprisingly well when I removed the hidden layer. It finds good solutions much faster. Maybe the memory feedback loop helps with that. But the advantage of using a hidden layer becomes clear after 3‘000 evaluations.

The performance difference between 20 and 40 hidden neurons is not that large. With 20 neurons it trains faster (only 600 parameters). The defaults of CMA-ES worked great with 20 hidden nodes, but I have found that increasing the population size from 23 (default) to 200 helps with the larger network.

A good scaling of the initial weights can speed up the first learning phase a lot. The same is true for input and output normalization. But this advantage often shrinks down to zero long before the final score is reached. Initialization should become more important with additional hidden layers. It’s known to be critical for deep neural networks.

Code

Code is available in my pixelcrawl GitHub repository. It is not very polished, however it is reasonably optimized (at the expense of flexibility).

The original version was pure Python. It was about 100x slower than the current Python/C++ mix. There is a lot access to individual pixels and math involving small arrays, something which Python is really slow at.

Bonus

Can it survive in empty space? Or will it always stick to the walls?

Let’s transfer the trained agent into a new environment:

(html5 video)

▶| next world

Observations: The learned strategy is somewhat robust and looks well randomized. It has trouble getting into certain one-pixel cavities. But it keeps exploring.