Self-Play Fitness

To optimize an NN’s parameters with CMA-ES, you need to specify a fitness function. It’s the value or quality you assign to a set of NN parameters. As an example, if you didn’t care about self-play, you could just play the NN against a minimax (perfect) player, say, 100 times and count the number of losses as the fitness (well, negative fitness). Then ask CMA-ES to find NN parameters that minimize the number of losses. I tried that. It works. It produces an NN player that always draws against minimax. To learn via “self-play”, however — without knowledge of the minimax solution or any other strategy hints — what, exactly should the NN play against to determine its fitness?

After much experimentation, I found that this worked: Construct an “omniscient agent” for the NN player to compete against. What makes the omniscient agent (OA) omniscient is that it embeds a copy of the previous generation’s “xfavorite” NN and knows what move it would make for any board configuration. Before making a move, the OA simulates games against xfavorite to see which move would work best against it.

By knowing xfavorite’s moves, OA can play better than xfavorite without actually knowing how to play tic-tac-toe or doing any tree search. It’s pretty simple.

As a test of the OA’s ability, I took it aside, embedded the minimax player in it (instead of the xfavorite NN) and had it play 100 games against the minimax player. The OA was able to draw the minimax player every time. Knowing that, we can say, “If NN converged to minimax play, OA would draw it”. In other words, the optimization won’t get stuck because OA isn’t good enough at tic-tac-toe. It could get stuck for other reasons, though.

Note that the OA isn’t useful for actual competitive play on its own because you could never embed an opponent in OA and call it a fair or meaningful game. But that doesn’t mean you can’t train an agent against the OA. That’s what I tried — and it worked. To track its progress, I paused training every 10 minutes, played the current xfavorite against xo’s minimax player 100 times, and recorded the number of losses:

(And yes, Run #4 was the best. You may see all the runs, though.) Notice how it gets to exactly zero losses and stays there as the optimization continues. Training against OA is forcing the agent towards perfect, minimax play.

What’s more enlightening, though, is all of the things that *didn’t* work.