fig. 1: Screenshot of my React app using the neural networks computed here.

Easily get into Reinforcement Learning with JavaScript applying deep Q-learning to a simple game: connect4. I built my first AI thanks to this!

Why connect 4?

When I decided to learn about Reinforcement Learning, I thought that I could begin with Generals.io. After all, it seemed quite easy to build unbeatable AI in Chess, Go and DOTA II. How could I fail to do the same?

Being a noob at both Neural Networks and Reinforcement Learning, I finally decided to start small: connect 4. First thing I was told?

Why connect 4? You can easily brute force it!

True. You can find a complete guide here. But that’s not the point. I wanted to try, fail and eventually succeed in building a learning AI.

What is the famous Q-learning?

Q-learning is the first algorithm given back by Google when looking for reinforcement learning. I won’t explain the whole concept, you can find a great introduction here. I’ll just give a quick summary here.

Q-learning is a solution to a MDP, aka Markov Decision Process. A MDP is defined by an agent, performing in an environment by means of actions. The environment can be in different states.

Given those 3, the idea of Q-learning is to reward the agent if the action it took led him to a better state. Or punish it if not. Those rewards will be used to build the function Q which is an estimate of the final gain, given the state s and the action a.

To ensure that a play is considering the potential future outcome of a game, the factor gamma is used to leverage the maximum Q-value of the next state s+1 among the possible actions.

Temporal Difference Learning

In connect 4, it is not possible to know if an action was a good move or a bad one. You have to wait until the end of the game to know if the succession of actions you took led you to the victory. It is the idea behind Temporal Difference Learning (TD-learning): the agent gets a reward when it wins the game, this reward is then distributed to the previous actions, with a discount.

With Q-learning and TD-Learning combined, each time the agent is rewarded — or punished — the function Q is updated with the following algorithm:

eq. 1: Update of Q-value for a play that happened n plays before the end of the game.

After several iterations, you learn the function Q (got it?).

We need to go deeper!

For simple games (a few actions, a few more states), you can evaluate the exact function Q as a 2D array. For more complex games, you have to approach the function Q. This is where neural networks come into play. They can be flexible enough for this task.

table 1: Possibilities in tic-tac-toe and connect 4.

The neural network will take the state of the environment as an input and will output an array corresponding to the available actions.

Your mission is to train the neural network to approach the Q function that will make the best choices.

In practice

The question is now: how to put this into use for connect 4?

Markov Decision Process

For connect 4, the environment is the board. The states are the different possibilities of this board being filled with yellow and red chips. There are at most 7 actions: drop a chip in each of the 7 columns.

What is my “labeled” data?

I don’t have a dataset of games played by humans. The idea of an AI learning just by playing against itself was kind of sexy so I decided to go for it: the training AI is playing successively as player 1 and player 2, saving the board states and actions. The more it learns, the more interesting the games will be.

To be sure the final model has explored enough in the game, there has to be a big part of randomness: I implemented the classic ε-greedy exploration strategy.

The rewards

I first decided to give a reward of +100 to the agent for winning the game. I trained my network and then played against it. This is when I realized it was not trained to prevent its opponent from winning, just to win when it could. For most cases, when a player wins, either the opponent could have prevented it, either it allowed the winner to win. For the former, I give the agent a +75 reward. For the latter, a punishment, that is to say a -75 reward.

Those rewards are completely arbitrary, I haven’t found a general law for picking them.

Neurons

I chose JavaScript because I wanted my connect 4 app to work autonomously and to train on client-side while playing. There are not a lot of deep-learning libraries in JavaScript but I wanted to try two implementations:

A dense neural network with SynapticJS (connectd to the 42 slots of the board).

A convolutional neural network (CNN) with ConvNetJS.

I had the intuition that a CNN would be quicker to train and smarter, for this reason: consider the following situation representing 2 boards for 2 different games.

fig. 2: The AI should react the same way as yellow player: play to complete the yellow diagonal.

Playing as the yellow player, a person would have the same reaction, regardless of the board: win. On the opposite, a dense neural network would see 2 completely different situations and would need to learn to play the same way.

I had the sense that a convolutional neural network would detect the same pattern and activate the same way the right column. No need to learn two times the same information.

The training

My repository is available on GitHub. I created a few APIs in order to:

Interface with the neural networks whichever the chosen implementation: NeuralNetwork;

Play a whole game of connect4: Game;

Train the network using Q-Learning: QLearning (see file below);

gist 1: QLearning.js file that implements learning on all the plays of the winner.

They are completed with a Helper file and an index.js file to launch the training:

gist 2: Hyperparameters in index.js.

Results

To reach the best level, I trained the same network several batches of around 200k games. Globally, I started with a lot of exploration (9/10 of the plays are random) to reach an even ratio at the end of the training — around 800k games played. On every batch, the learning rate was decreasing from 1e-5 to 1e-7.

The training is executed with node on a single CPU and takes at most 1 hour per batch for the dense neural network and a few hours for the CNN.

I like to say that the trained network has reached the level of a young teenager that has played quite often.

How did I measure the convergence?

To be sure my model was correctly learning, I decided to regularly use the model to predict its next play on a board where it could obviously win. I used 3 situations where it could win: horizontally, vertically and diagonally. Here is a chart displaying the typical convergence I observed.

fig. 3: Typical convergence of the probability to take the 3 obvious shot. While it was always clear the AI was learning correctly on vertical and horizontal situations, it sometimes was not converging on the diagonal one.

Try it yourself!

I created a React App to play against my newborn AI. When you play, the AI learns when it loses, to be able to adapt to your strategy. Thanks to JavaScript, the learning takes place in your browser, it’s quite immediate.

My learnings

I learnt a lot doing this side-project: experimenting things instead of overthinking them; being patient during a training; dividing my work to progress a bit everyday.

I didn’t meet the results I hoped with the CNN. It was quite frustrating: both networks have more or less the same level, and the CNN training takes way more time. I now need to look deeper into convolutions and understand how to improve them.

What I haven’t experienced

There are a few things that I didn’t try on this project. If I had more time, I would explore one of them. If you are interested, here is the list:

Benefit from the symmetry of the board (both playing and learning)

Train my AI against a brute force/exploratory model (Monte-Carlo, minimax)

Train my models against the first ones (for faster training)

Conduct a hyper-parameters optimization

Define a true performance measure of my AI “smartness”

For the CNN: Analyze the network to follow activated neurons and see the convolution filter that activated it.

For the CNN: Change the parameters of the filters (padding, size, stride)

Try not to lose!

Make your own mind playing against my connect4 AI on Sicara’s website! You can see its limitations but hey, it’s learning live! And it’s my first baby!

You have Reinforcement Learning projects? Take a look at Keras-RL for easier training! And if you liked this article, feel free to contact me.

Thanks to Damien Peltier and Adil Baaj.