In part 1 we used a random search algorithm to “solve” the cartpole environment. This time we are going to take things to the next level and implement a deep q-network.The OpenAI gym environment is one of the most fun ways to learn more about machine learning. Especially reinforcement learning and neural networks can be applied perfectly to the benchmark and Atari games collection that is included.Every environment has multiple featured solutions, and often you can find a writeup on how to achieve the same score. By looking at others approaches and ideas you can improve yourself quickly in a fun way.

In part 1 we introduced the Gym environment, and looked at a “random search” algorithm. Hopefully you were able to add something to this algorithm, and got some more experience with OpenAI Gym. In part two we are going to take a look at reinforcement learning algorithms, specifically the deep q-networks that are all the hype lately.

Background

Q-learning is a reinforcement learning technique that tries to predict the reward of a state-action pair. For the cartpole environment the state consists of four values, and there are two possible actions. For a certain state S we can predict the reward if we were to push left or right .

In the Atari game environment you get a reward of 1 every time you score a point. This scoring can happen when you hit a block in breakout, an alien in Space Invaders, or eat a pallet in Pacman. In the cartpole environment you get a reward every time the pole is standing on the cart (which is: every frame). The trick of q-learning is that it not only considers the direct reward, but also the expected future reward. After applying action we enter state and take the following into account:

The reward we obtained by performing this action

we obtained by performing this action The expected maximum reward , in the cartpole environment this is

We combine this into a neat formula where say that the predicted value should be in a

Where is the discount factor. Taking a small (for example 0.2) means that you don’t really care about long-term rewards, a large (0.95) means that you care a lot about the long-term rewards. In our case we do care a lot about long-term rewards, so we take a large .

This notebook can be found in my prepared docker environment. If you did not install Docker yet, make sure you do this. To run this environment type this in your terminal:

docker run -p 8888:8888 rmeertens/tensorflowgym

Then navigate to localhost:8888 and navigate to the TRADR folder.

Let’s apply our knowledge of q-learning on the same environment we tried last time: the CartPole environment.