Learning agent performs actions within environment, which puts environment in the certain state. In turn, environment returns reward to the agent. Goal of the learning agent is to maximize the reward. This approach to rewards driven learning, reminds us to Pavlov’s experiments with his dog, which is really interesting. However, let’s not drift far from the subject. In order to mathematically formulate these systems, we use Markov Decision Processes, or MDPs. They are represented as a tuple of four elements:

S – Set of states . At each time step t, the agent gets the environment’s state – St, where St ∈ S.

. At each time step t, the agent gets the environment’s state – St, where St ∈ S. A – Set of actions that the agent can take.

that the agent can take. Pa – Probability that action in some state s will result in the time step t, will result in the state s’ in the time step t+1.

that action in some state s will result in the time step t, will result in the state s’ in the time step t+1. Ra – Or Ra(s, s’), represents expected reward received after going from state s to the state s’, as a result of action a.

MDPs are can be represented using image below:

One approach to solving this problem, and by far the most popular one, presented with MDPs is Q-Learning.

Q-Learning vs Double Q-Learning

We already have multiple articles on our blog about Q-Learning, but let’s have a quick round up. Q-Learning is based on estimation the Q-Value, which is the value of taking action a in state s under policy π. Some consider this as a quality of action a in state s. Larger Q-Value indicates that reward for the learning agent is bigger. The policy in this case defines state–action pairs that visited and updated during the training process. In each epoch of training process, agent updates Q-Values for every state-action combination. That is how it creates a matrix or a table, where for each action and the state we store Q-Value. The process of updating these values is described by the formula:

The important part of the formula above is maxQ(St+1, a). Note the t+1 annotation. This means that current Q-value is based on the Q-value of state in which environment will be after the action is performed. Spooky indeed. How does this work? Well, in the beginning we initialize Q-Values for states St and St+1 to some random values. During the first training iteration we update Q-Value in the state St based on reward and on those random value of Q-Value in the state St+1.

To get it even more clear we can brake down Q-Learning into the steps. It would look something like this:

Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:

Q(s, a) = n, ∀s ∈ S, ∀a ∈ A(s)

Q(terminal-state, ·) = 0 Pick the action a, from the set of actions defined for that state A(s) defined by the policy π. Perform action a Observe reward R and the next state s’ For all possible actions from the state s’ select the one with the highest Q-Value – a’. Update value for the state using the formula:

Q(s, a) ← Q(s, a) + α [R + γ*maxQ(s’, a’) − Q(s, a)] Repeat steps 2-5 for each time step until the terminal state is reached Repeat steps 2-6 for each episode

However, this crucial part of the formula maxQ(St+1, a) is also the major flaw of Q-Learning. In general, Q-Learning performs poorly in some stochastic environments and max operator is the reason for that. Because of it Q-Learning overestimate Q-Values for certain actions. This means that this algorithm can be tricked that some actions are good even though they provide lower reward in the end. Check out this article for further explanation.

Solution for this problem is Double Q-Learning. It builds on the assumption that instead using one estimator we can use two estimators. In turn, this means that instead of using one Q-Value for each state-action pair, we should use two values – QA and QB. Double Q-Learning focuses on finding action a* that maximizes QA in the next state s’ – (Q(s’, a*) = max Q(s’, a)). Then it uses this action to get the value of second Q-Value – QB(s’, a*). Finally it uses QB(s’, a*) in order to update QA(s, a):

The same process is applied to QB. Here is what it looks like when we brake down Double Q-Learning process into the steps:

Initialize all QA, QB and starting state – s

all QA, QB and starting state – s Repeat Pick the action a and based on QA(s, •) and QB(s, •) get r and s’ Update(A) or Update(B) ( pick at random) If Update(A) Pick the action a* = argmax QA(s’, a) Update QA

QA(s, a) ← QA(s, a) + α [R + γQB(s’, a*) − QA(s, a)] If Update(B) Pick the action b* = argmax QB(s’, a) Update QB

QB(s, a) ← QB(s, a) + α [R + γQA(s’, b*) − QB(s, a)] s ← s’

Until End

In the previous article, we compared these two algorithms in more details, so make sure to check it out.

DQN and Double DQN

With reticent advances in deep learning, researchers came up with an idea that Q-Learning can be mixed with neural networks. That is how the deep reinforcement learning, or Deep Q-Learning to be precise, were born. Instead of using Q-Tables, Deep Q-Learning or DQN is using two neural networks. In this architecture, networks are feed forward neural networks which are utilized for predicting the best Q-Value. Because input data is not provided beforehand, the agent has to store previous experiences in a local memory called experience reply. This information is then used as input data.

It is important to notice that DQNs don’t use supervised learning like majority of neural networks. The reason for that is lack of labels (or expected output). These are not provided to the learning agent beforehand, i.e. learning agent has to figure them out on its own. Because every Q-Value depends on the policy, target (expected output) is continuously changing with each iteration. This is the main reason why this type of learning agent doesn’t have just one neural network, but two of them. The first network, which is refereed to as Q-Network is calculating Q-Value in the state St. The second network, refereed to as Target Network is calculating Q-Value in the state St+1.

Speaking more formally, given the current state St, the Q-Network retrieves the action-values Q(St,a). At the same time the Target Network uses the next state St+1 to calculate Q(St+1, a) for the Temporal Difference target. In order to stabilize this training of two networks, on each N-th iteration parameters of the Q-Network are copied over to the Target Network. Mathematically, a deep Q network (DQN) is represented as a neural network that for a given state s outputs a vector of action values Q(s, · ; θ), where θ are the parameters of the network. The Target Network, with parameters θ −, is the same as the Q-Network, but its parameters are copied every τ steps from the online network, so that then θ − t = θt. The target itself used by DQN is then defined like this:

A while back we implemented this process using Python and Tensorflow 2. You can check out that implementation here. Also, we used TF-Agents for implementation as well and you can find that here.

The problem with DQN is essentially the same as with vanilla Q-Learning, it overestimates Q-Values. So, this concept is extended with the knowledge from the Double Q-Learning and Double DQN was born. It represents minimal possible change to DQN. Personally, i think it is rather elegant how the author was able to get most of the benefits of Double Q-learning, while keeping the DQN algorithm the same. The core of the Double Q-learning is that it reduces Q-Value overestimations by splinting max operator into action selection and action evaluation. This is where target network in DQN algorithm played a major role. Meaning, no additional networks are added to the system, but evaluation of the policy of the Q-Network is done by using the Target Network to estimate its value. So, only the target is changes in Double DQN:

To sum it up, weights of the second network are replaced with the weights of the target network for the evaluation of the policy. Target Network is still updated periodically, by copying parameters from Q-Network.

Implementation

Prerequisites

This article contains two implementations of Double DQN. Both are done with Python 3.7 and using the Open AI Gym. First implementation uses TensorFlow 2 and the second one uses TF-Agents. Make sure you have these installed on your environment:

Python 3.7

TensorFlow 2.1.0

TF-Agents

Open AI Gym

If you need to learn more about TensorFlow 2, check out this guide and if you need to get familiar with TF-Agents, we recommend this guide.

In this article we use famous CartPole-v0 enviroment:

A pole is attached to a cart which moves along a track in this environment. The whole structure is controlled by applying a force of +1 or -1 to the cart and moving it left or right. The pole is in upright position in the beginning, and the goal is to prevent it from falling. For every timestamp in which pole doesn’t fall a reward of +1 is provided. The complete episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

TensorFlow 2 Implementation

Let’s kick off this implementation with modules that we need to import: