Welcome to GradientCrescent’s special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the” Reinforcement Learning” text by Sutton et. al, and the University of Alberta’s “Fundamentals of Reinforcement Learning” course. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github. The secondary purpose of this series is to reinforce (pun intended) my own learning in the field.

Introduction

Over the last few articles, we’ve covered and implemented the fundamentals of reinforcement learning through Markov Decision Process and Bellman Equations, learning to quantify values of specific actions and states of an agent within an environment. In this article, we’ll discuss Dynamic Programming and its role in Generalized Policy Iteration, a mutually reliant pair of processes that can self-optimize in order to identify the ideal trajectories within an environment to achieve maximum reward.

Reward-driven behavior. (OpenAI)

Dynamic programming (DP) is one of the most central tenets of reinforcement learning. Within the context of Reinforcement Learning, they can be described as a collection of algorithms that can be used to compute optimal policies iteratively, given a perfect model of the environment as a Markov Decision Process (MDP). Unfourtunately, their high computational expense coupled with the fact that most environments fail to reach this conditions of a perfect model, they are of limited use in practice. However, the concepts DP introduces lay the foundation for understanding other RL algorithms — In fact, most reinforcement learning algorithms can be seen as approximations to DP.

DP algorithms work to find optimal policies by iteratively evaluating solutions for Bellman equations, and then attempting to improve upon them by finding a policy that maximizes received reward. We’ve previously covered Bellman equations, advise the reader to consult our past articles for a deeper explanations.This sequence alternates until an optimal policy is identified.

Generalized policy iteration (Sutton)

DP algorithms primarily work for episodic, finite MDP environments, although it is possible to apply DP to continuous tasks via quantization. Recall that the value of a state is the expected reward of the state, which itself is a discounted sum of future rewards. After first initializing the value functions of a group of states arbitrarily, DP algorithms allow for the Bellman equation itself to be used as an update rule:

The Bellman Equation as an update rule (Sutton)

In practice, this is done by creating two arrays to hold the previous and present state-value functions. By using the values of the previous value function together with the DP algorithm, we can generate a new approximation for the value of that state. This is done one state at a time. It can be shown that states 𝑣𝑘 converges to the optimal 𝑣𝜋 as 𝑘 → ∞ under the same conditions that guarantee the existence of 𝑣𝜋, meaning that eventually these state value functions will stabilize.

Once convergence is reached, we move on to the improvement step, also known as control. By using the existing evaluated state value functions, we can map agent trajectories across state transitions that follow a greedy principle to maximize reward, and hence yield an alternative policy. In terms of Bellman equations, this can be described with:

This policy can then be itself evaluated to yield new state value functions, which can be then compared to the previous set. A policy can be described as being strictly better if its state value functions are greater than for another policy across all states. In practice, once we find a policy that cannot be strictly exceeded, we have then reached the optimal policy. So to summarize, we iteratively evaluate policies to obtain accurate state values, and then attempt to improve upon these policies through encouraging the agent to take greedy actions. This sequence is repeated until an optimal policy and optimal state values are reached.

Achieving optimal state values and policies through policy iteration.

Let’s put theory into practice and demonstrate how iterative policy evaluation works with a simple Gridworld example, based on the Computational Statistics course taught at ETH Zurich.

Gridworld: Policy Evaluation

To understand how the combination of evauation and improvement works, let’s look at the case of GridWord, essentially a 4x4 grid network of states with each state labelled from [1,2…16]. There are two terminal states here, at position 1 and 16. These are our target states, and will always have a value of 0.

As we initialize the states within the grid (v0), all of their initial values will be 0.

Initialized Gridworld (v0).

Let’s assume that we start off with a random policy — in other words, we have a 25% chance of moving in one of 4 cardinal directions. We’ll set the reward for each transition to be -1, in effect encouragingour agent to reach the terminal states in as few transitions as possible to avoid excessive penalization. We’ll also set our discount value to 1 to keep things simple.

We can hence calculate the value of position 6. As the values of the neighbouring states have been set to 0, this is facile.

We can then repeat this calculation to sweep the rest of the grid,

State values on Gridworld after one iteration (v1)

Let’s continue to evaluate our policy for another iteration, with the exact same policy, starting at the exact same position.

But as the original state values are non-zero, positions adjacent to the terminal position will increase slightly, representing the reward of being adjacent to the terminal position.

Extrapolating the calculations to the entire grid and rounding down:

Gridworld after 2 iterations (v2)

We can continue this episodic iterative process, until we reach convergence, after which the value functions will no longer change with further evaluation.

Gridworld: Policy Control

Now that we’ve fully evaluated our policy and populated the state values of Gridworld, let’s see if we can design a superior alternative. To do this, let’s take our final gridworld and map out trajectories to the terminal states that would give an agent the the maximal reward. In essence, we are moving from a stochastic approach to a deterministic one, as the possible actions are now dictated by the greedy actions of the agent.

Gridworld Mark 2, following the new policy 𝜋’.

Assuming the same rewards as discount factor as before, we can hence calculate the value of our states using our new deterministic policy, colored appropriately. Note that we won’t be using our original Gridworld values any longer, and so we’ll start on position 2 (or any of the red colored states), as the terminal states will always have a value of 0.

We can then calculate the value of position 3, also shared by all green colored states.

Similarly, we can then calculate the value of position 6, also shared by position 11.

I’ll leave the final uncolored states as an exercise for you. You should end up with the following grid:

As all of the values on v’ on the left are greater than those of v on the right for all states, we can clearly state that the former follows a superior policy than that of a random policy. In this particular case, we’ve also achieved convergence.

So let’s summarize what we’ve achieved.

Starting from just a random policy and arbitrarily initialized values, we calculated the value functions for all of the states on the Gridworld.

We evaluated the policy over many iterations until convergence was achieved.

Using the newly evaluated state-value functions, we then suggested a deterministic policy

We then showed that the new policy is superior to the random policy.

Beyond Gridworld, such approaches can be extrapolated to various exploratory applications, from robotic hoovers and optimized distribution networks, to self-driving automobiles.

That wraps up this introduction to the Dynamic Programming. In our next tutorial, we’ll move on to sample-based learning methods and cover the differences between Monte Carlo and Temporal Difference learning.

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent covering applied AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication.

References

Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL

Seminar, Computational Statistics, ETH Zurich.