Welcome to GradientCrescent’s special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the” Reinforcement Learning” text by Sutton et. al, and the University of Alberta’s “Fundamentals of Reinforcement Learning” course. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github. The secondary purpose of this series is to reinforce (pun intended) my own learning in the field.

Introduction

Reinforcement Learning has taken the AI world by storm. From AlphaGo to AlphaStar, increasing numbers of traditional human-dominated activities have now been conquered by AI agents powered by reinforcement learning. Briefly, these achievements rely on the optimization of an agent’s actions within an environment to achieve maximal reward. Over the past few articles, we’ve covered various fundamental aspects of reinforcement learning, from basic bandit systems and policy-based approaches, to optimizing reward-based behavior within Markovian environments. All of these approaches have demanded that we have complete knowledge of our environment — dynamic programming for example, requires that we possess the complete probability distributions of all possible state transitions. However, in reality we find that most systems are impossible to know completely, and that probability distributions cannot be obtained in explicit formed due to complexity, innate uncertainty, or computational limitations. As an analogy, consider the task of a meteorologist — the number of factors involved behind predicting weather may be so numerous that it’s simply improbable to know the exact probabilities involved.

Can you guarantee a certain probability for hurricane formation?

For these situations, sample based learning methods such as Monte Carlo are a solution. The term Monte Carlo is usually used to describe any estimation approach relying on random sampling. In other words, we do not assume of knowledge of our environment, but instead only learn from experience, through sample sequences of states, actions, and rewards obtained from interactions with the environment.These methods work by directly observing the rewards returned by the model during normal operation to judge the average value of its states. Interestingly, it’s been shown that even without any knowledge of the environment’s dynamics (which can be thought of as the probability distribution of state transitions), we can still obtain optimal behavior to maximize reward.

As an example, consider the return from throwing 12 dice rolls. By considering these rolls as a single state, we can average these returns to approach the true expected return. As the number of samples increases, the more accurately we approach the actual expected return.

The average expected sum of throwing 12 dice rolls 60 times (University of Alberta)

This kind of sampling-based valuation may feel familiar to our loyal readers, as sampling is also done for k-bandit systems. Instead of comparing different bandits, Monte Carlo methods are used to compare different policies in Markovian environments, by determining the value of a state while following a particular policy until termination.

State Value Estimation with Monte Carlo Methods

Within the context of reinforcement learning, Monte Carlo methods are a way of estimating the values of states in a model by averaging sample returns. Due to the need of a terminal state, Monte Carlo methods are inherently applicable to episodic environments. Due to this restriction, Monte Carlo approaches are commonly considered as being “offline”, in which all updates are done after the terminal state is reached. A simple analogy would be randomly navigating a maze- an offline approach would have the agent reach the end, before using the experience to try and decrease the maze time. In contrast, an online approach would have the agent constantly modifying its behavior already within the maze — perhaps it notices that green corridors lead to dead-ends, and decides to avoid them while already in the maze. We will discuss online approaches in the next article.

The Monte Carlo procedure can be summarized as follows:

Monte Carlo State-Value Estimation (Sutton et. al)

To better understand how Monte Carlo works, consider the state transition diagram below. The reward for each state-transition is shown in black, and a discount factor of 0.5 applied. Let’s put aside the actual state values for now, and focus on calculating one round of returns.

State transition diagram. State number is shown in red, returns are shown in black.

Given that the terminal state has a return of 0, let’s calculate the return of every state, starting from the terminal state (G5). Note that we have set the discount factor to 0.5, resulting in a weighting towards more recent states.

Or more generally,

To avoid keeping all of the returns in a list, we can execute the Monte-Carlo state-value update procedure incrementally, with an equation that shares some similarities with traditional gradient descent:

Incremental Monte Carlo update procedure. S stands for state, V its value, G it return, and alpha is a step size parameter.

Within reinforcement learning, Monte Carlo methods can be further classified as “First-visit” or “Every visit”. Briefly, the difference between the two lies in the number of times a state can be visited within a episode before an MC update is made. The first-visit MC method estimates the value of all states as the average of the returns following first visits to each state before termination, whereas the every-visit MC method averages the returns following an n-number of visits to a state before termination. We’ll be using the first-visit Monte Carlo throughout this article due to its relative simplicity.

Policy Control with Monte Carlo Methods

If a model is not available to provide policy, MC can also be used to estimate state-action values. This is more useful than state values alone, as an idea of of the value of each action (q) within a given state allows the agent to automatically form a policy from observations in an unknown environment.

More formally, we can use Monte Carlo to estimate q(s, a,pi), the expected return when starting in state s, taking action a, and thereafter following policy pi. The Monte Carlo methods remain the same, except that we now have the added dimensionality of actions taken for a certain state. A state– action pair (s, a) is said to be visited in an episode if ever the state s is visited and action a is taken in it. Similarly, state-action value estimation can be done via first-visit or every-visit approaches.

As in Dynamic Programming, we can use generalized policy iteration to to form a policy from observations of state-action values.

By alternating through policy evaluation and policy improvement steps and incorporating exploring starts to ensure that all possible actions are visited, we can achieve optimal policies for every state. For Monte Carlo GPI, this alternation is generally done after the termination of each episode.

Monte Carlo GPI (Sutton et. al)

Understanding Blackjack Strategy

To better understand how Monte Carlo works in practice in valuing different state values and state-action values, let’s perform a step-by-step demonstration with the game of Blackjack. To begin with, let’s define the rules and conditions of our game:

We’ll be playing against the dealer only — no other players will be participating. This allows us to consider the hands of the dealer as part of the environment.

The value of numerical cards is at face value. The value of cards J, K, and Q is 10. The value of ace can be 1 or 11 depending on the player’s choice

Both parties are given two cards. The player’s two cards are face up, while one of the dealers cards is face up.

The objective is to have the sum of all cards in one’s hand <=21. Going over 21 results in a bust, while both parties having 21 result in a draw.

After the player has seen the their cards and the dealer’s first card, the player can choose to hit or stand until he is satisfied with his sum, after which he will stand.

The dealer then reveals their second card — if the sum is less than 17, they will continue drawing cards until 17 is reached, after which they will stand.

Let’s demonstrate some Monte Carlo with a few hands of blackjack.

Round 1.

You draw a total of 19. But pushing your luck you hit, draw a 3, and go bust. As you went bust, the dealer only had a single visible card, with a sum of 10. This can be visualized as follows:

Round 1.

As we went bust, our reward for this round is -1. Let’s assign this accordingly as the return for the penultimate state, using the format of [Agent sum, dealer sum, ace?]:

Well that was unfortunate. Let’s go for another round.

Round 2.

You draw a total of 19. This time, you decided to stay. The dealer obtained 13, hits and goes bust. The penultimate states can be described as follows.

Round 2.

Let’s describe the states and rewards that have occurred in this round:

With episode termination, we can now update the values of all of our states in this round using the calculated returns. Assuming a discount factor of 1, we simply propagate our new reward across our previous hands as done with the state transitions previously. As the state V(19, 10, no) has had a previous return of -1, we calculate the expected return and assign them to our state:

Final state values for the Blackjack demonstration.

Implementation

Let’s implement a game of blackjack using first-visit Monte Carlo to learn about all of the possible state-values (or different hand combinations) within the game, by using a Python approach based on that by Sudharsan et. al. As usual, our code can be found on the GradientCrescent Github.

We’ll use OpenAI’s gym environment to make this facile. Think of the environment as an interface for running games of blackjack with minimal code, allowing us to focus on implementing reinforcement learning. Conveniently, all of the collected information about states, actions, and rewards are kept within “observation” variables, which are accumulated through running sessions of the game.

Let’s start by importing all of the libraries we’ll need to obtain and plot our results.

import gym

import numpy as np

from matplotlib import pyplot

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from collections import defaultdict

from functools import partial

%matplotlib inline

plt.style.use(‘ggplot’)

Next let’s initialize our gym environment and define the policy that’ll guide our agent’s actions. Essentially, we will keep hitting until our hand sum reaches 19 or more, after which we’ll stand.

#Observation here encompassess all data about state that we need, as well as reactions to it env = gym.make(‘Blackjack-v0’) #Define a policy where we hit until we reach 19.

# actions here are 0-stand, 1-hit def sample_policy(observation):

score, dealer_score, usable_ace = observation

return 0 if score >= 19 else 1

Next, let’s define a method to generate data for an episode using our policy. We’ll store information on the state, the action taken, and the reward immediately following that action.

def generate_episode(policy, env): # we initialize the list for storing states, actions, and rewards states, actions, rewards = [], [], [] # Initialize the gym environment observation = env.reset() while True: # append the states to the states list states.append(observation) # now, we select an action using our sample_policy function and append the action to actions list action = sample_policy(observation)

actions.append(action) # We perform the action in the environment according to our sample_policy, move to the next state observation, reward, done, info = env.step(action)

rewards.append(reward) # Break if the state is a terminal state (i.e. done)

if done:

break

return states, actions, rewards

Finally, let’s define the first-visit Monte Carlo prediction function. Firstly, we initialize an empty dictionary to store the current state-values along with another dictionary storing the number of entries for each state across episodes.

def first_visit_mc_prediction(policy, env, n_episodes): # First, we initialize the empty value table as a dictionary for storing the values of each state value_table = defaultdict(float)

N = defaultdict(int)

For each episode, we call upon our previous generate_episode method to generate information about the values of states and rewards earned following that state. We also initialize a variable to store our incremental returns. Next, we obtain the reward and current state-value for every state visited during the episode, and increment our returns variable with our reward for that step.

for _ in range(n_episodes): # Next, we generate the epsiode and store the states and rewards states, _, rewards = generate_episode(policy, env)

returns = 0 # Then for each step, we store the rewards to a variable R and states to S, and we calculate for t in range(len(states) — 1, -1, -1): R = rewards[t]

S = states[t]

returns += R # Now to perform first visit MC, we check if the episode is visited for the first time, if yes,

#This is the standard Monte Carlo Incremental equation.

# NewEstimate = OldEstimate+StepSize(Target-OldEstimate) if S not in states[:t]: N[S] += 1

value_table[S] += (returns — value_table[S]) / N[S] return value_table

Recall that as we are performing first-visit Monte Carlo, we only visit a single state within an episode once. Hence we perform a conditional check on the state-dictionary to see if the state has already been visited. If this condition is met, we can then calculate the new value using the Monte-Carlo state-value update procedure defined previously, and increase the number of observations for that state by 1. We then repeat the process for the following episode, in order to eventually obtain an average return.

Let’s run and take a look at our results!

value = first_visit_mc_prediction(sample_policy, env, n_episodes=500000) for i in range(10): print(value.popitem())

Sample output showing the state values of various hands of blackjack.

We can continue to observe Monte Carlo for 5000 episodes, and plot a state-value distribution describing the values of any combination of player and dealer hands.

def plot_blackjack(V, ax1, ax2): player_sum = np.arange(12, 21 + 1)

dealer_show = np.arange(1, 10 + 1)

usable_ace = np.array([False, True]) state_values = np.zeros((len(player_sum), len(dealer_show), len(usable_ace))) for i, player in enumerate(player_sum): for j, dealer in enumerate(dealer_show): for k, ace in enumerate(usable_ace): state_values[i, j, k] = V[player, dealer, ace] X, Y = np.meshgrid(player_sum, dealer_show) ax1.plot_wireframe(X, Y, state_values[:, :, 0])

ax2.plot_wireframe(X, Y, state_values[:, :, 1]) for ax in ax1, ax2: ax.set_zlim(-1, 1)

ax.set_ylabel(‘player sum’)

ax.set_xlabel(‘dealer sum’)

ax.set_zlabel(‘state-value’) fig, axes = pyplot.subplots(nrows=2, figsize=(5, 8),subplot_kw={'projection': '3d'})

axes[0].set_title('state-value distribution w/o usable ace')

axes[1].set_title('state-value distribution w/ usable ace')

plot_blackjack(value, axes[0], axes[1])

State-value visualization of different blackjack hand combinations

So let’s summarize what we’ve learned.

Sample-based learning methods allow us to estimate state and state-action values without any knowledge of transition dynamics, simply through sampling.

Monte Carlo approaches rely on random sampling of a model, observing the rewards returned by the model, and collecting information during normal operation to define the average value of its states.

Generalized Policy Iteration is possible through Monte Carlo methods.

The value of all possible combinations of player and dealer hands in Blackjack can be judged through repeated Monte Carlo simulations, opening the way for optimized strategies.

That wraps up this introduction to Monte Carlo method. In our next article, we’ll move on to online methods of sample-based learning, in the form of Temporal Difference learning.

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent covering applied AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication.

References

Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL

Platt et. Al, Northeaster University