Markov Reward Processes

Let’s extend the Markov Process model to add reward to the system. For this, we add value to our transitions from state to state. We already have the transition probability to capture the dynamics of the system and now we have another scalar number.

We can represent these rewards in many different forms but the most common form is to have a reward matrix much like a transition matrix. Thus each cell (i,j) represents the reward given for transitioning from state i to state j. Rewards can be positive or negative. (Sometimes we can also have simpler representations of rewards like in the cases where reaching a state has a reward regardless of the previous state, we can be done with an array instead of a matrix, which is more compact).

We also add one more thing to the model, called the discount rate represented using γ (gamma), and I know adding Greek Symbols makes this not fun, but stick with me for a while. γ is a number between 0 and 1 and we will come back to this in a minute.

As you remember, we see a chain of state transitions in a MP but now we have a reward as well for every transition and thus all of our observations have a reward value attached to them (We use Rt to represent reward at time t). Thus for every episode, we can define return as the total reward of that episode

(And here is a scary equation for the same)

Now here we already have γ so lets understand what this scary equation actually represents. We are calculating the sum of rewards but the more distant rewards are multiplied by the discount factor raise to the number of steps we are away from the starting point. In this way, γ, the discount factor represents the foresightedness of the agent. (Agent is basically the entity that goes through these state transitions in the MP system). Thus if γ = 1, then the return is just the sum of all subsequent rewards and corresponds to the case where the agent is a fortune teller and has the visibility of the subsequent rewards. If γ = 0, the return G is just the immediate reward without any subsequent state and corresponds to absolute short-sightedness (basically the agent is focusing on instant gratification rather than long term happiness, but philosophy much?). Usually γ is set to something in between in which case we look into future rewards, but not too far. We can think about γ as a measure of how far into the future we look to estimate the future return and the closer it is to 1, the more steps ahead will we take into account.

Value of a State

Return is usually not a very useful quantity, because it is defined for every specific chain and thus can vary widely like some episodes might be better than others. But what if we calculate the mathematical expectation of return for any state, by averaging a large number of chains, we get an important quantity called the Value of state s, represented using V(s). Leaving all the probability concepts aside, this quantity represents how much average return we can expect by following the Markov Reward Process from that state

(And here is the scary equation for V(s))

Example of Value of a State

I know this post is really long as is, but lets have a super simple example to showcase all these concepts. Let’s say you have a Software Engineer and we have a model to represent you. The various states for this engineer can be:

Home : Not at the Office

: Not at the Office Computer : Working at the Computer in the Office

: Working at the Computer in the Office Coffee : Drinking Coffee at the Office

: Drinking Coffee at the Office Chat : Discussing something with colleagues at the Office

Here is a state diagram with the transition probabilities

(State Diagram for a Software Engineer, I guess?)

Here the engineer starts from home and starts the day with coffee without exception (thus there is no home to computer edge and no home to chatting edge). The workday also ends at home (shown using the computer to home edge).

Now let’s add rewards to this system:

Home → Home : 1 (Good to be home)

Home → Coffee : 1

Computer → Computer : 5 (Working hard is good)

Computer → Chat : -3 (Its not good to be distracted)

Chat → Computer : 2

Computer → Coffee : 1

Coffee → Computer : 3

Coffee → Coffee : 1

Coffee → Chat : 2

Chat → Coffee : 1

Chat → Chat : -1 (Long Conversation becomes boring)

Let’s overload our diagram with this information:

(As if the diagram wasn’t already ugly)

Now let’s think about the γ parameter. For a simple case, assume γ = 0. How do we calculate the values of states here? For example for the state Chat, what can be the subsequent transitions? We have the probabilities. In this case, the return is equal to value of next state so lets calculate the state values by summing all the transition values and multiplying it with the probability.

V(s) = ForAllStates(<probability-of-that-state>*<reward-for-that-state>

So,

V(chat) = -1*0.5 + 2*0.3 + 1*0.2 = 0.3

V(coffee) = 2*0.7 + 1*0.1 + 3*0.2 = 2.1

V(home) = 1*0.6 + 1*0.4 = 1.0

V(computer) = 5*0.5 + (-3)*0.1 + 1*0.2 + 2*0.2 = 2.8

So if we care about the immediate reward, Computer is the best state to be in. (I’m not leaving my laptop after this :P )

What happens when γ = 1?. Then V(s) for all the states becomes infinite! This is because our diagram doesn’t have sink states (states which dont have an outgoing transition) and thus we can have an infinite number of transitions and sum of infinite positive values is infinite, regardless of the starting state. This doesn’t seem right, does it?

This infinite result is one of the reasons to introduce γ in a Markov Reward Process. In most cases, a process can have a large number of transitions and its not very practical to deal with infinite values, and we try to limit the horizon we calculate values for. γ < 1 provides such a limitation. If the horizon is a finite one, like a tic-tac-toe game, where there are a finite number of steps, it is fine to use γ = 1. There is one other class of environments called Multi-Armed Bandits MDP with only one step (something we might look into some other post later).

Now when γ is between 0 and 1, it is hard to calculate V(s) accurately by hand, because we need to sum hundreds of values. But that’s what computers are for and there are several simple methods that can quickly calculate these values like using Bellman Equations etc.

Now let’s add the final layer to Markov Reward Processes, Action!