Let's start by looking at passive reinforcement learning. ▶ 00:00

I'm going to describe an algorithm called ▶ 00:03

And what that means--sounds like a fancy name, ▶ 00:07

but all it really means is we're going to move ▶ 00:09

from one state to the next; ▶ 00:11

and we're going to look at the difference between the 2 states, ▶ 00:13

and learn that--and then kind of back up ▶ 00:16

the values, from one state to the next. ▶ 00:19

So if we're going to follow a fixed policy, pi, ▶ 00:22

and let's say our policy tells us to go this way, and then go this way. ▶ 00:27

We'll eventually learn that we get a plus 1 reward there ▶ 00:31

and we'll start feeding back that plus 1, saying: ▶ 00:35

if it was good to get a plus 1 here, ▶ 00:38

it must be somewhat good to be in this state, ▶ 00:40

somewhat good to be in this state--and so on, back to the start state. ▶ 00:42

So, in order to run this algorithm, ▶ 00:46

we're going to try to build up a table of utilities for each state ▶ 00:48

and along the way, we're going to keep track of ▶ 00:53

the number of times that we visited each state. ▶ 00:56

Now, the table of utilities, we're going to start blank-- ▶ 00:59

we're not going to start them at zero or anything else ▶ 01:01

where they're just going to be undefined. ▶ 01:03

And the table of numbers, we're going to start at zero, ▶ 01:05

saying we visited each state a total of zero times. ▶ 01:07

What we're going to do is run the policy, ▶ 01:11

have a trial that goes through the state; ▶ 01:14

when it gets to a terminal state, ▶ 01:16

we start it over again at the start and run it again; ▶ 01:18

and we keep track of how many times we visited each state, ▶ 01:21

we update the utilities, and we get a better ▶ 01:24

and better estimate for the utility. ▶ 01:26

And this is what the inner loop of the algorithm looks like-- ▶ 01:28

and let's see if we can trace it out. ▶ 01:30

So we'll start at a start state, ▶ 01:32

we'll apply the policy--and let's say the policy tells us to move in this direction. ▶ 01:34

Then we get a reward here, ▶ 01:39

and then we look at it with the algorithm, ▶ 01:44

and the algorithm tells us if the state ▶ 01:46

is new--yes, it is; we've never been there before-- ▶ 01:48

then set the utility of that state to the new reward, which is zero. ▶ 01:51

Okay--so now we have a zero here; ▶ 01:56

and then let's say, the next step, we move up here. ▶ 01:58

So, again, we have a zero; ▶ 02:02

and let's say our policy looks like a good one, ▶ 02:04

so we get: here, we have a zero. ▶ 02:07

We get: here, we have a zero. ▶ 02:10

we get a reward of 1, so that state gets a utility of 1. ▶ 02:16

And all along the way, we have to think about ▶ 02:20

how we're backing up these values, as well. ▶ 02:23

So when we get here, we have to look at this formula to say: ▶ 02:26

How are we going to update the utility of the prior state? ▶ 02:31

And the difference between this state and this state is zero. ▶ 02:35

so this difference, here, is going to be zero-- ▶ 02:38

the reward is zero, and so there's going to be no update to this state. ▶ 02:43

But now, finally--for the first time--we're going to have an actual update. ▶ 02:46

So we're going to update this state to be plus 1, ▶ 02:50

and now we're going to think about changing this state. ▶ 02:54

And what was its old utility?--well, it was zero. ▶ 02:57

And then there's a factor called Alpha, ▶ 03:00

which is the learning rate ▶ 03:03

that tells us how much we want to move this utility ▶ 03:05

towards something that's maybe a better estimate. ▶ 03:08

And the learning rate should be such that, ▶ 03:11

if we are brand new, ▶ 03:14

we want to move a big step; ▶ 03:16

and if we've seen this state a lot of times, ▶ 03:18

we're pretty confident of our number ▶ 03:20

and we want to make a small step. ▶ 03:22

So let's say that the Alpha function is 1 over N plus 1. ▶ 03:24

Well, we'd better not make it 1 over N plus 1, when N is zero. ▶ 03:29

So 1 over N plus 1 would be ½; ▶ 03:31

and then the reward in this state was zero; ▶ 03:35

plus, we had a Gamma-- ▶ 03:39

and let's just say that Gamma is 1, ▶ 03:41

so there's no discounting; and then ▶ 03:44

we look at the difference between the utility ▶ 03:46

of the resulting state--which is 1-- ▶ 03:49

minus the utility of this state, which was zero. ▶ 03:52

So we get ½, 1 minus zero--which is ½. ▶ 03:57

So we update this; ▶ 04:01

and we change this zero to ½. ▶ 04:03

Now let's say we start all over again ▶ 04:06

and let's say our policy is right on track; ▶ 04:10

and nothing unusual, stochastically, has happened. ▶ 04:12

So we follow the same path, ▶ 04:16

we don't update--because they're all zeros all along this path. ▶ 04:19

and now it's time for an update. ▶ 04:26

So now, we've transitioned from a zero to ½-- ▶ 04:28

so how are we going to update this state? ▶ 04:33

Well, the old state was zero ▶ 04:35

and now we have a 1 over N plus 1-- ▶ 04:37

So we're getting a little bit more confident--because we've been there ▶ 04:44

twice, rather than just once. ▶ 04:46

The reward in this state was zero, ▶ 04:48

and then we have to look at the difference between these 2 states. ▶ 04:51

That's where we get the name, Temporal Difference; ▶ 04:54

and so, we have ½ minus zero-- ▶ 04:57

and so that's 1/3 times ½-- ▶ 05:01

Now we update this state. ▶ 05:05

It was zero; now it becomes 1/6. ▶ 05:07

And you can see how the results ▶ 05:11

of the positive 1 starts to propagate ▶ 05:13

We have to have 1 trial at a time ▶ 05:18

to get that to propagate backwards. ▶ 05:20

Now, how about the update from this state to this state? ▶ 05:22

Now, we were ½ here--so our old utility was ½; ▶ 05:25

The reward in the old state was zero; ▶ 05:35

plus the difference between these two, ▶ 05:39

which is 1 minus ½. ▶ 05:42

So that's ½ plus 1/6 is 2/3. ▶ 05:45

And now the second time through, ▶ 05:49

we've updated the utility of this state from 1/2 to 2/3. ▶ 05:51

And we keep on going--and you can see the results of the positive, propagating backwards. ▶ 05:57

And if we did more examples through here, ▶ 06:02

you would see the results of the negative propagating backwards. ▶ 06:04