Solve problems with the right representations

Backpropagation ≈ calculating gradients with the chain rule

NNs can learn features directly from data

Reinforcement Learning Agent interacts with a (generally stochastic) environment

and learns through trial-and-error Agent perceives environment state $\mathbf{s}_t$ and chooses action $\mathbf{a}_t$ Performing $\mathbf{a}_t$ transitions $\mathbf{s}_t$ to $\mathbf{s}_{t+1}$ with scalar reward $r_{t+1}$

Supervised vs. Reinforcement Supervised learning: receive correct answer,

produce correct answer Reinforcement learning: receive reward signal,

produce correct action?

Difficulties Correct action unknown Agent affects its own observations (no i.i.d.) Long-range time dependencies (credit assignment)

Goal Maximise expected return (a.k.a. value) $\mathbb{E}[R]$ Return is cumulative (discounted) reward: $R = \sum\limits_{t=0}^{T-1} \gamma^tr_{t+1}$ Discount $\gamma \in [0, 1]$ determines "far-sightedness" If non-episodic ($T = \infty$), $\gamma \in [0, 1)$ Learn a policy $\pi$ that maps states to actions

to maximise $\mathbb{E}[R]$ Optimal policy $\pi^*$ maximises $\mathbb{E}[R]$ from all states

Markov Assumption Collect history, e.g. $\mathbf{h}_2 = \{\mathbf{s}_0, \mathbf{a}_0, r_1, \mathbf{s}_1, \mathbf{a}_1, r_2, \mathbf{s}_2\}$ RL assumes Markov decision process (MDP) Choose $\mathbf{a}_2$ based purely on $\mathbf{s}_2$, not $\mathbf{h}_2$ State is a sufficient statistic of the future;

allows dynamic programming instead of Monte Carlo estimates Realistic problems are usually partially observable MDPs Receive observation $\mathbf{o}_{t+1} \sim O(\mathbf{s}_{t+1}, \mathbf{a}_t)$

Approaches Value functions: estimate the value (expected return)

of being in a given state Policy search: directly find a policy Actor-critic: combine a value function (critic)

with policy search (actor) Can be combined with (learned) models in many ways,

e.g., training from simulation, model predictive control Considering tabular value functions/policies,

i.e., $|\pi| = |\mathcal{S}| \times |\mathcal{A}|$

Value Function Define the state value function: $V^\pi(\mathbf{s}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t]$ Optimal value function comes from optimal policy:

$V^* = V^{\pi^*} = \max\limits_\pi V^\pi(\mathbf{s}) \ \forall \mathbf{s}$ With the environment model, $\mathbf{s}_{t+1} \sim P(\mathbf{s}_t, \mathbf{a}_t)$,

we could use dynamic programming with $V^\pi$

Q-Function Define the state-action value function: $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t, \mathbf{a}_t]$ If we had $Q^*$, $\pi^*(\mathbf{s}_t) = \arg\!\max\limits_{\mathbf{a}}Q^*(\mathbf{s}_t, \mathbf{a})$ $Q^\pi$ satisfies a recursive relation (Bellman equation): $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_{\mathbf{s}_{t+1},\pi}\big[r_{t+1} + \gamma Q^\pi(\mathbf{s}_{t+1}, \pi(\mathbf{s}_{t+1}))\big]$ Therefore, $Q^\pi$ can be improved by bootstrapping Can also define relative advantage of action against baseline: $A(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) - V(\mathbf{s}_t)$

Q-Learning Learn from experience: $Q'(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \delta$,

where $\alpha$ is the learning rate and $\delta$ is the TD-error [7] $\delta = Y - Q = \left(r_t + \gamma\max\limits_aQ(\mathbf{s}_{t+1}, \mathbf{a})\right) - Q(\mathbf{s}_t, \mathbf{a}_t)$ $Y$ is reward received + discounted max Q-value of next state Minimising $\delta$ satisfies recursive relationship Loss is Mean Squared Error (over batch): $\mathcal{L}(\delta) = \frac{1}{N}\sum\limits_{n=1}^{N}(\delta_n)^2$ DL Note: RL updates are usually formulated for gradient ascent

Generalised Policy Iteration Used to get $Q^*$ from $Q^\pi$ Interleave steps of policy evaluation and policy improvement Policy evaluation: with updated policy,

improve estimate of value function Policy improvement: with updated value function,

improve policy

Policy Search Directly output actions (parameterised policy): $\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)$ Search methods can include black-box optimisers

such as genetic algorithms or even random search [8]

Continuous Control "Direct" policy methods easily allow continuous action outputs,

rather than searching for $\arg\!\max\limits_{\mathbf{a}}Q(\mathbf{s}, \mathbf{a})$

Policy Gradients Increase the log probability of actions, weighted by reward Score function gradient estimator (REINFORCE) [9]: $

abla_\theta \mathbb{E}_{\mathbf{s}}[R(\mathbf{s})] = \mathbb{E}[R(\mathbf{s})

abla_\theta\log \pi_\theta(\mathbf{s})]$ Stochastic estimation when $r$ is non-differentiable but

$\pi_\theta$ can be sampled from For more details, see Deep Reinforcement Learning: Pong from Pixels

Actor-Critic Actor: policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, trained with policy gradients Critic: state value function $V(\mathbf{s}_t)$, trained with TD-error $\delta$