3) How to solve RL problems? 🎮

“RL problems can be solved with dynamic programming, temporal difference learning, or policy gradient methods. However, it is often necessary to help agents that are operating in more complex environments to find good approximate solutions by employing deep neural networks & some cunning techniques that can tackle the fundamental exploration-exploitation tradeoff.”

Approaches for solving RL problems vary through the nature of environments, like how much we know about environments or what tasks we want to solve in them. In general, RL applications can be represented by MDPs and solved with some form of Dynamic Programming (DP) method. The DP is an optimization technique dividing problems into smaller tasks and solving them recursively. They can be also solved with methods using just some aspects of DP, such as in the Temporal Difference learning (TD) method. In the TD method, algorithms learn by combining sampling of Monte Carlo methods (estimating expected values) with bootstrapping of DP (updating estimates on the bases of other estimates). One of modern RL solutions is the Policy Gradient (PG) method, which can directly approximate the optimal policy without using any approximate value functions. The PG method is usually implemented via deep neural networks, which are connected Artificial Neural Networks (ANNs) with many layers. There are many other methods to solve RL problems, but they all share a similar aim of using the experience during learning and somehow obtaining unknown properties of environments, such as probabilities or rewards in MDPs.

⇒ RL is solved via tabular or approximate solutions! 🤖 🍽 📐

In simple MDPs, solutions are mostly found via tabular solutions in which tables (arrays) represent the approximate value functions since environments here are small or simple. They can often find exact solutions, such as finding exactly the optimal policy or optimal value function [Sutton & Barto 2017]. In other words, it is about finding an optimal policy or optimal value function over possibly infinite time and data in fully described environments. Algorithms calculate this through a recursion over values & policies for all states of an environment. It is done quite trivially with some variant of recursion of both introduced equations for policy and values in the previous section, or it can also be done by some variant of their iteration, such as doing value iteration or policy iteration. In complex MDPs, solutions are mostly found via approximate solutions that can generalize previous encounters in different states to other ones, since environments here are large. In other words, the solution problem heavily shifts into finding a good approximate solution that is not computationally heavy since we cannot expect to find an optimal policy or optimal value function in such environments [Sutton & Barto 2017]. Also, it is often necessary in this case to help agents with their job by ANNs. For instance, ANNs can process an agent’s experience by acting as nonlinear functions approximators and learning value functions or maximizing expected rewards.

… it seems like a good time for some questions again!

… so what’s the best way an agent can solve the task? 🎢

It is usually the one by which an agent can find the right balance between exploring and exploiting while completing tasks in an environment. An exploration is the capability to learn from data, while an exploitation is the capability to use what you know so far. Both are important abilities for an agent to have. However, this is actually a hard task in RL, since those two clash when trying to come up with policies for an environment.

This conflict can be handled by two different learning methods for finding ‘good’ policies, on-policy and off-policy learning. They both attempt to evaluate or improve the policy by ensuring that all actions are selected infinitely often, meaning that an agent is just continuing to select them. The main difference is that on-policy learning do that for policy that is used to make decisions, while off-policy learning do that for policy different from that used to generate the data [Sutton & Barto 2017]. They can be both solved with some function approximation that is very useful in complex environments. Then, there are usually challenges with convergence or the stability of such approximate solutions, but we will not talk about those in this introduction material :)

… can algorithms differentiate between agent’s goals and habits? ☯

Yes, algorithms can be built to do it quite well and there are actually various terms across fields describing such distinction in behavior. In cognitive science, it would be about reflective versus reflexive decision making and choice. In RL, it is about model-based versus model-free algorithms if we are thinking about this psychological distinction between goal-directed and habitual behavior. Model-based methods select actions by planning ahead via a model of the agent’s environment. Model-free algorithms make decisions by accessing information that has been stored in a policy or an action-value function [Sutton & Barto 2017].

Let’s tie this back to the introduced MDP terminology in the previous section. The most likely job for model-free algorithms, such as TD learning, would be the one where not all properties in MDPs are known. Also in connection with rather colloquial terms for planning, model-free algorithms would be the ones solving the case of online learning and model-based ones solving the offline planning.

… what are some common RL algorithms? 🔦

It is perhaps worth mentioning fundamental algorithms in RL that can be used to some extent for approximating solutions, since this is what RL implementations are all about. There are two distinct algorithms called Q-learning and SARSA that belong to TD method. They both require some action-value estimates to come up with their policies, i.e. they must first learn the values of actions to select the best actions. The main difference is perhaps their policy learning type; one is off-policy and other is on-policy, respectively. Then, there are two distinct algorithms called REINFORCE and Actor-Critic that belong to PG method. They can learn some parameterized policy that can select actions without consulting a value function (action-value estimates)[Sutton & Barto 2017]. The parameterized policy is a policy expressed as a function of some independent quantities called parameters, e.g. parameters can be weights of neural net implementing the PG method.

To summarize, TD algorithms can learn action values and then use them to determine action selections, while PG algorithms can learn a parameterized policy that enables actions to be taken without consulting action-value estimates [Sutton & Barto 2017]. In other words, it is about representing policy by value functions versus parametrically.

… well this may sound a bit confusing, so let’s just look at the Q-learning alone in the rest of this RL introduction text!

… how does Q-learning work? 🎛

0001 The Q-learning is a model-free algorithm implementing an off-policy learning that belongs to the TD method. It bases an agent’s learning on experience through state-action pairs without an explicit specification of transition probabilities. The letter Q stands for ‘quality’ being in a state, which is expressed by Q values representing state-action pairs. High Q values mean that it is better to be there (in such a state) compared to low Q values.

0010 The Q-learning usually follows some exploration-exploitation policy and learns the Q values associated with taking the optimal policy. It stores Q values for each state of environment and updates them directly as it learns, i.e. it uses its own transitions (stored data) to directly produce solutions to equations. The goal is to infer the optimal policy (π*) by approximating the Q function as shown below.

Q-learning — optimal policy

0100 The Q-learning can be solved by calculating equations that produce Q-values for each state in an environment, such as the equation below. It contains something called the learning rate (α), which determines how agent learns in terms of exploring vs exploiting (agent that learns everything vs nothing). There is also the discount factor (ɣ), which determines the importance of future rewards by setting up an agent for instant or delayed gratification (aka nearsighted or farsighted agent). Colloquially speaking, the new Q value produced by this equation roughly demonstrates that something has happened (sʹ), when an agent was in state (s) and tried doing something (a).

Q-learning — equation for Q values

1000 The Q-learning can use a table to store data as its simplest implementation version. However, this may not be feasible for large problems (environments with lot of states or actions). This is where ANNs come into place as function approximators allowing it to scale up where tables just can’t do. They provide more flexibility to the solution, but at the cost of sacrificing a bit of its stability. Regardless of the specific Q-learning implementation, there is also a need for a game plan when choosing actions. In our following coding example, we used a multi-armed bandit algorithm with epsilon-greedy strategy for deciding on actions in regards the exploration-exploitation dilemma.