An introduction to RL

RL is an area of machine learning that deals with sequential decision-making, aimed at reaching a desired goal. An RL problem is constituted by a decision-maker called an Agent and the physical or virtual world in which the agent interacts, is known as the Environment. The agent interacts with the environment in the form of Action which results in an effect. As a result, the environment will feedback to the agent a new State and Reward. These two signals are the consequences of the action taken by the agent. In particular, the reward is a value indicating how good or bad the action was, and the state is the current representation of the agent and the environment. This cycle is shown in the following diagram: In this diagram the agent is represented by PacMan that based on the current state of the environment, choose which action to take. Its behavior will influence the environment, like its position and that of the enemies, that will be returned by the environment in the form of a new state and the reward. This cycle is repeated until the game ends. The ultimate goal of the agent is to maximize the total reward accumulated during

its lifetime. Let's simplify the notation: if is the action at time and is the reward at time , then the agent will take actions , to maximize the sum of all rewards . To maximize the cumulative reward, the agent has to learn the best behavior in every situation. To do so, the agent has to optimize for a long-term horizon while taking care of every single action. In environments with many discrete or continuous states and actions, learning is difficult because the agent should be accountable for each situation. To make the problem harder, RL can have very sparse and delayed rewards, making the learning process more arduous. To give an example of an RL problem while explaining the complexity of a sparse reward, consider the well-known story of two siblings, Hansel and Gretel. Their parents led them into the forest to abandon them, but Hansel, who knew of their intentions, had taken a slice of bread with him when they left the house and managed to leave a trail of breadcrumbs that would lead him and his sister home. In the RL framework, the agents are Hansel and Gretel, and the environment is the forest. A reward of +1 is obtained for every crumb of bread reached and a reward of +10 is acquired when they reach home. In this case, the denser the trail of bread, the easier it will be for the siblings to find their way home. This is because to go from one piece of bread to another, they have to explore a smaller area. Unfortunately, sparse rewards are far more common than dense rewards in the real world. An important characteristic of RL is that it can deal with environments that are dynamic, uncertain, and non-deterministic. These qualities are essential for the adoption of RL in the real world. The following points are examples of how real-world problems can be reframed in RL settings: Self-driving cars are a popular, yet difficult, concept to approach with RL. This is because of the many aspects to be taken into consideration while driving on the road (such as pedestrians, other cars, bikes, and traffic lights) and the highly uncertain environment. In this case, the self-driving car is the agent that can act on the steering wheel, accelerator, and brakes. The environment is the world around it. Obviously, the agent cannot be aware of the whole world around it, as it can only capture limited information via its sensors (for example, the camera, radar, and GPS). The goal of the self-driving car is to reach the destination in the minimum amount of time while following the rules of the road and without damaging anything. Consequently, the agent can receive a negative reward if a negative event occurs and a positive reward can be received in proportion to the driving time when the agent reaches its destination.

In the game of chess, the goal is to checkmate the opponent's piece. In an RL framework, the player is the agent and the environment is the current state of the board. The agent is allowed to move the game pieces according to their own way of moving. As a result of an action, the environment returns a positive or negative reward corresponding to a win or a loss for the agent. In all other situations, the reward is 0 and the next state is the state of the board after the opponent has moved. Unlike the self-driving car example, here, the environment state equals the agent state. In other words, the agent has a perfect view of the environment.

Comparing RL and supervised learning

RL and supervised learning are similar, yet different, paradigms to learn from data. Many problems can be tackled with both supervised learning and RL; however, in most cases, they are suited to solve different tasks. Supervised learning learns to generalize from a fixed dataset with a limited amount of data consisting of examples. Each example is composed of the input and the desired output (or label) that provides immediate learning feedback. In comparison, RL is more focused on sequential actions that you can take in a particular situation. In this case, the only supervision provided is the reward signal. There's no correct action to take in a circumstance, as in the supervised settings. RL can be viewed as a more general and complete framework for learning. The major characteristics that are unique to RL are as follows: The reward could be dense, sparse, or very delayed. In many cases, the reward is obtained only at the end of the task (for example, in the game of chess).

The problem is sequential and time-dependent; actions will affect the next actions, which, in turn, influence the possible rewards and states.

An agent has to take actions with a higher potential to achieve a goal (exploitation), but it should also try different actions to ensure that other parts of the environment are explored (exploration). This problem is called the exploration-exploitation dilemma (or exploration-exploitation trade-off) and it manages the difficult task of balancing between the exploration and exploitation of the environment. This is also very important because, unlike supervised learning, RL can influence the environment since it is free to collect new data as long as it deems it useful.

The environment is stochastic and nondeterministic, and the agent has to take this into consideration when learning and predicting the next action. In fact, we'll see that many of the RL components can be designed to either output a single deterministic value or a range of values along with their probability. The third type of learning is unsupervised learning, and this is used to identify patterns in data without giving any supervised information. Data compression, clustering, and generative models are examples of unsupervised learning. It can also be adopted in RL settings in order to explore and learn about the environment. The combination of unsupervised learning and RL is called unsupervised RL. In this case, no reward is given and the agent could generate an intrinsic motivation to favor new situations where they can explore the environment. It's worth noting that the problems associated with self-driving cars have also been addressed as a supervised learning problem, but with poor results. The main problem is derived from a different distribution of data that the agent would encounter during its lifetime compared to that used during training.

History of RL

The first mathematical foundation of RL was built during the 1960s and 1970s in the field of optimal control. This solved the problem of minimizing a behavior's measure of a dynamic system over time. The method involved solving a set of equations with the known dynamics of the system. During this time, the key concept of a Markov decision process (MDP) was introduced. This provides a general framework for modeling decision-making in stochastic situations. During these years, a solution method for optimal control called dynamic programming (DP) was introduced. DP is a method that breaks down a complex problem into a collection of simpler subproblems for solving an MDP. Note that DP only provides an easier way to solve optimal control for systems with known dynamics; there is no learning involved. It also suffers from the problem of the curse of dimensionality because the computational requirements grow exponentially with the number of states. Even if these methods don't involve learning, as noted by Richard S. Sutton and Andrew G. Barto, we must consider the solution methods of optimal control, such as DP, to also be RL methods. In the 1980s, the concept of learning by temporally successive predictions—the so-called temporal difference learning (TD learning) method—was finally introduced. TD learning introduced a new family of powerful algorithms that will be explained in this book. The first problems solved with TD learning are small enough to be represented in tables or arrays. These methods are called tabular methods, which are often found as an optimal solution but are not scalable. In fact, many RL tasks involve huge state spaces, making tabular methods impossible to adopt. In these problems, function approximations are used to find a good approximate solution with less computational resources. The adoption of function approximations and, in particular, of artificial neural networks (and deep neural networks) in RL is not trivial; however, as shown on many occasions, they are able to achieve amazing results. The use of deep learning in RL is called deep reinforcement learning (deep RL) and it has achieved great popularity ever since a deep RL algorithm named deep q network (DQN) displayed a superhuman ability to play Atari games from raw images in 2015. Another striking achievement of deep RL was with AlphaGo in 2017, which became the first program to beat Lee Sedol, a human professional Go player, and 18-time world champion. These breakthroughs not only showed that machines can perform better than humans in high-dimensional spaces (using the same perception as humans with respect to images), but also that they can behave in interesting ways. An example of this is the creative shortcut found by a deep RL system while playing Breakout, an Atari arcade game in which the player has to destroy all the bricks, as shown in the following image. The agent found that just by creating a tunnel on the left-hand side of the bricks and by putting the ball in that direction, it could destroy much more bricks and thus increase its overall score with just one move. There are many other interesting cases where the agents exhibit superb behavior or strategies that weren't known to humans, like a move performed by AlphaGo while playing Go against Lee Sedol. From a human perspective, that move seemed nonsense but ultimately allowed AlphaGo to win the game (the move is called move 37). Nowadays, when dealing with high-dimensional state or action spaces, the use of deep neural networks as function approximations becomes almost a default choice. Deep RL has been applied to more challenging problems, such as data center energy optimization, self-driving cars, multi-period portfolio optimization, and robotics, just to name a few.

Deep RL