Challenges of real-world reinforcement learning, Dulac-Arnold et al., ICML’19

Last week we looked at some of the challenges inherent in automation and in building systems where humans and software agents collaborate. When we start talking about agents, policies, and modelling the environment, my thoughts naturally turn to reinforcement learning (RL). Today’s paper choice sets out some of the current (additional) challenges we face getting reinforcement learning to work well in many real-world systems.

We consider control systems grounded in the physical world, optimization of software systems, and systems that interact with users such as recommender systems and smart phones. … RL methods have been shown to be effective on a large set of simulated environments, but uptake in real-world problems has been much slower.

Why is this? The authors posit that there’s a meaningful gap between the tightly-controlled and amenable to simulation research settings where many RL systems do well, and the messy realities and constraints of real-world systems. For example, there may be no good simulator available, exploration may be curtailed by strong safety constraints, and feedback cycles for learning may be slow.

This lack of available simulators means learning must be done using data from the real system, and all acting and exploring must be done on the real system. Thus, we cannot simply collect massive datasets to solve these challenges, nor can we ignore safety during training.

The paper highlights nine challenges, all of which have to be addressed simultaneously for successful deployment of RL in many real-world settings.

Effective off-line learning (e.g. from system logs) Fast learning on real systems given limited samples High-dimensional continuous state and action spaces Safety constraints that should never or at least rarely be violated Tasks that may be partially observable (non-stationary or stochastic) Reward functions that are un(der)specified, multi-objective, or risk-sensitive System operators who desire explainable policies and actions Inference that must happen in real-time at the control frequency of the system Large and/or unknown delays in the system actuators, sensors, or rewards.

For RL researchers, the paper gives pointers to recent work in each of these areas. For the rest of us, the challenges form a handy checklist when thinking about the suitability of RL in a given situation.

Off-line learning

It’s often the case that training can’t be done directly online, and therefore learning takes place offline, using logs from a previous version of the control system. Generally we’d like it to be the case that the new version of the system performs better than the old one, and that means we also need to do off-policy evaluation (estimating performance without running it on the real system). There are a few methods for doing this (detailed in §2.1 of the paper), including importance sampling.

One special case to consider is the deployment of the first RL version (the initial policy); there is often a minimum performance threshold to be met before this is allowed to happen. Therefore another important quality to be able to evaluate is the warm-start performance.

Learning from limited samples

(Many) real systems do not have separate training and evaluation environments. All training data comes from the real system, and the agent cannot have a separate exploration policy during training as its exploratory actions do not come for free.

Given this higher cost for exploration, and the fact that logs for learning from are likely to explore very little of the state space, policy learning needs to be data-efficient. Control frequencies (opportunities to take an action) may be 1 hour or even multi-month timesteps, and reward horizons even longer.

One simple way to assess the data efficiency of a model is to look at the amount of data necessary to achieve a certain performance threshold.

High-dimensional state and action spaces

Many practical real-world problems have large and continuous state and action spaces… these can present serious issues for traditional RL algorithms.

One approach put forward by Dulac-Arnold et al. is to generate a vector for a candidate action and then do a nearest neighbour search to find the closest available real action. When evaluating policy performance, it’s important to take into account the relatedness of different actions: “millions of related actions are much easier to learn than a couple hundred completely unrelated actions.”

Safety constraints

Many control system must operate under safety constraints – including during exploratory learning phases. Constrained MDPs (Markov Decision Processes) allow constraints to be specified over states and actions.Budgeted MDPs allow the constraint level / performance trade-off to be explored by letting constraint levels be learned rather than simply hard-wired. Another approach is to add a safety layer to the network which prevents any safety violations.

Partial observability

Almost all real systems where we would want to deploy reinforcement learning are partially observable.

For example, efficiency of mechanical parts may degrade over time, ‘identical’ widgets may exhibit variations in performance given the same control inputs, or the state of some parts of the system (e.g. the mental state of users of a recommender system) may simply be unknowable.

To deal with this we can use Partially Observable Markov Decision Processes (POMDPs).

The key difference from the MDP formulation is that the agent’s observation is now separate from the state, with an observation function giving the probability of observing given the environment state .

Kant would approve.

Two common approaches to handling partial observability including history in the input (e.g. some kind of windowing), and modelling history in the model using recurrent networks.

Furthermore, Robust MDP formalisms have explicit mechanisms to ensure that agents are robust to sensor and action noise and delays. If a given deployment environment may have unknown upfront but learnable sources of noise, then System Identification techniques can be used to train a policy that can learn what environment it is operating in and modify itself accordingly.

Reward functions

In many cases, system or product owners do not have a clear picture of what they want to optimize.

Who would have thought!

Often the reward function is multi-dimensional and needs to balance multiple sub-goals. Another great insight here (which reminds me of discussions of system latency) is that ‘average’ performance (i.e., expectation) is often an insufficient metric, and the system needs to perform well for all task instances. Thus we need to consider the distribution of behaviours in reward function evaluation, and not just a single number.

A typical approach to evaluate the full distribution of reward across groups is to use a Conditional Value at Risk (CVaR) objective, which looks at a given percentile of the reward distribution, rather than expected reward.

Explainability / Interpretability

Real systems are “owned and operated by humans who need to be reassured about the controller’s intentions and require insights regarding failure cases.”

For this reason, policy explainability is important for real-world policies. Especially in cases where the policy might find an alternative and unexpected approach to controlling a system, understanding the longer-term intent of the policy is important for obtaining stakeholder buy-in. In the event of policy errors, being able to understand the error’s origins a posteriori is essential.

The authors highlight the work of Verma et al. here, which reminds me of some of the interpretable ML systems we looked at last year: the agent learns a policy program, expressed in a domain-specific programming language. Thus the policy can be understood by inspecting the program, with the added benefit that the DSL is verifiable too such that learned policies may be verifiably correct. That all sounds pretty interesting to me, and we’ll be looking at this work in the next edition of The Morning Paper!

Real-time inference

Policy inference needs to take place within the control frequency of the system. This could be on the order of milliseconds or less. This prevents us from using computationally expensive approaches that don’t meet the constraint (e.g., some forms of model-based planning). Systems with longer control intervals of course bring the opposite problem: we can’t run the task faster than real-time in order to speed-up data generation.

Delayed rewards

… most real systems have delays in either the sensation of the state, the actuators, or the reward feedback.

For example, delays in the effects produced by a braking system, or delays between choices presented by a recommendation system and user’s subsequent behaviours (which might be spread over a period of weeks).

There are a number of potential techniques for dealing with this, including memory-based agents that leverage a memory retrieval system to allocate credit to distant past events useful in predicting the value function in the current timestep. Alternatively the RUDDER algorithm uses a backward view of a task with delayed rewards distributed more evenly throughout time.

The last word