UPDATE: We’ve also summarized the top 2019 Reinforcement Learning research papers.

At a 2017 O’Reilly AI conference, Andrew Ng ranked reinforcement learning dead last in terms of its utility for business applications. Compared to other machine learning methods like supervised learning, transfer learning, and even unsupervised learning, deep reinforcement learning (RL) is incredibly data hungry, often unstable, and rarely the best option in terms of performance. RL has historically been successfully applied only in arenas where mountains of simulated data can be generated on demand, such as games and robotics.

Despite RL’s limitations in solving business use cases, some AI experts believe this approach is the most viable strategy for achieving human or superhuman Artificial General Intelligence (AGI). The recent victory of DeepMind’s AlphaStar over top-ranked professional StarCraft players suggests we might be on the cusp of applying deep RL to real world problems with real-time demands, extraordinary complexity, and incomplete information.

In 2018, we saw a number of advancements that could make reinforcement learning much more applicable to real-world domains. This includes increased data efficiency and stability, multi-tasking, and the recently introduced Horizon platform for applied RL.

We’ve done our best to summarize these papers correctly, but if we’ve made any mistakes, please contact us to request a fix.

If these summaries of scientific AI research papers are useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

Important Reinforcement Learning Research Papers

1. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, by Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine

Original Abstract

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

Our Summary

Berkeley AI research team introduces soft actor-critic (SAC), an off-policy maximum entropy deep reinforcement learning algorithm, where the actor aims to maximize expected reward while also maximizing entropy. The experiments confirm that stochastic, entropy maximizing reinforcement learning algorithms achieve substantial improvements in both performance and sample efficiency. Moreover, the corresponding approach is stable and scalable.

What’s the core idea of this paper?

Introducing soft actor-critic algorithm with three key ingredients: an actor-critic architecture with separate policy and value function networks; an off-policy formulation to enable reuse of previously collected data for efficiency; entropy maximization for stability and exploration.



What’s the key achievement?

SAC algorithm: performs comparably to the baseline methods in the easier tasks but significantly outperforms them on the more challenging tasks; demonstrates substantial improvement in terms of learning speed, final performance, sample efficiency, stability, and scalability.



What does the AI community think?

The paper was presented at ICML 2018, one of the most important Machine Learning conferences.

What are future research areas?

Further exploration of maximum entropy methods in reinforcement learning.

What are possible business applications?

Stability and scalability of SAC algorithm make it a good candidate for application to complex, real-world domains: it is sample efficient; it is an off-policy algorithm which allows reusing already collected data; the need for hyperparameter tuning is minimized thanks to the maximum entropy RL.



Where can you get implementation code?

The authors provide TensorFlow implementation for this research paper on GitHub.

For a PyTorch implementation of soft actor-critic, take a look at rlkit repository by Vitchyr Pong.

2. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, by Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu

Original Abstract

In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

Our Summary

In this research, DeepMind team attempts to tackle multi-tasking, one of the key challenges in reinforcement learning. To this end, they have developed a new distributed agent called Importance Weighted Actor-Learner Architecture (IMPALA). This agent leverages a topology of completely independent actors and learners that cooperate to get knowledge across different domains. IMPALA combined with a novel off-policy correction method V-trace outperforms previous approaches and shows much better results in terms of data efficiency and stability.

What’s the core idea of this paper?

Developing a fast and scalable policy gradient agent, Importance Weighted Actor-Learner Architecture (IMPALA), which was inspired by a popular A3C architecture but has some distinctive features: IMPALA actors collect experience (sequences of states, actions, and rewards), which is passed to a central learner that computes gradients; IMPALA can be implemented using a single learner machine or multiple learners performing synchronous updates between themselves; separating the learning and acting processes leads to the increased throughput of the whole system.

However, decoupling the acting and learning also causes the policy in the actor to lag behind the learner, and to compensate for this lag the researchers introduce an off-policy actor-critic correction method called V-trace.

What’s the key achievement?

The experiments in the multi-task setting (DMLab-30 and Atari-57) show that IMPALA combined with the V-trace correction algorithm: achieves exceptionally high data throughput rates of 250,000 frames per second (30 times faster than single-machine A3C); is more stable and 10 times more data efficient than A3C based agents; shows superior performance with the human normalized score on DMLab-30 of 49.4% compared to 23.8% achieved by A3C based agents.



What does the AI community think?

The paper was presented at ICML 2018, one of the most important Machine Learning conferences.

What are future research areas?

Exploring the ways to effectively deal with the large variation in reward scales across different tasks as this variation makes the agent focus on those tasks that have larger scores.

What are possible business applications?

The suggested approach is suitable for deployment in the commercial setting due to its stability, data efficiency and ability to effectively deal with multiple tasks.

Where can you get implementation code?

The source code is available on GitHub.

3. Temporal Difference Models: Model-Free Deep RL for Model-Based Control, by Vitchyr Pong, Shixiang Gu, Murtaza Dalal, Sergey Levine

Original abstract

Model-free reinforcement learning (RL) is a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even with off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.

Our Summary

Berkeley AI Research team explores the ways to combine the benefits of model-free and model-based algorithms. To this end, they introduce Temporal Difference Models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. The experiments on several continuous control tasks confirm that TDMs perform as well as model-free algorithms but learn as quickly as model-based approaches.

What’s the core idea of this paper?

Model-free RL algorithms achieve the best asymptotic performance but are not efficient, while model-based RL algorithms are more efficient but at the cost of higher asymptotic bias.

Temporal Difference Models (TDMs) combine the benefits of model-free and model-based reinforcement learning: TDMs are a type of goal-conditioned value functions; this function predicts how close an agent can get to the goal within τ time steps when it is attempting to reach that state in τ steps; because a TDM is just another Q function, we can train it with model-free algorithms (e.g., deep deterministic policy gradient).



What’s the key achievement?

The experiments on five simulated continuous control tasks and one real-world robotics task confirm that TDMs combine the benefits of model-free and model-based algorithms: they achieve asymptotic performance close to the model-free algorithms, but learn as quickly as purely model-based methods.



What does the AI community think?

The paper was presented at ICLR 2018, one of the key deep learning conferences.

What are future research areas?

Applying TDMs to complex state representations, such as images.

Extending TDMs to stochastic environments.

Combining TDMs with some alternative model-based planning optimization algorithms.

What are possible business applications?

High performance and sample efficiency enable application of TDMs in real-world settings, including robotics, autonomous driving and flight.

Where can you get implementation code?

For a PyTorch implementation of TDMs, take a look at rlkit repository by Vitchyr Pong.

4. Addressing Function Approximation Error in Actor-Critic Methods, by Scott Fujimoto, Herke van Hoof, David Meger

Original abstract

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

Our Summary

The issue of value overestimation as a result of function approximation errors is well-studied in reinforcement learning problems with discrete action spaces but the researchers argue that overestimation bias is also present in the actor-critic setting. To address this issue, they introduce Twin Delayed Deep Deterministic policy gradient algorithm (TD3) with clipped Double Q-learning variant, delayed policy updates, and target policy smoothing. The experiments on seven continuous control tasks from OpenAI gym confirm that suggested approach outperforms the state-of-the-art methods by a significant margin.

What’s the core idea of this paper?

High variance contributes to overestimation bias and leads to a noisy gradient for the policy update, resulting in reduced learning speed and lower performance.

To address high variance, the researchers introduce Twin Delayed Deep Deterministic policy gradient algorithm (TD3), which is based on Deep Deterministic Policy Gradient (DDPG) algorithm, but includes several important modifications: including clipped Double Q-learning variant which assumes that “a value estimate suffering from overestimation bias can be used as an approximate upper-bound to the true value estimate”; delaying policy updates until the value estimate has converged, to address the interplay between high variance estimates and policy performance; target policy smoothing that enforces the notion that similar actions should have similar value.



What’s the key achievement?

Introducing an effective approach to addressing high variance and the resulting overestimation bias in actor-critic methods.

Evaluating the proposed algorithm on seven continuous control tasks from OpenAI gym and showing that it outperforms the state of the art by a wide margin.

What does the AI community think?

The paper was presented at ICML 2018, one of the most important Machine Learning conferences.

What are future research areas?

Adding suggested modifications to other actor-critic algorithms.

Where can you get implementation code?

PyTorch implementation of TD3 and DDPG for OpenAI gym tasks is available on GitHub.

5. Learning by Playing – Solving Sparse Reward Tasks from Scratch, by Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, Jost Tobias Springenberg

Original abstract

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors – from scratch – in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment – enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.

Our Summary

DeepMind team introduces a new reinforcement learning algorithm, called Scheduled Auxiliary Control (SAC-X), which was inspired by the playful phase of our childhood. The key idea of the suggested approach is that before learning some complex tasks from scratch, the agent needs to master a set of basic auxiliary tasks first. The experiments on several challenging robotic manipulation tasks demonstrate that the SAC-X algorithm is able to learn the complex tasks from scratch in a reliable and data-efficient way.

What’s the core idea of this paper?

Introducing a new learning paradigm called Scheduled Auxiliary Control (SAC-X), which leverages the idea that in order to learn complex tasks from scratch, an agent has to learn and master a set of basic skills first: In addition to the main task reward, there is also a series of auxiliary rewards for tackling the auxiliary tasks. The auxiliary tasks encourage the agent to control its own sensory observations (e.g. images, proprioception, haptic sensors). The agent decides by itself which goal to pursue next (one of the auxiliary tasks or the target task). This is decided via a scheduling module, which learns through a meta-learning algorithm with the goal to maximize progress on the main task. Learning is performed off-policy so that each policy can learn from data generated by all other policies.



What’s the key achievement?

Experiments with several challenging robotics tasks in simulations as well as with real robot show that: SAC-X is able to solve from scratch all the tasks he gets, and it can even learn from scratch directly on a real robot arm; learned intentions are highly reactive and reliable; the learning process is data-efficient.



What does the AI community think?

The paper was presented at ICML 2018, one of the most important Machine Learning conferences.

What are future research areas?

Applying SAC-X approach to different types of tasks.

What are possible business applications?

The research paper discusses the application of SAC-X framework to typical robotics manipulation tasks but in fact, this is a general RL method that can be applied in sparse reinforcement learning settings beyond control and robotics.

Where can you get implementation code?

PyTorch implementation of the SAC-X RL Algorithm is available on GitHub.

6. Hierarchical Imitation and Reinforcement Learning, by Hoang M. Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, Hal Daumé III

Original abstract

We study how to effectively leverage expert feedback to learn sequential decision-making policies. We focus on problems with sparse rewards and long time horizons, which typically pose significant challenges in reinforcement learning. We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Our framework can incorporate different combinations of imitation learning (IL) and reinforcement learning (RL) at different levels, leading to dramatic reductions in both expert effort and cost of exploration. Using long-horizon benchmarks, including Montezuma’s Revenge, we demonstrate that our approach can learn significantly faster than hierarchical RL, and be significantly more label-efficient than standard IL. We also theoretically analyze labeling cost for certain instantiations of our framework.

Our Summary

The paper introduces a hierarchical guidance framework that combines imitation learning (IL) and reinforcement learning (RL) to solve the problems that can be divided into subtasks. With the goal to minimize expert efforts and costs, the researchers suggest an approach, in which experts mainly provide high-level feedback (i.e., whether a subgoal is defined correctly) and dive into the subtask itself only if necessary (i.e., when a subgoal is defined correctly but a subpolicy fails). The experiments show that the hierarchical guidance framework learns much faster than hierarchical RL and requires much less expert feedback than standard IL.

What’s the core idea of this paper?

Introducing an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem: An expert labels high-level trajectory with correct macro-actions (subgoals). If a macro-action is chosen correctly and a subpolicy is good, an expert simply verifies this. If a macro-action is chosen wrong, an expert doesn’t look into a corresponding subpolicy. If a macro-action is chosen correctly but a subpolicy fails, an expert labels a subpolicy as wrong and shows a right low-level trajectory.

In a hybrid IL/RL setting, meta-controller that is responsible for choosing a correct macro-action is learned via imitation learning, while subpolicies are learned via reinforcement learning.

What’s the key achievement?

The experiments on a challenging maze domain, and on Montezuma’s Revenge demonstrate that: hierarchical IL requires much fewer labels than standard IL, but this modest amount of expert feedback leads to dramatic improvements in performance compared to pure RL.



What does the AI community think?

The paper was presented at ICML 2018, one of the most important Machine Learning conferences.

What are future research areas?

Using weaker feedback such as preference or gradient-style feedback, or only saying whether the agent’s action is correct or incorrect.

In the settings, where it’s hard to specify whether a subgoal is achieved, exploring the possibility to learn this information.

Where can you get implementation code?

TensorFlow implementation of hierarchical imitation learning and reinforcement learning is available on GitHub.

7. Unsupervised Predictive Memory in a Goal-Directed Agent, by Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae, Piotr Mirowski, Joel Z. Leibo, Adam Santoro, Mevlana Gemici, Malcolm Reynolds, Tim Harley, Josh Abramson, Shakir Mohamed, Danilo Rezende, David Saxton, Adam Cain, Chloe Hillier, David Silver, Koray Kavukcuoglu, Matt Botvinick, Demis Hassabis, Timothy Lillicrap

Original abstract

Animals execute goal-directed behaviours despite the limited range and scope of their sensors. To cope, they explore environments and store memories maintaining estimates of important information that is not presently available. Recently, progress has been made with artificial intelligence (AI) agents that learn to perform tasks from sensory input, even at a human level, by merging reinforcement learning (RL) algorithms with deep neural networks, and the excitement surrounding these results has led to the pursuit of related ideas as explanations of non-human animal learning. However, we demonstrate that contemporary RL algorithms struggle to solve simple tasks when enough information is concealed from the sensors of the agent, a property called “partial observability”. An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.

Our Summary

In 3D virtual reality environments, RL agents often struggle even with simple tasks because lots of information is concealed from the agent’s sensors. To overcome this issue of partial observability, DeepMind team introduces a new model, Memory, RL, and Inference Network (MERLIN). With this model, the researchers suggest a new way to incorporate memory into the model – in particular, they propose to guide memory formation with a process of predictive modeling. The experiments demonstrate that MERLIN can successfully solve standard tasks drawn from behavioural research in psychology and neuroscience.

What’s the core idea of this paper?

The agent can be much better at solving tasks in the environments with limited observability if, at any given time-step, it has access to both its environment observation and memories relevant to its current state.

Thus, the researchers introduce a model architecture, called MERLIN, with two components: A memory-based predictor (MBP), which is responsible for: compressing observations into low-dimensional state representations or state variables; storing these state variables in memory; using state variables in memory to make predictions based on past observations. A policy network that receives state variables and memory contents from MBP and outputs actions.

In addition, the MBP is trained to predict the reward from the current state, ensuring that learned representations are indeed useful and relevant to the current task.

What’s the key achievement?

Demonstrating that combined use of memory and predictive modeling enhances the performance of RL agents: MERLIN was able to solve canonical behavioural tasks in psychology and neurobiology that were unattainable to previous state-of-the-art approaches. On some of the tasks, it was learning faster and reached higher performance than professional human testers.



What does the AI community think?

“On addressing partial observability in deep RL with fancy neuro-inspired memory. “Memory, RL, and Inference Network” = MERLIN. DeepMind’s been upping their acronym game lately. “, – Miles Brundage, a research scientist at OpenAI.

What are future research areas?

Further studies of memory in computational agents.

Where can you get implementation code?

The authors do not provide implementation code. However, the results of some personal experiments with MERLIN implementation are available on GitHub.

8. Data-Efficient Hierarchical Reinforcement Learning, by Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine

Original abstract

Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. To address efficiency, we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge. This allows us to take advantage of recent advances in off-policy model-free RL to learn both higher- and lower-level policies using substantially fewer environment interactions than on-policy algorithms. We term the resulting HRL agent HIRO and find that it is generally applicable and highly sample-efficient. Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.

Our Summary

In this research, Google Brain team seeks the way to develop hierarchical reinforcement learning (HRL) algorithms that do not make additional assumptions beyond the standard RL algorithms and require a relatively modest number of interaction samples. The key idea is to have two layers of policy with lower-level controllers supervised with goals that are learned and proposed automatically by the higher-level controllers. To increase efficiency, both higher- and lower-level policies are learned using off-policy experience with a specific correction. The experiments show that the suggested approach outperforms previous state-of-the-art techniques while being highly sample-efficient.

What’s the core idea of this paper?

Introducing a multi-level HRL agent, called HIRO, that stands out from other HRL methods by being generally applicable and data-efficient: Generality is achieved through training the lower-level policy to reach goals learned and instructed by the higher levels. Unlike previous work that operates in the goal-setting model, HIRO approach uses states as goals directly. Sample efficiency is achieved by using off-policy training. Off-policy training in the HRL setting results in a non-stationary problem for the higher-level policy. To overcome this issue, the researchers suggest an off-policy correction, which retroactively replaces the high-level action seen in off-policy experience with a high-level action chosen to maximize the likelihood of the past lower-level actions.



What’s the key achievement?

Experiments with complex tasks that combine locomotion and rudimentary object interaction demonstrate that: HIRO approach can learn such complex behaviours as pushing objects and utilizing them to reach target locations, from only a few million samples, equivalent to a few days of real-time interaction; previous state-of-the-art techniques are not able to show competitive results after 10M steps of training.



What does the AI community think?

The paper was presented at NeurIPS 2018.

“The recent paper from Google Brain takes a particularly clean and simple approach [to hierarchical reinforcement learning], and introduces some nice off-policy corrections for data-efficient training.”, – Joyce Xu, AI/ML engineer from Stanford.

What are future research areas?

Improving the stability and performance of hierarchical reinforcement learning method.

Exploring the HIRO method in the context of multi-task learning.

What are possible business applications?

HIRO agent is generally applicable and data-efficient, which makes it suitable for real-world applications.

Where can you get implementation code?

Find the open-source code for this research paper on GitHub.

9. Visual Reinforcement Learning with Imagined Goals, by Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine

Original abstract

For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised “practice” phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.

Our Summary

In this paper, the Berkeley AI research team introduces the reinforcement learning algorithms that can be used to learn a large and diverse set of tasks simultaneously, without human supervision. The key idea is that agents can prepare to solve different tasks by setting their own goals, practicing complex behaviours, and learning about the environment around them. Using these autonomously learned skills, the agent is then able to solve user-specified tasks represented with images. The experiments illustrate that the suggested approach can effectively learn policies for complex image-based tasks and can be successfully used to learn real-world robotic manipulation skills.

What’s the core idea of this paper?

The paper introduces a framework for solving goal-conditioned vision-based tasks without access to any ground truth state or reward functions.

This method, called reinforcement learning with imagined goals (RIG), trains a generative model with a number of important components: embedding state and goals using the encoder; sampling goals for exploration from the prior; sampling the value for a latent variable from the generative model to retroactively relabel goals and rewards; using the distances in the latent space for rewards to train the agent to reach a goal.

Within this framework, the state is represented as the image from the robot’s camera and the goal is just an image of the world as it should be. So, to specify a new task a user simply provides a goal image.

What’s the key achievement?

The results of the RIG algorithm on the set of robotics tasks are close to the state-based “oracle” method in terms of sample efficiency and performance, even though RIG method doesn’t have any access to object state.

Despite learning directly from pixels, solving the tasks doesn’t take much time: It took about an hour of the of real-robot interaction time to solve the task of reaching a goal position, and about 4.5 hours – to solve the object pushing task.



What does the AI community think?

The paper was presented at NeurIPS 2018 as a spotlight talk.

What are future research areas?

Allowing the goals to be represented not only with images but also with demonstrations or language, to make the system more flexible in interacting with humans.

Building on the existing research related to exploration and intrinsic motivation, and modifying the procedure so that the system could choose the goals in a more principled way.

Making the generative model aware of the environment dynamics.

What are possible business applications?

The algorithms introduced in this research paper can be successfully deployed in the real-world settings as 1) they can learn directly from raw images; 2) include a single policy to solve a large and diverse set of tasks.

Where can you get implementation code?

The algorithm implementation is available through the rlkit repository by Vitchyr Pong.

The environments are also available publicly on GitHub.

10. Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform, by Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye

Original abstract

In this paper we present Horizon, Facebook’s open source applied reinforcement learning (RL) platform. Horizon is an end-to-end platform designed to solve industry applied RL problems where datasets are large (millions to billions of observations), the feedback loop is slow (vs. a simulator), and experiments must be done with care because they don’t run in a simulator. Unlike other RL platforms, which are often designed for fast prototyping and experimentation, Horizon is designed with production use cases as top of mind. The platform contains workflows to train popular deep RL algorithms and includes data preprocessing, feature transformation, distributed training, counterfactual policy evaluation, and optimized serving. We also showcase real examples of where models trained with Horizon significantly outperformed and replaced supervised learning systems at Facebook.

Our Summary

Facebook team open-sources its end-to-end platform for applied reinforcement learning to push RL transition from academia to industry. The platform is built on principles that put industry applicability as a first priority: providing a pipeline for automatic and efficient preprocessing of data, allowing for flexible model serving in production, providing the opportunity to estimate algorithm performance before launch. Models trained with Horizon are already successfully deployed at Facebook.

What’s the core idea of this paper?

Introducing Horizon, an end-to-end platform for applied RL with a number of important features: data preprocessing using Spark pipeline that transforms logged data into the format required for deep RL models; feature normalization – extracting metadata about every feature and using it to automatically preprocess features during training and serving; implementation of popular deep RL models , including Deep Q-networks, Deep Q-networks with double Q-learning, Deep Q-networks with dueling architecture, Deep Deterministic Policy Gradients; supporting CPU, GPU, and multi-GPU training on a single machine; counterfactual policy evaluation (CPE) – scoring trained models off-line using several well-known CPE methods; optimized serving – using Caffe2 network that is optimized for performance and portability; tested algorithms – testing core functionality and algorithms via unit tests and integration tests.



What’s the key achievement?

Introducing the first open-source end-to-end platform that can be used by companies to apply reinforcement learning to real-world industry problems.

Providing examples of how models trained with Horizon outperformed approaches based on supervised learning and replaced them in real-life applications at Facebook.

What does the AI community think?

“Facebook the ever dominating social network platform has yet again proved that it can contribute to the enhancements in Machine Learning algorithms.”, – Analytics India Magazine.

What are future research areas?

Continually adding the best performing algorithms from the research community as well as improving currently available models.

Allowing developers to input a set of metrics that they are interested in tracking.

What are possible business applications?

Horizon is designed specifically to help companies that are interested in using applied RL: The platform allows handling very large datasets with hundreds or thousands of features and millions or billions of observations. The data can be inherently noisy, sparse, and arbitrarily distributed. Trained models can be deployed to thousands of machines.



Where can you get implementation code?

The first open-source platform for applied reinforcement learning is available at Facebook’s GitHub repository.

More technical content about Reinforcement Learning

If you enjoyed this article, you’ll want to check out other educational content in our RL series.

Want Deeper Dives Into Specific AI Research Topics?

Due to popular demand, we’ve released several of these easy-to-read summaries and syntheses of major research papers for different subtopics within AI and machine learning.

Update: 2019 Research Summaries Are Released

We’ll let you know when we release more summary articles like this one.

Email Address *

Name * First Last

Company *

What areas of AI research are you interested in? Select all that apply * Natural Language Processing (NLP) Chatbots & Conversational AI Computer Vision Ethics & Safety Robotics Machine Learning Deep Learning Reinforcement Learning Generative Models Other (Please Describe Below)

What is your biggest challenge with AI research? *