In this article, I’ve conducted an informal survey of all the deep reinforcement learning research thus far in 2019 and I’ve picked out some...

In this article, I’ve conducted an informal survey of all the deep reinforcement learning research thus far in 2019 and I’ve picked out some of my favorite papers. This list should make for some enjoyable summer reading!

[Related Article: 10 Compelling Machine Learning Dissertations from Ph.D. Students]

As we march into the second half of 2019, the field of deep learning research continues at an accelerated pace. There are so many fertile areas of research such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), autoencoders, generative networks, and so much more. But it is deep reinforcement learning (DRL) that seems to hold everyone’s fascination these days.

Reinforcement learning refers to algorithms that are “goal-oriented.” They’re able to learn how to attain a complex objective, i.e. a goal by maximizing along a specific dimension over a number of iterations. For instance, maximizing the points obtained in a game over a number of moves. They can start from an initial blank slate, and under the right conditions they achieve extraordinary performance. These algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones – this is how they engage the concept of reinforcement.

Deep reinforcement learning algorithms can beat world champions at the game of Go as well as human experts playing numerous Atari video games. While that may sound inconsequential, it’s a vast improvement over their previous undertakings, and the state of the art is progressing rapidly.

[Related Article: Latest Developments in GANs]

For those of you not yet formally introduced to DRL, here is a great survey paper from December, 2018 that will give you a healthy introduction: “An Introduction to Deep Reinforcement Learning,” by Vincent François-Lavet et al. For an interesting application of DRL for robotics, check out Pieter Abbeel’s (ODSC West 2018 speaker) Ph.D. Dissertation, Stanford University, Computer Science, August 2008, “Apprenticeship Learning and Reinforcement Learning with Application to Robotic Control.”

Decision support systems and autonomous systems are starting to be deployed in real applications. Although their operations often impact many users or stakeholders, no fairness consideration is generally taken into account in their design, which could lead to completely unfair outcomes for some users or stakeholders. To tackle this issue, this paper advocates for the use of social welfare functions that encode fairness and present this general novel problem in the context of DRL, although it could possibly be extended to other machine learning tasks.

DRL has achieved great success in various applications. However, recent studies show that machine learning models are vulnerable to adversarial attacks. DRL models have been attacked by adding perturbations to observations. While such observation based attack is only one aspect of potential attacks on DRL, other forms of attacks which are more practical require further analysis, such as manipulating environment dynamics. This paper proposes to understand the vulnerabilities of DRL from various perspectives and provide a thorough taxonomy of potential attacks.

This paper explores the design and implementation of a CUDA port of the Atari Learning Environment (ALE), a system for developing and evaluating DRL algorithms using Atari games. The CUDA Learning Environment (CuLE) overcomes many limitations of existing CPU-based Atari emulators and scales naturally to multi-GPU systems. It leverages the parallelization capability of GPUs to run thousands of Atari games simultaneously; by rendering frames directly on the GPU, CuLE avoids the bottleneck arising from the limited CPU-GPU communication bandwidth. As a result, CuLE is able to generate between 40M and 190M frames per hour using a single GPU, a finding that could be previously achieved only through a cluster of CPUs. The paper demonstrates the advantages of CuLE by effectively training agents with traditional deep reinforcement learning algorithms and measuring the utilization and throughput of the GPU. The analysis further highlights the differences in the data generation pattern for emulators running on CPUs or GPUs. The code is available on GitHub.

DRL is prone to overfitting, and traditional benchmarks such as Atari 2600 benchmark can exacerbate this problem. The Obstacle Tower Challenge addresses this by using randomized environments and separate seeds for training, validation, and test runs. This paper examines various improvements and best practices to the PPO algorithm using the Obstacle Tower Challenge to empirically study their impact with regards to generalization. The experiments show that the combination provides state-of-the-art performance on the Obstacle Tower Challenge.

This paper explores the usage of DRL algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, the researchers present a novel Markov decision process (MDP) model to capture the financial trading markets. They review and propose various modifications to existing approaches and explore different techniques to succinctly capture the market dynamics to model the markets. The then go on to use DRL to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, they go on to show that the model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments.

Recent advances in DRL, grounded on combining classical theoretical results with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and gave birth to DRL as a field of research. In this work latest DRL algorithms are reviewed with a focus on their theoretical justification, practical limitations and observed empirical properties.

The scale of Internet-connected systems has increased considerably, and these systems are being exposed to cyberattacks more than ever. The complexity and dynamics of cyberattacks require protecting mechanisms to be responsive, adaptive, and large-scale. Machine learning, or more specifically DRL, methods have been proposed widely to address these issues. By incorporating deep learning into traditional RL, DRL is highly capable of solving complex, dynamic, and especially high-dimensional cyber defense problems. This paper presents a survey of DRL approaches developed for cyber security. The researchers touch on different vital aspects, including DRL-based security methods for cyber-physical systems, autonomous intrusion detection techniques, and multi-agent DRL-based game theory simulations for defense strategies against cyberattacks. Extensive discussions and future research directions on DRL-based cyber security are also given. They expect that this comprehensive review provides the foundations for and facilitates future studies on exploring the potential of emerging DRL to cope with increasingly complex cyber security problems.

DRL enables agents to take decision based on a reward function. However, in the process of learning, the choice of values for learning algorithm parameters can significantly impact the overall learning process. This paper explores the use of a genetic algorithm (GA) to find the values of parameters used in Deep Deterministic Policy Gradient (DDPG) combined with Hindsight Experience Replay (HER), to help speed up the learning agent. The researchers use this method on fetch-reach, slide, push, pick and place, and door opening in robotic manipulation tasks. The experimental evaluation shows that our method leads to better performance, faster than the original algorithm.

DRL has achieved significant breakthroughs in various tasks. However, most DRL algorithms suffer a problem of generalizing the learned policy which makes the learning performance largely affected even by minor modifications of the training environment. Except that, the use of deep neural networks makes the learned policies hard to be interpretable. To address these two challenges, this paper proposes a novel algorithm named Neural Logic Reinforcement Learning (NLRL) to represent the policies in reinforcement learning by first-order logic. NLRL is based on policy gradient methods and differentiable inductive logic programming that have demonstrated significant advantages in terms of interpretability and generalisability in supervised tasks. Extensive experiments conducted on cliff-walking and blocks manipulation tasks demonstrate that NLRL can induce interpretable policies achieving near-optimal performance while demonstrating good generalisability to environments of different initial states and problem sizes.

This paper shows how to teach machines to paint like human painters, who can use a few strokes to create fantastic paintings. By combining the neural renderer and model-based DRL, the agent can decompose texture-rich images into strokes and make long-term plans. For each stroke, the agent directly determines the position and color of the stroke. Excellent visual effect can be achieved using hundreds of strokes. The training process does not require experience of human painting or stroke tracking data.

In order to meet the diverse challenges in solving many real-world problems, an intelligent agent has to be able to dynamically construct a model of its environment. Objects facilitate the modular reuse of prior knowledge and the combinatorial construction of such models. This paper argues that dynamically bound features (objects) do not simply emerge in connectionist models of the world. We identify several requirements that need to be fulfilled in overcoming this limitation and highlight corresponding inductive biases.

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, the researchers build a latent-variable autoregressive model by leveraging recent ideas in variational inference. It is argued that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner’s solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. The method discussed achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.