shie {at} ee {dot} technion {dot} ac {dot} il

So you run a reinforcement learning (RL) algorithm and it performs poorly. What then? Basically, you can try some other algorithm out of the box: PPO/AxC/*QN/Rainbow/etc… [1, 2, 3, 4] and hope for the best. This approach rarely works. But, why? Why don’t we have a “ResNet” for RL? By that I mean, why don’t we have a network architecture that gets you to 90% of the desired performance with 10% of the effort?

The problem, I believe, is in the questions we ask. For most problems of interest, much of the mental effort and the ingenuity is in defining the problem rather than in finding the best algorithms. Algorithms matter, but less than solving the right problem. In this post I will explain what are the right problems, and how can we address them. Rather than considering algorithmic problems, I view the problems from the design side: what would an engineer who tries to use RL need? These properties are not specific to RL systems, of course, but are rather generic. As we view the problem from a design perspective, we are interested in the interfaces from the system and how it is reflected to the outside world.

Accountability:

Accountability means that the system can explain why it acts in a certain way. The explanation can be reasoning in words, by example, or in any other means that are interpretable to the other entities that communicate with it. For example, consider an autonomous self-driving system. It ought to be able to explain itself: why it crossed a lane, or why it decided to hit the brakes. The explanation is important not only for the sake of investigating accidents, but also for making the human passenger feel safer and more in control, for debugging and to account for different styles of driving. An explanation can be in the form of a natural language or a computation, but without reasoning humans (users, regulators, or humans that interact with the system), will find it difficult to trust the system.

Adaptivity:

Adaptivity means that the system should function properly under a variety of conditions. Some of these conditions may be expected and some harder to predict. Of specific importance is adaptivity to counter-factual behavior: other agents and the environment are expected to react to the system’s behavior. Consider again the case of autonomous behavior in cars, not necessarily self-driving as ABS (Antilock Brake System) and ADAS (Automatic Driving Assist System) come to mind as well. The system has to work in all weather conditions and all road conditions, even if trained mostly in several specific conditions. Such a requirement calls for control policies more akin to adaptive control and to taking exogenous, or contextual, parameters (e.g., weather conditions) into account [9, 10].

Awareness:

Awareness means being aware of how well a system performs, and that it can communicate its performance. The system should be able to identify what is happening to it and be cognizant of other entities (other agents and humans). Returning to the self-driving car example, consider a level 3 or level 4 self-driving car that can relinquish driving to the human behind the steering wheel. Relinquishing control should happen when the system recognizes its inability to perform some required maneuver (e.g., change lanes into a very busy lane), but also, and more importantly when the system realizes it does not work well enough and it is aware of its current inability to perform adequately.

Life-cycle consciousness:

A system will be built, and then put in the field, and eventually deprecate and die out. We need to build systems that are aware of the life cycle, including debugging tools such as unit-testing, decomposability, and interaction with other subsystems within a more complex system. While the importance of debugging is clear and has been considered by many researchers [11, 12], let us give an example of building a system that can work effectively with other systems. Consider again, an ADAS or an ABS. Each such system will work with tens of other systems, from air pressure sensors to steering control to driver alertness monitoring. It should be possible to easily test when replacing the other systems with a newer version (software or hardware) would work, and it should be possible to replace the version of the ABS/ADAS system while considering the current versions of all other systems. Such testing is no easy fit, as it should be done virtually and probably no on the vehicle itself.

Scalability with resources:

The more resources we have in terms of data and computational power the better the policy should be. To quote Richard Sutton: “One thing that should be learned from the bitter lesson is the great power of general-purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are ‘search’ and ‘learning’.” [5]. Going back once more to the self-driving example, we would like to be able to learn and improve from, essentially, all driving experience of all cars ever recorded. Beyond the algorithmic challenge of learning with very large data sets, this calls for multi-domain learning since there will be a difference between individual drivers and counterfactual learning [8] as well as sim2real [6, 7] and real2sim.

So when building a system that is supposed to work in the real world, you should ask yourself if the above 5 principles are considered or not.

More often than not, we are enchanted by the mathematical beauty of one algorithm or another and consider restricted and limited research that provides a solution that would work on some mock up domain.

So this leads to people who run their pet RL algorithms on real systems and do not understand why they fail?

A system that does not comply with the 5 principles is unlikely to work in a real-world application. Take for instance DQN based algorithms, they really are only scaleable. All other design principles are far from satisfied. I find it hard to imagine any real-world system that uses DQN as its engine.

Of course, there is work on making DQN adaptive and on methods for debugging these algorithms, but in general, they are notoriously difficult to debug.

My view is that any RL research should be judged according to the five principles above: which of the five principles does it advance?

Of course. there is no harm in working on solving games, and “solving” Go or DOTA 2 is a great exercise in scaleability. But, it is nothing more: it does not bring us much closer to building RL systems that interact in the real world.

There are two diverging views on RL at the moment. The optimists are saying that since Go is “solved” and so are many other games it is time to call quits and look for a new research area. The pessimist view is that RL basically does work only when there are huge amounts of data and essentially for problems where the test set itself can be overfitted. So far, we have no indication that the pessimists are wrong. The reason, we have argued, is that we, as a community, have mostly worked so far on one aspect of the design problem.

Bibliography

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), pp.529-533.

[2] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. and Silver, D., 2018, April. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.

[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[4] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K., 2016, June. Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937).

[5] Sutton, R., 2019. The bitter lesson. Incomplete Ideas (blog), March, 13. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[6] Peng, X.B., Andrychowicz, M., Zaremba, W. and Abbeel, P., 2018, May. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 1-8). IEEE.

[7] Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R. and Schneider, J., 2019. Solving Rubik’s Cube with a Robot Hand. arXiv preprint arXiv:1910.07113.

[8] Swaminathan, A. and Joachims, T., 2015, June. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning (pp. 814-823).

[9] Hallak, A., Di Castro, D. and Mannor, S., 2015. Contextual Markov decision processes. arXiv preprint arXiv:1502.02259.

[10] Finn, C., Abbeel, P. and Levine, S., 2017, August. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1126-1135). JMLR. org.

[11] Zahavy, T., Ben-Zrihem, N. and Mannor, S., 2016, June. Graying the black box: Understanding dqns. In International Conference on Machine Learning (pp. 1899-1908).

[12] Hohman F, Kahng M, Pienta R, Chau DH. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE transactions on visualization and computer graphics. 2018 Jun 4;25(8):2674-93.