1 Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998). This book is the definitive reference on computational reinforcement learning.

2 Kaelbling, L. P., Littman, M. L. & Moore, A. W. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996).

3 Berry, D. A. & Fristedt, B. Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, 1985).

4 Shrager, J. & Tenenbaum, J. M. Rapid learning for precision oncology. Nature Rev. Clin. Onco. 11, 109–118 (2014).

5 Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002).

6 Kaelbling, L. P. Learning in Embedded Systems (MIT Press, 1993).

7 Li, L., Chu, W., Langford, J. & Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proc. 19th International World Wide Web Conference 661–670 (2010).

8 Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933).

9 West, R. F. & Stanovich, K. E. Is probability matching smart? Associations between probabilistic choices and cognitive ability. Mem. Cognit. 31, 243–251 (2003).

10 May, B. C., Korda, N., Lee, A. & Leslie, D. S. Optimistic Bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res. 13, 2069–2106 (2012).

11 Bubeck, S. & Liu, C.-Y. Prior-free and prior-dependent regret bounds for Thompson sampling. In Proc. Advances in Neural Information Processing Systems 638–646 (2013).

12 Gershman, S. & Blei, D. A tutorial on Bayesian nonparametric models. J. Math. Psychol. 56, 1–12 (2012).

13 Sutton, R. S. Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988).

14 Boyan, J. A. & Moore, A. W. Generalization in reinforcement learning: safely approximating the value function. In Proc. Advances in Neural Information Processing Systems 369–376 (1995).

15 Baird, L. Residual algorithms: reinforcement learning with function approximation. In Proc. 12th International Conference on Machine Learning (eds Prieditis, A. & Russell, S.) 30–37 (Morgan Kaufmann, 1995).

16 Sutton, R. S. et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Annual International Conference on Machine Learning 993–1000 (2009).

17 Sutton, R. S., Maei, H. R. & Szepesvári, C. A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Proc. Advances in Neural Information Processing Systems 1609–1616 (2009).

18 Maei, H. R. et al. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proc. Advances in Neural Information Processing Systems 1204–1212 (2009).

19 Maei, H. R., Szepesvári, C., Bhatnagar, S. & Sutton, R. S. Toward off-policy learning control with function approximation. In Proc. 27th International Conference on Machine Learning 719–726 (2010).

20 van Hasselt, H., Mahmood, A. R. & Sutton, R. S. Off-policy TD(λ) with a true online equivalence. In Proc. 30th Conference on Uncertainty in Artificial Intelligence 324 (2014).

21 Russell, S. J. & Norvig, P. Artificial Intelligence: A Modern Approach (Prentice–Hall, 1994).

22 Campbell, M., Hoane, A. J. & Hsu, F. H. Deep blue. Artif. Intell. 134, 57–83 (2002).

23 Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. Res. Develop. 3, 211–229 (1959).

24 Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994). This article describes the first reinforcement-learning system to solve a truly non-trivial task.

25 Tesauro, G., Gondek, D., Lenchner, J., Fan, J. & Prager, J. M. Simulation, learning, and optimization techniques in Watson's game strategies. IBM J. Res. Develop. 56, 1–11 (2012).

26 Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. 17th European Conference on Machine Learning 282–293 (2006). This article introduces UCT, the decision-making algorithm that revolutionized gameplay in Go.

27 Gelly, S. et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM 55, 106–113 (2012).

28 İpek. E., Mutlu, O., Martínez, J. F. & Caruana, R. Self-optimizing memory controllers: a reinforcement learning approach. In Proc. 35th International Symposium on Computer Architecture 39–50 (2008).

29 Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Proc. Advances in Neural Information Processing Systems http://papers.nips.cc/paper/2455-autonomous-helicopter-flight-via-reinforcement-learning (2003).

30 Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. 7th International Conference on Machine Learning 216–224 (Morgan Kaufmann, 1990).

31 Kearns, M. J. & Singh, S. P. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002). This article provides the first algorithm and analysis that shows that reinforcement-learning tasks can be solved approximately optimally with a relatively small amount of experience.

32 Brafman, R. I. & Tennenholtz, M. R-MAX — a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002).

33 Li, L., Littman, M. L., Walsh, T. J. & Strehl, A. L. Knows what it knows: a framework for self-aware learning. Mach. Learn. 82, 399–443 (2011).

34 Langley, P. Machine learning as an experimental science. Mach. Learn. 3, 5–8 (1988).

35 Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).

36 Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). This article describes the application of deep learning in a reinforcement-learning setting to address the challenging task of decision making in an arcade environment.

37 Murphy, S. A. An experimental design for the development of adaptive treatment strategies. Stat. Med. 24, 1455–1481 (2005).

38 Li, L., Chu, W., Langford, J. & Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proc. 4th ACM International Conference on Web Search and Data Mining 297–306 (2011).

39 Nouri, A. et al. A novel benchmark methodology and data repository for real-life reinforcement learning. In Proc. Multidisciplinary Symposium on Reinforcement Learning, Poster (2009).

40 Marivate, V. N., Chemali, J., Littman, M. & Brunskill, E. Discovering multi-modal characteristics in observational clinical data. In Proc. Machine Learning for Clinical Data Analysis and Healthcare NIPS Workshop http://paul.rutgers.edu/∼vukosi/papers/nips2013workshop.pdf (2013).

41 Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th International Conference on Machine Learning 278–287 (1999).

42 Thomaz, A. L. & Breazeal, C. Teachable robots: understanding human teaching behaviour to build more effective robot learners. Artif. Intell. 172, 716–737 (2008).

43 Knox, W. B. & Stone, P. Interactively shaping agents via human reinforcement: The TAMER framework. In Proc. 5th International Conference on Knowledge Capture 9–16 (2009).

44 Loftin, R. et al. A strategy-aware technique for learning behaviors from discrete human feedback. In Proc. 28th Association for the Advancement of Artificial Intelligence Conference https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8579 (2014).

45 Ng, A. Y. & Russell, S. Algorithms for inverse reinforcement learning. In Proc. International Conference on Machine Learning 663–670 (2000).

46 Babes, M., Marivate, V. N., Littman, M. L. & Subramanian, K. Apprenticeship learning about multiple intentions. In Proc. International Conference on Machine Learning 897–904 (2011).

47 Singh, S., Lewis, R.L., Barto, A.G. & Sorg, J. Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans. Auto. Mental Dev. 2, 70–82 (2010).

48 Newell, A. The chess machine: an example of dealing with a complex task by adaptation. In Proc. Western Joint Computer Conference 101–108 (1955).

49 Minsky, M. L. Some methods of artificial intelligence and heuristic programming. In Proc. Symposium on the Mechanization of Thought Processes 24–27 (1958).

50 Sutton, R. S. & Barto, A. G. Toward a modern theory of adaptive networks: expectation and prediction. Psychol. Rev. 88, 135–170 (1981).

51 Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

52 Dayan, P. & Niv, Y. Reinforcement learning and the brain: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18, 185–196 (2008).

53 Niv, Y. Neuroscience: dopamine ramps up. Nature 500, 533–535 (2013).

54 Cushman, F. Action, outcome, and value a dual-system framework for morality. Pers. Soc. Psychol. Rev. 17, 273–292 (2013).

55 Shapley, L. Stochastic games. Proc. Natl Acad. Sci. USA 39, 1095–1100 (1953).

56 Bellman, R. Dynamic Programming (Princeton Univ. Press, 1957).

57 Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013).

58 Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992). This article introduces the first provably correct approach to reinforcement learning for both prediction and decision making.

59 Jaakkola, T., Jordan, M. I. & Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems 6, 703–710 (Morgan Kaufmann, 1994).