1 Russell, S. & Norvig, P. Artificial Intelligence: a Modern Approach (Prentice–Hall, 1995).

2 Thrun, S., Burgard, W. & Fox, D. Probabilistic Robotics (MIT Press, 2006).

3 Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

4 Murphy, K. P. Machine Learning: A Probabilistic Perspective (MIT Press, 2012).

5 Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012).

6 Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems 25 1097–1105 (2012).

7 Sermanet, P. et al. Overfeat: integrated recognition, localization and detection using convolutional networks. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1312.6229 (2014).

8 Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

9 Ghahramani, Z. Bayesian nonparametrics and the probabilistic approach to modelling. Phil. Trans. R. Soc. A 371, 20110553 (2013). A review of Bayesian non-parametric modelling written for a general scientific audience.

10 Jaynes, E. T. Probability Theory: the Logic of Science (Cambridge Univ. Press, 2003).

11 Koller, D. & Friedman, N. Probabilistic Graphical Models: Principles and Techniques (MIT Press, 2009). This is an encyclopaedic text on probabilistic graphical models spanning many key topics.

12 Cox, R. T. The Algebra of Probable Inference (Johns Hopkins Univ. Press, 1961).

13 Van Horn, K. S. Constructing a logic of plausible inference: a guide to Cox's theorem. Int. J. Approx. Reason. 34, 3–24 (2003).

14 De Finetti, B. La prévision: ses lois logiques, ses sources subjectives. In Annales de l'institut Henri Poincaré [in French] 7, 1–68 (1937).

15 Knill, D. & Richards, W. Perception as Bayesian inference (Cambridge Univ.Press, 1996).

16 Griffiths, T. L. & Tenenbaum, J. B. Optimal predictions in everyday cognition. Psychol. Sci. 17, 767–773 (2006).

17 Wolpert, D. M., Ghahramani, Z. & Jordan, M. I. An internal model for sensorimotor integration. Science 269, 1880–1882 (1995).

18 Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: statistics, structure, and abstraction. Science 331, 1279–1285 (2011).

19 Marcus, G. F. & Davis, E. How robust are probabilistic models of higher-level cognition? Psychol. Sci. 24, 2351–2360 (2013).

20 Goodman, N. D. et al. Relevant and robust a response to Marcus and Davis (2013). Psychol. Sci. 26, 539–541 (2015).

21 Doya, K., Ishii, S., Pouget, A. & Rao, R. P. N. Bayesian Brain: Probabilistic Approaches to Neural Coding (MIT Press, 2007).

22 Deneve, S. Bayesian spiking neurons I: inference. Neural Comput. 20, 91–117 (2008).

23 Neal, R. M. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Report No. CRG-TR-93–1 http://www.cs.toronto.edu/∼radford/review.abstract.html (Univ. Toronto, 1993).

24 Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. An introduction to variational methods in graphical models. Mach. Learn. 37, 183–233 (1999).

25 Doucet, A., de Freitas, J. F. G. & Gordon, N. J. Sequential Monte Carlo Methods in Practice (Springer, 2000).

26 Minka, T. P. Expectation propagation for approximate Bayesian inference. In Proc. Uncertainty in Artificial Intelligence 17 362–369 (2001).

27 Neal, R. M. In Handbook of Markov Chain Monte Carlo (eds Brooks, S., Gelman, A., Jones, G. & Meng, X.-L.) (Chapman & Hall/CRC, 2010).

28 Girolami, M. & Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Series B Stat. Methodol. 73, 123–214 (2011).

29 Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27, 3104–3112 (2014).

30 Neal, R. M. in Maximum Entropy and Bayesian Methods 197–211 (Springer, 1992).

31 Orbanz, P. & Teh, Y. W. in Encyclopedia of Machine Learning 81–89 (Springer, 2010).

32 Hjort, N., Holmes, C., Müller, P. & Walker, S. (eds). Bayesian Nonparametrics (Cambridge Univ. Press, 2010).

33 Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006). This is a classic monograph on Gaussian processes, relating them to kernel methods and other areas of machine learning.

34 Lu, C. & Tang, X. Surpassing human-level face verification performance on LFW with GaussianFace. In Proc. 29th AAAI Conference on Artificial Intelligence http://arxiv.org/abs/1404.3840 (2015).

35 Ferguson, T. S. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973).

36 Teh, Y. W., Jordan, M. I., Beal, M. J. & Blei, D. M. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006).

37 Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. Learning systems of concepts with an infinite relational model. In Proc. 21st National Conference on Artificial Intelligence 381–388 (2006).

38 Medvedovic, M. & Sivaganesan, S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18, 1194–1206 (2002).

39 Rasmussen, C. E., De la Cruz, B. J., Ghahramani, Z. & Wild, D. L. Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. Trans. Comput. Biol. Bioinform. 6, 615–628 (2009).

40 Griffiths, T. L. & Ghahramani, Z. The Indian buffet process: an introduction and review. J. Mach. Learn. Res. 12, 1185–1224 (2011). This article introduced a new class of Bayesian non-parametric models for latent feature modelling.

41 Adams, R. P., Wallach, H. & Ghahramani, Z. Learning the structure of deep sparse graphical models. In Proc. 13th International Conference on Artificial Intelligence and Statistics (eds Teh, Y. W. & Titterington, M.) 1–8 (2010).

42 Miller, K., Jordan, M. I. & Griffiths, T. L. Nonparametric latent feature models for link prediction. In Proc. Advances in Neural Information Processing Systems 1276–1284 (2009).

43 Hinton, G. E., McClelland, J. L. & Rumelhart, D. E. in Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations 77–109 (MIT Press, 1986).

44 Neal, R. M. Bayesian Learning for Neural Networks (Springer, 1996). This text derived MCMC-based Bayesian inference in neural networks and drew important links to Gaussian processes.

45 Koller, D., McAllester, D. & Pfeffer, A. Effective Bayesian inference for stochastic programs. In Proc. 14th National Conference on Artificial Intelligence 740–747 (1997).

46 Goodman, N. D. & Stuhlmüller, A. The Design and Implementation of Probabilistic Programming Languages. Available at http://dippl.org (2015).

47 Pfeffer, A. Practical Probabilistic Programming (Manning, 2015).

48 Freer, C., Roy, D. & Tenenbaum, J. B. in Turing's Legacy (ed. Downey, R.) 195–252 (2014).

49 Marjoram, P., Molitor, J., Plagnol, V. & Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

50 Mansinghka, V., Kulkarni, T. D., Perov, Y. N. & Tenenbaum, J. Approximate Bayesian image interpretation using generative probabilistic graphics programs. In Proc. Advances in Neural Information Processing Systems 26 1520–1528 (2013).

51 Bishop, C. M. Model-based machine learning. Phil. Trans. R. Soc. A 371, 20120222 (2013). This article is a very clear tutorial exposition of probabilistic modelling.

52 Lunn, D. J., Thomas, A., Best, N. & Spiegelhalter, D. WinBUGS — a Bayesian modelling framework: concepts, structure, and extensibility. Stat. Comput. 10, 325–337 (2000). This reports an early probabilistic programming framework widely used in statistics.

53 Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0. http://mc-stan.org/ (2014).

54 Fischer, B. & Schumann, J. AutoBayes: a system for generating data analysis programs from statistical models. J. Funct. Program. 13, 483–508 (2003).

56 Wingate, D., Stuhlmüller, A. & Goodman, N. D. Lightweight implementations of probabilistic programming languages via transformational compilation. In Proc. International Conference on Artificial Intelligence and Statistics 770–778 (2011).

57 Pfeffer, A. IBAL: a probabilistic rational programming language. In Proc. International Joint Conference on Artificial Intelligence 733–740 (2001).

58 Milch, B. et al. BLOG: probabilistic models with unknown objects. In Proc. 19th International Joint Conference on Artificial Intelligence 1352–1359 (2005).

59 Goodman, N., Mansinghka, V., Roy, D., Bonawitz, K. & Tenenbaum, J. Church: a language for generative models. In Proc. Uncertainty in Artificial Intelligence 22 23 (2008). This is an influential paper introducing the Turing-complete probabilistic programming language Church.

60 Pfeffer, A. Figaro: An Object-Oriented Probabilistic Programming Language. Tech. Rep. (Charles River Analytics, 2009).

61 Mansinghka, V., Selsam, D. & Perov, Y. Venture: a higher-order probabilistic programming platform with programmable inference. Preprint at http://arxiv.org/abs/1404.0099 (2014).

62 Wood, F., van de Meent, J. W. & Mansinghka, V. A new approach to probabilistic programming inference. In Proc. 17th International Conference on Artificial Intelligence and Statistics 1024–1032 (2014).

63 Li, L., Wu, Y. & Russell, S. J. SWIFT: Compiled Inference for Probabilistic Programs. Report No. UCB/EECS-2015–12 (Univ. California, Berkeley, 2015).

64 Bergstra, J. et al. Theano: a CPU and GPU math expression compiler. In Proc. 9th Python in Science Conference http://conference.scipy.org/proceedings/scipy2010/ (2010).

65 Kushner, H. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86, 97–106 (1964).

66 Jones, D. R., Schonlau, M. & Welch, W. J. Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13, 455–492 (1998).

67 Brochu, E., Cora, V. M. & de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at http://arXiv.org/abs/1012.2599 (2010).

68 Hennig, P. & Schuler, C. J. Entropy search for information-efficient global optimization. J. Mach. Learn. Res. 13, 1809–1837 (2012).

69 Hernández-Lobato, J. M., Hoffman, M. W. & Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. In Proc. Advances in Neural Information Processing Systems 918–926 (2014).

70 Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Proc. Advances in Neural Information Processing Systems 2960–2968 (2012).

71 Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In Proc. 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 847–855 (2013).

72 Robbins, H. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 55, 527–535 (1952).

73 Deisenroth, M. P. & Rasmussen, C. E. PILCO: a model-based and data-efficient approach to policy search. In Proc. 28th International Conference on Machine Learning 465–472 (2011).

74 Poupart, P. in Encyclopedia of Machine Learning 90–93 (Springer, 2010).

75 Diaconis, P. in Statistical Decision Theory and Related Topics IV 163–175 (Springer, 1988).

76 O'Hagan, A. Bayes-Hermite quadrature. J. Statist. Plann. Inference 29, 245–260 (1991).

77 Shannon, C. & Weaver, W. The Mathematical Theory of Communication (Univ. Illinois Press, 1949).

78 MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms (Cambridge Univ. Press, 2003).

79 Wood, F., Gasthaus, J., Archambeau, C., James, L. & Teh, Y. W. The sequence memoizer. Commun. ACM 54, 91–98 (2011). This article derives a state-of-the-art data compression scheme based on Bayesian nonparametric models.

80 Steinruecken, C., Ghahramani, Z. & MacKay, D. J. C. Improving PPM with dynamic parameter updates. In Proc. Data Compression Conference (in the press).

81 Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. & Ghahramani, Z. Automatic construction and natural-language description of nonparametric regression models. In Proc. 28th AAAI Conference on Artificial Intelligence Preprint at: http://arxiv.org/abs/1402.4304 (2014). Introduces the Automatic Statistician, translating learned probabilistic models into reports about data.

82 Grosse, R. B., Salakhutdinov, R. & Tenenbaum, J. B. Exploiting compositionality to explore a large space of model structures. In Proc. Conference on Uncertainty in Artificial Intelligence 306–315 (2012).

83 Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).

84 Wolstenholme, D. E., O'Brien, C. M. & Nelder, J. A. GLIMPSE: a knowledge-based front end for statistical analysis. Knowl. Base. Syst. 1, 173–178 (1988).

85 Hand, D. J. Patterns in statistical strategy. In Artificial Intelligence and Statistics (ed Gale, W. A.) (Addison-Wesley Longman, 1986).

86 King, R. D. et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 247–252 (2004).

87 Welling, M. et al. Bayesian inference with big data: a snapshot from a workshop. ISBA Bulletin 21, https://bayesian.org/sites/default/files/fm/bulletins/1412.pdf (2014).

88 Bakker, B. & Heskes, T. Task clustering and gating for Bayesian multitask learning. J. Mach. Learn. Res. 4, 83–99 (2003).

89 Houlsby, N., Hernández-Lobato, J. M., Huszár, F. & Ghahramani, Z. Collaborative Gaussian processes for preference learning. In Proc. Advances in Neural Information Processing Systems 26 2096–2104 (2012).

90 Russell, S. J. & Wefald, E. Do the Right Thing: Studies in Limited Rationality (MIT Press, 1991).

91 Jordan, M. I. On statistics, computation and scalability. Bernoulli 19, 1378–1390 (2013).

92 Hoffman, M., Blei, D., Paisley, J. & Wang, C. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013).

93 Hensman, J., Fusi, N. & Lawrence, N. D. Gaussian processes for big data. In Proc. Conference on Uncertainty in Artificial Intelligence 244 (UAI, 2013).

94 Korattikara, A., Chen, Y. & Welling, M. Austerity in MCMC land: cutting the Metropolis-Hastings budget. In Proc. 31th International Conference on Machine Learning 181–189 (2014).

95 Paige, B., Wood, F., Doucet, A. & Teh, Y. W. Asynchronous anytime sequential Monte Carlo. In Proc. Advances in Neural Information Processing Systems 27 3410–3418 (2014).

96 Jefferys, W. H. & Berger, J. O. Ockham's Razor and Bayesian Analysis. Am. Sci. 80, 64–72 (1992).

97 Rasmussen, C. E. & Ghahramani, Z. Occam's Razor. In Neural Information Processing Systems 13 (eds Leen, T. K., Dietterich, T. G., & Tresp, V.) 294–300 (2001).

98 Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).

99 Gelman, A. et al. Bayesian Data Analysis 3rd edn (Chapman & Hall/CRC, 2013).