References

Abbeel, P., & Ng, A. Y. (2004, July 4). Apprenticeship learning via inverse reinforcement learning. Proceedings of the Twenty-First International Conference on Machine Learning. International Conference on Machine Learning. https://doi.org/10.1145/1015330.1015430

Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/rltheorybook_AJKS.pdf

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 29304–29320. https://proceedings.neurips.cc/paper_files/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html

Albrecht, S. V., Christianos, F., & Schäfer, L. (2023). Multi-agent reinforcement learning: Foundations and modern approaches. MIT Press. https://www.marl-book.com

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2024). Quarto (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Comput., 10(2), 251–276. https://doi.org/10.1162/089976698300017746

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. http://arxiv.org/abs/1606.06565

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., … Chintala, S. (2024, April). PyTorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). https://doi.org/10.1145/3620665.3640366

Antonoglou, I., Schrittwieser, J., Ozair, S., Hubert, T. K., & Silver, D. (2021, October 6). Planning in stochastic environments with a learned model. The Tenth International Conference on Learning Representations. International Conference on Learning Representations. https://openreview.net/forum?id=X6D9bAHhBQ1

Athans, M., & Falb, P. L. (1966). Optimal control: An introduction to the theory and its applications. McGraw-Hill. https://books.google.com?id=pfJHAQAAIAAJ

Aubret, A., Matignon, L., & Hassas, S. (2019, November 19). A survey on intrinsic motivation in reinforcement learning. https://doi.org/10.48550/arXiv.1908.06976

Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422. https://www.jmlr.org/papers/v3/auer02a.html

Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement learning. Proceedings of the 34th International Conference on Machine Learning, 263–272. https://proceedings.mlr.press/v70/azar17a.html

Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu, A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T., Hessel, M., Hou, S., Kapturowski, S., … Viola, F. (2020). The DeepMind JAX ecosystem [Computer software]. http://github.com/deepmind

Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., & Blundell, C. (2020). Agent57: Outperforming the atari human benchmark. ICML 2020, 507–517. https://doi.org/10.48550/arXiv.2003.13350

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5), 834–846. IEEE Transactions on Systems, Man, and Cybernetics. https://doi.org/10.1109/TSMC.1983.6313077

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018, February 5). Automatic differentiation in machine learning: A survey. https://doi.org/10.48550/arXiv.1502.05767

Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175. https://doi.org/10.1016/S0167-6377(02)00231-6

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279. https://doi.org/10.1613/jair.3912

Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1479–1487.

Bellman, R. (1957). Dynamic programming. Princeton University Press. https://books.google.com?id=rZW4ugAACAAJ

Bellman, R. (1961). Adaptive control processes: A guided tour. Princeton University Press. https://books.google.com?id=POAmAAAAMAAJ

Berkovitz, L. D. (1974). Optimal control theory. Springer Science+Business Media LLC.

Berry, D. A., & Fristedt, B. (1985). Bandit problems. Springer Netherlands. https://doi.org/10.1007/978-94-015-3711-7

Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena scientific.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor–critic algorithms. Automatica, 45(11), 2471–2482. https://doi.org/10.1016/j.automatica.2009.07.008

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., & Zhang, Q. (2018). JAX: Composable transformations of python+NumPy programs (Version 0.3.13) [Computer software]. http://github.com/google/jax

Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018, September 27). Exploration by random network distillation. The Seventh International Conference on Learning Representations. International Conference on Learning Representations. https://openreview.net/forum?id=H1lJJnR5Ym

Clark, J., & Amodei, D. (2024, February 14). Faulty reward functions in the wild. OpenAI. https://openai.com/index/faulty-reward-functions/

Danihelka, I., Guez, A., Schrittwieser, J., & Silver, D. (2021, October 6). Policy improvement by planning with gumbel. The Tenth International Conference on Learning Representations. https://openreview.net/forum?id=bERaNdoegnO

Danilyuk, P. (2021). A robot imitating a girl’s movement [Graphic]. https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/

Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142. IEEE Signal Processing Magazine. https://doi.org/10.1109/MSP.2012.2211477

Drago, S., Mussi, M., & Metelli, A. M. (2025, February 24). A refined analysis of UCBVI. https://doi.org/10.48550/arXiv.2502.17370

Fisher, R. A. (1925). Statistical methods for research workers, 11th ed. rev. Edinburgh.

Frans Berkelaar. (2009). Container ship MSC davos - westerschelde - zeeland [Graphic]. https://www.flickr.com/photos/28169156@N03/52957948820/

Gao, L., Schulman, J., & Hilton, J. (2023). Scaling laws for reward model overoptimization. Proceedings of the 40th International Conference on Machine Learning, 202, 10835–10866.

Gittins, J. C. (2011). Multi-armed bandit allocation indices (2nd ed). Wiley.

Gleave, A., Taufeeque, M., Rocamonde, J., Jenner, E., Wang, S. H., Toyer, S., Ernestus, M., Belrose, N., Emmons, S., & Russell, S. (2022, November 22). imitation: Clean imitation learning implementations. https://doi.org/10.48550/arXiv.2211.11972

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Commun. ACM, 63(11), 139–144. https://doi.org/10.1145/3422622

GPA Photo Archive. (2017). Robotic arm [Graphic]. https://www.flickr.com/photos/iip-photo-archive/36123310136/

Guy, R. (2006). Chess [Graphic]. https://www.flickr.com/photos/romainguy/230416692/

Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. Proceedings of the 34th International Conference on Machine Learning - Volume 70, 1352–1361.

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning, 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html

Hasselt, H. (2010). Double q-learning. Advances in Neural Information Processing Systems, 23. https://proceedings.neurips.cc/paper_files/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html

Hasselt, H. van, Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2094–2100.

Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media. https://books.google.com?id=yPfZBwAAQBAJ

Hausknecht, M., & Stone, P. (2017, January 11). Deep recurrent q-learning for partially observable mdps. https://doi.org/10.48550/arXiv.1507.06527

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 3215–3222.

Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Proceedings of the 30th International Conference on Neural Information Processing Systems, 4572–4580.

Ivanov, S., & D’yakonov, A. (2019, July 6). Modern deep reinforcement learning algorithms. https://doi.org/10.48550/arXiv.1906.10025

James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An introduction to statistical learning: With applications in python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0

Kakade, S. M. (2001). A natural policy gradient. Advances in Neural Information Processing Systems, 14. https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html

Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. Proceedings of the Nineteenth International Conference on Machine Learning, 267–274.

Keviczky, L., Bars, R., Hetthéssy, J., & Bányász, C. (2019). Control engineering. Springer. https://doi.org/10.1007/978-981-10-8297-9

Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022). Algorithms for decision making. https://mitpress.mit.edu/9780262047012/algorithms-for-decision-making/

Ladosz, P., Weng, L., Kim, M., & Oh, H. (2022). Exploration in deep reinforcement learning: A survey. Inf. Fusion, 85(C), 1–22. https://doi.org/10.1016/j.inffus.2022.03.003

Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22. https://doi.org/10.1016/0196-8858(85)90002-8

Lambert, N. (2024). Reinforcement learning from human feedback. Online. https://rlhfbook.com

Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Optimal control (3rd ed). John Wiley & Sons. https://doi.org/10.1002/9781118122631

Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web, 661–670. https://doi.org/10.1145/1772690.1772758

Li, S. E. (2023). Reinforcement learning for sequential decision and optimal control. Springer Nature. https://doi.org/10.1007/978-981-19-7784-8

Li, Y. (2018). Deep reinforcement learning. https://doi.org/10.48550/arXiv.1810.06339

Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321. https://doi.org/10.1007/BF00992699

Ljapunov, A. M., & Fuller, A. T. (1992). The general problem of the stability of motion. Taylor & Francis.

Lyapunov, A. M. (1892). The general problem of the stability of motion. University of Kharkov.

MacFarlane, A. (1979). The development of frequency-response methods in automatic control [perspectives]. IEEE Transactions on Automatic Control, 24(2), 250–265. IEEE Transactions on Automatic Control. https://doi.org/10.1109/TAC.1979.1101978

Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., & Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61, 523–562. https://doi.org/10.1613/jair.5699

Maei, H., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., & Sutton, R. S. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, 22. https://papers.nips.cc/paper_files/paper/2009/hash/3a15c7d0bbe60300a39f76f8a5ba6896-Abstract.html

Mahadevan, S., Giguere, S., & Jacek, N. (2013). Basis adaptation for sparse nonlinear reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 27(1, 1), 654–660. https://doi.org/10.1609/aaai.v27i1.8665

Mahadevan, S., & Liu, B. (2012). Sparse q-learning with mirror descent. Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 564–573.

Mannor, S., Mansour, Y., & Tamar, A. (2024). Reinforcement learning: foundations. https://sites.google.com/view/rlfoundations/home

Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2), 191–209. IEEE Transactions on Automatic Control. https://doi.org/10.1109/9.905687

Martens, J., & Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. Proceedings of the 32nd International Conference on Machine Learning, 2408–2417. https://proceedings.mlr.press/v37/martens15.html

Maxwell, J. C. (1867). On governors. Proceedings of the Royal Society of London, 16, 270–283. https://www.jstor.org/stable/112510

Mayr, E. (1970). Populations, species and evolution: An abridgment of animal species and evolution. Belknap Press of Harvard University Press.

Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., Rathnayake, T., Vig, S., Granger, B. E., Muller, R. P., Bonazzi, F., Gupta, H., Vats, S., Johansson, F., Pedregosa, F., … Scopatz, A. (2017). SymPy: Symbolic computing in python. PeerJ Computer Science, 3, e103. https://doi.org/10.7717/peerj-cs.103

Meyer, J.-A., & Wilson, S. W. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior (pp. 222–227). From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior. MIT Press. https://ieeexplore.ieee.org/document/6294131

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, 1928–1937. https://proceedings.mlr.press/v48/mniha16.html

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, abs/1312.5602. http://arxiv.org/abs/1312.5602

Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1054–1062.

Murphy, K. (2025, March 24). Reinforcement learning: A comprehensive overview. https://doi.org/10.48550/arXiv.2412.05265

Negative Space. (2015). Photo of commercial district during dawn [Graphic]. https://www.pexels.com/photo/photo-of-commercial-district-during-dawn-34639/

Nemirovskij, A. S., Judin, D. B., Dawson, E. R., & Nemirovskij, A. S. (1983). Problem complexity and method efficiency in optimization. Wiley.

Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf

Nielsen, M. A. (2015). Neural networks and deep learning. Determination Press. http://neuralnetworksanddeeplearning.com/

Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed). Springer.

OpenAI. (2022, November 30). Introducing ChatGPT. OpenAI News. https://openai.com/index/chatgpt/

Orsini, M., Raichuk, A., Hussenot, L., Vincent, D., Dadashi, R., Girgin, S., Geist, M., Bachem, O., Pietquin, O., & Andrychowicz, M. (2021). What matters for adversarial imitation learning? Proceedings of the 35th International Conference on Neural Information Processing Systems, 14656–14668.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Proceedings of the 36th International Conference on Neural Information Processing Systems, 27730–27744.

Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7), 1180–1190. https://doi.org/10.1016/j.neucom.2007.11.026

Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, & L. Torgo (Eds.), Machine Learning: ECML 2005 (pp. 280–291). Springer. https://doi.org/10.1007/11564096_29

Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 28(8), 1814–1826. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2543000

Pixabay. (2016a). 20 mg label blister pack [Graphic]. https://www.pexels.com/photo/20-mg-label-blister-pack-208512/

Pixabay. (2016b). Coins on brown wood [Graphic]. https://www.pexels.com/photo/coins-on-brown-wood-210600/

Plaat, A. (2022). Deep reinforcement learning. Springer Nature. https://doi.org/10.1007/978-981-19-0638-1

Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97. Neural Computation. https://doi.org/10.1162/neco.1991.3.1.88

Powell, W. B. (2022). Reinforcement learning and stochastic optimization: A unified framework for sequential decisions. Wiley.

Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. Wiley.

Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. Proceedings of the 20th International Joint Conference on Artifical Intelligence, 2586–2591.

Rao, A., & Jelvis, T. (2022). Foundations of reinforcement learning with applications in finance. Chapman and Hall/CRC. https://doi.org/10.1201/9781003229193

Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2006). Maximum margin planning. Proceedings of the 23rd International Conference on Machine Learning, 729–736. https://doi.org/10.1145/1143844.1143936

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535. https://projecteuclid.org/journals/bulletin-of-the-american-mathematical-society/volume-58/issue-5/Some-aspects-of-the-sequential-design-of-experiments/bams/1183517370.full

Ross, S., Gordon, G. J., & Bagnell, J. (2010, November 2). A reduction of imitation learning and structured prediction to no-regret online learning. International Conference on Artificial Intelligence and Statistics. https://www.semanticscholar.org/paper/A-Reduction-of-Imitation-Learning-and-Structured-to-Ross-Gordon/79ab3c49903ec8cb339437ccf5cf998607fc313e

Russell, S. (1998). Learning agents for uncertain environments (extended abstract). Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 101–103. https://doi.org/10.1145/279943.279964

Russell, S. J., & Norvig, P. (2021). Artificial intelligence: A modern approach (Fourth edition). Pearson.

Schmidhuber, J. (1991). Curious model-building control systems. [Proceedings] 1991 IEEE International Joint Conference on Neural Networks, 1458–1463 vol.2. https://doi.org/10.1109/IJCNN.1991.170605

Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230–247. IEEE Transactions on Autonomous Mental Development. https://doi.org/10.1109/TAMD.2010.2056368

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839, 7839), 604–609. https://doi.org/10.1038/s41586-020-03051-4

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html

Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Y. Bengio & Y. LeCun (Eds.), 4th international conference on learning representations. http://arxiv.org/abs/1506.02438

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017, August 28). Proximal policy optimization algorithms. https://doi.org/10.48550/arXiv.1707.06347

Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. https://doi.org/10.1126/science.275.5306.1593

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024, April 27). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. https://doi.org/10.48550/arXiv.2402.03300

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587, 7587), 484–489. https://doi.org/10.1038/nature16961

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning, 387–395. https://proceedings.mlr.press/v32/silver14.html

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676, 7676), 354–359. https://doi.org/10.1038/nature24270

Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 299, 103535. https://doi.org/10.1016/j.artint.2021.103535

Stigler, S. M. (2003). The history of statistics: The measurement of uncertainty before 1900 (9. print). Belknap Pr. of Harvard Univ. Pr.

Sussman, G. J., Wisdom, J., & Farr, W. (2013). Functional differential geometry. The MIT Press.

Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning [PhD thesis]. University of Massachusetts Amherst.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Second edition). The MIT Press. http://incompleteideas.net/book/RLbook2020trimmed.pdf

Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 13th International Conference on Neural Information Processing Systems, 1057–1063.

Szepesvári, C. (2010). Algorithms for reinforcement learning. Springer International Publishing. https://doi.org/10.1007/978-3-031-01551-9

Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y., Schulman, J., DeTurck, F., & Abbeel, P. (2017). #Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3a20f62a0af1aa152670bab3c602feed-Abstract.html

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294. https://doi.org/10.2307/2332286

Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics, 57(2), 450–456. https://doi.org/10.2307/2371219

Thorndike, E. L. (1911). Animal intelligence: Experimental studies (pp. viii, 297). Macmillan Press. https://doi.org/10.5962/bhl.title.55072

Thrun, S. B. (1992). Efficient exploration in reinforcement learning [Technical Report]. Carnegie Mellon University.

Turing, A. (1948). Intelligent machinery. National Physical Laboratory. https://weightagnostic.github.io/papers/turing1948.pdf

van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018, December 6). Deep reinforcement learning and the deadly triad. https://doi.org/10.48550/arXiv.1812.02648

Vapnik, V. N. (2000). The nature of statistical learning theory. Springer. https://doi.org/10.1007/978-1-4757-3264-1

Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge University Press. https://books.google.com?id=NDdqDwAAQBAJ

Wald, A. (1949). Statistical decision functions. The Annals of Mathematical Statistics, 20(2), 165–205. https://doi.org/10.1214/aoms/1177730030

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & Freitas, N. de. (2017, February 6). Sample efficient actor-critic with experience replay. 5th International Conference on Learning Representations. https://openreview.net/forum?id=HyM25Mqel

Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. https://doi.org/10.1007/BF00992696

Witten, I. H. (1977). An adaptive optimal controller for discrete-time markov environments. Information and Control, 34(4), 286–295. https://doi.org/10.1016/S0019-9958(77)90354-0

Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/361440528766bbaaaa1901845cf4152b-Abstract.html

Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering atari games with limited data. NeurIPS 2021, 25476–25488. https://doi.org/10.48550/arXiv.2111.00210

Zare, M., Kebria, P. M., Khosravi, A., & Nahavandi, S. (2024). A survey of imitation learning: Algorithms, recent developments, and challenges. IEEE Transactions on Cybernetics, 54(12), 7173–7186. IEEE Transactions on Cybernetics. https://doi.org/10.1109/TCYB.2024.3395626

Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. Proceedings of the 27th International Conference on International Conference on Machine Learning, 1255–1262.

Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, 1433–1438.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2020, January 8). Fine-Tuning Language Models from Human Preferences. https://doi.org/10.48550/arXiv.1909.08593