References

Abbeel, P., & Ng, A. Y. (2004, July 4). Apprenticeship learning via inverse reinforcement learning. Proceedings of the Twenty-First International Conference on Machine Learning. International Conference on Machine Learning. https://doi.org/10.1145/1015330.1015430
Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/rltheorybook_AJKS.pdf
Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 29304–29320. https://proceedings.neurips.cc/paper_files/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html
Albrecht, S. V., Christianos, F., & Schäfer, L. (2023). Multi-agent reinforcement learning: Foundations and modern approaches. MIT Press. https://www.marl-book.com
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2024). Quarto (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.5960048
Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Comput., 10(2), 251–276. https://doi.org/10.1162/089976698300017746
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. http://arxiv.org/abs/1606.06565
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., … Chintala, S. (2024, April). PyTorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). https://doi.org/10.1145/3620665.3640366
Antonoglou, I., Schrittwieser, J., Ozair, S., Hubert, T. K., & Silver, D. (2021, October 6). Planning in stochastic environments with a learned model. The Tenth International Conference on Learning Representations. International Conference on Learning Representations. https://openreview.net/forum?id=X6D9bAHhBQ1
Athans, M., & Falb, P. L. (1966). Optimal control: An introduction to the theory and its applications. McGraw-Hill. https://books.google.com?id=pfJHAQAAIAAJ
Aubret, A., Matignon, L., & Hassas, S. (2019, November 19). A survey on intrinsic motivation in reinforcement learning. https://doi.org/10.48550/arXiv.1908.06976
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422. https://www.jmlr.org/papers/v3/auer02a.html
Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement learning. Proceedings of the 34th International Conference on Machine Learning, 263–272. https://proceedings.mlr.press/v70/azar17a.html
Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu, A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T., Hessel, M., Hou, S., Kapturowski, S., … Viola, F. (2020). The DeepMind JAX ecosystem [Computer software]. http://github.com/deepmind
Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., & Blundell, C. (2020). Agent57: Outperforming the atari human benchmark. ICML 2020, 507–517. https://doi.org/10.48550/arXiv.2003.13350
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5), 834–846. IEEE Transactions on Systems, Man, and Cybernetics. https://doi.org/10.1109/TSMC.1983.6313077
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018, February 5). Automatic differentiation in machine learning: A survey. https://doi.org/10.48550/arXiv.1502.05767
Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175. https://doi.org/10.1016/S0167-6377(02)00231-6
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279. https://doi.org/10.1613/jair.3912
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1479–1487.
Bellman, R. (1957). Dynamic programming. Princeton University Press. https://books.google.com?id=rZW4ugAACAAJ
Bellman, R. (1961). Adaptive control processes: A guided tour. Princeton University Press. https://books.google.com?id=POAmAAAAMAAJ
Berkovitz, L. D. (1974). Optimal control theory. Springer Science+Business Media LLC.
Berry, D. A., & Fristedt, B. (1985). Bandit problems. Springer Netherlands. https://doi.org/10.1007/978-94-015-3711-7
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena scientific.
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor–critic algorithms. Automatica, 45(11), 2471–2482. https://doi.org/10.1016/j.automatica.2009.07.008
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., & Zhang, Q. (2018). JAX: Composable transformations of python+NumPy programs (Version 0.3.13) [Computer software]. http://github.com/google/jax
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018, September 27). Exploration by random network distillation. The Seventh International Conference on Learning Representations. International Conference on Learning Representations. https://openreview.net/forum?id=H1lJJnR5Ym
Clark, J., & Amodei, D. (2024, February 14). Faulty reward functions in the wild. OpenAI. https://openai.com/index/faulty-reward-functions/
Danihelka, I., Guez, A., Schrittwieser, J., & Silver, D. (2021, October 6). Policy improvement by planning with gumbel. The Tenth International Conference on Learning Representations. https://openreview.net/forum?id=bERaNdoegnO
Danilyuk, P. (2021). A robot imitating a girl’s movement [Graphic]. https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142. IEEE Signal Processing Magazine. https://doi.org/10.1109/MSP.2012.2211477
Drago, S., Mussi, M., & Metelli, A. M. (2025, February 24). A refined analysis of UCBVI. https://doi.org/10.48550/arXiv.2502.17370
Fisher, R. A. (1925). Statistical methods for research workers, 11th ed. rev. Edinburgh.
Frans Berkelaar. (2009). Container ship MSC davos - westerschelde - zeeland [Graphic]. https://www.flickr.com/photos/28169156@N03/52957948820/
Gao, L., Schulman, J., & Hilton, J. (2023). Scaling laws for reward model overoptimization. Proceedings of the 40th International Conference on Machine Learning, 202, 10835–10866.
Gittins, J. C. (2011). Multi-armed bandit allocation indices (2nd ed). Wiley.
Gleave, A., Taufeeque, M., Rocamonde, J., Jenner, E., Wang, S. H., Toyer, S., Ernestus, M., Belrose, N., Emmons, S., & Russell, S. (2022, November 22). imitation: Clean imitation learning implementations. https://doi.org/10.48550/arXiv.2211.11972
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Commun. ACM, 63(11), 139–144. https://doi.org/10.1145/3422622
GPA Photo Archive. (2017). Robotic arm [Graphic]. https://www.flickr.com/photos/iip-photo-archive/36123310136/
Guy, R. (2006). Chess [Graphic]. https://www.flickr.com/photos/romainguy/230416692/
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. Proceedings of the 34th International Conference on Machine Learning - Volume 70, 1352–1361.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning, 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
Hasselt, H. (2010). Double q-learning. Advances in Neural Information Processing Systems, 23. https://proceedings.neurips.cc/paper_files/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
Hasselt, H. van, Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2094–2100.
Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media. https://books.google.com?id=yPfZBwAAQBAJ
Hausknecht, M., & Stone, P. (2017, January 11). Deep recurrent q-learning for partially observable mdps. https://doi.org/10.48550/arXiv.1507.06527
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 3215–3222.
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Proceedings of the 30th International Conference on Neural Information Processing Systems, 4572–4580.
Ivanov, S., & D’yakonov, A. (2019, July 6). Modern deep reinforcement learning algorithms. https://doi.org/10.48550/arXiv.1906.10025
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An introduction to statistical learning: With applications in python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0
Kakade, S. M. (2001). A natural policy gradient. Advances in Neural Information Processing Systems, 14. https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. Proceedings of the Nineteenth International Conference on Machine Learning, 267–274.
Keviczky, L., Bars, R., Hetthéssy, J., & Bányász, C. (2019). Control engineering. Springer. https://doi.org/10.1007/978-981-10-8297-9
Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022). Algorithms for decision making. https://mitpress.mit.edu/9780262047012/algorithms-for-decision-making/
Ladosz, P., Weng, L., Kim, M., & Oh, H. (2022). Exploration in deep reinforcement learning: A survey. Inf. Fusion, 85(C), 1–22. https://doi.org/10.1016/j.inffus.2022.03.003
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22. https://doi.org/10.1016/0196-8858(85)90002-8
Lambert, N. (2024). Reinforcement learning from human feedback. Online. https://rlhfbook.com
Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Optimal control (3rd ed). John Wiley & Sons. https://doi.org/10.1002/9781118122631
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web, 661–670. https://doi.org/10.1145/1772690.1772758
Li, S. E. (2023). Reinforcement learning for sequential decision and optimal control. Springer Nature. https://doi.org/10.1007/978-981-19-7784-8
Li, Y. (2018). Deep reinforcement learning. https://doi.org/10.48550/arXiv.1810.06339
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321. https://doi.org/10.1007/BF00992699
Ljapunov, A. M., & Fuller, A. T. (1992). The general problem of the stability of motion. Taylor & Francis.
Lyapunov, A. M. (1892). The general problem of the stability of motion. University of Kharkov.
MacFarlane, A. (1979). The development of frequency-response methods in automatic control [perspectives]. IEEE Transactions on Automatic Control, 24(2), 250–265. IEEE Transactions on Automatic Control. https://doi.org/10.1109/TAC.1979.1101978
Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., & Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61, 523–562. https://doi.org/10.1613/jair.5699
Maei, H., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., & Sutton, R. S. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, 22. https://papers.nips.cc/paper_files/paper/2009/hash/3a15c7d0bbe60300a39f76f8a5ba6896-Abstract.html
Mahadevan, S., Giguere, S., & Jacek, N. (2013). Basis adaptation for sparse nonlinear reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 27(1, 1), 654–660. https://doi.org/10.1609/aaai.v27i1.8665
Mahadevan, S., & Liu, B. (2012). Sparse q-learning with mirror descent. Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 564–573.
Mannor, S., Mansour, Y., & Tamar, A. (2024). Reinforcement learning: foundations. https://sites.google.com/view/rlfoundations/home
Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2), 191–209. IEEE Transactions on Automatic Control. https://doi.org/10.1109/9.905687
Martens, J., & Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. Proceedings of the 32nd International Conference on Machine Learning, 2408–2417. https://proceedings.mlr.press/v37/martens15.html
Maxwell, J. C. (1867). On governors. Proceedings of the Royal Society of London, 16, 270–283. https://www.jstor.org/stable/112510
Mayr, E. (1970). Populations, species and evolution: An abridgment of animal species and evolution. Belknap Press of Harvard University Press.
Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., Rathnayake, T., Vig, S., Granger, B. E., Muller, R. P., Bonazzi, F., Gupta, H., Vats, S., Johansson, F., Pedregosa, F., … Scopatz, A. (2017). SymPy: Symbolic computing in python. PeerJ Computer Science, 3, e103. https://doi.org/10.7717/peerj-cs.103
Meyer, J.-A., & Wilson, S. W. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior (pp. 222–227). From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior. MIT Press. https://ieeexplore.ieee.org/document/6294131
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, 1928–1937. https://proceedings.mlr.press/v48/mniha16.html
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, abs/1312.5602. http://arxiv.org/abs/1312.5602
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1054–1062.
Murphy, K. (2025, March 24). Reinforcement learning: A comprehensive overview. https://doi.org/10.48550/arXiv.2412.05265
Negative Space. (2015). Photo of commercial district during dawn [Graphic]. https://www.pexels.com/photo/photo-of-commercial-district-during-dawn-34639/
Nemirovskij, A. S., Judin, D. B., Dawson, E. R., & Nemirovskij, A. S. (1983). Problem complexity and method efficiency in optimization. Wiley.
Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf
Nielsen, M. A. (2015). Neural networks and deep learning. Determination Press. http://neuralnetworksanddeeplearning.com/
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed). Springer.
OpenAI. (2022, November 30). Introducing ChatGPT. OpenAI News. https://openai.com/index/chatgpt/
Orsini, M., Raichuk, A., Hussenot, L., Vincent, D., Dadashi, R., Girgin, S., Geist, M., Bachem, O., Pietquin, O., & Andrychowicz, M. (2021). What matters for adversarial imitation learning? Proceedings of the 35th International Conference on Neural Information Processing Systems, 14656–14668.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Proceedings of the 36th International Conference on Neural Information Processing Systems, 27730–27744.
Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7), 1180–1190. https://doi.org/10.1016/j.neucom.2007.11.026
Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, & L. Torgo (Eds.), Machine Learning: ECML 2005 (pp. 280–291). Springer. https://doi.org/10.1007/11564096_29
Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 28(8), 1814–1826. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2543000
Pixabay. (2016a). 20 mg label blister pack [Graphic]. https://www.pexels.com/photo/20-mg-label-blister-pack-208512/
Pixabay. (2016b). Coins on brown wood [Graphic]. https://www.pexels.com/photo/coins-on-brown-wood-210600/
Plaat, A. (2022). Deep reinforcement learning. Springer Nature. https://doi.org/10.1007/978-981-19-0638-1
Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97. Neural Computation. https://doi.org/10.1162/neco.1991.3.1.88
Powell, W. B. (2022). Reinforcement learning and stochastic optimization: A unified framework for sequential decisions. Wiley.
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. Wiley.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. Proceedings of the 20th International Joint Conference on Artifical Intelligence, 2586–2591.
Rao, A., & Jelvis, T. (2022). Foundations of reinforcement learning with applications in finance. Chapman and Hall/CRC. https://doi.org/10.1201/9781003229193
Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2006). Maximum margin planning. Proceedings of the 23rd International Conference on Machine Learning, 729–736. https://doi.org/10.1145/1143844.1143936
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535. https://projecteuclid.org/journals/bulletin-of-the-american-mathematical-society/volume-58/issue-5/Some-aspects-of-the-sequential-design-of-experiments/bams/1183517370.full
Ross, S., Gordon, G. J., & Bagnell, J. (2010, November 2). A reduction of imitation learning and structured prediction to no-regret online learning. International Conference on Artificial Intelligence and Statistics. https://www.semanticscholar.org/paper/A-Reduction-of-Imitation-Learning-and-Structured-to-Ross-Gordon/79ab3c49903ec8cb339437ccf5cf998607fc313e
Russell, S. (1998). Learning agents for uncertain environments (extended abstract). Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 101–103. https://doi.org/10.1145/279943.279964
Russell, S. J., & Norvig, P. (2021). Artificial intelligence: A modern approach (Fourth edition). Pearson.
Schmidhuber, J. (1991). Curious model-building control systems. [Proceedings] 1991 IEEE International Joint Conference on Neural Networks, 1458–1463 vol.2. https://doi.org/10.1109/IJCNN.1991.170605
Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230–247. IEEE Transactions on Autonomous Mental Development. https://doi.org/10.1109/TAMD.2010.2056368
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839, 7839), 604–609. https://doi.org/10.1038/s41586-020-03051-4
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Y. Bengio & Y. LeCun (Eds.), 4th international conference on learning representations. http://arxiv.org/abs/1506.02438
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017, August 28). Proximal policy optimization algorithms. https://doi.org/10.48550/arXiv.1707.06347
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. https://doi.org/10.1126/science.275.5306.1593
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024, April 27). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. https://doi.org/10.48550/arXiv.2402.03300
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587, 7587), 484–489. https://doi.org/10.1038/nature16961
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning, 387–395. https://proceedings.mlr.press/v32/silver14.html
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676, 7676), 354–359. https://doi.org/10.1038/nature24270
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 299, 103535. https://doi.org/10.1016/j.artint.2021.103535
Stigler, S. M. (2003). The history of statistics: The measurement of uncertainty before 1900 (9. print). Belknap Pr. of Harvard Univ. Pr.
Sussman, G. J., Wisdom, J., & Farr, W. (2013). Functional differential geometry. The MIT Press.
Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning [PhD thesis]. University of Massachusetts Amherst.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Second edition). The MIT Press. http://incompleteideas.net/book/RLbook2020trimmed.pdf
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 13th International Conference on Neural Information Processing Systems, 1057–1063.
Szepesvári, C. (2010). Algorithms for reinforcement learning. Springer International Publishing. https://doi.org/10.1007/978-3-031-01551-9
Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y., Schulman, J., DeTurck, F., & Abbeel, P. (2017). #Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3a20f62a0af1aa152670bab3c602feed-Abstract.html
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294. https://doi.org/10.2307/2332286
Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics, 57(2), 450–456. https://doi.org/10.2307/2371219
Thorndike, E. L. (1911). Animal intelligence: Experimental studies (pp. viii, 297). Macmillan Press. https://doi.org/10.5962/bhl.title.55072
Thrun, S. B. (1992). Efficient exploration in reinforcement learning [Technical Report]. Carnegie Mellon University.
Turing, A. (1948). Intelligent machinery. National Physical Laboratory. https://weightagnostic.github.io/papers/turing1948.pdf
van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018, December 6). Deep reinforcement learning and the deadly triad. https://doi.org/10.48550/arXiv.1812.02648
Vapnik, V. N. (2000). The nature of statistical learning theory. Springer. https://doi.org/10.1007/978-1-4757-3264-1
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge University Press. https://books.google.com?id=NDdqDwAAQBAJ
Wald, A. (1949). Statistical decision functions. The Annals of Mathematical Statistics, 20(2), 165–205. https://doi.org/10.1214/aoms/1177730030
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & Freitas, N. de. (2017, February 6). Sample efficient actor-critic with experience replay. 5th International Conference on Learning Representations. https://openreview.net/forum?id=HyM25Mqel
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. https://doi.org/10.1007/BF00992696
Witten, I. H. (1977). An adaptive optimal controller for discrete-time markov environments. Information and Control, 34(4), 286–295. https://doi.org/10.1016/S0019-9958(77)90354-0
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/361440528766bbaaaa1901845cf4152b-Abstract.html
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering atari games with limited data. NeurIPS 2021, 25476–25488. https://doi.org/10.48550/arXiv.2111.00210
Zare, M., Kebria, P. M., Khosravi, A., & Nahavandi, S. (2024). A survey of imitation learning: Algorithms, recent developments, and challenges. IEEE Transactions on Cybernetics, 54(12), 7173–7186. IEEE Transactions on Cybernetics. https://doi.org/10.1109/TCYB.2024.3395626
Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. Proceedings of the 27th International Conference on International Conference on Machine Learning, 1255–1262.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, 1433–1438.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2020, January 8). Fine-Tuning Language Models from Human Preferences. https://doi.org/10.48550/arXiv.1909.08593