References
Abbeel, P., & Ng, A. Y. (2004, July 4). Apprenticeship learning via
inverse reinforcement learning. Proceedings of the
Twenty-First International Conference on Machine
Learning. International Conference on
Machine Learning. https://doi.org/10.1145/1015330.1015430
Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022).
Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/rltheorybook_AJKS.pdf
Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., &
Bellemare, M. (2021). Deep reinforcement learning at the edge of the
statistical precipice. Advances in Neural Information
Processing Systems, 34, 29304–29320. https://proceedings.neurips.cc/paper_files/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html
Albrecht, S. V., Christianos, F., & Schäfer, L. (2023).
Multi-agent reinforcement learning: Foundations and modern
approaches. MIT Press. https://www.marl-book.com
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C.
(2024). Quarto (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.5960048
Amari, S.-I. (1998). Natural gradient works efficiently in learning.
Neural Comput., 10(2), 251–276. https://doi.org/10.1162/089976698300017746
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J.,
& Mané, D. (2016). Concrete problems in AI
safety. http://arxiv.org/abs/1606.06565
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M.,
Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A.,
Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong,
J., Gschwind, M., … Chintala, S. (2024, April). PyTorch 2:
Faster machine learning through dynamic python bytecode transformation
and graph compilation. 29th ACM International
Conference on Architectural Support for Programming Languages and
Operating Systems, Volume 2 (ASPLOS ’24). https://doi.org/10.1145/3620665.3640366
Antonoglou, I., Schrittwieser, J., Ozair, S., Hubert, T. K., &
Silver, D. (2021, October 6). Planning in stochastic environments with a
learned model. The Tenth International Conference on Learning
Representations. International Conference on
Learning Representations. https://openreview.net/forum?id=X6D9bAHhBQ1
Athans, M., & Falb, P. L. (1966). Optimal control:
An introduction to the theory and its applications.
McGraw-Hill. https://books.google.com?id=pfJHAQAAIAAJ
Aubret, A., Matignon, L., & Hassas, S. (2019, November 19). A
survey on intrinsic motivation in reinforcement learning. https://doi.org/10.48550/arXiv.1908.06976
Auer, P. (2002). Using confidence bounds for exploitation-exploration
trade-offs. Journal of Machine Learning Research, 3,
397–422. https://www.jmlr.org/papers/v3/auer02a.html
Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds
for reinforcement learning. Proceedings of the 34th
International Conference on Machine
Learning, 263–272. https://proceedings.mlr.press/v70/azar17a.html
Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J.,
Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu,
A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T.,
Hessel, M., Hou, S., Kapturowski, S., … Viola, F. (2020). The
DeepMind JAX ecosystem [Computer software]. http://github.com/deepmind
Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A.,
Guo, D., & Blundell, C. (2020). Agent57: Outperforming
the atari human benchmark. ICML 2020, 507–517. https://doi.org/10.48550/arXiv.2003.13350
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike
adaptive elements that can solve difficult learning control problems.
IEEE Transactions on Systems, Man, and Cybernetics,
SMC-13(5), 834–846. IEEE Transactions on
Systems, Man, and Cybernetics. https://doi.org/10.1109/TSMC.1983.6313077
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M.
(2018, February 5). Automatic differentiation in machine learning: A
survey. https://doi.org/10.48550/arXiv.1502.05767
Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear
projected subgradient methods for convex optimization. Operations
Research Letters, 31(3), 167–175. https://doi.org/10.1016/S0167-6377(02)00231-6
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The
arcade learning environment: An evaluation platform for
general agents. Journal of Artificial Intelligence Research,
47, 253–279. https://doi.org/10.1613/jair.3912
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D.,
& Munos, R. (2016). Unifying count-based exploration and intrinsic
motivation. Proceedings of the 30th International
Conference on Neural Information Processing
Systems, 1479–1487.
Bellman, R. (1957). Dynamic programming. Princeton University
Press. https://books.google.com?id=rZW4ugAACAAJ
Bellman, R. (1961). Adaptive control processes: A guided tour.
Princeton University Press. https://books.google.com?id=POAmAAAAMAAJ
Berkovitz, L. D. (1974). Optimal control theory. Springer
Science+Business Media LLC.
Berry, D. A., & Fristedt, B. (1985). Bandit problems.
Springer Netherlands. https://doi.org/10.1007/978-94-015-3711-7
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic
programming. Athena scientific.
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009).
Natural actor–critic algorithms. Automatica, 45(11),
2471–2482. https://doi.org/10.1016/j.automatica.2009.07.008
Bishop, C. M. (2006). Pattern recognition and machine learning.
Springer.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization.
Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C.,
Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne,
S., & Zhang, Q. (2018). JAX:
Composable transformations of python+NumPy
programs (Version 0.3.13) [Computer software]. http://github.com/google/jax
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018, September
27). Exploration by random network distillation. The Seventh
International Conference on Learning Representations. International
Conference on Learning Representations. https://openreview.net/forum?id=H1lJJnR5Ym
Clark, J., & Amodei, D. (2024, February 14). Faulty reward
functions in the wild. OpenAI. https://openai.com/index/faulty-reward-functions/
Danihelka, I., Guez, A., Schrittwieser, J., & Silver, D. (2021,
October 6). Policy improvement by planning with gumbel. The Tenth
International Conference on Learning Representations. https://openreview.net/forum?id=bERaNdoegnO
Danilyuk, P. (2021). A robot imitating a girl’s movement
[Graphic]. https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/
Deng, L. (2012). The MNIST database of handwritten digit
images for machine learning research. IEEE Signal Processing
Magazine, 29(6), 141–142. IEEE Signal Processing
Magazine. https://doi.org/10.1109/MSP.2012.2211477
Drago, S., Mussi, M., & Metelli, A. M. (2025, February 24). A
refined analysis of UCBVI. https://doi.org/10.48550/arXiv.2502.17370
Fisher, R. A. (1925). Statistical methods for research workers, 11th
ed. rev. Edinburgh.
Frans Berkelaar. (2009). Container ship MSC davos -
westerschelde - zeeland [Graphic]. https://www.flickr.com/photos/28169156@N03/52957948820/
Gao, L., Schulman, J., & Hilton, J. (2023). Scaling laws for reward
model overoptimization. Proceedings of the 40th International
Conference on Machine Learning, 202,
10835–10866.
Gittins, J. C. (2011). Multi-armed bandit allocation indices
(2nd ed). Wiley.
Gleave, A., Taufeeque, M., Rocamonde, J., Jenner, E., Wang, S. H.,
Toyer, S., Ernestus, M., Belrose, N., Emmons, S., & Russell, S.
(2022, November 22). imitation: Clean
imitation learning implementations. https://doi.org/10.48550/arXiv.2211.11972
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., & Bengio, Y. (2020). Generative
adversarial networks. Commun. ACM, 63(11), 139–144. https://doi.org/10.1145/3422622
GPA Photo Archive. (2017). Robotic arm [Graphic]. https://www.flickr.com/photos/iip-photo-archive/36123310136/
Guy, R. (2006). Chess [Graphic]. https://www.flickr.com/photos/romainguy/230416692/
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017).
Reinforcement learning with deep energy-based policies. Proceedings
of the 34th International Conference on Machine
Learning - Volume 70, 1352–1361.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft
actor-critic: Off-policy maximum entropy
deep reinforcement learning with a stochastic actor. Proceedings of
the 35th International Conference on Machine
Learning, 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
Hasselt, H. (2010). Double q-learning. Advances in Neural
Information Processing Systems, 23. https://proceedings.neurips.cc/paper_files/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
Hasselt, H. van, Guez, A., & Silver, D. (2016). Deep reinforcement
learning with double q-learning. Proceedings of the Thirtieth
AAAI Conference on Artificial Intelligence, 2094–2100.
Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements
of statistical learning: Data mining, inference, and
prediction. Springer Science & Business Media. https://books.google.com?id=yPfZBwAAQBAJ
Hausknecht, M., & Stone, P. (2017, January 11). Deep recurrent
q-learning for partially observable mdps. https://doi.org/10.48550/arXiv.1507.06527
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G.,
Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018).
Rainbow: Combining improvements in deep reinforcement learning.
Proceedings of the Thirty-Second AAAI Conference on
Artificial Intelligence and Thirtieth Innovative
Applications of Artificial Intelligence Conference
and Eighth AAAI Symposium on Educational
Advances in Artificial Intelligence, 3215–3222.
Ho, J., & Ermon, S. (2016). Generative adversarial imitation
learning. Proceedings of the 30th International
Conference on Neural Information Processing
Systems, 4572–4580.
Ivanov, S., & D’yakonov, A. (2019, July 6). Modern deep
reinforcement learning algorithms. https://doi.org/10.48550/arXiv.1906.10025
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J.
(2023). An introduction to statistical learning: With applications
in python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0
Kakade, S. M. (2001). A natural policy gradient. Advances in
Neural Information Processing Systems, 14. https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html
Kakade, S., & Langford, J. (2002). Approximately optimal approximate
reinforcement learning. Proceedings of the Nineteenth
International Conference on Machine Learning,
267–274.
Keviczky, L., Bars, R., Hetthéssy, J., & Bányász, C. (2019).
Control engineering. Springer. https://doi.org/10.1007/978-981-10-8297-9
Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022).
Algorithms for decision making. https://mitpress.mit.edu/9780262047012/algorithms-for-decision-making/
Ladosz, P., Weng, L., Kim, M., & Oh, H. (2022). Exploration in deep
reinforcement learning: A survey. Inf. Fusion, 85(C),
1–22. https://doi.org/10.1016/j.inffus.2022.03.003
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive
allocation rules. Advances in Applied Mathematics,
6(1), 4–22. https://doi.org/10.1016/0196-8858(85)90002-8
Lambert, N. (2024). Reinforcement learning from human feedback.
Online. https://rlhfbook.com
Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Optimal
control (3rd ed). John Wiley & Sons. https://doi.org/10.1002/9781118122631
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A
contextual-bandit approach to personalized news article recommendation.
Proceedings of the 19th International Conference on
World Wide Web, 661–670. https://doi.org/10.1145/1772690.1772758
Li, S. E. (2023). Reinforcement learning for sequential decision and
optimal control. Springer Nature. https://doi.org/10.1007/978-981-19-7784-8
Li, Y. (2018). Deep reinforcement learning. https://doi.org/10.48550/arXiv.1810.06339
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement
learning, planning and teaching. Machine Learning,
8(3), 293–321. https://doi.org/10.1007/BF00992699
Ljapunov, A. M., & Fuller, A. T. (1992). The general problem of
the stability of motion. Taylor & Francis.
Lyapunov, A. M. (1892). The general problem of the stability of
motion. University of Kharkov.
MacFarlane, A. (1979). The development of frequency-response methods in
automatic control [perspectives]. IEEE Transactions on Automatic
Control, 24(2), 250–265. IEEE Transactions on
Automatic Control. https://doi.org/10.1109/TAC.1979.1101978
Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht,
M., & Bowling, M. (2018). Revisiting the arcade learning
environment: Evaluation protocols and open problems for
general agents. Journal of Artificial Intelligence Research,
61, 523–562. https://doi.org/10.1613/jair.5699
Maei, H., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., &
Sutton, R. S. (2009). Convergent temporal-difference learning with
arbitrary smooth function approximation. Advances in Neural
Information Processing Systems, 22. https://papers.nips.cc/paper_files/paper/2009/hash/3a15c7d0bbe60300a39f76f8a5ba6896-Abstract.html
Mahadevan, S., Giguere, S., & Jacek, N. (2013). Basis adaptation for
sparse nonlinear reinforcement learning. Proceedings of the AAAI
Conference on Artificial Intelligence, 27(1, 1), 654–660.
https://doi.org/10.1609/aaai.v27i1.8665
Mahadevan, S., & Liu, B. (2012). Sparse q-learning with mirror
descent. Proceedings of the Twenty-Eighth Conference on
Uncertainty in Artificial Intelligence,
564–573.
Mannor, S., Mansour, Y., & Tamar, A. (2024). Reinforcement
learning: foundations. https://sites.google.com/view/rlfoundations/home
Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based
optimization of markov reward processes. IEEE Transactions on
Automatic Control, 46(2), 191–209. IEEE
Transactions on Automatic Control. https://doi.org/10.1109/9.905687
Martens, J., & Grosse, R. (2015). Optimizing neural networks with
kronecker-factored approximate curvature. Proceedings of the 32nd
International Conference on Machine
Learning, 2408–2417. https://proceedings.mlr.press/v37/martens15.html
Maxwell, J. C. (1867). On governors. Proceedings of the Royal
Society of London, 16, 270–283. https://www.jstor.org/stable/112510
Mayr, E. (1970). Populations, species and evolution: An
abridgment of animal species and evolution. Belknap Press of
Harvard University Press.
Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B.,
Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., Rathnayake,
T., Vig, S., Granger, B. E., Muller, R. P., Bonazzi, F., Gupta, H.,
Vats, S., Johansson, F., Pedregosa, F., … Scopatz, A. (2017).
SymPy: Symbolic computing in python. PeerJ Computer
Science, 3, e103. https://doi.org/10.7717/peerj-cs.103
Meyer, J.-A., & Wilson, S. W. (1991). A possibility for implementing
curiosity and boredom in model-building neural controllers. In From
Animals to Animats: Proceedings
of the First International Conference on
Simulation of Adaptive Behavior (pp.
222–227). From Animals to Animats:
Proceedings of the First International
Conference on Simulation of Adaptive
Behavior. MIT Press. https://ieeexplore.ieee.org/document/6294131
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley,
T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for
deep reinforcement learning. Proceedings of The 33rd
International Conference on Machine
Learning, 1928–1937. https://proceedings.mlr.press/v48/mniha16.html
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
Wierstra, D., & Riedmiller, M. A. (2013). Playing atari with deep
reinforcement learning. CoRR, abs/1312.5602. http://arxiv.org/abs/1312.5602
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G.
(2016). Safe and efficient off-policy reinforcement learning.
Proceedings of the 30th International Conference on
Neural Information Processing Systems, 1054–1062.
Murphy, K. (2025, March 24). Reinforcement learning: A comprehensive
overview. https://doi.org/10.48550/arXiv.2412.05265
Negative Space. (2015). Photo of commercial district during
dawn [Graphic]. https://www.pexels.com/photo/photo-of-commercial-district-during-dawn-34639/
Nemirovskij, A. S., Judin, D. B., Dawson, E. R., & Nemirovskij, A.
S. (1983). Problem complexity and method efficiency in
optimization. Wiley.
Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse
reinforcement learning. Proceedings of the Seventeenth
International Conference on Machine Learning,
663–670. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf
Nielsen, M. A. (2015). Neural networks and deep learning.
Determination Press. http://neuralnetworksanddeeplearning.com/
Nocedal, J., & Wright, S. J. (2006). Numerical optimization
(2nd ed). Springer.
OpenAI. (2022, November 30). Introducing ChatGPT.
OpenAI News. https://openai.com/index/chatgpt/
Orsini, M., Raichuk, A., Hussenot, L., Vincent, D., Dadashi, R., Girgin,
S., Geist, M., Bachem, O., Pietquin, O., & Andrychowicz, M. (2021).
What matters for adversarial imitation learning? Proceedings of the
35th International Conference on Neural Information
Processing Systems, 14656–14668.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin,
P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton,
J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P.,
Christiano, P., Leike, J., & Lowe, R. (2022). Training language
models to follow instructions with human feedback. Proceedings of
the 36th International Conference on Neural
Information Processing Systems, 27730–27744.
Peters, J., & Schaal, S. (2008). Natural actor-critic.
Neurocomputing, 71(7), 1180–1190. https://doi.org/10.1016/j.neucom.2007.11.026
Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural
actor-critic. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, &
L. Torgo (Eds.), Machine Learning: ECML
2005 (pp. 280–291). Springer. https://doi.org/10.1007/11564096_29
Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between
imitation learning and inverse reinforcement learning. IEEE
Transactions on Neural Networks and Learning Systems,
28(8), 1814–1826. IEEE Transactions on
Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2543000
Pixabay. (2016a). 20 mg label blister pack [Graphic]. https://www.pexels.com/photo/20-mg-label-blister-pack-208512/
Pixabay. (2016b). Coins on brown wood [Graphic]. https://www.pexels.com/photo/coins-on-brown-wood-210600/
Plaat, A. (2022). Deep reinforcement learning. Springer Nature.
https://doi.org/10.1007/978-981-19-0638-1
Pomerleau, D. A. (1991). Efficient training of artificial neural
networks for autonomous navigation. Neural Computation,
3(1), 88–97. Neural Computation. https://doi.org/10.1162/neco.1991.3.1.88
Powell, W. B. (2022). Reinforcement learning and stochastic
optimization: A unified framework for sequential decisions. Wiley.
Puterman, M. L. (1994). Markov decision processes: Discrete
stochastic dynamic programming. Wiley.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement
learning. Proceedings of the 20th International Joint Conference on
Artifical Intelligence, 2586–2591.
Rao, A., & Jelvis, T. (2022). Foundations of reinforcement
learning with applications in finance. Chapman and
Hall/CRC. https://doi.org/10.1201/9781003229193
Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2006). Maximum
margin planning. Proceedings of the 23rd International Conference on
Machine Learning, 729–736. https://doi.org/10.1145/1143844.1143936
Robbins, H. (1952). Some aspects of the sequential design of
experiments. Bulletin of the American Mathematical Society,
58(5), 527–535. https://projecteuclid.org/journals/bulletin-of-the-american-mathematical-society/volume-58/issue-5/Some-aspects-of-the-sequential-design-of-experiments/bams/1183517370.full
Ross, S., Gordon, G. J., & Bagnell, J. (2010, November 2). A
reduction of imitation learning and structured prediction to no-regret
online learning. International Conference on
Artificial Intelligence and Statistics. https://www.semanticscholar.org/paper/A-Reduction-of-Imitation-Learning-and-Structured-to-Ross-Gordon/79ab3c49903ec8cb339437ccf5cf998607fc313e
Russell, S. (1998). Learning agents for uncertain environments (extended
abstract). Proceedings of the Eleventh Annual Conference on
Computational Learning Theory, 101–103. https://doi.org/10.1145/279943.279964
Russell, S. J., & Norvig, P. (2021). Artificial intelligence: A
modern approach (Fourth edition). Pearson.
Schmidhuber, J. (1991). Curious model-building control systems.
[Proceedings] 1991 IEEE International Joint
Conference on Neural Networks, 1458–1463 vol.2.
https://doi.org/10.1109/IJCNN.1991.170605
Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic
motivation (1990–2010). IEEE Transactions on Autonomous Mental
Development, 2(3), 230–247. IEEE Transactions
on Autonomous Mental Development. https://doi.org/10.1109/TAMD.2010.2056368
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,
Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T.,
Lillicrap, T., & Silver, D. (2020). Mastering atari, go, chess and
shogi by planning with a learned model. Nature,
588(7839, 7839), 604–609. https://doi.org/10.1038/s41586-020-03051-4
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P.
(2015). Trust region policy optimization. Proceedings of the 32nd
International Conference on Machine Learning, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P.
(2016). High-dimensional continuous control using generalized advantage
estimation. In Y. Bengio & Y. LeCun (Eds.), 4th international
conference on learning representations. http://arxiv.org/abs/1506.02438
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O.
(2017, August 28). Proximal policy optimization algorithms. https://doi.org/10.48550/arXiv.1707.06347
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate
of prediction and reward. Science, 275(5306),
1593–1599. https://doi.org/10.1126/science.275.5306.1593
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang,
M., Li, Y. K., Wu, Y., & Guo, D. (2024, April 27).
DeepSeekMath: Pushing the limits of
mathematical reasoning in open language models. https://doi.org/10.48550/arXiv.2402.03300
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V.,
Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,
& Hassabis, D. (2016). Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587, 7587),
484–489. https://doi.org/10.1038/nature16961
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M.,
Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap,
T., Simonyan, K., & Hassabis, D. (2018). A general reinforcement
learning algorithm that masters chess, shogi, and go through self-play.
Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., &
Riedmiller, M. (2014). Deterministic policy gradient algorithms.
Proceedings of the 31st International Conference on
Machine Learning, 387–395. https://proceedings.mlr.press/v32/silver14.html
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A.,
Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y.,
Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T.,
& Hassabis, D. (2017). Mastering the game of go without human
knowledge. Nature, 550(7676, 7676), 354–359. https://doi.org/10.1038/nature24270
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is
enough. Artificial Intelligence, 299, 103535. https://doi.org/10.1016/j.artint.2021.103535
Stigler, S. M. (2003). The history of statistics: The measurement of
uncertainty before 1900 (9. print). Belknap Pr. of Harvard Univ.
Pr.
Sussman, G. J., Wisdom, J., & Farr, W. (2013). Functional
differential geometry. The MIT Press.
Sutton, R. S. (1984). Temporal credit assignment in reinforcement
learning [PhD thesis]. University of Massachusetts Amherst.
Sutton, R. S. (1988). Learning to predict by the methods of temporal
differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An
introduction (Second edition). The MIT Press. http://incompleteideas.net/book/RLbook2020trimmed.pdf
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999).
Policy gradient methods for reinforcement learning with function
approximation. Proceedings of the 13th International
Conference on Neural Information Processing
Systems, 1057–1063.
Szepesvári, C. (2010). Algorithms for reinforcement learning.
Springer International Publishing. https://doi.org/10.1007/978-3-031-01551-9
Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y.,
Schulman, J., DeTurck, F., & Abbeel, P. (2017).
#Exploration: A study of count-based exploration for deep
reinforcement learning. Advances in Neural Information
Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3a20f62a0af1aa152670bab3c602feed-Abstract.html
Thompson, W. R. (1933). On the likelihood that one unknown probability
exceeds another in view of the evidence of two samples.
Biometrika, 25(3/4), 285–294. https://doi.org/10.2307/2332286
Thompson, W. R. (1935). On the theory of apportionment. American
Journal of Mathematics, 57(2), 450–456. https://doi.org/10.2307/2371219
Thorndike, E. L. (1911). Animal intelligence:
Experimental studies (pp. viii, 297). Macmillan Press.
https://doi.org/10.5962/bhl.title.55072
Thrun, S. B. (1992). Efficient exploration in reinforcement
learning [Technical Report]. Carnegie Mellon University.
Turing, A. (1948). Intelligent machinery. National Physical
Laboratory. https://weightagnostic.github.io/papers/turing1948.pdf
van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., &
Modayil, J. (2018, December 6). Deep reinforcement learning and the
deadly triad. https://doi.org/10.48550/arXiv.1812.02648
Vapnik, V. N. (2000). The nature of statistical learning
theory. Springer. https://doi.org/10.1007/978-1-4757-3264-1
Vershynin, R. (2018). High-dimensional probability: An
introduction with applications in data science. Cambridge
University Press. https://books.google.com?id=NDdqDwAAQBAJ
Wald, A. (1949). Statistical decision functions. The Annals of
Mathematical Statistics, 20(2), 165–205. https://doi.org/10.1214/aoms/1177730030
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K.,
& Freitas, N. de. (2017, February 6). Sample efficient actor-critic
with experience replay. 5th International Conference on Learning
Representations. https://openreview.net/forum?id=HyM25Mqel
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., &
Freitas, N. (2016). Dueling network architectures for deep reinforcement
learning. Proceedings of The 33rd International
Conference on Machine Learning, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html
Williams, R. J. (1992). Simple statistical gradient-following algorithms
for connectionist reinforcement learning. Machine Learning,
8(3), 229–256. https://doi.org/10.1007/BF00992696
Witten, I. H. (1977). An adaptive optimal controller for discrete-time
markov environments. Information and Control, 34(4),
286–295. https://doi.org/10.1016/S0019-9958(77)90354-0
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017).
Scalable trust-region method for deep reinforcement learning using
kronecker-factored approximation. Advances in Neural Information
Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/361440528766bbaaaa1901845cf4152b-Abstract.html
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021).
Mastering atari games with limited data. NeurIPS
2021, 25476–25488. https://doi.org/10.48550/arXiv.2111.00210
Zare, M., Kebria, P. M., Khosravi, A., & Nahavandi, S. (2024). A
survey of imitation learning: Algorithms, recent developments, and
challenges. IEEE Transactions on Cybernetics, 54(12),
7173–7186. IEEE Transactions on Cybernetics.
https://doi.org/10.1109/TCYB.2024.3395626
Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling
interaction via the principle of maximum causal entropy. Proceedings
of the 27th International Conference on International
Conference on Machine Learning, 1255–1262.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008).
Maximum entropy inverse reinforcement learning. Proceedings of the
23rd National Conference on Artificial Intelligence -
Volume 3, 1433–1438.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei,
D., Christiano, P., & Irving, G. (2020, January 8).
Fine-Tuning Language Models from Human
Preferences. https://doi.org/10.48550/arXiv.1909.08593