1 Introduction

Reinforcement learning (RL) is a branch of machine learning that studies sequential decision-making in unknown environments. An RL algorithm finds a strategy, called a policy, that maximizes the reward it obtains from the environment.

RL provides a powerful framework for attacking a wide variety of problems, including robotic control, video games and board games, resource management, language modeling, and more. It also provides an interdisciplinary paradigm for studying animal and human behaviour. Many of the most stunning results in machine learning, ranging from AlphaGo (Silver et al., 2016) to ChatGPT (OpenAI, 2022), are built using RL algorithms.

How does RL compare to the other two core machine learning paradigms, supervised learning and unsupervised learning?

Supervised learning (SL) concerns itself with learning a mapping from inputs to outputs. Typically the data takes the form of statistically independent input-output pairs. In RL, however, the data is generated by the agent interacting with the environment, meaning the sequential observations of the state are not independent from each other. SL is a well-studied field that provides many useful tools for RL.
Unsupervised learning concerns itself with learning the structure of data without the use of outside feedback or labels. In RL, though, the agent receives a reward signal from the environment, which can be thought of as a sort of feedback. Unsupervised learning is crucial in many real-world applications of RL for dimensionality reduction.

The key difference is that RL algorithms don’t learn from some existing dataset; rather, they must go out and interact with the environment to collect their own data in an online way. This means RL algorithms face a distinct set of challenges from other kinds of machine learning. We’ll discuss these more concretely in Section 1.2.

Remark 1.1 (The reward hypothesis). Why do we only focus on maximizing a scalar reward signal? Surely a more descriptive or higher-dimensional signal would enable more efficient learning. Nonetheless, many (prominent) researchers hold that scalar reward is enough for developing behaviours that achieve a wide array of goals (Silver et al., 2021). This idea is also termed the reward hypothesis, and goes back at least as far as Turing, who suggested that one could train a “universal machine” using one input signal for “pain” and another for “pleasure” (Turing, 1948). Reinforcement learning takes this hypothesis seriously. It is undeniable that maximizing scalar rewards has led to success in an assortment of sequential decision-making problems.

1.1 Core tasks of reinforcement learning

What tasks, exactly, are important for RL? Typically,

Policy evaluation (prediction): How ‘good’ is a specific state, or state-action pair (under a given policy)? That is, how much reward does it lead to in the long run? This is also called the task of value estimation.
Policy optimization (control): Suppose we fully understand how the environment behaves. What is the best action to take in every scenario?

1.2 Challenges of reinforcement learning

Recursion (bootstrapping): how can we “reuse” our current predictions to generate new information?

Exploration-exploitation tradeoff: should we try new actions, or capitalize on actions that we currently believe to be good?

Credit assignment: Consider this example: some mornings, you may wake up well rested, while other mornings, you may wake up drowsy and tired, even if the amount of time you spent asleep stays the same. What could be the cause? You take so many actions throughout the day, and any one of them could be the reason for your good or poor sleep. Was it a skipped meal? A lack of exercise? When these consequences interact with each other, it can be challenging to properly assign credit to the actions that cause the observed effects.

Reproducibility: the high variance inherent in interacting with the environment means that the results of RL experiments can be challenging to reproduce. Even when averaging across multiple random seeds, the same algorithm can achieve drastically different-seeming results (R. Agarwal et al., 2021).

1.3 Programming

Why include code in a textbook? We believe that implementing an algorithm is a strong test of your understanding of it; mathematical notation can often abstract away details, while a computer must be given every single instruction. We have sought to write readable Python code that is self-contained within each file. This approach is inspired by Sussman et al. (2013). There are some ways in which the code style differs from typical software projects:

We keep use of language features to a minimum, even if it leads to code that could otherwise be more concisely or idiomatically expressed.
The variable names used in the code match those used in the main text. For example, the variable s will be used instead of the more explicit state.

We also make extensive use of Python type annotations to explicitly specify variable types, including shapes of vectors and matrices using the jaxtyping library.

This is an interactive book built with Quarto (Allaire et al., 2024). It uses Python 3.11. It uses the JAX library for numerical computing. JAX was chosen for the clarity of its functional style and due to its mature RL ecosystem, sustained in large part by the Google DeepMind research group and a large body of open-source contributors. We use the standard Gymnasium library for interfacing with RL environments.

1.4 Bibliographic notes and further reading

Interest has surged in RL in the past decades, especially since AlphaGo’s groundbreaking success (Silver et al., 2016). There are a number of recent textbooks that cover the field of RL:

Schultz et al. (1997) highlights RL as a normative theory for neuroscientific behaviour. Thorndike (1911) puts RL forward as a learning framework for animal behaviours.

Sutton & Barto (2018) set the framework for much of modern RL. Plaat (2022) is a graduate-level textbook on deep reinforcement learning. A. Agarwal et al. (2022) is a useful reference for theoretical guarantees of RL algorithms. S. E. Li (2023) highlights the connections between RL and optimal control. Mannor et al. (2024) is another advanced undergraduate course textbook. Bertsekas & Tsitsiklis (1996) introduced many of the core concepts of RL. Szepesvári (2010) is an invaluable resource on much of the theory underlying the methods in this book and elsewhere. Kochenderfer et al. (2022) provides a more probabilistic perspective alongside Julia code for various RL algorithms.

There are also a number of review articles that summarize recent advances. Murphy (2025) gives an overview of the past decade of advancements in RL. Ivanov & D’yakonov (2019) lists many popular algorithms.

Other textbooks focus on specific RL techniques or on applications of RL to specific fields. Albrecht et al. (2023) discusses multi-agent reinforcement learning. Rao & Jelvis (2022) focuses on applications of RL to finance. Y. Li (2018) surveys applications of RL to various fields.