Welcome to the study of reinforcement learning! This textbook accompanies the undergraduate course CS 1840/STAT 184 taught at Harvard. It is intended to be a friendly yet rigorous introduction to this active subfield of machine learning.
1Prerequisites¶
This book assumes the same prerequisites as the course: You should be familiar with multivariable calculus, linear algebra, and probability. For Harvard undergraduates, this is fulfilled by Math 21a, Math 21b, and Stat 110, or their equivalents. Stat 111 is strongly recommended but not required. Specifically, we will assume that you know the following topics. The italicized terms have brief re-introductions in the text or in the Appendix: Background:
- Linear Algebra: Vectors and matrices, matrix multiplication, matrix inversion, eigenvalues and eigenvectors.
- Multivariable Calculus: Partial derivatives, the chain rule, Taylor series, gradients, directional derivatives, Lagrange multipliers.
- Probability: Random variables, probability distributions, expectation and variance, the law of iterated expectations (Adam’s rule), covariance, conditional probability, Bayes’s rule, and the law of total probability.
You should also be comfortable with programming in Python. See Section 6 for more about this textbook’s philosophy regarding programming.
2Reinforcement learning in a nutshell¶
Broadly speaking, RL studies sequential decision-making in dynamic environments. An RL algorithm finds a strategy, called a policy, that maximizes the reward it obtains from the environment.
RL provides a powerful framework for attacking a wide variety of problems, including robotic control, video games and board games, resource management, language modelling, and more. It also provides an interdisciplinary paradigm for studying animal and human behavior. Many of the most stunning results in machine learning, ranging from AlphaGo to ChatGPT, are built using RL algorithms.
How does RL compare to the other two core machine learning paradigms, supervised learning and unsupervised learning?
Supervised learning (SL) concerns itself with learning a mapping from inputs to outputs. Typically the data takes the form of statistically independent input-output pairs. In RL, however, the data is generated by the agent interacting with the environment, meaning the sequential observations of the state are not independent from each other.
Conversely, SL is a well-studied field that provides many useful tools for RL.
Unsupervised learning concerns itself with learning the structure of data without the use of outside feedback or labels. In RL, though, the agent receives a reward signal from the environment, which can be thought of as a sort of feedback.
Unsupervised learning is crucial in many real-world applications of RL for dimensionality reduction and other purposes.
3Core tasks of reinforcement learning¶
What tasks, exactly, does RL comprise? An RL algorithm must typically solve two main subtasks:
Policy evaluation (prediction): How ‘good’ is a specific state, or state-action pair (under a given policy)? That is, how much reward does it lead to in the long run?
Policy optimization (control): Suppose we fully understand how the environment behaves. What is the best action to take in every scenario?
4Course overview¶
The course will progress through the following units:
1 Markov Decision Processes introduces Markov Decision Processes, the core mathematical framework for describing a large class of interactive environments.
2 Linear Quadratic Regulators is a standalone chapter on the linear quadratic regulator (LQR), an important tool for continuous control, in which the state and action spaces are no longer finite but rather continuous. This has widespread applications in robotics.
3 Multi-Armed Bandits introduces the multi-armed bandit (MAB) model for stateless sequential decision-making tasks. In exploring a number of algorithms, we will see how each of them strikes a different balance between exploring new options and exploiting known options. This exploration-exploitation tradeoff is a core consideration in RL algorithm design.
4 Supervised learning is a standalone crash course on some tools from supervised learning that we will use in later chapters.
5 Fitted Dynamic Programming Algorithms introduces fitted dynamic programming (fitted DP) algorithms for solving MDPs. These algorithms use supervised learning to approximately evaluate policies when they cannot be evaluated exactly.
6 Policy Gradient Methods explores an important class of algorithms based on iteratively improving a policy. We will also encounter the use of deep neural networks to express more complicated policies and approximate complicated functions.
7 Imitation Learning attempts to learn a good policy from expert demonstrations. At its most basic, this is an application of supervised learning to RL tasks.
8 Tree Search Methods looks at ways to explicitly plan ahead when the environment’s dynamics are known. We will study the Monte Carlo Tree Search heuristic, which has been used to great success in the famous AlphaGo algorithm and its successors.
9 Exploration in MDPs continues to investigate the exploration-exploitation tradeoff. We will extend ideas from multi-armed bandits to the MDP setting.
Appendix: Background contains an overview of selected background mathematical content and programming content.
5Notation¶
We will use the following notation throughout the book. This notation is inspired by Sutton & Barto (2018) and Agarwal et al. (2022). We use as shorthand for the set .
Element | Space | Definition (of element) |
---|---|---|
A state. | ||
An action. | ||
A reward. | ||
γ | A discount factor. | |
τ | A trajectory. | |
π | Π | A policy. |
The value function of policy π. | ||
The action-value function (a.k.a. Q-function) of policy π. | ||
The advantage function of policy π. | ||
A distribution supported on . | ||
Time horizon index of an MDP (subscript). | ||
Arm index of a multi-armed bandit (superscript). | ||
Iteration index of an algorithm (subscript). | ||
θ | Θ | A set of parameters. |
Note that throughout the text, certain symbols will stand for either random variables or fixed values. We aim to clarify in ambiguous settings. Be warned that
6Programming¶
Why include code in a textbook? We believe that implementing an algorithm is a strong test of your understanding of it; mathematical notation can often abstract away details, while a computer must be given every single instruction. We have sought to write readable Python code that is self-contained within each file. This approach is inspired by Sussman et al. (2013). There are some ways in which the code style differs from typical software projects:
- We keep use of language features to a minimum, even if it leads to code that could otherwise be more concisely or idiomatically expressed.
- The variable names used in the code match those used in the main text.
For example, the variable
s
will be used instead of the more explicitstate
.
We also make extensive use of Python type annotations to explicitly specify variable types, including shapes of vectors and matrices using the jaxtyping library.
This is an interactive book built with Jupyter Book. It uses Python 3.11. It uses the JAX library for numerical computing. JAX was chosen for the clarity of its functional style and due to its mature RL ecosystem, sustained in large part by the Google DeepMind research group and a large body of open-source contributors. We use the standard Gymnasium library for interfacing with RL environments.
The following names are exported from the utils
module:
import matplotlib.pyplot as plt
# convenient class builder
from typing import NamedTuple
# function typings
from collections.abc import Callable
# array typings
from jaxtyping import Float, Array
# convenient function composition
from functools import partial
# numerical computing and linear algebra
import jax
import jax.numpy as jnp
# print functions as latex
import latexify
plt.style.use("fivethirtyeight")
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (Second edition). The MIT Press.
- Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement Learning: Theory and Algorithms.
- Sussman, G. J., Wisdom, J., & Farr, W. (2013). Functional Differential Geometry. The MIT Press.