An Introduction to Reinforcement Learning

Author

Alexander Cai

Published

March 28, 2025

Preface

Welcome to the study of reinforcement learning! This textbook accompanies the undergraduate course CS 1840/STAT 184 taught at Harvard. It is intended to be an approachable yet rigorous introduction to this active subfield of machine learning.

Prerequisites

This book assumes the same prerequisites as the course: You should be familiar with multivariable calculus, linear algebra, and probability. For Harvard undergraduates, this is fulfilled by Math 21a, Math 21b, and Stat 110, or their equivalents. Stat 111 is strongly recommended but not required. Specifically, we will assume that you know the following topics. The italicized terms have brief re-introductions in the text:

Linear Algebra: Vectors and matrices, matrix multiplication, matrix inversion, eigenvalues and eigenvectors.
Multivariable Calculus: Partial derivatives, the chain rule, Taylor series, gradients, directional derivatives, Lagrange multipliers.
Probability: Random variables, probability distributions, expectation and variance, the law of iterated expectations (Adam’s rule), covariance, conditional probability, Bayes’s rule, and the law of total probability.

You should also be familiar with basic programming concepts such as variables, functions, loops, etc. Pseudocode listings will be provided for certain algorithms.

Course overview

The course will progress through the following units:

1 Introduction presents motivation for the RL problem and compares RL to other fields of machine learning.

2 Markov Decision Processes introduces Markov Decision Processes, the core mathematical framework for describing a large class of interactive environments.

3 Linear Quadratic Regulators is a standalone chapter on the linear quadratic regulator (LQR), an important tool for continuous control, in which the state and action spaces are no longer finite but rather continuous. This has widespread applications in robotics.

4 Multi-Armed Bandits introduces the multi-armed bandit (MAB) model for stateless sequential decision-making tasks. In exploring a number of algorithms, we will see how each of them strikes a different balance between exploring new options and exploiting known options. This exploration-exploitation tradeoff is a core consideration in RL algorithm design.

5 Supervised learning is a standalone crash course on some tools from supervised learning that we will use in later chapters.

6 Fitted Dynamic Programming Algorithms introduces fitted dynamic programming (fitted DP) algorithms for solving MDPs. These algorithms use supervised learning to approximately evaluate policies when they cannot be evaluated exactly.

7 Policy Gradient Methods explores an important class of algorithms based on iteratively improving a policy. We will also encounter the use of deep neural networks to express nonlinear policies and approximate nonlinear functions with many inputs and outputs.

8 Imitation Learning attempts to learn a good policy from expert demonstrations. At its most basic, this is an application of supervised learning to RL tasks.

9 Tree Search Methods looks at ways to explicitly plan ahead when the environment’s dynamics are known. We will study the Monte Carlo Tree Search heuristic, which has been used to great success in the famous AlphaGo algorithm and its successors.

10 Exploration in MDPs continues to investigate the exploration-exploitation tradeoff. We will extend ideas from multi-armed bandits to the MDP setting.

App. A: Background contains an overview of selected background mathematical content and programming content.