Notation

It’s worth it to spend a few words discussing the notation used for reinforcement learning. RL notation can appear quite complicated, since we often need to index across algorithm iterations, trajectories, and timesteps, so that certain values can have two or three indices attached to them. It’s important to see beyond the formal notation and interpret the quantities being symbolized.

We will use the following notation throughout the book. This notation is inspired by Sutton & Barto (2018) and Agarwal et al. (2022). We use \([N]\) as shorthand for the set \(\{ 0, 1, \dots, N-1 \}\). We try to use each lowercase letter to represent an element of the set denoted by the corresponding (stylized) uppercase letter. We use \(\triangle(\mathcal{X})\) to denote a distribution supported on some subset of \(\mathcal{X}\). (The triangle is used to evoke the probability simplex.)

Table 1: Table of integer indices.
Index Range Definition (of index)
\(h\) \([H]\) Time horizon index of an MDP (subscript).
\(k\) \([K]\) Arm index of a multi-armed bandit (superscript).
\(t\) \([T]\) Number of episodes.
\(i\) \([I]\) Iteration index of an algorithm (subscript or superscript).
Table 2: Table of common notation.
Element Space Definition (of element)
\(s\) \(\mathcal{S}\) A state.
\(a\) \(\mathcal{A}\) An action.
\(r\) \(\mathbb{R}\) A reward.
\(\gamma\) \([0, 1]\) A discount factor.
\(\tau\) \(\mathcal{T}\) A trajectory \((s_0, a_0, r_0, \dots, s_{H-1}, a_{H-1}, r_{H-1})\)
\(\pi\) \(\Pi\) A policy.
\(V^\pi\) \(\mathcal{S} \to \mathbb{R}\) The value function of policy \(\pi\).
\(Q^\pi\) \(\mathcal{S} \times \mathcal{A} \to \mathbb{R}\) The action-value function (a.k.a. Q-function) of policy \(\pi\).
\(A^\pi\) \(\mathcal{S} \times \mathcal{A} \to \mathbb{R}\) The advantage function of policy \(\pi\).
\(v\) \(\mathcal{S} \to \mathbb{R}\) An approximation to the value function.
\(q\) \(\mathcal{S} \times \mathcal{A} \to \mathbb{R}\) An approximation to the Q function.
\(\theta\) \(\Theta\) A parameter vector.

Note that throughout the text, certain symbols will stand for either random variables or fixed values. We aim to clarify in ambiguous settings.