10  Exploration in MDPs

10.1 Introduction

One of the key challenges of reinforcement learning is the exploration-exploitation tradeoff. Should we exploit actions we know will give high reward, or should we explore different actions to discover potentially better strategies? An algorithm that doesn’t explore effectively might easily overfit to certain areas of the state space, and fail to generalize once they enter a region they haven’t yet seen.

In the multi-armed bandit setting (Chapter 4), we studied the upper confidence bound (UCB) algorithm (Section 4.7) that incentivizes the learner to explore arms that it is uncertain about. In particular, UCB relies on optimism in the face of uncertainty: it chooses arms based on an overestimate of the arm’s true mean reward. In this chapter, we will see how to generalize this idea to the MDP setting.

Code
from utils import latex

10.1.1 Sparse reward

Exploration is crucial in sparse reward problems where \(r(s, a) = 0\) for most (or nearly all) states and actions. Often, the agent must take a specific sequence of actions before any reward is observed.

Example 10.1 (Chain MDP) Here’s a simple example of an MDP with sparse rewards:

An illustration of the chain MDP environment.

There are \(|\mathcal{S}|\) cells arranged in a chain. The agent starts in the leftmost cell. The rightmost state is a terminal state. In every state, there are three possible actions, two of which move the agent left and one which moves the agent right. (The two “left” actions do nothing in the leftmost cell.) The reward function gives a reward of \(1\) for taking the action that enters the rightmost cell, and zero otherwise.

The problem of sparse rewards is especially prevalent in RL compared to supervised learning. In most supervised learning tasks, every labelled sample provides some useful signal. However, in RL, algorithms that don’t systematically explore new states may fail to learn anything meaningful within a reasonable amount of time.

Consider the algorithms we’ve covered so far for unknown environments: policy gradient methods (Chapter 7) and fitted DP methods (Chapter 6). How would these do on this problem?

Remark 10.1 (Policy gradient methods fail on sparse reward). Policy gradient algorithms require the gradient to be nonzero in order to learn. If we never observe any reward, the gradient will always be zero, and the policy will never change or improve. If we think of the expected total reward as a function \(J(\theta)\) of the policy parameters, we can visualize the graph of \(J\) as being mostly flat, making it impossible to “climb the hill” from almost every random initialization.

Remark 10.2 (Fitted DP methods fail on sparse reward). Fitted DP algorithms run into a similar issue: as we randomly interact with the environment, we never observe any reward, and so the reward model simply gives zero for every state-action pair. In expectation, it would take a computationally infeasible number of rollouts to observe the reward by chance.

This is quite disheartening! The sophisticated methods we’ve developed, which can exceed human-level performance on a wide variety of tasks, fail on this problem that seems almost trivial.

Of course, a simple way to solve the “chain MDP” in ex. 10.1 is to actively visit unseen states. For a policy that visits a new state in each rollout, the final cell can be reached in \(O(|\mathcal{S}|)\) rollouts (i.e. \(O(|\mathcal{S}|^2)\) time). The rest of this chapter will consider ways to explicitly explore unknown states.

10.1.2 Reward shaping

One workaround to sparse reward problems in practice is to shape the reward function using domain knowledge. For example, in ex. 10.1, we (that is, the practitioners) know that travelling to the right is the correct action, so we could design a reward function that provides a reward of \(0.1\) for the action that moves to the right. A similar strategy is used in practice for many chess or board game algorithms where capturing the opponent’s pieces earns some positive reward.

Though this might seem obvious, designing a useful reward function can be challenging in practice. The agent may learn to exploit the intermediate rewards rather than solve the original goal. A famous example is the agent trained to play the CoastRunners game, in which players race boats around a racetrack. However, the algorithm found that it could achieve higher reward (i.e. in-game score) by refusing to go around the racetrack and instead collecting bonus points in a loop!

An RL agent collects bonus points instead of participating in the race. Image from Clark & Amodei (2024).

This phenomenon is known as reward hacking or Goodhart’s law. Reward hacking is essentially a special case of “finding loopholes” around the written guidelines (or in this case, the reward signal used for training); think of folk stories such as King Midas or the Monkey’s Paw. When RL algorithms are deployed in high-stakes scenarios, it is crucial to verify the learned policy’s behaviour and ensure that it is aligned to the designer’s intentions.

10.2 Exploration in deterministic MDPs

Let us address the exploration problem in a deterministic MDP, that is, where taking action \(a\) in state \(s\) always leads to the state \(P(s, a) \in \mathcal{S}\). How can we methodically visit every single state-action pair?

In the multi-armed bandit setting (Chapter 4), there are no states, so it’s trivial to visit every “state-action pair”: just pull each arm once. But in the MDP setting, in order to achieve a particular state-action pair \((s, a)\), one must plan out a path from the initial state.

We can do this by constructing an MDP where only unseen state-action pairs are rewarded, and using value iteration/dynamic programming (Section 2.3.2) to reach the unknown states in \(M_{\mathcal{D}}\). Concretely, we keep a set \(\mathcal{D}\) of all the \((s, a, r, s')\) tuples we’ve observed. Each episode, we use \(\mathcal{D}\) to construct a fully known MDP, \(M_{\mathcal{D}}\), in which only unseen state-action pairs are rewarded.

Definition 10.1 (Explore-then-exploit algorithm) Suppose that every state can be reached from the initial state within a single episode.

  1. \(\mathcal{D} \gets \emptyset\)
  2. For \(T = 0, 1, 2, \dots\) (until the entire MDP has been explored):
    1. Construct \(M_{\mathcal{D}}\) using \(\mathcal{D}\). That is, the state transitions are set to those observed in \(\mathcal{D}\), and the reward is set to \(0\) for all state-action pairs in \(\mathcal{D}\), and \(1\) otherwise.
    2. Execute DP (Section 2.3.2) on the known MDP \(M_{\mathcal{D}}\) to compute the optimal policy \(\pi^\star_{\mathcal{D}}\).
    3. Execute \(\pi^\star_{\mathcal{D}}\) in \(M_{\mathcal{D}}\). This will visit some \((s, a)\) not yet in \(\mathcal{D}\), and observe the reward \(r(s, a)\) and next state \(P(s, a)\).
    4. \(\mathcal{D} \gets \mathcal{D} \cup \{ (s, a, r, s') \}\), where \(s' = P(s, a), r = r(s, a)\) are the observed state transition and reward.

Remark 10.3 (Path planning is graph traversal). Review the dynamic programming algorithm for a finite-horizon MDP (Section 2.3.2). Note that in the constructed MDP \(M_{\mathcal{D}}\), this is identical to a breadth-first search beginning from the desired state at the final timestep: each state-timestep pair is a node in the graph, and the state transitions determine the (directed) edges. Each state-timestep pair from which it is possible to reach the desired state is assigned a value of \(1\). The policy serves to backtrack through these state-timestep pairs, returning to the root node of the search: the desired state.

We can easily measure the per-episode regret of this algorithm.

Definition 10.2 (Per-episode regret) We aim to evaluate some iterative policy optimization algorithm. Let \(\pi^t\) be the policy returned by the algorithm after \(t\) iterations. The per-episode regret across \(T\) iterations is given by

\[ \text{Regret}_T= \mathop{\mathbb{E}}_{s_0 \sim P_0}\left[ \sum_{t= 0}^{T- 1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right] \tag{10.1}\]

where the randomness is in the initial state distribution.

Remark 10.4 (MDP policies as MAB arms). What does this have to do with the definition of regret in the MAB setting (def. 4.3)? Here, policies are arms, and the “mean reward” is the expected total reward of a trajectory. We’ll make this connection more explicit in Section 10.3.

Theorem 10.1 (Performance of explore-then-exploit) The regret of the explore-then-exploit algorithm (def. 10.1) can be upper-bounded by

\[ \sum_{t= 0}^{T- 1} V^\star_0 - V_0^{\pi^t} \le |\mathcal{S}||\mathcal{A}| H. \tag{10.2}\]

(This MDP and algorithm are deterministic, assuming there is a single starting state, so the regret is not random.)

Proof. As long as every state can be reached from \(s_0\) within a single episode, i.e. \(|\mathcal{S}| \le H\), def. 10.2 will eventually be able to explore all \(|\mathcal{S}| |\mathcal{A}|\) state-action pairs, adding one new transition per episode.

Let \(M\) denote the original MDP that we aim to solve. We know it will take at most \(|\mathcal{S}| |\mathcal{A}|\) iterations to explore the entire MDP, after which \(M_{\mathcal{D}} = M\) and \(\pi^\star_{\mathcal{D}}\) is the optimal policy in \(M\), incurring no additional regret. For each “shortest-path” policy \(\pi^\star_{\mathcal{D}}\) up until then, its value will differ from that of \(\pi^\star\) by at most \(H\), since the policies will differ by at most \(1\) reward at each timestep.

10.3 Treating an unknown MDP as a MAB

We explored the exploration-exploitation tradeoff in the multi-armed bandits setting (Chapter 4). Can we apply the MAB algorithms we discovered to MDPs as well? Let us formally describe an unknown MDP as an MAB problem.

In a MAB problem, we want to find the arm with the highest mean reward. In an MDP, we want to find the policy that achieves the highest expected total reward. So if we want to apply MAB techniques to solving an MDP, it makes sense to draw an equivalence between arms and policies. We can summarize this equivalence in the following table:

Table 10.1: Treating an MDP with finite states and actions as a MAB.
MAB MDP
\(K\) arms \((\vert\mathcal{A}\vert^{\vert\mathcal{S}\vert})^H\) deterministic policies
unknown reward distributions \(\nu^k\) unknown trajectory distributions \(\rho^\pi\)
\(k^\star = \arg\max_{k \in [K]} \mathop{\mathbb{E}}_{r \sim \nu^k} [r]\) \(\pi^\star = \arg\max_{\pi \in \Pi} \mathop{\mathbb{E}}_{\tau \sim \rho^\pi} \left[ \sum_{h=0}^{H-1} r(s_h, a_h) \right]\)
pull arm \(k\) and observe reward roll out with \(\pi\) and observe total reward

(For the sake of this example, assume that the MDP’s reward function is stochastic, so that the MAB reward distributions are nondegenerate.)

Recall that UCB incurs regret \(\widetilde{O}(\sqrt{TK})\), where \(T\) is the number of pulls and \(K\) is the number of arms. So in the MDP-as-MAB problem, using UCB for \(T\) episodes would achieve regret

\[ \widetilde{O}\left( \sqrt{|\mathcal{A}|^{|\mathcal{S}|H} T} \right) \tag{10.3}\]

This scales exponentially in \(|\mathcal{S}|\) and \(H\), which quickly becomes intractable. Notably, this method treats each policy as entirely independent from the others, but the performance of different policies are typically correlated. We can illustrate this with the following example:

Example 10.2 (Treating an MDP as a MAB) Consider a “coin MDP” with two states “heads” and “tails”, two actions “Y” and “N”, and a time horizon of \(H=2\). The state transition flips the coin, and doesn’t depend on the action. The reward only depends on the action: Taking action Y gives reward \(1\), and taking action N gives reward \(0\).

Suppose we collect data from the two constant policies \(\pi_{\text{Y}}(s) = \text{Y}\) and \(\pi_{\text{N}}(s) = \text{N}\). Now we want to learn about the policy \(\widetilde{\pi}\) that takes action Y and then N. Do we need to collect data from \(\widetilde{\pi}\) to evaluate it? No: Since the reward only depends on the action, we can infer its value from our data on the policies \(\pi_{\text{Y}}\) and \(\pi_{\text{N}}\). However, if we treat the MDP as a bandit in which \(\widetilde{\pi}\) is a new, unknown arm, we ignore the known correlation between the action and the reward.

10.4 Upper confidence bound value iteration

We shouldn’t need to consider all \(|\mathcal{A}|^{|\mathcal{S}| H}\) deterministic policies to achieve low regret. Rather, all we need to describe the optimal policy is \(Q^\star\), which has \(H |\mathcal{S}||\mathcal{A}|\) entries to be learned. In this section, we’ll study the upper confidence bound value iteration (UCBVI) algorithm (Azar et al., 2017), which indeed achieves polynomial regret in \(|\mathcal{S}|\), \(|\mathcal{A}|\), and \(H\).

As its name suggests, UCBVI combines the upper confidence bound (UCB) algorithm from the multi-armed bandits setting (Section 4.7) with value iteration (VI) from the MDP setting (Section 2.3.2):

  • UCB strikes a good exploration-exploitation tradeoff in an (unknown) MAB;
  • VI (what we simply call DP in the finite-horizon setting) computes the optimal value function in a known MDP (with finite states and actions).

Let us briefly review these two algorithms:

Remark 10.5 (Review of UCB). At each iteration \(t\), for each arm \(k\), we construct a confidence interval for the mean of arm \(k\)’s reward distribution. We then choose the arm with the highest upper confidence bound:

\[ \begin{aligned} k_{t+1} &\gets \arg\max_{k \in [K]} \text{ucb}^k_t \\ \text{where } \text{ucb}^k_t &= \frac{R^{k}_t}{N^{k}_t} + \sqrt{\frac{\ln(2t/\delta)}{2 N^{k}_t}} \end{aligned} \tag{10.4}\]

where \(N_t^k\) indicates the number of times arm \(k\) has been pulled up until time \(t\), \(R_t^k\) indicates the total reward obtained by pulling arm \(k\) up until time \(t\), and \(\delta > 0\) controls the width of the confidence interval.

We can treat the upper confidence bound as a “proxy reward” that is the estimated mean reward plus a bonus exploration term. Since the size of the bonus term is proportional to our uncertainty (i.e. predicted variance) about that arm’s mean, this is called an optimistic bonus. In UCBVI, we will extend this idea to the case of an unknown MDP \(\mathcal{M}\) by adding an exploration term to the reward function. Then, we will use DP to solve for the optimal policy in \(\widetilde{\mathcal{M}}\).

Remark 10.6 (Review of VI/DP). Value iteration (VI) is a dynamic programming (DP) algorithm for computing the optimal policy and value function in an MDP where the state transitions and reward function are known. We begin at the final timestep, where \(V_H^\star(s) = 0\), and work backwards using Bellman’s optimality equations (Theorem 2.5):

For \(h= H-1, \dots, 0\):

\[ \begin{aligned} Q_h^\star(s, a) &= r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(\cdot \mid s, a)}[V_{h+1}^\star(s')] \\ \pi_h^\star(s) &= \arg\max_{a \in \mathcal{A}} Q_h^\star(s, a) \\ V_h^\star(s) &= Q_h^\star(s, \pi_h^\star(s)). \end{aligned} \tag{10.5}\]

Assumptions: We will consider the general case of a time-varying MDP where the transition and reward functions may change over time. Recall our convention that \(P_h\) is the distribution of \(s_{h+1} \mid s_{h}, a_{h}\) and \(r_h\) is applied to \(s_h, a_h\).

Definition 10.3 (UCBVI) At a high level, the UCBVI algorithm can be described as follows: For \(i = 0, \dots, I-1\):

  1. Modeling: Use previously collected data to model the state transitions \(\widehat{P}_0, \dots, \widehat{P}_{H-1}\) and reward functions \(\widehat{r}_0, \dots, \widehat{r}_{H-1}\).
  2. Reward bonus: Design a reward bonus \(b_h(s, a) \in \mathbb{R}\) to encourage exploration, analogous to the UCB term.
  3. Optimistic planning: Use VI (i.e. DP) to compute the optimal policy \(\widehat \pi\) in the modelled MDP

\[ \widetilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \{ \widehat{P}_h\}_{h \in [H]}, \{ \widehat{r}_h+ b_h\}_{h \in [H]}, H). \]

  1. Execution: Use \(\widehat \pi\) to collect a new trajectory.

We detail each of these steps below.

10.4.1 Modeling the transitions

Recall that we don’t know the state transitions or reward function of the MDP we aim to solve. We seek to approximate

\[ P_h(s_{h+ 1} \mid s_h, a_h) = \frac{\mathbb{P}(s_h, a_h, s_{h+ 1})}{\mathbb{P}(s_h, a_h)}, \tag{10.6}\]

where \(\mathbb{P}\) denotes the true joint probabilities. We can estimate these using their sample probabilities across a set of collected transitions. That is, define

\[ \begin{aligned} N_h^i(s, a, s') & := \sum_{i'=0}^{i-1} \mathbf{1}\left\{ (s_h^{i'}, a_h^{i'}, s_{h+1}^{i'}) = (s, a, s') \right\} \\ N_h^i(s, a) & := \sum_{i'=0}^{i-1} \mathbf{1}\left\{ (s_h^{i'}, a_h^{i'}) = (s, a) \right\} \\ \end{aligned} \tag{10.7}\]

to be the number of times the tuple \(s, a, s'\) appears in the collected data, and similar for the state-action pair \(s, a\). Then we can model

\[ \widehat{P}_h^t(s' \mid s, a) = \frac{N_h^t(s, a, s')}{N_h^t(s, a)}. \tag{10.8}\]

Similarly, we can model the rewards by the sample mean in each state-action pair:

\[ \widehat{r}_h^t(s, a) = \frac{N_h^t(s, a)} \sum_{t'=0}^{t-1} \mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} r_h^i. \tag{10.9}\]

This is a fairly naive, nonparametric estimator that doesn’t assume any underlying structure of the MDP. We’ll see how to incorporate assumptions about the MDP in the following section.

10.4.2 Reward bonus

To motivate the reward bonus term, recall how we designed the reward bonus term for UCB (Section 4.7):

  1. We used Hoeffding’s inequality to bound, with high probability, how far the sample mean \(\widehat \mu_t^k\) deviated from the true mean \(\mu^k\).
  2. By inverting this inequality, we obtained a \((1-\delta)\)-confidence interval for the true mean, centered at our estimate.
  3. To make this bound uniform across all timesteps \(t \in [T]\), we applied the union bound and multiplied \(\delta\) by a factor of \(T\).

We’d like to do the same for UCBVI, and construct the bonus term such that \(V^\star_h(s) \le \widehat{V}_h^t(s)\) with high probability. However, our construction will be more complex than the MAB case, since \(\widehat{V}_h^t(s)\) depends on the bonus \(b_h^t(s, a)\) implicitly via DP. We claim that the bonus term that gives the proper bound is

\[ b_h^i(s, a) = 2 H\sqrt{\frac{\log( |\mathcal{S}||\mathcal{A}|HI/\delta )}{N_h^t(s, a)}}. \tag{10.10}\]

We provide a heuristic sketch of the proof in Section B.2; see Agarwal et al. (2022), Section 7.3 for a full proof.

10.4.3 Performance of UCBVI

How exactly does UCBVI strike a good balance between exploration and exploitation? In UCB for MABs, the bonus exploration term is simple to interpret: It encourages the learner to take actions with a high exploration term. Here, the policy depends on the bonus term indirectly: The policy is obtained by planning in an MDP where the bonus term is added to the reward function. Note that the bonuses propagate backwards in DP, effectively enabling the learner to plan to explore unknown states. This effect takes some further interpretation.

Recall we constructed \(b^t_h\) so that, with high probability, \(V^\star_h(s) \le \widehat{V}_h^t(s)\) and so

\[ V^\star_h(s) - V^{\pi^t}_h(s) \le \widehat{V}_h^t(s) - V^{\pi^t}_h(s). \]

That is, the l.h.s. measures how suboptimal policy \(\pi^t\) is in the true environment, while the r.h.s. is the difference in the policy’s value when acting in the modelled MDP \(\widetilde{\mathcal{M}}^t\) instead of the true one \(\mathcal{M}\).

If the r.h.s. is small, this implies that the l.h.s. difference is also small, i.e. that \(\pi^t\) is exploiting actions that are giving high reward.

If the r.h.s. is large, then we have overestimated the value: \(\pi^t\), the optimal policy of \(\widetilde{\mathcal{M}}^t\), does not perform well in the true environment \(\mathcal{M}\). This indicates that one of the \(b_h^t(s, a)\) terms must be large, or some \(\widehat P^t_h(\cdot \mid s, a)\) must be inaccurate, indicating a state-action pair with a low visit count \(N^t_h(s, a)\) that the learner was encouraged to explore.

It turns out that UCBVI achieves a regret of

Theorem 10.2 (UCBVI regret) The expected regret of UCBVI satisfies

\[ \mathop{\mathbb{E}}\left[ \sum_{t= 0}^{T- 1} \left(V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right) \right] = \widetilde{O}(H^2 \sqrt{|\mathcal{S}| |\mathcal{A}| I}) \]

Comparing this to the UCB regret bound \(\widetilde{O}(\sqrt{T K})\), where \(K\) is the number of arms of the MAB, we see that we’ve reduced the number of effective arms from \(|\mathcal{A}|^{|\mathcal{S}|H}\) (in eq. 10.3) to \(H^4 |\mathcal{S}||\mathcal{A}|\), which is indeed polynomial in \(|\mathcal{S}|\), \(|\mathcal{A}|\), and \(H\), as desired. This is also roughly the number of episodes it takes to achieve constant-order average regret:

\[ \frac{1}{T} \mathop{\mathbb{E}}[\text{Regret}_T] = \widetilde{O}\left(\sqrt{\frac{H^4 |\mathcal{S}||\mathcal{A}|}{T}}\right) \]

Note that the time-dependent transition matrix has \(H |\mathcal{S}|^2 |\mathcal{A}|\) entries. Assuming \(H \ll |\mathcal{S}|\), this shows that it’s possible to achieve low regret, and achieve a near-optimal policy, while only understanding a \(1/|\mathcal{S}|\) fraction of the world’s dynamics.

10.5 Linear MDPs

A polynomial dependency on \(|\mathcal{S}|\) and \(|\mathcal{A}|\) is manageable when the state and action spaces are small. But for large or continuous state and action spaces, even this polynomial factor will become intractable. Can we find algorithms that don’t depend on \(|\mathcal{S}|\) or \(|\mathcal{A}|\) at all, effectively reducing the dimensionality of the MDP? In this section, we’ll explore linear MDPs: an example of a parameterized MDP where the rewards and state transitions depend only on some parameter space of dimension \(d\) that is independent from \(|\mathcal{S}|\) or \(|\mathcal{A}|\).

Definition 10.4 (Linear MDP) We assume that the transition probabilities and rewards are linear in some feature vector

\(\phi(s, a) \in \mathbb{R}^d\):

\[ \begin{aligned} P_h(s' \mid s, a) & = \phi(s, a)^\top \mu^\star_h(s') \\ r_h(s, a) & = \phi(s, a)^\top \theta_h^\star \end{aligned} \]

Note that we can also think of \(P_h(\cdot \mid s, a) = \mu_h^\star\) as an \(|\mathcal{S}| \times d\) matrix, and think of \(\mu^\star_h(s')\) as indexing into the \(s'\)-th row of this matrix (treating it as a column vector). Thinking of \(V^\star_{h+1}\) as an \(|\mathcal{S}|\)-dimensional vector, this allows us to write

\[ \mathop{\mathbb{E}}_{s' \sim P_h(\cdot \mid s, a)}[V^\star_{h+1}(s)] = (\mu^\star_h\phi(s, a))^\top V^\star_{h+1}. \]

The \(\phi\) feature mapping can be designed to capture interactions between the state \(s\) and action \(a\). In this book, we’ll assume that the feature map \(\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d\) and the reward function (described by \(\theta_h^\star\)) are known to the learner.

10.5.1 Planning in a linear MDP

It turns out that \(Q^\star_h\) is also linear with respect to this feature mapping. We can prove this by simply computing it using DP. We initialize the value function at the end of the time horizon by setting \(V_{H}^\star(s) = 0\) for all states \(s\). Then we iterate:

\[ \begin{aligned} Q^\star_h(s, a) & = r_h(s, a) + \mathop{\mathbb{E}}_{s' \sim P_h(\cdot \mid s, a)} [V^\star_{h+1}(s')] \\ & = \phi(s, a)^\top \theta_h^\star + (\mu_h^\star \phi(s, a))^\top V^\star_{h+1} \\ & = \phi(s, a)^\top \underbrace{( \theta_h^\star + (\mu_h^\star)^\top V^\star_{h+1})}_{w_h} \\ V^\star_h(s) & = \max_a Q^\star_h(s, a) \\ \pi^\star_h(s) & = \arg\max_a Q^\star_h(s, a) \end{aligned} \]

Exercise 10.1 (Action-value function is linear in features) Show that \(Q^\pi_h\) is also linear with respect to \(\phi(s, a)\) for any policy \(\pi\).

10.5.2 UCBVI in a linear MDP

10.5.2.1 Modeling the transitions

This linear assumption on the MDP will also allow us to model the unknown dynamics \(P^?_h(s' \mid s, a)\) with techniques from supervised learning (SL). Recall that SL is useful for estimating conditional expectations by minimizing mean squared error. We can rephrase the estimation of \(P^?_h(s' \mid s, a)\) as a least-squares problem as follows: Write \(\delta_s\) to denote a one-hot vector in \(\mathbb{R}^{|\mathcal{S}|}\), with a \(1\) in the \(s\)-th entry and \(0\) everywhere else. Note that

\[ \mathop{\mathbb{E}}_{s' \sim P_h(\cdot \mid s, a)} [\delta_{s'}] = P_h(\cdot \mid s, a) = \mu_h^\star \phi(s, a). \]

Furthermore, since the expectation here is linear with respect to \(\phi(s, a)\), we can directly apply least-squares multi-target linear regression to construct the estimate

\[ \widehat \mu = \arg\min_{\mu \in \mathbb{R}^{|\mathcal{S}| \times d}} \sum_{t= 0}^{T- 1} \|\mu \phi(s_h^i, a_h^i) - \delta_{s_{h+1}^i} \|_2^2. \]

This has a well-known closed-form solution:

\[ \begin{aligned} \widehat \mu^\top & = (A_h^t)^{-1} \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \delta_{s_{h+1}^i}^\top \\ \text{where} \quad A_h^t & = \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \phi(s_h^i, a_h^i)^\top + \lambda I \end{aligned} \]

where we include a \(\lambda I\) term to ensure that the matrix \(A^t_h\) is invertible. (This can also be derived by adding a \(\lambda \|\mu\|_{\text{F}}^2\) regularization term to the objective.) We can directly plug in this estimate into \(\widehat{P}^t_h(\cdot \mid s, a) = \widehat \mu^t_h \phi(s, a)\).

10.5.2.2 Reward bonus

Now, to design the reward bonus, we can’t apply Hoeffding’s inequality anymore, since the terms no longer involve sample means of bounded random variables; Instead, we’re incorporating information across different states and actions. Rather, we can construct an upper bound using Chebyshev’s inequality in the same way we did for the LinUCB algorithm in the MAB setting Section 4.9.1:

\[ b^t_h(s, a) = \beta \sqrt{\phi(s, a)^\top (A^t_h)^{-1} \phi(s, a)}, \quad \beta = \widetilde O(d H). \]

Note that this isn’t explicitly inversely proportional to \(N_h^t(s, a)\) as in the original UCBVI bonus term eq. 10.10. Rather, it is inversely proportional to the amount that the direction \(\phi(s, a)\) has been explored in the history. That is, if \(A-h^t\) has a large component in the direction \(\phi(s, a)\), implying that this direction is well explored, then the bonus term will be small, and vice versa.

We can now plug in these transition estimates and reward bonuses into the UCBVI algorithm def. 10.3.

Theorem 10.3 (LinUCBVI regret) The LinUCBVI algorithm achieves expected regret

\[ \mathop{\mathbb{E}}[\text{Regret}_T] = \mathop{\mathbb{E}}\left[\sum_{t= 0}^{T- 1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right] \le \widetilde O(H^2 d^{1.5} \sqrt{T}) \]

Comparing this to our bound for UCBVI in an environment without this linear assumption, we see that we go from a sample complexity of \(\widetilde \Omega(H^4 |\mathcal{S}||\mathcal{A}|)\) to \(\widetilde \Omega(H^4 d^{3})\). This new sample complexity only depends on the feature dimension and not on the state or action space of the MDP!

10.6 Key takeaways

We first discussed the explore-then-exploit algorithm (def. 10.1), a simple way to explore a deterministic MDP by visiting all state-action pairs. This is essentially a graph traversal algorithm, where each state represents an edge of the graph. We then discussed how to treat an unknown MDP as a MAB (Section 10.3), and how this approach is inefficient since it doesn’t make use of correlations between different policies. We then introduced the UCBVI algorithm (def. 10.3), the key algorithm of this chapter, which models the unknown MDP by a proxy MDP with a reward bonus term that encourages exploration. Finally, assuming that the transitions and rewards are linear with respect to a feature transformation of the state and action, we introduced the LinUCBVI algorithm (Section 10.5.2), which has a sample complexity independent of the size of the state and action spaces. This makes it possible to scale up UCBVI to large problems that have a simple underlying structure.

10.7 Bibliographic notes and further reading

Sparse reward problems are frequent throughout reinforcement learning. The chain MDP example is from Thrun (1992). One of the most famous sparse reward problems is Montezuma’s Revenge, one of the tasks in the popular arcade learning environment (ALE) benchmark of Atari 2600 games (Bellemare et al., 2013; Machado et al., 2018). These were first solved by algorithms that explicitly encourage exploration (Bellemare et al., 2016; Burda et al., 2018).

The issue of reward hacking is one of many possible concerns relating to AI safety. We refer the reader to Amodei et al. (2016) for an overview of such risks. Reward hacking has been empirically demonstrated in large language model training (Gao et al., 2023).

The UCBVI algorithm was first presented in Azar et al. (2017). UCBVI extends the UCB algorithm from multi-armed bandits to the MDP by estimating a model of the environment. Later work by Drago et al. (2025) improved the regret bound on UCBVI. Other model-based methods for strategic exploration have been studied at least since Schmidhuber (1991) and Meyer & Wilson (1991). UCBVI computes the reward bonus using the count of the number of times that state-action pair has been visited. Tang et al. (2017) surveys other such count-based exploration algorithms.

It is also possible to encourage model-free algorithms to strategically explore. Badia et al. (2020) designed a Q-learning algorithm with exploration incentives that surpassed the human baseline on the challenging Atari tasks.

Intrinsic motivation is another family of approaches to strategic exploration. In some sense, intrinsic motivation approaches are to RL as self-supervised approaches are to unsupervised learning: typically, we add some intrinsic reward to the objective function that encourages the policy to explore. See Schmidhuber (2010) and Aubret et al. (2019) for a recent survey on this family of methods.

We refer the reader to the survey article Ladosz et al. (2022) for further reading on exploration in RL.