9 Exploration in MDPs

9.1 Introduction

One of the key challenges of reinforcement learning is the exploration-exploitation tradeoff. Should we exploit actions we know will give high reward, or should we explore different actions to discover potentially better strategies? An algorithm that doesn’t explore effectively might easily overfit to certain areas of the state space, and fail to generalize once they enter a region they haven’t yet seen. The algorithms we saw in the chapter on fitted DP Chapter 5 suffer from this issue.

In Chapter 3, where the state never changes so all we care about are the actions, we saw algorithms like Section 3.6 and Section 3.7 that incentivize the learner to explore arms that it is uncertain about. In this chapter, we will see how to generalize these ideas to the MDP setting.

Definition 9.1 (Per-episode regret) To quantify the performance of a learning algorithm, we will consider its per-episode regret over \(T\) timesteps/episodes:

\[ \text{Regret}_T = \mathbb{E}\left[ \sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right] \]

where \(\pi^t\) is the policy generated by the algorithm at the \(t\)th iteration.

9.1.1 Sparse reward

Exploration is crucial in unknown sparse reward problems where reward doesn’t come until after many steps, and algorithms which do not systematically explore new states may fail to learn anything meaningful (within a reasonable amount of time). We can see this by considering the following simple MDP:

Example 9.1 (Sparse Reward MDP) Here’s a simple example of an MDP with sparse reward:

There are \(|\mathcal{S}|\) states. The agent starts in the leftmost state. In every state, there are three possible actions, two of which move the agent left and one which moves the agent right. The reward function assigns \(r=1\) to the rightmost cell.

How well would the algorithms we’ve covered so far do on this problem? For example, Chapter 6 require the gradient to be nonzero in order to learn. If we never observe any reward, the gradient will always be zero, and the policy will never change or improve. Chapter 5 will run into a similar issue: as we randomly interact with the environment, we will never observe any reward, and so the reward model simply gives zero for every state-action pair. In expectation, it would take a computationally infeasible number of rollouts to observe the reward by chance.

The rest of this chapter will consider ways to explicitly add exploration to these algorithms.

9.1.2 Exploration in deterministic MDPs

Let us address the exploration problem in a deterministic MDP, that is, where taking action \(a\) in state \(s\) always leads to the state \(P(s, a) \in \mathcal{S}\). How can we methodically visit every single state-action pair? In the bandit setting, it was trivial to visit every arm: just pull each arm once. How might I navigate to a particular state \(s\) in the MDP setting? Doing so requires planning out a path from the original state. In fact, solving navigation itself is a complex RL problem!

9.2 Explore-then-exploit (for deterministic MDPs)

In this simple deterministic setting, the environment never randomly takes us to unseen states, so our strategy must actively explore new states.

Definition 9.2 We’ll keep a set \(\mathcal{D}\) of all the \((s, a, r, s')\) pairs we’ve observed. Each episode, we’ll use \(\mathcal{D}\) to construct a fully known MDP, \(M_{\mathcal{D}}\), in which unseen state-action pairs are rewarded. We’ll then use the planning algorithms from Section 1.2.7 to reach the unknown states in \(M-{\mathcal{D}}\).

We assume that every state can be reached from the initial state within a single episode, and that the action space \(\mathcal{A}\) is fixed.

\(\mathcal{D} \gets \emptyset\)
For \(T = 0, 1, 2, \dots\) (until the entire MDP has been explored):
1. Construct \(M_{\mathcal{D}}\) using \(\mathcal{D}\). That is, the state transitions are set to those observed in \(\mathcal{D}\), and the reward is set to \(0\) for all state-action pairs in \(\mathcal{D}\), and \(1\) otherwise.
2. Execute Section 1.2.7 on the known MDP \(M-{\mathcal{D}}\) to compute policy \(\pi^\star_{\mathcal{D}}\).
3. Execute \(\pi^\star_{\mathcal{D}}\). This will visit some \((s, a)\) not yet observed in \(\mathcal{D}\).
4. \(\mathcal{D} \gets \mathcal{D} \cup \{ (s, a, r, s') \}\), where \(s' = P(s, a), r = r(s, a)\) are the observed state transition and reward.

Theorem 9.1 (Performance of explore-then-exploit) As long as every state can be reached from \(s_0\) within a single episode, i.e. \(|\mathcal{S}| \le H\), this will eventually be able to explore all \(|\mathcal{S}| |\mathcal{A}|\) state-action pairs, adding one new transition per episode. We know it will take at most \(|\mathcal{S}| |\mathcal{A}|\) iterations to explore the entire MDP, after which \(\pi^t = \pi^\star\), incurring no additional regret. For each \(\pi^t\) up until then, corresponding to the shortest-path policies \(\tilde \pi\), the value of policy \(\pi^t\) will differ from that of \(\pi^\star\) by at most \(H\), since the policies will differ by at most \(1\) reward at each timestep. So,

\[ \sum_{t=0}^{T-1} V^\star_0 - V_0^{\pi^t} \le |\mathcal{S}||\mathcal{A}| H. \]

(Note that this MDP and algorithm are deterministic, so the regret is not random.)

9.3 Treating an unknown MDP as a MAB

We also explored the exploration-exploitation tradeoff in Chapter 3. Recall tthat in the MAB setting, we have \(K\) arms, each of which has an unknown reward distribution, and we want to learn which of the arms is optimal, i.e. has the highest mean reward.

One algorithm that struck a good balance between exploration and exploitation was the upper confidence bound algorithm Section 3.6: For each arm, we construct a confidence interval for its true mean award, and then choose the arm with the highest upper confidence bound. In summary,

\[ k_{t+1} \gets \arg\max_{k \in [K]} \frac{R^{k}_t}{N^{k}_t} + \sqrt{\frac{\ln(2t/\delta)}{2 N^{k}_t}} \]

where \(N_t^k\) indicates the number of times arm \(k\) has been pulled up until time \(t\), \(R_t^k\) indicates the total reward obtained by pulling arm \(k\) up until time \(t\), and \(\delta > 0\) controls the width of the confidence interval. How might we extend UCB to the MDP case?

Let us formally describe an unknown MDP as an MAB problem. In an unknown MDP, we want to learn which policy is optimal. So if we want to apply MAB techniques to solving an MDP, it makes sense to think of arms as policies. There are \(K = (|\mathcal{A}|^{|\mathcal{S}|})^H\) deterministic policies in a finite MDP. Then, “pulling” arm \(\pi\) corresponds to using \(\pi\) to act through a trajectory in the MDP, and observing the total reward.

Exercise 9.1 (Notation for expected reward) Which quantity that we have seen so far equals the expected reward from arm \(\pi\)?

Recall that UCB incurs regret \(\tilde{O}(\sqrt{TK})\), where \(T\) is the number of pulls and \(K\) is the number of arms. So in the MDP-as-MAB problem, using UCB for \(T\) episodes would achieve regret

\[ \tilde{O}(\sqrt{|\mathcal{A}|^{|\mathcal{S}|H} T}) \tag{9.1}\]

This scales exponentially in \(|\mathcal{S}|\) and \(H\), which quickly becomes intractable. Notably, this method doesn’t consider the information that we gain across different policies. We can illustrate this with the following example:

Example 9.2 (Treating an MDP as a MAB) Consider a “coin MDP” with two states “heads” and “tails”, two actions “Y” and “N”, and a time horizon of \(H=2\). The state transition flips the coin, and doesn’t depend on the action. The reward only depends on the action: Taking action Y gives reward \(1\), and taking action N gives reward \(0\).

Suppose we collect data from the two constant policies \(\pi_{\text{Y}}(s) = \text{Y}\) and \(\pi_{\text{N}}(s) = \text{N}\). Now we want to learn about the policy \(\tilde{\pi}\) that takes action Y and then N. Do we need to collect data from \(\tilde{\pi}\) to evaluate it? No: Since the reward only depends on the action, we can infer its value from our data on the policies \(\pi_{\text{Y}}\) and \(\pi_{\text{N}}\). However, if we treat the MDP as a bandit in which \(\tilde{\pi}\) is a new, unknown arm, we ignore the known correlation between the action and the reward.

9.4 UCB-VI

The approach above is inefficient: We shouldn’t need to consider all \(|\mathcal{A}|^{|\mathcal{S}| H}\) deterministic policies to achieve low regret. Rather, all we need to describe the optimal policy is \(Q^\star\), which has \(H |\mathcal{S}||\mathcal{A}|\) entries to be learned. Can we borrow ideas from UCB to reduce the regret to a polynomial in \(|\mathcal{S}|\), \(|\mathcal{A}|\), and \(H\)?

One way to frame the UCB algorithm is that, when choosing arms, we optimize over a proxy reward that is the sum of the estimated mean reward and an exploration term. In the UCB-VI algorithm, we will extend this idea to the case of an unknown MDP \(\mathcal{M}^{?}\) by modeling a proxy MDP \(\widetilde{\mathcal{M}}\) with a reward function that encourages exploration. Then, we will use DP to solve for the optimal policy in \(\widetilde{\mathcal{M}}\).

Assumptions: For simplicity, here we assume the reward function of \(\mathcal{M}^{?}\) is known, so we only need to model the state transitions, though the rewards can be modelled similarly. We will also consider the more general case of a time-varying MDP, where the transition and reward functions may change over time. We take the convention that \(P_h\) is the distribution of \(s_{h+1} \mid s_{h}, a_{h}\) and \(r_h\) is applied to \(s_h, a_h\).

At a high level, the UCB-VI algorithm can be described as follows:

Modeling: Use previous data to model the transitions \(\widehat{P}_0, \dots, \widehat{P}_{H-1}\).
Reward bonus: Design a reward bonus \(b_h(s, a) \in \mathbb{R}\) to encourage exploration, analogous to the UCB term.
Optimistic planning: Use DP to compute the optimal policy \(\widehat \pi_h(s)\) in the modelled MDP

\[ \tilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \{ \widehat{P}_h\}_{h \in [H]}, \{ r_h+ b_h\}_{h \in [H]}, H). \]

Execution: Use \(\widehat \pi_h(s)\) to collect a new trajectory, and repeat.

We detail each of these steps below. The full definition follows in Definition 9.3.

9.4.1 deling the transitions

We seek to approximate \(P_h(s_{h+1} \mid s_h, a_h) = \frac{\mathbb{P}(s_h, a_h, s_{h+1})}{\mathbb{P}(s_h, a_h)}\). We can estimate these using their sample probabilities from the dataset. That is, define

\[ \begin{aligned} N_h^t(s, a, s') & := \sum_{i=0}^{t-1} \mathbf{1}\left\{ (s_h^i, a_h^i, s_{h+1}^i) = (s, a, s') \right\} \\ N_h^t(s, a) & := \sum_{i=0}^{t-1} \mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} \\ \end{aligned} \]

Then we can model

\[ \widehat{P}_h^t(s' \mid s, a) = \frac{N_h^t(s, a, s')}{N_h^t(s, a)}. \]

Note that this is also a fairly naive, nonparametric estimator that doesn’t assume any underlying structure of the MDP. We’ll see how to incorporate assumptions about the MDP in the following section.

9.4.2 Reward bonus

To motivate the reward bonus term \(b_h^t(s, a)\), recall how we designed the reward bonus term for UCB:

We used Hoeffding’s inequality to bound, with high probability, how far the sample mean \(\widehat \mu_t^k\) deviated from the true mean \(\mu^k\).
By inverting this inequality, we obtained a \((1-\delta)\)-confidence interval for the true mean, centered at our estimate.
To make this bound uniform across all timesteps \(t \in [T]\), we applied the union bound and multiplied \(\delta\) by a factor of \(T\).

We’d like to do the same for UCB-VI, and construct the bonus term such that \(V^\star_h(s) \le \widehat{V}_h^t(s)\) with high probability. However, our construction will be more complex than the MAB case, since \(\widehat{V}_h^t(s)\) depends on the bonus \(b_h^t(s, a)\) implicitly via DP. We claim that the bonus term that gives the proper bound is

\[ b_h^t(s, a) = 2 H \sqrt{\frac{\log( |\mathcal{S}||\mathcal{A}|H T/\delta )}{N_h^t(s, a)}}. \tag{9.2}\]

We will only provide a heuristic sketch of the proof; see (Agarwal et al. 2022) (Section 7.3) for a full proof.

Remark 9.1 (UCB-VI reward bonus construction). We aim to show that, with high probability,

\[ V_h^\star(s) \le \widehat{V}_h^t(s) \quad \forall t \in [T], h \in [H], s \in \mathcal{S}. \]

We’ll do this by bounding the error incurred at each step of DP. Recall that DP solves for \(\widehat{V}_h^t(s)\) recursively as follows:

\[ \widehat{V}_h^t(s) = \max_{a \in \mathcal{A}} \left[ \tilde r^t_h(s, a) + \mathbb{E}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ \widehat{V}_{h+1}^t(s') \right] \right] \]

where \(\tilde r^t_h(s, a) = r_h(s, a) + b_h^t(s, a)\) is the reward function of our modelled MDP \(\tilde{\mathcal{M}}^t\). On the other hand, we know that \(V^\star\) must satisfy

\[ V^\star_h(s) = \max_{a \in \mathcal{A}} \left[ \tilde r^t_h(s, a) + \mathbb{E}_{s' \sim P^?_h(\cdot \mid s, a)} [V^\star_{h+1}(s')] \right] \]

so it suffices to bound the difference between the two inner expectations. There are two sources of error:

The value functions \(\widehat{V}^t_{h+1}\) v.s. \(V^\star_{h+1}\)
The transition probabilities \(\widehat{P}_h^t\) v.s. \(P^?_h\).

We can bound these individually, and then combine them by the triangle inequality. For the former, we can simply bound the difference by \(H\), assuming that the rewards are within \([0, 1]\). Now, all that is left is to bound the error from the transition probabilities:

\[ \text{error} = \left| \mathbb{E}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] - \mathbb{E}_{s' \sim P^?_h(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right]. \right| \tag{9.3}\]

Let us bound this term for a fixed \(s, a, h, t\). (Later we can make this uniform across \(s, a, h, t\) using the union bound.) Note that expanding out the definition of \(\widehat{P}_h^t\) gives

\[ \begin{aligned} \mathbb{E}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] & = \sum_{s' \in \mathcal{S}} \frac{N^t_h(s, a, s')}{N^t_h(s, a)} V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \sum_{s' \in \mathcal{S}} \mathbf{1}\left\{ (s_h^i, a_h^i, s_{h+1}^i) = (s, a, s') \right\} V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \underbrace{\mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} V^\star_{h+1}(s_{h+1}^i)}_{X^i} \end{aligned} \]

since the terms where \(s' \neq s_{h+1}^i\) vanish.

Now, in order to apply Hoeffding’s inequality, we would like to express the second term in Equation 9.3 as a sum over \(t\) random variables as well. We will do this by redundantly averaging over all desired trajectories (i.e. where we visit state \(s\) and action \(a\) at time \(h\)):

\[ \begin{aligned} \mathbb{E}_{s' \sim P^?_h(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] & = \sum_{s' \in \mathcal{S}} P^?_h(s' \mid s, a) V^\star_{h+1}(s') \\ & = \sum_{s' \in \mathcal{S}} \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} P^?_h(s' \mid s, a) V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \mathbb{E}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i. \end{aligned} \]

Now we can apply Hoeffding’s inequality to \(X^i - \mathbb{E}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i\), which is bounded by \(H\), to obtain that, with probability at least \(1-\delta\),

\[ \text{error} = \left| \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \left(X^i - \mathbb{E}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i \right) \right| \le 2 H \sqrt{\frac{\ln(1/\delta)}{N_h^t(s, a)}}. \]

Applying a union bound over all \(s \in \mathcal{S}, a \in \mathcal{A}, t \in [T], h \in [H]\) gives the \(b_h^t(s, a)\) term above.

Putting these parts together, we can define the algorithm as follows:

Definition 9.3 (UCB-VI algorithm) TODO

9.4.3 Performance of UCB-VI

How exactly does UCB-VI strike a good balance between exploration and exploitation? In UCB for MABs, the bonus exploration term is simple to interpret: It encourages the learner to take actions with a high exploration term. Here, the policy depends on the bonus term indirectly: The policy is obtained by planning in an MDP where the bonus term is added to the reward function. Note that the bonuses propagate backwards in DP, effectively enabling the learner to plan to explore unknown states. This effect takes some further interpretation.

Recall we constructed \(b^t_h\) so that, with high probability, \(V^\star_h(s) \le \widehat{V}_h^t(s)\) and so

\[ V^\star_h(s) - V^{\pi^t}_h(s) \le \widehat{V}_h^t(s) - V^{\pi^t}_h(s). \]

That is, the l.h.s. measures how suboptimal policy \(\pi^t\) is in the true environment, while the r.h.s. is the difference in the policy’s value when acting in the modelled MDP \(\tilde{\mathcal{M}}^t\) instead of the true one \(\mathcal{M}^{?}\).

If the r.h.s. is small, this implies that the l.h.s. difference is also small, i.e. that \(\pi^t\) is exploiting actions that are giving high reward.

If the r.h.s. is large, then we have overestimated the value: \(\pi^t\), the optimal policy of \(\tilde{\mathcal{M}}^t\), does not perform well in the true environment \(\mathcal{M}^{?}\). This indicates that one of the \(b_h^t(s, a)\) terms must be large, or some \(\widehat P^t_h(\cdot \mid s, a)\) must be inaccurate, indicating a state-action pair with a low visit count \(N^t_h(s, a)\) that the learner was encouraged to explore.

It turns out that UCB-VI achieves a regret of

Theorem 9.2 (UCB-VI regret) \[ \mathbb{E}\left[ \sum_{t=0}^{T-1} \left(V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right) \right] = \tilde{O}(H^2 \sqrt{|\mathcal{S}| |\mathcal{A}| T}) \]

Comparing this to the UCB regret bound \(\tilde{O}(\sqrt{T K})\), where \(K\) is the number of arms of the MAB, we see that we’ve reduced the number of effective arms from \(|\mathcal{A}|^{|\mathcal{S}|H}\) (in Equation 9.1) to \(H^4 |\mathcal{S}||\mathcal{A}|\), which is indeed polynomial in \(|\mathcal{S}|\), \(|\mathcal{A}|\), and \(H\), as desired. This is also roughly the number of episodes it takes to achieve constant-order average regret:

\[ \frac{1}{T} \mathbb{E}[\text{Regret}_T] = \tilde{O}\left(\sqrt{\frac{H^4 |\mathcal{S}||\mathcal{A}|}{T}}\right) \]

Note that the time-dependent transition matrix has \(H |\mathcal{S}|^2 |\mathcal{A}|\) entries. Assuming \(H \ll |\mathcal{S}|\), this shows that it’s possible to achieve low regret, and achieve a near-optimal policy, while only understanding a \(1/|\mathcal{S}|\) fraction of the world’s dynamics.

9.5 Linear MDPs

A polynomial dependency on \(|\mathcal{S}|\) and \(|\mathcal{A}|\) is manageable when the state and action spaces are small. But for large or continuous state and action spaces, even this polynomial factor will become intractable. Can we find algorithms that don’t depend on \(|\mathcal{S}|\) or \(|\mathcal{A}|\) at all, effectively reducing the dimensionality of the MDP? In this section, we’ll explore linear MDPs: an example of a parameterized MDP where the rewards and state transitions depend only on some parameter space of dimension \(d\) that is independent from \(|\mathcal{S}|\) or \(|\mathcal{A}|\).

Definition 9.4 (Linear MDP) We assume that the transition probabilities and rewards are linear in some feature vector

\(\phi(s, a) \in \mathbb{R}^d\):

\[ \begin{aligned} P_h(s' \mid s, a) & = \phi(s, a)^\top \mu^\star_h(s') \\ r_h(s, a) & = \phi(s, a)^\top \theta_h^\star \end{aligned} \]

Note that we can also think of \(P_h(\cdot \mid s, a) = \mu_h^\star\) as an \(|\mathcal{S}| \times d\) matrix, and think of \(\mu^\star_h(s')\) as indexing into the \(s'\)-th row of this matrix (treating it as a column vector). Thinking of \(V^\star_{h+1}\) as an \(|\mathcal{S}|\)-dimensional vector, this allows us to write

\[ \mathbb{E}_{s' \sim P_h(\cdot \mid s, a)}[V^\star_{h+1}(s)] = (\mu^\star_h\phi(s, a))^\top V^\star_{h+1}. \]

The \(\phi\) feature mapping can be designed to capture interactions between the state \(s\) and action \(a\). In this book, we’ll assume that the feature map \(\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d\) and the reward function (described by \(\theta_h^\star\)) are known to the learner.

9.5.1 Planning in a linear MDP

It turns out that \(Q^\star_h\) is also linear with respect to this feature mapping. We can prove this by simply computing it using DP. We initialize the value function at the end of the time horizon by setting \(V_{H}^\star(s) = 0\) for all states \(s\). Then we iterate:

\[ \begin{aligned} Q^\star_h(s, a) & = r_h(s, a) + \mathbb{E}_{s' \sim P_h(\cdot \mid s, a)} [V^\star_{h+1}(s')] \\ & = \phi(s, a)^\top \theta_h^\star + (\mu_h^\star \phi(s, a))^\top V^\star_{h+1} \\ & = \phi(s, a)^\top \underbrace{( \theta_h^\star + (\mu_h^\star)^\top V^\star_{h+1})}_{w_h} \\ V^\star_h(s) & = \max_a Q^\star_h(s, a) \\ \pi^\star_h(s) & = \arg\max_a Q^\star_h(s, a) \end{aligned} \]

Show that \(Q^\pi_h\) is also linear with respect to \(\phi(s, a)\) for any policy \(\pi\).

9.5.2 UCB-VI in a linear MDP

9.5.2.1 Modeling the transitions

This linear assumption on the MDP will also allow us to model the unknown dynamics \(P^?_h(s' \mid s, a)\) with techniques from supervised learning (SL). Recall that SL is useful for estimating conditional expectations by minimizing mean squared error. We can rephrase the estimation of \(P^?_h(s' \mid s, a)\) as a least-squares problem as follows: Write \(\delta_s\) to denote a one-hot vector in \(\mathbb{R}^{|\mathcal{S}|}\), with a \(1\) in the \(s\)-th entry and \(0\) everywhere else. Note that

\[ \mathbb{E}_{s' \sim P_h(\cdot \mid s, a)} [\delta_{s'}] = P_h(\cdot \mid s, a) = \mu_h^\star \phi(s, a). \]

Furthermore, since the expectation here is linear with respect to \(\phi(s, a)\), we can directly apply least-squares multi-target linear regression to construct the estimate

\[ \widehat \mu = \arg\min_{\mu \in \mathbb{R}^{|\mathcal{S}| \times d}} \sum_{t=0}^{T-1} \|\mu \phi(s_h^i, a_h^i) - \delta_{s_{h+1}^i} \|_2^2. \]

This has a well-known closed-form solution:

\[ \begin{aligned} \widehat \mu^\top & = (A_h^t)^{-1} \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \delta_{s_{h+1}^i}^\top \\ \text{where} \quad A_h^t & = \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \phi(s_h^i, a_h^i)^\top + \lambda I \end{aligned} \]

where we include a \(\lambda I\) term to ensure that the matrix \(A^t_h\) is invertible. (This can also be derived by adding a \(\lambda \|\mu\|_{\text{F}}^2\) regularization term to the objective.) We can directly plug in this estimate into \(\widehat{P}^t_h(\cdot \mid s, a) = \widehat \mu^t_h \phi(s, a)\).

9.5.2.2 Reward bonus

Now, to design the reward bonus, we can’t apply Hoeffding’s inequality anymore, since the terms no longer involve sample means of bounded random variables; Instead, we’re incorporating information across different states and actions. Rather, we can construct an upper bound using Chebyshev’s inequality in the same way we did for the LinUCB algorithm in the MAB setting Section 3.8.1:

\[ b^t_h(s, a) = \beta \sqrt{\phi(s, a)^\top (A^t_h)^{-1} \phi(s, a)}, \quad \beta = \tilde O(d H). \]

Note that this isn’t explicitly inversely proportional to \(N_h^t(s, a)\) as in the original UCB-VI bonus term Equation 9.2. Rather, it is inversely proportional to the amount that the direction \(\phi(s, a)\) has been explored in the history. That is, if \(A-h^t\) has a large component in the direction \(\phi(s, a)\), implying that this direction is well explored, then the bonus term will be small, and vice versa.

We can now plug in these transition estimates and reward bonuses into the UCB-VI algorithm Definition 9.3.

Theorem 9.3 (LinUCB-VI regret) The LinUCB-VI algorithm achieves expected regret

\[ \mathbb{E}[\text{Regret}_T] = \mathbb{E}\left[\sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right] \le \tilde O(H^2 d^{1.5} \sqrt{T}) \]

Comparing this to our bound for UCB-VI in an environment without this linear assumption, we see that we go from a sample complexity of \(\tilde \Omega(H^4 |\mathcal{S}||\mathcal{A}|)\) to \(\tilde \Omega(H^4 d^{3})\). This new sample complexity only depends on the feature dimension and not on the state or action space of the MDP!

9.6 Summary

In this chapter, we’ve explored how to explore in an unknown MDP.

We first discussed the explore-then-exploit algorithm Definition 9.2, a simple way to explore a deterministic MDP by visiting all state-action pairs.
We then discussed how to treat an unknown MDP as a MAB Section 9.3, and how this approach is inefficient since it doesn’t make use of relationships between policies.
We then introduced the UCB-VI algorithm Definition 9.3, which models the unknown MDP by a proxy MDP with a reward bonus term that encourages exploration.
Finally, assuming that the transitions and rewards are linear with respect to a feature transformation of the state and action, we introduced the LinUCB-VI algorithm Section 9.5.2, which has a sample complexity independent of the size of the state and action spaces.