Skip to article frontmatterSkip to article content

9 Exploration in MDPs

9.1Introduction

One of the key challenges of reinforcement learning is the exploration-exploitation tradeoff. Should we exploit actions we know will give high reward, or should we explore different actions to discover potentially better strategies? An algorithm that doesn’t explore effectively might easily overfit to certain areas of the state space, and fail to generalize once they enter a region they haven’t yet seen. The algorithms we saw in the chapter on fitted DP 5 Fitted Dynamic Programming Algorithms suffer from this issue.

In 3 Multi-Armed Bandits, where the state never changes so all we care about are the actions, we saw algorithms like Section 3.6 and Thompson sampling that incentivize the learner to explore arms that it is uncertain about. In this chapter, we will see how to generalize these ideas to the MDP setting.

9.1.1Sparse reward

Exploration is especially crucial in sparse reward problems where reward doesn’t come until after many steps, and algorithms which do not systematically explore new states may fail to learn anything meaningful (within a reasonable amount of time).

For example, policy gradient algorithms require the gradient to be nonzero in order to learn. If we never observe any reward, the gradient will always be zero, and the policy will never change or improve.

9.1.2Exploration in deterministic MDPs

Let us address the exploration problem in a deterministic MDP where taking action aa in state ss always leads to the state P(s,a)SP(s, a) \in \mathcal{S}. In this simple setting, there will be no “automatic” exploration due to randomness, so our strategy must actively explore new states. One simple strategy is to visit every possible state-action pair to learn the entire MDP. Then, once the MDP is known, we can use DP to solve for the optimal policy. (This should remind you of the Section 3.4 algorithm.)

9.2Treating an unknown MDP as a MAB

We also explored the exploration-exploitation tradeoff in 3 Multi-Armed Bandits. Recall tthat in the MAB setting, we have KK arms, each of which has an unknown reward distribution, and we want to learn which of the arms is optimal, i.e. has the highest mean reward.

One algorithm that struck a good balance between exploration and exploitation was the upper confidence bound algorithm Section 3.6: For each arm, we construct a confidence interval for its true mean award, and then choose the arm with the highest upper confidence bound. In summary,

kt+1argmaxk[K]RtkNtk+ln(2t/δ)2Ntkk_{t+1} \gets \arg\max_{k \in [K]} \frac{R^{k}_t}{N^{k}_t} + \sqrt{\frac{\ln(2t/\delta)}{2 N^{k}_t}}

where NtkN_t^k indicates the number of times arm kk has been pulled up until time tt, RtkR_t^k indicates the total reward obtained by pulling arm kk up until time tt, and δ>0\delta > 0 controls the width of the confidence interval. How might we extend UCB to the MDP case?

Let us formally describe an unknown MDP as an MAB problem. In an unknown MDP, we want to learn which policy is optimal. So if we want to apply MAB techniques to solving an MDP, it makes sense to think of arms as policies. There are K=(AS)HK = (|\mathcal{A}|^{|\mathcal{S}|})^\hor deterministic policies in a finite MDP. Then, “pulling” arm π corresponds to using π to act through a trajectory in the MDP, and observing the total reward.

Recall that UCB incurs regret O~(TK)\tilde{O}(\sqrt{TK}), where TT is the number of pulls and KK is the number of arms. So in the MDP-as-MAB problem, using UCB for TT episodes would achieve regret

O~(ASHT)\tilde{O}(\sqrt{|\mathcal{A}|^{|\mathcal{S}|\hor} T})

This scales exponentially in S|\mathcal{S}| and H\hor, which quickly becomes intractable. Notably, this method doesn’t consider the information that we gain across different policies. We can illustrate this with the following example:

9.3UCB-VI

The approach above is inefficient: We shouldn’t need to consider all ASH|\mathcal{A}|^{|\mathcal{S}| H} deterministic policies to achieve low regret. Rather, all we need to describe the optimal policy is QQ^\star, which has HSAH |\mathcal{S}||\mathcal{A}| entries to be learned. Can we borrow ideas from UCB to reduce the regret to this order (i.e. polynomial in S|\mathcal{S}|, A|\mathcal{A}|, and HH)?

One way to frame the UCB algorithm is that, when choosing arms, we optimize over a proxy reward that is the sum of the estimated mean reward and an exploration term. In the UCB-VI algorithm, we will extend this idea to the case of an unknown MDP M?\mathcal{M}^{?} by modelling a proxy MDP M~\tilde{\mathcal{M}} with a reward function that encourages exploration. Then, we will use DP to solve for the optimal policy in M~\tilde{\mathcal{M}}.

Assumptions: For simplicity, here we assume the reward function of M?\mathcal{M}^{?} is known, so we only need to model the state transitions, though the rewards can be modelled similarly. We will also consider the more general case of a time-varying MDP, where the transition and reward functions can change over time. We take the convention that PhP_\hi is the distribution of sh+1sh,ahs_{h+1} \mid s_{h}, a_{h} and rhr_\hi is applied to sh,ahs_\hi, a_\hi.

At a high level, the UCB-VI algorithm can be described as follows:

  1. Modelling: Use previous data to model the transitions P^0,,P^H1\hat{P}_0, \dots, \hat{P}_{H-1}.

  2. Reward bonus: Design a reward bonus bh(s,a)Rb_\hi(s, a) \in \mathbb{R} to encourage exploration, analogous to the UCB term.

  3. Optimistic planning: Use DP to compute the optimal policy π^h(s)\hat \pi_\hi(s) in the modelled MDP

M~=(S,A,{P^h}h[H],{rh+bh}h[H],H).\tilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \{ \hat{P}_\hi \}_{h \in [H]}, \{ r_\hi + b_\hi \}_{h \in [H]}, H).
  1. Execution: Use π^h(s)\hat \pi_\hi(s) to collect a new trajectory, and repeat.

We detail each of these steps below. The full definition follows in (9.16).

9.3.1Modelling the transitions

We seek to approximate Ph(sh+1sh,ah)=P(sh,ah,sh+1)P(sh,ah)P_\hi(s_{h+1} \mid s_\hi, a_\hi) = \frac{\pr(s_\hi, a_\hi, s_{h+1})}{\pr(s_\hi, a_\hi)}. We can estimate these using their sample probabilities from the dataset. That is, define

Nht(s,a,s):=i=0t11{(shi,ahi,sh+1i)=(s,a,s)}Nht(s,a):=i=0t11{(shi,ahi)=(s,a)}\begin{aligned} N_\hi^t(s, a, s') & := \sum_{i=0}^{t-1} \ind{ (s_\hi^i, a_\hi^i, s_{h+1}^i) = (s, a, s') } \\ N_\hi^t(s, a) & := \sum_{i=0}^{t-1} \ind{ (s_\hi^i, a_\hi^i) = (s, a) } \\ \end{aligned}

Then we can model

P^ht(ss,a)=Nht(s,a,s)Nht(s,a).\hat{P}_\hi^t(s' \mid s, a) = \frac{N_\hi^t(s, a, s')}{N_\hi^t(s, a)}.

9.3.2Reward bonus

To motivate the reward bonus term bht(s,a)b_\hi^t(s, a), recall how we designed the reward bonus term for UCB:

  1. We used Hoeffding’s inequality to bound, with high probability, how far the sample mean μ^tk\hat \mu_t^k deviated from the true mean μk\mu^k.

  2. By inverting this inequality, we obtained a (1δ)(1-\delta)-confidence interval for the true mean, centered at our estimate.

  3. To make this bound uniform across all timesteps t[T]t \in [T], we applied the union bound and multiplied δ by a factor of TT.

We’d like to do the same for UCB-VI, and construct the bonus term such that Vh(s)V^ht(s)V^\star_\hi(s) \le \hat{V}_\hi^t(s) with high probability. However, our construction will be more complex than the MAB case, since V^ht(s)\hat{V}_\hi^t(s) depends on the bonus bht(s,a)b_\hi^t(s, a) implicitly via DP. We claim that the bonus term that gives the proper bound is

bht(s,a)=2Hlog(SAHT/δ)Nht(s,a).b_\hi^t(s, a) = 2 H \sqrt{\frac{\log( |\mathcal{S}||\mathcal{A}|H T/\delta )}{N_\hi^t(s, a)}}.

We will only provide a heuristic sketch of the proof; see Agarwal et al. (2022) (Section 7.3) for a full proof.

9.3.3Definition

Putting these parts together, we can define the algorithm as follows:

3+1=43 + 1 = 4

9.3.4Performance of UCB-VI

How exactly does UCB-VI strike a good balance between exploration and exploitation? In UCB for MABs, the bonus exploration term is simple to interpret: It encourages the learner to take actions with a high exploration term. Here, the policy depends on the bonus term indirectly: The policy is obtained by planning in an MDP where the bonus term is added to the reward function. Note that the bonuses propagate backwards in DP, effectively enabling the learner to plan to explore unknown states. This effect takes some further interpretation.

Recall we constructed bhtb^t_\hi so that, with high probability, Vh(s)V^ht(s)V^\star_\hi(s) \le \hat{V}_\hi^t(s) and so

Vh(s)Vhπt(s)V^ht(s)Vhπt(s).V^\star_\hi(s) - V^{\pi^t}_\hi(s) \le \hat{V}_\hi^t(s) - V^{\pi^t}_\hi(s).

That is, the l.h.s. measures how suboptimal policy πt\pi^t is in the true environment, while the r.h.s. is the difference in the policy’s value when acting in the modelled MDP M~t\tilde{\mathcal{M}}^t instead of the true one M?\mathcal{M}^{?}.

If the r.h.s. is small, this implies that the l.h.s. difference is also small, i.e. that πt\pi^t is exploiting actions that are giving high reward.

If the r.h.s. is large, then we have overestimated the value: πt\pi^t, the optimal policy of M~t\tilde{\mathcal{M}}^t, does not perform well in the true environment M?\mathcal{M}^{?}. This indicates that one of the bht(s,a)b_h^t(s, a) terms must be large, or some P^ht(s,a)\hat P^t_\hi(\cdot \mid s, a) must be inaccurate, indicating a state-action pair with a low visit count Nht(s,a)N^t_\hi(s, a) that the learner was encouraged to explore.

It turns out that UCB-VI achieves a per-episode regret of

Comparing this to the UCB regret bound O~(TK)\tilde{O}(\sqrt{T K}), where KK is the number of arms of the MAB, we see that we’ve reduced the number of effective arms from ASH|\mathcal{A}|^{|\mathcal{S}|\hor} (in (9.4)) to H4SAH^4 |\mathcal{S}||\mathcal{A}|, which is indeed polynomial in S|\mathcal{S}|, A|\mathcal{A}|, and HH, as desired. This is also roughly the number of episodes it takes to achieve constant-order average regret:

1TE[RegretT]=O~(H4SAT)\frac{1}{T} \E[\text{Regret}_T] = \tilde{O}\left(\sqrt{\frac{H^4 |\mathcal{S}||\mathcal{A}|}{T}}\right)

Note that the time-dependent transition matrix has HS2AH |\mathcal{S}|^2 |\mathcal{A}| entries. Assuming HSH \ll |\mathcal{S}|, this shows that it’s possible to achieve low regret, and achieve a near-optimal policy, while only understanding a 1/S1/|\mathcal{S}| fraction of the world’s dynamics.

9.4Linear MDPs

A polynomial dependency on S|\mathcal{S}| and A|\mathcal{A}| is manageable when the state and action spaces are small. But for large or continuous state and action spaces, even this polynomial factor will become intractable. Can we find algorithms that don’t depend on S|\mathcal{S}| or A|\mathcal{A}| at all, effectively reducing the dimensionality of the MDP? In this section, we’ll explore linear MDPs: an example of a parameterized MDP where the rewards and state transitions depend only on some parameter space of dimension dd that is independent from S|\mathcal{S}| or A|\mathcal{A}|.

9.4.1Planning in a linear MDP

It turns out that QhQ^\star_\hi is also linear with respect to this feature mapping. We can prove this by simply computing it using DP. We initialize VH(s)=0sV_{H}^\star(s) = 0 \forall s. Then we iterate:

Qh(s,a)=rh(s,a)+EsPh(s,a)[Vh+1(s)]=ϕ(s,a)θh+(μhϕ(s,a))Vh+1=ϕ(s,a)(θh+(μh)Vh+1)whVh(s)=maxaQh(s,a)πh(s)=argmaxaQh(s,a)\begin{aligned} Q^\star_\hi(s, a) & = r_\hi(s, a) + \E_{s' \sim P_\hi(\cdot \mid s, a)} [V^\star_{h+1}(s')] \\ & = \phi(s, a)^\top \theta_\hi^\star + (\mu_\hi^\star \phi(s, a))^\top V^\star_{h+1} \\ & = \phi(s, a)^\top \underbrace{( \theta_\hi^\star + (\mu_\hi^\star)^\top V^\star_{h+1})}_{w_\hi} \\ V^\star_\hi(s) & = \max_a Q^\star_\hi(s, a) \\ \pi^\star_\hi(s) & = \arg\max_a Q^\star_\hi(s, a) \end{aligned}

9.4.2UCB-VI in a linear MDP

9.4.2.1Modelling the transitions

This linear assumption on the MDP will also allow us to model the unknown dynamics Ph?(ss,a)P^?_\hi(s' \mid s, a) with techniques from supervised learning (SL). Recall that SL is useful for estimating conditional expectations by minimizing mean squared error. We can rephrase the estimation of Ph?(ss,a)P^?_\hi(s' \mid s, a) as a least-squares problem as follows: Write δs\delta_s to denote a one-hot vector in RS\mathbb{R}^{|\mathcal{S}|}, with a 1 in the ss-th entry and 0 everywhere else. Note that

EsPh(s,a)[δs]=Ph(s,a)=μhϕ(s,a).\E_{s' \sim P_h(\cdot \mid s, a)} [\delta_{s'}] = P_h(\cdot \mid s, a) = \mu_h^\star \phi(s, a).

Furthermore, since the expectation here is linear with respect to ϕ(s,a)\phi(s, a), we can directly apply least-squares multi-target linear regression to construct the estimate

μ^=argminμRS×dt=0T1μϕ(shi,ahi)δsh+1i22.\hat \mu = \arg\min_{\mu \in \mathbb{R}^{|\mathcal{S}| \times d}} \sum_{t=0}^{T-1} \|\mu \phi(s_h^i, a_h^i) - \delta_{s_{h+1}^i} \|_2^2.

This has a well-known closed-form solution:

μ^=(Aht)1i=0t1ϕ(shi,ahi)δsh+1iwhereAht=i=0t1ϕ(shi,ahi)ϕ(shi,ahi)+λI\begin{aligned} \hat \mu^\top & = (A_h^t)^{-1} \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \delta_{s_{h+1}^i}^\top \\ \text{where} \quad A_h^t & = \sum_{i=0}^{t-1} \phi(s_h^i, a_h^i) \phi(s_h^i, a_h^i)^\top + \lambda I \end{aligned}

where we include a λI\lambda I term to ensure that the matrix AhtA^t_h is invertible. (This can also be derived by adding a λμF2\lambda \|\mu\|_{\text{F}}^2 regularization term to the objective.) We can directly plug in this estimate into P^ht(s,a)=μ^htϕ(s,a)\hat{P}^t_h(\cdot \mid s, a) = \hat \mu^t_h \phi(s, a).

9.4.2.2Reward bonus

Now, to design the reward bonus, we can’t apply Hoeffding anymore, since the terms no longer involve sample means of bounded random variables; Instead, we’re incorporating information across different states and actions. Rather, we can construct an upper bound using Chebyshev’s inequality in the same way we did for the LinUCB algorithm in the MAB setting Section 3.8.1:

bht(s,a)=βϕ(s,a)(Aht)1ϕ(s,a),β=O~(dH).b^t_\hi(s, a) = \beta \sqrt{\phi(s, a)^\top (A^t_h)^{-1} \phi(s, a)}, \quad \beta = \tilde O(d \hor).

Note that this isn’t explicitly inversely proportional to Nht(s,a)N_h^t(s, a) as in the original UCB-VI bonus term (9.8). Rather, it is inversely proportional to the amount that the direction ϕ(s,a)\phi(s, a) has been explored in the history. That is, if AhtA_h^t has a large component in the direction ϕ(s,a)\phi(s, a), implying that this direction is well explored, then the bonus term will be small, and vice versa.

We can now plug in these transition estimates and reward bonuses into the UCB-VI algorithm (9.16).

9.4.2.3Performance

Comparing this to our bound for UCB-VI in an environment without this linear assumption, we see that we go from a sample complexity of Ω~(H4SA)\tilde \Omega(H^4 |\mathcal{S}||\mathcal{A}|) to Ω~(H4d3)\tilde \Omega(H^4 d^{3}). This new sample complexity only depends on the feature dimension and not on the state or action space of the MDP!

9.5Summary

In this chapter, we’ve explored how to explore in an unknown MDP.

  • We first discussed the explore-then-exploit algorithm Definition 9.2, a simple way to explore a deterministic MDP by visiting all state-action pairs.

  • We then discussed how to treat an unknown MDP as a MAB Section 9.2, and how this approach is inefficient since it doesn’t make use of relationships between policies.

  • We then introduced the UCB-VI algorithm (9.16), which models the unknown MDP by a proxy MDP with a reward bonus term that encourages exploration.

  • Finally, assuming that the transitions and rewards are linear with respect to a feature transformation of the state and action, we introduced the LinUCB-VI algorithm Section 9.4.2, which has a sample complexity independent of the size of the state and action spaces.

References
  1. Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement Learning: Theory and Algorithms.