One of the key challenges of reinforcement learning is the exploration-exploitation tradeoff. Should we exploit actions we know will give high reward, or should we explore different actions to discover potentially better strategies? An algorithm that doesn’t explore effectively might easily overfit to certain areas of the state space, and fail to generalize once they enter a region they haven’t yet seen. The algorithms we saw in the chapter on fitted DP 5 Fitted Dynamic Programming Algorithms suffer from this issue.
In 3 Multi-Armed Bandits, where the state never changes so all we care about are the actions, we saw algorithms like Section 3.6 and Thompson sampling that incentivize the learner to explore arms that it is uncertain about. In this chapter, we will see how to generalize these ideas to the MDP setting.
Exploration is especially crucial in sparse reward problems where reward doesn’t come until after many steps, and algorithms which do not systematically explore new states may fail to learn anything meaningful (within a reasonable amount of time).
For example, policy gradient algorithms require the gradient to be nonzero in order to learn. If we never observe any reward, the gradient will always be zero, and the policy will never change or improve.
Let us address the exploration problem in a deterministic MDP where taking action a in state s always leads to the state P(s,a)∈S. In this simple setting, there will be no “automatic” exploration due to randomness, so our strategy must actively explore new states. One simple strategy is to visit every possible state-action pair to learn the entire MDP. Then, once the MDP is known, we can use DP to solve for the optimal policy. (This should remind you of the Section 3.4 algorithm.)
We also explored the exploration-exploitation tradeoff in 3 Multi-Armed Bandits. Recall tthat in the MAB setting, we have K arms, each of which has an unknown reward distribution, and we want to learn which of the arms is optimal, i.e. has the highest mean reward.
One algorithm that struck a good balance between exploration and exploitation was the upper confidence bound algorithm Section 3.6: For each arm, we construct a confidence interval for its true mean award, and then choose the arm with the highest upper confidence bound. In summary,
where Ntk indicates the number of times arm k has been pulled up until time t, Rtk indicates the total reward obtained by pulling arm k up until time t, and δ>0 controls the width of the confidence interval. How might we extend UCB to the MDP case?
Let us formally describe an unknown MDP as an MAB problem. In an unknown MDP, we want to learn which policy is optimal. So if we want to apply MAB techniques to solving an MDP, it makes sense to think of arms as policies. There are K=(∣A∣∣S∣)H deterministic policies in a finite MDP. Then, “pulling” arm π corresponds to using π to act through a trajectory in the MDP, and observing the total reward.
Recall that UCB incurs regret O~(TK), where T is the number of pulls and K is the number of arms. So in the MDP-as-MAB problem, using UCB for T episodes would achieve regret
This scales exponentially in ∣S∣ and H, which quickly becomes intractable. Notably, this method doesn’t consider the information that we gain across different policies. We can illustrate this with the following example:
The approach above is inefficient: We shouldn’t need to consider all ∣A∣∣S∣H deterministic policies to achieve low regret. Rather, all we need to describe the optimal policy is Q⋆, which has H∣S∣∣A∣ entries to be learned. Can we borrow ideas from UCB to reduce the regret to this order (i.e. polynomial in ∣S∣, ∣A∣, and H)?
One way to frame the UCB algorithm is that, when choosing arms, we optimize over a proxy reward that is the sum of the estimated mean reward and an exploration term. In the UCB-VI algorithm, we will extend this idea to the case of an unknown MDP M? by modelling a proxy MDP M~ with a reward function that encourages exploration. Then, we will use DP to solve for the optimal policy in M~.
Assumptions: For simplicity, here we assume the reward function of M? is known, so we only need to model the state transitions, though the rewards can be modelled similarly. We will also consider the more general case of a time-varying MDP, where the transition and reward functions can change over time. We take the convention that Ph is the distribution of sh+1∣sh,ah and rh is applied to sh,ah.
At a high level, the UCB-VI algorithm can be described as follows:
Modelling: Use previous data to model the transitions P^0,…,P^H−1.
Reward bonus: Design a reward bonus bh(s,a)∈R to encourage exploration, analogous to the UCB term.
Optimistic planning: Use DP to compute the optimal policy π^h(s) in the modelled MDP
We seek to approximate Ph(sh+1∣sh,ah)=P(sh,ah)P(sh,ah,sh+1). We can estimate these using their sample probabilities from the dataset. That is, define
To motivate the reward bonus term bht(s,a), recall how we designed the reward bonus term for UCB:
We used Hoeffding’s inequality to bound, with high probability, how far the sample mean μ^tk deviated from the true mean μk.
By inverting this inequality, we obtained a (1−δ)-confidence interval for the true mean, centered at our estimate.
To make this bound uniform across all timesteps t∈[T], we applied the union bound and multiplied δ by a factor of T.
We’d like to do the same for UCB-VI, and construct the bonus term such that Vh⋆(s)≤V^ht(s) with high probability. However, our construction will be more complex than the MAB case, since V^ht(s) depends on the bonus bht(s,a) implicitly via DP. We claim that the bonus term that gives the proper bound is
How exactly does UCB-VI strike a good balance between exploration and exploitation? In UCB for MABs, the bonus exploration term is simple to interpret: It encourages the learner to take actions with a high exploration term. Here, the policy depends on the bonus term indirectly: The policy is obtained by planning in an MDP where the bonus term is added to the reward function. Note that the bonuses propagate backwards in DP, effectively enabling the learner to plan to explore unknown states. This effect takes some further interpretation.
Recall we constructed bht so that, with high probability, Vh⋆(s)≤V^ht(s) and so
That is, the l.h.s. measures how suboptimal policy πt is in the true environment, while the r.h.s. is the difference in the policy’s value when acting in the modelled MDP M~t instead of the true one M?.
If the r.h.s. is small, this implies that the l.h.s. difference is also small, i.e. that πt is exploiting actions that are giving high reward.
If the r.h.s. is large, then we have overestimated the value: πt, the optimal policy of M~t, does not perform well in the true environment M?. This indicates that one of the bht(s,a) terms must be large, or some P^ht(⋅∣s,a) must be inaccurate, indicating a state-action pair with a low visit count Nht(s,a) that the learner was encouraged to explore.
It turns out that UCB-VI achieves a per-episode regret of
Comparing this to the UCB regret bound O~(TK), where K is the number of arms of the MAB, we see that we’ve reduced the number of effective arms from ∣A∣∣S∣H (in (9.4)) to H4∣S∣∣A∣, which is indeed polynomial in ∣S∣, ∣A∣, and H, as desired. This is also roughly the number of episodes it takes to achieve constant-order average regret:
Note that the time-dependent transition matrix has H∣S∣2∣A∣ entries. Assuming H≪∣S∣, this shows that it’s possible to achieve low regret, and achieve a near-optimal policy, while only understanding a 1/∣S∣ fraction of the world’s dynamics.
A polynomial dependency on ∣S∣ and ∣A∣ is manageable when the state and action spaces are small. But for large or continuous state and action spaces, even this polynomial factor will become intractable. Can we find algorithms that don’t depend on ∣S∣ or ∣A∣ at all, effectively reducing the dimensionality of the MDP? In this section, we’ll explore linear MDPs: an example of a parameterized MDP where the rewards and state transitions depend only on some parameter space of dimension d that is independent from ∣S∣ or ∣A∣.
It turns out that Qh⋆ is also linear with respect to this feature mapping. We can prove this by simply computing it using DP. We initialize VH⋆(s)=0∀s. Then we iterate:
This linear assumption on the MDP will also allow us to model the unknown dynamics Ph?(s′∣s,a) with techniques from supervised learning (SL). Recall that SL is useful for estimating conditional expectations by minimizing mean squared error. We can rephrase the estimation of Ph?(s′∣s,a) as a least-squares problem as follows: Write δs to denote a one-hot vector in R∣S∣, with a 1 in the s-th entry and 0 everywhere else. Note that
Furthermore, since the expectation here is linear with respect to ϕ(s,a), we can directly apply least-squares multi-target linear regression to construct the estimate
where we include a λI term to ensure that the matrix Aht is invertible. (This can also be derived by adding a λ∥μ∥F2 regularization term to the objective.) We can directly plug in this estimate into P^ht(⋅∣s,a)=μ^htϕ(s,a).
Now, to design the reward bonus, we can’t apply Hoeffding anymore, since the terms no longer involve sample means of bounded random variables; Instead, we’re incorporating information across different states and actions. Rather, we can construct an upper bound using Chebyshev’s inequality in the same way we did for the LinUCB algorithm in the MAB setting Section 3.8.1:
Note that this isn’t explicitly inversely proportional to Nht(s,a) as in the original UCB-VI bonus term (9.8). Rather, it is inversely proportional to the amount that the direction ϕ(s,a) has been explored in the history. That is, if Aht has a large component in the direction ϕ(s,a), implying that this direction is well explored, then the bonus term will be small, and vice versa.
We can now plug in these transition estimates and reward bonuses into the UCB-VI algorithm (9.16).
Comparing this to our bound for UCB-VI in an environment without this linear assumption, we see that we go from a sample complexity of Ω~(H4∣S∣∣A∣) to Ω~(H4d3). This new sample complexity only depends on the feature dimension and not on the state or action space of the MDP!
In this chapter, we’ve explored how to explore in an unknown MDP.
We first discussed the explore-then-exploit algorithm Definition 9.2, a simple way to explore a deterministic MDP by visiting all state-action pairs.
We then discussed how to treat an unknown MDP as a MAB Section 9.2, and how this approach is inefficient since it doesn’t make use of relationships between policies.
We then introduced the UCB-VI algorithm (9.16), which models the unknown MDP by a proxy MDP with a reward bonus term that encourages exploration.
Finally, assuming that the transitions and rewards are linear with respect to a feature transformation of the state and action, we introduced the LinUCB-VI algorithm Section 9.4.2, which has a sample complexity independent of the size of the state and action spaces.