App. B: Proofs

B.1 LQR proof

We’ll compute \(V_H^\star\) (at the end of the horizon) as our base case.
Then we’ll work step-by-step backwards in time, using \(V_{h+1}^\star\) to compute \(Q_h^\star\), \(\pi_{h}^\star\), and \(V_h^\star\).

Base case:

At the final timestep, there are no possible actions to take, and so \(V^\star_H(x) = c(x) = x^\top Q x\). Thus \(V_H^\star(x) = x^\top P_Hx+ p_H\) where \(P_H= Q\) and \(p_H= 0\).

Inductive hypothesis:

We seek to show that the inductive step holds for both theorems: If \(V^\star_{h+1}(x)\) is an upward-curved quadratic, then \(V^\star_h(x)\) must also be an upward-curved quadratic, and \(\pi^\star_h(x)\) must be linear. We’ll break this down into the following steps:

Show that \(Q^\star_h(x, u)\) is an upward-curved quadratic (in both \(x\) and \(u\)).
Derive the optimal policy \(\pi^\star_h(x) = \arg \min_uQ^\star_h(x, u)\) and show that it’s linear.
Show that \(V^\star_h(x)\) is an upward-curved quadratic.

We first assume the inductive hypothesis that our theorems are true at time \(h+1\). That is,

\[ V^\star_{h+1}(x) = x^\top P_{h+1} x+ p_{h+1} \quad \forall x\in \mathcal{S}. \]

Lemma B.1 (\(Q^\star_h(x, u)\) is an upward-curved quadratic) Let us decompose \(Q^\star_h: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) into the immediate reward plus the expected cost-to-go:

\[ Q^\star_h(x, u) = c(x, u) + \mathop{\mathbb{E}}_{x' \sim f(x, u, w_{h+1})} [V^\star_{h+1}(x')]. \]

Recall \(c(x, u) := x^\top Q x+ u^\top R u\). Let’s consider the expectation over the next timestep. The only randomness in the dynamics comes from the noise \(w_{h+1} \sim \mathcal{N}(0, \sigma^2 I)\), so we can expand the expectation as:

\[ \begin{aligned} & \mathop{\mathbb{E}}_{x'} [V^\star_{h+1}(x')] \\ {} = {} & \mathop{\mathbb{E}}_{w_{h+1}} [V^\star_{h+1}(A x+ B u+ w_{h+1})] & & \text{definition of } f \\ {} = {} & \mathop{\mathbb{E}}_{w_{h+1}} [ (A x+ B u+ w_{h+1})^\top P_{h+1} (A x+ B u+ w_{h+1}) + p_{h+1} ]. & & \text{inductive hypothesis} \end{aligned} \]

Summing and combining like terms, we get

\[ \begin{aligned} Q^\star_h(x, u) & = x^\top Q x+ u^\top R u+ \mathop{\mathbb{E}}_{w_{h+1}} [(A x+ B u+ w_{h+1})^\top P_{h+1} (A x+ B u+ w_{h+1}) + p_{h+1}] \\ & = x^\top (Q + A^\top P_{h+1} A)x+ u^\top (R + B^\top P_{h+1} B) u+ 2 x^\top A^\top P_{h+1} B u\\ & \qquad + \mathop{\mathbb{E}}_{w_{h+1}} [w_{h+1}^\top P_{h+1} w_{h+1}] + p_{h+1}. \end{aligned} \]

Note that the terms that are linear in \(w_h\) have mean zero and vanish. Now consider the remaining expectation over the noise. By expanding out the product and using linearity of expectation, we can write this out as

\[ \begin{aligned} \mathop{\mathbb{E}}_{w_{h+1}} [w_{h+1}^\top P_{h+1} w_{h+1}] & = \sum_{i=1}^d \sum_{j=1}^d (P_{h+1})_{ij} \mathop{\mathbb{E}}_{w_{h+1}} [(w_{h+1})_i (w_{h+1})_j] \\ & = \sigma^2 \mathrm{Tr}(P_{h+ 1}) \end{aligned} \]

B.1.0.1 Quadratic forms

When solving quadratic forms, i.e. expressions of the form \(x^\top A x\), it’s often helpful to consider the terms on the diagonal (\(i = j\)) separately from those off the diagonal.

In this case, the expectation of each diagonal term becomes

\[ (P_{h+1})_{ii} \mathop{\mathbb{E}}(w_{h+1})_i^2 = \sigma^2 (P_{h+1})_{ii}. \tag{B.1}\]

Off the diagonal, since the elements of \(w_{h+1}\) are independent, the expectation factors, and since each element has mean zero, the term vanishes:

\[ (P_{h+1})_{ij} \mathop{\mathbb{E}}[(w_{h+1})_i] \mathop{\mathbb{E}}[(w_{h+1})_j] = 0. \tag{B.2}\]

Thus, the only terms left are the ones on the diagonal, so the sum of these can be expressed as the trace of \(\sigma^2 P_{h+1}\):

\[ \mathop{\mathbb{E}}_{w_{h+1}} [w_{h+1}^\top P_{h+1} w_{h+1}] = \sigma^2 \mathrm{Tr}(P_{h+1}). \tag{B.3}\]

Substituting this back into the expression for \(Q^\star_h\), we have:

\[ \begin{aligned} Q^\star_h(x, u) & = x^\top (Q + A^\top P_{h+1} A) x+ u^\top (R + B^\top P_{h+1} B) u + 2x^\top A^\top P_{h+1} B u\\ & \qquad + \sigma^2 \mathrm{Tr}(P_{h+1}) + p_{h+1}. \end{aligned} \tag{B.4}\]

As we hoped, this expression is quadratic in \(x\) and \(u\). Furthermore, we’d like to show that it also curves upwards with respect to \(u\) so that its minimum with respect to \(u\) is well-defined. We can do this by noting that the Hessian matrix of second derivatives is positive definite:

\[ \nabla_{uu} Q_h^\star(x, u) = R + B^\top P_{h+1} B \]

Since \(R\) is sym. p.d. (def. 3.5), and \(P_{h+1}\) is sym. p.d. (by the inductive hypothesis), this sum must also be sym. p.d., and so \(Q^\star_h\) is indeed an upward-curved quadratic with respect to \(u\). (If this isn’t clear, try proving it as an exercise.) The proof of its upward curvature with respect to \(x\) is equivalent.

Lemma B.2 (\(\pi^\star_h\) is linear) Since \(Q^\star_h\) is an upward-curved quadratic, finding its minimum over \(u\) is easy: we simply set the gradient with respect to \(u\) equal to zero and solve for \(u\). First, we calculate the gradient:

\[ \begin{aligned} \nabla_uQ^\star_h(x, u) & = \nabla_u[ u^\top (R + B^\top P_{h+1} B) u+ 2 x^\top A^\top P_{h+1} B u] \\ & = 2 (R + B^\top P_{h+1} B) u+ 2 (x^\top A^\top P_{h+1} B)^\top \end{aligned} \]

Setting this to zero, we get

\[ \begin{aligned} 0 & = (R + B^\top P_{h+1} B) \pi^\star_h(x) + B^\top P_{h+1} A x\nonumber \\ \pi^\star_h(x) & = (R + B^\top P_{h+1} B)^{-1} (-B^\top P_{h+1} A x) \nonumber \\ & = - K_hx, \end{aligned} \]

where

\[ K_h= (R + B^\top P_{h+1} B)^{-1} B^\top P_{h+1} A. \tag{B.5}\]

Note that this optimal policy doesn’t depend on the starting distribution \(\mu_0\). It’s also fully deterministic and isn’t affected by the noise terms \(w_0, \dots, w_{H-1}\).

Lemma B.4 (The value function is an upward-curved quadratic) Using the identity \(V^\star_h(x) = Q^\star_h(x, \pi^\star(x))\), we have:

\[ \begin{aligned} V^\star_h(x) & = Q^\star_h(x, \pi^\star(x)) \\ & = x^\top (Q + A^\top P_{h+1} A) x+ (-K_hx)^\top (R + B^\top P_{h+1} B) (-K_hx) + 2x^\top A^\top P_{h+1} B (-K_hx) \\ & \qquad + \mathrm{Tr}(\sigma^2 P_{h+1}) + p_{h+1} \end{aligned} \]

Note that with respect to \(x\), this is the sum of a quadratic term and a constant, which is exactly what we were aiming for! The scalar term is clearly

\[ p_h= \mathrm{Tr}(\sigma^2 P_{h+1}) + p_{h+1}. \tag{B.6}\]

We can simplify the quadratic term by substituting in \(K_h\) from eq. B.5. Notice that when we do this, the \((R+B^\top P_{h+1} B)\) term in the expression is cancelled out by its inverse, and the remaining terms combine to give the Riccati equation:

\[ P_h= Q + A^\top P_{h+1} A - A^\top P_{h+1} B (R + B^\top P_{h+1} B)^{-1} B^\top P_{h+1} A. \tag{B.7}\]

It remains to prove that \(V^\star_h\) curves upwards, that is, that \(P_h\) is sym. p.d. We will use the following fact about Schur complements:

Lemma B.3 (Positive definiteness of Schur complements) Let

\[ D = \begin{pmatrix} A & B \\ B^\top & C \end{pmatrix} \tag{B.8}\]

be a symmetric \((m+n) \times (m+n)\) block matrix, where \(A \in \mathbb{R}^{m \times m}, B \in \mathbb{R}^{m \times n}, C \in \mathbb{R}^{n \times n}\). The Schur complement of \(A\) is denoted

\[ D/A = C - B^\top A^{-1} B. \tag{B.9}\]

Schur complements have various uses in linear algebra and numerical computation.

A useful fact for us is that if \(A\) is positive definite, then \(D\) is positive semidefinite if and only if \(D/A\) is positive semidefinite.

Let \(P\) denote \(P_{h+ 1}\) for brevity. We already know \(Q\) is sym. p.d., so it suffices to show that

\[ S = P - P B (R + B^\top P B)^{-1} B^\top P \tag{B.10}\]

is p.s.d. (positive semidefinite), since left- and right- multiplying by \(A^\top\) and \(A\) respectively preserves p.s.d. We note that \(S\) is the Schur complement \(D/(R + B^\top P B)\), where

\[ D = \begin{pmatrix} R + B^\top P B & B^\top P \\ P B & P \end{pmatrix}. \]

Thus we must show that \(D\) is p.s.d.. This can be seen by computing

\[ \begin{aligned} \begin{pmatrix} y^\top & z^\top \end{pmatrix} D \begin{pmatrix} y \\ z \end{pmatrix} &= y^\top R y + y^\top B^\top P B y + 2 y^\top B^\top P z + z^\top P z \\ &= y^\top R y + (By + z)^\top P (By + z) \\ &> 0. \end{aligned} \]

Since \(R + B^\top P B\) is sym. p.d. and \(D\) is p.s.d., then \(S = D / (R + B^\top P B)\) must be p.s.d., and \(P_h= Q + A S A^\top\) must be sym. p.d.

Now we’ve shown that \(V^\star_h(x) = x^\top P_hx+ p_h\), where \(P_h\) is sym. p.d., proving the inductive hypothesis and completing the proof of Theorem 3.3 and Theorem 3.2.

B.2 UCBVI reward bonus proof

We aim to show that, with high probability,

\[ V_h^\star(s) \le \widehat{V}_h^t(s) \quad \forall t \in [T], h \in [H], s \in \mathcal{S}. \]

We’ll do this by bounding the error incurred at each step of DP. Recall that DP solves for \(\widehat{V}_h^t(s)\) recursively as follows:

\[ \widehat{V}_h^t(s) = \max_{a \in \mathcal{A}} \left[ \widetilde r^t_h(s, a) + \mathop{\mathbb{E}}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ \widehat{V}_{h+1}^t(s') \right] \right] \]

where \(\widetilde r^t_h(s, a) = r_h(s, a) + b_h^t(s, a)\) is the reward function of our modelled MDP \(\widetilde{\mathcal{M}}^t\). On the other hand, we know that \(V^\star\) must satisfy

\[ V^\star_h(s) = \max_{a \in \mathcal{A}} \left[ \widetilde r^t_h(s, a) + \mathop{\mathbb{E}}_{s' \sim P^?_h(\cdot \mid s, a)} [V^\star_{h+1}(s')] \right] \]

so it suffices to bound the difference between the two inner expectations. There are two sources of error:

The value functions \(\widehat{V}^t_{h+1}\) v.s. \(V^\star_{h+1}\)
The transition probabilities \(\widehat{P}_h^t\) v.s. \(P^?_h\).

We can bound these individually, and then combine them by the triangle inequality. For the former, we can simply bound the difference by \(H\), assuming that the rewards are within \([0, 1]\). Now, all that is left is to bound the error from the transition probabilities:

\[ \text{error} = \left| \mathop{\mathbb{E}}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] - \mathop{\mathbb{E}}_{s' \sim P^?_h(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right]. \right| \tag{B.11}\]

Let us bound this term for a fixed \(s, a, h, t\). (Later we can make this uniform across \(s, a, h, t\) using the union bound.) Note that expanding out the definition of \(\widehat{P}_h^t\) gives

\[ \begin{aligned} \mathop{\mathbb{E}}_{s' \sim \widehat{P}_h^t(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] & = \sum_{s' \in \mathcal{S}} \frac{N^t_h(s, a, s')}{N^t_h(s, a)} V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \sum_{s' \in \mathcal{S}} \mathbf{1}\left\{ (s_h^i, a_h^i, s_{h+1}^i) = (s, a, s') \right\} V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \underbrace{\mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} V^\star_{h+1}(s_{h+1}^i)}_{X^i} \end{aligned} \]

since the terms where \(s' \neq s_{h+1}^i\) vanish.

Now, in order to apply Hoeffding’s inequality, we would like to express the second term in eq. B.11 as a sum over \(t\) random variables as well. We will do this by redundantly averaging over all desired trajectories (i.e. where we visit state \(s\) and action \(a\) at time \(h\)):

\[ \begin{aligned} \mathop{\mathbb{E}}_{s' \sim P^?_h(\cdot \mid s, a)} \left[ V^\star_{h+1}(s') \right] & = \sum_{s' \in \mathcal{S}} P^?_h(s' \mid s, a) V^\star_{h+1}(s') \\ & = \sum_{s' \in \mathcal{S}} \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \mathbf{1}\left\{ (s_h^i, a_h^i) = (s, a) \right\} P^?_h(s' \mid s, a) V^\star_{h+1}(s') \\ & = \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \mathop{\mathbb{E}}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i. \end{aligned} \]

Now we can apply Hoeffding’s inequality to \(X^i - \mathop{\mathbb{E}}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i\), which is bounded by \(H\), to obtain that, with probability at least \(1-\delta\),

\[ \text{error} = \left| \frac{1}{N^t_h(s, a)} \sum_{i=0}^{t-1} \left(X^i - \mathop{\mathbb{E}}_{s_{h+1}^i \sim P^?_{h}(\cdot \mid s_h^i, a_h^i)} X^i \right) \right| \le 2 H \sqrt{\frac{\ln(1/\delta)}{N_h^t(s, a)}}. \]

Applying a union bound over all \(s \in \mathcal{S}, a \in \mathcal{A}, t \in [T], h \in [H]\) gives the \(b_h^t(s, a)\) term above.