Skip to article frontmatterSkip to article content

2 Linear Quadratic Regulators

2.1Introduction

Up to this point, we have considered decision problems with finitely many states and actions. However, in many applications, states and actions may take on continuous values. For example, consider autonomous driving, controlling a robot’s joints, and automated manufacturing. How can we teach computers to solve these kinds of problems? This is the task of continuous control.

Solving a Rubik’s Cube with a robot hand.

Figure 2.1:Solving a Rubik’s Cube with a robot hand.

Boston Dynamics’s Spot robot.

Figure 2.2:Boston Dynamics’s Spot robot.

Aside from the change in the state and action spaces, the general problem setup remains the same: we seek to construct an optimal policy that outputs actions to solve the desired task. We will see that many key ideas and algorithms, in particular dynamic programming algorithms, carry over to this new setting.

This chapter introduces a fundamental tool to solve a simple class of continuous control problems: the linear quadratic regulator. We will then extend this basic method to more complex settings.

2.2Optimal control

Recall that an MDP is defined by its state space S\mathcal{S}, action space A\mathcal{A}, state transitions PP, reward function rr, and discount factor γ or time horizon H\hor. These have equivalents in the control setting:

  • The state and action spaces are continuous rather than finite. That is, SRnx\mathcal{S} \subseteq \mathbb{R}^{n_\st} and ARnu\mathcal{A} \subseteq \mathbb{R}^{n_\act}, where nxn_\st and nun_\act are the corresponding dimensions of these spaces, i.e. the number of coordinates to specify a single state or action respectively.

  • We call the state transitions the dynamics of the system. In the most general case, these might change across timesteps and also include some stochastic noise whw_\hi at each timestep. We denote these dynamics as the function fhf_\hi such that xh+1=fh(xh,uh,wh)\st_{\hi+1} = f_\hi(\st_\hi, \act_\hi, w_\hi). Of course, we can simplify to cases where the dynamics are deterministic/noise-free (no whw_\hi term) and/or time-homogeneous (the same function ff across timesteps).

  • Instead of maximizing the reward function, we seek to minimize the cost function ch:S×ARc_\hi: \mathcal{S} \times \mathcal{A} \to \mathbb{R}. Often, the cost function describes how far away we are from a target state-action pair (x,u)(\st^\star, \act^\star). An important special case is when the cost is time-homogeneous; that is, it remains the same function cc at each timestep hh.

  • We seek to minimize the undiscounted cost within a finite time horizon H\hor. Note that we end an episode at the final state xH\st_\hor -- there is no uH\act_\hor, and so we denote the cost for the final state as cH(xH)c_\hor(\st_\hor).

With all of these components, we can now formulate the optimal control problem: compute a policy to minimize the expected undiscounted cost over H\hor timesteps. In this chapter, we will only consider deterministic, time-dependent policies π=(π0,,πH1)\pi = (\pi_0, \dots, \pi_{H-1}) where πh:SA\pi_h : \mathcal{S} \to \mathcal{A} for each h[H]\hi \in [\hor].

2.2.1A first attempt: Discretization

Can we solve this problem using tools from the finite MDP setting? If S\mathcal{S} and A\mathcal{A} were finite, then we’d be able to work backwards using the DP algorithm for computing the optimal policy in an MDP (Definition 1.11). This inspires us to try discretizing the problem.

Suppose S\mathcal{S} and A\mathcal{A} are bounded, that is, maxxSxBx\max_{\st \in \mathcal{S}} \|\st\| \le B_\st and maxuAuBu\max_{\act \in \mathcal{A}} \|\act\| \le B_\act. To make S\mathcal{S} and A\mathcal{A} finite, let’s choose some small positive ε, and simply round each coordinate to the nearest multiple of ε. For example, if ϵ=0.01\epsilon = 0.01, then we round each element of x\st and u\act to two decimal spaces.

However, the discretized S~\widetilde{\mathcal{S}} and A~\widetilde{\mathcal{A}} may be finite, but they may be infeasibly large: we must divide each dimension into intervals of length ε\varepsilon, resulting in S~=(Bx/ε)nx|\widetilde{\mathcal{S}}| = (B_\st/\varepsilon)^{n_\st} and A~=(Bu/ε)nu|\widetilde{\mathcal{A}}| = (B_\act/\varepsilon)^{n_\act}. To get a sense of how quickly this grows, consider ε=0.01,nx=nu=10\varepsilon = 0.01, n_\st = n_\act = 10. Then the number of elements in the transition matrix would be S~2A~=(10010)2(10010)=1060|\widetilde{\mathcal{S}}|^2 |\widetilde{\mathcal{A}}| = (100^{10})^2 (100^{10}) = 10^{60}! (That’s a trillion trillion trillion trillion trillion.)

What properties of the problem could we instead make use of? Note that by discretizing the state and action spaces, we implicitly assumed that rounding each state or action vector by some tiny amount ε\varepsilon wouldn’t change the behavior of the system by much; namely, that the cost and dynamics were relatively continuous. Can we use this continuous structure in other ways? This leads us to the linear quadratic regulator.

2.3The Linear Quadratic Regulator

The optimal control problem Definition 2.1 seems highly complex in general. Is there a relevant simplification that we can analyze? The linear quadratic regulator (LQR) is a solvable case and a fundamental tool in control theory.

We will henceforth abbreviate “symmetric positive definite” as s.p.d. and “positive definite” as p.d.

It will be helpful to reintroduce the value function notation for a policy to denote the average cost it incurs. These will be instrumental in constructing the optimal policy via dynamic programming, as we did in Section 1.3.2 for MDPs.

2.4Optimality and the Riccati Equation

In this section, we’ll compute the optimal value function VhV^\star_h, Q-function QhQ^\star_h, and policy πh\pi^\star_h in the linear quadratic regulator using dynamic programming in a very similar way to the DP algorithms in the MDP setting. Recall the definition of the optimal value function:

We will prove the striking fact that the solution has very simple structure: VhV_h^\star and QhQ^\star_h are upward-curved quadratics and πh\pi_h^\star is linear and furthermore does not depend on the noise!

The construction (and inductive proof) proceeds similarly to the one in the MDP setting.

  1. We’ll compute VHV_\hor^\star (at the end of the horizon) as our base case.
  2. Then we’ll work step-by-step backwards in time, using Vh+1V_{\hi+1}^\star to compute QhQ_\hi^\star, πh\pi_{\hi}^\star, and VhV_\hi^\star.

Base case: At the final timestep, there are no possible actions to take, and so VH(x)=c(x)=xQxV^\star_\hor(\st) = c(\st) = \st^\top Q \st. Thus VH(x)=xPHx+pHV_\hor^\star(\st) = \st^\top P_\hor \st + p_\hor where PH=QP_\hor = Q and pH=0p_\hor = 0.

Inductive hypothesis: We seek to show that the inductive step holds for both theorems: If Vh+1(x)V^\star_{\hi+1}(\st) is an upward-curved quadratic, then Vh(x)V^\star_\hi(\st) must also be an upward-curved quadratic, and πh(x)\pi^\star_\hi(\st) must be linear. We’ll break this down into the following steps:

  1. Show that Qh(x,u)Q^\star_\hi(\st, \act) is an upward-curved quadratic (in both x\st and u\act).
  2. Derive the optimal policy πh(x)=argminuQh(x,u)\pi^\star_\hi(\st) = \arg \min_\act Q^\star_\hi(\st, \act) and show that it’s linear.
  3. Show that Vh(x)V^\star_\hi(\st) is an upward-curved quadratic.

We first assume the inductive hypothesis that our theorems are true at time h+1\hi+1. That is,

Vh+1(x)=xPh+1x+ph+1xS.V^\star_{\hi+1}(\st) = \st^\top P_{\hi+1} \st + p_{\hi+1} \quad \forall \st \in \mathcal{S}.

Now we’ve shown that Vh(x)=xPhx+phV^\star_\hi(\st) = \st^\top P_\hi \st + p_\hi, where PhP_\hi is s.p.d., proving the inductive hypothesis and completing the proof of Theorem 2.2 and Theorem 2.1.

In summary, we just demonstrated that at each timestep h[H]\hi \in [\hor], the optimal value function VhV^\star_\hi and optimal Q-function QhQ^\star_\hi are both upward-curved quadratics and the optimal policy πh\pi^\star_\hi is linear. We also showed that all of these quantities can be calculated using a sequence of s.p.d. matrices P0,,PHP_0, \dots, P_H that can be defined recursively using the Riccati equation Definition 2.5.

Before we move on to some extensions of LQR, let’s consider how the state at time h\hi behaves when we act according to this optimal policy.

2.4.1Expected state at time h\hi

How can we compute the expected state at time h\hi when acting according to the optimal policy? Let’s first express xh\st_\hi in a cleaner way in terms of the history. Note that having linear dynamics makes it easy to expand terms backwards in time:

xh=Axh1+Buh1+wh1=A(Axh2+Buh2+wh2)+Buh1+wh1==Ahx0+i=0h1Ai(Buhi1+whi1).\begin{aligned} \st_\hi & = A \st_{\hi-1} + B \act_{\hi-1} + w_{\hi-1} \\ & = A (A\st_{\hi-2} + B \act_{\hi-2} + w_{\hi-2}) + B \act_{\hi-1} + w_{\hi-1} \\ & = \cdots \\ & = A^\hi \st_0 + \sum_{i=0}^{\hi-1} A^i (B \act_{\hi-i-1} + w_{\hi-i-1}). \end{aligned}

Let’s consider the average state at this time, given all the past states and actions. Since we assume that E[wh]=0\E [w_\hi] = 0 (this is the zero vector in dd dimensions), when we take an expectation, the whw_\hi term vanishes due to linearity, and so we’re left with

E[xhx0:(h1),u0:(h1)]=Ahx0+i=0h1AiBuhi1.\E [\st_\hi \mid \st_{0:(\hi-1)}, \act_{0:(\hi-1)}] = A^\hi \st_0 + \sum_{i=0}^{\hi-1} A^i B \act_{\hi-i-1}.

This introdces the quantity ABKiA - B K_i, which shows up frequently in control theory. For example, one important question is: will xh\st_\hi remain bounded, or will it go to infinity as time goes on? To answer this, let’s imagine for simplicity that these KiK_is are equal (call this matrix KK). Then the expression above becomes (ABK)hx0(A-BK)^\hi \st_0. Now consider the maximum eigenvalue λmax\lambda_{\max} of ABKA - BK. If λmax>1|\lambda_{\max}| > 1, then there’s some nonzero initial state xˉ0\bar \st_0, the corresponding eigenvector, for which

limh(ABK)hxˉ0=limhλmaxhxˉ0=.\lim_{\hi \to \infty} (A - BK)^\hi \bar \st_0 = \lim_{\hi \to \infty} \lambda_{\max}^\hi \bar \st_0 = \infty.

Otherwise, if λmax<1|\lambda_{\max}| < 1, then it’s impossible for your original state to explode as dramatically.

2.5Extensions

We’ve now formulated an optimal solution for the time-homogeneous LQR and computed the expected state under the optimal policy. However, real world tasks rarely have such simple dynamics, and we may wish to design more complex cost functions. In this section, we’ll consider more general extensions of LQR where some of the assumptions we made above are relaxed. Specifically, we’ll consider:

  1. Time-dependency, where the dynamics and cost function might change depending on the timestep.

  2. General quadratic cost, where we allow for linear terms and a constant term.

  3. Tracking a goal trajectory rather than aiming for a single goal state-action pair.

Combining these will allow us to use the LQR solution to solve more complex setups by taking Taylor approximations of the dynamics and cost functions.

2.5.1Time-dependent dynamics and cost function

So far, we’ve considered the time-homogeneous case, where the dynamics and cost function stay the same at every timestep. However, this might not always be the case. As an example, in many sports, the rules and scoring system might change during an overtime period. To address these sorts of problems, we can loosen the time-homogeneous restriction, and consider the case where the dynamics and cost function are time-dependent. Our analysis remains almost identical; in fact, we can simply add a time index to the matrices AA and BB that determine the dynamics and the matrices QQ and RR that determine the cost.

The modified problem is now defined as follows:

The derivation of the optimal value functions and the optimal policy remains almost exactly the same, and we can modify the Riccati equation accordingly:

Additionally, by allowing the dynamics to vary across time, we gain the ability to locally approximate nonlinear dynamics at each timestep. We’ll discuss this later in the chapter.

2.5.2More general quadratic cost functions

Our original cost function had only second-order terms with respect to the state and action, incentivizing staying as close as possible to (x,u)=(0,0)(\st^\star, \act^\star) = (0, 0). We can also consider more general quadratic cost functions that also have first-order terms and a constant term. Combining this with time-dependent dynamics results in the following expression, where we introduce a new matrix MhM_\hi for the cross term, linear coefficients qhq_\hi and rhr_\hi for the state and action respectively, and a constant term chc_\hi:

ch(xh,uh)=(xhQhxh+xhMhuh+uhRhuh)+(xhqh+uhrh)+ch.c_\hi(\st_\hi, \act_\hi) = ( \st_\hi^\top Q_\hi \st_\hi + \st_\hi^\top M_\hi \act_\hi + \act_\hi^\top R_\hi \act_\hi ) + (\st_\hi^\top q_\hi + \act_\hi^\top r_\hi) + c_\hi.

Similarly, we can also include a constant term vhRnxv_\hi \in \mathbb{R}^{n_\st} in the dynamics (note that this is deterministic at each timestep, unlike the stochastic noise whw_\hi):

xh+1=fh(xh,uh,wh)=Ahxh+Bhuh+vh+wh.\st_{\hi+1} = f_\hi(\st_\hi, \act_\hi, w_\hi) = A_\hi \st_\hi + B_\hi \act_\hi + v_\hi + w_\hi.

2.5.3Tracking a predefined trajectory

Consider applying LQR to a task like autonomous driving, where the target state-action pair changes over time. We might want the vehicle to follow a predefined trajectory of states and actions (xh,uh)h=0H1(\st_\hi^\star, \act_\hi^\star)_{\hi=0}^{\hor-1}. To express this as a control problem, we’ll need a corresponding time-dependent cost function:

ch(xh,uh)=(xhxh)Q(xhxh)+(uhuh)R(uhuh).c_\hi(\st_\hi, \act_\hi) = (\st_\hi - \st^\star_\hi)^\top Q (\st_\hi - \st^\star_\hi) + (\act_\hi - \act^\star_\hi)^\top R (\act_\hi - \act^\star_\hi).

Note that this punishes states and actions that are far from the intended trajectory. By expanding out these multiplications, we can see that this is actually a special case of the more general quadratic cost function above (2.38):

Mh=0,qh=2Qxh,rh=2Ruh,ch=(xh)Q(xh)+(uh)R(uh).M_\hi = 0, \qquad q_\hi = -2Q \st^\star_\hi, \qquad r_\hi = -2R \act^\star_\hi, \qquad c_\hi = (\st^\star_\hi)^\top Q (\st^\star_\hi) + (\act^\star_\hi)^\top R (\act^\star_\hi).

2.6Approximating nonlinear dynamics

The LQR algorithm solves for the optimal policy when the dynamics are linear and the cost function is an upward-curved quadratic. However, real settings are rarely this simple! Let’s return to the CartPole example from the start of the chapter (Example 2.1). The dynamics (physics) aren’t linear. How can we approximate this by an LQR problem?

Concretely, let’s consider a noise-free problem since, as we saw, the noise doesn’t factor into the optimal policy. Let’s assume the dynamics and cost function are stationary, and ignore the terminal state for simplicity:

This is now only slightly simplified from the general optimal control problem (see Definition 2.1). Here, we don’t know an analytical form for the dynamics ff or the cost function cc, but we assume that we’re able to query/sample/simulate them to get their values at a given state and action. To clarify, consider the case where the dynamics are given by real world physics. We can’t (yet) write down an expression for the dynamics that we can differentiate or integrate analytically. However, we can still simulate the dynamics and cost function by running a real-world experiment and measuring the resulting states and costs. How can we adapt LQR to this more general nonlinear case?

2.6.1Local linearization

How can we apply LQR when the dynamics are nonlinear or the cost function is more complex? We’ll exploit the useful fact that we can take a function that’s locally continuous around (s,a)(s^\star, a^\star) and approximate it nearby with low-order polynomials (i.e. its Taylor approximation). In particular, as long as the dynamics ff are differentiable around (x,u)(\st^\star, \act^\star) and the cost function cc is twice differentiable at (x,u)(\st^\star, \act^\star), we can take a linear approximation of ff and a quadratic approximation of cc to bring us back to the regime of LQR.

Linearizing the dynamics around (x,u)(\st^\star, \act^\star) gives:

f(x,u)f(x,u)+xf(x,u)(xx)+uf(x,u)(uu)(xf(x,u))ij=dfi(x,u)dxj,i,jnx(uf(x,u))ij=dfi(x,u)duj,inx,jnu\begin{gathered} f(\st, \act) \approx f(\st^\star, \act^\star) + \nabla_\st f(\st^\star, \act^\star) (\st - \st^\star) + \nabla_\act f(\st^\star, \act^\star) (\act - \act^\star) \\ (\nabla_\st f(\st, \act))_{ij} = \frac{d f_i(\st, \act)}{d \st_j}, \quad i, j \le n_\st \qquad (\nabla_\act f(\st, \act))_{ij} = \frac{d f_i(\st, \act)}{d \act_j}, \quad i \le n_\st, j \le n_\act \end{gathered}

and quadratizing the cost function around (x,u)(\st^\star, \act^\star) gives:

c(x,u)c(x,u)constant term+xc(x,u)(xx)+uc(x,u)(au)linear terms+12(xx)xxc(x,u)(xx)+12(uu)uuc(x,u)(uu)+(xx)xuc(x,u)(uu)}quadratic terms\begin{aligned} c(\st, \act) & \approx c(\st^\star, \act^\star) \quad \text{constant term} \\ & \qquad + \nabla_\st c(\st^\star, \act^\star) (\st - \st^\star) + \nabla_\act c(\st^\star, \act^\star) (a - \act^\star) \quad \text{linear terms} \\ & \left. \begin{aligned} & \qquad + \frac{1}{2} (\st - \st^\star)^\top \nabla_{\st \st} c(\st^\star, \act^\star) (\st - \st^\star) \\ & \qquad + \frac{1}{2} (\act - \act^\star)^\top \nabla_{\act \act} c(\st^\star, \act^\star) (\act - \act^\star) \\ & \qquad + (\st - \st^\star)^\top \nabla_{\st \act} c(\st^\star, \act^\star) (\act - \act^\star) \end{aligned} \right\} \text{quadratic terms} \end{aligned}

where the gradients and Hessians are defined as

(xc(x,u))i=dc(x,u)dxi,inx(uc(x,u))i=dc(x,u)dui,inu(xxc(x,u))ij=d2c(x,u)dxidxj,i,jnx(uuc(x,u))ij=d2c(x,u)duiduj,i,jnu(xuc(x,u))ij=d2c(x,u)dxiduj.inx,jnu\begin{aligned} (\nabla_\st c(\st, \act))_{i} & = \frac{d c(\st, \act)}{d \st_i}, \quad i \le n_\st & (\nabla_\act c(\st, \act))_{i} & = \frac{d c(\st, \act)}{d \act_i}, \quad i \le n_\act \\ (\nabla_{\st \st} c(\st, \act))_{ij} & = \frac{d^2 c(\st, \act)}{d \st_i d \st_j}, \quad i, j \le n_\st & (\nabla_{\act \act} c(\st, \act))_{ij} & = \frac{d^2 c(\st, \act)}{d \act_i d \act_j}, \quad i, j \le n_\act \\ (\nabla_{\st \act} c(\st, \act))_{ij} & = \frac{d^2 c(\st, \act)}{d \st_i d \act_j}. \quad i \le n_\st, j \le n_\act \end{aligned}

Exercise: Note that this cost can be expressed in the general quadratic form seen in (2.38). Derive the corresponding quantities Q,R,M,q,r,cQ, R, M, q, r, c.

2.6.2Finite differencing

To calculate these gradients and Hessians in practice, we use a method known as finite differencing for numerically computing derivatives. Namely, we can simply use the limit definition of the derivative, and see how the function changes as we add or subtract a tiny δ to the input.

ddxf(x)=limδ0f(x+δ)f(x)δ\frac{d}{dx} f(x) = \lim_{\delta \to 0} \frac{f(x + \delta) - f(x)}{\delta}

Note that this only requires us to be able to query the function, not to have an analytical expression for it, which is why it’s so useful in practice.

2.6.3Local convexification

However, simply taking the second-order approximation of the cost function is insufficient, since for the LQR setup we required that the QQ and RR matrices were positive definite, i.e. that all of their eigenvalues were positive.

One way to naively force some symmetric matrix DD to be positive definite is to set any non-positive eigenvalues to some small positive value ε>0\varepsilon > 0. Recall that any real symmetric matrix DRn×nD \in \mathbb{R}^{n \times n} has an basis of eigenvectors u1,,unu_1, \dots, u_n with corresponding eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n such that Dui=λiuiD u_i = \lambda_i u_i. Then we can construct the positive definite approximation by

D~=(i=1,,nλi>0λiuiui)+εI.\widetilde{D} = \left( \sum_{i=1, \dots, n \mid \lambda_i > 0} \lambda_i u_i u_i^\top \right) + \varepsilon I.

Exercise: Convince yourself that D~\widetilde{D} is indeed positive definite.

Note that Hessian matrices are generally symmetric, so we can apply this process to QQ and RR to obtain the positive definite approximations Q~\widetilde{Q} and R~\widetilde{R}. Now that we have an upward-curved quadratic approximation to the cost function, and a linear approximation to the state transitions, we can simply apply the time-homogenous LQR methods from Section 2.4.

But what happens when we enter states far away from x\st^\star or want to use actions far from u\act^\star? A Taylor approximation is only accurate in a local region around the point of linearization, so the performance of our LQR controller will degrade as we move further away. We’ll see how to address this in the next section using the iterative LQR algorithm.

Local linearization might only be accurate in a small region around the
point of linearization.

Figure 2.3:Local linearization might only be accurate in a small region around the point of linearization.

2.6.4Iterative LQR

To address these issues with local linearization, we’ll use an iterative approach, where we repeatedly linearize around different points to create a time-dependent approximation of the dynamics, and then solve the resulting time-dependent LQR problem to obtain a better policy. This is known as iterative LQR or iLQR:

Now let’s go through the details of each step. We’ll use superscripts to denote the iteration of the algorithm. We’ll also denote xˉ0=Ex0μ0[x0]\bar \st_0 = \E_{\st_0 \sim \mu_0} [\st_0] as the expected initial state.

At iteration ii of the algorithm, we begin with a candidate trajectory τˉi=(xˉ0i,uˉ0i,,xˉH1i,uˉH1i)\bar \tau^i = (\bar \st^i_0, \bar \act^i_0, \dots, \bar \st^i_{\hor-1}, \bar \act^i_{\hor-1}).

Step 1: Form a time-dependent LQR problem. At each timestep h[H]\hi \in [\hor], we use the techniques from Section 2.6 to linearize the dynamics and quadratize the cost function around (xˉhi,uˉhi)(\bar \st^i_\hi, \bar \act^i_\hi):

fh(x,u)f(xˉhi,uˉhi)+xf(xˉhi,uˉhi)(xxˉhi)+uf(xˉhi,uˉhi)(uuˉhi)ch(x,u)c(xˉhi,uˉhi)+[xxˉhiuuˉhi][xc(xˉhi,uˉhi)uc(xˉhi,uˉhi)]+12[xxˉhiuuˉhi][xxc(xˉhi,uˉhi)xuc(xˉhi,uˉhi)uxc(xˉhi,uˉhi)uuc(xˉhi,uˉhi)][xxˉhiuuˉhi].\begin{aligned} f_\hi(\st, \act) & \approx f(\bar {\st}^i_\hi, \bar {\act}^i_\hi) + \nabla_{\st } f(\bar {\st}^i_\hi, \bar {\act}^i_\hi)(\st - \bar {\st}^i_\hi) + \nabla_{\act } f(\bar {\st}^i_\hi, \bar {\act}^i_\hi)(\act - \bar {\act}^i_\hi) \\ c_\hi(\st, \act) & \approx c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) + \begin{bmatrix} \st - \bar {\st }^i_\hi& \act - \bar {\act}^i_\hi \end{bmatrix} \begin{bmatrix} \nabla_{\st } c(\bar {\st}^i_\hi, \bar {\act}^i_\hi)\\ \nabla_{\act} c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) \end{bmatrix} \\ & \qquad + \frac{1}{2} \begin{bmatrix} \st - \bar {\st }^i_\hi& \act - \bar {\act}^i_\hi \end{bmatrix} \begin{bmatrix} \nabla_{\st \st} c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) & \nabla_{\st \act} c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) \\ \nabla_{\act \st} c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) & \nabla_{\act \act} c(\bar {\st}^i_\hi, \bar {\act}^i_\hi) \end{bmatrix} \begin{bmatrix} \st - \bar {\st }^i_\hi\\ \act - \bar {\act}^i_\hi \end{bmatrix}. \end{aligned}

Step 2: Compute the optimal policy. We can now solve the time-dependent LQR problem using the Riccati equation from Section 2.5.1 to compute the optimal policy π0i,,πH1i\pi^i_0, \dots, \pi^i_{\hor-1}.

Step 3: Generate a new series of actions. We can then generate a new sample trajectory by taking actions according to this optimal policy:

xˉ0i+1=xˉ0,u~h=πhi(xˉhi+1),xˉh+1i+1=f(xˉhi+1,u~h).\bar \st^{i+1}_0 = \bar \st_0, \qquad \widetilde \act_\hi = \pi^i_\hi(\bar \st^{i+1}_\hi), \qquad \bar \st^{i+1}_{\hi+1} = f(\bar \st^{i+1}_\hi, \widetilde \act_\hi).

Note that the states are sampled according to the true dynamics, which we assume we have query access to.

Step 4: Compute a better candidate trajectory., Note that we’ve denoted these actions as u~h\widetilde \act_\hi and aren’t directly using them for the next iteration uˉhi+1\bar \act^{i+1}_\hi. Rather, we want to interpolate between them and the actions from the previous iteration uˉ0i,,uˉH1i\bar \act^i_0, \dots, \bar \act^i_{\hor-1}. This is so that the cost will increase monotonically, since if the new policy turns out to actually be worse, we can stay closer to the previous trajectory. (Can you think of an intuitive example where this might happen?)

Formally, we want to find α[0,1]\alpha \in [0, 1] to generate the next iteration of actions uˉ0i+1,,uˉH1i+1\bar \act^{i+1}_0, \dots, \bar \act^{i+1}_{\hor-1} such that the cost is minimized:

minα[0,1]h=0H1c(xh,uˉhi+1)wherexh+1=f(xh,uˉhi+1)uˉhi+1=αuˉhi+(1α)u~hx0=xˉ0.\begin{aligned} \min_{\alpha \in [0, 1]} \quad & \sum_{\hi=0}^{\hor-1} c(\st_\hi, \bar \act^{i+1}_\hi) \\ \text{where} \quad & \st_{\hi+1} = f(\st_\hi, \bar \act^{i+1}_\hi) \\ & \bar \act^{i+1}_\hi = \alpha \bar \act^i_\hi + (1-\alpha) \widetilde \act_\hi \\ & \st_0 = \bar \st_0. \end{aligned}

Note that this optimizes over the closed interval [0,1][0, 1], so by the Extreme Value Theorem, it’s guaranteed to have a global maximum.

The final output of this algorithm is a policy πnsteps\pi^{n_\text{steps}} derived after nstepsn_\text{steps} of the algorithm. Though the proof is somewhat complex, one can show that for many nonlinear control problems, this solution converges to a locally optimal solution (in the policy space).

2.7Summary

This chapter introduced some approaches to solving different variants of the optimal control problem Definition 2.1. We began with the simple case of linear dynamics and an upward-curved quadratic cost. This model is called the LQR and we solved for the optimal policy using dynamic programming. We then extended these results to the more general nonlinear case via local linearization. We finally saw the iterative LQR algorithm for solving nonlinear control problems.