Up to this point, we have considered decision problems with finitely
many states and actions. However, in many applications, states and
actions may take on continuous values. For example, consider autonomous
driving, controlling a robot’s joints, and automated manufacturing. How
can we teach computers to solve these kinds of problems? This is the
task of continuous control.
Aside from the change in the state and action spaces, the general
problem setup remains the same: we seek to construct an optimal policy
that outputs actions to solve the desired task. We will see that many
key ideas and algorithms, in particular dynamic programming algorithms,
carry over to this new setting.
This chapter introduces a fundamental tool to solve a simple class of
continuous control problems: the linear quadratic regulator. We will
then extend this basic method to more complex settings.
Recall that an MDP is defined by its state space S, action space
A, state transitions P, reward function r, and discount factor
γ or time horizon H. These have equivalents in the control
setting:
The state and action spaces are continuous rather than finite.
That is, S⊆Rnx and A⊆Rnu,
where nx and nu are the corresponding dimensions of these
spaces, i.e. the number of coordinates to specify a single state or
action respectively.
We call the state transitions the dynamics of the system. In the
most general case, these might change across timesteps and also
include some stochastic noisewh at each timestep. We
denote these dynamics as the function fh such that
xh+1=fh(xh,uh,wh). Of course, we can
simplify to cases where the dynamics are deterministic/noise-free
(no wh term) and/or time-homogeneous (the same function f
across timesteps).
Instead of maximizing the reward function, we seek to minimize the
cost functionch:S×A→R. Often, the cost
function describes how far away we are from a target
state-action pair(x⋆,u⋆). An important special
case is when the cost is time-homogeneous; that is, it remains the
same function c at each timestep h.
We seek to minimize the undiscounted cost within a finite time
horizonH. Note that we end an episode at the final state
xH -- there is no uH, and so we denote the cost for
the final state as cH(xH).
With all of these components, we can now formulate the optimal control
problem:compute a policy to minimize the expected undiscounted cost
over H timesteps. In this chapter, we will only consider
deterministic, time-dependent policies
π=(π0,…,πH−1) where πh:S→A for each
h∈[H].
Can we solve this problem using tools from the finite MDP setting? If
S and A were finite, then we’d be able to work backwards using the DP algorithm for computing the optimal policy in an MDP (Definition 1.11).
This inspires us to try discretizing the
problem.
Suppose S and A are bounded, that is,
maxx∈S∥x∥≤Bx and
maxu∈A∥u∥≤Bu. To make S and A finite,
let’s choose some small positive ε, and simply round each
coordinate to the nearest multiple of ε. For example, if
ϵ=0.01, then we round each element of x and u to two
decimal spaces.
However, the discretized S and A may be finite, but
they may be infeasibly large: we must divide each dimension into
intervals of length ε, resulting in
∣S∣=(Bx/ε)nx and
∣A∣=(Bu/ε)nu. To get a sense of how
quickly this grows, consider ε=0.01,nx=nu=10.
Then the number of elements in the transition matrix would be
∣S∣2∣A∣=(10010)2(10010)=1060! (That’s
a trillion trillion trillion trillion trillion.)
What properties of the problem could we instead make use of? Note that
by discretizing the state and action spaces, we implicitly assumed that
rounding each state or action vector by some tiny amount ε
wouldn’t change the behavior of the system by much; namely, that the
cost and dynamics were relatively continuous. Can we use this
continuous structure in other ways? This leads us to the linear
quadratic regulator.
The optimal control problem Definition 2.1 seems highly complex in general. Is there a relevant simplification that we can analyze?
The linear quadratic regulator (LQR) is a solvable case and a fundamental tool in control theory.
We will henceforth abbreviate “symmetric positive definite” as s.p.d.
and “positive definite” as p.d.
It will be helpful to reintroduce the value function notation for a policy to denote the average cost it incurs.
These will be instrumental in constructing the optimal policy via dynamic programming,
as we did in Section 1.3.2 for MDPs.
In this section,
we’ll compute the optimal value function Vh⋆,
Q-function Qh⋆,
and policy πh⋆ in the linear quadratic regulator using dynamic programming
in a very similar way to the DP algorithms in the MDP setting.
Recall the definition of the optimal value function:
We will prove the striking fact that the solution has very simple structure:
Vh⋆ and Qh⋆ are upward-curved quadratics
and πh⋆ is linear and furthermore does not depend on the noise!
The construction (and inductive proof) proceeds similarly to the one in the MDP setting.
We’ll compute VH⋆ (at the end of the horizon) as our base case.
Then we’ll work step-by-step backwards in time, using Vh+1⋆ to compute Qh⋆, πh⋆, and Vh⋆.
Base case:
At the final timestep,
there are no possible actions to take,
and so VH⋆(x)=c(x)=x⊤Qx.
Thus VH⋆(x)=x⊤PHx+pH
where PH=Q and pH=0.
Inductive hypothesis:
We seek to show that the inductive step holds for both theorems:
If Vh+1⋆(x) is an upward-curved quadratic,
then Vh⋆(x) must also be an upward-curved quadratic,
and πh⋆(x) must be linear.
We’ll break this down into the following steps:
Show that Qh⋆(x,u) is an upward-curved quadratic (in both
x and u).
Derive the optimal policy
πh⋆(x)=argminuQh⋆(x,u) and show
that it’s linear.
Show that Vh⋆(x) is an upward-curved quadratic.
We first assume the inductive hypothesis that our theorems are true at
time h+1. That is,
Now we’ve shown that Vh⋆(x)=x⊤Phx+ph,
where Ph is s.p.d.,
proving the inductive hypothesis and completing the proof of Theorem 2.2 and Theorem 2.1.
In summary, we just demonstrated that at each timestep h∈[H],
the optimal value function Vh⋆
and optimal Q-function Qh⋆ are both upward-curved quadratics
and the optimal policy πh⋆ is linear.
We also showed that all of these quantities can be calculated
using a sequence of s.p.d. matrices P0,…,PH
that can be defined recursively using the Riccati equation Definition 2.5.
Before we move on to some extensions of LQR, let’s consider how the
state at time h behaves when we act according to this optimal
policy.
How can we compute the expected state at time h when acting
according to the optimal policy? Let’s first express xh in a
cleaner way in terms of the history. Note that having linear dynamics
makes it easy to expand terms backwards in time:
Let’s consider the average state at this time, given all the past
states and actions. Since we assume that E[wh]=0 (this is the
zero vector in d dimensions), when we take an expectation, the wh
term vanishes due to linearity, and so we’re left with
This introdces the quantity A−BKi, which shows up frequently in
control theory. For example, one important question is: will xh
remain bounded, or will it go to infinity as time goes on? To answer
this, let’s imagine for simplicity that these Kis are equal (call
this matrix K). Then the expression above becomes (A−BK)hx0.
Now consider the maximum eigenvalue λmax of A−BK. If
∣λmax∣>1, then there’s some nonzero initial state
xˉ0, the corresponding eigenvector, for which
We’ve now formulated an optimal solution for the time-homogeneous LQR
and computed the expected state under the optimal policy. However, real
world tasks rarely have such simple dynamics, and we may wish to design
more complex cost functions. In this section, we’ll consider more
general extensions of LQR where some of the assumptions we made above
are relaxed. Specifically, we’ll consider:
Time-dependency, where the dynamics and cost function might
change depending on the timestep.
General quadratic cost, where we allow for linear terms and a
constant term.
Tracking a goal trajectory rather than aiming for a single goal
state-action pair.
Combining these will allow us to use the LQR solution to solve more
complex setups by taking Taylor approximations of the dynamics and
cost functions.
So far, we’ve considered the time-homogeneous case, where the dynamics
and cost function stay the same at every timestep. However, this might
not always be the case. As an example, in many sports, the rules and
scoring system might change during an overtime period. To address these
sorts of problems, we can loosen the time-homogeneous restriction, and
consider the case where the dynamics and cost function are
time-dependent. Our analysis remains almost identical; in fact, we can
simply add a time index to the matrices A and B that determine the
dynamics and the matrices Q and R that determine the cost.
The modified problem is now defined as follows:
The derivation of the optimal value functions and the optimal policy
remains almost exactly the same, and we can modify the Riccati equation
accordingly:
Additionally, by allowing the dynamics to vary across time, we gain the
ability to locally approximate nonlinear dynamics at each timestep.
We’ll discuss this later in the chapter.
Our original cost function had only second-order terms with respect to
the state and action, incentivizing staying as close as possible to
(x⋆,u⋆)=(0,0). We can also consider more general
quadratic cost functions that also have first-order terms and a constant
term. Combining this with time-dependent dynamics results in the
following expression, where we introduce a new matrix Mh for the
cross term, linear coefficients qh and rh for the state and
action respectively, and a constant term ch:
Similarly, we can also include a
constant term vh∈Rnx in the dynamics (note that this is
deterministic at each timestep, unlike the stochastic noise wh):
Consider applying LQR to a task like autonomous driving, where the
target state-action pair changes over time. We might want the vehicle to
follow a predefined trajectory of states and actions
(xh⋆,uh⋆)h=0H−1. To express this as a
control problem, we’ll need a corresponding time-dependent cost
function:
Note that this punishes states and actions that are far from the
intended trajectory. By expanding out these multiplications, we can see
that this is actually a special case of the more general quadratic cost
function above (2.38):
The LQR algorithm solves for the optimal policy when the dynamics are
linear and the cost function is an upward-curved quadratic. However,
real settings are rarely this simple! Let’s return to the CartPole
example from the start of the chapter
(Example 2.1). The dynamics (physics) aren’t linear. How
can we approximate this by an LQR problem?
Concretely, let’s consider a noise-free problem since, as we saw, the
noise doesn’t factor into the optimal policy. Let’s assume the dynamics
and cost function are stationary, and ignore the terminal state for
simplicity:
This is now only slightly simplified from the general optimal control
problem (see
Definition 2.1). Here, we don’t know an analytical form
for the dynamics f or the cost function c, but we assume that we’re
able to query/sample/simulate them to get their values at a given
state and action. To clarify, consider the case where the dynamics are
given by real world physics. We can’t (yet) write down an expression for
the dynamics that we can differentiate or integrate analytically.
However, we can still simulate the dynamics and cost function by
running a real-world experiment and measuring the resulting states and
costs. How can we adapt LQR to this more general nonlinear case?
How can we apply LQR when the dynamics are nonlinear or the cost
function is more complex? We’ll exploit the useful fact that we can take
a function that’s locally continuous around (s⋆,a⋆) and
approximate it nearby with low-order polynomials (i.e. its Taylor
approximation). In particular, as long as the dynamics f are
differentiable around (x⋆,u⋆) and the cost function
c is twice differentiable at (x⋆,u⋆), we can take a
linear approximation of f and a quadratic approximation of c to
bring us back to the regime of LQR.
To calculate these gradients and Hessians in practice,
we use a method known as finite differencing for numerically computing derivatives.
Namely, we can simply use the limit definition of the derivative, and
see how the function changes as we add or subtract a tiny δ to
the input.
Note that this only requires us to be able to query the function, not
to have an analytical expression for it, which is why it’s so useful in
practice.
However, simply taking the second-order approximation of the cost
function is insufficient, since for the LQR setup we required that the
Q and R matrices were positive definite, i.e. that all of their
eigenvalues were positive.
One way to naively force some symmetric matrix D to be positive definite
is to set any non-positive eigenvalues to some small positive value ε>0.
Recall that any real symmetric matrix D∈Rn×n has an basis of eigenvectors u1,…,un
with corresponding eigenvalues λ1,…,λn
such that Dui=λiui.
Then we can construct the positive definite approximation by
Exercise: Convince yourself that D is indeed positive
definite.
Note that Hessian matrices are generally symmetric, so we can apply this
process to Q and R to obtain the positive definite approximations
Q and R.
Now that we have an upward-curved
quadratic approximation to the cost function, and a linear approximation
to the state transitions, we can simply apply the time-homogenous LQR
methods from Section 2.4.
But what happens when we enter states far away from x⋆ or want
to use actions far from u⋆? A Taylor approximation is only
accurate in a local region around the point of linearization, so the
performance of our LQR controller will degrade as we move further away.
We’ll see how to address this in the next section using the iterative LQR algorithm.
To address these issues with local linearization, we’ll use an iterative
approach, where we repeatedly linearize around different points to
create a time-dependent approximation of the dynamics, and then solve
the resulting time-dependent LQR problem to obtain a better policy. This
is known as iterative LQR or iLQR:
Now let’s go through the details of each step. We’ll use superscripts to
denote the iteration of the algorithm. We’ll also denote
xˉ0=Ex0∼μ0[x0] as the expected initial
state.
At iteration i of the algorithm, we begin with a candidate
trajectory
τˉi=(xˉ0i,uˉ0i,…,xˉH−1i,uˉH−1i).
Step 1: Form a time-dependent LQR problem. At each timestep
h∈[H], we use the techniques from
Section 2.6 to linearize the dynamics and
quadratize the cost function around (xˉhi,uˉhi):
Step 2: Compute the optimal policy. We can now solve the
time-dependent LQR problem using the Riccati equation from
Section 2.5.1 to compute the optimal policy
π0i,…,πH−1i.
Step 3: Generate a new series of actions. We can then generate a new
sample trajectory by taking actions according to this optimal policy:
Note that the states are sampled according to the true dynamics, which
we assume we have query access to.
Step 4: Compute a better candidate trajectory., Note that we’ve
denoted these actions as uh and aren’t directly using
them for the next iteration uˉhi+1. Rather, we want to
interpolate between them and the actions from the previous iteration
uˉ0i,…,uˉH−1i. This is so that the cost
will increase monotonically, since if the new policy turns out to
actually be worse, we can stay closer to the previous trajectory. (Can
you think of an intuitive example where this might happen?)
Formally, we want to find α∈[0,1] to generate the next
iteration of actions
uˉ0i+1,…,uˉH−1i+1 such that the cost
is minimized:
Note that this optimizes over the closed interval
[0,1], so by the Extreme Value Theorem, it’s guaranteed to have a
global maximum.
The final output of this algorithm is a policy πnsteps
derived after nsteps of the algorithm. Though the proof is
somewhat complex, one can show that for many nonlinear control problems,
this solution converges to a locally optimal solution (in the policy
space).
This chapter introduced some approaches to solving different variants of
the optimal control problem
Definition 2.1. We began with the simple case of linear
dynamics and an upward-curved quadratic cost. This model is called the
LQR and we solved for the optimal policy using dynamic programming. We
then extended these results to the more general nonlinear case via local
linearization. We finally saw the iterative LQR algorithm for solving
nonlinear control problems.