The field of RL studies how an agent can learn to make sequential decisions in an interactive environment.
This is a very general problem!
How can we formalize this task in a way that is both sufficiently general yet also tractable enough for fruitful analysis?
Let’s consider some examples of sequential decision problems to identify the key common properties we’d like to capture:
Board games and video games, where a player takes actions in a virtual environment.
Inventory management, where a company must efficiently move resources from producers to consumers.
Robotic control, where a robot can move and interact with the real world to complete some task.
In these environments and many others, the state transitions,
the “rules” of the environment,
only depend on the most recent state and action (generally speaking).
For example, if you want to take a break while playing a game of chess,
you could take a picture of the board,
and later on reset the board to that state and continue playing;
the past history of moves doesn’t matter (generally speaking).
This is called the Markov property.
Environments that satisfy the Markov property are called Markov decision processes (MDPs).
This chapter will focus on introducing core vocabulary for MDPs that will be useful throughout the book.
MDPs are usually classified as finite-horizon, where the interactions end after some finite number of time steps,
or infinite-horizon, where the interactions can continue indefinitely.
We’ll begin with the finite-horizon case and discuss the infinite-horizon case in the second half of the chapter.
We’ll describe how to evaluate different strategies, called policies, and how to compute (or approximate)
the optimal policy for a given MDP.
We’ll introduce the Bellman consistency condition, which allows us to analyze the whole sequence of interactions in terms of individual timesteps.
from utils import NamedTuple, Float, Array, partial, jax, jnp, latexify
class MDP(NamedTuple):
"""A description of a Markov decision process with finitely many states and actions."""
S: int # number of states
A: int # number of actions
μ: Float[Array, " S"]
P: Float[Array, "S A S"] # "current" state, "current" action, "next" state
r: Float[Array, "S A"]
H: int
γ: float = 1.0 # discount factor (used later)
Note that for finite state and action spaces,
we can represent a randomized mapping S→Δ(A)
as a matrix π∈[0,1]S×A where each row describes
the policy’s distribution over actions for the corresponding state.
A fascinating result is that every finite-horizon MDP has an optimal deterministic time-dependent policy!
Intuitively, the Markov property implies that the current state contains all the information we need to make the optimal decision.
We’ll prove this result constructively later in the chapter.
class Transition(NamedTuple):
"""A single state-action-reward interaction with the environment.
A trajectory comprises a sequence of transitions.
"""
s: int
a: int
r: float
Once we’ve chosen a policy,
we can sample trajectories by repeatedly choosing actions according to the policy,
transitioning according to the state transitions, and observing the rewards.
That is, a policy induces a distribution ρπ over trajectories.
(We assume that μ and P are clear from context.)
Note that for a state-dependent policy, using the Markov property Definition 1.1,
we can write down the likelihood function of this probability distribution in an autoregressive way (i.e. one timestep at a time):
def trajectory_log_likelihood(
mdp: MDP,
τ: list[Transition],
π: Float[Array, "S A"],
) -> float:
"""Compute the log-likelihood of a trajectory under a given MDP and policy."""
# initial distribution and action
total = jnp.log(mdp.μ[τ[0].s])
total += jnp.log(π[τ[0].s, τ[0].a])
# remaining state transitions and actions
for i in range(1, mdp.H):
total += jnp.log(mdp.P[τ[i - 1].s, τ[i - 1].a, τ[i].s])
total += jnp.log(π[τ[i].s, τ[i].a])
return total
For a deterministic policy π, we have that πh(a∣s)=I[a=πh(s)];
that is, the probability of taking an action is 1 if it’s the unique action prescribed by the policy for that state and 0 otherwise.
In this case, the only randomness in sampling trajectories comes from the initial state distribution μ and the state transitions P.
The main goal of RL is to find a policy that maximizes the expected total
reward E[r0+⋯+rH−1].
Let’s introduce some notation for analyzing this quantity.
A policy’s value function at time h is its expected remaining reward from a given state:
Similarly, we can define the action-value function (aka the
Q-function) at time h as the expected remaining reward from a given state and taking a given action:
1.2.4.1Relating the value function and action-value function¶
Note that the value function is just the expected action-value over
actions drawn from the policy:
def q_to_v(
policy: Float[Array, "S A"],
q: Float[Array, "S A"],
) -> Float[Array, " S"]:
"""
Compute the value function for a given policy in a known finite MDP
at a single timestep from its action-value function.
"""
return jnp.average(q, weights=policy, axis=1)
and the action-value is the sum of the immediate reward and the expected value of the following
state:
def v_to_q(
mdp: MDP,
v_next: Float[Array, " S"],
) -> Float[Array, "S A"]:
"""
Compute the action-value function in a known finite MDP
at a single timestep from the corresponding value function.
"""
# the discount factor is relevant later
return mdp.r + mdp.γ * mdp.P @ v_next
# convert a list of v functions to a list of q functions
v_ary_to_q_ary = jax.vmap(v_to_q, in_axes=(None, 0))
For any given Q∈R∣S∣×∣A∣, we can define the greedy policyπ^Q as the deterministic policy that selects the action with the highest Q-value at each state:
def q_to_greedy(q: Float[Array, "S A"]) -> Float[Array, "S A"]:
"""
Get the (deterministic) greedy policy with respect to an action-value function.
Return the policy as a matrix of shape (S, A) where each row is a one-hot vector.
"""
A = q.shape[1]
a_ary = jnp.argmax(q, axis=1)
return jnp.eye(A)[a_ary]
def v_to_greedy(mdp: MDP, v: Float[Array, " S"]) -> Float[Array, "S A"]:
"""Get the (deterministic) greedy policy with respect to a value function."""
return q_to_greedy(v_to_q(mdp, v))
Note that by simply considering the cumulative reward as the sum of the
current reward and the future cumulative reward, we can describe the
value function recursively (in terms of itself). This is named the
Bellman consistency equation after Richard Bellman (1920--1984),
who is credited with introducing dynamic programming in 1953.
def check_bellman_consistency_v(
mdp: MDP,
policy: Float[Array, "H S A"],
v_ary: Float[Array, "H S"],
) -> bool:
"""
Check that the given (time-dependent) "value function"
satisfies the Bellman consistency equation.
"""
return all(
jnp.allclose(
# lhs
v_ary[h],
# rhs
jnp.sum(policy[h] * (mdp.r + mdp.γ * mdp.P @ v_ary[h + 1]), axis=1),
)
for h in range(mdp.H - 1)
)
One can analogously derive the Bellman consistency equation for the
action-value function:
Fix a policy π. Consider the higher-order operator that takes in a
“value function” v:S→R and returns the r.h.s. of the Bellman
equation for that “value function”:
def bellman_operator_looping(
mdp: MDP,
policy: Float[Array, "S A"],
v: Float[Array, " S"],
) -> Float[Array, " S"]:
"""
Looping definition of the Bellman operator.
Concise version is below
"""
v_new = jnp.zeros(mdp.S)
for s in range(mdp.S):
for a in range(mdp.A):
for s_next in range(mdp.S):
v_new[s] += (
policy[s, a]
* mdp.P[s, a, s_next]
* (mdp.r[s, a] + mdp.γ * v[s_next])
)
return v_new
Note that we can concisely implement this using the q_to_v and v_to_q utilities from above:
def bellman_operator(
mdp: MDP,
policy: Float[Array, "S A"],
v: Float[Array, " S"],
) -> Float[Array, " S"]:
"""For a known finite MDP, the Bellman operator can be exactly evaluated."""
return q_to_v(policy, v_to_q(mdp, v)) # equivalent
return jnp.sum(policy * (mdp.r + mdp.γ * mdp.P @ v), axis=1)
We’ll call Jπ:RS→RS the Bellman
operator of π.
Note that it’s defined on any “value function” mapping states to real numbers;
v doesn’t have to be a well-defined value function for some policy (hence the lowercase notation).
The Bellman operator also gives us a concise way to express Theorem 1.1 for the value function:
Intuitively, the output of the Bellman operator, a new “value function”,
evaluates states as follows: from a given state, take one action
according to π, observe the reward, and then evaluate the next state
using the input “value function”.
When we discuss infinite-horizon MDPs, the Bellman operator will turn
out to be more than just a notational convenience: We’ll use it to
construct algorithms for computing the optimal policy.
How can we actually compute the value function of a given policy? This
is the task of policy evaluation.
def dp_eval_finite(mdp: MDP, policy: Float[Array, "S A"]) -> Float[Array, "H S"]:
"""Evaluate a policy using dynamic programming."""
V_ary = [None] * mdp.H + [jnp.zeros(mdp.S)] # initialize to 0 at end of time horizon
for h in range(mdp.H - 1, -1, -1):
V_ary[h] = bellman_operator(mdp, policy[h], V_ary[h + 1])
return jnp.stack(V_ary[:-1])
This runs in time O(H⋅∣S∣2⋅∣A∣) by counting the
loops.
We’ve just seen how to evaluate a given policy. But how can we find
the optimal policy for a given environment?
Convince yourself that all optimal policies must have the same value
function. We call this the optimal value function and denote it by
Vh⋆(s). The same goes for the action-value function
Qh⋆(s,a).
It is a stunning fact that every finite-horizon MDP has an optimal
policy that is time-dependent and deterministic. In particular, we can
construct such a policy by acting greedily with respect to the optimal
action-value function:
Note that this also gives simplified forms of the Bellman consistency equations for the optimal policy:
Now that we’ve shown this particular greedy policy is optimal, all we
need to do is compute the optimal value function and optimal policy. We
can do this by working backwards in time using dynamic programming
(DP).
def find_optimal_policy(mdp: MDP):
Q = [None] * mdp.H
pi = [None] * mdp.H
V = [None] * mdp.H + [jnp.zeros(mdp.S)] # initialize to 0 at end of time horizon
for h in range(mdp.H - 1, -1, -1):
Q[h] = mdp.r + mdp.P @ V[h + 1]
pi[h] = jnp.eye(mdp.S)[jnp.argmax(Q[h], axis=1)] # one-hot
V[h] = jnp.max(Q[h], axis=1)
Q = jnp.stack(Q)
pi = jnp.stack(pi)
V = jnp.stack(V[:-1])
return pi, V, Q
At each of the H timesteps, we must compute Q⋆ for each of
the ∣S∣∣A∣ state-action pairs. Each computation takes ∣S∣
operations to evaluate the average value over s′. This gives a total
computation time of O(H⋅∣S∣2⋅∣A∣).
Note that this algorithm is identical to the policy evaluation algorithm
dp_eval_finite, but instead of averaging over the
actions chosen by a policy, we instead simply take a maximum over the
action-values. We’ll see this relationship between policy evaluation
and optimal policy computation show up again in the infinite-horizon
setting.
π_opt, V_opt, Q_opt = find_optimal_policy(tidy_mdp)
assert jnp.allclose(π_opt, tidy_policy_messy_only)
assert jnp.allclose(V_opt, V_messy)
assert jnp.allclose(Q_opt[:-1], v_ary_to_q_ary(tidy_mdp, V_messy)[1:])
"Assertions passed (the 'tidy when messy' policy is optimal)"
"Assertions passed (the 'tidy when messy' policy is optimal)"
What happens if a trajectory is allowed to continue forever (i.e.
H=∞)? This is the setting of infinite horizon MDPs.
In this chapter, we’ll describe the necessary adjustments from the
finite-horizon case to make the problem tractable. We’ll show that the
Bellman operator in the discounted reward setting is a
contraction mapping for any policy.
We’ll discuss how to evaluate
policies (i.e. compute their corresponding value functions). Finally,
we’ll present and analyze two iterative algorithms, based on the Bellman
operator, for computing the optimal policy: value iteration and
policy iteration.
First of all, note that maximizing the cumulative reward
rh+rh+1+rh+2+⋯ is no longer a good idea since it
might blow up to infinity. Instead of a time horizon H, we now need a
discount factorγ∈[0,1) such that rewards become less
valuable the further into the future they are:
We can think of γ as measuring how much we care about the future:
if it’s close to 0, we only care about the near-term rewards; it’s
close to 1, we put more weight into future rewards.
You can also analyze γ as the probability of continuing the
trajectory at each time step. (This is equivalent to H being
distributed by a First Success distribution with success probability
γ.) This accords with the above interpretation: if γ is
close to 0, the trajectory will likely be very short, while if
γ is close to 1, the trajectory will likely continue for a long
time.
The time-dependent policies from the finite-horizon case become
difficult to handle in the infinite-horizon case. In particular, many of
the DP approaches we saw required us to start at the end of the
trajectory, which is no longer possible. We’ll shift to stationary
policies π:S→A (deterministic) or Δ(A) (stochastic).
We also consider stationary value functions Vπ:S→R and
Qπ:S×A→R. We need to insert a factor of γ
into the Bellman consistency equation Theorem 1.1 to account for the discounting:
Vπ(s)Qπ(s,a)=Eτ∼ρπ[rh+γrh+1+γ2rh+2⋯∣sh=s]=Ea∼π(s)s′∼P(s,a)[r(s,a)+γVπ(s′)]=Eτ∼ρπ[rh+γrh+1+γ2rh+2+⋯∣sh=s,ah=a]=r(s,a)+γEs′∼P(s,a)a′∼π(s′)[Qπ(s′,a′)]for any h∈Nfor any h∈N
1.5.1The Bellman operator is a contraction mapping¶
Recall from Definition 1.8 that the Bellman operator Jπ
for a policy π takes in a “value function” v:S→R and
returns the r.h.s. of the Bellman equation for that “value function”. In
the infinite-horizon setting, this is
The crucial property of the Bellman operator is that it is a
contraction mapping for any policy. Intuitively, if we start with
two “value functions” v,u:S→R, if we repeatedly apply the
Bellman operator to each of them, they will get closer and closer
together at an exponential rate.
It is a powerful fact (known as the Banach fixed-point theorem) that
every contraction mapping has a unique fixed pointx⋆ such
that f(x⋆)=x⋆. This means that if we repeatedly apply f
to any starting point, we will eventually converge to x⋆:
Let’s return to the RL setting and apply this result to the Bellman
operator. How can we measure the distance between two “value functions”
v,u:S→R? We’ll take the supremum norm as our distance
metric:
i.e.
we compare the “value functions” on the state that causes the biggest
gap between them. Then (1.36) implies that if we repeatedly
apply Jπ to any starting “value function”, we will eventually
converge to Vπ:
The backwards DP technique we used in the finite-horizon case no
longer works since there is no “final timestep” to start from. We’ll
need another approach to policy evaluation.
The Bellman consistency conditions yield a system of equations we can
solve to evaluate a deterministic policy exactly. For a faster approximate solution,
we can iterate the policy’s Bellman operator, since we know that it has
a unique fixed point at the true value function.
1.5.2.1Matrix inversion for deterministic policies¶
Note that when the policy π is deterministic, the actions can be
determined from the states, and so we can chop off the action dimension
for the rewards and state transitions:
For Pπ, we’ll treat the rows as the states and the
columns as the next states. Then Ps,s′π is the probability of
transitioning from state s to state s′ under policy π.
The Bellman consistency equation for a deterministic policy can be
written in tabular notation as
The matrix inversion above takes roughly O(∣S∣3) time.
It also only works for deterministic policies.
Can we trade off the requirement of finding the exact value function for a faster
approximate algorithm that will also extend to stochastic policies?
Let’s use the Bellman operator to define an iterative algorithm for
computing the value function. We’ll start with an initial guess
v(0) with elements in [0,1/(1−γ)] and then iterate the
Bellman operator:
Now let’s move on to solving for an optimal policy in the
infinite-horizon case. As in the finite-horizon case, an optimal policyπ⋆
is one that does at least as well as any other policy in all situations.
That is, for all policies π, states s∈S, times
h∈N, and initial trajectories
τh=(s0,a0,r0,…,sh) where sh=s,
Once again, all optimal policies share the same optimal value functionV⋆, and the greedy policy with respect to this value function
is optimal.
So how can we compute such an optimal policy? We can’t use the backwards
DP approach from the finite-horizon case Definition 1.11 since there’s no “final timestep” to start
from. Instead, we’ll exploit the fact that the Bellman consistency
equation (1.32) for the optimal value
function doesn’t depend on any policy:
Since the optimal policy is still a policy, our result that the Bellman
operator is a contracting map still holds, and so we can repeatedly
apply this operator to converge to the optimal value function! This
algorithm is known as value iteration.
def value_iteration(mdp: MDP, ε: float = 1e-6) -> Float[Array, " S"]:
"""Iterate the Bellman optimality operator until convergence."""
op = partial(bellman_optimality_operator, mdp)
return loop_until_convergence(op, jnp.zeros(mdp.S), ε)
value_iteration(tidy_mdp_inf)
Array([15.564166, 14.785956], dtype=float32)
Note that the runtime analysis for an ε-optimal value function
is exactly the same as iterative policy evaluation! This is because value iteration is simply
the special case of applying iterative policy evaluation to the
optimal value function.
As the final step of the algorithm, to return an actual policy
π^, we can simply act greedily with respect to the final iteration
v(T) of our above algorithm:
We must be careful, though: the value function of this greedy policy,
Vπ^, is not the same as v(T), which need not even be a
well-defined value function for some policy!
The bound on the policy’s quality is actually quite loose: if
∥v(T)−V⋆∥∞≤ϵ, then the greedy policy
π^ satisfies
∥Vπ^−V⋆∥∞≤1−γ2γϵ,
which might potentially be very large.
So in order to compensate and achieve ∥Vπ^−V⋆∥≤ϵ, we must have
Can we mitigate this “greedy worsening”? What if instead of approximating the optimal value function and then acting greedily by it at the very end, we iteratively improve the policy and value function together? This is the idea behind policy iteration. In each step, we simply set the policy to act greedily with respect to its own value function.
def policy_iteration(mdp: MDP, ε=1e-6) -> Float[Array, "S A"]:
"""Iteratively improve the policy and value function."""
def op(pi):
return v_to_greedy(mdp, eval_deterministic_infinite(mdp, pi))
π_init = jnp.ones((mdp.S, mdp.A)) / mdp.A # uniform random policy
return loop_until_convergence(op, π_init, ε)
policy_iteration(tidy_mdp_inf)
Array([[1., 0.],
[0., 1.]], dtype=float32)
Although PI appears more complex than VI, we’ll use the same contraction property Theorem 1.4 to show convergence. This will give us the same runtime bound as value iteration and iterative policy evaluation for an ε-optimal value function Remark 1.2, although in practice, PI often converges much faster.
Markov decision processes (MDPs) are a framework for sequential
decision making under uncertainty. They consist of a state space
S, an action space A, an initial state distribution
μ∈Δ(S), a transition function P(s′∣s,a), and a
reward function r(s,a). They can be finite-horizon (ends after
H timesteps) or infinite-horizon (where rewards scale by
γ∈(0,1) at each timestep).
Our goal is to find a policy π that maximizes expected total
reward. Policies can be deterministic or stochastic,
state-dependent or history-dependent, stationary or
time-dependent.
A policy induces a distribution over trajectories.
We can evaluate a policy by computing its value functionVπ(s), which is the expected total reward starting from state
s and following policy π. We can also compute the
state-action value functionQπ(s,a), which is the expected
total reward starting from state s, taking action a, and then
following policy π. In the finite-horizon setting, these also
depend on the timestep h.
The Bellman consistency equation is an equation that the value
function must satisfy. It can be used to solve for the value
functions exactly. Thinking of the r.h.s. of this equation as an
operator on value functions gives the Bellman operator.
In the finite-horizon setting, we can compute the optimal policy
using dynamic programming.
In the infinite-horizon setting, we can compute the optimal policy
using value iteration or policy iteration.