Skip to article frontmatterSkip to article content

1 Markov Decision Processes

1.1Introduction

The field of RL studies how an agent can learn to make sequential decisions in an interactive environment. This is a very general problem! How can we formalize this task in a way that is both sufficiently general yet also tractable enough for fruitful analysis?

Let’s consider some examples of sequential decision problems to identify the key common properties we’d like to capture:

  • Board games and video games, where a player takes actions in a virtual environment.
  • Inventory management, where a company must efficiently move resources from producers to consumers.
  • Robotic control, where a robot can move and interact with the real world to complete some task.

In these environments and many others, the state transitions, the “rules” of the environment, only depend on the most recent state and action (generally speaking). For example, if you want to take a break while playing a game of chess, you could take a picture of the board, and later on reset the board to that state and continue playing; the past history of moves doesn’t matter (generally speaking). This is called the Markov property.

Environments that satisfy the Markov property are called Markov decision processes (MDPs). This chapter will focus on introducing core vocabulary for MDPs that will be useful throughout the book.

MDPs are usually classified as finite-horizon, where the interactions end after some finite number of time steps, or infinite-horizon, where the interactions can continue indefinitely. We’ll begin with the finite-horizon case and discuss the infinite-horizon case in the second half of the chapter.

We’ll describe how to evaluate different strategies, called policies, and how to compute (or approximate) the optimal policy for a given MDP. We’ll introduce the Bellman consistency condition, which allows us to analyze the whole sequence of interactions in terms of individual timesteps.

from utils import NamedTuple, Float, Array, partial, jax, jnp, latexify

1.2Finite-horizon MDPs

1.2.1Definition

class MDP(NamedTuple):
    """A description of a Markov decision process with finitely many states and actions."""
    S: int  # number of states
    A: int  # number of actions
    μ: Float[Array, " S"]
    P: Float[Array, "S A S"]  # "current" state, "current" action, "next" state
    r: Float[Array, "S A"]
    H: int
    γ: float = 1.0  # discount factor (used later)
tidy_mdp = MDP(
    S=2,  # 0 = orderly, 1 = messy
    A=2,  # 0 = ignore, 1 = tidy
    μ=jnp.array([1.0, 0.0]),  # start in orderly state
    P=jnp.array([
        [
            [0.7, 0.3],  # orderly, ignore
            [1.0, 0.0],  # orderly, tidy
        ],
        [
            [0.0, 1.0],  # messy, ignore
            [1.0, 0.0],  # messy, tidy
        ],
    ]),
    r=jnp.array([
        [
            1.0,   # orderly, ignore
            -1.0,  # orderly, tidy
        ],
        [
            -1.0,  # messy, ignore
            0.0,   # messy, tidy
        ]
    ]),
    H=7,
)

1.2.2Policies

Note that for finite state and action spaces, we can represent a randomized mapping SΔ(A)\mathcal{S} \to \Delta(\mathcal{A}) as a matrix π[0,1]S×A\pi \in [0, 1]^{\mathcal{S} \times \mathcal{A}} where each row describes the policy’s distribution over actions for the corresponding state.

A fascinating result is that every finite-horizon MDP has an optimal deterministic time-dependent policy! Intuitively, the Markov property implies that the current state contains all the information we need to make the optimal decision. We’ll prove this result constructively later in the chapter.

# arrays of shape (H, S, A) represent time-dependent policies
tidy_policy_always_tidy = (
    jnp.zeros((7, 2, 2))
    .at[:, :, 1].set(1.0)
)
tidy_policy_weekends = (
    jnp.zeros((7, 2, 2))
    .at[5:7, :, 1].set(1.0)
    .at[0:5, :, 0].set(1.0)
)
tidy_policy_messy_only = (
    jnp.zeros((7, 2, 2))
    .at[:, 1, 1].set(1.0)
    .at[:, 0, 0].set(1.0)
)

1.2.3Trajectories

class Transition(NamedTuple):
    """A single state-action-reward interaction with the environment.

    A trajectory comprises a sequence of transitions.
    """
    s: int
    a: int
    r: float

Once we’ve chosen a policy, we can sample trajectories by repeatedly choosing actions according to the policy, transitioning according to the state transitions, and observing the rewards.

That is, a policy induces a distribution ρπ\rho^{\pi} over trajectories. (We assume that μ and PP are clear from context.)

Note that for a state-dependent policy, using the Markov property Definition 1.1, we can write down the likelihood function of this probability distribution in an autoregressive way (i.e. one timestep at a time):

def trajectory_log_likelihood(
    mdp: MDP,
    τ: list[Transition],
    π: Float[Array, "S A"],
) -> float:
    """Compute the log-likelihood of a trajectory under a given MDP and policy."""

    # initial distribution and action
    total = jnp.log(mdp.μ[τ[0].s])
    total += jnp.log(π[τ[0].s, τ[0].a])

    # remaining state transitions and actions
    for i in range(1, mdp.H):
        total += jnp.log(mdp.P[τ[i - 1].s, τ[i - 1].a, τ[i].s])
        total += jnp.log(π[τ[i].s, τ[i].a])

    return total

For a deterministic policy π, we have that πh(as)=I[a=πh(s)]\pi_\hi(a \mid s) = \mathbb{I}[a = \pi_\hi(s)]; that is, the probability of taking an action is 1 if it’s the unique action prescribed by the policy for that state and 0 otherwise. In this case, the only randomness in sampling trajectories comes from the initial state distribution μ and the state transitions PP.

1.2.4Value functions

The main goal of RL is to find a policy that maximizes the expected total reward E[r0++rH1]\E [r_0 + \cdots + r_{\hor-1}].

Let’s introduce some notation for analyzing this quantity.

A policy’s value function at time h\hi is its expected remaining reward from a given state:

Similarly, we can define the action-value function (aka the Q-function) at time hh as the expected remaining reward from a given state and taking a given action:

1.2.4.1Relating the value function and action-value function

Note that the value function is just the expected action-value over actions drawn from the policy:

Vhπ(s)=Eaπh(s)[Qhπ(s,a)]V_\hi^\pi(s) = \E_{a \sim \pi_\hi(s)} [Q_\hi^\pi(s, a)]
def q_to_v(
    policy: Float[Array, "S A"],
    q: Float[Array, "S A"],
) -> Float[Array, " S"]:
    """
    Compute the value function for a given policy in a known finite MDP
    at a single timestep from its action-value function.
    """
    return jnp.average(q, weights=policy, axis=1)

and the action-value is the sum of the immediate reward and the expected value of the following state:

Qhπ(s,a)=r(s,a)+EsP(s,a)[Vh+1π(s)]Q_\hi^\pi(s, a) = r(s, a) + \E_{s' \sim P(s, a)} [V_{\hi+1}^\pi(s')]
def v_to_q(
    mdp: MDP,
    v_next: Float[Array, " S"],
) -> Float[Array, "S A"]:
    """
    Compute the action-value function in a known finite MDP
    at a single timestep from the corresponding value function.
    """
    # the discount factor is relevant later
    return mdp.r + mdp.γ * mdp.P @ v_next


# convert a list of v functions to a list of q functions
v_ary_to_q_ary = jax.vmap(v_to_q, in_axes=(None, 0))

1.2.4.2Greedy policies

For any given QRS×AQ \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|}, we can define the greedy policy π^Q\hat \pi_Q as the deterministic policy that selects the action with the highest QQ-value at each state:

π^Q(s)=argmaxaQsa\hat \pi_Q(s) = \arg\max_{a} Q_{sa}
def q_to_greedy(q: Float[Array, "S A"]) -> Float[Array, "S A"]:
    """
    Get the (deterministic) greedy policy with respect to an action-value function.
    Return the policy as a matrix of shape (S, A) where each row is a one-hot vector.
    """
    A = q.shape[1]
    a_ary = jnp.argmax(q, axis=1)
    return jnp.eye(A)[a_ary]


def v_to_greedy(mdp: MDP, v: Float[Array, " S"]) -> Float[Array, "S A"]:
    """Get the (deterministic) greedy policy with respect to a value function."""
    return q_to_greedy(v_to_q(mdp, v))

1.2.5The one-step (Bellman) consistency equation

Note that by simply considering the cumulative reward as the sum of the current reward and the future cumulative reward, we can describe the value function recursively (in terms of itself). This is named the Bellman consistency equation after Richard Bellman (1920--1984), who is credited with introducing dynamic programming in 1953.

def check_bellman_consistency_v(
    mdp: MDP,
    policy: Float[Array, "H S A"],
    v_ary: Float[Array, "H S"],
) -> bool:
    """
    Check that the given (time-dependent) "value function"
    satisfies the Bellman consistency equation.
    """
    return all(
        jnp.allclose(
            # lhs
            v_ary[h],
            # rhs
            jnp.sum(policy[h] * (mdp.r + mdp.γ * mdp.P @ v_ary[h + 1]), axis=1),
        )
        for h in range(mdp.H - 1)
    )

One can analogously derive the Bellman consistency equation for the action-value function:

1.2.6The one-step Bellman operator

Fix a policy π. Consider the higher-order operator that takes in a “value function” v:SRv : \mathcal{S} \to \mathbb{R} and returns the r.h.s. of the Bellman equation for that “value function”:

def bellman_operator_looping(
    mdp: MDP,
    policy: Float[Array, "S A"],
    v: Float[Array, " S"],
) -> Float[Array, " S"]:
    """
    Looping definition of the Bellman operator.
    Concise version is below
    """
    v_new = jnp.zeros(mdp.S)
    for s in range(mdp.S):
        for a in range(mdp.A):
            for s_next in range(mdp.S):
                v_new[s] += (
                    policy[s, a]
                    * mdp.P[s, a, s_next]
                    * (mdp.r[s, a] + mdp.γ * v[s_next])
                )
    return v_new

Note that we can concisely implement this using the q_to_v and v_to_q utilities from above:

def bellman_operator(
    mdp: MDP,
    policy: Float[Array, "S A"],
    v: Float[Array, " S"],
) -> Float[Array, " S"]:
    """For a known finite MDP, the Bellman operator can be exactly evaluated."""
    return q_to_v(policy, v_to_q(mdp, v))  # equivalent
    return jnp.sum(policy * (mdp.r + mdp.γ * mdp.P @ v), axis=1)

We’ll call Jπ:RSRS\mathcal{J}^\pi : \mathbb{R}^\mathcal{S} \to \mathbb{R}^\mathcal{S} the Bellman operator of π. Note that it’s defined on any “value function” mapping states to real numbers; vv doesn’t have to be a well-defined value function for some policy (hence the lowercase notation). The Bellman operator also gives us a concise way to express Theorem 1.1 for the value function:

Vhπ=Jπ(Vh+1π)V_\hi^\pi = \mathcal{J}^{\pi}(V_{\hi+1}^\pi)

Intuitively, the output of the Bellman operator, a new “value function”, evaluates states as follows: from a given state, take one action according to π, observe the reward, and then evaluate the next state using the input “value function”.

When we discuss infinite-horizon MDPs, the Bellman operator will turn out to be more than just a notational convenience: We’ll use it to construct algorithms for computing the optimal policy.

1.3Solving finite-horizon MDPs

1.3.1Policy evaluation in finite-horizon MDPs

How can we actually compute the value function of a given policy? This is the task of policy evaluation.

def dp_eval_finite(mdp: MDP, policy: Float[Array, "S A"]) -> Float[Array, "H S"]:
    """Evaluate a policy using dynamic programming."""
    V_ary = [None] * mdp.H + [jnp.zeros(mdp.S)]  # initialize to 0 at end of time horizon
    for h in range(mdp.H - 1, -1, -1):
        V_ary[h] = bellman_operator(mdp, policy[h], V_ary[h + 1])
    return jnp.stack(V_ary[:-1])

This runs in time O(HS2A)O(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|) by counting the loops.

V_messy = dp_eval_finite(tidy_mdp, tidy_policy_messy_only)
V_messy
Array([[5.5621696, 4.7927704], [4.7927704, 4.0241003], [4.0241003, 3.253 ], [3.253 , 2.49 ], [2.49 , 1.7 ], [1.7 , 1. ], [1. , 0. ]], dtype=float32)

1.3.2Optimal policies in finite-horizon MDPs

We’ve just seen how to evaluate a given policy. But how can we find the optimal policy for a given environment?

Convince yourself that all optimal policies must have the same value function. We call this the optimal value function and denote it by Vh(s)V_\hi^\star(s). The same goes for the action-value function Qh(s,a)Q_\hi^\star(s, a).

It is a stunning fact that every finite-horizon MDP has an optimal policy that is time-dependent and deterministic. In particular, we can construct such a policy by acting greedily with respect to the optimal action-value function:

Note that this also gives simplified forms of the Bellman consistency equations for the optimal policy:

Now that we’ve shown this particular greedy policy is optimal, all we need to do is compute the optimal value function and optimal policy. We can do this by working backwards in time using dynamic programming (DP).

def find_optimal_policy(mdp: MDP):
    Q = [None] * mdp.H
    pi = [None] * mdp.H
    V = [None] * mdp.H + [jnp.zeros(mdp.S)]  # initialize to 0 at end of time horizon

    for h in range(mdp.H - 1, -1, -1):
        Q[h] = mdp.r + mdp.P @ V[h + 1]
        pi[h] = jnp.eye(mdp.S)[jnp.argmax(Q[h], axis=1)]  # one-hot
        V[h] = jnp.max(Q[h], axis=1)

    Q = jnp.stack(Q)
    pi = jnp.stack(pi)
    V = jnp.stack(V[:-1])

    return pi, V, Q

At each of the HH timesteps, we must compute QQ^{\star} for each of the SA|\mathcal{S}| |\mathcal{A}| state-action pairs. Each computation takes S|\mathcal{S}| operations to evaluate the average value over ss'. This gives a total computation time of O(HS2A)O(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|).

Note that this algorithm is identical to the policy evaluation algorithm dp_eval_finite, but instead of averaging over the actions chosen by a policy, we instead simply take a maximum over the action-values. We’ll see this relationship between policy evaluation and optimal policy computation show up again in the infinite-horizon setting.

π_opt, V_opt, Q_opt = find_optimal_policy(tidy_mdp)
assert jnp.allclose(π_opt, tidy_policy_messy_only)
assert jnp.allclose(V_opt, V_messy)
assert jnp.allclose(Q_opt[:-1], v_ary_to_q_ary(tidy_mdp, V_messy)[1:])
"Assertions passed (the 'tidy when messy' policy is optimal)"
"Assertions passed (the 'tidy when messy' policy is optimal)"

1.4Infinite-horizon MDPs

What happens if a trajectory is allowed to continue forever (i.e. H=H = \infty)? This is the setting of infinite horizon MDPs.

In this chapter, we’ll describe the necessary adjustments from the finite-horizon case to make the problem tractable. We’ll show that the Bellman operator in the discounted reward setting is a contraction mapping for any policy. We’ll discuss how to evaluate policies (i.e. compute their corresponding value functions). Finally, we’ll present and analyze two iterative algorithms, based on the Bellman operator, for computing the optimal policy: value iteration and policy iteration.

1.4.1Discounted rewards

First of all, note that maximizing the cumulative reward rh+rh+1+rh+2+r_\hi + r_{\hi+1} + r_{\hi+2} + \cdots is no longer a good idea since it might blow up to infinity. Instead of a time horizon HH, we now need a discount factor γ[0,1)\gamma \in [0, 1) such that rewards become less valuable the further into the future they are:

rh+γrh+1+γ2rh+2+=k=0γkrh+k.r_\hi + \gamma r_{\hi+1} + \gamma^2 r_{\hi+2} + \cdots = \sum_{k=0}^\infty \gamma^k r_{\hi+k}.

We can think of γ as measuring how much we care about the future: if it’s close to 0, we only care about the near-term rewards; it’s close to 1, we put more weight into future rewards.

You can also analyze γ as the probability of continuing the trajectory at each time step. (This is equivalent to HH being distributed by a First Success distribution with success probability γ.) This accords with the above interpretation: if γ is close to 0, the trajectory will likely be very short, while if γ is close to 1, the trajectory will likely continue for a long time.

The other components of the MDP remain the same:

M=(S,A,μ,P,r,γ).M = (\mathcal{S}, \mathcal{A}, \mu, P, r, \gamma).

Code-wise, we can reuse the MDP class from before Definition 1.2 and set mdp.H = float('inf').

tidy_mdp_inf = tidy_mdp._replace(H=float("inf"), γ=0.95)

1.4.2Stationary policies

The time-dependent policies from the finite-horizon case become difficult to handle in the infinite-horizon case. In particular, many of the DP approaches we saw required us to start at the end of the trajectory, which is no longer possible. We’ll shift to stationary policies π:SA\pi : \mathcal{S} \to \mathcal{A} (deterministic) or Δ(A)\Delta(\mathcal{A}) (stochastic).

1.4.3Value functions and Bellman consistency

We also consider stationary value functions Vπ:SRV^\pi : \mathcal{S} \to \mathbb{R} and Qπ:S×ARQ^\pi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}. We need to insert a factor of γ into the Bellman consistency equation Theorem 1.1 to account for the discounting:

Vπ(s)=Eτρπ[rh+γrh+1+γ2rh+2sh=s]for any hN=Eaπ(s)sP(s,a)[r(s,a)+γVπ(s)]Qπ(s,a)=Eτρπ[rh+γrh+1+γ2rh+2+sh=s,ah=a]for any hN=r(s,a)+γEsP(s,a)aπ(s)[Qπ(s,a)]\begin{aligned} V^\pi(s) &= \E_{\tau \sim \rho^\pi} [r_\hi + \gamma r_{\hi+1} + \gamma^2 r_{\hi+2} \cdots \mid s_\hi = s] && \text{for any } \hi \in \mathbb{N} \\ &= \E_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + \gamma V^\pi(s')]\\ Q^\pi(s, a) &= \E_{\tau \sim \rho^\pi} [r_\hi + \gamma r_{\hi+1} + \gamma^2 r_{\hi+2} + \cdots \mid s_\hi = s, a_\hi = a] && \text{for any } \hi \in \mathbb{N} \\ &= r(s, a) + \gamma \E_{\substack{s' \sim P(s, a) \\ a' \sim \pi(s')}} [Q^\pi(s', a')] \end{aligned}

1.5Solving infinite-horizon MDPs

1.5.1The Bellman operator is a contraction mapping

Recall from Definition 1.8 that the Bellman operator Jπ\mathcal{J}^{\pi} for a policy π takes in a “value function” v:SRv : \mathcal{S} \to \mathbb{R} and returns the r.h.s. of the Bellman equation for that “value function”. In the infinite-horizon setting, this is

[Jπ(v)](s):=Eaπ(s)sP(s,a)[r(s,a)+γv(s)].[\mathcal{J}^{\pi}(v)](s) := \E_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + \gamma v(s')].

The crucial property of the Bellman operator is that it is a contraction mapping for any policy. Intuitively, if we start with two “value functions” v,u:SRv, u : \mathcal{S} \to \mathbb{R}, if we repeatedly apply the Bellman operator to each of them, they will get closer and closer together at an exponential rate.

It is a powerful fact (known as the Banach fixed-point theorem) that every contraction mapping has a unique fixed point xx^\star such that f(x)=xf(x^\star) = x^\star. This means that if we repeatedly apply ff to any starting point, we will eventually converge to xx^\star:

f(t)(x)xγtxx.\|f^{(t)}(x) - x^\star\| \le \gamma^t \|x - x^\star\|.

Let’s return to the RL setting and apply this result to the Bellman operator. How can we measure the distance between two “value functions” v,u:SRv, u : \mathcal{S} \to \mathbb{R}? We’ll take the supremum norm as our distance metric:

vu:=supsSv(s)u(s),\| v - u \|_{\infty} := \sup_{s \in \mathcal{S}} |v(s) - u(s)|,

i.e. we compare the “value functions” on the state that causes the biggest gap between them. Then (1.36) implies that if we repeatedly apply Jπ\mathcal{J}^\pi to any starting “value function”, we will eventually converge to VπV^\pi:

(Jπ)(t)(v)VπγtvVπ.\|(\mathcal{J}^\pi)^{(t)}(v) - V^\pi \|_{\infty} \le \gamma^{t} \| v - V^\pi\|_{\infty}.

We’ll use this useful fact to prove the convergence of several algorithms later on.

1.5.2Policy evaluation in infinite-horizon MDPs

The backwards DP technique we used in the finite-horizon case no longer works since there is no “final timestep” to start from. We’ll need another approach to policy evaluation.

The Bellman consistency conditions yield a system of equations we can solve to evaluate a deterministic policy exactly. For a faster approximate solution, we can iterate the policy’s Bellman operator, since we know that it has a unique fixed point at the true value function.

1.5.2.1Matrix inversion for deterministic policies

Note that when the policy π is deterministic, the actions can be determined from the states, and so we can chop off the action dimension for the rewards and state transitions:

rπRSPπ[0,1]S×Sμ[0,1]SπASVπRSQπRS×A.\begin{aligned} r^{\pi} &\in \mathbb{R}^{|\mathcal{S}|} & P^{\pi} &\in [0, 1]^{|\mathcal{S}| \times |\mathcal{S}|} & \mu &\in [0, 1]^{|\mathcal{S}|} \\ \pi &\in \mathcal{A}^{|\mathcal{S}|} & V^\pi &\in \mathbb{R}^{|\mathcal{S}|} & Q^\pi &\in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|}. \end{aligned}

For PπP^\pi, we’ll treat the rows as the states and the columns as the next states. Then Ps,sπP^\pi_{s, s'} is the probability of transitioning from state ss to state ss' under policy π.

The Bellman consistency equation for a deterministic policy can be written in tabular notation as

Vπ=rπ+γPπVπ.V^\pi = r^\pi + \gamma P^\pi V^\pi.

(Unfortunately, this notation doesn’t simplify the expression for QπQ^\pi.) This system of equations can be solved with a matrix inversion:

Vπ=(IγPπ)1rπ.V^\pi = (I - \gamma P^\pi)^{-1} r^\pi.
def eval_deterministic_infinite(
    mdp: MDP, policy: Float[Array, "S A"]
) -> Float[Array, " S"]:
    pi = jnp.argmax(policy, axis=1)  # un-one-hot
    P_π = mdp.P[jnp.arange(mdp.S), pi]
    r_π = mdp.r[jnp.arange(mdp.S), pi]
    return jnp.linalg.solve(jnp.eye(mdp.S) - mdp.γ * P_π, r_π)
eval_deterministic_infinite(tidy_mdp_inf, tidy_policy_messy_only[0])
Array([15.56419, 14.78598], dtype=float32)

1.5.2.2Iterative policy evaluation

The matrix inversion above takes roughly O(S3)O(|\mathcal{S}|^3) time. It also only works for deterministic policies. Can we trade off the requirement of finding the exact value function for a faster approximate algorithm that will also extend to stochastic policies?

Let’s use the Bellman operator to define an iterative algorithm for computing the value function. We’ll start with an initial guess v(0)v^{(0)} with elements in [0,1/(1γ)][0, 1/(1-\gamma)] and then iterate the Bellman operator:

v(t+1)=Jπ(v(t)),v^{(t+1)} = \mathcal{J}^{\pi}(v^{(t)}),

i.e. v(t)=(Jπ)(t)(v(0))v^{(t)} = (\mathcal{J}^{\pi})^{(t)} (v^{(0)}). Note that each iteration takes O(S2)O(|\mathcal{S}|^2) time for the matrix-vector multiplication.

def supremum_norm(v):
    return jnp.max(jnp.abs(v))  # same as jnp.linalg.norm(v, jnp.inf)


def loop_until_convergence(op, v, ε=1e-6):
    """Repeatedly apply op to v until convergence (in supremum norm)."""
    while True:
        v_new = op(v)
        if supremum_norm(v_new - v) < ε:
            return v_new
        v = v_new


def iterative_evaluation(mdp: MDP, pi: Float[Array, "S A"], ε=1e-6) -> Float[Array, " S"]:
    op = partial(bellman_operator, mdp, pi)
    return loop_until_convergence(op, jnp.zeros(mdp.S), ε)

Then, as we showed in (1.38), by the Banach fixed-point theorem:

v(t)Vπγtv(0)Vπ.\|v^{(t)} - V^\pi \|_{\infty} \le \gamma^{t} \| v^{(0)} - V^\pi\|_{\infty}.
iterative_evaluation(tidy_mdp_inf, tidy_policy_messy_only[0])
Array([15.564166, 14.785956], dtype=float32)

1.5.3Optimal policies in infinite-horizon MDPs

Now let’s move on to solving for an optimal policy in the infinite-horizon case. As in the finite-horizon case, an optimal policy π\pi^\star is one that does at least as well as any other policy in all situations. That is, for all policies π, states sSs \in \mathcal{S}, times hN\hi \in \mathbb{N}, and initial trajectories τh=(s0,a0,r0,,sh)\tau_\hi = (s_0, a_0, r_0, \dots, s_\hi) where sh=ss_\hi = s,

Vπ(s)=Eτρπ[rh+γrh+1+γ2rh+2+sh=s]Eτρπ[rh+γrh+1+γ2rh+2+τh]\begin{aligned} V^{\pi^\star}(s) &= \E_{\tau \sim \rho^{\pi^{\star}}}[r_\hi + \gamma r_{\hi+1} + \gamma^2 r_{\hi+2} + \cdots \mid s_\hi = s] \\ &\ge \E_{\tau \sim \rho^{\pi}}[r_\hi + \gamma r_{\hi+1} + \gamma^2 r_{\hi+2} + \cdots \mid \tau_\hi] \end{aligned}

Once again, all optimal policies share the same optimal value function VV^\star, and the greedy policy with respect to this value function is optimal.

So how can we compute such an optimal policy? We can’t use the backwards DP approach from the finite-horizon case Definition 1.11 since there’s no “final timestep” to start from. Instead, we’ll exploit the fact that the Bellman consistency equation (1.32) for the optimal value function doesn’t depend on any policy:

V(s)=maxa[r(s,a)+γEsP(s,a)V(s).]V^\star(s) = \max_a \left[ r(s, a) + \gamma \E_{s' \sim P(s, a)} V^\star(s'). \right]

As before, thinking of the r.h.s. of (1.53) as an operator on value functions gives the Bellman optimality operator

[J(v)](s)=maxa[r(s,a)+γEsP(s,a)v(s)][\mathcal{J}^{\star}(v)](s) = \max_a \left[ r(s, a) + \gamma \E_{s' \sim P(s, a)} v(s') \right]
def bellman_optimality_operator(mdp: MDP, v: Float[Array, " S"]) -> Float[Array, " S"]:
    return jnp.max(mdp.r + mdp.γ * mdp.P @ v, axis=1)


def check_optimal(v: Float[Array, " S"], mdp: MDP):
    return jnp.allclose(v, bellman_optimality_operator(v, mdp))

1.5.3.1Value iteration

Since the optimal policy is still a policy, our result that the Bellman operator is a contracting map still holds, and so we can repeatedly apply this operator to converge to the optimal value function! This algorithm is known as value iteration.

def value_iteration(mdp: MDP, ε: float = 1e-6) -> Float[Array, " S"]:
    """Iterate the Bellman optimality operator until convergence."""
    op = partial(bellman_optimality_operator, mdp)
    return loop_until_convergence(op, jnp.zeros(mdp.S), ε)
value_iteration(tidy_mdp_inf)
Array([15.564166, 14.785956], dtype=float32)

Note that the runtime analysis for an ε-optimal value function is exactly the same as iterative policy evaluation! This is because value iteration is simply the special case of applying iterative policy evaluation to the optimal value function.

As the final step of the algorithm, to return an actual policy π^\hat \pi, we can simply act greedily with respect to the final iteration v(T)v^{(T)} of our above algorithm:

π^(s)=argmaxa[r(s,a)+γEsP(s,a)v(T)(s)].\hat \pi(s) = \arg\max_a \left[ r(s, a) + \gamma \E_{s' \sim P(s, a)} v^{(T)}(s') \right].

We must be careful, though: the value function of this greedy policy, Vπ^V^{\hat \pi}, is not the same as v(T)v^{(T)}, which need not even be a well-defined value function for some policy!

The bound on the policy’s quality is actually quite loose: if v(T)Vϵ\|v^{(T)} - V^\star\|_{\infty} \le \epsilon, then the greedy policy π^\hat \pi satisfies Vπ^V2γ1γϵ\|V^{\hat \pi} - V^\star\|_{\infty} \le \frac{2\gamma}{1-\gamma} \epsilon, which might potentially be very large.

So in order to compensate and achieve Vπ^Vϵ\|V^{\hat \pi} - V^{\star}\| \le \epsilon, we must have

v(T)V1γ2γϵ.\|v^{(T)} - V^\star\|_{\infty} \le \frac{1-\gamma}{2 \gamma} \epsilon.

This means, using Remark 1.2, we need to run value iteration for

T=O(11γlog(γϵ(1γ)2))T = O\left( \frac{1}{1-\gamma} \log\left(\frac{\gamma}{\epsilon (1-\gamma)^2}\right) \right)

iterations to achieve an ε-accurate estimate of the optimal value function.

1.5.3.2Policy iteration

Can we mitigate this “greedy worsening”? What if instead of approximating the optimal value function and then acting greedily by it at the very end, we iteratively improve the policy and value function together? This is the idea behind policy iteration. In each step, we simply set the policy to act greedily with respect to its own value function.

def policy_iteration(mdp: MDP, ε=1e-6) -> Float[Array, "S A"]:
    """Iteratively improve the policy and value function."""
    def op(pi):
        return v_to_greedy(mdp, eval_deterministic_infinite(mdp, pi))
    π_init = jnp.ones((mdp.S, mdp.A)) / mdp.A  # uniform random policy
    return loop_until_convergence(op, π_init, ε)
policy_iteration(tidy_mdp_inf)
Array([[1., 0.], [0., 1.]], dtype=float32)

Although PI appears more complex than VI, we’ll use the same contraction property Theorem 1.4 to show convergence. This will give us the same runtime bound as value iteration and iterative policy evaluation for an ε-optimal value function Remark 1.2, although in practice, PI often converges much faster.

1.6Summary

  • Markov decision processes (MDPs) are a framework for sequential decision making under uncertainty. They consist of a state space S\mathcal{S}, an action space A\mathcal{A}, an initial state distribution μΔ(S)\mu \in \Delta(\mathcal{S}), a transition function P(ss,a)P(s' \mid s, a), and a reward function r(s,a)r(s, a). They can be finite-horizon (ends after HH timesteps) or infinite-horizon (where rewards scale by γ(0,1)\gamma \in (0, 1) at each timestep).

  • Our goal is to find a policy π that maximizes expected total reward. Policies can be deterministic or stochastic, state-dependent or history-dependent, stationary or time-dependent.

  • A policy induces a distribution over trajectories.

  • We can evaluate a policy by computing its value function Vπ(s)V^\pi(s), which is the expected total reward starting from state ss and following policy π. We can also compute the state-action value function Qπ(s,a)Q^\pi(s, a), which is the expected total reward starting from state ss, taking action aa, and then following policy π. In the finite-horizon setting, these also depend on the timestep h\hi.

  • The Bellman consistency equation is an equation that the value function must satisfy. It can be used to solve for the value functions exactly. Thinking of the r.h.s. of this equation as an operator on value functions gives the Bellman operator.

  • In the finite-horizon setting, we can compute the optimal policy using dynamic programming.

  • In the infinite-horizon setting, we can compute the optimal policy using value iteration or policy iteration.