1  Markov Decision Processes



1.1 Introduction

Machine learning studies algorithms that learn to solve a task “on their own”, without needing a programmer to implement handwritten “if statements”. Reinforcement learning (RL) is a branch of machine learning that focuses on decision problems like the following:

Example 1.1 (Decision problems)  

Board games and video games, such as a game of chess, where a player character takes actions that update the state of the game. (Guy 2006)

Inventory management, where a company must efficiently move resources from producers to consumers. (Frans Berkelaar 2009)

Robotic control, where a robot can move and interact with the real world to complete some task. (GPA Photo Archive 2017)

All of these tasks involve taking a sequence of actions in an interactive environment. This interaction loop can be summarized in the following diagram:

graph LR
    Agent --"takes *action*"--> Environment
    Environment --"observes *state*, *reward*"--> Agent
Figure 1.1: The RL interaction loop. The agent chooses an action that affects the environment. The environment updates according to the state transitions. Then the agent observes the updated environment and a reward signal that says how well the agent behaved.

The strategy for choosing actions is called the policy. The goal of an RL algorithm is to find a policy that achieves the maximum total reward. This is a very general problem! How can we describe such tasks using mathematics in a way that is general yet also structured enough for us to analyze?

This chapter will introduce the most popular formalization for such tasks: Markov decision processes (MDPs). We will study dynamic programming (DP) algorithms for solving fully specified tasks in which we completely understand the environment. We’ll describe how to evaluate different policies and how to compute (or approximate) the optimal policy for a given MDP. We’ll introduce the Bellman consistency condition, which allows us to analyze the whole sequence of interactions in terms of individual timesteps.

Remark 1.1 (Further generalizations). In the real world, we often don’t know how the environment works, or it is too hard to represent the environment on a computer. We will study such tasks in future chapters. Additionally, in many tasks, only a subset of the state is visible to the observer. We won’t discuss such partially observed environments in this course.

Code
from utils import NamedTuple, Float, Array, partial, jax, jnp, latexify, latex
from tabulate import tabulate
import numpy as np
from IPython.display import Markdown, display

1.2 Finite-horizon Markov decision processes

To gain a better grasp of decision processes, let us frame the robotic control task in Example 1.1 as a decision problem.

Example 1.2 (Robotic control as a decision problem) Suppose the goal is to move the robot to a certain position.

  • The state consists of the positions and velocities of the robot’s joints.
  • The action consists of the forces to apply to the robot’s motors.
  • The state transitions are essentially the rules of physics: after applying a certain force to the joints, their positions and velocities would change according to physical law.
  • We reward positions that are closer to the desired position. To be more energy-efficient, we could also deduct reward for applying a lot of force, or add other terms that describe the ideal behavior.

Exercise 1.1 (Identifying decision problems) For each of the other examples in Example 1.1, what information should the state include? What might the valid set of actions be? Describe the state transitions heuristically, and a reward function that describes how the task should be solved.

In many decision problems, the state transitions only depend on the current state and action. For example, in Example 1.2, if we use Newton’s laws of physics to compute the state transitions, then just knowing the current positions and velocities is enough to calculate the next positions and velocities, since Newton’s laws are second-order. We say that such state transitions satisfy the Markov property, which we will formally define in Definition 1.2 after introducing some notation.

Exercise 1.2 Look back at the state transitions you described in Exercise 1.1. Do they satisfy the Markov property? (For chess, consider the threefold repetition rule: if the same position occurs three times during the game, either player may claim a draw.)

Markov decision processes (MDPs), which satisfy the Markov property, are the most common formalization of decision problems. MDPs can be classified as finite-horizon, where the interactions end after some finite number of time steps, or infinite-horizon, where the interactions are allowed to continue indefinitely. We’ll begin with the finite-horizon case and discuss the infinite-horizon case in Section 1.3.

Definition 1.1 (Finite-horizon Markov decision process) The components of a finite-horizon Markov decision process are:

  1. The state that the agent interacts with. We use \(\mathcal{S}\) to denote the set of possible states, called the state space.

  2. The actions that the agent can take. We use \(\mathcal{A}\) to denote the set of possible actions, called the action space.

  3. Some initial state distribution \(\mu \in \triangle(\mathcal{S})\).

  4. The state transitions (a.k.a. dynamics) \(P : \mathcal{S} \times \mathcal{A} \to \triangle(\mathcal{S})\) that describe what state the agent transitions to after taking an action. (We write \(P(s_{h+ 1} \mid s_h, a_h)\) for the probability of transitioning to state \(s_{h+1}\) when starting in state \(s_h\) and taking action \(a_h\).)

  5. The reward function. In this course, we’ll take it to be a deterministic function on state-action pairs, \(r : \mathcal{S} \times \mathcal{A} \to \mathbb{R}\), but in general many results will extend to a stochastic reward signal.

  6. A time horizon \(H\in \mathbb{N}\) that specifies the number of interactions in an episode.

Combined together, these objects specify a finite-horizon Markov decision process:

\[ \mathcal{M} = (\mathcal{S}, \mathcal{A}, \mu, P, r, H). \]

When there are finitely many states and actions, i.e. \(|\mathcal{S}|, |\mathcal{A}| < \infty\), we can express the relevant quantities as vectors and matrices (i.e. tables of values):

\[ \begin{aligned} \mu &\in [0, 1]^{|\mathcal{S}|} & P &\in [0, 1]^{(|\mathcal{S} \times \mathcal{A}|) \times |\mathcal{S}|} & r &\in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|} \end{aligned} \]

(Verify that the types and shapes provided above make sense!)

Code
class MDP(NamedTuple):
    """A description of a Markov decision process with finitely many states and actions."""
    S: int  # number of states
    A: int  # number of actions
    mu: Float[Array, " S"]
    P: Float[Array, "S A S"]  # "current" state, "current" action, "next" state
    r: Float[Array, "S A"]
    H: int  # the horizon
    gamma: float = 1.0  # discount factor (used later)

Definition 1.2 (Markov property) A decision process satisfies the Markov property if the next state is independent from the past states and actions when conditioned on the current state and action:

\[ \mathbb{P}(s_{h+1} \mid s_0, a_0, \dots, s_h, a_h) = \mathbb{P}(s_{h+1} \mid s_h, a_h) \]

By their definition, Markov decision processes satisfy the Markov property. This is because the state transitions are defined using the function \(P\), which only takes in the state-action pair to transition from.

Exercise 1.3 (Conditional independence for future states) Use the chain rule of probability to show that, conditioned on the current state and action \(s_h, a_h\), all future states \(s_{h'}\) where \(h' > h\) are conditionally independent of the past states and actions. That is, for all \(h' > h\),

\[ \mathbb{P}(s_{h'} \mid s_0, a_0, \dots, s_h, a_h) = \mathbb{P}(s_{h'} \mid s_h, a_h). \]

We prove by induction. MDPs already satisfy the case \(h' = h+1\). Now suppose it holds for \(h' > h\). Then

\[ \mathbb{P}(s_{h'+1} \mid s_0, a_0, \dots, s_h, a_h) = \mathbb{P}(s_{h'+1} \mid s_0, a_0, \dots, s_h, a_h, \dots, s_{h'}) \mathbb{P}(s_{h'} \mid s_0, a_0, \dots, s_h, a_h, s_{h'}) = \mathbb{P}(s_) \]

Example 1.3 (Tidying MDP) Let’s consider a simple decision problem throughout this chapter: the task of keeping your room tidy.

Your room has the possible states \(\mathcal{S} = \{ \text{orderly}, \text{messy} \}.\) You can take either of the actions \(\mathcal{A} = \{ \text{ignore}, \text{tidy} \}.\) The room starts off orderly, that is, \(\mu(\text{orderly}) = 1\).

The state transitions are as follows: if you tidy the room, it becomes (or remains) orderly; if you ignore the room, it might become messy, according to the probabilities in Table 1.1.

The rewards are as follows: You get penalized for tidying an orderly room (a waste of time) or ignoring a messy room, but you get rewarded for ignoring an orderly room (since you can enjoy your additional time). Tidying a messy room is a chore that gives no reward.

Consider a time horizon of \(H= 7\) days (one interaction per day). Let \(t = 0\) correspond to Monday and \(t = 6\) correspond to Sunday.

Code
tidy_mdp = MDP(
    S=2,  # 0 = orderly, 1 = messy
    A=2,  # 0 = ignore, 1 = tidy
    mu=jnp.array([1.0, 0.0]),  # start in orderly state
    P=jnp.array([
        [
            [0.7, 0.3],  # orderly, ignore
            [1.0, 0.0],  # orderly, tidy
        ],
        [
            [0.0, 1.0],  # messy, ignore
            [1.0, 0.0],  # messy, tidy
        ],
    ]),
    r=jnp.array([
        [
            1.0,   # orderly, ignore
            -1.0,  # orderly, tidy
        ],
        [
            -1.0,  # messy, ignore
            0.0,   # messy, tidy
        ]
    ]),
    H=7,
)

Markdown(tabulate(
    zip(*[
        ['orderly', 'orderly', 'messy', 'messy'],
        ['ignore', 'tidy', 'ignore', 'tidy'],
        tidy_mdp.P[:, :, 0].flatten(),
        tidy_mdp.P[:, :, 1].flatten(),
        tidy_mdp.r.flatten(),
    ]),
    [
        r"$s$",
        r"$a$",
        r"$P(\text{orderly} \mid s, a)$",
        r"$P(\text{messy} \mid s, a)$",
        r"$r(s, a)$",
    ]
))
Table 1.1: Description of the MDP in Example 1.3.
\(s\) \(a\) \(P(\text{orderly} \mid s, a)\) \(P(\text{messy} \mid s, a)\) \(r(s, a)\)
orderly ignore 0.7 0.3 1
orderly tidy 1 0 -1
messy ignore 0 1 -1
messy tidy 1 0 0

1.2.1 Policies

Definition 1.3 (Policies) A policy \(\pi\) describes the agent’s strategy: which actions it takes in a given situation. A key goal of RL is to find the optimal policy that maximizes the total reward on average.

There are three axes along which policies can vary: their outputs, inputs, and time-dependence.

  1. Outputs: Deterministic or stochastic. A deterministic policy outputs actions while a stochastic policy outputs distributions over actions.

A deterministic policy, which produces a single action

A stochastic policy, which produces a distribution over possible actions
Figure 1.2
  1. Inputs: State-dependent or history-dependent. A state-dependent (a.k.a. “Markovian”) policy only depends on the current state, while a history-dependent policy depends on the sequence of past states, actions, and rewards. We’ll only consider state-dependent policies in this course.

  2. Stationary or time-dependent. A stationary (a.k.a. time-homogeneous) policy keeps the same strategy at all time steps, while a time-dependent policy can depend on the current timestep. For consistency with states and actions, we will denote the timestep as a subscript, i.e. \(\pi = \{ \pi_0, \dots, \pi_{H-1} \}.\)

Note that for finite state and action spaces, we can represent a randomized mapping \(\mathcal{S} \to \Delta(\mathcal{A})\) as a matrix \(\pi \in [0, 1]^{\mathcal{S} \times \mathcal{A}}\) where each row describes the policy’s distribution over actions for the corresponding state.

Example 1.4 (Policies for the tidying MDP) Here are some possible deterministic policies for the tidying MDP Example 1.3:

  • Always tidy: \(\pi(s) = \text{tidy}\).
  • Only tidy on weekends: \(\pi_h(s) = \text{tidy}\) if \(h\in \{ 5, 6 \}\) and \(\pi_h(s) = \text{ignore}\) otherwise.
  • Only tidy if the room is messy: \(\pi_h(\text{messy}) = \text{tidy}\) and \(\pi_h(\text{orderly}) = \text{ignore}\) for all \(h\).
Code
# arrays of shape (H, S, A) represent time-dependent policies
tidy_policy_always_tidy = (
    jnp.zeros((7, 2, 2))
    .at[:, :, 1].set(1.0)
)
tidy_policy_weekends = (
    jnp.zeros((7, 2, 2))
    .at[5:7, :, 1].set(1.0)
    .at[0:5, :, 0].set(1.0)
)
tidy_policy_messy_only = (
    jnp.zeros((7, 2, 2))
    .at[:, 1, 1].set(1.0)
    .at[:, 0, 0].set(1.0)
)

1.2.2 Trajectories

Definition 1.4 (Trajectories) A sequence of states, actions, and rewards is called a trajectory:

\[ \tau = (s_0, a_0, r_0, \dots, s_{H-1}, a_{H-1}, r_{H-1}) \]

where \(r_h= r(s_h, a_h)\). (Note that some sources omit the reward at the final time step. This is a minor detail.)

Code
class Transition(NamedTuple):
    """A single state-action-reward interaction with the environment.

    A trajectory comprises a sequence of transitions.
    """
    s: int
    a: int
    r: float

Once we’ve chosen a policy, we can sample trajectories by repeatedly choosing actions according to the policy, transitioning according to the state transitions, and observing the rewards. This generative process induces a distribution \(\rho^{\pi}\) over trajectories. (We assume that \(\mu\) and \(P\) are clear from context.)

graph LR
    S0($$s_0$$) -- $$\pi_0$$ --> A0{{$$a_0$$}}
    S0 & A0 --> R0[$$r_0$$]
    A0 & S0 --> S1($$s_1$$)
    S1 -- $$\pi_1$$ --> A1{{$$a_1$$}}
    S1 & A1 --> R1[$$r_1$$]
    A1 & S1 --> S2($$s_2$$)
    S2 -- $$\pi_2$$ --> A2{{$$a_2$$}}
    S2 & A2 --> R2[$$r_2$$]
    A2 & S2 --> S3($$s_3$$)
Figure 1.3: A trajectory is generated by taking a sequence of actions according to the policy.

Example 1.5 (Trajectories in the tidying environment) Here is a possible trajectory for the tidying example:

\(h\) \(0\) \(1\) \(2\) \(3\) \(4\) \(5\) \(6\)
\(s\) orderly orderly orderly messy messy orderly orderly
\(a\) tidy ignore ignore ignore tidy ignore ignore
\(r\) \(-1\) \(1\) \(1\) \(-1\) \(0\) \(1\) \(1\)

Could any of the policies in Example 1.4 have generated this trajectory?

Note that for a state-dependent policy, using the Markov property (Definition 1.2), we can write down the likelihood function of this probability distribution in an autoregressive way (i.e. one timestep at a time):

\[ \rho^{\pi}(\tau) := \mu(s_0) \pi_0(a_0 \mid s_0) P(s_1 \mid s_0, a_0) \cdots P(s_{H-1} \mid s_{H-2}, a_{H-2}) \pi_{H-1}(a_{H-1} \mid s_{H-1}) \tag{1.1}\]

Code
def trajectory_log_likelihood(
    mdp: MDP,
    traj: list[Transition],
    pi: Float[Array, "S A"],
) -> float:
    """Compute the log-likelihood of a trajectory under a given MDP and policy."""

    # initial distribution and action
    total = jnp.log(mdp.mu[traj[0].s])
    total += jnp.log(pi[traj[0].s, traj[0].a])

    # remaining state transitions and actions
    for i in range(1, mdp.H):
        total += jnp.log(mdp.P[traj[i - 1].s, traj[i - 1].a, traj[i].s])
        total += jnp.log(pi[traj[i].s, traj[i].a])

    return total

Exercise 1.4 (Trajectory distribution for stochastic reward) Modify Equation 1.1 for the case when the reward function is stochastic, that is, \(r : \mathcal{S} \times \mathcal{A} \to \triangle(\mathbb{R})\).

For a deterministic policy \(\pi\), we have that \(\pi_h(a \mid s) = \mathbb{I}[a = \pi_h(s)]\); that is, the probability of taking an action is \(1\) if it’s the unique action prescribed by the policy for that state and \(0\) otherwise. In this case, the only randomness in sampling trajectories comes from the initial state distribution \(\mu\) and the state transitions \(P\).

1.2.3 Value functions

The core goal of an RL algorithm is to find a policy that maximizes the expected total reward

\[ \mathbb{E}[r_0 + \cdots + r_{H-1}]. \]

Note that the quantity \(r_0 + \cdots + r_{H-1}\) is a random variable. For a given policy, its distribution is induced by that policy’s trajectory distribution (Equation 1.1). We need tools for describing the expected total reward achieved by a given policy, starting in specific states and actions. This will allow us to compare the quality of differeent policies and compute the optimal policy.

Definition 1.5 (Value function) A policy’s value function at time \(h\) is its expected remaining reward, starting in a specific state:

\[ V_h^\pi(s) := \mathbb{E}_{\tau \sim \rho^\pi} [r_h+ \cdots + r_{H-1} \mid s_h= s] \]

graph LR
    S0($$s_h = s$$) -- $$\pi_h$$ --> A0{{$$a_h$$}}
    S0 & A0 --> R0[$$r_h$$]
    A0 & S0 --> S1("$$s_{h+1}$$")
    S1 -- "$$\pi_{h+1}$$" --> A1{{"$$a_{h+1}$$"}}
    S1 & A1 --> R1["$$r_{h+1}$$"]
    A1 & S1 --> S2("$$s_{h+2}$$")
    S2 -- "$$\pi_{h+2}$$" --> A2{{"$$a_{h+2}$$"}}
    S2 & A2 --> R2["$$r_{h+2}$$"]
    A2 & S2 --> S3("$$s_{h+3}$$")

    class S0,R0,R1,R2 thick
    class R0,R1,R2 reward

    classDef thick stroke-width: 4px;
    classDef reward fill: lightcoral;
Figure 1.4: The policy starts in state \(s\) at time \(h\). Then the expected remaining reward is computed.

Definition 1.6 (Action-value function) Similarly, we can define the action-value function (aka the Q-function) at time \(h\) as the expected remaining reward starting in a specific state and taking a specific action:

\[ Q_h^\pi(s, a) := \mathbb{E}_{\tau \sim \rho^\pi} [r_h+ \cdots + r_{H-1} \mid s_h= s, a_h= a] \]

graph LR
    S0($$s_h = s$$) ~~~ A0{{$$a_h = a$$}}
    S0 & A0 --> R0[$$r_h$$]
    A0 & S0 --> S1("$$s_{h+1}$$")
    S1 -- "$$\pi_{h+1}$$" --> A1{{"$$a_{h+1}$$"}}
    S1 & A1 --> R1["$$r_{h+1}$$"]
    A1 & S1 --> S2("$$s_{h+2}$$")
    S2 -- "$$\pi_{h+2}$$" --> A2{{"$$a_{h+2}$$"}}
    S2 & A2 --> R2["$$r_{h+2}$$"]
    A2 & S2 --> S3("$$s_{h+3}$$")

    class S0,A0,R0,R1,R2 thick;
    class R0,R1,R2 reward;

    classDef thick stroke-width: 4px;
    classDef reward fill: lightcoral;
Figure 1.5: The policy starts in state \(s\) at time \(h\) and takes action \(a\). Then the expected remaining reward is computed.

Exercise 1.5 (Relating the value function and action-value function) Show that the value function is the expected action-value over actions drawn from the policy:

\[ V_h^\pi(s) = \mathbb{E}_{a \sim \pi_h(s)} [Q_h^\pi(s, a)] \]

and the action-value is the sum of the immediate reward and the expected value of the following state:

\[ Q_h^\pi(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V_{h+1}^\pi(s')] \]

Code
def q_to_v(
    policy: Float[Array, "S A"],
    q: Float[Array, "S A"],
) -> Float[Array, " S"]:
    """
    Compute the value function for a given policy in a known finite MDP
    at a single timestep from its action-value function.
    """
    return jnp.average(q, weights=policy, axis=1)

def v_to_q(
    mdp: MDP,
    v_next: Float[Array, " S"],
) -> Float[Array, "S A"]:
    """
    Compute the action-value function in a known finite MDP
    at a single timestep from the corresponding value function.
    """
    # the discount factor is relevant later
    return mdp.r + mdp.gamma * mdp.P @ v_next


# convert a list of v functions to a list of q functions
v_ary_to_q_ary = jax.vmap(v_to_q, in_axes=(None, 0))

1.2.4 The one-step (Bellman) consistency equation

Note that by simply considering the cumulative reward as the sum of the immediate reward and the remaining reward, we can describe the value function recursively, in terms of itself. The resulting system of equations (one equation per state) is named after Richard Bellman (1920–1984), who is credited with introducing dynamic programming in 1953.

Theorem 1.1 (Bellman consistency equations for the value function) For a state-dependent policy \(\pi\),

\[ \forall s \in \mathcal{S}, \quad V_h^\pi(s) = \mathbb{E}_{\substack{a \sim \pi_h(s) \\ s' \sim P(s, a)}} [r(s, a) + V_{h+1}^\pi(s')] \tag{1.2}\]

Code
def check_bellman_consistency_v(
    mdp: MDP,
    policy: Float[Array, "H S A"],
    v_ary: Float[Array, "H S"],
) -> bool:
    """
    Check that the given (time-dependent) "value function"
    satisfies the Bellman consistency equation.
    """
    return all(
        jnp.allclose(
            # lhs
            v_ary[h],
            # rhs
            jnp.sum(policy[h] * (mdp.r + mdp.gamma * mdp.P @ v_ary[h + 1]), axis=1),
        )
        for h in range(mdp.H - 1)
    )

Exercise 1.6 Verify that this equation holds for all \(s \in \mathcal{S}\) by expanding \(V_h^\pi(s)\) and \(V_{h+1}^\pi(s')\) and applying linearity of expectation. Where did you use the assumption that \(\pi\) doesn’t depend on the past states and actions?

Remark 1.2 (The Bellman consistency equation for deterministic policies). Note that for state-dependent deterministic policies, the Bellman consistency equations simplify to

\[ V_h^\pi(s) = r(s, \pi_h(s)) + \mathbb{E}_{s' \sim P(s, \pi_h(s))} [V_{h+1}^\pi(s')] \]

1.2.5 Policy evaluation in finite-horizon MDPs

How can we actually evaluate a given policy, that is, compute its value function?

Suppose we start with some guess \(v_h: \mathcal{S} \to \mathbb{R}\) for the values of each state at time \(h= 0, \dots, H-1\). We write \(v\) in lowercase to indicate that it might not be the value function of a policy. How might we improve this guess?

Recall that the Bellman consistency equations (Equation 1.2) hold for any value function. Suppose we replace \(V^\pi_{h+1}\) with \(v_{h+1}\) on the right-hand side. Then the new right-hand side quantity,

\[ \mathbb{E}_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + v_{h+1}(s')], \tag{1.3}\]

can be thought of as follows: we take one action according to \(\pi\), observe the immediate reward, and evaluate the next state using \(v\). This gives us an updated estimate of the value of \(V^\pi_h(s)\) that is at least as accurate as applying \(v(s)\) directly.

graph LR
    S0($$s_h = s$$) -- $$\pi_h$$ --> A0{{"$$a_h = a$$"}}
    S0 & A0 --> R0["$$r_h = r(s, a)$$"]
    A0 & S0 --> S1("$$s_{h+1} = s'$$")
    S1 --> R1["$$r_{h+1} + \cdots + r_{H-1} \approx v_{h+1}(s')$$"]

    class S0,R0,R1,R2 thick
    class R0,R1,R2 reward

    classDef thick stroke-width: 4px;
    classDef reward fill: lightcoral;
Figure 1.6: We evaluate the next state using the guess function \(v_{h+1}\).

We can treat this operation of taking \(v\) to its improved version as a higher-order function \(\mathcal{J}^\pi\), known as the Bellman operator.

Definition 1.7 (Bellman operator) Let \(\pi : \mathcal{S} \to \triangle(\mathcal{A})\). the Bellman operator \(\mathcal{J}^{\pi} : (\mathcal{S} \to \mathbb{R}) \to (\mathcal{S} \to \mathbb{R})\) is the higher-order function that takes in a function \(v : \mathcal{S} \to \mathbb{R}\) and returns the r.h.s. of the Bellman equation with \(v\) substituted in:

\[ \mathcal{J}^{\pi}(v) := \left( s \mapsto \mathbb{E}_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + v(s')] \right). \tag{1.4}\]

This is a crucial tool for reasoning about MDPs. Intuitively, it answers the following question: if we evaluate the next state using \(v\), how good is the current state, according to the given policy?

Code
def bellman_operator_looping(
    mdp: MDP,
    policy: Float[Array, "S A"],
    v: Float[Array, " S"],
) -> Float[Array, " S"]:
    """
    Looping definition of the Bellman operator.
    Concise version is below
    """
    v_new = jnp.zeros((mdp.S,))
    for s in range(mdp.S):
        for a in range(mdp.A):
            for s_next in range(mdp.S):
                v_new[s] += (
                    policy[s, a]
                    * mdp.P[s, a, s_next]
                    * (mdp.r[s, a] + mdp.gamma * v[s_next])
                )
    return v_new

# Note that we can concisely implement this using the `q_to_v` and `v_to_q` utilities from above:

def bellman_operator(
    mdp: MDP,
    policy: Float[Array, "S A"],
    v: Float[Array, " S"],
) -> Float[Array, " S"]:
    """For a known finite MDP, the Bellman operator can be exactly evaluated."""
    return q_to_v(policy, v_to_q(mdp, v))  # equivalent
    return jnp.sum(policy * (mdp.r + mdp.gamma * mdp.P @ v), axis=1)

latex(
    bellman_operator_looping,
    id_to_latex={"v_new": r"v^\text{lhs}", "policy": r"\pi", "s_next": r"s'"},
    trim_prefixes={"mdp"},
    subscript_type="call",
)
\begin{array}{l} \mathbf{function} \ \mathrm{bellman\_operator\_looping}(\mathrm{mdp}: \mathrm{MDP}, \pi: \mathbb{R}^{S \times A}, v: \mathbb{R}^{S}) \\ \hspace{1em} \textrm{" Looping definition of the Bellman operator. Concise version is below "} \\ \hspace{1em} v^\text{lhs} \gets \mathbf{0}^{S} \\ \hspace{1em} \mathbf{for} \ s \in \mathrm{range} \mathopen{}\left( S \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} \mathbf{for} \ a \in \mathrm{range} \mathopen{}\left( A \mathclose{}\right) \ \mathbf{do} \\ \hspace{3em} \mathbf{for} \ s' \in \mathrm{range} \mathopen{}\left( S \mathclose{}\right) \ \mathbf{do} \\ \hspace{4em} v^\text{lhs}(s) \gets v^\text{lhs}(s) + \pi(s, a) P(s, a, s') \cdot \mathopen{}\left( r(s, a) + \gamma v(s') \mathclose{}\right) \\ \hspace{3em} \mathbf{end \ for} \\ \hspace{2em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ v^\text{lhs} \\ \mathbf{end \ function} \end{array}
Figure 1.7: An algorithm for computing the Bellman operator for a finite state and action space.

The Bellman operator also gives us a concise way to express the Bellman consistency equations (Equation 1.2) for the value function:

\[ V_h^\pi = \mathcal{J}^{\pi}(V_{h+1}^\pi) \tag{1.5}\]

1.2.6 Dynamic programming

The Bellman consistency equations (Theorem 1.1) give us a convenient algorithm for evaluating stationary policies: it expresses the value function at timestep \(h\) as a function of the value function at timestep \(h+1\). This means we can start at the end of the time horizon, where the value is known, and work backwards in time, using the Bellman operator to compute the value function at each time step.

Code
def dp_eval_finite(mdp: MDP, policy: Float[Array, "S A"]) -> Float[Array, "H S"]:
    """Evaluate a policy using dynamic programming."""
    V = np.zeros((mdp.H + 1, mdp.S))  # initialize to 0 at end of time horizon
    for h in range(mdp.H - 1, -1, -1):
        V[h] = bellman_operator(mdp, policy[h], V[h + 1])
    return V[:-1]

latex(
    dp_eval_finite,
    id_to_latex={"bellman_operator": r"\mathcal{J}", "policy": r"\pi", "V": "V^\pi"},
)
\begin{array}{l} \mathbf{function} \ \mathrm{dp\_eval\_finite}(\mathrm{mdp}: \mathrm{MDP}, \pi: \mathbb{R}^{S \times A}) \\ \hspace{1em} \textrm{"Evaluate a policy using dynamic programming."} \\ \hspace{1em} V^\pi \gets \mathbf{0}^{\mathrm{mdp}.H + 1 \times \mathrm{mdp}.S} \\ \hspace{1em} \mathbf{for} \ h \in \mathrm{range} \mathopen{}\left( \mathrm{mdp}.H - 1, -1, -1 \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} V^\pi_{h} \gets \mathcal{J} \mathopen{}\left( \mathrm{mdp}, \pi_{h}, V^\pi_{h + 1} \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ V^\pi_{:-1} \\ \mathbf{end \ function} \end{array}
Figure 1.8: A dynamic programming algorithm for evaluating a policy in a finite-horizon MDP.

This runs in time \(O(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|)\). Note that the implementation of the Bellman operator in Figure 1.7 can be easily modified to compute \(Q^\pi\) as an intermediate step. Do you see how?

Example 1.6 (Tidying policy evaluation) Let’s evaluate the policy from Example 1.4 in the tidying MDP that tidies if and only if the room is messy. We’ll use the Bellman consistency equation to compute the value function at each time step.

\[ \begin{aligned} V_{H-1}^\pi(\text{orderly}) &= r(\text{orderly}, \text{ignore}) \\ &= 1 \\ V_{H-1}^\pi(\text{messy}) &= r(\text{messy}, \text{tidy}) \\ &= 0 \\ V_{H-2}^\pi(\text{orderly}) &= r(\text{orderly}, \text{ignore}) + \mathbb{E}_{s' \sim P(\text{orderly}, \text{ignore})} [V_{H-1}^\pi(s')] \\ &= 1 + 0.7 \cdot V_{H-1}^{\pi}(\text{orderly}) + 0.3 \cdot V_{H-1}^{\pi}(\text{messy}) \\ &= 1 + 0.7 \cdot 1 + 0.3 \cdot 0 \\ &= 1.7 \\ V_{H-2}^\pi(\text{messy}) &= r(\text{messy}, \text{tidy}) + \mathbb{E}_{s' \sim P(\text{messy}, \text{tidy})} [V_{H-1}^\pi(s')] \\ &= 0 + 1 \cdot V_{H-1}^{\pi}(\text{orderly}) + 0 \cdot V_{H-1}^{\pi}(\text{messy}) \\ &= 1 \\ V_{H-3}^\pi(\text{orderly}) &= r(\text{orderly}, \text{ignore}) + \mathbb{E}_{s' \sim P(\text{orderly}, \text{ignore})} [V_{H-2}^\pi(s')] \\ &= 1 + 0.7 \cdot V_{H-2}^{\pi}(\text{orderly}) + 0.3 \cdot V_{H-2}^{\pi}(\text{messy}) \\ &= 1 + 0.7 \cdot 1.7 + 0.3 \cdot 1 \\ &= 2.49 \\ V_{H-3}^\pi(\text{messy}) &= r(\text{messy}, \text{tidy}) + \mathbb{E}_{s' \sim P(\text{messy}, \text{tidy})} [V_{H-2}^\pi(s')] \\ &= 0 + 1 \cdot V_{H-2}^{\pi}(\text{orderly}) + 0 \cdot V_{H-2}^{\pi}(\text{messy}) \\ &= 1.7 \end{aligned} \]

etc. You may wish to repeat this computation for the other policies to get a better sense of this algorithm.

Code
V_messy = dp_eval_finite(tidy_mdp, tidy_policy_messy_only)
Markdown(tabulate(
    zip(*[
        ["orderly", "messy"],
        *V_messy,
    ]),
    ["$s$"] + [f"$h={h}$" for h in range(len(V_messy))]
))
\(s\) \(h=0\) \(h=1\) \(h=2\) \(h=3\) \(h=4\) \(h=5\) \(h=6\)
orderly 5.56217 4.79277 4.0241 3.253 2.49 1.7 1
messy 4.79277 4.0241 3.253 2.49 1.7 1 0

1.2.7 Optimal policies in finite-horizon MDPs

We’ve just seen how to evaluate a given policy. But how can we find the best policy for a given environment? We must first define what it means for a policy to be optimal. Intuitively speaking, acting according to an optimal policy should yield the highest possible expected remaining reward in any situation. In mathematical terms:

Definition 1.8 (Optimal policies) We call the policy \(\pi^\star\) optimal if, for any partial trajectory \(\tau_{\le h} = (s_0, a_0, r_0, \dots, s_{h-1}, a_{h-1}, r_{h-1}, s_{h})\), it does at least as well as any other policy \(\pi^\text{any}\):

\[ \begin{gathered} \forall \pi^\text{any}, \quad \mathbb{E}\left[ r_h+ \cdots + r_{H-1} \mid \tau_{\le h}, a_{h'} \sim \pi^\star(\tau_{\le h'}) \right] \\ \ge \mathbb{E}\left[ r_h+ \cdots + r_{H-1} \mid \tau_{\le h}, a_{h'} \sim \pi^\text{any}(\tau_{\le h'}) \right]. \end{gathered} \tag{1.6}\]

Example 1.7 (Optimal policy in the tidying MDP) For the tidying MDP (Example 1.3), you might guess that the optimal strategy is to only tidy your room if it is messy, since this puts the room into the “orderly” state, in which reward can be obtained, and you are penalized for tidying otherwise. Note that this policy is state-dependent and deterministic.

It is a stunning fact that every finite-horizon MDP has an optimal policy that is state-dependent and deterministic. How is this possible? For some intuition, recall that the Markov property (Definition 1.2) says that once we know the current state, the next state no longer depends on the past history, so history-dependent policies, intuitively speaking, don’t benefit over simply state-dependent ones. For a rigorous proof, see Chapter 1.1.3 of Agarwal et al. (2022). This theorem greatly narrows down the space of policies that we have to search through.

Definition 1.9 (Optimal value function) Given a state \(s\) at time \(h\), the optimal value \(V^\star_h(s)\) is the maximum expected remaining reward achievable by any policy \(\pi^\text{any}\). Since there exists a state-dependent optimal policy (Theorem 1.2), it suffices to restrain \(\pi^\text{any}\) to state-dependent policies, which have value functions:

\[ \forall s \in \mathcal{S}, \quad V^\star_h(s) := \max_{\pi^\text{any}} V^{\pi^\text{any}}_h(s). \tag{1.7}\]

We can define the optimal action-value function analogously:

\[ \forall s \in \mathcal{S}, a \in \mathcal{A}, \quad Q^\star_h(s, a) := \max_{\pi^\text{any}} Q^{\pi^\text{any}}_h(s, a). \tag{1.8}\]

We can construct an optimal policy by acting greedily with respect to the optimal action-value function:

Definition 1.10 (Greedy policies) For any sequence of functions \(q_h : \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) for \(h = 0, \dots, H-1\), we define the greedy policy \(\hat \pi^q\) to be the deterministic policy that selects the action with the highest value according to \(q\) at each state:

\[ \hat \pi_h^q(s) := \arg\max_{a \in \mathcal{A}} q_h(s, a). \tag{1.9}\]

Code
def q_to_greedy(q: Float[Array, "S A"]) -> Float[Array, "S A"]:
    """
    Get the (deterministic) greedy policy with respect to an action-value function.
    Return the policy as a matrix of shape (S, A) where each row is a one-hot vector.
    """
    A = q.shape[1]
    a_ary = jnp.argmax(q, axis=1)
    return jnp.eye(A)[a_ary]


def v_to_greedy(mdp: MDP, v: Float[Array, " S"]) -> Float[Array, "S A"]:
    """Get the (deterministic) greedy policy with respect to a value function."""
    return q_to_greedy(v_to_q(mdp, v))

Theorem 1.2 (A deterministic optimal policy) The greedy policy \(\hat \pi_h^{Q^\star}(s)\), where \(Q^\star\) is the optimal action-value function (Equation 1.8), is an optimal policy (Definition 1.8).

Proof. Let \(V^{\star}\) and \(Q^{\star}\) denote the optimal value and action-value functions (Definition 1.9). We aim to show that the greedy policy \(\hat \pi^{Q^\star}\) is optimal; that is, \(V^{\hat \pi^{Q^\star}} = V^{\star}\). For notational convenience, we abbreviate \(\hat \pi^{Q^\star}\) by just \(\hat \pi\).

Fix an arbitrary state \(s \in \mathcal{S}\) and time \(h\in [H]\).

By the definition of \(V^{\star}\), we already know \(V_h^{\star}(s) \ge V_h^{\hat \pi}(s)\). So for equality to hold we just need to show that \(V_h^{\star}(s) \le V_h^{\hat \pi}(s)\).

We’ll first show that the Bellman operator \(\mathcal{J}^{\hat \pi_h}\) never decreases \(V_h^{\star}\) elementwise. Then we’ll apply this result recursively to show that \(V^{\star} = V^{\hat \pi}\).

Lemma 1.1 (The Bellman operator never decreases the optimal value function) \(\mathcal{J}^{\hat \pi_h}\) never decreases \(V^{\star}\) elementwise:

\[ \forall h \in [H], \quad [\mathcal{J}^{\hat \pi_h} (V_{h+1}^{\star})](s) \ge V_h^{\star}(s). \tag{1.10}\]

Proof. We perform some algebraic manipulations:

\[ \begin{aligned} V_h^{\star}(s) &= \max_{\pi \in \Pi} V_h^{\pi}(s) \\ &= \max_{\pi} \mathop{\mathbb{E}}_{a \sim \pi(\dots)}\left[r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)} V_{h+1}^\pi(s') \right] && \text{Bellman consistency} \\ &\le \max_{\pi} \mathop{\mathbb{E}}_{a \sim \pi(\dots)}\left[r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)} V_{h+1}^{\star}(s') \right] && \text{definition of } V^\star \\ &= \max_{a \in \mathcal{A}} \left[ r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)} V_{h+1}^{\star}(s') \right] && \text{only depends on } \pi \text{ via } a \\ &= [\mathcal{J}^{\hat \pi_h}(V_{h+1}^{\star})](s). \end{aligned} \]

We can now apply this result recursively to get

\[ V^{\star}_h(s) \le V^{\hat \pi}_h(s) \]

as follows. (Note that even though \(\hat \pi\) is deterministic, we’ll use the \(a \sim \hat \pi(s)\) notation to make it explicit that we’re sampling a trajectory from it.)

\[ \begin{aligned} V_{t}^{\star}(s) &\le [\mathcal{J}^{\hat \pi}(V_{h+1}^{\star})](s) \\ &= \mathop{\mathbb{E}}_{a \sim \hat \pi(s)} \left[ r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)} \left[ {\color{blue} V_{h+1}^{\star}(s')} \right] \right] && \text{definition of } \mathcal{J}^{\hat \pi} \\ &\le \mathop{\mathbb{E}}_{a \sim \hat \pi(s)} \left[ r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)} \left[ {\color{blue}[ \mathcal{J}^{\hat \pi} (V_{h+2}^{\star})] (s')} \right] \right] && \text{above lemma} \\ &= \mathop{\mathbb{E}}_{a \sim \hat \pi(s)} \left[ r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(s, a)}{\color{blue} \left[ \mathop{\mathbb{E}}_{a' \sim \hat \pi} r(s', a') + \mathop{\mathbb{E}}_{s''} V_{h+2}^{\star}(s'') \right]} \right] && \text{definition of } \mathcal{J}^{\hat \pi} \\ &\le \cdots && \text{apply at all timesteps} \\ &= \mathop{\mathbb{E}}_{\tau \sim \rho^{\hat \pi}} \left[\sum_{h'=h}^{H-1} r(s_{h'}, a_{h'}) \mid s_h= s\right] && \text{rewrite expectation} \\ &= V_{h}^{\hat \pi}(s) && \text{definition} \end{aligned} \]

And so we have \(V^{\star} = V^{\hat \pi}\), making \(\hat \pi\) optimal.

\(\hat \pi^{Q^\star}\) is a policy just like any other. In particular, we can write down the corresponding set of Bellman consistency equations (Theorem 1.1). These are often termed the “Bellman optimality equations”.

Theorem 1.3 (Bellman optimality equations) The optimal value function (Definition 1.9) satisfies

\[ \begin{aligned} \forall s \in \mathcal{S}, \quad V_h^\star(s) &= \max_a r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V_{h+1}^\star(s')] \\ \end{aligned} \tag{1.11}\]

Furthermore, this uniquely identifies \(V^\star\): it is the only function that satisfies this equation. That is, if a sequence of functions \(v_h : \mathcal{S} \to \mathbb{R}\) for \(h = 0, \dots, H-1\) satisfies Equation 1.11, then we know \(v = V^\star\). We leave the proof as an exercise; a proof by contradiction is relatively straightforward.

Note that due to the structure of the greedy policy, these equations don’t need the letter \(\pi\) at all! This will prove useful when we discuss off-policy algorithms later in the course.

How can we compute \(\hat \pi^{Q^\star}\)? We need to compute the optimal value function and optimal policy. We can do this by working backwards in time using dynamic programming (DP).

graph RL
    Q0["$$Q^\star_{H-1} = r$$"] -- "$$\max_{a \in \mathcal{A}}$$" --> V0
    Q0 -- "$$\arg\max_{a \in \mathcal{A}}$$" --> P0["$$\pi^\star_{H-1}$$"]
    V0["$$V^\star_{H-1}$$"] -- "Bellman equations" --> Q1
    
    Q1["$$Q^\star_{H-2}$$"] -- "$$\max_{a \in \mathcal{A}}$$" --> V1
    Q1 -- "$$\arg\max_{a \in \mathcal{A}}$$" --> P1["$$\pi^\star_{H-2}$$"]
    V1["$$V^\star_{H-2}$$"] -- "Bellman equations" --> Q2

    Q2["..."]
Figure 1.9: Illustrating a dynamic programming algorithm for computing the optimal policy in a finite-horizon MDP.

Definition 1.11 (DP algorithm to compute an optimal policy in a finite-horizon MDP) Base case. At the end of the episode (time step \(H-1\)), we can’t take any more actions, so the \(Q\)-function is simply the reward that we obtain:

\[ Q^\star_{H-1}(s, a) = r(s, a). \]

The best thing to do at the end of the horizon, therefore, is to act greedily and get as much reward as we can!

\[ \pi^\star_{H-1}(s) = \arg\max_{a \in \mathcal{A}} Q^\star_{H-1}(s, a) \]

Then \(V^\star_{H-1}(s)\), the optimal value of state \(s\) at the end of the trajectory, is simply whatever action gives the most reward.

\[ V^\star_{H-1}(s) = \max_{a \in \mathcal{A}} Q^\star_{H-1}(s, a) \]

Recursion. Then, we can work backwards in time, starting from the end, using the Bellman optimality equations (Theorem 1.3)! i.e. for each \(h= H-2, \dots, 0\), we set

\[ \begin{aligned} Q^\star_{h}(s, a) &= r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\star_{h+1}(s')] \\ \pi^\star_{h}(s) &= \arg\max_{a \in \mathcal{A}} Q^\star_{h}(s, a) \\ V^\star_{h}(s) &= \max_{a \in \mathcal{A}} Q^\star_{h}(s, a) \end{aligned} \]

Code
# TODO fix latexify bugs
def find_optimal_policy(mdp: MDP):
    Q = np.zeros((mdp.H, mdp.S, mdp.A))
    policy = np.zeros((mdp.H, mdp.S, mdp.A))
    V = np.zeros((mdp.H+1, mdp.S))  # initialize to 0 at end of time horizon

    for h in range(mdp.H - 1, -1, -1):
        Q[h] = mdp.r + mdp.P @ V[h + 1]
        policy[h] = jnp.eye(mdp.S)[jnp.argmax(Q[h], axis=1)]  # one-hot
        V[h] = jnp.max(Q[h], axis=1)

    Q = jnp.stack(Q)
    policy = jnp.stack(policy)
    V = jnp.stack(V[:-1])

    return policy, V, Q

latex(find_optimal_policy, id_to_latex={"policy": r"\pi"}, trim_prefixes={"mdp"})
\begin{array}{l} \mathbf{function} \ \mathrm{find\_optimal\_policy}(\mathrm{mdp}: \mathrm{MDP}) \\ \hspace{1em} Q \gets \mathbf{0}^{H \times S \times A} \\ \hspace{1em} \pi \gets \mathbf{0}^{H \times S \times A} \\ \hspace{1em} V \gets \mathbf{0}^{H + 1 \times S} \\ \hspace{1em} \mathbf{for} \ h \in \mathrm{range} \mathopen{}\left( H - 1, -1, -1 \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} Q_{h} \gets r + P V_{h + 1} \\ \hspace{2em} \pi_{h} \gets \mathrm{jnp}.\mathrm{eye} \mathopen{}\left( S \mathclose{}\right)_{\mathrm{jnp}.\mathrm{argmax} \mathopen{}\left( Q_{h} \mathclose{}\right)} \\ \hspace{2em} V_{h} \gets \mathrm{jnp}.\mathrm{max} \mathopen{}\left( Q_{h} \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} Q \gets \mathrm{jnp}.\mathrm{stack} \mathopen{}\left( Q \mathclose{}\right) \\ \hspace{1em} \pi \gets \mathrm{jnp}.\mathrm{stack} \mathopen{}\left( \pi \mathclose{}\right) \\ \hspace{1em} V \gets \mathrm{jnp}.\mathrm{stack} \mathopen{}\left( V_{:-1} \mathclose{}\right) \\ \hspace{1em} \mathbf{return} \ \mathopen{}\left( \pi, V, Q \mathclose{}\right) \\ \mathbf{end \ function} \end{array}
Figure 1.10: Pseudocode for the dynamic programming algorithm.

At each of the \(H\) timesteps, we must compute \(Q^{\star}_h\) for each of the \(|\mathcal{S}| |\mathcal{A}|\) state-action pairs. Each computation takes \(|\mathcal{S}|\) operations to evaluate the average value over \(s'\). This gives a total computation time of \(O(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|)\).

Note that this algorithm is identical to the policy evaluation algorithm (Figure 1.8), but instead of averaging over the actions chosen by a policy, we instead take a maximum over the action-values. We’ll see this relationship between policy evaluation and optimal policy computation show up again in the infinite-horizon setting.

Code
pi_opt, V_opt, Q_opt = find_optimal_policy(tidy_mdp)
assert jnp.allclose(pi_opt, tidy_policy_messy_only)
assert jnp.allclose(V_opt, V_messy)
assert jnp.allclose(Q_opt[:-1], v_ary_to_q_ary(tidy_mdp, V_messy)[1:])
"Assertions passed (the 'tidy when messy' policy is optimal)"
"Assertions passed (the 'tidy when messy' policy is optimal)"

Remark 1.3 (Why stochastic policies). Given that there exists an optimal deterministic policy, why do we try ever try to learn stochastic policies? We will see a partial answer to this when we discuss the exploration-exploitation tradeoff in Chapter 3. So far, we’ve assumed that the environment is totally known. If it isn’t, however, we need some way to explore different possible actions, and stochastic policies provide a natural way to do this.

1.3 Infinite-horizon MDPs

What happens if a trajectory is allowed to continue forever (i.e. \(H = \infty\))? This is the setting of infinite horizon MDPs.

In this chapter, we’ll describe the necessary adjustments from the finite-horizon case. We’ll show that the Bellman operator (Definition 1.7) in the discounted reward setting is a contraction mapping for any policy. We’ll discuss how to evaluate policies (i.e. compute their corresponding value functions). Finally, we’ll present and analyze two iterative algorithms, based on the Bellman operator, for computing the optimal policy: value iteration and policy iteration.

1.3.1 Discounted rewards

First of all, note that the total reward \(r_h+ r_{h+1} + r_{h+2} + \cdots\) might blow up to infinity. To ensure that the “total reward” can remain bounded, we insert a discount factor \(\gamma \in [0, 1)\) such that rewards become less valuable the further into the future they are:

\[ r_h+ \gamma r_{h+1} + \gamma^2 r_{h+2} + \cdots = \sum_{k=0}^\infty \gamma^k r_{h+k}. \tag{1.12}\]

We can think of \(\gamma\) as measuring how much we care about the future: if it’s close to \(0\), we only care about the near-term rewards; if it’s close to \(1\), we put more weight into future rewards.

You can also analyze \(\gamma\) as the probability of continuing the trajectory at each time step. (This is equivalent to \(H\) being distributed by a First Success distribution with success probability \(\gamma\).) This accords with the above interpretation: if \(\gamma\) is close to \(0\), the trajectory will likely be very short, while if \(\gamma\) is close to \(1\), the trajectory will likely continue for a long time.

Exercise 1.7 Assuming that \(r_h\in [0, 1]\) for all \(h\in \mathbb{N}\), what is the maximum discounted cumulative reward? You may find it useful to review geometric series.

The other components of the MDP remain the same:

\[ \mathcal{M} = (\mathcal{S}, \mathcal{A}, \mu, P, r, \gamma). \]

Code
# Code-wise, we can reuse the `MDP` class from before @def-finite-horizon-mdp and set `mdp.H = float('inf')`.

tidy_mdp_inf = tidy_mdp._replace(H=float("inf"), gamma=0.95)

1.3.2 Stationary policies

The time-dependent policies from the finite-horizon case become difficult to handle in the infinite-horizon case. In particular, many of the DP approaches we saw required us to start at the end of the trajectory, which is no longer possible. We’ll shift to stationary policies \(\pi : \mathcal{S} \to \mathcal{A}\) (deterministic) or \(\Delta(\mathcal{A})\) (stochastic).

Exercise 1.8 Which of the policies in Example 1.4 (policies in the tidying MDP) are stationary?

1.3.3 Value functions and Bellman consistency

We also consider stationary value functions \(V^\pi : \mathcal{S} \to \mathbb{R}\) and \(Q^\pi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}\). We need to insert a factor of \(\gamma\) into the Bellman consistency equations (Theorem 1.1) to account for the discounting:

\[ \begin{aligned} V^\pi(s) &= \mathbb{E}_{\tau \sim \rho^\pi} [r_h+ \gamma r_{h+1} + \gamma^2 r_{h+2} \cdots \mid s_h= s] && \text{for any } h\in \mathbb{N} \\ &= \mathbb{E}_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + \gamma V^\pi(s')]\\ Q^\pi(s, a) &= \mathbb{E}_{\tau \sim \rho^\pi} [r_h+ \gamma r_{h+1} + \gamma^2 r_{h+2} + \cdots \mid s_h= s, a_h= a] && \text{for any } h\in \mathbb{N} \\ &= r(s, a) + \gamma \mathbb{E}_{\substack{s' \sim P(s, a) \\ a' \sim \pi(s')}} [Q^\pi(s', a')] \end{aligned} \tag{1.13}\]

Exercise 1.9 (Time-independent) Heuristically speaking, why does it no longer matter which time step we condition on when defining the value function?

1.3.4 Contraction mappings

Recall from Definition 1.7 that the Bellman operator \(\mathcal{J}^{\pi}\) for a policy \(\pi\) takes in a “value function” \(v : \mathcal{S} \to \mathbb{R}\) and returns the r.h.s. of the Bellman equation for that “value function”. In the infinite-horizon setting, this is

\[ [\mathcal{J}^{\pi}(v)](s) := \mathbb{E}_{\substack{a \sim \pi(s) \\ s' \sim P(s, a)}} [r(s, a) + \gamma v(s')]. \]

The crucial property of the Bellman operator is that it is a contraction mapping for any policy. Intuitively, if we start with two “value functions” \(v, u : \mathcal{S} \to \mathbb{R}\), if we repeatedly apply the Bellman operator to each of them, they will get closer and closer together at an exponential rate.

Definition 1.12 (Contraction mapping) Let \(X\) be some space with a norm \(\|\cdot\|\). We call an operator \(f: X \to X\) a contraction mapping if for any \(x, y \in X\),

\[ \|f(x) - f(y)\| \le \gamma \|x - y\| \]

for some fixed \(\gamma \in (0, 1)\). Intuitively, this means that if two points are \(\delta\) far apart, after applying the mapping,

Exercise 1.10 Show that for a contraction mapping \(f\) with coefficient \(\gamma\), for all \(t \in \mathbb{N}\),

\[ \|f^{(t)}(x) - f^{(t)}(y)\| \le \gamma^t \|x - y\|, \]

i.e. that any two points will be pushed closer by at least a factor of \(\gamma\) at each iteration.

It is a powerful fact (known as the Banach fixed-point theorem) that every contraction mapping has a unique fixed point \(x^\star\) such that \(f(x^\star) = x^\star\). This means that if we repeatedly apply \(f\) to any starting point, we will eventually converge to \(x^\star\):

\[ \|f^{(t)}(x) - x^\star\| \le \gamma^t \|x - x^\star\|. \tag{1.14}\]

Let’s return to the RL setting and apply this result to the Bellman operator. How can we measure the distance between two “value functions” \(v, u : \mathcal{S} \to \mathbb{R}\)? We’ll take the supremum norm as our distance metric:

\[ \| v - u \|_{\infty} := \sup_{s \in \mathcal{S}} |v(s) - u(s)|, \]

i.e. we compare the “value functions” on the state that causes the biggest gap between them. Then Equation 1.14 implies that if we repeatedly apply \(\mathcal{J}^\pi\) to any starting “value function”, we will eventually converge to \(V^\pi\):

\[ \|(\mathcal{J}^\pi)^{(t)}(v) - V^\pi \|_{\infty} \le \gamma^{t} \| v - V^\pi\|_{\infty}. \tag{1.15}\]

We’ll use this useful fact to prove the convergence of several algorithms later on.

Theorem 1.4 (The Bellman operator is a contraction mapping) \[ \|\mathcal{J}^{\pi} (v) - \mathcal{J}^{\pi} (u) \|_{\infty} \le \gamma \|v - u \|_{\infty}. \]

Proof. For all states \(s \in \mathcal{S}\),

\[ \begin{aligned} |[\mathcal{J}^{\pi} (v)](s) - [\mathcal{J}^{\pi} (u)](s)|&= \Big| \mathop{\mathbb{E}}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathop{\mathbb{E}}_{s' \sim P(s, a)} v(s') \right] \\ &\qquad - \mathop{\mathbb{E}}_{a \sim \pi(s)} \left[r(s, a) + \gamma \mathop{\mathbb{E}}_{s' \sim P(s, a)} u(s') \right] \Big| \\ &= \gamma \left|\mathop{\mathbb{E}}_{s' \sim P(s, a)} [v(s') - u(s')] \right| \\ &\le \gamma \mathop{\mathbb{E}}_{s' \sim P(s, a)}|v(s') - u(s')| \qquad \text{(Jensen's inequality)} \\ &\le \gamma \max_{s'} |v(s') - u(s')| \\ &= \gamma \|v - u \|_{\infty}. \end{aligned} \]

1.3.5 Policy evaluation in infinite-horizon MDPs

The backwards DP technique we used in the finite-horizon case (Section 1.2.5) no longer works since there is no “final timestep” to start from. We’ll need another approach to policy evaluation.

The Bellman consistency conditions yield a system of equations we can solve to evaluate a deterministic policy exactly. For a faster approximate solution, we can iterate the policy’s Bellman operator, since we know that it has a unique fixed point at the true value function.

1.3.6 Matrix inversion for deterministic policies

Note that when the policy \(\pi\) is deterministic, the actions can be determined from the states, and so we can chop off the action dimension for the rewards and state transitions:

\[ \begin{aligned} r^{\pi} &\in \mathbb{R}^{|\mathcal{S}|} & P^{\pi} &\in [0, 1]^{|\mathcal{S}| \times |\mathcal{S}|} & \mu &\in [0, 1]^{|\mathcal{S}|} \\ \pi &\in \mathcal{A}^{|\mathcal{S}|} & V^\pi &\in \mathbb{R}^{|\mathcal{S}|} & Q^\pi &\in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|}. \end{aligned} \]

For \(P^\pi\), we’ll treat the rows as the states and the columns as the next states. Then \(P^\pi_{s, s'}\) is the probability of transitioning from state \(s\) to state \(s'\) under policy \(\pi\).

Example 1.8 (Tidying MDP) The tabular MDP from before has \(|\mathcal{S}| = 2\) and \(|\mathcal{A}| = 2\). Let’s write down the quantities for the policy \(\pi\) that tidies if and only if the room is messy:

\[ r^{\pi} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad P^{\pi} = \begin{bmatrix} 0.7 & 0.3 \\ 1 & 0 \end{bmatrix}, \quad \mu = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \]

We’ll see how to evaluate this policy in the next section.

The Bellman consistency equation for a deterministic policy can be written in tabular notation as

\[ V^\pi = r^\pi + \gamma P^\pi V^\pi. \]

(Unfortunately, this notation doesn’t simplify the expression for \(Q^\pi\).) This system of equations can be solved with a matrix inversion:

\[ V^\pi = (I - \gamma P^\pi)^{-1} r^\pi. \tag{1.16}\]

Exercise 1.11 Note we’ve assumed that \(I - \gamma P^\pi\) is invertible. Can you see why this is the case?

(Recall that a linear operator, i.e. a square matrix, is invertible if and only if its null space is trivial; that is, it doesn’t map any nonzero vector to zero. In this case, we can see that \(I - \gamma P^\pi\) is invertible because it maps any nonzero vector to a vector with at least one nonzero element.)

Code
def eval_deterministic_infinite(
    mdp: MDP, policy: Float[Array, "S A"]
) -> Float[Array, " S"]:
    pi = jnp.argmax(policy, axis=1)  # un-one-hot
    P_pi = mdp.P[jnp.arange(mdp.S), pi]
    r_pi = mdp.r[jnp.arange(mdp.S), pi]
    return jnp.linalg.solve(jnp.eye(mdp.S) - mdp.gamma * P_pi, r_pi)

Example 1.9 (Tidying policy evaluation) Let’s use the same policy \(\pi\) that tidies if and only if the room is messy. Setting \(\gamma = 0.95\), we must invert

\[ I - \gamma P^{\pi} = \begin{bmatrix} 1 - 0.95 \times 0.7 & - 0.95 \times 0.3 \\ - 0.95 \times 1 & 1 - 0.95 \times 0 \end{bmatrix} = \begin{bmatrix} 0.335 & -0.285 \\ -0.95 & 1 \end{bmatrix}. \]

The inverse to two decimal points is

\[ (I - \gamma P^{\pi})^{-1} = \begin{bmatrix} 15.56 & 4.44 \\ 14.79 & 5.21 \end{bmatrix}. \]

Thus the value function is

\[ V^{\pi} = (I - \gamma P^{\pi})^{-1} r^{\pi} = \begin{bmatrix} 15.56 & 4.44 \\ 14.79 & 5.21 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 15.56 \\ 14.79 \end{bmatrix}. \]

Let’s sanity-check this result. Since rewards are at most \(1\), the maximum cumulative return of a trajectory is at most \(1/(1-\gamma) = 20\). We see that the value function is indeed slightly lower than this.

Code
eval_deterministic_infinite(tidy_mdp_inf, tidy_policy_messy_only[0])
Array([15.56419, 14.78598], dtype=float32)

1.3.6.1 Iterative policy evaluation

The matrix inversion above takes roughly \(O(|\mathcal{S}|^3)\) time. It also only works for deterministic policies. Can we trade off the requirement of finding the exact value function for a faster approximate algorithm that will also extend to stochastic policies?

Let’s use the Bellman operator to define an iterative algorithm for computing the value function. We’ll start with an initial guess \(v^{(0)}\) with elements in \([0, 1/(1-\gamma)]\) and then iterate the Bellman operator:

\[ v^{(t+1)} = \mathcal{J}^{\pi}(v^{(t)}), \]

i.e. \(v^{(t)} = (\mathcal{J}^{\pi})^{(t)} (v^{(0)})\). Note that each iteration takes \(O(|\mathcal{S}|^2)\) time for the matrix-vector multiplication.

Code
def supremum_norm(v):
    return jnp.max(jnp.abs(v))  # same as jnp.linalg.norm(v, jnp.inf)


def loop_until_convergence(op, v, ε=1e-6):
    """Repeatedly apply op to v until convergence (in supremum norm)."""
    while True:
        v_new = op(v)
        if supremum_norm(v_new - v) < ε:
            return v_new
        v = v_new


def iterative_evaluation(mdp: MDP, pi: Float[Array, "S A"], ε=1e-6) -> Float[Array, " S"]:
    op = partial(bellman_operator, mdp, pi)
    return loop_until_convergence(op, jnp.zeros(mdp.S), ε)

Then, as we showed in Equation 1.15, by the Banach fixed-point theorem:

\[ \|v^{(t)} - V^\pi \|_{\infty} \le \gamma^{t} \| v^{(0)} - V^\pi\|_{\infty}. \]

Code
iterative_evaluation(tidy_mdp_inf, tidy_policy_messy_only[0])
Array([15.564166, 14.785956], dtype=float32)

Remark 1.4 (Convergence of iterative policy evaluation). How many iterations do we need for an \(\epsilon\)-accurate estimate? We can work backwards to solve for \(t\):

\[ \begin{aligned} \gamma^t \|v^{(0)} - V^\pi\|_{\infty} &\le \epsilon \\ t &\ge \frac{\log (\epsilon / \|v^{(0)} - V^\pi\|_{\infty})}{\log \gamma} \\ &= \frac{\log (\|v^{(0)} - V^\pi\|_{\infty} / \epsilon)}{\log (1 / \gamma)}, \end{aligned} \]

and so the number of iterations required for an \(\epsilon\)-accurate estimate is

\[ T = O\left( \frac{1}{1-\gamma} \log\left(\frac{1}{\epsilon (1-\gamma)}\right) \right). \]

Note that we’ve applied the inequalities \(\|v^{(0)} - V^\pi\|_{\infty} \le 1/(1-\gamma)\) and \(\log (1/x) \ge 1-x\).

1.3.7 Optimal policies in infinite-horizon MDPs

Now let’s move on to solving for an optimal policy in the infinite-horizon case. As in Definition 1.8, an optimal policy \(\pi^\star\) is one that does at least as well as any other policy in all situations. That is, for all policies \(\pi\), states \(s \in \mathcal{S}\), times \(h\in \mathbb{N}\), and initial trajectories \(\tau_h= (s_0, a_0, r_0, \dots, s_h)\) where \(s_h= s\),

\[ \begin{aligned} V^{\pi^\star}(s) &= \mathbb{E}_{\tau \sim \rho^{\pi^{\star}}}[r_h+ \gamma r_{h+1} + \gamma^2 r_{h+2} + \cdots \mid s_h= s] \\ &\ge \mathbb{E}_{\tau \sim \rho^{\pi}}[r_h+ \gamma r_{h+1} + \gamma^2 r_{h+2} + \cdots \mid \tau_h] \end{aligned} \tag{1.17}\]

Once again, all optimal policies share the same optimal value function \(V^\star\), and the greedy policy with respect to this value function is optimal.

Exercise 1.12 (The greedy optimal policy) Verify this by modifying the proof Theorem 1.2 from the finite-horizon case.

So how can we compute such an optimal policy? We can’t use the backwards DP approach from the finite-horizon case Definition 1.11 since there’s no “final timestep” to start from. Instead, we’ll exploit the fact that the Bellman consistency equation Equation 1.13 for the optimal value function doesn’t depend on any policy:

\[ V^\star(s) = \max_a \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} V^\star(s'). \right] \tag{1.18}\]

Exercise 1.13 Verify this by substituting the greedy policy into the Bellman consistency equation.

As before, thinking of the r.h.s. of Equation 1.18 as an operator on value functions gives the Bellman optimality operator

\[ [\mathcal{J}^{\star}(v)](s) = \max_a \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} v(s') \right] \tag{1.19}\]

Code
def bellman_optimality_operator(mdp: MDP, v: Float[Array, " S"]) -> Float[Array, " S"]:
    return jnp.max(mdp.r + mdp.gamma * mdp.P @ v, axis=1)


def check_optimal(v: Float[Array, " S"], mdp: MDP):
    return jnp.allclose(v, bellman_optimality_operator(v, mdp))

1.3.7.1 Value iteration

Since the optimal policy is still a policy, our result that the Bellman operator is a contracting map still holds, and so we can repeatedly apply this operator to converge to the optimal value function! This algorithm is known as value iteration.

Code
def value_iteration(mdp: MDP, ε: float = 1e-6) -> Float[Array, " S"]:
    """Iterate the Bellman optimality operator until convergence."""
    op = partial(bellman_optimality_operator, mdp)
    return loop_until_convergence(op, jnp.zeros(mdp.S), ε)
Code
value_iteration(tidy_mdp_inf)
Array([15.564166, 14.785956], dtype=float32)

Note that the runtime analysis for an \(\epsilon\)-optimal value function is exactly the same as Section 1.3.6.1! This is because value iteration is simply the special case of applying iterative policy evaluation to the optimal value function.

As the final step of the algorithm, to return an actual policy \(\hat \pi\), we can simply act greedily with respect to the final iteration \(v^{(T)}\) of our above algorithm:

\[ \hat \pi(s) = \arg\max_a \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} v^{(T)}(s') \right]. \]

We must be careful, though: the value function of this greedy policy, \(V^{\hat \pi}\), is not the same as \(v^{(T)}\), which need not even be a well-defined value function for some policy!

The bound on the policy’s quality is actually quite loose: if \(\|v^{(T)} - V^\star\|_{\infty} \le \epsilon\), then the greedy policy \(\hat \pi\) satisfies \(\|V^{\hat \pi} - V^\star\|_{\infty} \le \frac{2\gamma}{1-\gamma} \epsilon\), which might potentially be very large.

Theorem 1.5 (Greedy policy value worsening) \[ \|V^{\hat \pi} - V^\star \|_{\infty} \le \frac{2 \gamma}{1-\gamma} \|v - V^\star\|_{\infty} \]

where \(\hat \pi(s) = \arg\max_a q(s, a)\) is the greedy policy with respect to

\[ q(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} v(s'). \]

Proof. We first have

\[ \begin{aligned} V^{\star}(s) - V^{\hat \pi}(s) &= Q^{\star}(s,\pi^\star(s)) - Q^{\hat \pi}(s, \hat \pi(s))\\ &= [Q^{\star}(s,\pi^\star(s)) - Q^{\star}(s, \hat \pi(s))] + [Q^{\star}(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s))]. \end{aligned} \]

Let us bound these two quantities separately.

For the first quantity, note that by the definition of \(\hat \pi\), we have

\[ q(s, \hat \pi(s)) \ge q(s,\pi^\star(s)). \]

Let’s add \(q(s, \hat \pi(s)) - q(s,\pi^\star(s)) \ge 0\) to the first term to get

\[ \begin{aligned} Q^{\star}(s,\pi^\star(s)) - Q^{\star}(s, \hat \pi(s)) &\le [Q^{\star}(s,\pi^\star(s))- q(s,\pi^\star(s))] + [q(s, \hat \pi(s)) - Q^{\star}(s, \hat \pi(s))] \\ &= \gamma \mathbb{E}_{s' \sim P(s, \pi^{\star}(s))} [ V^{\star}(s') - v(s') ] + \gamma \mathbb{E}_{s' \sim P(s, \hat \pi(s))} [ v(s') - V^{\star}(s') ] \\ &\le 2 \gamma \|v - V^{\star}\|_{\infty}. \end{aligned} \]

The second quantity is bounded by

\[ \begin{aligned} Q^{\star}(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s)) &= \gamma \mathbb{E}_{s'\sim P(s, \hat \pi(s))}\left[ V^\star(s') - V^{\hat \pi}(s') \right] \\ & \leq \gamma \|V^{\star} - V^{\hat \pi}\|_\infty \end{aligned} \]

and thus

\[ \begin{aligned} \|V^\star - V^{\hat \pi}\|_\infty &\le 2 \gamma \|v - V^{\star}\|_{\infty} + \gamma \|V^{\star} - V^{\hat \pi}\|_\infty \\ \|V^\star - V^{\hat \pi}\|_\infty &\le \frac{2 \gamma \|v - V^{\star}\|_{\infty}}{1-\gamma}. \end{aligned} \]

So in order to compensate and achieve \(\|V^{\hat \pi} - V^{\star}\| \le \epsilon\), we must have

\[ \|v^{(T)} - V^\star\|_{\infty} \le \frac{1-\gamma}{2 \gamma} \epsilon. \]

This means, using Remark 1.4, we need to run value iteration for

\[ T = O\left( \frac{1}{1-\gamma} \log\left(\frac{\gamma}{\epsilon (1-\gamma)^2}\right) \right) \]

iterations to achieve an \(\epsilon\)-accurate estimate of the optimal value function.

1.3.7.2 Policy iteration

Can we mitigate this “greedy worsening”? What if instead of approximating the optimal value function and then acting greedily by it at the very end, we iteratively improve the policy and value function together? This is the idea behind policy iteration. In each step, we simply set the policy to act greedily with respect to its own value function.

Code
def policy_iteration(mdp: MDP, ε=1e-6) -> Float[Array, "S A"]:
    """Iteratively improve the policy and value function."""
    def op(pi):
        return v_to_greedy(mdp, eval_deterministic_infinite(mdp, pi))
    pi_init = jnp.ones((mdp.S, mdp.A)) / mdp.A  # uniform random policy
    return loop_until_convergence(op, pi_init, ε)
Code
policy_iteration(tidy_mdp_inf)
Array([[1., 0.],
       [0., 1.]], dtype=float32)

Although PI appears more complex than VI, we’ll use Theorem 1.4 to show convergence. This will give us the same runtime bound as value iteration and iterative policy evaluation for an \(\epsilon\)-optimal value function (Remark 1.4), although in practice, PI often converges in fewer iterations. Why so? Intuitively, by iterating through actual policies, we can “skip over” different functions that represent the same policy, which value iteration might iterate through. For a concrete example, suppose we have an MDP with three states \(\mathcal{S} = \{\text A, \text B, \text C\}\). Compare the two functions below:

Code
display(Markdown(tabulate(
    {
        r"$s$": ["A", "B", "C"],
        r"$a = 0$": [0.1, 0.8, 0.4],
        r"$a = 1$": [0.2, 0.9, 0.5],
    },
    "keys"
)))

display(Markdown(tabulate(
    {
        r"$s$": ["A", "B", "C"],
        r"$a = 0$": [0.1, 0.8, 0.4],
        r"$a = 1$": [0.3, 1.0, 0.6],
    },
    "keys"
)))
Table 1.2: Two action-value functions that result in the same greedy policy
(a) One Q-function
\(s\) \(a = 0\) \(a = 1\)
A 0.1 0.2
B 0.8 0.9
C 0.4 0.5
(b) Another Q-function
\(s\) \(a = 0\) \(a = 1\)
A 0.1 0.3
B 0.8 1
C 0.4 0.6

These both map to the same greedy policy, so policy iteration would never iterate through both of these functions, while value iteration might.

Theorem 1.6 (Policy Iteration runtime and convergence) We aim to show that the number of iterations required for an \(\epsilon\)-accurate estimate of the optimal value function is

\[ T = O\left( \frac{1}{1-\gamma} \log\left(\frac{1}{\epsilon (1-\gamma)}\right) \right). \]

This bound follows from the contraction property Equation 1.15:

\[ \|V^{\pi^{t+1}} - V^\star \|_{\infty} \le \gamma \|V^{\pi^{t}} - V^\star \|_{\infty}. \]

We’ll prove that the iterates of PI respect the contraction property by showing that the policies improve monotonically:

\[ V^{\pi^{t+1}}(s) \ge V^{\pi^{t}}(s). \]

Then we’ll use this to show \(V^{\pi^{t+1}}(s) \ge [\mathcal{J}^{\star}(V^{\pi^{t}})](s)\). Note that

\[ \begin{aligned} (s) &= \max_a \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} V^{\pi^{t}}(s') \right] \\ &= r(s, \pi^{t+1}(s)) + \gamma \mathbb{E}_{s' \sim P(s, \pi^{t+1}(s))} V^{\pi^{t}}(s') \end{aligned} \]

Since \([\mathcal{J}^{\star}(V^{\pi^{t}})](s) \ge V^{\pi^{t}}(s)\), we then have

\[ \begin{aligned} V^{\pi^{t+1}}(s) - V^{\pi^{t}}(s) &\ge V^{\pi^{t+1}}(s) - \mathcal{J}^{\star} (V^{\pi^{t}})(s) \\ &= \gamma \mathbb{E}_{s' \sim P(s, \pi^{t+1}(s))} \left[V^{\pi^{t+1}}(s') - V^{\pi^{t}}(s') \right]. \end{aligned} \tag{1.20}\]

But note that the expression being averaged is the same as the expression on the l.h.s. with \(s\) replaced by \(s'\). So we can apply the same inequality recursively to get

\[ \begin{aligned} V^{\pi^{t+1}}(s) - V^{\pi^{t}}(s) &\ge \gamma \mathbb{E}_{s' \sim P(s, \pi^{t+1}(s))} \left[V^{\pi^{t+1}}(s') - V^{\pi^{t}}(s') \right] \\ &\ge \gamma^2 \mathbb{E}_{\substack{s' \sim P(s, \pi^{t+1}(s)) \\ s'' \sim P(s', \pi^{t+1}(s'))}} \left[V^{\pi^{t+1}}(s'') - V^{\pi^{t}}(s'') \right]\\ &\ge \cdots \end{aligned} \]

which implies that \(V^{\pi^{t+1}}(s) \ge V^{\pi^{t}}(s)\) for all \(s\) (since the r.h.s. converges to zero). We can then plug this back into Equation 1.20 to get the desired result:

\[ \begin{aligned} V^{\pi^{t+1}}(s) - \mathcal{J}^{\star} (V^{\pi^{t}})(s) &= \gamma \mathbb{E}_{s' \sim P(s, \pi^{t+1}(s))} \left[V^{\pi^{t+1}}(s') - V^{\pi^{t}}(s') \right] \\ &\ge 0 \\ V^{\pi^{t+1}}(s) &\ge [\mathcal{J}^{\star}(V^{\pi^{t}})](s) \end{aligned} \]

This means we can now apply the Bellman convergence result Equation 1.15 to get

\[ \|V^{\pi^{t+1}} - V^\star \|_{\infty} \le \|\mathcal{J}^{\star} (V^{\pi^{t}}) - V^{\star}\|_{\infty} \le \gamma \|V^{\pi^{t}} - V^\star \|_{\infty}. \]

1.4 Summary

  • Markov decision processes (MDPs) are a framework for sequential decision making under uncertainty. They consist of a state space \(\mathcal{S}\), an action space \(\mathcal{A}\), an initial state distribution \(\mu \in \Delta(\mathcal{S})\), a transition function \(P(s' \mid s, a)\), and a reward function \(r(s, a)\). They can be finite-horizon (ends after \(H\) timesteps) or infinite-horizon (where rewards scale by \(\gamma \in (0, 1)\) at each timestep).

  • Our goal is to find a policy \(\pi\) that maximizes expected total reward. Policies can be deterministic or stochastic, state-dependent or history-dependent, stationary or time-dependent.

  • A policy induces a distribution over trajectories.

  • We can evaluate a policy by computing its value function \(V^\pi(s)\), which is the expected total reward starting from state \(s\) and following policy \(\pi\). We can also compute the state-action value function \(Q^\pi(s, a)\), which is the expected total reward starting from state \(s\), taking action \(a\), and then following policy \(\pi\). In the finite-horizon setting, these also depend on the timestep \(h\).

  • The Bellman consistency equation is an equation that the value function must satisfy. It can be used to solve for the value functions exactly. Thinking of the r.h.s. of this equation as an operator on value functions gives the Bellman operator.

  • In the finite-horizon setting, we can compute the optimal policy using dynamic programming.

  • In the infinite-horizon setting, we can compute the optimal policy using value iteration or policy iteration.

1.5 References

The proof of Theorem 1.2 can also be found in Agarwal et al. (2022).