6 Fitted Dynamic Programming Algorithms

In Chapter 2, we discussed how to solve known tabular Markov decision processes where the state and action spaces $\mathcal{S}$ and $\mathcal{A}$ were finite and small and we knew the reward function and state transitions of the environment. For most interesting tasks, however, we don’t know the analytical form of the reward functions or state transitions. In such settings, we must learn through interaction with the environment.

Remark 6.1 (Tackling unknown environments). How can we deal with an unknown environment? One way is to learn a model of the environment, that is, approximations of the reward function and state transitions. We can then plan in the model environment. This is the model-based approach. If the approximation is good enough, the optimal policy in the model environment will be near-optimal in the true environment.

Instead of learning a model and planning within it, we could also try to learn the optimal value function and/or optimal policy directly. Such model-free methods are popular in practice, though they generally require many more samples.

In this short chapter, we will tackle the full RL problem by extending DP algorithms to account for unknown environments. We will also relax the assumption that the state space is small and finite: the methods we will cover support large or continuous state spaces by using parameterized function approximation. This will require learning about the environment by collecting data through interaction and then applying supervised learning (see Chapter 5) to approximate relevant functions.

Each of the algorithms we’ll cover is analogous to one of the algorithms from Chapter 2:

Table 6.1: Fitted algorithms for general environments.

Algorithm	Description	Fitted version
Policy evaluation (Section 2.4.3)	Given a policy $\pi$, compute its value function $V^\pi$ by iterating the Bellman consistency equations.	Fitted policy evaluation (Section 6.1)
Value iteration (Section 2.4.4.1)	Compute the optimal value function $V^\star$ by iterating the Bellman optimality equations.	Fitted value iteration (Section 6.2)
Policy iteration (Section 2.4.4.2)	Repeatedly make the policy $\pi$ greedy with respect to its own Q function.	Fitted policy iteration (Section 6.3)

The term “fitted” means that, since we can no longer compute the Bellman consistency equations or Bellman optimality equations exactly, we plug data from the environment into a supervised learning algorithm to approximately solve these equations. As such, these are also termed “approximate DP” algorithms, or “neuro-DP” algorithms, as in Bertsekas & Tsitsiklis (1996). We will also introduce some useful language for classifying RL algorithms.

Remark 6.2 (Notation). To simplify our notation, we will assume that each state $s$ includes the current timestep $h$. That is, if the original state space was $\mathcal{S}$, the augmented state space is $\mathcal{S} \times [H]$. This allows us to treat all state-dependent quantities as time-dependent without explicitly carrying around a subscript $h$. This also holds true in most real-world episodic tasks, where the state conveys how many timesteps have passed. For example, in tic-tac-toe, one can simply count the number of moves on the board.

Code

from utils import gym, tqdm, rand, Float, Array, NamedTuple, Callable, Optional, np, latex

key = rand.PRNGKey(184)

class Transition(NamedTuple):
    s: int
    a: int
    r: float


Trajectory = list[Transition]


def get_num_actions(trajectories: list[Trajectory]) -> int:
    """Get the number of actions in the dataset. Assumes actions range from 0 to A-1."""
    return max(max(t.a for t in τ) for τ in trajectories) + 1


State = Float[Array, "..."]  # arbitrary shape

# assume finite `A` actions and f outputs an array of Q-values
# i.e. Q(s, a, h) is implemented as f(s, h)[a]
QFunction = Callable[[State, int], Float[Array, " A"]]


def Q_zero(A: int) -> QFunction:
    """A Q-function that always returns zero."""
    return lambda s, a: np.zeros(A)


# a deterministic time-dependent policy
Policy = Callable[[State, int], int]


def q_to_greedy(Q: QFunction) -> Policy:
    """Get the greedy policy for the given state-action value function."""
    return lambda s, h: np.argmax(Q(s, h))

# We will see some examples of fitting methods in the next section
FittingMethod = Callable[[Float[Array, "N D"], Float[Array, " N"]], QFunction]

6.1 Fitted policy evaluation

Recall the task of policy evaluation: given a policy $\pi : \mathcal{S} \to \triangle(\mathcal{A})$, compute its Q function $Q^\pi$, which expresses the expected total reward in a given starting state-action pair.

Remark 6.3 (Value function vs Q function). In Chapter 2, we sought to compute the value function $V^\pi$. Here, we’ll instead approximate the action-value function $Q^\pi$, since we’ll need the ability to take the action with the maximum expected remaining reward. If we don’t know the environment, we can’t compute the action-values using only $V^\pi$.

Remark 6.4 (Fixed-point policy evaluation review). The fixed-point policy evaluation algorithm (Section 2.4.3.2) makes use of the Bellman consistency equations (Theorem 2.3), restated here for the Q-function:

\[ Q^\pi(s, a) = r(s, a) + \mathop{\mathbb{E}}_{\substack{ s' \sim P(\cdot \mid s, a)\\ a' \sim \pi(\cdot \mid s') }} [Q^\pi(s', a')]. \tag{6.1}\]

In fixed-point iteration, we treat eq. 6.1 not as an equation, but as an “operator” that takes in a function $q^i : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ and returns an updated function $q^{i+1}$ by substituting $q^i$ in place of $Q^\pi$.

\[ q^{i+1}(s, a) := r(s, a) + \mathop{\mathbb{E}}_{\substack{ s' \sim P(\cdot \mid s, a)\\ a' \sim \pi(\cdot \mid s') }} [q^i(s', a')]. \tag{6.2}\]

Roughly speaking, $q^{i+1}$ approximates $Q^\pi$ slightly better than $q^i$ since it incorporates a single step of reward from the environment. Since this operator is a contraction mapping (Theorem 2.8), by iterating this process, $q$ will eventually converge to the true policy $Q^\pi$.

However, computing the update step in eq. 6.2 requires computing an expectation over the state transitions $s' \sim P(\cdot \mid s, a)$. As mentioned in the introduction, this is intractable for most real-world tasks. Either the state space is too large, or we simply don’t know what $P$ is. Instead, we will apply supervised learning methods to approximately solve the Bellman consistency equations using data.

Recall that supervised learning is good at learning conditional expectations of the form

\[ f(x) = \mathop{\mathbb{E}}[ y \mid x ], \tag{6.3}\]

where $x$ are the input features and $y$ is the scalar response. Can we rewrite the one-step value target (eq. 6.1) as a conditional expectation? In fact, it already is! Let us use notation that makes this more clear. We explicitly treat $s', a'$ as random variables and move the conditioning on $s, a$ into the brackets:

\[ Q^\pi(s, a) = \mathop{\mathbb{E}}[ r(s, a) + Q^\pi(s', a') \mid s, a ]. \tag{6.4}\]

We can see that the input $x$ corresponds to $s, a$ and the response $y$ corresponds to $r(s, a) + q^i(s', a')$, where $q^i$ is our current estimate of $Q^\pi$. Now we just need to obtain a dataset of input-output pairs and run empirical risk minimization (Section 5.3). We can classify data collection strategies as either offline or online.

Definition 6.1 (Offline and online algorithms) We say that a learning algorithm works offline if the learning is performed as a function of a static dataset, without requiring further interaction with the environment. In contrast, online learning algorithms require interaction with the environment during learning.

We’ll begin with an offline version of fitted policy evaluation and then see an online version.

6.1.1 Offline fitted policy evaluation

In particular, we use $\pi$, the policy we’re trying to evaluate, to obtain a dataset of $N$ trajectories $\tau^1, \dots, \tau^N \sim \rho^{\pi}$. Let us indicate the trajectory index in the superscript, so that

\[ \tau^n = \{ s_0^n, a_0^n, r_0^n, s_1^n, a_1^n, r_1^n, \dots, s_{H-1}^n, a_{H-1}^n, r_{H-1}^n \}. \tag{6.5}\]

This would give us $N (H- 1)$ samples in the dataset. We subtract one from the horizon since each $(x, y)$ sample requires a pair of consecutive timesteps:

\[ x^n_{h} = (s_h^n, a_h^n) \qquad y^n_{h} = r(s_h^n, a_h^n) + q^i(s^n_{h+ 1}, a^n_{h+1}) \tag{6.6}\]

where $q^i$ is our current estimate of $Q^\pi$. This makes our new algorithm a fixed-point algorithm as well, since it uses $q^i$ to compute the next iterate $q^{i+1}$.

Now we can use empirical risk minimization to find a function $q^I$ that approximates the optimal Q-function.

graph LR
    A["$$\text{Use } \pi \text{ to collect data } \tau^1, \dots, \tau^N$$"] --> B
    B["$$\text{Compute labels } y^n_h \text{ using Bellman consistency equations for } q$$"] --> C
    C["$$\text{Update } q \text{ by fitting to data}$$"] --> B

Figure 6.1: Fitted policy evaluation.

Definition 6.2 (Fitted policy evaluation) Let $\pi : \mathcal{S} \to \triangle(\mathcal{A})$ be the policy we wish to evaluate. Our goal is to compute an approximation of its Q-function $Q^\pi$.

Collect a dataset $\tau^1, \dots, \tau^N \sim \rho^\pi$.
Initialize some function $q^0 : \mathcal{S} \times \mathcal{A} \in \mathbb{R}$.
For $i = 0, \dots, I-1$:
1. Generate labels $y^1, \dots, y^N$ from the trajectories and the current estimate $q^i$, where the labels come from eq. 6.1.
2. Set $q^{i+1}$ to the function that minimizes the empirical risk:

\[ q \gets \arg\min_{q} \frac{1}{N H} \sum_{n=1}^N \sum_{h=1}^{H-2} (y^n_h- q(x^n_h))^2. \tag{6.7}\]

Code

def fitted_evaluation(
    trajectories: list[Trajectory],
    fit: FittingMethod,
    π: Policy,
    epochs: int,
    Q_init: QFunction,
) -> QFunction:
    """
    Run fitted policy evaluation using the given dataset.
    Returns an estimate of the Q-function of the given policy.
    """
    Q_hat = Q_init
    X = get_X(trajectories)
    for epoch in tqdm(range(epochs)):
        y = get_y(trajectories, Q_hat, π)
        Q_hat = fit(X, y)
    return Q_hat

latex(fitted_evaluation)

$ \[\begin{array}{l} \mathbf{function} \ \mathrm{fitted\_evaluation}(\mathrm{trajectories}: \mathrm{list}_{\mathrm{Trajectory}}, \mathrm{fit}: \mathrm{FittingMethod}, π: \mathrm{Policy}, \mathrm{epochs}: \mathbb{Z}, Q_{\mathrm{init}}: \mathrm{QFunction}) \\ \hspace{1em} \textrm{" Run fitted policy evaluation using the given dataset. Returns an estimate of the Q-function of the given policy. "} \\ \hspace{1em} \widehat{Q} \gets Q_{\mathrm{init}} \\ \hspace{1em} X \gets \mathrm{get\_X} \mathopen{}\left( \mathrm{trajectories} \mathclose{}\right) \\ \hspace{1em} \mathbf{for} \ \mathrm{epoch} \in \mathrm{tqdm} \mathopen{}\left( \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} y \gets \mathrm{get\_y} \mathopen{}\left( \mathrm{trajectories}, \widehat{Q}, π \mathclose{}\right) \\ \hspace{2em} \widehat{Q} \gets \mathrm{fit} \mathopen{}\left( X, y \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ \widehat{Q} \\ \mathbf{end \ function} \end{array}\]

Fitted policy evaluation is offline since the “interaction phase” and the “learning phase” are disjoint. In other words, we can think of fitted policy evaluation as taking a dataset of trajectories collected by some unknown policy $\pi_\text{data}$, and then approximating the Q function of $\pi_\text{data}$.

Fitted policy evaluation is on-policy because the update rule uses trajectories sampled from $\pi$. Where do we need this assumption? Pay close attention to the target, and compare it to the true one-step value target (eq. 6.1):

\[ \begin{aligned} y_h&= r(s_h, a_h) + q(s_{h+1}, a_{h+1}) \\ Q^\pi(s, a) &= r(s, a) + \mathop{\mathbb{E}}_{\substack{ s' \sim P(\cdot \mid s, a) \\ a' \sim \pi(\cdot \mid s') }} Q^\pi(s', a'). \end{aligned} \tag{6.8}\]

Notice that $(s_{h+1}, a_{h+1})$ is a single sample from the joint distribution $s' \sim P(\cdot \mid s, a)$ and $a' \sim \pi(\cdot \mid s')$. If the trajectories were collected from a different policy, then $a_{h+1}$ would not be a sample from $\pi(\cdot \mid s')$, making the target a biased sample for evaluating $\pi$.

Definition 6.3 (On-policy and off-policy algorithms) We say that a learning algorithm is on-policy if the update rule must use data collected by the current policy. On the other hand, we call a learning algorithm off-policy if its update rule doesn’t care about how the data was collected.

Exercise 6.1 (Off-policy fitted policy evaluation) Now suppose you are given a dataset of trajectories sampled from a policy $\pi_\text{data}$, and you want to evaluate a different policy $\pi$. You are not given access to the environment. How could you use the dataset to evaluate $\pi$? Explain what makes this an off-policy algorithm.

Code

def collect_data(
    env: gym.Env, N: int, H: int, key: rand.PRNGKey, π: Optional[Policy] = None
) -> list[Trajectory]:
    """Collect a dataset of trajectories from the given policy (or a random one)."""
    trajectories = []
    seeds = [rand.bits(k).item() for k in rand.split(key, N)]
    for i in tqdm(range(N)):
        τ = []
        s, _ = env.reset(seed=seeds[i])
        for h in range(H):
            # sample from a random policy
            a = π(s, h) if π else env.action_space.sample()
            s_next, r, terminated, truncated, _ = env.step(a)
            τ.append(Transition(s, a, r))
            if terminated or truncated:
                break
            s = s_next
        trajectories.append(τ)
    return trajectories


def get_X(trajectories: list[Trajectory]):
    """
    We pass the state and timestep as input to the Q-function
    and return an array of Q-values.
    """
    rows = [(τ[h].s, τ[h].a, h) for τ in trajectories for h in range(len(τ))]
    return [np.stack(ary) for ary in zip(*rows)]


def get_y(
    trajectories: list[Trajectory],
    f: Optional[QFunction] = None,
    π: Optional[Policy] = None,
):
    """
    Transform the dataset of trajectories into a dataset for supervised learning.
    If `π` is None, instead estimates the optimal Q function.
    Otherwise, estimates the Q function of π.
    """
    f = f or Q_zero(get_num_actions(trajectories))
    y = []
    for τ in trajectories:
        for h in range(len(τ) - 1):
            s, a, r = τ[h]
            Q_values = f(s, h + 1)
            y.append(r + (Q_values[π(s, h + 1)] if π else Q_values.max()))
        y.append(τ[-1].r)
    return np.array(y)

6.1.2 Bootstrapping and target networks

Using the current guess $q$ to compute the labels is known as bootstrapping. (This has nothing to do with the term bootstrapping in statistical inference.)

This term comes from the following metaphor: if you are trying to get on a horse, and there’s nobody to help you up, you need to “pull yourself up by your bootstraps,” or in other words, start from your existing resources to build up to something more effective.

Using a bootstrapped estimate makes the optimization more complicated. Since we are constantly updating our Q function estimate, the labels are also constantly changing, destabilizing learning. Sutton & Barto (2018) calls bootstrapping one prong of the deadly triad of deep reinforcement learning, alongside function approximation and off-policy learning. When an algorithm relies on all three of these properties, it is possible for learning to diverge, so that the fitted Q function is far from the true Q function (van Hasselt et al., 2018).

Many tricks exist to mitigate this issue in practice. One such trick is to use a target network. That is, when computing $y$, instead of using $q$, which is constantly changing, we maintain another target network $q_\text{target}$ that “updates more slowly.” Concretely, $q_\text{target}$ is an exponential moving average of the iterates of $q$. Whenever we update $q$, we update $q_\text{target}$ accordingly:

\[ q_\text{target} \gets (1-\lambda_{\text{target}}) q_\text{target} + \lambda_\text{target} q, \tag{6.9}\]

where $\lambda_\text{target} \in (0, 1)$ is some mixing parameter: the larger it is, the more we update towards the current estimate $q$.

6.1.3 Online fitted policy evaluation

In the offline algorithm above, we collect the whole dataset before starting the learning process. What could go wrong with this? Since the environment is unknown, the dataset only contains information about some portion of the environment. When we update $q$, it may only learn to approximate $Q^\pi$ well for states that are in the initial dataset. It would be much better if $q$ were accurate in states that the policy will actually find itself in. This leads to the following simple shift: collect trajectories inside the iteration loop! This results in an online algorithm for fitted policy evaluation, also known as $\text{TD}(0)$.

Code

def fitted_policy_evaluation_online(
    env,
    pi: Policy,
    epochs: int,
    learning_rate: float,
    q_init,
    *,
    key
):
    q = q_init
    for epoch in range(epochs):
        trajectory = collect_data(env, N=1, H=20, key=key, pi=pi)
        for (s, a, r), (s_next, a_next, _) in zip(trajectory[:-1], trajectory[1:]):
            target = r + q[s_next, a_next]
            q[s, a] = q[s, a] - learning_rate * (q[s, a] - target)
    return q

latex(fitted_policy_evaluation_online)

$ \[\begin{array}{l} \mathbf{function} \ \mathrm{fitted\_policy\_evaluation\_online}(\mathrm{env}, \pi: \mathrm{Policy}, \mathrm{epochs}: \mathbb{Z}, \mathrm{learning\_rate}: \mathbb{R}, q_{\mathrm{init}}) \\ \hspace{1em} q \gets q_{\mathrm{init}} \\ \hspace{1em} \mathbf{for} \ \mathrm{epoch} \in \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} \mathrm{trajectory} \gets \mathrm{collect\_data} \mathopen{}\left( \mathrm{env} \mathclose{}\right) \\ \hspace{2em} \mathbf{for} \ \mathopen{}\left( \mathopen{}\left( s, a, r \mathclose{}\right), \mathopen{}\left( s_{\mathrm{next}}, a_{\mathrm{next}}, \mathrm{\_} \mathclose{}\right) \mathclose{}\right) \in \mathrm{zip} \mathopen{}\left( \mathrm{trajectory}_{:-1}, \mathrm{trajectory}_{1:} \mathclose{}\right) \ \mathbf{do} \\ \hspace{3em} \mathrm{target} \gets r + q_{s_{\mathrm{next}}, a_{\mathrm{next}}} \\ \hspace{3em} q_{s, a} \gets q_{s, a} - \mathrm{learning\_rate} \cdot \mathopen{}\left( q_{s, a} - \mathrm{target} \mathclose{}\right) \\ \hspace{2em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ q \\ \mathbf{end \ function} \end{array}\]

Figure 6.2: Pseudocode for online policy evaluation (TD(0))

Note that we explicitly write out one step of gradient descent on the squared “temporal difference error”.

Remark 6.5 (On notation). What does $\text{TD}(0)$ stand for? This notation suggests the existence of other $\text{TD}(\lambda)$ algorithms. In fact, $\text{TD}(\lambda)$ is a well-studied algorithm with a parameter $\lambda \in [0, 1]$, and $\text{TD}(0)$ is simply the special case $\lambda = 0$. TD stands for “temporal difference”: we use the value function at a different point in time to estimate the value of the current state at the current time step. The parameter $\lambda$ is one way to average between the near-term estimates and the long-term estimates at later timesteps.

6.2 Fitted value iteration

We’ll now explore an algorithm for computing the optimal value function $V^\star$ when the environment is unknown. This method is analogous to value iteration (Section 2.4.4.1), except instead of solving the Bellman optimality equations (eq. 2.38) exactly, which would require full knowledge of the environment, we will collect a dataset of trajectories and apply supervised learning to solve the Bellman optimality equations approximately. This is exactly analogous to fitted policy evaluation (Section 6.1), except we use the Bellman optimality equations (eq. 2.38) instead of the Bellman consistency equations (eq. 2.9) for a given policy.

Table 6.2: How fitted value iteration relates to existing algorithms.

	Known environment	Unknown environment
Bellman consistency equations (evaluation)	Policy evaluation	Fitted policy evaluation
Bellman optimality equations (optimization)	Value iteration	Fitted value iteration

Remark 6.6 (Value iteration review). Value iteration was an algorithm we used to compute the optimal value function $V^\star : \mathcal{S} \to \mathbb{R}$ in an infinite-horizon MDP (Section 2.4.4.1). Here, we will present the equivalent algorithm for the optimal action-value function $Q^\star : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$. The optimal action-value function satisfies the Bellman optimality equations

\[ Q^\star(s, a) = r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(\cdot \mid s, a)}[ \max_{a' \in \mathcal{A}} Q^\star(s', a') ]. \tag{6.10}\]

Now let us treat eq. 6.10 as an “operator” instead of an equation, that is,

\[ q(s, a) \gets r(s, a) + \mathop{\mathbb{E}}_{s' \sim P(\cdot \mid s, a)}[ \max_{a' \in \mathcal{A}} q(s', a') ]. \tag{6.11}\]

If we start with some guess $q : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$, and repeatedly apply the update step eq. 6.11, the iterates will eventually converge to $Q^\star$ since eq. 6.11 is a contraction mapping.

When we can’t compute expectations over $s' \sim P(\cdot \mid s, a)$, we can instead apply supervised learning to approximately solve the Bellman optimality equations. As before, we can write $Q^\star$ explicitly as a conditional expectation

\[ Q^\star(s, a) = \mathop{\mathbb{E}}\left[ r(s, a) + \max_{a' \in \mathcal{A}}Q^\star(s', a') \mid s, a \right]. \tag{6.12}\]

From this expression, we can read off the inputs $x$ and the targets $y$ for supervised learning. As before, since we don’t know $Q^\star$, we replace it with our current guess $q \approx Q^\star$:

\[ \begin{aligned} x &:= (s, a) \\ y &:= r(s, a) + \max_{a' \in \mathcal{A}} q(s', a'). \end{aligned} \tag{6.13}\]

The only difference from fitted policy evaluation (Section 6.1) is how we compute the targets $y$. Instead of using the next action from the trajectory, we use the action with the maximum value. Notice that these equations don’t reference a policy anywhere! In other words, fitted value iteration is an off-policy algorithm.

graph LR
    A["$$\text{Use } \pi \text{ to collect data } \tau^1, \dots, \tau^N$$"] --> B
    B["$$\text{Compute labels } y^n_h \text{ using Bellman optimality equations for } q$$"] --> C
    C["$$\text{Update } q \text{ by fitting to data}$$"] --> B

Figure 6.3: Fitted policy evaluation.

6.2.1 Offline fitted value iteration

To construct an offline algorithm, we take some dataset of trajectories, and then do all of the learning without ever interacting with the environment.

Definition 6.4 (Fitted value iteration) Suppose we have some dataset of trajectories $\tau^1, \dots, \tau^N$ collected by interacting with the environment.

Initialize some function $q^0(s, a, h) \in \mathbb{R}$.
Compute $x^n_h= (s^n_h, a^n_h)$.
For $t = 0, \dots, T-1$:
1. Use $q^t$ to generate the targets \[ y^n_h= r^n_h+ \max_{a'} q^t(s^n_h, a'). \tag{6.14}\]
2. Set $q^{t+1}$ to the function that minimizes the empirical risk:

\[ f^{t+1} \gets \arg\min_f \frac{1}{N (H-1)} \sum_{n=1}^N \sum_{h=0}^{H-2} (y^n_h- f(x^n_h))^2, \tag{6.15}\]

Code

def fitted_q_iteration(
    trajectories,
    fit,
    epochs,
    Q_init,
) -> QFunction:
    Q_hat = Q_init or Q_zero(get_num_actions(trajectories))
    X = get_X(trajectories)
    for _ in range(epochs):
        y = get_y(trajectories, Q_hat)
        Q_hat = fit(X, y)
    return Q_hat

latex(fitted_q_iteration)

$ \[\begin{array}{l} \mathbf{function} \ \mathrm{fitted\_q\_iteration}(\mathrm{trajectories}, \mathrm{fit}, \mathrm{epochs}, Q_{\mathrm{init}}) \\ \hspace{1em} \widehat{Q} \gets Q_{\mathrm{init}} \lor Q_{\mathrm{zero}} \mathopen{}\left( \mathrm{get\_num\_actions} \mathopen{}\left( \mathrm{trajectories} \mathclose{}\right) \mathclose{}\right) \\ \hspace{1em} X \gets \mathrm{get\_X} \mathopen{}\left( \mathrm{trajectories} \mathclose{}\right) \\ \hspace{1em} \mathbf{for} \ \mathrm{\_} \in \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} y \gets \mathrm{get\_y} \mathopen{}\left( \mathrm{trajectories}, \widehat{Q} \mathclose{}\right) \\ \hspace{2em} \widehat{Q} \gets \mathrm{fit} \mathopen{}\left( X, y \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ \widehat{Q} \\ \mathbf{end \ function} \end{array}\]

Figure 6.4: Pseudocode for fitted value iteration.

6.2.2 Q-learning

In fitted value iteration (fig. 6.4), we collect the whole dataset beforehand from some unknown policy (or policies). This doesn’t interact with the environment during the learning process. What could go wrong? Since the environment is unknown, the dataset only contains information about some portion of the environment. When we update $q$, it may therefore only learn to approximate $Q^\star$ well for states that are in the initial dataset. It would be much better if $q$ was accurate in states that the policy will actually find itself in. Given access to the environment, we can instead collect trajectories while learning. This turns fitted value interation into an online algorithm known as Q-learning.

graph LR
    A[Collect trajectories from data collection policy] --> B
    B[Compute targets by bootstrapping with q] --> C
    C[Update q by empirical risk minimization] --> B

Figure 6.5: Fitted value iteration

graph LR
    A[Collect trajectories from epsilon-greedy policy for current guess for Q] --> B
    B[Compute targets by bootstrapping with q] --> C
    C[Update q by empirical risk minimization] --> A

Figure 6.6: Q learning

Code

def q_learning(env, Q_init, epochs: int):
    q = Q_init
    for _ in range(epochs):
        trajectories = collect_trajectories(env, EpsilonGreedyPolicy(theta))
        X = get_X(trajectories)
        y = get_y(trajectories, q)
        q = fit(X, y)
    return q

latex(q_learning, id_to_latex={"q_learning": r"\text{Q-learning}"})

$ \[\begin{array}{l} \mathbf{function} \ \text{Q-learning}(\mathrm{env}, Q_{\mathrm{init}}, \mathrm{epochs}: \mathbb{Z}) \\ \hspace{1em} q \gets Q_{\mathrm{init}} \\ \hspace{1em} \mathbf{for} \ \mathrm{\_} \in \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} \mathrm{trajectories} \gets \mathrm{collect\_trajectories} \mathopen{}\left( \mathrm{env}, \mathrm{EpsilonGreedyPolicy} \mathopen{}\left( \theta \mathclose{}\right) \mathclose{}\right) \\ \hspace{2em} X \gets \mathrm{get\_X} \mathopen{}\left( \mathrm{trajectories} \mathclose{}\right) \\ \hspace{2em} y \gets \mathrm{get\_y} \mathopen{}\left( \mathrm{trajectories}, q \mathclose{}\right) \\ \hspace{2em} q \gets \mathrm{fit} \mathopen{}\left( X, y \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ q \\ \mathbf{end \ function} \end{array}\]

Note that it doesn’t actually matter how the trajectories are collected, making Q-learning an off-policy algorithm. One common choice is to collect trajectories using an epsilon-greedy policy with respect to the current guess $q$.

Another common trick used in practice is to grow the dataset, called a replay buffer, at each iteration. Then, in the improvement step, we randomly sample a batch of $(x, y)$ samples from the replay buffer and use these for empirical risk minimization.

Code

import random

def q_learning_with_buffer(env, Q_init, epochs: int, batch_size):
    q = Q_init
    buffer = []
    for _ in range(epochs):
        trajectories = collect_trajectories(env, EpsilonGreedyPolicy(theta))
        buffer.extend(trajectories)

        batch = random.sample(trajectories, batch_size)
        X = get_X(batch)
        y = get_y(batch, q)
        q = fit(X, y)
    return q

latex(q_learning, id_to_latex={"q_learning_with_buffer": r"\text{Q-learning-buffer}"})

$ \[\begin{array}{l} \mathbf{function} \ q_{\mathrm{learning}}(\mathrm{env}, Q_{\mathrm{init}}, \mathrm{epochs}: \mathbb{Z}) \\ \hspace{1em} q \gets Q_{\mathrm{init}} \\ \hspace{1em} \mathbf{for} \ \mathrm{\_} \in \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} \mathrm{trajectories} \gets \mathrm{collect\_trajectories} \mathopen{}\left( \mathrm{env}, \mathrm{EpsilonGreedyPolicy} \mathopen{}\left( \theta \mathclose{}\right) \mathclose{}\right) \\ \hspace{2em} X \gets \mathrm{get\_X} \mathopen{}\left( \mathrm{trajectories} \mathclose{}\right) \\ \hspace{2em} y \gets \mathrm{get\_y} \mathopen{}\left( \mathrm{trajectories}, q \mathclose{}\right) \\ \hspace{2em} q \gets \mathrm{fit} \mathopen{}\left( X, y \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ q \\ \mathbf{end \ function} \end{array}\]

6.3 Fitted policy iteration

We can use fitted policy evaluation to extend policy iteration (Section 2.4.4.2) to this new, more general setting. The algorithm remains exactly the same – repeatedly make the policy greedy with respect to its own value function – except now we evaluate the policy $\pi$ (i.e. estimate $Q^\pi$) using fitted policy evaluation.

Code

def fitted_policy_iteration(
    trajectories: list[Trajectory],
    fit: FittingMethod,
    epochs: int,
    evaluation_epochs: int,
    π_init: Optional[Policy] = lambda s, h: 0,  # constant zero policy
):
    """Run fitted policy iteration using the given dataset."""
    π = π_init
    for _ in range(epochs):
        Q_hat = fitted_evaluation(trajectories, fit, π, evaluation_epochs)
        π = q_to_greedy(Q_hat)
    return π

latex(fitted_policy_iteration, id_to_latex={"fitted_policy_iteration": r"\text{fitted-policy-iteration}"})

$ \[\begin{array}{l} \mathbf{function} \ \text{fitted-policy-iteration}(\mathrm{trajectories}: \mathrm{list}_{\mathrm{Trajectory}}, \mathrm{fit}: \mathrm{FittingMethod}, \mathrm{epochs}: \mathbb{Z}, \mathrm{evaluation\_epochs}: \mathbb{Z}, π_{\mathrm{init}}: \mathrm{Optional}_{\mathrm{Policy}}) \\ \hspace{1em} \textrm{"Run fitted policy iteration using the given dataset."} \\ \hspace{1em} π \gets π_{\mathrm{init}} \\ \hspace{1em} \mathbf{for} \ \mathrm{\_} \in \mathrm{range} \mathopen{}\left( \mathrm{epochs} \mathclose{}\right) \ \mathbf{do} \\ \hspace{2em} \widehat{Q} \gets \mathrm{fitted\_evaluation} \mathopen{}\left( \mathrm{trajectories}, \mathrm{fit}, π, \mathrm{evaluation\_epochs} \mathclose{}\right) \\ \hspace{2em} π \gets \mathrm{q\_to\_greedy} \mathopen{}\left( \widehat{Q} \mathclose{}\right) \\ \hspace{1em} \mathbf{end \ for} \\ \hspace{1em} \mathbf{return} \ π \\ \mathbf{end \ function} \end{array}\]

Figure 6.7: Pseudocode for fitted policy iteration.

Exercise 6.2 (Classification of fitted policy iteration) Is fitted policy iteration online or offline? On-policy or off-policy?

6.4 Key takeaways

In a finite MDP where the environment is known, we can apply dynamic programming (Section 2.3.2) to compute the optimal policy. But in most real-world settings, we don’t know the state transitions $P$. This means we can’t exactly compute the Bellman consistency or optimality equations (Theorem 2.3), which require taking an expectation over the next state.

In an unknown environment, we must learn from data collected from the environment. Model-based methods learn a model of the environment, and then plan within that model to choose the best action. Model-free methods instead use the data to approximate quantities of interest (e.g. the optimal Q function or optimal policy) directly.

We began by considering offline algorithms that used a dataset of interactions to learn some quantity of interest. We then saw the online equivalents that observe data by interacting with the environment.

6.5 Bibliographic notes and further reading

The fitted dynamic programming algorithms we discuss in this chapter are a special case of temporal difference (TD) methods, an important class of reinforcement learning algorithms. In particular, this chapter only discusses one-step methods, which can be directly generalized to $n$-step methods and the $\lambda$-return, the formulation most commonly used in practice. This allows us to draw direct parallels to the tabular DP methods of Chapter 2 (see tbl. 6.1). TD learning is perhaps the “central and novel [idea of] reinforcement learning” (Sutton & Barto, 2018, ch. 6).

TD methods are based in the psychology of animal learning

Witten (1977), which proposed the tabular $\text{TD}(0)$ method, appears to be the earliest instance of a temporal difference algorithm.

Richard Sutton’s PhD thesis (Sutton, 1988) proved theoretical guarantees for many of the algorithms presented here. Sutton & Barto (2018) gives a comprehensive discussion of the methods discussed here.

In later work, Hausknecht & Stone (2017) made the Q-function recurrent, so that it depends on the past history of states. The resulting policy is non-Markovian, which additionally accounts for the partial observability of the environment.

Maei et al. (2009) further discusses the deadly triad of off-policy sampling, function approximation, and value bootstrapping.

Deep RL rose to prominence when Mnih et al. (2013) used a deep Q network to achieve state of the art performance on the Atari games. However, the combination of deep learning and the underlying variance in RL problems made it challenging to optimize these neural networks. Many changes have been proposed. Wang et al. (2016) suggests learning the value function and advantage function separately in a “dueling” architecture. Hasselt (2010) suggested the “double learning” algorithm to compensate for Q-learning’s tendency to overestimate the true values. Hasselt et al. (2016) applied double learning to deep neural networks. Hessel et al. (2018) combined

Bertsekas & Tsitsiklis (1996) coined the term “neuro-dynamic programming” to refere to the combination of (artificial) neural networks with dynamic programming techniques.

Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena scientific.

Hasselt, H. (2010). Double q-learning. Advances in Neural Information Processing Systems, 23. https://proceedings.neurips.cc/paper_files/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html

Hasselt, H. van, Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2094–2100.

Hausknecht, M., & Stone, P. (2017, January 11). Deep recurrent q-learning for partially observable mdps. https://doi.org/10.48550/arXiv.1507.06527

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 3215–3222.

Maei, H., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., & Sutton, R. S. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, 22. https://papers.nips.cc/paper_files/paper/2009/hash/3a15c7d0bbe60300a39f76f8a5ba6896-Abstract.html

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, abs/1312.5602. http://arxiv.org/abs/1312.5602

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Second edition). The MIT Press. http://incompleteideas.net/book/RLbook2020trimmed.pdf

van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018, December 6). Deep reinforcement learning and the deadly triad. https://doi.org/10.48550/arXiv.1812.02648

Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html

Witten, I. H. (1977). An adaptive optimal controller for discrete-time markov environments. Information and Control, 34(4), 286–295. https://doi.org/10.1016/S0019-9958(77)90354-0