8  Imitation Learning

8.1 Introduction

Imagine you are tasked with learning how to drive. How do, or did, you go about it? At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error. Luckily, there are already people in the world who know how to drive who can get you started. In almost every challenge we face, we “stand on the shoulders of giants” and learn skills from experts who have already mastered them.

A robot imitating the pose of a young child. Image from Danilyuk (2021).

In machine learning, we often try to teach machines to accomplish tasks that humans are already proficient at. In such cases, the machine learning algorithm is the one learning the new skill, and humans are the “experts” that can demonstrate how to perform the task. Imitation learning is an approach to sequential decision-making where we aim to learn a policy that performs at least as well as the expert. It is often used as a first step for complex tasks where it is too challenging to learn from scratch or difficult to specify a reward function that captures the desired behaviour.

We’ll see that the most naive form of imitation learning, called behaviour cloning, is really an application of supervised learning to interactive tasks. We’ll then explore dataset aggregation (DAgger) as a way to query an expert and learn even more effectively.

8.2 Behaviour cloning

This notion of “learning from human-provided data” may remind you of the basic premise of Chapter 5. In supervised learning, there is some mapping from inputs to outputs, such as the task of assigning the correct label to an image, that humans can implicitly compute. To teach a machine to calculate this mapping, we first collect a large training dataset by getting people to label a lot of inputs, and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible.

How does this relate to interactive tasks? Here, the input is the observation seen by the agent and the output is the action it selects, so the mapping is the agent’s policy. What’s stopping us from applying supervised learning techniques to mimic the expert’s policy? In principle, nothing! This is called behaviour cloning.

Definition 8.1 (Behaviour cloning)  

  1. Collect a training dataset of trajectories \(\mathcal{D} = (s^n, a^n)_{n=1}^{N}\) generated by an expert policy \(\pi_\text{expert}\). (For example, if the dataset contains \(M\) trajectories, each with a finite horizon \(H\), then \(N = M \times H\).)
  2. Use a supervised learning algorithm \(\texttt{fit} : \mathcal{D} \mapsto \widetilde{\pi}\) to extract a policy \(\widetilde{\pi}\) that approximates the expert policy.

Typically, this second task can be framed as empirical risk minimization (which we previously saw in Section 5.3.2):

\[ \widetilde{\pi} = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n) \tag{8.1}\]

where \(\Pi\) is some class of possible policies, \(\text{loss}\) is the loss function to measure how different the policy’s prediction is from the true observed action, and the supervised learning algorithm itself, also known as the fitting method, tells us how to compute this \(\arg\min\).

How should we choose the loss function? In supervised learning, we saw that the mean squared error is a good choice for continuous outputs. However, how should we measure the difference between two actions in a discrete action space? In this setting, the policy acts more like a classifier that picks the best action in a given state. Rather than considering a deterministic policy that just outputs a single action, we’ll consider a stochastic policy \(\pi\) that outputs a distribution over actions. This allows us to assign a likelihood to observing the entire dataset \(\mathcal{D}\) under the policy \(\pi\), as if the state-action pairs are independent:

\[ \mathbb{P}_\pi(\mathcal{D}) = \prod_{n=1}^{N} \pi(a_n \mid s_n) \tag{8.2}\]

Note that the states and actions are not, however, actually independent! A key property of interactive tasks is that the agent’s output – the action that it takes – may influence its next observation. We want to find a policy under which the training dataset \(\mathcal{D}\) is the most likely. This is called the maximum likelihood estimate of the policy that generated the dataset:

\[ \widetilde{\pi} = \arg\max_{\pi \in \Pi} \mathbb{P}_{\pi}(\mathcal{D}) \tag{8.3}\]

This is also equivalent to doing empirical risk minimization with the negative log likelihood as the loss function:

\[ \begin{aligned} \widetilde{\pi} &= \arg\min_{\pi \in \Pi} - \log \mathbb{P}_\pi(\mathcal{D}) \\ &= \arg\min_{\pi \in \Pi} \sum_{n=1}^N - \log \pi(a_n \mid s_n) \end{aligned} \tag{8.4}\]

Can we quantify how well this algorithm works? For simplicity, let’s consider the case where the action space is finite and both the expert policy and learned policy are deterministic.

Theorem 8.1 (Performance of behaviour cloning) Suppose the learned policy obtains \(\varepsilon\) classification error. That is, for trajectories drawn from the expert policy, the learned policy chooses a different action at most \(\varepsilon\) of the time:

\[ \mathbb{E}_{\tau \sim \rho^{\pi_{\text{expert}}}} \left[ \frac 1 H\sum_{h=0}^{H-1} \mathbf{1}\left\{ \widetilde{\pi}(s_h) \ne \pi_{\text{expert}} (s_h) \right\} \right] \le \varepsilon \tag{8.5}\]

Then, their value functions differ by

\[ | V^{\pi_{\text{expert}}} - V^{\widetilde{\pi}} | \le H^2 \varepsilon \]

where \(H\) is the horizon of the problem.

Proof. Recall the Performance Difference Lemma (Theorem 7.8). The Performance Difference Lemma allows us to express the difference between \(\pi-{\text{expert}}\) and \(\widetilde{\pi}\) as

\[ V_0^{\pi_{\text{expert}}}(s) - V_0^{\widetilde{\pi}} (s) = \mathop{\mathbb{E}}_{\tau \sim \rho^{\pi_{\text{expert}}} \mid s_0 = s} \left[ \sum_{h=0}^{H-1} A_h^{\widetilde{\pi}} (s_h, a_h) \right]. \tag{8.6}\]

Now since the expert policy is deterministic, we can substitute \(a_h= \pi_{\text{expert}}(s_h)\). This allows us to make a further simplification: since \(\pi_{\text{expert}}\) is deterministic, the advantage of the chosen action is exactly zero:

\[ A^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) = Q^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) - V^{\pi_{\text{expert}}}(s) = 0. \tag{8.7}\]

But the right-hand-side of eq. 8.6 uses \(A^{\widetilde{\pi}}\), not \(A^{\pi-{\text{expert}}}\). To bridge this gap, we now use the assumption that \(\widetilde{\pi}\) obtains \(\varepsilon\) classification error. Note that \(A_h^{\widetilde{\pi}}(s_h, \pi_{\text{expert}}(s_h)) = 0\) when \(\pi_{\text{expert}}(s_h) = \widetilde{\pi}(s_h)\). In the case where the two policies differ on \(s_h\), which occurs with probability \(\varepsilon\), the advantage is naively upper bounded by \(H\) (assuming rewards are bounded between \(0\) and \(1\)). Taking the final sum gives the desired bound.

8.3 Distribution shift

Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven’t visited before, such as a highway, where it would be dangerous to try and apply the techniques you’ve already learned. This is the issue of distribution shift: a policy learned under a certain distribution of states may perform poorly if the distribution of states changes.

This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent’s behaviour; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory.

How could you learn a strategy for these new settings? In the driving example, you might decide to install a dashcam to record the car’s surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way. Then the next time you go for a drive, you can remember the expert’s advice, and take a safer route. You could then repeat this training as many times as desired, thereby collecting the expert’s feedback over a diverse range of locations. This is the key idea behind dataset aggregation.

8.4 Dataset aggregation (DAgger)

The DAgger algorithm assumes that we have query access to the expert policy. That is, for a given state \(s\), we can ask for the expert’s action \(\pi_{\text{expert}}(s)\) in that state. We also need access to the environment for rolling out policies. This makes DAgger an online algorithm, as opposed to pure behaviour cloning, which is offline since we don’t need to act in the environment at all.

You can think of DAgger as a specific way of collecting the dataset \(\mathcal{D}\).

Definition 8.2 (DAgger algorithm) Inputs: \(\pi_{\text{expert}}\), an initial policy \(\pi_{\text{init}}\), the number of iterations \(T\), and the number of trajectories \(N\) to collect per iteration.

  1. Initialize \(\mathcal{D} = \{\}\) (the empty set) and \(\pi = \pi_{\text{init}}\).
  2. For \(i= 1, \dots, I\):
    • Collect \(T\) trajectories \(\tau_0, \dots, \tau_{T-1}\) using the current policy \(\pi\).
    • For each trajectory \(\tau_n\):
      • Replace each action \(a_h\) in \(\tau_n\) with the expert action \(\pi_{\text{expert}}(s_h)\).
      • Call the resulting trajectory \(\tau^{\text{expert}}_n\).
    • \(\mathcal{D} \gets \mathcal{D} \cup \{ \tau^{\text{expert}}_1, \dots, \tau^{\text{expert}}_n \}\).
    • Let \(\pi \gets \texttt{fit}(\mathcal{D})\), where \(\texttt{fit}\) is a behaviour cloning algorithm.
  3. Return \(\pi\).

We leave the implementation as an exercise. How well does DAgger perform?

Theorem 8.2 (Performance of DAgger) Let \(\pi_\text{expert}\) be the expert policy and \(\pi_\text{DAgger}\) be the policy resulting from DAgger. In \(I= \widetilde O(H^2)\) iterations, with high probability,

\[ |V^{\pi_{\text{expert}}} - V^{\pi_{\text{DAgger}}}| \le H\varepsilon, \tag{8.8}\]

where \(\varepsilon\) is the “classification error” guaranteed by the supervised learning algorithm.

8.5 Key takeaways

Given a task where learning from scratch is too challenging, if we have access to expert data, we can use supervised learning to find a policy that imitates the expert demonstrations.

The simplest way to do this is to apply a supervised learning algorithm to an already-collected dataset of expert state-action pairs. This is called behaviour cloning. However, given query access to the expert policy, we can do better by integrating its feedback in an online loop. The DAgger algorithm is one way of doing this, where we use the expert policy to augment trajectories and then learn from this augmented dataset using behaviour cloning.

8.6 Bibliographic notes and further reading

Earlier interest in imitation learning arose in the context of autonomous driving (Pomerleau, 1991). This task is suitable for imitation learning since expert (or near-expert) driving data is readily available. It is also challenging to express a reward function that captures exactly what we mean by “good driving”. Imitation learning methods sidestep this issue by directly training the algorithm to imitate expert demonstrations. The DAgger algorithm (def. 8.2) is due to Ross et al. (2010). The performance guarantee is stated as Ross et al. (2010, thm. 3.4).

Another approach is to infer the reward function from the expert trajectories. This is known as the inverse reinforcement learning (IRL) problem (Abbeel & Ng, 2004; Ng & Russell, 2000; Russell, 1998). The typical RL problem is going from a reward function to an optimal policy, so the inverse problem is going from an optimal policy to the underlying reward function. One can then use typical RL techniques to optimize for the inferred reward function. This tends to generalize better than direct behaviour cloning. The challenge is that this problem is not well-defined, since any policy is optimal for infinitely many possible reward functions. This means that the researcher must come up with a useful “regularization” assumption to select which of the possible reward functions is the most plausible. Three common modelling assumptions are maximum-margin IRL (Ng & Russell, 2000; Ratliff et al., 2006), Bayesian IRL (Ramachandran & Amir, 2007), and maximum-entropy IRL (Ziebart et al., 2008, 2010).

Another framework for learning behaviour from expert demonstrations is generative adversarial imitation learning (Ho & Ermon, 2016; Orsini et al., 2021), inspired by generative adversarial networks (GANs) (Goodfellow et al., 2020). Rather than first learning a reward function from expert data and then optimizing the policy in a separate phase, generative adversarial imitation learning simultaneously learns the reward function and the optimal policy, leading to improved performance.

We can infer a reward function from other expert data besides demonstrations. Expert demonstrations provide a lot of signal but can be prohibitively expensive to collect for some tasks. It is often easier to recognize a good solution than to generate one. This leads to the related approach of reinforcement learning from human feedback (RLHF). Instead of inferring a reward function from expert demonstrations, we instead infer a reward function from a dataset of expert rankings of trajectories. This is the dominant approach used in RL finetuning of large language models (Ouyang et al., 2022; Ziegler et al., 2020). We recommend Lambert (2024) for a comprehensive treatment of RLHF.

Piot et al. (2017) unifies imitation learning and IRL in the set-policy framework. Gleave et al. (2022) is a library of modular implementations of imitation learning algorithms in PyTorch. We recommend Zare et al. (2024) for a more comprehensive survey of imitation learning.