7 Imitation Learning - CS/STAT 184: Introduction to Reinforcement Learning

7.1Introduction¶

Imagine you are tasked with learning how to drive. How do, or did, you go about it? At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error. Luckily, there are already people in the world who know how to drive who can get you started. In almost every challenge we face, we “stand on the shoulders of giants” and learn skills from experts who have already mastered them.

a robot imitating the pose of a young child (Photo by Pavel Danilyuk: https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/)

Now in machine learning, we are often trying to teach machines to accomplish tasks that humans are already proficient at. In such cases, the machine learning algorithm is the one learning the new skill, and humans are the “experts” that can demonstrate how to perform the task. Imitation learning is an approach to reinforcement learning where we aim to learn a policy that performs at least as well as the expert. It is often used as a first step for complex tasks where it is impractical to learn from scratch.

We’ll see that the most naive form of imitation learning, called behavioral cloning (or “behavior cloning”), is really an application of supervised learning to interactive tasks. We’ll then explore dataset aggregation (DAgger) as a way to query an expert and learn even more effectively.

7.2Behavioral cloning¶

This notion of “learning from human-provided data” may remind you of the basic premise of 4 Supervised learning. In supervised learning, there is some mapping from inputs to outputs, such as the task of assigning the correct label to an image, that humans can implicitly compute. To teach a machine to calculate this mapping, we first collect a large training dataset by getting people to label a lot of inputs, and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible.

How does this relate to interactive tasks? Here, the input is the observation seen by the agent and the output is the action it selects, so the mapping is the agent’s policy. What’s stopping us from applying supervised learning techniques to mimic the expert’s policy? In principle, nothing! This is called behavioral cloning.

Typically, this second task can be framed as empirical risk minimization (which we previously saw in Definition 4.1):

\widetilde{\pi} = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n)

(7.1)

where Π is some class of possible policies, $\text{loss}$ is the loss function to measure how different the policy’s prediction is from the true observed action, and the supervised learning algorithm itself, also known as the fitting method, tells us how to compute this $\arg\min$ .

How should we choose the loss function? In supervised learning, we saw that the mean squared error is a good choice for continuous outputs. However, how should we measure the difference between two actions in a discrete action space? In this setting, the policy acts more like a classifier that picks the best action in a given state. Rather than considering a deterministic policy that just outputs a single action, we’ll consider a stochastic policy π that outputs a distribution over actions. This allows us to assign a likelihood to observing the entire dataset $\mathcal{D}$ under the policy π, as if the state-action pairs are independent:

\pr_\pi(\mathcal{D}) = \prod_{n=1}^{N} \pi(a_n \mid s_n)

(7.2)

Note that the states and actions are not, however, actually independent! A key property of interactive tasks is that the agent’s output -- the action that it takes -- may influence its next observation. We want to find a policy under which the training dataset $\mathcal{D}$ is the most likely. This is called the maximum likelihood estimate of the policy that generated the dataset:

\widetilde{\pi} = \arg\max_{\pi \in \Pi} \pr_{\pi}(\mathcal{D})

(7.3)

This is also equivalent to doing empirical risk minimization with the negative log likelihood as the loss function:

\begin{align*} \widetilde{\pi} &= \arg\min_{\pi \in \Pi} - \log \pr_\pi(\mathcal{D}) \\ &= \arg\min_{\pi \in \Pi} \sum_{n=1}^N - \log \pi(a_n \mid s_n) \end{align*}

(7.4)

7.2.1Performance of behavioral cloning¶

Can we quantify how well this algorithm works? For simplicity, let’s consider the case where the action space is finite and both the expert policy and learned policy are deterministic. Suppose the learned policy obtains $\varepsilon$ classification error. That is, for trajectories drawn from the expert policy, the learned policy chooses a different action at most $\varepsilon$ of the time:

\mathbb{E}_{\tau \sim \rho_{\pi_{\text{expert}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \widetilde{\pi}(s_\hi) \ne \pi_{\text{expert}} (s_\hi) } \right] \le \varepsilon

(7.5)

Then, their value functions differ by

| V^{\pi_{\text{expert}}} - V^{\widetilde{\pi}} | \le H^2 \varepsilon

(7.6)

where $H$ is the horizon.

Theorem 7.1 (Performance of behavioral cloning)

Recall the Performance Difference Lemma (Theorem 6.1) allows us to express the difference between $\pi_{\text{expert}}$ and $\widetilde{\pi}$ as

V_0^{\pi_{\text{expert}}}(s) - V_0^{\widetilde{\pi}} (s) = \E_{\tau \sim \rho^{\pi_{\text{expert}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\widetilde{\pi}} (s_\hi, a_\hi) \right].

(7.7)

Now since the expert policy is deterministic, we can substitute $a_\hi = \pi_{\text{expert}}(s_\hi)$ . This allows us to make a further simplification: since $\pi_{\text{expert}}$ is deterministic, the advantage of the chosen action is exactly zero:

A^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) = Q^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) - V^{\pi_{\text{expert}}}(s) = 0.

(7.8)

But the right-hand-side of (7.7) uses $A^{\widetilde{\pi}}$ , not $A^{\pi_{\text{expert}}}$ . To bridge this gap, we now use the assumption that $\widetilde{\pi}$ obtains $\varepsilon$ classification error. Note that $A_\hi^{\widetilde{\pi}}(s_\hi, \pi_{\text{expert}}(s_\hi)) = 0$ when $\pi_{\text{expert}}(s_\hi) = \widetilde{\pi}(s_\hi)$ . In the case where the two policies differ on $s_\hi$ , which occurs with probability $\varepsilon$ , the advantage is naively upper bounded by $H$ (assuming rewards are bounded between 0 and 1). Taking the final sum gives the desired bound.

7.3Distribution shift¶

Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven’t visited before, such as a highway, where it would be dangerous to try and apply the techniques you’ve already learned. This is the issue of distribution shift: a policy learned under a certain distribution of states may not perform well if this distribution changes.

This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent’s behavior; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory.

How could you learn a strategy for these new settings? In the driving example, you might decide to install a dashcam to record the car’s surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way. Then the next time you go for a drive, you can remember the expert’s advice, and take a safer route. You could then repeat this training as many times as desired, thereby collecting the expert’s feedback over a diverse range of locations. This is the key idea behind dataset aggregation.

7.4Dataset aggregation (DAgger)¶

The DAgger algorithm (Ross et al. (2010)) assumes that we have query access to the expert policy. That is, for a given state $s$ , we can ask for the expert’s action $\pi_{\text{expert}}(s)$ in that state. We also need access to the environment for rolling out policies. This makes DAgger an online algorithm, as opposed to pure behavioral cloning, which is offline since we don’t need to act in the environment at all.

You can think of DAgger as a specific way of collecting the dataset $\mathcal{D}$ .

Algorithm 7.1 (DAgger)

Inputs: $\pi_{\text{expert}}$ , an initial policy $\pi_{\text{init}}$ , the number of iterations $T$ , and the number of trajectories $N$ to collect per iteration.

Initialize $\mathcal{D} = \{\}$ (the empty set) and $\pi = \pi_{\text{init}}$ .
For $t = 1, \dots, T$ :
- Collect $N$ trajectories $\tau_1, \dots, \tau_N$ using the current policy π.
- For each trajectory $\tau_n$ :
  - Replace each action $a_h$ in $\tau_n$ with the expert action $\pi_{\text{expert}}(s_h)$ .
  - Call the resulting trajectory $\tau^{\text{expert}}_n$ .
- $\mathcal{D} \gets \mathcal{D} \cup \{ \tau^{\text{expert}}_1, \dots, \tau^{\text{expert}}_n \}$ .
- Let $\pi \gets \texttt{fit}(\mathcal{D})$ , where $\texttt{fit}$ is a behavioral cloning algorithm.
Return π.

We leave the implementation as an exercise. How well does DAgger perform? A full proof can be found in Ross et al. (2010) that under certain assumptions, the DAgger algorithm can better approximate the expert policy:

|V^{\pi_{\text{expert}}} - V^{\pi_{\text{DAgger}}}| \le H \varepsilon

(7.9)

where $\varepsilon$ is the “classification error” guaranteed by the supervised learning algorithm.

7.5Summary¶

For tasks where it is too difficult or expensive to learn from scratch, we can instead start off with a collection of expert demonstrations. Then we can use supervised learning techniques to find a policy that imitates the expert demonstrations.

The simplest way to do this is to apply a supervised learning algorithm to an already-collected dataset of expert state-action pairs. This is called behavioral cloning. However, given query access to the expert policy, we can do better by integrating its feedback in an online loop. The DAgger algorithm is one way of doing this, where we use the expert policy to augment trajectories and then learn from this augmented dataset using behavioral cloning.

References¶

Ross, S., Gordon, G. J., & Bagnell, J. (2010, November). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. International Conference on Artificial Intelligence and Statistics.

CS/STAT 184: Introduction to Reinforcement Learning

6 Policy Gradient Methods

CS/STAT 184: Introduction to Reinforcement Learning

8 Tree Search Methods