Code
from utils import Float, Array, Callable, plt, np, latex
from torchvision.datasets import MNIST
from config import MNIST_PATH
from utils import Float, Array, Callable, plt, np, latex
from torchvision.datasets import MNIST
from config import MNIST_PATH
Supervised learning (SL) is a core subfield of machine learning alongside RL and unsupervised learning. The typical SL task is to approximate an unknown function given a dataset of input-output pairs from that function.
Example 4.1 (Image classification) One of the most common examples of an SL problem is the task of image classification: Given a dataset of images and their respective labels, construct a function that takes an image and outputs the correct label.
Figure 4.1 illustrates two samples (that is, input-output pairs) from the MNIST database of handwritten digits (Deng 2012). This is a task that most humans can easily accomplish. By providing many samples of digits and their labels to a machine, SL algorithms can learn to solve this task as well.
= MNIST(MNIST_PATH, train=True, download=True)
data
'off')
plt.axis(0], cmap='gray')
plt.imshow(data.data[f"Label: {data.targets[0]}")
plt.title(2, 2)
plt.gcf().set_size_inches(
plt.show()
'off')
plt.axis(1], cmap='gray')
plt.imshow(data.data[f"Label: {data.targets[1]}")
plt.title(2, 2)
plt.gcf().set_size_inches( plt.show()
Where might function approximation be useful in RL? There are many functions involved in the definition of an MDP (Definition 1.1), such as the state transitions \(P\) or the reward function \(r\), any of which might be unknown. We can plug in an SL algorithm to model these functions, and then solve the modeled environment using dynamic programming (Section 1.2.7). This approach is called fitted DP and will be covered in Chapter 5. In the rest of this chapter, we’ll formalize the SL task and examine some basic algorithms.
In SL, we are given a dataset of labelled samples \((x_1, y_1), \dots, (x_N, y_N)\) that are independently sampled from some joint distribution \(p \in \triangle(X \times Y)\) known as the data generating process. Note that, by the chain rule of probability, this can be factored as \(p(x, y) = p(y \mid x) p(x)\).
Example 4.2 For example, in Example 4.1, the marginal distribution over \(x\) is assumed to be the distribution of handwritten digits by humans, scanned as \(28 \times 28\) grayscale images, and the conditional distribution \(y \mid x\) is assumed to be the distribution over \(\{ 0, \dots, 9 \}\) that a human would assign to the image \(x\).
Our task is to compute a “good” predictor \(\hat f : X \to Y\) that, as its name suggests, takes an input and tries to predict the corresponding output.
How can we measure how “good” a predictor is? The most common way is to use a loss function \(\ell : Y \times Y \to \mathbb{R}\) that compares the guess \(\hat y := \hat f(x)\) with the true output \(y\). \(\ell(\hat y, y)\) should be low if the predictor accurately guessed the output, and high if the prediction was incorrect.
Example 4.3 (Zero-one loss) In the image classification task Example 4.1, we have \(X = [0, 1]^{28 \times 28}\) (the space of \(28\)-by-\(28\) grayscale images) and \(Y = \{ 0, \dots, 9 \}\) (the image’s label). We could use the zero-one loss function,
\[ \ell(\hat y, y) = \begin{cases} 0 & \hat y = y \\ 1 & \hat y \ne y \end{cases} \]
to measure the accuracy of the predictor. That is, if the predictor assigns the wrong label to an image, it incurs a loss of one for that sample.
Example 4.4 (Square loss) For a continuous output (i.e. \(Y \subseteq \mathbb{R}\)), we typically use the squared difference as the loss function:
\[ \ell(\hat y, y) = (\hat y - y)^2 \]
The squared loss is nice to work with analytically since its derivative with respect to \(\hat y\) is simply \(2 (\hat y - y)\). (Sometimes authors define the square loss as half of the above value to cancel the factor of \(2\) in the derivative; generally speaking, scaling the loss by some constant scalar has no practical effect.)
= np.linspace(-1, 1, 20)
x = x ** 2
y
plt.plot(x, y)r"$\hat y - y$")
plt.xlabel(r"$\ell(\hat y, y)$")
plt.ylabel( plt.show()
Ultimately, we want a predictor that does well on new, unseen samples from the data generating process. We can thus ask, how much loss does the predictor incur in expectation? This is called the prediction’s generalization error or test error:
\[ \text{test error}(\hat f) := \mathbb{E}_{(x, y) \sim p} [ \ell(\hat f(x), y) ] \]
Our goal is then to find the function \(\hat f\) that minimizes the test error. For certain loss functions, this can be analytically computed, such as for squared error.
Theorem 4.1 (The conditional expectation minimizes mean squared error) An important result is that, under the squared loss, the optimal predictor is the conditional expectation:
\[ \arg\min_{f} \mathbb{E}[(y - f(x))^2] = (x \mapsto \mathbb{E}[y \mid x]) \]
Proof. We can decompose the mean squared error as
\[ \begin{aligned} \mathbb{E}[(y - f(x))^2] &= \mathbb{E}[ (y - \mathbb{E}[y \mid x] + \mathbb{E}[y \mid x] - f(x))^2 ] \\ &= \mathbb{E}[ (y - \mathbb{E}[y \mid x])^2 ] + \mathbb{E}[ (\mathbb{E}[y \mid x] - f(x))^2 ] \\ &\quad {} + 2 \mathbb{E}[ (y - \mathbb{E}[y \mid x])(\mathbb{E}[y \mid x] - f(x)) ] \\ \end{aligned} \]
We leave it as an exercise to show that the last term is zero. (Hint: use the law of iterated expectations.) The first term is the noise, or irreducible error, that doesn’t depend on \(f\), and the second term is the error due to the approximation, which is minimized at \(0\) when \(f(x) = \mathbb{E}[y \mid x]\).
In most applications, such as in Example 4.2, the joint distribution of \(x, y\) is intractable to compute, and so we can’t evaluate \(\mathbb{E}[y \mid x]\) analytically. Instead, all we have are \(N\) samples from the joint distribution of \(x\) and \(y\). How might we use these to approximate the generalization error?
To estimate the generalization error, we can simply take the sample average of the loss over the training data. This is called the training loss or empirical risk:
\[ \text{training loss}(\hat f) := \frac 1 N \sum_{n=1}^N \ell(\hat f(x_n), y_n). \]
By the law of large numbers, as \(N\) grows to infinity, the training loss converges to the generalization error.
The empirical risk minimization (ERM) approach is to find a predictor that minimizes the empirical risk. An ERM algorithm requires two ingredients to be chosen based on our domain knowledge about the DGP:
This allows us to compute the empirical risk minimizer:
\[ \begin{aligned} \hat f_\text{ERM} &:= \arg\min_{f \in \mathcal{F}} \text{training loss}(f) \\ &= \arg\min_{f \in \mathcal{F}}\frac 1 N \sum_{n=1}^N \ell(f(x_n), y_n). \end{aligned} \tag{4.1}\]
How should we choose the correct function class? In fact, why do we need to constrain our search at all?
Exercise 4.1 (Overfitting) Suppose we are trying to approximate a relationship between real-valued inputs and outputs using squared loss as our loss function. Consider the predictor (visualized in Figure 4.3)
\[\hat f(x) = \sum_{n=1}^N y_n \mathbf{1}\left\{x = x_n\right\}.\]
What is the empirical risk of this function? How well does it perform on newly generated samples?
= 1000
n = np.linspace(-1, +1, n)
x_axis
for _ in range(2):
= np.random.uniform(-1, +1, 10)
x_train = np.sin(np.pi * x_train)
y_train = np.where(np.isclose(x_axis[:, None], x_train, atol=2/n), y_train, 0).sum(axis=-1)
y_hat
=r'$\hat f(x)$')
plt.plot(x_axis, y_hat, label='red', marker='x', label='training data')
plt.scatter(x_train, y_train, color
plt.legend()3, 2)
plt.gcf().set_size_inches( plt.show()
The choice of \(\mathcal{F}\) depends on our domain knowledge about the task. On one hand, \(\mathcal{F}\) should be large enough to contain the true relationship, but on the other, it shouldn’t be too expressive; otherwise, it will overfit to random noise in the labels. The larger and more complex the function class, the more accurately we will be able to approximate any particular training dataset (i.e. smaller bias), but the more drastically the function will vary for different training datasets (i.e. larger variance). The mathematical details of the so-called bias-variance tradeoff can be found, for example, in Hastie, Tibshirani, and Friedman (2013, chap. 2.9).
= 10
n_samples = np.linspace(-1, +1, 50)
x_axis
def generate_data(sigma=0.2):
= np.random.uniform(-1, +1, n_samples)
x_train = np.sin(np.pi * x_train) + sigma * np.random.normal(size=n_samples)
y_train return x_train, y_train
def transform(x: Float[Array, " N"], d: int):
return np.column_stack([
** d_
x for d_ in range(d + 1)
])
for d in [2, 5, 50]:
for _ in range(2):
= generate_data()
x_train, y_train
= transform(x_train, d)
x_features = np.linalg.lstsq(x_features, y_train)[0]
w = transform(x_axis, d) @ w
y_hat
= 'blue' if _ == 0 else 'red'
color =color, marker='x')
plt.scatter(x_train, y_train, color=color)
plt.plot(x_axis, y_hat, color-1, +1)
plt.xlim(-1.2, 1.2)
plt.ylim(2, 2)
plt.gcf().set_size_inches( plt.show()
We must also consider practical constraints on the function class. We need an efficient algorithm to actually compute the function in the class that minimizes the training error. This point should not be underestimated! The success of modern deep learning, for example, is in large part due to hardware developments that make certain parallelizable operations more efficient.
Both of the function classes we will consider, linear maps and neural networks, are finite-dimensional, a.k.a. parameterized. This means each function can be identified using some finite set of parameters, which we denote \(\theta \in \mathbb{R}^D\).
Example 4.5 (Quadratic functions) As a third example of a parameterized function class, consider the class of quadratic functions, i.e. polynomials of degree \(2\). This is a three-dimensional function space, since we can describe any quadratic \(p\) as
\[ p(x) = a x^2 + b x + c, \]
where \(a, b, c\) are the three parameters. We could also use a different parameterization:
\[ p(x) = a' (x - b')^2 + c'. \]
Note that the choice of parameterization can impact the performance of the chosen fitting method. What is the derivative of the first expression with respect to \(a, b, c\)? Compare this to the derivative of the second expression with respect to \(a', b', c'\). This shows that gradient-based fitting methods may change their behavior depending on the parameterization.
Using a parameterized function class allows us to reframe the ERM problem Equation 4.1 in terms of optimizing over the parameters instead of over the functions they represent:
\[ \begin{aligned} \hat \theta_\text{ERM} &:= \arg\min_{\theta \in \mathbb{R}^D} \text{training loss}(f_\theta) \\ &= \frac{1}{N} \sum_{n=1}^N (y_n - f_\theta(x_n))^2 \end{aligned} \tag{4.2}\]
In general, optimizing over a finite-dimensional space is much, much easier than optimizing over an infinite-dimensional space.
One widely applicable fitting method for parameterized function classes is gradient descent.
Let \(L(\theta) = \text{training loss}(f_\theta)\) denote the empirical risk in terms of the parameters. The gradient descent algorithm iteratively updates the parameters according to the rule
\[ \theta^{t+1} = \theta^t - \eta \nabla_\theta L(\theta^t) \]
where \(\eta > 0\) is the learning rate and \(\nabla_\theta L(\theta^t)\) indicates the gradient of \(L\) at the point \(\theta^t\). Recall that the gradient of a function at a point is a vector in the direction that increases the function’s value the most within a neighborhood. So by taking small steps in the oppposite direction, we obtain a solution that achieves a slightly lower loss than the current one.
= Float[Array, " D"]
Params
def gradient_descent(
float],
loss: Callable[[Params],
θ_init: Params,float,
η: int,
epochs:
):"""
Run gradient descent to minimize the given loss function
(expressed in terms of the parameters).
"""
= θ_init
θ for _ in range(epochs):
= θ - η * grad(loss)(θ)
θ return θ
In Section 6.2.1, we will discuss methods for implementing the grad
function above, which takes in a function and returns its gradient, which can then be evaluated at a point.
Why do we need to scale down the step size by \(\eta\)? The key word above is “neighborhood”. The gradient only describes the function within a local region around the point, whose size depends on the function’s smoothness. If we take a step that’s too large, we might end up with a worse solution by overshooting the region where the gradient is accurate. Note that, as a result, we can’t guarantee finding a global optimum of the function; we can only find local optima that are the best parameters within some neighborhood.
Another issue is that it’s often expensive to compute \(\nabla_\theta L\) when \(N\) is very large. Instead of calculating the gradient for every point in the dataset and averaging these, we can simply draw a batch of samples from the dataset and average the gradient across just these samples. Note that this is an unbiased random estimator of the true gradient. This algorithm is known as stochastic gradient descent. The added noise sometimes helps to jump to better solutions with a lower overall empirical risk.
Stepping for a moment back into the world of RL, you might wonder, why can’t we simply apply gradient descent (or rather, gradient ascent) to the total reward? It turns out that the gradient of the total reward with respect to the policy parameters known as the policy gradient, is challenging but possible to approximate. In Chapter 6, we will do exactly this.
In linear regression, we assume that the function \(f\) is linear in the parameters:
\[ \mathcal{F} = \{ x \mapsto \theta^\top x \mid \theta \in \mathbb{R}^D \} \]
You may already be familiar with linear regression from an introductory statistics course. This function class is extremely simple and only contains linear functions, whose graphs look like “lines of best fit” through the training data. It turns out that, when minimizing the squared error, the empirical risk minimizer has a closed-form solution, known as the ordinary least squares estimator. Let us write \(Y = (y_1, \dots, y_n)^\top \in \mathbb{R}^N\) and \(X = (x_1, \dots, x_N)^\top \in \mathbb{R}^{N \times D}\). Then we can write
\[ \begin{aligned} \hat \theta &= \arg\min_{\theta \in \mathbb{R}^D} \frac{1}{2} \sum_{n=1}^N (y_n - \theta^\top x_n)^2 \\ &= \arg\min_{\theta \in \mathbb{R}^D} \frac 1 2 \|Y - X \theta \|^2 \\ &= (X^\top X)^{-1} X^\top Y, \end{aligned} \tag{4.3}\]
where we have assumed that the columns of \(X\) are linearly independent so that the matrix \(X^\top X\) is invertible.
What happens if the columns aren’t linearly independent? In this case, out of the possible solutions with the minimum empirical risk, we typically choose the one with the smallest norm.
Exercise 4.2 Gradient descent on the ERM problem (Equation 4.3), initialized at the origin and using a small enough step size, eventually finds the parameters with the smallest norm. In practice, since the squared error gradient is convenient to compute, running gradient descent can be faster than explicitly computing the inverse (or pseudoinverse) of a matrix.
Assume that \(N < D\) and that the data points are linearly independent.
Let \(\hat{\theta}\) be the solution found by gradient descent. Show that \(\hat{\theta}\) is a linear combination of the data points, that is, \(\hat{\theta} = X^\top a\), where \(a \in \mathbb{R}^N\).
Let \(w \in \mathbb{R}^D\) be another empirical risk minimizer i.e. \(X w = y\). Show that \(\hat{\theta}^\top (w - \hat{\theta}) = 0\).
Use this to show that \(\|\hat{\theta}\| \le \|w\|\), showing that the gradient descent solution has the smallest norm out of all solutions that fit the data. (No need for algebra; there is a nice geometric solution!)
Though linear regression may appear trivially simple, it is a very powerful tool for more complex models to build upon. For instance, to expand the expressiveness of linear models, we can first transform the input \(x\) using some feature mapping \(\phi\), i.e. \(\widetilde x = \phi(x)\), and then fit a linear model in the transformed space instead. By using domain knowledge to choose a useful feature mapping, we can obtain a powerful SL method for a particular task.
In neural networks, we assume that the function \(f\) is a composition of linear functions (represented by matrices \(W_i\)) and non-linear activation functions (denoted by \(\sigma\)):
\[ \mathcal{F} = \{ x \mapsto \sigma(W_L \sigma(W_{L-1} \dots \sigma(W_1 x + b_1) \dots + b_{L-1}) + b_L) \} \]
where \(W_\ell \in \mathbb{R}^{D_{\ell+1} \times D_\ell}\) and \(b_\ell \in \mathbb{R}^{D_{\ell+1}}\) are the parameters of the \(i\)-th layer, and \(\sigma\) is the activation function.
This function class is highly expressive and allows for more parameters. This makes it more susceptible to overfitting on smaller datasets, but also allows it to represent more complex functions. In practice, however, neural networks exhibit interesting phenomena during training, and are often able to generalize well even with many parameters.
Another reason for their popularity is the efficient backpropagation algorithm for computing the gradient of the output with respect to the parameters. Essentially, the hierarchical structure of the neural network, i.e. computing the output of the network as a composition of functions, allows us to use the chain rule to compute the gradient of the output with respect to the parameters of each layer.
We have now gotten a glimpse into supervised learning, which seeks to learn about some input-output relationship using a dataset of example points. In particular, we typically seek to compute a predictor that takes in an input value and returns a good guess for the corresponding output. We score predictors using a loss function that measures how incorrectly it guesses. We want to find a predictor that achieves low loss on unseen data points. We do this by searching over a class of functions to find one that minimizes the empirical risk over the training dataset. We finally saw two popular examples of parameterized function classes: linear regression and neural networks.
James et al. (2023) provides an accessible introduction to supervised learning. Hastie, Tibshirani, and Friedman (2013) examines the subject in even further depth and covers many relevant supervised learning methods. Nielsen (2015) provides a comprehensive introduction to neural networks and backpropagation.