3.1Introduction¶
The multi-armed bandits (MAB) setting is a simple setting for studying the basic challenges of sequential decision-making. In this setting, an agent repeatedly chooses from a fixed set of actions, called arms, each of which has an associated reward distribution. The agent’s goal is to maximize the total reward it receives over some time period.
In particular, we’ll spend a lot of time discussing the Exploration-Exploitation Tradeoff: should the agent choose new actions to learn more about the environment, or should it choose actions that it already knows to be good?
In this chapter, we will introduce the multi-armed bandits setting, and discuss some of the challenges that arise when trying to solve problems in this setting. We will also introduce some of the key concepts that we will use throughout the book, such as regret and exploration-exploitation tradeoffs.
from jaxtyping import Float, Array
import numpy as np
import latexify
from typing import Callable, Union
import matplotlib.pyplot as plt
import solutions.bandits as solutions
np.random.seed(184)
def random_argmax(ary: Array) -> int:
"""Take an argmax and randomize between ties."""
max_idx = np.flatnonzero(ary == ary.max())
return np.random.choice(max_idx).item()
# used as decorator
latex = latexify.algorithmic(
prefixes={"mab"},
identifiers={"arm": "a_t", "reward": "r", "means": "mu"},
use_math_symbols=True,
escape_underscores=False,
)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 7
4 from typing import Callable, Union
5 import matplotlib.pyplot as plt
----> 7 import solutions.bandits as solutions
9 np.random.seed(184)
11 def random_argmax(ary: Array) -> int:
ModuleNotFoundError: No module named 'solutions'
Let denote the number of arms. We’ll label them and use superscripts to indicate the arm index; since we seldom need to raise a number to a power, this won’t cause much confusion. In this chapter, we’ll consider the Bernoulli bandit setting from the examples above, where arm either returns reward 1 with probability or 0 otherwise. The agent gets to pull an arm times in total. We can formalize the Bernoulli bandit in the following Python code:
class MAB:
"""
The Bernoulli multi-armed bandit environment.
:param means: the means (success probabilities) of the reward distributions for each arm
:param T: the time horizon
"""
def __init__(self, means: Float[Array, " K"], T: int):
assert all(0 <= p <= 1 for p in means)
self.means = means
self.T = T
self.K = self.means.size
self.best_arm = random_argmax(self.means)
def pull(self, k: int) -> int:
"""Pull the `k`-th arm and sample from its (Bernoulli) reward distribution."""
reward = np.random.rand() < self.means[k].item()
return +reward
mab = MAB(means=np.array([0.1, 0.8, 0.4]), T=100)
In pseudocode, the agent’s interaction with the MAB environment can be described by the following process:
@latex
def mab_loop(mab: MAB, agent: "Agent") -> int:
for t in range(mab.T):
arm = agent.choose_arm() # in 0, ..., K-1
reward = mab.pull(arm)
agent.update_history(arm, reward)
mab_loop
The Agent
class stores the pull history and uses it to decide which arm to pull next. Since we are working with Bernoulli bandits, we can summarize the pull history concisely in a array.
class Agent:
def __init__(self, K: int, T: int):
"""The MAB agent that decides how to choose an arm given the past history."""
self.K = K
self.T = T
self.rewards = [] # for plotting
self.choices = []
self.history = np.zeros((K, 2), dtype=int)
def choose_arm(self) -> int:
"""Choose an arm of the MAB. Algorithm-specific."""
...
def count(self) -> int:
"""The number of pulls made. Also the current step index."""
return len(self.rewards)
def update_history(self, arm: int, reward: int):
self.rewards.append(reward)
self.choices.append(arm)
self.history[arm, reward] += 1
What’s the optimal strategy for the agent, i.e. the one that achieves the highest expected reward? Convince yourself that the agent should try to always pull the arm with the highest expected reward:
The goal, then, can be rephrased as to minimize the regret, defined below:
def regret_per_step(mab: MAB, agent: Agent):
"""Get the difference from the average reward of the optimal arm. The sum of these is the regret."""
return [mab.means[mab.best_arm] - mab.means[arm] for arm in agent.choices]
Note that this depends on the true means of the pulled arms, not the actual observed rewards. We typically think of this as a random variable where the randomness comes from the agent’s strategy (i.e. the sequence of actions ).
Throughout the chapter, we will try to upper bound the regret of various algorithms in two different senses:
Upper bound the expected regret, i.e. show .
Find a high-probability upper bound on the regret, i.e. show .
Note that these two different approaches say very different things about the regret. The first approach says that the average regret is at most . However, the agent might still achieve higher regret on many runs. The second approach says that, with high probability, the agent will achieve regret at most . However, it doesn’t say anything about the regret in the remaining δ fraction of runs, which might be arbitrarily high.
We’d like to achieve sublinear regret in expectation, i.e. . That is, as we learn more about the environment, we’d like to be able to exploit that knowledge to take the optimal arm as often as possible.
The rest of the chapter comprises a series of increasingly sophisticated MAB algorithms.
def plot_strategy(mab: MAB, agent: Agent):
plt.figure(figsize=(10, 6))
# plot reward and cumulative regret
plt.plot(np.arange(mab.T), np.cumsum(agent.rewards), label="reward")
cum_regret = np.cumsum(regret_per_step(mab, agent))
plt.plot(np.arange(mab.T), cum_regret, label="cumulative regret")
# draw colored circles for arm choices
colors = ["red", "green", "blue"]
color_array = [colors[k] for k in agent.choices]
plt.scatter(np.arange(mab.T), np.zeros(mab.T), c=color_array, label="arm")
# labels and title
plt.xlabel("timestep")
plt.legend()
plt.title(f"{agent.__class__.__name__} reward and regret")
plt.show()
3.2Pure exploration (random guessing)¶
A trivial strategy is to always choose arms at random (i.e. “pure exploration”).
class PureExploration(Agent):
def choose_arm(self):
"""Choose an arm uniformly at random."""
return solutions.pure_exploration_choose_arm(self)
Note that
so the expected regret is simply
This scales as , i.e. linear in the number of timesteps . There’s no learning here: the agent doesn’t use any information about the environment to improve its strategy. You can see that the distribution over its arm choices always appears “(uniformly) random”.
agent = PureExploration(mab.K, mab.T)
mab_loop(mab, agent)
plot_strategy(mab, agent)
3.3Pure greedy¶
How might we improve on pure exploration? Instead, we could try each arm once, and then commit to the one with the highest observed reward. We’ll call this the pure greedy strategy.
class PureGreedy(Agent):
def choose_arm(self):
"""Choose the arm with the highest observed reward on its first pull."""
return solutions.pure_greedy_choose_arm(self)
Note we’ve used superscripts during the exploration phase to indicate that we observe exactly one reward for each arm. Then we use subscripts during the exploitation phase to indicate that we observe a sequence of rewards from the chosen greedy arm .
How does the expected regret of this strategy compare to that of pure exploration? We’ll do a more general analysis in the following section. Now, for intuition, suppose there’s just arms, with Bernoulli reward distributions with means .
Let’s let be the random reward from the first arm and be the random reward from the second. If , then we achieve zero regret. Otherwise, we achieve regret . Thus, the expected regret is simply:
Which is still , the same as pure exploration!
agent = PureGreedy(mab.K, mab.T)
mab_loop(mab, agent)
plot_strategy(mab, agent)
The cumulative regret is a straight line because the regret only depends on the arms chosen and not the actual reward observed. In fact, if the greedy algorithm happens to get lucky on the first set of pulls, it may act entirely optimally for that episode! But its average regret is what measures its effectiveness.
3.4Explore-then-commit¶
We can improve the pure greedy algorithm as follows: let’s reduce the variance of the reward estimates by pulling each arm times before committing. This is called the explore-then-commit strategy. Note that the “pure greedy” strategy above is just the special case where .
class ExploreThenCommit(Agent):
def __init__(self, K: int, T: int, N_explore: int):
super().__init__(K, T)
self.N_explore = N_explore
def choose_arm(self):
return solutions.etc_choose_arm(self)
agent = ExploreThenCommit(mab.K, mab.T, mab.T // 15)
mab_loop(mab, agent)
plot_strategy(mab, agent)
Notice that now, the graphs are much more consistent, and the algorithm finds the true optimal arm and sticks with it much more frequently. We would expect ETC to then have a better (i.e. lower) average regret. Can we prove this?
3.4.1ETC regret analysis¶
Let’s analyze the expected regret of the explore-then-commit strategy by splitting it up into the exploration and exploitation phases.
3.4.1.1Exploration phase.¶
This phase takes timesteps. Since at each step we incur at most 1 regret, the total regret is at most .
3.4.1.2Exploitation phase.¶
This will take a bit more effort. We’ll prove that for any total time , we can choose such that with arbitrarily high probability, the regret is sublinear.
Let denote the arm chosen after the exploration phase. We know the regret from the exploitation phase is
So we’d like to bound (as a function of ) in order to achieve sublinear regret. How can we do this?
Let’s define to denote how far the mean estimate for arm is from the true mean. How can we bound this quantity? We’ll use the following useful inequality for i.i.d. bounded random variables:
The proof of this inequality is beyond the scope of this book. See Vershynin (2018) Chapter 2.2.
We can apply this directly to the rewards for a given arm , since the rewards from that arm are i.i.d.:
But note that we can’t apply this to arm directly since is itself a random variable. Instead, we need to “uniform-ize” this bound across all the arms, i.e. bound the error across all the arms simultaneously, so that the resulting bound will apply no matter what “crystallizes” to.
The union bound provides a simple way to do this:
Exercise: Prove the second statement above.
Applying the union bound across the arms for the l.h.s. event of (3.8), we have
Then to apply this bound to in particular, we can apply the useful trick of “adding zero”:
where we’ve set . Putting this all together, we’ve shown that, with probability ,
Note that it suffices for to be on the order of to achieve sublinear regret. In particular, we can find the optimal by setting the derivative of the r.h.s. to zero:
Plugging this into the expression for the regret, we have (still with probability )
The ETC algorithm is rather “abrupt” in that it switches from exploration to exploitation after a fixed number of timesteps. In practice, it’s often better to use a more gradual transition, which brings us to the epsilon-greedy algorithm.
3.5Epsilon-greedy¶
Instead of doing all of the exploration and then all of the exploitation separately – which additionally requires knowing the time horizon beforehand – we can instead interleave exploration and exploitation by, at each timestep, choosing a random action with some probability. We call this the epsilon-greedy algorithm.
class EpsilonGreedy(Agent):
def __init__(
self,
K: int,
T: int,
ε_array: Float[Array, " T"],
):
super().__init__(K, T)
self.ε_array = ε_array
def choose_arm(self):
return solutions.epsilon_greedy_choose_arm(self)
agent = EpsilonGreedy(mab.K, mab.T, np.full(mab.T, 0.1))
mab_loop(mab, agent)
plot_strategy(mab, agent)
Note that we let ε vary over time. In particular, we might want to gradually decrease ε as we learn more about the reward distributions and no longer need to spend time exploring.
It turns out that setting also achieves a regret of (ignoring the logarithmic factors). (We will not prove this here; a proof can be found in Agarwal et al. (2022).)
In ETC, we had to set based on the total number of timesteps . But the epsilon-greedy algorithm actually handles the exploration automatically: the regret rate holds for any , and doesn’t depend on the final horizon .
But the way these algorithms explore is rather naive: we’ve been exploring uniformly across all the arms. But what if we could be smarter about it, and explore more for arms that we’re less certain about?
3.6Upper Confidence Bound (UCB)¶
To quantify how certain we are about the mean of each arm, we’ll compute confidence intervals for our estimators, and then choose the arm with the highest upper confidence bound. This operates on the principle of the benefit of the doubt (i.e. optimism in the face of uncertainty): we’ll choose the arm that we’re most optimistic about.
In particular, for each arm at time , we’d like to compute some upper confidence bound such that with high probability, and then choose . But how should we compute ?
In Section 3.4.1, we were able to compute this bound using Hoeffding’s inequality, which assumes that the number of samples is fixed. This was the case in ETC (where we pull each arm times), but in UCB, the number of times we pull each arm depends on the agent’s actions, which in turn depend on the random rewards and are therefore stochastic. So we can’t use Hoeffding’s inequality directly.
Instead, we’ll apply the same trick we used in the ETC analysis: we’ll use the union bound to compute a looser bound that holds uniformly across all timesteps and arms. Let’s introduce some notation to discuss this.
Let denote the (random) number of times arm has been pulled within the first timesteps, and denote the sample average of those pulls. That is,
To achieve the “fixed sample size” assumption, we’ll need to shift our index from time to number of samples from each arm. In particular, we’ll define to be the th sample from arm , and to be the sample average of the first samples from arm . Then, for a fixed , this satisfies the “fixed sample size” assumption, and we can apply Hoeffding’s inequality to get a bound on .
So how can we extend our bound on to ? Well, we know (where equality would be the case if and only if we had pulled arm every time). So we can apply the same trick as last time, where we uniform-ize across all possible values of :
In particular, since , and by definition, we have
This bound would then suffice for applying the UCB algorithm! That is, the upper confidence bound for arm would be
where we can choose depending on how tight we want the interval to be.
- A smaller would give us a larger and higher-confidence interval, emphasizing the exploration term.
- A larger would give a tighter and lower-confidence interval, prioritizing the current sample averages.
We can now use this to define the UCB algorithm.
class UCB(Agent):
def __init__(self, K: int, T: int, delta: float):
super().__init__(K, T)
self.delta = delta
def choose_arm(self):
return solutions.ucb_choose_arm(self)
Intuitively, UCB prioritizes arms where:
is large, i.e. the arm has a high sample average, and we’d choose it for exploitation, and
is large, i.e. we’re still uncertain about the arm, and we’d choose it for exploration.
As desired, this explores in a smarter, adaptive way compared to the previous algorithms. Does it achieve lower regret?
agent = UCB(mab.K, mab.T, 0.9)
mab_loop(mab, agent)
plot_strategy(mab, agent)
3.6.1UCB regret analysis¶
First we’ll bound the regret incurred at each timestep. Then we’ll bound the total regret across timesteps.
For the sake of analysis, we’ll use a slightly looser bound that applies across the whole time horizon and across all arms. We’ll omit the derivation since it’s very similar to the above (walk through it yourself for practice).
Intuitively, denotes the width of the CI for arm at time . Then, assuming the above uniform bound holds (which occurs with probability ), we can bound the regret at each timestep as follows:
Summing this across timesteps gives
Putting everything together gives
In fact, we can do a more sophisticated analysis to trim off a factor of and show .
3.6.2Lower bound on regret (intuition)¶
Is it possible to do better than in general? In fact, no! We can show that any algorithm must incur regret in the worst case. We won’t rigorously prove this here, but the intuition is as follows.
The Central Limit Theorem tells us that with i.i.d. samples from some distribution, we can only learn the mean of the distribution to within (the standard deviation). Then, since we get samples spread out across the arms, we can only learn each arm’s mean to an even looser degree.
That is, if two arms have means that are within about , we won’t be able to confidently tell them apart, and will sample them about equally. But then we’ll incur regret
3.7Thompson sampling and Bayesian bandits¶
So far, we’ve treated the parameters of the reward distributions as fixed. Instead, we can take a Bayesian approach where we treat them as random variables from some prior distribution. Then, upon pulling an arm and observing a reward, we can simply condition on this observation to exactly describe the posterior distribution over the parameters. This fully describes the information we gain about the parameters from observing the reward.
From this Bayesian perspective, the Thompson sampling algorithm follows naturally: just sample from the distribution of the optimal arm, given the observations!
class Distribution:
def sample(self) -> Float[Array, " K"]:
"""Sample a vector of means for the K arms."""
...
def update(self, arm: int, reward: float):
"""Condition on obtaining `reward` from the given arm."""
...
class ThompsonSampling(Agent):
def __init__(self, K: int, T: int, prior: Distribution):
super().__init__(K, T)
self.distribution = prior
def choose_arm(self):
means = self.distribution.sample()
return random_argmax(means)
def update_history(self, arm: int, reward: int):
super().update_history(arm, reward)
self.distribution.update(arm, reward)
In other words, we sample each arm proportionally to how likely we think it is to be optimal, given the observations so far. This strikes a good exploration-exploitation tradeoff: we explore more for arms that we’re less certain about, and exploit more for arms that we’re more certain about. Thompson sampling is a simple yet powerful algorithm that achieves state-of-the-art performance in many settings.
class Beta(Distribution):
def __init__(self, K: int, alpha: int = 1, beta: int = 1):
self.alphas = np.full(K, alpha)
self.betas = np.full(K, beta)
def sample(self):
return np.random.beta(self.alphas, self.betas)
def update(self, arm: int, reward: int):
self.alphas[arm] += reward
self.betas[arm] += 1 - reward
beta_distribution = Beta(mab.K)
agent = ThompsonSampling(mab.K, mab.T, beta_distribution)
mab_loop(mab, agent)
plot_strategy(mab, agent)
It turns out that asymptotically, Thompson sampling is optimal in the following sense. Lai & Robbins (1985) prove an instance-dependent lower bound that says for any bandit algorithm,
where
measures the Kullback-Leibler divergence from the Bernoulli distribution with mean to the Bernoulli distribution with mean . It turns out that Thompson sampling achieves this lower bound with equality! That is, not only is the error rate optimal, but the constant factor is optimal as well.
3.8Contextual bandits¶
In the above MAB environment, the reward distributions of the arms remain constant. However, in many real-world settings, we might receive additional information that affects these distributions. For example, in the online advertising case where each arm corresponds to an ad we could show the user, we might receive information about the user’s preferences that changes how likely they are to click on a given ad. We can model such environments using contextual bandits.
Assuming our context is discrete, we can just perform the same algorithms, treating each context-arm pair as its own arm. This gives us an enlarged MAB of arms.
Recall that running UCB for timesteps on an MAB with arms achieves a regret bound of . So in this problem, we would achieve regret in the contextual MAB, which has a polynomial dependence on . But in a situation where we have large, or even infinitely many contexts, e.g. in the case where our context is a continuous value, this becomes intractable.
Note that this “enlarged MAB” treats the different contexts as entirely unrelated to each other, while in practice, often contexts are related to each other in some way: for example, we might want to advertise similar products to users with similar preferences. How can we incorporate this structure into our solution?
3.8.1Linear contextual bandits¶
We want to model the mean reward of arm as a function of the context, i.e. . One simple model is the linear one: , where and describes a feature direction for arm . Recall that supervised learning gives us a way to estimate a conditional expectation from samples: We learn a least squares estimator from the timesteps where arm was selected:
This has the closed-form solution known as the ordinary least squares (OLS) estimator:
We can now apply the UCB algorithm in this environment in order to balance exploration of new arms and exploitation of arms that we believe to have high reward. But how should we construct the upper confidence bound? Previously, we treated the pulls of an arm as i.i.d. samples and used Hoeffding’s inequality to bound the distance of the sample mean, our estimator, from the true mean. However, now our estimator is not a sample mean, but rather the OLS estimator above (3.30). Instead, we’ll use Chebyshev’s inequality to construct an upper confidence bound.
Since the OLS estimator is known to be unbiased (try proving this yourself), we can apply Chebyshev’s inequality to :
The first term is exactly our predicted reward . To interpret the second term, note that
where
is the empirical covariance matrix of the contexts (assuming that the context has mean zero). That is, the learner is encouraged to choose arms when is not aligned with the data seen so far, or if arm has not been explored much and so is small.
We can now substitute these quantities into UCB to get the LinUCB algorithm:
class LinUCBPseudocode(Agent):
def __init__(
self, K: int, T: int, D: int, lam: float, get_c: Callable[[int], float]
):
super().__init__(K, T)
self.lam = lam
self.get_c = get_c
self.contexts = [None for _ in range(K)]
self.A = np.repeat(lam * np.eye(D)[...], K) # regularization
self.targets = np.zeros(K, D)
self.w = np.zeros(K, D)
def choose_arm(self, context: Float[Array, " D"]):
c = self.get_c(self.count)
scores = self.w @ context + c * np.sqrt(
context.T @ np.linalg.solve(self.A, context)
)
return random_argmax(scores)
def update_history(self, context: Float[Array, " D"], arm: int, reward: int):
self.A[arm] += np.outer(context, context)
self.targets[arm] += context * reward
self.w[arm] = np.linalg.solve(self.A[arm], self.targets[arm])
is similar to the term of UCB: It controls the width of the confidence interval. Here, we treat it as a tunable parameter, though in a theoretical analysis, it would depend on and the probability δ with which the bound holds.
Using similar tools for UCB, we can also prove an regret bound. The full details of the analysis can be found in Section 3 of Agarwal et al. (2022).
3.9Summary¶
In this chapter, we explored the multi-armed bandit setting for analyzing sequential decision-making in an unknown environment. An MAB consists of multiple arms, each with an unknown reward distribution. The agent’s task is to learn about these through interaction, eventually minimizing the regret, which measures how suboptimal the chosen arms were.
We saw algorithms such as upper confidence bound and Thompson sampling that handle the tradeoff between exploration and exploitation, that is, the tradeoff between choosing arms that the agent is uncertain about and arms the agent already supposes are be good.
We finally discussed contextual bandits, in which the agent gets to observe some context that affects the reward distributions. We can approach these problems through supervised learning approaches.
- Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
- Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement Learning: Theory and Algorithms.
- Lai, T. L., & Robbins, H. (1985). Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6(1), 4–22. 10.1016/0196-8858(85)90002-8