The multi-armed bandits (MAB) setting is a simple setting for studying the basic challenges of sequential decision-making. In this setting, an agent repeatedly chooses from a fixed set of actions, called arms, each of which has an associated reward distribution. The agent’s goal is to maximize the total reward it receives over some time period.
In particular, we’ll spend a lot of time discussing the Exploration-Exploitation Tradeoff: should the agent choose new actions to learn more about the environment, or should it choose actions that it already knows to be good?
Example 3.1 (Online advertising) Let’s suppose you, the agent, are an advertising company. You have \(K\) different ads that you can show to users; For concreteness, let’s suppose there’s just a single user. You receive \(1\) reward if the user clicks the ad, and \(0\) otherwise. Thus, the unknown reward distribution associated to each ad is a Bernoulli distribution defined by the probability that the user clicks on the ad. Your goal is to maximize the total number of clicks by the user.
Example 3.2 (Clinical trials) Suppose you’re a pharmaceutical company, and you’re testing a new drug. You have \(K\) different dosages of the drug that you can administer to patients. You receive \(1\) reward if the patient recovers, and \(0\) otherwise. Thus, the unknown reward distribution associated to each dosage is a Bernoulli distribution defined by the probability that the patient recovers. Your goal is to maximize the total number of patients that recover.
In this chapter, we will introduce the multi-armed bandits setting, and discuss some of the challenges that arise when trying to solve problems in this setting. We will also introduce some of the key concepts that we will use throughout the book, such as regret and exploration-exploitation tradeoffs.
Code
from jaxtyping import Float, Arrayimport numpy as npimport latexifyfrom typing import Callable, Unionimport matplotlib.pyplot as pltimport solutions.bandits as solutionsnp.random.seed(184)def random_argmax(ary: Array) ->int:"""Take an argmax and randomize between ties.""" max_idx = np.flatnonzero(ary == ary.max())return np.random.choice(max_idx).item()# used as decoratorlatex = latexify.algorithmic( trim_prefixes={"mab"}, id_to_latex={"arm": "a_t", "reward": "r", "means": "\mu"}, use_math_symbols=True,)
Remark 3.1 (Namesake). The name “multi-armed bandits” comes from slot machines in casinos, which are often called “one-armed bandits” since they have one arm (the lever) and rob money from the player.
Let \(K\) denote the number of arms. We’ll label them \(0, \dots, K-1\) and use superscripts to indicate the arm index; since we seldom need to raise a number to a power, this won’t cause much confusion. In this chapter, we’ll consider the Bernoulli bandit setting from the examples above, where arm \(k\) either returns reward \(1\) with probability \(\mu^k\) or \(0\) otherwise. The agent gets to pull an arm \(T\) times in total. We can formalize the Bernoulli bandit in the following Python code:
Code
class MAB:"""The Bernoulli multi-armed bandit environment. :param means: the means (success probabilities) of the reward distributions for each arm :param T: the time horizon """def__init__(self, means: Float[Array, " K"], T: int):assertall(0<= p <=1for p in means)self.means = meansself.T = Tself.K =self.means.sizeself.best_arm = random_argmax(self.means)def pull(self, k: int) ->int:"""Pull the `k`-th arm and sample from its (Bernoulli) reward distribution.""" reward = np.random.rand() <self.means[k].item()return+reward
Code
mab = MAB(means=np.array([0.1, 0.8, 0.4]), T=100)
In pseudocode, the agent’s interaction with the MAB environment can be described by the following process:
Code
@latexdef mab_loop(mab: MAB, agent: "Agent") ->int:for t inrange(mab.T): arm = agent.choose_arm() # in 0, ..., K-1 reward = mab.pull(arm) agent.update_history(arm, reward)mab_loop
The Agent class stores the pull history and uses it to decide which arm to pull next. Since we are working with Bernoulli bandits, we can summarize the pull history concisely in a \(\mathbb{N}^{K \times 2}\) array.
Code
class Agent:def__init__(self, K: int, T: int):"""The MAB agent that decides how to choose an arm given the past history."""self.K = Kself.T = Tself.rewards = [] # for plottingself.choices = []self.history = np.zeros((K, 2), dtype=int)def choose_arm(self) ->int:"""Choose an arm of the MAB. Algorithm-specific.""" ...def count(self) ->int:"""The number of pulls made. Also the current step index."""returnlen(self.rewards)def update_history(self, arm: int, reward: int):self.rewards.append(reward)self.choices.append(arm)self.history[arm, reward] +=1
What’s the optimal strategy for the agent, i.e. the one that achieves the highest expected reward? Convince yourself that the agent should try to always pull the arm with the highest expected reward:
\[
\mu^\star := \max_{k \in [K]} \mu^k.
\]
The goal, then, can be rephrased as to minimize the regret, defined below:
Definition 3.1 (Regret) The agent’s regret after \(T\) timesteps is defined as
def regret_per_step(mab: MAB, agent: Agent):"""Get the difference from the average reward of the optimal arm. The sum of these is the regret."""return [mab.means[mab.best_arm] - mab.means[arm] for arm in agent.choices]
Note that this depends on the true means of the pulled arms, not the actual observed rewards. We typically think of this as a random variable where the randomness comes from the agent’s strategy (i.e. the sequence of actions \(a_0, \dots, a_{T-1}\)).
Throughout the chapter, we will try to upper bound the regret of various algorithms in two different senses:
Upper bound the expected regret, i.e. show \(\mathbb{E}[\text{Regret}_T] \le M_T\).
Find a high-probability upper bound on the regret, i.e. show \(\mathbb{P}(\text{Regret}_T \le M_{T, \delta}) \ge 1-\delta\).
Note that these two different approaches say very different things about the regret. The first approach says that the average regret is at most \(M_T\). However, the agent might still achieve higher regret on many runs. The second approach says that, with high probability, the agent will achieve regret at most \(M_{T, \delta}\). However, it doesn’t say anything about the regret in the remaining \(\delta\) fraction of runs, which might be arbitrarily high.
We’d like to achieve sublinear regret in expectation, i.e. \(\mathbb{E}[\text{Regret}_T] = o(T)\). That is, as we learn more about the environment, we’d like to be able to exploit that knowledge to take the optimal arm as often as possible.
The rest of the chapter comprises a series of increasingly sophisticated MAB algorithms.
Code
def plot_strategy(mab: MAB, agent: Agent): plt.figure(figsize=(10, 6))# plot reward and cumulative regret plt.plot(np.arange(mab.T), np.cumsum(agent.rewards), label="reward") cum_regret = np.cumsum(regret_per_step(mab, agent)) plt.plot(np.arange(mab.T), cum_regret, label="cumulative regret")# draw colored circles for arm choices colors = ["red", "green", "blue"] color_array = [colors[k] for k in agent.choices] plt.scatter(np.arange(mab.T), np.zeros(mab.T), c=color_array, label="arm")# labels and title plt.xlabel("timestep") plt.legend() plt.title(f"{agent.__class__.__name__} reward and regret") plt.show()
3.2 Pure exploration
A trivial strategy is to always choose arms at random (i.e. “pure exploration”).
Code
class PureExploration(Agent):def choose_arm(self):"""Choose an arm uniformly at random."""return solutions.pure_exploration_choose_arm(self)
This scales as \(\Theta(T)\), i.e. linear in the number of timesteps \(T\). There’s no learning here: the agent doesn’t use any information about the environment to improve its strategy. You can see that the distribution over its arm choices always appears “(uniformly) random”.
How might we improve on pure exploration? Instead, we could try each arm once, and then commit to the one with the highest observed reward. We’ll call this the pure greedy strategy.
Code
class PureGreedy(Agent):def choose_arm(self):"""Choose the arm with the highest observed reward on its first pull."""return solutions.pure_greedy_choose_arm(self)
Note we’ve used superscripts \(r^k\) during the exploration phase to indicate that we observe exactly one reward for each arm. Then we use subscripts \(r_t\) during the exploitation phase to indicate that we observe a sequence of rewards from the chosen greedy arm \(\hat k\).
How does the expected regret of this strategy compare to that of pure exploration? We’ll do a more general analysis in the following section. Now, for intuition, suppose there’s just \(K=2\) arms, with Bernoulli reward distributions with means \(\mu^0 > \mu^1\).
Let’s let \(r^0\) be the random reward from the first arm and \(r^1\) be the random reward from the second. If \(r^0 > r^1\), then we achieve zero regret. Otherwise, we achieve regret \(T(\mu^0 - \mu^1)\). Thus, the expected regret is simply:
The cumulative regret is a straight line because the regret only depends on the arms chosen and not the actual reward observed. In fact, if the greedy algorithm happens to get lucky on the first set of pulls, it may act entirely optimally for that episode! But its average regret is what measures its effectiveness.
3.4 Explore-then-commit
We can improve the pure greedy algorithm as follows: let’s reduce the variance of the reward estimates by pulling each arm \(N_{\text{explore}}> 1\) times before committing. This is called the explore-then-commit strategy. Note that the “pure greedy” strategy above is just the special case where \(N_{\text{explore}}= 1\).
Notice that now, the graphs are much more consistent, and the algorithm finds the true optimal arm and sticks with it much more frequently. We would expect ETC to then have a better (i.e. lower) average regret. Can we prove this?
3.4.1 ETC regret analysis
Let’s analyze the expected regret of the explore-then-commit strategy by splitting it up into the exploration and exploitation phases.
3.4.1.1 Exploration phase.
This phase takes \(N_{\text{explore}}K\) timesteps. Since at each step we incur at most \(1\) regret, the total regret is at most \(N_{\text{explore}}K\).
3.4.1.2 Exploitation phase.
This will take a bit more effort. We’ll prove that for any total time \(T\), we can choose \(N_{\text{explore}}\) such that with arbitrarily high probability, the regret is sublinear.
Let \(\hat k\) denote the arm chosen after the exploration phase. We know the regret from the exploitation phase is
So we’d like to bound \(\mu^\star - \mu^{\hat k} = o(1)\) (as a function of \(T\)) in order to achieve sublinear regret. How can we do this?
Let’s define \(\Delta^k = \hat \mu^k - \mu^k\) to denote how far the mean estimate for arm \(k\) is from the true mean. How can we bound this quantity? We’ll use the following useful inequality for i.i.d. bounded random variables:
Theorem 3.1 (Hoeffding’s inequality) Let \(X_0, \dots, X_{n-1}\) be i.i.d. random variables with \(X_i \in [0, 1]\) almost surely for each \(i \in [n]\). Then for any \(\delta > 0\),
But note that we can’t apply this to arm \(\hat k\) directly since \(\hat k\) is itself a random variable. Instead, we need to “uniform-ize” this bound across all the arms, i.e. bound the error across all the arms simultaneously, so that the resulting bound will apply no matter what\(\hat k\) “crystallizes” to.
The union bound provides a simple way to do this:
Theorem 3.2 (Union bound) Consider a set of events \(A_0, \dots, A_{n-1}\). Then
\[
\mathbb{P}(\exists i \in [n]. A_i) \le \sum_{i=0}^{n-1} \mathbb{P}(A_i).
\]
In particular, if \(\mathbb{P}(A_i) \ge 1 - \delta\) for each \(i \in [n]\), we have
\[
\mathbb{P}(\forall i \in [n]. A_i) \ge 1 - n \delta.
\]
Exercise: Prove the second statement above.
Applying the union bound across the arms for the l.h.s. event of Equation 3.1, we have
Note that it suffices for \(N_{\text{explore}}\) to be on the order of \(\sqrt{T}\) to achieve sublinear regret. In particular, we can find the optimal \(N_{\text{explore}}\) by setting the derivative of the r.h.s. to zero:
The ETC algorithm is rather “abrupt” in that it switches from exploration to exploitation after a fixed number of timesteps. In practice, it’s often better to use a more gradual transition, which brings us to the epsilon-greedy algorithm.
3.5 Epsilon-greedy
Instead of doing all of the exploration and then all of the exploitation separately – which additionally requires knowing the time horizon beforehand – we can instead interleave exploration and exploitation by, at each timestep, choosing a random action with some probability. We call this the epsilon-greedy algorithm.
Note that we let \(\epsilon\) vary over time. In particular, we might want to gradually decrease\(\epsilon\) as we learn more about the reward distributions and no longer need to spend time exploring.
What is the expected regret of the algorithm if we set \(\epsilon\) to be a constant?
It turns out that setting \(\epsilon_t = \sqrt[3]{K \ln(t)/t}\) also achieves a regret of \(\tilde O(t^{2/3} K^{1/3})\) (ignoring the logarithmic factors). (We will not prove this here; a proof can be found in (Agarwal et al. 2022).)
In ETC, we had to set \(N_{\text{explore}}\) based on the total number of timesteps \(T\). But the epsilon-greedy algorithm actually handles the exploration automatically: the regret rate holds for any\(t\), and doesn’t depend on the final horizon \(T\).
But the way these algorithms explore is rather naive: we’ve been exploring uniformly across all the arms. But what if we could be smarter about it, and explore more for arms that we’re less certain about?
3.6 Upper Confidence Bound (UCB)
To quantify how certain we are about the mean of each arm, we’ll compute confidence intervals for our estimators, and then choose the arm with the highest upper confidence bound. This operates on the principle of the benefit of the doubt (i.e. optimism in the face of uncertainty): we’ll choose the arm that we’re most optimistic about.
In particular, for each arm \(k\) at time \(t\), we’d like to compute some upper confidence bound \(M^k_t\) such that \(\hat \mu^k_t \le M^k_t\) with high probability, and then choose \(a_t := \arg \max_{k \in [K]} M^k_t\). But how should we compute \(M^k_t\)?
In Section 3.4.1, we were able to compute this bound using Hoeffding’s inequality, which assumes that the number of samples is fixed. This was the case in ETC (where we pull each arm \(N_{\text{explore}}\) times), but in UCB, the number of times we pull each arm depends on the agent’s actions, which in turn depend on the random rewards and are therefore stochastic. So we can’t use Hoeffding’s inequality directly.
Instead, we’ll apply the same trick we used in the ETC analysis: we’ll use the union bound to compute a looser bound that holds uniformly across all timesteps and arms. Let’s introduce some notation to discuss this.
Let \(N^k_t\) denote the (random) number of times arm \(k\) has been pulled within the first \(t\) timesteps, and \(\hat \mu^k_t\) denote the sample average of those pulls. That is,
To achieve the “fixed sample size” assumption, we’ll need to shift our index from time to number of samples from each arm. In particular, we’ll define \(\tilde r^k_n\) to be the \(n\)th sample from arm \(k\), and \(\tilde \mu^k_n\) to be the sample average of the first \(n\) samples from arm \(k\). Then, for a fixed \(n\), this satisfies the “fixed sample size” assumption, and we can apply Hoeffding’s inequality to get a bound on \(\tilde \mu^k_n\).
So how can we extend our bound on \(\tilde\mu^k_n\) to \(\hat \mu^k_t\)? Well, we know \(N^k_t \le t\) (where equality would be the case if and only if we had pulled arm \(k\) every time). So we can apply the same trick as last time, where we uniform-ize across all possible values of \(N^k_t\):
First we’ll bound the regret incurred at each timestep. Then we’ll bound the total regret across timesteps.
For the sake of analysis, we’ll use a slightly looser bound that applies across the whole time horizon and across all arms. We’ll omit the derivation since it’s very similar to the above (walk through it yourself for practice).
\[
\begin{aligned}
\mathbb{P}\left(\forall k \le K, t < T. |\hat \mu^k_t - \mu^k | \le B^k_t \right) &\ge 1-\delta'' \\
\text{where} \quad B^k_t &:= \sqrt{\frac{\ln(2TK/\delta'')}{2N^k_t}}.
\end{aligned}
\]
Intuitively, \(B^k_t\) denotes the width of the CI for arm \(k\) at time \(t\). Then, assuming the above uniform bound holds (which occurs with probability \(1-\delta''\)), we can bound the regret at each timestep as follows:
\[
\begin{aligned}
\text{Regret}_T &\le 2 K \sqrt{2T \ln(2TK/\delta'')} && \text{with probability } 1-\delta'' \\
&= \tilde O(K\sqrt{T})
\end{aligned}
\]
In fact, we can do a more sophisticated analysis to trim off a factor of \(\sqrt{K}\) and show \(\text{Regret}_T = \tilde O(\sqrt{TK})\).
3.6.2 Lower bound on regret (intuition)
Is it possible to do better than \(\Omega(\sqrt{T})\) in general? In fact, no! We can show that any algorithm must incur \(\Omega(\sqrt{T})\) regret in the worst case. We won’t rigorously prove this here, but the intuition is as follows.
The Central Limit Theorem tells us that with \(T\) i.i.d. samples from some distribution, we can only learn the mean of the distribution to within \(\Omega(1/\sqrt{T})\) (the standard deviation). Then, since we get \(T\) samples spread out across the arms, we can only learn each arm’s mean to an even looser degree.
That is, if two arms have means that are within about \(1/\sqrt{T}\), we won’t be able to confidently tell them apart, and will sample them about equally. But then we’ll incur regret \[
\Omega((T/2) \cdot (1/\sqrt{T})) = \Omega(\sqrt{T}).
\]
3.7 Thompson sampling and Bayesian bandits
So far, we’ve treated the parameters \(\mu^0, \dots, \mu^{K-1}\) of the reward distributions as fixed. Instead, we can take a Bayesian approach where we treat them as random variables from some prior distribution. Then, upon pulling an arm and observing a reward, we can simply condition on this observation to exactly describe the posterior distribution over the parameters. This fully describes the information we gain about the parameters from observing the reward.
From this Bayesian perspective, the Thompson sampling algorithm follows naturally: just sample from the distribution of the optimal arm, given the observations!
Code
class Distribution:def sample(self) -> Float[Array, " K"]:"""Sample a vector of means for the K arms.""" ...def update(self, arm: int, reward: float):"""Condition on obtaining `reward` from the given arm.""" ...
In other words, we sample each arm proportionally to how likely we think it is to be optimal, given the observations so far. This strikes a good exploration-exploitation tradeoff: we explore more for arms that we’re less certain about, and exploit more for arms that we’re more certain about. Thompson sampling is a simple yet powerful algorithm that achieves state-of-the-art performance in many settings.
Example 3.3 (Bayesian Bernoulli bandit) We’ve been working in the Bernoulli bandit setting, where arm \(k\) yields a reward of \(1\) with probability \(\mu^k\) and no reward otherwise. The vector of success probabilities \(\boldsymbol{\mu} = (\mu^1, \dots, \mu^K)\) thus describes the entire MAB.
Under the Bayesian perspective, we think of \(\boldsymbol{\mu}\) as a random vector drawn from some prior distribution \(\pi(\boldsymbol{\mu})\). For example, we might have \(\pi\) be the Uniform distribution over the unit hypercube \([0, 1]^K\), that is,
In this case, upon viewing some reward, we can exactly calculate the posterior distribution of \(\boldsymbol{\mu}\) using Bayes’s rule (i.e. the definition of conditional probability):
This is the PDF of the \(\text{Beta}(1 + r_0, 1 + (1 - r_0))\) distribution, which is a conjugate prior for the Bernoulli distribution. That is, if we start with a Beta prior on \(\mu^k\) (note that \(\text{Unif}([0, 1]) = \text{Beta}(1, 1)\)), then the posterior, after conditioning on samples from \(\text{Bern}(\mu^k)\), will also be Beta. This is a very convenient property, since it means we can simply update the parameters of the Beta distribution upon observing a reward, rather than having to recompute the entire posterior distribution from scratch.
It turns out that asymptotically, Thompson sampling is optimal in the following sense. Lai and Robbins (1985) prove an instance-dependent lower bound that says for any bandit algorithm,
measures the Kullback-Leibler divergence from the Bernoulli distribution with mean \(\mu^k\) to the Bernoulli distribution with mean \(\mu^\star\). It turns out that Thompson sampling achieves this lower bound with equality! That is, not only is the error rate optimal, but the constant factor is optimal as well.
3.8 Contextual bandits
In the above MAB environment, the reward distributions of the arms remain constant. However, in many real-world settings, we might receive additional information that affects these distributions. For example, in the online advertising case where each arm corresponds to an ad we could show the user, we might receive information about the user’s preferences that changes how likely they are to click on a given ad. We can model such environments using contextual bandits.
Definition 3.2 (Contextual bandit) At each timestep \(t\), a new context\(x_t\) is drawn from some distribution \(\nu_{\text{x}}\). The learner gets to observe the context, and choose an action \(a_t\) according to some context-dependent policy \(\pi_t(x_t)\). Then, the learner observes the reward from the chosen arm \(r_t \sim \nu^{a_t}(x_t)\). The reward distribution also depends on the context.
Assuming our context is discrete, we can just perform the same algorithms, treating each context-arm pair as its own arm. This gives us an enlarged MAB of \(K |\mathcal{X}|\) arms.
Write down the UCB algorithm for this enlarged MAB. That is, write an expression for \(\pi_t(x_t) = \arg\max_a \dots\).
Recall that running UCB for \(T\) timesteps on an MAB with \(K\) arms achieves a regret bound of \(\tilde{O}(\sqrt{TK})\). So in this problem, we would achieve regret \(\tilde{O}(\sqrt{TK|\mathcal{X}|})\) in the contextual MAB, which has a polynomial dependence on \(|\mathcal{X}|\). But in a situation where we have large, or even infinitely many contexts, e.g. in the case where our context is a continuous value, this becomes intractable.
Note that this “enlarged MAB” treats the different contexts as entirely unrelated to each other, while in practice, often contexts are related to each other in some way: for example, we might want to advertise similar products to users with similar preferences. How can we incorporate this structure into our solution?
3.8.1 Linear contextual bandits
We want to model the mean reward of arm \(k\) as a function of the context, i.e. \(\mu^k(x)\). One simple model is the linear one: \(\mu^k(x) = x^\top \theta^k\), where \(x \in \mathcal{X} = \mathbb{R}^d\) and \(\theta^k \in \mathbb{R}^d\) describes a feature direction for arm \(k\). Recall that supervised learning gives us a way to estimate a conditional expectation from samples: We learn a least squares estimator from the timesteps where arm \(k\) was selected: \[
\hat \theta_t^k = \arg\min_{\theta \in \mathbb{R}^d} \sum_{\{ i \in [t] : a_i = k \}} (r_i - x_i^\top \theta)^2.
\] This has the closed-form solution known as the ordinary least squares (OLS) estimator:
\[
\begin{aligned}
\hat \theta_t^k & = (A_t^k)^{-1} \sum_{\{ i \in [t] : a_i = k \}} x_i r_i \\
\text{where} \quad A_t^k & = \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top.
\end{aligned}
\tag{3.2}\]
We can now apply the UCB algorithm in this environment in order to balance exploration of new arms and exploitation of arms that we believe to have high reward. But how should we construct the upper confidence bound? Previously, we treated the pulls of an arm as i.i.d. samples and used Hoeffding’s inequality to bound the distance of the sample mean, our estimator, from the true mean. However, now our estimator is not a sample mean, but rather the OLS estimator above Equation 3.2. Instead, we’ll use Chebyshev’s inequality to construct an upper confidence bound.
Theorem 3.3 (Chebyshev’s inequality) For a random variable \(Y\) such that \(\mathbb{E}Y = 0\) and \(\mathbb{E}Y^2 = \sigma^2\), \[
|Y| \le \beta \sigma \quad \text{with probability} \ge 1 - \frac{1}{\beta^2}
\]
Since the OLS estimator is known to be unbiased (try proving this yourself), we can apply Chebyshev’s inequality to \(x_t^\top (\hat \theta_t^k - \theta^k)\):
We haven’t explained why \(x_t^\top (A_t^k)^{-1} x_t\) is the correct expression for the variance of \(x_t^\top \hat \theta_t^k\). This result follows from some algebra on the definition of the OLS estimator Equation 3.2.
The first term is exactly our predicted reward \(\hat \mu^k_t(x_t)\). To interpret the second term, note that \[
x_t^\top (A_t^k)^{-1} x_t = \frac{1}{N_t^k} x_t^\top (\Sigma_t^k)^{-1} x_t,
\] where \[
\Sigma_t^k = \frac{1}{N_t^k} \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top
\] is the empirical covariance matrix of the contexts (assuming that the context has mean zero). That is, the learner is encouraged to choose arms when \(x_t\) is not aligned with the data seen so far, or if arm \(k\) has not been explored much and so \(N_t^k\) is small.
We can now substitute these quantities into UCB to get the LinUCB algorithm:
Note that the matrix \(A_t^k\) above might not be invertible. When does this occur? One way to address this is to include a \(\lambda I\) regularization term to ensure that \(A_t^k\) is invertible. This is equivalent to solving a ridge regression problem instead of the unregularized least squares problem. Implement this solution.
\(c_t\) is similar to the \(\log (2t/\delta')\) term of UCB: It controls the width of the confidence interval. Here, we treat it as a tunable parameter, though in a theoretical analysis, it would depend on \(A_t^k\) and the probability \(\delta\) with which the bound holds.
Using similar tools for UCB, we can also prove an \(\tilde{O}(\sqrt{T})\) regret bound. The full details of the analysis can be found in Section 3 of (Agarwal et al. 2022).
3.9 Summary
In this chapter, we explored the multi-armed bandit setting for analyzing sequential decision-making in an unknown environment. An MAB consists of multiple arms, each with an unknown reward distribution. The agent’s task is to learn about these through interaction, eventually minimizing the regret, which measures how suboptimal the chosen arms were.
We saw algorithms such as upper confidence bound and Thompson sampling that handle the tradeoff between exploration and exploitation, that is, the tradeoff between choosing arms that the agent is uncertain about and arms the agent already supposes are be good.
We finally discussed contextual bandits, in which the agent gets to observe some context that affects the reward distributions. We can approach these problems through supervised learning approaches.
Lai, T. L, and Herbert Robbins. 1985. “Asymptotically Efficient Adaptive Allocation Rules.”Advances in Applied Mathematics 6 (1): 4–22. https://doi.org/10.1016/0196-8858(85)90002-8.
Vershynin, Roman. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press. https://books.google.com?id=NDdqDwAAQBAJ.