12 Bandits

(For a book-length treatment of this topic, see Lattimore and Szepesvári (2020).)

Large online services, like search engines and social networks, frequently conduct experiments as they build new features and adapt to meet user needs. An online ad network may experiment with different algorithms to match ads to users, hoping to maximize revenue; a social network might experiment with different recommendation algorithms for content, hoping to increase user engagement; a search engine might experiment with different result ranking algorithms, hoping to increase the chance users click on top results.

But the experimental setting is very different from those in classic experimental design. A classic agricultural or industrial experiment has a limited sample size (because each observation is expensive); a large social network can experiment on thousands of users at once. An agricultural or industrial experiment is often treated separately from the main production of the farm or factory; an advertising network experiments on their actual users as they see real ads. This changes the focus of an experimental design.

Because of the scale of experiments on large web services, we can think of them as sequential experiments and use early stopping or adaptive design, as discussed in the previous chapters. But because the experiments are conducted on actual users, whose actions involve real revenue and expenses for the company, there is a secondary concern. We could use any classic experimental design to find the treatment that maximizes revenue, using careful design to minimize the sample size needed; but since we have as many experimental units as we want, why not find the treatment that maximizes revenue, while also maximizing revenue during the experiment?

For instance, if we have \(K\) distinct treatments, we could start by randomly assigning users to treatments. As we gather evidence that one treatment yields more revenue, we could give it to larger and larger proportions of users, until we have conclusive evidence that it is the best—when we can switch to giving it to every user. This way, we both learn about the best treatment and use it to improve revenue during the experiment.

The classic framework for this type of experimental problem is the bandit framework.

12.1 Bandit framework

Bandits get their name from one-armed bandits—slot machines. We’ll imagine we have \(K\) separate bandits (or one multi-armed bandit). These are our treatments. We observe a sequence of rounds, \(t = 1, \dots, n\); the size \(n\) is known as the horizon, and may or may not be fixed in advance.

In each round of the bandit problem, a learner must decide which bandit to pick. (Or, to translate, the experimenter must decide how to allocate each unit to a treatment.) Once select the bandit, the environment (nature) reveals a reward (the unit’s response to that treatment). Each bandit has a different reward distribution; at the start of the experiment, we do not know these distributions, but we gradually learn the distributions as we observe rewards. The learner’s goal is to assign units to the bandits with highest reward, thus maximizing the total reward. Let’s state this formally.

Definition 12.1 (Multi-armed bandit) A finite-armed stochastic bandit with \(K\) arms is a tuple \(\nu = (P_1, \dots, P_K)\) of distributions. A learner interacts with the bandit sequentially over \(n\) rounds. In round \(t\):

The learner selects an action \(A_t \in \mathcal{A}\). In a finite-armed bandit, the action is the choice of arm, so \(\mathcal{A} = \{1, \dots, K\}\).
The environment reveals the reward \(X_t \sim P_{A_t}\).

If the bandit is stationary, the distributions \(P_k\) are fixed across rounds.

We can hence describe a simple Bernoulli experiment (Definition 4.3) as a two-armed bandit. The distributions \(P_1\) and \(P_2\) give the distribution of the response under treatment and control, and the learner’s actions are chosen by Bernoulli draws. In industry, this special case is often referred to as A/B testing.

Example 12.1 (A/B testing) In an A/B test, a company has two versions of a product, A and B. Customers are randomly assigned to receive either A or B, and a response is measured. For example, A and B might be two different versions of a website, two different recommendation algorithms, or two different ads, and the response is the revenue or engagement. The goal is to maximize the response over time, so an A/B test is a two-armed bandit problem.

Typically we deal with unstructured bandits, where learning about one distribution \(P_i\) gives no information about another distribution \(P_j\). A structured bandit, where the distributions are related in some way, introduces extra complication.

However, in the bandit setting, we are primarily interested in the cumulative reward, \(\sum_t X_t\). We want to define a policy: a method of selecting the action at each step, such that we maximize the rewards. The policy determines how we learn from the observed sequence to decide which action to take next.

Definition 12.2 (Policy) At round \(t\), the history is the sequence of actions and rewards up to \(t\): \(H_{t - 1} = ((A_1, X_1), \dots, (A_{t-1}, X_{t-1}))\). A policy is a function \(\pi\) mapping from histories to actions in \(\mathcal{A}\).

For example, you could imagine a simple policy that uses the history to estimate the average reward for each action (each bandit), then picks the action with the highest average reward for the next round. This seems simple enough, but there are complications: What do you do before you have observed rewards from all actions? How do we make more precise estimates of the rewards from each action if we do not systematically gather more data from each?

Hence we need a way to evaluate policies to judge how good they are. Intuitively, we should compare the reward we’ll obtain by following the policy to the best possible reward we could obtain from any policy. This is the regret.

Definition 12.3 (Regret) The regret of a policy \(\pi\) for a bandit is the difference between the expected reward of the best possible policy and the expected reward of the chosen policy.

For example, in a multi-armed bandit \(\nu\) (Definition 12.1), the best policy would be to always choose the arm with the highest expected reward. Let \[ \mu_i(\nu) = \int x \dif P_i(x) \] be the mean reward for bandit arm \(i\). The best possible expected reward is \(\mu^*(\nu) = \max_i \mu_i(\nu)\). Then the regret under a policy \(\pi\) is \[ R_n(\pi, \nu) = n \mu^*(\nu) - \E\left[ \sum_{i=1}^n X_i \right]. \]

There is always a policy with zero regret—but it requires knowing each distribution \(P_i\) in the bandit in advance, so you can always make the best possible choice. Usually, however, we don’t know the distributions in advance, so we could only select an optimal policy with the benefit of hindsight.¹ Without this foreknowledge, we should expect that any reasonable policy will have high regret in the first few rounds (i.e., for small \(n\)), but as the history grows, the policy should learn the best possible action and the regret should stop growing.

Consequently, we are often interested in the rate of the regret, meaning its rate of growth as \(n \to \infty\). For example, a policy might be said to be “\(O(\sqrt{n})\)”, meaning that its regret grows at a rate bounded by some multiple of \(\sqrt{n}\). A good policy should certainly do better than \(O(n)\), which would imply the regret grows linearly with \(n\)—as though we do not learn from experience, continuing to make decisions we regret regardless of how much history is available.

TODO immediate regret decomposition?

12.2 Common policy choices

12.2.1 Explore-then-commit

One intuitive policy choice is to explore all \(K\) arms of the bandit by trying each repeatedly, say \(m\) times. After \(mK\) rounds, we can use the observed data to estimate the mean reward for each arm. We then select the arm with the highest mean reward and use it for all future actions.

Or, to put it another way, this is a classic experimental design setup: We conduct an experiment (in this case, a simple one-way balanced design) to find the arm with the highest treatment mean. Once the experiment is done and we have the results, we use the arm selected by the experiment in the future. So any classic industrial experiment is an explore-then-commit experiment.

The regret of explore-then-commit depends heavily on \(m\). If we pick a small value, say \(m = 1\), our estimates \(\hat \mu_i(\nu)\) of the expected reward for each arm will have high variance, and so there is a good chance we will pick the wrong arm. If we pick a large value, say \(m = 100\), the estimates will be much better, but we may waste time gathering 100 observations of the worst arm, thus incurring lots of regret.

This tradeoff can be shown theoretically by bounding \(R_n(\pi, \nu)\) (Lattimore and Szepesvári 2020, theorem 6.1). The bound shows the regret has two terms: one, increasing in \(m\), from the regret of choosing bad arms \(m\) times; and one, decreasing with \(m\), as larger values of \(m\) make it more likely we ultimately pick the best arm. Unfortunately we cannot use this bound to select the best \(m\) in advance without knowing details of the bandit in advance.

12.2.2 Upper confidence bound

The intuition of explore-then-commit suggests we can do better by a dynamic approach. Clearly we need to try each arm at least once, but afterwards we can base our decisions on the estimated reward. We should try promising arms more often, gathering more data and reducing our uncertainty about their mean rewards; as we gain more information, we can keep updating our sense of which could be the best.

We can express this idea as a principle of optimism. We have uncertainty about each arm’s mean reward, so let’s assume they are as high as plausible—as high as a confidence interval for the mean extends. We pick the arm whose confidence interval extends the highest, thus gaining information and shrinking the CI.

Formally, the upper confidence bound policy works as follows:

Select each arm once.
Using the observed rewards, form confidence intervals for \(\mu_i(\nu)\) for each \(i\).
In the next round, select the arm \(i\) whose confidence interval has the highest upper bound.
Using the newly observed reward, return to step 2 and recompute the confidence intervals.

In principle, we could use any reasonable method to form the confidence interval; for instance, we could assume the rewards are normally distributed, estimate their variances, and use a \(t\)-based confidence interval. In practice, people who study bandits tend to avoid making assumptions about distributions, so they base their confidence intervals on probability inequalities that work for large classes of distributions.

TODO

12.3 Contextual bandits

TODO bandits with covariates

12.4 Exercises

TODO

In this sense, “regret” is a very well-chosen name.↩︎