Bandit Algorithms
Tor Lattimore and Csaba Szepesv ́ari
This is the (free) online edition. The content is the same as the print edition, published by Cambridge University Press, except that minor typos are corrected here. There are also font and other typographical differences that mean the page numbers do not match between the versions.
Bandits, Probability and Concentration
8 10 13 16 16
56 56 57 57
Introduction
1.1 The Language of Bandits 1.2 Applications
1.4 Bibliographic Remarks
2.1 Probability Spaces and Random Elements 2.2 σ-Algebras and Knowledge
2.3 Conditional Probabilities
2.4 Independence
Foundations of Probability (
2.5 Integration and Expectation 2.6 Conditional Expectation
2.8 Bibliographic Remarks
2.9 Exercises
Stochastic Processes and Markov Chains (
3.1 Stochastic Processes
3.2 Markov Chains
3.3 Martingales and Stopping Times 3.4 Notes
3.5 Bibliographic Remarks
3.6 Exercises
Stochastic Bandits
4.1 Core Assumptions
4.2 The Learning Objective
4.3 Knowledge and Environment Classes
4.4 The Regret 60
4.5 Decomposing the Regret 62
4.6 The Canonical Bandit Model (
4.7 The Canonical Bandit Model for Uncountable Action Sets (
4.8 Notes 66
4.9 Bibliographical Remarks 68
4.10 Exercises 69
Concentration of Measure 74 5.1 Tail Probabilities 74 5.2 The Inequalities of Markov and Chebyshev 75 5.3 The Cram ́er-Chernoff Method and Subgaussian Random Variables 77
CONTENTS iii
5.5 Bibliographical Remarks 5.6 Exercises
Stochastic Bandits with Finitely Many Arms
The Explore-Then-Commit Algorithm
6.1 Algorithm and Regret Analysis 6.2 Notes
6.3 Bibliographical Remarks
6.4 Exercises
The Upper Confidence Bound Algorithm
7.1 The Optimism Principle 7.2 Notes
7.3 Bibliographical Remarks 7.4 Exercises
The Upper Confidence Bound Algorithm: Asymptotic Optimality
8.1 Asymptotically Optimal UCB 8.2 Notes
8.3 Bibliographic Remarks
8.4 Exercises
The Upper Confidence Bound Algorithm: Minimax Optimality (
9.1 The MOSS Algorithm 9.2 Two Problems
9.4 Bibliographic Remarks 9.5 Exercises
91 91 95 95 96
) 123 123 127 128 129 129
The Upper Confidence Bound Algorithm: Bernoulli Noise (
Part IV 13
) 186 14.1 Entropy and Optimal Coding 186 14.2 Relative Entropy 188 14.3 Notes 191 14.4 Bibliographic Remarks 194 14.5 Exercises 194
Minimax Lower Bounds 198 15.1 Relative Entropy Between Bandits 198 15.2 Minimax Lower Bounds 199 15.3 Notes 201 15.4 Bibliographic Remarks 203
CONTENTS iv
Part III 11
10.1 Concentration for Sums of Bernoulli Random Variables 133
10.2 The KL-UCB Algorithm 137
10.3 Notes 140
10.4 Bibliographic Remarks 141
10.5 Exercises 141
Adversarial Bandits with Finitely Many Arms 144
The Exp3 Algorithm 148 11.1 Adversarial Bandit Environments 148 11.2 Importance-Weighted Estimators 150 11.3 The Exp3 Algorithm 152 11.4 Regret Analysis 153 11.5 Notes 157 11.6 Bibliographic Remarks 160 11.7 Exercises 160
The Exp3-IX Algorithm 165 12.1 The Exp3-IX Algorithm 165 12.2 Regret Analysis 167 12.3 Notes 171 12.4 Bibliographic Remarks 172 12.5 Exercises 172
Lower Bounds for Bandits with Finitely Many Arms 177
Lower Bounds: Basic Ideas 180 13.1 Main Ideas Underlying Minimax Lower Bounds 181 13.2 Notes 183 13.3 Bibliographic Remarks 184 13.4 Exercises 184
Foundations of Information Theory (
15.5 Exercises
Instance-Dependent Lower Bounds
16.1 Asymptotic Bounds 16.2 Finite-Time Bounds 16.3 Notes
16.4 Bibliographic Remarks 16.5 Exercises
High-Probability Lower Bounds
17.1 Stochastic Bandits 17.2 Adversarial Bandits 17.3 Notes
17.4 Bibliographic Remarks 17.5 Exercises
Contextual and Linear Bandits
Contextual Bandits
18.1 Contextual Bandits: One Bandit per Context 18.2 Bandits with Expert Advice
18.4 Regret Analysis
18.5 Notes
18.6 Bibliographic Remarks 18.7 Exercises
Stochastic Linear Bandits
19.1 Stochastic Contextual Bandits 19.2 Stochastic Linear Bandits 19.3 Regret Analysis
19.4 Notes
19.5 Bibliographic Remarks 19.6 Exercises
Confidence Bounds for Least Squares Estimators
20.1 Martingales and the Method of Mixtures 20.2 Notes
20.3 Bibliographic Remarks
20.4 Exercises
Optimal Design for Least Squares Estimators
21.1 The Kiefer–Wolfowitz Theorem 21.2 Notes
267 267 270
CONTENTS v
Part VI 26
25.2 Clouds Looming for Optimism 25.3 Notes
25.4 Bibliographic Remarks
25.5 Exercises
Adversarial Linear Bandits
Foundations of Convex Analysis (
26.1 Convex Sets and Functions 26.2 Jensen’s Inequality
26.3 Bregman Divergence
26.4 Legendre Functions
26.5 Optimisation
26.6 Projections
26.7 Notes
26.8 Bibliographic Remarks
) 306 306 308 308 310 312 313 314 314
CONTENTS vi
21.3 Bibliographic Remarks 271 21.4 Exercises 272
Stochastic Linear Bandits with Finitely Many Arms 273 22.1 Notes 274 22.2 Bibliographic Remarks 275 22.3 Exercises 276
Stochastic Linear Bandits with Sparsity 277 23.1 Sparse Linear Stochastic Bandits 277 23.2 Elimination on the Hypercube 278 23.3 Online to Confidence Set Conversion 282 23.4 Sparse Online Linear Prediction 285 23.5 Notes 286 23.6 Bibliographical Remarks 287 23.7 Exercises 287
Minimax Lower Bounds for Stochastic Linear Bandits 288 24.1 Hypercube 289 24.2 Unit Ball 290 24.3 Sparse Parameter Vectors 292 24.4 Misspecified Models 293 24.5 Notes 295 24.6 Bibliographic Remarks 295 24.7 Exercises 295
Asymptotic Lower Bounds for Stochastic Linear Bandits 296 25.1 An Asymptotic Lower Bound for Fixed Action Sets 296
程序代写 CS代考 加微信: cstutorcs
Part VII 30
CONTENTS vii
26.9 Exercises 314
Exp3 for Adversarial Linear Bandits 318 27.1 Exponential Weights for Linear Bandits 318 27.2 Regret Analysis 320 27.3 Continuous Exponential Weights 321 27.4 Notes 323 27.5 Bibliographic Remarks 324 27.6 Exercises 325
Follow-the-Regularised-Leader and Mirror Descent 327 28.1 Online Linear Optimisation 327 28.2 Regret Analysis 331 28.3 Application to Linear Bandits 335 28.4 Linear Bandits on the Unit Ball 337 28.5 Notes 340 28.6 Bibliographic Remarks 343 28.7 Exercises 344
The Relation between Adversarial and Stochastic Linear Bandits 350 29.1 Unified View 350 29.2 Reducing Stochastic Linear Bandits to Adversarial Linear Bandits 351 29.3 Stochastic Linear Bandits with Parameter Noise 352 29.4 Contextual Linear Bandits 354 29.5 Notes 355 29.6 Bibliographic Remarks 356 29.7 Exercises 356
Other Topics 358
Combinatorial Bandits 362 30.1 Notation and Assumptions 363 30.2 Applications 363 30.3 Bandit Feedback 364 30.4 Semi-bandit Feedback and Mirror Descent 365 30.5 Follow-the-Perturbed-Leader 367 30.6 Notes 372 30.7 Bibliographic Remarks 374 30.8 Exercises 375
Non-stationary Bandits 378 31.1 Adversarial Bandits 378 31.2 Stochastic Bandits 381 31.3 Notes 383
) 441 35.3 One-armed Bayesian Bandits 443 35.4 Gittins Index 447 35.5 Computing the Gittins Index 453 35.6 Notes 454 35.7 Bibliographical Remarks 456 35.8 Exercises 457
36 Thompson Sampling 460 36.1 Finite-Armed Bandits 461 36.2 Frequentist Analysis 462 36.3 Linear Bandits 466
35.2 Optimal Stopping (
CONTENTS viii
31.4 Bibliographic Remarks 385 31.5 Exercises 386
32 Ranking 388 32.1 Click Models 389 32.2 Policy 392 32.3 Regret Analysis 394 32.4 Notes 398 32.5 Bibliographic Remarks 400 32.6 Exercises 401
33 Pure Exploration 403 33.1 Simple Regret 403 33.2 Best-Arm Identification with a Fixed Confidence 405 33.3 Best-Arm Identification with a Budget 412 33.4 Notes 413 33.5 Bibliographical Remarks 415 33.6 Exercises 417
34 Foundations of Bayesian Learning 421 34.1 Statistical Decision Theory and Bayesian Learning 421 34.2 Bayesian Learning and the Posterior Distribution 422 34.3 Conjugate Pairs, Conjugate Priors and the Exponential Family 426 34.4 The Bayesian Bandit Environment 430 34.5 Posterior Distributions in Bandits 431 34.6 Bayesian Regret 432 34.7 Notes 433 34.8 Bibliographic Remarks 436 34.9 Exercises 436
35 Bayesian Bandits 440 35.1 Bayesian Optimal Regret for k-Armed Stochastic Bandits 440
Part VIII 37
36.4 Information Theoretic Analysis 468
36.5 Notes 471
36.6 Bibliographic Remarks 474
36.7 Exercises 475
Beyond Bandits 478
Partial Monitoring 479 37.1 Finite Adversarial Partial Monitoring Problems 480 37.2 The Structure of Partial Monitoring 483 37.3 Classification of Finite Adversarial Partial Monitoring 487 37.4 Lower Bounds 488 37.5 Policy and Upper Bounds 492 37.6 Proof of Theorem 37.16 497 37.7 Proof of Theorem 37.17 498 37.8 Proof of the Classification Theorem 503 37.9 Notes 503 37.10 Bibliographical Remarks 507 37.11 Exercises 508
Markov Decision Processes 512 38.1 Problem Set-Up 512 38.2 Optimal Policies and the Bellman Optimality Equation 516
) 519 38.4 Learning in Markov Decision Processes 522 38.5 Upper Confidence Bounds for Reinforcement Learning 523 38.6 Proof of Upper Bound 526 38.7 Proof of Lower Bound 529 38.8 Notes 532 38.9 Bibliographical Remarks 536 38.10 Exercises 538
38.3 Finding an Optimal Policy (
CONTENTS ix
Bibliography Index
Multi-armed bandits have now been studied for nearly a century. While research in the beginning was quite meandering, there is now a large community publishing hundreds of articles every year. Bandit algorithms are also finding their way into practical applications in industry, especially in on-line platforms where data is readily available and automation is the only way to scale.
We had hoped to write a comprehensive book, but the literature is now so vast that many topics have been excluded. In the end we settled on the more modest goal of equipping our readers with enough expertise to explore the specialised literature by themselves, and to adapt existing algorithms to their applications. This latter point is important. Problems in theory are all alike; every application is different. A practitioner seeking to apply a bandit algorithm needs to understand which assumptions in the theory are important and how to modify the algorithm when the assumptions change. We hope this book can provide that understanding.
What is covered in the book is covered in some depth. The focus is on the mathematical analysis of algorithms for bandit problems, but this is not a traditional mathematics book, where lemmas are followed by proofs, theorems and more lemmas. We worked hard to include guiding principles for designing algorithms and intuition for their analysis. Many algorithms are accompanied by empirical demonstrations that further aid intuition.
We expect our readers to be familiar with basic analysis and calculus and some linear algebra. The book uses the notation of measure-theoretic probability theory, but does not rely on any deep results. A dedicated chapter is included to introduce the notation and provide intuitions for the basic results we need. This chapter is unusual for an introduction to measure theory in that it emphasises the reasons to use σ-algebras beyond the standard technical justifications. We hope this will convince the reader that measure theory is an important and intuitive tool. Some chapters use techniques from information theory and convex analysis, and we devote a short chapter to each.
Most chapters are short and should be readable in an afternoon or presented in a single lecture. Some components of the book contain content that is not really about bandits. These can be skipped by knowledgeable readers, or otherwise
referred to when necessary. They are marked with a (
) because ‘Skippy the
CS Help, Email: tutorcs@163.com
Kangaroo’ skips things.1 The same mark is used for those parts that contain useful, but perhaps overly specific information for the first-time reader. Later parts will not build on these chapters in any substantial way. Most chapters end with a list of notes and exercises. These are intended to deepen intuition and highlight the connections between various subsections and the literature. There is a table of notation at the end of this preface.
We’re indebted to our many collaborators and feel privileged that there are too many of you to name. The University of Alberta, Indiana University and DeepMind have all provided outstanding work environments and supported the completion of this book. The book has benefited enormously from the proofreading efforts of a large number of our friends and colleagues. We are sorry for all the mistakes introduced after your hard work. Alphabetically, they are: Aaditya Ramdas, Abbas Mehrabian, Aditya Gopalan, Ambuj Tewari, Andra ́s Gyo ̈rgy, Arnoud den Boer, Branislav Kveton, Brendan Patch, Chao Tao, Chao Qin, Christoph Dann, Claire Vernade, David Janz, Emilie Kaufmann, Eugene Ji, Gell ́ert Weisz, Gergely Neu, Johannes Kirschner, Julian Zimmert, Kwang-Sung Jun, Lalit Jain, Laurent Orseau, Marcus Hutter, Michal Valko, Omar Rivasplata, Pierre Menard, Ramana Kumar, Roman Pogodin, Ronald Ortner, Ronan Fruit, Ruihao Zhu, Shuai Li, Toshiyuki Tanaka, Wei Chen, Yoan Russac, Yufei Yi and Zhu Xiaohu. We are especially grateful to G ́abor Bal ́azs and Wouter Koolen, who both read almost the entire book. Thanks to Lauren Cowels and Cambridge University Press for providing free books for our proofreaders, tolerating the delays and for supporting a freely available PDF version. R ́eka Szepesv ́ari is responsible for converting some of our primary school figures to their current glory. Last of all, our families have endured endless weekends of editing and multiple false promises of ‘done by Christmas’. Rosina and Be ́ata, it really is done now!
1 Taking inspiration from Tor’s grandfather-in-law, John Dillon [Anderson et al., 1977].
Some sections are marked with special symbols, which are listed and described below.
A warning to the reader. Something important. An experiment.
Nomenclature and Conventions
This symbol is a note. Usually this is a remark that is slightly tangential to the topic at hand.
≥ an for all n ≥
A sequence (an )∞n=1 is increasing
decreasing if an+1 ≤ an. When the inequalities are strict, we say strictly increasing/decreasing. The same terminology holds for functions. We will not be dogmatic about what is the range of argmin/argmax. Sometimes they return sets, sometimes arbitrary elements of those sets and, where stated, specific elements of those sets. We will be specific when it is non-obvious/matters. The infimum of the empty set is inf ∅ = ∞ and the supremum is sup∅ = −∞. The empty sum is Pi∈∅ ai = 0 and the empty product is Qi∈∅ ai = 1.
Landau Notation
We make frequent use of the Bachmann–Landau notation. Both were nineteenth century mathematicians who could have never expected their notation to be adopted so enthusiastically by computer scientists. Given functions f, g : N →
n→∞ g(n) f(n) = o(g(n)) ⇔ lim f(n) = 0,
Notation 4
[0, ∞), define
f(n) = O(g(n)) ⇔ limsup f(n) < ∞,
f(n) = Ω(g(n)) ⇔ liminf f(n) > 0,
f(n) = ω(g(n)) ⇔ liminf f(n) = ∞,
f(n) = Θ(g(n)) ⇔ f(n) = O(g(n)) and f(n) = Ω(g(n)).
We make use of the (Bachmann–)Landau notation in two contexts. First, in proofs where limiting arguments are made, we sometimes write lower-order terms using Landau notation. For example, we might write that f (n) = √n + o(√n), by which we mean that limn→∞ f(n)/√n = 1. In this case we use the mathematical definitions as envisaged by Bachmann and Landau. The second usage is to informally describe a result without the clutter of uninteresting constants. For better or worse, this usage is often a little imprecise. For example, we will often write expressions of the form: Rn = O(m√dn). Almost always what is meant by this is that there exists a universal constant c > 0 (a constant that does not depend on either of the quantities involved) such that Rn ≤ cm√dn for all (reasonable) choices of m, d and n. In this context we are careful not to use Landau notation to hide large lower-order terms. For example, if f(x) = x2 + 10100x, we will not write f(x) = O(x2), although this would be true.
[n] 2A A∗ B2d Pd
action in round t number of arms/actions time horizon
reward in round t
loss in round t
mean reward of arm i
emptyset + naturalnumbers,N={0,1,2,…}andN =N\{0} real numbers
{1,2,3,…,n−1,n}
the power set of set A (the set of allSsubsets of A) set of finite sequences over A, A∗ = ∞i=0 Ai d-dimensional unit ball, {x ∈ Rd : ∥x∥2 ≤ 1} probability simplex, {x ∈ [0, 1]d+1 : ∥x∥1 = 1}
Functions, Operators and Operations
(x) amodb ⌊x⌋, ⌈x⌉ dom(f ) E
Supp ∇f (x) ∇vf(x) ∇2f(x) ∨,∧
erf (x) erfc(x)
argmaxx f(x) argminx f(x) Iφ
D(P, Q) d(p, q)
Linear Algebra
ker(A) span(v1,…,vd) λmin(G)
∥x∥p ∥x∥2G
the cardinality (number of elements) of the finite set A max(x, 0)
remainder when natural number a is divided by b
floor and ceiling functions of x
domain of function f
expectation
support of distribution or random variable gradient of f at x
directional derivative of f at x in direction v
Hessian of f at x
maxRimum and minimum, a∨b = max(a, b) and a∧b = min(a, b)
√2 x exp(−y2)dy π0
1 − erf(x) R
Gamma function, Γ(z) =
support function φA(x) = supy∈A⟨x, y⟩ convexconjugate,f∗(y)=supx∈A⟨x,y⟩−f(x)
binomial coefficient
maximiser or maximisers of f
minimiser or minimisers of f
indicator function: converts Boolean φ into binary
indicator of set B
Relative entropy between probability distributions P and Q Relative entropy between B(p) and B(q)
standard basis vectors of the d-dimensional Euclidean space vectors whose elements are all zeros and all ones, respectively determinant of matrix A
trace of matrix A
image of matrix A
kernel of matrix A
span of vectors v1,…,vd
minimum eigenvalue of Pmatrix G
inner product, ⟨x, y⟩ = i xiyi
p-norm of vector x
x⊤Gx for positive definite G ∈ Rd×d and x ∈ Rd
0∞ x exp(−x)dx
Notation 5
P(A) B(A) [x, y]
set of distributions over set A
Borel σ-algebra on A
convex hull of vectors or real values x and y
Notation 6
Distributions
N (μ, σ2) B(p) U(a,b) Beta(α, β) δx
Topological
cl(A) int(A) ∂A co(A) aff(A) ri(A)
Loewner partial order of positive semidefinite matrices: A ≼ B (A ≺ B) if B−A is positive semidefinite (respectively, definite).
Normal distribution with mean μ and variance σ2 Bernoulli distribution with mean p
uniform distribution supported on [a,b]
Beta distribution with parameters α, β > 0
Dirac distribution with point mass at x
closure of set A
interior of set A
boundary of a set A, ∂A = cl(A) \ int(A) convex hull of A
affine hull of A
relative interior of A
Code Help
Bandits, Probability and Concentration
1 Introduction
Bandit problems were introduced by William R. Thompson in an article published in 1933 in Biometrika. Thompson was interested in medical trials and the cruelty of running a trial blindly, without adapting the treatment allocations on the fly as the drug appears more or less effective. The name comes from the 1950s, when
Frederick Mosteller and Robert Bush decided
to study animal learning and ran trials on
mice and then on humans. The mice faced
the dilemma of choosing to go left or right
after starting in the bottom of a T-shaped
maze, not knowing each time at which end
they would find food. To study a similar
learning setting in humans, a ‘two-armed
bandit’ machine was commissioned where
humans could choose to pull either the left or
the right arm of the machine, each giving a
random pay-off with the distribution of pay-
offs for each arm unknown to the human player. The machine was called a
‘two-armed bandit’ in homage to the one-armed bandit, an old-fashioned name for a lever-operated slot machine (‘bandit’ because they steal your money).
There are many reasons to care about bandit problems. Decision-making with uncertainty is a challenge we all face, and bandits provide a simple model of this dilemma. Bandit problems also have practical applications. We already mentioned clinical trial design, which researchers have used to motivate their work for 80 years. We can’t point to an example where bandits have actually been used in clinical trials, but adaptive experimental design is gaining popularity and is actively encouraged by the US Food and Drug Administration, with the justification that not doing so can lead to the withholding of effective drugs until long after a positive effect has been established.
While clinical trials are an important application for the future, there are applications where bandit algorithms are already in use. Major tech companies use bandit algorithms for configuring web interfaces, where applications include news recommendation, dynamic pricing and ad placement. A bandit algorithm
Figure 1.1 Mouse learning a T-maze.
plays a role in Monte Carlo Tree Search, an algorithm made famous by the recent success of AlphaGo.
Finally, the mathematical formulation of bandit problems leads to a rich structure with connections to other branches of mathematics. In writing this book (and previous papers), we have read books on convex analysis/optimisation, Brownian motion, probability theory, concentration analysis, statistics, differential geometry, information theory, Markov chains, computational complexity and more. What fun!
A combination of all these factors has led to an enormous growth in research over the last two decades. Google Scholar reports less than 1000, then 2700 and 7000 papers when searching for the phrase ‘bandit algorithm’ for the periods of 2001–5, 2006–10, and 2011–15, respectively, and the trend just seems to have strengthened since then, with 5600 papers coming up for the period of 2016 to the middle of 2018. Even if these numbers are somewhat overblown, they are indicative of a rapidly growing field. This could be a fashion, or maybe there is something interesting happening here. We think that the latter is true.
A Classical Dilemma
Imagine you are playing a two-armed bandit machine and you already pulled each lever five times, resulting in the following pay-offs (in dollars):
Round 1 2 3 4 5 6 7 8 9 10
left 0 100 0 10
right 10 0 000
The left arm appears to be doing slightly better. The
average pay-off for this arm is $4, while the average for the
right arm is only $2. Let’s say you have 10 more trials (pulls)
altogether. What is your strategy? Will you keep pulling
the left arm, ignoring the right? Or would you attribute the poor performance of the right arm to bad luck and try it a few more times? How many more times? This illustrates one of the main interests in bandit problems. They capture the fundamental dilemma a learner faces when choosing between uncertain options. Should one explore an option that looks inferior or exploit by going with the option that looks best currently? Finding the right balance between exploration and exploitation is at the heart of all bandit problems.
Introduction 9
1.2 Two- armed bandit
1.1 The Language of Bandits
A bandit problem is a sequential game between a learner and an environment. The game is played over n rounds, where n is a positive natural number called the horizon. In each round t ∈ [n], the learner first chooses an action At from a given set A, and the environment then reveals a reward Xt ∈ R.
Of course the learner cannot peek into the future when choosing their actions, which means that At should only depend on the history Ht−1 = (A1,X1,…,At−1,Xt−1). A policy is a mapping from histories to actions: A learner adopts a policy to interact with an environment. An environment is a mapping from history sequences ending in actions to rewards. Both the learner and the environment may randomise their decisions, but this detail is not so important for now. The most common objective of the learner is to choose actions tPhat lead to the largest possible cumulative reward over all n rounds, which is
The fundamental challenge in bandit problems is that the environment is unknown to the learner. All the learner knows is that the true environment lies in some set E called the environment class. Most of this book is about designing policies for different kinds of environment classes, though in some cases the framework is extended to include side observations as well as actions and rewards.
The next question is how to evaluate a learner. We discuss several performance measures throughout the book, but most of our efforts are devoted to understanding the regret. There are several ways to define this quantity. To avoid getting bogged down in details, we start with a somewhat informal definition.
Definition 1.1. The regret of the learner relative to a policy π (not necessarily that followed by the learner) is the difference between the total expected reward using policy π for n rounds and the total expected reward collected by the learner over n rounds. The regret relative to a set of policies Π is the maximum regret relative to any policy π ∈ Π in the set.
The set Π is often called the competitor class. Another way of saying all this is that the regret measures the performance of the learner relative to the best policy in the competitor class. We usually measure the regret relative to a set of policies Π that is large enough to include the optimal policy for all environments
1.1 The Language of Bandits 10
In the literature, actions are often also called ‘arms’. We talk about k-armed bandits when the number of actions is k, and about multi-armed bandits when the number of arms is at least two and the actual number is immaterial to the discussion. If there are multi-armed bandits, there are also one-armed bandits, which are really two-armed bandits where the pay-off of one of the arms is a known fixed deterministic number.
1.1 The Language of Bandits 11
in E. In this case, the regret measures the loss suffered by the learner relative to the optimal policy.
Example 1.2. Suppose the action set is A = {1, 2, . . . , k}. An environment is called a stochastic Bernoulli bandit if the reward Xt ∈ {0, 1} is binary valued and there exists a vector μ ∈ [0, 1]k such that the probability that Xt = 1 given the learner chose action At = a is μa. The class of stochastic Bernoulli bandits is the set of all such bandits, which are characterised by their mean vectors. If you knew the mean vector associated with the environment, then the optimal policy is to play the fixed action a∗ = argmaxa∈A μa. This means that for this problem the natural competitor class is the set of k constant polices Π = {π1, . . . , πk}, where πi chooses action i in every round. The regret over n rounds becomes
“Xn # Rn = n max μa − E Xt ,
where the expectation is with respect to the randomness in the environment and policy. The first term in this expression is the maximum expected reward using any policy. The second term is the expected reward collected by the learner.
For a fixed policy and competitor class, the regret depends on the environment. The environments where the regret is large are those where the learner is behaving worse. Of course the ideal case is that the regret be small for all environments. The worst-case regret is the maximum regret over all possible environments.
One of the core questions in the study of bandits is to understand the growth rate of the regret as n grows. A good learner achieves sublinear regret. Letting Rn denote the regret over n rounds, this means that Rn = o(n) or equivalently that limn→∞ Rn/n = 0. Of course one can ask for more. Under what circumstances is Rn = O(√n) or Rn = O(log(n))? And what are the leading constants? How does the regret depend on the specific environment in which the learner finds itself? We will discover eventually that for the environment class in Example 1.2, the worst-case regret for any policy is at least Ω(√n) and that there exist policies for which Rn = O(√n).
The framework is general enough to model almost anything by using a rich enough environment class. This cannot be bad, but with too much generality it becomes impossible to say much. For this reason, we usually restrict our attention to certain kinds of environment classes and competitor classes.
A large environment class corresponds to less knowledge by the learner. A large competitor class means the regret is a more demanding criteria. Some care is sometimes required to choose these sets appropriately so that (a) guarantees on the regret are meaningful and (b) there exist policies that make the regret small.
A simple problem setting is that of stochastic stationary bandits. In this
1.1 The Language of Bandits 12
case the environment is restricted to generate the reward in response to each action from a distribution that is specific to that action and independent of the previous action choices and rewards. The environment class in Example 1.2 satisfies these conditions, but there are many alternatives. For example, the rewards could follow a Gaussian distribution rather than Bernoulli. This relatively mild difference does not change the nature of the problem in a significant way. A more drastic change is to assume the action set A is a subset of Rd and that the mean reward for choosing some action a ∈ A follows a linear model, Xt = ⟨a, θ⟩ + ηt for θ ∈ Rd and ηt a standard Gaussian (zero mean, unit variance). The unknown quantity in this case is θ, and the environment class corresponds to