CS861: Theoretical Foundations of Machine Learning Lecture 17 – 10/13/2023 University of Wisconsin–Madison, Fall 2023
Lecture 17: K-armed bandits, the UCB algorithm
Lecturer: Kirthevasan Kandasamy Scribed by: Ransheng Guan, Yamin Zhou
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will introduce the K–armed bandit and then present the upper confidence bound algorithm. Let us first quickly review the bandit framework.
Letν={νa,a∈A}beabanditmodel
On round t, learner chooses At ∈ A and observes reward Xt sampled from νAt
A learner is characterized by a policy Π = (Π ) , where Π maps the history {(A , X )}t−1 to an tt∈N t s s s=1
action in A (or, for randomized policies, to a (deterministic) distribution over A
Let μa = EX∼νa [X] be the expected reward of action a, let a∗ ∈ arg maxa∈A μa be an optimal action,
and let μ∗ = μa∗ be the optimal value. Regret
RT =RT(π,ν)=Tμ∗−E[XXt]
where E is with respect to the distribution of the action-reward sequence A1, X1, A2, X2, …., AT , XT
induced by the interaction between the policy π and bandit model ν.
WewantRT ∈O(T),i.e. lim RT =0 T→∞ T
K-armed bandits
A K-armed bandit is a stochastic bandit model where the action space consists of K distinct actions. We will write A = [K]. We will assume that each νi is σ sub-Gaussian, with σ known. That is,
P = {ν = {νi, i ∈ [K]}, νi is σ-sub-Gaussian ∀i ∈ [K]}
For convenient, assume without loss of generality that 1 ≥ μ1 ≥ μ2 ≥ … ≥ μK ≥ 0, where μi = EX∼νi [X]. (The learner is not aware of the ordering.) Finally, let ∆i = μ∗ − μi = μ1 − μi denote the gap between the optimal arm and arm i.
2 Explore-then-Commit
One of the simplest algorithms for bandit models is the explore-then-commit algorithm, which simply pulls each arm for a fixed number of rounds at the beginning, and for the remaining rounds, pulls the arm that appeared to be the best. We have stated this algorithm formal in Algorithm 1.
CS Help, Email: tutorcs@163.com
Algorithm 1 Explore-then-Commit Algorithm
Data: time horizon T, number of exploration rewards m (≤ T/K)
– Exploration phase: Pull each arm m times in the first mK rounds.
Theorem 1. Let P denote the class of σ-subGaussian bandit models, and let ν ∈ P, Then the ETC policy satisfies
If we choose m ≍ K−1/3T 2/3, then
The proof of this theorem will be in HW2.
A = argmaxμb, where μb = T∈[K]
s=1 – Commit phase: Pull arm A for the remaining T − mK rounds
1(A = i)x iimss
RT (ν) ≤ m
∆i + (T − mK)
− m ∆ 2i 4σ2
sup RT (ν) ∈ O ̃(K1/3T 2/3) ν∈P
3 The Upper Confidence Bound Algorithm
The UCB algorithm is based on the principle of optimism in the face of uncertainly, where on each round we will pretend that the bandit model is as nice as statistically plausible. To state the algorithm, we will first define upper confidence bounds on each arm at the end of round t as follows:
μˆi,t = N s
1(As = i)Xs.
ei,t = σ 2 log(1/δt) where δt = 1 (undefined if Ni,t = 0)
Then, μˆi,t + ei,t is an upper confidence bound for μi, and μˆi,t − ei,t is a lower confidence bound for μi.
We can now state the upper confidence bound algorithm, which stipulates that we choose the arm with the highest upper confidence bound μˆi,t−1 +ei,t−1 on each round. Intuitively, when you maximize μˆi,t−1 +ei,t−1, the μˆi,t−1 favors exploitation, and ei,t−1 favors exploration. The algorithm is stated formally in Algorithm 2
(undefined if Ni,t = 0)
Algorithm 2 The Upper Confidence Bound Algorithm
Data: time horizon T, number of exploration rewards m(≤ T/K)
for t = 1, · · · , k do
Pullarmt,i.e. At =tandobserveXt ∼νt
end fort=k+1,···,T do
Pull At = arg maxi∈[K] μˆi,t−1 + ei,t−1 and observe Xt ∼ νAt end
▷ break ties arbitrarily
Code Help
Theorem 2. Let P denote the class of σ-subGaussian bandit models, and let ν ∈ P. Then the UCB policy satisfies
RT(ν)≤3K+ X 24σ2log(T). i;∆i >0 ∆i
p ̃√ supRT(ν)≤3K+σ 96KTlog(T)∈O KT .
Here, the first bound can be viewed as a gap-dependent bound while the second bound can be viewed as
a gap-independent bound or a worst case bound. If the gaps ∆i = μ1 − μi are large, then RT ∈ O(log(T )). OtherwiseRT ∈O ̃√KT
Before we prove our theorem, we will first state the following decomposition of the regret. Lemma 1 (Regret decomposition). (Applies to any policy, not just UCB)
RT (ν) = X ∆i E[Ni,T ], i,∆i >0
where the expectation E is with respect to the action reward sequence A1, X1, A2, X2…..Ai, Xi Proof
RT =X(μ1−E[Xt])
μ1 − E[ 1(At = i)Xt] i=1
=XXE[(μ1 −Xt)1(At =i)] t=1 i=1
= X X E[E[(μ1 − Xt)1(At = i)|At]]
i=1 t=1 KT
= X X E[1(At = i) E[(μ1 − Xt)|At]] i=1 t=1
= X X E[1(At = i)(μ1 − μAt )]
i=1 t=1 KT
=XXE[1(At =i)(μ1 −μi)] i=1 t=1
KT =XXE[∆i1(At =i)]
i=1 t=1 KT
= ∆i E[ 1(At = i)] i=1 t=1
= X∆i E[Ni,t]
(Integrating out observation) (it will be 0 if A1 ̸=i)
The last step follows from the fact that
1(As = i)Xs
Proof Proof of Theorem 2 will assume w.l.o.g that each arm samples rewards yi,rr∈N and we observe these
samples one-by-one as we pull each arm. Therefore, we can write μˆ = i,t
We now define the following good events, G1, Gi, ∀i s.t. ∆i > 0. G1 ={∀t>K, μ1 <μˆ1,t +e1,t}
1 PNi,t y . Ni,t r=1 i,r
Gi ={∀t>K, μi >μˆi,t −ei,t}
where G1 indicates that the true mean is below the UCB, and Gi indicates that the true mean is above the
Claim 1. We have, P(Gc1) ≤ T1 , and P(Gci ) ≤ T1
P(Gc1)=P(∃t>K, s.t.μ1 ≥μˆ1,t +e1,t)
≤ XP(μ1 >μˆ1,t +e1,t) t>K
N1,t s =XP μ1> 1 Xy1,r+σ 2log(1/δt)
N1,t N1,t r=1
1 Xs r2log(1/δt)! ∃s∈[t−K+1]s.t.μ1 > s y1,r +σ s
t−K+1 t>K s=1
1 s r2log(1/δt)! X(y1,r−μ1)<−σ
t−K+1 s 2 2log(1/δt)
≤X X exp −2σ2 ·σ · t>K s=1
The trick we used in the fourth and fifth steps only works in K–armed bandits. For other bandit models, we usually use martingales.
Programming Help