CS861: Theoretical Foundations of Machine Learning Lecture 18 10 16 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 18 – 10/16/2023 University of Wisconsin–Madison, Fall 2023
Lecture 18: Proof for UCB (cont’d), K-armed Bandit Lower Bound
Lecturer: Kirthevasan Kandasamy Scribed by: Michael Harding and Congwei Yang
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will first upper bound the regret for UCB, providing gap-dependent and worst-case bounds. We will then start our discussion on proving lower bounds for K–armed bandits.
1 UCB Theorem and Proof
Recall the UCB algorithm from the last class.
Algorithm 1 The Upper Confidence Bound Algorithm Require: time horizon T
for t = 1,…,K do At ← t
Xt ∼ νt end for
fort=K+1,…,T do
At ← arg maxi∈[K] μˆi,t−1 + ei,t−1 Xt ∼ νAt
▷ Break ties arbitrarily
We will now present the theorem for the risk upper bounds for the UCB theorem once again, and pick up the proof where we left off.
Theorem 1 (UCB Risk Upper Bound). Let P = ν = {νi}Ki=1 : νi σ-sG, EX∼νi [X] ∈ [0, 1] ∀ i ∈ [K] be the class of σ-sub-Gaussian K-armed bandit models with means in [0, 1]. Let μi := EX∼νi [X], μ∗ := maxi∈[K] μi, and denote ∆i := μ∗ − μi. Then
RT(ν)≤3K+ X 24σ2log(T) (1) i:∆i >0 ∆i
sup RT (ν) ≤ 3K + σp96KT log(T ) (2) ν∈P
Proof As before, WLOG, we begin by letting 1 ≥ μ1 ≥ · · · ≥ μK ≥ 0 for ease of notation. Also, we again define our good events
G1 := \ μ1 < μˆ1,t + e1,t t>K
Gi := \ μi > μˆi,t − ei,t t>K
Computer Science Tutoring
At the end of our previous class, we proved that P(Gc1 ), P(Gci ) ≤ T1 (we directly showed this for the case of Gc1, remarking that the case for Gci is nearly identical). We will now show that Ni,t := Pts=1 I{As=i} is small for sub-optimal arms (∆i > 0) under the event G1 ∩ Gi. To show this, suppose arm i was last pulled on round t + 1, where t ≥ K. Hence,
μˆi,t + ei,t ≥ max μˆj,t + ej,t ← UCB Alg. construction j ̸=i
≥ μˆ1,t + e1,t
> μ1 (under G1),
and under Gi, we also have μi > μˆi,t − ei,t. Therefore,
μ1 <μi +2ei,t ⇒ i t
Now, combining these results, we can write,
⇒Ni,T =Ni,t+1≤24σ2log(T)+1
E[Ni,t] = E[Ni,t|G1 ∩ Gi] P(G1 ∩ Gi) + E[Ni,t|Gc1 ∪ Gci ] P(Gc1 ∪ Gci ) ≤ 3 + 24σ2 log(T ) | {z }| {z } | {z }| {z } ∆2i
≤ 24σ2 log(T ) +1 ≤1 ≤T ≤ T2 ∆2
Then, by the regret decomposition result shown towards the end of last class, we can write, RT(ν)≤ X ∆iE[Ni,t]≤3K+ X 24σ2log(T),
where we leverage the fact that ∆i ∈ [0, 1] and there are at most K − 1 summands. This proves the gap- dependent bound in (1). For the gap-independent bound, we can choose some value ∆ > 0 and rewrite our result above as thus:
i:∆i >0 i:∆i >0 ∆i
RT(ν)= X ∆iE[Ni,t] i:∆i >0
= X ∆i E[Ni,t] + i:∆i ∈(0,∆]
X ∆i E[Ni,t] i:∆i >∆
≤∆ X E[Ni,t]+ X 24σ2log(T)+3K
i:∆i ∈(0,∆] i:∆i >∆ | {z }
≤ 3K + ∆T + 24σ2 log(T )
Then, because this holds for all ∆ > 0, we are free to optimize over values of ∆, giving us in particular
∆ = σq 24K log(T ) . Therefore, T
RT (ν) ≤ 3K + σp96KT log(T ) ,
and because this result holds for all ν ∈ P, and the bound has no dependence on ν, then we can write,
sup RT (ν) ≤ 3K + σp96KT log(T ) , ν∈P
Programming Help, Add QQ: 749389476
which is exactly the statement in (2).
Next, we will present an alternative proof of the gap-independent bound. We will use similar techniques for linear bandits in subsequent classes.
1.1 Alternative Proof for the Gap-Independent Bound
We will first decompose the regret as follows:
RT =E X(μ∗−Xt)
=E XE[μ∗−Xt|At] t=1
“T# =E X(μ∗−μAt)
t=1 (μ∗ − μAt ) is usually called the pseudo-regret. Let G = G1 ∩ i:∆i >0 Gi , then “T#”T#
RT =E X(μ1 −μAt)|G P(G)+E X(μ1 −μAt)|Gc P(Gc) (3) t=1 t=1
hPT ci cK PT NotewehaveP(G)≤1,E t=1(μ1 −μAt)|G ≤T,andP(G )≤ T . Wewillbound t=1(μ1−μAt)
Claim: Under the event G, μ1 − μAt ≤ 2eAt,t−1.
• IfAt isanoptimalarm,thenμ1−μAt ≤0≤2eAt,t−1.
• If not, μ1 ≤ μˆ1,t−1 + e1,t−1 ≤ mˆuAt,t−1 + eAt,t−1 ≤ μAt + 2eAt,t−1, where the first inequality is under
G1, and the last inequality is under Ti:∆i>0 Gi. Then,
T T s2log(1/δt)
X(μ1 −μAt)≤K+ X 2σ
NAt,t−1 s 2 log(T 2 t)
≤K+σ 24log(T)
Code Help
We will now focus on the last summation:
T 1 K Ni,T−1 1
≤2XpNi,T −1 i=1
XpNi,T −1 i=1
= 2pK(T − K)
(Ni,T − 1)
(Jensen′s Inequality)
Here the first inequality follows from Pm √1 ≤ 2√m, which we have proved below. s=1 s
Combining (3), (4), (5), we obtain RT ≤ 2K + σp96KT log(T ).
To prove, Pm √1 ≤ 2√m, we will bound the sum of a decreasing function by an integral as follows:
KT). To do so, recall the following results we used in the proof of Le Cam’s method (Lecture 9, Lemma 1 and Corollary 1).
Lemma 1. Let P0, P1 be two distributions and A be any event. Then,
P0(A) + P1(Ac) ≥ ∥P0 ∧ P1∥ (Neyman − Pearson Test)
= 1 − T V (P0 , P1 )
≥ 21 exp(−KL(P0, P1))
When applying this inequality, the KL divergence will be between distributions of action-reward sequences A1, X1, · · · , AT , XT induced by the interaction of a policy π with different bandit models. The following lemma will be helpful in computing the KL divergence.
Lemma 2 (KL divergence decomposition). Let ν, ν′ be two K-armed bandits models. For a fixed policy Π, let P, P′ denote the probability distribution over the sequence of actions and rewards A1,X1,··· ,AT,XT under ν, ν′, respectively. Let Eν denote the expectation under bandit model ν. Then ∀T ≥ 1,
where Ni,T = PTt=1 1{At=i}
KL(P, P ′) = X Eν [Ni,T ]KL(νi, νi′)
Intuitively, suppose we pulled arm 1 N1 times. As the observations are independent KL(P,P′) = N1KL(ν1, ν1′ ). Next, consider a nonadaptive policy which pulls arm i Ni times for i = 1, · · · , K. We then have KL(P, P ′) = PKi=1 NiKL(νi, νi′). The above lemma says that a similar result holds when we use an adaptive policy.
Pm √1 ≤Rm√1 ds=(2s1/2)m=2√m.
2 K-armed bandits lower bound.
In this section, we will prove the following lower bound on the minimax regret: infΠ supν∈P RT (Π, ν) ∈
s=1 s 0 s 0

Proof Proof of Lemma 2 Consider any given sequence a1, x1, · · · , aT , xT . Let p, p′ denote the Radon- Nikodym derivatives of P, P ′ respectively. Let ν ̃i, ν ̃i′ denote the Radon-Nikodym derivatives of νi, νi′, respec- tively.
Consider for fixed action-reward sequence a1 , x1 , · · · , aT , xT .
T p(a1,x1,···,aT,xT)=Yp(at,xt |a1,x1,···,at−1,xt−1)
=YΠ(at |a1,x1,···,at−1,xt−1)ν ̃at(xt) t=1
Similarly, under ν′, we can write
p′(a1,x1,···,aT,xT)=YΠ(at |a1,x1,···,at−1,xt−1)ν ̃at(xt)
p(a1,x1,··· ,aT,xT) ν ̃a1(x1)···ν ̃aT (xT) log p′(a1,x1,···,at,xt) =log ν ̃a′ (x1)···ν ̃a′ (xT)
To be continued next lecture…
1T XT  ν ̃ a t ( x t ) 
= logν ̃′(x) t=1 att