CS861: Theoretical Foundations of Machine Learning Lecture 1 – 10/30/2023 University of Wisconsin–Madison, Fall 2023
Lecture 24: EXP3, Lower Bounds for adversarial bandits
Lecturer: Kirthevasan Kandasamy Scribed by: Bo-Hsun Chen, Zexuan Sun
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will continue the proof of EXP-3 Theorem from the previous lecture, then discuss lower bounds for adversarial bandits, and finally introduce and define contextual bandit problem.
(proof of EXP-3 Theorem)
Recall the lemma
E[lˆ(a)|p]=l(a) ttt
[lˆ2(a)|p]=l2t(a) t t pt(a)
We will apply the first result from the Hedge theorem with a = a∗ and losses lt. Since pTt lt = lt(At) ≤ 1 and the losses are non-negative, we can apply this result. We have,
T T log(K) T 2
XpTlˆ−Xlˆ(a∗)≤ +ηXpTlˆ , (1)
tttηtt t=1 t=1 t=1
where a∗ = arg mina∈[K] PTt=1 lt(a).
First note that E[lˆ (a∗) | p ] = l (a∗) by the lemma above. Next, applying the Lemma again,
E[pTlˆ |p ]=pTE[lˆ |p ] tttttt
= pTt lt (by Lemma (i)) = E[lt(At) | pt]
Finally, applying the second result of the above lemma,
E[pTt l2t | pt] = E[ K
pt(a)l2t (a) | pt] Xˆ
pt(a)E[l2t (a) | pt]
= X l 2t ( a ) ≤ K
Programming Help, Add QQ: 749389476
Now, taking expectations on both sides of (1), we have
“TT# E[LHS]=E XE[pTlˆ |p ]−XE[lˆ(a∗)|p ]
ttttt t=1 t=1
“TT# =E XE[lt(At)|pt]−Xlt(a∗)
t=1 t=1 “T#T
=E Xlt(At) −Xlt(a∗) t=1 t=1
= RT (π, l),
log(K) XT h Tˆ2 i
E[RHS]= η +η E E[pt lt |pt] t=1
≤ log(K) + ηKT (since E[pT lˆ2 | p ] ≤ K) ηttt
This proves the first statement of the EXP-3 Theorem, and the second statement follows by optimizing over η.
1 Lower bounds for adversarial bandits
The following theorem provides a lower bound for the minimax rate of regret of the adversarial multi-armed bandit problem.
Theorem 1. For the adversarial multi-armed bandit problem, the minimax regret satisfies,
inf sup RT(π,l)∈Ω( KT)
π l∈[0,1]K ×T
Recall the minimax lower bound for stochastic bandits is
√ infsupRstoch(π,ν)∈Ω( KT)
Note that the adversarial bandit problem is is applicable in more general settings than the stochastic bandit problem. Moreover, the regret definitions are different for the adversarial bandit and stochastic bandit problems. For the adversarial bandits, the regret depends on the best action in hindsight. While, the regret of stochastic bandits depends on the arm with the lowest expected mean value. Despite this, we find that the minimax regret is similar for both problems. This is because the hardest stochastic bandit problems are as hard as the hardest adversarial bandit problems. In fact, the proof of this lower bound will rely on similar techniques to the proof of the lower bound for stochastic bandits.
Our proof will consider stochastic losses and show that the expected regret is large. Then, there is at least
one sequence of losses for which the regret should be large.
Proof Let π be given. We will consider two stochastic loss models ν(1) = (ν(1), ν(1), …, ν(1)) and ν(2) = 12K
(ν(2), ν(2), …, ν(2)) to be defined shortly. Let P (1) and P (2) denote the probability distributions of the action 12K
loss sequence (A1, l1(A1), …, AT , l1(AT )) due to π’s interaction with ν(1) and ν(2) respectively. Let E(1) and E(2) denote the corresponding expectations.
Code Help, Add WeChat: cstutorcs
Let ν(1) be defined as,
(1) 1(1) 1
ν1 =Bern 2−δ andνi =Bern 2 ,∀i̸=1,
Here, δ < 1/8 is a parameter that we will specify later. So, the means of ν(1) will be 1/2 − δ, 1
1/2, 1/2, ...,
Since PKa=1 E[Na,T ] = T , where Na,T = PTt=1 1(At = a), we know ∃j∈2,3,...,Ks.t.E(1)[Nj,T]≤ T
K−1 (2) 1(2)(1)
. We will next define ν(2) as follows:
νj =Bern 2−2δ andνi =νi ,∀i̸=j,
So, the means of ν(2) would be (1/2−δ, 1/2, 1/2, ..., 1/2 − 2δ, ..., 1/2). Let RT′ (π, l) = Pt lt(At)−mina∈[K] Pt lt(a) | {z }
andletEπ denoterandomnessw.r.t. policy,sothatRT(π,l)=Eπ[RT′ (π,l)]. Wecannowboundtheworst-
case regret for policy π as follows:
R⋆(π)= sup RT(π,l)= sup Eπ[RT′ (π,l)]
l∈[0,1]K l∈[0,1]K ≥Ei∼Unif({1,2})El∼ν(i)Eπ[RT′ (π,l)]
= 12Eπ [El∼ν(1) [RT′ (π,l)]]+ 12Eπ [El∼ν(2) [RT′ (π,l)]]
where the inequality follows from the fact that i ∼ Unif({1, 2}), l ∼ ν (i) is a distribution of {0, 1}K ×T ⊆
[0,1]K×T and the fact that the maximum is larger than the average. Next, consider "TT
El∼ν(i) [RT (π, l)] = El∼ν(i)
where μ⋆(ν) = minEX∼νa[X]. The inequality is by Jensen’s inequality and noting that the pointwise
minimum is concave, i.e. E[min xi] ≤ min E[xi]. Therefore, we have
lt(At) − min lt(a)
a∈[K ] "T#T
Eπ [El∼ν(i) [RT(π,l)]]≥EπEl∼ν(i)
here Rstoc(π,ν(i)) is the stochastic bandit regret of policy π on the stochastic bandit model ν(i). Therefore,
R⋆(π) ≥ 1 Rstoc(π, ν(1)) + Rstoc(π, ν(2)) 2TT
The term inside the paranthesis is similar to the quantity we obtained when proving lower bounds for stochas-
tic bandits. You will complete the remainder of the proof in the homework.You can show (⋆) ∈ Ω( an appropriate choice of δ.
lt(At) − min El∼ν(i) lt(a)
lt(At) −Tμ (ν )
′ X ⋆(i)stoc(i)
lt(At) −Tμ (ν )=RT (π,ν )
Computer Science Tutoring