CS861: Theoretical Foundations of Machine Learning Lecture 23 10 27 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 23 – 10/27/2023 University of Wisconsin–Madison, Fall 2023
Lecture 23: Experts problem (continued), Adversarial bandits
Lecturer: Kirthevasan Kandasamy Scribed by: Congwei Yang and Bo-Hsun Chen
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will continue the discussion on Hedge algorithm, and then start the topic of adversarial bandits.
1 Experts problem (continued)
Consider the hedge algorithm introduced in last lecture. For any policy π, which samples action according to Pt on round t, define
R ̄ T ( π , l , a ) = X p Tt l t − X l t ( a )
Let a∗ = arg mina∈[K] PTt=1 lt(a). We have, “T#T
RT(π,l)=E Xlt(At) −minXlt(a)
=E XE[lt(At)|pt] −Xlt(a∗) t=1 t=1
= E[R ̄(π, l, a∗)]
If we bound R ̄(π, l, a∗), then we have a bound for R(π, l).
Theorem 1 (Hedge). Let the loss vector on round t be lt ∈ RK+ ∀t. Let l2t ∈ RK+ such that l2t (i) = (lt(i))2. Then, for η ≤ 1, the Hedge algorithm satisfies
(i) Letl=(l1,···,lT)beanarbitrarysequenceoflossesandleta∈[K]. Then,ifp⊤t lt ≤1forallt,we
(ii) If lt ∈ [0, 1]K ∀t, then
(iii) If we choose η = qlog(K), then ∀a ∈ [K], and all loss vector l, T
R ̄ T ( π , l , a ) ≤ 2 p T l o g ( K ) 1
̄ log(K ) XT
R T ( π , l , a ) ≤ η + η
R ̄ (l, a) ≤ log(K) + η

程序代写 CS代考 加微信: cstutorcs
Define Φt = η1 log PKa=1 e−ηLt(a). Consider
1 PKa=1 e−ηLt(a) ! Φt−Φt−1=ηlog PKa=1e−ηLt−1(a)
1 PKa=1 e−ηLt−1(a) · e−ηlt(a) ! = η log PKa=1 e−ηLt−1(a)
pt(a)e−ηlt(a)
pt(a)(1−ηlt(a)+η2l2t(a)) = η1 log(1 − ηpTt lt + η2pTt l2t )
= η log ≤ηlog
≤η1(−ηpTtlt+η2pTtl2t) (Aslog(1+y)≤y∀y≥−1andsinceηp⊤tlt≤1) = − p Tt l t + η p Tt l 2t
We have Φt − Φt−1 ≤ −pTt lt + ηpTt l2t , so ΦT − Φ0 ≤ − PTt=1 pTt lt + η PTt=1 pTt l2t . Also 1 XK log(K )
Φ0 = η log( ΦT = η log(
e−ηL0(i)) = η 1XK 1
(Ase−y ≤1−y+y2 ∀y≥−1)
T log(K)T T
= −Xlt(a) t=1
− X l t ( a ) − t=1
≤ − X p Tt l t + η X p Tt l 2t
T R ̄T(π,l,a)=XpTt lt −Xlt(a)≤
e−ηLT (i)) ≥ η log(e−ηLt(a)) = −LT (a)
T +ηXpTt l2t
The proof for (i) is complete. To prove (ii), we note that lt ∈ [0,1]. So l2t(a) ≤ 1 ∀a ⇒ pTt l2t ≤ 1. so
RT (π, l, a) ≤ log(K) + ηT . Statement (iii) Follows by optimizing over η. η
2 Adversarial Bandits
Adversarial bandits is a variant of the expert problem, but the learner only observes the loss for the action taken. It has the following components:

1. Oneach round, learner chooses At ∈ [K]. Simultaneously, the environment picks lt ∈ [0, 1]K . 2. The learner incurs losses lt(At).
3. The learner observes only lt(At) (Bandit feedback).
The regret RT (π, l) is defined exactly the same as the expert problem: TT
RT′ (π,l)=Xlt(At)− min Xlt(a)
a∈[K ] RT(π,l)=E[RT′ (π,l)]
As before, we are interested in bounding supl RT (π, l).
Here, the main challenge, when compared to full information feedback, is in balancing between exploration
and exploitation.
2.1 The EXP-3 Algorithm
The main idea of EXP-3 algorithm is built on Hedge. We will estimate lt by only observing lt(At). For this, we will use the following inverse probability weighted estimator:
ˆ lt(a) lt(a) =
 lt(At)  pt(At)
,ifa=At , otherwise
1(a = At) =
Here, p (a) is the probability of choosing action a in Hedge. So, lˆ (a) would look as follows:
lˆ(a)=h0 … 0 lt(At) 0 … 0iT t pt(At)
We will show that lˆ is an unbiased estimator of l , i.e., E[lˆ |p ] = l . ttttt
The EXP3 algorithm is stated below.
Algorithm 1 EXP-3 (Exponential weights for exploration and exploitation)
Require: time horizon T , learning rate η
Set L0 ← 0K ;
for t = 1,2,…,T do
exp(−ηLt−1 (a))
Set pt(a) ← PKj=1 exp(−ηLt−1(j));
Sample At ∼ pt, and incur loss lt(At);
Update Lt(At) ← Lt−1(At) + lt(At) ; pt(At)
Update Lt(a) ← Lt−1(a), ∀a ̸= At; end for
Intuitively, the exploitation for EXP3 comes from the fact that arms with large losses are discounted more in the losses. The exploration comes from the fact that we only discount arms that were pulled, so arms that are pulled less frequently are more likely to be pulled in future rounds.
Before, we analyze the algorithm, we will state the following lemma.
Lemma 1. If lˆ is chosen as in Eq. (1), the followings are true for all a ∈ A: t
CS Help, Email: tutorcs@163.com
1. E[lˆ(a)|p]=l ttt
2. [lˆ2(a) | p ] = l2t (a) t t pt(a)
We will now state the main theorem for EXP3. We will prove this theorem in the next class.
Proof (proof of lemma above)
(i)E[lˆ(a)|p]=p(a)lt(a) +(1−p(a))·0=l(a) t t t pt(a) t t
(ii)E[lˆ2(a)|p]=p(a)l2t(a) +(1−p(a))·0=l2t(a) t t t p2t(a) t pt(a)
We can get a theorem for the upper bound of the regret of EXP-3 as follows.
Theorem 2 (EXP-3). Assume the loss vectors on each round t satisfy lt ∈ [0, 1]K . Then, EXP-3 satisfies:
∀l=[l1,…,lT],RT(π,l)≤log(K) +ηKT. Ifwechooseη=qlog(K),thenRT(π,l)≤2pKTlog(K). η KT
Remark The upper bounds of some strategies that we discussed in class are compared here. • Hedge: O ̃(√T) (experts problem)
• EXP-3: O ̃(√KT) (adversarial bandits)
• UCB: O ̃(√KT) (stochastic bandits, minimax regret)
Hedge has a better regret than EXP3 since it is an easier problem. Interestingly, even though the adversarial bandit problem subsumes the stochastic bandit problem, the worst-case regret is the same. When we prove lower bounds for the adversarial bandit problems in the next class, we will see that the hardest stochastic bandit problems are as hard as the hardest adversarial bandit problems.
程序代写 CS代考 加QQ: 749389476