STAT 861: Theoretical Foundations of Machine Learning University of Wisconsin–Ma

CS/ECE/STAT-861: Theoretical Foundations of Machine Learning University of Wisconsin–Madison, Fall 2023
Homework 3. Due 11/08/2023, 11.00 am
Instructions:
1. Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submissions.
2. I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
3. You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
4. Unless otherwise specified, you may use any result we have already proved in class. You do not need to prove them from scratch, but clearly state which result you are using.
5. Solutions to some of the problems may be found as examples or exercises in the suggested textbooks or other resources. You are encouraged to try the problems on your own first before searching for the solution. If you find an existing solution, first read and understand the proof, and then write it in your own words. Please indicate any references you have used at the beginning of your solution when you turn in your homework.
6. Collaboration: You are allowed to collaborate on in groups of size up to 3 on each problem. However, you must write the solution in your own words. If you collaborate, please indicate your collaborators at the beginning of your solution.
I may release more questions over the next week as we cover more material in class.

Short questions
1. (Do losses need to be non-negative for Hedge/EXP3?) Suppose we are in a sequential decision-making environ- ment with adversarial losses that are known to be in [−1, 0]K . The losses are bounded, but are negative. If we have full information feedback, we will use the Hedge algorithm, and if we have bandit feedback, we will use EXP3. In this question, we will explore if it is necessary to shift the losses so that they are non-negative before applying either of these algorithms.
(a) [2 pts] (Shifting losses in Hedge) Intuitively, explain if you expect the Hedge algorithm to still work if the losses are negative, but are bounded in [−1, 0].
(b) [2 pts] Justify your answer in part 1a by showing where the proof we did in class breaks down without the non-negativity assumption, or by arguing that the proof will still carry through.
(c) [2 pts] (Shifting losses in EXP3) Intuitively, explain if you expect EXP3 to still work if the losses are negative, but are bounded in [−1, 0]. You may reason using the exploration-exploitation tradeoff.
(d) [2 pts] Justify your answer in part 1c by showing where the proof we did in class breaks down without the non-negativity assumption, or by arguing that the proof will still carry through.
(a) Yes, Hedge will still work (for bounded losses) as we will still assign more weight to actions with small historical losses when sampling.
(b) The proof carries through. The only place we have used properties about the losses is when applying the inequalitye−y ≤1−y+y2 whichcanbeappliedaslongasy≥−1.
(c) No, EXP3 will not work anymore. As the losses are negative, if we take an action, we are even more likely to sample it in future rounds. Hence, EXP3 is less likely to explore.
(d) The inequality e−y ≤ 1 − y + y2 cannot be applied anymore as the estimated loss lb (i) = l (i)/p (i) may ttt
have a large negative value.
2. [6 pts] (The doubling trick) In class, we looked at several algorithms for sequential decision-making problems where the time horizon is assumed to be known ahead of time. However, often, we are interested in any-time algorithms, where we execute an algorithm indefinitely, but wish to bound the regret after any number of rounds.
The doubling trick is often used to convert a given known time horizon algorithm into an any-time procedure. ForsuchanalgorithmA,letAT betheversionofAthatisexecutedwithtimehorizonT.Thedoublingtrick proceeds by choosing an arbitrary time horizon T0 (for simplicity, we may take T0 = 1), and then executing AT0 for T0 rounds, then A2T0 for 2T0 rounds, and proceeding in this fashion, doubling the time horizon each time we finish an execution.
Algorithm 1 The doubling trick Given: An algorithm A.
Set T0 ← 1.
Initialize algorithm AT0 . for t = 1,2,… do
if t ≥ 2T0 then Set T0 ← 2T0.
Re-initialize algorithm AT0 . end if
Execute round t − T0 + 1 of algorithm AT0 . end for
In Hedge and EXP3, re-initialization would mean resetting the cumulative losses to 0, and setting the learning rate to η = plog(K)/(KT0). In UCB, this could mean resetting the mean estimates and confidence intervals.

We wish to bound the regret of such an algorithm, which can be written as RT =∆ PTt=1 rt, where rt is the instantaneous regret. For instance, in the experts problem, this would be the difference between the loss of the action taken and the loss of any fixed action.
Suppose that there exist α > 0, β > 0, and γ ∈ (0, 1) such that the regret of algorithm A satisfies RT (AT ) ≤ αT γ + β for all known time horizons T . Let A′ denote the version of this algorithm modified using the doubling trick. Show that when executed in an any-time fashion, A′ satisfies RT (A′) ∈ O ̃(T γ ) for all T .
Solution: Let T be given, and let k be the integer such that 2k−1 ≤ T < 2k. We can decompose the regret as follows, T 2k−1 k 2j+1−1 k k RT(A′)=Xrt≤Xrt=X X rt ≤Xα2jγ+β≤αX(2γ)j +β(k+1) t=1 t=1 (2γ )k+1 − 1 +β(log2(T)+1)≤α 2γ −1 +β(log2(T)+1) ≤α2γ −1Tγ +β(log2(T)+1). Here, in the second step, we have simply added terms for the last 2k − 1 − T rounds which should only increase the regret. In the fourth step, we have used the given bound for when the time horizon is known. In the sixth step, we have observed that the terms inside the summation are a geometric series, and that k ≤ log2(T). In the eigth step, we have used the fact that 2k+1 ≤ 4T. The claim follows from the observation that the final expression is in O ̃(T γ ). Lower bounds for adversarial bandits and learning from experts 1. [15 pts] (A lower bound for adversarial bandits) Recall the adversarial bandit problem, where we defined the regretforapolicyπandasequenceoflossvectorsl=(l1,...,lT)∈[0,1]K×T asfollows: "T#T Here, the expectation Eπ is with respect to the randomness of the policy. Show that the minimax regret satisfies, √ inf sup ∈Ω KT . π l∈[0,1]K ×T Hint: Recall that for two Bernoulli distributions Bern (p), Bern (q), the KL divergence can be upper bounded (p−q)2 by KL(Bern (p), Bern (q)) ≤ q(1−q) . Solution:Letthepolicyπbegiven.Wewillconsideradistributionoverlossesin[0,1]K×T andshowthatthe √ expected regret of π over this distribution is at least Ω( KT). Hence, there should be at least one sequence of losses drawn from this distribution which should have large regret. For this, we will consider two stochastic bandit models ν(1), ν(2) where ν(j) = (ν(j), . . . , ν(j)) and each ν(j) has 1Ki Bernoulli losses. Let P(1), P(2) denote the probability distribution of the action-loss sequence A1, l1(A1), . . . , At, lt(At), . . . , AT , lT (AT ) due to π’s interaction with ν(1), ν(2) respectively. Let E(1), E(2) denote the corre- sponding expectations. First, let ν(1) be defined by, ν(1) = Bern (1/2 − δ), and ν(1) = Bern (1/2) for all i > 1. Here, δ ≤ 1/8 1i
is a parameter we will specify later based on K and T . Let Na,T = PTt=1 1(At = a) denote the number of 3
=α 2γ −1 4γ
j=0 t=2j j=0 j=0 2k+1γ
R(π,l)=E Xl(A)−minXl(a).

times arm a was pulled in T rounds. As PKa=1 E(1)[Na,T ] = T, there exists some a′ ∈ {2,…,K} such that (1) (2) (2) (2) (1) ′
E [Na′ ,T ] ≤ T /(K − 1). We will define, ν so that νa′ = Bern (1/2 − 2δ) and νi = νi for all i ̸= a . Denote RT′ (π,l) = PTt=1 l(At) − mina∈[K] PTt=1 l(a) so that we have RT = Eπ[RT′ ]. We can now lower
bound the worst-case regret for the policy π as follows:
sup Eπ [RT′ (π, l)] ≥ Ej∼Unif ({1,2})El∼ν(j) Eπ [RT′ (π, l)]
l∈[0,1]K ×T
≥ 12Eπ [El∼ν(1) [RT′ (π,l)]]+ 12Eπ [El∼ν(2) [RT′ (π,l)]]. (1)
In the first step we have used the fact that the maximum is larger than the average while observing that j ∼ Unif ({1, 2}) and then l ∼ ν(j) defines a distribution over {0, 1}K×T which is a subset of [0, 1]K×T . Next, using Jensen’s inequality and the fact that the pointwise minimum is a concave function, we can upper bound El∼ν(j) [RT′ (π,l)]asfollows:
El∼ν(j) [RT (π, l)] = El∼ν(j) l(At) − min l(a)
≥El∼ν(j) l(At) − min El∼ν(j)
“T# “T#”T# XXX⋆
Here, μ⋆j = mina∈[K] EX∼ν(j) [X] is the value of the smallest mean in the bandit model ν(j). Therefore, we can
write, for j ∈ {1, 2},
E [E (j) [R′ (π,l)]]≥E(j) Xl(A ) −Tμ⋆ =∆ Rstoc(π,ν(j)). (3)
Here, we have observed that that E(j) is simply expectation under π’s interaction with ν(j). Moreover, the final
term, denoted Rstoc(π, ν(j)) is precisely the “stochastic bandit regret” of policy π on the stochastic bandit model T
ν(j). Combining the results in (1), (2), and (3) we have the following lower bound on the worst-case adversarial regret in terms of the stochastic regret on bandit models ν(1) and ν(2).
sup E [R′ (π,l)]≥1Rstoc(π,ν(1))+1Rstoc(π,ν(2)). πTTT
l∈[0,1]K ×T 2 2 Next, the following inequalities are straightforward to verify,
Rstoc(π,ν(1)) ≥ P(1) (N ≤ T/2) Tδ, Rstoc(π,ν(2)) ≥ P(2) (N > T/2) Tδ. T 1,T 2 T 1,T 2
Hence, using a corollary of the Neyman-Pearson test, we have
Tδ (1) (2) Tδ (1) l∈[0,1]K ×T 4 8
(2) sup RT(π,l)≥ P (N1,T ≤T/2)+P (N1,T >T/2) ≥ exp −KL P ,P .
All that is left to do is upper bound the KL divergence between P(1) and P(2). For this, we will use the divergence decomposition lemma and proceed as follows:
(1) (2) X (1) (1) (2) (1) (1) (2) KL P ,P = E [Na,T]KL(νi ,νi )=E [Na′,T]KL(νa′ ,νa′ )
T (2δ)2 ≤ ·
Here, the third step uses the fact that E(1)[Na′,T ] ≤ T/(K − 1) by our construction and then applies the upper
bound on the KL divergence between Bernoulli distributions as given in the hint. The last step uses the fact
l(a) =El∼ν(j) l(At) −Tμj. (2) t=1
K −1 (1/2−2δ)(1/2+2δ)
程序代写 CS代考加QQ: 749389476
that δ ≤ 1/8. Finally, choosing δ = p(K − 1)/T we obtain the desired lower bound which holds when T ≥ 64(K − 1). We have,
sup RT(π,l)≥ e−32pT(K−1). l∈[0,1]K ×T 8
2. [6 pts] (A lower bound for the expert problem) Recall the expert problem, which is similar to the adversarial bandit problem, but we observe the losses for all actions. Show that the minimax regret satisfies,
√ inf sup ∈Ω T.
π l∈[0,1]K ×T
N.B: If you choose to adapt the proof from part 1, it is sufficient to point out where the proof changes instead of
re-writing from scratch.
Solution: We will follow a similar outline, with ν(1) be defined by, ν(1) = Bern (1/2 − δ), and ν(1) = 1i
Bern (1/2) for all i > 1. Next, ν(2) is defined by, ν(2) = Bern (1/2 − 2δ), and ν(2) = ν(1) for all i ̸= 2. Here, 2ii
δ ≤ 1/8 is a parameter we will specify later based on K and T . The remainder of the proof follows in a similar fashion. However, as we have full information, we can upper bound the KL divergence as follows:
(1) (2) X (1) (2) (1) (2) 2 KL P ,P = TKL(νi ,νi )=TKL(ν2 ,ν2 )≤32δ T.
By choosing δ = 1/√T, we obtain, sup K×T E [R′ (π,l)] ≥ e−32 √T.
l∈[0,1]πT 8
3. (An asymptotic lower bound for the experts problem) Recall that the Hedge algorithm achieves O(pT log(K))
regret. While the result in part 2 is tight in the
this question, you will show the following asymptotic lower bound on the regret for the experts problem which captures this dependence (but in a weaker asymptotic sense).
inf lim lim sup p ≥C.
π T→∞K→∞l1,…,lT Tlog(K)
Here, C is a universal constant.
(a) [6 pts] Let π be given. Fix the values of T and K . Let P be the uniform distribution over {0, 1}K ×T for
T term, it has no dependence on the number of experts K. In
the loss vectors l = (l1,…,lT ). Show that
sup RT(π,l)≥ El∼P maxX(1−2lt(a)) .
−El∼P min Xlt(a) =El∼P
1″T # l1,…,lT 2 a∈[K] t=1
Solution: Let RT′ (π, l) = PTt=1 lt(At) − mina∈[K] PTt=1 l(a) so that we have RT = Eπ[RT′ ]. We have, T “T#
sup RT(π,l)≥El∼PEπ[RT′ (π,l)]=XEπEl1,…,lt∼P[lt(At)]−El∼P min Xlt(a)
t=1 a∈[K] t=1 T ” T # ” T1 #
El∼P min X (1 − 2lt(a))
程序代写 CS代考加微信: cstutorcs
Here, in the second step we have used the fact that the choice of At can only depend on l1, . . . , lt−1. In the third step we have used the fact that for all a ∈ [K], we have Elt∼P [lt(a)] = 1/2 as P is the uniform distribution.
(b) [2 pts] Prove the asymptotic lower bound shown above.
Hint: You may use the following well-known result about independent Rademacher random vectors
σ1,…,σT , where σt ∈ {−1,1}K.
random variables. After applying the hint, the claim follows with the value of the universal constant being
3 Deriving the Hedge algorithm via FTRL
In this question, we will use the follow the regularized leader (FTRL) framework for online convex optimization (OCO) to derive the Hedge algorithm for the experts problem.
Recall that in OCO, we are a learner is given a compact convex set Ω ⊂ Rd. On each round t, the learner chooses a weight vector wt ∈ Ω. Simultaneously, the environment chooses a convex loss function ft : Ω → R. The learner incurs loss ft(wt), but observes ft. The learner can use this information to choose the weights in future rounds. The FTRL strategy stipulates that we choose a wt to minimize the sum of historical losses plus a regularization term
η1 λ(w), where η > 0 is the learning rate and λ is strongly convex in some norm ∥ · ∥. We have,
wt ∈ argmin X fs(w) + Λ(w). (4)
Recall that for any sequence of losses f = {ft}Tt=1, the regret for FTRL satisfies, TTT
lim lim Eσ1,…,σT T→∞ K→∞
“T# maxXσt(i)
Solution: First note that in the result from part 3a, we have that (1 − 2lt(i)) are independent Rademacher
RT(πFTRL,f)=∆ Xft(wt)−minXft(w) ≤ 1 maxλ(w)−minλ(w) +ηX∥gt∥2⋆. (5)
w∈Ω η w∈Ω w∈Ω t=1
In the experts problem, the learner is given one of K actions. On round t, the learner chooses a probability vector pt ∈∆K,where∆K ={p∈RK+;p⊤1=1},fromwhichanexpertwillbesampledfortheround.Simultaneously, the environment chooses a loss vector lt ∈ [0,1]K where lt(i) corresponding to expert i. After averaging out the randomness of sampling, the learner’s loss on round t is p⊤t lt. The learner then observes lt. Recall that the regret for the experts problem for a sequence of losses l = (l1, . . . , lT ) is
TTTT RT(π,l)=∆ Xl⊤t pt − min Xlt(a)=Xl⊤t pt − min Xl⊤t p.
1. [3pts]Showthatλ(p)isstronglyconvexintheL1–normwhenthedomainistheprobabilitysimplex∆K.That is, show that for all p,q ∈ ∆K, we have
λ(q) ≥ λ(p) + ∇λ(p)⊤(q − p) + 21∥q − p∥21. 6
Here,gt ∈∂ft(wt)isasubgradientofft atwt and∥·∥⋆ isthedualnormof∥·∥.
We will apply FTRL on the experts problem with the negative entropy regularizer λ(p) = PKi=1 p(i) log(p(i)).
Code Help
Hint: Consider using Pinsker’s inequality.
Solution: First observe that ∂λ(p(i)) = 1 + log(p(i)). Hence, we can write the RHS as,
KK1K1 RHS=Xp(i)log(p(i))+X(1+log(p(i)))(q(i)−p(i))+2∥q−p∥21 =Xq(i)log(p(i))+2∥q−p∥21.
i=1 i=1 i=1
By Pinsker’s inequality, we therefore have, LHS − RHS = KL(q, p) − 21 ∥p − q∥21 ≥ 0.
2. [5 pts] Derive the FTRL step in (4) and show that you arrive at the Hedge algorithm.
Solution: We will choose pt by maximizing Pt−1 l⊤s p + 1 PK p(i) log(p(i)) subject to the constraints
p(i) ≥ 0 and p⊤1 = 1. We will write out the Lagrangian for only the p⊤1 = 1 constraint and then verify that
the solution satisfies non-negativity. Letting the Lagrangian multiplier be μ, we have t−1 1K
L=Xl⊤s p+ s=1
Xp(i)log(p(i))+μ(p⊤1−1). i=1
Taking the derivative with respect to p(i) and setting it to 0 we obtain, pt(i) = e−ημ exp −η Pt−1 ls(i).
This is non-negative for all μ ∈ R. To obtain the expresssion, note that PKi=1 p(i) = 1. Hence, we obtain the
following expression for pt which is precisely the Hedge algorithm. exp −η Pt−1 ls(i)
pt(i)= K t−1
Pj=1 exp −η Ps=1 ls(j)
3. [3 pts] Compute a bound on the regret using (5) via an optimal choice for η. You may use the fact that the
supremum norm ∥ · ∥∞ is the dual of the L1–norm ∥ · ∥1.
Solution: Note that λ(p) = −H(p) where H is the entropy, and recall that 0 ≤ H(p) ≤ log(K). Applying the
above bound, we have RT ≤ Bη + η PTt=1 ∥lt∥2∞, where
B = max (−H(p)) − min (−H(p)) = max H(p) − min H(p) = log(K).
p∈∆K p∈∆K p∈∆K p∈∆K Moreover, we have ∥lt∥∞ ≤ 1. Hence, choosing η = plog(K)/T, we obtain
RT ≤log(K)+ηT∈OpTlog(K). η