STAT 861: Theoretical Foundations of Machine Learning University of Wisconsin–Ma

CS/ECE/STAT-861: Theoretical Foundations of Machine Learning University of Wisconsin–Madison, Fall 2023
Homework 2. Due 10/27/2023, 11.00 am
Instructions:
1. Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.
2. I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
3. You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
4. Unless otherwise specified, you may use any result we have already proved in class. You do not need to prove them from scratch, but clearly state which result you are using.
5. Solutions to some of the problems may be found as examples or exercises in the suggested textbooks or other resources. You are encouraged to try the problems on your own first before searching for the solution. If you find an existing solution, first read and understand the proof, and then write it in your own words. Please indicate any references you have used at the beginning of your solution when you turn in your homework.
6. Collaboration: You are allowed to collaborate on in groups of size up to 3 on each problem. If you do so, please indicate your collaborators at the beginning of your solution.
程序代写 CS代考加QQ: 749389476
1 Lower bounds with mixtures
In this question, you will prove a variant of our current framework for proving minimax lower bounds that involve mixtures of distributions.
[5 pts] We observe data S drawn from some distribution P belonging to a family of distributions P. We wish to estimate a parameter θ(P ) ∈ Θ of interest via a loss Φ ◦ ρ, where Φ : R+ → R+ is a non-decreasing function andρ:Θ×Θ→R+ isametric. LetP1,…,PN besubsetsofP,andletΛj denoteaprioronPj. LetPj denote the mixture,
Pj(S ∈ A) = EP∼Λj [ES∼P [1(S ∈ A)]].
Let δ = minj̸=k infP∈Pj,P′∈Pk ρ(θ(P),θ(P′)). Let ψ be a function which maps the data to [N] and θbbe an
estimator which maps the data to Θ. Then, prove that
R =infsupES Φ◦ρ θ(P),θ(S) ≥Φ infmaxPj(ψ(S)̸=j). θbP∈P 2 ψj∈[N]
(correction: The RHS previously said inf instead of inf. –KK) θb ψ
[3pts]Supposeweobserveni.i.ddatapointsS = {X1,…,Xn}drawnfromsomeP ∈ P. Let{P0,P1,…,PN} ⊂ P. Let P = N1 PNj=1 Pjn. Show that,
⋆1δ n Rn≥4Φ 2 exp−KL(P0,P)
[2 pts] Using the result from part 2, briefly explain why using mixtures in the alternatives can (i) lead to tighter lower bounds, but (ii) are difficult to apply.
Density estimation in a Ho ̈lder class
Let H(2, L, B), defined below, denote the bounded second order Ho ̈lder class in [0, 1]. It consists of functions whose derivatives are L-Lipschitz.
H(2,L,B)={f :[0,1]→[0,B]; |f′(x1)−f′(x2)|≤L|x1 −x2| forallx1,x2 ∈R}
Let P denote the set of distributions whose densities are in H(2, L, B). We observe n samples S = {X1, . . . , Xn} drawn i.i.d from some P ∈ P and wish to estimate its density p in the L2 loss Φ ◦ ρ(p1, p2) = ∥p1 − p2∥2. The minimax risk is
Rn⋆ =inf sup
pb p∈H(2,L,B)
ES∥p−pb∥2.
In this question, you will show that the minimax rate1 for this problem is Θ(n−4/5).
1. [15 pts] (Lower bound) Using Fano’s method, or otherwise, show that Rn⋆ ∈ Ω(n−4/5).
2. [15 pts] (Upper bound) Design an estimator pbfor p and bound its risk by O(n−4/5).
Hint: If you choose to use a kernel density estimator, consider the first order Taylor expansion of p and then apply the Ho ̈lder property.
3. [4pts](Highdimensionalsetting)Inwords,brieflyexplainhowyoucanextendboththeupperandlowerbounds for density estimation in d dimensions. The d dimensional second-order Ho ̈lder class, defined below, consists of functions whose partial derivatives are Lipschitz.
d ∂f H(2,L,B) = f : [0,1] → [0,B]; ∂x is L–Lipschitz for all i ∈ [d] .
1Recall from class that the minimax rate for a Ho ̈lder class of order β is O n 2β+d in Rd.
Programming Help, Add QQ: 749389476
You can focus only on the key differences. A detailed proof is not necessary.
4. [4 pts] (Lipschitz second derivatives) In words, briefly explain how you can extend both the upper and lower
bounds if the densities belonged to the third order Ho ̈lder class in one dimension, defined below: H(3,L,B)={f :[0,1]→[0,B]; |f′′(x1)−f′′(x2)|≤L|x1 −x2| forallx1,x2 ∈R}
Please focus only on the key differences. A detailed proof is not necessary.
Hint: For the upper bound, if you choose to use a kernel density estimator, you may consider a kernel of the
form K(u) = 1(|u| ≤ 1/2)(α − βu2) for appropriately chosen α, β.
Lower bounds on the excess risk for prediction problems
In this question, we will develop a framework for lower bounding the excess risk of prediction problems. We will use this to establish a lower bound on the estimation error for binary classificaion in a VC class.
Let Z be a data space and H be a hypothesis space. Let f : H × Z → R be the instance loss, where f (h, Z ) is the loss of hypothesis h on instance Z . Let F (h, P ) = EZ ∼P [f (h, Z )] be the population loss of hypothesis h on distribution P , and let L(h, P ) = F (h, P ) − inf h∈H F (h, P ) denote the excess population loss. Let bh be an estimator which maps a dataset to a hypothesis in H. The risk of the estimator is
R(bh,P)=EL(bh,P)=EF(bh,P)− inf F(h,P). h∈H
Here, the expectation is taken with respect to the data. The minimax risk is R⋆ = inf sup R(bh, P ). bh P∈P
Example: In classification, f(h,(X,Y)) = 1(h(X) ̸= Y) is the 0–1 loss, F(h,P) is usually called the risk of hypothesis h. The infimum infh F(h,P) is attained by the Bayes’ optimal classifier, and L(h,P) is the excess risk of hypothesis h. This framework can be used to lower bound the expected excess risk R(bh, P ) of classification (and regression) problems. When bh chooses a hypothesis in some hypothesis class H, then R(bh, P ) is the estimation error.
1. [6 pts] (Reduction to testing) For two distributions P, Q, we define the separation ∆(P, Q) as, ∆(P,Q)=supδ≥0; L(h,P)≤δ =⇒ L(h,Q)≥δ∀h∈H,
L(h,Q)≤δ =⇒ L(h,P)≥δ∀h∈H .
A dataset S is drawn from some distribution P ∈ P. Let {P1,…,PN} ⊂ P such that ∆(Pj,Pk) ≥ δ for all
j ̸= k. Let ψ be any function which maps S to [N]. Show that,
R⋆ ≥ δ inf max Pj(ψ ̸= j).
We can establish the following statements from the above result when S consists of n i.i.d data points. You do not
need to prove them for the homework, but are encouraged to verify that they are true. LeCam’smethod:Let{P0,P1}⊂Psuchthat∆(P0,P1)≥δandKL(P0,P1)≤log(2)/n.Then,Rn⋆ ≥δ/8.
Local Fano method: Let {P1,…,PN} ⊂ P such that ∆(Pj,Pk) ≥ δ and KL(Pj,Pk) ≤ log(N)/4n for all j ̸= k. Suppose N ≥ 16. Then, Rn⋆ ≥ δ/2.
You may use these results when solving the problems below.
2. (One sided-threshold classifiers) Consider a binary classification problem with input in X = [0, 1] and label in {0, 1}. We observe S = {(X1, Y1), . . . , (Xn, Yn)} drawn i.i.d from some distribution P ∈ P, where P consists of distributions whose marginal p(x) is the uniform distribution on [0, 1].
Let H = {ht(·) = 1(· ≥ t);t ∈ [0,1]} be the class of one-sided threshold classifiers. For any h ∈ H, let f (h, (X, Y )) = 1(h(X ) ̸= Y ) be the 0–1 loss and let F (h, P ) = EX,Y ∼P [1(h(X ) ̸= Y )].

(a) [8 pts] Using Le Cam’s method, show that for any estimator bh which maps the dataset to a hypothesis in H, there exists some distribution P ∈ P such that
r1! EF(bh,P)≥ inf F(h,P)+Ω .
(b) [2pts]Notethatwehavenotassumedthath(·)=P(Y =1|X =·)belongstoHforallP ∈P. We have however assumed that the estimator bh always chooses some hypothesis in H. Briefly, explain why this assumption is necessary and where the proof breaks without this assumption.
3. [15 pts](Classification in a VC class) Let X be a given input space and let P be all distributions supported on X × {0,1}. We observe S = {(X1,Y1),…,(Xn,Yn)} drawn i.i.d from some distribution P ∈ P. Let H⊂{h:X →{0,1}}beahypothesisclasswithfiniteVCdimensiond.Foranyh∈H,letf(h,(X,Y))= 1(h(X ) ̸= Y ) be the 0–1 loss and let F (h, P ) = EX,Y ∼P [1(h(X ) ̸= Y )].
In Homework 1, you showed the following upper bound on the estimation error of the ERM estimator bhERM,
rdlog(n)! EF(bhERM,P)≤ inf F(h,P)+O .
Here, the expectation is with respect to the dataset S. Using the local Fano method, show that this is rate is essentially unimprovable. That is, show that for any estimator bh which maps the dataset to a hypothesis in H, there exists some distribution P ∈ P such that, for sufficiently large d,
rd! EF(bh,P)≥ inf F(h,P)+Ω .
Explore-then-commit for K–armed bandits
In this question, we will upper and lower bound the regret for the explore-then-commit algorithm, described below.
Algorithm 1 Explore-then-Commit
Given: time horizon T, number of exploration rounds m (< T/K) Pull each arm i ∈ [K] m times in the first mK rounds. 1 PmK Set A = argmaxi∈[K] m t=1 1(At = i)Xt. Pull arm A for the remaining T − mK rounds. Let ν = {νi}i∈[K] be a σ sub-Gaussian K-armed bandit model, i.e each νi is a σ sub-Gaussian distribution. Let μi =EX∼νi[X]denotethemeanofarmi,μ⋆ =maxj∈[K]μj bethehighestmean,andlet∆i =μ⋆ −μi bethegap between the optimal arm and the ith arm. Assume, without loss of generality, that μi ∈ [0, 1] for all i. Let RT (m, ν) denote the regret when we execute the above algorithm on ν with m exploration rounds, "T# RT(m,ν)=Tμ⋆−E XXt . 1. [5 pts] (Gap-dependent bound) Show that there exists global constants C1 , C2 such that sup RT (m′, ν) ∈ O ̃(K1/3T 2/3). ν∈P ∆i + C1(T − mK) 2. [6pts](Gap-independentbound)LetPdenotetheclassofallσsub-Gaussianbanditswhosemeansarebounded RT (m, ν) ≤ m ∆i exp C2σ2 . between 0 and 1. Show that for a suitable choice of m, say m′ (possibly dependent on T and K), that we have X − m ∆ 2i 浙大学霸代写加微信 cstutorcs 3. [10pts](Lowerbound)Showthattheresultinpart2cannotbeimproved(sayviaatighterupperboundanalysis) for the explore-then-commit algorithm. That is, show inf supRT(m,ν)∈Ω(K1/3T2/3). m∈N ν∈P Hint: One approach is to adopt a similar technique to the proof of the general lower bound for K-armed bandits, but adapt it to the structure of the explore-then-commit algorithm. Your alternatives will need to depend on the specific choice of m to get a tight lower bound. To do so, you should carefully consider the failure cases if m is picked to be too large or too small.