STAT 861: Theoretical Foundations of Machine Learning University of Wisconsin–Ma

CS/ECE/STAT-861: Theoretical Foundations of Machine Learning University of Wisconsin–Madison, Fall 2023
Homework 1. Due 10/06/2023, 11.00 am
Instructions:
1. Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.
2. I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
3. You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
4. Unless otherwise specified, you may use any result we have already proved in class. You do not need to prove them from scratch, but clearly state which result you are using.
5. Solutions to some of the problems may be found as examples or exercises in the suggested textbooks or other resources. You are encouraged to try the problems on your own first before searching for the solution. If you find an existing solution, first read and understand the proof, and then write it in your own words. Please indicate any references you have used at the beginning of your solution when you turn in your homework.
6. Collaboration: You are allowed to collaborate on in groups of size up to 3 on each problem. If you do so, please indicate your collaborators at the beginning of your solution.
程序代写 CS代考加微信: cstutorcs
PAC Learning and Empirical Risk Minimization
1. [4 pts] (What is wrong with this proof?) We perform empirical risk minimization (ERM) in a finite hypothesis class H using an i.i.d dataset S of n points. Let h⋆ ∈ argminh∈H R(h) be an optimal classifier in the class, and let bh ∈ argminh∈H Rb(h) minimize the empirical risk of the dataset S. A student offers the following proof and claims that it is possible to bound the estimation error without any dependence on |H|.
(i) Let B1 = {Rb(h⋆) − R(h⋆) > ε} denote the bad event that the empirical risk of h⋆ is ε larger than its true risk. By Hoeffding’s inequality we have P(B1) ≤ e−2nε2 .
(ii) Similarly, Let B2 = {R(bh) − Rb(bh) > ε} denote the bad event that the empirical risk of bh is ε smaller than its true risk. By Hoeffding’s inequality we have P(B2) ≤ e−2nε2 . (correction: This previously said 2e−2nε2 . Thanks to Zhihao for pointing this out. –KK)
As Rb(bh) ≤ Rb(h⋆), we have,
R(bh) − R(h⋆) ≤ R(bh) − Rb(bh) + Rb(h⋆) − R(h⋆) ≤ 2ε
under the good event G = B1c ∩ B2c which is true with probability at least 1 − 2e−2nε2 . This result does not depend on |H| and even applies to infinite hypothesis classes provided there exists h⋆ which minimizes the risk.
Which sentence below best describes the mistake (if any) with this proof? State your with an explanation. If you believe there is a mistake, be as specific as possible as to what the mistake is.
(a) Both statement (i) and statement (ii) are incorrect.
(b) Only statement (i) is incorrect. Statement (ii) is correct.
(c) Only statement (ii) is incorrect. Statement (i) is correct.
(d) Both statements are correct. There is nothing wrong with this proof.
2. [6 pts] (PAC bound) Prove the following result which was presented but not proved in class.
Let H be a hypothesis class with finite Radn(H). Let bh be obtained via ERM using n i.i.d samples. Let ε > 0.
Then, there exists universal constants C1 , C2 such that with probability at least 1 − 2e−2nε2 , we have R(bh) ≤ inf R(h) + C1Radn(H) + C2ε.
3. [3 pts] (Sample complexity based on VC dimension) Say H has a finite VC dimension d. Let δ ∈ (0, 1). Using the result/proof in part 2 or otherwise, show that there exist universal constants C3 , C4 such that when n ≥ d, the following bound holds with probability at least 1 − δ.
4. [3 pts] (Bound on the expected risk) The above results show that R(bh) is small with high probability. Using the results/proofs in parts 2 and 3 or otherwise, show that it is also small in expectation. Specifically, show that there exist universal constants C5 , C6 such that the following bound holds.
rdlog(n/d)+d s1 2 R(bh) ≤ inf R(h) + C3 + C4 log
r d log(n/d) + d
r log(4n) 1 +√ .
E[R(bh)]≤infR(h)+C5 Here, the expectation is with respect to the dataset S.
For parts 2, 3, and 4, of this question, if you can prove a bound that has similar higher order terms but differs in additive/multiplicative constants or poly-logarithmic factors, you will still receive full credit.
Programming Help
Rademacher Complexity & VC dimension [5pts](EmpiricalRademachercomplexity)Considerabinaryclassificationproblemwiththe0–1lossl(y1,y2)=
1(y1 ̸= y2) and where X = R. Consider the following dataset S = {(x1 = 0,y1 = 0),(x2 = 1,y2 = 1)}.
(a) Let H1 = {ha (x) = 1(x ≥ a); a ∈ R} be the hypothesis class of one-sided threshold functions. Compute
the empirical Rademacher complexity Rad(S, H1).
(b) LetH2 ={ha(x)=1(x≥a);a∈R}∪{ha(x)=1(x≤a);a∈R}betheclassoftwo-sidedthreshold
functions. Compute the empirical Rademacher complexity Rad(S, H2 ).
(c) Are the values computed above consistent with the fact that H1 ⊂ H2?
[6 pts] (VC dimension of linear classifiers) Consider a binary classification problem where X = RD is the D-dimensional Euclidean space. The class of linear classifiers is given by H = {hw,b(x) = 1[w⊤x + b ≥ 0]; w ∈ RD , b ∈ R}. Prove that the VC dimension of this class is dH = D + 1. (correction: Previously this said H = {hw,b(x) = w⊤x + b ≥ 0 w ∈ Rd, b ∈ R}. –KK)
(Interval classifiers) Let X = R. Consider the class of interval classifiers, given by H = {ha,b(x) = 1(a ≤ x ≤ b);a,b ∈ R,a ≤ b}.
(a) [4 pts] What is the VC dimension d of this class?
(b) [8 pts] Show that Sauer’s lemma is tight for this class. That is, for all n, show that g(n, H) = Pd n.
i=0 i (Union of interval classifiers) Let X = R. Consider the class of the union of K interval classifiers, given by
H = {ha,b(x) = 1(∃k ∈ {1,…,K} s.t ak ≤ x ≤ bk);a,b ∈ Rk,ak ≤ bk∀k}. (a) [4 pts] What is the VC dimension d of this class?
(b) [8 pts] Show that Sauer’s lemma is tight for this class. That is, for all n, show that g(n, H) = Pd n. i=0 i
Hint: The following identity from combinatorics, which we used in the proof of Sauer’s lemma, may be helpful. m m−1 m−1
[6 pts] (Tightness of Sauer’s lemma) Prove the following statement about the tightness of Sauer’s lemma when
X =R:Foralld>0,thereexistsahypothesisclassH⊂{h:R→{0,1}}withVCdimensiondH =dsuch
that, for all dataset sizes n > 0, we have g(n, H) = Pd n. i=0 i
Keep in mind that the hypothesis class H should depend on d but not on n.
Hint: One approach will be to use the results from part 4 which will allow you to prove the results for even d.
You should consider a different hypothesis class to show this for odd d.
Relationship between divergences
∀m>k, k = k + k−1 .
Let P, Q be probabilities with densities p, q respectively. Recall the following divergences we discussed in class KL divergence: KL(P, Q) = R log p(x) p(x)dx.
Total variation distance: TV(P, Q) = supA |P (A) − Q(A)|.
L1 distance: ∥P − Q∥1 = R |p(x) − q(x)|dx.
Hellinger distance: H (P, Q) = p(x) − q(x) 3

Finally, let ∥P ∧ Q∥ = R min(p(x), q(x))dx denote the affinity between two distributions. When we have n i.i.d observations, let P n , Qn denote the product distributions.
(correction: Previously, the definition of the Hellinger distance said H and not H2. Thanks to Yixuan for pointing this out. –KK)
Prove the following statements:
1. [3pts]KL(Pn,Qn)=nKL(P,Q).
2. [3pts]H2(Pn,Qn)=2−21− 12H2(P,Q)n.
3. [3pts]TV(P,Q)=12∥P−Q∥1.
Hint: Can you relate both sides of the equation to the set A = {x; p(x) > q(x)}?
4. [3pts]TV(P,Q)=1−∥P ∧Q∥.
5. [3pts]H2(P,Q)≤∥P −Q∥1.
Hint:Whatcanyousayabout(a−b)2 and|a2 −b2|whena,b>0?
Code Help, Add WeChat: cstutorcs