CS861: Theoretical Foundations of Machine Learning Lecture 13 10 04 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 13 – 10/04/2023 University of Wisconsin–Madison, Fall 2023
Lecture 13: Varshamov-Gilbert lemma, Nonparametric regression
Lecturer: Kirthevasan Kandasamy Scribed by: Elliot Pickens, Yuya Shimizu
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we provide further methods to derive a lower bound for the minimax risk. First, we will continue our previous discussion on constructing alternatives via tight packings. Then, we will introduce the Varshamov-Gilbert lemma, which is another method to construct well-separated alternatives. We also briefly mention other methods for lower bounds. Finally, we will discuss nonprarametric regression.
1 Method 1: Constructing alternatives via tight packings (con- tinued)
In the previous lecture, we have learned ε-packing numbers. We will see ε-covering numbers that behave in the equivalent order to packing numbers as ε ↓ 0.
Definition 1. (Covering number, metric entropy)
• An ε-covering of set X with respect to a metric ρ is a set {x1,··· ,xN} such that for all x ∈ X, there
exists some xi ∈ {x1,··· ,xN} s.t. ρ(x,xi) ≤ ε.
• The ε-covering number N(ε,X,ρ) is the size of the smallest covering. • The metric entropy is log(N(ε,X,ρ)).
We have the following lemma that relates covering numbers and packing numbers. Lemma 1. A covering number N(·,X,ρ) and a packing number M(·,X,ρ) satisfy
M(2ε,X,ρ) ≤ N(ε,X,ρ) ≤ M(ε,X,ρ).
Remark This lemma is useful since we can apply prior work on bounding the metric entropy log (N (ε, X , ρ)).
2 Method 2: Varshamov–Gilbert Lemma
To get a lower bound for the minimax risk, it is often convenient to consider alternatives indexed with a hypercube:
Pω;ω = (ω1,…,ωd) ∈ {0,1}d . Example 1. (Normal mean estimation in Rd) Consider a hypercube:
N δω,σ2I;ω ∈ {0,1}d . 1

Computer Science Tutoring
Figure 1: A hypercube that we will use to generate our alternatives by removing a few vertices from the cube.
For these alternatives, we can calculate the following values: minρ(θ(Pω),θ(Pω′))= min∥δω−δω′∥2
= δ, (ω and ω′differ on only one coordinate)
maxKL(Pω,Pω′)= maxω,ω′ ∥δω−δω′∥2
(ω and ω differ on all coordinates)
dδ2 = 2σ2 .
The problem here is the Kullback-Leibler divergence could be large relative to the minimum distance, thus, we cannot simply apply the local Fano’s method.
This example motivates us to introduce Varshamov-Gilbert Lemma. The Varshamov-Gilbert lemma states that we can find a large subset of {0, 1}d such that the minimum distance between any two points in the subset is also large. Before stating the lemma, we define the Hamming distance.
Definition 2. (Hamming distance) The hamming distance between two binary vectors ω, ω′ is H (ω, ω′) = Pdj=1 1 ωj ̸= ωj′ for ω, ω′ ∈ Rd. It counts the number of coordinate where ωj and ωj′ differ.
Lemma 2. (Varshamov-Gilbert) Let m ≥ 8. Then there exists Ωm ⊆ {0,1}m such that the followings are true: (i) |Ωm| ≥ 2m/8. (ii) ∀ω, ω′ ∈ Ωm, H (ω, ω′) ≥ m/8. We will call Ωm the Varshamov-Gilbert pruned hypercube of {0, 1}m.
We will revisit the normal mean estimation example to illustrate an application of the Varshamov-Gilbert lemma.
Example 2. (Normal mean estimation in Rd) Let an i.i.d. data S = {x1,··· ,xn} ∼ Pn where P ∈ P and P = N(μ,Σ);μ∈Rd,Σ⪯σ2I . Consider Φ◦ρ(θ1,θ2) = ∥θ1 −θ2∥2. Let Ωd be the Varshamov-Gilbert pruned hypercube of {0, 1}d. Define
( r8 ! ) P′ = N dδω,σ2I ;ω∈Ωd .

For these alternatives, we have the following bound:
Pω̸=Pω′ ,Pω,Pω′ ∈P′
ρ(θ(Pω),θ(Pω′))= mint ω̸=ω′
where the inequality follows from the property (ii) of the Varshamov-Gilbert pruned hypercube. Since the
maximum l2-distance over a hypercube is the length of a diagonal, we also have √ q8 2
max′KL(Pω,Pω′)= Pω,Pω′∈P
vuXd r8 r8 !2
r8 p ′ δ min H (ω, ω )
d ω̸=ω′ r8 rd
d× dδ 4δ2 2 = 2.
Choose δ = σq d log(2) . Then, 128n
4δ2 d log(2) max ′KL(Pω,Pω′)= 2 =
σ 32n log 2d/8
≤ log(|P′|),
where the inequality follows from the property (i) of the Varshamov-Gilbert pruned hypercube: |P′| = |Ωd| ≥ 2d/8. Therefore, by the local Fano’s method (here we also require d ≥ 32),
∗ 1 δ log(2) dσ2 Rn ≥ 2Φ 2 = 1024 · n .
3 Other methods for lower bounds
Theorem 3. Informal theorem (Ch 2, Tsyhakov)
LetS={(x1,y1),…,(xn,yn)}∼P ∈P,{P0,…,PN}⊆P,andδ=maxj̸=kΦ◦ρ(θ(Pj),θ(Pk)). Thenif
we can say that
KL(Pj,P0) ≤ C1 log(N)
∗ δ Rn≥C2Φ 2 .
Roughly, this informal theorem says that if the average KL distance between Pj∀j and some ”central” distribution P0 is small enough we get a the lower bound of on the minimax risk seen above.
A related method, Assouad’s method, can be found in chapter 9 of John Duchi’s ”Lecture Notes on Information Theory.” This method applies when there is additional structure in the problem.
程序代写 CS代考 加微信: cstutorcs
4 Nonparametric regression
The model: Let F be the class of bounded Lipschitz functions in [0, 1]. That is F ={f :[0,1]→[0,B]; |f(x1)−f(x2)|≤L|x1,x2|}
where L is the Lipschitz constant.
We observe some S = {(x1, y1), . . . , (xn, yn)} drawn i.i.d. from Pxy = P where
P ={Pxy ;0<α0 ≤p(x)≤α1 <∞ the regression function f (x) ≜ E[Y |X = x] ∈ F V ar(Y |X = x) ≤ σ2}. We wish to estimate the regression function via the following loss l(Pxy,y)= (f(x)−g(x))2p(x)dx where f is the regression function and g : [0, 1] → R. The risk of fˆ is and the minimax risk of fˆ is ˆhˆi R(Pxy, f) = ES l(Pxy, f) =ES (f(x)−f(x)) p(x)dx ∗ˆ Rn = inf sup R(Pxy,f). We want to show that the minimax risk is Θ n− 32 . We will do this in two steps: 1. Establish a lower bound with Fano’s method 2. Get and upper bound using Nadaraya-Watson Estimation 4.1 Lower Bound To begin it should be noted that we have a problem where the loss does cannot be written as l = Φ ◦ ρ. To circumvent this, denote P′′ = {Pxy ∈ P ; p(x) = 1}.1 Then Rn = inf sup ES  (f(x) − f(x)) p(x)dx fˆ Pxy∈P′′   | {z } where Φ ◦ ρ(f1, f2) = ||f1 − f2||2. Now we proceed with proving the lower bound via the following three steps: 1We choose p(x) = 1 here for simplicity, but any uniform pdf will work. 4 1. Constructing the alternatives 2. Lower bounding ρ 3. Upper bounding KL 4.1.1 Constructing the alternatives Ψ(x) =  0 Figure 2: Ψ(x) where Ψ is 1−Lipschitz and R Ψ2(x)dx = 1 < ∞. 12 Now let h > 0 (we will choose it more precisely later), and let m = h1. Define F′
 Xm  F′ = fw ; fw(·)= wjφj(·), w∈Ωm
 j=1  where Ωm is the Vashamov-Gilbert pruned subset of {0,1}d. And φj as
if x ∈ [−1/2, 0] if x ∈ [0, 1/2]
(x − (j − 1/2)h) φj(x) = LhΨ h
012345 012345
555555 555555
Figure 3: Depiction of φj and w when w = {0, 0, 1, 0, 1}.
Now we need to check that F′ ⊆ F (to show that fw is L−Lipschitz). To do this, it is sufficient to check
within one of the ”bumps.” By the chain rule,
|φ′j | = |LhΨ′  (x − (j − 1/2)h)  1 | = L.

Finally we define our set of alternatives P′ as follows.
P′ =nP; p(x)isuniform, f(x)≜E[Y|X=x]∈F′, Y|X=x∼N(f(x),σ2)o.
WeseethatP′ ⊆P′′ ⊆P where
程序代写 CS代考 加QQ: 749389476