CS861: Theoretical Foundations of Machine Learning Lecture 14 10 06 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 14 – 10/06/2023 University of Wisconsin–Madison, Fall 2023
Lecture 14: Nonparametric Regression and Density Estimation
Lecturer: Kirthevasan Kandasamy Scribed by: Haoran Xiong, Zhihao Zhao
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will develop upper and lower bounds for nonparametric regression and show that the minimax rate is Θe(n−2/3). We will also briefly introduce nonparametric density estimation.
1 Nonparametric Regression
Assume that we observe dataset S = {(X1, Y1), (X2, Y2), · · · , (Xn, Yn)} i.i.d drawn from some distribution PXY ∈ P, where
P={PXY;0<α0 ≤p(x)≤α1 <∞, f (x) = E[Y |X = x] is L-Lipschitz, Var(Y |X = x) ≤ σ2}, in which p(x) is the marginal density of X. Our target regression function will be estimated via the following loss: ∆Z2 l(PXY,g)= [f(x)−g(x)] p(x)dx, where p(x) and f(x) = E[Y |X = x] are defined above. Then the minimax risk is defined as follows: Rn =inf sup ES∼PXY fb PXY ∈P =inf sup ES∼PXY fb PXY ∈P We want to show that the Rn∗ is Θ(n−2/3) in two steps: 1. Establish a lower bound with Fano’s method 2. Get an upper bound by using Nadaraya-Watson Estimation 1.1 Lower bound By noticing that l(PXY , g) defined above cannot be written into the form of l = Φ ◦ ρ, which means we cannot utilize theorems and lemmas learnt in previous lectures, we circumvent this problem by constructing a sub-class P′ of P as follows: P′′ ={PXY ∈P; p(x)=1}1 ThenRn≥inf sup ES∼PXY (f(x)−f(x))dx,andnowwecanwriteΦ◦ρ(f1,f2)=∥f1−f2∥2. fb PXY ∈P′′ 1here we use the uniform density p(x) = 1 for convenience, but any fixed density p(x) will still induce a metric. (f(x) − f(x))2p(x)dx 1.1.1 Constructing alternatives x + 1 if x ∈ [− 1 , 0), 22 Define ψ(x) = −x + 12 if x ∈ [0, 12 ], then we can easily note that ψ is 1-Lipschitz, and R ψ2(x)dx = 1/12. 0 o.w. Now let h > 0 (we’ll specify its value later) and let m = h1 , we construct a new function class
 m  F′ = fω; fω(·) = Xωjφj(·),ω ∈ Ωm
where Ωm is the Varshamov-Gilbert pruned hypercube of {0,1}m, and φj(x) = Lh·ψx−(j−1/2)h. Since
|φj (x)| ≤ L, we know that fω is L-Lipschitz. We can now define our alternatives:
P′ =nPXY; p(x)uniform,f(x)=E[Y|X=x]∈F′, Y|X=x∼N(f(x),σ2)o.
WeseethatP′ ⊂P′′⊂P.
1.1.2 Lower bound on ∥fω − fω′ }
To better organize our result, we first calculate,
ρ2(fω,fω′)= =
x−(j−1/2)h h
Z 1/2 −1/2
φ2j (x)dx = mm
L2h3ψ2(u)du =
We then have,
Xm Z mj j−1
(fω −fω′)2dx
L 2 h 3 Xm
Sinceω,ω′ ∈Ωm,byVarshamov-Gilbertlemma,H(ωj,ωj′)≥m = 1 .Thenwehave
Lh ∆ min ρ(fω,fω′)≥ √ =δ,
(ωj φj (x) − ωj′ φj (x))2dx Z mj
1{ωj ̸=ωj′}
φ2j(x)dx L 2 h 3
1{ωj ̸=ωj′}= 12 ·H(ωj,ωj′)
where H(·,·) is the Hamming distance and the last equation holds because of the definition of it.
where δ is called the separation between hypotheses.
Code Help
1.1.3 Upper bound KL
Next, we will upper bound the maximum KL divergence between our alternatives. Let Pω,Pω′ ∈′. Then,
KL(Pω,Pω′)= = = =
Z pω pωlogp ′
Z 1 Z ∞ pω(x)pω(y|x) pω(x)pω(y|x)logp ′(x)p ′(y|x)dydx
0−∞ ωω Z1Z∞ pω(y|x)
(as Y|X = x ∼ N(f(x),σ2))
pω(y|x)log p ′(y|x)dydx (as pω(x) = pω′(x) = 1) 0−∞ ω
KLN(fω(x),σ2),N(fω′(x),σ2)dx 1Z1 2
(fω(x)−fω′(x)) dx
= 1 ρ2(fω,fω′)= L2h3 ·H(ω,ω′).
Then since max H(ω, ω′) ≤ m = 1/h, ω,ω′
max KL(Pω,Pω′) = ω,ω′
1.1.4 Apply local Fano’s method
L2h3 ′ max H(ω,ω ) ≤
In order to apply Fano’s method, we need to satisfy max KL(Pω,Pω′) ≤ log|P′|. Recall that by the ω,ω′ 4n
Varshamov-Gilbert lemma, |P′| ≥ 2m/8, so it is sufficient if we have, L2h2 log(2m/8) m log 2 log 2
24σ2 ≤ 4n This suggests that we could choose h = 3log213
= 32n = 32nh. σ2/3 .
Thus the separation between hypotheses δ = C L1/3σ2/3 , where C is some constant, and then by local
Fano’s method,
1 n1/3 1 ∗ 1δ12 L2/3σ4/3
Rn≥2Φ2=8δ=C2 n2/3 .
Remark: In order to apply above local Fano’s method, it’s required that |P′| ≥ 16. It’s sufficient to have
|P′| ≥ 2m/8 ≥ 16, i.e. m = 1/h ≥ 32, which means the following must hold: 3log213 σ2/3 1 σ2
h = 4 n1/3L2/3 ≤ 32 =⇒ n ≥ C3 L2 for some constant C3. 1.2 Upper Bound
To upper bound the minimax risk we introduce the following estimator. Later we will introduce the Nadaraya- Watson estimator, and show that our current estimator is a special case of the Nadaraya-Watson estimator.
Our estimator f (t) is defiend as follows: Let N (t) = Pn 1{Xi ∈ [t − h, t + h]}. Then define,
(clip 1 Pn Yi1{Xi ∈[t−h,t+h]},0,1 ifN(t)>0 N(t) i=1
if N(t) = 0

where clip(x, 0, 1) means that
By definition,
clip(x,0,1) =
x, 0⩽x⩽1 
0, x < 0  1 , x > 1 .
R(PXY,f)⩽α1
Now we choose h = σ2/3L−2/3n−1/3, which implies that
ˆˆ2 R(PXY,f)=ES (f(x)−f(x)) p(x)dx
⩽ α1ES (f(x) − f(x)) dx
Z1hˆ 2i ES (f(t) − f(t)) dt.
We will next provide a pointwise bound on err(t) which will translate to an integrated bound. The calculations for the pointwise bound are very similar to an example we saw previously so we will only provide an overview and highlight the differences.
Let Gt = {N(t) ⩾ α0nh} denote the good event that there were a sufficient number of samples in a 2h neighborhood of t. We have,
P(Gct)=P X1{Xi ∈[t−h,t+h]}<α0nh where P([t−h, t+h]) = R t+h p(x)dx ⩾ 2α0h. Thus we have α0nh−nP([t−h, t+h]) ⩽ −α0nh. By Hoeffding’s Therefore, ! =P X(1{Xi ∈[t−h,t+h]}−P([t−h,t+h]))<α0nh−nP([t−h,t+h]) , inequality, we have P(Gc) ⩽ exp(−2α02nh2). By following the calculations from our previous example, we h i σ2 22 ˆ 2 22 −2α0nh ES (f(t)−f(t)) ⩽Lh+nh+e . Z1 h i σ2 2 2 ˆ ˆ 2 22 −2α0nh ES (f(t)−f(t)) dt⩽α1 Lh +nh+e . ˆ σ4/3L2/3 2 σ4/3n1/3 R(PX Y , f ) ⩽ 2α1 n2/3 + α1 exp −2α0 L4/3 . Remark: On an ancillary note, had we used the multiplication Chernoff bound instead of Hoeffding’s inequality, we will have had the following bounds: P(Gc) ⩽ e−α0nh/8, σ4/3L2/3 α0 σ2/3n2/3 R(PXY,f)⩽2α1 n2/3 +α1exp −4 L2/3 . For i.i.d Bernoulli random variables with success probability close to 0 or 1, the multiplicative Chernoff bound can provide a tighter bound than Hoeffding’s inequality. This does not significantly alter our conclusions in this example, but it may be significant in other use cases. CS Help, Email: tutorcs@163.com 1.3 Nadaraya-Watson Estimator An Nadaraya-Watson Estimator (also known as the kernel estimator), is defined to be where K is called a smoothing kernel. For example, in our previous case, the smoothing kernel is K(t) = 1{t ⩽ 1/2}. Other kernel choices can lead to better rates under stronger smoothness assumptions. On such assumption is the Ho ̈lder class in Rd, which is denoted by H(β, L) and defined to be the set of all functions whose (β−1)th order partial derivatives are L-Lipschitz. The minimax rate in this class is Θ(n−2β/(2β+d)). To achieve this rate, we will need to design smarter kernels in the Nadaraya-Watson estimator. The same rates hold for density estimation in the H ̈older class. 2 Density Estimation We will briefly introduce lower and upper bounds for density estimation. Let F be the class of bounded Lipschitz functions, i.e. F={f:[0,1]→[0,B] : |f(x1)−f(x2)|⩽L|x1−x2|∀x1,x2 ∈[0,1]}. The corresponding nonparametric family of densities is then defined to be P={P: Thep.d.f.pofPisinF}. Suppose we observe S = (X1, . . . , Xn) drawn i.i.d from some distribution P ∈ P. We wish to estimate the K((t − Xi)/n) yiwi(t), wi(t) = Pnj=1 K((t − Xj)/n), p.d.f. under the L2 loss, i.e. By definition, the minimax risk is 2.1 Lower bound Φ ◦ ρ(P1, P2) = Rn∗ =infsupES||pb−p||2. The first step is to construct alternatives. For this, define x+1, x∈−1,−1 2 24 ψ(x) = −x, x ∈ −14, 41 x−1, x∈1,1. 242 Note that ψ is 1-Lipschitz and always in [−1/4, 1/4]. Moreover, R ψ(t)dt = 0, R ψ2(t)dt = 1/48, and |ψ(t)| ≤ 1/4. (p1(t) − p2(t))2dt. 程序代写 CS代考加微信: cstutorcs