CS861: Theoretical Foundations of Machine Learning Lecture 15 09 10 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 15 – 09/10/2023 University of Wisconsin–Madison, Fall 2023
Lecture 15: Density estimation, Lower bounds for prediction problems
Lecturer: Kirthevasan Kandasamy Scribed by: Haoyue Bai, Zexuan Sun
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will first present a lower bound for nonparametric density estimation, and then study kernel density estimation to upper bound the risk. We will conclude with a framework for proving lower bounds for prediction problems.
1 Nonparametric Density Estimation (Cont’d) 1.1 Recap: Nonparametric Density Estimation
Consider the function space F : {f : [0,1] → [0,B],|f(x1)−f(x2)| ≤ L|x1 −x2|}. The constraint ensures the
function is Lipschitz continuous with Lipschitz constant L. We are interested in the family of distributions
P, which consists of distributions whose densities are L–Lipschitz. That is, P : {p : density p of P is in F}. iid
We observe samples S = {X1, …, Xn} ∼ p ∈ P, where the set S represents a random sample of n data points that are independent and identically distributed (iid) according to some unknown density p that belongs to P.
Given the sample S, we wish to estimate the density p of P in the L2 loss. That is:
(p1(t) − p2(t))2 dt Rn∗ =infsupES[||pˆ−p||2]∈Θ(n−23).
Φ ◦ ρ(p1, p2) = We will show that the minimax risk satisfies:
1.2 Lower bound
We will first prove a lower bound via Fano’s method.
Step 1: Construct alternatives
Consider the function ψ illustrated in Figure 1. The following facts are straightforward to verify.
• ψ is 1-Lipschitz, meaning that for any two inputs λ1 and λ2: |ψ(λ1) − ψ(λ2)| ≤ |λ1 − λ2|;
• R ψ = 0, which indicates that the area under the curve of the function, over its entire domain, sums up to zero;
• −41 ≤ ψ ≤ 14, which gives the range of the function;
• R ψ2 = 1 , which the squared integral of ψ. 48
Code Help, Add WeChat: cstutorcs
Figure 1: Illustrative figure for the function ψ(λ).
To construct the alternatives let h is a positive number (h > 0) that we will decide later. Let m = n1 .
The alternative function space F
F′ = Pω :Pω(·)=1+XωjΦj(·); ω∈Ωm .
This space defines a set of functions Pω that are formed by a linear combination of basis function Φj(t).
The vector ω resides in some set Ωm. The basic function is defined as: t − ( j − 12 ) · h
Φj(t) = LhΦ h , where L denotes the Lipschitz constant, and h is the bandwidth.
The illustrative example in Figure 2 would provide a visual representation of the approximation.
𝑡 𝜔: 01001010
Figure 2: Illustrative figure for the example pω.
Step 2: Lower bound the distance ρ(pω,pω′)
The objective of this step is to determine a lower bound for the difference between pω and pω′ . We can bound the density below:

ρ2(pω,pω′)= =
= By the Varshamov-Gilbert lemma, we have
(pω −pω′)2
Xm Z mj j−1
(1 + ωj Φj − (1 + ωj′ Φj ))2 Z mj
I ( ω j ̸ = ω j′ ) H(ω,ω′)L2h3
rL2h3 ω,ω′ 48
min H(ω,ω)≥ √ ≜δ. ω,ω′ 86
H(ω, ω′ ) ≥ m = 1
After some algebra (you will do this in the homework), we have the following upperbound:
′ L2h3 ′ L2h2 KL(pω,pω′ ) ≤ H(ω,ω ) 48 ⇛ ∀ω,ω ,KL(pω,pω′ ) ≤ 48
Step 3: Upper bound KL
min ||pω − pω′ || =
In this step, the goal is to determine an upper bound for the Kullback-Leibler (KL) divergence between two functions, pω and pω′ . Based on the definition of KL divergence and expanding KL for pω and pω′ :
KL(pω,pω′ ) = =
Z1 pω pω log( )
1+ωjΦj ′ (1+ωjΦj)log 1+ω′Φ )I(ωj ̸=ωj
This suggests that regardless of the particular values of ω and ω , the KL divergence between any two functions pω and p ′ from the considered set is always less than or equal to L2h2 .
A formal proof of this statement is left as an exercise in a homework assignment.
Step 4: Apply local Fano
Applying local Fano’s inequality in this step, we derive conditions and constraints for the estimation problem. We want max ′ KL(pω,p ′ ) ≤ log(ω). This relation is upper bound the maximum KL divergence between
any two functions in the set by a term that diminishes with increasing sample size n.
Sufficient if we have the equation L2h2 ≤ log(2 8 ) = log(2). Choose h = C, 48 4n 32nh
choice of h as a function of n and L. Then, we have the equation:
1 , which determines the n 13 L 23
This offers a lower bound on the risk, Rn∗ , which quantifies the error in the estimation. The given requirements ensure the validity and applicability of the above relations:
• m ≥ h1 ≥ 8. This is necessary for the Varshamov-Gilbert lemma to be applicable. 3
2 ∗ 1 Lh L3
Rn≥ Φ( √)=C 2. 22×86 n3

• CardinalityofF′: |F′|≥16⇐2m8 ≥16⇐h≤ 1 ⇒satisfiedifn≥ c . Thisensuresasufficient 32 L2
number of observations given the Lipschitz constant L.
• KL bounding condition: h ≤ 2.72 . This imposes the KL divergence between functions remains bounded,
and ties the bandwidth h to the Lipschitz constant L.
Upper Bound via Kernel Density Estimation
Kernel Density Estimation (KDE) is a non-parametric technique to estimate the probability density function of a continuous random variable. The Kernel Density Estimation (KDE) has the following form:
pˆ ( t ) = n
1Xn 1 t−xi
where pˆ(t) is the estimated density at point t, n is the number of data points, and xi are the observed data points. h is bandwidth parameters, which plays a critical role in KDE. K is a (smoothing) kernel with the following properties:
(1) Normalization:
This ensures the result will integrate to 1 over its entire domain, maintaining the fundamental property of a probability density function.
(2) Symmetry:
K(u) = K(−u),
This property ensures that the kernel is symmetric around zero. As a result, the estimated density will not be biased towards any direction from the point of estimation.
For the problem at hand, the kernel selected is K(t) = I(|t| ≤ 21 ), which is sufficient for Lipschitz functions. This is a simple uniform kernel.
We can bound the risk as follows:
E[||p − pˆ||2] = E[ (p − pˆ)2 dt] ZZZ
= E[ (p − E(pˆ))2 dt + (E(pˆ) − pˆ)2 dt +2 (p − E(pˆ))(E(pˆ) − pˆ) dt] Z1Z1Z1
= (p(t) − E(pˆ(t)))2 dt + E[(pˆ(t) − E(pˆ(t)))2 ] dt +2 0|{z} 0| {z } 0
K(u)du = 1,
(p − E(pˆ)) E[E(pˆ) − pˆ] dt |{z}
far on average the estimate is from the true value. In our context, the bias term is derived from:
bias(t) Var(t)
The bias and variance terms can be written and bounded below. The bias of an estimator indicates how
CS Help, Email: tutorcs@163.com
bias(t) = E[pˆ(t) − p(t)]
=EX∼P[1K(t−X)]−p(t) (ApplyE[1XZi]=E[Zi])
= hK( h )p(x)dx−p(t)
K(u)p(t + uh) du −p(t)
K(u) (p(t + uh) − p(t)) du ZZ
The bias depends on the choice of kernel and bandwidth h. Then we can bound the bias term:
|bias(t)| ≤ ≤
K (u)|p(t + uh) − p(t)| du
K(u)L|uh| du Z
= Lh K(u)|u| du
where C1 = R K(u)|u| du is a constant that essentially depends on the kernel, and L is the Lipschitz constant, signifying the bound on the rate of change of our function.
Variance quantifies the dispersion of an estimator around its expected value. In the context of KDE, variance arises from the randomness in the sample. For our setup, the variance term is captured by:
1n1t−X! Var(t) = Var X K( i )
11t−X 1Xn!1
= nVarX∼P hK( h ) (Apply Var n 112t−X222
Zi = nVar(Z1)) ≤ nEX∼P[h2K ( h )] (Var(Z)=E[Z ]−(E[Z]) ≤E[Z ])
Where B is a constant that bounds the product of the squared kernel and the true density. Combine these bounds for bias and variance terms, we have
= nh2 K ( h )p(x)dx
= nh K (u)p(t + uh)du
BZ2B ≤ nh K (u)du = nh
Z1 Z1 ES[∥ρ − ρˆ∥2] ≤ bias2(t)dt +
00 ≤ C 12 L 2 h 2 + B
= (B + C1 ) n2/3 (if h = n1/3L2/3 ) 5

3 Lower Bounds for Prediction Problems
So far, we have talked about estimation in some metrics, for instance ρ(θ1,θ2)=|θ1 −θ2|,θ(P)∈R
ρ ( θ 1 , θ 2 ) = ∥ θ 1 − θ 2 ∥ L 2 , θ ( P ) ∈ { f : X →− R }
Next, we will develop a framework for proving lower bounds for prediction problems (among others). The
framework is comprised of the following components:
1. Data space Z. This is the space from which data samples arise.
2. A family of distribution P, where ∀P ∈ P,supp(P) ⊆ Z. This represents a collection of probability distributions from which the data can be drawn.
3. A hypothesis/parameter space H. This contains all the potential hypotheses or models that we might use to make predictions.
4. An “instance loss”, f : H × Z →− R, where f(h,Z) is the loss of hypothesis h on instance Z. This measures how well a particular hypothesis h from H performs on an instance Z from Z.
5. Thepopulationloss,F(h,P)=EZ∼P[f(h,Z)];Excesspopulationloss,L(h,P)=F(h,P)−infh′∈HF(h′,P). These terms indicate how well our hypothesis does on average (under distribution P) and how it com- pares to the best possible hypothesis in H, respectively.
6. A dataset S is drawn from some P ∈ P.
7. An estimator hˆ, which maps the dataset S to a hypothesis in H. Note that we overload notation here
8. Risk of estimator hˆ
9. Minimax risk:
hˆ : a map form data to H (estimator) hˆ : as the estimate (hˆ ∈ H)
R(hˆ, P ) = E[L(hˆ(S), P )] =E[F(hˆ(S),P)]− inf F(h,P)
R⋆ = inf sup R(hˆ,P) hˆ P∈P
Next, let us see an example.
Example 1. Excess risk in classification/regression
• Data space Z = X × Y, H : {h : X → Y}. This is the set of all possible prediction functions mapping from features X to outcomes Y .
• Instance loss (in classification), f (h, (X, Y )) = 1 (h(X) ̸= Y ). This measures the discrepancy between a predicted and actual class label.
• F(h,P)=EX,Y∼P[1(h(X)̸=Y)]isthe“risk”. Thisistheexpectedvalueoftheinstancelossoverthe joint distribution of X and Y, and can be understood as the overall error rate of the classifier.

• L(h,P) = F(h,P) − F(h⋆,P), where h⋆ is the Bayes optimal classifier, i.e. h⋆ = argmaxy∈Y P(Y = y|X = x), and L(h, P ) is the “excess risk” in classification. The excess risk quantifies how much worse our classifier h performs compared to the Bayes optimal classifier h⋆.
• Inregressionwedefinef(h,(X,Y))=(h(X)−Y)2. Thetypicallossfunctionusedisthesquaredloss.
• L(h,P) = EX∼P[(h(X)−Y)2]−EX∼P[(h⋆(X)−Y)2], where h⋆(x) = E[Y|X = x], L(h,P) is the “excess risk” in regression. the excess risk measures how much worse our regression function performs compared to the optimal regression function h∗(x) which is the conditional expectation.
• Note that this is different to Φ ◦ ρ(h1, h2) = ∥h1 − h2∥2, which captures the squared difference between two hypotheses h1 and h2, and ES[Φ ◦ ρ(hˆ,h⋆)] measures how much estimator hˆ deviates from the optimal hypothesis h∗.
浙大学霸代写加微信 cstutorcs