CS861: Theoretical Foundations of Machine Learning Lecture 10 27 09 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 10 – 27/09/2023 University of Wisconsin–Madison, Fall 2023
Lecture 10: Le Cam’s Method (Some Examples)
Lecturer: Kirthevasan Kandasamy Scribed by: Ying Fu, Deep Patel
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
Previously, we defined the notion of Minimax Optimality, following which we obtained lower bounds for this Minimax Risk by ‘reducing’ the problem of estimation to that of hypothesis testing. Upon this ‘reduction,’ we then looked at Le Cam’s method to obtain a lower bound by considering binary hypothesis testing specifically. In this lecture, we begin by obtaining a specific form of lower bound as a consequence of Le Cam’s method and then see its applications through some examples of mean estimation and regression.
Corollary 1 (From Le Cam’s Method). Let P0, P1 ∈ P and δ = ρ (θ(P0), θ(P1)). If KL(P0, P1) ≤ n1 log 2. Suppose the dataset S has n i.i.d samples. Then,
From our last lecture, we know that, ∗ 1 δ n n
∗ 1δ Rn≥8Φ 2
≥2Φ 2 2exp(−KL(P0,P1)) (Bytheinequality:∥P0∧P1∥≥2exp(−KL(P0,P1)))
Rn≥2Φ 2 ∥P0∧P1∥ 1δ1 nn
1 δ = 4Φ 2
1 δ ≥ 8Φ 2
exp(−nKL(P0,P1)) ( Since KL(P0 ,P1 ) = nKL(P0,P1)) (By assumption)
Remark 0.1. Corollary 1 formalizes the intuition that if statistically speaking, we can find two distributions that are close to each other (i.e. low KL-divergence) but their parameters, θ(P0) and θ(P1) are far apart (i.e. high δ), then the minimax optimal error is also high. This is because the lower bound, as seen above in the Corollary, will be high owing to difficulty in distinguishing between the two distributions via the binary hypothesis test.
1 Estimating Mean of Normal Distribution (with known variance)
Suppose we have S = {X1,···Xn} with n i.i.d samples drawn from P ∈ P, where P = {N(μ,σ2) | μ ∈ R} withσ2 isknown. WealsohaveΦ◦ρ(θ1,θ2)=(θ1−θ2)2. DefineP0 =N0,σ2andP1 =Nδ,σ2such

that |θ(P0) − θ(P1)| = δ. Then the KL divergence between P0 and P1 is,
KL(P0, P1) = δ2 Since KL N (μ1, σ2), N (μ2, σ2) = 1 (μ1 − μ2)2
2σ2 2σ2 InordertosatisfytherequirementofCorollary1–whichisKL(P0,P1)≤ log2 –wewillchooseδ=σq2log2.
Then, by Corollary 1, we have the following:
= 1 σr2log2!2
= log2σ2 16 n
Previously, we have seen that the sample mean estimator achieves the σ2/n rate. 2 Estimating Mean of Bernoulli Distribution
Before we proceed, we will state some facts about Bernoulli distributions that we will be using for deriving the lower bounds:
• If P ∼ Bern(p) and Q ∼ Bern(q), then we have,
2(p−q) ≤ KL(P,Q) ≤ q(1−q) (2)
The proof of relations in Equation (2) above is left for exercise. The first inequality(tagged as ○1 ) is
by Pinsker’s inequality, and the second inequality(tagged as ○2 ) will use the fact that log(x) ≤ x − 1.
• Therefore, if q is bounded away from 0 and 1, then,
KL(P, Q) ∈ Θ (p − q)2
This is also a general statement about sub-Gaussian distributions (i.e., KL(P, Q) ∈ Θ((μP − μQ)2)).
With this setup, let’s begin our example of mean estimation for Bernoulli distributions. Let S = {X1,···Xn} with n i.i.d samples drawn from P ∈ P, where P = {Bern(μ | μ ∈ [0,1]}. Let P1 = Bern1
In order to satisfy the requirement of Corollary 1 of KL(P0,P1) ≤ log2, we will choose δ = 1qlog2. With n2n
andP2 =Bern1 +δ,thenwehave|θ(P0)−θ(P1)|=δ. Then,bythefactsintroducedabove,wehave 2
(p0 −p1)2 δ2 KL(P0, P1) ≤ p1(1 − p1) = 1/4
this chosen δ, by Corollary 1, we get
∗ 1δ Rn≥8Φ 2
=1 rlog2·1!2 8n4
= log 2 1 128 n
Previously, we have seen that the sample mean estimator achieves n1 rate. 2

3 A Simplified Regression Problem
i.i.d i.i.d i.i.d
Let S = {X1,…,Xn} where (Xi,Yi) ∼ P ∈ P where Xi ∼ unif([0,1]) and Yi ∼ σ-subGaussian
distribution with mean f (Xi ). Here, we will assume that the function, f : [0, 1] → [0, 1], is L−Lipschitz. Thus, by our assumption, we have
P={PXY |PX =unif([0,1]),f(·)=E[Y|X=·]isL−Lipschitzwithrange[0,1],Y|Xisσ−subgaussian} | {z }| {z }
marginal of X regression function
We will begin with a simplified regression problem, where we are interested in estimating the value of the regression function at x = 1/2, i.e θ = f ( 21 ) = E[Y |X = 1/2] and the criterion for measuring ‘good’-ness is
the standard squared error loss: Φ ◦ ρ(θ1, θ2) =∆ (θ1 − θ2)2. 3.1 Lower Bound
Let us first obtain the lower bound for the minimax risk. For this, we need to choose a P0 and P1. To keep things simple, we will choose them to be the following:
P0 : P0(y|x) = N (f0(x), σ2), P0(x) = unif([0, 1]) | {z }
by assumption
P1 : P1(y|x) = N (f1(x), σ2), P1(x) = unif([0, 1])
by assumption
Recall that we we want the gap between two parameters to be large, i.e. large δ = |f0(1/2) − f1(1/2)|, while ensuring that the two distributions are hard to distinguish, i.e. small KL(P0,P1). Given that we require the regression functions, f0 and f1, to be L-Lipschitz, for the simplistic setting of f0 ≡ 0, the regression function f1 would look something like the one shown below in Figure 1.
Figure 1: Here, we’d need δ/L ≤ 1/2 for f1 to be well-defined. We can mathematically define these functions, f0 and f1, as follows:
f0(x)=∆ 0 ∀x∈[0,1]
L(x − (1/2 − δ/L)), f1(x)=∆ L(1/2+δ/L−x),
if x ∈ [1/2 − δ/L, 1/2) ifx∈[1/2,1/2+δ/L) else
Programming Help
For obtaining the lower bound as derived in Corollary 1, recall that we need to choose δ such that KL(P0, P1) ≤ log 2 . To do this, let us first compute KL(P0, P1):
KL(P0, P1) =
Z1Z∞ P0(x,y)
P0(x, y) log P 0−∞ 1
(x, y) dydx =1
P0(y|x)P0(x) P1(y|x)P1(x)
L2(x − 1/2 + δ/L)2dx +
Z 1/2+δ/L 1/2
L2(1/2 + δ/L − x)2dx
P0(y|x)P0(x) log | {z }
=1 P0(y|x)
P0(y|x) log P KL(N(0,σ2),N(f(x),σ2))dx
(y|x) dydx
Z11 2 2 21 2
2σ2(0−f(x)) dx ∵KL N(μ1,σ ),N(μ2,σ ) = 2σ2(μ1 −μ2)
So, we can choose δ = (3σ2·L·log 2)1/3
to ensure KL(P , P ) ≤ log 2 . Application of Corollary 1 now tells 01 n
where C = 1 (3 log 2)1/3. 32
Rn≥8Φ2=8·4=C· n2/3 (3)
∗ 1 δ 1 δ2 σ4/3L2/3
Remark 3.1. One minor detail here is that, for our construction, since we need δ/L ≤ 1/2 (see Figure 1), the lower bound in Equation (3) applies only if n ≥ 3σ2(log 2)2 .
Remark 3.2. Previously, when we were estimating the mean of a Gaussian distribution, we were getting
a 1 ‘rate’ in the lower bound as opposed to 1 that we have obtained here for regression in Equation (3). n n2/3
This is because we observe data samples around the point X = 21 . We will note later that we will get similar, weaker rates even when we look at more realistic regression settings wherein we are not doing point estimation like we are here in our toy setting.
3.2 Upper Bound
Having derived the lower bound, let us derive the upper bound now. While our focus is on lower bounds, this exercise will be useful when we derive upper bounds for general regression problems later on. For this, we will rely on the idea of estimating θ = f(1/2) by calculating the average of points in a neighborhood around the point 12 . Towards this, we’ll now define a few quantities as follows:
N(S)=∆ XI{Xi ∈(1/2−α,1/2+α)}
1 Pn N(S) i=1
I{Xi ∈(1/2−α,1/2+α)}yi, 4
if N(S) = 0 ifN(S)>0

Figure 2: Idea: Estimate θ = f(1/2) by calculating the average of points in a neighborhood around the point 1/2
Having defined the estimator, θˆ(S), we now wish to compute the risk of this estimator. For this, we will define a ‘Good Event’: G =∆ {N(S) ≥ nα}.
Remark 3.3 (Rationale for defining G the way we have). If we observe points sampled uniformly between 0 and 1, the probability of each of them falling in one of the “2α intervals” is 2α. Thus, we see that N(S) ∼ Binomial(n,2α) Since, E[N(S)] = 2nα, the ‘Good Event’ is saying that ‘at least half of these points should fall near θ = 12 ’.
Given the way we have defined G, it immediately leads us to the following inequality: “n#
P(Gc)=P XI{Xi ∈(1/2−α,1/2+α)}≤nα i=1
≤ exp (−2nα2) (∵ By Hoeffding’s inequality) Thus, by the tower property of expectation operator, we can write the following:
E[(θˆ(S) − θ)2] = E[(θˆ(S) − θ)2|G] P(G) + E[(θˆ(S) − θ)2|Gc] P(Gc) (4) |{z}| {z }|{z}
≤1 ≤θ2 ≤1 ≤exp(−2nα2) To upper bound E[(θˆ(S) − θ)2|G], we will expand (θˆ(S) − θ)2 as follows:
ˆ21Xn 21Xn 2 (θ(S)−θ) = N(S) AiYi −θ = N(S) Ai(Yi −θ)
(whereAi =I{Xi ∈(1/2−α,1/2+α)}) nn
= 1 XAi(Yi −f(Xi))+
1 XAi(f(Xi)−θ)2
| {z }| {z }
Here, we can think of the quantities v,b as the variance and bias respectively. Therefore we can write the following
∴ E[(θˆ(S) − θ)2|G] = E[(v + b)2|G] = E[b2|G] + E[v2|G] + 2E[bv|G] (5) 5
Programming Help, Add QQ: 749389476
Calculating and bounding each of the conditional expectations separately yields the following:
2 hh1X 2 ii
=E[ 1 ·σ2|G] N(S)
≤ nα (∵ N(S) ≥ nα under the good event G)
E[v |G]=E E N
=E E N(S)2
Ai(Yi −f(Xi)) |G,X1,…,Xn Ai(Yi −f(Xi))2|G,X1,…,Xn
≤ E N(S)2 · N(S)σ |G (∵ Yi − f(xi) is σ − subGaussian)
Similarly, we obtain the following
h h  1 Xn
N(S) N(S)i=1
= 0 (∵ Yi − f(Xi)|Xi = xi have zero mean and are independent)
f(xi)−θ | {z }
f (xi )−f (1/2)≤Lα
E[bv|G]=E n
Ai(Yi −f(Xi))
Ai(f(Xi)−θ) |G,X1,…,Xn
Combining Equations (4) – (9) together, we get
E[(θ(S) − θ)2] ≤ nα + L2α2 + e−2nα For a tighter upper bound, optimizing the R.H.S. above yields
α = L2/3n1/3
2σ4/3L2/3 ⇒ E[(θˆ(S) − θ)2] ≤ n2/3
− 2σ4/3 n1/3 L4/3
Remark 3.4. Note that the first term in R.H.S. of Equation (12) above is precisely the lower bound we had got earlier. But this requires that we choose α∗ as per Equation (11), which requires knowledge of both σ and L, which may or may not be known.
However, if one wanted to choose α without the knowledge of σ and L, say, α = 1 instead of α = α∗, N 1/3
then we get the following loose upper bound for Equation (10) which is tight with respect to n but not σ or
E[(θˆ(S) − θ)2] ≤ e−2n1/3 + 1 (L2 + σ2) (13) n2/3
CS Help, Email: tutorcs@163.com