CS861: Theoretical Foundations of Machine Learning Lecture 9 25 09 2023 Univer

CS861: Theoretical Foundations of Machine Learning Lecture 9 – 25/09/2023 University of Wisconsin–Madison, Fall 2023
Lecture 09: Hypothesis testing and Le Cam’s method
Lecturer: Kirthevasan Kandasamy Scribed by: Haoyue Bai, Ying Fu
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we begin by recapping point estimate and minimax optimality from the previous class. Then, we will introduce the concept of hypothesis testing, along with the theorem of reduction from estimation to testing. Finally, we will introduce the Le Cam methods, which are fundamental tools in establishing minimax lower bounds.
1 From Estimation to Testing
A standard first step in proving minimax bounds is to “reduce” the estimation problem to a testing problem. Then, we need to show that the estimation risk can be lower bounded by the probability of error in testing problems, which we can develop tools for. We first define the hypothesis test problems we will use.
Definition 1 (Hypothesis Test). Let Q be a class of distributions, and let Q1, Q2, .., QN be a partition of Q. Let S be a dataset drawn from some P ∈ Q. A (multiple) hypothesis test Ψ is a function of the data which maps to {1,…,N} ≜ [N]. If Ψ(S) = j, the test has decided that P ∈ Qj.
• In this class, Q = {P1,…,PN}, Qi = {Pj}.
• PS∼Pj (Ψ(S) ̸= j) is the probability of error (when S ∼ Pi).
With this setup, we obtain the classical reduction from estimation to testing.
Theorem1(ReductionfromEstimationtoTesting). Let{P1,…,PN}⊆P,andletδ=minj̸=kρ(θ(Pj),θ(Pn)),
By Markov’s inequality,
Set t = Φ δ , 2
∗ h iδ b
Rn =infsupES Φ·ρ(θ(P),θ(S)) ≥Φ inf maxPS∼Pj (Ψ(S)̸=j). θb P 2 Ψj∈[N]
For brevity, θj = θ(Pj ), Pj (·) = PS∼Pj (·), Ej [·] = ES∼Pj [·]. ∗hihi
Rn≥infmaxΦ θb j∈[N]
Rn = inf supES l(θ,θ) ≥ inf max Ej l(θj,θ) . θb p∈P θb j∈[N]
δ δ
Pj Φ◦ρ θj,θb >Φ δ
l(θj,θ) ≥ tPj l(θj,θ) > t .
= Φ inf max Pj ρ θj , θb > (Since Φ(·) is a nondecreasing function)
2 θbj∈[N] 2

𝜃(𝑃 ) 𝜃(𝑃 ) 𝜃(𝑃 ) 123
Figure 1: Illustrative figure for Theorem 1. The radius of each circle is δ/2. If N is too large (δ is small), Ψ(δ/2) will be small. But if N is small (δ is large), PS∼Pj (Ψ(S) ̸= j) will be small as it may be harder to distinguish between the alternatives.
Given an estimator θ, we define the following test,
Ψb(S) = arg min ρ θ(S), θj
Given the data was generated by Pj, but Ψθb = k ̸= j, then,
δ ≤ ρ(θj,θk) (By definition of δ)
Therefore,Pj ρ θ,θj ≥ ≥Pj Ψb̸=j ,andwehave,
≤ ρ θ , θ + ρ θ, θ jb bk
bb ≤ ρ θj , θ + ρ θj , θ
= 2 ρ θ j , θb .
(By triangle inequality)
∴ Ψb ̸= j ⇒ ρθj,θ ≥ δ. θb2
∗ δ Rn≥Φ infmaxPj Ψθb̸=j
2 Ψθbj∈[N] δ
≥Φ inf maxPj (Ψ̸=j). 2 Ψ j∈[N]
2 Distances/divergences between distributions
Consider two probability distributions, P and Q. Let p(x) and q(x) be their probability density functions. We usually have the following distance and divergence measurements between two probability distributions. They are key ingredients in formulating lower bounds on the performance of inference procedures.

1. KL divergence:
KL(P, Q) = 2. Total Variation (TV) distance:
Z dP(x) Z p(x) log dQ(x) dP (x) = log q(x)
TV(P, Q) = sup |P (A) − Q(A)|. A
3. L distance: 1Z
4. Helliger distance H(P, Q):
Also, define affinity:
||P −Q||1 = |p(x)−q(x)|dx.
H (P,Q)= p(x)− q(x) dx
=2−2 pp(x)q(x)dx.
2. TV(P,Q)= 21||P −Q||1 =1−||P ∧Q||.
3. H2(P, Q) ≤ ||P − Q||1 = 2TV(P, Q).
4. Pinsker’s inequality:
TV (P,Q) ≤
r1 2KL(P,Q).
||P ∧ Q|| = =
min (p(x), q(x)) dx (p(x) ∧ q(x)) dx
2.1 Inequalities between divergences and product distributions
Here we present a few inequalities and their consequences when applied to product distributions, which will be quite useful for proving our lower bounds. These inequalities will relate to three divergences, i.e. total variation distance, Kullback-Leibler divergence, and Hellinger distance.
1. Since KL-divergence and Hellinger distance both are easier to manipulate on product distributions than is total variation. Consider the product distribution Pn = P ×···×P and Qn = Q×···×Q.
KL(P n, Qn) = n × KL(P, Q)
H 2 ( P n , Q n ) = 2 − 2 1 − 12 H 2 ( P , Q ) n
5. ∥P ∧ Q∥ ≥ 21 exp(−KL(P, Q).
We will prove statement 5 below. You will prove the remaining statements in your homework.

2∥P ∧ Q∥ = ≥ = =
≥ exp 2 p(x) log p(x) dx
Z p(x) =exp − p(x)log q(x) dx
2 min(p(x), q(x)) dx
2 min(p(x), q(x)) dx − min(p(x), q(x)) dx ZZ
min(p(x), q(x)) dx 2 − min(p(x), q(x)) dx Z Z
min(p(x), q(x)) dx max(p(x), q(x)) dx 2
pmin(p(x), q(x)) max(p(x), q(x)) dx
2 pp(x)q(x) dx
Z 2 log pp(x)q(x) dx
(By Cauchy–Schwarz inequality)
=exp 2log p(x) p(x) dx
Z sq(x)! !
“sq(x)#! E p(x)
q(x) dx = 2
sq(x)!# p(x) )
(By Jensen’s inequality: log
= exp(−KL(P, Q))
Where the third equality to fourth equality follows by,
min(p(x), q(x)) dx + max(p(x), q(x)) dx = p(x) dx +
3 Le Cam’s Method
Consider this scenario: nature chooses one of a possible set of (say) k + 1 words, indexed by probability distributionsP0,P1,···Pk andconditionalonnature’schoiceoftheword—thedistributionP∗ ∈{P0,···Pk} chosen—we observe data S drawn from P∗. Intuitively, it will be difficult to decide which distribution Pi is the true P∗ if all the distributions are similar—the distance/divergence between the Pi is small and easy if the distance/divergence between the distribution Pi is large.
The simplest case is when there are only two possible distributions, P0 and P1, and our goal is to make a decision on whether P0 and P1 are the distribution generating data we observe. Suppose that nature chooses one of the distributions P0 or P1 at random, and let V = {0, 1} index the choice. Conditional on V = v, we then observe samples S drawn from Pv , then, for any test Ψ : S ⇒ {0, 1}, the probability of error is then
P(Ψ(S) ̸= V ) = 12P0(Ψ ̸= 0) + 21P1(Ψ ̸= 1)
Now, we introduce the Neyman-Pearson Test, and then we will show that it can minimize the sum of errors.
Code Help
3.1 Neyman-Pearson Test
Given a binary hypothesis test between two alternatives P0 and P1 with densities p0 and p1, let S denote an i.i.d dataset. Then, the Neyman-Pearson test is the form:
(0 if p0(S) ≥ p1(S) ΨNP(S) = 1 if p0(S) < p1(S) Lemma 1. For any other test Ψ, the Neyman-Pearson test minimizes the sum of errors. That is, ∀Ψ, P0(Ψ̸=0)+P1(Ψ̸=1)≥P0(ΨNP ̸=0)+P1(ΨNP ̸=1) where P0(Ψ ̸= 0) is actually the PS∼P0 (Ψ ̸= 0), for short. Proof P0(Ψ ̸= 0) + P1(Ψ ̸= 1) = P0(ΨNP ̸= 0) + P1(ΨNP ̸= 1) P0(Ψ = 1) + P1(Ψ = 0) ZZ p0(x) dx + p1(x) dx Ψ=0 p0(x)dx+ p0(x)dx+ p1(x)dx+ p1(x)dx Ψ=1,ΨNP =1 Ψ=1,ΨNP =0 Ψ=0,ΨNP =0 Ψ=0,ΨNP =1 ZZZZ p0(x) dx + ZZ p1(x) dx + p1(x) dx + (by Definition of NP Test) Ψ=1,ΨNP =1 p0(x) dx + Ψ=1,ΨNP =0 p1(x) dx Ψ=0,ΨNP =0 Ψ=0,ΨNP =1 =P0(ΨNP =1)+P1(ΨNP =0) Next, we show the connection between hypothesis testing and total variation distance and later use this to yield lower bounds on minimax error by Le Cam’s Method. Corollary 1. For any hypothesis test Ψ, we have, P0(Ψ̸=0)+P1(Ψ̸=1)≥∥P0 ∧P1∥ =1−TV(P0,P1) ≥ 21exp(−KL(P0,P1)) From this Corollary, we can see that the smaller the KL divergence or TV distance between P0 and P1, i.e., the more similar P0 and P1, the larger the testing error, which also verifies the intuition of our introduced scenario. P0(Ψ ̸= 0) + P1(Ψ ̸= 1) = p0(x) dx + ZZ p1(x) dx ΨNP =0 p0(x) dx + min (p0(x), p1(x)) dx (by Definition of NP test) = ∥P0 ∧ P1∥ Code Help, Add WeChat: cstutorcs Other parts of inequalities come from Section 2. Putting them together, we now show that Le Cam’s Method can yield lower bounds for minimax estimation. Theorem 2 (Le Cam’s Method). Let P0,P1 ∈ P, let δ = ρθ(P0),θ(P1) and let S be an i.i.d dataset of n points, then, ∗ 1 δ n n Rn≥2Φ 2 ∥P0∧P1∥ Rn ≥Φ inf max Pj (Ψ̸=j) (ByTheorem1) (Asmaxislargerthantheaverage.) 2 Ψ j∈{0,1} δ1n n ≥Φ inf (P0 (Ψ̸=0)+P1 (Ψ̸=1)) 2Ψ2 1δn n ≥ 2Φ 2 ∥P0 ∧P1 ∥ (By Corollary 1) Programming Help, Add QQ: 749389476