CS861: Theoretical Foundations of Machine Learning Lecture 12 10 02 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 12 – 10/02/2023 University of Wisconsin–Madison, Fall 2023
Lecture 12: Fano’s method
Lecturer: Kirthevasan Kandasamy Scribed by: Yuya Shimizu, Keran Chen
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will prove Fano’s inequality and apply it to show Fano’s methods. In the previous lecture, we confirmed that Le Cam’s method is only good for the single parameter to derive a lower bound of the minimax risk. While Le Cam’s method relies on the Neyman-Pearson test for binary hypotheses, Fano’s method considers multiple hypotheses, and it can cover more general situations. We will also introduce a method for constructing alternatives for Fano’s method using packing numbers.
1 Fano’s inequality
Before showing Fano’s method, we will prove Fano’s inequality using properties in information theory. Theorem 1. (Fano’s inequality) Let X be a discrete random variable with a finite support X. Let X,Y,Xˆ
form a Markov chain X →Y →Xˆ. Denote Pe =P(Xˆ ̸=X) and
h (Pe) = −Pe log (Pe) − (1 − Pe) log (1 − Pe) .
Then, Hence,
H (X | Y ) ≤ H (X | Xˆ ) ≤ Pe log(|X |) + h (Pe ) . (1) P(X ̸= Xˆ) ≥ H(X | Y ) − log(2).
(Interpretation) We can interpret Pe as a probability of error, and X defines a prior on the
alternativesP1,···,P|X| .WewanttoguessX(viaXˆ).IfYuniquelyidentifiesX,wehaveH(X|Y)=0,
i.e., no information on X is left after observing Y. Fano’s inequality quantifies P(X ̸= Xˆ) in terms of H(X | Y ). We do not require any restriction on the support of Y . It is instructive to assume that Xˆ has the same support as X, although this is strictly not necessary.
Proof Let E = 1 nX ̸= Xˆ o (thus, if E = 1, there is an error). Using the chain rule in two different ways, H ( E , X | Xˆ ) = H ( X | Xˆ ) + H ( E | X , Xˆ )
Here, since E is a function of X,Xˆ, Also, since conditioning reduces entropy,
= H (E | Xˆ ) + H (X | E , Xˆ ). (2) H(E | X,Xˆ) = 0. (3)
H(E | Xˆ) ≤ H(E) = h(Pe), (4) 1

where the equality follows by the definition of h(·). By the definition of the conditional entropy, H(X |E,Xˆ)=P(E =0)H(X |Xˆ,E =0)+P(E =1)H(X |Xˆ,E =1)
third inequality in the theorem:
2 Fano’s method
Pe ≥ H(X | Y ) − log(2). log(|X |)
| {z }| {z
≤ PeH(X) ≤ Pe log(|X |),
} | {z }| {z }
where H(X | Xˆ,E = 0) follows since E = 0 implies X = Xˆ and the last inequality follows from the property of discrete X. Thus, (2)-(5) together imply that H(X | Xˆ) ≤ h(Pe) + Pe log(|X|). We have proved the second inequality in the theorem.
Next, we will show the first inequality in the theorem. By the data processing inequality, I(X,Xˆ) ≤ I(X,Y). Since we also have I(X,Xˆ) = H(X)−H(X | Xˆ) and I(X,Y) = H(X)−H(X | Y), the next inequality is implied:
H ( X | Y ) ≤ H ( X | Xˆ ) .
Since h (Pe) is a entropy for a binary random variable, h (Pe) ≤ log(2). This and (1) together imply the
Given Fano’s inequality, we will derive two different types of lower bounds of the minimax risk. The first bound is called the global Fano’s method since it contains a Kullback-Leibler divergence between one distri- bution and the mixture of all other distributions in the alternatives. The second bound is called the local Fano’s method since it contains only the KL divergences between the alternatives.
Theorem 2. Let S be drawn (not necessary i.i.d.) from some joint distribution P ∈ P . Let {P1 , · · · , PN } ⊆ P, and denote P ̄ = N1 PNj=1 Pj (an equally weighted mixture distribution). Denote the minimax risk
∗ hˆi R =infsupE Φ◦ρ θ(P),θ(S) .
Let δ = minj̸=k ρ(θ(Pj),θ(Pk)). Then, the following statements are true:
(I) (Global Fano’s method)
∗ δ N1 PNj=1KLPj,P ̄+log(2)! R≥Φ2 1− log(N) .
(II) (Local Fano’s method)
1P KL(P,P)+log(2)! R∗≥Φδ1−N21≤j,k≤N jk .
Remark The global Fano’s method is stronger (tighter), but the local Fano’s method is easier to apply
since KL(Pj,Pk) is easier to compute than KLPj,P ̄. Proof Define a uniform prior on {P1, . . . , PN } as follows:
P(V =j)=N1, forj=1,…,N. 2

程序代写 CS代考加QQ: 749389476
Given V = j, let S be sampled from Pj. Then, the joint distribution of (V,S) is PV,S(S ∈A,V =j)=P(S ∈A|V =j)P(V =j)
= N1 P j ( A ) .
By our ”basic” theorem of reduction from estimation to testing,
since KL(Q,P) = EQ
N KLPj,P ̄ ≤ N2 KL(Pj,Pk), (9) j=1 1≤j,k≤N
log P(x) is convex in the second argument, and Jensen’s inequality implies that
R ≥Φ inf maxPj (ψ(S)̸=j)
2 ψ j∈[N] δ
≥ Φ inf PV,S(ψ(S) ̸= V ), (6) 2ψ
where the second inequality follows since the maximum value is greater than or equal to the average value. By Fano’s inequality (on the Markov chain V → S → ψ(S)),
PV,S(ψ(S)̸=V)≥ H(V |S)−log(2) log(N )
= H(V ) − I(V,S) − log(2) log(N )
=1− I(V,S)+log(2), (7) log(N )
where the second inequality follows from I(V, S) = H(V ) − H(V | S) and the last equality follows H(V ) = log(N) since V follows from a uniform distribution.
I(V, S) = I(S, V )
P(S,V)
=ES,V log P(S)·P(V)
Pj(s)·P(V =j) Pj(s)log P ̄(s)·P(V =j)
P j ( s ) Pj(s)log P ̄(s)
=N KLPj,P ̄,
where the first equality follows from the symmetry of mutual information, the second equality follows from the definition of mutual information, the third equality follows from the law of iterated expectations, the fourth equality follows from P(V = j) = 1/N, and the last equality follows from the definition of the Kullback-Leibler divergence. (6)-(8) together imply the global Fano’s method.
Moreover, we can further bound (8) as
KL Q, 1 PN P ≤ 1 PN KL (Q, P ). (6), (7), and (9) together imply the local Fano’s method. Nk=1kNk=1 k
Code Help, Add WeChat: cstutorcs
Remark Similarly to Le Cam’s method, Fano’s methods also give a lower bound of an average error in the proof. This can be seen via the following observation:
XN PV,S(ψ(S) ̸= V ) =
P(ψ(S) ̸= j)P(V = j) = N P(ψ(S) ̸= j).
We will next state the following corollary of the local Fano method which is convenient to apply.
Corollary 1. (The local Fano method for iid data) If S contains n i.i.d samples from distribution P ∈ P, then S ∼ Pn. Let {P1,··· ,PN} ⊆ P. Let N ≥ 16. Suppose maxKL(Pj,Pk) ≤ log(N). Define δ =
minj̸=k ρ(θ(Pj),θ(Pk)). Then Rn∗ ≥ 12Φ(2δ).
Proof Applying the local Fano method for i.i.d data (product distributions) yields:
1− N2 1≤j,k≤N
KLPn,Pn+log(2)! j k
By the fact that KLPn,Pn = nKL(P ,P ) and maxKL(P ,P ) ≤ log(N),
jk jkj,kjk4n
KLPn,Pn+log(2)! j k
log(N ) KL(P,P)+log(2)
≥Φ 2 1− log(N)
δ 1 log(2) ≥ Φ 2 1 − 4 − log(16)
δ 1 =Φ22
δ 1− N2 1≤j,k≤N 2
 n log(N)N2+log(2)
The last inequality is because N ≥ 16.
Remark For Global Fano, P ̄ = N1 Pj Pjn and we do not have a simple form as for the local Fano. This
is one reason why the local Fano is easier to apply.
3 Constructing alternatives
When constructing alternatives for Fano’s method (and other methods), we need to be careful in our con- struction of {P1,…,PN}. In particular, we need ρ(θ(Pj),θ(Pk)) to be large but KL(Pj,Pk) to be small. If δ = minj̸=k ρ(θ(Pj),θ(Pk)) is too small, then the lower bound of the minimax risk will be small. Next, we will discuss two common tools that are used in the construction.

3.1 Method 1: Tight Packings
Definition 1. Packing number.
– An ε-packing of a set χ with respect to metric ρ is a set {x1,…,xn} ∈ χ s.t ρ(xi,xj) ≥ ε,∀i ̸= j.
– The ε-packing number M (ε, χ, ρ) is the size of the largest ε-packing of χ. – A packing with a size M (ε, χ, ρ) is called a maximal packing.
Next, we will illustrate the use of tight packings via a d-dimensional normal mean estimation problem. The following fact will be helpful.
Lemma 1. For an L2 ball of radius r in Rd, χ=x∈Rd :∥x∥2 ≤r , we have rd
M(ε,χ,ρ)≥ ε .
Example 3. Normal mean estimation in Rd.
P is the family N (μ,Σ),μ ∈ Rd,Σ ⪯ σ2I . Let S = {X1,…,Xn} be n i.i.d samples from P ∈ P. The
minimax risk: h ˆ i R∗=infsupEΦ◦ρθ(P),θ(S) .
θˆ P∈P LetΦ◦ρθ(P),θˆ(S)=∥θ1−θ2∥2,θ=EX∼P [X].
Upper bound of minimax risk: Consider the estimator θˆ(S) = n1 Pni=1 Xi. As V ar(xij) ≤ σ2:
d n !2d n !2d
∗ 2 X 1X X 1X XVar(xij) σ2d
Rn ≤E θˆ(S)−θ =E Xij −θj = E Xij −θj = ≤ . 2nnnn
j=1 i=1 j=1 i=1 j=1
Lower bound of minimax risk: Let U be a δ-packing of the L2 ball of radius 2δ. Consider a subset
of P: P′ = N(μ,σ2I),μ∈U . So according to lemma 1, |P′| = |U| ≥ 2δd = 2d. Moreover, by the δ
definition of a δ-packing, we have
min ∥θ(P)−θ(P′)∥=minμ,μ′ ∈U∥μ−μ′∥≥δ.
Lemma 2. N (μ1 , Σ) and N (μ2 , Σ) are two multidimensional normal random variables, then
KL(N(μ1,Σ),N(μ2,Σ))= 21(μ1 −μ2)TΣ−1(μ1 −μ2).
Let Pj,Pk be two distributions in P′: Pj = N μj,σ2I, Pk = N μk,σ2I, μj,μk ∈ U. So according to
the definition of U, ∥μj − μk∥2 ≤ 4δ, and ∥μj − μk∥2 ≥ δ. Then
∥μj − μk∥2 (4δ)2 8δ2
KL(Pj,Pk)= 2σ2 ≤ 2σ2 = σ2 .
The first equality is by Lemma 2, the inequality is because of ∥μj − μk∥2 ≤ 4δ. To apply the Local Fano
method (Corollary 1), we want KL ≤ log(|P′|), choose δ = σqdlog(2) so that 8δ2 ≤ log(2d) ≤ dlog(2). Then, n 8nσ2nn
by the local Fano method,
∗ 1 δ log(2) σ2d Rn≥2Φ 2 = 64 n.
Computer Science Tutoring