CS861: Theoretical Foundations of Machine Learning Lecture 11 29 09 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 11 – 29/09/2023 University of Wisconsin–Madison, Fall 2023
Lecture 11: Review of Information Theory
Lecturer: Kirthevasan Kandasamy Scribed by: Deep Patel, Keran Chen
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
We have been looking at the notion of Minimiax Optimality for a few lectures wherein we introduced the notion of Minimax Risk and “reduced” the problem of estimation to that of hypothesis testing for obtaining lower bounds for the Minimax risk. Specifically, we “reduced” to binary hypothesis testing and derived the lower bound using Le Cam’s method. We looked at some mean estimation and toy settings for regression problem for application of Le Cam’s method. In this lecture, we will do a brief review of Information Theory and set the stage for Fano’s method that will be more appropriate for the lower bounds we are interested in obtaining in this course.
1 Insufficiency of Le Cam’s method
As we consider only binary hypothesis testing, Le Cam’s method is usually sufficient only for point estima-
tion. However, when we are doing high-dimensional parameter estimation, it would make sense to be able to
distinguish between multiple hypotheses for better estimation bounds. To illustrate this, consider the follow-
ing (imperfect) example of mean estimation for d-dimensional Gaussian distributions: Consider the family 2 d 2 i.i.d.
P= N μ,σI |μ∈R ,whereσ isknown.WearegivenasetS={X1,X2,…,Xn} ∼ P∈P.Let Φ ◦ ρ(θ1, θ2) =∆ ∥θ1 − θ2∥2. Consider θˆ(S) = n1 Pni=1 Xi as our estimator for the mean.
The upper bound for this mean estimator can be obtained as follows: ˆ ˆ 2
R(θ, P ) = ES∼P θ(S) − θ(P ) dn !2
n1 X Xij − θj  i=1
σ2 ≤ n (∵ for each 1-D Gaussian, upper bound is n )
Note that the upper bound shown above becomes increasingly loose as the dimensionality, d, grows larger.
Using Le Cam’s method, we can obtain the lower bound as follows: Let P0 = N (0, σ2I) and P1 = N (δv, σ2I)
(where v ∈ Rd s.t. ∥v∥2 = 1). We want to apply Corollary 1 from Lecture 10 to obtain the lower bound.
But for this, we need to choose δ such that KL(P0,P1) ≤ log2. Since P0 and P1 are Gaussian, we have n
KL(P ,P ) = δ2 . Thus, we choose δ = qlog2. Whence, by Corollary 1 of Lecture 10, we have 0 1 2σ2 n
∗ h ˆi log2σ2 Rn =infsupES Φ◦ρ θ(P),θ(S) ≥ ·
= X ES  j=1
16 n |′ {z }
No ‘d factor here

2 Review of Information Theory 2.1 Entropy
Definition 1 (Entropy of a random variable).
H(X) = EX [− log p(X)]
Discrete random variable: H(X) = −Px∈χ p(x)log(p(x)) Continuous random variable: H (X ) = − R p(x) log(p(x))dx
For example, if X ∼ Bern(p) ⇒ H(X) = −plogp − (1 − p)log(1 − p). Similarly, if X ∼ N(μ,σ2) ⇒ H(X)= 12 log2πeσ2
Remark 2.1 (Interpretation of Entropy for discrete random variables). Entropy measures the spread of the distribution. Another interpretation of entropy is that it measures the amount of information or uncertainty about the possible outcomes contained in a variable.
Remark 2.2. Some properties of Entropy (for Discrete R.V.): 0 ≤ H(X) ≤ log|X|
(a) (b) (a) : This uses the fact that log 1 ≥0 since p(x)≤1
(b) : Refer to Lemma 6 towards the end of this lecture notes.
Definition 2 (Conditional Entropy). First define the entropy of X conditioned on knowing Y = y as follows, H(X|Y = y) = − X p(x|y) log(p(x|y)).
The conditional entropy is the expectation of H(X|Y = y) over Y . We have,
H(X|Y ) = − X p(y)H(X|Y = y) = − X p(x, y) log(p(x|y)) y∈Y x,y
More generally, we can write
H(X|Y ) = −EX,Y [log(p(X|Y ))] Definition 3 (Joint Entropy of two random variables).
H(X, Y ) = −EX,Y [log(p(X, Y ))] Lemma 1 (Chain Rule for Entropy).
H(X1, …Xn) = X H(Xi|X1…Xi−1) (1)
H(X1, …Xn|Y ) = X H(Xi|X1…Xi−1, Y ) (2)
Computer Science Tutoring
Proof We will prove the first statement when n = 2. First note that we can write,
p(x1, x2) = p(x1)p(x2|x1) ⇒ − log(p(x1, x2)) = − log(p(x1)) − log(p(x2|x1))
Take expectation with respect to X1, X2:
H(X1, X2) = H(X1) + H(X2|X1)
Then use induction:
H(X1, , Xn) = H(X1, . . . , Xn−1) + H(Xn|X1, , Xn−1)
= H(X1, . . . , Xn−2) + H(Xn−1|X1, . . . , Xn−2) + H(Xn|X1, . . . , Xn−1)
= X H(Xi|X1…Xi−1) i=1
The second statement can be proved in a a similar fashion.
Definition 4 (KL divergence (relative entropy) of distibution P and Q). p(X)
KL(P,Q) = EX∼P log q(X) Lemma 2 (KL(P, Q) ≥ 0 with equality iff P = Q).
q(X) q(X) KL(P,Q) = EX∼P −log p(X) ≥ −log EX∼P p(X)
The equality condition follows from the tightness condition for Jensen’s inequality.
2.2 Mutual Information
Definition 5. Mutual Information is the KL divergence between the joint distribution PXY marginals PX × PY
and product of
I(X;Y)=KL(PXY,PX ×PY)=EPXY
Some properties of Mutual Information:
• (Non-Negativity) I (X ; Y ) ≥ 0 with equality IFF X ⊥ Y • (Symmetry)I(X;Y)=I(Y;X)
p(x,y) log p(x)p(y)
Lemma 3. Mutual Information can be expressed in terms of entropy, conditional entropy, and joint entropy as follows:
1.) I(X;Y)=H(X)−H(X|Y)=H(Y)−H(Y|X) 2.) I(X;Y)=H(X)+H(Y)−H(X,Y)
3.) I(X;Y)=H(X)
程序代写 CS代考加微信: cstutorcs
P(X,Y) I(X;Y)=EX,Y logP(X)P(Y)
P(X)P(Y |X) =EX,Y log P(X)P(Y)
= −EY [log P (Y )] + EX,Y [log P (Y |X )] = H(Y ) − H(Y |X)
Similarly, one can obtain I(X; Y ) = H(X) − H(X|Y ). 2.) By Chain Rule (Lemma 1), we have
H(X, Y ) = H(Y ) + H(X|Y )
= H(Y ) + H(X) − I(X; Y )
I(X; X) = H(X) − H(X|X) = H(X) − 0
Definition 6 (Conditional Mutual Information).
I(X;Y|Z)=∆ H(X|Z)−H(X|Y,Z)
P(X,Y|Z)
=EX,Y,Z logP(X|Z)P(Y|Z) Lemma 4 (Chain Rule for Mutual Information).
I(X1, . . . , Xn; Y ) = X I(Xi; Y |X1, . . . , Xi−1)
We will see the proof for the case of n = 2 but exactly the same proof strategy works for any n.
I(X1, X2; Y ) = H(X1, X2) − H(X1, X2|Y )
= H(X1) + H(X2|X1) − (H(X1|Y ) + H(X2|X1, Y )) (Applying Lemma 1 for entropy and conditional entropy)
= I(X1; Y ) + I(X2; Y |X1)
Lemma 5 (Conditioning reduces entropy). H(X|Y ) ≤ H(X)
CS Help, Email: tutorcs@163.com
Figure 1: Summary of relationship between entropy and mutual information. Source: Chapter 2, Elements of Information Theory by Thomas M. Cover & Joy A. Thomas
I(X; Y ) = H(X) − H(X|Y ) ≥ 0 (Mutual Information is non-negative) ⇒ H(X|Y ) ≤ H(X)
random variable X. with equality IFF X ∼ [|X|].
Proof Let P,U be the distribution of X and uniform random variable over [|X|]. Let p,u be the corre-
Lemma 6 (Uniform distribution represents the maximum uncertainty). 0 ≤ H(X) ≤ log|X| for discrete unif
sponding PMFs. Then, we have
0≤KL(P,U)=EX logu(X) =−H(X)−EX log|X|
p(X) 1 ⇒ H(X) ≤ −log 1 = log|X|
0 ≤ H(X) is because log(p(x)) ≤ 0
Lemma 7 (Data Processing Inequality). Say X, Y, Z are random variables such that X ⊥ Z|Y . Then,
I(X; Y ) ≥ I(X; Z) I(X;Y,Z)=I(X;Z)+I(X;Y|Z) (ByapplyingLemma4)
I(X;Y,Z)=I(X;Y)+ I(X;Z|Y) (ByapplyingLemma4)
=0 (∵X⊥Z|Y ) ∴ I(X; Y ) ≥ I(X; Z)

Remark 2.3. Some remarks regarding the Data Processing Inequality (DPI) that we have proved above (Lemma 7):
1.) We can think of X,Y,Z as forming a Markov Chain: X −→ Y −→ Z.
2.) I(X; Z) ≤ I(X; Y ) means that “Z contains no more information about X than Y ”. 3.) The road ahead:
◦ In hypothesis testing, we will assume a prior on the alternatives {P1, . . . , PN } ⊆ P
◦ X ∈ [N] forms a selection from {P1,…,PN}. Then, the data, Y, is generated from the chosen
PX. Now we have a test, Z, which tries to estimate X.
◦ With the help of Fano’s inequality, we will obtain a lower bound for the probability that Z fails to
estimate X. We will use the data processing inequality to prove Fano’s inequality. P[Z̸=X]≥ (……)
lower bound