CS861: Theoretical Foundations of Machine Learning Lecture 3 09 11 2023 Univer

CS861: Theoretical Foundations of Machine Learning Lecture 3 – 09/11/2023 University of Wisconsin–Madison, Fall 2023
Lecture 03: Introduction to Radamacher complexity
Lecturer: Kirthevasan Kandasamy Scribed by: Justin Kiefel, Albert Dorador
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will first finish up Bayes optimal classifier from the previous class. Then, we will introduce McDiarmid’s inequality. This tool is then applied to derive the uniform convergence in probability of the empirical risk to the true risk for any hypothesis in a given hypothesis class. Lastly, we will introduce the Empirical Radamacher complexity as a tool to more explicitly bound the difference between empirical risk and true risk, with large probability.
1 Bayes optimal classifier
We will begin this lecture by introducing the Bayes optimal classifier. This classifier always selects the class with the highest probability conditioned on the input. For binary classification, the classifier is as follows:
hB(x) = argmaxP(Y = y|X = x) y∈{0,1}
(1 if P(Y =1|X=x)≥21 = 0 ifP(Y=0|X=x)≥21
We can show that the Bayes optimal classifier produces the minimum risk across all potential classifiers. The intuition behind this result is that selecting the highest probability class will always minimize the expected prediction error.
Theorem: ∀h∈{h:X→Y},R(h)≥R(hB) Proof:
R(h) = EXY [1(h(X) ̸= Y )] Applying the law of iterated expectation:
= EX [EY |X [1(h(X) ̸= Y )|X]]
=EX[EY[1(h(X)̸=Y)|Y =0,X]P(Y =0|X)+EY[1(h(X)̸=Y)|Y =1,X]P(Y =1|X)]
=EX[1(h(X)̸=0)P(Y =0|X)+1(h(X)̸=1)P(Y =1|X)]
=EX[1(h(X)=1)P(Y =0|X)+1(h(X)=0)P(Y =1|X)] Applying the definition of expectation:
Using the law of total expectation:
(1(h(x)=1)P(Y =0|X =x)+1(h(x)=0)P(Y =1|X =x))dPX(x) 1
程序代写 CS代考 加QQ: 749389476
Observe that the above integrand is minimized pointwise if h(x) obeys the following scheme: if P (Y = 0|X = x) ≥ P(Y = 1|X = x) then set h(x) = 0 so that only the smaller of two summands in the integrand is “activated”. A symmetric argument reveals that if P(Y = 1|X = x) ≥ P(Y = 0|X = x) then we should choose h(x) = 1. In other words, the integrand, and therefore the risk of h, is minimized for all x if h is chosen to be:
(1 if P(Y =1|X =x)≥P(Y =0|X =x) 0 if P(Y =0|X =x)≥P(Y =1|X =x)
Notice, that this classifier is identical to the Bayes classifier, hB. Since the Bayes classifier minimizes this term for all x, the integral will be minimized, and hB(x) will produce the minimum possible value of R(h).
2 McDiarmid’s inequality
Next, we will introduce McDiarmid’s inequality. This concentration inequality bounds the difference between a function’s sampled value and its expected value. We will use McDiarmid’s inequality when working with Radamacher complexity. To begin, let us define the bounded difference property. This property is necessary to apply McDiarmid’s inequality.
Definition 1 (Bounded Difference Property). Let f : Rn → R. f satisfies the bounded difference property when ∃c1, …, cn ∈ R such that ∀k ∈ {1, …, n}:
sup |f (z1 , …, zk , …, zn ) − f (z1 , …, z ̃k , …, zn )| ≤ ck z1 ,…,zk ,…,zn ,z ̃k
Intuitively, the bounded difference property states that changing any input to a function will lead to a finite difference in the function’s output. Now we define McDiarmid’s Inequality.
Theorem 1 (McDiarmid’s Inequality). Let f : Rn → R be a function satisfying the bounded difference property with bounds c1, …, cn ∈ R. Let Z1, …, Zn be n independent random variables. Then for all ε > 0:
Similarly,
 −2ε2  P(f(Z1,…,Zn)−E[f(Z1,…,Zn)]>ε)≤exp Pnk=1c2k
 −2ε2  P(E[f(Z1,…,Zn)]−f(Z1,…,Zn)>ε)≤exp Pnk=1c2k
To demonstrate the application of McDiarmid’s inequality we present the following example:
Example 1. We will use McDiarmid’s inequality to show P(|Rb(h) − R(h)| > ε) ≤ 2e−2nε2 . For this, first, we will show that the function Rb(h) satisfies the bounded difference property. Let X1,…,Xn ∈ Rd and Y ∈ {0, 1} be random variables. Now we define the random variable Zi = 1(h(Xi) ̸= Yi).
Hence, we can represent R(h) as a function of
maximum difference in Rb(h) from changing Zk is bounded by 1.
1 Xn 1(h(Xi)̸=Yi)= n Zi
Z1, …, Zn. Let k ∈ {1, …, n}. Next, we can see that the
Code Help
1n1k−1 n 111
sup XZi− (XZi+Z ̃k+ X Zi) = sup Zk− Z ̃k = Z1,…,Zk,…,Zn,Z ̃k n i=1 n i=1 i=k+1 Zk,Z ̃k n n
Hence, the bounded difference property applies to Rb(h), and the maximum difference for changing any input
is n1 . Now we can apply McDirmad’s inequality.
 −2ε22
P(Rb(h) − R(h) > ε) = P Rb(h) − E[Rb(h)] > ε ≤ exp Pnk=1(1/n)2 = exp(−2nε ) Applying the same reasoning we get:
P(R(h) − Rb(h) > ε) ≤ exp(−2nε2) Using the union bound, we get our desired result.
P(|R(h) − Rb(h)| > ε) ≤ 2 exp(−2nε2)
3 Uniform convergence
We would like to have, for any small ε > 0,
P ( ∀ h ∈ H , | Rb ( h ) − R ( h ) | ≤ ε ) ≥ γ
where γ ∈ [0, 1] is a large quantity (i.e. close to 1).
In a previous lecture, we have considered, for an arbitrary h ∈ H,
P(|Rb(h) − R(h)| > ε) ≤ δ
where δ ∈ [0, 1] is a small quantity (i.e. close to 0), and that bound was derived e.g. by Hoeffding’s inequality.
Then, we applied a union bound and obtained:
P(∃h∈H|Rb(h)−R(h)|>ε)≤ XP(|Rb(h)−R(h)|>ε)≤|H|·δ
Of course, the above statement is vacuous if |H| · δ ≥ 1, which will necessarily be the case if |H| = ∞ no matter how small δ > 0 is. Therefore, we wish to derive a bound that is still useful in the presence of a potentially non-finite hypothesis class. We will proceed by considering the following quantity
f(S) := sup(RbS(h) − R(h)) h∈H
S := {(X1,Y1),…,(Xk,Yk),…,(Xn,Yn)} with Xi ∈ X and Yi ∈ {0,1}, and RbS(h) := n1 P(Xi,Yi)∈S I{h(Xi)̸=Yi}
To apply McDiarmid’s inequality, additionally define
S ̃ := {(X1,Y1),…,(X ̃k,Y ̃k),…,(Xn,Yn)}

Then, equipped with the above definitions,
sup|f(S)−f(S ̃)|=sup sup(RbS(h)−R(h))−sup(RbS ̃(h)−R(h))
S∪S ̃ h∈H h∈H
≤supsup (RbS(h)−R(h))−(RbS ̃(h)−R(h))
=supsup RbS(h)−RbS ̃(h) S∪S ̃ h∈H
1 = sup sup n I{h(Xk)̸=Yk} − I{h(X ̃k)̸=Y ̃k}
S∪S ̃ h∈H ≤ n1
where the inequality above follows from the fact that for any functions f1,f2, | sup f1(a) − sup f2(a)| ≤ sup |f1(a) − f2(a)|
Noting that n1 plays the role of ck for all k in the bounded difference property introduced in the previous
section, we can see that said property holds and so we can apply McDiarmid’s inequality as follows:
 PS∼PX,Y (f(S) − E[f(S)] > ε) = PS∼PX,Y sup(RbS(h) − R(h)) − E[sup(RbS(h) − R(h))] > ε
h∈H h∈H ≤ exp(−2nε2)
Observe that the above probability is equivalent to
 P sup(RbS(h) − R(h)) > E[sup(RbS(h) − R(h))] + ε
h∈H h∈H 
which in turn is equivalent to
P ∃h∈H,(RbS(h)−R(h))>E sup(RbS(h)−R(h)) +ε
Hence, by McDiarmid’s inequality, we can equivalently claim that, ∀h ∈ H, with probability at least 1 −
exp(−2nε2 ),
RbS(h) − R(h) ≤ E[sup(RbS(h) − R(h))] + ε h∈H
It would be more illuminating if we had a way to quantify or at least bound the term E[suph∈H(RbS(h)−R(h))], which, given the non-linear nature of the supremum operator, is in general not equal to suph∈H E[(RbS (h) −
R(h))] = suph∈H ·0 = 0.
Next, we will introduce Radamacher complexity to help us bound the term above.
4 Radamacher complexity
Definition 2. A Radamacher random variable σ ∈ {−1, 1} is such that P(σ = −1) = P(σ = 1) = 1/2
Definition 3 (Empirical Radamacher Complexity). Let S := {(x1, y1), . . . , (xn, yn)} be an observed sample of n points, and let σ := (σ1 , . . . , σn ) ∈ {−1, 1}n be n independent Radamacher random variables. Then, the Empirical Radamacher Complexity is
“1Xn # Rad(S, H) := Eσ sup σil(h(xi), yi)
dn h∈H i=1
where l(·) is our choice of loss function, e.g. l(h(xi),yi) = I{h(xi)̸=yi} in case of a classification problem. 4
Programming Help
The above definition can be intuitively interpreted as follows: let ∼l := (l(h(x1), y1), . . . , l(h(xn), yn)). Then, 1
hypothesis classes will be able to correlate more with random vectors.
Rad(S,H) := Eσ sup σ · l dn∼
measures how well the hypothesis class H could correlate1 with a random Radamacher vectors. More flexible
1Recall that for any two vectors a,b, their dot product is equal to the cosine of the angle between them, scaled by the product of their norms; if those vectors have mean zero, their cosine coincides with their correlation coefficient.