CS861: Theoretical Foundations of Machine Learning Lecture 5 09 15 2023 Univer

CS861: Theoretical Foundations of Machine Learning Lecture 5 – 09/15/2023 University of Wisconsin–Madison, Fall 2023
Lecture 05: Growth Function and VC Dimension
Lecturer: Kirthevasan Kandasamy Scribed by: Chenghui Zheng, Yixuan Zhang
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will first bound the Radamacher complexity using the growth function. Then, we will introduce the VC dimension and provide some examples.
1 Bounding Rademacher Complexity Using the Growth Function
First, we will prove Massart’s lemma which upper bounds the empirical Radamacher complexity.
Lemma 1 (Massart’s Lemma). Let S = {(x1, y1), …, (xn, yn)} ∈ {X × Y}n, and H be a hypothesis class.
where ∥v∥2 = Pni=1 vi2.
Proof First, observe that we can write “1n#1″n#
Rad(S, H) ≤
max ∥v∥2 2 log(|L(S, H)|),
n v∈L(S,H)
Rad(S, H) = Eσ sup h∈H
“n#1″n# Eσ max Xσivi = Eσ max sXσivi
Eσ max v∈L(S,H)
Next, let s > 0, whose value we will specify later.
s v∈L(S,H)
1″ n!!# Eσ log exp max sXσivi
X σil(h(xi), yi) i=1
X σivi . (1) i=1
s v∈L(S,H)
Eσ exp max s X σivi
by Jensen’s Inequality
≤slogEσ exp s σivi 
v∈L(S,H) i=1
 ” n !# XX
≤slog Eσexps σivi 
v∈L(S,H) i=1
(i) 1 X ≤ s log 
2 vi  i=1
Computer Science Tutoring
Equation (2) holds for all s, we can choose
Equation (1), (2) and (3) imply
Radn(S, H) ≤ max ||v||
s 2 v∈L(S,H)
= 1 log (|L(S, H)|) + s max
The inequality (i) holds because σ = (σ1, . . . , σn) and σi is 1-subgaussian. Then, ” n !# n n
XY Ys2vi2
σivi = Eσ [exp ((svi)σi)] ≤ exp i=1 i=1
2 log |L(S, H)|
1ˆp n v∈L(S,H)
2 log(|L(S, H)|).
Corollary 1. ∀S such that |S| = n, we have d
r 2 log(g(n, H)) n
Rad(S, H) ≤
Proof ∥v∥2 ≤ √n and |L(S, H)| ≤ g(n, H) by definition of g(n, H). The second
taking the expectation over S of the LHS of the first statement.
statement follows by
r 2 log(g(n, H)) n .
To motivate the ensuing discussion about the VC dimension, recall that with probability at least 1 − 2e−2nε2
R(hˆ) ≤ inf R(h) + c1Radn(H) + c2ε. h∈H
Then, with fixed n,δ, where ε ∈ O(q1 log(1)). From the previous lecture, we obtained g(n,H) ≤ 2n. nδ
However, when g(n,H) = 2n, Radn(H) will never goes to 0. At the very least, we hope to have: g(n,H) ∈
o(2n), but ideally we would like to have g(n,H) ∈ poly(n) so that qlog(g(n,H)) ≲ qlog(n). nn
2 VC dimension
In this section, we begin with the definition of Shattering.
Code Help, Add WeChat: cstutorcs
Definition 1. Let SX = {x1,…,xn} ∈ Xn be a set of n points in X. We say that SX is shattered by a hypothesis class H if H “can realize any label on SX”. That is
|H(SX)| = 2n,
Then, we give two examples for Shattering under the same hypothesis class H, which is the two sided
where H(SX) = {[h(x1),…,h(xn)] | h ∈ H}. threshold classifiers:
H={ha(x)=1{x≥a} |∀a∈R}∪{ha(x)=1{x