CS861: Theoretical Foundations of Machine Learning Lecture 6 09 18 2023 Univer

CS861: Theoretical Foundations of Machine Learning Lecture 6 – 09/18/2023 University of Wisconsin–Madison, Fall 2023
Lecture 06: PAC bound in a finite VC class, Proof of Sauer’s lemma
Lecturer: Kirthevasan Kandasamy Scribed by: Justin Kiefel, Joseph Salzer
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we demonstrate how to derive a bound for the growth function of a hypothesis class via its VC-dimension. This bound is called Sauer’s Lemma, which will be proved in the second section. Once we have this bound we can show that, in the case of a finite VC-dimension hypothesis class, that class is (agnostic) PAC-learnable.
1 PAC Bound in a Finite VC Class
Recall the definition of a restriction L(S,H) and the growth function g(n,H) (see Lecture 4, definition 2 and 3 respectively) of a hypothesis class H. In our previous lecture, we proved the following generalization bound for the estimation error using the growth function: with probability greater than 1 − 2e−2nε2 we have
r 2 log(g(n, H))
R(hˆ)≤ inf R(h)+c1 +2ε (1)
Furthermore, we had introduced the concept of shattering and of VC dimension, i.e, the maximal size of a set that can be shattered by H. The following lemma provides an upper bound for the growth function based on the VC-dimension.
Lemma 1 (Sauer’s Lemma). Define Φd(n) := Pd n. If the VC-dimension of a hypothesis class H is d, i=0 i
g(n, H) ≤ Φd(n)
We will prove this lemma in the next section of this lecture. For now, we will demonstrate a few properties of the function Φd(n) and use them to derive the PAC bound similar to Equation 1 but in terms of the VC-dimension instead of the growth function.
Ifn≤d,thenΦd(n)=2n. Butifn>d,
Φd (n) = ( n )d Pd n( d )d
d i=0 i n ≤ ( n )d Pn n( d )i
binomial expansion of (1 + nd )n
(1 + x )n ≤ ex n
d i=0 i =(n)d(1+ d)n
≤ ( en )d d
Thus, when n > d, the growth function grows polynomially in n. We can combine this result with Equation 1 to obtain the following theorem:
Theorem 1 (PAC Bound for Finite VC-dim). Let H be a hypothesis class with finite VC dimension d. Let hˆ be obtained via ERM using n i.i.d samples where n ≥ d. Further, let ε > 0. Then with probability of at least 1 − 2e−2nε2 ,
Code Help, Add WeChat: cstutorcs
s log(n/d) ! R(hˆ)≤ inf R(h)+O +2ε
h∈H (n/d) 2 Proof of Sauer’s Lemma
We will now provide a proof for Sauer’s lemma via a modified induction argument.
For S = {(x1,y1),··· ,(xn,yn)} and SX = {x1,··· ,xn} ∈ Xn, define H(SX) := {[h(x1),··· ,h(xn)] : h ∈
H}. The following claim will be useful in constructing our proof: Claim 1. g(n, H) = max|SX |=n |H(SX )|
Proof Recall that L(S, H) = { [(l(h(x1), y1), · · · , (l(h(xn), yn)] : h ∈ H }. There exists a bijection between L(S, H) and H(SX ) so that |L(S, H)| = |H(SX )|. Thus,
g(n, H) = max |L(S, H)| = max |H(SX )| = max |H(SX )| |S|=n |S|=n |SX |=n
The following example illustrates the bijection.
Example 2. Let S = {(x1 = −1,y1 = 0),(x2 = 1,y2 = 1)} and Hone-sided = {ha(x) = 1{x≥a}|∀a ∈ R}. Under the zero-one loss we have
L(S, Hone-sided ) = {[0, 1], [0, 0], [1, 0]} Hone-sided(SX ) = {[0, 0], [0, 1], [1, 1]}
Clearly, there is a one-to-one correspondence/bijection between these two sets.
The setup for our proof of Sauer’s lemma will be via induction on k = n + d; where n is the number of i.i.d samples and d is the VC-dimension of our hypothesis class.
1. Base case: Show that Sauer’s lemma holds… (a) ∀dandn=0
(b) ∀nandd=0
2. Inductive case: Let k be some constant. Assume Sauer’s lemma holds ∀n, d such that n + d < k. Show that Sauer’s lemma holds ∀n, d such that n + d = k. See Figure 1 for a visual demo of the induction strategy. We will begin by proving the two base cases. For the first case, let n = 0. The VC dimension may be any non-negative integer. Xd n Xd 0 0 Xd 0 i = 0 + i =1+0=1 i=1 Φd(n)= i = i=0 Notice that SX must be empty when |SX| = 0. There is only one possible labeling of zero data points. Therefore, H(SX) = {[]}, and |H(SX)| = 1. Applying Claim 1 we see that g(n,H) = 1. Thus, g(n,H) = 2 程序代写 CS代考 加微信: cstutorcs Figure 1: Visual demo of the proof by induction. The axes are n and d. The gray region represents the base case for n and d (where n = 0 and d = 0). The brownish region represents the induction hypothesis (where n + d < k). The purple region represents the inductive step (where n + d = k). Φd(n), and Sauer’s lemma is satisfied for n = 0. Now we will consider the case where d = 0 and n is any non-negative integer. X0 n n The VC dimension of H is 0, so the hypothesis class cannot shatter a set of size 1. Therefore, for any x ∈ X, all classifiers in H must generate the same label. It follows that for any SX = {x1,...,xn} ∈ Xn, |{[h(x1), ..., h(xn)] : h ∈ H}| = |H(SX )| = 1. Hence, g(n, H) = Φd(n), and the lemma is satisfied. We will now prove the inductive case. Assume that Sauer’s lemma holds ∀d, n where d + n ≤ k − 1. Let d, n be such that d + n = k. Let SX = {x1, . . . , xn} be given. To begin with, we will construct a new hypothesis class, G, defined only on {x1, . . . , xn} as follows. For each [y1, ..., yn] ∈ H(SX ), ∃h ∈ H such that [y1, ..., yn] = [h(x1), ..., h(xn)]. Add one such h, restricted only to points in SX , to G; that is, we will add gh : S → {0, 1}, where gh(x) = h(x) for all x ∈ SX, but undefined elsewhere. Therefore, G will have exactly one function that generates each labeling in H(SX). It follows that |G(SX)| = |H(SX)| = |G|. Next, we will partition G into the sets G1 and G2 using the following construction: 1. G1: For every possible labeling of {x1,...,xn−1}, add one element from G to G1. 2. G2: Let G2 = G \ G1. The intuition behind this partition is to generate two hypothesis classes with VC dimension less than d. We will then apply the inductive hypothesis to each hypothesis class and bound the growth function. To demonstrate how G is constructed and partitioned, we present an example with a simple hypothesis class. Example 3. Let x1,x2 ∈ R with x1 < x2, and let Hone-sided = {ha(x) = 1{x≥a}|∀a ∈ R}. We can see that Hone−sided(SX ) = {[0, 0], [0, 1], [1, 1]}. Due to the one-sided nature of Hone−sided, it is not possible to generate the labeling [1, 0]. Let g1, g2, and g3 be classifiers generating the predictions [0, 0], [0, 1], and [1, 1] respectively. Define G = {g1, g2, g3}. An example of G includes the following classifiers: 1. g1 = ghx2+1 which is the function hx2+1 restricted to {x1,x2}. This function generates the label 0 for both x1 and x2. The function is undefined for other values. 2. g2 = gh(x1 +x2 )/2 which is the function h(x1 +x2 )/2 restricted to {x1 , x2 }. This function generates the labels 0 and 1 for x1 and x2 respectively. The function is undefined for other values. 3. g3 = ghx1−1 which is the function hx1−1 restricted to {x1,x2}. This function generates the label 1 for both x1 and x2. The function is undefined for other values. To construct G1 we will select one classifier for each labeling of {x1}. The remaining classifiers will define G2. One possible partition is G1 = {g1, g3} and G2 = {g2}. Claim 2. |G1(SX )| = |G1({x1, ..., xn−1})| Proof For every labeling {g(x1), ..., g(xn−1)} ∈ G1({x1, ..., xn−1}), we have exactly one of [g(x1), ..., g(xn−1), 0] or [g(x1), ..., g(xn−1), 1] in G1(SX ). Claim 3. |G2(SX )| = |G2({x1, ..., xn−1})| Proof For every labeling {g(x1), ..., g(xn−1)} ∈ G1({x1, ..., xn−1}), we have exactly one of [g(x1), ..., g(xn−1), 0] or [g(x1), ..., g(xn−1), 1] in G1(SX ). Therefore, G2 will have at most one of these labelings. We can apply the equality in Claim 2 to create a bound on |G1(SX)|. |G2(SX )| = |G2({x1, ..., xn−1})| ≤ g(n − 1, G2) ≤ ΦdG2 (n − 1) ≤ Φd−1(n − 1) With this result, we can prove the bound in Sauer’s lemma. |H(SX )| = |G(SX )| Definition of growth function Inductive Hypothesis Φd increases with d {G1, G2} is a partition of G. |G1(SX )| = |G1({x1, ..., xn−1})| ≤ g(n − 1, G1) ≤ ΦdG1 (n − 1) ≤ Φd(n − 1) Definition of growth function Inductive Hypothesis Φd increases with d To show why the inductive hypothesis applies to g(n − 1, G1) and why dG1 ≤ d, consider G1. This hypothesis class is a subset of G, so any set shattered by G1 will also be shattered by G. As a result, dG1 ≤ dG. Similarly, any set shattered by G is shattered by H, so dG1 ≤ dG ≤ d. Furthermore, the sum of theVCdimensionandthenumberofsamplesinthesecondlineisdG1 +n−1≤d+n−1=k−1. Now consider G2 . For every g2 ∈ G2 , ∃g1 ∈ G1 which disagrees only on xn . Therefore, if T X ⊆ {x1, ..., xn−1} is shattered by G2, T X ∪ {xn} must be shattered by G. Because no set larger than d can be shattered by G, |TX| ≤ d − 1. Hence, dG2 ≤ d − 1. We will now apply this result with Claim 3 to create a bound on |G2(SX)|. = |G1(SX ) ∪ G2(SX )| = |G1(SX )| + |G2(SX )| ≤ Φd(n − 1) + Φd−1(n − 1) d n−1 d−1n−1  n − 1  Xd  n − 1  Xd  n − 1  =0+i+i−1 i=1 i=1  n  Xd   n − 1   n − 1   =0+i+i−1 n Xd n Xd n = 0 + i = i =Φd(n) i=1 i=0 SX ⊆ X n is arbitrary, so g(n, H) = max|SX |=n |H(SX )| ≤ Φd(n). Therefore, Sauer’s lemma holds in the inductive case. Code Help