CS861: Theoretical Foundations of Machine Learning Lecture 1 – 09/13/2023 University of Wisconsin–Madison, Fall 2023
Lecture 04: Rademacher Complexity & Growth Function
Lecturer: Kirthevasan Kandasamy Scribed by: Yixuan Zhang, Elliot Pickens
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we first introduce a simple example of the Empirical Rademacher Complexity (ERM). Then, we introduce the Rademacher Complexity, which can be applied to derive an upper bound for
ES suph∈H(RS(h) − R(h)) . After that, we will state a bound for PAC learning. Finally, we will introduce
the growth function.
1 Rademacher Complexity
Before introducing Rademacher complexity, we first give a simple example to recap the Empirical Rademacher Complexity (ERM).
Output Output
Threshold Threshold
Figure 1: Two example threshold functions, where the hypothesis is either h(x) = I(x ≥ a) or h(x) = I(x ≤ a) Consider the dataset S = {(x1 = 0, y1 = 0), (x2 = 1, y2 = 1)} and two hypothesis classes:
H1 = {ha(x) = 1{x≥a} | ∀a ∈ R} “one-sided threshold” H2 = H1 ∪ {h′a(x) = 1{x 0. Then, there exist universal constants c1,c2 such that with probability at least 1 − 2e−2nε2
R(hˆ) ≤ inf R(h) + c1Radn(H) + c2ε h∈H
We will prove this theorem in the next homework. The following ideas may be helpful in this proof.
程序代写 CS代考 加QQ: 749389476
• For the case that ∃h∗ ∈ H such that R(h∗) = infh∈H R(h). We can do the following decomposition: R(hˆ) − R(h∗) = R(hˆ) − Rˆ(hˆ) + Rˆ(hˆ) − R(h∗) ≤ R(hˆ) − Rˆ(hˆ) + Rˆ(h∗) − R(h∗)
| {z } | {z }
By McDiarmid’s inequality, we can bound both T1 and T2.
• We also need to carefully deal with the case that ∄h∗ ∈ H such that R(h∗) = infh∈H R(h), which will
not be showed here.
Growth Function
While the above bound in useful, computing Radn(H) can be difficult for general hypothesis classes. Hence, we will relate the Radamacher complexity to the VC dimension, which is easier to bound. For this, we will first define the growth function.
Definition 2. Restriction of H to S
Given a sample S = {(x1, y1), . . . , (xn, yn)}, and a hypothesis space H, define
L(S, H) = {[l(h(x1), y1), …, l(h(xn), yn)] | h ∈ H}
to be the set of all possible loss vectors of S given H, i.e. all possible loss vectors we can generate from
S by iterating over all h ∈ H.
For 0 − 1 loss, each datapoint in the sample can take on one of two values since l(h(xi), yi) ∈ {0, 1}. This
allows us to easily bound the cardinality of L for 0 − 1 loss as | L(S, H)| ≤ 2n
Now, let us go through a few examples of L.
Example 2. Let S = {(x1 = −1,y1 = 0),(x2 = 1,y2 = 1)} and Hone−sided = {ha(x) = 1{x≥a} | ∀a ∈ R}
Figure 2: An example of a h ∈ Hone−sided that gives us a [0, 0] loss vector. be the set of all ”one-sided threshold functions.” Then
L(S, Hone−sided) = {[0, 1], [1, 0], [0, 0]}
Since we can either misclassify a single point or no points, but it is not possible to misclassify both points
with this hypothesis class.
Example 3.
Figure 3: An example of a h ∈ Htwo−sided that gives us a [0, 0] loss vector
LetS={(x1 =−1,y1 =0),(x2 =1,y2 =1)}andHtwo−sided =Hone−sided∪{h′a(x)=1{x