CS861: Theoretical Foundations of Machine Learning Lecture 1-2 – 09/06-08/2023 University of Wisconsin–Madison, Fall 2023
Lecture 01 & 02: PAC Learning
Lecturer: Kirthevasan Kandasamy Scribed by: Albert Dorador, Michael Harding
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In the first two lectures, we will introduce PAC Learning. We will first introduce some background definitions, then discuss empirical risk minimization, analysis of ERM, sub-Gaussian Random Variables, agnostic PAC bounds, and finally conclude with a discussion on approximation error vs. estimation error.
1 Background Definitions
We begin by laying out some important foundational definitions for discussing data and algorithms and for evaluating our methods via the simple, albeit instructive, example of the Binary Classification problem.
We first introduce the general concepts of the Input Space X (also known as the covariate, feature, etc. space) and the Label Space Y (response, output, target, etc.). In the case of binary classification, we have Y = {0, 1}. One common example for X is Rd, though specific data settings may of course result in a different X. With a given Input and Label Space pair, we then wish to assume that there exists some joint distribution PX,Y over ordered pairs (X, Y ) ∈ X × Y, such that our Observed Dataset,
S = {(X1,Y1),…,(Xn,Yn)}
is sampled independently from this distribution.
From here, we define a hypothesis to be any map h : X → Y, taking a member of the input space and
outputting its “estimated” label space value, and we consider the concept of Learning in the statistical / machine learning sense to be the act of finding a “good” hypothesis. This of course then motivates the questions: What constitutes a “good” hypothesis? And how do we compare one hypothesis to another? To answer these, we define the Risk of a hypothesis h to be
R(h) = E(X,Y )∼PX,Y [l(h(X), Y )],
where l : Y ×Y → R+ is a predefined Loss Function. For our Binary Classification example, our hypotheses are functions that propose a “splitting” of the data into positive and negative (0 and 1) classes, and our goal is learn a function (or set of functions) that produce a low probability of misclassification, motivating the 0-1 loss function,
l(h(X),Y)=I{h(X)̸=Y} ⇒R(h)=P(h(X)̸=Y)
2 Empirical Risk Minimization
In order to go about learning the “best” hypothesis, we recognize two important factors:
1. We must begin by defining a suitable hypothesis class H ⊂ {h : X → Y} that will be the set of “learnable” hypotheses for our problem setting.
Code Help, Add WeChat: cstutorcs
2. We of course wish to minimize R(h) over all h ∈ H, but often PX,Y is unknown, and thus the risk R(h) is not calculable in general.
This motivates us to define the Empirical Risk of a hypothesis h to be
l(h(Xi),Yi) =
I{h(Xi)̸=Yi} in the case of binary classification
Rather than the expected loss, we use the observed average loss as a stand-in (which naturally carries the benefits of the sample mean like unbiasedness and concentration rates with it, and it asymptotically matches the true Risk by the Law of Large Numbers). Using this definition, we can then define the process of learning the “best” hypothesis bh via Empirical Risk Minimization (ERM) by letting
bh ∈ arg min Rb(h). h∈H
(Note that we use the set notation ∈ due to the fact that we do not necessarily have a unique h ∈ H that minimizes the empirical risk.)
Example 1 (Binary Classification ERM). Let X = R2, Y = {0, 1}, H = {h1, h2, h3} (as pictured be- low). As pictured, we have Rb(h3) > 0 and Rb(h1) = Rb(h2) = 0, giving us bh ∈ {h1,h2}. However, we can also see that, given the full distribution PX,Y , we have R(h2), R(h3) > 0 and R(h1) = 0, so h1 is clearly the “best” hypothesis in H, though we cannot uniquely identify bh = h1 with only using this dataset S.
Figure 1: A simple binary classification example with input space X = R2
3 Analysis of ERM
To facilitate our analysis of the efficacy of the ERM algorithm, we begin by making two important assump- tions:
1. We have a finite hypothesis class, i.e. |H| < ∞.
2. Our problem {X , Y, PX,Y , H} is Realizable, meaning ∃ h∗ ∈ H s.t. ∀ (x, y) ∈ supp(PX,Y ), h∗(x) = y.
These assumptions, while not necessary, greatly simplify our analysis and enable us to develop strong results. We will relax both assumptions later in class. The first assumption narrows the problem scope and allows us to control statistical bounds via |H|. The realizability assumption guarantees that there exists some hypothesis in our hypothesis class with 0 true risk.
It is also easy to see that due to realizability, we have Rb(h∗) = 0. Consequently, our ERM estimator bh has zero empirical risk (Rb(bh) = 0), as there is at least one hypothesis (namely h∗) in our hypothesis class with 0 empirical risk; however, we are not guaranteed R(bh) = 0, as we can select bh ̸= h∗. We saw this case in Example 1, where the problem was realizable by h1(x) = y, but we had bh ∈ {h1, h2} under our particular dataset S. Because of this, we generally aim for statistical results that guarantee, with high probability, R(bh) ≤ ε for a sufficiently small tolerance ε > 0, dependent on our sample size n and hypothesis class H.
Below, we state our first theoretical result for ERM under both assumptions above.
Theorem 1. Let bh ∈ H be chosen via ERM, using a dataset of n samples. Furthermore, let |H| < ∞. Then PS (R(bh) < ε) ≥ 1 − |H|e−nε
Define HB := {h ∈ H : R(h) > ε} to be the set of “bad” hypotheses (we call them “bad” because they have a risk that exceeds our desired tolerance of ε). Consider any h ∈ HB . Then R(h) > ε by construction. More concretely, if we choose our loss function to be the standard 0/1 loss in binary classification problems, by construction we have that, for any h ∈ HB ,
R(h) = E(X,Y )∼PX,Y [I{h(X)̸=Y }] = PS (h(X) ̸= Y ) > ε Moreover, for any h ∈ HB,
n PS(Rb(h)=0)=PS(h(Xi)=Yi ∀i)=YP(h(Xi)=Yi)≤(1−ε)n
Observe that the second equality above follows from the fact that the random vectors (Xi,Yi) are i.i.d. by initial assumption.
By the Realizability assumption, we know that there exists h∗ ∈ H such that Rb(h∗) = 0. Therefore, one would never pick h ∈ H to be the empirical risk minimizer bh if Rb(h) > 0. Hence, we can define the good event G := {∀ h ∈ HB,Rb(h) > 0} (“good” since under those conditions one would never make a mistake selecting the empirical risk minimizer by choosing a hypothesis that has large true risk). That is, under G and Realizability, bh ∈/ HB. Then
PS(Gc)=PS(∃h∈HB :Rb(h)=0)≤ X PS(Rb(h)=0)≤ X (1−ε)n ≤|HB|(1−ε)n h∈HB h∈HB
where the second inequality follows from our previous derivation, and the first inequality is a direct applica- tion of the Union bound1. Observe that the above derivation implies that
PS(G)≥1−|H|(1−ε)n ≥1−|H|e−nε
1The union bound states that if A1, . . . , AK are events; then, P(SKk=1 Ak) ≤ PKk=1 P(Ak)
where the last inequality follows by noting that ln(1 − ε) ≤ −ε for any 0 < ε < 1, since ln(1 − ε) = −ε−ε2/2−ε3/3−ε4/4−....
Since under G and Realizability, bh ∈/ HB and hence R(bh) < ε, the result we wished to prove then fol- lows.
Observe that there are three parameters that one can control: namely the amount of data n, the desired risk tolerance ε > 0, and the probability of error δ.
In particular, given n samples and δ ∈ (0, 1), bh satisfies
log(|H|/δ)
PS R(bh)< n ≥1−δ
which in effect requires a tolerance of order 1/n.
Moreover, given δ ∈ (0,1) and ε > 0, as long as n ≥ 1ε log(H/δ) i.e. we have samples at least of order
1/ε, it holds that
P S ( R ( bh ) < ε ) ≥ 1 − δ
The previous results illustrate the concept of “PAC learning”. PAC is an acronym for “Probably Approx- imately Correct”, which means that with high probability (“Probably”) the error our learning algorithm makes is small (i.e. it’s “Approximately Correct”). The standard definition of PAC Learning requiers a more technical characterization, please refer to definition 2.3 in MRT and definition 3.1 in SB.
In the next section we introduce a concept that will later come in handy.
4 Sub Gaussian Random Variables
Definition 1. If a random variable X satisfies
h λ(X−EX)i λ2σ2
for all λ ∈ R, then we say it is a σ-sub Gaussian random variable.
Intuitively, X is a σ-sub Gaussian (henceforward, σ-sG) random variable if its tail decays at least as fast as that of a N(0,σ2) random variable.
Example 2 (Gaussian random variables are sub Gaussian). Let X ∼ N(μ,σ2). Then X is σ-sG.
Example 3 (Bounded random variables are sub Gaussian). Ifsupp(X)⊂[a,b]forsome−∞