CS861: Theoretical Foundations of Machine Learning Lecture 1 2 09 06 08 2023 U

CS861: Theoretical Foundations of Machine Learning Lecture 1-2 – 09/06-08/2023 University of Wisconsin–Madison, Fall 2023
Lecture 01 & 02: PAC Learning
Lecturer: Kirthevasan Kandasamy Scribed by: Albert Dorador, Michael Harding
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In the first two lectures, we will introduce PAC Learning. We will first introduce some background definitions, then discuss empirical risk minimization, analysis of ERM, sub-Gaussian Random Variables, agnostic PAC bounds, and finally conclude with a discussion on approximation error vs. estimation error.
1 Background Definitions
We begin by laying out some important foundational definitions for discussing data and algorithms and for evaluating our methods via the simple, albeit instructive, example of the Binary Classification problem.
We first introduce the general concepts of the Input Space X (also known as the covariate, feature, etc. space) and the Label Space Y (response, output, target, etc.). In the case of binary classification, we have Y = {0, 1}. One common example for X is Rd, though specific data settings may of course result in a different X. With a given Input and Label Space pair, we then wish to assume that there exists some joint distribution PX,Y over ordered pairs (X, Y ) ∈ X × Y, such that our Observed Dataset,
S = {(X1,Y1),…,(Xn,Yn)}
is sampled independently from this distribution.
From here, we define a hypothesis to be any map h : X → Y, taking a member of the input space and
outputting its “estimated” label space value, and we consider the concept of Learning in the statistical / machine learning sense to be the act of finding a “good” hypothesis. This of course then motivates the questions: What constitutes a “good” hypothesis? And how do we compare one hypothesis to another? To answer these, we define the Risk of a hypothesis h to be
R(h) = E(X,Y )∼PX,Y [l(h(X), Y )],
where l : Y ×Y → R+ is a predefined Loss Function. For our Binary Classification example, our hypotheses are functions that propose a “splitting” of the data into positive and negative (0 and 1) classes, and our goal is learn a function (or set of functions) that produce a low probability of misclassification, motivating the 0-1 loss function,
l(h(X),Y)=I{h(X)̸=Y} ⇒R(h)=P(h(X)̸=Y)
2 Empirical Risk Minimization
In order to go about learning the “best” hypothesis, we recognize two important factors:
1. We must begin by defining a suitable hypothesis class H ⊂ {h : X → Y} that will be the set of “learnable” hypotheses for our problem setting.
Code Help, Add WeChat: cstutorcs
2. We of course wish to minimize R(h) over all h ∈ H, but often PX,Y is unknown, and thus the risk R(h) is not calculable in general.
This motivates us to define the Empirical Risk of a hypothesis h to be
l(h(Xi),Yi) =
I{h(Xi)̸=Yi} in the case of binary classification
Rather than the expected loss, we use the observed average loss as a stand-in (which naturally carries the benefits of the sample mean like unbiasedness and concentration rates with it, and it asymptotically matches the true Risk by the Law of Large Numbers). Using this definition, we can then define the process of learning the “best” hypothesis bh via Empirical Risk Minimization (ERM) by letting
bh ∈ arg min Rb(h). h∈H
(Note that we use the set notation ∈ due to the fact that we do not necessarily have a unique h ∈ H that minimizes the empirical risk.)
Example 1 (Binary Classification ERM). Let X = R2, Y = {0, 1}, H = {h1, h2, h3} (as pictured be- low). As pictured, we have Rb(h3) > 0 and Rb(h1) = Rb(h2) = 0, giving us bh ∈ {h1,h2}. However, we can also see that, given the full distribution PX,Y , we have R(h2), R(h3) > 0 and R(h1) = 0, so h1 is clearly the “best” hypothesis in H, though we cannot uniquely identify bh = h1 with only using this dataset S.
Figure 1: A simple binary classification example with input space X = R2

3 Analysis of ERM
To facilitate our analysis of the efficacy of the ERM algorithm, we begin by making two important assump- tions:
1. We have a finite hypothesis class, i.e. |H| < ∞. 2. Our problem {X , Y, PX,Y , H} is Realizable, meaning ∃ h∗ ∈ H s.t. ∀ (x, y) ∈ supp(PX,Y ), h∗(x) = y. These assumptions, while not necessary, greatly simplify our analysis and enable us to develop strong results. We will relax both assumptions later in class. The first assumption narrows the problem scope and allows us to control statistical bounds via |H|. The realizability assumption guarantees that there exists some hypothesis in our hypothesis class with 0 true risk. It is also easy to see that due to realizability, we have Rb(h∗) = 0. Consequently, our ERM estimator bh has zero empirical risk (Rb(bh) = 0), as there is at least one hypothesis (namely h∗) in our hypothesis class with 0 empirical risk; however, we are not guaranteed R(bh) = 0, as we can select bh ̸= h∗. We saw this case in Example 1, where the problem was realizable by h1(x) = y, but we had bh ∈ {h1, h2} under our particular dataset S. Because of this, we generally aim for statistical results that guarantee, with high probability, R(bh) ≤ ε for a sufficiently small tolerance ε > 0, dependent on our sample size n and hypothesis class H.
Below, we state our first theoretical result for ERM under both assumptions above.
Theorem 1. Let bh ∈ H be chosen via ERM, using a dataset of n samples. Furthermore, let |H| < ∞. Then PS (R(bh) < ε) ≥ 1 − |H|e−nε Define HB := {h ∈ H : R(h) > ε} to be the set of “bad” hypotheses (we call them “bad” because they have a risk that exceeds our desired tolerance of ε). Consider any h ∈ HB . Then R(h) > ε by construction. More concretely, if we choose our loss function to be the standard 0/1 loss in binary classification problems, by construction we have that, for any h ∈ HB ,
R(h) = E(X,Y )∼PX,Y [I{h(X)̸=Y }] = PS (h(X) ̸= Y ) > ε Moreover, for any h ∈ HB,
n PS(Rb(h)=0)=PS(h(Xi)=Yi ∀i)=YP(h(Xi)=Yi)≤(1−ε)n
Observe that the second equality above follows from the fact that the random vectors (Xi,Yi) are i.i.d. by initial assumption.
By the Realizability assumption, we know that there exists h∗ ∈ H such that Rb(h∗) = 0. Therefore, one would never pick h ∈ H to be the empirical risk minimizer bh if Rb(h) > 0. Hence, we can define the good event G := {∀ h ∈ HB,Rb(h) > 0} (“good” since under those conditions one would never make a mistake selecting the empirical risk minimizer by choosing a hypothesis that has large true risk). That is, under G and Realizability, bh ∈/ HB. Then
PS(Gc)=PS(∃h∈HB :Rb(h)=0)≤ X PS(Rb(h)=0)≤ X (1−ε)n ≤|HB|(1−ε)n h∈HB h∈HB
where the second inequality follows from our previous derivation, and the first inequality is a direct applica- tion of the Union bound1. Observe that the above derivation implies that
PS(G)≥1−|H|(1−ε)n ≥1−|H|e−nε
1The union bound states that if A1, . . . , AK are events; then, P(SKk=1 Ak) ≤ PKk=1 P(Ak)

where the last inequality follows by noting that ln(1 − ε) ≤ −ε for any 0 < ε < 1, since ln(1 − ε) = −ε−ε2/2−ε3/3−ε4/4−.... Since under G and Realizability, bh ∈/ HB and hence R(bh) < ε, the result we wished to prove then fol- lows. Observe that there are three parameters that one can control: namely the amount of data n, the desired risk tolerance ε > 0, and the probability of error δ.
In particular, given n samples and δ ∈ (0, 1), bh satisfies
 log(|H|/δ) 
PS R(bh)< n ≥1−δ which in effect requires a tolerance of order 1/n. Moreover, given δ ∈ (0,1) and ε > 0, as long as n ≥ 1ε log(H/δ) i.e. we have samples at least of order
1/ε, it holds that
P S ( R ( bh ) < ε ) ≥ 1 − δ The previous results illustrate the concept of “PAC learning”. PAC is an acronym for “Probably Approx- imately Correct”, which means that with high probability (“Probably”) the error our learning algorithm makes is small (i.e. it’s “Approximately Correct”). The standard definition of PAC Learning requiers a more technical characterization, please refer to definition 2.3 in MRT and definition 3.1 in SB. In the next section we introduce a concept that will later come in handy. 4 Sub Gaussian Random Variables Definition 1. If a random variable X satisfies h λ(X−EX)i λ2σ2 for all λ ∈ R, then we say it is a σ-sub Gaussian random variable. Intuitively, X is a σ-sub Gaussian (henceforward, σ-sG) random variable if its tail decays at least as fast as that of a N(0,σ2) random variable. Example 2 (Gaussian random variables are sub Gaussian). Let X ∼ N(μ,σ2). Then X is σ-sG. Example 3 (Bounded random variables are sub Gaussian). Ifsupp(X)⊂[a,b]forsome−∞ε)≤2e 2σ2 Proof We will just prove here the first statement above.
P(X−EX>ε)≤e 2σ2 − ε2
P(X−EX<−ε)≤e 2σ2 − ε2 Assume without loss of generality that E X = 0. Then P(X>ε)=P(eλX >eλε)≤E(eλX)≤eλ2σ2−λε
where the second inequality follows from the fact that X is σ-sG, and the first inequality is a direct application
of Markov’s inequality: for any non-negative random variable W and any a > 0, P(W > a) ≤ E W a
Now, the statement above holds for any λ ∈ R, so in particular it holds for λ = ε/σ2, which then con- cludes our proof.
5 Agnostic PAC Bounds
Previously when we discussed PAC learning we made two important assumptions: finite hypothesis class H and realizability. Now, in agnostic PAC learning, we relax the second assumption. In other words, we don’t require our dataset to be separable, neither in our hypothesis class nor in general for any possible hypothesis class. More formally, realizability might not hold because of either (or both) of the following reasons.
a) ∃h∈{h:X →Y}suchthatY =h(X)forallX,Y ∈supp(PX,Y)buth∈/H. Forexample,ifHis the set of linear classifiers and our data looks like the below, then clearly there’s no h ∈ H that can separate the two classes (but there might be a non-linear classifier that can)
Figure 2: Non-realizability if our hypothesis class is the set of all linear classifiers
b) ∄ h ∈ {h : X → Y} such that Y = h(X) for all X,Y ∈ supp(PX,Y ), because the labels are stochastic
(i.e. the same x ∈ X is randomly mapped to a potentially different label y ∈ Y in the next realization) 5

Figure 3: Non-realizability for any choice of hypothesis class (stochastic labels) To deal with the possibility of our problem being non-realizable, we can still define
h∗ ∈argminR(h), h∈H
now allowing for R(h∗) > 0. This will change some of the bounds from Theorem 1: Theorem 2. Let |H| < ∞, ε > 0, bh be chosen via ERM using n i.i.d. samples. Then
P(R(bh) ≤ R(h∗) + 2ε) ≤ 1 − 2|H|e−2nε2 Proof We begin by defining the good event
G= \{|Rb(h)−R(h)|≤ε}, h∈H
i.e. the empirical risk of each hypothesis h ∈ H is within an ε-bound of its true risk. Under G, we have
Figure 4: An example of hypotheses and their associated real and empirical risks under the conditions of G. R(bh) − R(h∗) = R(bh) − Rb(bh) + Rb(bh) − R(h∗) ≤ 2ε
| {z } | {z }
≤Rb(h∗) − R(h∗) | {z }
程序代写 CS代考 加QQ: 749389476
Thus, we wish to show P(Gc) ≤ 2|H|e−2nε2 . We begin by using a union bound to show !
P(Gc)=P [{|Rb(h)−R(h)|>ε} ≤ XP(|Rb(h)−R(h)|>ε) h∈H h∈H
From here, we point out that
as desired.
R(h)−R(h)= XI −E[I b n {h(Xi)̸=Yi}
XZh −E[Z(h)], i 1
where Z(h) are i.i.d. Bernoulli random variables with probability of success P
(h(X) ̸= Y ). Then, we can either apply Hoeffding’s inequality directly or use the fact that Zi are bounded, and thus 12-sub-
i (X,Y )∼PX,Y Gaussian, to show
! P(|Rb(h)−R(h)|>ε)=P n Zi −E[Z1 ] >ε ≤2e
i=1 Thus, by this result and our previous work, we have
Corollary 1. This result presents 3 controllable parameters: the tolerance ε > 0, the sample size n ∈ N, and the probability of error δ ∈ (0, 1). If we are given a fixed n and δ, then
s 1 2|H|!
P R(bh) ≤ R(h∗) + 2 2n log δ ≥ 1 − δ
{h(Xi)̸=Yi}
1X (h) (h)
P(Gc) ≤ X P(|Rb(h) − R(h)| > ε) ≤ X 2e−nε2 = 2|H|e−2nε2 , h∈H h∈H
Otherwise, if we are instead given a fixed ε and δ, then
1 2|H| ∗
n≥ 2ε2 log δ ⇒P(R(bh)≤R(h )+2ε)≥1−δ
Given these results, we can then compare the relationships between ε, n, and δ in the Agnostic and
Realizable PAC learning cases, summarized in the table below:
Agnostic Realizable
Fix n,δ ε = Oq1 log(1/δ) ε = O1 log(1/δ) nn
Fix ε,δ n ≥ O 1 log(1/δ) n ≥ O1 log(1/δ) ε2 ε
This table illustrates that the guarantees for the agnostic case are weaker than the realizable case.

Figure 5: An example of approximation error incurred by working with a non-ideal H 6 Approximation Error vs. Estimation Error
We conclude by presenting a decomposition of the error between our estimate bh and the “best” estimator. In the case of binary classification, it is provable (we will show this next lecture) that in the case where PX,Y is known, the estimator with the minimum risk is the Bayes Classifier,
(1 P(Y =1|X =x)≥1/2 hB(x)=argmax P(Y =y|X =x)= 0 P(Y =1|X =x)<1/2 In our discussions thus far, we have mainly focused on “estimation error,” which arises from only having access to n data points and not the full distribution PX,Y . On the other hand, we also have the case where our hypothesis class H does not contain the Bayes classifier, and thus we incur some amount of error just from the difference in risk from the Bayes classifier and h∗. An example of this is found in Figure 5, where the Bayes classifier hB is highly nonlinear, but our hypothesis class H = {linear classifiers}. We can decompose this error into “approximation” and “estimation” error as R(bh)−R(hB)=R(bh)−R(h∗)+ R(h∗)−R(hB) | {z } | {z } estimation error approximation error “variance” likening each term to the typical “Bias-Variance” trade-off language. A visual example is provided below, where we can see that expanding from H to H′ would reduce the approximation error, but would also then potentially increase the estimation error by having a larger class of hypotheses to “test” on the given data. Figure 6: A visual representation of the Approximation vs. Estimation error trade-off 程序代写 CS代考 加微信: cstutorcs