CS861: Theoretical Foundations of Machine Learning Lecture 16 10 11 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 16 – 10/11/2023 University of Wisconsin–Madison, Fall 2023
Lecture 16: Lower bounds for prediction problems, Stochastic Bandits
Lecturer: Kirthevasan Kandasamy Scribed by: Ransheng Guan, Haoran Xiong
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will continue our discussion on proving minimax lower bounds for prediction problems, and use it to prove a lower bound for classification in a VC class. Finally, we will briefly introduce Stochastic Bandits.
1 Excess risk in classification/regression (Cont’d)
Let Z be a data space, P be a family of distribution, and H be a hypothesis space. Let f : H × Z → R be the instance loss, where f(h,Z) is the loss of hypothesis h on instance Z. Let F(h,P) = EZ∼P [f(h,Z)] be the population loss of hypothesis h on distribution P, and let L(h,P) = F(h,P) − infh∈H F(h,P) denote the excess population loss.
Then a dataset S drawn from some P ∈ P; an estimator hˆ mapping the data to a hypothesis in H. Thus, the risk would be
and the minimax risk is
R(bh,P)=E[L(bh,P)]=E[F(bh,P)]− inf F(h,P), h∈H
R⋆ = inf sup R(bh, P ). bh P∈P
Example 1 (Estimation error in a hypothesis class). H ⊆ {h : X → Y}
Our estimator hˆ will choose some hypothesis in H using data. We can now view L(h, P ) as the estimation
error. Recall, that letting h∗ be the Bayes’ optimal classifier, we can write
F(h,P)−F(h∗,P)=F(h,P)− inf F(h′,P)+ inf F(h′,P)−F(h∗,P). h′ ∈H h′ ∈H
| {z }| {z }
estimation error=L(h,P) approximation error In Homework 1, we saw that for ERM, when H has VC dimension dH, we have
rd ! R(hˆERM,P) = ES[F(hˆERM(S),P)] − inf F(h,P) ∈ O ̃ H
We will use this framework to show a corresponding lower bound
| {z } h∈H n ES [EX,Y ∼P [(1(hˆERM (S)(X)̸=Y )]]
rd! inf sup ES∼P[F(hˆ(S),P)]− inf F(h′,P) ∈Ω H
hˆ P∈P h′∈H n
To proceed, we will first define the separation of two distributions, with respect to a given hypothesis class and loss L.

Definition 1 (Separation). For two distributions P, Q, define the separation ∆(P,Q) as ∆(P,Q) = sup{δ ≥ 0; L(h,P) ≤ δ ⇒ L(h,Q) ≥ δ,∀h ∈ H
L(h, Q) ≤ δ ⇒ L(h, P ) ≥ δ, ∀h ∈ H}
• P, Q are δ-separated if any hypothesis that does well on P (i.e. L(h,P) ≤ δ), does poorly on Q (i.e.
L(h, Q) ≥ δ)
• We say a collection of distributions {P1, · · · , PN } are δ-separated if ∆(Pi, Pk) ≥ δ, ∀j ̸= k.
The following theorem can be proved using a similar technique to our previous thoerem on reducing estimation to testing. You will do this in your homework.
Theorem 2 (Reduction to testing). Let {P1, · · · , PN } be a δ-separated subset of P. Let ψ be any test which maps the dataset to [N]. Then
R∗ ≥δinfmaxPj(ψ̸=j) ψ j∈[N]
We can then establish the following statements from the above result when S consists of n i.i.d data points.
Theorem 3 (Le Cam & Fano Method). 1. Le Cam: If {P0,P1} are δ-separated, R∗ ≥ 2δ∥P0 ∧P1∥ ≥ 4δe−KL(P0,P1)
Hence, for i.i.d. data S ∼ Pn, if KL(P0,P1) ≤ log(2), then R∗ ≥ δ n8
2. Local Fano Method: If {P1, · · · , PN } are δ-separated, then
1 P KL(P ,P )+log(2)!
R∗≥δ1−N2 j,k jk log(N )
Hence, for i.i.d. data S ∼ Pn, if KL(Pj,Pk) ≤ log(N), and N ≥ 16, then R∗ ≥ δ 4n 8
Remark While our focus is on prediction problems, this framework and theorems apply to any problem for which
infL(h,P)=0 ∀P∈P. h∈H
2 Application: Classification in a VC class
We will now use the above results to prove a lower bound for classification in a VC class.
Theorem 4. Let P be the set of all distributions supported on X ×{0,1}. Let H ⊆ {h : X → Y} be a hypothesis class with VC dimension d ⩾ 8. Let S = {(X1,Y1),…,(Xn,Yn)} ∼iid P, where P ∈ P. Then, for any estimator hˆ which maps the data set S to a hypothesis in H,
∗ ˆ ′rd R =inf sup E[F(h,P)]− inf F(h,P) ⩾C1
for some global constant C1.
Code Help, Add WeChat: cstutorcs
Proof Our proof will follow the usual four step recipe when applying Fano/Le Cam methods.
Step 1: Construct alternatives.
Let Xd = {x1,…,xd} be a set of points shattered by H. Let γ ⩽ 1/4 be a value which will be specified
later. Define
P′ = {Pω : Pω(X = x) = d−1I{x∈Xd}, Pω(Y = 1|X = xi) = 12 + (2ωi − 1)γ, ω ∈ Ωd},
where Ωd is the VG-pruned hypercube of {0,1}d.
Remark To illustrate the above construction, consider the class of two-sided threshold classifiers with d=2,i.e. X2 ={x1,x2}⊆R. LetPω bethedistributionforω=(0,1)withPω(X=x1)=Pω(X=x2)= 1/2. Then the conditional distribution of Y should be
Pω(Y =1|X=x1)=21−γ, Pω(Y =1|X=x2)=12+γ Step 2: Lower bound the separation minω,ω′ ∆(Pω , Pω′ ).
We have the following claim: For any Pω,Pω′ ∈ P′, the separation satisfies ∆(Pω,Pω′)⩾ γdH(ω,ω′).
We will prove this claim in homework. Then by the Varshamov-Gilbert lemma, we have min∆(Pω,Pω′)⩾ γd = γ ≜δ.
Step 3: Upper bound the KL divergence maxω,ω′ KL(Pω,Pω′). We have,
Pω (X, Y ) KL(Pω,Pω′)=EX,Y logPω′(X,Y)
= X Pω(xi)
X Pω(y|xi) log Pω′ (y|xi) (as Pω(x) = Pω′ (x))
dI{ω̸=ω′} (1/2+γ)log 1/2−γ +(1/2−γ)log 1/2+γ
Xd1 1/2+γ 1/2−γ
Therefore, with H(ω, ω′) ⩽ d, we have
γ2 ′ ⩽C2 dH(ω,ω).
maxKL(Pω,Pω′) ⩽ C2γ2. ω,ω′
Step 4: To conclude the proof, we will choose γ = C3pd/n. Then we have maxKL(P ,P ′)⩽C d ⩽ log(2d/8) ⩽ log(|P′|),
ω,ω′ ωω4n4n4n
where the last inequality is by the Varshamov-Gilbert lemma. Then, by the local Fano method, we have
∗δrd R⩾2⩾C5 n.
Programming Help, Add QQ: 749389476
3 Stochastic Bandits
In the next series of lectures we will be discussing sequential/adaptive decision making problems in which there exists a sequence of interactions between a learner and an environment. Specifically, on round t, the learner chooses an action At ∈ A, where A is a set of possible actions. Then the environment reveals an observation Ot. In return, the learner receives a reward Xt = Xt(Ot, At). The learner’s goal is to maximize the sum of rewards PTt=1 Xt. Stochastic/adversarial bandits and online learning are typical examples of sequential/adaptive decision making problems. We will first focus on stochastic bandits.
A stochastic bandit problem will have the following components:
• Let ν = {νa, a ∈ A} denote a set of distributions indexed by actions in A. ν is called a bandit model
and is a subset of some family P.
• On round t, the learner chooses At ∈ A and observes a reward Xt sampled from νAt .
• The learner is characterized by a policy Π = (Π ) , where Π maps the history {(A , X )}t−1 to an tt∈N t s s s=1
action in A.
• If Π is a randomized policy, Πt maps the history to a probability distribution on A, and then an action
is sampled from this distribution. Π can also be a deterministic policy.
• μa = EX∼νa [X] is defined to be the expected reward of the action a. Let a∗ ∈ arg maxa∈A μa be an
optimal action, and let μ∗ = μa∗ be the corresponding optimal value of the expected reward.
• Finally, we define the regret after T rounds of interaction as
RT =RT(π,ν)=Tμ∗−E[XXt]
where E is with respect to the distribution of the action-reward sequence A1, X1, A2, X2, …., AT , XT induced by the interaction between the policy π and bandit model ν. Here, μa, a∗, and μ∗ should be viewed as functions of the the bandit model ν and can be written as μa(ν), a∗(ν), and μ∗(ν).
When designing an algorithm for bandits, at the bare minimum, we will require RT ∈ O(T ), i.e. lim RT = 0. T→∞ T
This implies that over time, a learner is able to eventually learn the optimal arm.
Code Help