CS861: Theoretical Foundations of Machine Learning Lecture 22 10 25 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 22 – 10/25/2023 University of Wisconsin–Madison, Fall 2023
Lecture 22: Online learning, The experts problem
Lecturer: Kirthevasan Kandasamy Scribed by: Xinyan Wang, Zhifeng Chen
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will introduce Online Learning and the experts problems. We will first complete the proof of the martingale concentration result from the last lecture.
Proof Pickaroundt∈{d+1,…,T}andanya∈A. BytheL-Lipschitzpropertyoff,Weknow
f(θTa)−f(θˆTa) ≤L (θ −θˆ)Ta (1) ∗t∗t
Now we bound θ − θˆ . Using the assumption that f′ ≥ c, we have ∗t
∇gt−1(θ) = X AsATs f′(ATs θ) ≥ c X AsATs (2)
As f′ is continuous, by the fundamental theorem of calculus,
g (θ)−g (θˆ )=G ∗(θ−θˆ )
t−1 ∗ t−1 t−1 t−1 ∗ t−1 WhereG =R1∇g sθ +(1−s)θˆ ds.
t−1 0 t−1 ∗ t−1
By(3),G isinvertible⇒(θ −θˆ )=G−1 g (θ)−g (θˆ ). Wethereforehave,
t−1 t t−1 t−1 t−1 t t−1 t−1 ˆT ˆT−1
a = gt−1(θ⋆) − gt−1(θt−1) Gt−1a
≤ gt−1(θ⋆) − gt−1(θt−1) −1 Gt−1
∥a∥G−1 , as A t−1
t−1 t−1 t−1
=1g (θ)−g (θˆ )
∥a∥−1,asG ≥cV ⇒G−1≤1V
c t−1 ⋆ 2 t−1
V−1 Vt−1 t−1
∥a∥V −1 t−1
(4) We will apply the martingale concentration result with t0 = d, Vt−1 = Pt−1 AsATs , c2 = maxa∈A aT a = d
and finally δ = 1/T 2. Then, with probability ≥ 1 − T 2, by (1) and (4)
∀a∈A, f(θTa)−f(θˆT a) ≤ 2Lσγp2dlog(t)log(dT2)∥a∥ −1 .
⋆t−1c Vt−1 | {z }
ρ(t) Applyingaunionboundoverallt∈d+1,…,T wehavethat∀a∈Aand∀t∈d+1,…,T,
f(θTa)−f(θˆT a)≤ρ(t)∥a∥−1 .
Code Help, Add WeChat: cstutorcs
1 The Expert Problem
To motivate the ensuing model, we will begin with two examples.
Example 1 (Online spam detection). Given a hypothesis class H of binary classifiers, where H ∈ {h : X → {0, 1}}. Consider the following game over T rounds:
1. A learner receives an input email nt ∈ X on round t.
2. The learner chooses some ht ∈ H and predicts ht(nt) (spam or not-spam). 3. Learner sees the label yt and incurs loss 1{ht(nt) ̸= yt}.
Note that the learner knows the loss for all h ∈ H.
Example 2 (Weather forecasting). Given a set of models H. Consider the following game over T rounds:
1. Learner (weather forecaster) chooses some model h ∈ H and predicts the number yˆ . t
2. Learner observes the true weather y and incurs loss l(y , yˆ ). ttt
We can now introduce Expert Problem, which proceeds over T rounds in the following fashion:
1. We are given a set of K experts, denoted [K].
2. On each round, the learner chooses an expert(action) At ∈ [K]. Simultanously, the environment picks a loss vector lt ∈ [0,1]K, where lt(i) is the loss for expert i.
3. Learner incurs loss lt(At).
4. Learner observes lt(losses for all experts).
This type of feedback, where we observe feedback for all actions is called full information feedback. In contrast, when we observe losses or rewards only for the action we took, it is called bandit feedback.
Unlike in the stochastic bandit setting, we will not assume that the loss vectors are drawn from some distribution. Then how do we define regret? Recall that in the stochastic setting, we let a⋆ = arg mini∈[K] EX∼νi [X] be the action with the highest expected reward and defined the regret as follows:
“T# RStochastic(π,ν)=E XX −TE [X]
We did this for bandit feedback, but can define the regret similarly for full information feedback.
Here, in the non-stochastic setting, where loss vectors are arbitrary, we will compete against the best
fixed action in hindsight. For a policy π, and a sequence of losses, l =1, …, lt, define TT
RT′ (π,l) = Xlt(At) = min Xlt(a) a∈[K ]
For a stochastic policy, we will consider
RT(π,l)=E[RT′ (π,l)]=E Xlt(At) − min Xlt(a),
Programming Help
where E is with respect to the randomness of the policy.
For a given policy π, we wish to bound RT (π, l) for all loss sequences. That is supl RT (π, l). We wish to
do well even if the losses were generated by an adversary which had full knowledge of our policy π. Here, we are concerned with oblivious adversaries, who can choose lt to only be a function of the current action, and not previous actions.
2 The Hedge Algorithm
The most intuitive approach to solve this problem is to choose the action At = arg mina∈[K] Pt−1 ls(a) on s=1
round t. This is called Follow The leader(FTL). For instance, for binary classification example, this would simply be empirical risk minimization, as we will choose
ht = arg min X 1(h(xt) ̸= yt)
Unfortunately, this does not work. To see why, suppose K = 2, and define the loss vectors as follows:
Then, FTL will choose
(0.5, 0)
(1,0) (0, 1)
if t = 1 iftisoddandt>1 if t is even
on odd rounds on even rounds
Then the total loss of FTL will be at least T − 1, while the best action in hindsight will have loss at most T/2. Hence, the regret of FTL is at least T/2 − 1 ∈ Ω(T).
In the Hedge algorithm, we will instead use a soft version of the minimum, where we will sample from a distribution which samples arms with small losses more frequently. We hae summarized the Hedge algorithm below.
Algorithm 1 The Hedge Algorithm(a.k.a multiplicative weights, a.k.a exponential weights) Given time horizon T , learning rate η
LetL0←0K (allzerovectorinRK)
for t = 1,…,T do
e−2Lt−1 (a)
Set Pt(a) ← PKj=1 e−2Lt−1(j) ,∀a ∈ [K]
Sample At ∼ Pt (note that Pt ∈ ∆K) Incur loss lt(At), observe lt
Update lt(a) ← Lt−1(a) + lt(a), ∀a ∈ [K]
CS Help, Email: tutorcs@163.com