CS861: Theoretical Foundations of Machine Learning Lecture 1 – 11/01/2023 University of Wisconsin–Madison, Fall 2023
Lecture 25: Online Convex Optimization
Lecturer: Kirthevasan Kandasamy Scribed by: Xindi Lin, Tony Chang Wang
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we will introduce online convex optimization. We will first introduce two motivating exam- ples, the online linear classification, and the expert problem, and give a unified framework for online convex optimization. Then, we will discuss two methods, Follow the Leader(FTL), and Follow the Regularized Leader(FTRL). Finally, we will use several examples to show how to choose the regularizer.
1 Examples and Unified Framework
We will first present two examples to show what is online convex optimization. The online linear classification, and the expert problem.
Example 1 (online linear classification). Let Θ θ ∈ Rd : ∥θ∥2 ≤ 1 . On each round, the learner chooses some θt ∈ Θ. Simultaneously, the environment picks an instance {xt,yt} ∈ X ×Y where the domain X ∈ Rd, Y = {+1, −1}. Then, the learner incurs the hinge loss lt(θt) = max{0, 1 − ytθt⊤xt}. Finally, the learner observes the instance {xt , yt }, and hence knows the loss for all θ ∈ Θ. The regret is defined as follows
RT π,{xt,yt}Tt=1=Xlt(θt)−minXlt(θ)
Example 2 (The Expert Problem). Given K arms, and denote ∆K = {p ∈ RK+ : p⊤1 = 1}. On each round t, the learner chooses some pt ∈ ∆K . Simultaneously, the environment picks a loss vector lt ∈ [0, 1]K . Then, the learner incurs the loss p⊤t lt. Finally, the learner observes the loss vector lt, and hence knows the loss for all p ∈ ∆K . The regret is defined as follows
RT (π,l) = Xp⊤t lt − min Xlt(a) = Xp⊤t lt − min Xp⊤lt
nates of p in PTt=1 p⊤lt.
We will now present a unified framework for online convex optimization.
Definition 1 (Online convex optimization). Consider the following frame. A learner is given a weight space Ω ⊂ Rd. On each round t , the learner chooses a weight vector wt ∈ Ω. Simultaneously, the environment chooses a loss function ft : w → R, a mapping from weight space to real line. Then the linear incurs the loss ft(wt). Finally, the learner observes the loss function ft, and hence knows the value of ft(w) for all w ∈ Ω.
In the above framework, if (1) the weight space Ω is convex and compact, and (2) the loss function ft at every round is convex, the framework is called online convex optimization.
Given a horizon T. The goal is to minimize the regret against the best-fixed weight vector in Ω w.r.t. the policy π of choosing the weight vector at each round.
RT π,f=Xft(wt)− min Xft(w)
where minp∈∆K PTt=1 p⊤lt = mina∈[K] PTt=1 lt(a) is easy to see if we take derivative w.r.t. each coordi-
浙大学霸代写 加微信 cstutorcs
In example 1, the l2-ball is convex and compact, and the hinge loss is convex. In 2, ∆K is convex and compact, and the loss p⊤t lt is a linear function of pt and thus convex.
2 Follow the Regularized Leader
A most straightforward policy is Follow the Leader(FTL). The weight wt is chosen by t−1
wt ∈argminXfs(w)
which is the best weight vector based on the observed loss function. However, this is often a bad idea, as the chosen weight could fluctuate from round to round. Therefore, we will stabilize the FTL by adding a regularized term Λ(w)
wt ∈argmin Xfs(w)+Λ(w)
We call the above policy with the regularized term Follow the Regularized Leader(FTRL), and we will give its regret upper bound.
Theorem 3 (Regret Upper Bound for FTRL). For any u ∈ Ω, FTRL satisfies
RT (FTRL, f ) ≤ X ft (wt ) − X ft (u)
≤ X (ft(wt) − ft(wt+1)) + Λ(u) − min Λ(w)
N.B. We have not assumed convexity of Ω, ft, or Λ in the theorem.
Proof The first inequality is by the definition of regret. For the proof of the second inequality, we denote
Consider Φt−1 − Φt, and we have
Ft(w) = X fs(w) + Λ(w)
Φt = min Ft(w) = Ft(wt+1) w∈Ω
Φt−1 − Φt = Ft−1(wt) − Ft(wt+1)
= Ft−1(wt) − (Ft−1(wt+1) + ft(wt+1))
= (Ft−1(wt) − Ft−1(wt+1)) − ft(wt+1)) ≤ −ft(wt+1)
since Ft−1(wt) ≤ Ft−1(wt+1), Then we will have
Φt−1 − Φt + ft(wt) ≤ ft(wt) − ft(wt+1)
by adding ft(wt) to both sides of the equation. Then we sum both sides from t = 1,..,T, and we will have 2
Programming Help
Φ0 − ΦT + X ft(wt) ≤ X(ft(wt) − ft(wt+1))
t=1 t=1 We can compute the values of ΦT , Φ0 as follows:
Therefore, we have
X ft(wt) − X fs(u) − Λ(u) + min Λ(w) ≤ X(ft(wt) − ft(wt+1))
RT (FTRL, f ) ≤ X (ft (wt ) − ft (wt+1 )) .
• If wt fluctuates frequently, the regret of FTRL/FTL will be bad.
• The purpose of the regularized term Λ(w) is to stabilize the chosen weight wt. Examples Analysis: How a regularizer is Chosen
ΦT =min Xfs(w)+Λ(w) ≤Xfs(u)+Λ(u)
RT (FTRL, f ) ≤ X ft (wt ) − X ft (u)
• The above theorem implies that for follow the leader (FTL),
3.1 Example 1: FTL with linear losses
First, Let Ω = [0, 1]. Then we define ft(w) ∀w ∈ Ω:
Φ0 =minΛ(w) w∈Ω
≤ X (ft(wt) − ft(wt+1)) + Λ(u) − min Λ(w)
To motivate how a regularizer is chosen, we will consider 3 examples for FTL with Ω = [0, 1] and ft : [0, 1] → [0, 1]
2 w ft(w)= w
Xt Ft(w) =
( 1 w + t−1 2 2
− 12 w + 2t 3
if t is odd if t is even
iftisodd,t>1 1−w iftiseven
Hence, we have the following:
First note that the sum of losses can be written as:
Hence we have,
wt = argminFt−1(w) = 1 w∈[0,1]
if t is odd
Therefore, we obtain the Upper Bound from the Thm 3:
RT ≤Xft(wt)−ft(wt−1)= X
The bound given by the theorem is O(T). Moreover, it is not hard to see that the actual regret is also large. The total loss of FTL is at least T − 1. The best action in hindsight will have loss at most T2 . Therefore, we have Regret ≥ T2 − 1, and we could see that the Bound on RT is pretty tight. The linear losses are bad use case for FTL.
3.2 Example 2: FTL with quadratic losses
t=1 t s.t t is odd t s.t t is even
Let Ω = [0, 1], and we define ft(w), ∀w ∈ Ω as following:
(w2 if w is odd
(t+1w2 + t−1(1−w)2 22
t w2 + (1 − w)2 2
if t is odd if t is even
if t is odd if t is evens
ft(w)= (1−w)2 ifwiseven
Similar to the previous example, the best action for a given round oscillates between 0 and 1. However, we will see that the regret is not large.
wt = argminFt−1(w) = 1 − 1 w∈[0,1] 2 2t
We see that the choices made by FTL do not oscillate much, with wt → 12 as t → ∞. We have the following upper bound:
RT ≤Xft(wt)−ft(wt+1)
12 1 1 2 1 1 2 12
t s.t t is odd
222t 22t2 t s.t t is even
XT 1 1 = 2t+Ot2
∈ O (log T )
3.3 Example 3: FTRL with Linear losses
For our final example, we will revisit the linear losses in the first example, but will add a regularizer to stabilize the fluctuations. Since quadratic losses achieved small regret, let us try Λ(w) = η1 (w − 12 )2 (η will be chosen later). We define ft same as in example 1, namely: ∀w ∈ Ω = [0, 1]:
1/2w if t = 1
ft(w)= w iftisodd,t>1 1−w iftiseven
(0 if t is even
Programming Help, Add QQ: 749389476
Then we have Ft(w): Ft(w) =
Hence we got:
w∈[0,1] 12 − η4 if t is evens
Then we have the following upper bound on the regret. Define B := maxw∈[0,1] 1 w − 1 2−minw∈[0,1] 1 w − 1 2 =
1 . We have, 4η
2 4η Next, we decide to choose η = √1
fs(w) + Λ(w) =
(1w+t−1+1(w−1)2 iftisodd 2 2 η
η1 (w − 12 )2 − 12 w + 2t if t is even
(1 + η if t is odd wt =argminFt−1(w)= 2 4
! X ft(wt) − ft(wt+1)
= 2+4−2−4+ 2+4−2−4+B
+ B X 1 η 1 η
t s.t t is odd = XT η + 1
X 1 η 1 η t s.t t even
. Based on the regret’s UB we just showed, we have:
√ RT∈O T
Some take-aways from the examples above:
• Linear functions have bad behaviour in FTL due to the instability of the chose wt
• We should add a ”nice” regulizer to stabilize oscillations (”nice” means strong convexity here)