CS861: Theoretical Foundations of Machine Learning Lecture 26 – 03/11/2023 University of Wisconsin–Madison, Fall 2023
Lecture 26: Online Convex Optimization (continued)
Lecturer: Kirthevasan Kandasamy Scribed by: Haoyue Bai & Deep Patel
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
We will return to contextual bandits next week. Right now, we will continue studying online convex optimization: A learner is given a convex set, Ω ⊆ Rd, and at each rount-t, the learner chooses an action wT ∈ Ω. Simultaneously, the environment picks a convex loss function ft : Ω → R so that the player incurs the loss ft(wt). We also assume that the player gets to observe all ft’s. So far, we have looked at the frame- work of Follow-The-Regularized-Leader (FTRL) framework that helps the learner choose an action wt ∈ Ω such that its regret w.r.t. the best action in hindsight is minimized by introducing an appropriate additive regularizer term to the losses incurred.
We concluded the last lecture by looking at three examples in order to motivate how to pick these reg-
ularizers. In particular, we saw that Follow-The-Leader (FTL) has bad regret (RT ∈ O(T)) when the
learner uses linear losses whereas FTL does well when using quadratic losses (RT ∈ O(log T )). Next, we saw
that if we were to use linear losses with quadratic regularizers, FTRL actually has better regret compared
to FTL-with-linear-losses (i.e., RT ∈ O( T)). However, it is not clear what type of regularizer to use for
a given loss function. We will now formalize this process of choosing an appropriate regularizer given a convex loss function so as to ensure good regret bounds under the FTRL framework. Specifically, in this lecture, we will first present a primer on some properties of convex functions, and then study the framework of Follow-The-Regularized-Leader (FTRL) with convex losses and strongly convex regularizers.
1 A primer on properties of convex functions
Definition 1 (Convex function). We will present two equivalent definitions of convex functions: (i.) Afunctionf:Ω→RisconvexifΩisaconvexsetfor∀α∈[0,1],and∀u,v∈Ω,wehave:
f(αu + (1 − α)v) ≤ αf(u) + (1 − α)f(v). (ii.) Equivalently f is convex if ∀ w ∈ Ω,∃ g ∈ Rn,s.t.,∀ w′ ∈ Ω, we have:
f(w′)≥f(w)+gT(w′ −w)
Definition 2 (Sub-gradients and sub-differential). We will present the definition for sub-gradient and sub-
differential.
(i.) Any g ∈ Rn which satisfies (ii) in the above definition is called a subgradient of f at w.
(ii.) The set of subgradient of f at w are called sub-differential and denote ∂f(w). Remark 1.1. Some facts about sub-gradients:
(i.) If f is differentiable, ∂f(w) = {∇f(w)}.
程序代写 CS代考 加QQ: 749389476
Figure 1: The blue curve above depicts a convex function, f, whereas the red lines denote the first-order linear underestimators to this function f. Additionally, these linear underestimators also serve as three of the uncountably- infinite possible subgradients at the non-different point ((0.8,0.52)) on function f.
(ii.) 0 ∈ ∂f(w) ⇔ w ∈ arg minf(w). w∈Ω
(iii.) For finite-valued convex functions1 (f1,f2) and positive scalars (α1,α2), if g1 ∈ ∂f1(w) and g2 ∈ ∂f2(w), then α1g1 + α2g2 ∈ ∂h(w), where h = α1g1 + α2g2.
Definition 3 (Strong Convexity). A convex function f : Ω → R is α-strongly convex in some norm || · ||, if f(w′) ≥ f(w) + gT (w′ − w) + α2 ||w′ − w||2, ∀g ∈ ∂f(w).
Example 1. Some examples of strongly-convex functions: (i.) f(w)=12||w||2 is1-stronglyconvexis||·||2.
(ii.) Thenegativeentropyf(w)=Pdi=1w(i)log(w(i))is1-stronglyconvexin||·||1,whenΩ=∆d. Remark 1.2. Some remarks and properties of strongly-convex functions:
(i.) If f is strongly convex in ||·||2, then this is equivalent to saying that f(w)− α2 ||w||2 is convex. In other words, f is ‘at least as convex as a quadratic function’.
(ii.) If f is α-strongly convex and f2 is convex, then βf1 + f2 is (βα)-strongly convex ∀β > 0.
(iii.) Let w∗ = arg minf(w), where f is α-strongly convex. Then f(w) ≥ f(w∗) + α2 ||w − w∗||2. The proof w∈Ω
uses the definition of strong convexity and the fact that 0 ∈ ∂f(w∗).
Definition 4 (Dual norm). Given a norm || · ||, its dual norm || · ||∗ is defined as:
||w||∗ = max uT w ||u||≤1
Example 2. Some examples of dual-norm pairs: (i.) (|| · ||2, || · ||2)
(ii.) (|| · ||1, || · ||∞)
1Refer to Theorem 8.11 here: https://people.eecs.berkeley.edu/~brecht/opt4ml_book/O4MD_08_Subgradients.pdf
CS Help, Email: tutorcs@163.com
(iii.) More generally, given that H ̈older’s inequality (Lemma 1 below) holds, the following are also dual-norm pairs when considering lα-norms (α > 0):
(||·||p,||·||q), wherep,q>0and p1+1q =1 Lemma 1 (H ̈older’s inequality). ∀a, b ∈ Rd , aT b ≤ ||a|| · ||b||∗ .
2 FTRL with convex losses and strongly-convex regularizers
We will not state our main theorem for FTRL with convex losses and strongly-convex regularizers.
Theorem 2.1. Suppose ft is convex for all t and Λ(w) = η1 λ(w) where η > 0 and λ is 1-strongly convex with respect to some norm || · ||. Let || · ||∗ be the dual-norm of || · ||, and let gt ∈ ∂f(wt), where wt was chosen by FTRL. Then,
TT RT(FTRL,f)=∆ Xft(wt)−minXft(w)
||gt||2∗ w∈Ωw∈Ωt∗ TG2
RT ≤ η +ηTG ∈O(G BT).
Remark 2.1. Note that the condition ||gt||∗ ≤ G ∀t here means that ft is G-Lipschitz in || · ||∗-norm.
We will do the proof of Theorem 1 after looking at some examples below.
Example 2 (Linear Losses). Let Ω = {w | ||w||2 ≤ 1} and ft(w) = wT l2 where ||lt||2 ≤ 1. We will apply FTRL result with λ(w) = 21 ||w||2 which is 1-strongly convex in || · ||2. We will compute the best action on round-t as follows:
wt =argmin Xfs(w)+Λ(w)
t Ω s=1s time on each round t:
= argmin ||w||2 +2ηwT w∈Ω
t−1! t−1 !2 Xls +η2 Xls(w)
max λ(w) − min λ(w) + η
Corollary1.Supposemaxλ(w)−minλ(w)≤Band||g|| ≤G∀t.Then,choosingη=qB ,wehave
= arg min wT w∈Ω
X ls(w) s=1
s=1 = arg min ||w + η X ls(w)||2
We should choose w = proj −η Pt−1 l (w). This can be implemented via the following update in O(1)
ut ←− ut−1 − ηlt−1
wt ←− arg min ||w − ut||2 w∈Ω
Code Help, Add WeChat: cstutorcs
This gives us the following regret bound:
11 1XT RT (F T RL, l) ≤ max ||w||2 − min ||w||2 + η
||lt||2 =2η(1−0)+η ||lt||2 (∵0≤||w||2≤1∀w∈Ω)
Example3(ExpertsProblem). Ω=∆K={p∈RK+,1Tp=1},ft(p)=lTp,lt∈[0,1]K.SayK≥2. Let’s try FTRL with λ(w) = 21 ||w||2. Doing the same calculations as in the Example above, we get the
following regret bound:
RT(FTRL,l)≤ η +η
1 − ≤ (∵ K ≥ 2)
∴RT(FTRL,l)≤ 1 +ηKT 2η
√ r1! ∴RT(FTRL,l)∈O( KT) forη= KT
Remark 2.2. We observe the following when comparing the regret bounds derived in Example 2 and Example 3 above with that of Hedge as derived in previous lectures:
• Recall that we have the following regret bound for Hedge algorithm: RT ∈O(pTlogK)
• So, it seems that we are not accurately capturing the geometry of the problem here. That is, l2-norm hypercube scales with K , whereas, say, l∞ -norm for [0, 1]K would remain a constant. So, we would want to use a regularizer that is strongly convex in some norm other than l2-norm; say l1-norm.
• You will be proving in the upcoming homework that using λ(p) = −H(p), which is strongly convex in the l1-norm, yields a better regret bound here.
B = max λ(w) − min λ(w) ≤ log K
RT ≤logK+η·T·1∈O(pTlogK) η
Remark 2.3. Thus, if we know/anticipate that {∇ft}t≥1 are small in some dual-norm || · ||∗, then it would be a good idea to run FTRL with a regularizer Λ which is strongly convex w.r.t. the corresponding norm2 (||·||∗)∗ = ||·||.
≤ 1 +η·T (∵||lt||2 ≤1∀t)
∈O(T) ifη=√ T
Note that B = max λ(w) − min λ(w) =
w∈Ω w∈Ω 2K2 and that ||lt||2 ≤ K (∵ lt ∈ [0,1]K)
2dual of the dual-norm, which is the norm itself