CSCI 659 Lecture 2 Fall 2022
Instructor: Haipeng Luo
1 A General Algorithmic Framework
In the last lecture, we discussed a special case of OCO, that is, the expert problem, and the classical Hedge algorithm for this problem. For a general OCO problem, how should we design a no-regret algorithm? (Recall the OCO setting: each round the learner decides wt ∈ Ω, while the environment decides a convex loss function ft : Ω → R).
Before any further discussion, we first make the following observation: it is sufficient to solve the case where each ft is a linear function. To see this, notice that by convexity, we have ft(wt) − ft(w) ≤ ⟨∇ft(wt), wt − w⟩ (draw a picture to convince yourself) and thus
max X(ft(wt) − ft(w)) ≤ max X ⟨∇ft(wt), wt − w⟩ .
is linear with lt = ∇ft(wt). There are a couple of things worth noticing in this argument:
• First, this shows that knowing the gradient of ft at the played action wt is already enough, and we do not really need full information about ft. More importantly, many algorithms (such as those to be discussed in this lecture) only need to store the cumulative gradients, leading to T -independent space complexity.
• Second, in this reduction, the linear loss function generally becomes adaptively chosen (that is, dependent on wt), even if the original adversary is oblivious (unless ft itself is a linear function). Therefore, we do generally need an algorithm that is able to deal with an adaptive adversary.
• Third, while this reduction is sufficient to derive some regret bounds, it might not be the best way to do so, especially if ft has some curvature making the linear approximation too loose.
In light of this reduction, for the rest of this lecture, we assume that the loss functions are linear and parameterized by l1, . . . , lT .
1.1 The Importance of Stability
Onceagain,thenaturalfirstattemptistheFTLapproach:w =argmin Dw,Pt−1lE,which t w∈Ω s=1 s
we already know suffers linear regret in the worst case. If we take a closer look at the worst case instance discussed in Lecture 1, we see that FTL exhibits highly unstable behavior — it alternates betweens two very different decisions. Motivated by this, we explore the idea of stabilizing the algorithm. This might not sound very intuitive at first glance: the loss functions can be changing in some arbitrary way, so why is it a good thing to have a stable learner? One answer is that, recall that the goal of the learner is only to compete with a fixed action, so there is really no point in “chasing” the loss functions all the time. Instead, one should stick around and not move too far away from the current best fixed action.
w∈Ω w∈Ω t=1
Therefore, we only need to solve a different OCO instance where the loss function ft′(w) = ⟨lt, w⟩
Programming Help
To make this intuition more formal, in this lecture we consider stabilizing the FTL approach by adding an auxiliary loss function Ψ : Ω → R that is differentiable and convex:
wt ∈argmin w,Xls +Ψ(w), (1)
t=1 t=1 where BΨ = maxw∈Ω Ψ(w) − minw∈Ω Ψ(w) is the range of Ψ.
penalty term
stability term
and we will discuss two different types of Ψ, leading to two different classical approaches. Before that, we first derive an intermediate regret bound for this general form, which will further illustrate the importance of stability and the role of Ψ.
First, recall that the Bregman divergence DΨ : Ω × Ω → R+ with respect to Ψ is defined as DΨ(w, u) = Ψ(w) − Ψ(u) − ⟨∇Ψ(u), w − u⟩ ,
which is simply the gap between Ψ and its first order approximation at u. Note that this is always nonnegative due to the convexity of Ψ and zero when w = u, but in general asymmetric between w and u (thus not a metric). Taylor theorem also tells us that there exists ξ between w and u such that DΨ(w, u) = 21 ∥w − u∥2∇2Ψ(ξ),1 meaning that the Bregman divergence is roughly measuring some squared quadratic norm of w − u.
Question 1. What is the Bregman divergence with respect to a linear function? What about a quadratic function?
We will be using the following simple fact.
Lemma 1. Suppose that F : Ω → R is convex and differentiable, and w⋆ ∈ Ω minimizes F . Then F (w⋆) ≤ F (w) − DF (w, w⋆) holds for any w ∈ Ω.
Proof. By definition, this is equivalent to ⟨∇F (w⋆), w − w⋆⟩ ≥ 0, which is exactly the first-order optimality condition for w⋆ being a minimizer of Φ.
What this lemma says is that w⋆ being a minimizer of F actually tells us a bit more than the trivial fact F(w⋆) ≤ F(w). With this, we prove the following useful lemma.
Lemma 2. Let w⋆ ∈ Ω be a minimizer of the function ⟨w, L⟩ + Ψ(w) for some arbitrary L and Φ be its minimum value; similarly, let w ̄⋆ and Φ ̄ be a minimizer and the minimum value respectively of the function w, L ̄ + Ψ(w) for some arbitrary L ̄. Then we have
Φ ̄−Φ≤ w⋆,L ̄−L −DΨ(w⋆,w ̄⋆).
Proof. Simply apply Lemma 1 with F (w) = w, L ̄ + Ψ(w) and note that DF = DΨ:
Φ ̄ − Φ = F(w ̄⋆) − Φ
≤ F (w⋆) − Φ − DΨ(w⋆, w ̄⋆) = w⋆, L ̄ − L − DΨ(w⋆, w ̄⋆).
Now we are ready to show the following intermediate regret bound. Lemma 3. Algorithm described by Eq. (1) ensures for any u ∈ Ω:
(by definition) (Lemma 1) (by definition)
X⟨wt −u,lt⟩≤Ψ(u)−minΨ(w)+X⟨wt −wt+1,lt⟩−XDΨ(wt+1,wt)
| {z } t=1 t=1
≤BΨ +X⟨wt −wt+1,lt⟩−XDΨ(wt+1,wt),
For a PSD matrix M, the notation ∥w∥M denotes
w Mw, which is equivalent to ∥M 2 w∥2.
In this bound, the first term is the penalty or error term, caused by the fact that Eq. (1) is not directly minimizing the cumulative loss, but instead that plus the auxiliary term Ψ. The second term, called stability term, exactly demonstrates why having a stable algorithm is important: the closer wt and wt+1 are (in terms of their loss for the loss function at time t precisely), the smaller the regret. Finally, for this lecture, we can simply ignore the third nonpositive term, but in the future we will see that this is critical in some cases. In fact, after ignoring this term and rearranging, Lemma 3 is simply saying (try to verify yourself)
Ψ(w1) + X ⟨wt+1, lt⟩ ≤ Ψ(u) + X ⟨u, lt⟩
for any u ∈ Ω, that is, the one-step lookahead strategy wt+1, has negative regret (when the auxiliary loss is considered as well). This is often known as the Be-the-Leader lemma (where the leader refers to the one-step lookahead strategy).
Then Lemma 2 tells us
Φt−1 − Φt ≤ − ⟨wt+1, lt⟩ − DΨ(wt+1, wt).
Adding ⟨wt , lt ⟩ to both sides, summing over t, telescoping, and rearranging leads to
TTT X⟨wt,lt⟩−ΦT ≤−Φ0 +X⟨wt −wt+1,lt⟩−XDΨ(wt+1,wt). t=1 t=1 t=1
ItthusremainstopointoutthatΦ =Ψ(w )=min Ψ(w)andΦ ≤Du,P
0 1 w∈Ω T s≤Ts
Proof of Lemma 3. Let Φ = min Dw, P l E + Ψ(w) = Dw , P l E + Ψ(w ).
for any u by definition.
t w∈Ω s≤t s t+1 s≤t s
l E+Ψ(u) Therefore, our goal will be to find Ψ that stabilizes the algorithm while having a relatively small
penalty. In the rest of this lecture, we discuss two general approaches to doing so.
2 Follow the Regularized Leader
The first approach is by taking Ψ to be some strongly convex function, often called a regularizer, to stabilize the algorithm. This approach is called Follow-the-Regularized-Leader, or FTRL for short. Note that regularization is also a widely-used technique in statistical learning (in fact, being able to stabilize the learning algorithm is one explanation of why it helps reduce generalization error).
Formally, let Ψ = η1 ψ for some learning rate η > 0 and a strongly convex function ψ : Ω → R. Strong convexity means: for any w, u ∈ Ω, the following holds:2
ψ(u) − ψ(w) ≤ ⟨∇ψ(u), u − w⟩ − 12 ∥u − w∥2 (2) for some norm ∥·∥. The following observations might help you understand this concept better:
• Compared to the convexity property ψ(u) − ψ(w) ≤ ⟨∇ψ(u), u − w⟩, there is an extra nonpos- itive term in the right-hand side of Eq. (2) (thus indeed “stronger”).
• Rearranging Eq. (2) gives 12 ∥w − u∥2 + ⟨∇ψ(u), w − u⟩ + ψ(u) ≤ ψ(w), which, for quadratic norm ∥·∥, says that at any point w, the function ψ(w) is completely above a certain quadratic (the left-hand side) with the same function value and gradient at u, thus illustrating a certain curvature of ψ.
•Bydefinition,Eq.(2)isthesameas12∥w−u∥2 ≤Dψ(w,u)=12∥w−u∥2∇2Ψ(ξ)forsomeξ between u and w, so strong convexity provides a nice lower bound for the Bregman divergence.
2More precisely, this is the definition of ψ being 1-strongly convex. Due to the scaling parameter of 1/η, it does not matter whether ψ is 1-strongly convex or α-strongly convex for some other parameter α.
Due to the curvature, the minimizer of a strongly convex function is unique (try to verify this your- self), and more importantly, does not move significantly when the function is slightly changed, that is, the minimizer is stable. Formally, we prove the following.
Lemma 4. Let w⋆ = argminw∈Ω ⟨w, L⟩ + η1 ψ(w) and w ̄⋆ = argminw∈Ω w, L ̄ + η1 ψ(w) for some arbitrary L and L ̄, η > 0, and strongly convex ψ (with respect to some norm ∥·∥). We have
and consequently
where ∥ · ∥⋆ is the dual norm.3
∥w⋆−w ̄⋆∥2≤η w⋆−w ̄⋆,L ̄−L , (3) ∥w⋆ − w ̄⋆∥ ≤ η∥L − L ̄∥⋆, (4)
Proof. Similar to Lemma 2, define Φ = ⟨w⋆, L⟩ + η1 ψ(w⋆) and Φ ̄ = w ̄⋆, L ̄ + η1 ψ(w ̄⋆). Then we have
Φ ̄−Φ≤ w⋆,L ̄−L −η1Dψ(w⋆,w ̄⋆) (Lemma2) ≤ w⋆, L ̄ − L − 1 ∥w⋆ − w ̄⋆∥2. (definition of strong convexity)
On the other hand, noticing the symmetry between Φ ̄ and Φ, we in fact also have
Φ−Φ ̄≤ w ̄⋆,L−L ̄ −η1Dψ(w ̄⋆,w⋆) (Lemma2) ≤ w ̄⋆, L − L ̄ − 1 ∥w⋆ − w ̄⋆∥2. (definition of strong convexity)
Summing up the two inequalities and rearranging proves Eq. (3). To prove Eq. (4), it suffices to
further apply Ho ̈lder’s inequality:
w⋆ −w ̄⋆,L−L ̄ ≤∥w⋆ −w ̄⋆∥∥L−L ̄∥⋆ and divide both sides by ∥w⋆ − w ̄⋆∥.
Note that the stability level is naturally controlled by the parameter η. Combining this stability lemma and the intermediate regret bound Lemma 3, we obtain the following regret bound for FTRL, where one see that the tradeoff between the penalty term and stability term is exactly governed by η.
Theorem 1. The FTRL strategy: wt = argminw∈Ω w, Ps
t t⋆ TG2 regretboundRT =O(GpTBψ).
Proof. Note that by the definition of FTRL, Lemma 4 exactly tells us ∥wt − wt+1∥ ≤ η∥lt∥⋆. Therefore, the regret can be bounded as
RT ≤ ηψ +X⟨wt −wt+1,lt⟩
(Lemma3) (Ho ̈lder’s inequality) (Lemma4)
≤ ηψ + X ∥wt − wt+1∥∥lt∥⋆ t=1
≤ ηψ +ηX∥lt∥2⋆,
which finishes the proof.
We conclude by pointing out that having a uniform bound G on ∥lt∥⋆, which we recall is just
∥∇ft(wt)∥⋆, is simply saying that the loss function ft is G-Lipschitz with respect to the norm ∥·∥⋆. 3Given a norm ∥·∥ (called primal norm), its dual norm is defined as ∥u∥⋆ = max∥w∥≤1 ⟨u, w⟩. The most
important examples of primal-dual norm pair for this course are (∥ · ∥2, ∥ · ∥2) and (∥ · ∥1, ∥ · ∥∞). 4
3 Instances of FTRL
We now discuss concrete instantiations of FTRL for different problems.
3.1 Online Gradient Descent
For an arbitrary action space Ω, we can pick ψ(w) = 12 ∥w∥2, a very standard regularizer in machine learning. The induced FTRL is thus
*+12 wt =argmin w,Xls +2η∥w∥2 =argmin w+ηXls ,
w∈Ω s