CS861: Theoretical Foundations of Machine Learning Lecture 27 06 11 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 27 – 06/11/2023 University of Wisconsin–Madison, Fall 2023
Lectures 27, 28: Online Gradient Descent, Contextual Bandits
Lecturer: Kirthevasan Kandasamy Scribed by: Haoyue Bai & Deep Patel
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
We have been looking at the framework of Follow-The-Regularized Leader (FTRL) that helps the learner choose actions wt over rounds such that we have smaller cumulative regret w.r.t. the best action in hindsight. Specifically, we saw how choosing an appropriate regularizer can help obtain small regret when learner incurs linear losses. We then looked at a more general framework – FTRL with convex losses and strongly convex regularizers – which, in turn, led us to the following observation:
“If we know/anticipate that {∇ft}t≥1 are small in some dual-norm ||·||∗, then it would be a good idea to run FTRL with a regularizer Λ which is strongly convex w.r.t. the corresponding norm1 (||·||∗)∗ =||·||.”
We stated this formally as Theorem 1.1 (stated below) which we will prove in today’s lecture. We will wrap up our discussion on the FTRL framework by applying this result in the context of at Online Gradient Descent. Finally, we will conclude this lecture and the course with a brief discussion of Contextual Bandits and a commonly-used algorithm for it – EXP4 Algorithm.
1 FTRL and Online Gradient Descent
Theorem 1.1. If ft is convex, and Λ(w) = η1 λ(w) where λ is 1-strongly convex in || · ||, then 1 XT
RT (FTRL,f) ≤ max λ(w) − min λ(w) + η ||gt||2∗ ̄ηw∈Ω w∈Ω t=1
where gt ∈ ∂ft(wt) and || · ||∗ is the dual-norm of || · ||.
In previous lecture, we proved the following for any u ∈ Ω:
TTTT t=1 t=1 t=1 t=1
 Xft(wt)−Xft(u)≤Xft(wt)−Xft(wt+1)+ maxλ(w)−minλ(w)
Using this, for a given policy π, we can say that: TT
RT (π, f) = X ft(wt) − X ft(w∗)
(ft(wt) − ft(wt+1))
η w∈Ω w∈Ω 1dual of the dual-norm, which is the norm itself
maxλ(w) − minλ(w) +
Programming Help
Therefore, it is sufficient to prove that the following holds on all rounds t:
ft(wt) − ft(wt+1) ≤ η||gt||2∗ (1)
By convexity and as gt ∈ ∂ft(wt), we can write
ft(wt+1) ≥ ft(wt) + gtT (wt+1 − wt)
By H ̈older’s inequality, we have
ft(wt) − ft(wt+1) ≤ gtT (wt − wt+1)
⇒ ft(wt) − ft(wt+1) ≤ ||gt||∗||wt − wt+1|| (2)
Let’s now denote Ft =∆ Pts=1 ft(w) + η1 λ(w). Since λ is 1-strongly convex in || · ||-norm by assumption, we have that Ft is η1 -strongly convex in || · ||-norm. Note that, by definition, wt+1 minimizes Ft and wt minimizes Ft−1. Thus, as Ft−1 and Ft are η1 -strongly convex, we can say that
Ft−1(wt+1) ≥ Ft−1(wt) + 1 ||wt+1 − wt||2 2η
Ft(wt) ≥ Ft(wt+1) + 1 ||wt − wt+1||2 2η
Summing both the sides above will give us
ft(wt) − ft(wt+1) ≥ η1||wt − wt+1||2 (3)
Thus, Equations 2 and 3 imply that
Now, combining Equation 2 and Equation 4 gives us the desired inequality as stated above in Equation 1:
||wt − wt+1|| ≤ η||gt||∗ (4) ft(wt) − ft(wt+1) ≤ η||gt||2∗
Example 2 (Online Gradient Descent). Let ft be differentiable2 and Ω be a compact, convex set. Choose λ(w) = 12||w||2. Let us say that we using the following FTRL framework to obtain wt at the end of each
wt ∈ arg min X fs(w) +
Although the actions wt’s obtained as above give us good regret rates, the problem with obtaining the wt’s this way is that, in general, the complexity of solving the aforementioned optimization problem grows with t – at the end of each round-t we have to compute a new gradient ∇ft(w) which results in the computational cost growing linearly in t. We would like to keep the computational cost per round-t to be small, ideally not depending on t. So, we will take a different perspective to circumvent this issue. We will start by rewriting
2We don’t actually need this assumption. We are using it for simplicity in this class.

the regret as follows:
RT (π, {ft}Tt=1) = X ft(wt) − min X ft(w)
X[ft(wt) − ft(w)] t=1
∵ft isconvex ⇐⇒ ft(w)≥ft(wt)+(w−wt)T∇ft(wt)∀w∈Ω TT
= X wtT ∇ft(wt) − min X wT ∇ft(wt) w∈Ω
t=1 t=1 
= RT π, {∇ft(wt)}t=1 
abuse of notation3
We will now apply FTRL on the linear losses f ̃ (w) =∆ wT ∇f (w ) with λ(w) = 1 ||w||2 as shown below:
≤max X∇ftT(wt)(wt−w)
= arg min ||w + η X ∇fs(ws)||2
wt = arg min w∈Ω
ttt22 t−1 !1!
X ∇fs(ws) s=1
+ ||w||2 2η
(by completing the squares)
Hence, wt will be the l2-projection of −η Pt−1 ∇fs(ws) to Ω, which can be implemented in O(1)-time4 at
each round-t as follows:
Now, we can show that
ut ←− ut−1 − η∇ft−1(wt−1) wt ←− arg min ||w − ut||2
RT (π, {ft}Tt=1) ≤ RT (π, {∇ft(wt)}Tt=1)
≤ Bη + ηT G2 (By Theorem 1.1)
√ rB! ∈O(G BT) ifη= TG2
where B = max λ(w) − min λ(w) and ||∇ft(wt)||2 ≤ G ∀t. w∈Ω w∈Ω
Remark 1.1. Some connections that we can make a note of:
• Suppose f = ft is a fixed function. This is similar to the standard Projected Gradient Descent (PGD)
u t ←− w t − 1 − η ∇ f ( w t − 1 ) wt ←− arg min ||w − ut||2
3We mean wtT ∇ft(wt) here
4We are not considering how this scales with the dimensionality, d, of Ω ⊆ Rd at the moment

Using update rule in Equation 5 on a fixed function leads to the following bound for convex optimization via Equation 6:
minf(wt)−f(w∗)≤ wt T
f(wt)−f(w∗) rB!
Note that this need not necessarily be an optimal bound. We are simply showing an application of Theorem 1.1 to a convex optimization problem.
In machine learning, update rule defined in Equation 5 is similar to the Stochastic Gradient Descent (SGD) update where ft is the loss for instance (xt,yt)
Contextual Bandits
We will resume our discussion on contextual bandits now. Recall that, in the case of K-armed bandits, we had K-arms that can be pulled and we were competing against the single best arm/action in hindsight. How- ever, in certain situations, the best arm/action depends on contextual information, which may be available to the learner. For e.g., K-armed bandits: advertising; contextual bandits: targeted advertising.
Definition 1 (The contextual bandit problem). We will define the contextual bandit problem as follows: (i.) The environment picks a context xt ∈ X and the learner observes xt.
(ii.) Learner chooses action At ∈ [K]. Simultaneously, the environment picks a loss vector lt ∈ [0,1]k. (iii.) Learner incurs the loss lt(At).
(iv.) Learner observes (only) lt(At).
Question: How do we define regret here?
■ One option is to compete against the best action for the given context: “T#T
(∵min≤ avg.)
RT(π,l,x)=E Xlt(At) − min Xlt(e(xt))
̄ ̄ e:X→[k] t=1
where E is w.r.t. the randomness of policy. Here, we are competing against the
single best mapping from contexts to actions.
▲ This is like running a separate bandit algorithm for different contexts.
▲ And this is challenging if |X | is large, but also unnecessary if there are relationships between contexts. e.g., frying pan, non-stick skillet.
■ Instead, we will consider a set of N experts, who map contexts to actions and we will now be competing against the single best expert in hindsight. Here, the experts could be, say, machine learning models trained on a variety of large datasets.
▲ If the experts are {e1,…,eN}, where ej : X → [K] ∀j ∈ [N], then “T#T
RT(π,l,x)=E Xlt(At) −minXlt(ej(xt)).
̄ ̄ j∈[k] t=1
程序代写 CS代考 加微信: cstutorcs
Question: Can we apply EXP3 algorithm here by treating the experts as arms?
■ Yes. But the regret is going to be large. RT ∈ O(pTN log(N)) which is fine as long as the number of experts, N, is small. However, we are usually interested in cases where we have many more experts than possible actions, N ≫ K. We wish to avoid poly(N) dependence, but a polylog(N)-dependence would be fine.
The EXP4 algorithm
Just like we built on Hedge to arrive at the EXP3 algorithm, we are going to build on the EXP3 algorithm to arrive at the EXP4 algorithm here. We will treat the experts as arms here and run EXP3 on them, but what we will do differently here is the following: We will use the fact that when we observe feedback, we will discount all the experts who would have chosen the action. Based on this, we can express the pseudocode of EXP4 as shown below in Algorithm 1:
Algorithm 1: The EXP-4 algorithm Input: Time horizon T , learning rate η
Let L ̃0 ← 0N
for t = 1,…,T do
Observe xt
Compute p ̃t(i) = PNj=1 e−ηL ̃t−1(j) ,∀i ∈ [N]
e−ηL ̃ t−1 (i)
Let pt(a) = PNj=1 p ̃t(j)I(ej(xt) = a) Sample At ∼ pt
Observe lt(At)
lˆ(a)← lt(a)I(A =a)∀a∈[k]
l ̃ (j) ← lˆ (e (x )) ∀j ∈ [N]
L ̃ ( j ) ← L ̃ ( j ) + l ̃ ( j ) ∀ j ∈ [ N ] t t−1 t
Remark 3.1. We can note the following about the EXP4 algorithm:
• Lines (ii.), (iii.), and (iv.) can be implemented as Expert∼ p ̃ and A = Expert(x ). ttt
• We can write the loss update as
L ̃t(j) ←− L ̃t−1(j) + I{ej(xt) = At} · lt(At) (7)
update rule:
L ̃t(j) ←− L ̃t−1(j) + I{Et = j} · lt(At) (8) p ̃ t ( E t )
See Remark 3.2 below for more on this.
Theorem 3.1 (Regret bound for EXP4). Assume that the loss vectors on each round satisfy lt ∈ [0, 1]K and let xt ∈X be drawn arbitrarily. Then, ∀ ̄l(=(l1,…,lT)), ∀ ̄x(=(x1,…,xT)), EXP4 satisfies:
RT(π,l,x)≤ logN +ηKT ̄ ̄η
▲ Note that we are not discounting only one expert here. That is, we are NOT utilizing the following
Code Help
where N is the number of experts. With η = q log N , we see that the regret bound becomes KT
Similarly,
RT (π, ̄l, ̄x) ≤ 2pKT log N First, we will compute E[l ̃ (j)|p ̃ ] and E[l ̃2(j)|p ̃ ].
E[l ̃(j)|p ̃]=(1−p (e (x )))×0+p (e (x ))· lt(ej(xt))
t t tjt tjt pt(ej(xt)) (9) = lt(ej(xt))
2 l2t (ej(xt)) E[l ̃ (j)|p ̃ ] = (1 − p (e (x ))) × 0 + p (e (x )) ·
t t t j t t j t p2t(ej(xt)) = l2t(ej(xt))
pt (ej (xt ))
We will now apply result (i.) from the Hedge Theorem (Theorem 1, Lecture 23). As we have N experts, the
loss lt is non-negative and p ̃t lt ≤ 1 ∀t, we have, for any expert i:
T T logN T
From Equation 9, we get
From Equation 9 again, we get
TT2 X ̃X ̃ X ̃
We will apply Equation 11 by setting i = i∗ (i.e., best expert in hindsight),
lt(i) ≤ η + η p ̃t lt (11) t=1
T i∗=argmin Xlt(ej(xt))
E[l ̃(i∗)]=l(e (x)) t ti∗t
T ̃T ̃ E[p ̃t lt|p ̃t] = p ̃t E[lt|p ̃t]
= Xp ̃t(j)lt(ej(xt)) (By Equation 9)
= Xp ̃t(j) Xlt(a)I{ej(xt) = a} j=1 a=1
= X lt(a)X p ̃t(j)I{ej (xt) = a}
= X pt(a)lt(a) = pTt lt
= E [lt(At)|pt]

Next, using Equation 10, we can say hT2i XN
l2t(ej(xt)) pt (ej (xt ))
Ep ̃l ̃= tt
(By Equation 10)
p t ( a ) I { e j ( x t ) = a } p ̃t(j)I{ej(xt) = a}
p ̃ t ( j ) · XK l2t(a)XN
X l 2t ( a )
=Xl2t(a)≤K (∵lt(a)∈[0,1]∀a∈[K],∀t)
Remark 3.2. We discount all experts that predict a at round-t instead of just one expert. If we were to discount only one expert, we would get a dependence on N as shown below which we clearly do not want in case where we have large N:
XN l2t (ej(xt))
p ̃ t ( j ) p ̃ t ( j ) ≤ N j=1
Intuitively, since pt(At) ≥ p ̃t(Et), we see that the update rule in Equation 7 is better as it reduces variance in observed loss values compared to the case where we use the update rule in Equation 8.
Thus, we can finally see that
E[LHS of Equation 11] = E X lt(At) t=1
= RT (π, ̄l, ̄x) E[RHS of Equation 11] ≤ log N + ηKT
− X lt(ei∗ (xt)) t=1
rlogN! if η = KT
∈ O(pKT logN)