CS861: Theoretical Foundations of Machine Learning Lecture 21 10 23 2023 Unive

CS861: Theoretical Foundations of Machine Learning Lecture 21 – 10/23/2023 University of Wisconsin–Madison, Fall 2023
Lecture 21: Martingale concentration and structured bandits
Lecturer: Kirthevasan Kandasamy Scribed by: Zhifeng Chen, Xinyan Wang
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In the previous lectures, we have shown that RT ∈ O ̃(d√T) under the following good event G = nf(θTa)−f(θˆTa) ≤ρ∥a∥ −1 ,∀a∈A,∀t∈{d+1,…,T}o. In this lecture, we will show P(Gc) ≤ 1/T
using martingale concentration inequalities.
1 Martingale Concentration Inequality
Theorem 1. Let F = {Ft}t≥0 be a filtration. Let {At}t≥1 be an Rd-valued stochastic process predictable w.r.t F, and let {εt}t≥1 be a real-valued martingale difference sequence adapted to {Ft}t≥1. Assume εt is σ- subGaussian. Let Vt = Pts=1 AsATs , ξt = Pts=1 Asεs and say ATs As ≤ C2,∀s ∈ [T]. Suppose Vt ≥ I,∀t > t0.
Then for all δ ≥ e− √1 , with probability at least 1 − δ,
∥ξt∥V −1 ≤ γσp2d log(t) log(d/δ) t
Where γ = p3 + 2 log(1 + 2C)
To prove this theorem, we need the following lemma.
λA−λ2B2 √ Lemma1. IfAandBarerandomvariabless.tE[e 2 ]≤1,then∀τ≥
2andy>0, s   2! 2
1B−τ P |A|>τ (B2+y) 1+2log 1+ y ≤e 2
Remark If we don’t think of B as a random variable, but as a constant, then A is B-subGaussian by
λA λ2B2 τ2/2 E[e ]≤e 2 . SowehaveP(|A|≥Bτ)≤2e
Now we can start to prove Theorem 1.
. ThislemmagivesasimilarresultwhenBisarandom
Proof Letx∈Rd begiven. WewillapplythelemmawithA=xTξt andB=∥x∥ =pxTVtx. First
we should check the condition E[e 2 ] ≤ 1 ∀λ.
λ2B2 xT ξt 2 xT Vtx λA− 2 =λ σ −λ 2
σ x T A s ε s − 2 x T A s A Ts x

As As is Fs−1 measurable, it is a non-stochastic quantity given Fs−1, λ λ2 2
E[eQs|Fs−1]=E exp σxTAsεs− 2 xTAs Fs−1 λλ2 2
Therefore,
(asεs isσ-sub-Gaussian)
E[e |Ft−1]
=E exp σxTAsεs Fs−1 exp −2 xTAs
≤exp 2 ∗σ2 xTAs
λA−λ2B2 E[e 2
on. We require τ ≥ √2, which is satisfied if δ′ ≤ e− √1 . Then with probability at least 1 − δ′
exp −2 xTAs
Pt Qs ] = E[e s=1 ]
h Pt−1 Qs =E e s=1
≤ E[ePt−1 Qs ]
We will now apply the lemma with y = ∥x∥2 and τ = 2log(1/δ′). We will choose δ′ later in terms of δ
|A|= xTξt ≤ut(∥x∥2 +∥x∥2) 1+1log 1+∥x∥Vt · 2log 1
σ Vt 2 2 ∥x∥2 δ′ | {z }
Next, we will show that (∗) ∼ ∥x∥2Vt. For t > t0, since I ≤ Vt = Pts=1 AsATs ≤ tC2I, we have
∥x∥2 ≤ ∥x∥2Vt ≤ tC2 ∥x∥2 ∥x∥2 + ∥x∥2Vt ≤ 2 ∥x∥2Vt
 ∥x∥2  2
We can also show that 1+ 1 log 1+ Vt ≤ γ log(t), where γ =
3+2log(1+2C) as given in the
2 ∥x∥2 2 theorem. Therefore, with probability at least 1 − δ′ ∀x ∈ Rd,
We can decompose ∥ξt∥2V −1 t
in the following way. ∥ξ∥2−1 =ξTV−1ξ
r1 xT ξt ≤ σγ ∥x∥Vt 2 log(t) log δ′
tVt ttt T−12 −21
=ξt Vt IVt d
= X ξ T V − 12 e e T V − 12 ξ
ttjjtt j=1
CS Help, Email: tutorcs@163.com
Now for any s > 0,
d 22XT−1T−12
P(∥ξ∥ −1 ≥ds)=P ξ V 2ee V 2ξ >ds
ttjjtt  j=1
We will apply (1) with x = V − 12 e , δ′ = δ/d and let s = σγ V −1/2e tj tj
XT−12T−12 2
P ξtVt ejejVt X  T − 12 
= PξtVt ej>s j=1
Finally we get
plog(t) log(d/s) = σγplog(t) log(d/s) P∥ξt∥V−1≥dγσlog(t)logδ ≤δ
2 22 d t
Bounding P(Gc)
We have proved the martingale concentration inequality so we now proceed to prove P(Gc) ≤ 1/T. We can
also use the following fact about generalized linear models. Define gt(θ) := Pts=1 Asf(ATs θ), so we can write
θˆ =argmin XA f(ATθ)−X  tsss
=argmin gt(θ)−XAsXs
Fact: By using quasi-maximum likelihood estimators in the exponential family, ∃ a unique θ ̃ ∈ Rd s.t t
Therefore we can write Consider
g ( θ ̃ ) − X A X = X A  f ( A T θ ̃ ) − X  = 0 tt ss ssts
θˆ =argmin g(θ)−g(θ ̃)
g(θ )−g(θˆ) ≤ g(θ )−g(θ ̃)
t t V−1 t ∗ t t V−1 ttt
+ g(θ ̃)−g(θˆ)
t t t t V−1
≤ 2 g ( θ ) − g ( θ ̃ )
=2 XAsεs s=1
浙大学霸代写 加微信 cstutorcs
WenowprovetheclaimP(Gc)≤1/T,whereG=nf(θTa)−f(θˆTa) ≤ρ∥a∥ −1 ,∀a∈A,∀t∈{d+1,…,T}o ∗ t Vt−1
Proof Pickaroundt∈{d+1,…,T}andanya∈A. BytheL-Lipschitzpropertyoff,Weknow f(θTa)−f(θˆTa) ≤L (θ −θˆ)Ta
Now we bound θ − θˆ . Consider ∗t
∇gt−1(θ) = X AsATs f′(ATs θ) ≥ c X AsATs ≥ cI s=1 s=1
As f′ is continuous, by the fundamental theorem of calculus,
g (θ)−g (θˆ )=G ∗(θ−θˆ )
t−1 ∗ t−1 t−1 t−1 ∗ t−1 WhereG =R1∇g sθ +(1−s)θˆ ds
t−1 0 t−1 ∗ t−1 (proof to be continued in the next class)
程序代写 CS代考 加微信: cstutorcs