CS861: Theoretical Foundations of Machine Learning Lecture 19 – 18/10/2023 University of Wisconsin–Madison, Fall 2023
Lecture 19: K–armed bandit lower bounds, generalized linear bandits Lecturer: Kirthevasan Kandasamy Scribed by: Alex Clinton, Yamin Zhou
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In the previous lecture we proved an upper bound for the UCB and began analyzing a lower bound. In this lecture, we will continue the proof of that lower bound which is gap independent. We will then also provide a gap dependent lower bound for k-armed bandits. Finally, we will introduce a structured bandit model which generalizes the K–armed setting.
1 K-armed bandits lower bounds
Lemma 1. Let ν, ν′ be two bandit models, and P, P ′ bet the prob distribution of A1, x1, …., Ai, xi due to the interaction of a policy π with ν,ν′ respectively, then
KL(P, P ′) = X EP [Ni,T ]KL(νi, νi′)
we showed from the previous lecture, for any sequence a1, x1, ….aT , xT
Therefore, we have
′ KL(P,P )=Ep log
P (A1, x1…..Ai, xi) P′(A1,x1……AT,xT)
p(a1, x1, ….aT , xT ) log ′
XT νˆat(xt) ! = log ′
νˆAt(xt) !# pˆ
p (a1, x1, ….aT , xT ) t=1
νAt (xt ) XK
= XKL(νi,νi′) XE[1(At = i)]
= X KL(νi, νi′ ) E[Ni,T ] i=1
ν′ (xt) At
1(At = i) i=1
KT ν(x) XXAtt EpEplog ′ 1(At=i)|At νAt(xt)
from the result of * and KL is not related to t
Therefore, μ = (δ, 0, 0, ….., 2δ , 0, …, 0) |{z}
νAt(xt) ∗=1(At =i)Ep log ν′ (xt) |At
= 1(At = i)KL(νAt, νAt) ′
= 1(At = i)KL(νi, νi )
This KL divergence decomposition lemma now allows us to state the following minimax lower bound for k-armed bandits. The idea of the proof is that given any policy π, we can construct two bandit instances such that π will perform poorly in one of them.
Theorem 1. (Minimax lower bound for k-armed bandits) Let P = {ν = νi, i ∈ [K]}, νi is σ subGaussian for all i ∈ [K]. Then, if K > 1, for some universal constant C,
inf sup RT (π, ν) ≥ CσpT (K − 1) π ν∈P
Proof Let π be given, consider the following two bandit models ν, ν
• Let ν = {νi = N(μi,σ2)i∈[K], where μ1 = δ, and μi = 0, ∀i ∈ 2,…..,K, μ = (δ,0,0…..0)}. Here, δ > 0
is a value we will specify shortly.
• Let Eν denote the expectation with respect to the sequence A1, x1, …..AT , xT due to π’s interaction
with ν. Since Pk Eν[Ni,t] = T, ∃ some j ∈ 2,….,k s.t Eν[Nj,t] ≤ T
• Let ν′ = {νi = N(μ′i,σ2)}i∈[k] where
′ μiifi̸=j μi= 2δ if i=j
• Let P, P ′ denote the prob distributions of A1, x1, ….AT , xT due to π’s interaction with ν, ν′ .
TTδ RT(π,ν)≥P N1,T≤2 2
′ ′ TTδ ′ TTδ RT (π, ν ) ≥ P Nj,T ≤ 2 2 ≥ P N1,T ≥ 2 2
Now, note that we can write
supRT(π,ν)≥max(RT(ν,π),RT(ν′,π))≥ 1(RT(ν,π)+RT(ν′,π))
(constructed based on π) as follows:
2| {z } (⋆)
Programming Help
We will now bound the term (⋆) as follows TδT′T
(⋆)≥ 2 P(N1,T ≤ 2)+P (N1,T > 2) fromthedefinition ≥ Tδ exp(−KL(p,p′)) from the Le Cam’s lemma
Tδ (2δ)2
= 4 exp Eν [Nj,T ] 2σ2 Divergece Decomposition + KL of Gaussian Tδ T 2δ2
CσpT(K − 1).
2 Gap-independent lower bounds
Recall from our analysis of UCB that there are two types of bounds: gap dependent bounds (those that depend on ∆i) and gap independent bounds (those that do not depend on ∆i). In addition to the gap independent lower bound we just proved, there is also a gap dependent lower bound for k-armed bandits. Although we will not prove it here, this lower bound is given by the following theorem.
Theorem 2. (Theorem 16.4 in LS)
Let ν be a given K-armed bandit model with σ-sub Gaussian rewards. Let μ = μ(ν) be the means of the arms. LetP(ν)={ν′ :μi(ν)∈[μi,μi+2∆i],viisσ-subGaussian}. SayπisapolicysuchthatRT(π,ν′)≤ cTp,∀ν′ ∈ P(ν) for some c > 0 and p ∈ (0,1). Then
≥ 4 exp −K−1σ2 Now,ifwechooseδ=σqK−1,Then,weareabletoget∗≥(1e−2)pT(K−1),andhenceinfπsupν∈P RT(π,ν)≥
1−p)log(T)+log(∆i σ2 8c
RT (π, ν) ≥
Remark At a high level this theorem says that if a policy does well on all “similar” problems then it does
∆i at least as poorly as the given expression on the original problem.
3 Stochastic bandits in a generalized linear model
One potential criticism of the bandit model we have studied thus far is its restriction of the action space to K specific choices. In the generalized linear bandit model we are about to introduce, we will allow for an infinite action space, but assume additional structure on the rewards.
Definition 1. A generalized linear bandit model consists of the following components:
1. Action space A ⊆ [−1, 1]d. For reasons of convenience which will become clear shortly, we will assume
that the basis vectors e1,…,ed are in A.
2. Parameter space Θ ⊆ [−1, 1]d
3. True parameter (unknown) θ∗ ∈ Θ
4. When we choose an action (arm) At, we observe Xt = f(θ∗T At) + εt where E[εt] = 0 and εt is σ-sub Gaussian. Here εt can be thought of as noise.
Programming Help, Add QQ: 749389476
5. Here f is known and has the following properties
(a) f is strictly increasing with f′(x) ≥ c > 0 (b) f is L-Lipschitz
(c) f′ is continuous
Example (f is the identity in a linear bandit model): In this case we have the following expression for
the regret, RT = Tf(θ∗ a∗) − E t=1 Xt , where a∗ = argmaxa∈A f(θ∗ a). Notice that this allows us to
recover our original bandit model and so our new framework is broader than how we first introduced k-armed bandits.
3.1 A UCB algorithm (Based on Filippi et al. 2010)
In order to get a UCB algorithm, we need to know how to estimate θ∗ from data and construct UCBs. We define the following quantities:
•θˆ =argmin Pt A(f(ATθ)−X) wherev =Pt AAT and∥y∥2 =yTQy(hereQ t θ∈Θs=1sssV−1 ts=1ss Q
must be positive semi-definite)
• ρ(t)= 2Lσp(3+2log(1+2d))·2dlog(t)log(dT2)∈O ̃(√d)
In the following algorithm, for the first d rounds, we pull each basis vector which is analogous to pulling each of the k arms once at the start of our original UCB algorithm. For the remaining rounds we then pull the arm with the highest new confidence bound which is again analogous the original UCB algorithm where we pull the arm with the highest original confidence bound.
Algorithm 1 UCB Require: a time horizon T
for t = 1,…,d do
Choose At = et (the tth basis vector)
end for fort=d+1,…,T do
ChooseA =argmax f(θˆTa) +ρ(t)∥a∥ −1 t a∈At Vt−1
exploitation
exploration
程序代写 CS代考 加微信: cstutorcs