IPS2010

Parametric Bandits: The Generalized Linear Case
Sarah Filippi
Telecom ParisTech et CNRS
Paris, France
Aure ́lien Garivier
Telecom ParisTech et CNRS
Paris, France
We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach.
Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization.
1 Introduction
In the classical K-armed bandit problem, an agent selects at each time step one of the K arms and receives a reward that depends on the chosen action. The aim of the agent is to choose the sequence of arms to be played so as to maximize the cumulated reward. There is a fundamental trade-off between gathering experimental data about the reward distribution (exploration) and exploiting the arm which seems to be the most promising.
In the basic multi-armed bandit problem, also called the independent bandits problem, the rewards are assumed to be random and distributed independently according to a probability distribution that is specific to each arm –see [1, 2, 3, 4] and references therein. Recently, structured bandit problems in which the distributions of the rewards pertaining to each arm are connected by a common unknown parameter have received much attention [5, 6, 7, 8, 9]. This model is motivated by the many practical applications where the number of arms is large, but the payoffs are interrelated. Up to know, two different models were studied in the literature along these lines. In one model, in each times step, a side-information, or context, is given to the agent first. The payoffs of the arms depend both on this side information and the index of the arm. Thus the optimal arm changes with the context [5, 6, 9]. In the second, simpler model, that we are also interested in here, there is no side-information, but the agent is given a model that describes the possible relations between the arms’ payoffs. In particular, in “linear bandits” [10, 8, 11, 12], each
Olivier Cappe ́
Telecom ParisTech et CNRS Paris, France
Csaba Szepesva ́ri
RLAI Laboratory University of Alberta Edmonton, Canada

arm a ∈ A is associated with some d-dimensional vector ma ∈ Rd known to the agent. The expected payoffs of the arms are given by the inner product of their associated vector and some fixed, but initially unknown parameter vector θ∗. Thus, the expected payoff of arm a is m′aθ∗, which is linear in θ∗.1
In this article, we study a richer generalized linear model (GLM) in which the expectation of the reward conditionally to the action a is given by μ(m′aθ∗), where μ is a real-valued, non-linear function called the (inverse) link function. This generalization allows to consider a wider class of problems, and in particular cases where the rewards are counts or binary variables using, respectively, Poisson or logistic regression. Obviously, this situation is very common in the fields of marketing, social networking, web-mining (see example of Section 5.2 below) or clinical studies.
Our first contribution is an “optimistic” algorithm, termed GLM-UCB, inspired by the Upper Confidence Bound (UCB) approach [2]. GLM-UCB generalizes the algorithms studied by [10, 8, 12]. Our next contribution are finite-time bounds on the statistical performance of this algorithm. In particular, we show that the performance depends on the dimension of the parameter but not on the number of arms, a result that was previously known in the linear case. Interestingly, the GLM-UCB approach takes advantage of the particular structure of the parameter estimate of generalized linear models and operates only in the reward space. In contrast, the parameter-space confidence region approach adopted by [8, 12] appears to be harder to generalize to non-linear regression models. Our second contribution is a tuning method based on asymptotic arguments. This contribution addresses the poor empirical performance of the current algorithms that we have observed for small or moderate sample-sizes when these algorithms are tuned based on finite-sample bounds.
The paper is organized as follows. The generalized linear bandit model is presented in Section 2, together with a brief survey of needed statistical results. Section 3 is devoted to the description of the GLM-UCB algorithm, which is compared to related approaches. Section 4 presents our regret bounds, as well as a discussion, based on asymptotic arguments, on the optimal tuning of the method. Section 5 reports the results of two experiments on real data sets.
2 Generalized Linear Bandits, Generalized Linear Models
We consider a structured bandit model with a finite, but possibly very large, number of arms. At each time t, the agent chooses an arm At from the set A (we shall denote the cardinality of A by K). The prior knowledge available to the agent consists of a collection of vectors {ma}a∈A of features which are specific to each arm and a so-called (inverse) link function μ : R → R.
The generalized linear bandit model investigated in this work is based on the assumption that the payoff Rt received at time t is conditionally independent of the past payoffs and choices and it satisfies
E[Rt|At] = μ(m′Atθ∗), (1)
for some unknown parameter vector θ∗ ∈ Rd. This framework generalizes the linear bandit model considered by [10, 8, 12]. Just like the linear bandit model builds on linear regression, our model capitalizes on the well-known statistical framework of Generalized Linear Models (GLMs). The advantage of this framework is that it allows to address various, specific reward structures widely found in applications. For example, when rewards are binary-valued, a suitable choice of μ is μ(x) = exp(x)/(1 + exp(x)), leading to the logistic regression model. For integer valued rewards, the choice μ(x) = exp(x) leads to the Poisson regression model. This can be easily extended to the case of multinomial (or polytomic) logistic regression, which is appropriate to model situations in which the rewards are associated with categorical variables.
To keep this article self-contained, we briefly review the main properties of GLMs [13]. A univariate probability distribution is said to belong to a canonical exponential family if its density with respect to a reference measure is given by
pβ (r) = exp (rβ − b(β) + c(r)) , (2)
where β is a real parameter, c(·) is a real function and the function b(·) is assumed to be twice continuously differentiable. This family contains the Gaussian and Gamma distributions when the reference measure is the Lebesgue measure and the Poisson and Bernoulli distributions when the reference measure is the counting measure on the integers. For a random variable R with density
1Throughout the paper we use the prime to denote transposition. 2

CS Help, Email: tutorcs@163.com
defined in (2), E(R) = b ̇(β) and Var(R) = ̈b(β), where b ̇ and ̈b denote, respectively, the first and second derivatives of b. In addition, ̈b(β) can also be shown to be equal to the Fisher information matrix for the parameter β. The function b is thus strictly convex.
Now, assume that, in addition to the response variable R, we have at hand a vector of covariates X ∈ Rd. The canonical GLM associated to (2) postulates that pθ(r|x) = px′θ(r), where θ ∈ Rd is a vector of parameter. Denote by μ = b ̇ the so-called inverse link function. From the properties of b, we know that μ is continuously differentiable, strictly increasing, and thus one-to-one. The maximum likelihood esti-
matorθˆ,basedonobservations(R ,X ),…(R ,X
t 11 t−1t−1
following estimating equation
),isdefinedasthemaximizerofthefunction X log pθ (Rk |Xk ) = X Rk Xk′ θ − b(Xk′ θ) + c(Rk ) ,
t−1 t−1 k=1 k=1
a strictly concave function in θ.2 Upon differentiating, we obtain that θˆ is the unique solution of the t
X(Rk −μ(Xk′θ))Xk =0, k=1
where we have used the fact that μ = b ̇. In practice, the solution of (3) may be found efficiently using, for instance, Newton’s algorithm.
A semi-parametric version of the above model is obtained by assuming only that Eθ[R|X] = μ(X′θ)
without (much) further assumptions on the conditional distribution of R given X. In this case, the
estimator obtained by solving (3) is referred to as the maximum quasi-likelihood estimator. It is a
remarkable fact that this estimator is consistent under very general assumptions as long as the design
matrix Pt−1 XkX′ tends to infinity [14]. As we will see, this matrix also plays a crucial role in the k=1 k
algorithm that we propose for bandit optimization in the generalized linear bandit model.
3 The GLM-UCB Algorithm
According to (1), the agent receives, upon playing arm a, a random reward whose expected value is
μ(m′aθ∗), where θ∗ ∈ Θ is the unknown parameter. The parameter set Θ is an arbitrary closed subset
of Rd. Any arm with largest expected reward is called optimal. The aim of the agent is to quickly
find an optimal arm in order to maximize the received rewards. The greedy action argmax μ(m′ θˆ ) a∈A at
may lead to an unreliable algorithm which does not sufficiently explore to guarantee the selection of an optimal arm. This issue can be addressed by resorting to an “optimistic approach”. As described by [8, 12] in the linear case, an optimistic algorithm consists in selecting, at time t, the arm
A =argmaxmaxE [R |A =a] s.t. ∥θ−θˆ∥ ≤ρ(t), (4) t θθtt tMt
where ρ is an appropriate, “slowly increasing” function, t−1
√ norm induced by the positive semidefinite matrix M . The region ∥θ − θˆ ∥
ellipsoid around the estimated parameter θˆ . Generalizing this approach beyond the case of linear t
link functions looks challenging. In particular, in GLMs, the relevant confidence regions may have a more complicated geometry in the parameter space than simple ellipsoids. As a consequence, the benefits of this form of optimistic algorithms appears dubious.3
An alternative approach consists in directly determining an upper confidence bound for the expected reward of each arm, thus choosing the action a that maximizes
E θˆ t [ R t | A t = a ] + ρ ( t ) ∥ m a ∥ M − 1 . t
2Here, and in what follows log denotes the natural logarithm.
3Note that maximizing μ(m′aθ) over a convex confidence region is equivalent to maximizing m′aθ over the same region since μ is strictly increasing. Thus, computationally, this approach is not more difficult than it is for the linear case.
Mt =XmAkm′Ak k=1
v′Mv denotes the matrix
≤ ρ(t) is a confidence
is the design matrix corresponding to the first t − 1 timesteps and ∥v∥M =

In the linear case the two approaches lead to the same solution [12]. Interestingly, for non-linear bandits, the second approach looks more appropriate.
In the rest of this section, we apply this second approach to the GLM bandit model defined in (1). According to (3), the maximum quasi-likelihood estimator of the parameter in the GLM is the unique solution of the estimating equation
XR−μ(m′ θˆ)m =0, (6)
where A1 , . . . , At−1 denote the arms played so far and R1 , . . . , Rt−1 are the corresponding rewards.
Let g (θ) = Pt−1 μ(m′ θ)m be the invertible function such that the estimated parameter θˆ t k=1 Ak Ak t
satisfies g (θˆ ) = Pt−1 R m . Since θˆ might be outside of the set of admissible parameters Θ,
t t k=1 k Ak we “project it” to Θ, to obtain θ ̃ :
θ ̃ =argmin g (θ)−g (θˆ)
t t t t M−1
=argmin g (θ)−XR m . (7)
t k Ak M−1 θ∈Θ t θ∈Θ k=1 t
Note that if θˆ ∈ Θ (which is easy to check and which happened to hold always in the examples we t
dealt with) then we can let θ ̃ = θˆ . This is important since computing θ ̃ is non-trivial and we can ttt
save this computation by this simple check. The proposed algorithm, GLM-UCB, is as follows: Algorithm 1 GLM-UCB
Input: {ma}a∈A
Play actions a1,…,ad, receive R1,…,Rd. fort>ddo
Estimate θˆ according to (6) t
if θˆ ∈ Θ let θ ̃ = θˆ else compute θ ̃ according to (7) tttt
Play the action At = argmaxa μ(maθt) + ρ(t)∥ma∥M−1 , receive Rt
At time t, for each arm a, an upper bound μ(m′ θ ̃ ) + βa is computed, where the “exploration bonus” att
βta = ρ(t)∥ma∥M−1 is the product of two terms. The quantity ρ(t) is a slowly increasing function; t
we prove in Section 4 that ρ(t) can be set to guarantee high-probability bounds on the expected regret (for the actual form used, see (8)). Note that the leading term of βta is ∥ma∥M−1 which decreases
1: 2: 3: 4: 5:
to zero as t increases.
As we are mostly interested in the case when the number of arms K is much larger than the dimension d, the algorithm is simply initialized by playing actions a1 , . . . , ad such that the vectors ma1 . . . , mad form a basis of M = span(ma , a ∈ A). Without loss of generality, here and in what follows we assume that the dimension of M is equal to d. Then, by playing a1 , . . . , ad in the first d steps the agent ensures that Mt is invertible for all t. An alternative strategy would be to initialize M0 = λ0I, where I is the d × d identify matrix.
3.1 Discussion
The purpose of this section is to discuss some properties of Algorithm 1, and in particular the interpretation of the role played by ∥ma∥M−1 .
Generalizing UCB The standard UCB algorithm for K arms [2] can be seen as a special case of GLM-UCB where the vectors of covariates associated with the arms form an orthogonal system and μ(x) = x. Indeed, take d = K, A = {1, . . . , K}, define the vectors {ma}a∈A as the canonical basis {ea}a∈A of Rd, and take θ ∈ Rd the vector whose component θa is the expected reward for arm a.
Then, Mt is a diagonal matrix whose a-th diagonal element is the number Nt(a) of times the
a-th arm has been played up to time t. Therefore, the exploration bonus in GLM-UCB is given by
βa = ρ(t)/pN (a). Moreover, the maximum quasi-likelihood estimator θˆ satisfies R ̄a = θˆ (a) tt ttt
Code Help
for all a ∈ A, where R ̄a = 1 Pt−1 I R is the empirical mean of the rewards received t Nt(a) k=1 {At=a} k
while playing arm a. Algorithm 1 then reduces to the familiar UCB algorithm. In this case, it is known that the expected cumulated regret can be controlled upon setting the slowly varying function ρ to ρ(t) = p2 log(t), assuming that the range of the rewards is bounded by one [2].
Generalizing linear bandits Obviously, setting μ(x) = x, we obtain a linear bandit model. In this case, assuming that Θ = Rd, the algorithm will reduce to those described in the papers [8, 12]. In particular, the maximum quasi-likelihood estimator becomes the least-squares estimator and as noted earlier, the algorithm behaves identically to one which chooses the parameter optimistically within the confidence ellipsoid {θ : ∥θ − θˆ ∥ ≤ ρ(t)}.
Dependence in the Number of Arms In contrast to an algorithm such as UCB, Algorithm 1
does not need that all arms be played even once.4 To understand this phenomenon, observe that, as M =M +m m′ ,∥m ∥2 −1 =∥m ∥2 −1 −m′M−1m 2(1+∥m ∥2 −1)forany
t+1tAtAtaMt+1 aMt atAt AtMt
arm a. Thus the exploration bonus βa decreases for all arms, except those which are exactly orthog-
onal to m (in the M −1 metric). The decrease is most significant for arms that are colinear to m .
At t At This explains why the regret bounds obtained in Theorems 1 and 2 below depend on d but not on K.
4 Theoretical analysis
In this section we first give our finite sample regret bounds and then show how the algorithm can be tuned based on asymptotic arguments.
4.1 Regret Bounds
To quantify the performance of the GLM-UCB algorithm, we consider the cumulated (pseudo) regret defined as the expected difference between the optimal reward obtained by always playing an optimal arm and the reward received following the algorithm:
RegretT =Xμ(m′a∗θ∗)−μ(m′Atθ∗).
For the sake of the analysis, in this section we shall assume that the following assumptions hold:
Assumption 1. The link function μ : R → R is continuously differentiable, Lipschitz with constant kμ and such that cμ = infθ∈Θ,a∈A μ ̇(m′aθ) > 0.
For the logistic function kμ = 1/4, while the value of cμ depends on supθ∈Θ,a∈A |m′aθ|. Assumption 2. The norm of covariates in {ma : a ∈ A} is bounded: there exists cm < ∞ such thatforalla∈A,∥ma∥2 ≤cm. Finally, we make the following assumption on the rewards: Assumption 3. There exists Rmax > 0 such that for any t ≥ 1, 0 ≤ Rt ≤ Rmax holds a.s. Let εt =Rt −μ(m′Atθ∗).Forallt≥1,itholdsthatE[εt|mAt,εt−1,…,mA2,ε1,mA1]=0a.s.
As for the standard UCB algorithm, the regret can be analyzed in terms of the difference between the expected reward received playing an optimal arm and that of the best sub-optimal arm:
∆(θ∗) = ′ min ′ μ(m′a∗ θ∗) − μ(m′aθ∗) . a:μ(ma θ∗ )<μ(ma∗ θ∗ ) Theorem 1 establishes a high probability bound on the regret underlying using GLM-UCB with ρ(t) = 2kμκRmax p2dlog(t)log(2dT/δ), (8) where T is the fixed time horizon, κ = p3 + 2 log(1 + 2c2m/λ0) and λ0 denotes the smallest eigenvalue of Pdi=1 mai m′ai , which by our previous assumption is positive. 4Of course, the linear bandit algorithms also share this property with our algorithm. Theorem1(ProblemDependentUpperBound). Lets=max(1,c2m/λ0).Then,underAssumptions 1–3, for all T ≥ 1, the regret satisfies: Cd2 2 2dT 32κ2Rm2axkμ2 P RegretT ≤(d+1)Rmax+∆(θ)log [sT]log δ ≥1−δ with C= c2 ∗μ . Note that the above regret bound depends on the true value of θ∗ through ∆(θ∗). The following theorem provides an upper-bound of the regret independently of the θ∗. Theorem2(ProblemIndependentUpperBound). Lets=max(1,c2m/λ0).Then,underAssumptions 1–3, for all T ≥ 1, the regret satisfies P RegretT ≤ (d + 1)Rmax + C d log [s T ] s 2dT ! 8Rmaxkμκ T log δ ≥ 1 − δ with C = c . The proofs of Theorems 1–2 can be found in the supplementary material. The main idea is to use the explicit form of the estimator given by (6) to show that k t−1 μ(m′θ)−μ(m′θˆ)≤μ∥m∥−1 Xmε . At∗ AttcAtMt AkkM−1 μ k=1 t Bounding the last term on the right-hand side is then carried out following the lines of [12]. 4.2 Asymptotic Upper Confidence Bound Preliminary experiments carried out using the value of ρ(t) defined equation (8), including the case where μ is the identity function –i.e., using the algorithm described by [8, 12], revealed poor performance for moderate sample sizes. A look into the proof of the regret bound easily explains this observation as the mathematical involvement of the arguments is such that some approximations seem unavoidable, in particular several applications of the Cauchy-Schwarz inequality, leading to pessimistic confidence bounds. We provide here some asymptotic arguments that suggest to choose significantly smaller exploration bonuses, which will in turn be validated by the numerical experiments presented in Section 5. Consider the canonical GLM associated with an inverse link function μ and assume that the vectors of covariates X are drawn independently under a fixed distribution. This random design model would for instance describe the situation when the arms are drawn randomly from a fixed distribution. Standard statistical arguments show that the Fisher information matrix pertaining to this model is given by J = E[μ ̇(X′θ )XX′] and that the maximum likelihood estimate θˆ is such ∗t −1/2ˆ D −1 D that t (θt − θ∗ ) −→ N (0, J ), where −→ stands for convergence in distribution. Moreover, Mt −→ Σ where Σ = E[X X ]. Hence, using the delta-method and Slutsky’s lemma −1′ˆ′D ′′−2′2 ∥ma∥M−1(μ(maθt)−μ(maθ∗))−→N(0,μ ̇(maθ∗)∥ma∥Σ−1∥ma∥J−1). The right-hand variance is smaller than kμ/cμ as J ≽ cμΣ. Hence, for any sampling distribution such that J and Σ are positive definite and sufficiently large t and small δ, −1′ˆ′q P ∥ma∥M−1 (μ(maθt) − μ(maθ∗)) > 2kμ/cμ log(1/δ)
is asymptotically bounded by δ. Based on the above asymptotic argument, we postulate that using ρ(t) = p2kμ/cμ log(t), i.e., inflating the exploration bonus by a factor of pkμ/cμ compared to the usual UCB setting, is sufficient. This is the setting used in the simulations below.
5 Experiments
To the best of our knowledge, there is currently no public benchmark available to test bandit methods on real world data. On simulated data, the proposed method unsurprisingly outperforms its competitors when the data is indeed simulated from a well-specified generalized linear model. In order to evaluate the potential of the method in more challenging scenarios, we thus carried out two experiments using real world datasets.
Programming Help, Add QQ: 749389476
5.1 Forest Cover Type Data
In this first experiment, we test the performance of the proposed method on a toy problem using the “Forest Cover Type dataset” from the UCI repository. The dataset (centered and normalized with constant covariate added, resulting in 11-dimensional vectors, ignoring all categorical variables) has been partitioned into K = 32 clusters using unsupervised k-means. The values of the response variable for the data points assigned to each cluster are viewed as the outcomes of an arm while the centroid of the cluster is taken as the 11-dimensional vector of covariates characteristic of the arm. To cast the problem into the logistic regression framework, each response variable is binarized by associating the first class (“Spruce/Fir”) to a response R = 1 and all other six classes to R = 0. The proportions of responses equal to 1 in each cluster (or, in other word, the expected reward associated with each arm) ranges from 0.354 to 0.992, while the proportion on the complete set of 581,012 data points is equal to 0.367. In effect, we try
to locate as fast as possible the cluster that contains the maximal proportion of trees from a given species. We are faced with a 32-arm problem in a 11-dimensional space with binary rewards. Obviously, the logis- tic regression model is not satisfied, although we do expect some regularity with respect to the position of the cluster’s centroid as the logistic regression trained on all data reaches a 0.293 misclassification rate.
2000 1500 1000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
2 4 6 8 10 12 14 16 18
GLM−UCB ε−greedy
Figure 1: Top: Regret of the UCB, GLM-UCB and the ε-greedy algorithms. Bottom: Frequencies of the 20 best arms draws using the UCB and GLM-UCB.
We compare the performance of three algorithms. First, the GLM-UCB algorithm, with parameters
tuned as indicated in Section 4.2. Second, the standard UCB algorithm that ignores the covariates.
Third, an ε-greedy algorithm that performs logistic regression and plays the best estimated action,
A = argmax μ(m′ θˆ ), with probability 1 − ε (with ε = 0.1). We observe in the top graph of taat
Figure 1 that the GLM-UCB algorithm achieves the smallest average regret by a large margin. When the parameter is well estimated, the greedy algorithm may find the best arm in little time and then leads to small regrets. However, the exploration/exploitation tradeoff is not correctly handled by the ε-greedy approach causing a large variability in the regret. The lower plot of Figure 1 shows the number of times each of the 20 best arms have been played by the UCB and GLM-UCB algorithms. The arms are sorted in decreasing order of expected reward. It can be observed that GML-UCB only plays a small subset of all possible arms, concentrating on the bests. This behavior is made possible by the predictive power of the covariates: by sharing information between arms, it is possible to obtain sufficiently accurate predictions of the expected rewards of all actions, even for those that have never (or rarely) been played.

5.2 Internet Advertisement Data
In this experiment, we used a large record of the activity of internet users provided by a major ISP. The original dataset logs the visits to a set of 1222 pages over a six days period corresponding to about 5.108 page visits. The dataset also contains a record of the users clicks on the ads that were presented on these pages. We worked with a subset of 208 ads and 3.105 users. The pages (ads) were partitioned in 10 (respectively, 8) categories using Latent Dirichlet Allocation [15] applied to their respective textual content (in the case of ads, the textual content was that of the page pointed to by the ad’s link). This second experiment is much more challenging, as the predictive power of the sole textual information turns out to be quite limited (for instance, Poisson regression trained on the entire data does not even correctly identify the best arm).
The action space is composed of the 80 pairs of pages and ads categories: when a pair is chosen, it is presented to a group of 50 users, randomly selected from the database, and the reward is the number of recorded clicks. As the average reward is typically equal to 0.15, we use a logarithmic link function corre- sponding to Poisson regression. The vector of covariates for each pair is of dimension 19: it is composed of an intercept followed by the concatenation of two vectors of dimension 10 and 8 representing, respec- tively, the categories of the pages and the ads. In this problem, the covariate vectors do not span the entire space; to address this issue, it is sufficient to consider the pseudo-inverse of Mt instead of the inverse.
On this data, we compared the GLM-UCB algorithm with the two alternatives described in Section 5.1. Figure 2 shows that GLM-UCB once again outperforms its competitors, even though the margin over UCB is now less remarkable. Given the rather limited predictive power of the covariates in this example, this is an encouraging illustration of the potential of techniques which use vectors of covariates in real-life applications.
0 1000 2000
3000 4000 5000
GLM−UCB ε−greedy
Figure 2: Comparison of the regret of the UCB, GLM-UCB and the ε-greedy (ε = 0.1) algorithm on the advertisement dataset.
6 Conclusions
We have introduced an approach that generalizes the linear regression model studied by [10, 8, 12]. As in the original UCB algorithm, the proposed GLM-UCB method operates directly in the reward space. We discussed how to tune the parameters of the algorithm to avoid exaggerated optimism, which would slow down learning. In the numerical simulations, the proposed algorithm was shown to be competitive and sufficiently robust to tackle real-world problems. An interesting open problem (already challenging in the linear case) consists in tightening the theoretical results obtained so far in order to bridge the gap between the existing (pessimistic) confidence bounds and those suggested by the asymptotic arguments presented in Section 4.2, which have been shown to perform satisfactorily in practice.
Acknowledgments
This work was supported in part by AICML, AITF, NSERC, PASCAL2 under no216886, the DARPA GALE project under noHR0011-08-C-0110 and Orange Labs under contract no289365.

References
[1] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
[3] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Pr, 2006.
[4] J. Audibert, R. Munos, and Cs. Szepesva ́ri. Tuning bandit algorithms in stochastic environments.
Lecture Notes in Computer Science, 4754:150, 2007.
[5] C.C. Wang, S.R. Kulkarni, and H.V. Poor. Bandit problems with side observations. IEEE
Transactions on Automatic Control, 50(3):338–355, 2005.
[6] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side
information. Advances in Neural Information Processing Systems, pages 817–824, 2008.
[7] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent arms.
International Conference on Machine learning, pages 721–728, 2007.
[8] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback.
Conference on Learning Theory, 2008.
[9] S.M. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine learning, pages 440–447. ACM, 2008.
[10] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
[11] Y. Abbasi-Yadkori, A. Antos, and Cs. Szepesva ́ri. Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009.
[12] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
[13] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.
[14] K. Chen, I. Hu, and Z. Ying. Strong consistency of maximum quasi-likelihood estimators in gener-
alized linear models with fixed and adaptive designs. Annals of Statistics, 27(4):1155–1163, 1999.
[15] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Advances
in Neural Informatio