1912.13213v6

A Modern Introduction to Online Learning
Francesco Orabona Boston University
May 30, 2023
arXiv:1912.13213v6 [cs.LG] 28 May 2023

Abstract vi
1 What is Online Learning? 1
1.1 HistoryBits………………………………………… 5
2 Online Subgradient Descent 7
2.1 OnlineLearningwithConvexDifferentiableLosses……………………… 7 2.1.1 ConvexAnalysisBits:Convexity ………………………….. 8 2.1.2 OnlineGradientDescent………………………………. 10
2.2 OnlineSubgradientDescent ………………………………… 12 2.2.1 ConvexAnalysisBits:Subgradients…………………………. 13 2.2.2 AnalysiswithSubgradients …………………………….. 14
2.3 FromConvexLossestoLinearLosses ……………………………. 15
2.4 HistoryBits………………………………………… 16
2.5 Exercises …………………………………………. 16
3 Online-to-Batch Conversion 17
3.1 AgnosticPACLearning…………………………………… 19 3.2 BitsonConcentrationInequalities ……………………………… 20 3.3 FromRegrettoAgnosticPAC ……………………………….. 21 3.4 HistoryBits………………………………………… 23 3.5 Exercises …………………………………………. 23
T Regret 24 4.1 StrongConvexityandOnlineSubgradientDescent ……………………… 24 4.1.1 ConvexAnalysisBits:StrongConvexity ………………………. 24 4.1.2 OnlineSubgradientDescentforStronglyConvexLosses . . . . . . . . . . . . . . . . . . . . 26
4.2 AdaptiveAlgorithms:L⋆boundsandAdaGrad ……………………….. 28
4.2.1 AdaptiveLearningRatesforOnlineSubgradientDescent. . . . . . . . . . . . . . . . . . . . 28
4.2.2 ConvexAnalysisBits:DualNormsandSmoothFunctions . . . . . . . . . . . . . . . . . . . 29
4.2.3 L⋆bounds …………………………………….. 31
4.2.4 AdaGrad ……………………………………… 32
4.3 HistoryBits………………………………………… 34
4.4 Exercises …………………………………………. 34
5 Lower bounds for Online Linear Optimization 36
5.1 LowerboundsforBoundedOLO………………………………. 36
5.2 UnconstrainedOnlineLinearOptimization………………………….. 37 5.2.1 ConvexAnalysisBits:FenchelConjugate………………………. 37 5.2.2 LowerBoundfortheUnconstrainedCase………………………. 39
5.3 HistoryBits………………………………………… 41

5.4 Exercises …………………………………………. 42
6 Online Mirror Descent 43
6.1 SubgradientsarenotInformative………………………………. 43
6.2 ReinterpretingtheOnlineSubgradientDescentAlgorithm. . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 ConvexAnalysisBits:BregmanDivergence…………………………. 45
6.4 OnlineMirrorDescent …………………………………… 46
6.4.1 The“Mirror”Interpretation …………………………….. 50
6.4.2 YetAnotherWaytoWritetheOnlineMirrorDescentUpdate. . . . . . . . . . . . . . . . . . 52
6.5 OMDRegretBoundusingLocalNorms …………………………… 53
6.6 ExampleofOMD:ExponentiatedGradient …………………………. 54
6.7 ExampleofOMD:p-normAlgorithms……………………………. 56
6.8 AnApplicationofOnlineMirrorDescent:LearningwithExpertAdvice . . . . . . . . . . . . . . . . 57
6.9 OptimisticOMD ……………………………………… 59
6.10HistoryBits………………………………………… 60 6.11Exercises …………………………………………. 61
7 Follow-The-Regularized-Leader 63
7.1 TheFollow-the-Regularized-LeaderAlgorithm ……………………….. 63
7.2 FTRLRegretBoundusingStrongConvexity…………………………. 64
7.2.1 Convex Analysis Bits: Properties of Strongly Convex Functions . . . . . . . . . . . . . . . . 65 7.2.2 AnExplicitRegretBound ……………………………… 65 7.3 FTRLwithLinearizedLosses ……………………………….. 66 7.3.1 FTRLwithLinearizedLossesCanBeEquivalenttoOMD . . . . . . . . . . . . . . . . . . . 68
7.4 FTRLRegretBoundusingLocalNorms…………………………… 69
7.5 ExampleofFTRL:ExponentiatedGradientwithoutKnowingT . . . . . . . . . . . . . . . . . . . . 70
7.6 ExampleofFTRL:AdaHedge*……………………………….. 71
7.7 ExampleofFTRL:GroupNorms………………………………. 75
7.8 CompositeLossesandProximalOperators………………………….. 76
7.9 FTRLRegretBoundwithProximalRegularizers ………………………. 78
7.10OnlineNewtonStep…………………………………….. 79 7.11 OnlineLinearRegression:Vovk-Azoury-WarmuthForecaster. . . . . . . . . . . . . . . . . . . . . . 82 7.12OptimisticFTRL ……………………………………… 84 7.12.1 RegretthatDependsontheVarianceoftheSubgradients . . . . . . . . . . . . . . . . . . . . 85 7.12.2 OnlineConvexOptimizationwithGradualVariations . . . . . . . . . . . . . . . . . . . . . . 86 7.13HistoryBits………………………………………… 87 7.14Exercises …………………………………………. 89
8 Online Linear Classification 91
8.1 OnlineRandomizedClassifier ……………………………….. 91 8.2 ThePerceptronAlgorithm …………………………………. 92 8.3 HistoryBits………………………………………… 95 8.4 Exercises …………………………………………. 95
9 Parameter-free Online Linear Optimization 96
9.1 Coin-BettingGame…………………………………….. 96
9.2 Parameter-free1dOCOthroughCoin-Betting………………………… 98 9.2.1 KTasa1dOnlineConvexOptimizationAlgorithm………………….. 100
9.3 Coordinate-wiseParameter-freeOCO ……………………………. 102
9.4 Parameter-freeinAnyNorm ………………………………… 102
9.5 CombiningOnlineConvexOptimizationAlgorithms …………………….. 104
9.6 ReductiontoLearningwithExperts …………………………….. 105

9.6.1 ABettingStrategythatLosesatMostaConstantFractionofMoney . . . . . . . . . . . . . . 107 9.7 HistoryBits………………………………………… 109 9.8 Exercises …………………………………………. 111
10 Multi-Armed Bandit 112
10.1AdversarialMulti-ArmedBandit………………………………. 112 10.1.1 Exponential-weight algorithm for Exploration and Exploitation: Exp3 . . . . . . . . . . . . . 114 10.1.2 OptimalRegretUsingOMDwithTsallisEntropy…………………… 116
10.2StochasticBandits……………………………………… 118 10.2.1 ConcentrationInequalitiesBits……………………………. 118 10.2.2 Explore-Then-CommitAlgorithm ………………………….. 120 10.2.3 UpperConfidenceBoundAlgorithm…………………………. 121
10.3HistoryBits………………………………………… 124 10.4Exercises …………………………………………. 125
11 Saddle-Point Optimization and OCO Algorithms 126
11.1Saddle-PointProblems …………………………………… 126 11.2SolvingSaddle-PointProblemswithOCO………………………….. 129 11.2.1 VariationswithBestResponseandAlternation ……………………. 131 11.3Game-TheoryinterpretationofSaddle-PointProblems ……………………. 133 11.4 Boosting as a Two-Person Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.5FasterRatesThroughOptimism ………………………………. 137 11.6 Prescient Online Mirror Descent and Be-The-Regularized-Leader . . . . . . . . . . . . . . . . . . . 140 11.7HistoryBits………………………………………… 141 11.8Exercises …………………………………………. 143
12 Sequential Investment and Universal Portfolio Algorithms 144
12.1PortfolioSelectionwithExponentiatedGradient……………………….. 145 12.2 Universal Portfolio Selection with F -Weighted Portfolio Algorithms . . . . . . . . . . . . . . . . . . 145 12.3InformationTheoryBits ………………………………….. 146 12.4ProofofTheorem12.1 …………………………………… 147 12.5PortfolioSelectionthroughOnline-Newton-Step……………………….. 149 12.6 Application:PortfolioSelectionandContinuousCoin-Betting . . . . . . . . . . . . . . . . . . . . . 150 12.7 Application: From Portfolio Regret to Time-Uniform Concentration Inequalities . . . . . . . . . . . . 151 12.8HistoryBits………………………………………… 155 12.9Exercises …………………………………………. 156
A Appendix 157
A.1 TheLambertFunctionandItsApplications …………………………. 157 A.2 TopologyBits……………………………………….. 159

List of Definitions
2.2 Definition(ConvexSet)…………………………………… 8 2.3 Definition(ConvexFunction)………………………………… 9 2.14Definition(ProperFunction) ………………………………… 13 2.16Definition(Subgradient) ………………………………….. 13 2.23Definition(LipschitzFunction)……………………………….. 14
3.9 Definition(Agnostic-PAC-learnable)…………………………….. 20 3.10Definition(Martingale)…………………………………… 20 3.11Definition(Supermartingale)………………………………… 20
4.1 Definition(StronglyConvexFunction)……………………………. 24 4.15Definition(DualNorm)…………………………………… 29 4.20Definition(SmoothFunction)………………………………… 30
5.3 Definition(ClosedFunction) ………………………………… 37 5.5 Definition(FenchelConjugate)……………………………….. 37
6.2 Definition(StrictlyConvexFunction) ……………………………. 45 6.3 Definition(BregmanDivergence)………………………………. 46
7.16Definition(GroupNorm)………………………………….. 75 7.17Definition(AbsolutelySymmetricFunction)…………………………. 75
10.5Definition(SubgaussianRandomVariable)………………………….. 119
11.1 Definition (Saddle Point) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 11.9Definition(DualityGap) ………………………………….. 128 11.10Definition(ε-Saddle-Point)…………………………………. 129
12.2Definition(Typeofasequenceofsymbols) …………………………. 146
A.4 Definition(BoundedSet)………………………………….. 159 A.5 Definition(OpenandClosedSets) ……………………………… 159 A.7 Definition(Neighborhood) …………………………………. 159 A.8 Definition(InteriorpointandInteriorofaSet)………………………… 159 A.9 Definition(BoundarypointsandBoundaryofaSet)……………………… 159

List of Algorithms
2.1 ProjectedOnlineGradientDescent……………………………… 10
2.2 ProjectedOnlineSubgradientDescent ……………………………. 15
4.1 AdaGradforHyperrectangles………………………………… 33
6.1 OnlineMirrorDescent …………………………………… 47
6.2 ExponentiatedGradient…………………………………… 54
6.3 LearningwithExpertAdvicethroughRandomization…………………….. 58
6.4 OptimisticOnlineMirrorDescent ……………………………… 59
7.1 Follow-the-Regularized-LeaderAlgorithm………………………….. 63
7.2 Follow-the-Regularized-Leader Algorithm on Linearized Losses . . . . . . . . . . . . . . . . . . . . 67
7.3 AdaHedgeAlgorithm……………………………………. 74
7.4 FTRLwithGroupNormsforLinearRegression……………………….. 76
7.5 Follow-the-Regularized-Leader Algorithm with “Quadratized” Losses . . . . . . . . . . . . . . . . . 79
7.6 OnlineNewtonStepAlgorithm……………………………….. 81
7.7 Vovk-Azoury-WarmuthForecaster ……………………………… 82
7.8 OptimisticFollow-the-Regularized-LeaderAlgorithm…………………….. 84
8.1 RandomizedOnlineLinearClassifierthroughFTRL …………………….. 92
8.2 PerceptronAlgorithm……………………………………. 93
9.1 Krichevsky-TrofimovBettor ………………………………… 97
9.2 Krichevsky-TrofimovOCOAlgorithm……………………………. 100
9.3 OCOwithCoordinate-WiseKrichevsky-Trofimov………………………. 102
9.4 LearningMagnitudeandDirectionSeparately………………………… 103
9.5 LearningwithExpertAdvicebasedonKTBettors………………………. 107
10.1 Exponential Weights with Explicit Exploration for Multi-Armed Bandit . . . . . . . . . . . . . . . . 113
10.2 Exp3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
10.3 INFAlgorithm(OMDwithTsallisEntropyforMulti-ArmedBandit). . . . . . . . . . . . . . . . . . 116
10.4Explore-Then-CommitAlgorithm ……………………………… 120 10.5UpperConfidenceBoundAlgorithm…………………………….. 122 11.1SolvingSaddle-PointProblemswithOCO………………………….. 129 11.2 Saddle-PointOptimizationwithOCOandY-Best-Response . . . . . . . . . . . . . . . . . . . . . . 131 11.3 Saddle-PointOptimizationwithOCOandX-Best-Response . . . . . . . . . . . . . . . . . . . . . . 132 11.4Saddle-PointOptimizationwithOCOandAlternation…………………….. 132 11.5 Boosting through OCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.6SolvingSaddle-PointProblemswithOptimisticFTRL ……………………. 138 11.7SolvingSaddle-PointProblemswithOptimisticOMD…………………….. 139 11.8PrescientOnlineMirrorDescent ………………………………. 140 12.1F-WeightedPortfolioSelection ………………………………. 145 12.2OnlineNewtonStepforPortfolioSelection …………………………. 150
Code Help, Add WeChat: cstutorcs
Disclaimer: This is work in progress, I plan to add more material and/or change/reorganize the content.
In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized- Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible.
I want to thank all the people that checked the proofs and reasonings in these notes. In particular, the students in my first class that mercilessly pointed out my mistakes, Nicolo` Campolongo that found all the typos in my formulas, and Jake Abernethy for the brainstorming on presentation strategies. Other people that helped me with comments, feedback, references, and/or hunting typos (in alphabetical order): Andreas Argyriou, Param Kishor Budhraja, Nicolo` Cesa-Bianchi, Keyi Chen, Mingyu Chen, Peiqing Chen, Ryan D’Orazio, Alon Gonen, Daniel Hsu, Gergely Imreh, Christian Kroer, Kwang-Sung Jun, Michał Kempka, Pierre Laforgue, Chuang-Chieh Lin, Shashank Manjunath, Aryan Mokhtari, Gergely Neu, Ankit Pensia, Daniel Roy, Guanghui Wang, and Jiujia Zhang.
This material is based upon work supported by the National Science Foundation under grant no. 1925930 “Col- laborative Research: TRIPODS Institute for Optimization and Learning”.
A note on citations: it is customary in the computer science literature to only cite the journal version of a result that first appeared in a conference. The rationale is that the conference version is only a preliminary version, while the journal one is often more complete and sometimes more correct. In these notes, I will not use this custom. Instead, in the presence of the conference and journal version of the same paper, I will cite both. The reason is that I want to clearly delineate the history of the ideas, their first inventors, and the unavoidable rediscoveries. Hence, I need the exact year when some ideas were first proposed. Moreover, in some rare cases the authors changed from the conference to the journal version, so citing only the latter would erase the contribution of some key people from the history of Science.

What is Online Learning?
Imagine the following repeated game: In each round t = 1,…,T
• An adversary choose a real number in yt ∈ [0, 1] and he keeps it secret;
• You try to guess the real number, choosing xt ∈ [0, 1];
• The adversary’s number is revealed and you pay the squared difference (xt − yt)2.
Basically, we want to guess a sequence of numbers as precisely as possible. To be a game, we now have to decide what is the “winning condition”. Let’s see what makes sense to consider as a winning condition.
First, let’s simplify a bit the game. Let’s assume that the adversary is drawing the numbers i.i.d. from some fixed distribution over [0, 1]. However, he is still free to decide which distribution at the beginning of the game. If we knew the distribution, we could just predict each round the mean of the distribution and in expectation we would pay σ2T, where σ2 is the variance of the distribution. We cannot do better than that! However, given that we do not know the distribution, it is natural to benchmark our strategy with respect to the optimal one. That is, it is natural to measure the quantity
or equivalently considering the average
E X(x −Y)2 −σ2T,
E X(x−Y)2 −σ2.
Clearly these quantities are positive and they seem to be a good measure, because they are somehow normalized with respect to the “difficulty” of the numbers generated by the adversary, through the variance of the distribution. It is not the only possible choice to measure our “success”, but for sure it is a reasonable one. It would make sense to consider a strategy “successful” if the difference in (1.1) grows sublinearly over time and, equivalently, if the difference in (1.2) goes to zero as the number of rounds T goes to infinity. That is, on average on the number of rounds, we would like our algorithm to be able to approach the optimal performance.
Minimizing Regret. Given that we have converged to what it seems a good measure of success of the algorithm. Let’s now rewrite (1.1) in an equivalent way
“T#”T# E X(xt −Y)2 − min E X(x−Y)2 .

Now, the last step: let’s remove the assumption on how the data is generated, consider any arbitrary sequence of yt, and keep using the same measure of success. Of course, we can remove the expectation because there is no stochasticity anymore. Note that in the stochastic case the optimal strategy is given by a single best prediction, so it was natural to compare against it. Instead, with arbitrary sequences it is not clear anymore that this is a good competitor. For example, we might consider a sequence of competitors instead of a single one. Indeed, it can be done, but the single competitor is still interesting in a variety of settings and simpler to explain. So, most of the time we will use a single competitor.
Now, we get that we will win the game if
RegretT := X(xt − yt)2 − min X(x − yt)2
grows sublinearly with T . The quantity above is called the Regret, because it measures how much the algorithm regrets for not sticking on all the rounds to the optimal choice in hindsight. We will denote it by RegretT . Our reasoning should provide sufficient justification for this metric, however in the following we will see that this also makes sense from both a convex optimization and machine learning point of view.
Let’s now generalize the online game a bit more, considering that the algorithm outputs a vector in xt ∈ V ⊆ Rd and it pays a loss lt : V → R that measures how good was the prediction of the algorithm in each round. The set V is called the feasible set. Also, let’s consider an arbitrary predictor u in1 V ⊆ Rd and let’s parameterize the regret with respect to it: RegretT (u). So, to summarize, Online Learning is nothing else than designing and analyzing algorithms to minimize the Regret over a sequence of loss functions with respect to an arbitrary competitor u ∈ V ⊆ Rd:
TT RegretT (u) := X lt(xt) − X lt(u) .
Remark 1.1. Strictly speaking, the regret is also a function of the losses l1 , . . . , lT . However, we will suppress this dependency for simplicity of notation.
This framework is pretty powerful, and it allows to reformulate a bunch of different problems in machine learning and optimization as similar games. More in general, with the regret framework we can analyze situations in which the data are not independent and identically distributed from a distribution, yet I would like to guarantee that the algorithm is “learning” something. For example, online learning can be used to analyze
• Prediction of clicks on banners on webpages;
• Routing on a network;
• Convergence to equilibrium of repeated games.
It can also be used to analyze stochastic algorithms, e.g., Stochastic Gradient Descent, but the adversarial nature of the analysis might give you suboptimal results. For example, it can be used to analyze momentum algorithms, but the adversarial nature of the losses might force you to prove a convergence guarantee that treats the momentum term as a vanishing disturbance that does not help the algorithm in any way.
Let’s now go back to our number guessing game and let’s try a strategy to win it. Of course, this is one of the simplest example of online learning, without a real application. Yet, going through it we will uncover most of the key ingredients in online learning algorithms and their analysis.
A Winning Strategy. Can we win the number guessing game? Note that we did not assume anything on how the adversary is deciding the numbers. In fact, the numbers can be generated in any way, even in an adaptive way based on our strategy. Indeed, they can be chosen adversarially, that is explicitly trying to make us lose the game. This is why we call the mechanism generating the number the adversary.
1In same cases, we can make the game easier for the algorithm letting it choose the prediction from a set W ⊃ V . 2

Programming Help
The fact that the numbers are adversarially chosen means that we can immediately rule out any strategy based on any statistical modeling of the data. In fact, it cannot work because the moment we estimate something and act on our estimate, the adversary can immediately change the way he is generating the data, ruining us. So, we have to think about something else. Yet, many times online learning algorithms will look like classic ones from statistical estimation, even if they work for different reasons.
Now, let’s try to design a strategy to make the regret provably sublinear in time, regardless of how the adversary chooses the numbers. The first thing to do is to take a look at the best strategy in hindsight, that is argmin of the second term of the regret. It should be immediate to see that
stated inequality, that is
This inequality is equivalent to
XT x⋆T :=argmin
x∈[0,1] t=1
1 XT (x−yt)2=T yt.
Now, given that we do not know the future, for sure we cannot use x⋆T as our guess in each round. However, we do know the past, so a reasonable strategy in each round could be to output the best number over the past. Why such strategy would work? For sure, the reason why it could work is not because we expect the future to be like the past, because it is not true! Instead, we want to leverage the fact that the optimal guess over time cannot change too much between rounds, so we can try to “track” it over time.
Hence, on each round t our strategy is to guess xt = x⋆t−1 = 1 Pt−1 yi. Such strategy is usually called Follow- t−1 i=1
the-Leader (FTL), because you are following what would have been the optimal thing to do on the past rounds (the Leader).
Let’s now try to show that indeed this strategy will allow us to win the game. Given that this is a simple example, we will prove its regret guarantee using first principles. In the following, we will introduce and use very general proof methods. First, we will need a small lemma.
Lemma 1.2. Let V ⊆ Rd and lt : V → R an arbitrary sequence of loss functions. Denote by x⋆t a minimizer of the cumulative losses over the previous t rounds in V . Then, we have
Xlt(x⋆t ) ≤ Xlt(x⋆T ) . t=1 t=1
Proof. We prove it by induction over T . The base case is
l1(x⋆1) ≤ l1(x⋆1),
that is trivially true. Now, for T ≥ 2, we assume that PT −1 lt(x⋆t ) ≤ PT −1 lt(x⋆
) is true and we must prove the
Xlt(x⋆t ) ≤ Xlt(x⋆T ) . t=1 t=1
X lt(x⋆t ) ≤ X lt(x⋆T ), t=1 t=1
where we removed the last element of the sums because they are the same. Now observe that
by induction hypothesis, and
X l t ( x ⋆t ) ≤ X l t ( x ⋆T − 1 ) , t=1 t=1
X l t ( x ⋆T − 1 ) ≤ X l t ( x ⋆T ) t=1 t=1
because x⋆T −1 is a minimizer of the left hand side in V and x⋆T ∈ V . Chaining these two inequalities, we have that (1.3) is true, and so the theorem is proven.
浙大学霸代写加微信 cstutorcs
Basically, the above lemma quantifies the idea the knowing the future and being adaptive to it is typically better than not being adaptive to it.
With this lemma, we can now prove that the regret will grow sublinearly, in particular it will be logarithmic in time. Note that we will not prove that our strategy is minimax optimal, even if it is possible to show that the logarithmic dependency on time is unavoidable for this problem.
Theorem 1.3. Let yt ∈ [0, 1] for t = 1, . . . , T an arbitrary sequence of numbers. Let the algorithm’s output xt =
1 Pt−1 y . Then,wehave t−1 i=1 i
RegretT =X(xt −yt)2 − min X(x−yt)2 ≤4+4lnT .
Proof. We use Lemma 1.2 to upper bound the regret:
X(xt −yt)2 − min X(x−yt)2 =X(x⋆t−1 −yt)2 −X(x⋆T −yt)2 ≤X(x⋆t−1 −yt)2 −X(x⋆t −yt)2 .
Now, let’s take a look at each difference in the sum in the last equation. We have that
t=1 t=1 t=1 t=1 t=1
(x⋆t−1 − yt)2 − (x⋆t − yt)2 = (x⋆t−1)2 − 2ytx⋆t−1 − (x⋆t )2 + 2ytx⋆t = (x⋆t−1 + x⋆t − 2yt)(x⋆t−1 − x⋆t )
≤ | x ⋆t − 1 + x ⋆t − 2 y t | | x ⋆t − 1 − x ⋆t | ≤ 2|x⋆t−1 − x⋆t |
Xyi− Xyi t−1 t
i=1 i=1 1 1t−1 yt
− X yi − t−1tt
≤ 2 + 2|yt|
1 t−1 2|yt| Xyi +
t(t−1) t i=1
Hence, overall we have
To upper bound the last sum, observe that we are trying to find an upper bound to the green area in Figure 1.1. As
you can see from the picture, it can be upper bounded by 1 plus the integral of 1 from 2 to T + 1. So, we have t−1
TTT1 X(xt−yt)2−minX(x−yt)2≤4X .
t ≤ 1 + t − 1 dt = 1 + ln T . 2
x∈[0,1] t t=1 t=1
Let’s write in words the steps of the proof: Lemma 1.2 allows us to upper bound the regret against the single best guesswiththeregretagainstthesequencex⋆1,…,x⋆T.Inturn,giventhat|x⋆t −x⋆t−1|goestozeroveryfast,thetotal regret is sublinear in time.
There are a few things to stress on this strategy. The strategy does not have parameters to tune (e.g., learning rates, regularizers). Note that the presence of parameters does not make sense in online learning: We have only one

Figure 1.1: Upper bounding the sum with an integral.
stream of data and we cannot run our algorithm over it multiple times to select the best parameter! Also, it does not need to maintain a complete record of the past, but only a “summary” of it, through the running average. This gives a computationally efficient algorithm. When we design online learning algorithms, we will strive to achieve all these characteristics. Last thing that I want to stress is that the algorithm does not use gradients: Gradients are useful and we will use them a lot, but they do not constitute the entire world of online learning.
Before going on I want to remind you that, as seen above, this is different from the classic setting in statistical machine learning. So, for example, “overfitting” has absolutely no meaning here. Same for “generalization gap” and similar ideas linked to a training/testing scenario.
In the next chapters, we will introduce several algorithms for online optimization and one of them will be a strict generalization of the strategy we used in the example above.
1.1 History Bits
The concept of “regret” seems to have been proposed by Savage [1951], an exposition and review of the book by Wald [1950] on a foundation of statistical decision problems based on zero-sum two-person games. Savage [1951] introduces the idea of considering the difference between the utility of the best action in a given state and the utility incurred by any action under the same state. The proposed optimal strategy was the one minimizing such regret over the worst possible state. Savage [1951] called this concept “loss” and did not like the word “regret” because “ that term seems to me charged with emotion and liable to lead to such misinterpretation as that the loss necessarily becomes known” [Savage, 1954, page 163]. The name of “regret” instead seems to have suggested in Milnor [1951].
However, Savage’s definition is a modification of the one proposed by Wald [1950], who instead proposed to maximize just the utility, under the assumption that the utility of the best action for any st