Lecture Notes on Statistics and Information Theory
John Duchi December 6, 2023
1 Introduction and setting 8 1.1 Informationtheory………………………………. 8 1.2 Movingtostatistics ……………………………… 9 1.3 Aremarkaboutmeasuretheory………………………… 10 1.4 Outlineandchapterdiscussion ………………………… 10
2 An information theory review 12
2.1 BasicsofInformationTheory …………………………. 12 2.1.1 Definitions ………………………………. 12 2.1.2 Chainrulesandrelatedproperties …………………… 17 2.1.3 Dataprocessinginequalities: ……………………… 19
2.2 Generaldivergencemeasuresanddefinitions………………….. 20 2.2.1 Partitions,algebras,andquantizers…………………… 20 2.2.2 KL-divergence …………………………….. 21 2.2.3 f-divergences……………………………… 22 2.2.4 Inequalities and relationships between divergences . . . . . . . . . . . . . . . 25 2.2.5 Convexity and data processing for divergence measures . . . . . . . . . . . . 29
2.3 Firststepsintooptimalprocedures: testinginequalities . . . . . . . . . . . . . . . . 30 2.3.1 LeCam’sinequalityandbinaryhypothesistesting . . . . . . . . . . . . . . . 30 2.3.2 Fano’sinequalityandmultiplehypothesistesting . . . . . . . . . . . . . . . . 31
2.4 Afirstoperationalresult: entropyandsourcecoding . . . . . . . . . . . . . . . . . . 33 2.4.1 Thesourcecodingproblem ………………………. 34 2.4.2 TheKraft-McMillaninequalities ……………………. 35 2.4.3 Entropyratesandlongercodes …………………….. 37
2.5 Bibliography …………………………………. 39
2.6 Exercises …………………………………… 39
3 Exponential families and statistical modeling 45
3.1 Exponentialfamilymodels…………………………… 45
3.2 Whyexponentialfamilies?…………………………… 47
3.2.1 Fittinganexponentialfamilymodel ………………….. 50
3.3 Divergence measures and information for exponential families . . . . . . . . . . . . . 51
3.4 Generalizedlinearmodelsandregression……………………. 52
3.4.1 Fittingageneralizedlinearmodelfromasample . . . . . . . . . . . . . . . . 55
3.5 Lowerboundsontestingaparameter’svalue …………………. 56
3.6 Deferredproofs………………………………… 58
Lexture Notes on Statistics and Information Theory John Duchi
3.6.1 ProofofProposition3.2.2……………………….. 58
3.7 Bibliography …………………………………. 60
3.8 Exercises …………………………………… 60
Concentration, information, stability, and generalization 61
Concentration Inequalities 62
4.1 Basictailinequalities……………………………… 62 4.1.1 Sub-Gaussianrandomvariables…………………….. 64 4.1.2 Sub-exponentialrandomvariables …………………… 68 4.1.3 Orlicznorms ……………………………… 72 4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 73 4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 75
4.2 Martingalemethods ……………………………… 76 4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 78 4.2.2 Examplesandboundeddifferences …………………… 79
4.3 Uniformityandmetricentropy ………………………… 81 4.3.1 Symmetrizationanduniformlaws …………………… 82 4.3.2 Metricentropy,coverings,andpackings ………………… 86
4.4 Generalizationbounds …………………………….. 90 4.4.1 Finiteandcountableclassesoffunctions………………… 91 4.4.2 Largeclasses ……………………………… 93 4.4.3 Structuralriskminimizationandadaptivity . . . . . . . . . . . . . . . . . . . 95
4.5 Technicalproofs ……………………………….. 97 4.5.1 ProofofTheorem4.1.11………………………… 97 4.5.2 ProofofTheorem4.1.15………………………… 98 4.5.3 ProofofTheorem4.3.6 ………………………… 99
4.6 Bibliography …………………………………. 99
4.7 Exercises …………………………………… 99
Generalization and stability 104
5.1 The variational representation of Kullback-Leibler divergence . . . . . . . . . . . . . 105
5.2 PAC-Bayesbounds……………………………….106
5.2.1 Relativebounds …………………………….108 5.2.2 Alarge-marginguarantee ………………………..111 5.2.3 Amutualinformationbound ………………………113
5.3 Interactivedataanalysis…………………………….114
5.3.1 Theinteractivesetting………………………….115
5.3.2 Secondmomenterrorsandmutualinformation . . . . . . . . . . . . . . . . . 116
5.3.3 Limitinginteractionininteractiveanalyses . . . . . . . . . . . . . . . . . . . 117
5.3.4 Errorboundsforasimplenoiseadditionscheme . . . . . . . . . . . . . . . . 122
5.4 Bibliographyandfurtherreading ………………………..123
5.5 Exercises ……………………………………124
Lexture Notes on Statistics and Information Theory John Duchi
6 Advanced techniques in concentration inequalities 128 6.1 Entropyandconcentrationinequalities……………………..128 6.1.1 TheHerbstargument ………………………….129 6.1.2 Tensorizingtheentropy …………………………130 6.1.3 Concentrationofconvexfunctions ……………………134
7 Privacy and disclosure limitation 138 7.1 Disclosurelimitation,privacy,anddefinitions ………………….138 7.1.1 Basicmechanisms ……………………………140 7.1.2 Resilience to side information, Bayesian perspectives, and data processing . . 144
7.2 Weakeningsofdifferentialprivacy………………………..146
7.2.1 Basicmechanisms ……………………………147
7.2.2 Connectionsbetweenprivacymeasures………………….149
7.2.3 Side information protections under weakened notions of privacy . . . . . . . . 152
7.3 Compositionandprivacybasedondivergence………………….155 7.3.1 CompositionofR ́enyi-privatechannels………………….155 7.3.2 Privacygamesandcomposition……………………..156
7.4 Additional mechanisms and privacy-preserving algorithms . . . . . . . . . . . . . . . 158 7.4.1 Theexponentialmechanism……………………….158
7.4.2 Local sensitivities and the inverse sensitivity mechanism . . . . . . . . . . . . 161
7.5 Deferredproofs…………………………………166 7.5.1 ProofofLemma7.2.10………………………….166
7.6 Bibliography ………………………………….169
7.7 Exercises ……………………………………169
II Fundamental limits and optimality 176
8 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 178
8.1 Basicframeworkandminimaxrisk ……………………….178
8.2 Preliminariesonmethodsforlowerbounds …………………..180
8.2.1 Fromestimationtotesting ……………………….181
8.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 182
8.2.3 Metricentropyandpackingnumbers…………………..184
8.3 LeCam’smethod………………………………..185
8.4 Fano’smethod …………………………………187 8.4.1 Theclassical(local)Fanomethod ……………………187 8.4.2 Adistance-basedFanomethod ……………………..192
8.5 Assouad’smethod ……………………………….195 8.5.1 Well-separatedproblems…………………………195 8.5.2 Fromestimationtomultiplebinarytests…………………195 8.5.3 ExampleapplicationsofAssouad’smethod ……………….197
8.6 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 199 8.6.1 Kernelestimatesofthefunction …………………….199
8.6.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 203 8.7 GlobalFanoMethod………………………………206 8.7.1 Amutualinformationboundbasedonmetricentropy . . . . . . . . . . . . . 206
程序代写 CS代考 加微信: cstutorcs
Lexture Notes on Statistics and Information Theory John Duchi
8.7.2 Minimaxboundsusingglobalpackings………………….208
8.7.3 Example:non-parametricregression …………………..208
8.8 Deferredproofs…………………………………210 8.8.1 ProofofProposition8.4.6………………………..210 8.8.2 ProofofCorollary8.4.7 …………………………210 8.8.3 ProofofLemma8.5.2 ………………………….211
8.9 Bibliography ………………………………….211
8.10Exercises ……………………………………211
9 Constrained risk inequalities 220
9.1 Strongdataprocessinginequalities ……………………….220
9.2 Localprivacy ………………………………….223
9.3 Communicationcomplexity …………………………..227
9.3.1 Classical communication complexity problems . . . . . . . ……….
9.3.2 Deterministic communication: lower bounds and structure ……….
9.3.3 Randomization, information complexity, and direct sums ……….
9.3.4 The structure of randomized communication and communication complexity
. 227 . 230 . 232
ofprimitives ………………………………236 9.4 Communicationcomplexityinestimation ……………………239 9.4.1 Directsumcommunicationbounds……………………240 9.4.2 Communicationdataprocessing …………………….241 9.4.3 Applications: communication and privacy lower bounds . . . . . . . . . . . . 243
9.5 ProofofTheorem9.4.4……………………………..247 9.5.1 ProofofLemma9.5.3 ………………………….251
9.6 Bibliography ………………………………….252
9.7 Exercises ……………………………………252
10 Testing and functional estimation 257 10.1 Le Cam’s convex hull method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 10.1.1 Theχ2-mixturebound………………………….259 10.1.2 EstimatingerrorsandthenormofaGaussianvector…………..261 10.2Minimaxhypothesistesting …………………………..263 10.2.1 Detectingadifferenceinpopulations…………………..264 10.2.2 SignaldetectionandtestingaGaussianmean . . . . . . . . . . . . . . . . . . 265 10.2.3 Goodnessoffitandtwo-sampletestsformultinomials . . . . . . . . . . . . . 267 10.3Geometrizingratesofconvergence ……………………….271 10.4 Best possible lower bounds and super-efficiency . . . . . . . . . . . . . . . . . . . . . 271 10.5Bibliography ………………………………….271 10.6Ausefuldivergencecalculation …………………………272 10.7Exercises ……………………………………274
Entropy, predictions, divergences, and information 277
11 Predictions, loss functions, and entropies 278 11.1 Properlosses,scoringrules,andgeneralizedentropies . . . . . . . . . . . . . . . . . 279 11.1.1 Aconvexityprimer……………………………280
Lexture Notes on Statistics and Information Theory John Duchi
11.1.2 From a proper loss to an entropy . . . . . . . . . . . . . . . . . . . . . . . . . 282
11.1.3 Theinformationinanexperiment ……………………283
11.2 CharacterizingproperlossesandBregmandivergences . . . . . . . . . . . . . . . . . 285 11.2.1 Characterizing proper losses for Y taking finitely many vales . . . . . . . . . 285 11.2.2 Generalproperlosses ………………………….288 11.2.3 Properlossesandvector-valuedY ……………………291
11.3 From entropies to convex losses, arbitrary predictions, and link functions . . . . . . . 294 11.3.1 Convexconjugatelinkages………………………..294 11.3.2 Convexconjugatelinkageswithaffineconstraints . . . . . . . . . . . . . . . . 298
11.4 Exponentialfamilies,maximumentropy,andlogloss . . . . . . . . . . . . . . . . . . 301 11.4.1 Maximizingentropy …………………………..303 11.4.2 I-projectionsandmaximumlikelihood ………………….307
11.5Technicalanddeferredproofs ………………………….308 11.5.1 FinalizingtheproofofTheorem11.2.14 …………………308 11.5.2 ProofofProposition11.4.1 ……………………….309 11.5.3 ProofofProposition11.4.3 ……………………….310 11.6Exercises ……………………………………311
12 Calibration and Proper Losses 315 12.1Properlossesandcalibrationerror ……………………….316 12.2Measuringcalibration ……………………………..319
12.2.1 Theimpossibilityofmeasuringcalibration………………..319
12.2.2 Alternativecalibrationmeasures …………………….322
12.3 Auditing and improving calibration at the population level . . . . . . . . . . . . . . 325 12.3.1 The post-processing gap and calibration audits for squared error . . . . . . . 325 12.3.2 Calibration audits for losses based on conjugate linkages . . . . . . . . . . . . 327 12.3.3 Apopulation-levelalgorithmforcalibration . . . . . . . . . . . . . . . . . . . 329
12.4 Calibeating: improvingsquarederrorbycalibration . . . . . . . . . . . . . . . . . . 330 12.4.1 ProofofTheorem12.4.1…………………………333
12.5 Continuous and equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . 336 12.5.1 Calibrationmeasures…………………………..337 12.5.2 Equivalentcalibrationmeasures……………………..339
12.6 Deferred technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 12.6.1 ProofofLemma12.2.1………………………….345 12.6.2 ProofofProposition12.5.2 ……………………….346 12.6.3 ProofofLemma12.5.4………………………….347 12.6.4 ProofofTheorem12.5.6…………………………348
12.7Bibliography ………………………………….350 12.8Exercises ……………………………………351
13 Surrogate Risk Consistency: the Classification Case 352 13.1Generalresults …………………………………358 13.2Proofsofconvexanalyticresults ………………………..358
13.2.1 ProofofLemma13.0.4………………………….358 13.2.2 ProofofLemma13.0.4………………………….359 13.2.3 ProofofLemma13.0.6………………………….359
13.3Exercises ……………………………………359 5
Lexture Notes on Statistics and Information Theory John Duchi
14 Divergences, classification, and risk 362
14.1 Generalized entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.2 From entropy to losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.2.1 Classificationcase ……………………………368
14.2.2 Structuredpredictioncase………………………..368
14.3Predictions,calibration,andscoringrules ……………………369 14.4 Surrogate risk consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14.4.1 Uniformly convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 14.4.2 Structuredprediction(discrete)case …………………..369 14.4.3 ProofofTheorem14.4.1…………………………370
14.5Lossequivalence ………………………………..371 14.6 Proof of Theorem 14.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 14.6.1 ProofofLemma14.6.2………………………….375 14.6.2 ProofofLemma14.6.4………………………….376 14.7Bibliography ………………………………….376 14.8Exercises ……………………………………376
15 Fisher Information 378 15.1Fisherinformation:definitionsandexamples ………………….378 15.2 Estimation and Fisher information: elementary considerations . . . . . . . . . . . . . 380 15.3 Connections between Fisher information and divergence measures . . . . . . . . . . . 381
IV Online game playing and compression 384
16 Universal prediction and coding 385 16.1Basicsofminimaxgameplayingwithlogloss ………………….385 16.2 Universal and sequential prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 16.3Minimaxstrategiesforregret ………………………….389
16.4 Mixture (Bayesian) strategies and redundancy . . . . . . . . . . . . . . . . . . . . . . 391
16.4.1 Bayesian redundancy and objective, reference, and Jeffreys priors . . . . . . . 394
16.4.2 Redundancycapacityduality ………………………396
16.5 Asymptotic normality and Theorem 16.4.1 . . . . . . . . . . . . . . . . . . . . . . . . 396 16.5.1 Heuristicjustificationofasymptoticnormality . . . . . . . . . . . . . . . . . 397 16.5.2 Heuristic calculations of posterior distributions and redundancy . . . . . . . . 397
16.6 Proof of Theorem 16.4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
16.7Exercises ……………………………………400
17 Universal prediction with other losses 403 17.1 Redudancy and expected regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 17.1.1 Universalpredictionviathelogloss …………………..404 17.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 17.2Individualsequencepredictionandregret ……………………408
Lexture Notes on Statistics and Information Theory John Duchi
18 Online convex optimization 413 18.1Theproblemofonlineconvexoptimization …………………..413 18.2 Online gradient and non-Euclidean gradient (mirror) descent . . . . . . . . . . . . . 415
18.2.1 ProofofTheorem18.2.5…………………………419 18.3Onlinetobatchconversions …………………………..421 18.4Morerefinedconvergenceguarantees ………………………421
18.4.1 ProofofProposition18.4.1 ……………………….422
19 Exploration, exploitation, and bandit problems 424
19.1 Confidence-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
19.2 Bayesian approaches to bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19.2.1 Posterior(Thompson)sampling……………………..430 19.2.2 Aninformation-theoreticanalysis…………………….433 19.2.3 Informationandexploration……………………….433
19.3 Online gradient descent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
19.4 Further notes and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
19.5Technicalproofs ………………………………..435
19.5.1 ProofofClaim(19.1.1) …………………………435 V Appendices 437
A Miscellaneous mathematical results 438 A.1 Therootsofapolynomial ……………………………438 A.2 Measure-theoretic development of divergence measures . . . . . . . . . . . . . . . . . 438
B Convex Analysis 439
B.1 Convexsets…………………………………..439 B.1.1 Operationspreservingconvexity …………………….441 B.1.2 Representationandseparationofconvexsets . . . . . . . . . . . . . . . . . . 443
B.2 Sublinearandsupportfunctions…………………………446
B.3 Convexfunctions………………………………..449
B.3.1 Equivalentdefinitionsofconvexfunctions ………………..450
B.3.2 Continuitypropertiesofconvexfunctions ………………..452
B.3.3 Operationspreservingconvexity …………………….458
B.3.4 Smoothness properties, first-order developments for convex functions, and subdifferentiability ……………………………460
B.3.5 Calculusrulesofsubgradients………………………465
C Optimality, stability, and duality 468
C.1 Optimalityconditionsandstabilityproperties………………….469 C.1.1 Subgradientcharacterizationsforoptimality……………….469 C.1.2 Stabilitypropertiesofminimizers…………………….471
C.2 Conjugacyanddualityproperties………………………..475 C.2.1 Gradient dualities and the Fenchel-Young inequality . . . . . . . . . . . . . . 476 C.2.2 Smoothnessandstrictconvexityofconjugates. . . . . . . . . . . . . . . . . . 478
C.3 Exercises ……………………………………483
Computer Science Tutoring
Introduction and setting
This set of lecture notes explores some of the (many) connections relating information theory, statistics, computation, and learning. Signal processing, machine learning, and statistics all revolve around extracting useful information from signals and data. In signal processing and information theory, a central question is how to best design signals—and the channels over which they are transmitted—to maximally communicate and store information, and to allow the most effective decoding. In machine learning and statistics, by contrast, it is often the case that there is a fixed data distribution that nature provides, and it is the learner’s or statistician’s goal to recover information about this (unknown) distribution.
A central aspect of information theory is the discovery of fundamental results: results that demonstrate that certain procedures are optimal. That is, information theoretic tools allow a characterization of the attainable results in a variety of communication and statistical settings. As we explore in these notes in the context of statistical, inferential, and machine learning tasks, this allows us to develop procedures whose optimality we can certify—no better procedure is possible. Such results are useful for a myriad of reasons; we would like to avoid making bad decisions or false inferences, we may realize a task is impossible, and we can explicitly calculate the amount of data necessary for solving different statistical problems.
1.1 Information theory
Information theory is a broad field, but focuses on several main questions: what is information, how much information content do various signals and data hold, and how much information can be reliably transmitted over a channel. We will vastly oversimplify information theory into two main questions with corresponding chains of tasks.
1. How much information does a signal contain?
2. How much information can a noisy channel reliably transmit?
In this context, we provide two main high-level examples, one for each of these tasks.
Example 1.1.1 (Source coding): The source coding, or data compression problem, is to take information from a source, compress it, decompress it, and recover the original message. Graphically, we have
Source → Compressor → Decompressor → Receiver 8
Lexture Notes on Statistics and Information Theory John Duchi
The question, then, is how to design a compressor (encoder) and decompressor (decoder) that uses the fewest number of bits to describe a source (or a message) while preserving all the information, in the sense that the receiver receives the correct message with high probability. This fewest number of bits is then the information content of the source (signal). Q
Example 1.1.2: The channel coding, or data transmission problem, is the same as the source coding problem of Example 1.1.1, except that between the compressor and decompressor is a source of noise, a channel. In this case, the graphical representation is
Source → Compressor → Channel → Decompressor → Receiver
Here the question is the maximum number of bits that may be sent per each channel use in the sense that the receiver may reconstruct the desired message with low probability of error. Because the channel introduces noise, we require some redundancy, and information theory studies the exact amount of redundancy and number of bits that must be sent to allow such reconstruction. Q
1.2 Moving to statistics
Statistics and machine learning can—broadly—be studied with the same views in mind. Broadly, statistics and machine learning can be thought of as (perhaps shoehorned into) source coding and a channel coding problems.
In the analogy with source coding, we observe a sequence of data points X1, . . . , Xn drawn from some (unknown) distribution P on a space X. For example, we might be observing species that biologists collect. Then the analogue of source coding is to construct a model (often a generative model) that encodes the data using relatively few bits: that is,
X1 ,…,Xn Pb
Source (P ) −→ Compressor → Decompressor → Receiver.
Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than the original signal X1,…,Xn, with the hope that we learn information about the generating distribution P, or at least describe it efficiently.
In our analogy with channel coding, we make a connection with estimation and inference. Roughly, the major problem in statistics we consider is as follows: there exists some unknown function f on a space X that we wish to estimate, and we are able to observe a noisy version of f(Xi) for a series of Xi drawn from a distribution P. Recalling the graphical description of Example 1.1.2, we now have a channel P(Y | f(X)) that gives us noisy observations of f(X) for each Xi, but we may (generally) now longer choose the encoder/compressor. That is, we have
X1 ,…,Xn f (X1 ),…,f (Xn ) Y1 ,…,Yn
Source (P) −→ Compressor −→ Channel P(Y | f(X)) −→ Decompressor.
The estimation—decompression—problem is to either estimate f, or, in some cases, to estimate other aspects of the source probability distribution P. In general, in statistics, we do not have any choice in the design of the compressor f that transforms the original signal X1, . . . , Xn, which makes it somewhat different from traditional ideas in information theory. In some cases that we explore later—such as experimental design, randomized controlled trials, reinforcement learning and bandits (and associated exploration/exploitation tradeoffs)—we are also able to influence the compression part of the above scheme.
Lexture Notes on Statistics and Information Theory John Duchi
Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear regression problem. Here the data Xi belong to Rd, and the compression function f(x) = θ⊤x for some vector θ ∈ Rd. Then the channel is often of the form
EP [f(X)] = assume we may write
Nothing will be lost.
f(x)dP(x) or
EP [f (X )] =
EP [f(X)] = f (x)p(x)dx.
f(x)p(x)dμ(x),
1.4 Outline and chapter discussion
Yi=θ⊤Xi+ εi , | {z } |{z}
signal noise
where εi ∼ N(0,σ ) are independent mean zero normal perturbations. The goal is, given a sequence of pairs (Xi,Yi), to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental design, we may choose the sequence Xi so as to better explore properties of θ. Later in the course we will investigate whether it is possible to improve estimation by these strategies. As one concrete idea, if we allow infinite power, which in this context corresponds to letting ∥Xi∥ → ∞— choosing very “large” vectors xi—then the signal of θ⊤Xi should swamp any noise and make estimation easier. Q
For the remainder of the class, we explore these ideas in substantially more detail.
1.3 A remark about measure theory
As this book focuses on a number of fundamental questions in statistics, machine learning, and information theory, fully general statements of the results often require measure theory. Thus, formulae such as R f(x)dP(x) or R f(x)dμ(x) appear. While knowledge of measure theory is cer- tainly useful and may help appreciate the results, it is completely inessential to developing the intuition and, I hope, understanding the proofs and main results. Indeed, the best strategy (for a reader unfamiliar with measure theory) is to simply replace every instance of a formula such as dμ(x) with dx. The most frequent cases we encounter will be the following: we wish to compute the expectation of a function f of random variable X following distribution P, that is, EP[f(X)]. Normally, we would write EP[f(X)] = R f(x)dP(x), or sometimes EP[f(X)] = R f(x)p(x)dμ(x), saying that “P has density p with respect to the underlying measure μ.” Instead, one may simply (and intuitively) assume that x really has density p over the reals, and instead of computing the integral
We divide the lecture notes into four distinct parts, each of course interacting with the others, but it is possible to read each as a reasonably self-contained unit. The lecture notes begin with a revew (Chapter 2) that introduces the basic information-theoretic quantities that we discuss: mutual information, entropy, and divergence measures. It is required reading for all the chapters that follow.
Programming Help, Add QQ: 749389476
Lexture Notes on Statistics and Information Theory John Duchi
Part I of the notes covers what I term “stability” based results. At a high level, this means that we ask what can be gained by considering situations where individual observations in a sequence of random variables X1,…,Xn have little effect on various functions of the sequence. We begin in Chapter 4 with basic concentration inequalities, discussing how sums and related quantities can converge quickly; while this material is essential for the remainder of the lectures, it does not depend on particular information-theoretic techniques. We discuss some heuristic applications to problems in statistical learning—empirical risk minimization—in this section of the notes. We provide a treatment of more advanced ideas in Chapter 6, including some approaches to concentration via entropy methods. We then turn in Chapter 5 carefully investigate generalization and convergence guarantees—arguing that functions of a sample X1, . . . , Xn are representative of the full population P from which the sample is drawn—based on controlling different information-theoretic quantities. In this context, we develop PAC-Bayesian bounds, and we also use the same framework to present tools to control generalization and convergence in interactive data analyses. These types of analyses reflect modern statistics, where one performs some type of data exploration before committing to a fuller analysis, but which breaks classical statistical approaches, because the analysis now depends on the sample. Finally, we provide a chapter (Cha