Understanding Machine Learning: From Theory to Algorithms
⃝c 2014 by Shai Shalev-Shwartz and Shai Ben-David
Published 2014 by Cambridge University Press.
This copy is for personal use only. Not for distribution. Do not post. Please link to:
http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning
Please note: This copy is almost, but not entirely, identical to the printed version of the book. In particular, page numbers are not identical (but section numbers are the same).
Understanding Machine Learning
Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a princi- pled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Fol- lowing a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous text- books. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorith- mic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to stu- dents and nonexpert readers in statistics, computer science, mathematics, and engineering.
Shai Shalev-Shwartz is an Associate Professor at the School of Computer Science and Engineering at The Hebrew University, Israel.
Shai Ben-David is a Professor in the School of Computer Science at the University of Waterloo, Canada.
浙大学霸代写 加微信 cstutorcs
UNDERSTANDING MACHINE LEARNING
From Theory to Algorithms
Shai Shalev-Shwartz
The Hebrew University, Jerusalem
Shai Ben-David
University of Waterloo, Canada
32 Avenue of the Americas, New York, NY 10013-2473, USA
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107057135
⃝c Shai Shalev-Shwartz and Shai Ben-David 2014
This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published 2014
Printed in the United States of America
A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication Data
ISBN 978-1-107-05713-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication, and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Triple-S dedicates the book to triple-M
The term machine learning refers to the automated detection of meaningful patterns in data. In the past couple of decades it has become a common tool in almost any task that requires information extraction from large data sets. We are surrounded by a machine learning based technology: search engines learn how to bring us the best results (while placing profitable ads), anti-spam software learns to filter our email messages, and credit card transactions are secured by a software that learns how to detect frauds. Digital cameras learn to detect faces and intelligent personal assistance applications on smart-phones learn to recognize voice commands. Cars are equipped with accident prevention systems that are built using machine learning algorithms. Machine learning is also widely used in scientific applications such as bioinformatics, medicine, and astronomy.
One common feature of all of these applications is that, in contrast to more traditional uses of computers, in these cases, due to the complexity of the patterns that need to be detected, a human programmer cannot provide an explicit, fine- detailed specification of how such tasks should be executed. Taking example from intelligent beings, many of our skills are acquired or refined through learning from our experience (rather than following explicit instructions given to us). Machine learning tools are concerned with endowing programs with the ability to “learn” and adapt.
The first goal of this book is to provide a rigorous, yet easy to follow, intro- duction to the main concepts underlying machine learning: What is learning? How can a machine learn? How do we quantify the resources needed to learn a given concept? Is learning always possible? Can we know if the learning process succeeded or failed?
The second goal of this book is to present several key machine learning algo- rithms. We chose to present algorithms that on one hand are successfully used in practice and on the other hand give a wide spectrum of different learning techniques. Additionally, we pay specific attention to algorithms appropriate for large scale learning (a.k.a. “Big Data”), since in recent years, our world has be- come increasingly “digitized” and the amount of data available for learning is dramatically increasing. As a result, in many applications data is plentiful and computation time is the main bottleneck. We therefore explicitly quantify both the amount of data and the amount of computation time needed to learn a given concept.
The book is divided into four parts. The first part aims at giving an initial rigorous answer to the fundamental questions of learning. We describe a gen- eralization of Valiant’s Probably Approximately Correct (PAC) learning model, which is a first solid answer to the question “what is learning?”. We describe the Empirical Risk Minimization (ERM), Structural Risk Minimization (SRM), and Minimum Description Length (MDL) learning rules, which shows “how can a machine learn”. We quantify the amount of data needed for learning using the ERM, SRM, and MDL rules and show how learning might fail by deriving
a “no-free-lunch” theorem. We also discuss how much computation time is re- quired for learning. In the second part of the book we describe various learning algorithms. For some of the algorithms, we first present a more general learning principle, and then show how the algorithm follows the principle. While the first two parts of the book focus on the PAC model, the third part extends the scope by presenting a wider variety of learning models. Finally, the last part of the book is devoted to advanced theory.
We made an attempt to keep the book as self-contained as possible. However, the reader is assumed to be comfortable with basic notions of probability, linear algebra, analysis, and algorithms. The first three parts of the book are intended for first year graduate students in computer science, engineering, mathematics, or statistics. It can also be accessible to undergraduate students with the adequate background. The more advanced chapters can be used by researchers intending to gather a deeper theoretical understanding.
Acknowledgements
The book is based on Introduction to Machine Learning courses taught by Shai Shalev-Shwartz at the Hebrew University and by Shai Ben-David at the Univer- sity of Waterloo. The first draft of the book grew out of the lecture notes for the course that was taught at the Hebrew University by Shai Shalev-Shwartz during 2010–2013. We greatly appreciate the help of Ohad Shamir, who served as a TA for the course in 2010, and of Alon Gonen, who served as a TA for the course in 2011–2013. Ohad and Alon prepared few lecture notes and many of the exercises. Alon, to whom we are indebted for his help throughout the entire making of the book, has also prepared a solution manual.
We are deeply grateful for the most valuable work of Dana Rubinstein. Dana has scientifically proofread and edited the manuscript, transforming it from lecture-based chapters into fluent and coherent text.
Special thanks to Amit Daniely, who helped us with a careful read of the advanced part of the book and also wrote the advanced chapter on multiclass learnability. We are also grateful for the members of a book reading club in Jerusalem that have carefully read and constructively criticized every line of the manuscript. The members of the reading club are: Maya Alroy, Yossi Arje- vani, Aharon Birnbaum, Alon Cohen, Alon Gonen, Roi Livni, Ofer Meshi, Dan Rosenbaum, Dana Rubinstein, Shahar Somin, Alon Vinnikov, and Yoav Wald. We would also like to thank Gal Elidan, Amir Globerson, Nika Haghtalab, Shie Mannor, Amnon Shashua, Nati Srebro, and Ruth Urner for helpful discussions.
Shai Shalev-Shwartz, Jerusalem, Israel Shai Ben-David, Waterloo, Canada
Introduction
1.1 What Is Learning?
1.2 When Do We Need Machine Learning?
1.3 Types of Learning
1.4 Relations to Other Fields
1.5 How to Read This Book
1.5.1 Possible Course Plans Based on This Book 1.6 Notation
Foundations
A Gentle Start
2.1 A Formal Model – The Statistical Learning Framework
2.2 Empirical Risk Minimization
2.2.1 Something May Go Wrong – Overfitting 2.3 Empirical Risk Minimization with Inductive Bias
2.3.1 Finite Hypothesis Classes 2.4 Exercises
A Formal Learning Model
3.1 PAC Learning
3.2 A More General Learning Model
3.2.1 Releasing the Realizability Assumption – Agnostic PAC Learning
3.2.2 The Scope of Learning Problems Modeled
3.3 Summary
3.4 Bibliographic Remarks
3.5 Exercises
Learning via Uniform Convergence
4.1 Uniform Convergence Is Sufficient for Learnability
4.2 Finite Classes Are Agnostic PAC Learnable
45 47 49 50 50
Understanding Machine Learning, ⃝c 2014 by Shai Shalev-Shwartz and Shai Ben-David Published 2014 by Cambridge University Press.
Personal use only. Not for distribution. Do not post.
Please link to http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning
x Contents
4.3 Summary 58
4.4 Bibliographic Remarks 58
4.5 Exercises 58
5 The Bias-Complexity Tradeoff 60 5.1 The No-Free-Lunch Theorem 61 5.1.1 No-Free-Lunch and Prior Knowledge 63
5.2 Error Decomposition 64
5.3 Summary 65
5.4 Bibliographic Remarks 66
5.5 Exercises 66
6 The VC-Dimension 67
6.1 Infinite-Size Classes Can Be Learnable 67
6.2 The VC-Dimension 68
6.3 Examples 70
6.3.1 Threshold Functions 70
6.3.2 Intervals 71
6.3.3 Axis Aligned Rectangles 71
6.3.4 Finite Classes 72
6.3.5 VC-Dimension and the Number of Parameters 72
6.4 The Fundamental Theorem of PAC learning 72
6.5 Proof of Theorem 6.7 73
6.5.1 Sauer’s Lemma and the Growth Function 73
6.5.2 Uniform Convergence for Classes of Small Effective Size 75
6.6 Summary 78
6.7 Bibliographic remarks 78
6.8 Exercises 78
7 Nonuniform Learnability 83
7.1 Nonuniform Learnability 83 7.1.1 Characterizing Nonuniform Learnability 84
7.2 Structural Risk Minimization 85
7.3 Minimum Description Length and Occam’s Razor 89
7.3.1 Occam’s Razor 91
7.4 Other Notions of Learnability – Consistency 92
7.5 Discussing the Different Notions of Learnability 93
7.5.1 The No-Free-Lunch Theorem Revisited 95
7.6 Summary 96
7.7 Bibliographic Remarks 97
7.8 Exercises 97
8 The Runtime of Learning 100 8.1 Computational Complexity of Learning 101
Contents xi
8.1.1 Formal Definition* 102
8.2 Implementing the ERM Rule 103
8.2.1 Finite Classes 104
8.2.2 Axis Aligned Rectangles 105
8.2.3 Boolean Conjunctions 106
8.2.4 Learning 3-Term DNF 107
8.3 Efficiently Learnable, but Not by a Proper ERM 107
8.4 Hardness of Learning* 108
8.5 Summary 110
8.6 Bibliographic Remarks 110
8.7 Exercises 110
From Theory to Algorithms 115
Linear Predictors 117
9.1 Halfspaces 118
9.1.1 Linear Programming for the Class of Halfspaces 119
9.1.2 Perceptron for Halfspaces 120
9.1.3 The VC Dimension of Halfspaces 122
9.2 Linear Regression 123
9.2.1 Least Squares 124
9.2.2 Linear Regression for Polynomial Regression Tasks 125
9.3 Logistic Regression 126
9.4 Summary 128
9.5 Bibliographic Remarks 128
9.6 Exercises 128
Boosting 130 10.1 Weak Learnability 131 10.1.1 Efficient Implementation of ERM for Decision Stumps 133
10.2 AdaBoost 134
10.3 Linear Combinations of Base Hypotheses 137
10.3.1 The VC-Dimension of L(B, T ) 139
10.4 AdaBoost for Face Recognition 140
10.5 Summary 141
10.6 Bibliographic Remarks 141
10.7 Exercises 142
Model Selection and Validation 144
11.1 Model Selection Using SRM 145
11.2 Validation 146
11.2.1 Hold Out Set 146
11.2.2 Validation for Model Selection 147
11.2.3 The Model-Selection Curve 148
xii Contents
11.2.4 k-Fold Cross Validation 149
11.2.5 Train-Validation-Test Split 150
11.3 What to Do If Learning Fails 151
11.4 Summary 154
11.5 Exercises 154
12 Convex Learning Problems 156
12.1 Convexity, Lipschitzness, and Smoothness 156
12.1.1 Convexity 156
12.1.2 Lipschitzness 160
12.1.3 Smoothness 162
12.2 Convex Learning Problems 163
12.2.1 Learnability of Convex Learning Problems 164
12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems 166
12.3 Surrogate Loss Functions 167
12.4 Summary 168
12.5 Bibliographic Remarks 169
12.6 Exercises 169
13 Regularization and Stability 171
13.1 Regularized Loss Minimization 171 13.1.1 Ridge Regression 172
13.2 Stable Rules Do Not Overfit 173
13.3 Tikhonov Regularization as a Stabilizer 174
13.3.1 Lipschitz Loss 176
13.3.2 Smooth and Nonnegative Loss 177
13.4 Controlling the Fitting-Stability Tradeoff 178
13.5 Summary 180
13.6 Bibliographic Remarks 180
13.7 Exercises 181
14 Stochastic Gradient Descent 184 14.1 Gradient Descent 185 14.1.1 Analysis of GD for Convex-Lipschitz Functions 186
14.2 Subgradients 188
14.2.1 Calculating Subgradients 189
14.2.2 Subgradients of Lipschitz Functions 190
14.2.3 Subgradient Descent 190
14.3 Stochastic Gradient Descent (SGD) 191
14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions 191 14.4 Variants 193
14.4.1 Adding a Projection Step 193
14.4.2 Variable Step Size 194
14.4.3 Other Averaging Techniques 195
Contents xiii
14.4.4 Strongly Convex Functions* 195
14.5 Learning with SGD 196
14.5.1 SGD for Risk Minimization 196
14.5.2 Analyzing SGD for Convex-Smooth Learning Problems 198
14.5.3 SGD for Regularized Loss Minimization 199
14.6 Summary 200
14.7 Bibliographic Remarks 200
14.8 Exercises 201
15 Support Vector Machines 202
15.1 Margin and Hard-SVM 202
15.1.1 The Homogenous Case 205
15.1.2 The Sample Complexity of Hard-SVM 205
15.2 Soft-SVM and Norm Regularization 206
15.2.1 The Sample Complexity of Soft-SVM 208
15.2.2 Margin and Norm-Based Bounds versus Dimension 208
15.2.3 The Ramp Loss* 209
15.3 Optimality Conditions and “Support Vectors”* 210
15.4 Duality* 211
15.5 Implementing Soft-SVM Using SGD 212
15.6 Summary 213
15.7 Bibliographic Remarks 213
15.8 Exercises 214
16 Kernel Methods 215
16.1 Embeddings into Feature Spaces 215
16.2 The Kernel Trick 217
16.2.1 Kernels as a Way to Express Prior Knowledge 221
16.2.2 Characterizing Kernel Functions* 222
16.3 Implementing Soft-SVM with Kernels 222
16.4 Summary 224
16.5 Bibliographic Remarks 225
16.6 Exercises 225
17 Multiclass, Ranking, and Complex Prediction Problems 227
17.1 One-versus-All and All-Pairs 227
17.2 Linear Multiclass Predictors 230
17.2.1 How to Construct Ψ 230
17.2.2 Cost-Sensitive Classification 232
17.2.3 ERM 232
17.2.4 Generalized Hinge Loss 233
17.2.5 Multiclass SVM and SGD 234
17.3 Structured Output Prediction 236
17.4 Ranking 238
17.4.1 Linear Predictors for Ranking 240 17.5 Bipartite Ranking and Multivariate Performance Measures 243 17.5.1 Linear Predictors for Bipartite Ranking 245
17.6 Summary 247
17.7 Bibliographic Remarks 247
17.8 Exercises 248
Decision Trees 250
18.1 Sample Complexity 251
18.2 Decision Tree Algorithms 252
18.2.1 Implementations of the Gain Measure 253
18.2.2 Pruning 254
18.2.3 Threshold-Based Splitting Rules for Real-Valued Features 255
18.3 Random Forests 255
18.4 Summary 256
18.5 Bibliographic Remarks 256
18.6 Exercises 256
Nearest Neighbor 258
19.1 k Nearest Neighbors 258
19.2 Analysis 259
19.2.1 A Generalization Bound for the 1-NN Rule 260
19.2.2 The “Curse of Dimensionality” 263
19.3 Efficient Implementation* 264
19.4 Summary 264
19.5 Bibliographic Remarks 264
19.6 Exercises 265
Neural Networks 268
20.1 Feedforward Neural Networks 269
20.2 Learning Neural Networks 270
20.3 The Expressive Power of Neural Networks 271
20.3.1 Geometric Intuition 273
20.4 The Sample Complexity of Neural Networks 274
20.5 The Runtime of Learning Neural Networks 276
20.6 SGD and Backpropagation 277
20.7 Summary 281
20.8 Bibliographic Remarks 281
20.9 Exercises 282
Additional Learning Models 285
Online Learning 287 21.1 Online Classification in the Realizable Case 288
Part III 21
Contents xv
21.1.1 Online Learnability 290 21.2 Online Classification in the Unrealizable Case 294 21.2.1 Weighted-Ma jority 295
21.3 Online Convex Optimization 300
21.4 The Online Perceptron Algorithm 301
21.5 Summary 304
21.6 Bibliographic Remarks 305
21.7 Exercises 305
22 Clustering 307
22.1 Linkage-Based Clustering Algorithms 310
22.2 k-Means and Other Cost Minimization Clusterings 311
22.2.1 The k-Means Algorithm 313
22.3 Spectral Clustering 315
22.3.1 Graph Cut 315
22.3.2 Graph Laplacian and Relaxed Graph Cuts 315
22.3.3 Unnormalized Spectral Clustering 317
22.4 Information Bottleneck* 317
22.5 A High Level View of Clustering 318
22.6 Summary 320
22.7 Bibliographic Remarks 320
22.8 Exercises 320
23 Dimensionality Reduction 323
23.1 Principal Component Analysis (PCA) 324
23.1.1 A More Efficient Solution for the Case d ≫ m 326
23.1.2 Implementation and Demonstration 326
23.2 Random Pro jections 329
23.3 Compressed Sensing 330 23.3.1 Proofs* 333
23.4 PCA or Compressed Sensing? 338
23.5 Summary 338
23.6 Bibliographic Remarks 339
23.7 Exercises 339
24 Generative Models 342
24.1 Maximum Likelihood Estimator 343
24.1.1 Maximum Likelihood Estimation for Continuous Ran-
dom Variables 344
24.1.2 Maximum Likelihood and Empirical Risk Minimization 345
24.1.3 Generalization Analysis 345
24.2 Naive Bayes 347
24.3 Linear Discriminant Analysis 347
24.4 Latent Variables and the EM Algorithm 348
24.4.1 EM as an Alternate Maximization Algorithm 350
24.4.2 EM for Mixture of Gaussians (Soft k-Means) 352
24.5 Bayesian Reasoning 353
24.6 Summary 355
24.7 Bibliographic Remarks 355
24.8 Exercises 356
Feature Selection and Generation 357
25.1 Feature Selection 358
25.1.1 Filters 359
25.1.2 Greedy Selection Approaches 360
25.1.3 Sparsity-Inducing Norms 363
25.2 Feature Manipulation and Normalization 365
25.2.1 Examples of Feature Transformations 367 25.3 Feature Learning 368 25.3.1 Dictionary Learning Using Auto-Encoders 368
25.4 Summary 370
25.5 Bibliographic Remarks 371
25.6 Exercises 371
Advanced Theory 373
Rademacher Complexities 375 26.1 The Rademacher Complexity 375 26.1.1 Rademacher Calculus 379
26.2 Rademacher Complexity of Linear Classes 382
26.3 Generalization Bounds for SVM 383
26.4 Generalization Bounds for Predictors with Low l1 Norm 386
26.5 Bibliographic Remarks 386
Covering Numbers 388
27.1 Covering 388 27.1.1 Properties 388
27.2 From Covering to Rademacher Complexity via Chaining 389
27.3 Bibliographic Remarks 391
Proof of the Fundamental Theorem of Learning Theory 392
28.1 The Upper Bound for the Agnostic Case 392
28.2 The Lower Bound for the Agnostic Case 393
28.2.1 Showing That m(ε, δ) ≥ 0.5 log(1/(4δ))/ε2 393
28.2.2 Showing That m(ε, 1/8) ≥ 8d/ε2 395
28.3 The Upper Bound for the Realizable Case 398
28.3.1 From ε-Nets to PAC Learnability 401
Part IV 26
Computer Science Tutoring
29 Multiclass Learnability
29.1 The Natarajan Dimension
29.2 The Multiclass Fundamental Theorem
29.2.1 On the Proof of Theorem 29.3
29.3 Calculating the Natarajan Dimension
29.3.1 One-versus-All Based Classes
29.3.2 General Multiclass-to-Binary Reductions
29.3.3 Linear Multiclass Predictors
29.4 On Good and Bad ERMs
29.5 Bibliographic Remarks
29.6 Exercises
30 Compression Bounds
30.1 Compression Bounds
30.2 Examples
30.2.1 Axis Aligned Rectangles
30.2.2 Halfspaces
30.2.3 Separating Polynomials
30.2.4 Separation with Margin
30.3 Bibliographic Remarks
31 PAC-Bayes
Contents xvii
419 422 430
435 437 447
31.1 31.2 31.3
Appendix A Appendix B Appendix C
PAC-Bayes Bounds Bibliographic Remarks Exercises
Technical Lemmas Measure Concentration Linear Algebra
Notes References Index
1 Introduction
The subject of this book is automated learning, or, as we will more often call it, Machine Learning (ML). That is, we wish to program computers so that they can “learn” from input available to them. Roughly speaking, learning is the process of converting experience into expertise or knowledge. The input to a learning algorithm is training data, representing experience, and the output is some expertise, which usually takes the form of another computer program that can perform some task. Seeking a formal-mathematical understanding of this concept, we’ll have to be more explicit about what we mean by each of the involved terms: What is the training data our programs will access? How can the process of learning be automated? How can we evaluate the success of such a process (namely, the quality of the output of a learning program)?
1.1 What Is Learning?
Let us begin by considering a couple of examples from naturally occurring ani- mal learning. Some of the most fundamental issues in ML arise already in that context, which we are all familiar with.
Bait Shyness – Rats Learning to Avoid Poisonous Baits: When rats encounter food items with novel look or smell, they will first eat very small amounts, and subsequent feeding will depend on the flavor of the food and its physiological effect. If the food produces an ill effect, the novel food will often be associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning mechanism in play here – the animal used past experience with some food to acquire expertise in detecting the safety of this food. If past experience with the food was negatively labeled, the animal predicts that it will also have a negative effect when encountered in the future.
Inspired by the preceding example of successful learning, let us demonstrate a typical machine learning task. Suppose we would like to program a machine that learns how to filter spam e-mails. A naive solution would be seemingly similar to the way rats learn how to avoid poisonous baits. The machine will simply memorize all previous e-mails that had been labeled as spam e-mails by the human user. When a new e-mail arrives, the machine will search for it in the set
Understanding Machine Learning, ⃝c 2014 by Shai Shalev-Shwartz and Shai Ben-David Published 2014 by Cambridge University Press.
Personal use only. Not for distribution. Do not post.
Please link to http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning
20 Introduction
of previous spam e-mails. If it matches one of them, it will be trashed. Otherwise, it will be moved to the user’s inbox folder.
While the preceding “learning by memorization” approach is sometimes use- ful, it lacks an important aspect of learning systems – the ability to label unseen e-mail messages. A successful learner should be able to progress from individual examples to broader generalization. This is also referred to as inductive reasoning or inductive inference. In the bait shyness example presented previously, after the rats encounter an example of a certain type of food, they apply their attitude toward it on new, unseen examples of food of similar smell and taste. To achieve generalization in the spam filtering task, the learner can scan the previously seen e-mails, and extract a set of words whose appearance in an e-mail message is indicative of spam. Then, when a new e-mail arrives, the machine can check whether one of the suspicious words appears in it, and predict its label accord- ingly. Such a system would potentially be able correctly to predict the label of unseen e-mails.
However, inductive reasoning might lead us to false conclusions. To illustrate this, let us consider again an example from animal learning.
Pigeon Superstition: In an experiment performed by the psychologist B. F. Skinner, he placed a bunch of hungry pigeons in a cage. An automatic mechanism had been attached to the cage, delivering food to the pigeons at regular intervals with no reference whatsoever to the birds’ behavior. The hungry pigeons went around the cage, and when food was first delivered, it found each pigeon engaged
in some activity (pecking, turning the head, etc.). The arrival of food reinforced each bird’s specific action, and consequently, each bird tended to spend some more time doing that very same action. That, in turn, increased the chance that the next random food delivery would find each bird engaged in that activity again. What results is a chain of events that reinforces the pigeons’ association of the delivery of the food with whatever chance actions they had been perform- ing when it was first delivered. They subsequently continue to perform these same actions diligently.1
What distinguishes learning mechanisms that result in superstition from useful learning? This question is crucial to the development of automated learners. While human learners can rely on common sense to filter out random meaningless learning conclusions, once we export the task of learning to a machine, we must provide well defined crisp principles that will protect the program from reaching senseless or useless conclusions. The development of such principles is a central goal of the theory of machine learning.
What, then, made the rats’ learning more successful than that of the pigeons? As a first step toward answering this question, let us have a closer look at the bait shyness phenomenon in rats.
Bait Shyness revisited – rats fail to acquire conditioning between food and electric shock or between sound and nausea: The bait shyness mechanism in
1 See: http://psychclassics.yorku.ca/Skinner/Pigeon
Programming Help, Add QQ: 749389476
1.2 When Do We Need Machine Learning? 21
rats turns out to be more complex than what one may expect. In experiments carried out by Garcia (Garcia & Koelling 1996), it was demonstrated that if the unpleasant stimulus that follows food consumption is replaced by, say, electrical shock (rather than nausea), then no conditioning occurs. Even after repeated trials in which the consumption of some food is followed by the administration of unpleasant electrical shock, the rats do not tend to avoid that food. Similar failure of conditioning occurs when the characteristic of the food that implies nausea (such as taste or smell) is replaced by a vocal signal. The rats seem to have some “built in” prior knowledge telling them that, while temporal correlation between food and nausea can be causal, it is unlikely that there would be a causal relationship between food consumption and electrical shocks or between sounds and nausea.
We conclude that one distinguishing feature between the bait shyness learning and the pigeon superstition is the incorporation of prior knowledge that biases the learning mechanism. This is also referred to as inductive bias. The pigeons in the experiment are willing to adopt any explanation for the occurrence of food. However, the rats “know” that food cannot cause an electric shock and that the co-occurrence of noise with some food is not likely to affect the nutritional value of that food. The rats’ learning process is biased toward detecting some kind of patterns while ignoring other temporal correlations between events.
It turns out that the incorporation of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithms (this is formally stated and proved as the “No-Free-Lunch theorem” in Chapter 5). The development of tools for expressing domain expertise, translating it into a learning bias, and quantifying the effect of such a bias on the success of learning is a central theme of the theory of machine learning. Roughly speaking, the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions. We shall discuss these issues explicitly in Chapter 5.
1.2 When Do We Need Machine Learning?
When do we need machine learning rather than directly program our computers to carry out the task at hand? Two aspects of a given problem may call for the use of programs that learn and improve on the basis of their “experience”: the problem’s complexity and the need for adaptivity.
Tasks That Are Too Complex to Program.
• Tasks Performed by Animals/Humans: There are numerous tasks that we human beings perform routinely, yet our introspection concern- ing how we do them is not sufficiently elaborate to extract a well
22 Introduction
defined program. E