机器学习基础从入门到求职.pdf
http://www.100md.com
2020年11月18日
![]() |
| 第1页 |
![]() |
| 第9页 |
![]() |
| 第20页 |
![]() |
| 第27页 |
![]() |
| 第34页 |
![]() |
| 第141页 |
参见附件(5523KB,427页)。
《机器学习基础:从入门到求职》的定位是帮助求职者快速入门并掌握机器学习相关的基础核心知识,降低学习成本,节省更多的时间。

编辑推荐
适读人群 :机器学习初级和中级读者,机器学习岗求职者
《机器学习基础:从入门到求职》的定位是帮助求职者快速入门并掌握机器学习相关的基础核心知识,降低学习成本,节省更多的时间。
但是《机器学习基础:从入门到求职》并未介绍如何求职、如何面试,而是把机器学习算法相关岗位的常见问题罗列在前言,读者可阅读前言后自行判断哪些方面需要提升,然后学习本书中的相关章节,带你走近机器学习求职的起点。
内容简介
本书是一本机器学习算法方面的理论+实践读物,主要包含机器学习基础理论、回归模型、分类模型、聚类模型、降维模型和深度学习模型六大部分。机器学习基础理论部分包含第1、2章,主要介绍机器学习的理论基础和工程实践基础。第3章是回归模型部分,主要包括模型的建立、学习策略的确定和优化算法的求解过程,最后结合三种常见的线性回归模型实现了一个房价预测的案例。第4至11章详细介绍了几种常见的分类模型,包括朴素贝叶斯模型、K近邻模型、决策树模型、Logistic回归模型、支持向量机模型、随机森林模型、AdaBoost模型和提升树模型,每一个模型都给出了较为详细的推导过程和实际应用案例。第12章系统介绍了五种常见的聚类模型,包括K-Means聚类、层次聚类、密度聚类、谱聚类和高斯混合聚类,每一个模型的原理、优缺点和工程应用实践都给出了较为详细的说明。第13章系统介绍了四种常用的降维方式,包括奇异值分解、主成分分析、线性判别分析和局部线性嵌入,同样给出了详细的理论推导和分析。最后两章分别是词向量模型和深度神经网络模型,其中,词向量模型详细介绍了Word2Vec和Doc2Vec模型的原理推导和应用;深度神经网络模型系统介绍了深度学习相关的各类基础知识。本书适合对人工智能和机器学习感兴趣的学生、求职者和已工作人士,以及想要使用机器学习这一工具的跨行业者(有最基本的高等数学、线性代数、概率基础即可),具体判别方法建议您阅读本书的前言。
机器学习求职60问
类型一:基础概念类
问题1:过拟合与欠拟合(定义、产生的原因、解决的方法各是什么)。
问题2:L1正则与L2正则(有哪些常见的正则化方法?作用各是什么?区别是什么?为什么加正则化项能防止模型过拟合)。
问题3:模型方差和偏差(能解释一下机器学习中的方差和偏差吗?哪些模型是降低模型方差的?哪些模型是降低模型偏差的?举例说明一下)。
问题4:奥卡姆剃刀(说一说机器学习中的奥卡姆梯刀原理)。
问题5:模型评估指标(回归模型和分类模型各有哪些常见的评估指标?各自的含义是什么?解释一下AUC?你在平时的实践过程中用到过哪些评估指标?为什么要选择这些指标)。
问题6:风险函数(说一下经验风险和结构风险的含义和异同点)。
问题7:优化算法(机器学习中常见的优化算法有哪些?梯度下降法和牛顿法的原理推导)。
问题8:激活函数(神经网络模型中常用的激活函数有哪些?说一下各自的特点)。
问题9:核函数(核函数的定义和作用是什么?常用的核函数有哪些?你用过哪些核函数?说一下高斯核函数中的参数作用)。
问题10:梯度消失与梯度爆炸(解释一下梯度消失与梯度爆炸问题,各自有什么解决方案)。
问题11:有监督学习和无监督学习(说一下有监督学习和无监督学习的特点,举例说明一下)。
问题12:生成模型与判别模型(你知道生成模型和判别模型吗?各自的特点是什么?哪些模型是生成模型,哪些模型是判别模型)。
类型二:模型原理类
问题13:线性回归(线性回归模型的原理、损失函数、正则化项)。
问题14:KNN模型(KNN模型的原理、三要素、优化方案以及模型的优缺点)。
问题15:朴素贝叶斯(朴素贝叶斯模型的原理推导,拉普拉斯平滑,后验概率最大化的含义以及模型的优缺点)。
问题16:决策树(决策树模型的原理、特征评价指标、剪枝过程和原理、几种常见的决策树模型、各自的优缺点)。
问题17:随机森林模型(RF模型的基本原理,RF模型的两个“随机”。从偏差和方差角度说一下RF模型的优缺点,以及RF模型和梯度提升树模型的区别)。
问题18:AdaBoost(AdaBoost模型的原理推导、从偏差和方差角度说一下AdaBoost、AdaBoost模型的优缺点)。
问题19:梯度提升树模型(GBDT模型的原理推导、使用GBDT模型进行特征组合的过程、GBDT模型的优缺点)。
问题20:XGBoost(XGBoost模型的基本原理、XGBoost模型和GBDT模型的异同点、XBGoost模型的优缺点)。
问题21:Logistic回归模型(LR模型的原理、本质,LR模型的损失函数,能否使用均方损失、为什么)。
问题22:支持向量机模型(SVM模型的原理,什么是“支持向量”?为什么使用拉格朗日对偶性?说一下KKT条件、软间隔SVM和硬间隔SVM的异同点。SVM怎样实现非线性分类?SVM常用的核函数有哪些?SVM模型的优缺点各是什么)。
问题23:K-Means聚类(K-Means聚类的过程和原理是什么?优化方案有哪些?各自优缺点是什么)。
问题24:层次聚类(层次聚类的过程、原理和优缺点)。
问题25:密度聚类(密度聚类的基本原理和优缺点)。
问题26:谱聚类(谱聚类的基本原理和优缺点)。
问题27:高斯混合聚类(高斯混合聚类的原理和优缺点)。
问题28:EM算法(EM算法的推导过程和应用场景)。
问题29:特征分解与奇异值分解(特征分解与奇异值分解的原理、异同点、应用场景)。
问题30:主成分分析(PCA模型的原理、过程、应用场景)。
问题31:线性判别分析(LDA模型的原理、过程、应用场景)。
问题32:局部线性嵌入(LLE模型的原理、过程、应用场景)。
问题33:词向量(Word2Vec模型和Doc2Vec模型的类别,各自原理推导、应用和参数调节)。
问题34:深度神经网络(深度神经网络模型的原理,反向传播的推导过程,常用的激活函数,梯度消失与梯度爆炸问题怎么解决?说一下深度神经网络中的Dropout、早停、正则化)。
类型三:模型比较类
问题35:LR模型与SVM模型的异同点。
问题36:LR模型与朴素贝叶斯模型的异同点。
问题37:KNN模型与K-Means模型的异同点。
问题38:ID3、C4.5、CART模型的异同点。
问题39:PCA模型与LDA模型的异同点。
问题40:Bagging方式和Boosting模型的异同点。
问题41:GBDT模型与XGBoost模型的异同点。
问题42:Word2Vec模型中CWOB模式与Skip模式的异同点。
问题43:Word2Vec模型和Doc2Vec模型的异同点。
类型四:模型技巧类
问题44:模型调参(随便选一个上述涉及的模型,说一下它的调参方式与过程)。
问题45:特征组合(常见的特征组合方式有哪些?各自特点是什么)。
问题46:特征工程(结合实践解释一下你所理解的特征工程)。
问题47:缺失值问题(说一下你遇到的缺失值处理问题,你知道哪些缺失值处理方式?你使用过哪些,效果怎样)。
问题48:样本不平衡问题(你知道样本不平衡问题吗?你是怎样处理的?效果怎么样?除上采样和下采样外,你还能自己设计什么比较新颖的方式吗)。
问题49:特征筛选(特征筛选有哪几种常见的方式?结合自己的实践经验说一下各自的原理和特点。)
问题50:模型选择(你一般怎样挑选合适的模型?有实际的例子吗?)
问题51:模型组合(你知道哪些模型组合方式?除了运用AdaBoost和RF等模型,你自己有使用过Bagging和Embedding方式组合模型吗?结合实际例子说明一下)。
问题52:AB测试(了解AB测试吗?为什么要使用AB测试)。
问题53:降维(为什么要使用降维?你知道哪些降维方法?你用过哪些降维方式?结合实际使用说明一下)。
问题54:项目(你做过哪些相关的项目?挑一个你觉得印象最深刻的说明一下)。
问题55:踩过的坑(你在使用机器学习模型中踩过哪些坑?最后你是如何解决的)。
类型五:求职技巧类
问题56:机器学习求职要准备哪些项?各项对应如何准备?
问题57:机器学习相关的学习内容有哪些?学习路线应该怎么定?有什么推荐的学习资料?
问题58:机器学习岗求职的投递方式有哪些?什么时间投递最合适?投递目标应该怎样选择?
问题59:机器学习岗求职的简历最好写哪些内容?所做的项目应该如何描述?
问题60:面试过程中自我介绍如何说比较合适?求职心态应该如何摆正?如果遇到压力该如何面对?面试过程中如何掌握主导权?怎样回答面试官最后的“你还有什么要问我的”问题?怎样面对最后的人力资源面试?
机器学习基础从入门到求职截图



Foundations of Machine LearningAdaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,Associate Editors
A complete list of books published in The Adaptive Computations and Machine
Learning series appears at the back of this book.Foundations of Machine Learning
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar
The MIT Press
Cambridge, Massachusetts
London, Englandc 2012 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or information
storage and retrieval) without permission in writing from the publisher.
MIT Press books may be purchased at special quantity discounts for business or
sales promotional use. For information, please email special sales@mitpress.mit.edu
or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cam-
bridge, MA 02142.
This book was set in L A TEX by the authors. Printed and bound in the United States
of America.
Library of Congress Cataloging-in-Publication Data
Mohri, Mehryar.
Foundations of machine learning Mehryar Mohri, Afshin Rostamizadeh, and
Ameet Talwalkar.
p. cm. - (Adaptive computation and machine learning series)
Includes bibliographical references and index.
ISBN 978-0-262-01825-8 (hardcover : alk. paper) 1. Machine learning. 2. Computer
algorithms. I. Rostamizadeh, Afshin. II. Talwalkar, Ameet. III. Title.
Q325.5.M64 2012
006.3’1-dc23
2012007249
10987654321Contents
Preface xi
1 Introduction 1
1.1 Applicationsandproblems........................ 1
1.2 De?nitionsandterminology....................... 3
1.3 Cross-validation.............................. 5
1.4 Learningscenarios ............................ 7
1.5 Outline .................................. 8
2 The PAC Learning Framework 11
2.1 ThePAClearningmodel......................... 11
2.2 Guaranteesfor?nitehypothesissets—consistentcase........ 17
2.3 Guaranteesfor?nitehypothesissets—inconsistentcase....... 21
2.4 Generalities ................................ 24
2.4.1 Deterministicversusstochasticscenarios............ 24
2.4.2 Bayeserrorandnoise ...................... 25
2.4.3 Estimationandapproximationerrors.............. 26
2.4.4 Modelselection.......................... 27
2.5 Chapternotes............................... 28
2.6 Exercises ................................. 29
3 Rademacher Complexity and VC-Dimension 33
3.1 Rademachercomplexity ......................... 34
3.2 Growthfunction ............................. 38
3.3 VC-dimension............................... 41
3.4 Lowerbounds............................... 48
3.5 Chapternotes............................... 54
3.6 Exercises ................................. 55
4 Support Vector Machines 63
4.1 Linearclassi?cation............................ 63
4.2 SVMs—separablecase ......................... 64vi
4.2.1 Primaloptimizationproblem .................. 64
4.2.2 Supportvectors.......................... 66
4.2.3 Dualoptimizationproblem ................... 67
4.2.4 Leave-one-outanalysis...................... 69
4.3 SVMs—non-separablecase....................... 71
4.3.1 Primaloptimizationproblem .................. 72
4.3.2 Supportvectors.......................... 73
4.3.3 Dualoptimizationproblem ................... 74
4.4 Margintheory............................... 75
4.5 Chapternotes............................... 83
4.6 Exercises ................................. 84
5 Kernel Methods 89
5.1 Introduction................................ 89
5.2 Positivede?nitesymmetrickernels ................... 92
5.2.1 De?nitions ............................ 92
5.2.2 ReproducingkernelHilbertspace................ 94
5.2.3 Properties............................. 96
5.3 Kernel-basedalgorithms ......................... 100
5.3.1 SVMswithPDSkernels..................... 100
5.3.2 Representertheorem....................... 101
5.3.3 Learningguarantees ....................... 102
5.4 Negativede?nitesymmetrickernels................... 103
5.5 Sequencekernels ............................. 106
5.5.1 Weightedtransducers ...................... 106
5.5.2 Rationalkernels ......................... 111
5.6 Chapternotes............................... 115
5.7 Exercises ................................. 116
6 Boosting 121
6.1 Introduction................................ 121
6.2 AdaBoost ................................. 122
6.2.1 Boundontheempiricalerror .................. 124
6.2.2 Relationshipwithcoordinatedescent.............. 126
6.2.3 Relationshipwithlogisticregression .............. 129
6.2.4 Standarduseinpractice..................... 129
6.3 Theoreticalresults ............................ 130
6.3.1 VC-dimension-basedanalysis .................. 131
6.3.2 Margin-basedanalysis ...................... 131
6.3.3 Marginmaximization ...................... 136
6.3.4 Game-theoreticinterpretation.................. 137vii
6.4 Discussion................................. 140
6.5 Chapternotes............................... 141
6.6 Exercises ................................. 142
7 On-Line Learning 147
7.1 Introduction................................ 147
7.2 Predictionwithexpertadvice...................... 148
7.2.1 MistakeboundsandHalvingalgorithm ............ 148
7.2.2 Weightedmajorityalgorithm .................. 150
7.2.3 Randomizedweightedmajorityalgorithm........... 152
7.2.4 Exponentialweightedaveragealgorithm............ 156
7.3 Linearclassi?cation............................ 159
7.3.1 Perceptronalgorithm....................... 160
7.3.2 Winnowalgorithm........................ 168
7.4 On-linetobatchconversion ....................... 171
7.5 Game-theoreticconnection........................ 174
7.6 Chapternotes............................... 175
7.7 Exercises ................................. 176
8 Multi-Class Classi?cation 183
8.1 Multi-classclassi?cationproblem.................... 183
8.2 Generalizationbounds .......................... 185
8.3 Uncombinedmulti-classalgorithms................... 191
8.3.1 Multi-classSVMs......................... 191
8.3.2 Multi-classboostingalgorithms................. 192
8.3.3 Decisiontrees........................... 194
8.4 Aggregated multi-class algorithms ................... 198
8.4.1 One-versus-all........................... 198
8.4.2 One-versus-one .......................... 199
8.4.3 Error-correctioncodes ...................... 201
8.5 Structuredpredictionalgorithms .................... 203
8.6 Chapternotes............................... 206
8.7 Exercises ................................. 207
9 Ranking 209
9.1 Theproblemofranking ......................... 209
9.2 Generalizationbound .......................... 211
9.3 RankingwithSVMs ........................... 213
9.4 RankBoost ................................ 214
9.4.1 Boundontheempiricalerror .................. 216
9.4.2 Relationshipwithcoordinatedescent.............. 218viii
9.4.3 Marginboundforensemblemethodsinranking ....... 220
9.5 Bipartiteranking............................. 221
9.5.1 Boostinginbipartiteranking .................. 222
9.5.2 AreaundertheROCcurve ................... 224
9.6 Preference-basedsetting......................... 226
9.6.1 Second-stagerankingproblem.................. 227
9.6.2 Deterministicalgorithm..................... 229
9.6.3 Randomizedalgorithm...................... 230
9.6.4 Extensiontootherlossfunctions ................ 231
9.7 Discussion................................. 232
9.8 Chapternotes............................... 233
9.9 Exercises ................................. 234
10 Regression 237
10.1Theproblemofregression........................ 237
10.2Generalizationbounds .......................... 238
10.2.1Finitehypothesissets ...................... 238
10.2.2Rademachercomplexitybounds................. 239
10.2.3Pseudo-dimensionbounds.................... 241
10.3Regressionalgorithms .......................... 245
10.3.1Linearregression......................... 245
10.3.2Kernelridgeregression...................... 247
10.3.3 Supportvectorregression .................... 252
10.3.4Lasso ............................... 257
10.3.5Groupnormregressionalgorithms ............... 260
10.3.6On-lineregressionalgorithms .................. 261
10.4Chapternotes............................... 262
10.5Exercises ................................. 263
11 Algorithmic Stability 267
11.1De?nitions................................. 267
11.2Stability-basedgeneralizationguarantee ................ 268
11.3Stabilityofkernel-basedregularizationalgorithms .......... 270
11.3.1 Application to regression algorithms: SVR and KRR . . 274
11.3.2Applicationtoclassi?cationalgorithms:SVMs ........ 276
11.3.3Discussion............................. 276
11.4Chapternotes............................... 277
11.5Exercises ................................. 277
12 Dimensionality Reduction 281
12.1PrincipalComponentAnalysis ..................... 282ix
12.2KernelPrincipalComponentAnalysis(KPCA) ............ 283
12.3 KPCAandmanifoldlearning...................... 285
12.3.1 Isomap .............................. 285
12.3.2Laplacianeigenmaps....................... 286
12.3.3Locallylinearembedding(LLE) ................ 287
12.4Johnson-Lindenstrausslemma...................... 288
12.5Chapternotes............................... 290
12.6Exercises ................................. 290
13 Learning Automata and Languages 293
13.1Introduction................................ 293
13.2Finiteautomata ............................. 294
13.3E?cientexactlearning.......................... 295
13.3.1Passivelearning ......................... 296
13.3.2Learningwithqueries ...................... 297
13.3.3Learningautomatawithqueries ................ 298
13.4Identi?cationinthelimit ........................ 303
13.4.1Learningreversibleautomata .................. 304
13.5Chapternotes............................... 309
13.6Exercises ................................. 310
14 Reinforcement Learning 313
14.1Learningscenario............................. 313
14.2Markovdecisionprocessmodel ..................... 314
14.3Policy ................................... 315
14.3.1De?nition............................. 315
14.3.2Policyvalue............................ 316
14.3.3Policyevaluation......................... 316
14.3.4Optimalpolicy .......................... 318
14.4Planningalgorithms ........................... 319
14.4.1Valueiteration .......................... 319
14.4.2Policyiteration.......................... 322
14.4.3Linearprogramming....................... 324
14.5Learningalgorithms ........................... 325
14.5.1 Stochasticapproximation .................... 326
14.5.2TD(0)algorithm......................... 330
14.5.3Q-learningalgorithm....................... 331
14.5.4 SARSA .............................. 334
14.5.5 TD(λ)algorithm......................... 335
14.5.6Largestatespace......................... 336
14.6Chapternotes............................... 337x
Conclusion 339
A Linear Algebra Review 341
A.1 Vectorsandnorms ............................ 341
A.1.1 Norms............................... 341
A.1.2 Dualnorms............................ 342
A.2 Matrices.................................. 344
A.2.1 Matrixnorms........................... 344
A.2.2 Singularvaluedecomposition .................. 345
A.2.3 Symmetricpositivesemide?nite(SPSD)matrices....... 346
B Convex Optimization 349
B.1 Di?erentiationandunconstrainedoptimization ............ 349
B.2 Convexity................................. 350
B.3 Constrainedoptimization ........................ 353
B.4 Chapternotes............................... 357
C Probability Review 359
C.1 Probability ................................ 359
C.2 Randomvariables............................. 359
C.3 Conditionalprobabilityandindependence ............... 361
C.4 Expectation, Markov’s inequality, and moment-generating function
C.5 VarianceandChebyshev’sinequality.................. 365
D Concentration inequalities 369
D.1 Hoe?ding’sinequality .......................... 369
D.2 McDiarmid’sinequality ......................... 371
D.3 Otherinequalities............................. 373
D.3.1 Binomialdistribution:Slud’sinequality ............ 374
D.3.2 Normaldistribution:tailbound................. 374
D.3.3 Khintchine-Kahaneinequality.................. 374
D.4 Chapternotes............................... 376
D.5 Exercises ................................. 377
E Notation 379
References 381
Index 397
363 .Preface
This book is a general introduction to machine learning that can serve as a textbook
for students and researchers in the ?eld. It covers fundamental modern topics in
machine learning while providing the theoretical basis and conceptual tools needed
for the discussion and justi?cation of algorithms. It also describes several key aspects
of the application of these algorithms.
We have aimed to present the most novel theoretical tools and concepts while
giving concise proofs, even for relatively advanced results. In general, whenever
possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial
complex topics arising in machine learning and highlight several open research
questions. Certain topics often merged with others or treated with insu?cient
attention are discussed separately here and with more emphasis: for example, a
di?erent chapter is reserved for multi-class classi?cation, ranking, and regression.
Although we cover a very wide variety of important topics in machine learning, we
have chosen to omit a few important ones, including graphical models and neural
networks, both for the sake of brevity and because of the current lack of solid
theoretical guarantees for some methods.
The book is intended for students and researchers in machine learning, statistics
and other related areas. It can be used as a textbook for both graduate and advanced
undergraduate classes in machine learning or as a reference text for a research
seminar. The ?rst three chapters of the book lay the theoretical foundation for the
subsequent material. Other chapters are mostly self-contained, with the exception
of chapter 5 which introduces some concepts that are extensively used in later
ones. Each chapter concludes with a series of exercises, with full solutions presented
separately.
The reader is assumed to be familiar with basic concepts in linear algebra,probability, and analysis of algorithms. However, to further help him, we present
in the appendix a concise linear algebra and a probability review, and a short
introduction to convex optimization. We have also collected in the appendix a
number of useful tools for concentration bounds used in this book.
To our knowledge, there is no single textbook covering all of the material
presented here. The need for a uni?ed presentation has been pointed out to usxii Preface
every year by our machine learning students. There are several good books for
various specialized areas, but these books do not include a discussion of other
fundamental topics in a general manner. For example, books about kernel methods
do not include a discussion of other fundamental topics such as boosting, ranking,reinforcement learning, learning automata or online learning. There also exist more
general machine learning books, but the theoretical foundation of our book and our
emphasis on proofs make our presentation quite distinct.
Mostofthematerialpresentedheretakesitsoriginsinamachinelearning
graduate course (Foundations of Machine Learning) taught by the ?rst author
at the Courant Institute of Mathematical Sciences in New York University over
the last seven years. This book has considerably bene?ted from the comments
and suggestions from students in these classes, along with those of many friends,colleagues and researchers to whom we are deeply indebted.
We are particularly grateful to Corinna Cortes and Yishay Mansour who have
both made a number of key suggestions for the design and organization of the
materialpresentedwithdetailedcommentsthatwehavefullytakenintoaccount
and that have greatly improved the presentation. We are also grateful to Yishay
Mansour for using a preliminary version of the book for teaching and for reporting
his feedback to us.
We also thank for discussions, suggested improvement, and contributions of many
kinds the following colleagues and friends from academic and corporate research lab-
oratories: Cyril Allauzen, Stephen Boyd, Spencer Greenberg, Lisa Hellerstein, Sanjiv
Kumar, Ryan McDonald, Andres Mu? noz Medina, Tyler Neylon, Peter Norvig, Fer-
nando Pereira, Maria Pershina, Ashish Rastogi, Michael Riley, Umar Syed, Csaba
Szepesv′ ari, Eugene Weinstein, and Jason Weston.
Finally, we thank the MIT Press publication team for their help and support in
the development of this text.1 Introduction
Machine learning can be broadly de?ned as computational methods using experience
to improve performance or to make accurate predictions. Here, experience refers to
the past information available to the learner, which typically takes the form of
electronic data collected and made available for analysis. This data could be in the
form of digitized human-labeled training sets, or other types of information obtained
via interaction with the environment. In all cases, its quality and size are crucial to
the success of the predictions made by the learner.
Machine learning consists of designing e?cient and accurate prediction algo-
rithms. As in other areas of computer science, some critical measures of the quality
of these algorithms are their time and space complexity. But, in machine learning,we will need additionally a notion of sample complexity to evaluate the sample size
required for the algorithm to learn a family of concepts. More generally, theoreti-
cal learning guarantees for an algorithm depend on the complexity of the concept
classes considered and the size of the training sample.
Since the success of a learning algorithm depends on the data used, machine
learning is inherently related to data analysis and statistics. More generally, learning
techniques are data-driven methods combining fundamental concepts in computer
science with ideas from statistics, probability and optimization.
1.1 Applications and problems
Learning algorithms have been successfully deployed in a variety of applications,including
Text or document classi?cation, e.g., spam detection;
Natural language processing, e.g., morphological analysis, part-of-speech tagging,statistical parsing, named-entity recognition;
Speech recognition, speech synthesis, speaker veri?cation;
Optical character recognition (OCR);
Computational biology applications, e.g., protein function or structured predic-2 Introduction
tion;
Computer vision tasks, e.g., image recognition, face detection;
Fraud detection (credit card, telephone) and network intrusion;
Games, e.g., chess, backgammon;
Unassisted vehicle control (robots, navigation);
Medical diagnosis;
Recommendation systems, search engines, information extraction systems.
This list is by no means comprehensive, and learning algorithms are applied to new
applications every day. Moreover, such applications correspond to a wide variety of
learning problems. Some major classes of learning problems are:
Classi?cation: Assign a category to each item. For example, document classi?ca-
tion may assign items with categories such as politics, business, sports,or weather
while image classi?cation may assign items with categories such as landscape, por-
trait,or animal. The number of categories in such tasks is often relatively small,but can be large in some di?cult tasks and even unbounded as in OCR, text clas-
si?cation, or speech recognition.
Regression: Predict a real value for each item. Examples of regression include
prediction of stock values or variations of economic variables. In this problem, the
penalty for an incorrect prediction depends on the magnitude of the di?erence
between the true and predicted values, in contrast with the classi?cation problem,where there is typically no notion of closeness between various categories.
Ranking: Order items according to some criterion. Web search, e.g., returning
web pages relevant to a search query, is the canonical ranking example. Many other
similar ranking problems arise in the context of the design of information extraction
or natural language processing systems.
Clustering: Partition items into homogeneous regions. Clustering is often per-
formed to analyze very large data sets. For example, in the context of social net-
work analysis, clustering algorithms attempt to identify “communities” within large
groups of people.
Dimensionality reduction or manifold learning: Transform an initial representa-
tion of items into a lower-dimensional representation of these items while preserving
some properties of the initial representation. A common example involves prepro-
cessing digital images in computer vision tasks.
The main practical objectives of machine learning consist of generating accurate
predictions for unseen items and of designing e?cient and robust algorithms to
produce these predictions, even for large-scale problems. To do so, a number of
algorithmic and theoretical questions arise. Some fundamental questions include:1.2 De?nitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and red
training sample, but it is a complex separation surface that is not likely to generalize
well to unseen data. In contrast, the decision surface on the right panel is simpler
and might generalize better in spite of its misclassi?cation of a few points of the
training sample.
Which concept families can actually be learned, and under what conditions? How
well can these concepts be learned computationally?
1.2 De?nitions and terminology
We will use the canonical problem of spam detection as a running example to
illustrate some basic de?nitions and to describe the use and evaluation of machine
learning algorithms in practice. Spam detection is the problem of learning to
automatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spam
problem, these examples correspond to the collection of email messages we will use
for learning and testing.
Features: The set of attributes, often represented as a vector, associated to an
example. In the case of email messages, some relevant features may include the
length of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classi?cation problems,examples are assigned speci?c categories, for instance, the spam and non-spam
categories in our binary classi?cation problem. In regression, items are assigned
real-valued labels.
Training sample: Examples used to train a learning algorithm. In our spam
problem, the training sample consists of a set of email examples along with their
associated labels. The training sample varies for di?erent learning scenarios, as
described in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm4 Introduction
when working with labeled data. Learning algorithms typically have one or more
free parameters, and the validation sample is used to select appropriate values for
these model parameters.
Test sample: Examples used to evaluate the performance of a learning algorithm.
The test sample is separate from the training and validation data and is not made
available in the learning stage. In the spam problem, the test sample consists of a
collection of email examples for which the learning algorithm must predict labels
based on features. These predictions are then compared with the labels of the test
sample to measure the performance of the algorithm.
Loss function: A function that measures the di?erence, or loss, between a pre-
dicted label and a true label. Denoting the set of all labels as Y and the set of
possible predictions as Y
, a loss function L is a mapping L: Y×Y
→ R+. In most
cases, Y
= Y and the loss function is bounded, but these conditions do not always
hold. Common examples of loss functions include the zero-one (or misclassi?cation)
loss de?ned over {?1,+1}×{?1,+1} by L(y, y)=1y
=y and the squared loss
de?ned over I × I by L(y, y)=(y
y)2,where I ? R is typically a bounded
interval.
Hypothesis set : A set of functions mapping features (feature vectors) to the set of
labels Y. In our example, these may be a set of functions mapping email features
to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping
features to a di?erent set Y
. They could be linear functions mapping email feature
vectors to real numbers interpreted as scores (Y
= R), with higher score values
more indicative of spam than lower ones.
We now de?ne the learning stages of our spam problem. We start with a given
collection of labeled examples. We ?rst randomly partition the data into a training
sample, a validation sample, and a test sample. The size of each of these samples
depends on a number of di?erent considerations. For example, the amount of data
reserved for validation depends on the number of free parameters of the algorithm.
Also, when the labeled sample is relatively small, the amount of training data is
often chosen to be larger than that of test data since the learning performance
directly depends on the training sample.
Next, we associate relevant features to the examples. This is a critical step in
the design of machine learning solutions. Useful features can e?ectively guide the
learning algorithm, while poor or uninformative ones can be misleading. Although
it is critical, to a large extent, the choice of the features is left to the user. This
choice re?ects the user’s prior knowledge about the learning task which in practice
can have a dramatic e?ect on the performance results.
Now, we use the features selected to train our learning algorithm by ?xing di?erent
values of its free parameters. For each value of these parameters, the algorithm1.3 Cross-validation 5
selects a di?erent hypothesis out of the hypothesis set. We choose among them
the hypothesis resulting in the best performance on the validation sample. Finally,using that hypothesis, we predict the labels of the examples in the test sample. The
performance of the algorithm is evaluated by using the loss function associated to
the task, e.g., the zero-one loss in our spam detection task, to compare the predicted
and true labels.
Thus, the performance of an algorithm is of course evaluated based on its test error
and not its error on the training sample. A learning algorithm may be consistent ,that is it may commit no error on the examples of the training data, and yet
have a poor performance on the test data. This occurs for consistent learners
de?ned by very complex decision surfaces, as illustrated in ?gure 1.1, which tend
to memorize a relatively small training sample instead of seeking to generalize well.
This highlights the key distinction between memorization and generalization, which
is the fundamental property sought for an accurate learning algorithm. Theoretical
guarantees for consistent learners will be discussed with great detail in chapter 2.
1.3 Cross-validation
In practice, the amount of labeled data available is often too small to set aside
a validation sample since that would leave an insu?cient amount of training data.
Instead, a widely adopted method known as n-fold cross-validation is used to exploit
the labeled data both for model selection (selection of the free parameters of the
algorithm) and for training.
Let θ denote the vector of free parameters of the algorithm. For a ?xed value
of θ, the method consists of ?rst randomly partitioning a given sample S of
m labeled examples into n subsamples, or folds. The ith fold is thus a labeled
sample ((xi1,yi1),..., (ximi ,yimi )) of size mi. Then, for any i ∈ [1,n], the learning
algorithm is trained on all but the ith fold to generate a hypothesis hi,andthe
performance of hi is tested on the ith fold, as illustrated in ?gure 1.2a. The
parameter value θ is evaluated based on the average error of the hypotheses hi,which is called the cross-validation error .Thisquantityisdenotedby RCV(θ)and
de?ned by
RCV(θ)= 1
n
n
i=1
1
mi
mi
j=1
L(hi(xij ),yij )
error of hi on the ith fold
.
The folds are generally chosen to have equal size, that is mi = mn for all i ∈ [1,n].
How should n be chosen? The appropriate choice is subject to a trade-o? and the
topic of much learning theory research that we cannot address in this introductory6 Introduction
test train train train train
test train train train train
.
.
.
test train train train train
error
m
(a) (b)
Figure 1.2 n-fold cross validation. (a) Illustration of the partitioning of the
training data into 5 folds. (b) Typical plot of a classi?er’s prediction error as a
function of the size of the training sample: the error decreases as a function of the
number of training points.
chapter. For a large n, each training sample used in n-fold cross-validation has size
m?mn = m(1?1n) (illustrated by the right vertical red line in ?gure 1.2b), which
is close to m, the size of the full sample, but the training samples are quite similar.
Thus, the method tends to have a small bias but a large variance. In contrast,smaller values of n lead to more diverse training samples but their size (shown by
the left vertical red line in ?gure 1.2b) is signi?cantly less than m,thusthemethod
tends to have a smaller variance but a larger bias.
In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross
validation is used as follows in model selection. The full labeled data is ?rst split
into a training and a test sample. The training sample of size m is then used to
compute the n-fold cross-validation error RCV(θ) for a small number of possible
values of θ. θ is next set to the value θ0 for which RCV(θ) is smallest and the
algorithm is trained with the parameter setting θ0 over the full training sample of
size m.Itsperformanceisevaluatedonthetestsampleasalreadydescribedinthe
previous section.
The special case of n-fold cross validation where n = m is called leave-one-out
cross-validation, since at each iteration exactly one instance is left out of the training
sample. As shown in chapter 4, the average leave-one-out error is an approximately
unbiased estimate of the average error of an algorithm and can be used to derive
simple guarantees for some algorithms. In general, the leave-one-out error is very
costly to compute, since it requires training n times on samples of size m ? 1, but
for some algorithms it admits a very e?cient computation (see exercise 10.9).
In addition to model selection, n-fold cross validation is also commonly used for
performance evaluation. In that case, for a ?xed parameter setting θ, the full labeled
sample is divided into n random folds with no distinction between training and test
samples. The performance reported is the n-fold cross-validation on the full sample
as well as the standard deviation of the errors measured on each fold.1.4 Learning scenarios 7
1.4 Learning scenarios
We next brie?y describe common machine learning scenarios. These scenarios di?er
in the types of training data available to the learner, the order and method by which
training data is received and the test data used to evaluate the learning algorithm.
Supervised learning: The learner receives a set of labeled examples as training
data and makes predictions for all unseen points. This is the most common scenario
associated with classi?cation, regression, and ranking problems. The spam detection
problem discussed in the previous section is an instance of supervised learning.
Unsupervised learning: The learner exclusively receives unlabeled training data,and makes predictions for all unseen points. Since in general no labeled exam-
ple is available in that setting, it can be di?cult to quantitatively evaluate the
performance of a learner. Clustering and dimensionality reduction are example of
unsupervised learning problems.
Semi-supervised learning: The learner receives a training sample consisting of
both labeled and unlabeled data, and makes predictions for all unseen points. Semi-
supervised learning is common in settings where unlabeled data is easily accessible
but labels are expensive to obtain. Various types of problems arising in applications,including classi?cation, regression, or ranking tasks, can be framed as instances
of semi-supervised learning. The hope is that the distribution of unlabeled data
accessible to the learner can help him achieve a better performance than in the
supervised setting. The analysis of the conditions under which this can indeed
be realized is the topic of much modern theoretical and applied machine learning
research.
Transductive inference: As in the semi-supervised scenario, the learner receives
a labeled training sample along with a set of unlabeled test points. However, the
objective of transductive inference is to predict labels only for these particular test
points. Transductive inference appears to be an easier task and matches the scenario
encountered in a variety of modern applications. However, as in the semi-supervised
setting, the assumptions under which a better performance can be achieved in this
setting are research questions that have not been fully resolved.
On-line learning: In contrast with the previous scenarios, the online scenario
involves multiple rounds and training and testing phases are intermixed. At each
round, the learner receives an unlabeled training point, makes a prediction, receives
the true label, and incurs a loss. The objective in the on-line setting is to minimize
the cumulative loss over all rounds. Unlike the previous settings just discussed, no
distributional assumption is made in on-line learning. In fact, instances and their
labels may be chosen adversarially within this scenario.8 Introduction
Reinforcement learning: The training and testing phases are also intermixed in
reinforcement learning. To collect information, the learner actively interacts with the
environment and in some cases a?ects the environment, and receives an immediate
reward for each action. The object of the learner is to maximize his reward over
a course of actions and iterations with the environment. However, no long-term
reward feedback is provided by the environment, and the learner is faced with the
exploration versus exploitation dilemma, since he must choose between exploring
unknown actions to gain more information versus exploiting the information already
collected.
Active learning: The learner adaptively or interactively collects training examples,typically by querying an oracle to request labels for new points. The goal in
active learning is to achieve a performance comparable to the standard supervised
learning scenario, but with fewer labeled examples. Active learning is often used
in applications where labels are expensive to obtain, for example computational
biology applications.
In practice, many other intermediate and somewhat more complex learning scenarios
may be encountered.
1.5 Outline
This book presents several fundamental and mathematically well-studied algo-
rithms. It discusses in depth their theoretical foundations as well as their practical
applications. The topics covered include:
Probably approximately correct (PAC) learning framework; learning guarantees
for ?nite hypothesis sets;
Learning guarantees for in?nite hypothesis sets, Rademacher complexity, VC-
dimension;
Support vector machines (SVMs), margin theory;
Kernel methods, positive de?nite symmetric kernels, representer theorem, rational
kernels;
Boosting, analysis of empirical error, generalization error, margin bounds;
Online learning, mistake bounds, the weighted majority algorithm, the exponen-
tial weighted average algorithm, the Perceptron and Winnow algorithms;
Multi-class classi?cation, multi-class SVMs, multi-class boosting, one-versus-all,one-versus-one, error-correction methods;
Ranking, ranking with SVMs, RankBoost, bipartite ranking, preference-based1.5 Outline 9
ranking;
Regression, linear regression, kernel ridge regression, support vector regression,Lasso;
Stability-based analysis, applications to classi?cation and regression;
Dimensionality reduction, principal component analysis (PCA), kernel PCA,Johnson-Lindenstrauss lemma;
Learning automata and languages;
Reinforcement learning, Markov decision processes, planning and learning prob-
lems.
The analyses in this book are self-contained, with relevant mathematical concepts
related to linear algebra, convex optimization, probability and statistics included in
the appendix.2 The PAC Learning Framework
Several fundamental questions arise when designing and analyzing algorithms that
learn from examples: What can be learned e?ciently? What is inherently hard to
learn? How many examples are needed to learn successfully? Is there a general model
of learning? In this chapter, we begin to formalize and address these questions by
introducing the Probably Approximately Correct (PAC) learning framework. The
PAC framework helps de?ne the class of learnable concepts in terms of the number
of sample points needed to achieve an approximate solution, sample complexity,and
the time and space complexity of the learning algorithm, which depends on the cost
of the computational representation of the concepts.
We ?rst describe the PAC framework and illustrate it, then present some general
learning guarantees within this framework when the hypothesis set used is ?nite,both for the consistent case where the hypothesis set used contains the concept to
learn and for the opposite inconsistent case.
2.1 The PAC learning model
We ?rst introduce several de?nitions and the notation needed to present the PAC
model, which will also be used throughout much of this book.
We denote by X the set of all possible examples or instances. X is also sometimes
referred to as the input space. The set of all possible labels or target values is denoted
by Y. For the purpose of this introductory chapter, we will limit ourselves to the
case where Y is reduced to two labels, Y = {0, 1}, so-called binary classi?cation.
Later chapters will extend these results to more general settings.
A concept c : X→Y is a mapping from X to Y.Since Y = {0, 1},wecanidentify
c with the subset of X over which it takes the value 1. Thus, in the following, we
equivalentlyrefertoaconcepttolearnasamappingfrom X to {0, 1},ortoa
subset of X. As an example, a concept may be the set of points inside a triangle or
the indicator function of these points. In such cases, we will say in short that the
concept to learn is a triangle. A concept class is a set of concepts we may wish to
learn and is denoted by C. This could, for example, be the set of all triangles in the12 The PAC Learning Framework
plane.
We assume that examples are independently and identically distributed (i.i.d.)
according to some ?xed but unknown distribution D. The learning problem is then
formulated as follows. The learner considers a ?xed set of possible concepts H,called a hypothesis set ,whichmaynotcoincidewith C. He receives a sample
S =(x1,...,xm) drawn i.i.d. according to D as well as the labels (c(x1),...,c(xm)),whicharebasedonaspeci?ctargetconcept c ∈ C to learn. His task is to use the
labeled sample S to select a hypothesis hS ∈ H that has a small generalization
error with respect to the concept c. The generalization error of a hypothesis h ∈ H,also referred to as the true error or just error of h is denoted by R(h) and de?ned
as follows.
1
De?nition 2.1 Generalization error
Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution
D,the generalization error or risk of h is de?ned by
R(h)= Pr
x~D[h(x)
= c(x)] = E x~D
1h(x)
=c(x)
, (2.1)
where 1ω is the indicator function of the event ω.
2
The generalization error of a hypothesis is not directly accessible to the learner
since both the distribution D and the target concept c are unknown. However, the
learner can measure the empirical error of a hypothesis on the labeled sample S.
De?nition 2.2 Empirical error
Given a hypothesis h ∈ H, a target concept c ∈ C, and a sample S =(x1,...,xm),the empirical error or empirical risk of h is de?ned by
R(h)= 1
m
m
i=1
1h(xi)
=c(xi). (2.2)
Thus, the empirical error of h ∈ H is its average error over the sample S, while the
generalization error is its expected error based on the distribution D. We will see in
this chapter and the following chapters a number of guarantees relating to these two
quantities with high probability, under some general assumptions. We can already
note that for a ?xed h ∈ H, the expectation of the empirical error based on an i.i.d.
1.Thechoiceof R instead of E to denote an error avoids possible confusions with the
notation for expectations and is further justi?ed by the fact that the term risk is also used
in machine learning and statistics to refer to an error.
2. For this and other related de?nitions, the family of functions H and the target concept
c must be measurable. The function classes we consider in this book all have this property.2.1 The PAC learning model 13
sample S is equal to the generalization error:
E[ R(h)] = R(h). (2.3)
Indeed, by the linearity of the expectation and the fact that the sample is drawn
i.i.d., we can write
E S~Dm[ R(h)] = 1
m
m
i=1
E S~Dm[1h(xi)
=c(xi)]= 1
m
m
i=1
E S~Dm[1h(x)
=c(x)],for any x in sample S.Thus,E S~Dm[ R(h)] = E S~Dm[1{h(x)
=c(x)}]= E x~D[1{h(x)
=c(x)}]= R(h).
The following introduces the Probably Approximately Correct (PAC) learning
framework. We denote by O(n) an upper bound on the cost of the computational
representation of any element x ∈X and by size(c) the maximal cost of the
computational representation of c ∈ C. For example, x may be a vector in Rn,for which the cost of an array-based representation would be in O(n).
De?nition 2.3 PAC-learning
A concept class C is said to be PAC-learnable if there exists an algorithm A and
a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ> 0, for all
distributions D on X and for any target concept c ∈ C, the following holds for any
sample size m ≥ poly(1, 1δ, n, size(c)):
Pr
S~Dm[R(hS) ≤ ] ≥ 1 ? δ. (2.4)
If A further runs in poly(1, 1δ, n, size(c)),then C is said to be e?ciently PAC-
learnable. When such an algorithm A exists, it is called a PAC-learning algorithm
for C.
A concept class C is thus PAC-learnable if the hypothesis returned by the algorithm
after observing a number of points polynomial in 1 and 1δ is approximately
correct (error at most ) with high probability (at least 1 ? δ), which justi?es the
PAC terminology. δ> 0isusedtode?nethe con?dence 1?δ and > 0the accuracy
1 ? . Note that if the running time of the algorithm is polynomial in 1 and 1δ,then the sample size m must also be polynomial if the full sample is received by the
algorithm.
Several key points of the PAC de?nition are worth emphasizing. First, the PAC
framework is a distribution-free model : no particular assumption is made about the
distribution D from which examples are drawn. Second, the training sample and the
test examples used to de?ne the error are drawn according to the same distribution
D. This is a necessary assumption for generalization to be possible in most cases.14 The PAC Learning Framework
R
R’
Figure 2.1 Target concept R and possible hypothesis R
. Circles represent training
instances. A blue circle is a point labeled with 1, since it falls within the rectangle
R.Othersareredandlabeledwith 0.
Finally, the PAC framework deals with the question of learnability for a concept
class C and not a particular concept. Note that the concept class C is known to the
algorithm, but of course target concept c ∈ C is unknown.
In many cases, in particular when the computational representation of the con-
cepts is not explicitly discussed or is straightforward, we may omit the polynomial
dependency on n and size(c) in the PAC de?nition and focus only on the sample
complexity.
We now illustrate PAC-learning with a speci?c learning problem.
Example 2.1 Learning axis-aligned rectangles
Considerthecasewherethesetofinstancesarepointsintheplane, X = R2,and
the concept class C is the set of all axis-aligned rectangles lying in R2.Thus,each
concept c is the set of points inside a particular axis-aligned rectangle. The learning
problem consists of determining with small error a target axis-aligned rectangle
using the labeled training sample. We will show that the concept class of axis-
aligned rectangles is PAC-learnable.
Figure 2.1 illustrates the problem. R represents a target axis-aligned rectangle
and R
a hypothesis. As can be seen from the ?gure, the error regions of R
are
formed by the area within the rectangle R but outside the rectangle R
and the area
within R
but outside the rectangle R. The ?rst area corresponds to false negatives,that is, points that are labeled as 0 or negatively by R
, which are in fact positive
or labeled with 1. The second area corresponds to false positives,thatis,points
labeled positively by R
which are in fact negatively labeled.
To show that the concept class is PAC-learnable, we describe a simple PAC-
learning algorithm A. Given a labeled sample S, the algorithm consists of returning
the tightest axis-aligned rectangle R
= RS containing the points labeled with 1.
Figure 2.2 illustrates the hypothesis returned by the algorithm. By de?nition, RS
does not produce any false positive, since its points must be included in the target
concept R. Thus, the error region of RS is included in R.2.1 The PAC learning model 15
R
R’
Figure 2.2 Illustration of the hypothesis R
= RS returned by the algorithm.
Let R ∈ C be a target concept. Fix > 0. Let Pr[RS] denote the probability mass
oftheregionde?nedby RS, that is the probability that a point randomly drawn
according to D falls within RS. Since errors made by our algorithm can be due only
to points falling inside RS, we can assume that Pr[RS] >; otherwise, the error of
RS is less than or equal to regardless of the training sample S received.
Now, since Pr[RS] >, we can de?ne four rectangular regions r1,r2,r3, and r4
along the sides of RS, each with probability at least 4. These regions can be
constructed by starting with the empty rectangle along a side and increasing its
size until its distribution mass is at least 4. Figure 2.3 illustrates the de?nition of
these regions.
Observe that if RS meets all of these four regions, then, because it is a rectangle,it will have one side in each of these four regions (geometric argument). Its error
area, which is the part of R that it does not cover, is thus included in these regions
and cannot have probability mass more than . By contraposition, if R(RS) >,then RS must miss at least one of the regions ri, i ∈ [1, 4]. As a result, we can write
Pr
S~Dm[R(RS) >] ≤ Pr
S~Dm[∪4
i=1{RS ∩ ri = ?}] (2.5)
≤
4
i=1
Pr
S~Dm[{RS ∩ ri = ?}] (by the union bound)
≤ 4(1 ? 4)
m (since Pr[ri] >4)
≤ 4exp(?m4),where for the last step we used the general identity 1?x ≤ e?x valid for all x ∈ R.
For any δ> 0, to ensure that PrS~Dm[R(RS) >] ≤ δ,wecanimpose
4exp(?m4) ≤ δ ? m ≥ 4
log
4
δ
. (2.6)
Thus, for any > 0and δ> 0, if the sample size m is greater than 4
log 4
δ ,then PrS~Dm[R(RS) >] ≤ 1 ? δ. Furthermore, the computational cost of the16 The PAC Learning Framework
R
R’
r1
r2
r3
r4
Figure 2.3 Illustration of the regions r1,...,r4.
representation of points in R2 and axis-aligned rectangles, which can be de?ned by
their four corners, is constant. This proves that the concept class of axis-aligned
rectangles is PAC-learnable and that the sample complexity of PAC-learning axis-
aligned rectangles is in O(
1
log 1
δ ).
An equivalent way to present sample complexity results like (2.6), which we will
often see throughout this book, is to give a generalization bound. It states that with
probability at least 1 ? δ, R(RS) is upper bounded by some quantity that depends
on the sample size m and δ. To obtain this, if su?ces to set δ to be equal to the
upper bound derived in (2.5), that is δ =4exp(?m4) and solve for .Thisyields
that with probability at least 1 ? δ, the error of the algorithm is bounded as:
R(RS) ≤ 4
m log
4
δ
. (2.7)
Other PAC-learning algorithms could be considered for this example. One alterna-
tive is to return the largest axis-aligned rectangle not containing the negative points,for example. The proof of PAC-learning just presented for the tightest axis-aligned
rectangle can be easily adapted to the analysis of other such algorithms.
Note that the hypothesis set H we considered in this example coincided with the
concept class C and that its cardinality was in?nite. Nevertheless, the problem
admitted a simple proof of PAC-learning. We may then ask if a similar proof
can readily apply to other similar concept classes. This is not as straightforward
because the speci?c geometric argument used in the proof is key. It is non-trivial
to extend the proof to other concept classes such as that of non-concentric circles
(see exercise 2.4). Thus, we need a more general proof technique and more general
results. The next two sections provide us with such tools in the case of a ?nite
hypothesis set.2.2 Guarantees for ?nite hypothesis sets — consistent case 17
2.2 Guarantees for ?nite hypothesis sets — consistent case
In the example of axis-aligned rectangles that we examined, the hypothesis hS
returned by the algorithm was always consistent , that is, it admitted no error on
the training sample S. In this section, we present a general sample complexity
bound, or equivalently, a generalization bound, for consistent hypotheses, in the
case where the cardinality |H| of the hypothesis set is ?nite. Since we consider
consistent hypotheses, we will assume that the target concept c is in H.
Theorem 2.1 Learning bounds — ?nite H, consistent case
Let H be a ?nite set of functions mapping from X to Y.Let A be an algorithm that
for any target concept c ∈ H and i.i.d. sample S returns a consistent hypothesis hS:
R(hS)=0.Then,forany , δ > 0,theinequality PrS~Dm[R(hS) ≤ ] ≥ 1?δ holds
if
m ≥ 1
log |H| +log
1
δ
. (2.8)
This sample complexity result admits the following equivalent statement as a gener-
alization bound: for any , δ > 0, with probability at least 1 ? δ,R(hS) ≤ 1
m
log |H| +log
1
δ
. (2.9)
Proof Fix > 0. We do not know which consistent hypothesis hS ∈ H is selected
by the algorithm A. This hypothesis further depends on the training sample S.
Therefore, we need to give a uniform convergence bound, that is, a bound that
holds for the set of all consistent hypotheses, which a fortiori includes hS.Thus,we will bound the probability that some h ∈ H would be consistent and have error
more than :
Pr[?h ∈ H: R(h)=0 ∧ R(h) >]
=Pr[(h1 ∈ H, R(h1)=0 ∧ R(h1) >) ∨ (h2 ∈ H, R(h2)=0 ∧ R(h2) >) ∨··· ]
≤
h∈H
Pr[ R(h)=0 ∧ R(h) >] (union bound)
≤
h∈H
Pr[ R(h)=0 | R(h) >]. (de?nition of conditional probability)
Now, consider any hypothesis h ∈ H with R(h) >. Then, the probability that h
would be consistent on a training sample S drawn i.i.d., that is, that it would have
no error on any point in S, can be bounded as:
Pr[ R(h)=0 | R(h) >] ≤ (1 ? )
m.18 The PAC Learning Framework
The previous inequality implies
Pr[?h ∈ H: R(h)=0 ∧ R(h) >] ≤|H|(1 ? )
m.
Setting the right-hand side to be equal to δ and solving for concludes the proof.
The theorem shows that when the hypothesis set H is ?nite, a consistent algorithm
A is a PAC-learning algorithm, since the sample complexity given by (2.8) is
dominated by a polynomial in 1 and 1δ. As shown by (2.9), the generalization
error of consistent hypotheses is upper bounded by a term that decreases as
a function of the sample size m. This is a general fact: as expected, learning
algorithms bene?t from larger labeled training samples. The decrease rate of O(1m)
guaranteed by this theorem, however, is particularly favorable.
The price to pay for coming up with a consistent algorithm is the use of a
larger hypothesis set H containing target concepts. Of course, the upper bound
(2.9) increases with |H|. However, that dependency is only logarithmic. Note that
the term log |H|,ortherelatedtermlog2 |H| from which it di?ers by a constant
factor, can be interpreted as the number of bits needed to represent H.Thus,the
generalization guarantee of the theorem is controlled by the ratio of this number of
bits, log2 |H|,andthesamplesize m.
We now use theorem 2.1 to analyze PAC-learning with various concept classes.
Example 2.2 Conjunction of Boolean literals
Consider learning the concept class Cn of conjunctions of at most n Boolean literals
x1,...,xn. A Boolean literal is either a variable xi, i ∈ [1,n], or its negation xi.For
n = 4, an example is the conjunction: x1 ∧ x2 ∧ x4,where x2 denotes the negation
of the Boolean literal x2.(1, 0, 0, 1) is a positive example for this concept while
(1, 0, 0, 0) is a negative example.
Observe that for n = 4, a positive example (1, 0, 1, 0) implies that the target
concept cannot contain the literals x1 and x3 and that it cannot contain the literals
x2 and x4. In contrast, a negative example is not as informative since it is not
known which of its n bits are incorrect. A simple algorithm for ?nding a consistent
hypothesis is thus based on positive examples and consists of the following: for each
positive example (b1,...,bn)and i ∈ [1,n], if bi =1then xi is ruled out as a possible
literal in the concept class and if bi =0then xi is ruled out. The conjunction of all
the literals not ruled out is thus a hypothesis consistent with the target. Figure 2.4
shows an example training sample as well as a consistent hypothesis for the case
n =6.
We have |H| = |Cn| =3n, since each literal can be included positively, with
negation, or not included. Plugging this into the sample complexity bound for
consistent hypotheses yields the following sample complexity bound for any > 02.2 Guarantees for ?nite hypothesis sets — consistent case 19
011011 +
011111 +
001101 -
011111 +
100110 -
010011 +
01? ?11
Figure 2.4 Each of the ?rst six rows of the table represents a training example with
its label, + or ?, indicated in the last column. The last row contains 0 (respectively
1) in column i ∈ [1, 6] if the ith entry is 0 (respectively 1) for all the positive examples.
It contains “?” if both 0 and 1 appear as an ith entry for some positive example.
Thus, for this training sample, the hypothesis returned by the consistent algorithm
describedinthetextis x1 ∧ x2 ∧ x5 ∧ x6.
and δ> 0:
m ≥ 1
(log 3)n +log
1
δ
. (2.10)
Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note
that the computational complexity is also polynomial, since the training cost per
exampleisin O(n). For δ =0.02, =0.1, and n = 10, the bound becomes m ≥ 149.
Thus, for a labeled sample of at least 149 examples, the bound guarantees 99%
accuracy with a con?dence of at least 98%.
Example 2.3 Universal concept class
Consider the set X = {0, 1}n of all Boolean vectors with n components, and let Un
be the concept class formed by all subsets of X. Is this concept class PAC-learnable?
To guarantee a consistent hypothesis the hypothesis class must include the concept
class, thus |H|≥|Un| =2(2n)
. Theorem 2.1 gives the following sample complexity
bound:
m ≥ 1
(log 2)2n +log
1
δ
. (2.11)
Here, the number of training samples required is exponential in n, which is the cost
of the representation of a point in X. Thus, PAC-learning is not guaranteed by
the theorem. In fact, it is not hard to show that this universal concept class is not
PAC- learnable.20 The PAC Learning Framework
Example 2.4 k-term DNF formulae
A disjunctive normal form (DNF) formula is a formula written as the disjunction of
several terms, each term being a conjunction of Boolean literals. A k-term DNF is a
DNF formula de?ned by the disjunction of k terms, each term being a conjunction
of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a k-term
DNF is (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x3).
Is the class C of k-term DNF formulae is PAC-learnable? The cardinality of the
classis3nk, since each term is a conjunction of at most n variablesandthereare
3n such conjunctions, as seen previously. The hypothesis set H must contain C for
consistency to be possible, thus |H|≥ 3nk. Theorem 2.1 gives the following sample
complexity bound:
m ≥ 1
(log 3)nk +log
1
δ
, (2.12)
which is polynomial. However, it can be shown that the problem of learning k-
term DNF is in RP, the complexity class of problems that admit a randomized
polynomial-time decision solution. The problem is therefore computationally in-
tractable unless RP = NP, which is commonly conjectured not to be the case. Thus,while the sample size needed for learning k-term DNF formulae is only polynomial,e?cient PAC-learning of this class is not possible unless RP = NP.
Example 2.5 k-CNF formulae
A conjunctive normal form (CNF) formula is a conjunction of disjunctions. A k-
CNF formula is an expression of the form T1 ∧ ... ∧ Tj with arbitrary length j ∈ N
and with each term Ti being a disjunction of at most k Boolean attributes.
The problem of learning k-CNF formulae can be reduced to that of learning
conjunctions of Boolean literals, which, as seen previously, is a PAC-learnable
concept class. To do so, it su?ces to associate to each term Ti a new variable.
Then, this can be done with the following bijection:
ai(x1) ∨···∨ ai(xn) → Yai(x1),...,ai(xn), (2.13)
where ai(xj ) denotes the assignment to xj in term Ti. This reduction to PAC-
learning of conjunctions of Boolean literals may a?ect the original distribution, but
this is not an issue since in the PAC framework no assumption is made about the
distribution. Thus, the PAC-learnability of conjunctions of Boolean literals implies
that of k-CNF formulae.
This is a surprising result, however, since any k-term DNF formula can be written
as a k-CNF formula. Indeed, using associativity, a k-term DNF can be rewritten as2.3 Guarantees for ?nite hypothesis sets — inconsistent case 21
a k-CNF formula via
k
i=1
ai(x1) ∧···∧ ai(xn)=
n
i1,...,ik=1
a1(xi1 ) ∨···∨ ak(xik ).
To illustrate this rewriting in a speci?c case, observe, for example, that
(u1 ∧ u2 ∧ u3) ∨ (v1 ∧ v2 ∧ v3)=
3
i,j=1
(ui ∧ vj ).
But, as we previously saw, k-term DNF formulae are not e?ciently PAC-learnable!
What can explain this apparent inconsistency? Observe that the number of new
variables needed to write a k-term DNF as a k-CNF formula via the transformation
just described is exponential in k,itisin O(nk). The discrepancy comes from the size
of the representation of a concept. A k-term DNF formula can be an exponentially
more compact representation, and e?cient PAC-learning is intractable if a time-
complexity polynomial in that size is required. Thus, this apparent paradox deals
with key aspects of PAC-learning, which include the cost of the representation of a
concept and the choice of the hypothesis set.
2.3 Guarantees for ?nite hypothesis sets — inconsistent case
In the most general case, there may be no hypothesis in H consistent with the
labeled training sample. This, in fact, is the typical case in practice, where the
learning problems may be somewhat di?cult or the concept classes more complex
than the hypothesis set used by the learning algorithm. However, inconsistent
hypotheses with a small number of errors on the training sample can be useful and,as we shall see, can bene?t from favorable guarantees under some assumptions. This
section presents learning guarantees precisely for this inconsistent case and ?nite
hypothesis sets.
To derive learning guarantees in this more general setting, we will use Hoe?ding’s
inequality (theorem D.1) or the following corollary, which relates the generalization
error and empirical error of a single hypothesis.22 The PAC Learning Framework
Corollary 2.1
Fix > 0 and let S denote an i.i.d. sample of size m. Then, for any hypothesis
h: X →{0, 1}, the following inequalities hold:
Pr
S~Dm[ R(h) ? R(h) ≥ ] ≤ exp(?2m
2) (2.14)
Pr
S~Dm[ R(h) ? R(h) ≤?] ≤ exp(?2m ......
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,Associate Editors
A complete list of books published in The Adaptive Computations and Machine
Learning series appears at the back of this book.Foundations of Machine Learning
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar
The MIT Press
Cambridge, Massachusetts
London, Englandc 2012 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or information
storage and retrieval) without permission in writing from the publisher.
MIT Press books may be purchased at special quantity discounts for business or
sales promotional use. For information, please email special sales@mitpress.mit.edu
or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cam-
bridge, MA 02142.
This book was set in L A TEX by the authors. Printed and bound in the United States
of America.
Library of Congress Cataloging-in-Publication Data
Mohri, Mehryar.
Foundations of machine learning Mehryar Mohri, Afshin Rostamizadeh, and
Ameet Talwalkar.
p. cm. - (Adaptive computation and machine learning series)
Includes bibliographical references and index.
ISBN 978-0-262-01825-8 (hardcover : alk. paper) 1. Machine learning. 2. Computer
algorithms. I. Rostamizadeh, Afshin. II. Talwalkar, Ameet. III. Title.
Q325.5.M64 2012
006.3’1-dc23
2012007249
10987654321Contents
Preface xi
1 Introduction 1
1.1 Applicationsandproblems........................ 1
1.2 De?nitionsandterminology....................... 3
1.3 Cross-validation.............................. 5
1.4 Learningscenarios ............................ 7
1.5 Outline .................................. 8
2 The PAC Learning Framework 11
2.1 ThePAClearningmodel......................... 11
2.2 Guaranteesfor?nitehypothesissets—consistentcase........ 17
2.3 Guaranteesfor?nitehypothesissets—inconsistentcase....... 21
2.4 Generalities ................................ 24
2.4.1 Deterministicversusstochasticscenarios............ 24
2.4.2 Bayeserrorandnoise ...................... 25
2.4.3 Estimationandapproximationerrors.............. 26
2.4.4 Modelselection.......................... 27
2.5 Chapternotes............................... 28
2.6 Exercises ................................. 29
3 Rademacher Complexity and VC-Dimension 33
3.1 Rademachercomplexity ......................... 34
3.2 Growthfunction ............................. 38
3.3 VC-dimension............................... 41
3.4 Lowerbounds............................... 48
3.5 Chapternotes............................... 54
3.6 Exercises ................................. 55
4 Support Vector Machines 63
4.1 Linearclassi?cation............................ 63
4.2 SVMs—separablecase ......................... 64vi
4.2.1 Primaloptimizationproblem .................. 64
4.2.2 Supportvectors.......................... 66
4.2.3 Dualoptimizationproblem ................... 67
4.2.4 Leave-one-outanalysis...................... 69
4.3 SVMs—non-separablecase....................... 71
4.3.1 Primaloptimizationproblem .................. 72
4.3.2 Supportvectors.......................... 73
4.3.3 Dualoptimizationproblem ................... 74
4.4 Margintheory............................... 75
4.5 Chapternotes............................... 83
4.6 Exercises ................................. 84
5 Kernel Methods 89
5.1 Introduction................................ 89
5.2 Positivede?nitesymmetrickernels ................... 92
5.2.1 De?nitions ............................ 92
5.2.2 ReproducingkernelHilbertspace................ 94
5.2.3 Properties............................. 96
5.3 Kernel-basedalgorithms ......................... 100
5.3.1 SVMswithPDSkernels..................... 100
5.3.2 Representertheorem....................... 101
5.3.3 Learningguarantees ....................... 102
5.4 Negativede?nitesymmetrickernels................... 103
5.5 Sequencekernels ............................. 106
5.5.1 Weightedtransducers ...................... 106
5.5.2 Rationalkernels ......................... 111
5.6 Chapternotes............................... 115
5.7 Exercises ................................. 116
6 Boosting 121
6.1 Introduction................................ 121
6.2 AdaBoost ................................. 122
6.2.1 Boundontheempiricalerror .................. 124
6.2.2 Relationshipwithcoordinatedescent.............. 126
6.2.3 Relationshipwithlogisticregression .............. 129
6.2.4 Standarduseinpractice..................... 129
6.3 Theoreticalresults ............................ 130
6.3.1 VC-dimension-basedanalysis .................. 131
6.3.2 Margin-basedanalysis ...................... 131
6.3.3 Marginmaximization ...................... 136
6.3.4 Game-theoreticinterpretation.................. 137vii
6.4 Discussion................................. 140
6.5 Chapternotes............................... 141
6.6 Exercises ................................. 142
7 On-Line Learning 147
7.1 Introduction................................ 147
7.2 Predictionwithexpertadvice...................... 148
7.2.1 MistakeboundsandHalvingalgorithm ............ 148
7.2.2 Weightedmajorityalgorithm .................. 150
7.2.3 Randomizedweightedmajorityalgorithm........... 152
7.2.4 Exponentialweightedaveragealgorithm............ 156
7.3 Linearclassi?cation............................ 159
7.3.1 Perceptronalgorithm....................... 160
7.3.2 Winnowalgorithm........................ 168
7.4 On-linetobatchconversion ....................... 171
7.5 Game-theoreticconnection........................ 174
7.6 Chapternotes............................... 175
7.7 Exercises ................................. 176
8 Multi-Class Classi?cation 183
8.1 Multi-classclassi?cationproblem.................... 183
8.2 Generalizationbounds .......................... 185
8.3 Uncombinedmulti-classalgorithms................... 191
8.3.1 Multi-classSVMs......................... 191
8.3.2 Multi-classboostingalgorithms................. 192
8.3.3 Decisiontrees........................... 194
8.4 Aggregated multi-class algorithms ................... 198
8.4.1 One-versus-all........................... 198
8.4.2 One-versus-one .......................... 199
8.4.3 Error-correctioncodes ...................... 201
8.5 Structuredpredictionalgorithms .................... 203
8.6 Chapternotes............................... 206
8.7 Exercises ................................. 207
9 Ranking 209
9.1 Theproblemofranking ......................... 209
9.2 Generalizationbound .......................... 211
9.3 RankingwithSVMs ........................... 213
9.4 RankBoost ................................ 214
9.4.1 Boundontheempiricalerror .................. 216
9.4.2 Relationshipwithcoordinatedescent.............. 218viii
9.4.3 Marginboundforensemblemethodsinranking ....... 220
9.5 Bipartiteranking............................. 221
9.5.1 Boostinginbipartiteranking .................. 222
9.5.2 AreaundertheROCcurve ................... 224
9.6 Preference-basedsetting......................... 226
9.6.1 Second-stagerankingproblem.................. 227
9.6.2 Deterministicalgorithm..................... 229
9.6.3 Randomizedalgorithm...................... 230
9.6.4 Extensiontootherlossfunctions ................ 231
9.7 Discussion................................. 232
9.8 Chapternotes............................... 233
9.9 Exercises ................................. 234
10 Regression 237
10.1Theproblemofregression........................ 237
10.2Generalizationbounds .......................... 238
10.2.1Finitehypothesissets ...................... 238
10.2.2Rademachercomplexitybounds................. 239
10.2.3Pseudo-dimensionbounds.................... 241
10.3Regressionalgorithms .......................... 245
10.3.1Linearregression......................... 245
10.3.2Kernelridgeregression...................... 247
10.3.3 Supportvectorregression .................... 252
10.3.4Lasso ............................... 257
10.3.5Groupnormregressionalgorithms ............... 260
10.3.6On-lineregressionalgorithms .................. 261
10.4Chapternotes............................... 262
10.5Exercises ................................. 263
11 Algorithmic Stability 267
11.1De?nitions................................. 267
11.2Stability-basedgeneralizationguarantee ................ 268
11.3Stabilityofkernel-basedregularizationalgorithms .......... 270
11.3.1 Application to regression algorithms: SVR and KRR . . 274
11.3.2Applicationtoclassi?cationalgorithms:SVMs ........ 276
11.3.3Discussion............................. 276
11.4Chapternotes............................... 277
11.5Exercises ................................. 277
12 Dimensionality Reduction 281
12.1PrincipalComponentAnalysis ..................... 282ix
12.2KernelPrincipalComponentAnalysis(KPCA) ............ 283
12.3 KPCAandmanifoldlearning...................... 285
12.3.1 Isomap .............................. 285
12.3.2Laplacianeigenmaps....................... 286
12.3.3Locallylinearembedding(LLE) ................ 287
12.4Johnson-Lindenstrausslemma...................... 288
12.5Chapternotes............................... 290
12.6Exercises ................................. 290
13 Learning Automata and Languages 293
13.1Introduction................................ 293
13.2Finiteautomata ............................. 294
13.3E?cientexactlearning.......................... 295
13.3.1Passivelearning ......................... 296
13.3.2Learningwithqueries ...................... 297
13.3.3Learningautomatawithqueries ................ 298
13.4Identi?cationinthelimit ........................ 303
13.4.1Learningreversibleautomata .................. 304
13.5Chapternotes............................... 309
13.6Exercises ................................. 310
14 Reinforcement Learning 313
14.1Learningscenario............................. 313
14.2Markovdecisionprocessmodel ..................... 314
14.3Policy ................................... 315
14.3.1De?nition............................. 315
14.3.2Policyvalue............................ 316
14.3.3Policyevaluation......................... 316
14.3.4Optimalpolicy .......................... 318
14.4Planningalgorithms ........................... 319
14.4.1Valueiteration .......................... 319
14.4.2Policyiteration.......................... 322
14.4.3Linearprogramming....................... 324
14.5Learningalgorithms ........................... 325
14.5.1 Stochasticapproximation .................... 326
14.5.2TD(0)algorithm......................... 330
14.5.3Q-learningalgorithm....................... 331
14.5.4 SARSA .............................. 334
14.5.5 TD(λ)algorithm......................... 335
14.5.6Largestatespace......................... 336
14.6Chapternotes............................... 337x
Conclusion 339
A Linear Algebra Review 341
A.1 Vectorsandnorms ............................ 341
A.1.1 Norms............................... 341
A.1.2 Dualnorms............................ 342
A.2 Matrices.................................. 344
A.2.1 Matrixnorms........................... 344
A.2.2 Singularvaluedecomposition .................. 345
A.2.3 Symmetricpositivesemide?nite(SPSD)matrices....... 346
B Convex Optimization 349
B.1 Di?erentiationandunconstrainedoptimization ............ 349
B.2 Convexity................................. 350
B.3 Constrainedoptimization ........................ 353
B.4 Chapternotes............................... 357
C Probability Review 359
C.1 Probability ................................ 359
C.2 Randomvariables............................. 359
C.3 Conditionalprobabilityandindependence ............... 361
C.4 Expectation, Markov’s inequality, and moment-generating function
C.5 VarianceandChebyshev’sinequality.................. 365
D Concentration inequalities 369
D.1 Hoe?ding’sinequality .......................... 369
D.2 McDiarmid’sinequality ......................... 371
D.3 Otherinequalities............................. 373
D.3.1 Binomialdistribution:Slud’sinequality ............ 374
D.3.2 Normaldistribution:tailbound................. 374
D.3.3 Khintchine-Kahaneinequality.................. 374
D.4 Chapternotes............................... 376
D.5 Exercises ................................. 377
E Notation 379
References 381
Index 397
363 .Preface
This book is a general introduction to machine learning that can serve as a textbook
for students and researchers in the ?eld. It covers fundamental modern topics in
machine learning while providing the theoretical basis and conceptual tools needed
for the discussion and justi?cation of algorithms. It also describes several key aspects
of the application of these algorithms.
We have aimed to present the most novel theoretical tools and concepts while
giving concise proofs, even for relatively advanced results. In general, whenever
possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial
complex topics arising in machine learning and highlight several open research
questions. Certain topics often merged with others or treated with insu?cient
attention are discussed separately here and with more emphasis: for example, a
di?erent chapter is reserved for multi-class classi?cation, ranking, and regression.
Although we cover a very wide variety of important topics in machine learning, we
have chosen to omit a few important ones, including graphical models and neural
networks, both for the sake of brevity and because of the current lack of solid
theoretical guarantees for some methods.
The book is intended for students and researchers in machine learning, statistics
and other related areas. It can be used as a textbook for both graduate and advanced
undergraduate classes in machine learning or as a reference text for a research
seminar. The ?rst three chapters of the book lay the theoretical foundation for the
subsequent material. Other chapters are mostly self-contained, with the exception
of chapter 5 which introduces some concepts that are extensively used in later
ones. Each chapter concludes with a series of exercises, with full solutions presented
separately.
The reader is assumed to be familiar with basic concepts in linear algebra,probability, and analysis of algorithms. However, to further help him, we present
in the appendix a concise linear algebra and a probability review, and a short
introduction to convex optimization. We have also collected in the appendix a
number of useful tools for concentration bounds used in this book.
To our knowledge, there is no single textbook covering all of the material
presented here. The need for a uni?ed presentation has been pointed out to usxii Preface
every year by our machine learning students. There are several good books for
various specialized areas, but these books do not include a discussion of other
fundamental topics in a general manner. For example, books about kernel methods
do not include a discussion of other fundamental topics such as boosting, ranking,reinforcement learning, learning automata or online learning. There also exist more
general machine learning books, but the theoretical foundation of our book and our
emphasis on proofs make our presentation quite distinct.
Mostofthematerialpresentedheretakesitsoriginsinamachinelearning
graduate course (Foundations of Machine Learning) taught by the ?rst author
at the Courant Institute of Mathematical Sciences in New York University over
the last seven years. This book has considerably bene?ted from the comments
and suggestions from students in these classes, along with those of many friends,colleagues and researchers to whom we are deeply indebted.
We are particularly grateful to Corinna Cortes and Yishay Mansour who have
both made a number of key suggestions for the design and organization of the
materialpresentedwithdetailedcommentsthatwehavefullytakenintoaccount
and that have greatly improved the presentation. We are also grateful to Yishay
Mansour for using a preliminary version of the book for teaching and for reporting
his feedback to us.
We also thank for discussions, suggested improvement, and contributions of many
kinds the following colleagues and friends from academic and corporate research lab-
oratories: Cyril Allauzen, Stephen Boyd, Spencer Greenberg, Lisa Hellerstein, Sanjiv
Kumar, Ryan McDonald, Andres Mu? noz Medina, Tyler Neylon, Peter Norvig, Fer-
nando Pereira, Maria Pershina, Ashish Rastogi, Michael Riley, Umar Syed, Csaba
Szepesv′ ari, Eugene Weinstein, and Jason Weston.
Finally, we thank the MIT Press publication team for their help and support in
the development of this text.1 Introduction
Machine learning can be broadly de?ned as computational methods using experience
to improve performance or to make accurate predictions. Here, experience refers to
the past information available to the learner, which typically takes the form of
electronic data collected and made available for analysis. This data could be in the
form of digitized human-labeled training sets, or other types of information obtained
via interaction with the environment. In all cases, its quality and size are crucial to
the success of the predictions made by the learner.
Machine learning consists of designing e?cient and accurate prediction algo-
rithms. As in other areas of computer science, some critical measures of the quality
of these algorithms are their time and space complexity. But, in machine learning,we will need additionally a notion of sample complexity to evaluate the sample size
required for the algorithm to learn a family of concepts. More generally, theoreti-
cal learning guarantees for an algorithm depend on the complexity of the concept
classes considered and the size of the training sample.
Since the success of a learning algorithm depends on the data used, machine
learning is inherently related to data analysis and statistics. More generally, learning
techniques are data-driven methods combining fundamental concepts in computer
science with ideas from statistics, probability and optimization.
1.1 Applications and problems
Learning algorithms have been successfully deployed in a variety of applications,including
Text or document classi?cation, e.g., spam detection;
Natural language processing, e.g., morphological analysis, part-of-speech tagging,statistical parsing, named-entity recognition;
Speech recognition, speech synthesis, speaker veri?cation;
Optical character recognition (OCR);
Computational biology applications, e.g., protein function or structured predic-2 Introduction
tion;
Computer vision tasks, e.g., image recognition, face detection;
Fraud detection (credit card, telephone) and network intrusion;
Games, e.g., chess, backgammon;
Unassisted vehicle control (robots, navigation);
Medical diagnosis;
Recommendation systems, search engines, information extraction systems.
This list is by no means comprehensive, and learning algorithms are applied to new
applications every day. Moreover, such applications correspond to a wide variety of
learning problems. Some major classes of learning problems are:
Classi?cation: Assign a category to each item. For example, document classi?ca-
tion may assign items with categories such as politics, business, sports,or weather
while image classi?cation may assign items with categories such as landscape, por-
trait,or animal. The number of categories in such tasks is often relatively small,but can be large in some di?cult tasks and even unbounded as in OCR, text clas-
si?cation, or speech recognition.
Regression: Predict a real value for each item. Examples of regression include
prediction of stock values or variations of economic variables. In this problem, the
penalty for an incorrect prediction depends on the magnitude of the di?erence
between the true and predicted values, in contrast with the classi?cation problem,where there is typically no notion of closeness between various categories.
Ranking: Order items according to some criterion. Web search, e.g., returning
web pages relevant to a search query, is the canonical ranking example. Many other
similar ranking problems arise in the context of the design of information extraction
or natural language processing systems.
Clustering: Partition items into homogeneous regions. Clustering is often per-
formed to analyze very large data sets. For example, in the context of social net-
work analysis, clustering algorithms attempt to identify “communities” within large
groups of people.
Dimensionality reduction or manifold learning: Transform an initial representa-
tion of items into a lower-dimensional representation of these items while preserving
some properties of the initial representation. A common example involves prepro-
cessing digital images in computer vision tasks.
The main practical objectives of machine learning consist of generating accurate
predictions for unseen items and of designing e?cient and robust algorithms to
produce these predictions, even for large-scale problems. To do so, a number of
algorithmic and theoretical questions arise. Some fundamental questions include:1.2 De?nitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and red
training sample, but it is a complex separation surface that is not likely to generalize
well to unseen data. In contrast, the decision surface on the right panel is simpler
and might generalize better in spite of its misclassi?cation of a few points of the
training sample.
Which concept families can actually be learned, and under what conditions? How
well can these concepts be learned computationally?
1.2 De?nitions and terminology
We will use the canonical problem of spam detection as a running example to
illustrate some basic de?nitions and to describe the use and evaluation of machine
learning algorithms in practice. Spam detection is the problem of learning to
automatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spam
problem, these examples correspond to the collection of email messages we will use
for learning and testing.
Features: The set of attributes, often represented as a vector, associated to an
example. In the case of email messages, some relevant features may include the
length of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classi?cation problems,examples are assigned speci?c categories, for instance, the spam and non-spam
categories in our binary classi?cation problem. In regression, items are assigned
real-valued labels.
Training sample: Examples used to train a learning algorithm. In our spam
problem, the training sample consists of a set of email examples along with their
associated labels. The training sample varies for di?erent learning scenarios, as
described in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm4 Introduction
when working with labeled data. Learning algorithms typically have one or more
free parameters, and the validation sample is used to select appropriate values for
these model parameters.
Test sample: Examples used to evaluate the performance of a learning algorithm.
The test sample is separate from the training and validation data and is not made
available in the learning stage. In the spam problem, the test sample consists of a
collection of email examples for which the learning algorithm must predict labels
based on features. These predictions are then compared with the labels of the test
sample to measure the performance of the algorithm.
Loss function: A function that measures the di?erence, or loss, between a pre-
dicted label and a true label. Denoting the set of all labels as Y and the set of
possible predictions as Y
, a loss function L is a mapping L: Y×Y
→ R+. In most
cases, Y
= Y and the loss function is bounded, but these conditions do not always
hold. Common examples of loss functions include the zero-one (or misclassi?cation)
loss de?ned over {?1,+1}×{?1,+1} by L(y, y)=1y
=y and the squared loss
de?ned over I × I by L(y, y)=(y
y)2,where I ? R is typically a bounded
interval.
Hypothesis set : A set of functions mapping features (feature vectors) to the set of
labels Y. In our example, these may be a set of functions mapping email features
to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping
features to a di?erent set Y
. They could be linear functions mapping email feature
vectors to real numbers interpreted as scores (Y
= R), with higher score values
more indicative of spam than lower ones.
We now de?ne the learning stages of our spam problem. We start with a given
collection of labeled examples. We ?rst randomly partition the data into a training
sample, a validation sample, and a test sample. The size of each of these samples
depends on a number of di?erent considerations. For example, the amount of data
reserved for validation depends on the number of free parameters of the algorithm.
Also, when the labeled sample is relatively small, the amount of training data is
often chosen to be larger than that of test data since the learning performance
directly depends on the training sample.
Next, we associate relevant features to the examples. This is a critical step in
the design of machine learning solutions. Useful features can e?ectively guide the
learning algorithm, while poor or uninformative ones can be misleading. Although
it is critical, to a large extent, the choice of the features is left to the user. This
choice re?ects the user’s prior knowledge about the learning task which in practice
can have a dramatic e?ect on the performance results.
Now, we use the features selected to train our learning algorithm by ?xing di?erent
values of its free parameters. For each value of these parameters, the algorithm1.3 Cross-validation 5
selects a di?erent hypothesis out of the hypothesis set. We choose among them
the hypothesis resulting in the best performance on the validation sample. Finally,using that hypothesis, we predict the labels of the examples in the test sample. The
performance of the algorithm is evaluated by using the loss function associated to
the task, e.g., the zero-one loss in our spam detection task, to compare the predicted
and true labels.
Thus, the performance of an algorithm is of course evaluated based on its test error
and not its error on the training sample. A learning algorithm may be consistent ,that is it may commit no error on the examples of the training data, and yet
have a poor performance on the test data. This occurs for consistent learners
de?ned by very complex decision surfaces, as illustrated in ?gure 1.1, which tend
to memorize a relatively small training sample instead of seeking to generalize well.
This highlights the key distinction between memorization and generalization, which
is the fundamental property sought for an accurate learning algorithm. Theoretical
guarantees for consistent learners will be discussed with great detail in chapter 2.
1.3 Cross-validation
In practice, the amount of labeled data available is often too small to set aside
a validation sample since that would leave an insu?cient amount of training data.
Instead, a widely adopted method known as n-fold cross-validation is used to exploit
the labeled data both for model selection (selection of the free parameters of the
algorithm) and for training.
Let θ denote the vector of free parameters of the algorithm. For a ?xed value
of θ, the method consists of ?rst randomly partitioning a given sample S of
m labeled examples into n subsamples, or folds. The ith fold is thus a labeled
sample ((xi1,yi1),..., (ximi ,yimi )) of size mi. Then, for any i ∈ [1,n], the learning
algorithm is trained on all but the ith fold to generate a hypothesis hi,andthe
performance of hi is tested on the ith fold, as illustrated in ?gure 1.2a. The
parameter value θ is evaluated based on the average error of the hypotheses hi,which is called the cross-validation error .Thisquantityisdenotedby RCV(θ)and
de?ned by
RCV(θ)= 1
n
n
i=1
1
mi
mi
j=1
L(hi(xij ),yij )
error of hi on the ith fold
.
The folds are generally chosen to have equal size, that is mi = mn for all i ∈ [1,n].
How should n be chosen? The appropriate choice is subject to a trade-o? and the
topic of much learning theory research that we cannot address in this introductory6 Introduction
test train train train train
test train train train train
.
.
.
test train train train train
error
m
(a) (b)
Figure 1.2 n-fold cross validation. (a) Illustration of the partitioning of the
training data into 5 folds. (b) Typical plot of a classi?er’s prediction error as a
function of the size of the training sample: the error decreases as a function of the
number of training points.
chapter. For a large n, each training sample used in n-fold cross-validation has size
m?mn = m(1?1n) (illustrated by the right vertical red line in ?gure 1.2b), which
is close to m, the size of the full sample, but the training samples are quite similar.
Thus, the method tends to have a small bias but a large variance. In contrast,smaller values of n lead to more diverse training samples but their size (shown by
the left vertical red line in ?gure 1.2b) is signi?cantly less than m,thusthemethod
tends to have a smaller variance but a larger bias.
In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross
validation is used as follows in model selection. The full labeled data is ?rst split
into a training and a test sample. The training sample of size m is then used to
compute the n-fold cross-validation error RCV(θ) for a small number of possible
values of θ. θ is next set to the value θ0 for which RCV(θ) is smallest and the
algorithm is trained with the parameter setting θ0 over the full training sample of
size m.Itsperformanceisevaluatedonthetestsampleasalreadydescribedinthe
previous section.
The special case of n-fold cross validation where n = m is called leave-one-out
cross-validation, since at each iteration exactly one instance is left out of the training
sample. As shown in chapter 4, the average leave-one-out error is an approximately
unbiased estimate of the average error of an algorithm and can be used to derive
simple guarantees for some algorithms. In general, the leave-one-out error is very
costly to compute, since it requires training n times on samples of size m ? 1, but
for some algorithms it admits a very e?cient computation (see exercise 10.9).
In addition to model selection, n-fold cross validation is also commonly used for
performance evaluation. In that case, for a ?xed parameter setting θ, the full labeled
sample is divided into n random folds with no distinction between training and test
samples. The performance reported is the n-fold cross-validation on the full sample
as well as the standard deviation of the errors measured on each fold.1.4 Learning scenarios 7
1.4 Learning scenarios
We next brie?y describe common machine learning scenarios. These scenarios di?er
in the types of training data available to the learner, the order and method by which
training data is received and the test data used to evaluate the learning algorithm.
Supervised learning: The learner receives a set of labeled examples as training
data and makes predictions for all unseen points. This is the most common scenario
associated with classi?cation, regression, and ranking problems. The spam detection
problem discussed in the previous section is an instance of supervised learning.
Unsupervised learning: The learner exclusively receives unlabeled training data,and makes predictions for all unseen points. Since in general no labeled exam-
ple is available in that setting, it can be di?cult to quantitatively evaluate the
performance of a learner. Clustering and dimensionality reduction are example of
unsupervised learning problems.
Semi-supervised learning: The learner receives a training sample consisting of
both labeled and unlabeled data, and makes predictions for all unseen points. Semi-
supervised learning is common in settings where unlabeled data is easily accessible
but labels are expensive to obtain. Various types of problems arising in applications,including classi?cation, regression, or ranking tasks, can be framed as instances
of semi-supervised learning. The hope is that the distribution of unlabeled data
accessible to the learner can help him achieve a better performance than in the
supervised setting. The analysis of the conditions under which this can indeed
be realized is the topic of much modern theoretical and applied machine learning
research.
Transductive inference: As in the semi-supervised scenario, the learner receives
a labeled training sample along with a set of unlabeled test points. However, the
objective of transductive inference is to predict labels only for these particular test
points. Transductive inference appears to be an easier task and matches the scenario
encountered in a variety of modern applications. However, as in the semi-supervised
setting, the assumptions under which a better performance can be achieved in this
setting are research questions that have not been fully resolved.
On-line learning: In contrast with the previous scenarios, the online scenario
involves multiple rounds and training and testing phases are intermixed. At each
round, the learner receives an unlabeled training point, makes a prediction, receives
the true label, and incurs a loss. The objective in the on-line setting is to minimize
the cumulative loss over all rounds. Unlike the previous settings just discussed, no
distributional assumption is made in on-line learning. In fact, instances and their
labels may be chosen adversarially within this scenario.8 Introduction
Reinforcement learning: The training and testing phases are also intermixed in
reinforcement learning. To collect information, the learner actively interacts with the
environment and in some cases a?ects the environment, and receives an immediate
reward for each action. The object of the learner is to maximize his reward over
a course of actions and iterations with the environment. However, no long-term
reward feedback is provided by the environment, and the learner is faced with the
exploration versus exploitation dilemma, since he must choose between exploring
unknown actions to gain more information versus exploiting the information already
collected.
Active learning: The learner adaptively or interactively collects training examples,typically by querying an oracle to request labels for new points. The goal in
active learning is to achieve a performance comparable to the standard supervised
learning scenario, but with fewer labeled examples. Active learning is often used
in applications where labels are expensive to obtain, for example computational
biology applications.
In practice, many other intermediate and somewhat more complex learning scenarios
may be encountered.
1.5 Outline
This book presents several fundamental and mathematically well-studied algo-
rithms. It discusses in depth their theoretical foundations as well as their practical
applications. The topics covered include:
Probably approximately correct (PAC) learning framework; learning guarantees
for ?nite hypothesis sets;
Learning guarantees for in?nite hypothesis sets, Rademacher complexity, VC-
dimension;
Support vector machines (SVMs), margin theory;
Kernel methods, positive de?nite symmetric kernels, representer theorem, rational
kernels;
Boosting, analysis of empirical error, generalization error, margin bounds;
Online learning, mistake bounds, the weighted majority algorithm, the exponen-
tial weighted average algorithm, the Perceptron and Winnow algorithms;
Multi-class classi?cation, multi-class SVMs, multi-class boosting, one-versus-all,one-versus-one, error-correction methods;
Ranking, ranking with SVMs, RankBoost, bipartite ranking, preference-based1.5 Outline 9
ranking;
Regression, linear regression, kernel ridge regression, support vector regression,Lasso;
Stability-based analysis, applications to classi?cation and regression;
Dimensionality reduction, principal component analysis (PCA), kernel PCA,Johnson-Lindenstrauss lemma;
Learning automata and languages;
Reinforcement learning, Markov decision processes, planning and learning prob-
lems.
The analyses in this book are self-contained, with relevant mathematical concepts
related to linear algebra, convex optimization, probability and statistics included in
the appendix.2 The PAC Learning Framework
Several fundamental questions arise when designing and analyzing algorithms that
learn from examples: What can be learned e?ciently? What is inherently hard to
learn? How many examples are needed to learn successfully? Is there a general model
of learning? In this chapter, we begin to formalize and address these questions by
introducing the Probably Approximately Correct (PAC) learning framework. The
PAC framework helps de?ne the class of learnable concepts in terms of the number
of sample points needed to achieve an approximate solution, sample complexity,and
the time and space complexity of the learning algorithm, which depends on the cost
of the computational representation of the concepts.
We ?rst describe the PAC framework and illustrate it, then present some general
learning guarantees within this framework when the hypothesis set used is ?nite,both for the consistent case where the hypothesis set used contains the concept to
learn and for the opposite inconsistent case.
2.1 The PAC learning model
We ?rst introduce several de?nitions and the notation needed to present the PAC
model, which will also be used throughout much of this book.
We denote by X the set of all possible examples or instances. X is also sometimes
referred to as the input space. The set of all possible labels or target values is denoted
by Y. For the purpose of this introductory chapter, we will limit ourselves to the
case where Y is reduced to two labels, Y = {0, 1}, so-called binary classi?cation.
Later chapters will extend these results to more general settings.
A concept c : X→Y is a mapping from X to Y.Since Y = {0, 1},wecanidentify
c with the subset of X over which it takes the value 1. Thus, in the following, we
equivalentlyrefertoaconcepttolearnasamappingfrom X to {0, 1},ortoa
subset of X. As an example, a concept may be the set of points inside a triangle or
the indicator function of these points. In such cases, we will say in short that the
concept to learn is a triangle. A concept class is a set of concepts we may wish to
learn and is denoted by C. This could, for example, be the set of all triangles in the12 The PAC Learning Framework
plane.
We assume that examples are independently and identically distributed (i.i.d.)
according to some ?xed but unknown distribution D. The learning problem is then
formulated as follows. The learner considers a ?xed set of possible concepts H,called a hypothesis set ,whichmaynotcoincidewith C. He receives a sample
S =(x1,...,xm) drawn i.i.d. according to D as well as the labels (c(x1),...,c(xm)),whicharebasedonaspeci?ctargetconcept c ∈ C to learn. His task is to use the
labeled sample S to select a hypothesis hS ∈ H that has a small generalization
error with respect to the concept c. The generalization error of a hypothesis h ∈ H,also referred to as the true error or just error of h is denoted by R(h) and de?ned
as follows.
1
De?nition 2.1 Generalization error
Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution
D,the generalization error or risk of h is de?ned by
R(h)= Pr
x~D[h(x)
= c(x)] = E x~D
1h(x)
=c(x)
, (2.1)
where 1ω is the indicator function of the event ω.
2
The generalization error of a hypothesis is not directly accessible to the learner
since both the distribution D and the target concept c are unknown. However, the
learner can measure the empirical error of a hypothesis on the labeled sample S.
De?nition 2.2 Empirical error
Given a hypothesis h ∈ H, a target concept c ∈ C, and a sample S =(x1,...,xm),the empirical error or empirical risk of h is de?ned by
R(h)= 1
m
m
i=1
1h(xi)
=c(xi). (2.2)
Thus, the empirical error of h ∈ H is its average error over the sample S, while the
generalization error is its expected error based on the distribution D. We will see in
this chapter and the following chapters a number of guarantees relating to these two
quantities with high probability, under some general assumptions. We can already
note that for a ?xed h ∈ H, the expectation of the empirical error based on an i.i.d.
1.Thechoiceof R instead of E to denote an error avoids possible confusions with the
notation for expectations and is further justi?ed by the fact that the term risk is also used
in machine learning and statistics to refer to an error.
2. For this and other related de?nitions, the family of functions H and the target concept
c must be measurable. The function classes we consider in this book all have this property.2.1 The PAC learning model 13
sample S is equal to the generalization error:
E[ R(h)] = R(h). (2.3)
Indeed, by the linearity of the expectation and the fact that the sample is drawn
i.i.d., we can write
E S~Dm[ R(h)] = 1
m
m
i=1
E S~Dm[1h(xi)
=c(xi)]= 1
m
m
i=1
E S~Dm[1h(x)
=c(x)],for any x in sample S.Thus,E S~Dm[ R(h)] = E S~Dm[1{h(x)
=c(x)}]= E x~D[1{h(x)
=c(x)}]= R(h).
The following introduces the Probably Approximately Correct (PAC) learning
framework. We denote by O(n) an upper bound on the cost of the computational
representation of any element x ∈X and by size(c) the maximal cost of the
computational representation of c ∈ C. For example, x may be a vector in Rn,for which the cost of an array-based representation would be in O(n).
De?nition 2.3 PAC-learning
A concept class C is said to be PAC-learnable if there exists an algorithm A and
a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ> 0, for all
distributions D on X and for any target concept c ∈ C, the following holds for any
sample size m ≥ poly(1, 1δ, n, size(c)):
Pr
S~Dm[R(hS) ≤ ] ≥ 1 ? δ. (2.4)
If A further runs in poly(1, 1δ, n, size(c)),then C is said to be e?ciently PAC-
learnable. When such an algorithm A exists, it is called a PAC-learning algorithm
for C.
A concept class C is thus PAC-learnable if the hypothesis returned by the algorithm
after observing a number of points polynomial in 1 and 1δ is approximately
correct (error at most ) with high probability (at least 1 ? δ), which justi?es the
PAC terminology. δ> 0isusedtode?nethe con?dence 1?δ and > 0the accuracy
1 ? . Note that if the running time of the algorithm is polynomial in 1 and 1δ,then the sample size m must also be polynomial if the full sample is received by the
algorithm.
Several key points of the PAC de?nition are worth emphasizing. First, the PAC
framework is a distribution-free model : no particular assumption is made about the
distribution D from which examples are drawn. Second, the training sample and the
test examples used to de?ne the error are drawn according to the same distribution
D. This is a necessary assumption for generalization to be possible in most cases.14 The PAC Learning Framework
R
R’
Figure 2.1 Target concept R and possible hypothesis R
. Circles represent training
instances. A blue circle is a point labeled with 1, since it falls within the rectangle
R.Othersareredandlabeledwith 0.
Finally, the PAC framework deals with the question of learnability for a concept
class C and not a particular concept. Note that the concept class C is known to the
algorithm, but of course target concept c ∈ C is unknown.
In many cases, in particular when the computational representation of the con-
cepts is not explicitly discussed or is straightforward, we may omit the polynomial
dependency on n and size(c) in the PAC de?nition and focus only on the sample
complexity.
We now illustrate PAC-learning with a speci?c learning problem.
Example 2.1 Learning axis-aligned rectangles
Considerthecasewherethesetofinstancesarepointsintheplane, X = R2,and
the concept class C is the set of all axis-aligned rectangles lying in R2.Thus,each
concept c is the set of points inside a particular axis-aligned rectangle. The learning
problem consists of determining with small error a target axis-aligned rectangle
using the labeled training sample. We will show that the concept class of axis-
aligned rectangles is PAC-learnable.
Figure 2.1 illustrates the problem. R represents a target axis-aligned rectangle
and R
a hypothesis. As can be seen from the ?gure, the error regions of R
are
formed by the area within the rectangle R but outside the rectangle R
and the area
within R
but outside the rectangle R. The ?rst area corresponds to false negatives,that is, points that are labeled as 0 or negatively by R
, which are in fact positive
or labeled with 1. The second area corresponds to false positives,thatis,points
labeled positively by R
which are in fact negatively labeled.
To show that the concept class is PAC-learnable, we describe a simple PAC-
learning algorithm A. Given a labeled sample S, the algorithm consists of returning
the tightest axis-aligned rectangle R
= RS containing the points labeled with 1.
Figure 2.2 illustrates the hypothesis returned by the algorithm. By de?nition, RS
does not produce any false positive, since its points must be included in the target
concept R. Thus, the error region of RS is included in R.2.1 The PAC learning model 15
R
R’
Figure 2.2 Illustration of the hypothesis R
= RS returned by the algorithm.
Let R ∈ C be a target concept. Fix > 0. Let Pr[RS] denote the probability mass
oftheregionde?nedby RS, that is the probability that a point randomly drawn
according to D falls within RS. Since errors made by our algorithm can be due only
to points falling inside RS, we can assume that Pr[RS] >; otherwise, the error of
RS is less than or equal to regardless of the training sample S received.
Now, since Pr[RS] >, we can de?ne four rectangular regions r1,r2,r3, and r4
along the sides of RS, each with probability at least 4. These regions can be
constructed by starting with the empty rectangle along a side and increasing its
size until its distribution mass is at least 4. Figure 2.3 illustrates the de?nition of
these regions.
Observe that if RS meets all of these four regions, then, because it is a rectangle,it will have one side in each of these four regions (geometric argument). Its error
area, which is the part of R that it does not cover, is thus included in these regions
and cannot have probability mass more than . By contraposition, if R(RS) >,then RS must miss at least one of the regions ri, i ∈ [1, 4]. As a result, we can write
Pr
S~Dm[R(RS) >] ≤ Pr
S~Dm[∪4
i=1{RS ∩ ri = ?}] (2.5)
≤
4
i=1
Pr
S~Dm[{RS ∩ ri = ?}] (by the union bound)
≤ 4(1 ? 4)
m (since Pr[ri] >4)
≤ 4exp(?m4),where for the last step we used the general identity 1?x ≤ e?x valid for all x ∈ R.
For any δ> 0, to ensure that PrS~Dm[R(RS) >] ≤ δ,wecanimpose
4exp(?m4) ≤ δ ? m ≥ 4
log
4
δ
. (2.6)
Thus, for any > 0and δ> 0, if the sample size m is greater than 4
log 4
δ ,then PrS~Dm[R(RS) >] ≤ 1 ? δ. Furthermore, the computational cost of the16 The PAC Learning Framework
R
R’
r1
r2
r3
r4
Figure 2.3 Illustration of the regions r1,...,r4.
representation of points in R2 and axis-aligned rectangles, which can be de?ned by
their four corners, is constant. This proves that the concept class of axis-aligned
rectangles is PAC-learnable and that the sample complexity of PAC-learning axis-
aligned rectangles is in O(
1
log 1
δ ).
An equivalent way to present sample complexity results like (2.6), which we will
often see throughout this book, is to give a generalization bound. It states that with
probability at least 1 ? δ, R(RS) is upper bounded by some quantity that depends
on the sample size m and δ. To obtain this, if su?ces to set δ to be equal to the
upper bound derived in (2.5), that is δ =4exp(?m4) and solve for .Thisyields
that with probability at least 1 ? δ, the error of the algorithm is bounded as:
R(RS) ≤ 4
m log
4
δ
. (2.7)
Other PAC-learning algorithms could be considered for this example. One alterna-
tive is to return the largest axis-aligned rectangle not containing the negative points,for example. The proof of PAC-learning just presented for the tightest axis-aligned
rectangle can be easily adapted to the analysis of other such algorithms.
Note that the hypothesis set H we considered in this example coincided with the
concept class C and that its cardinality was in?nite. Nevertheless, the problem
admitted a simple proof of PAC-learning. We may then ask if a similar proof
can readily apply to other similar concept classes. This is not as straightforward
because the speci?c geometric argument used in the proof is key. It is non-trivial
to extend the proof to other concept classes such as that of non-concentric circles
(see exercise 2.4). Thus, we need a more general proof technique and more general
results. The next two sections provide us with such tools in the case of a ?nite
hypothesis set.2.2 Guarantees for ?nite hypothesis sets — consistent case 17
2.2 Guarantees for ?nite hypothesis sets — consistent case
In the example of axis-aligned rectangles that we examined, the hypothesis hS
returned by the algorithm was always consistent , that is, it admitted no error on
the training sample S. In this section, we present a general sample complexity
bound, or equivalently, a generalization bound, for consistent hypotheses, in the
case where the cardinality |H| of the hypothesis set is ?nite. Since we consider
consistent hypotheses, we will assume that the target concept c is in H.
Theorem 2.1 Learning bounds — ?nite H, consistent case
Let H be a ?nite set of functions mapping from X to Y.Let A be an algorithm that
for any target concept c ∈ H and i.i.d. sample S returns a consistent hypothesis hS:
R(hS)=0.Then,forany , δ > 0,theinequality PrS~Dm[R(hS) ≤ ] ≥ 1?δ holds
if
m ≥ 1
log |H| +log
1
δ
. (2.8)
This sample complexity result admits the following equivalent statement as a gener-
alization bound: for any , δ > 0, with probability at least 1 ? δ,R(hS) ≤ 1
m
log |H| +log
1
δ
. (2.9)
Proof Fix > 0. We do not know which consistent hypothesis hS ∈ H is selected
by the algorithm A. This hypothesis further depends on the training sample S.
Therefore, we need to give a uniform convergence bound, that is, a bound that
holds for the set of all consistent hypotheses, which a fortiori includes hS.Thus,we will bound the probability that some h ∈ H would be consistent and have error
more than :
Pr[?h ∈ H: R(h)=0 ∧ R(h) >]
=Pr[(h1 ∈ H, R(h1)=0 ∧ R(h1) >) ∨ (h2 ∈ H, R(h2)=0 ∧ R(h2) >) ∨··· ]
≤
h∈H
Pr[ R(h)=0 ∧ R(h) >] (union bound)
≤
h∈H
Pr[ R(h)=0 | R(h) >]. (de?nition of conditional probability)
Now, consider any hypothesis h ∈ H with R(h) >. Then, the probability that h
would be consistent on a training sample S drawn i.i.d., that is, that it would have
no error on any point in S, can be bounded as:
Pr[ R(h)=0 | R(h) >] ≤ (1 ? )
m.18 The PAC Learning Framework
The previous inequality implies
Pr[?h ∈ H: R(h)=0 ∧ R(h) >] ≤|H|(1 ? )
m.
Setting the right-hand side to be equal to δ and solving for concludes the proof.
The theorem shows that when the hypothesis set H is ?nite, a consistent algorithm
A is a PAC-learning algorithm, since the sample complexity given by (2.8) is
dominated by a polynomial in 1 and 1δ. As shown by (2.9), the generalization
error of consistent hypotheses is upper bounded by a term that decreases as
a function of the sample size m. This is a general fact: as expected, learning
algorithms bene?t from larger labeled training samples. The decrease rate of O(1m)
guaranteed by this theorem, however, is particularly favorable.
The price to pay for coming up with a consistent algorithm is the use of a
larger hypothesis set H containing target concepts. Of course, the upper bound
(2.9) increases with |H|. However, that dependency is only logarithmic. Note that
the term log |H|,ortherelatedtermlog2 |H| from which it di?ers by a constant
factor, can be interpreted as the number of bits needed to represent H.Thus,the
generalization guarantee of the theorem is controlled by the ratio of this number of
bits, log2 |H|,andthesamplesize m.
We now use theorem 2.1 to analyze PAC-learning with various concept classes.
Example 2.2 Conjunction of Boolean literals
Consider learning the concept class Cn of conjunctions of at most n Boolean literals
x1,...,xn. A Boolean literal is either a variable xi, i ∈ [1,n], or its negation xi.For
n = 4, an example is the conjunction: x1 ∧ x2 ∧ x4,where x2 denotes the negation
of the Boolean literal x2.(1, 0, 0, 1) is a positive example for this concept while
(1, 0, 0, 0) is a negative example.
Observe that for n = 4, a positive example (1, 0, 1, 0) implies that the target
concept cannot contain the literals x1 and x3 and that it cannot contain the literals
x2 and x4. In contrast, a negative example is not as informative since it is not
known which of its n bits are incorrect. A simple algorithm for ?nding a consistent
hypothesis is thus based on positive examples and consists of the following: for each
positive example (b1,...,bn)and i ∈ [1,n], if bi =1then xi is ruled out as a possible
literal in the concept class and if bi =0then xi is ruled out. The conjunction of all
the literals not ruled out is thus a hypothesis consistent with the target. Figure 2.4
shows an example training sample as well as a consistent hypothesis for the case
n =6.
We have |H| = |Cn| =3n, since each literal can be included positively, with
negation, or not included. Plugging this into the sample complexity bound for
consistent hypotheses yields the following sample complexity bound for any > 02.2 Guarantees for ?nite hypothesis sets — consistent case 19
011011 +
011111 +
001101 -
011111 +
100110 -
010011 +
01? ?11
Figure 2.4 Each of the ?rst six rows of the table represents a training example with
its label, + or ?, indicated in the last column. The last row contains 0 (respectively
1) in column i ∈ [1, 6] if the ith entry is 0 (respectively 1) for all the positive examples.
It contains “?” if both 0 and 1 appear as an ith entry for some positive example.
Thus, for this training sample, the hypothesis returned by the consistent algorithm
describedinthetextis x1 ∧ x2 ∧ x5 ∧ x6.
and δ> 0:
m ≥ 1
(log 3)n +log
1
δ
. (2.10)
Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note
that the computational complexity is also polynomial, since the training cost per
exampleisin O(n). For δ =0.02, =0.1, and n = 10, the bound becomes m ≥ 149.
Thus, for a labeled sample of at least 149 examples, the bound guarantees 99%
accuracy with a con?dence of at least 98%.
Example 2.3 Universal concept class
Consider the set X = {0, 1}n of all Boolean vectors with n components, and let Un
be the concept class formed by all subsets of X. Is this concept class PAC-learnable?
To guarantee a consistent hypothesis the hypothesis class must include the concept
class, thus |H|≥|Un| =2(2n)
. Theorem 2.1 gives the following sample complexity
bound:
m ≥ 1
(log 2)2n +log
1
δ
. (2.11)
Here, the number of training samples required is exponential in n, which is the cost
of the representation of a point in X. Thus, PAC-learning is not guaranteed by
the theorem. In fact, it is not hard to show that this universal concept class is not
PAC- learnable.20 The PAC Learning Framework
Example 2.4 k-term DNF formulae
A disjunctive normal form (DNF) formula is a formula written as the disjunction of
several terms, each term being a conjunction of Boolean literals. A k-term DNF is a
DNF formula de?ned by the disjunction of k terms, each term being a conjunction
of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a k-term
DNF is (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x3).
Is the class C of k-term DNF formulae is PAC-learnable? The cardinality of the
classis3nk, since each term is a conjunction of at most n variablesandthereare
3n such conjunctions, as seen previously. The hypothesis set H must contain C for
consistency to be possible, thus |H|≥ 3nk. Theorem 2.1 gives the following sample
complexity bound:
m ≥ 1
(log 3)nk +log
1
δ
, (2.12)
which is polynomial. However, it can be shown that the problem of learning k-
term DNF is in RP, the complexity class of problems that admit a randomized
polynomial-time decision solution. The problem is therefore computationally in-
tractable unless RP = NP, which is commonly conjectured not to be the case. Thus,while the sample size needed for learning k-term DNF formulae is only polynomial,e?cient PAC-learning of this class is not possible unless RP = NP.
Example 2.5 k-CNF formulae
A conjunctive normal form (CNF) formula is a conjunction of disjunctions. A k-
CNF formula is an expression of the form T1 ∧ ... ∧ Tj with arbitrary length j ∈ N
and with each term Ti being a disjunction of at most k Boolean attributes.
The problem of learning k-CNF formulae can be reduced to that of learning
conjunctions of Boolean literals, which, as seen previously, is a PAC-learnable
concept class. To do so, it su?ces to associate to each term Ti a new variable.
Then, this can be done with the following bijection:
ai(x1) ∨···∨ ai(xn) → Yai(x1),...,ai(xn), (2.13)
where ai(xj ) denotes the assignment to xj in term Ti. This reduction to PAC-
learning of conjunctions of Boolean literals may a?ect the original distribution, but
this is not an issue since in the PAC framework no assumption is made about the
distribution. Thus, the PAC-learnability of conjunctions of Boolean literals implies
that of k-CNF formulae.
This is a surprising result, however, since any k-term DNF formula can be written
as a k-CNF formula. Indeed, using associativity, a k-term DNF can be rewritten as2.3 Guarantees for ?nite hypothesis sets — inconsistent case 21
a k-CNF formula via
k
i=1
ai(x1) ∧···∧ ai(xn)=
n
i1,...,ik=1
a1(xi1 ) ∨···∨ ak(xik ).
To illustrate this rewriting in a speci?c case, observe, for example, that
(u1 ∧ u2 ∧ u3) ∨ (v1 ∧ v2 ∧ v3)=
3
i,j=1
(ui ∧ vj ).
But, as we previously saw, k-term DNF formulae are not e?ciently PAC-learnable!
What can explain this apparent inconsistency? Observe that the number of new
variables needed to write a k-term DNF as a k-CNF formula via the transformation
just described is exponential in k,itisin O(nk). The discrepancy comes from the size
of the representation of a concept. A k-term DNF formula can be an exponentially
more compact representation, and e?cient PAC-learning is intractable if a time-
complexity polynomial in that size is required. Thus, this apparent paradox deals
with key aspects of PAC-learning, which include the cost of the representation of a
concept and the choice of the hypothesis set.
2.3 Guarantees for ?nite hypothesis sets — inconsistent case
In the most general case, there may be no hypothesis in H consistent with the
labeled training sample. This, in fact, is the typical case in practice, where the
learning problems may be somewhat di?cult or the concept classes more complex
than the hypothesis set used by the learning algorithm. However, inconsistent
hypotheses with a small number of errors on the training sample can be useful and,as we shall see, can bene?t from favorable guarantees under some assumptions. This
section presents learning guarantees precisely for this inconsistent case and ?nite
hypothesis sets.
To derive learning guarantees in this more general setting, we will use Hoe?ding’s
inequality (theorem D.1) or the following corollary, which relates the generalization
error and empirical error of a single hypothesis.22 The PAC Learning Framework
Corollary 2.1
Fix > 0 and let S denote an i.i.d. sample of size m. Then, for any hypothesis
h: X →{0, 1}, the following inequalities hold:
Pr
S~Dm[ R(h) ? R(h) ≥ ] ≤ exp(?2m
2) (2.14)
Pr
S~Dm[ R(h) ? R(h) ≤?] ≤ exp(?2m ......
您现在查看是摘要介绍页, 详见PDF附件(5523KB,427页)。





