资料下载：机器学习基础从入门到求职.pdf

机器学习基础从入门到求职.pdf

http://www.100md.com 2020年11月18日

第1页

第9页

第20页

第27页

第34页

第141页

参见附件(5523KB，427页)。

《机器学习基础：从入门到求职》的定位是帮助求职者快速入门并掌握机器学习相关的基础核心知识，降低学习成本，节省更多的时间。

    编辑推荐

    适读人群：机器学习初级和中级读者，机器学习岗求职者

    《机器学习基础：从入门到求职》的定位是帮助求职者快速入门并掌握机器学习相关的基础核心知识，降低学习成本，节省更多的时间。

    但是《机器学习基础：从入门到求职》并未介绍如何求职、如何面试，而是把机器学习算法相关岗位的常见问题罗列在前言，读者可阅读前言后自行判断哪些方面需要提升，然后学习本书中的相关章节，带你走近机器学习求职的起点。

    内容简介

    本书是一本机器学习算法方面的理论+实践读物，主要包含机器学习基础理论、回归模型、分类模型、聚类模型、降维模型和深度学习模型六大部分。机器学习基础理论部分包含第1、2章，主要介绍机器学习的理论基础和工程实践基础。第3章是回归模型部分，主要包括模型的建立、学习策略的确定和优化算法的求解过程，最后结合三种常见的线性回归模型实现了一个房价预测的案例。第4至11章详细介绍了几种常见的分类模型，包括朴素贝叶斯模型、K近邻模型、决策树模型、Logistic回归模型、支持向量机模型、随机森林模型、AdaBoost模型和提升树模型，每一个模型都给出了较为详细的推导过程和实际应用案例。第12章系统介绍了五种常见的聚类模型，包括K-Means聚类、层次聚类、密度聚类、谱聚类和高斯混合聚类，每一个模型的原理、优缺点和工程应用实践都给出了较为详细的说明。第13章系统介绍了四种常用的降维方式，包括奇异值分解、主成分分析、线性判别分析和局部线性嵌入，同样给出了详细的理论推导和分析。最后两章分别是词向量模型和深度神经网络模型，其中，词向量模型详细介绍了Word2Vec和Doc2Vec模型的原理推导和应用；深度神经网络模型系统介绍了深度学习相关的各类基础知识。本书适合对人工智能和机器学习感兴趣的学生、求职者和已工作人士，以及想要使用机器学习这一工具的跨行业者(有最基本的高等数学、线性代数、概率基础即可)，具体判别方法建议您阅读本书的前言。

    机器学习求职60问

    类型一：基础概念类

    问题1：过拟合与欠拟合(定义、产生的原因、解决的方法各是什么)。

    问题2：L1正则与L2正则(有哪些常见的正则化方法？作用各是什么？区别是什么？为什么加正则化项能防止模型过拟合)。

    问题3：模型方差和偏差(能解释一下机器学习中的方差和偏差吗？哪些模型是降低模型方差的？哪些模型是降低模型偏差的？举例说明一下)。

    问题4：奥卡姆剃刀(说一说机器学习中的奥卡姆梯刀原理)。

    问题5：模型评估指标(回归模型和分类模型各有哪些常见的评估指标？各自的含义是什么？解释一下AUC？你在平时的实践过程中用到过哪些评估指标？为什么要选择这些指标)。

    问题6：风险函数(说一下经验风险和结构风险的含义和异同点)。

    问题7：优化算法(机器学习中常见的优化算法有哪些？梯度下降法和牛顿法的原理推导)。

    问题8：激活函数(神经网络模型中常用的激活函数有哪些？说一下各自的特点)。

    问题9：核函数(核函数的定义和作用是什么？常用的核函数有哪些？你用过哪些核函数？说一下高斯核函数中的参数作用)。

    问题10：梯度消失与梯度爆炸(解释一下梯度消失与梯度爆炸问题，各自有什么解决方案)。

    问题11：有监督学习和无监督学习(说一下有监督学习和无监督学习的特点，举例说明一下)。

    问题12：生成模型与判别模型(你知道生成模型和判别模型吗？各自的特点是什么？哪些模型是生成模型，哪些模型是判别模型)。

    类型二：模型原理类

    问题13：线性回归(线性回归模型的原理、损失函数、正则化项)。

    问题14：KNN模型(KNN模型的原理、三要素、优化方案以及模型的优缺点)。

    问题15：朴素贝叶斯(朴素贝叶斯模型的原理推导，拉普拉斯平滑，后验概率最大化的含义以及模型的优缺点)。

    问题16：决策树(决策树模型的原理、特征评价指标、剪枝过程和原理、几种常见的决策树模型、各自的优缺点)。

    问题17：随机森林模型(RF模型的基本原理，RF模型的两个“随机”。从偏差和方差角度说一下RF模型的优缺点，以及RF模型和梯度提升树模型的区别)。

    问题18：AdaBoost(AdaBoost模型的原理推导、从偏差和方差角度说一下AdaBoost、AdaBoost模型的优缺点)。

    问题19：梯度提升树模型(GBDT模型的原理推导、使用GBDT模型进行特征组合的过程、GBDT模型的优缺点)。

    问题20：XGBoost(XGBoost模型的基本原理、XGBoost模型和GBDT模型的异同点、XBGoost模型的优缺点)。

    问题21：Logistic回归模型(LR模型的原理、本质，LR模型的损失函数，能否使用均方损失、为什么)。

    问题22：支持向量机模型(SVM模型的原理，什么是“支持向量”？为什么使用拉格朗日对偶性？说一下KKT条件、软间隔SVM和硬间隔SVM的异同点。SVM怎样实现非线性分类？SVM常用的核函数有哪些？SVM模型的优缺点各是什么)。

    问题23：K-Means聚类(K-Means聚类的过程和原理是什么？优化方案有哪些？各自优缺点是什么)。

    问题24：层次聚类(层次聚类的过程、原理和优缺点)。

    问题25：密度聚类(密度聚类的基本原理和优缺点)。

    问题26：谱聚类(谱聚类的基本原理和优缺点)。

    问题27：高斯混合聚类(高斯混合聚类的原理和优缺点)。

    问题28：EM算法(EM算法的推导过程和应用场景)。

    问题29：特征分解与奇异值分解(特征分解与奇异值分解的原理、异同点、应用场景)。

    问题30：主成分分析(PCA模型的原理、过程、应用场景)。

    问题31：线性判别分析(LDA模型的原理、过程、应用场景)。

    问题32：局部线性嵌入(LLE模型的原理、过程、应用场景)。

    问题33：词向量(Word2Vec模型和Doc2Vec模型的类别，各自原理推导、应用和参数调节)。

    问题34：深度神经网络(深度神经网络模型的原理，反向传播的推导过程，常用的激活函数，梯度消失与梯度爆炸问题怎么解决？说一下深度神经网络中的Dropout、早停、正则化)。

    类型三：模型比较类

    问题35：LR模型与SVM模型的异同点。

    问题36：LR模型与朴素贝叶斯模型的异同点。

    问题37：KNN模型与K-Means模型的异同点。

    问题38：ID3、C4.5、CART模型的异同点。

    问题39：PCA模型与LDA模型的异同点。

    问题40：Bagging方式和Boosting模型的异同点。

    问题41：GBDT模型与XGBoost模型的异同点。

    问题42：Word2Vec模型中CWOB模式与Skip模式的异同点。

    问题43：Word2Vec模型和Doc2Vec模型的异同点。

    类型四：模型技巧类

    问题44：模型调参(随便选一个上述涉及的模型，说一下它的调参方式与过程)。

    问题45：特征组合(常见的特征组合方式有哪些？各自特点是什么)。

    问题46：特征工程(结合实践解释一下你所理解的特征工程)。

    问题47：缺失值问题(说一下你遇到的缺失值处理问题，你知道哪些缺失值处理方式？你使用过哪些，效果怎样)。

    问题48：样本不平衡问题(你知道样本不平衡问题吗？你是怎样处理的？效果怎么样？除上采样和下采样外，你还能自己设计什么比较新颖的方式吗)。

    问题49：特征筛选(特征筛选有哪几种常见的方式？结合自己的实践经验说一下各自的原理和特点。)

    问题50：模型选择(你一般怎样挑选合适的模型？有实际的例子吗？)

    问题51：模型组合(你知道哪些模型组合方式？除了运用AdaBoost和RF等模型，你自己有使用过Bagging和Embedding方式组合模型吗？结合实际例子说明一下)。

    问题52：AB测试(了解AB测试吗？为什么要使用AB测试)。

    问题53：降维(为什么要使用降维？你知道哪些降维方法？你用过哪些降维方式？结合实际使用说明一下)。

    问题54：项目(你做过哪些相关的项目？挑一个你觉得印象最深刻的说明一下)。

    问题55：踩过的坑(你在使用机器学习模型中踩过哪些坑？最后你是如何解决的)。

    类型五：求职技巧类

    问题56：机器学习求职要准备哪些项？各项对应如何准备？

    问题57：机器学习相关的学习内容有哪些？学习路线应该怎么定？有什么推荐的学习资料？

    问题58：机器学习岗求职的投递方式有哪些？什么时间投递最合适？投递目标应该怎样选择？

    问题59：机器学习岗求职的简历最好写哪些内容？所做的项目应该如何描述？

    问题60：面试过程中自我介绍如何说比较合适？求职心态应该如何摆正？如果遇到压力该如何面对？面试过程中如何掌握主导权？怎样回答面试官最后的“你还有什么要问我的”问题？怎样面对最后的人力资源面试？

    机器学习基础从入门到求职截图

    Foundations of Machine LearningAdaptive Computation and Machine Learning

    Thomas Dietterich, Editor

    Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,Associate Editors

    A complete list of books published in The Adaptive Computations and Machine

    Learning series appears at the back of this book.Foundations of Machine Learning

    Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar

    The MIT Press

    Cambridge, Massachusetts

    London, Englandc 2012 Massachusetts Institute of Technology

    All rights reserved. No part of this book may be reproduced in any form by any

    electronic or mechanical means (including photocopying, recording, or information

    storage and retrieval) without permission in writing from the publisher.

    MIT Press books may be purchased at special quantity discounts for business or

    sales promotional use. For information, please email special sales@mitpress.mit.edu

    or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cam-

    bridge, MA 02142.

    This book was set in L A TEX by the authors. Printed and bound in the United States

    of America.

    Library of Congress Cataloging-in-Publication Data

    Mohri, Mehryar.

    Foundations of machine learning Mehryar Mohri, Afshin Rostamizadeh, and

    Ameet Talwalkar.

    p. cm. - (Adaptive computation and machine learning series)

    Includes bibliographical references and index.

    ISBN 978-0-262-01825-8 (hardcover : alk. paper) 1. Machine learning. 2. Computer

    algorithms. I. Rostamizadeh, Afshin. II. Talwalkar, Ameet. III. Title.

    Q325.5.M64 2012

    006.3’1-dc23

    2012007249

    10987654321Contents

    Preface xi

    1 Introduction 1

    1.1 Applicationsandproblems........................ 1

    1.2 De?nitionsandterminology....................... 3

    1.3 Cross-validation.............................. 5

    1.4 Learningscenarios ............................ 7

    1.5 Outline .................................. 8

    2 The PAC Learning Framework 11

    2.1 ThePAClearningmodel......................... 11

    2.2 Guaranteesfor?nitehypothesissets—consistentcase........ 17

    2.3 Guaranteesfor?nitehypothesissets—inconsistentcase....... 21

    2.4 Generalities ................................ 24

    2.4.1 Deterministicversusstochasticscenarios............ 24

    2.4.2 Bayeserrorandnoise ...................... 25

    2.4.3 Estimationandapproximationerrors.............. 26

    2.4.4 Modelselection.......................... 27

    2.5 Chapternotes............................... 28

    2.6 Exercises ................................. 29

    3 Rademacher Complexity and VC-Dimension 33

    3.1 Rademachercomplexity ......................... 34

    3.2 Growthfunction ............................. 38

    3.3 VC-dimension............................... 41

    3.4 Lowerbounds............................... 48

    3.5 Chapternotes............................... 54

    3.6 Exercises ................................. 55

    4 Support Vector Machines 63

    4.1 Linearclassi?cation............................ 63

    4.2 SVMs—separablecase ......................... 64vi

    4.2.1 Primaloptimizationproblem .................. 64

    4.2.2 Supportvectors.......................... 66

    4.2.3 Dualoptimizationproblem ................... 67

    4.2.4 Leave-one-outanalysis...................... 69

    4.3 SVMs—non-separablecase....................... 71

    4.3.1 Primaloptimizationproblem .................. 72

    4.3.2 Supportvectors.......................... 73

    4.3.3 Dualoptimizationproblem ................... 74

    4.4 Margintheory............................... 75

    4.5 Chapternotes............................... 83

    4.6 Exercises ................................. 84

    5 Kernel Methods 89

    5.1 Introduction................................ 89

    5.2 Positivede?nitesymmetrickernels ................... 92

    5.2.1 De?nitions ............................ 92

    5.2.2 ReproducingkernelHilbertspace................ 94

    5.2.3 Properties............................. 96

    5.3 Kernel-basedalgorithms ......................... 100

    5.3.1 SVMswithPDSkernels..................... 100

    5.3.2 Representertheorem....................... 101

    5.3.3 Learningguarantees ....................... 102

    5.4 Negativede?nitesymmetrickernels................... 103

    5.5 Sequencekernels ............................. 106

    5.5.1 Weightedtransducers ...................... 106

    5.5.2 Rationalkernels ......................... 111

    5.6 Chapternotes............................... 115

    5.7 Exercises ................................. 116

    6 Boosting 121

    6.1 Introduction................................ 121

    6.2 AdaBoost ................................. 122

    6.2.1 Boundontheempiricalerror .................. 124

    6.2.2 Relationshipwithcoordinatedescent.............. 126

    6.2.3 Relationshipwithlogisticregression .............. 129

    6.2.4 Standarduseinpractice..................... 129

    6.3 Theoreticalresults ............................ 130

    6.3.1 VC-dimension-basedanalysis .................. 131

    6.3.2 Margin-basedanalysis ...................... 131

    6.3.3 Marginmaximization ...................... 136

    6.3.4 Game-theoreticinterpretation.................. 137vii

    6.4 Discussion................................. 140

    6.5 Chapternotes............................... 141

    6.6 Exercises ................................. 142

    7 On-Line Learning 147

    7.1 Introduction................................ 147

    7.2 Predictionwithexpertadvice...................... 148

    7.2.1 MistakeboundsandHalvingalgorithm ............ 148

    7.2.2 Weightedmajorityalgorithm .................. 150

    7.2.3 Randomizedweightedmajorityalgorithm........... 152

    7.2.4 Exponentialweightedaveragealgorithm............ 156

    7.3 Linearclassi?cation............................ 159

    7.3.1 Perceptronalgorithm....................... 160

    7.3.2 Winnowalgorithm........................ 168

    7.4 On-linetobatchconversion ....................... 171

    7.5 Game-theoreticconnection........................ 174

    7.6 Chapternotes............................... 175

    7.7 Exercises ................................. 176

    8 Multi-Class Classi?cation 183

    8.1 Multi-classclassi?cationproblem.................... 183

    8.2 Generalizationbounds .......................... 185

    8.3 Uncombinedmulti-classalgorithms................... 191

    8.3.1 Multi-classSVMs......................... 191

    8.3.2 Multi-classboostingalgorithms................. 192

    8.3.3 Decisiontrees........................... 194

    8.4 Aggregated multi-class algorithms ................... 198

    8.4.1 One-versus-all........................... 198

    8.4.2 One-versus-one .......................... 199

    8.4.3 Error-correctioncodes ...................... 201

    8.5 Structuredpredictionalgorithms .................... 203

    8.6 Chapternotes............................... 206

    8.7 Exercises ................................. 207

    9 Ranking 209

    9.1 Theproblemofranking ......................... 209

    9.2 Generalizationbound .......................... 211

    9.3 RankingwithSVMs ........................... 213

    9.4 RankBoost ................................ 214

    9.4.1 Boundontheempiricalerror .................. 216

    9.4.2 Relationshipwithcoordinatedescent.............. 218viii

    9.4.3 Marginboundforensemblemethodsinranking ....... 220

    9.5 Bipartiteranking............................. 221

    9.5.1 Boostinginbipartiteranking .................. 222

    9.5.2 AreaundertheROCcurve ................... 224

    9.6 Preference-basedsetting......................... 226

    9.6.1 Second-stagerankingproblem.................. 227

    9.6.2 Deterministicalgorithm..................... 229

    9.6.3 Randomizedalgorithm...................... 230

    9.6.4 Extensiontootherlossfunctions ................ 231

    9.7 Discussion................................. 232

    9.8 Chapternotes............................... 233

    9.9 Exercises ................................. 234

    10 Regression 237

    10.1Theproblemofregression........................ 237

    10.2Generalizationbounds .......................... 238

    10.2.1Finitehypothesissets ...................... 238

    10.2.2Rademachercomplexitybounds................. 239

    10.2.3Pseudo-dimensionbounds.................... 241

    10.3Regressionalgorithms .......................... 245

    10.3.1Linearregression......................... 245

    10.3.2Kernelridgeregression...................... 247

    10.3.3 Supportvectorregression .................... 252

    10.3.4Lasso ............................... 257

    10.3.5Groupnormregressionalgorithms ............... 260

    10.3.6On-lineregressionalgorithms .................. 261

    10.4Chapternotes............................... 262

    10.5Exercises ................................. 263

    11 Algorithmic Stability 267

    11.1De?nitions................................. 267

    11.2Stability-basedgeneralizationguarantee ................ 268

    11.3Stabilityofkernel-basedregularizationalgorithms .......... 270

    11.3.1 Application to regression algorithms: SVR and KRR . . 274

    11.3.2Applicationtoclassi?cationalgorithms:SVMs ........ 276

    11.3.3Discussion............................. 276

    11.4Chapternotes............................... 277

    11.5Exercises ................................. 277

    12 Dimensionality Reduction 281

    12.1PrincipalComponentAnalysis ..................... 282ix

    12.2KernelPrincipalComponentAnalysis(KPCA) ............ 283

    12.3 KPCAandmanifoldlearning...................... 285

    12.3.1 Isomap .............................. 285

    12.3.2Laplacianeigenmaps....................... 286

    12.3.3Locallylinearembedding(LLE) ................ 287

    12.4Johnson-Lindenstrausslemma...................... 288

    12.5Chapternotes............................... 290

    12.6Exercises ................................. 290

    13 Learning Automata and Languages 293

    13.1Introduction................................ 293

    13.2Finiteautomata ............................. 294

    13.3E?cientexactlearning.......................... 295

    13.3.1Passivelearning ......................... 296

    13.3.2Learningwithqueries ...................... 297

    13.3.3Learningautomatawithqueries ................ 298

    13.4Identi?cationinthelimit ........................ 303

    13.4.1Learningreversibleautomata .................. 304

    13.5Chapternotes............................... 309

    13.6Exercises ................................. 310

    14 Reinforcement Learning 313

    14.1Learningscenario............................. 313

    14.2Markovdecisionprocessmodel ..................... 314

    14.3Policy ................................... 315

    14.3.1De?nition............................. 315

    14.3.2Policyvalue............................ 316

    14.3.3Policyevaluation......................... 316

    14.3.4Optimalpolicy .......................... 318

    14.4Planningalgorithms ........................... 319

    14.4.1Valueiteration .......................... 319

    14.4.2Policyiteration.......................... 322

    14.4.3Linearprogramming....................... 324

    14.5Learningalgorithms ........................... 325

    14.5.1 Stochasticapproximation .................... 326

    14.5.2TD(0)algorithm......................... 330

    14.5.3Q-learningalgorithm....................... 331

    14.5.4 SARSA .............................. 334

    14.5.5 TD(λ)algorithm......................... 335

    14.5.6Largestatespace......................... 336

    14.6Chapternotes............................... 337x

    Conclusion 339

    A Linear Algebra Review 341

    A.1 Vectorsandnorms ............................ 341

    A.1.1 Norms............................... 341

    A.1.2 Dualnorms............................ 342

    A.2 Matrices.................................. 344

    A.2.1 Matrixnorms........................... 344

    A.2.2 Singularvaluedecomposition .................. 345

    A.2.3 Symmetricpositivesemide?nite(SPSD)matrices....... 346

    B Convex Optimization 349

    B.1 Di?erentiationandunconstrainedoptimization ............ 349

    B.2 Convexity................................. 350

    B.3 Constrainedoptimization ........................ 353

    B.4 Chapternotes............................... 357

    C Probability Review 359

    C.1 Probability ................................ 359

    C.2 Randomvariables............................. 359

    C.3 Conditionalprobabilityandindependence ............... 361

    C.4 Expectation, Markov’s inequality, and moment-generating function

    C.5 VarianceandChebyshev’sinequality.................. 365

    D Concentration inequalities 369

    D.1 Hoe?ding’sinequality .......................... 369

    D.2 McDiarmid’sinequality ......................... 371

    D.3 Otherinequalities............................. 373

    D.3.1 Binomialdistribution:Slud’sinequality ............ 374

    D.3.2 Normaldistribution:tailbound................. 374

    D.3.3 Khintchine-Kahaneinequality.................. 374

    D.4 Chapternotes............................... 376

    D.5 Exercises ................................. 377

    E Notation 379

    References 381

    Index 397

    363 .Preface

    This book is a general introduction to machine learning that can serve as a textbook

    for students and researchers in the ?eld. It covers fundamental modern topics in

    machine learning while providing the theoretical basis and conceptual tools needed

    for the discussion and justi?cation of algorithms. It also describes several key aspects

    of the application of these algorithms.

    We have aimed to present the most novel theoretical tools and concepts while

    giving concise proofs, even for relatively advanced results. In general, whenever

    possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial

    complex topics arising in machine learning and highlight several open research

    questions. Certain topics often merged with others or treated with insu?cient

    attention are discussed separately here and with more emphasis: for example, a

    di?erent chapter is reserved for multi-class classi?cation, ranking, and regression.

    Although we cover a very wide variety of important topics in machine learning, we

    have chosen to omit a few important ones, including graphical models and neural

    networks, both for the sake of brevity and because of the current lack of solid

    theoretical guarantees for some methods.

    The book is intended for students and researchers in machine learning, statistics

    and other related areas. It can be used as a textbook for both graduate and advanced

    undergraduate classes in machine learning or as a reference text for a research

    seminar. The ?rst three chapters of the book lay the theoretical foundation for the

    subsequent material. Other chapters are mostly self-contained, with the exception

    of chapter 5 which introduces some concepts that are extensively used in later

    ones. Each chapter concludes with a series of exercises, with full solutions presented

    separately.

    The reader is assumed to be familiar with basic concepts in linear algebra,probability, and analysis of algorithms. However, to further help him, we present

    in the appendix a concise linear algebra and a probability review, and a short

    introduction to convex optimization. We have also collected in the appendix a

    number of useful tools for concentration bounds used in this book.

    To our knowledge, there is no single textbook covering all of the material

    presented here. The need for a uni?ed presentation has been pointed out to usxii Preface

    every year by our machine learning students. There are several good books for

    various specialized areas, but these books do not include a discussion of other

    fundamental topics in a general manner. For example, books about kernel methods

    do not include a discussion of other fundamental topics such as boosting, ranking,reinforcement learning, learning automata or online learning. There also exist more

    general machine learning books, but the theoretical foundation of our book and our

    emphasis on proofs make our presentation quite distinct.

    Mostofthematerialpresentedheretakesitsoriginsinamachinelearning

    graduate course (Foundations of Machine Learning) taught by the ?rst author

    at the Courant Institute of Mathematical Sciences in New York University over

    the last seven years. This book has considerably bene?ted from the comments

    and suggestions from students in these classes, along with those of many friends,colleagues and researchers to whom we are deeply indebted.

    We are particularly grateful to Corinna Cortes and Yishay Mansour who have

    both made a number of key suggestions for the design and organization of the

    materialpresentedwithdetailedcommentsthatwehavefullytakenintoaccount

    and that have greatly improved the presentation. We are also grateful to Yishay

    Mansour for using a preliminary version of the book for teaching and for reporting

    his feedback to us.

    We also thank for discussions, suggested improvement, and contributions of many

    kinds the following colleagues and friends from academic and corporate research lab-

    oratories: Cyril Allauzen, Stephen Boyd, Spencer Greenberg, Lisa Hellerstein, Sanjiv

    Kumar, Ryan McDonald, Andres Mu? noz Medina, Tyler Neylon, Peter Norvig, Fer-

    nando Pereira, Maria Pershina, Ashish Rastogi, Michael Riley, Umar Syed, Csaba

    Szepesv′ ari, Eugene Weinstein, and Jason Weston.

    Finally, we thank the MIT Press publication team for their help and support in

    the development of this text.1 Introduction

    Machine learning can be broadly de?ned as computational methods using experience

    to improve performance or to make accurate predictions. Here, experience refers to

    the past information available to the learner, which typically takes the form of

    electronic data collected and made available for analysis. This data could be in the

    form of digitized human-labeled training sets, or other types of information obtained

    via interaction with the environment. In all cases, its quality and size are crucial to

    the success of the predictions made by the learner.

    Machine learning consists of designing e?cient and accurate prediction algo-

    rithms. As in other areas of computer science, some critical measures of the quality

    of these algorithms are their time and space complexity. But, in machine learning,we will need additionally a notion of sample complexity to evaluate the sample size

    required for the algorithm to learn a family of concepts. More generally, theoreti-

    cal learning guarantees for an algorithm depend on the complexity of the concept

    classes considered and the size of the training sample.

    Since the success of a learning algorithm depends on the data used, machine

    learning is inherently related to data analysis and statistics. More generally, learning

    techniques are data-driven methods combining fundamental concepts in computer

    science with ideas from statistics, probability and optimization.

    1.1 Applications and problems

    Learning algorithms have been successfully deployed in a variety of applications,including

    Text or document classi?cation, e.g., spam detection;

    Natural language processing, e.g., morphological analysis, part-of-speech tagging,statistical parsing, named-entity recognition;

    Speech recognition, speech synthesis, speaker veri?cation;

    Optical character recognition (OCR);

    Computational biology applications, e.g., protein function or structured predic-2 Introduction

    tion;

    Computer vision tasks, e.g., image recognition, face detection;

    Fraud detection (credit card, telephone) and network intrusion;

    Games, e.g., chess, backgammon;

    Unassisted vehicle control (robots, navigation);

    Medical diagnosis;

    Recommendation systems, search engines, information extraction systems.

    This list is by no means comprehensive, and learning algorithms are applied to new

    applications every day. Moreover, such applications correspond to a wide variety of

    learning problems. Some major classes of learning problems are:

    Classi?cation: Assign a category to each item. For example, document classi?ca-

    tion may assign items with categories such as politics, business, sports,or weather

    while image classi?cation may assign items with categories such as landscape, por-

    trait,or animal. The number of categories in such tasks is often relatively small,but can be large in some di?cult tasks and even unbounded as in OCR, text clas-

    si?cation, or speech recognition.

    Regression: Predict a real value for each item. Examples of regression include

    prediction of stock values or variations of economic variables. In this problem, the

    penalty for an incorrect prediction depends on the magnitude of the di?erence

    between the true and predicted values, in contrast with the classi?cation problem,where there is typically no notion of closeness between various categories.

    Ranking: Order items according to some criterion. Web search, e.g., returning

    web pages relevant to a search query, is the canonical ranking example. Many other

    similar ranking problems arise in the context of the design of information extraction

    or natural language processing systems.

    Clustering: Partition items into homogeneous regions. Clustering is often per-

    formed to analyze very large data sets. For example, in the context of social net-

    work analysis, clustering algorithms attempt to identify “communities” within large

    groups of people.

    Dimensionality reduction or manifold learning: Transform an initial representa-

    tion of items into a lower-dimensional representation of these items while preserving

    some properties of the initial representation. A common example involves prepro-

    cessing digital images in computer vision tasks.

    The main practical objectives of machine learning consist of generating accurate

    predictions for unseen items and of designing e?cient and robust algorithms to

    produce these predictions, even for large-scale problems. To do so, a number of

    algorithmic and theoretical questions arise. Some fundamental questions include:1.2 De?nitions and terminology 3

    Figure 1.1 The zig-zag line on the left panel is consistent over the blue and red

    training sample, but it is a complex separation surface that is not likely to generalize

    well to unseen data. In contrast, the decision surface on the right panel is simpler

    and might generalize better in spite of its misclassi?cation of a few points of the

    training sample.

    Which concept families can actually be learned, and under what conditions? How

    well can these concepts be learned computationally?

    1.2 De?nitions and terminology

    We will use the canonical problem of spam detection as a running example to

    illustrate some basic de?nitions and to describe the use and evaluation of machine

    learning algorithms in practice. Spam detection is the problem of learning to

    automatically classify email messages as either spam or non-spam.

    Examples: Items or instances of data used for learning or evaluation. In our spam

    problem, these examples correspond to the collection of email messages we will use

    for learning and testing.

    Features: The set of attributes, often represented as a vector, associated to an

    example. In the case of email messages, some relevant features may include the

    length of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

    Labels: Values or categories assigned to examples. In classi?cation problems,examples are assigned speci?c categories, for instance, the spam and non-spam

    categories in our binary classi?cation problem. In regression, items are assigned

    real-valued labels.

    Training sample: Examples used to train a learning algorithm. In our spam

    problem, the training sample consists of a set of email examples along with their

    associated labels. The training sample varies for di?erent learning scenarios, as

    described in section 1.4.

    Validation sample: Examples used to tune the parameters of a learning algorithm4 Introduction

    when working with labeled data. Learning algorithms typically have one or more

    free parameters, and the validation sample is used to select appropriate values for

    these model parameters.

    Test sample: Examples used to evaluate the performance of a learning algorithm.

    The test sample is separate from the training and validation data and is not made

    available in the learning stage. In the spam problem, the test sample consists of a

    collection of email examples for which the learning algorithm must predict labels

    based on features. These predictions are then compared with the labels of the test

    sample to measure the performance of the algorithm.

    Loss function: A function that measures the di?erence, or loss, between a pre-

    dicted label and a true label. Denoting the set of all labels as Y and the set of

    possible predictions as Y

    , a loss function L is a mapping L: Y×Y

    → R+. In most

    cases, Y

    = Y and the loss function is bounded, but these conditions do not always

    hold. Common examples of loss functions include the zero-one (or misclassi?cation)

    loss de?ned over {?1,+1}×{?1,+1} by L(y, y)=1y

    =y and the squared loss

    de?ned over I × I by L(y, y)=(y

    y)2,where I ? R is typically a bounded

    interval.

    Hypothesis set : A set of functions mapping features (feature vectors) to the set of

    labels Y. In our example, these may be a set of functions mapping email features

    to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping

    features to a di?erent set Y

    . They could be linear functions mapping email feature

    vectors to real numbers interpreted as scores (Y

    = R), with higher score values

    more indicative of spam than lower ones.

    We now de?ne the learning stages of our spam problem. We start with a given

    collection of labeled examples. We ?rst randomly partition the data into a training

    sample, a validation sample, and a test sample. The size of each of these samples

    depends on a number of di?erent considerations. For example, the amount of data

    reserved for validation depends on the number of free parameters of the algorithm.

    Also, when the labeled sample is relatively small, the amount of training data is

    often chosen to be larger than that of test data since the learning performance

    directly depends on the training sample.

    Next, we associate relevant features to the examples. This is a critical step in

    the design of machine learning solutions. Useful features can e?ectively guide the

    learning algorithm, while poor or uninformative ones can be misleading. Although

    it is critical, to a large extent, the choice of the features is left to the user. This

    choice re?ects the user’s prior knowledge about the learning task which in practice

    can have a dramatic e?ect on the performance results.

    Now, we use the features selected to train our learning algorithm by ?xing di?erent

    values of its free parameters. For each value of these parameters, the algorithm1.3 Cross-validation 5

    selects a di?erent hypothesis out of the hypothesis set. We choose among them

    the hypothesis resulting in the best performance on the validation sample. Finally,using that hypothesis, we predict the labels of the examples in the test sample. The

    performance of the algorithm is evaluated by using the loss function associated to

    the task, e.g., the zero-one loss in our spam detection task, to compare the predicted

    and true labels.

    Thus, the performance of an algorithm is of course evaluated based on its test error

    and not its error on the training sample. A learning algorithm may be consistent ,that is it may commit no error on the examples of the training data, and yet

    have a poor performance on the test data. This occurs for consistent learners

    de?ned by very complex decision surfaces, as illustrated in ?gure 1.1, which tend

    to memorize a relatively small training sample instead of seeking to generalize well.

    This highlights the key distinction between memorization and generalization, which

    is the fundamental property sought for an accurate learning algorithm. Theoretical

    guarantees for consistent learners will be discussed with great detail in chapter 2.

    1.3 Cross-validation

    In practice, the amount of labeled data available is often too small to set aside

    a validation sample since that would leave an insu?cient amount of training data.

    Instead, a widely adopted method known as n-fold cross-validation is used to exploit

    the labeled data both for model selection (selection of the free parameters of the

    algorithm) and for training.

    Let θ denote the vector of free parameters of the algorithm. For a ?xed value

    of θ, the method consists of ?rst randomly partitioning a given sample S of

    m labeled examples into n subsamples, or folds. The ith fold is thus a labeled

    sample ((xi1,yi1),..., (ximi ,yimi )) of size mi. Then, for any i ∈ [1,n], the learning

    algorithm is trained on all but the ith fold to generate a hypothesis hi,andthe

    performance of hi is tested on the ith fold, as illustrated in ?gure 1.2a. The

    parameter value θ is evaluated based on the average error of the hypotheses hi,which is called the cross-validation error .Thisquantityisdenotedby RCV(θ)and

    de?ned by

     RCV(θ)= 1

    n

    n

    i=1

    1

    mi

    mi

    j=1

    L(hi(xij ),yij )



    error of hi on the ith fold

    .

    The folds are generally chosen to have equal size, that is mi = mn for all i ∈ [1,n].

    How should n be chosen? The appropriate choice is subject to a trade-o? and the

    topic of much learning theory research that we cannot address in this introductory6 Introduction

    test train train train train

    test train train train train

    .

    .

    .

    test train train train train

    error

    m

    (a) (b)

    Figure 1.2 n-fold cross validation. (a) Illustration of the partitioning of the

    training data into 5 folds. (b) Typical plot of a classi?er’s prediction error as a

    function of the size of the training sample: the error decreases as a function of the

    number of training points.

    chapter. For a large n, each training sample used in n-fold cross-validation has size

    m?mn = m(1?1n) (illustrated by the right vertical red line in ?gure 1.2b), which

    is close to m, the size of the full sample, but the training samples are quite similar.

    Thus, the method tends to have a small bias but a large variance. In contrast,smaller values of n lead to more diverse training samples but their size (shown by

    the left vertical red line in ?gure 1.2b) is signi?cantly less than m,thusthemethod

    tends to have a smaller variance but a larger bias.

    In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross

    validation is used as follows in model selection. The full labeled data is ?rst split

    into a training and a test sample. The training sample of size m is then used to

    compute the n-fold cross-validation error RCV(θ) for a small number of possible

    values of θ. θ is next set to the value θ0 for which RCV(θ) is smallest and the

    algorithm is trained with the parameter setting θ0 over the full training sample of

    size m.Itsperformanceisevaluatedonthetestsampleasalreadydescribedinthe

    previous section.

    The special case of n-fold cross validation where n = m is called leave-one-out

    cross-validation, since at each iteration exactly one instance is left out of the training

    sample. As shown in chapter 4, the average leave-one-out error is an approximately

    unbiased estimate of the average error of an algorithm and can be used to derive

    simple guarantees for some algorithms. In general, the leave-one-out error is very

    costly to compute, since it requires training n times on samples of size m ? 1, but

    for some algorithms it admits a very e?cient computation (see exercise 10.9).

    In addition to model selection, n-fold cross validation is also commonly used for

    performance evaluation. In that case, for a ?xed parameter setting θ, the full labeled

    sample is divided into n random folds with no distinction between training and test

    samples. The performance reported is the n-fold cross-validation on the full sample

    as well as the standard deviation of the errors measured on each fold.1.4 Learning scenarios 7

    1.4 Learning scenarios

    We next brie?y describe common machine learning scenarios. These scenarios di?er

    in the types of training data available to the learner, the order and method by which

    training data is received and the test data used to evaluate the learning algorithm.

    Supervised learning: The learner receives a set of labeled examples as training

    data and makes predictions for all unseen points. This is the most common scenario

    associated with classi?cation, regression, and ranking problems. The spam detection

    problem discussed in the previous section is an instance of supervised learning.

    Unsupervised learning: The learner exclusively receives unlabeled training data,and makes predictions for all unseen points. Since in general no labeled exam-

    ple is available in that setting, it can be di?cult to quantitatively evaluate the

    performance of a learner. Clustering and dimensionality reduction are example of

    unsupervised learning problems.

    Semi-supervised learning: The learner receives a training sample consisting of

    both labeled and unlabeled data, and makes predictions for all unseen points. Semi-

    supervised learning is common in settings where unlabeled data is easily accessible

    but labels are expensive to obtain. Various types of problems arising in applications,including classi?cation, regression, or ranking tasks, can be framed as instances

    of semi-supervised learning. The hope is that the distribution of unlabeled data

    accessible to the learner can help him achieve a better performance than in the

    supervised setting. The analysis of the conditions under which this can indeed

    be realized is the topic of much modern theoretical and applied machine learning

    research.

    Transductive inference: As in the semi-supervised scenario, the learner receives

    a labeled training sample along with a set of unlabeled test points. However, the

    objective of transductive inference is to predict labels only for these particular test

    points. Transductive inference appears to be an easier task and matches the scenario

    encountered in a variety of modern applications. However, as in the semi-supervised

    setting, the assumptions under which a better performance can be achieved in this

    setting are research questions that have not been fully resolved.

    On-line learning: In contrast with the previous scenarios, the online scenario

    involves multiple rounds and training and testing phases are intermixed. At each

    round, the learner receives an unlabeled training point, makes a prediction, receives

    the true label, and incurs a loss. The objective in the on-line setting is to minimize

    the cumulative loss over all rounds. Unlike the previous settings just discussed, no

    distributional assumption is made in on-line learning. In fact, instances and their

    labels may be chosen adversarially within this scenario.8 Introduction

    Reinforcement learning: The training and testing phases are also intermixed in

    reinforcement learning. To collect information, the learner actively interacts with the

    environment and in some cases a?ects the environment, and receives an immediate

    reward for each action. The object of the learner is to maximize his reward over

    a course of actions and iterations with the environment. However, no long-term

    reward feedback is provided by the environment, and the learner is faced with the

    exploration versus exploitation dilemma, since he must choose between exploring

    unknown actions to gain more information versus exploiting the information already

    collected.

    Active learning: The learner adaptively or interactively collects training examples,typically by querying an oracle to request labels for new points. The goal in

    active learning is to achieve a performance comparable to the standard supervised

    learning scenario, but with fewer labeled examples. Active learning is often used

    in applications where labels are expensive to obtain, for example computational

    biology applications.

    In practice, many other intermediate and somewhat more complex learning scenarios

    may be encountered.

    1.5 Outline

    This book presents several fundamental and mathematically well-studied algo-

    rithms. It discusses in depth their theoretical foundations as well as their practical

    applications. The topics covered include:

    Probably approximately correct (PAC) learning framework; learning guarantees

    for ?nite hypothesis sets;

    Learning guarantees for in?nite hypothesis sets, Rademacher complexity, VC-

    dimension;

    Support vector machines (SVMs), margin theory;

    Kernel methods, positive de?nite symmetric kernels, representer theorem, rational

    kernels;

    Boosting, analysis of empirical error, generalization error, margin bounds;

    Online learning, mistake bounds, the weighted majority algorithm, the exponen-

    tial weighted average algorithm, the Perceptron and Winnow algorithms;

    Multi-class classi?cation, multi-class SVMs, multi-class boosting, one-versus-all,one-versus-one, error-correction methods;

    Ranking, ranking with SVMs, RankBoost, bipartite ranking, preference-based1.5 Outline 9

    ranking;

    Regression, linear regression, kernel ridge regression, support vector regression,Lasso;

    Stability-based analysis, applications to classi?cation and regression;

    Dimensionality reduction, principal component analysis (PCA), kernel PCA,Johnson-Lindenstrauss lemma;

    Learning automata and languages;

    Reinforcement learning, Markov decision processes, planning and learning prob-

    lems.

    The analyses in this book are self-contained, with relevant mathematical concepts

    related to linear algebra, convex optimization, probability and statistics included in

    the appendix.2 The PAC Learning Framework

    Several fundamental questions arise when designing and analyzing algorithms that

    learn from examples: What can be learned e?ciently? What is inherently hard to

    learn? How many examples are needed to learn successfully? Is there a general model

    of learning? In this chapter, we begin to formalize and address these questions by

    introducing the Probably Approximately Correct (PAC) learning framework. The

    PAC framework helps de?ne the class of learnable concepts in terms of the number

    of sample points needed to achieve an approximate solution, sample complexity,and

    the time and space complexity of the learning algorithm, which depends on the cost

    of the computational representation of the concepts.

    We ?rst describe the PAC framework and illustrate it, then present some general

    learning guarantees within this framework when the hypothesis set used is ?nite,both for the consistent case where the hypothesis set used contains the concept to

    learn and for the opposite inconsistent case.

    2.1 The PAC learning model

    We ?rst introduce several de?nitions and the notation needed to present the PAC

    model, which will also be used throughout much of this book.

    We denote by X the set of all possible examples or instances. X is also sometimes

    referred to as the input space. The set of all possible labels or target values is denoted

    by Y. For the purpose of this introductory chapter, we will limit ourselves to the

    case where Y is reduced to two labels, Y = {0, 1}, so-called binary classi?cation.

    Later chapters will extend these results to more general settings.

    A concept c : X→Y is a mapping from X to Y.Since Y = {0, 1},wecanidentify

    c with the subset of X over which it takes the value 1. Thus, in the following, we

    equivalentlyrefertoaconcepttolearnasamappingfrom X to {0, 1},ortoa

    subset of X. As an example, a concept may be the set of points inside a triangle or

    the indicator function of these points. In such cases, we will say in short that the

    concept to learn is a triangle. A concept class is a set of concepts we may wish to

    learn and is denoted by C. This could, for example, be the set of all triangles in the12 The PAC Learning Framework

    plane.

    We assume that examples are independently and identically distributed (i.i.d.)

    according to some ?xed but unknown distribution D. The learning problem is then

    formulated as follows. The learner considers a ?xed set of possible concepts H,called a hypothesis set ,whichmaynotcoincidewith C. He receives a sample

    S =(x1,...,xm) drawn i.i.d. according to D as well as the labels (c(x1),...,c(xm)),whicharebasedonaspeci?ctargetconcept c ∈ C to learn. His task is to use the

    labeled sample S to select a hypothesis hS ∈ H that has a small generalization

    error with respect to the concept c. The generalization error of a hypothesis h ∈ H,also referred to as the true error or just error of h is denoted by R(h) and de?ned

    as follows.

    1

    De?nition 2.1 Generalization error

    Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution

    D,the generalization error or risk of h is de?ned by

    R(h)= Pr

    x～D[h(x)

    = c(x)] = E x～D



    1h(x)

    =c(x)

    , (2.1)

    where 1ω is the indicator function of the event ω.

    2

    The generalization error of a hypothesis is not directly accessible to the learner

    since both the distribution D and the target concept c are unknown. However, the

    learner can measure the empirical error of a hypothesis on the labeled sample S.

    De?nition 2.2 Empirical error

    Given a hypothesis h ∈ H, a target concept c ∈ C, and a sample S =(x1,...,xm),the empirical error or empirical risk of h is de?ned by

     R(h)= 1

    m

    m

    i=1

    1h(xi)

    =c(xi). (2.2)

    Thus, the empirical error of h ∈ H is its average error over the sample S, while the

    generalization error is its expected error based on the distribution D. We will see in

    this chapter and the following chapters a number of guarantees relating to these two

    quantities with high probability, under some general assumptions. We can already

    note that for a ?xed h ∈ H, the expectation of the empirical error based on an i.i.d.

    1.Thechoiceof R instead of E to denote an error avoids possible confusions with the

    notation for expectations and is further justi?ed by the fact that the term risk is also used

    in machine learning and statistics to refer to an error.

    2. For this and other related de?nitions, the family of functions H and the target concept

    c must be measurable. The function classes we consider in this book all have this property.2.1 The PAC learning model 13

    sample S is equal to the generalization error:

    E[ R(h)] = R(h). (2.3)

    Indeed, by the linearity of the expectation and the fact that the sample is drawn

    i.i.d., we can write

    E S～Dm[ R(h)] = 1

    m

    m

    i=1

    E S～Dm[1h(xi)

    =c(xi)]= 1

    m

    m

    i=1

    E S～Dm[1h(x)

    =c(x)],for any x in sample S.Thus,E S～Dm[ R(h)] = E S～Dm[1{h(x)

    =c(x)}]= E x～D[1{h(x)

    =c(x)}]= R(h).

    The following introduces the Probably Approximately Correct (PAC) learning

    framework. We denote by O(n) an upper bound on the cost of the computational

    representation of any element x ∈X and by size(c) the maximal cost of the

    computational representation of c ∈ C. For example, x may be a vector in Rn,for which the cost of an array-based representation would be in O(n).

    De?nition 2.3 PAC-learning

    A concept class C is said to be PAC-learnable if there exists an algorithm A and

    a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ> 0, for all

    distributions D on X and for any target concept c ∈ C, the following holds for any

    sample size m ≥ poly(1, 1δ, n, size(c)):

    Pr

    S～Dm[R(hS) ≤ ] ≥ 1 ? δ. (2.4)

    If A further runs in poly(1, 1δ, n, size(c)),then C is said to be e?ciently PAC-

    learnable. When such an algorithm A exists, it is called a PAC-learning algorithm

    for C.

    A concept class C is thus PAC-learnable if the hypothesis returned by the algorithm

    after observing a number of points polynomial in 1 and 1δ is approximately

    correct (error at most ) with high probability (at least 1 ? δ), which justi?es the

    PAC terminology. δ> 0isusedtode?nethe con?dence 1?δ and > 0the accuracy

    1 ? . Note that if the running time of the algorithm is polynomial in 1 and 1δ,then the sample size m must also be polynomial if the full sample is received by the

    algorithm.

    Several key points of the PAC de?nition are worth emphasizing. First, the PAC

    framework is a distribution-free model : no particular assumption is made about the

    distribution D from which examples are drawn. Second, the training sample and the

    test examples used to de?ne the error are drawn according to the same distribution

    D. This is a necessary assumption for generalization to be possible in most cases.14 The PAC Learning Framework

    R

    R’

    Figure 2.1 Target concept R and possible hypothesis R

    . Circles represent training

    instances. A blue circle is a point labeled with 1, since it falls within the rectangle

    R.Othersareredandlabeledwith 0.

    Finally, the PAC framework deals with the question of learnability for a concept

    class C and not a particular concept. Note that the concept class C is known to the

    algorithm, but of course target concept c ∈ C is unknown.

    In many cases, in particular when the computational representation of the con-

    cepts is not explicitly discussed or is straightforward, we may omit the polynomial

    dependency on n and size(c) in the PAC de?nition and focus only on the sample

    complexity.

    We now illustrate PAC-learning with a speci?c learning problem.

    Example 2.1 Learning axis-aligned rectangles

    Considerthecasewherethesetofinstancesarepointsintheplane, X = R2,and

    the concept class C is the set of all axis-aligned rectangles lying in R2.Thus,each

    concept c is the set of points inside a particular axis-aligned rectangle. The learning

    problem consists of determining with small error a target axis-aligned rectangle

    using the labeled training sample. We will show that the concept class of axis-

    aligned rectangles is PAC-learnable.

    Figure 2.1 illustrates the problem. R represents a target axis-aligned rectangle

    and R

    a hypothesis. As can be seen from the ?gure, the error regions of R

    are

    formed by the area within the rectangle R but outside the rectangle R

    and the area

    within R

    but outside the rectangle R. The ?rst area corresponds to false negatives,that is, points that are labeled as 0 or negatively by R

    , which are in fact positive

    or labeled with 1. The second area corresponds to false positives,thatis,points

    labeled positively by R

    which are in fact negatively labeled.

    To show that the concept class is PAC-learnable, we describe a simple PAC-

    learning algorithm A. Given a labeled sample S, the algorithm consists of returning

    the tightest axis-aligned rectangle R

    = RS containing the points labeled with 1.

    Figure 2.2 illustrates the hypothesis returned by the algorithm. By de?nition, RS

    does not produce any false positive, since its points must be included in the target

    concept R. Thus, the error region of RS is included in R.2.1 The PAC learning model 15

    R

    R’

    Figure 2.2 Illustration of the hypothesis R

    = RS returned by the algorithm.

    Let R ∈ C be a target concept. Fix > 0. Let Pr[RS] denote the probability mass

    oftheregionde?nedby RS, that is the probability that a point randomly drawn

    according to D falls within RS. Since errors made by our algorithm can be due only

    to points falling inside RS, we can assume that Pr[RS] >; otherwise, the error of

    RS is less than or equal to regardless of the training sample S received.

    Now, since Pr[RS] >, we can de?ne four rectangular regions r1,r2,r3, and r4

    along the sides of RS, each with probability at least 4. These regions can be

    constructed by starting with the empty rectangle along a side and increasing its

    size until its distribution mass is at least 4. Figure 2.3 illustrates the de?nition of

    these regions.

    Observe that if RS meets all of these four regions, then, because it is a rectangle,it will have one side in each of these four regions (geometric argument). Its error

    area, which is the part of R that it does not cover, is thus included in these regions

    and cannot have probability mass more than . By contraposition, if R(RS) >,then RS must miss at least one of the regions ri, i ∈ [1, 4]. As a result, we can write

    Pr

    S～Dm[R(RS) >] ≤ Pr

    S～Dm[∪4

    i=1{RS ∩ ri = ?}] (2.5)

    ≤

    4

    i=1

    Pr

    S～Dm[{RS ∩ ri = ?}] (by the union bound)

    ≤ 4(1 ? 4)

    m (since Pr[ri] >4)

    ≤ 4exp(?m4),where for the last step we used the general identity 1?x ≤ e?x valid for all x ∈ R.

    For any δ> 0, to ensure that PrS～Dm[R(RS) >] ≤ δ,wecanimpose

    4exp(?m4) ≤ δ ? m ≥ 4



    log

    4

    δ

    . (2.6)

    Thus, for any > 0and δ> 0, if the sample size m is greater than 4



    log 4

    δ ,then PrS～Dm[R(RS) >] ≤ 1 ? δ. Furthermore, the computational cost of the16 The PAC Learning Framework

    R

    R’

    r1

    r2

    r3

    r4

    Figure 2.3 Illustration of the regions r1,...,r4.

    representation of points in R2 and axis-aligned rectangles, which can be de?ned by

    their four corners, is constant. This proves that the concept class of axis-aligned

    rectangles is PAC-learnable and that the sample complexity of PAC-learning axis-

    aligned rectangles is in O(

    1



    log 1

    δ ).

    An equivalent way to present sample complexity results like (2.6), which we will

    often see throughout this book, is to give a generalization bound. It states that with

    probability at least 1 ? δ, R(RS) is upper bounded by some quantity that depends

    on the sample size m and δ. To obtain this, if su?ces to set δ to be equal to the

    upper bound derived in (2.5), that is δ =4exp(?m4) and solve for .Thisyields

    that with probability at least 1 ? δ, the error of the algorithm is bounded as:

    R(RS) ≤ 4

    m log

    4

    δ

    . (2.7)

    Other PAC-learning algorithms could be considered for this example. One alterna-

    tive is to return the largest axis-aligned rectangle not containing the negative points,for example. The proof of PAC-learning just presented for the tightest axis-aligned

    rectangle can be easily adapted to the analysis of other such algorithms.

    Note that the hypothesis set H we considered in this example coincided with the

    concept class C and that its cardinality was in?nite. Nevertheless, the problem

    admitted a simple proof of PAC-learning. We may then ask if a similar proof

    can readily apply to other similar concept classes. This is not as straightforward

    because the speci?c geometric argument used in the proof is key. It is non-trivial

    to extend the proof to other concept classes such as that of non-concentric circles

    (see exercise 2.4). Thus, we need a more general proof technique and more general

    results. The next two sections provide us with such tools in the case of a ?nite

    hypothesis set.2.2 Guarantees for ?nite hypothesis sets — consistent case 17

    2.2 Guarantees for ?nite hypothesis sets — consistent case

    In the example of axis-aligned rectangles that we examined, the hypothesis hS

    returned by the algorithm was always consistent , that is, it admitted no error on

    the training sample S. In this section, we present a general sample complexity

    bound, or equivalently, a generalization bound, for consistent hypotheses, in the

    case where the cardinality |H| of the hypothesis set is ?nite. Since we consider

    consistent hypotheses, we will assume that the target concept c is in H.

    Theorem 2.1 Learning bounds — ?nite H, consistent case

    Let H be a ?nite set of functions mapping from X to Y.Let A be an algorithm that

    for any target concept c ∈ H and i.i.d. sample S returns a consistent hypothesis hS:

     R(hS)=0.Then,forany , δ > 0,theinequality PrS～Dm[R(hS) ≤ ] ≥ 1?δ holds

    if

    m ≥ 1



    log |H| +log

    1

    δ



    . (2.8)

    This sample complexity result admits the following equivalent statement as a gener-

    alization bound: for any , δ > 0, with probability at least 1 ? δ,R(hS) ≤ 1

    m

    log |H| +log

    1

    δ



    . (2.9)

    Proof Fix > 0. We do not know which consistent hypothesis hS ∈ H is selected

    by the algorithm A. This hypothesis further depends on the training sample S.

    Therefore, we need to give a uniform convergence bound, that is, a bound that

    holds for the set of all consistent hypotheses, which a fortiori includes hS.Thus,we will bound the probability that some h ∈ H would be consistent and have error

    more than :

    Pr[?h ∈ H: R(h)=0 ∧ R(h) >]

    =Pr[(h1 ∈ H, R(h1)=0 ∧ R(h1) >) ∨ (h2 ∈ H, R(h2)=0 ∧ R(h2) >) ∨··· ]

    ≤



    h∈H

    Pr[ R(h)=0 ∧ R(h) >] (union bound)

    ≤



    h∈H

    Pr[ R(h)=0 | R(h) >]. (de?nition of conditional probability)

    Now, consider any hypothesis h ∈ H with R(h) >. Then, the probability that h

    would be consistent on a training sample S drawn i.i.d., that is, that it would have

    no error on any point in S, can be bounded as:

    Pr[ R(h)=0 | R(h) >] ≤ (1 ? )

    m.18 The PAC Learning Framework

    The previous inequality implies

    Pr[?h ∈ H: R(h)=0 ∧ R(h) >] ≤|H|(1 ? )

    m.

    Setting the right-hand side to be equal to δ and solving for concludes the proof.

    The theorem shows that when the hypothesis set H is ?nite, a consistent algorithm

    A is a PAC-learning algorithm, since the sample complexity given by (2.8) is

    dominated by a polynomial in 1 and 1δ. As shown by (2.9), the generalization

    error of consistent hypotheses is upper bounded by a term that decreases as

    a function of the sample size m. This is a general fact: as expected, learning

    algorithms bene?t from larger labeled training samples. The decrease rate of O(1m)

    guaranteed by this theorem, however, is particularly favorable.

    The price to pay for coming up with a consistent algorithm is the use of a

    larger hypothesis set H containing target concepts. Of course, the upper bound

    (2.9) increases with |H|. However, that dependency is only logarithmic. Note that

    the term log |H|,ortherelatedtermlog2 |H| from which it di?ers by a constant

    factor, can be interpreted as the number of bits needed to represent H.Thus,the

    generalization guarantee of the theorem is controlled by the ratio of this number of

    bits, log2 |H|,andthesamplesize m.

    We now use theorem 2.1 to analyze PAC-learning with various concept classes.

    Example 2.2 Conjunction of Boolean literals

    Consider learning the concept class Cn of conjunctions of at most n Boolean literals

    x1,...,xn. A Boolean literal is either a variable xi, i ∈ [1,n], or its negation xi.For

    n = 4, an example is the conjunction: x1 ∧ x2 ∧ x4,where x2 denotes the negation

    of the Boolean literal x2.(1, 0, 0, 1) is a positive example for this concept while

    (1, 0, 0, 0) is a negative example.

    Observe that for n = 4, a positive example (1, 0, 1, 0) implies that the target

    concept cannot contain the literals x1 and x3 and that it cannot contain the literals

    x2 and x4. In contrast, a negative example is not as informative since it is not

    known which of its n bits are incorrect. A simple algorithm for ?nding a consistent

    hypothesis is thus based on positive examples and consists of the following: for each

    positive example (b1,...,bn)and i ∈ [1,n], if bi =1then xi is ruled out as a possible

    literal in the concept class and if bi =0then xi is ruled out. The conjunction of all

    the literals not ruled out is thus a hypothesis consistent with the target. Figure 2.4

    shows an example training sample as well as a consistent hypothesis for the case

    n =6.

    We have |H| = |Cn| =3n, since each literal can be included positively, with

    negation, or not included. Plugging this into the sample complexity bound for

    consistent hypotheses yields the following sample complexity bound for any > 02.2 Guarantees for ?nite hypothesis sets — consistent case 19

    011011 +

    011111 +

    001101 -

    011111 +

    100110 -

    010011 +

    01? ?11

    Figure 2.4 Each of the ?rst six rows of the table represents a training example with

    its label, + or ?, indicated in the last column. The last row contains 0 (respectively

    1) in column i ∈ [1, 6] if the ith entry is 0 (respectively 1) for all the positive examples.

    It contains “?” if both 0 and 1 appear as an ith entry for some positive example.

    Thus, for this training sample, the hypothesis returned by the consistent algorithm

    describedinthetextis x1 ∧ x2 ∧ x5 ∧ x6.

    and δ> 0:

    m ≥ 1



    (log 3)n +log

    1

    δ



    . (2.10)

    Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note

    that the computational complexity is also polynomial, since the training cost per

    exampleisin O(n). For δ =0.02, =0.1, and n = 10, the bound becomes m ≥ 149.

    Thus, for a labeled sample of at least 149 examples, the bound guarantees 99%

    accuracy with a con?dence of at least 98%.

    Example 2.3 Universal concept class

    Consider the set X = {0, 1}n of all Boolean vectors with n components, and let Un

    be the concept class formed by all subsets of X. Is this concept class PAC-learnable?

    To guarantee a consistent hypothesis the hypothesis class must include the concept

    class, thus |H|≥|Un| =2(2n)

    . Theorem 2.1 gives the following sample complexity

    bound:

    m ≥ 1



    (log 2)2n +log

    1

    δ



    . (2.11)

    Here, the number of training samples required is exponential in n, which is the cost

    of the representation of a point in X. Thus, PAC-learning is not guaranteed by

    the theorem. In fact, it is not hard to show that this universal concept class is not

    PAC- learnable.20 The PAC Learning Framework

    Example 2.4 k-term DNF formulae

    A disjunctive normal form (DNF) formula is a formula written as the disjunction of

    several terms, each term being a conjunction of Boolean literals. A k-term DNF is a

    DNF formula de?ned by the disjunction of k terms, each term being a conjunction

    of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a k-term

    DNF is (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x3).

    Is the class C of k-term DNF formulae is PAC-learnable? The cardinality of the

    classis3nk, since each term is a conjunction of at most n variablesandthereare

    3n such conjunctions, as seen previously. The hypothesis set H must contain C for

    consistency to be possible, thus |H|≥ 3nk. Theorem 2.1 gives the following sample

    complexity bound:

    m ≥ 1



    (log 3)nk +log

    1

    δ



    , (2.12)

    which is polynomial. However, it can be shown that the problem of learning k-

    term DNF is in RP, the complexity class of problems that admit a randomized

    polynomial-time decision solution. The problem is therefore computationally in-

    tractable unless RP = NP, which is commonly conjectured not to be the case. Thus,while the sample size needed for learning k-term DNF formulae is only polynomial,e?cient PAC-learning of this class is not possible unless RP = NP.

    Example 2.5 k-CNF formulae

    A conjunctive normal form (CNF) formula is a conjunction of disjunctions. A k-

    CNF formula is an expression of the form T1 ∧ ... ∧ Tj with arbitrary length j ∈ N

    and with each term Ti being a disjunction of at most k Boolean attributes.

    The problem of learning k-CNF formulae can be reduced to that of learning

    conjunctions of Boolean literals, which, as seen previously, is a PAC-learnable

    concept class. To do so, it su?ces to associate to each term Ti a new variable.

    Then, this can be done with the following bijection:

    ai(x1) ∨···∨ ai(xn) → Yai(x1),...,ai(xn), (2.13)

    where ai(xj ) denotes the assignment to xj in term Ti. This reduction to PAC-

    learning of conjunctions of Boolean literals may a?ect the original distribution, but

    this is not an issue since in the PAC framework no assumption is made about the

    distribution. Thus, the PAC-learnability of conjunctions of Boolean literals implies

    that of k-CNF formulae.

    This is a surprising result, however, since any k-term DNF formula can be written

    as a k-CNF formula. Indeed, using associativity, a k-term DNF can be rewritten as2.3 Guarantees for ?nite hypothesis sets — inconsistent case 21

    a k-CNF formula via

    k

    i=1

    ai(x1) ∧···∧ ai(xn)=

    n

    i1,...,ik=1

    a1(xi1 ) ∨···∨ ak(xik ).

    To illustrate this rewriting in a speci?c case, observe, for example, that

    (u1 ∧ u2 ∧ u3) ∨ (v1 ∧ v2 ∧ v3)=

    3

    i,j=1

    (ui ∧ vj ).

    But, as we previously saw, k-term DNF formulae are not e?ciently PAC-learnable!

    What can explain this apparent inconsistency? Observe that the number of new

    variables needed to write a k-term DNF as a k-CNF formula via the transformation

    just described is exponential in k,itisin O(nk). The discrepancy comes from the size

    of the representation of a concept. A k-term DNF formula can be an exponentially

    more compact representation, and e?cient PAC-learning is intractable if a time-

    complexity polynomial in that size is required. Thus, this apparent paradox deals

    with key aspects of PAC-learning, which include the cost of the representation of a

    concept and the choice of the hypothesis set.

    2.3 Guarantees for ?nite hypothesis sets — inconsistent case

    In the most general case, there may be no hypothesis in H consistent with the

    labeled training sample. This, in fact, is the typical case in practice, where the

    learning problems may be somewhat di?cult or the concept classes more complex

    than the hypothesis set used by the learning algorithm. However, inconsistent

    hypotheses with a small number of errors on the training sample can be useful and,as we shall see, can bene?t from favorable guarantees under some assumptions. This

    section presents learning guarantees precisely for this inconsistent case and ?nite

    hypothesis sets.

    To derive learning guarantees in this more general setting, we will use Hoe?ding’s

    inequality (theorem D.1) or the following corollary, which relates the generalization

    error and empirical error of a single hypothesis.22 The PAC Learning Framework

    Corollary 2.1

    Fix > 0 and let S denote an i.i.d. sample of size m. Then, for any hypothesis

    h: X →{0, 1}, the following inequalities hold:

    Pr

    S～Dm[ R(h) ? R(h) ≥ ] ≤ exp(?2m

    2) (2.14)

    Pr

    S～Dm[ R(h) ? R(h) ≤?] ≤ exp(?2m ......

http://www.100md.com/html/file/202011/185701.htm

您现在查看是摘要介绍页，详见PDF附件(5523KB，427页)。