Professional Documents
Culture Documents
Contents
vii
xi
1 Introduction
2 Overview of Supervised Learning
2.1
Introduction . . . . . . . . . . . . . . . . . . . . .
2.2
Variable Types and Terminology . . . . . . . . . .
2.3
Two Simple Approaches to Prediction:
Least Squares and Nearest Neighbors . . . . . . .
2.3.1
Linear Models and Least Squares . . . .
2.3.2
Nearest-Neighbor Methods . . . . . . . .
2.3.3
From Least Squares to Nearest Neighbors
2.4
Statistical Decision Theory . . . . . . . . . . . . .
2.5
Local Methods in High Dimensions . . . . . . . . .
2.6
Statistical Models, Supervised Learning
and Function Approximation . . . . . . . . . . . .
2.6.1
A Statistical Model
for the Joint Distribution Pr(X, Y ) . . .
2.6.2
Supervised Learning . . . . . . . . . . . .
2.6.3
Function Approximation . . . . . . . . .
2.7
Structured Regression Models . . . . . . . . . . .
2.7.1
Difficulty of the Problem . . . . . . . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
14
16
18
22
. . . .
28
.
.
.
.
.
28
29
29
32
32
.
.
.
.
.
.
.
.
.
.
.
9
9
9
.
.
.
.
.
.
.
.
.
.
xiv
Contents
2.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
35
37
39
39
.
.
.
.
43
43
44
49
51
.
.
.
.
.
.
.
.
.
.
52
56
57
57
58
60
61
61
61
68
.
.
.
.
.
69
73
79
79
80
.
.
.
.
.
.
.
.
.
.
.
.
82
84
86
86
89
89
90
91
92
93
94
94
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
103
106
112
113
113
119
120
122
124
125
127
129
130
132
135
135
xvi
Contents
191
192
194
197
198
200
201
203
203
205
208
208
210
210
212
214
216
216
216
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
219
219
219
223
226
228
230
232
233
235
237
239
241
241
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
245
247
249
252
254
257
257
Contents
xvii
8.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
261
265
267
267
271
272
272
276
277
279
282
283
288
290
292
293
295
295
297
299
304
305
305
307
308
310
313
317
320
321
326
327
328
329
332
334
334
335
xviii
Contents
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
10.10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
341
342
343
345
346
350
352
353
358
358
359
360
361
364
364
365
367
367
369
371
371
375
379
380
384
11 Neural Networks
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
11.2 Projection Pursuit Regression . . . . . . . . . . . .
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . .
11.4 Fitting Neural Networks . . . . . . . . . . . . . . . .
11.5 Some Issues in Training Neural Networks . . . . . .
11.5.1 Starting Values . . . . . . . . . . . . . . . .
11.5.2 Overfitting . . . . . . . . . . . . . . . . . .
11.5.3 Scaling of the Inputs . . . . . . . . . . . .
11.5.4 Number of Hidden Units and Layers . . . .
11.5.5 Multiple Minima . . . . . . . . . . . . . . .
11.6 Example: Simulated Data . . . . . . . . . . . . . . .
11.7 Example: ZIP Code Data . . . . . . . . . . . . . . .
11.8 Discussion . . . . . . . . . . . . . . . . . . . . . . .
11.9 Bayesian Neural Nets and the NIPS 2003 Challenge
11.9.1 Bayes, Boosting and Bagging . . . . . . . .
11.9.2 Performance Comparisons . . . . . . . . .
11.10 Computational Considerations . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
389
389
389
392
395
397
397
398
398
400
400
401
404
408
409
410
412
414
415
Contents
xix
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
12 Support Vector Machines and
Flexible Discriminants
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
12.2 The Support Vector Classifier . . . . . . . . . . . . . . .
12.2.1 Computing the Support Vector Classifier . . .
12.2.2 Mixture Example (Continued) . . . . . . . . .
12.3 Support Vector Machines and Kernels . . . . . . . . . .
12.3.1 Computing the SVM for Classification . . . . .
12.3.2 The SVM as a Penalization Method . . . . . .
12.3.3 Function Estimation and Reproducing Kernels
12.3.4 SVMs and the Curse of Dimensionality . . . .
12.3.5 A Path Algorithm for the SVM Classifier . . .
12.3.6 Support Vector Machines for Regression . . . .
12.3.7 Regression and Kernels . . . . . . . . . . . . .
12.3.8 Discussion . . . . . . . . . . . . . . . . . . . .
12.4 Generalizing Linear Discriminant Analysis . . . . . . .
12.5 Flexible Discriminant Analysis . . . . . . . . . . . . . .
12.5.1 Computing the FDA Estimates . . . . . . . . .
12.6 Penalized Discriminant Analysis . . . . . . . . . . . . .
12.7 Mixture Discriminant Analysis . . . . . . . . . . . . . .
12.7.1 Example: Waveform Data . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 Prototype Methods and Nearest-Neighbors
13.1 Introduction . . . . . . . . . . . . . . . . . . . .
13.2 Prototype Methods . . . . . . . . . . . . . . . .
13.2.1 K-means Clustering . . . . . . . . . . .
13.2.2 Learning Vector Quantization . . . . .
13.2.3 Gaussian Mixtures . . . . . . . . . . . .
13.3 k-Nearest-Neighbor Classifiers . . . . . . . . . .
13.3.1 Example: A Comparative Study . . . .
13.3.2 Example: k-Nearest-Neighbors
and Image Scene Classification . . . . .
13.3.3 Invariant Metrics and Tangent Distance
13.4 Adaptive Nearest-Neighbor Methods . . . . . . .
13.4.1 Example . . . . . . . . . . . . . . . . .
13.4.2 Global Dimension Reduction
for Nearest-Neighbors . . . . . . . . . .
13.5 Computational Considerations . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
417
417
417
420
421
423
423
426
428
431
432
434
436
438
438
440
444
446
449
451
455
455
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
459
459
459
460
462
463
463
468
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
470
471
475
478
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
479
480
481
481
xx
Contents
14 Unsupervised Learning
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Association Rules . . . . . . . . . . . . . . . . . . . . .
14.2.1 Market Basket Analysis . . . . . . . . . . . . .
14.2.2 The Apriori Algorithm . . . . . . . . . . . . .
14.2.3 Example: Market Basket Analysis . . . . . . .
14.2.4 Unsupervised as Supervised Learning . . . . .
14.2.5 Generalized Association Rules . . . . . . . . .
14.2.6 Choice of Supervised Learning Method . . . .
14.2.7 Example: Market Basket Analysis (Continued)
14.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . .
14.3.1 Proximity Matrices . . . . . . . . . . . . . . .
14.3.2 Dissimilarities Based on Attributes . . . . . .
14.3.3 Object Dissimilarity . . . . . . . . . . . . . . .
14.3.4 Clustering Algorithms . . . . . . . . . . . . . .
14.3.5 Combinatorial Algorithms . . . . . . . . . . .
14.3.6 K-means . . . . . . . . . . . . . . . . . . . . .
14.3.7 Gaussian Mixtures as Soft K-means Clustering
14.3.8 Example: Human Tumor Microarray Data . .
14.3.9 Vector Quantization . . . . . . . . . . . . . . .
14.3.10 K-medoids . . . . . . . . . . . . . . . . . . . .
14.3.11 Practical Issues . . . . . . . . . . . . . . . . .
14.3.12 Hierarchical Clustering . . . . . . . . . . . . .
14.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . .
14.5 Principal Components, Curves and Surfaces . . . . . . .
14.5.1 Principal Components . . . . . . . . . . . . . .
14.5.2 Principal Curves and Surfaces . . . . . . . . .
14.5.3 Spectral Clustering . . . . . . . . . . . . . . .
14.5.4 Kernel Principal Components . . . . . . . . . .
14.5.5 Sparse Principal Components . . . . . . . . . .
14.6 Non-negative Matrix Factorization . . . . . . . . . . . .
14.6.1 Archetypal Analysis . . . . . . . . . . . . . . .
14.7 Independent Component Analysis
and Exploratory Projection Pursuit . . . . . . . . . . .
14.7.1 Latent Variables and Factor Analysis . . . . .
14.7.2 Independent Component Analysis . . . . . . .
14.7.3 Exploratory Projection Pursuit . . . . . . . . .
14.7.4 A Direct Approach to ICA . . . . . . . . . . .
14.8 Multidimensional Scaling . . . . . . . . . . . . . . . . .
14.9 Nonlinear Dimension Reduction
and Local Multidimensional Scaling . . . . . . . . . . .
14.10 The Google PageRank Algorithm . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
485
485
487
488
489
492
495
497
499
499
501
503
503
505
507
507
509
510
512
514
515
518
520
528
534
534
541
544
547
550
553
554
.
.
.
.
.
.
557
558
560
565
565
570
.
.
.
.
572
576
578
579
Contents
15 Random Forests
15.1 Introduction . . . . . . . . . . . . . . . .
15.2 Definition of Random Forests . . . . . . .
15.3 Details of Random Forests . . . . . . . .
15.3.1 Out of Bag Samples . . . . . . .
15.3.2 Variable Importance . . . . . . .
15.3.3 Proximity Plots . . . . . . . . .
15.3.4 Random Forests and Overfitting
15.4 Analysis of Random Forests . . . . . . . .
15.4.1 Variance and the De-Correlation
15.4.2 Bias . . . . . . . . . . . . . . . .
15.4.3 Adaptive Nearest Neighbors . .
Bibliographic Notes . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . .
xxi
.
.
.
.
.
.
.
.
.
.
.
.
.
587
587
587
592
592
593
595
596
597
597
600
601
602
603
16 Ensemble Learning
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Boosting and Regularization Paths . . . . . . . . . . . . .
16.2.1 Penalized Regression . . . . . . . . . . . . . . .
16.2.2 The Bet on Sparsity Principle . . . . . . . . .
16.2.3 Regularization Paths, Over-fitting and Margins .
16.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . .
16.3.1 Learning a Good Ensemble . . . . . . . . . . . .
16.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
605
605
607
607
610
613
616
617
622
623
624
639
641
642
643
645
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Effect
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
625
. 625
. 627
. 630
. 631
. 635
. 638
.
.
.
.
.
18 High-Dimensional Problems: p N
649
18.1 When p is Much Bigger than N . . . . . . . . . . . . . . 649
xxii
Contents
18.2
651
654
656
657
657
658
659
661
664
666
668
668
670
672
674
678
680
681
683
687
690
692
693
694
References
699
Author Index
729
Index
737