机器学习周志华 8.16.23 PM

(machine learning).
[Mitchell ,
(learning
et a l., 2001].
2
(data
(attribute
(attribute (samp1e
(feature vector).
=
Xi = (X i1; Xi2; . . . ;
), (dimensionality).
(training) ,
(training (training samp1e) ,

(training
(training
1. 2 3
(regression).
(binary (positive
(negative
(multi-class
Y1) , (X2 , Y2) ,..., (Xm ,
=
(testing
(testing
f(x).
(supervised (unsupervised
(unseen instance).
(distribution)
(independent and identically
4
(inductive .
1
2
3
4
"
1.3 5
[Cohen and Feigenbaum. ^ ^
*) ^
x 3 x 3+1=
7 11
(version
6
*) ^ *)"
(inductive bias) ,
(feature
1. 4 7
y
&
B
,
/' "
-,
\ A
(Occam's
= _X 2 + 6x +
*) ^ *) ^
8
y y ,
B
J
/
B
, ‘BEES
(a) (b)
I (1. 1)
h æEX-X
P(h I
f f h
I X ,'ca)
= L IX,
æEX-X h
h
1. 4 9
= 21.1' 1- 1 P(æ). 1 . (1. 2)
f) , (1. 3)
f f
Free Lunch
[Wolpert , 1996; Wolpert and Macready , 1995].
*)
*) ^
^
"
10
(Logic
(General Problem
A.
E. A.
S. Michalski
B.
J.
S. G.
[Michalski et a l.,
1. 5 11
G.
[Carbonell ,
R. S. et a l.,
E. A.
[Cohen and
S.
Logic
12
S.
J. J.
D. E.
(statistical
Vector
(kernel
N. J.
1. 6 13
Intel
14
and DeCoste ,
ZJZ;;225;
T.
(data
1.6 15
S.
S.
16
(Sparse Distributed
[Mitchell , [Duda et al., 2001; Alc

paydin , 2004; Flach , [Hastie et al. ,
[Bishop ,
[Shalev-Shwartz and Ben-David , [Witten et
2011]
http://www.cs.waikato.
ac.nz/ml/weka j.
[Michalski et a l.,
Morgan
1.7 17
A.
and Feigenbaum ,
[Dietterich ,
(transfer learning) [Pan and Yang ,

(1earning by
(deep
and [Winston , 1970]
[Simon and Lea ,

[Mitchell ,
et a l.,
1996;
(principle of multiple
explanations)
of Machine Learning Learning.

Intelli-
of Intelligence
Transactions on Knowledge Discovery
and Knowledge
18.
Transactions on Pattem
Analysis and Machine Com-
on Neural
19
1.1
1. 2
^ *))
v ^ ,
^
V (A = *)
=
1. 3
1.4*
= I
h
1. 5
20
(2007). 3(12):35-44.
Alpaydin , E. (2004). Introduction to Machine Learning. MIT Press , Cambridge ,

MA.
Asinis , E. (1984). Epicurus' Method. Cornell University Press , It haca ,
NY.
Bishop , C. M. (2006). Pattern Recognition and Machine
New York , NY.
Blumer , A. , A. Ehrenfeucht , D. Haussler , and M. K. (1996). -
J. G. , ed. (1990). Machine Learning: and Methods. MIT

Press , Cambridge , MA.
Cohen , P. R. and E. A. eds. (1983). The of
Intelligence , volume 3. William Kaufmann , New York , NY.
Dietterich, T. G. (1997). "Machine Four current directions."
P. (1999). "The role of Occam's razor in knowledge discovery."

Mining and K Discovery,
Duda , R. 0. , P. E. Hart , and D. G. Stork. (2001). Pattern 2nd
edition. & Sons , New York , NY.
Flach, P. (2012). Machine The Art and Science of that
Sense of Cambridge University Press , Cambridge , UK.
H. Mannila , and P. Smyth. (2001). Principles of MIT
Hastie , T. , R. J. Friedman. (2009). The Elements of
Learning, 2nd edition. Springer , New York , NY.
Hunt , E. G. and D. 1. Hovland. (1963). "Programming a model of human con-
cept Thought (E. and J.
eds.) , 310-325 , McGraw Hill , New York , NY.
21
Kanerva , P. (1988). Distributed Memory. MIT Press , Cambridge , MA.

Michalski , J. G. Carbonell , and T. M. Mitchell , eds. (1983). Machine
Learning: An Intelligence Tioga , Palo Alto , CA.
Mitchell , T. (1997). Learning. McGraw Hill , New York , NY.
Mitchell , T. M. (1977). "Version spaces: A candidate elimination approach to
rule learning." In Proceedings of the 5th Joint on
Intelligence 305-310 , Cambridge , MA.
and D. DeCoste. (2001). "Machine science: State of
the art and future prospects." Science , 293(5537):2051-2055.
Pan , S. J. and Q. Yang. of transfer learning." IEEE
on 22(10):1345-1359.
Shalev-Shwartz , S. and S. Ben-David. (2014). Understanding Machine Learn-
ing. Cambridge University Press , Cambridge , UK.
Simon , H. A. and G. Lea. (1974). "Problem solving and rule induction: A
unified view." In K and Cognition (L. W. ed.) , 105-127,
Erlbaum , New York , NY.
Vapnik , V. N. (1998). Learning New York , NY.
Webb , G. 1. (1996). experimental evidence against the utility of Oc-
cam's razor." Journal of Intelligence Research, 43:397-417.
Winston , P. H. (1970). "Learning structural descriptions from examples." Tech-
nical Report AI-TR-231 , AI Lab , MIT , Cambridge , MA.
Witten ,1. E. .Frank , and M. A. Hall. (2011). Pmctical Ma-
chine Learing Tools Techniques , 3rd edition. Elsevier , Burlington, MA.
H. (1996). "The lack of a distinctions between learning al-
gorithms." Neural :1341-1390.
Wolpert , D. H. and W. G. Macready. (1995). "No free lunch theorems for
search." Technical Report SFI-TR-05-010 , Santa Fe Institute , Sante Fe ,
NM.
Zhou , Z.-H. (2003). "Three perspectives of data
143(1):139-146.
22
Samuel, 1901- 1990)
(John McCarthy, ,
studies in machine learning using the game of checkers"
and
(error
rate)
-;:;) x 100%
(error) ,
(training
(empirical error) (generalization
24
(model
(testing
(testing
2.2 25
Yl) , (X2 ,Y2) , … 7
x 30% = 70%.
(stratified
26
rv
1997]
(cross
= D1 U D k, Di n Dj = ø
(k-fold cross
validat ion).
D
G
ID1 1D2 1 31D4 1Ds ID6 1D71 D81Dg IDlOI
D
1
r
IDsID6ID7ID8IDgID lO l J
2.2 27
and Tibshirani ,
(2.1)
(out-of-bag
estimate).
28
(parameter tuning).
(validation
measure).
2.3 29
= {(X1 , Y1) , . . . , (Xm,
(mean squared error)
(2.2)
E (f ;Ð) = (2.3)
(2 .4)
acc (f; D) (f (Xi) (2.5)
1- E (f ;D) .
, (2.6)
30
acc (f; Ð) L EM Z
(2.7)
I FN
FP I TN
TP
(2.8)
TP+FP'
'-F
R -T -N. (2.9)
2.3 31
10
./
0.2
Ol 0 2 04 0 6
(Break-Event
32
2xPxR 2 xTP
.,,-,r-. (2.10)
P+R
' Rijsbergen , 1979]. ß= 1

ß> ß<
(macro-F1):
macro-F1 = - ,.
Inacro- -P ••
x macro- -R. (2.14)
macro- P +
(micro-F1):
TP
m lCr o- r = (2.15)
TP+FP'
2.3 33
TP
mlCrO- l'í = ‘ (2.16)
TP+FN'
2x
micrc• F1= (2.17)
micro- P + micro-R
2.3.3
(cut
(Receiver Operating
[Spackman ,
(True Positive (False Positive
'-F
T PR -TA -N, (2.18)
FP R
…
R÷
- (2.19)
34
1.0 1. 0
0.8
AUC
0.2 0.2
0.2 0.4 0.6
(Area Under
ROC
y l), (X2, Y2) , . . . , (x m,

=0, x m =
2.3 35
fl. rank (f (æ+) < ),

(2.21 )
AUC = . (2.22)
(unequa1 cost).
(cost
costii =
17'1 / .,
>
costl0 5 1 qa-
9"-
36
(total
E (f; D; cost) (f
+ L I. (2.23)
(cost
pX costOl
(2.24)
+ (1 - p)
(normaliza-
!NR X P x x (1- p) (2.25)

norm p + (1 - p)
2.8.
FNR == 1 -
2.4 37
0.5 1. 0
2010]
=
38
(1 - x
P(Ê; f) = (2.26)
f)/8f =
=
0.25
0.20
0.15
0, 10
= 0.3)
(binomial
e = maxf (2.27)
2 .4 39
('Binomial' ,
www.r-project.org
(2.28)
(2.29)
EO)
(2.30)
-10 5
Tt
‘4 5 10
= 10)
1
40
-1).
k
2 5 10 20 30
0.05 12.706 2.776 2.262 2.093 2.045
0.10 6.314 2.132 1.833 1.729 1.699
(paired t-tests
.ð. 1 , .ð.2 , . . .
(2.31)
x
2.4 41
1998].
0.5(ßi +
f..L
Tt = (2.32)
5
0.2 I: ut
2.4 .3
(contingency table)
eOO eOl
eîo e l1
-
le01 -
- e lO l- 1)2
(2.33)
^
('Chisquare' , 0.1
= 2
42
2 .4.4
2,
Dl 1 2 3
D2 l 2.5 2.5
D3 1 2 3
D4 1 2 3
1 2.125 2.875
+ -
- - (N -1)Tx2
YF Z N(k-1)-TX2' (2.35)
2.4 43
- l)(N -
0.05
2 3 4 5 6 7 8 9 10
1. (k 4 10.128 5.143 3.863 3.259 2.901 2.661 2.488 2.355 2.250
(/F / , 1 5 7.709 4.459 3.490 3.007 2.711 2.508 2.359 2.244 2.153
-1 , (k -1) * (N -1)). 8 5.591 3.739 3.072 2.714 2.485 2.324 2.203 2.109 2.032
10 5.117 3.555 2.960 2.634 2.422 2.272 2.159 2.070 1. 998
15 4.600 3.340 2.827 2.537 2.346 2.209 2.104 2.022 1. 955
20 4.381 3.245 2.766 2.492 2.310 2.179 2.079 2.000 1. 935
0.1
2 3 4 5 6 7 8 9 10
4 5.538 3.463 2.813 2.480 2.273 2.130 2.023 1. 940 1. 874
5 4.545 3.113 2.606 2.333 2.158 2.035 1.943 1.870 1. 811
8 3.589 2.726 2.365 2.157 2.019 1. 919 1.843 1. 782 1. 733
10 3.360 2.624 2.299 2.108 1. 980 1. 886 1. 814 1. 757 1. 710
15 3.102 2.503 2.219 2.048 1.931 1. 845 1. 779 1. 726 1. 682
20 2.990 2.448 2.182 2.020 1.909 1. 826 1. 762 1. 711 1. 668
(post-hoc
CD=JF , (2.36)
Inf)!
sqrt
2 3 4 5 6 7 8 9 10
0.05 1. 960 2.344 2.569 2.728 2.850 2.949 3.031 3.102 3.164
0.1 1. 645 2.052 2.291 2.459 2.589 2.693 2.780 2.855 2.920
44
=
QO.05 = =
1. 0 3.0
(bias-variance
f(x;
2.5 45
f(x) = E D[ f(x;D)] , (2.37)
var(x) = ED [(f (2.38)
[(YD-y)2] (2.39)
(x) = (1 (x) _ Y) 2 . (2 .40)
= ED [(f 2]
= ED [(f 2J
+ED [2 (f (x;D) - f(x)) (f (x) -YD)]

= ED [(f
=ED [(f (æ;D)-f(æ))2] +ED [(1 (æ)-Y+Y-YD)2]
=ED [(1 (æ) _y)2] +ED
[(1 (æ) - Y) (y - YD)]

= ED [(f (f (æ) - y)2 + lE D
(2.41)
E (f; D) = (æ) (2 .42)

46
variance
Tibshirani ,
1989],
2.6 47
and McNeil , 1983]. [Hand and Till ,
[Drummond and Holte ,
learning) [Elkan , 2001;

Zhou and Liu ,
[Dietterich ,
[Demsar ,
[Geman et al.,
variance-covariance
and Dietterich , 1995; Kohavi and Wolpert ,

1996; Breiman, 1996; Friedman , 1997;
48
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
- . (2 .43)
, x-x
(2 .44)
2.9
2.10*
49
Brad1ey, A. P. (1997). "The use of under the ROC curve in the eva1ua-
tion of machine 1earning a1gorithms." Pattem Recognition,
Breiman, L. (1996). variance , and arcing classifiers." Technica1 Report
460 , Statistics Department , University of California, Berke1ey, CA.
Demsar , J. (2006). "Statistical comparison of classifiers over mu1tip1e data
sets." Journal of Machine Leaming Research, 7:1-30.
Dietterich, T. G. (1998). "Approximate statistical tests for comparing super-
vised classification 1earning algorithms." Neural
1923.
P. uni
CA.
for visualizing classifier performance." Machine

Efron , B. and R. Tibshirani. (1993). An Introduction to the Chap-
man & Hall , New York , NY.
Elkan , C. (2001). "The foundations of cost-senstive 1earning." In Proceedings of
the 17th Intemational Joint Conference on Artificial
973-978 , Seatt1e , WA.
Fawcett , T. (2006). "An introduction to ROC ana1ysis." Recognition
Letters ,
Friedman , J. H. (1997). "On bias , variance , 0/1-108s , and the curse-of-
dimensionality." Knowledge
E. Bienenstock , and R. Doursat. (1992). networks and the
bias/variance dilemma." Neural 4(1):1 58
•
D. J. and R. J. Till. generalisation of the area under

the ROC curve for multiple class classification prob1ems." Learning,
45(2):171-186
J. A. and B. J. McNeil. (1983). "A method ofcomparing
under receiver operating characteristic curves derived from the same cases."
148(3):839-843.
50
Kohavi , R. and D. H. Wo1pert. (1996). variance decomposition for

zero-one 10ss functions." 1n Proceeding of the 13th Conference
275-283 , Bari , 1ta1y.
E. B. and T. G. Dietterich. (1995). coding cor-
rects bias and variance." 1n Proceedings of the 12th International Conference
on M achine 313-321 , Tahoe City, CA.
Mitchell , T. (1997). Machine Learning. McGrawHill , New
Spackm
inductive 1n Proceedings of W orkshop on
160-163 , Ithaca , NY.
VanRijsbergen , C. J. (1979). Retrieval, 2nd edition. Butterworths ,
London , UK.
Wellek , S. (2010). of
2nd edition. Chapman & HalljCRC , Boca Raton , FL.
Zhou , Z.-H. and X.-Y. Liu. (2006). "On mu1ti-class
on
Boston, WA.
51
.
Gosset ,
(Student's t-test).
,
Pearson,
(University College
=
f(æ) + b, (3.1)
f(æ) (3.2)
(un-
derstandability)
= {(æ1 , Y1) , (Xi1;

. . ; Xid) , (linear
=
54
(0 , (1 , 0, 0).
(3.3)
(5quare 1055)
== arg min ) (f
i:i
= argmin ) -b? (3 .4)
(Euclidean
(least
estimation).
434-t
no-nu-
z'
'b\lll-/
(3.5)
Z
(3.6)
f(x) =
Z Yi(Xi - x)
(3.7)
3.2 55
(3.8)
f(Xi) + ,
= x
X11 X12
X22 X2d 1.)(71

I I xi 1
.
.. ..
Xm l Xm 2 Xmd 1} \x;, 1
'ÛJ* = argmin X 'ÛJ )T (y - X 'ÛJ) (3.9)
= (y - X 'ÛJ )T (y -
2X 1 y) (3.10)
definite ma-
(3.11)
=
56
f (Xi ) = x'f (XTXr 1 XTy (3.12)
(3.13)
lny = wTx +b (3.14)
(log-linear
U
30
20
2 X
3.3 57
y = (3.15)
(generalized linear
(link
g(-) =
(unit-step function)
I 0, z<0;
y= < 0.5 , z = 0 ; (3.16)
1 1, z >0,
Z >O
z = 0;
l
10 5 10 z
58
(surrogate
1
y = 1. + e- z (3.17)
1
y= .
(3.18)
(3.19)
1-y
U
(3.20)
1-y
(log
(3.21)
1-y
(logistic
3.3 59
= 11
p(y = 11 x)
=wTx+b. (3.22)
p(y = 01 x)
A wTæ+b
(3.23)
'
p(U=01z)=1 (3.24)
.
(maximum likelihood
(log-
likelihood)
1 (3.25)
X = (x; = 1 1
= p(y = 0 1 = 1-
+ . (3.26)
= E . (3.27)
and
Vandenberghe , descent
= . (3.28)
+1 at \
(3.29)
60
: Xi(Yi - P1 , (3.30)
mtt
ß )) . (3.31)
Discriminant
Xl
"
=
3.4 61
- wT :E ow +
(3.32)
scatter matrix)
+:E 1
(3.33)
æEXo
scatter matrix)
Sb = , (3.34)
(3.35)
(generalized
Rayleigh
1,
(3.36)
s.t. wTSww = 1.
Sb W (3.37)
62
Sb W , (3.38)
. (3.39)
St = Sb + Sw
, (3.40)
N
(3 .4 1)
SWi = L (3.42)
Sb = St - Sw
T
m (3 .43)
Sw ,
3.5 63
tr (WTSbW)
(3 .44)
,
E
. (3 .45)
(classifier)
(One vs (One vs.

(One vs. (Many vs.
= {(Xl , Y1) , ... CN}.
64
=} h -7 "+"
(Error Correcting
Output
ECOC [Dietterich and Bakiri ,
3.5 65
(coding
and et a l.,
11 h fa 14 15 h h fa 16
•• ••
C1• • 32 V3 C1 • • 4 4
C •
2 • 4 4 C2 • • 2 2
C •
3 • 1 2 C3 • • 5
C •
4 • 22 v'2
••
66
y >
(3 .46)
l-y
(3 .4 7)
3.7 67
y' y m
1- y' 1- Y --
m+
(3.48)
(rebal-
(rescaling).
ance).
(upsam-
pling)
(threshold-moving).
[Chawla
et a l.,
EasyEnsemble [Liu et
(cost-sensitive
(sparse
(sparsity
LASSO
[Tibshirani ,
68
et al.,
[Crammer and Singer ,
et 2006 , 2008].
(Directed Acyclic
et al.,
and Singer, 2001; Lee et
(misclassification
and Liu ,
and Liu ,
and Zhou, 2014].

69
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8*
3.9
70
Allwein , E. L. , R. E. Schapire , and Y. Singer. (2000).

to binary: A approach for margin classifiers." Journal of Machine
1:113-14 1.
Boyd , S. and L. Vande erghe. (2004). Convex Optimization. Cambridge U
Press , Cambridge , UK.
Chawla , N. V. , K. W. Bowyer , L. O. Hall , and W. P. Kegelmeyer. (2002).
"SMOTE: Synthetic minority technique." Journal of Artificial
Intelligence 16:321-:-357.
Crammer , K. and Y. Singer. (2001). "On the algorithmic implementation of
multiclass kernel-based vector machines." Journal of Learning Re-
2:265-292.
J
Crammer, K. and Y. Singer. (2002). "On the learnability and design of
codes for multiclass problems." Machine Learning, 47(2-3):201-233.
via error-correcting of Artificial Intelligence Re-

2:263-286.
Elkan , C. (2001). "The foundations of cost-sensitive learning." In Proceedings
of the 17th Joint Conference on Artifiçial Intelligence (IJCAI) ,
973-978 , Seattle , WA.
O. Pujol , and P. Radeva. (2010). "Error-correcting ouput codes
library." Journal of Machine 11:661-664.
Fisher , R. A. (1936). "The use of multiple taxonomic prob-
lems." Annals of 7(2):179-188.
Lee , Y., Y. Lin , and G. Wahba. (2004). "Multicategory support vector ma-
chines , theory, and application to the classification of microarray data and
satellite radiance data." Journal of the American Statistical Association, 99
(465):67-8 1.
Liu , X.- Y., J. Wu , and Z.-H. Zhou. undersamping for
learning." IEEE on Systems , Cyber-
neticíi - Part B: Cybernetics , 39(2):539- 550.
Platt , J. C. , N. Cristianini , and J. Shawe-Taylor. (2000). "Large margin DAGs
71
for multiclass classification." In in Neural Processing

12 (NIPS) (8. A. 8011a, T. K. Leen , and K.-R. Müller , eds.) , MIT
Pujol , 0. , 8. Escalera , and P. Radeva. incremental node embedding
technique for error correcting output codes." Pattern Recognitìon,
725.
Pujol , 0. , P. Radeva , and J. Vitrià. (2006). "Discriminant ECOC: A heuristic
method for application dependent design of error output codes."
IEEE on Pattern Analysis and Machine Intelligence , 28(6):
1007-1012.
Tibshirani , R. (1996). "Regression shrinkage and selection via the LA880."
Journal of the Royal Series B , 58(1):267-288.
M.-L. and Z.-H. Zhou. on multi-label learning al-
gorithms." IEEE Transactions on 26(8):
1819-1837.
Zhou , Z.-H. and X.- Y. Liu. (2006a). "On multi-class cost-sensitive learning." In
Proceeding ofthe 21st National Conference on Intelligence
567-572 , Boston, WA.
Zhou , Z.-H. and X.- Y. Liu. (2006b). cost-sensitive neural networks
with methods addressing the class imbalance problem." IEEE
18(1):63-77.
72
"3L"
74
= {(X1 , Y1) , . , (Xm,Ym)};
A)
2: if then
3:
4: end if
5: if A= 0
6: return
7: end if
for do
10:
11: if
12: return
13: else
14: A\
15: end if
16: end for
4.2 75
(information
(k = 1, 2,. . . ,
IYI
=- I:>k1og2Pk (4.1)
k=l
(information gain)
= . (4.2)
erative Dichotomiser
(8. 8 9. 9\
1"'7) =
76
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
D1
D2 D3
4, 6 , 10 , 13 ,
3 , 7, 8 , 9 ,
= = 11 , 12 , 14,
Ent(D l ) (63 3 6 3 6)}I

6 '3 == 1.000 ,
Ent(D :l) (4422)
Ent(D ,j) ;. i5 4 5)}

5 +'4
(51 1 0.722
-- ,
4.2 77
=
fLlDU!
IDI
= x
= 0.143; = 0.141;
= 0.381; = 0.289;
= 0.006.
2, 3 , 4 , 5, 6 , 8 , 10 ,
= 0.043; = 0.458;
= 0.331; = 0.458;
= 0.4 58.
78
(gain
a) (4.3)
(4.4)
(intrinsic value) [Quinlan ,
= 0.874 (V = 2) , = 1.580 (V = 3) ,
= 4.088 (V = 17).
C4.5
4.3 79
and Regression
et al., (Gini
IYI
=1- (4.5)
GiniJndex(D , a) =): I;-..._ ,' Gini(DV) . (4.6)
argmin
pruning) [Quinlan,
80
4.3 81
4%
4%
4%
100% = 42.9%
x 100% = 71. 4% > 42.9%

82
(decision
stump).
4.4 83
1%
4%
1993] .
t
84
, (4.7)
[Quinlan , 1993].
Gain(D , a) t)
-LEnt(Df) , (4.8)
IDI
a,
12345678
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0 .437 0.211
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
4.4 85
0.294 , 0.351 , 0.381 , 0.420 , 0 .4 59 , 0.518 ,

0.574 , 0.600 , 0.621 , 0.636 , 0.648 , 0.661 , 0.681 , 0.708 ,
{0.049 , 0.074 , 0.095 , 0.101 , 0.126 , 0.155 , 0.179 , 0.204 , 0.213 , 0.226 , 0.250 , 0.265 ,
0.292 , 0.344 , 0.373 ,
= 0.109; = 0.143;
= 0.141; = 0.381;
= 0.289; = 0.006;
= 0.262; = 0.349.
.
86
7, 14 ,
..,
ÌJ k
= 1 , 2 , .. . , ÌJ k ,
LZLEL-L
(4.9)
~pf
(1 k IYI) , (4.10)
(1 v V) . (4.11)
4.4 87
Gain(D , a) = p x Gain(D , a)
\tll1/
× En4L ~D En 4tu
ND 4 -i9"
IYI
Ent(15) = -
Ent(15) = - LPk
(6. 6 8. 8\
= - I -::-: . -::-: 14 J = 0.985.
\11FE/\11/
NDND
EE
MM)) - - -
oo
ubub -
oo
FDFb nu03
nu--
1inu
q4qL
.•
88
14
0.306 = 0.252 .
17
= 0.252; = 0.171;
= 0.145; = 0 .424;
= 0.289; = 0.006.
9, 13 , 14 , 12 ,
4.5 89
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0.437 0.211
,.•
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
90
06EZE +
+
+
+
o. 2
o O. 2 O. 4 O. 6 O. 8
(multivariate decision
(oblique
decision tree).
(univariate decision
4.5 91
/ '-....
O. 2
o 0.2 O. 4 O. 6 0.8
92
[Quinlan , 1979 , [Quin-

[Breiman et a l., 1984]. [Murthy ,
[Quinlan ,
[Raileanu and Stoffel ,
[Murthy et a l., and Ut-

goff,
[Brodley
and Utgoff,
(Perceptron tree) [Utgoff,

and Gelfand ,
[Schlimmer and Fisher , [Utgoff, [Utgoff et a l.,

93
4.1
4.2
4.3
4.4
4.5
http://archive.ics.uci.edu/mlj. 4.6
4.7
4.8*
4.9
4.10
94
Breiman , L. , J. Friedman , C. J. and R. A. Olshen. (1984).

Chapman & Hall/CRC , Boca Raton , FL.
Brodley, C. E. and P. E. Utgoff. (1995). "Multivariate decision trees." Machine
Learning, 19(1):45-77.
Guo , H. and S. B. Gelfand. (1992). trees with ncural network
feature extraction." IEEE on Neural Networks , 3(6):923-933
Mingers , J. (1989a). "An empirical comparison of pruning mcthods for decision
tree induction." Learning , 4(2):227-243.
Mingers , J. (1989b). "An empirical comparison of selection measures for
decision-tree induction." Learning, 3(4):319-342.
Murthy, S. K. (1998). "Automatic construction of decision trees from data:
A survey." and K 2(4):
345-389.
Murthy, S. Kasif, and S. Salzberg. for induction of
oblique decision trees." Jo'urnal of Artíficial Intelligence 2:1-32.
J. R. (1979). rules by large collections of
examples." 1n Expert the (D.
University Press , Edinburgh , UK.
Quinlan , J. R. (1986). "1nduction of decision trees." Machine 1(1):
81-106.
Quinlan , J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kauf-
mann , San Mateo , CA.
Raileanu , L. E. Stoffel. (2004). "Theoretical comparison between the
Gini index and information gain criteria." of Arti-
ficialIntelligence , 41(1):77-93.
J. C. and D. Fisher. study of incremental concept
induction." 1n Proceedings of the 5th Conference on Artificial In-
telligence (AAAI) , 495-501 , PA
Utgoff, P. E. (1989a). "1ncremental induction of decision trees." Machíne
Learning, 4(2):161-186.
95
P. E. (1989b). "Perceptron A case study in hybrid concept rep-

resenations." Connection Science, 1(4):377-39 1.
Utgoff, P. E. , N. C. Berkman, and J. A. Clouse. (1997). "Decision t ree induction
based on effcient tree Machine Learning, 29(1 ):5-44.
Ross Quinlan , 1943- ).

B.
Hunt
(Concept Learning
achine
Kohonen 1988
Networks
[Kohonen ,
[McCulloch and Pitts ,
U
98
(activation
1.0
1 x -1. 0 -0.5 10 0.5 1. 0 :Í;

cd Z) 0;
x< O
(b)
(threshold logic unit) .

fCI:'i WiXi -
• (Xl ^ = 1, e= f(l . Xl + 1 . X2 -
5.2 99
'W l/ \'W 2
Xl X2
= X2 = = 1;
• (X1 V 1, e= f(l . X2 - 0.5) ,

= = = 1;
• = = 0, e= f( -0.6. Xl + O.
X2 = = = 1.
(dummy
(5.1)
ý)Xi , (5.2)
and Papert ,
100
X2 X2
"+"
(1 , 1) (0, 1) ..., (1 , 1)
i+
( AU nU)
(1 , 0) Xl (0 , 0) (1, 0) Xl
X2)
+
r_"_"
F
, i< ;H "+
-0 X
Xl ,
..
(0, 0) (1, 0) xl
(d)
(multi-layer feedforward neural
(nu , )
1
i
‘. (1, 1)
1
Xl
Xl X2
5.3 101
(connection
Yl) ,
.. . , (Xm , Ym)} , Xi E Yi
=
102
Yl
Xl Xi Xd
rl' FI<J Sigmoid
Bj) , (5.3)
(5.4)
+
x
. (5.5)
(5.6)
5.3 103
_
(5.7)
't.
(5.8)
f'(x)=f(x)(1-f(x)) , (5.9)
- yj) f'
- fjj)(yj (5.10)
6. W hj (5.11)
6. ()j (5.12)
6. Vih , (5.13)
(5.14)
åb h
104
=
j=l
= bh (l - . (5.15)
2: repeat
3: for all (Xk ,
4:
5:
6:
7:
8: end for
9:
, (5.16)
5.3 105
0.11 0.51 0.56
0.53 1.72
2f .
1
of + +
2
error
(one round ,
gradient
[Hornik et al.,
(early
(regularization) [Barron , 1991; Girosi et a l.,

106
(5.17)
(local
(global
v 0) 111(w;
w;
v ,- , -,
5 .4 107
(simulated and Korst , 1989].
algorithms) [Goldberg ,
108
5.5.1
RBF(Radial and
= Ci) , (5.18)
= (5.19)
[Park and Sandberg ,
5 .5 .2
A R,T(Adaptive Reson.ance and

Grossberg ,
5.5 109
(stability-
plasticity
5.5.3
SOM(Self-Organizing
(Self-Organizing Fea-
matching
110
(construc-
and Lcbiere , 1990]
(
5.5 111
5.5.5
neural (recurrent neural
networks"
1987].
5.5.6
(energy-based
E {O ,
E(s) =- (5.20)
i=l j= i+ l i=l
(station
ary distribution)
112
(a)
P(s) (5.21)
Boltzmann
(Contrastive
= rr (5.22)
P (hj I v) (5.23)
j=1
(5.24)
5.6 113
(deep
layer-wise
(pre- (fine-
belief [Hinton
(weight
Neural
[LeCun and 1995; LeCun et a1.,
et a 1.,
114
32x 32 (j<<128x28 6 (çý 14 x14
et al. , 1998]
f (x ) =< -' x <,-,

0,
I x. otherwlse
LU(Rectified Li near Unit ) ;
(feat ure
(representation learning) .
(feature
5.7 115
[Haykin , [Bishop ,
Computation , Neural
IEEE 'JIransactions on Neural Networks and Learning Systems;
on Neu-
ral Networks.
[Gerstner and Kistler , 2002].

et al. ,
(Least Mean
and Rumelhart , 1995].
[Gori and Tesi , [Yao ,
and 1998; Orr and Müller , 1998].

et al. , 2001]. [Carpenter and
Grossberg ,
2001]. [Bengio et al.,
[Tickle et al., 1998; Zhou , 2004].

116
5.1
5.2
5.3
5.4
5.5
5.6
http://archive.ics.ucí.edu/ml/
5.7
5.8
5.9*
5.10
http://yann.lecun.com/
117
Aarts , E. and J. Korst. (1989). Simulated Machines:

A to Neural Comput-
ing. John Wiley & Sons , New York , NY.
D. H. , G. E. Hinton , and T. J. Sejnowski. algorithm
for Boltzmann machines."
A. R. with to -
cial neural networks." In Related
Topics; NATO Volume 335 (G. ed.) , Kluwer ,
Amsterdam , The Netherlands.
Bengio , Y., A. Courville , and P. Vincent. (2013).
review and new perspectives." IEEE on Pattern
Machine Intelligence ,
Bishop , C. M. (1995). Neural Networks fo 1' Pattern Recognition. Oxford Uni-
Press , New York , NY.
Broomhead , D: S. and D. Lowe. (1988). functional interpolation
and adaptive networks." Complex
Carpenter , G. A. and S. Grossberg. (1987). "A massively parallel architecture
for a self-organizing neural pattern Computer Vision ,
Image
Carpenter , G. A. and S. Grossberg , eds. (1991). Pattern Recognition by
Organizing Neural Networks. MIT Press , Cambridge , MA.
Y. and D. E. Rumelhart , eds. (1995).
Applications. Lawrence Erlbaum Associates , Hillsdale , NJ.
Elman , J. L. (1990). structure in time."
Fahlman, S. E. and C. Lebiere. (1990). "The cascade-correlation

Technical Report CMU-CS-90-100 , School of Computer Sciences ,
Carnergie Mellon University, Pittsburgh , PA.
Gerstner , W. and W. Kistler. (2002). Spiking Neuron Models: Single
Cambridge University Press , Cambridge, UK.
M. Jones , and T. Poggio. (1995). "Regularization theory and neural
118
networks architectures."
Goldberg , D. K (1989). Genetic Algorithms in Optirnizaiton and Ma-
chine Addison-Wesley, Boston , MA.
Gori , M. and A. Tes i. (1992). "On the problem of local in backpropa-
gation." IEEE 0 11, knalysis and Intelligence ,
14(1):76-86
S. (1998). Neural A Comprehensive 2nd
tion. Prentice-Hall , Upper Saddle River , NJ.
Hinton , G. (2010). "A practical guide to Boltzmann ma-
chines." Technical Report UTML TR 2010-003 , Department of Computer
Science , University of Toronto.
Hinton , G. , S. Osindero , and Y. -W. Teh. (2006). "A fast learning algorithm for
deep belief nets." Neural 1527-1554.
Hornik , K., M. Stinchcombe, and H. White. (1989). feedforward
networks are universal approximators." Neural 2(5):359-366
Kohonen , T. (1982). "Self-organized formation of topologically correct feature
maps." Cybernetics , 43(1):59-69
Kohonen , T. (1988). "An introduction to neural computing." Neural Networks ,
1(1):3-16.
Kohonen , T. (2001). 3rd edition. Springer, Berlin.
LeCun , Y. and Y. (1995). networks for images , speech ,
and time-series." In The of Brain Neural Networks
(M. A. Arbib , ed.) , MIT Press , Cainbridge , MA.
LeCun , Y., L. and P. Haffner. (1998). "Gradient-based
learning applied to document Proceedings of theIEEE, 86(11):
2278 2324.
D. J. C. (1992). "A 1
networks." Neural Computation , 4(3):448-472.

McCulloch , W. S. and W. Pitts. (1943). "A logical calculus of the ideas im-
manent in nervous activity." Bulletin of Mathematical Biophysics, 5(4):115-
133.
Minsky, M. and S. Papert. (1969). Perceptrons. MIT Press , Cambridge , MA.
119
Orr , G. B. and K.-R. Müller , eds. (1998). Neuml of the Trade.

Springer, London , UK.
Park , J. and 1. W. Sandberg. (1991). "Universal approximation
networks." Neural
Pineda , F. J. (1987). "Generalization of Back-Propagation to recurrent neural
networks." Physical Review Letters , 59(19):2229-2232.
Reed , R. D. and R. J. Marks. (1998). Neuml Leaming in
Artificial Neural Networks. MIT Press , Cambridge , MA.
Rumelhart , D. E. , G. E. and R. J. Williams. (1986a). "Learning internal
representations by error propagation." In Distributed Processing:
Explorations in the Cognition (D. E. Rumelhart and J. L.
McClelland , eds.) , volume 1, 318-362 , MIT.Press , Cambridge , MA
Rumelhart , G. E. Hinton , and R. J. Williams. (1986b). repre-
sentations by 323(9):318-362.
Schwenker , F. , H.A. Kestler , and G. Palm. (2001). "Three learning phases for
Tickle , A. B. , R. Andrews , M. Golea , and J. Diederich. (1998). "The truth

will to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks." IEEE on
Neural 9(6):1057 1067.
•
P. (1974). Beyond New tools for in

the science. Ph.D. thesis , Harvard University, Cambridge , MA.
Yao , x. (1999). "Evolving neural networks." Proceedings of the IEEE,
87(9) :1423-1447.
Zhou , Z.-H. (2004). "Rule extraction: networks or for neural
networks?" Joumal of Technology , 19(2):249-253.
120
Minsky,
Paul
UCSD
!'l"
= {(Xl , Y2) , . . . , (X m , Ym)} ,
Xl
(6.1)
=
122
÷
(6.2)
=
, T'o".J n llA.I <lV1., I V
> v,
_,.,..-
+ b<
b +1, Yi = +1 ; (6.3)
WTXi +b -1.
(6 .4)
Ilwll'
(margin).
X2
+ +/ ...T _ ,
+
Xl
(maximum
2
(6.5)
s.t. 1,
6.2 123
1-2
(6.6)
S.t. i = 1, 2,..., m.
Vector
f(æ) (6.7)
quadratic
(dual
, (6.8)
, (6.9)
(6.10)
( tu )
4·i-t4
max -
124
S.t. = 0 ,
i=l
i =
f(x) =
+b , (6.12)
(Karush-Kuhn-
(> O
(6.13)
(Yd(Xi) - 1) = 0 ..
=
0,
=
SMO (Sequential Minimal

1998].
6.2 125
et a l.,
:? 0 :? 0 , (6.14)
"uo
(6.15)
=C (6.16)
= 1,
=1 (6.17)
= {i > 0, i =
(6.18)
126
>
f(x) +b , (6. 19)
(6 .20)
'"
S.t.
T
x WU
U
Z Z nhv
6.3 127
s.t. L O'. iYi = 0 ,
;;;:: 0 , i = 1, 2,..., m .
= = cþ(Xi)Tcþ(Xj) , (6.22)
(ker-
nel trick).
=0,
i=l
;;;:: 0 , i = 1, 2,… , m.
f(x) +b
+b
i=l
(6.24)
(kernel
vector expansion).
128
and Smola , 2002]

=
(kernel matrix)
X1) Xj) Xm )
K= X1) Xm )
X1)
(Reproducing Kernel Hilbert
= x 'f Xj
d = = (x 'f Xj)d
= exp
= exp (-
= 0, B < 0
(6.25)
6.4 129
z) (6.26)
= z)g(z ) (6.27)
(80ft
X2
Xl
(hard
130
+ (6.28)
-1) , (6.29)
1 1. if Z < 0:
eO/1(Z) = , (6.30)
10 , otherwise.
(surrogate 10ss).
ehinge(Z) = max(O , 1 - z) ; (6.31)
eexp(z) = exp( -z) ; (6.32)
10ss): e1og(z) = 10g(1 + exp( -z)) . (6.33)

log(.)
(6.34)
(s1ack variab1es) Çi
mm (6.35)
W ,b,Çí
6.4 131
f(z )
3
= exp( - z)
= max(O , 1 -
l
f
‘t ,
( ~) i
/i " H
U
ut
l
S.t. 3 1 - Çi
Çi i = 1, 2 ,… , m.
+ b)) - (6.36)
, (6.37)
(6.38)
(6.39)
132
lllax (6 .40)
a
s.t.
i = 1, 2,… , m.
0 ,
-1 + Çi 0,
(6 .4 1)
(Xi) - 1 + Çi) = 0,
Çi = 0.
= =
=
< =
= =
>
and
6.5 133
C(f (x i) ,
+ C "L C(f (Xi)'Y (6.42)
(structural risk)
C(f (Xi) , (empirical risk)
= {(X l, Y1) , ,…?

(X m, Ym)} , Yi E
Vector
134
_-0 0
, (6.43)
I 0, if Izl E ;
(6 .44)
=<
l Izl - E , otherwi8e
utLz (6 .45)
f! (z )
if Izl E
otherwise
Z
6.5 135
S.t. E+ Çi ,
f(x;)
0, i = 1 , 2,… , m.
C
mZM
(x;) - E - - f (Xi) - (6.46)
b,
(6 .4 7)
0= , (6.48)
, (6.49)
iii . (6.50)
mhMMA (6.51 )
S.t. =0,
âi c.
136
f_ -
- f(Xi) - =0,
(6.52)
= 0 , ÇiÇi = 0 ,
(0 = 0 , (0 -
=
Yi - f(Xi) - =
- Yi -
-
f(x) = (6.53)
- Yi - f - Çi) = O. <
(6.54)
i=l
(6.55)
6.6 137
f(x) (6.56)
i=1
=
Yl) , (X2 ,Y2) , … 7
(representer
and
Smola ,
: ]Rm •-7
2fE (6.57)
h*(x) Xi) (6.58)
(kernel
(Kernelized Linear Discriminant
h(x) . (6.59)
138
(6.60)
(6.61)
$
st = - (6.62)
L L (<þ (x) - p, f)( <þ (x) - p,f)T (6.63)

i=O æEXi
Xi) , (6.64)
(6.65)
(6.66)
mo
(6.67)
m1
M= 11 1) T , (6.68)
6.7 139
(6.69)
(6.70)
and Vapnik ,
(statistical
and Shawe-Tay lor ,
and Shawe-Taylor , 2000; Burges , 1998;

2009; Schölkopf et a l., 1999; Schölkopf and Smola ,
1995 , 1998 , 1999].
plane
et a l.,
[Hsieh et al.,
[Tsang et a l.,
and Seeger ,
and
et a l., 2012].
[Hsu and Lin ,

140
et al. , et a1., 1997] , [Smola and

Schölkopf,
[Vapnik and Chervonenkis ,
[Chang and Lin ,

LIBLINEAR [Fan et a1.,
141
6.1
6.2
csie. ntu.edu. tw/
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10*
142
Bach , R. R., G. R. G. Lanckriet , and M. I. Jordan. (2004). "Multiple kernel

learning , conic duality, and the SMO algorithm." In Proceedings 0] the 21st
International on 6-13 , Banff, Cana-
da.
Boyd , S. and L. Vandenberghe. (2004). Convex Optimization. Cambridge Uni >
versity Cambridge , UK.

Burges , C. J. C. (1998). "A tutorial on support vector machines for pattern
recognition." and Knowledge Discovery, 2(1):121-167.
Chang , C.-C. and C.-J. Lin. (2011). "LIBSVM: A library for support vector
machines." ACM Transactions on Intelligent Technology , 2(3):
27.
Cortes , C. and V. N. Vapnik. vector networks."
20(3):273-297.
and J. Shawe-Taylor. (2000). An Introduction to Vector
and Other Learning Methods. Cambridge University
Press , Cambridge , UK
Drucker , H. , C. J. C. Burges , L. A. J. Smola , and V. Vapnik. (1997)
"Suppurt vector machines." In in Neural
Processing Systems 9 (NIPS) (M. C. Mozer , M. I. Jordan , and T. Petsche ,
eds.) , 155-161 , MIT Press , Cambridge , MA
Fan , R. -E. , K. -W. Chang , C.-J. Hsieh , X.- R. Wang , and C.-J. Lin. (2008).
"LIBLINEAR: A library for large linear Journal 0] Machine
Learning Research, 9:1871-1874.
Hsieh , C.-J. , K. -W. Chang , C.-J. S. S. Keerthi , and S. Sundararajan.
(2008). "A dual coordinate descent method for large-scalc lincar SVM."
In Proceedings of the 25th International Con]erence on Machine Learning
408-415 , Helsinki ,
and C.-J. Lin. of methods for multi-class
support vector IEEE
143
Joachims , T. (1998). "Text with vector Learn-

ing with many relevant features." In of the 10th
ference on Machine Learning Chemnitz , Germany.
Joachims , T. (2006). "Training linear SVMs in linear time." In Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery
(KDD) , Philadelphia, PA.
Lanckriet , G. R. G. , N. Cristianini , and M. 1. Jordan P. Bartlett , L. El Ghaoui.
(2004). the kernel matrix with semidefinite Jour-
nal of Machine 5:27-72.
Osuna , E. , R. Freund , and F. Girosi. (1997). "An improved
for support vector machines." In Proceedings ofthe IEEE Workshop on Neu-
for Processing (NNSP) , 276-285 , Amelia Island , FL.
Platt , J. (1998). "Sequential minimal optimization: A fast algorithm for train-
support vector machines." Technical Report MSR-TR-98-14 , Microsoft
Research.
Platt , J. (2000). "Probabilities for (SV) machines." In Advances in Mar-
gin (A. Smola, P. Bartlett , B. Schölkopf, and D. Schuurmans ,
MIT Press , Cambridge , MA
Rahimi , A. Recht. features for large-scale kernel m a-
chines." In Advances in Neural Information Processing 20 (NIPS)
(J.C. Platt , D. Koller , Y. Singer , Roweis , eds.) , 1177-1184 , MIT Press ,
Cambridge , MA.
C. J. C. Burges , and A. J. Smola, eds. (1999). Advances in Kernel
Methods: Support Vector MIT Cambridge , MA.
Schölkopf, B. and A. J. Smola, eds. (2002). Kernels: Vec-
tor Beyond. MIT Press , Cam-
bridge , MA.
Shalev-Shwartz , S. , Y. N. Srebro , and A. Cotter. (2011). "Pegasos: Pri-
mal estimated sub-gradient solver for SVM." Mathematical Programming ,
127(1):3-30.
Smola , A. J. and B. Schölkopf. (2004). "A tutorial vector regres-
Computing, 14(3):199-222.
144
1. W. , J. T. Kwok , and P. Cheung. (2006). "Core vector machines:

Fast SVM training on very large data sets." Journal of Machine
Tsochantaridis , 1., T. Joachims , T. Hofmann , and Y. Altun. (2005). "Large

margin methods for structured and interdependent output variables." Jour-
of Machine Learning Research,
Vapnik , V. N. (1995). The Nature of 8tatistical Springer , New
York , NY.
Vapnik , V. N. (1998). Learning Theory. Wiley, New York , NY.
Vapnik , V. N. (1999). "An 1EEE
on Neural Networks ,
Vapnik , V. N. and A. J. (1991). "The necessary and sufficient
conditions for consistency of the method of empirical risk." Pattern Recog-
Williams , C. K. and M. Seeger. (2001). "Using the Nyström method to speed

up kernel machines." In Processing 8ystems
13 (NIP8) (T. K. Leen , T. G. Dietterich, and V. Tresp , eds.) , 682-688 , MIT
Yang , T.-B. , Y.-F. Li , M.
method vs Fourier features: A theoretical and empirical
son." In Advances in Neural 1nformation Processing 8ystems 25 (NIP8) (P.
Bartlett , F. C. N. Pereira , C. J. C. L. Bottou , and K. Q. Weinberger ,
eds.) , MIT Press , Cambridge , MA.
based on risk A ls
145
N. Vapnik ,
"Nothing is
theory."
decision
=
I
loss)
(conditional risk)
(risk)
N
R(Ci I æ) I æ) (7.1)
R (h) = lEa: [R (h (æ) I æ)] . (7.2)
decision
I
= argminR(c (7.3)
optimal
risk). 1
148
J 0, if i = j ; (7.4)
21-IL otherwise ,
R(c I æ) = 1- P(c (7.5)
= argmaxP(c I x) , (7.6)
cEY
P(c I I
(discriminative
I
(generative
P(æ , c)
P(c I æ) (7.7)
P(æ)
P (c) P(x I c)
P(c I æ) = (7.8)
P(æ)
P(æ I
(likelihood);
I
I c).
I
7.2 149
I I
2005;
Likelihood
P(Dc = rr (7.9)
= log P(Dc I Oc)

, (7.10)
150
ax L L( QUC)
PAV
C
a?4
(7.11)
rv N(!-"c ,
(7.12)
(7.13)
{I'!
I
I
Bayes
(attribute conditional
P(c I I c) , (7.14)
P(z)PW)ff
7.3 151.
), 'hM
3 -aTi FDU ax p c pz c
(7.15)
I C).
IDcl (7.16)
IDI'
P(Xi I c) =
IDc ,Xil (7.17)
IDcl
I c) '"
( \
P(Xi I c) exp I . (7.18)
\ /
0.460 ?
17
152
1 c):
1 .... ( 0
1 ..... ( (0.697 -
0.195 ….t-' \ 2.0.195 2 ) -
-
V2ir. 0.101 \ 2.0.101 2 )
1 \ _. (\ fìD t:.'
)'
7.3 153
X 10- 5 .
> 6.80 x
(Laplacian
P(c)
IDcl + 1 (7.19)
IDI+N'
IDC , I + 1
Xi
(7.20)
IDcl +Ni .
17 + 2
154
3+1
0+1
(lazy
(semi-naïve Bayes
(One-Dependent
P(c I rr P(Xi I (7.21)
I
7.4 155
SPODE (Super-Parent
(a) NB (b) SPODE (c) TAN
TAN (Tree Augmented naÏve Bayes) [Fr iedman et a l.,

weighted
mutual information)
n/__ 1_\'-_ P(Xi' Xj I c)

I(xi , Xj I y) = ) P(Xi , Xj I c)log (7.22)
z;f7cU P(zz|c)P(Z31c)'
I(xi , Xj ;
Xj I
AODE (Averaged One-Dependent Estimator) [Webb et

156
P(c 1 rr P(Xj , (7.23)
m' [Webb
et al. , 2005].
c,
P(C , Xi) (7.24)

IDI+Ni
+1
P(X.iJ -, _.'/
1
I
(7.25)
6+1
y,
(belief
Acyclic
7.5 157
Probability
FB ( Xi
= , Xd
l 7ri) = (7.26)
i=l i=l
P ( Xl , X2 , X3 , X4 , X5) = P(Xl)P(X2) P (X3 1 Xl)P(X41 Xl , X2 ) P ( X5 I X2) ,
j_ X4 I X2.
158
(common
= . (7.27)
(marginal
(marginal-
ization)
j_ X4 I Xl
y j_ z I
(direct
ed)
1988].
(moral
(moralization) [Cowell et a l., 1999].
=
7.5 159
I Xl , X3 1- X2 I X5
X3 1- X5 I
(Minimal Description
Itp
= =
s(B I D) = f(O)IBI- LL(B I D) , (7.28)

160
LL(B I D) (Xi) (7.29)
s(B I
= (Akaike Information
AIC(B I D) = IBI - LL(B I D) . (7.30)
= (Bayesian
Information
D) I D) . (7.31
=
s(B I I
(7.32)
et
7.5 161
(evidence).
=
= q I E = e) ,
=
=
=
P(Q =q I E (7.33)
(random
(Markov
P(Q I IE =
= q I E = e).
162
= (G , 8);
1: nn = 0
2: qO
3: for t = 1, 2, . . . , T do
4: for do
5: Z = E U Q \ {Qi};
6: z=euqt-l\{qf-l};
7: IZ = z)j
8: Z
9: qt
10: end for
11: if qt = q then
12: nq = nq + 1
13: end if
14: end for
P(Q = q I E =
"1"
7.6
(latent
LL(8 I X , Z) =lnP(X , Z I 8) . (7.34)

ln(.) .
7.6 163
(marginal likelihood)
LL(8 I X) = I .Lz P(X , Z I 8) . (7.35)
EM (Expectation- et al.,
P(Z I
I
IX,
Q(8 I 8 t ) = lE z lx ,8 t LL(8 I X , Z) . (7.36)
= argmax Q(8 I 8 t ) . (7.37)

@
1983].
descent)
164
[Domingos and pazzani , 1997; Ng and Jordan,
and pazzani ,
1998] , [McCallum and Nigam ,
1991].
[Friedman et al. , [Webb et al.,
(lazy Bayesian Rule) [Zheng and Webb ,
[Kohavi ,
(Bayesian
2006].
J.
1990; Chickering et
[Friedman and Goldszmidt ,
and Domingos ,
1997; Heckerman , 1998].
7.7 165
mÎxture
and Krishnan , 2008].

166
7.1
7.2*
7.3
7.4
P(Xi I
7.5
7.6
7.7
x 2=
.
7.8
y ..l z I
7.9
7.10
167
Bishop , C. M. (2006). Pattern Recognition and Machine Learning. Springer ,

New York , NY.
M. , D. Heckerman , and C. Meek. (2004). 1earning
of Bayesian networks is NP-hard." Journal of Machine Learning
5:1287-1330.
Chow , C. K. and C. N. Liu. (1968). probabi1ity distri-
butions with dependence trees." IEEE on Theory ,
14(3):462-467.
Cooper , G. F. (1990). "The computatíona1 comp1exity ofprobabilistic inference
belief networks." lntelligence , 42(2-3):393-405
Cowell , R. G. , P. Dawid , S. L. Lauritzen , and D. J. Spiege1ha1ter. (1999). Prob-
Expert Systems. Springer , New York , NY.
Dempster , A. P. , N. M. Laird , and D. B. Rubin. (1977). "Maximum likelihood
from incomp1ete data via the EM a1gorithm." of the Royal
Society - Series B , 39(1):1-38.
P. and M. Pazzani. (1997). "On the optimality of the simp1e
Bayesian classifier under zero-one 10ss." Machine 29(2-3): 103-130.
(2005). and scientists." of the
American Association, 100(469):1-5.
D. Geiger , Go1dszmidt. (1997). "Bayesian network clas-
sifiers." Machine Learning, 29(2-3):131-163.
Friedman , N. and M. Go1dszmidt. (1996). Bayesian networks with
10ca1 structure." In Proceedings of the 12th Conference on Uncer-
in Port1and , O R.
Grossman , D. and P. (2004). network
by maximizing conditiona1likelihood." In Proceedings of the 21st
tional Conference on Machine Learning (ICML) , 46-53 , Banff, Canada.
in Models (M. 1. Jordan , ed.) , 301-354 , K1uwer , Dor-

drecht , The Netherlands.
V. (1997). An to Springer , NY.
168
Kohavi , R. (1996). the accuracy of naive- Bayes classifiers: A

decision-tree hybrid." In Proceedings of the 2nd Conference
on Discovery and (KDD) , 202-207 , Port1and , OR.
r. (1991). "Semi-naive Bayesian classifier." In Proceedings of the
Working Session on Learning (EWSL) , 206-219 , Porto , Por-
tugal.
Lewis , D. D. at forty: The independence assumption in
information retrieval." In Proceedings of the 10th on
Machine Learning (ECML) , 4-15 , Chemnitz , Germany.
A. and K.
In Working Notes of the AAA I'98 Workshop on
Learning for Text Cagegorization , Madison , W r.
McLach1an , G. and T. Krishnan. (2008). The EM Algorithm and Extensions ,
2nd edition. John Wiley & Sons , Hoboken , NJ.
Ng , A. Y. and M. r. Jordan. (2002). "On discriminative vs. generative classifiers:
A comparison of 10gistic regression and naive Bayes." In in N eural
Information Processing Systems 14 (NIPS) (T. G. Dietterich, S. Becker , and
Z. 841-848 , MIT Press , Cambridge, MA
J. (1988). Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco , CA.
Sahami , M. (1996). "Learning limited dependence Bayesian classifiers." In Pro-
ceedings of the 2nd Conference on Knowledge Discovery and
(KDD) , 335-338 , Portland , OR.
Samaniego , F. J. (2010). A of the and Prequentist Ap-
to Estimation. Springer, New York , NY.
Webb , G. , J. Boughton , and Z. Wang. (2005). "Not so naive Bayes: Aggregating
estimators." Machine 58(1):5-24
Wu , C. F. Jeff. (1983). convergence the EM algorithm."
of 11(1):95-103.
Wu , X. , V. Kumar , J. R. Ghosh , Q. Yang , H. G. J. M-
cLachlan, A. Ng , B. Liu , P. S. Yu , Z.-H. Zhou , M. Steinbach, D. J. Hand ,
and D. Steinberg. (2007). "Top 10 algorithms in data mining."
169
Systems, 14(1) :1-37

Zhang, H. (2004). "The optimality of naive Bayes." In Proceedings of the
17th International Intelligence Society Confer-
ence (FLAIRS) , 562-567, Miami , FL.
Z. and G. I. Webb. (2000). "Lazy learning of Bayesian rules." Machine
Learning, 41(1):53-84.
Bayes ,
,
(individual
(base learner) ,
(base learning
(component
172
× h1 X h1 × ×
h2 × hz v' × hz × ×
ha v' × d h3 V V x ha × ×
× ×
(8.1)
/ T \
H(x) = sign I ) (8.2)
8.2 Boosting 173
. (8.3)
(Random Forest).
8.2 Boosting
[Freund and Schapire , 1997] ,

E {-1 , + 1},
(additive
T
H(x) (8.4)
function) [Friedman et
fexp(H I Ð) = . (8.5)
174
= {(X1' . . , (Xm ,Ym)};
1: 1)1(X) = 11m.
2: for t = 1, 2 ,. ..., T do
3: ht = 'c (D , 1)t);
5: if Et > 0.5 then break
7:
Ð ,(æ) " f if ht(x) = f(x)
Z, if
iH-
dj-
n··- Zt
1-3JU-AU
A-A
= _e-H(oo) P (f (æ) = 1 1æ) + eH(oo) P (f (æ) = (8.6)
P (f (x) = 11 æ)
(8.7)
P (f (x) =
,-_ P (f (x) = 11
l(1 ... P (f (x) == -11 æ))j
\2-h
=
1 100) = -11 æ)
-1 , P (f (x) = 1 æ) < P (f (x)
1 1 æ)
= y æ) ,
1 (8.8)
consistent
8.2 175
eexp Ðt ) = ]
(æ)) + e"'t ]l (f (æ))]

P (f (æ) = ht (æ)) + (f (æ))
- Et) (8.9)
I Ðt ) = - Et) + (8.10)
1, (1 - Et \
(8.11)
I Ð) =
f(æ)ht(æ)j . (8.12)
= hr(æ) =
Ð) 1_-f(ælH+_ , (æl (,
( 1-
"1_\ 1.. 1_\ , f2(æ)h;(æ )\1
) I
ht(æ) = I Ð)
h
176
=
| I
r • I, (8.14)
h I I
(æ)
Dt(æ) = u. (,.,.\, , (8.15)
I
ht(æ) = I ... r •
,- r - ,-, II
= arg m a.x . (8.16)
f(æ)h(æ) = 1- , (8.17)
ht(æ) ] (8.18)
D(æ) e-f(æ)Ht(æ)
(æ) =
[e-f(æ)Ht(æ)]
1) (æ)
(æ)]
= Dt( æ) . , (8.19)
8.2 Boosting 177
and Wolpert , 1996] ,
0.6 • EZE
0.21
o Q2 Q4 Q6 QS
nu Zb
2 0.4 O. 6 0.8
<
(a) J
(c)
178
8.3
8.3.1 Bagging
Bagging [Breiman ,
Bootstrap
sampling).
{(Xl ,Yl) , (X2 ,Y2) ,"', (xm ,Ym)};
1: for do
2: ht = .l3 (D , Ðb.)
3: end for
H(x) = argmax I:;;=l[(ht (x)
yEY
8.3 179
T (0 (m) +0
[Zhou. 2012].
estimate) [Breiman , 1996a;

Wolpert and Macready ,
T
Hoob(æ) = = y) . lI (æ 1:. Dt) , (8.20)
t:l
Eoob . (8.21)
[Breiman ,
180
ji
o
; 0 4
'" 0.2
4
0.2
0.8 0.8
(a) (b)
=
= = log2 d
[Breiman ,
0 .34 0.028
0 .3 2 , - - -\
0.22 0.008
0.20 '-,; 0.004 '-0

10 V
10' 10' 103 10 V
10' 10' 103
(a) (b)
8 .4 181
Bagging
/
.!
/ < h3 . .
.f ,/ . f \"
• h1 :
..,,' '"1
2000]
averaging)
H(x ) (8.22)
182
• averaging)
T
(8.23)
Breiman T
1952J , [Perrone and Cooper ,
et a l., 1992; Ho et a l., 1994;

Kittler et al.,
voting)
T N T
if >
(8.24)
otherwise.
voting)
H(x) = hi(æ) . (8.25)

8.4 183
yoting)
H(x) = . (8.26)
h;
(hard voting).
h; I
(soft voting).
scaling) [Platt , regression) [Zadrozny and

Elkan , 2001
Stacking [Wolpert , 1992;

184
= {(æl ,
1: for t = 1 , 2,… , T do
2: ht =
3: end for
4:
5: for i = 1, 2,..., m do
6: for t = 1, 2,… , Tdo
7: Zit = ht(æi)j
8: end for
9: D' = D' U
10: end for
11: h' =
H(æ) = h'(h1(æ) , h 2(æ) ,..., hT(æ))
=D\
(Zi1; Zi2; . . . ; ZiT)
Linear
and Witten ,
2002].
8.5 185
Model A
[Clarke ,
:
(arnbigui ty
I æ) = , (8.27)
A(h I æ) = æ)
(æ) - H (æ))2 (8.28)
æ) = (1 (æ) - hi(æ))2 , (8.29)
E(H I æ) = (1 (æ) - H(æ))2 . (8.30)
I æ) E(hi I
= LWiE(hi I æ) -E(H I æ)
i=l
= E(h I æ) - E(H I æ) . (8.31)
186
Ei =J x)p(x)dx , (8.33)
Ai =J x)p(x (8.34)
E(H).
E =J E(H I (8.35)
E=E-A. (8.36)
and Vedelsby ,
(error-ambiguity decomposition).
8.5 187
= Y2) , . . . , (X m ,
{-1 ,
hi = -1
b d
measure)
d-Mb+c
1, 8 ,;.; = (8.37)
m
coefficient)
- bc
(8.38)
c)(c + d)(b + d)
• Q-statistic )
bc (8.39)
p-1 qa-Ba
= (8 .40)
Pl (8 .41)
m
P2 (8 .42)
m2
188
0 .40 0.40
0.35 0.35
0.30 0.30
0 .25
v
'
.......
0 .1 5
0. 10 r
nuo A 06
U
0 0 .2 0.4 0.6 0.8 - 0.2
( ) B F
(a) D
base
8.5 189
= {(Xl , Yl) , ,… , (Xm , Ym)};
1: for t = 1, 2, . . . ,T do
3: Dt =
4: ht = 5:. (D t )
5: end for
H(x) = (MapFt (x))
yEY
(Flipping Output) [Breiman ,

(Output
Smearing) [Breiman ,
and
Bakiri,
190
(Negative Correlation) [Liu and Yao ,
[Kuncheva, 2004; Rokach ,

[Schapire and
and Valiant ,
and Schapire ,
Mult iBoosting
[Webb ,
(statistical view) [Fried-

man et
[Demiriz et a l.,
and Wyner ,
(margin theory) [Schapire et a l.,

8.6 191
2014].
of
and Whitaker , 2003; Tang et a l.,
2012]
(selec-
tive [Rokach , 2010a]; [Zhou et a l.,
(ensemble selec- 2012]

tion).
[Zhou ,
and
2012]
192
8.1
k
P(H(n)";; k) (8 .43)
i=O
> 0, k = (p -
P(H (n) ,,;; (p - 8) n) ,,;; e- 282n . (8 .44)
8.2
-
(8 >
8.3
8.4 GradientBoosting [Friedman ,
8.5
8.6
8.7
8.8
Iterative
8.9*
193
Breiman, L. (1996a). "Bagging predictors." Learning, 24(2):123-140.

Breiman, L. (1996b). regressions." Machine 24(1):49 64.
•
Breiman , L. (2000). to increase prediction accuracy."

Machine 40(3):113-120.
Breiman , L. (2001a). "Random Machine Learning, 45(1):5-32.
Breiman, L. iterated bagging to debias Machine
Learning, 45(3) :261-277.
C1arke , B. (2003). mode1 averaging and stacking when mod-
e1 approximation error cannot be ignored." Journal 01 Machine Learning
4:683-712.
A. , K. P. Bennett , and J. Shawe-Tay1or. (2008).
Boosting via co1umn generation." Learning, 46(1-3):225-254.
Dietterich , T. G. (2000). "Ensemb1e methods In Pro-
ceedings 01 the 1st on Multiple Systems
(MCS) , 1-15 , Cagliari , Italy.
Dietterich, T. G. and G. (1995). "Solving multiclass 1earning problems
via error-correcting output codes." Journal 01 Artificial Intelligence Re-
2:263-286.
Y. and R. E. Schapire. generalization of
on-line learning and an application to boosting." Journal 01 Computer and
System Sciences ,
Friedman , J. , T. Hastie , and R. TibshiranL (2000). "Additive logistic
sion: A statistical view of boosting (with discussions)." Annals 01
28(2):337 407.
•
Friedman, J. H. (2001). "Greedy function approximation: A gradient Boosting

machine." Annals 01 Statistics , 29(5):1189-1232.
Ho , T. K. (1998). "The random subspace method for constructing decision
forests." IEEE on Pattern Machine Intelligence ,
20(8):832-844.
Ho , T. K., J. J. Hull , and S. N. Srihari. (1994). in mu1ti-
194
ple classifier systems." IEEE on Pattern Machine

Intelligence , 16(1) :66-75.
Kearns , M. and L. G. Valiant. (1989). "Cryptographic limitations on learni:rig
Boolean formulae and finite automata." In Proceedings 01 the 21st Annual
ACM on Computing (STOC) , 433-444 , Seattle, WA.
Kittler , J. , M. Hatef, R. and J. Matas. (1998). "On combining classifiers."
IEEE on Pattern Analysis and Machine Intelligence , 20(3):
226-239.
Kohavi , R. and D. H. Wolpert. (1996). "Bias plus variance decomposition for
zero-one 10ss functions." In Proceedings 01 the 13th Conference
on Machine , 275-283 , Bari , ltaly.
Krogh , A. and J. Vedelsby. (1995). "Neural network ensemb1es , cross va1idation ,
and active learning." In in Neural Processing Systems
7 (NIPS) (G. Tesauro , D. S. Touretzky, and T. K. Leen , eds.) , 231… 238 , MIT
Cambridge, MA.
Kuncheva , L. I. (2004). Combining Pattern Classifiers: Methods and Algo-
rithms. J & Sons , Hoboken , NJ.
Kuncheva , L. I. and C. J. Whitaker. (2003). "Measures of diversity in classifi-
er ensemb1es and their re1ationship with the ensemble accuracy." Machine
51(2):181-207.
Liu , Y. and X. YaQ. (1999). via negative
12(10):1399- 1404.
, H. (1952).
A. the statistical view
of boosting (with discussions)." Journal of Machine Learning
Perrone , M. P. and L. N. Cooper. (1993). "When networks disagree: Ensemble

method for neura1 networks." In Artificial Neural Networks for Speech
(R. J. ed.) , Chapman & Hall , New York , NY.
P1att , J. C. (2000). "Probabilities for SV machines." In Advances in
gin (A. J. Smo1a, P. L. Bart1ett , B. Schö1kopf, and D. Schuurmans ,
eds.) , 61 74 , MIT Press , Cambridge , MA.
•
195
Rokach , L. Arlificial Intelligence

33(1):1-39.
Rokach , L. (2010b). Pattem Ensemble Methods. World
Scientific , Singapore.
Schapire, R. E. (1990). "The strength ofweak learnability." Machine Leaming,
5(2):197 227.
Schapire , R. E. and Y. Freund. (2012). Boosting: Algorithms.
M1T Press , Cambridge , MA.
Y. P. Bartlett , and W. S. Lee. (1998). "Boosting the
margin: explanation for the effectiveness of voting methods."
01 26(5):1651-1686.
Seewald , A. K. (2002). "How to make Stacking better and faster while also tak-
ing care of an unknown weakness." 1n of the 19th Intemational
Conference on Machine 554-561 , Sydney, Australia.
E. K., P. N. Suganthan, and X. (2006). of diversity
measures." Machine Leaming, 65(1):247-27 1.
Ting , K. M. and 1. H. Witten. (1999). stacked Jour-
of Intelligence 10:271-289.
Webb , G. 1. (2000). "Mult iBoosting: A technique for combining boosting and
wagging." Machine
H.
260.
Wolpert , D. H. and W. G. Macready. (1999). "An method to estimate
Bagging's M achine 35(1):41-55.
Xu , L. , A. Krzyzak , and C. Y. Suen. (1992). "Methods of combining multiple
classifiers and their applications to recognition." IEEE
on Systems,
Zadrozny, B. and C. Elkan. (2001). probability esti-
mates from decision trees and naÏve Bayesian classifiers." 1n Proceedings of
the Conference on Leaming (ICML) , 609-616 ,
Williamstown , MA.
Zhou , Z.-H. (2012). Ensemble Methods:
196
Chapman &HalljCRC , Boca Raton , FL.

Zhou , Z.-H. (2014). margin distribution learning." In Proceedings of the
6th IAPR International Workshop on Neural Networks in Pattern
Recognition (ANNPR) , 1-11 , Montreal, Canada.
Zhou, Z.-H. and Y. (2004). ensemble based C4.5."
IEEE on
Zhou , Z.-H. , J. Wu , and W. Tang. (2002). "Ensembling neural networks: Many
could be better than all." Intelligence, 137(1-2):239-263.
Breiman,
(unsupervised
ty
(anomaly
(clustering).
=
= (X í1; Xí2;... ;
Il = = ø
= U7=1
(cluster
(validity
198
(intra-cluster (inter-cluster
(reference (external
(internal
index).
= = {Cl ,
= {q , 02 ,...,
88 = ((æi , æj) i < j)} , (9.1)

b =18DI , 8D = (9.2)
D8 = {(æi , æj)
d =IDDI , DD = {(æi , Xj) < j)} , (9.4)
(i <
c+ d =
• J accard
(9.5)
• and Mallows
(9.6)
9.3 199
m(m -1) . (9.7)
ICI(IOI - , (9.8)
diam(C) = dist(Xi , Xj) , (9.9)

= dist(Xi , Xj) , (9.10)
dcen(Ci , Cj ) = , (9.11)
dmin(Ci ,
• Bouldin
\
I ) . (9.12)
• )
DI = _min. ir ( ) • (9.13)
l-'Tii: J
(distance
dist(Xi , Xj) 0; (9.14)
= Xj ; (9.15)
200
= dist(xj , Xi) ; (9.16)
dist(Xi , Xj) dist(Xi , Xk) + . (9.17)
= Xj2; . • . ;
(Minkowski distance)
(..!':_ \
- Xjul P 1 . (9.18)
distance)
disted(Xi , Xj) = = . (9.19)
(city
distance)
block distance)
== Ilxi - Xjlll . (9.20)
(continuous
(categorical
(numerical
(nominal attribute)
(ordinal
(non-ordinal
attribute)
(Value Difference Metric) [Stanfill and Waltz ,
(9.21 )
9.3 201
MinkovDMp(Xi , X j) = IL IXiu - Xju lP + VDMp(Xiu , Xju) I

\u= l u=nc+ l /
(9.22)
(weighted
distwmk(X'i , Xj) = IXil - Xj l lP + '"
0 (i =
(similarity
(distance metric
< d3
202
based
9.4.1
= , xm} , (k- means

= {C1 , C 2 ,...,
E=LL i=l æEGi

(9.24)
LæEGi
et a l.,
1 0.697 0.460 11 0.245 0.057 21 0.748 0.232

2 0.774 0.376 12 0.343 0.099 22 0.714 0.346
3 . 0.634 0.264 13 0.639 0.161 23 0.483 0.312
4 0.608 0.318 14 0.657 0.198 24 0 .478 0 .437
5 0.556 0.215 15 0.360 0.370 25 0.525 0.369
6 0.403 0.237 16 0.593 0.042 26 0.751 0 .489
7 0.4 81 0.149 17 0.719 0.103 27 0.532 0.472
8 0.437 0.211 18 0.359 0.188 28 0.473 0.376
9 0.666 0.091 19 0.339 0.241 29 0.725 0.445
10 0.243 0.267 20 0.282 0.257 30 0.446 0.459
9.4 203
1Lk}
2: repeat
= ø k)
4: forj = 1 , 2, . . . , m do
5: dji =
6: = d ji ;
7: U{Xj};
8: end for
9: for i = 1 , 2 ,… , k do
10: X;
11:
12:
13: else
14:
15: end if
16: end for
17:
, Ck}
X12 ,
= = (0.343; = (0.532; 0.472)
= (0.697;
0.369 , 0.506 ,
01 = X8 , X9 , X lQ, X13 , X14 , X15 , X17 , X18 , X19 , X20 , X23};
02 = {X l1, X12 , X16};
03 = {X1 , X2 , X3 , X4 , X21 , X22 , X24 , X25 , X26 , X27 , X28 , X29 , X30}'
= (0.473; = (0.394; = (0.623; 0.388) .

204
0.8 0.8
0.7 0.7
0.6 0.6
ooo
..
. ..
..
. .
. ....;..
....::::... .
+ .‘
+ .... ...: _.
0.2
•
" . • +. .- .
• ....::...
-'
.1
0.2
0.1 0.1
0.2 0.3 0.4 0.7 0.8 0.9 0.8 0.9
0.8 0.8
0.7 0.7
0.6 0.6
•
. ••..
+ +
.....
.
0.2 · /J.·' ·
/ 4. 4
0.1 ...:i
0.2 0.3 0.4 0.6 0.8 0.9 0.2 0.4 0.7 0.8 0.9
(Learning Vector
= {(Xl , Yl ), (Xm ,
9.4 205
, (Xm , Ym)}j
2:
3:
4: i d ji = Ilxj -pil12j
5: djij
6: if Yj =
7: (Xj -
8: else
9: p' = (Xj -
10: end if
11:
12:
, Pq}
p' , (9.25)
=
= . (9.26)
206
(Iossy com
(vector
Ri = {x EX Illx -PiI12:( . (9.27)
(Voronoi tessellation).
Cl, C2 , C2 , Cl , Cl.
X12 , X18 , æ23 ,
0.283 , 0.506 , 0 .434 , 0.260 ,
p' (Xl - P5)
= (0.725; 0.445) + 0.1 . ((0.697; - (0.725; 0 .445))
= (0.722; 0 .442) .
IEI: (9.28)
E- 1 :
9.4 207
0.8 0.8
0.7
.. +' . .
• + .+•
"0 '--"" ...
+ ·
.+
.
0.2
l . .
0. 1'-
+
0.8 0.9
(b)
0.8 0.8
0.7
0.6
• .+ • l •
.
er 0.5
.+ . .
‘
•
-'.0
•..... ......... .......
0.2
. ,. . to 0.2
J
0.1
0.8 0.9
(c) (d )
= Cl ,
PM (
PM(X) . p (x , (9.29)
>
coefficient) =1
208
=
2 ,...
P(Zj =
P (Zj = i) . PM(Xj i)
PM(Zj =i 1 Xj) =
PM(Xj)
. p(Xj :E i)
(9.30)
p(Xj :E l)
(i=1 , 2 ,..., k).
. (9.31)
:Ei)
\Ill-/'PA
LL D -qJ
\-atfF/
m 2 QU qdq,"
J.Li , :E i )
:E i ) = 0 , (9.33)
:E 1)
= PM(Zj
9.4 209
(9.34)
= j=1 (9.35)
0, =
(9.36)
p(Xj (9.37)
p(Xj
(9.38)
k}
210
}";i) 1 1 ,,;; i ,,;; k}

2: repeat
3:
4:
for j =, ,...,
1 2 m do
= i 1æj) (1";; i ,,;; k)

5: end for
6: for i = 1, 2 ,..., k do
7:
8:
9: •
10:
11: 11 ,,;; k}
12:
13: Ci ,,;;
14: for j = 1, 2,... , m do

15:
16: C>'j = C>'j U{æj}
17: end for
=
= :1: 6 ,
= 0.219 , 1'12 = =
0.361 , a; = = 0.316
= (0 .4 91; = (0.534; 0.295)
21=(::;;;;:;)Ji=(;:;:::17) z•(:;;;;;:;) ,
9.5 211
0.8 0.8
0.7 0.7
0.6
0.5
. . ‘
A
A
A
A
a‘
4‘
a‘ ' ‘ • ,‘
.+
o 0 0 0 + + • o 0 4
0.2 0.2
0.1
40U
ß.1 0.2 0.3 0.4 0.5 0.8 0.9
(a)
0.8
0.7 0.7
0.6 0.6
0.5
'‘
'‘
0.5
. 4‘
A‘
• ,‘
‘+‘
+ 0.3
• + • +
O 0
0.2
.. 0.1
L u u u u u M L u u u u u u u
l
(c) (d)
=
(density-based
5- (neigh-
patial Clustering of Appli-
cations with Noise"
D=
= I : : ; E};
212
= Xi , Pn
0 0...- - 0
h XA
OY\/21 , , \ ,J , J d )
(9.39)
(9.40)
(seed) ,
9.5 213
MinPts).
2: for do
3:
4: if ;;;:: MinPts then
5: 0 = OU{Xj}
6: end if
7: end for
f=D
10: while do
11: f o1d = f;
12: =< 0>;
13: f = f \ {o};
14: while do
15:
16: if then
18:
19: f =f \ t:l i
20: end if
21: end while
22: k= k+ = f o1d \f;
23: O=O\Ck
24: end while
= {Cl, C 2 ,..., Ck}
0.11 , MinPts = 5.
D= X9 , X13 , X14 , X18 , X19 , X24 ,
01 = {X6 , X7 , X8 , X lO, X12 , X18 , X19 , X20 , X23}'
D = D \ 01 =
X13 , X14 , X24 , X25 , X28 ,
214
0.8 0.8
0.7 0.7
0.6 0.6
0.5
. O5A
0 |
0.3
r J vF > •
0.2
0.1 .
6C 0.8 0.9 0.2 0.3 0 .4 0.5 0.6 0.7 0.8 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
. 0.5
U.3
0.2 0.2
0.1 • 0/
.
0
OL Z
0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9
= 0.11 , M inPts =
"0"
C2 = X17 , X 21} ,
C3 = ;
C4 = X 27 , X28 , X 30 } .
tive N ESti
9.6 215
dorff
9.2 , (9.41)
dmax ( Ci , C j )
_.
= Ifl ax (9.42)
= {X1 , X2 , … , Xm};
dmax
1: for j = 1, 2,…, mdo

2:
3: end for
4: for i = 1, 2, . . . , m do
5: for j = 1, 2,…, mdo
6: M(i , j) =
7: M (j, i) =
8: end for
9: end for
q=m
11: while q > k do
j* 12:
13:
14: for j = j* + do
15:
16: end for
17:
18: for j = 1, 2, .. . , q - 1 do
19:
20: j)
21: end for
22: q = q-1
23: end while
= {C1 , C 2 ,..., Ck}
AGNES
216
0.7
0.6
0.4
"'
…
0.2
0.0
C1 = {æ !, æ26 , æ2g}; C2 = {æ2 , æ3 , æ4 , æ21 , æ22};

C3 = {æ23 , æ24 , æ28 , æ30}; C4 = {æ5 , æ7};
C5 = {æ9 , æ13 , æ14 , æ16 , æ17}; C6 = æ18 ,
C7 = {æ11 , æ12}.
9.7 217
0.8 0.8
0.7 0.7
0.6 0.6
..
..... ooo
..
. .;
."
.
.. .."
<!. • .•
0.2 0.2
0.1 0.1
1
0.7 0.8 0.9 0.8 0.9
0.8
0.7 0.7
0.6 0.6
...
‘
...•
0.4
i / J\\ V .• . ..
..•
.
...
......, .........!
0.2 •. 0.2
0.1 .... • .!. ‘ 0.1

:. _..
0.7 0.8 0.9 0.2 0.3 0.4 0.7 0.8 0.9
= 7, 6 , 5,
218
Jain 1999; Xu and Wunsch II , 2005;
silhouette width)
1988; Halkidi et a1., 2001; Maulik and
2002].
[Deza and Deza , 2009]. and

et a1.,
2000; Tan et a l.,
al.,
[Jain and Dubes , 1988; Jain , 2009].

and Rousseeuw ,
Fuzzy C -means [Bezdek , 1981]
(soft
k-
[Schölkopf et al., clustering) [von Luxburg ,
et al.,
[Pelleg and Moore , 2000; Tibshirani et
[Kohonen , 2001]. [McLachlan and Peel ,

1998; Jain and Dubes , 1988].
[Ester et a l., [Ankerst et a1.,

9.7 219
[Hinneburg and AGNES
DIANA [Kaufman and Ro usseeuw ,
[Zhang et a l., [Guha et al. ,
detection detection) [Hodge and Austin , 2004; Chandola et
et al., 2012].
220
9.1
li:rp__ I) xjul P I = max IXiu - Xjul.

J
9.2
(Hausdorff
distH(X, Z) = max (disth(X , Z) , disth(Z, X)) , (9 .44)
disth(X , Z) = - zl12 . (9 .4 5)
9.3
9.4
9.5
9.6
9.7
9.8
9.9*
9.10*
221
Aloise , D. , A. Hansen , and P. Popat. (2009). "NP-hardness of

Euclidean sum-of-squares clustering." Machine Learning,
Ankerst , M. , M. Breunig, H.-P. Kriegel , and J. Sander. (1999). "OPTICS: Or-
to the clustering structure." In Proceedings of the
ACM SIGMOD on of
MOD) , 49-60 , Philadelphia, PA.
S. 1. and J.
Bezdek , J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algo-

rithms. Plenum Press , New York , NY.
Bilmes , J. A. tutorial of the EM algorithm and its appli-
cations to parameter estimation for Gaussian mixture and hidden Markov
models." Technical Report TR-97-021 , Department of Electrical Engineering
and Computer Science , University of California at Berkeley, Berkeley, CA.
Chandola , V. , A. Banerjee , and V. Kumar. (2009). "Anomaly detection: A
survey." ACM Computing 41(3):Article 15.
Deza, M. and E. Deza. (2009). Encyclopedia of Springer ,
Dhillon ,1. Y. Guan , and B. Kulis. Spectral
tering and normalized cuts." of the 10th ACM SIGKDD In-
ternational Conference on (KDD) ,
551 556 , Seattle, WA.
•
Ester , M. , H. P. Kriegel , J. Sander , and X. Xu. (1996). "A density-based algo-

rithm for discovering clusters in large spatial databases." In Proceedings of
the 2nd International Conference on Knowledge Discovery and
(KDD) , Portland , OR
V. (2002). "Why so many clustering - a position
paper." SIGKDD Explorations ,
R. Rastogi , and K. Shim. A robust clustering al-
gorithm for categorical attributes." In Proceedings of the 15th
Conference on Sydney, Australia.
Halkidi , M. , Y. Batistakis , and M. Vazirgiannis. (2001). "On clustering valida-
222
tion
145.
Hinneburg , A. and D. A. Keim. (1998). "An efficient approach to clustering
large multimedia databases with noise." In Pmceedings of the 4th
tional Conference on K Discovery and (KDD) , 58-65 ,
New York , NY.
Hodge , V. J. and J. Austin. (2004). "A survey of outlier detection methodolo-
gies." A 7t ificial Intelligence
Z. (1998). "Extensions to the k-means algorithm for clustering large
data sets with categorical values." Knowledge Discovery ,
2(3) :283-304.
D. W. , D. Weinshall , and Y. Gdalyahu. (2000). with
non-metric Image retrieval and class representation." IEEE
Transactions on Pattern Analysis and Machine Intelligence , 6(22):583-600.
Jain , A. K. (2009). "Data clustering: 50 years beyond k-means." Pattern Recog-
nition Letters , 31(8):651-666.
Jain , A. K. and R. C. Dubes. (1988). Algorithms for Clustering Prentice
Hall , Upper Saddle River , NJ.
K., M. N. Murty, and P. J. Flynn. (1999). "Data clustering: A review."
ACM Computing 3(31):264-323.
Kaufman , L. and P. J. Rousseeuw. (1987). "Clustering by means of medoids."
In Statistical on the LI-Norm Related Methods (Y.
Dodge , ed.) , 405-416 , Elsevier , Amsterdam , The Netherlands.
Kaufman , L. and P. J. (1990). in An Intro-
duction to Cluster Analysis. & New York , NY.
Kohonen , T. (2001). 3rd edition. Springer , Berlin.
Liu , F. T. , K. M. Ting , and Z.-H. Zhou. (2012). "Isolation-based anomaly detec-
tion." ACM on Knowledge Discovery fmm 6(1):Article
3.
Maulik , U. and S. Bandyopadhyay. (2002). "Performance evaluation of some
clustering algorithms and validity indices." IEEE on
Analysis and Machine Intelligence , 24(12):1650-1654.
223
McLachlan , G. and D. Peel. (2000). Finite Mixture Models. John Wiley &8ons ,
New York , NY.
Mitchell , T. (1997). McGraw Hill , New York , NY.
Pelleg , D. and A. Moore. (2000). "X-means: Extending
estimation of the number of clusters." 1n of the 1 7th
tional Conference on M achine 727-734 , 8tanford , CA.
Rousseeuw , P. J. (1987). "8ilhouettes: A graphical aid to the interpretation
and validation of cluster analysis." of Computational and Applied
20:53-65.
A. 8mola, and K.-R. Müller. (1998). "Nonliear anal-
ysis as a kernel
Stanfill , C. and D. Waltz. memory-based
Tan , X. , 8. Chen , Z.-H. Zhou , and J. Liu. (2009). under oc-

clusions and variant expressions with partial similarity." IEEE
on Forensics and Security, 2(4):217-230.
G. Walther , and T. Hastie. (2001). "Estimating the number of
clusters in a data set via the gap statistic." of the
Society - Series B , 63(2):411-423.
von Luxburg , U. (2007). "A tutorial on spectral and
Computing , 17 (4) :395 416.
•
Xing , E. P. , A. Y. Ng , M. 1. Jordan , and 8. Russell. (2003). "Distance metric

learning , with application to clustering with 1n
in Neural Information Processing Systems 15 (NIPS) (8. Becker, 8. Thrun ,
and K. Obermayer , eds.) , 505-512 , M1T Press , Cambridge, MA.
Xu , R. and D. Wunsch 11. (2005). "8urvey of clustering algorithms." IEEE
on Neural Networks , 3(16):645-6 78
Zhang , T. , R. Ramakrishnan , and M. Livny. (1996). "B1RCH: data
clustering method for large databases." 1n Proceedings of the ACM
SIGMOD on of
103-114 , Montreal , Canada.
Zhou , Z.-H. (2012). Ensemble Methods: Chap-
224
man & HalljCRC , Boca Raton , FL.

Zhou , Z.-H. and y. Yu. (2005). "Ensembling locallearners through multimodal
perturbation." IEEE on Systems, Cybernetics -
B: Cybernetics , 35(4):725-735.
Minkowski ,
- 3) + (33 - 23) =
(Kaunas)
10.1
(lazy
(eager learning) .
//
/ / / / -\Y< \
'+ I /
l / f
l l l+
\ \- /
k =
226
P(err) = 1- I æ)P(c I z) (10.1)
P(err) = 1- :L P(c I æ)P(c I z)

cEY
:L p2(clæ)
1 - p2(c* I æ)
= (1 + P (c* I æ) )(1- P (c* I æ))

2 x (1 - P (c* I æ)) . (10.2)
and Hart ,
(dense
8=
(103 )20 =
10.2 227
[Bellman , 1957]
(curse of
dimensionality) .
red uction)
.., ...' ..
...
-:'..'
' _' • 9 L & .......
;f d dzh
..
• t .. fz -. .... L'
,::_
.:....
(Mult iple Dimensional [Cox and

Cox ,
Z E ]R d'x m , d'
- zj ll = distij .
dist;j = IIzil12 + IIzj l1 2- 2z; Zj
= bii + bjj - 2bij (10.3)

228
bij = bij
+ mbjj , (10 .4)
(10.5)
j=1
m m
nunhu
= ,
i=1 j=1
tr(B)
(10.7)
(10.8)
(10.9)
decomposition) , B =
A = ;;:: ... V
nu
Z= Xm
-EA
10.3 229
. (10.12)
<<- d.
z=wTx , (10.13)
Z E
(i =1
Component
230
=
IIwil12 = 1, =0
( <
Zi2;...;
= TWTXi + const
(W T (10.14)
Xi X ;
tr (WTXXTW) (10.15)
s.t.VVTVVzI
tr (WTXXTW) (10.16)
S.t. WTW=I ,
10.3 231
X2
0.045
2 Xl
, (10.17)
;;:: ... =
æ ij
., Wd"
).
i= l j= l
t. (10.18)
232
(c)
[Schölkopf et al.,
(10.19)
10 .4 233
, (10.20)
i = 1, 2,. . .
(10.21)
(10.22)
Xj) = . (10.23)
(10.24)
(K)ij A = .
(j = 1, 2, . . .
æ) , (10.25)
234
(manifold
[Tenenbaum et al.,
10.5 235
= {Xl , X2 ,..., Xm};
1: for i = 1, 2,... , rn do
2:
3:
4: end for
Xj);
7: return
Linear [Roweis and Saul,

236
Xk
Xk
z
•
• Xl
Xi = + WilXI , (10.26)
(10.27)
Jbm
S.t. = 1,
= (Xi - Xj)T(Xi -
kEQi
(10.28)
l ,sEQi
(10.29)
10.6 237
= Zm) E ]Rd'Xm , (W)ij
M = (1 - W? (1 - W) , (10.30)
(10.31)
ZZT = 1.
= {Xl , X2 ,'" , Xm};
1: for i = 1, 2,…, mdo

2:
3:
4: = 0;
5: end for
8: return
=
(distance metric learning)

238
= Ilxi - Xj = +... + , (10.32)
dist!ed(Xi , Xj) = Ilxi - .
= , (10.33)
0, W = (W)ii
distance)
C.
flp M = = - Xj) = Ilxi - , (10.34)
Component
[Goldberger et al.,
10.6 239
(10.35)
3 ,
Pi , (10.36)
= (10.37)
-
(10.38)
i=l
et a l., 2005]
(Xi'
[Xing et a l., 2003]:
mjp - (10.39)
"
240
et a1., 1996];
1997].
[Fisher , [Baudat
and Anouar ,
(Canonical Correlation
[Harden et
view
[Yang et al. ,
[Ye et a 1., Zhou ,
and Bader , 2009].
cian [Belkin and Niyogi ,

Tangent Space [Zhang and Zha ,
Preserving [He and Niyogi ,
et al. ,
[Belkin et al. , 2006]. [Yan et
et a1.,
et
10.7 241
and Saul ,
et al., 2007; Zhan et
et
al.,
[Davis et a l.,
242
10.1
10.2
(" IYI.. err*

- 1 "v" )l .
_____* \
(10 .40)
10.3
10 .4
10.5
10.6
http://vision.ucsd.edu/content
jyale-face-database
10.7
10.8*
10.10
243
Aha , D. , ed. (1997). Kluwer , Norwell , MA.

Baudat , G. and F. Anouar. (2000). "Generalized discriminant analysis using a
Neural
Belkin , M. and P. Niyogi. (2003). "Laplacian eigenmaps for dimensionality re-
duction and data representation." Neural
Be1kin , M. , P. Niyogi , and V. Sindhwani. (2006). "Manifold A
geometric framework for learning from labeled and unlabe1ed examples."
Journal of Machine Research,
Bellman , R. E. (1957). Programming. Press ,
Princeton , N J.
Cover , T. M. and P. E. Hart. (1967). "Nearest neighbor pattern
IEEE on Theory ,
Cox , T. F. and M. A. Cox. (2001). Multidimensional Chapman & Ha1-
l/CRC , UK.
Davis , J. V. , B. P. Jain , S. Sra, and 1. S. Dhillon. (2007). "Information-
theoretic metric 1earning." In Proceedings of the 24th International Conferc
ence on Corvalis , OR.
Fisher , R. A. (1936). "The use of multip1e in taxonomic prob-
1ems." Annals of Eugenics , 188.
Friedman , J. H. , R.
of Co on Aritificial elligence
717 724 , Port1and , OR.
•
Frome , A. , Y. Singer , and J. Malik. (2007). "Image retrieval and classification

using local distance functions." in N eural Process-
ing Systems 19 (NIPS) (B. J. C. Platt , and T. Hoffman , eds.) ,
MIT Press , Cambridge , MA.
Geng , X. , Zhan , and Z.-H. Zhou. (2005). "Supervised nonlinear dimen-
sionality reduction for visualization and
-P 1107.
J. , G. E. Hinto S. T. R. R.
"N ana1ysis." In in Neural
244
Systems 17 (NIPS) (L. K. Weiss , and L. Bottou , eds.) ,

513-520 , MIT Press , Cambridge , MA.
Harden , D. R., S. Szedmak, and J. Shawe-Taylor. (2004). "Canonical
tion analysis: An overview with application to learning methods." Neural
:2639-2664.
He , X. and P. Niyogi. (2004). "Locality In Advances
in Neural Processing Systems 16 (NIPS) (S. Thrun , L. K. Saul ,
and B. Schölkopf, eds.) , 153-160 , MIT Press , Cambridge , MA.
H. (1936). "Relations between two sets of variates."
(3-4) :321-377.
T. G. and B. W. Bader. (2009). "Tensor decompositions and applica-
tions." SIAM Review, 51(3):455… 500.
Roweis , S. T. and L. K. Saul. (2000). "Locally linear Science , 290
(5500):2323-2316.
A. Smola, and K. -R. Müller. (1998). "Nonlinear component anal-
kernel eigenvalue problem." Neural 10(5):1299-1319.
J. B. , V. de Silva, and J. C. Langíord. (2000). "A global geomet-
ric framework for dimensionality reduction." Science , 290(5500):
2319-2323.
Wagstaff, K. , C. Cardie , S. Rogers , and S. Schrödl. (2001). "Constrained
k-means clustering with background knowledge." In Proceedings of the
18th Intemational Conference on Machine 577-584 ,
MA.
Weinberger , K. Q. and L. K. Saul. (2009). ."Distance metric learning for large
margin nearest neighbor Journal of Re-
10:207-244.
E. P. , A. Y. Ng , M. 1. Jordan , and S. Russell. (2003). "Distance metric
learning, with application to clustering with side-information." In
in Neural Information 15 (NIPS) (8. Thrun ,
and K. Obermayer , eds.) , 505-512 , MIT Press , Cambridge , MA
Yan , 8. , D. Xu , B. Zhang , and H.-J. Zhang. (2007). "Graph embedding and ex-
tensions: A general framework for dimensionality reduction." IEEE Trans-
245
on Pattem and Machine Intelligence , 29(1):40-5 1.

A. F. FIangi , and J.- Y. Yang. (2004). "Two-dimensional
PCA: A new approach to appearance-based face representation and recog-
nition." IEEE on Pattem Intelligence ,
26(1):131-137
Yang , L. , R. R. Sukthankar , and Y. Liu. (2006). "An efficient algorithm
for local distance metric learning." In Proceedings of the 21st National Con-
ference on Boston , MA.
Ye , J. , R. Janardan , and Q. Li. (2005). "Two-dimensionallinear discriminant
analysis." In Advances in Information Processing 17 (NIPS)
(L. K. Saul , Y. Weiss , and L. Bottou , eds.) , 1569-1576 , MIT Press ,
bridge , MA.
Zhan , D.-C. , Y. -F. Li , and Z.-H. Zhou. (2009). "Learning instance specific
distances using metric propagation." In Proceedings ofthe 26th Intemational
Conference on 1225-1232 , Montreal , Canada.
D. and Z.-H. Zhou. (2005). "(2D)2PCA: 2-directional 2-dimensional
PCA for efficient face representation and recognition." Neurocomputing , 69
(1-3) :224-23 1.
Zhang , Z. and H. Zha. (2004). "Principal manifolds and nonlinear dimension
reduction via local tangent space alignment." SIAM Joumal on Scientific
Computing, 26(1):313-338.
246
Pearson ,
College
,
(relevant
(irrelevant
(feature selection).
(data
(redundant
248
(su bset
(subset
(i =
{D l, D2 ,...,
, (1 1. 1)
11.2 249
IYI
Ent(D) =- (1 1. 2)
Relief (Relevant Features) [Kira and Rendell ,
Y1) ,
(X2 , Y2) , ..., (x m ,
250
+ , (1 1. 3)
=
= -
and Rendell ,
[Kononenko ,
(l = 1, 2,…, IYI;
8j
11. 3 251
LVW (Las Vegas Wrapper) [Liu and Setiono ,

Vegas
1:
2: d= IAI;
3: A* = A;
4: t = 0;
5: while t < T do
6:
7: d'= IA'I;
8: E' =
9: < d)) then
10: t = 0;
11: E=E';
12: d= d';
13: A* =A'
14: else
15: t=t+l
16: end if
17: end while
252
y l),
I)Yi - WTXi)2 (1 1. 5)
(1 1.6)
(ridge regression) [Tikhonov

, and Arsenin ,
(1 1. 7)
(Least Absolute
Selection Operator)
50.
11 .4 253
W1
\!
J
W2
Gradient Descent ,
[Boyd and Vandenberghe ,
minf(x)
z
(11.8)
II\7 f (x') - Ilx' - ('V x , x') , (11. 9)

254
( -ti1ai
-
-i
-EL
zkrmk jvkzk)
L-2 /It--\ I-L

z ti -- a
VAFb mzn Z 2 VIJ z a ( qA
= Xk -
Xk+1 - . (11.13)
( Il. P
I < z' ;
Xk +1 = < 0, (1 1. 14)
I z'!
11. 5 255
(codebook)
(dictionary
learning)
(sparse
(1 1. 15)
256
minllxi - . (1 1. 16)
(1 1. 17)
= A = E ]R kxm , 11 .
[Aharon et a l.,
=
)=1 IIF
X- J - bíoí
\ J7=' / IIF
2F ( )
n E LU
.
4tinxu
= 2:#i
11.6 257
pressíve sens-
sensing) [Donoho , 2006; a l.,
mg
n <<:
y = q, æ , (11.19)
(1 1. 20)
258
(Restricted Isometry [Candès , 2008].
x m (n<<
E
(1- :::;; (1 , (1 1. 21)
-ms s nu
(1 1. 22)
s.t. Y = As .
L1
et a l.,
min IIsl11 (1 1. 23)
s.t. Y = As .
11. 6 259
(Basis
Pursuit [Chen et al. , 1998].
(collaborative filter-
"
5 ? 3 2
?
53?· ?
whuq'-qr·
5 ?
3 5 4
and Recht ,
rank(X) (1 1. 24)
s.t. (X)ij =
260
(nuclear norm):
norm)
min{m ,n}
IIXII*= , (1 1. 25)
IIXII* (1 1. 26)
S.t. (X)ij = (A)ij , (i , j) E n.
Programming ,
n<<
[Recht , 2011].
and Fu kunaga ,
et al.,
(Akaike Criterion) [Akaike , [Blum and
Langley , [Forman ,
and
John , et al.,
[Quinlan ,
and Pederson , 1997; Jain and Zongker ,
and Elisseeff, 2003; Liu et al.,
1 1. 7 261
[Liu and Motoda , 1998 , 2007].

LARS (Least Angle [Efron et al.,
LASSO
LASSO [Yuan and
Lin , LASSO [Tibshirani et al.,
and Hastie , 2005].

et al.,
(group
sparse coding)
et a l., 2008; Wang et al., 2010].
2006; et al.,
et al. ,
et al., 2010]. [Baraniuk ,
Pursuit) [Chen et al.,

Pursuit ) [Mallat and [Liu and Ye ,
(http://www.yelab.net/software/SLEP/).
262
1 1. 1
1 1.. 2
1 1. 3
11.4
1 1.5
1 1. 6
11.7
1 1. 8
1 1. 9
11.1 0*
263
M. Elad , and A. Bruckstein. (2006). "K-SVD: An algorithm for

designing overcomplete dictionaries for sparse representation." 1EEE 'I'ra ns-
on Processing, 54(11):4311-4322.
Akaike , H. (1974). "A new look at the statistical model 1EEE
on Automatic Control, 19(6):716 723.•
Baraniuk, R. G. (2007). "Compressive sensing." 1EEE Signal Processing

24(4):118-121
F. Pereira , y. Singer , and D. Strelow. (2009). cod-
In Advances in Neural 1nformation Processing 22 (N1PS) (Y.
Bengio , D. Schuurmans, J. D. Lafferty, C. K. 1. Williams , and A. Culotta,
eds.) , 82-89 , MIT Press , Cambridge , MA.
Blum, A. and P. Langley. (1997). "Selection of relevant features and examples
in machine learning." 97(1-2):245-27 1.
Boyd , S. and L. Vandenberghe. (2004). Convex Optimization. Cambridge Uni-
versity Press , Cambridge , UK.
Candès , E. J. (2008). "The restricted isometry property and its implications for
compressed sensing." Rendus
E. J. , X. Li , Y. Ma , and J. Wright. (2011). "Robust principal
nent analysis?" Joumal 01 the ACM, 58(3):Article 11.
E. J. and B. Recht. matrix completion via convex op-
E. J. , J. . uncertainty principles:
Exact signal highly incomplete frequency
1EEE on 1nformation 52(2):489-509
Chen , D. L. Donoho , and M. A. Saunders. (1998). "Atomic decomposition
by basis pursuit." S1AM on Scientific 20(1):33-6 1.
Donoho , D. L. (2006). "Compressed sensing." 1EEE on
tion Theory , 52(4):1289-1306.
T. Hastie , 1. Johnstone , and R. Tibshirani. (2004). "Least angle
regression." Annals 01 Statistics , 32(2):407-499.
264
Forman , G. (2003). "An extensive empirical study of feature selection metrics

for text classification." Joumal of Leaming Research, 3:1289-1305.
Guyon , 1. and (2003). "An introduction to variable and feature
selection." Joumal of Machine Leaming Research , 3:1157-1182.
Jain , A. and D. Zongker. (1997). selection:
small sample. 1EEE s
Kira , K. and L. A. Rendell. (1992). "The feature selection problem: Tradi-

tional methods and a new algorithm." In Proceedings of the 10th
on (AAA1) , 129-134 , San Jose , CA.
Kohavi , R. and G. H. John. for feature subset selection."
1ntelligence , 97(1-2):273-324.
LIEF." ln Proceedings of the 7th European Conference on Leaming

(ECML) , 171-182 , Catania, Italy.
Liu , H. and H. Motoda. (1998). Feature Selection for Knowledge
Kluwer , Boston , MA.
Liu , H. and H. Motoda. (2007). M Selection.
Chapman & HalljCRC , Boca Raton , FL.
H. and Z. Zhao. (2010). "Feature selection: An
ever evolving frontier in data mining." In Proceedings of Workshop
on Selection in (FSDM) , 4-13 , Hyderabad , India.
Liu , H. and R. Setiono. (1996). "Feature selection and classification - a prob-
abilistic wrapper approach." In Proceedings of the 9th 1ntemational Con-
ference on Engineering of Artificial 1ntelligence
Expert Systems (1EAj Fukuoka, Japan.
Liu , J. and J. Ye. (2009). Euclidean projections in linear time."
In Proceedings of the 26th Conference on Leaming
Mairal , J. , M. Elad , and G. Sapiro. (2008). "Sparse representation for color
image restoration." 1EEE on Processing, 17(1):5 3-69.
S. G. and Z. F.
265
dictionaries." IEEE on Signal Processing , 41(12):3397-3415.

P. M. and K. Fukunaga. (1977). "A branch and bound algorithm
for feature subset selection." IEEE on C-26(9):
917-922.
J. Novovicová, and J. Kittler. (1994). methods in
feature selection." Recognition Letters ,
J. R. (1986). "Induction of decision trees." Machine 1(1):
81-106.
B. approach
12:3413-3430.
Recht , B. , M. Fazel , and P. Parrilo. (2010). "Guaranteed
lutions of linear matrix equations via nuclear norm minimization." SIAM
(1996). and selection via the LASSO."

Journal of the Royal Series B ,
M. Saunders , S. Rosset , J. K. Knight. (2005). "Spar-
sity and smoothness via the fused LASSO." Journal of the Royal Statistical
B , 67(1):91-108.
Tikhonov , A. N. and V. Y. Arsenin , eds. Solution of Problem
DC.
Wang , J. , J. Yang , K. Yu , F. Lv , T. and Y. Gong. (2010). "Locality-
constrained linear coding for image classification." In Proceedings of the
IEEE Computer Society Conference on Computer Recog-
nition (CVPR) , 3360-3367, San CA.
Weston , J. , A. Elisseff, B. Schölkopf, and M. Tipping. (2003). "Use of the zero
norm with linear models and kernel methods." Joumal of Machine Learning
Yang , Y. and J. O. Pederson. (1997). "A comparative study on feature selection

in text categorization." In Pmceedings of the 14th Conference
on Machine Leaming (ICML) , 412-420 , Nashville , TN.
Yuan , M. and Y. Lin. (2006). "Model selection and estimation in regression
with grouped variables." of the Royal Society - Series B ,
266
68(1):49-67
Zou , H. and T. Hastie. (2005). "Regularization and variable selection via the
elastic net." of the Royal Statistical Society - Series B , 67(2):301
320.
Ulam, 1909- 1984)
1867-1918
learning
= {(Xl , Yl) , (Xm , Ym)} ,
and identically
E(h;Ð) , (12.1)
Ê(h;D) (12.2)
:::;;
d(h 1 , h2) "1 h2(X)) . (12.3)

268
f( lE (x)) lE(f (x)) . (12 .4)
(12.5)
) (12.6)
sup xm) - Xi+l,..., ,

X l,..., Xm ,
P (f (Xl ,"', Xm ) E) exp , (12.7)
(12.8)
12.2
Approximately
1984].
=
(concept
class)
(hypothesis
12.2 269
<
P(E(h) E) 1- 8 , (12.9)
< E, 8 <
270
Learning
E, 1/6,
PAC learnable)
;?;
(properly PAC
=1=
1111
12.3 271
P(h(x) = Y) = 1 -
= 1- E(h)
<1 • E • (12.10)
P( (h(Xl) = Yl) = Ym)) = (1- P (h )m

< (1 - E)m • (12.11)
E(h) > E = 0) < E)m
(12.12)
<5, (12.13)
(12.14)
.
272
<E<
P(Ê(h) - E(h) ? E) :S:; exp( -2m(2) , (12.15)
P(E(h) - Ê(h) ? -2m(2) , (12.16)
P(IE(h) - .
<E<
<8<
(1 ro f7_\ ;:;r ,_\1 yLU1fLI 8.
: IE(h)- Ê(h)1 > E)

1
,
L P(IE(h) - Ê(h)1 > E) :s:; 21 1-L 1exp( -2m(2) ,

12.4 273
exp(
•
(agnostic
PAC
0 8 <
"
m ?:
(12.20)
12.4
(Vapnik-
dimension) 1971].
.
=
{(h (æ 1), h (æ2) ,"', h (æ m ))}.

274
= (h(X1) ,..., h(xm )) I .
E N, 0 < E <
and Chervonenkis , 1971]. > E) (12.22)
= 2m } . (12.23)
12.4 275
E b} , X = = +1 ,
= 0.5 , X2
{h [o ,l] ' h[O ,2]' h[l ,2J'
< X4 <
(X5 ,
X =
E
= =
1972]:
(12.24)
= 1, =
- 1, d - - 1, = {Xl' X2 , … , Xm } ,
D' = {Xl' X2 ,..., Xm- l} ,
= { (h (Xl) , h (X2) , ... , h (xm)) I ,
= { (h (Xl) , h (X2) ,"' , h (Xm-l)) I h E 1i} .

276
={ I 3h ,
(h(Xi) = h' (Xi) = (x m )) ,
11iID I = (12.25)
(12.26)
1) 1) (12.27)
I
1l D I
1 (m 1) + (m 1)
(n:_-;:l) o.
=
= -=-11) )
{12.28)
12.4 277
= (:tt
m;;'d.
(:t i:; (7)
=
d, 0 <8<
P II E(h)
_",
- ,;:::" , 18d ln
m' - --- 0 I
\
1- 8 . (12.29)
E= 11
m
•
278
(12.30)
Risk
E(g) . (12.31)
K=; ,
(12.32)
_ f- __,..
E(g) E(g)
(12.34)
m 2 '
P( E(h) -
E(h) -
= E(h) - E(g) + E
12.5 279
12.5
Rademach-
er(1892-1969)
= {(X l, Y1) , (x m ,
Ê(h)
1
mz2
(12.36)
ar324ph(24) (12.37)
Yi) ,
280
(12.38)
, (12.39)
Rz(F) =
Rm (F) (12 .4 1)
et
2012]:
= Zm} , 0 < t5 <

12.5 281
+ (12.42)
(12.43)
Êz (f) ,
Ez (f) - EZ' (f)

fEF
= sup 1.(Zm) - f(z '.r,)

fEF m
(12 .44)
282
I sup Ez' (f) - Ez (f) I

J
J J
;:; r - r \ , . Jln(2 j8)

Rz( F) + (12 .4 5)
+3 (12 .46)
•
12.5 283
,x m } , 0 < Ó <
'8 {1. \ , n (/'I1\

+, _I/ln(ljó)
2m
(12 .47)
'8" \ ,; , " /ln(2jó)

+ RD( 1l) +
(/'I I\
E(h) E(h)
I
= X x
!h (z) = !h (x , y) = ,
{!h: h
sup 1
J
WL
-
- Yi ih(Xi))]
(J
(12.50)
(12.51)
•
284
et
a l., 2012]
Rm (tl) (12.52)
E(h) :s; Ê(h) + (12.53)
= {Z1 = (Xl , Yl) , Z2 = = (Xm , Ym)} ,

Yi =
12.6 285
D\i = Zi+ l,.. ., Zm} ,
Dt =
J!('c, [J!('c D, Z)] . (12.54)
(12.55)
1-m
avb Z D at ZD z 9" vhuEU
stability):
Z =
1J!('cD , Z) - J!('cD\i , Z) I ::;; ß , i = 1, 2, . . . , m , (12.57)
1J!('cD , Z) -J!('cDi , Z)1

-J!('c D\i , Z)1 + 1J!('cDi , Z) -.e('cD\i , Z)1
286
= z)
[Bousquet and
quet and Elisseeff, 2002].
0< ð<
+ 2ß + (4mß + M)V (12.58)
D) + ß + (4mß + M) (12.59)
ß=
D) - i( 'c,
-
i( 'c, (12.60)
Risk
12.7 287
Jl (g ,V) =
hEH
,
Ed-2'x
E-AVBq"
p mE
I Jl (g , V) - ê( g , D) I ,,;;
Jl('c,
Jl( Jl (g ,
,,;; Jl('c, D) - Jl (g , D) +E
E
[Valiant ,
[Kearns and Vazirani ,
288
and Chervonenkis ,
and
Ben-David et al., 1995].

and Panchenko ,
and Mendelson , [Bartlett ct al.,
and Elisseeff, 2002]

[Mukherjee
et et
et a l., (Asyrnptotical Empirical Risk Minimization)
deterministic
et al., 1996].
289
12.1
12.2
= 2e- 2me2 . 12.3
12 .4
12.5
12.6
12.7
12.8
12.9
+
12.10*
290
P. L. , O. Bousquet , and S. Mendelson. (2002). "Localized Rademacher

complexities." In Proceedings of the 15th Conference on
Theory 44-58 , Sydney, Australia.
Bartlett , P. L. and S. Mendelson. (2003). "Rademacher and Gaussian com-
plexities: Risk bounds and structural results." Journal of
3:463-482.
Ben-David , S. , N. Cesa-Bianchi , D. Haussler , and P. M. Long. (1995). "Charac-
terizations of learnability for classes of {O , . . . , n }-valued functions." Journal
of System Sciences , 50(1):74-86.
Bousquet , O. and A. Elisseeff. (2002). "Stability and generalization." Journal
of Machine Learning Research, 2:499-526.
Devroye , L. , L. G. eds. (1996). A Theory of
Springer , New York , NY.
Hoeffding , W. (1963). "Probability inequalities for sums of bounded random
variables." of the 58(301):13-30.

V. and D. Pa
ing the of function ." In High Dimensional Probability II (E.
D. M. Mason , and J. A. Wellner , eds.) , 443-457 , Birkhäuser Boston,
Cambridge , MA.
McDiarmid , C. (1989). "On the method of bounded in
Combinatorics , 141(1):148… 188.
Mohri , M. , A. and A. Talwalkar , eds. (2012). Foundations of
S. , P. Niyogi , T. Poggio , and R. M. Rifkin. (2006). "Learning the-
ory: Stability is sufficient for generalization and necessary and sufficient
for consistency of empirical risk minimization." in Computational
25(1-3):161-193.
K. (1989). "On learning sets and functions." Machine Learning,
291
Sauer , N. (1972). "On the density of families of sets." Journal of Combinatorial

Series A , 13(1):145-147.
Shalev-Shwartz , S. , O. Shamir , N. Srebro , and K. Sridharan. (2010).
ability, stability and uniform convergence." Journal of Machine Learning
Shelah , S. (1972). "A combinatorial problem; stability and order for models
and theories in languages." Journal of 41
(1):247-261
Valiant , L. G. (1984). "A theory of the learnable." of the
ACM, 27(11):1134-1142.
Vapnik , V. N. (1971). "On the uniform convergence of
relative frequencies of events to their probabilities." Theory of
Its Applications, 16(2):264-280.
292
TCS (Theoretical Computer
G.
theory of the
learnable"
Thring Award Goes to in Machine

= {(X1 , Y1) ,
l<<
(active
!"
294
..
"+"? " _ " ?
••
-•
4
+
•••
(duster
(manifold
13.2 295
296
= {1 , 2,...
N
p(x) p(X , (13.1)
0, = 1;
I(x) = argmaxp(y = j 1 x)
N
= (13.2)
p(x :Ei )
p(8 = i 1 x) = ;. (13.3)
p(x :E i)
= j 1
= j
1 8 =
= j 18 = i) = =j 1 = o.
= j 1 8 = i,
= ...,
13.2 297
Du = {XZ+l , XZ+2' … , Xl+ u}, 1<< u , 1+ u =
Dz
IN \
LL(Dz U Du) = . p(Xj :E i) . p(Yj 1e = i ,xj) 1
/
:Ei)) (13 .4)
:E i)
ji
'YIJ" = N (13.5)
. p(Xj :E i)
Z eqJ
(13.6)
:E i =
$30Jz z \z30tL
(13.7)
(13.8)
and U yar ,
et
298
[Cozman and Cohen ,
Support Vector
(low-density
4
.+.• 0',.
...-
·••
•• -
• ...L• .
•• · •
"
(Transductive Support Vector

Machine) [Joachims ,
assignment)
13.3 299
= . . , (XI ,
XI+2 , . . . 1 << u , 1 + u = m.
= . . . , YI+u) ,
me
g ,_
+CILÇi +Cu (13.9)
S.t. i = 1, 2,… , l ,
+ b) ;? i = 1+ 1, 1 + 2 ,… ,m ,
Çi ;? 0 , i =
• ,
=
300
= {(Xl ,Yl) , (X2 , Y2) , … , (Xz , yz)};

'= {X l+ l , Xl+ 2,"', XZ+u};
Cu .
(YZ +1, Yl+ 2, …, Yl+ u);

CZ ;
4: while Cu < Cz do
5: y , C z, e;
6: while :l{ i , j I 0) ^ (Çi > > 0) ^ (Çi 2)} do
7: Yi = -Yi;
8: Yj
9: e
10: end while
11: Cu '= min{2Cu , Cz }
12: end while
y= • , YZ+u)
, (13.10)
11 ,+
+ çj >
[Joachims. 1999].
[Chapelle and Zien ,

meanS3VM [Li et a l.,
13.4 301
= {(X1 , Y1) , , XZ+u } ,

l +u = =
=
I( exp
__._
), if od 14 )
(W)ij = < ,-- / .
l 0, otherwise ,
=
Yi = Sign (f (Xi)) ,
(energy function) [Zhu et al. , 2003]:
- f(Xj))2
1-2
IIU Z PTd
Z q" w ,
d
PT
Z rJ z
-
i=1 j=1
= fT(D - W)f , (13.12)
= (fzT (f (X1); f(X2);"'; f(xz)) , fu = (f (XZ +1);

...;
D = diag(d1 , d2 ,"',
W=
IWuz Wuul
302
D = I Dll
I uul Ll UU I
l Olu Wluilri
2fJWudl (13.14)
fu = (Duu - W uu )-lWudl . (13.15)
-ud
DO
P D W
.. t!
D;-;-lWlu 1I
.L.IU •• ,,,
(13.16)
'
= =
(Duu(1 -
=(I
= (1 - Puu)-lpudl . (13.17)
et a l., 2004].
U =
- {Xl ,'"
= diag(d1 , d2 ,...,
+ u) x = (F I, F i,...,
Fi = ((F) i1, • ,
Yi = arg maX1';;j ,;; IYI (F)ij.
= 1, 2 ,..., m , j = 1, 2,...,
13.4 303
OK
4.'''hw 'bn8
-qJ
F QU Y -qJ
(13.18)
'+b ,A
F(t + 1) + (13.19)
F* = .lim F(t) = (13.20)

z• -+00
= (:1: 1, Yl)};
4: t= 0;
5:
6: +
7: t =
8:
9: for i = 1 + 1, 1 + 2,..., 1
11: end for

iì =
et al. , 2004]
11 • •
,
\Ô=l \ (W)ij 11
. . ) tJ 11 z
(13.21)
304
based
sity.
" " (co-training) [Blum
(multi-view
(attribute
set)
13.5 305
= =
sufficient
[Blum and Mitchell ,
and Mitchell ,
and Zhou ,
and Li ,
[Zhou and Li ,
306
Yl) ,.…,( (x f, xf) , Yl)};

{(Xf+l' XT+1) , . . . , (xf+u' xT+u)};
2: Du \Ds;
3: for j = 1, 2 do
5: end for
6: for t = 1 , 2 , . . . , T do
7: for j = 1, 2 do
9
c Ds;
10: = I Xi E Dt};
11: I
12: Ds = Ds \ (Dp U Dn);
13: end for
14: if
15: break
16: else
17: for j = 1, 2 do
18:
19: end for
20:
21: end if
22: end for
h2
13.6 307
et al. ,
=
2: repeat
3: Cj = ø (1 j k);
4: fori=1 , 2,..., mdo
5: d ij = ;
7: is ...merged=false;
8: while -, do
9: minjE /C dij ;
10:
11: if -, is_voilated then
12: Cr =
13:
14: else
{r};
16: ø then
17:
18: end if
19: end if
20: end while
21: end for
22: for j = 1, 2, . . ., k do
X;
24: end for
25:
C2 ,..., Ck}
308
(X i ,
X25) , (X12 ' X20 ) , (X20 , X12) , (X14 , X17 ) , ( X 17 , X14)} ,
C= {(X2 , X21) , (X21' X2) , (X13 , X 23 ) , ( X23 , X 13), (X19 , X23 ) , ( X23 , X19)}.
= X6 , X12 ,
0.8 0.8
0.7
0.6
...
0.6
0.5 0.5
0.2
0.1
0.2
0.1
í ..:..
0.7 0.8 0.9 0.8 0.9
0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.2 0.2
0.1 0.1
0.7 0.8 0.9 0.9
(
=
13.6 309
C1 = {X3 , X5 , X7 , X9 , X13 , X14 , X16 , X17 , X21};
C2 = {X6 , X8 , X 1O, X11 , X12 , X15 , X18 , X19 , X20};

C3 = X26 , X27 , X28 , X29 , X30}.
=
label).
Seed
et al.,
= {Xl' X2 ,..., xm};

8c D, 181 << IDI
1: for j = 1 , 2 , . . . , k do
2: X
end for
4: repeat
5:
6: for j = 1, 2, . . . , k do
7: for all X E S , do
8: C j = Cj U{ X }
9: end for
10: end for
11: for all
12: dij = Ilxi ;
13: r = d ij ;
14: Cr = Cr U{x i}
15: end for
16: for j = 1 , 2 ,. . . , k do
X;
18: end for
19:
C2 ,..., C k}
310
82 = 83 = {XI4 , X17}.
C1 = X 4 , X22 , X23 , X24 , X25 , X26 , X27 , X28 , X29 , X30} ;
C2 = x 8,x lQ, xu , X12 , X 15 , x18 , x19 , X20} ;

C3 = X17 ,
0.8
0.7
0.(. 0.6
'·. ·.·.. ..
.
• •
0.4 \. .
\. ...." +
......
.\
...'
......
\. •
...
+
0.2
•
•.. . """ _L !
0.2 .• •+
0.1 . · . - -s 0.1
;.
• ...! .
-
"
0.6 0 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.8
0.7 0.7
0.6
'… .
.… . •
.. ......... ........ \. .• +
••
•
.
.•
_....... .
\.
(. ......•
• .....
••
e .-
-
ee 0.2
\
0.1 • •
.
0.1 ! •........... •
-
0.2 0.3 0.7 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.9
=
13.7 311
and Landgrebe ,
and Mitchell ,
et al.,
and Landgrebe ,
et a l.,
et a l.,
[Collobert et a l.,
and Chawla ,
and Zhang , 2006; Jebara et a l.,

et a l., 2003].
[Blum and Mitchell ,

and Li , 2005b].
[Belkin et a l., regular-

312
and Cohen ,
[Li and Zhou ,
and Li , et a l.,
et a l., 2006b;
[Zhou and Li ,
[Settles ,
313
13.1
13.2
13.3 of
P Z QV
Z P Z AU --qa
13.4
http:j j
13.5
13.6*
13. 7*
13.8
13.10
314
(2013).
Basu , S. , A. Banerjee, and R. J. Mooney. (2002). "Semí-supervised clusteríng
by seedíng." In Proceedings of the 19th on M achine
19-26 , Sydney, Australía.
P. Níyogí , and V. Síndhwaní. (2006). "Manifold regularízatíon: A
geometríc framework for learníng from labeled and unlabeled examples."
Journal of 7:2399-2434.
Blum, A. and S. (2001). "Learníng from labeled and unlabeled data
usíng graph míncuts." In Proceedings of the 18th International Conference
on Machine , 19-26 , Wíllíamston , MA.
Blum, A. and T. Mítchell. (1998). "Combíníng labeled and unlabeled data with
co-traíning." In ofthe 11th Annual Conference on Computation-
al Learning Theory 92-100 , Madison , W I.
Chapelle , 0. , M. Chi , and A. Zien. (2006a). "A contínuation method for semi-
supervísed SVMs." In Proceedings of the 23rd Conference on
Machine , 185-192 , Pittsburgh , PA.
Chapelle , 0. , B. Schölkopf, and A. Zien , eds. (2006b). Semi-Supervised
ing. MIT Press , Cambrídge , MA.
Chapelle , 0. , J. Weston , and B. Schölkopf. (2003). "Cluster kernels for semi-
supervised learníng." In in Processing Systems
15 (NIPS) (S. Becker , S. K. Obermayer , eds.) , 585'---592 , MIT
Chapelle , O. and A. Zien. (2005). learning by low density
separation." In Proceedings of the 1 0th W orkshop on A
Intelligence and Statistics (AISTATS) , 57-64 , Savannah Hotel , Barbados.
R., F. Sinz , J. L. Bottou. (2006). "Tr adíng convexíty for
scalability." In Proceedings of the 23rd Conference on Machine
201-208 , Pittsburgh , PA.
G. and I. Cohen. (2002). "Unlabeled data can degrade classifìca-
tíon performance of generative In Proceedings of the 15th Inter-
Conference of the Intelligence Research Society
315
(FLAIRS) , 327-331 , Pensacola , FL.

Goldman , S. and Y. Zhou. (2000). "Enhancing supervised learning with unla-
beled data." In Proceedings of the 17th International
327-334 , San Francisco , CA.
Jebara , T. , J. and S. F. Chang. (2009). "Graph construction and
matching for semi-supervised learning." In Proceedings of the 26th Interna-
tional Conference on 441-448 , Montreal , Cana-
da.
Joachims , T. (1999). "Tr ansductive inference for text classification
port vector machines." In Proceedings of the 16th Conference
on Machine , 200-209 , Bled , Slovenia.
Li , Y.-F. , J. T.
label In Proceedings of the 26th International Conference on Machine
Li , Y.-F. and Z.-H. Zhou. (2015). "Towards making unlabeled data never hurt."
IEEE on Pattern and Machine 37(1):
175-188.
Miller , D. J. and H. S. Uyar. (1997). "A mixture of experts classifier with
learning based on both labelled and unlabelled data." In in Neural
Information Systems 9 (NIPS) (M. Mozer , M. 1. Jordan , and T.
Petsche , eds.) , 571-577 , MIT Press , Cambridge , MA
A. S. Thrun , and T. Mitchell. (2000). "Text
tion from labeled and unlabeled documents using EM." Learning,
39(2-3):103-134.
B. (2009). "Active Technical Re-
port 1648 , Department of Computer University of Wis-
consin at W I. http:j jpages.cs.wisc.eduj ,,-, bsettlesj
pub j
and D. Landgrebe. (1994). "The effect of unlabeled samples
in reducing the small sample size problem and mitigating the Hughes phe-
nomenon." IEEE Transactions on Remote Sensing, 32(5):
1087-1095.
316
S. S. Keerthi , and O. Chapelle.

for semi-supervised kernel machines." In Proceedings ofthe 23rd
al Conference on Machine (ICML) , 123-130 , Pittsburgh, PA.
K., C. Cardie , 8. Rogers , and S. Schrödl. (2001).
k-means clustering with background knowledge." In Proceedings of the
18th Intemational Conference on M achine 577-584 ,
MA.
F. and C. Zhang. (2006). "Label propagation through linear neighbor-
hoods." In Proceedings 01 the 23rd Intemational Conference on Machine
Leaming (ICML) , 985-992 , Pittsburgh , PA.
D. , Z.-H. Zhou , and S. Chen. (2007). dimensionality
reduction." In Proceedings of the 7th SIAM Intemational Conference on
Minneapolis , MN.
Zhou , D. , O. Bousquet , T. N. Lal , J. Weston , and B. Schölkopf. (2004).
ing with local and global consistency." In Advances in Neuml Information
16 (NIPS) (8. Thrun , L. Saul , and B. Schölkopf, eds.) ,
284-291 , MIT Press , Cambridge , MA.
"When
In
529-538 , Reykjavik , Iceland.
Zhou , Z.-H. and M. Li. (2005a). regression with
In Proceedings of the 19th Joint Conference on
telligence (IJCAI) , 908-913 , Edinburgh , Scotland.
Zhou , Z.-H. and M. Li. (2005b). Exploiting unlabeled data using
three classifiers." IEEE on
17(11):1529-1541
Zhou , Z.-H. and M. Li. (2010). learning by disagreement."
Systems , 24(3):415-439.
Zhu , X. (2006). learning literature survey." Technical Report
1530 , Department of Computer Sciences , University of Wisconsin at Madi-
son , Madison , W I. http:j jwww.cs.wisc.eduj"'jerryzhujpubjssl....s urvey.pdf.
Zhu , X. , Z. Ghahramani , and J. Lafferty. learning
317
using Gaussian fields and harmonic functions." In Proceedings of the 20th

International Conference on Machine 912-919 ,
ton, DC.
Riemann,
1853
I
I 0).
graphical
network).
Markov
Bayesian
320
., OM}.
(Markov
, Xn , Yn) = rr P(Yi I Yi-1)P(Xi I yi) . (14.1)
= P(Yt+l = Sj I Yt = s i) , 1:(; i , j :(; N ,
B=
. bij = P(Xt = Oj I , 1 :(; i :(; N , 1 :(; j :(; M

14.1 321
P(Yl
..., Xn}:
< =t +
02 , . . .
{Xl , X2 ,...,
=
322
Random
functions)
{Xl , x 2}, {Xl ,
X =
, (14.2)
=
14.2 323
Lx
ç xQ*;
P(X) , (14 ,3)
= Lx
X5 ,
(separating
• (global Markov
XB I xc.
C B
324
P (XA , XB , XC) P(XA , XB , XC)

P(XA , XB I XC) = - \-;"1___ 0
:-'-'/
P(XC)
(14.5)
P(XA'XC)
P(XA I XC)
P(XC)
.
(14.6)
P(XA , XB I XC) = P(XA I XC)P(XB I XC) , (14.7)
Markov
14.3 325
(Markov =n(v)U l_ XV\n*(v) I Xn(v) .

blanket)
Markov
l_ X v I XV\(u ,v) .
= <I 1. 5‘ if
---, --
I 0.1 , otherwise ,
I 0.2 ‘ if XR = X r;;
--, --
I 1. 3, otherwise ,
= e-HQ(xQ) (14.8)
HQ(xQ) =' , (14.9)
Random
326
= {Xl' X2 , … , = .. .,
I
^
y i Yl Y2 Y3 Y4
y
| [ D] [N] [V] [P] [D] [N]
X
1
- -
P
L
V
-
3
-
n
u
-
--
w
1I
j - - 1-
nC o
-- - I
[D] [N]
The boy
[V] [P][D] [N]
I x
I = P (Yv I x , Yn(v)) , (14.10)
(chain-structured
Yl Y2 Y3 Yn
x = {Xl X2 ... Xn }
14.3 327
P X =
--zp
ex UU + X + QU'K X
1i
feature
Sk(Yi , X ,
feature
I 1, if Yi+ 1 = [P ], Yi = [V] and

i) = <
- 1 0 , otherwise ,
I 1, if Yi = [V] and Xi
=<
1 0, otherwise ,
328
(marginalization).
P(XE , XF) P(XE , XF)

P(XF I
P(XE , XF)
P(XE) . (14.13)
14.4 329
m43(X3)
P(Xl , X2 , X3 , X4 , X5)
X4 X3 X2 Xl
I Xl) P(X3 I X2)P(X4 I X3)

X4 X3 X2 Xl
(14.14)
X3) I X2) L P(Xl)P(X2 I Xl)
I I X2)m12(X2) , (14.15)
X3 X4 X2
P(X5) I X3) L
I I X3)
= L P (X5 I
= m35(X5) . (14.16)
330
,
(14.17)
=jm35(Z5) (14.18)
mij(Xj) II (14.19)
14.5 331
P(Xi) oc II mki(Xi) (14.20)
inference).
14.5.1
332
Ep[f] = J (14.21 )
N
, (14.22)
Chain
Monte
(14.23)
p(f) = lEp (14.24)
(14.25)
14.5 333
I x) ,
p(xt )T(xt - 1 IXt) = p(X t - 1 )T(xt I x t - 1) , (14.26)
Metropolis-Hastings
(rejeet
et al., K.
I
x t - 1 )A(x* I I A(x* I
p(xt •
1 )Q(X* I Xt - 1) = p(X*)Q(Xt - 1 I x*)A(x t - 1 I x*) , (14.27)
Ixt- 1 ).
2: for t = 1, 2 ,... do
3: I
4:
5: if I x t - 1 ) then
6: xt = x*
7: else
8: xt =
9: end if
10: end for
11: return Xl , x 2 ,
334
(, p(X*)Q(xt-l I x *) \
A(x*|xt-1) =min l 1 i (14.28)
p(xt - 1)Q(x* I x t - 1 ))
I
xt = ,
notation) [Buntine,
14.5 335
rr I:
N
p(x 1 8) = P(Xi' (14.29)
z 1
=
@
= 1 (14.31)
z 1
p(z 1 1
lnp(x) = .c (q) + KL(q 11 p) , (14.32)
.c (q) = J q(z) ln { } dz , (14.33)
KL(q 11 p) = - J q(z) (14.34)

336
M
q(Z) = (14.35)
exponential
qi
qi
z)] + const , (14.37)
lEi #j [lnp(x , z)] = !lnp(x, z) rr qidzi .
(qj =
lnqj(zj) = [lnp(x , z)] + const , (14.39)
[lnp (x , z) l)
q;(Z3)=.(14AO)
J/ J exp (x , Z)] )dzj
[lnp(x ,
14.6 337
field (mean
Dirichlet
(k =
1, 2,...
C.1. 6
338
14.7 339
T K I N \
IIp(8 t I II I Zt ,n , 8 t ) ), (14 .4 1)
pa
LR HK
@
(14.42)
T
LL( (14.43)
(14 .44)
and Friedman , 2009].
[Pearl, [Pearl ,
and Geman ,
et
and McCallum , 2012].
340
Belief Propagation)
[Murphyet
and Kappen , 2007; (factor graph)
[Kschischang et al.,
and Spiegelhalter ,
et a l.,
[Jordan ,
and Jordan , 2008].
LDA [Blei et a l.,
(non-parametric
et a l., and
[Hofmann ,
et a l., 2003;
Gilks et 1996]
341
14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8
14.9*
14.10*
342
C. , N. De A. Doucet , and M. 1. Jordan. (2003). "An intro-

duction to MCMC for machine Machine 50(1-2):5-43.
Blei , D. M. (2012). "Probabi1isitic topic models." ofthe ACM,
55( 4):77-84.
Blei , D. M. , A. Ng , and M. 1. Jordan. Dirichlet
of M
2:159-225.
and D. Geman. (1984). "Stochastic relaxation , Gibbs distributions ,
and the Bayesian restoration of images." IEEE on Pattern
Machine Intelligence , 6(6):721-74 1.
Ghahramani , Z. and T. L. Griffiths. (2006). "Infinite latent feature models
the Indian process." In
18 (NIPS) (Y. Schölkopf, and J. C. Platt , eds.) , 475-482 ,
Gilks , W. R., S. Richardson , and D. J. Spiegelhalter. (1996).
Monte in Practice. Chapman & HalljCRC , Boca Raton , FL.
Gonzalez , J. E. , Y. Low , and C. Guestrin. (2009). "Residualsplash for optimally
parallelizing belief propagation." In Proceedings of the 12th
on and
Clearwater Beach, FL.
Hastings , W. K. (1970). "Monte Carlo sampling methods using Markov chains
and
Hofmann , T. (2001). probabi1istic latent semantic
analysis." Machine 42(1):177-196.
Jordan , M. 1., ed. (1998). Models. Kluwer , Dordrecht ,
The Netherlands.
Koller , D. and N. Friedman. (2009). Graphical Models: Principles
Techniques. MIT Press , Cambridge , MA.
B. J. Frey, and H.-A. Loeliger. (2001). "Factor graphs and
the sum-product algorithm." IEEE on Information Theory , 47
343
(2) :498-519.
Lafferty, J. D. , A.
dom Probabilistic mode1s for segmenting and labeling sequence data."
1n Proceedings of the 18th Intemational Conference Leaming
282-289 , Williamstown , MA.
Lauritzen , S. L. and D. J. Spiege1ha1ter. (1988). "Local computations with prob
abi1ities on graphica1 structures and their application to expert systems."
of the Royal Society - Series B , 50(2):157-224.
Metropolis , N. , A. W. Rosenb1uth , M. N. Rosenb1uth , A. H. Teller , and E.
Teller. (1953). "Equations of state calcu1ations by fast computing machines."
Joumal of Chemical Physics , 21(6):1087-1092.
J. M. and H. J. Kappen. (2007). "Sufficient conditions for convergence
of the sum-product a1gorithm." IEEE on Information Theory ,
53(12):4422-4437.
tion for approximate inference: An empirical study." 1n Proceedings of the

15th Conference on in Artificial Intelligence
Sweden.
Nea1 , R. M. (1993). "Probabilistic inference using Markov chain Monte Carlo
methods." Technica1 Report CRG-TR-93-1 , Department of Computer Sci-
ence , University of Toronto
Pearl , .J. (1982). "Asymptotic properties of trees and game-searching
procedures." 1n Proceedings of the 2nd National un
Intelligence (AAAI) , Pittsburgh , PA.
J. "Fu structuring belief
Pearl , J. "
Pearl , J. (1988). Probabilistic Reasoning 'in Intelligent Systems: Networks of

Plausible Inference. Morgan Kaufmann , San Francisco , CA.
Rabiner , L. R. (1989). "A tutoria1 on hidden Markov model and selected ap-
plications in speech Proceedings of the IEEE , 77(2):257-286.
344
Sutton , C. McCallum. (2012). "An introduction to random

fields." Trends in Machine Learning, 4(4):267-373.
Teh , Y. W. , M. 1. Jordan , M. J. Beal , and D. M. Blei. (2006). "Hierarchical
Dirichlet processes." Journal of the Statistical Association, 101
(476):1566-1581
Wainwright , M. J. and M. 1. Jordan. (2008). "Graphical models , exponential
and variational inference." Trends in Machine
Y. (2000). "Correctness of local probability propagation in graphical

models with loops." Neural
345
Pearl , 1936-
.
Newell
"9.
Michael
[Fürnkranz et al., 2012 J. (rule
@• f1 ^ f2 (15.1)
^ ;
348
val
(confl.ict
(ordered
(priority
(default rule)
(propositional
(first-order (propositional
(atomic
+
^
15.2 349
(relational
(sequential
=
350
6,
.
^ .
^
15.2 351
.
352
(beam
and Niblett ,
Ratio
LRS = 2. (r 'h-l- log0 .

15.3 353
(Reduced Error
[Brunk and Pazzani ,
(growi ng
and
(Incremental 4]l
[Cohen ,
ed Incremental Pruning
Produce Error Reduction ,
JRIP
lREP* [Cohen ,
IREP*(D);
2: i = 0;
3: repeat
5: Di =
= lREP*(Di );
7:
8: i =
9: until
354
•
rule)j
•
rule).
et a l., 2012].
15 .4 355
1) 6) 10) 14)
16) 17) 1) 6)
16) 17) 14) 16)

6) 7) 10) 14)
7) 10) 14) 15)

1) 6) 7)
7) 10) 15) 16)

7) 14) 16) 17)
14) 16) 17) 16)

7) 10) 15)
10) 16) 10) 16)

6) 7) 10) 15)
6) 7) 10) 15)
10) 14) 15) 16)
14) 15) 16) 17)

1) 2) 3) 6)
2) 3) 6) 7)
(relational
(background
.)
356
Y) .
FOIL (Fírst-Order Inductive Learner) [Quinlan ,
Y) ,
15.5 357
(FOIL
F_Gain (15.3)
\ -_ m+ +m_ _
X (log 2 - 1og 2
6)" .
Logic
P(f (X)) ,
358
(Least General
LGG) [Plotkin , 1970].
t) =
LGG(s , t) ==
Y) Y)
Y);
15.5 359
Y) Y) Y).
Y) Y). (15 .4)
10)
, (15.5)
LGG( lO, Y)
and Dze-
roski , 1993]
(Relative Least
General Generalization) [Plotkin ,
360
A.
principle)
(inverse
resolution) [Muggleton and Buntine ,
= L1 = -, L 2 , C1 = A V L , C2 = B V
C=AV
(Av =A, (15.6)
V , (15.7)
C 1 . C2 (15.8)
v AVB
C2 = V {-, L}. (15.9)

15.5 361
[Muggleton ,
q •A
:
q •A (15.10)
p • p • A^q
A^B
:
q •B
• A^q p
(15.11)
p • A^C (15.12)
q• B q• C
p • A^B (15.13)
• A q • r^C
T
"C' = CÐ
= {X/Y}.
= BÐ
ò (most
general
= {1/ X} , Ð2 = {1/ X , 2/Y} ,
Ð3
= AV = Bv = ...., L2Ð,
362
C= V . (15.14)
= (resol ution quotient) ,
(15.15)
(C = AV
(--, L101)021 =
{--, L10 t})Ø 21. (15.16)
Y) Y).
,
15.6 363
"-, q(M, N)" . N) •?"
{ljM , XjN} , O2 =
(symbolism
[Michalski , 1983]. [Fürnkranz et
and
PRISM [Cendrowska ,
CN2 [Clark and Niblett ,

[Fürnkranz ,
RIPPER [Cohen ,
364
(Never-El1ding Language
et
[Muggleton ,
GOLEM [Muggleton and
PROGOL [Muggleton ,
and Lin ,
[Muggleton , [Srinivasan ,
Datalog [Ceri et al.,
1992; Lavrac and Dzeroski ,
and Muggleton ,
(probabilistic
ILP) [De Raedt et
(relational Bayesian network) [Jaeger ,
et al.,
Logic Program) al., 2000
logic network) [Richardson and Domingos ,
(statistical relationallearning) [Getoor and
365
15.1
15.2
15.3
15.4
15.5
15.6 Y)" .
15.7
15.8
15.9* t2 , … ,
15.1 0*
366
Bratko , I. and S. Muggleton. (1995). "Applications of inductive logic program-

ming." of the ACM, 38(11):65-70.
C. A. and M. J. pazzani. (1991). "An investigation of noise-tolerant re-
lational concept learning algorithms." In Proceedings ofthe 8th
Workshop on Machine Learning (1WML) , 389-393 , Evanston , IL.
Carlson , A. , J. Betteridge , B. Kisiel , B. Settles , E. R. Hruschka , and T. M.
Mitchell. (2010). "Toward an architecture for learn-
ing." In Proceedings of the 24th AAA1 Conference on Artificial 1ntelligence
1306-1313 , Atlanta , GA.
Cendrowska , J. (1987). "PRISM: An algorithm for inducing modular
1nternational Journal of Studies , 27(4):349-370.
Ceri , S. , G. Gottlob , and L. Tanca. (1989). "What you always wal1ted to kl10W
about Datalog (and l1ever dared 1EEE on K
1(1):146-166
Clark , P. and T. Niblett. (1989). "The CN2 induction algorithm." Machine
3(4):261-283.
Cohen , W. W. (1995). "Fast effective rule induction." In Proceedings ofthe 12th
1nternational Conference on Machine 115-123 , Tahoe ,
CA.
De Raedt , L. , P. Frasco l1i , K. and S. eds. (2008). Prob-
1nductive Logic Programming: Applications. Springer ,
Berlin.
L. Getoor , D. Koller , and A Pfeffer. (1999). "Learning prob-
abilistic relatìonal models." In Proceedings of the 16th 1nternational Joint
Conference on 1300-1307, Stockholm , Swe-
den.
J. (1994). "Top-down prul1Íng in In Proceed-
ings 0 f the 11 th Conference on Artificial
457 , Amsterdam , The Netherlands.
Fürnkranz , J. , D. Gamberger , and N. Lavrac. (2012). of Rule
Learning. Springer , Berlin.
367
J. and G. Widmer. (1994). "Incremental reduced error pruning."

In Proceedings of the 11th Conference on Machine
New Brunswick, NJ.
Getoor , L. and B. Taskar. (2007). Introduction to Relational Learn-
ing. MIT Press , Cambridge, MA.
Jaeger , M. (2002). "Relational Bayesian networks: A survey." Electronic Trans-
on Artificial Intelligence , 6:Article 15.
L. and S. Kramer. (2000). logic
programs." In of the AAA I'2000 W orkshop on Statis-.
Models from Relational 29-35 , Austin , TX.
N. and S. Dzeroski. (1993). Progmmming: Techniques
Ellis Horwood , New York , NY.
S. (1969). "On the quasi-minimal solution of the general covering
problem." In Proceedings of the 5th Symposium on
Processing (FCIP) , volume A3 , 128 , Yugoslavia.
Michalski , R. S. and methodology of inductive In
Machine An Intelligence (R. S. Michalski , J.
Carbonell, and T. Mitchell, eds.) , Tioga , Palo Alto , CA.
Michalski , R. S. , I. Mozetic , J. Hong , and N. Lavrac. (1986). "The multi-purpose
incrementallearning system AQ15 and its testing application to three med-
ical domains." In Proceedings of the 5th National Conference on
Intelligence 1041 1045 , Philadelphia, PA.
•
Muggleton , S. (1991). "Inductive logic programming."
Muggleton , S. , ed. (1992). Inductive Logic

don , UK.
Muggleton , S. (1995). "Inverse entailment and Progol." New Generation Com-
putíng, 13(3-4):245-286.
Muggleton , S. and W. Buntine. (1988). "Machine invention offirst order predi-
cates by inverting resolution." In Proceedings of the 5th International Work-
shop on Machine Ann Arbor , MI.
Muggleton , S. and C. (1990). "Efficient induction of logic programs."
368
In Proceedings of the 1st International Workshop on Algorithmic Learning

Theory Tokyo , Japan.
Mugg1eton , S. and D. Lin. 1earning of higher-order
dyadic data1og: Predicate invention revisited." In Proceedings ofthe 23rd In-
Joint Conference on Intelligence
Beijing , China.
P1otkin , G. D. on inductive
ligence 5 (B. D. 153-165 , Edinburgh U niversity
Press , Edinburgh , Scot1and.
P1otkin , G. D. note on inductive generalization." In
chine 6 (B. Me1tzer and D. Mitchie , eds.) , 107-124 , Edinburgh
University Press , Edinburgh , Scotland.
J. R.
5(3):239-266.
Ri chardson, M. and P. Domingos. (2006). "Markov logic networks." Machine
Learning, 62(1-2):107 136.
•
machin
Journal of the ACM, 12(1):23-41.
Srinivasan,A.(1999). "The A1eph manual."
mach1earnj A1ephj a1eph.html.
Winston , P. H. (1970). structural descriptions from Ph.D.
Department Engineering , MIT , Cambridge , MA.
Wnek , J. and R. S. Micha1ski. (1994). "Hypothesis-driven constructive induc-
tion in AQI7-HCI: A method and experiments." Machine Learning , 2(14):
139-168.
369
S. Michalski , 1937- 2007) .
G.
achine
(reinforcement learning).
Decision
= (X , A , P, R) ,
372
p=l
r= -100
a) = 1.
16.2 373
tll
16.2
armed
(exploration-
374
\ $$$ \
(Exploration-
Exploitation
V l, V 2 ,.
(16.1)
1-n
n n × n
+un ,,,.. (16.2)
16.2 375
n
+ 1-n
un AVV
n (16.3)
1: r 0;
2: Vi = 1 , 2 ,. .. K: Q(i) = = 0;
3: for t = 1 , 2 ,.. ., T do
4: if rand() < E then
5: k
6: else
7: k = argmaxi Q(i)
8: end if
9: v = R(k)j
10: r = r + v;
t
12: count(k)
13: end for
= l/vt.
16.2.3 Softmax
376
" T
(16 .4)
1: r = 0;
2: Vi = 1, 2,... K: Q(i) = 0 , count(i) = 0;
3: for t = 1, 2,..., T do
4:
5: v = R(k);
Softmax
(T =
16.3 377
0.40 ï
0 .35 1
Softmax (r =O .OI )
\'
il(
0.25
o 500 1000 1500 2000 2500 3000
E =
(state value function) ,

(state-action value
378
(16.5)
I Xo
2: ;=1 rt I Xo = (16.6)
a) = I Xo = x , aO
IF--4
Z
z
R
•z + Z --au
L P:--t x' (16.8)

16.3 379
(X , A , P , R);
1: V(x) = 0;
2: for t = 1, 2, . . . do
3: V'(x) = 2:: x'EX P:• x'
4: if then
5: break
6: else
7: V = V'
8: end if
9: end for
- V'(x)1 < B. (16.9)
;
(16.10)
.
x'EX
arg max 2.::: (16.11)

7r
380
V*(x) = (16.12)
(16.13)
V; (x) =
lb<::......
L:
x'EX
P:-+ x'
. (16.14)
a' t:.ti (16.15)

a) = L: a' t:.ti
íT '(x))
)
16.3 381
(16.16)
= a) , (16.17)
(policy iteration).
(X , A , P , R);
1: V(x) =
2: loop
3: for t = 1, 2,... do
4: : V'(x) = a) I: x/EX
5: if t = T + 1 then
6: break
7: else
8: V = V'
9: end if
10: end for
11: = argmaxaEA Q(x , a);
12: if \/x then
13: break
14: else
16: end if
17: end loop
382
th)= …= maxaEA
2: x' EX X-+X'
PZ• X'
\"1 '--X-+X" T .
.
'1' (16.18)
= (X , A , P , R);
1: V(x) = 0;
2: for t = 1, 2, ' , • do
3: : V'(x) = L: x' EX P:-+ x ' (t
4: if maxxEX lV (x) - V'(x)1 < () then
5: break
6: else
7: V=V'
8: end if
9: end for
= argmaxaEA
V'(x) . (16.19)
16.4 383
(model-free
< T1 , T2 , . . . , TT , XT >,
384
1 7T (X) , E;
7T E (X) = < '" - - (16.20)
;;;:::
(on-policy
1: Q(x , a) = =
2: for s = 1, 2 , . .. do
3:
< >;
4: fort=O , 1,..., T-1 do
5: R= T=-t ri;
6: =
=
8:
9:
I argmaxa ,
a) == < ::, ,-J: _:_
10: end for

16.4 385
E[fl
E[fl = I
r I_'P(X)
Jx " n q(x) f(x)dx . (16.23)
(16.24)
q(xD
(16.25)
Q(x , a) (16.26)
p 7r = rr
T-l
(16.27)
386
(16.28)
7T'
1: Q(x , a) = 0 ,
2: for s = 1, 2, . . . do
3:
< r l, >j
1___ E+ ;
.0 1 E/IAI ,
5: for t = do
6: R = ;
7: Q(Xt , at) =
8:
for
= Q(x , a')
11: end for
(Temporal
16 .4 387
= t 2::;=1
Qf+l (x ,a) = Qf( x, a)
- Qt( X,
(rt +l -
Q1r (x , a) (X'))
x'EX
a')Q 1r (x' , a')). (16.30)

x'EX
Qf+l = - , (16.31)
and Dayan ,
388
1: Q(X , a)
= 0 , 1l" (x , a)
2: x =
3: for t = 1 , 2 , . .. do
4: r , x'
5: a'
6: Q(x , a) = Q(x , a) a') - Q(x , a));
= argmaxa" ;
8:
9: end for
1: Q(x , a) =
2: x = xo;
3: for t = 1, 2, . . . do
4: r , x'
5: a'
6: Q(x , a) = Q(x , a) -
= argmaxa " ;
8: x
9: end for
(tabular value
16.5 389
=
et al., 2010]
(16.32)
function
.(16.33)
lr.' l
2 I
= (16.34)
x . (16.35)
X , (16.36)
390
(0; . . . ; 1; . . . ;
1: (J = 0;
2: = arg ;
3: for t = 1, 2, . . . do
4: r , æ'
5: a' =
7: 7l" (æ) = a");

8: æ=
9: end for
(apprenticeship learning) ,
(Iearning
from demonstration) ,
(Iearning by
(imitation "learning).
16.6 391
D= ,…, ni' ni)} ,
reinforce-
ment learning) [Abbeel and Ng , 2004].
=E
392
(16.37)
X11") ?: 0 . (16.38)
(16.39)
"
S.t. 1
2: 1['
3: for t = do
4:
5: S.t. :S; 1;
6: 1[' =
7: end for
16.7 393
and Barto , 1998]. [Gosavi , 2003]

[Whiteson ,
[Mausam and Kolobov ,
[Sigaud and Buffet ,
Observable
et a l., 2010].
[Kaelbling et a l., [Kober et

Deisenroth et
[Kuleshov and Precup , and Mohri ,
and Ftistedt , (online

[Bubeck and Cesa-Bianchi ,
(regret
[Sutton ,
p.22.
and Dayan ,
and Niranjan , 1994].
et a l.,
and Scherrer , [Dann et
1992; Price and Boutili-

er , et a l., 2009]. [Abbeel and Ng , 2004;
Langford and
(approximate dynamic
394
16.1 (Upper Confidence

+
Q(k)
16.2
16.3
16.4
16.5 ).
16.6
16.7
16.8
16.9
16.10*
395
Abbeel , P. and A. Y. Ng. inverse rein-

forcement learning." In Proceedings of the 21st International Conference on
Banff, Canada.
Argall , B. D. , S. Chernova , M. Veloso , and B. Browning. (2009). "A survey
of robot learning from demonstration." A utonomous Systems ,
57(5):469-483.
and B. Fr istedt. (1985). Problems. Chapman & HalljCRC ,
London , U K.
Bertsekas , D. P. (2012). Dynamic Optimal Control: Approx-
Programming , 4th edition. Athena Scientific ,
S. and N. Cesa-Bianchi. (2012). "Regret analysis of stochastic and
nonstochastic mu1ti-armed bandit problems." Trends in
Learning, 5(1):1-122.
R. Babuska , B. De Schutter , and D. Ernst. (2010). Reinforcement
Programming Using Punction Approximators. Chap-
Press , Boca Raton , FL.
C. , G. Neumann , and J. Peters. (2014). "Policy eva1uation with tem-
poral differences: A survey and comparison." Journal of Machine Learning
Deisenroth , M. P. , G. Neumann , and J. Peters. on policy

search for robotics." and Trends in Robotics, 2(1-2):1-142.
Geist , M. and B. Scherrer. (2014). "Off-policy eligibility traces:
A survey." of Machine Research, 15:289-333.
Gosavi , A. (2003). Optimization: Optimization
Reinforcement Kluwer , Norwell , MA.
M. L. Littman , and A. W. Moore. (1996). "Reinforcement
A survey." Journal of Artificial Intelligence Research, 4:237-285.
Kober , J. , J. A. Bagnell , and J. Peters.
robotics: A survey." International on Robotics Research , 32(11):
Ku1eshov , V. and D. "Algorithms for the multi-armed bandit

396
problem." Journal of Machine Learning 1:1-48.

Langford , J. and B. (2005). "Relating reinforcement
mance to performance." In Proceedings of the 22nd
tional Conference on Machine 473-480 , Bonn, Germany.
Lin , L.-J. (1992). agents based on reinforcement learn-
ing , and Machine 8(3-4):293-32 1.
Mausam and A. Kolobov. (2012). with Decision Processes:
An AI Perspective. Claypool, San Rafael , CA.
Price , B. and C. Boutilier. (2003). "Accelerating reinforcement
through implicit imitation." Journal of Artificial Intelligence Research, 19:
Rummery, G. A. and M. (1994). "On-line Q-learning using connec-

tionist systems." Technical Report CUEDjF-INFENGjTR 166 , Engineering
Department , Cambridge University, Cambridge, U K.
O. and O. Buffet. (2010). Decision Processes in
Hoboken , NJ.
Sutton , R. S. (1988). predict by the methods of temporal differ-
ences." Machine Learning, 3(1):9-44.
Sutton, R. S. and A. G. Barto. (1998). Reinforcement An Introduc-
tion. MIT Press , Cambridge, MA.
Tesauro , G. (1995). "Temporal difference TD-Gammon." Com-
of the ACM, 38(3):58-68.
Ueno , T. , S. Maeda , M. Kawanabe , and S. Ishii. (2011). "Generalized TD learn-
ing." Journal of Machine Learning 12:1977-2020.
Vermorel , J. and M. Mohri. (2005). "Multi-armed bandit algorithms and empir-
ical evaluation." In Proceedings of the 16th Conference on Machine
437-448 , Porto, Portugal
C. J. C. H. and P. Dayan. (1992). "Q-learning." Machine Learning, 8
(3-4):279-292.
Whiteson , S. (2010). for Reinforcement Learning.
Springer , Berlin.
397
Andreyevich
=
transpose A T, (AT)ij =
(A.l)
(AB)T = B TA T . (A.2)
= A- 1A =
(AT )-1 = (A- 1 )T , (A.3)

=B lA- 1 .
•
(A .4)
tr(A T) = tr(A) , (A.5)

tr(A + B) = tr(A) + tr(B) , (A.6)
tr(AB) = tr(BA) , (A.7)
tr(ABC) = tr(BCA) = tr(CAB) . (A.8)
det(A) = L • • • (A.9)
uESn
400
det(I) =
AA AA
--n
d 4lu A - d 4EU
=A A A n4 A
1i
,
G
det(cA) = cn det(A) , (A. lO)

det(A T) = det(A) , (A.ll)
= det(A) det(B) , (A.12)
= det(A)-l , (A.13)
det(A n) = det(A)n . (A.14)
(A.15)
i
(A.16)
(A.17)
i
(A.18)
401
(A.19)
(A.20)
(\1 2 f (A.21)
rule)
(A.22)
(A.23)
. 1
(A.24)
ax ax
T>
, (A.25)
r> T
(A.26)
T>
-, (A.27)
-, (A.28)
. (A.29)
åA
402
(A.30)
(chain
= g (h
(A.31)
- b)
- byW(Ax - b) 2W(Ax- b)
= 2AW(Ax - b) . (A.32)
A=U :E yT (A.33)
matrix);
Value
vector) ,
matrix ap-
k:::;;
mm IIA-AIIF (A.34)
AE lRmxn
s.t. rank(A) =k .
403
Ak = Uk :E kV f' (A.35)
=
=
'V =0, (B.1)
= , (B.2)
=
404
=
=
< 0.
g(x) < = <

=
g(x ) =
= =
l MZKO (B.3)
=0.
J(x) (B.4)
S.t . hi (X) = 0 (i = ,
(j = .
405
L Z = FJ Z h g + Z B
… (B.6)
(primal
(dual
(dual function) r : lR.m x
(B.7)
:::;; 0 , (B.8)
=
æ EollJ}
. (B.9)
:::;; p* , (B. I0)

406
mJ pi nu
CO
(B.ll)
(dual
(weak
= (strong
(B.12)
æ z
s.t. Ax
Q E
pro-
B 407
c.x= 2:2: CijXij , (B.13)
(i =
min c.x (B.14)

X
s.t. Ai. X = bi , i = 1, 2, .
X>- O.
f(X t+ l) < f(x t ) , t = 0, 1, 2 ,. (B.15)
f(x + + ð. xTVf(x) , (B.16)

408
+ Ll x) <
Ll x , (B.17)
method).
(coordi-
nate ascent)
= (X1' X2 ,.. .
,
XO X 1 ,
t+1 _ J!{ •.
= arg mm H Xi' -, . . . ,
Xi . - i, y, Xã) (B.18)
f(x 2 ) ••• (B.19)
,
X O X 1, X 2 ,..
point).
point).
409
p(x)
U(x
1
b-a
AA
p(x = U(X I a, b) (0.1)
Ehl=-r; (0.2)
[x] 12
(0.3)
b).
(Jacob Bernoulli ,
P(x = ;
410
lE [X] (0.5)
var[x] . (0.6)
= Bin(m I = ! .. (0.7)
\ml
lE [X] = (0.8)
var[x] = . (0.9)
=
E =
c nu
=
i=l
(0.11)
var[Xi] ; (0.12)
= 1I[j = . (0.13)
(, 2.
P(ml , m2 ,... , md I = Mult(ml ,
; (0.14)
ml!m2!
lE [mi] = ; (0.15)
411
var[mil = - ,ui) ; (C.16)
cov[mj , mil (C.17)
,
,,
I
,,
, ,,
--
0.5 1 nu
\b
= 1(1
r (a)r(b)
; (C.18)
b)
(C .19)
'
mrbLl = [ JR , (C.20)
+
filo
r - d
(C .21)
B(a , b) (C.22)
412
...;
= (C.23)
lmk
FJrn
(C.24)
(C.25)
ov
pu
(C.26)
f )
=N(x (C.27)
lE [x] (C.28)
var[x] (C.29)
p(x
(C.30)
413
p(x)
4 4 z
lE [x] (C.31)
cov[x] = . (C.32)
I = {X1 , X2 , '"
I I
I I distribution).
X = {X1 ,
1
- mx)
= (C.33)
b' = b+ m -
414
(, 3
ullback-Leibler
I \ L _ p( x)
KL(PIIQ) = I p(x) dx (C.34)
q(x)
(C.35)
, (C.36)
+ H(P, )
H(P,
415
W.
418
130 , 147 LASSO , 252 , 261
5x
LVQ , 204 , 218
AdaBoost , 173
108
331
Bagging , 178
MDP, 371
137, 139
111
Boosting , 173 , 190 MvM , 63
101
OvO , 63
OvR , 63
ECOC , 64
162 , 208 , 295 , 335

PCA , 229
Fl , 32
ReLU , 114
192 , 268 RIPPER , 353
RKHS , 128
ILP, 357 , 364 S3VM , 298
Softmax , 375
124, 132 , 135 184

SVM , 123
273 , 274
WEKA , 16
420
27 , 179
185
156 , 319 , 339
101
158
158 , 328
158
58 ,132 , 325
149
421
23 , 104, 191 , 352
137, 232
128 , 138 , 233
185 , 304
10 , 363
156, 319
206 , 296
59 , 149 , 297
149 , 328
422
191
182 , 225
183 , 225
154 , 225 , 240
299
61, 138
79 , 352 61 , 138
101 , 104
103 , 402
161 , 328 , 331 240 , 294
161 , 328
161 , 320
251 , 340 , 384
199
423
148 , 325
161
150 130
179
102 , 254 , 389 , 407
130
106
158
12 , 139
129
148 , 295 , 325

424
162 , 319
158
197
126
156
109 , 241 , 393
191
2
105 , 133
105
425
189 , 227
408
010-62782989 13701121933
2016
ISBN
N. ( TP181
http://WWw.wqbook. com
100084
010-62770175 010-62786544
010-62776969,
210mmx 235mm 27.75
1 5000
064027-01
_._
II
111
Z
x
x
JJ}NJPD
.
sup(.)
lI(. )
sign(. ) > 1, 0 , 1
1
1. 1 ........................................................... 1
1. 2 . ........... .................... ...... ....... .......... 2
1. 3 ....................................................... 4
1.4 .......... .......... .............. .... .... ... ... ....... 6
1. 5 ....................................................... 10
1. 6 ....................................................... 13
1. 7 ....... ................................................. 16
................................................................ .
............................................................. 20
........................................................... 22
.... .... .......... ...... ............. ... 23

2.1 ...... ....... ........... ..... ..... ............ 23
2.2 ..... ... ......... ... ............. ............... .... ... 24
2.3 ........ ... ...... ... ................. ............ ... ... 28
2 .4 .......................... ............. .............. ... 37
2.5 ........................... ......... ... ...... ........ 44
2.6 ........ ............ ....... ..... ...... .......... .... ... 46
.... ........ ........ .... ........................ ........ ..........48
.... .... ......... ....... .............. .................... ... 49
............................................................ 51
....... ................. ................. 53

3.1 ................................ ............... ........ 53
3.2 ...... ........................... .......... ..... ....... 53
3.3 ............. .......... ............ ................ 57
3.4 .................. ................. ...... .......... 60
3.5 ............ ........ ......;........ ........ .......... 63
X
3.6 .............. ....... ....... ..................... 66

3.7 ....................................................... 67
................................................................. 69
.......................... ..... .............................. 70
... ........................... ........ ..................... 72
........... 73
4.1 ..... .................................................. 73
4.2 ....................................................... 75
4.3 .. ......................... ....... ..................... 79
4.4 ................................................... 83
4.5 ................................................... 88
4.6 ....................................................... 92
................................................................. 93
............................................................. 94
........................................................... 95
...................................... 97
5.1 ......................... .... ......................... 97
5.2 .............................................. 98
5.3 .......................................... ....... .101
5.4 ............................................106
5.5 .............................................. 108
5.6 ....................................................... 113
5.7 ....................................................... 115
................................................................. 116
............................................................. 117
........................................................... 120
.121
6.1 ................................................ .121
6.2 ....................................................... 123
6.3 ......................................................... 126
6.4 ................................................ .129
6.5 .................................................. .133
xl
6.6 ......................................................... 137

6.7 ....................................................... 139
................................................................. 141
............................................................. 142
........................................................... 145
7.1 .................................................. .147

7.2 .................................................. .149
7.3 .............................................. 150
7.4 ............................................154
7.5 ....................................................... 156
7.6 ....................................................... .162
7.7 ....................................................... 164
................................................................. 166
............................................................. 167
........................................................... 169
171
8.1 ..................................................... 171
8.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3 ............................................ .178
8.4 ....................................................... 181
8.5 ......................................................... 185
8.6 ....................................................... 190
.......................................................... ........192
............................................................. 193
....................... .....................................196
197
9.1 ....................................................... 197
9.2 ....................................................... 197
9.3 ....................................................... 199
9.4 ....................................................... 202
9.5 ....................................................... 211
xii
9.6 ....................................................... 214

9.7 ....................................................... 217
........... .......... ..... .......................... ........ ..;..220
............................................................. 221
.....•... ........... ............... ............. .... .......224
.... ......225
10.1 . ... ...... ............................ .............. .225
10.2 ...................................................... 226
10.3 .................................................... 229
10.4 ....................... ."..........................232
10.5 ........................ .......;......................234
10.6 ... ...... ..... ... ...................... .......... .....237
10.7 ......... ...... .......................................240
................ ... ...... ... ..................... •...............242
....... ........... ............................ ... ..... .......243
..... .......... ............................. .... .... .... ...246
..............................247
11. 1 .......................... .............".......247
11. 2 .................................................... 249
11. 3 .................................................... 250
11. 4 .......................... .... ..... ..... .252
11.5 ............................ ... ....... .....254
11. 6 ...................................................... 257
11. 7 ......... ...... .............................. .........260
............. ..... ...... .......................... .... ...... .....262
............................................................. 263
........................................................... 266
12.1 ...................................................... 267

12.2 .. ... ........ .......................... ..... ....... ...268
12.3 . ................................. ........ ...... ..270
12 .4 ... ....... .....................;..... .... ...........273
Xlll
12.5 ............................................279
12.6 ....... .............. ........... .......... .... ..........284
12.7 ...................................................... 287
................................................................. 289
... ........... ......... ............ ............. .............290
........................................................... 292
293
13.1 .... .......... ....................... .............. .293
13.2 .................................................... 295
13.3 ............. .................. ............. ....... 298
13 .4 .............. ........... ............. ......... ...300
13.5 ..... ............ ....................... .......304
13.6 .. ............... ...................... .......... ... 307
13.7 ...................................................... 311
................................................................. 313
............................................................. 314
........................................................... 317
319
14.1 ...............................................319
14.2 ............................................322
14.3 .................................................... 325
14.4 ... ........... ............... .......... ...;........ .328
14.5 ...................................................... 331
14.6 ...................................................... 337
14.7 ....... .... ..... ............ ............ ........... ... 339
................................................................. 341
................. ............................................342
...................... ........... ............ ........ ......345
. ………………………………….... .347
15.1 ..... ........... ............ ............ ........... ... 347
15.2 ............... ............... .......... ........... ... 349
15.3 ...................................................... 352
Xl V
15 .4 ................................................. .354
15.5 .............................................357
15.6 ...................................................... 363
...................................................:............. 365
............................................................. 366
........................................................... 369
.371
16.1 ...................... 371
16.2 .................................................373
16.3 .................................................... 377
16.4 .....................................•.............. 382
16.5 .................................................... 388
16.6 ...................................................... 390
16.7 ...................................................... 393
............ ....'.......... .................... ...................394
............................................................. 395
.... ................... ............... .....................397
.'..........399
A ............................................................. 399
B ..... ..... ........... .................... ................... .403
C .... .................................... ................409
417
419

机器学习周志华 8.16.23 PM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

机器学习周志华 8.16.23 PM

Uploaded by

Copyright:

Available Formats

(machine learning).

(training (training samp1e) ,

[Cohen and Feigenbaum. ^ ^

= 21.1' 1- 1 P(æ). 1 . (1. 2)

[Mitchell , [Duda et al., 2001; Alc

(transfer learning) [Pan and Yang ,

and [Winston , 1970]

[Simon and Lea ,

of Machine Learning Learning.

Alpaydin , E. (2004). Introduction to Machine Learning. MIT Press , Cambridge ,

J. G. , ed. (1990). Machine Learning: and Methods. MIT

P. (1999). "The role of Occam's razor in knowledge discovery."

Kanerva , P. (1988). Distributed Memory. MIT Press , Cambridge , MA.

Samuel, 1901- 1990)

studies in machine learning using the game of checkers"

Yl) , (X2 ,Y2) , … 7

= {(X1 , Y1) , . . . , (Xm,

(mean squared error)

acc (f; D) (f (Xi) (2.5)

' Rijsbergen , 1979]. ß= 1

(True Positive (False Positive

0.2 0.4 0.6

y l), (X2, Y2) , . . . , (x m,

fl. rank (f (æ+) < ),

!NR X P x x (1- p) (2.25)

f(x) = E D[ f(x;D)] , (2.37)

var(x) = ED [(f (2.38)

(x) = (1 (x) _ Y) 2 . (2 .40)

+ED [2 (f (x;D) - f(x)) (f (x) -YD)]

[(1 (æ) - Y) (y - YD)]

E (f; D) = (æ) (2 .42)

and McNeil , 1983]. [Hand and Till ,

[Drummond and Holte ,

learning) [Elkan , 2001;

and Dietterich , 1995; Kohavi and Wolpert ,

for visualizing classifier performance." Machine

D. J. and R. J. Till. generalisation of the area under

Kohavi , R. and D. H. Wo1pert. (1996). variance decomposition for

= {(æ1 , Y1) , (Xi1;

= argmin ) -b? (3 .4)

X22 X2d 1.)(71

'ÛJ* = argmin X 'ÛJ )T (y - X 'ÛJ) (3.9)

f (Xi ) = x'f (XTXr 1 XTy (3.12)

lny = wTx +b (3.14)

(One vs (One vs.

and Singer, 2001; Lee et

and Zhou, 2014].

Allwein , E. L. , R. E. Schapire , and Y. Singer. (2000).

via error-correcting of Artificial Intelligence Re-

for multiclass classification." In in Neural Processing

= {(X1 , Y1) , . , (Xm,Ym)};

Ent(D l ) (63 3 6 3 6)}I

Ent(D :l) (4422)

Ent(D ,j) ;. i5 4 5)}

(intrinsic value) [Quinlan ,

GiniJndex(D , a) =): I;-..._ ,' Gini(DV) . (4.6)

x 100% = 71. 4% > 42.9%

0.294 , 0.351 , 0.381 , 0.420 , 0 .4 59 , 0.518 ,

[Quinlan , 1979 , [Quin-

[Raileanu and Stoffel ,

[Murthy et a l., and Ut-

(Perceptron tree) [Utgoff,

[Schlimmer and Fisher , [Utgoff, [Utgoff et a l.,

Breiman , L. , J. Friedman , C. J. and R. A. Olshen. (1984).