You are on page 1of 443

(machine learning).

[Mitchell ,

(learning

et a l., 2001].
2

(data

(attribute
(attribute (samp1e

(feature vector).
=

Xi = (X i1; Xi2; . . . ;

), (dimensionality).
(training) ,

(training (training samp1e) ,


(training
(training
1. 2 3

(regression).
(binary (positive
(negative
(multi-class
Y1) , (X2 , Y2) ,..., (Xm ,
=

(testing
(testing
f(x).

(supervised (unsupervised

(unseen instance).

(distribution)
(independent and identically
4

(inductive .

1
2
3
4

"
1.3 5

[Cohen and Feigenbaum. ^ ^

*) ^

x 3 x 3+1=

7 11

(version
6

*) ^ *)"
(inductive bias) ,

(feature
1. 4 7

y
&

B
,
/' "
-,
\ A

(Occam's

= _X 2 + 6x +

*) ^ *) ^
8

y y ,
B
J
/
B
, ‘BEES

(a) (b)

I (1. 1)
h æEX-X

P(h I
f f h

I X ,'ca)

= L IX,
æEX-X h

h
1. 4 9

= 21.1' 1- 1 P(æ). 1 . (1. 2)

f) , (1. 3)
f f

Free Lunch
[Wolpert , 1996; Wolpert and Macready , 1995].

*)
*) ^
^

"
10

(Logic
(General Problem

A.

E. A.

S. Michalski
B.

J.

S. G.
[Michalski et a l.,
1. 5 11

G.
[Carbonell ,

R. S. et a l.,

E. A.
[Cohen and

S.

Logic
12

S.

J. J.
D. E.

(statistical
Vector
(kernel

N. J.
1. 6 13

Intel
14

and DeCoste ,
ZJZ;;225;

T.

(data
1.6 15

S.

S.
16

(Sparse Distributed

[Mitchell , [Duda et al., 2001; Alc


paydin , 2004; Flach , [Hastie et al. ,
[Bishop ,
[Shalev-Shwartz and Ben-David , [Witten et
2011]

http://www.cs.waikato.
ac.nz/ml/weka j.
[Michalski et a l.,
Morgan
1.7 17

A.
and Feigenbaum ,
[Dietterich ,

(transfer learning) [Pan and Yang ,


(1earning by
(deep

and [Winston , 1970]

[Simon and Lea ,


[Mitchell ,

et a l.,

1996;

(principle of multiple
explanations)

of Machine Learning Learning.


Intelli-
of Intelligence
Transactions on Knowledge Discovery
and Knowledge
18.

Transactions on Pattem
Analysis and Machine Com-
on Neural
19

1.1

1. 2

^ *))
v ^ ,

^
V (A = *)
=

1. 3

1.4*

= I
h

1. 5
20

(2007). 3(12):35-44.

Alpaydin , E. (2004). Introduction to Machine Learning. MIT Press , Cambridge ,


MA.
Asinis , E. (1984). Epicurus' Method. Cornell University Press , It haca ,
NY.
Bishop , C. M. (2006). Pattern Recognition and Machine
New York , NY.
Blumer , A. , A. Ehrenfeucht , D. Haussler , and M. K. (1996). -

J. G. , ed. (1990). Machine Learning: and Methods. MIT


Press , Cambridge , MA.
Cohen , P. R. and E. A. eds. (1983). The of
Intelligence , volume 3. William Kaufmann , New York , NY.
Dietterich, T. G. (1997). "Machine Four current directions."

P. (1999). "The role of Occam's razor in knowledge discovery."


Mining and K Discovery,
Duda , R. 0. , P. E. Hart , and D. G. Stork. (2001). Pattern 2nd
edition. & Sons , New York , NY.
Flach, P. (2012). Machine The Art and Science of that
Sense of Cambridge University Press , Cambridge , UK.
H. Mannila , and P. Smyth. (2001). Principles of MIT
Press , Cambridge , MA.
Hastie , T. , R. J. Friedman. (2009). The Elements of
Learning, 2nd edition. Springer , New York , NY.
Hunt , E. G. and D. 1. Hovland. (1963). "Programming a model of human con-
cept Thought (E. and J.
eds.) , 310-325 , McGraw Hill , New York , NY.
21

Kanerva , P. (1988). Distributed Memory. MIT Press , Cambridge , MA.


Michalski , J. G. Carbonell , and T. M. Mitchell , eds. (1983). Machine
Learning: An Intelligence Tioga , Palo Alto , CA.
Mitchell , T. (1997). Learning. McGraw Hill , New York , NY.
Mitchell , T. M. (1977). "Version spaces: A candidate elimination approach to
rule learning." In Proceedings of the 5th Joint on
Intelligence 305-310 , Cambridge , MA.
and D. DeCoste. (2001). "Machine science: State of
the art and future prospects." Science , 293(5537):2051-2055.
Pan , S. J. and Q. Yang. of transfer learning." IEEE
on 22(10):1345-1359.
Shalev-Shwartz , S. and S. Ben-David. (2014). Understanding Machine Learn-
ing. Cambridge University Press , Cambridge , UK.
Simon , H. A. and G. Lea. (1974). "Problem solving and rule induction: A
unified view." In K and Cognition (L. W. ed.) , 105-127,
Erlbaum , New York , NY.
Vapnik , V. N. (1998). Learning New York , NY.
Webb , G. 1. (1996). experimental evidence against the utility of Oc-
cam's razor." Journal of Intelligence Research, 43:397-417.
Winston , P. H. (1970). "Learning structural descriptions from examples." Tech-
nical Report AI-TR-231 , AI Lab , MIT , Cambridge , MA.
Witten ,1. E. .Frank , and M. A. Hall. (2011). Pmctical Ma-
chine Learing Tools Techniques , 3rd edition. Elsevier , Burlington, MA.
H. (1996). "The lack of a distinctions between learning al-
gorithms." Neural :1341-1390.
Wolpert , D. H. and W. G. Macready. (1995). "No free lunch theorems for
search." Technical Report SFI-TR-05-010 , Santa Fe Institute , Sante Fe ,
NM.
Zhou , Z.-H. (2003). "Three perspectives of data
143(1):139-146.
22

Samuel, 1901- 1990)

(John McCarthy, ,

studies in machine learning using the game of checkers"

and
(error
rate)

-;:;) x 100%
(error) ,
(training
(empirical error) (generalization
24

(model

(testing
(testing
2.2 25

Yl) , (X2 ,Y2) , … 7

x 30% = 70%.

(stratified
26

rv

1997]

(cross
= D1 U D k, Di n Dj = ø

(k-fold cross
validat ion).

D
G
ID1 1D2 1 31D4 1Ds ID6 1D71 D81Dg IDlOI
D

1
r

IDsID6ID7ID8IDgID lO l J
2.2 27

and Tibshirani ,

(2.1)

(out-of-bag
estimate).
28

(parameter tuning).

(validation

measure).
2.3 29

= {(X1 , Y1) , . . . , (Xm,

(mean squared error)

(2.2)

E (f ;Ð) = (2.3)

(2 .4)

acc (f; D) (f (Xi) (2.5)

1- E (f ;D) .

, (2.6)
30

acc (f; Ð) L EM Z
(2.7)

I FN
FP I TN

TP
(2.8)
TP+FP'
'-F
R -T -N. (2.9)
2.3 31

10

./

0.2

Ol 0 2 04 0 6

(Break-Event
32

2xPxR 2 xTP
.,,-,r-. (2.10)
P+R

' Rijsbergen , 1979]. ß= 1


ß> ß<

(macro-F1):

macro-F1 = - ,.
Inacro- -P ••
x macro- -R. (2.14)
macro- P +

(micro-F1):

TP
m lCr o- r = (2.15)
TP+FP'
2.3 33

TP
mlCrO- l'í = ‘ (2.16)
TP+FN'
2x
micrc• F1= (2.17)
micro- P + micro-R
2.3.3

(cut

(Receiver Operating

[Spackman ,

(True Positive (False Positive

'-F
T PR -TA -N, (2.18)

FP R


- (2.19)
34

1.0 1. 0

0.8

AUC

0.2 0.2

0.2 0.4 0.6

(Area Under
ROC

y l), (X2, Y2) , . . . , (x m,


=0, x m =
2.3 35

fl. rank (f (æ+) < ),


(2.21 )

AUC = . (2.22)

(unequa1 cost).

(cost
costii =
17'1 / .,
>
costl0 5 1 qa-
9"-
36

(total

E (f; D; cost) (f

+ L I. (2.23)

(cost

pX costOl
(2.24)
+ (1 - p)

(normaliza-

!NR X P x x (1- p) (2.25)


norm p + (1 - p)
2.8.

FNR == 1 -
2.4 37

0.5 1. 0

2010]

=
38

(1 - x

P(Ê; f) = (2.26)

f)/8f =
=

0.25

0.20

0.15

0, 10

= 0.3)

(binomial

e = maxf (2.27)
2 .4 39

('Binomial' ,

www.r-project.org

(2.28)

(2.29)

EO)
(2.30)

-10 5
Tt
‘4 5 10

= 10)

1
40

-1).

k
2 5 10 20 30
0.05 12.706 2.776 2.262 2.093 2.045
0.10 6.314 2.132 1.833 1.729 1.699

(paired t-tests

.ð. 1 , .ð.2 , . . .

(2.31)

x
2.4 41

1998].

0.5(ßi +

f..L
Tt = (2.32)
5
0.2 I: ut

2.4 .3

(contingency table)

eOO eOl
eîo e l1

-
le01 -

- e lO l- 1)2
(2.33)
^

('Chisquare' , 0.1
= 2
42

2 .4.4

2,

Dl 1 2 3
D2 l 2.5 2.5
D3 1 2 3
D4 1 2 3
1 2.125 2.875

+ -

- - (N -1)Tx2
YF Z N(k-1)-TX2' (2.35)
2.4 43

- l)(N -

0.05

2 3 4 5 6 7 8 9 10
1. (k 4 10.128 5.143 3.863 3.259 2.901 2.661 2.488 2.355 2.250
(/F / , 1 5 7.709 4.459 3.490 3.007 2.711 2.508 2.359 2.244 2.153
-1 , (k -1) * (N -1)). 8 5.591 3.739 3.072 2.714 2.485 2.324 2.203 2.109 2.032
10 5.117 3.555 2.960 2.634 2.422 2.272 2.159 2.070 1. 998
15 4.600 3.340 2.827 2.537 2.346 2.209 2.104 2.022 1. 955
20 4.381 3.245 2.766 2.492 2.310 2.179 2.079 2.000 1. 935

0.1

2 3 4 5 6 7 8 9 10
4 5.538 3.463 2.813 2.480 2.273 2.130 2.023 1. 940 1. 874
5 4.545 3.113 2.606 2.333 2.158 2.035 1.943 1.870 1. 811
8 3.589 2.726 2.365 2.157 2.019 1. 919 1.843 1. 782 1. 733
10 3.360 2.624 2.299 2.108 1. 980 1. 886 1. 814 1. 757 1. 710
15 3.102 2.503 2.219 2.048 1.931 1. 845 1. 779 1. 726 1. 682
20 2.990 2.448 2.182 2.020 1.909 1. 826 1. 762 1. 711 1. 668

(post-hoc

CD=JF , (2.36)

Inf)!
sqrt

2 3 4 5 6 7 8 9 10
0.05 1. 960 2.344 2.569 2.728 2.850 2.949 3.031 3.102 3.164
0.1 1. 645 2.052 2.291 2.459 2.589 2.693 2.780 2.855 2.920
44

=
QO.05 = =

1. 0 3.0

(bias-variance

f(x;
2.5 45

f(x) = E D[ f(x;D)] , (2.37)

var(x) = ED [(f (2.38)

[(YD-y)2] (2.39)

(x) = (1 (x) _ Y) 2 . (2 .40)

= ED [(f 2]
= ED [(f 2J

+ED [2 (f (x;D) - f(x)) (f (x) -YD)]


= ED [(f
=ED [(f (æ;D)-f(æ))2] +ED [(1 (æ)-Y+Y-YD)2]
=ED [(1 (æ) _y)2] +ED

[(1 (æ) - Y) (y - YD)]


= ED [(f (f (æ) - y)2 + lE D
(2.41)

E (f; D) = (æ) (2 .42)


46

variance

Tibshirani ,

1989],
2.6 47

and McNeil , 1983]. [Hand and Till ,

[Drummond and Holte ,

learning) [Elkan , 2001;


Zhou and Liu ,
[Dietterich ,
[Demsar ,
[Geman et al.,
variance-covariance

and Dietterich , 1995; Kohavi and Wolpert ,


1996; Breiman, 1996; Friedman , 1997;
48

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

- . (2 .43)

, x-x
(2 .44)

2.9

2.10*
49

Brad1ey, A. P. (1997). "The use of under the ROC curve in the eva1ua-
tion of machine 1earning a1gorithms." Pattem Recognition,
Breiman, L. (1996). variance , and arcing classifiers." Technica1 Report
460 , Statistics Department , University of California, Berke1ey, CA.
Demsar , J. (2006). "Statistical comparison of classifiers over mu1tip1e data
sets." Journal of Machine Leaming Research, 7:1-30.
Dietterich, T. G. (1998). "Approximate statistical tests for comparing super-
vised classification 1earning algorithms." Neural
1923.
P. uni

CA.

for visualizing classifier performance." Machine


Efron , B. and R. Tibshirani. (1993). An Introduction to the Chap-
man & Hall , New York , NY.
Elkan , C. (2001). "The foundations of cost-senstive 1earning." In Proceedings of
the 17th Intemational Joint Conference on Artificial
973-978 , Seatt1e , WA.
Fawcett , T. (2006). "An introduction to ROC ana1ysis." Recognition
Letters ,
Friedman , J. H. (1997). "On bias , variance , 0/1-108s , and the curse-of-
dimensionality." Knowledge
E. Bienenstock , and R. Doursat. (1992). networks and the
bias/variance dilemma." Neural 4(1):1 58

D. J. and R. J. Till. generalisation of the area under


the ROC curve for multiple class classification prob1ems." Learning,
45(2):171-186
J. A. and B. J. McNeil. (1983). "A method ofcomparing
under receiver operating characteristic curves derived from the same cases."
148(3):839-843.
50

Kohavi , R. and D. H. Wo1pert. (1996). variance decomposition for


zero-one 10ss functions." 1n Proceeding of the 13th Conference
275-283 , Bari , 1ta1y.
E. B. and T. G. Dietterich. (1995). coding cor-
rects bias and variance." 1n Proceedings of the 12th International Conference
on M achine 313-321 , Tahoe City, CA.
Mitchell , T. (1997). Machine Learning. McGrawHill , New
Spackm
inductive 1n Proceedings of W orkshop on
160-163 , Ithaca , NY.
VanRijsbergen , C. J. (1979). Retrieval, 2nd edition. Butterworths ,
London , UK.
Wellek , S. (2010). of
2nd edition. Chapman & HalljCRC , Boca Raton , FL.
Zhou , Z.-H. and X.-Y. Liu. (2006). "On mu1ti-class
on
Boston, WA.
51

.
Gosset ,

(Student's t-test).

,
Pearson,
(University College
=

f(æ) + b, (3.1)

f(æ) (3.2)

(un-
derstandability)

= {(æ1 , Y1) , (Xi1;


. . ; Xid) , (linear

=
54

(0 , (1 , 0, 0).

(3.3)

(5quare 1055)

== arg min ) (f
i:i

= argmin ) -b? (3 .4)

(Euclidean
(least

estimation).

434-t
no-nu-
z'
'b\lll-/
(3.5)

Z
(3.6)

f(x) =

Z Yi(Xi - x)
(3.7)
3.2 55

(3.8)

f(Xi) + ,

= x

X11 X12

X22 X2d 1.)(71


I I xi 1
.
.. ..
Xm l Xm 2 Xmd 1} \x;, 1

'ÛJ* = argmin X 'ÛJ )T (y - X 'ÛJ) (3.9)

= (y - X 'ÛJ )T (y -

2X 1 y) (3.10)

definite ma-

(3.11)

=
56

f (Xi ) = x'f (XTXr 1 XTy (3.12)

(3.13)

lny = wTx +b (3.14)

(log-linear

U
30

20

2 X
3.3 57

y = (3.15)

(generalized linear
(link
g(-) =

(unit-step function)

I 0, z<0;
y= < 0.5 , z = 0 ; (3.16)
1 1, z >0,

Z >O
z = 0;
l

10 5 10 z
58

(surrogate

1
y = 1. + e- z (3.17)

1
y= .
(3.18)

(3.19)
1-y

U
(3.20)
1-y

(log

(3.21)
1-y

(logistic
3.3 59

= 11

p(y = 11 x)
=wTx+b. (3.22)
p(y = 01 x)

A wTæ+b
(3.23)
'

p(U=01z)=1 (3.24)
.

(maximum likelihood
(log-
likelihood)

1 (3.25)

X = (x; = 1 1
= p(y = 0 1 = 1-

+ . (3.26)

= E . (3.27)

and
Vandenberghe , descent

= . (3.28)

+1 at \
(3.29)
60

: Xi(Yi - P1 , (3.30)
mtt
ß )) . (3.31)

Discriminant

Xl

"

=
3.4 61

- wT :E ow +
(3.32)

scatter matrix)

+:E 1

(3.33)
æEXo

scatter matrix)

Sb = , (3.34)

(3.35)

(generalized
Rayleigh

1,

(3.36)
s.t. wTSww = 1.

Sb W (3.37)
62

Sb W , (3.38)

. (3.39)

St = Sb + Sw

, (3.40)

N
(3 .4 1)

SWi = L (3.42)

Sb = St - Sw
T
m (3 .43)

Sw ,
3.5 63

tr (WTSbW)
(3 .44)
,
E

. (3 .45)

(classifier)

(One vs (One vs.


(One vs. (Many vs.
= {(Xl , Y1) , ... CN}.
64

=} h -7 "+"

(Error Correcting
Output
ECOC [Dietterich and Bakiri ,
3.5 65

(coding
and et a l.,

11 h fa 14 15 h h fa 16
•• ••
C1• • 32 V3 C1 • • 4 4
C •
2 • 4 4 C2 • • 2 2
C •
3 • 1 2 C3 • • 5
C •
4 • 22 v'2
••
66

y >

(3 .46)
l-y

(3 .4 7)
3.7 67

y' y m
1- y' 1- Y --
m+
(3.48)

(rebal-
(rescaling).
ance).

(upsam-
pling)

(threshold-moving).

[Chawla
et a l.,

EasyEnsemble [Liu et

(cost-sensitive

(sparse
(sparsity
LASSO
[Tibshirani ,
68

et al.,
[Crammer and Singer ,

et 2006 , 2008].

(Directed Acyclic
et al.,

and Singer, 2001; Lee et

(misclassification

and Liu ,
and Liu ,

and Zhou, 2014].


69

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8*

3.9
70

Allwein , E. L. , R. E. Schapire , and Y. Singer. (2000).


to binary: A approach for margin classifiers." Journal of Machine
1:113-14 1.
Boyd , S. and L. Vande erghe. (2004). Convex Optimization. Cambridge U
Press , Cambridge , UK.
Chawla , N. V. , K. W. Bowyer , L. O. Hall , and W. P. Kegelmeyer. (2002).
"SMOTE: Synthetic minority technique." Journal of Artificial
Intelligence 16:321-:-357.
Crammer , K. and Y. Singer. (2001). "On the algorithmic implementation of
multiclass kernel-based vector machines." Journal of Learning Re-
2:265-292.

J
Crammer, K. and Y. Singer. (2002). "On the learnability and design of
codes for multiclass problems." Machine Learning, 47(2-3):201-233.

via error-correcting of Artificial Intelligence Re-


2:263-286.
Elkan , C. (2001). "The foundations of cost-sensitive learning." In Proceedings
of the 17th Joint Conference on Artifiçial Intelligence (IJCAI) ,
973-978 , Seattle , WA.
O. Pujol , and P. Radeva. (2010). "Error-correcting ouput codes
library." Journal of Machine 11:661-664.
Fisher , R. A. (1936). "The use of multiple taxonomic prob-
lems." Annals of 7(2):179-188.
Lee , Y., Y. Lin , and G. Wahba. (2004). "Multicategory support vector ma-
chines , theory, and application to the classification of microarray data and
satellite radiance data." Journal of the American Statistical Association, 99
(465):67-8 1.
Liu , X.- Y., J. Wu , and Z.-H. Zhou. undersamping for
learning." IEEE on Systems , Cyber-
neticíi - Part B: Cybernetics , 39(2):539- 550.
Platt , J. C. , N. Cristianini , and J. Shawe-Taylor. (2000). "Large margin DAGs
71

for multiclass classification." In in Neural Processing


12 (NIPS) (8. A. 8011a, T. K. Leen , and K.-R. Müller , eds.) , MIT
Press , Cambridge , MA.
Pujol , 0. , 8. Escalera , and P. Radeva. incremental node embedding
technique for error correcting output codes." Pattern Recognitìon,
725.
Pujol , 0. , P. Radeva , and J. Vitrià. (2006). "Discriminant ECOC: A heuristic
method for application dependent design of error output codes."
IEEE on Pattern Analysis and Machine Intelligence , 28(6):
1007-1012.
Tibshirani , R. (1996). "Regression shrinkage and selection via the LA880."
Journal of the Royal Series B , 58(1):267-288.
M.-L. and Z.-H. Zhou. on multi-label learning al-
gorithms." IEEE Transactions on 26(8):
1819-1837.
Zhou , Z.-H. and X.- Y. Liu. (2006a). "On multi-class cost-sensitive learning." In
Proceeding ofthe 21st National Conference on Intelligence
567-572 , Boston, WA.
Zhou , Z.-H. and X.- Y. Liu. (2006b). cost-sensitive neural networks
with methods addressing the class imbalance problem." IEEE
18(1):63-77.
72

"3L"
74

= {(X1 , Y1) , . , (Xm,Ym)};

A)
2: if then
3:
4: end if
5: if A= 0
6: return
7: end if

for do
10:
11: if
12: return
13: else
14: A\
15: end if
16: end for
4.2 75

(information
(k = 1, 2,. . . ,

IYI
=- I:>k1og2Pk (4.1)
k=l

(information gain)

= . (4.2)

erative Dichotomiser

(8. 8 9. 9\
1"'7) =
76

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

D1
D2 D3

4, 6 , 10 , 13 ,
3 , 7, 8 , 9 ,
= = 11 , 12 , 14,

Ent(D l ) (63 3 6 3 6)}I


6 '3 == 1.000 ,

Ent(D :l) (4422)

Ent(D ,j) ;. i5 4 5)}


5 +'4
(51 1 0.722
-- ,
4.2 77

=
fLlDU!
IDI
= x

= 0.143; = 0.141;
= 0.381; = 0.289;

= 0.006.

2, 3 , 4 , 5, 6 , 8 , 10 ,

= 0.043; = 0.458;
= 0.331; = 0.458;
= 0.4 58.
78

(gain

a) (4.3)

(4.4)

(intrinsic value) [Quinlan ,

= 0.874 (V = 2) , = 1.580 (V = 3) ,
= 4.088 (V = 17).
C4.5
4.3 79

and Regression
et al., (Gini

IYI
=1- (4.5)

GiniJndex(D , a) =): I;-..._ ,' Gini(DV) . (4.6)

argmin

pruning) [Quinlan,
80
4.3 81

4%

4%
4%

100% = 42.9%

x 100% = 71. 4% > 42.9%


82

(decision
stump).
4.4 83

1%
4%

1993] .

t
84

, (4.7)

[Quinlan , 1993].

Gain(D , a) t)

-LEnt(Df) , (4.8)
IDI

a,

12345678
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0 .437 0.211
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
4.4 85

0.294 , 0.351 , 0.381 , 0.420 , 0 .4 59 , 0.518 ,


0.574 , 0.600 , 0.621 , 0.636 , 0.648 , 0.661 , 0.681 , 0.708 ,

{0.049 , 0.074 , 0.095 , 0.101 , 0.126 , 0.155 , 0.179 , 0.204 , 0.213 , 0.226 , 0.250 , 0.265 ,
0.292 , 0.344 , 0.373 ,

= 0.109; = 0.143;
= 0.141; = 0.381;
= 0.289; = 0.006;
= 0.262; = 0.349.

.
86

7, 14 ,

..,

ÌJ k
= 1 , 2 , .. . , ÌJ k ,

LZLEL-L
(4.9)

~pf
(1 k IYI) , (4.10)

(1 v V) . (4.11)
4.4 87

Gain(D , a) = p x Gain(D , a)
\tll1/

× En4L ~D En 4tu
ND 4 -i9"

IYI
Ent(15) = -

Ent(15) = - LPk
(6. 6 8. 8\
= - I -::-: . -::-: 14 J = 0.985.

\11FE/\11/
NDND
EE
MM)) - - -
oo
ubub -
oo
FDFb nu03
nu--
1inu
q4qL

.•
88

14
0.306 = 0.252 .
17

= 0.252; = 0.171;
= 0.145; = 0 .424;
= 0.289; = 0.006.

9, 13 , 14 , 12 ,
4.5 89

0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0.437 0.211
,.•
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
90

06EZE +
+
+
+
o. 2

o O. 2 O. 4 O. 6 O. 8

(multivariate decision

(oblique
decision tree).

(univariate decision
4.5 91

/ '-....

O. 2

o 0.2 O. 4 O. 6 0.8
92

[Quinlan , 1979 , [Quin-


[Breiman et a l., 1984]. [Murthy ,

[Quinlan ,

[Raileanu and Stoffel ,

[Murthy et a l., and Ut-


goff,
[Brodley
and Utgoff,

(Perceptron tree) [Utgoff,


and Gelfand ,

[Schlimmer and Fisher , [Utgoff, [Utgoff et a l.,


93

4.1

4.2

4.3

4.4

4.5

http://archive.ics.uci.edu/mlj. 4.6

4.7

4.8*

4.9

4.10
94

Breiman , L. , J. Friedman , C. J. and R. A. Olshen. (1984).


Chapman & Hall/CRC , Boca Raton , FL.
Brodley, C. E. and P. E. Utgoff. (1995). "Multivariate decision trees." Machine
Learning, 19(1):45-77.
Guo , H. and S. B. Gelfand. (1992). trees with ncural network
feature extraction." IEEE on Neural Networks , 3(6):923-933
Mingers , J. (1989a). "An empirical comparison of pruning mcthods for decision
tree induction." Learning , 4(2):227-243.
Mingers , J. (1989b). "An empirical comparison of selection measures for
decision-tree induction." Learning, 3(4):319-342.
Murthy, S. K. (1998). "Automatic construction of decision trees from data:
A survey." and K 2(4):
345-389.
Murthy, S. Kasif, and S. Salzberg. for induction of
oblique decision trees." Jo'urnal of Artíficial Intelligence 2:1-32.
J. R. (1979). rules by large collections of
examples." 1n Expert the (D.
University Press , Edinburgh , UK.
Quinlan , J. R. (1986). "1nduction of decision trees." Machine 1(1):
81-106.
Quinlan , J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kauf-
mann , San Mateo , CA.
Raileanu , L. E. Stoffel. (2004). "Theoretical comparison between the
Gini index and information gain criteria." of Arti-
ficialIntelligence , 41(1):77-93.
J. C. and D. Fisher. study of incremental concept
induction." 1n Proceedings of the 5th Conference on Artificial In-
telligence (AAAI) , 495-501 , PA
Utgoff, P. E. (1989a). "1ncremental induction of decision trees." Machíne
Learning, 4(2):161-186.
95

P. E. (1989b). "Perceptron A case study in hybrid concept rep-


resenations." Connection Science, 1(4):377-39 1.
Utgoff, P. E. , N. C. Berkman, and J. A. Clouse. (1997). "Decision t ree induction
based on effcient tree Machine Learning, 29(1 ):5-44.

Ross Quinlan , 1943- ).


B.
Hunt
(Concept Learning

achine
Kohonen 1988
Networks
[Kohonen ,

[McCulloch and Pitts ,

U
98

(activation

1.0

1 x -1. 0 -0.5 10 0.5 1. 0 :Í;


cd Z) 0;
x< O

(b)

(threshold logic unit) .


fCI:'i WiXi -

• (Xl ^ = 1, e= f(l . Xl + 1 . X2 -
5.2 99

'W l/ \'W 2

Xl X2

= X2 = = 1;

• (X1 V 1, e= f(l . X2 - 0.5) ,


= = = 1;

• = = 0, e= f( -0.6. Xl + O.
X2 = = = 1.

(dummy

(5.1)

ý)Xi , (5.2)

and Papert ,
100

X2 X2
"+"
(1 , 1) (0, 1) ..., (1 , 1)
i+
( AU nU)
(1 , 0) Xl (0 , 0) (1, 0) Xl

X2)

+
r_"_"
F

, i< ;H "+

-0 X
Xl ,
..
(0, 0) (1, 0) xl
(d)

(multi-layer feedforward neural

(nu , )
1
i
‘. (1, 1)

1
Xl
Xl X2
5.3 101

(connection

Yl) ,
.. . , (Xm , Ym)} , Xi E Yi

=
102

Yl

Xl Xi Xd

rl' FI<J Sigmoid

Bj) , (5.3)

(5.4)

+
x

. (5.5)

(5.6)
5.3 103

_
(5.7)

't.
(5.8)

f'(x)=f(x)(1-f(x)) , (5.9)

- yj) f'
- fjj)(yj (5.10)

6. W hj (5.11)

6. ()j (5.12)

6. Vih , (5.13)
(5.14)

åb h
104

=
j=l

= bh (l - . (5.15)

2: repeat
3: for all (Xk ,
4:
5:
6:
7:
8: end for
9:

, (5.16)
5.3 105

0.11 0.51 0.56

0.53 1.72

2f .
1

of + +
2

error

(one round ,

gradient
[Hornik et al.,

(early

(regularization) [Barron , 1991; Girosi et a l.,


106

(5.17)

(local
(global

v 0) 111(w;

w;

v ,- , -,
5 .4 107

(simulated and Korst , 1989].

algorithms) [Goldberg ,
108

5.5.1
RBF(Radial and

= Ci) , (5.18)

= (5.19)

[Park and Sandberg ,

5 .5 .2

A R,T(Adaptive Reson.ance and


Grossberg ,
5.5 109

(stability-
plasticity

5.5.3
SOM(Self-Organizing
(Self-Organizing Fea-

matching
110

(construc-

and Lcbiere , 1990]

(
5.5 111

5.5.5
neural (recurrent neural
networks"

1987].

5.5.6

(energy-based

E {O ,

E(s) =- (5.20)
i=l j= i+ l i=l

(station
ary distribution)
112

(a)

P(s) (5.21)

Boltzmann

(Contrastive

= rr (5.22)

P (hj I v) (5.23)
j=1

(5.24)
5.6 113

(deep

layer-wise

(pre- (fine-
belief [Hinton

(weight
Neural
[LeCun and 1995; LeCun et a1.,
et a 1.,
114

32x 32 (j<<128x28 6 (çý 14 x14

et al. , 1998]

f (x ) =< -' x <,-,


0,
I x. otherwlse

LU(Rectified Li near Unit ) ;

(feat ure
(representation learning) .

(feature
5.7 115

[Haykin , [Bishop ,
Computation , Neural
IEEE 'JIransactions on Neural Networks and Learning Systems;
on Neu-
ral Networks.

[Gerstner and Kistler , 2002].


et al. ,
(Least Mean

and Rumelhart , 1995].

[Gori and Tesi , [Yao ,

and 1998; Orr and Müller , 1998].


et al. , 2001]. [Carpenter and
Grossberg ,
2001]. [Bengio et al.,

[Tickle et al., 1998; Zhou , 2004].


116

5.1

5.2

5.3

5.4

5.5

5.6

http://archive.ics.ucí.edu/ml/

5.7

5.8

5.9*

5.10

http://yann.lecun.com/
117

Aarts , E. and J. Korst. (1989). Simulated Machines:


A to Neural Comput-
ing. John Wiley & Sons , New York , NY.
D. H. , G. E. Hinton , and T. J. Sejnowski. algorithm
for Boltzmann machines."
A. R. with to -
cial neural networks." In Related
Topics; NATO Volume 335 (G. ed.) , Kluwer ,
Amsterdam , The Netherlands.
Bengio , Y., A. Courville , and P. Vincent. (2013).
review and new perspectives." IEEE on Pattern
Machine Intelligence ,
Bishop , C. M. (1995). Neural Networks fo 1' Pattern Recognition. Oxford Uni-
Press , New York , NY.
Broomhead , D: S. and D. Lowe. (1988). functional interpolation
and adaptive networks." Complex
Carpenter , G. A. and S. Grossberg. (1987). "A massively parallel architecture
for a self-organizing neural pattern Computer Vision ,
Image
Carpenter , G. A. and S. Grossberg , eds. (1991). Pattern Recognition by
Organizing Neural Networks. MIT Press , Cambridge , MA.
Y. and D. E. Rumelhart , eds. (1995).
Applications. Lawrence Erlbaum Associates , Hillsdale , NJ.
Elman , J. L. (1990). structure in time."

Fahlman, S. E. and C. Lebiere. (1990). "The cascade-correlation


Technical Report CMU-CS-90-100 , School of Computer Sciences ,
Carnergie Mellon University, Pittsburgh , PA.
Gerstner , W. and W. Kistler. (2002). Spiking Neuron Models: Single
Cambridge University Press , Cambridge, UK.
M. Jones , and T. Poggio. (1995). "Regularization theory and neural
118

networks architectures."
Goldberg , D. K (1989). Genetic Algorithms in Optirnizaiton and Ma-
chine Addison-Wesley, Boston , MA.
Gori , M. and A. Tes i. (1992). "On the problem of local in backpropa-
gation." IEEE 0 11, knalysis and Intelligence ,
14(1):76-86
S. (1998). Neural A Comprehensive 2nd
tion. Prentice-Hall , Upper Saddle River , NJ.
Hinton , G. (2010). "A practical guide to Boltzmann ma-
chines." Technical Report UTML TR 2010-003 , Department of Computer
Science , University of Toronto.
Hinton , G. , S. Osindero , and Y. -W. Teh. (2006). "A fast learning algorithm for
deep belief nets." Neural 1527-1554.
Hornik , K., M. Stinchcombe, and H. White. (1989). feedforward
networks are universal approximators." Neural 2(5):359-366
Kohonen , T. (1982). "Self-organized formation of topologically correct feature
maps." Cybernetics , 43(1):59-69
Kohonen , T. (1988). "An introduction to neural computing." Neural Networks ,
1(1):3-16.
Kohonen , T. (2001). 3rd edition. Springer, Berlin.
LeCun , Y. and Y. (1995). networks for images , speech ,
and time-series." In The of Brain Neural Networks
(M. A. Arbib , ed.) , MIT Press , Cainbridge , MA.
LeCun , Y., L. and P. Haffner. (1998). "Gradient-based
learning applied to document Proceedings of theIEEE, 86(11):
2278 2324.
D. J. C. (1992). "A 1

networks." Neural Computation , 4(3):448-472.


McCulloch , W. S. and W. Pitts. (1943). "A logical calculus of the ideas im-
manent in nervous activity." Bulletin of Mathematical Biophysics, 5(4):115-
133.
Minsky, M. and S. Papert. (1969). Perceptrons. MIT Press , Cambridge , MA.
119

Orr , G. B. and K.-R. Müller , eds. (1998). Neuml of the Trade.


Springer, London , UK.
Park , J. and 1. W. Sandberg. (1991). "Universal approximation
networks." Neural
Pineda , F. J. (1987). "Generalization of Back-Propagation to recurrent neural
networks." Physical Review Letters , 59(19):2229-2232.
Reed , R. D. and R. J. Marks. (1998). Neuml Leaming in
Artificial Neural Networks. MIT Press , Cambridge , MA.
Rumelhart , D. E. , G. E. and R. J. Williams. (1986a). "Learning internal
representations by error propagation." In Distributed Processing:
Explorations in the Cognition (D. E. Rumelhart and J. L.
McClelland , eds.) , volume 1, 318-362 , MIT.Press , Cambridge , MA
Rumelhart , G. E. Hinton , and R. J. Williams. (1986b). repre-
sentations by 323(9):318-362.
Schwenker , F. , H.A. Kestler , and G. Palm. (2001). "Three learning phases for

Tickle , A. B. , R. Andrews , M. Golea , and J. Diederich. (1998). "The truth


will to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks." IEEE on
Neural 9(6):1057 1067.

P. (1974). Beyond New tools for in


the science. Ph.D. thesis , Harvard University, Cambridge , MA.
Yao , x. (1999). "Evolving neural networks." Proceedings of the IEEE,
87(9) :1423-1447.
Zhou , Z.-H. (2004). "Rule extraction: networks or for neural
networks?" Joumal of Technology , 19(2):249-253.
120

Minsky,

Paul

UCSD

!'l"
= {(Xl , Y2) , . . . , (X m , Ym)} ,

Xl

(6.1)

=
122

÷
(6.2)

=
, T'o".J n llA.I <lV1., I V
> v,
_,.,..-
+ b<

b +1, Yi = +1 ; (6.3)
WTXi +b -1.

(6 .4)
Ilwll'
(margin).

X2

+ +/ ...T _ ,
+

Xl

(maximum

2
(6.5)

s.t. 1,
6.2 123

1-2
(6.6)

S.t. i = 1, 2,..., m.

Vector

f(æ) (6.7)

quadratic

(dual

, (6.8)

, (6.9)

(6.10)

( tu )
4·i-t4
max -
124

S.t. = 0 ,
i=l

i =

f(x) =

+b , (6.12)

(Karush-Kuhn-

(> O
(6.13)

(Yd(Xi) - 1) = 0 ..

=
0,
=

SMO (Sequential Minimal


1998].
6.2 125

et a l.,

:? 0 :? 0 , (6.14)

"uo
(6.15)

=C (6.16)

= 1,

=1 (6.17)

= {i > 0, i =

(6.18)
126

>

f(x) +b , (6. 19)

(6 .20)
'"

S.t.

T
x WU
U
Z Z nhv
6.3 127

s.t. L O'. iYi = 0 ,

;;;:: 0 , i = 1, 2,..., m .

= = cþ(Xi)Tcþ(Xj) , (6.22)

(ker-
nel trick).

=0,
i=l

;;;:: 0 , i = 1, 2,… , m.

f(x) +b

+b
i=l

(6.24)

(kernel

vector expansion).
128

and Smola , 2002]


=
(kernel matrix)

X1) Xj) Xm )

K= X1) Xm )

X1)

(Reproducing Kernel Hilbert

= x 'f Xj
d = = (x 'f Xj)d
= exp
= exp (-
= 0, B < 0

(6.25)
6.4 129

z) (6.26)

= z)g(z ) (6.27)

(80ft

X2

Xl

(hard
130

+ (6.28)

-1) , (6.29)

1 1. if Z < 0:
eO/1(Z) = , (6.30)
10 , otherwise.

(surrogate 10ss).

ehinge(Z) = max(O , 1 - z) ; (6.31)

eexp(z) = exp( -z) ; (6.32)

10ss): e1og(z) = 10g(1 + exp( -z)) . (6.33)


log(.)

(6.34)

(s1ack variab1es) Çi

mm (6.35)
W ,b,Çí
6.4 131

f(z )
3
= exp( - z)

= max(O , 1 -

l
f
‘t ,
( ~) i
/i " H
U
ut
l

S.t. 3 1 - Çi
Çi i = 1, 2 ,… , m.

+ b)) - (6.36)

, (6.37)

(6.38)

(6.39)
132

lllax (6 .40)
a

s.t.

i = 1, 2,… , m.

0 ,
-1 + Çi 0,
(6 .4 1)
(Xi) - 1 + Çi) = 0,

Çi = 0.

= =
=

< =
= =
>

and
6.5 133

C(f (x i) ,

+ C "L C(f (Xi)'Y (6.42)

(structural risk)
C(f (Xi) , (empirical risk)

= {(X l, Y1) , ,…?


(X m, Ym)} , Yi E

Vector
134

_-0 0

, (6.43)

I 0, if Izl E ;
(6 .44)
=<
l Izl - E , otherwi8e

utLz (6 .45)

f! (z )

if Izl E
otherwise

Z
6.5 135

S.t. E+ Çi ,
f(x;)
0, i = 1 , 2,… , m.

C
mZM

(x;) - E - - f (Xi) - (6.46)

b,

(6 .4 7)

0= , (6.48)

, (6.49)

iii . (6.50)

mhMMA (6.51 )

S.t. =0,

âi c.
136

f_ -

- f(Xi) - =0,
(6.52)
= 0 , ÇiÇi = 0 ,
(0 = 0 , (0 -

=
Yi - f(Xi) - =
- Yi -
-

f(x) = (6.53)

- Yi - f - Çi) = O. <

(6.54)
i=l

(6.55)
6.6 137

f(x) (6.56)
i=1
=

Yl) , (X2 ,Y2) , … 7

(representer

and
Smola ,

: ]Rm •-7

2fE (6.57)

h*(x) Xi) (6.58)

(kernel

(Kernelized Linear Discriminant

h(x) . (6.59)
138

(6.60)

(6.61)
$

st = - (6.62)

L L (<þ (x) - p, f)( <þ (x) - p,f)T (6.63)


i=O æEXi

Xi) , (6.64)

(6.65)

(6.66)
mo
(6.67)
m1

M= 11 1) T , (6.68)
6.7 139

(6.69)

(6.70)

and Vapnik ,

(statistical

and Shawe-Tay lor ,

and Shawe-Taylor , 2000; Burges , 1998;


2009; Schölkopf et a l., 1999; Schölkopf and Smola ,
1995 , 1998 , 1999].

plane

et a l.,
[Hsieh et al.,
[Tsang et a l.,
and Seeger ,
and
et a l., 2012].

[Hsu and Lin ,


140

et al. , et a1., 1997] , [Smola and


Schölkopf,

[Vapnik and Chervonenkis ,

[Chang and Lin ,


LIBLINEAR [Fan et a1.,
141

6.1

6.2
csie. ntu.edu. tw/

6.3

6.4

6.5

6.6

6.7

6.8

6.9

6.10*
142

Bach , R. R., G. R. G. Lanckriet , and M. I. Jordan. (2004). "Multiple kernel


learning , conic duality, and the SMO algorithm." In Proceedings 0] the 21st
International on 6-13 , Banff, Cana-
da.
Boyd , S. and L. Vandenberghe. (2004). Convex Optimization. Cambridge Uni >

versity Cambridge , UK.


Burges , C. J. C. (1998). "A tutorial on support vector machines for pattern
recognition." and Knowledge Discovery, 2(1):121-167.
Chang , C.-C. and C.-J. Lin. (2011). "LIBSVM: A library for support vector
machines." ACM Transactions on Intelligent Technology , 2(3):
27.
Cortes , C. and V. N. Vapnik. vector networks."
20(3):273-297.
and J. Shawe-Taylor. (2000). An Introduction to Vector
and Other Learning Methods. Cambridge University
Press , Cambridge , UK
Drucker , H. , C. J. C. Burges , L. A. J. Smola , and V. Vapnik. (1997)
"Suppurt vector machines." In in Neural
Processing Systems 9 (NIPS) (M. C. Mozer , M. I. Jordan , and T. Petsche ,
eds.) , 155-161 , MIT Press , Cambridge , MA
Fan , R. -E. , K. -W. Chang , C.-J. Hsieh , X.- R. Wang , and C.-J. Lin. (2008).
"LIBLINEAR: A library for large linear Journal 0] Machine
Learning Research, 9:1871-1874.
Hsieh , C.-J. , K. -W. Chang , C.-J. S. S. Keerthi , and S. Sundararajan.
(2008). "A dual coordinate descent method for large-scalc lincar SVM."
In Proceedings of the 25th International Con]erence on Machine Learning
408-415 , Helsinki ,
and C.-J. Lin. of methods for multi-class
support vector IEEE
143

Joachims , T. (1998). "Text with vector Learn-


ing with many relevant features." In of the 10th
ference on Machine Learning Chemnitz , Germany.
Joachims , T. (2006). "Training linear SVMs in linear time." In Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery
(KDD) , Philadelphia, PA.
Lanckriet , G. R. G. , N. Cristianini , and M. 1. Jordan P. Bartlett , L. El Ghaoui.
(2004). the kernel matrix with semidefinite Jour-
nal of Machine 5:27-72.
Osuna , E. , R. Freund , and F. Girosi. (1997). "An improved
for support vector machines." In Proceedings ofthe IEEE Workshop on Neu-
for Processing (NNSP) , 276-285 , Amelia Island , FL.
Platt , J. (1998). "Sequential minimal optimization: A fast algorithm for train-
support vector machines." Technical Report MSR-TR-98-14 , Microsoft
Research.
Platt , J. (2000). "Probabilities for (SV) machines." In Advances in Mar-
gin (A. Smola, P. Bartlett , B. Schölkopf, and D. Schuurmans ,
MIT Press , Cambridge , MA
Rahimi , A. Recht. features for large-scale kernel m a-
chines." In Advances in Neural Information Processing 20 (NIPS)
(J.C. Platt , D. Koller , Y. Singer , Roweis , eds.) , 1177-1184 , MIT Press ,
Cambridge , MA.
C. J. C. Burges , and A. J. Smola, eds. (1999). Advances in Kernel
Methods: Support Vector MIT Cambridge , MA.
Schölkopf, B. and A. J. Smola, eds. (2002). Kernels: Vec-
tor Beyond. MIT Press , Cam-
bridge , MA.
Shalev-Shwartz , S. , Y. N. Srebro , and A. Cotter. (2011). "Pegasos: Pri-
mal estimated sub-gradient solver for SVM." Mathematical Programming ,
127(1):3-30.
Smola , A. J. and B. Schölkopf. (2004). "A tutorial vector regres-
Computing, 14(3):199-222.
144

1. W. , J. T. Kwok , and P. Cheung. (2006). "Core vector machines:


Fast SVM training on very large data sets." Journal of Machine

Tsochantaridis , 1., T. Joachims , T. Hofmann , and Y. Altun. (2005). "Large


margin methods for structured and interdependent output variables." Jour-
of Machine Learning Research,
Vapnik , V. N. (1995). The Nature of 8tatistical Springer , New
York , NY.
Vapnik , V. N. (1998). Learning Theory. Wiley, New York , NY.
Vapnik , V. N. (1999). "An 1EEE
on Neural Networks ,
Vapnik , V. N. and A. J. (1991). "The necessary and sufficient
conditions for consistency of the method of empirical risk." Pattern Recog-

Williams , C. K. and M. Seeger. (2001). "Using the Nyström method to speed


up kernel machines." In Processing 8ystems
13 (NIP8) (T. K. Leen , T. G. Dietterich, and V. Tresp , eds.) , 682-688 , MIT
Press , Cambridge , MA.
Yang , T.-B. , Y.-F. Li , M.
method vs Fourier features: A theoretical and empirical
son." In Advances in Neural 1nformation Processing 8ystems 25 (NIP8) (P.
Bartlett , F. C. N. Pereira , C. J. C. L. Bottou , and K. Q. Weinberger ,
eds.) , MIT Press , Cambridge , MA.

based on risk A ls
145

N. Vapnik ,

"Nothing is
theory."
decision

=
I
loss)
(conditional risk)
(risk)

N
R(Ci I æ) I æ) (7.1)

R (h) = lEa: [R (h (æ) I æ)] . (7.2)

decision
I

= argminR(c (7.3)

optimal
risk). 1
148

J 0, if i = j ; (7.4)
21-IL otherwise ,

R(c I æ) = 1- P(c (7.5)

= argmaxP(c I x) , (7.6)
cEY

P(c I I
(discriminative
I
(generative

P(æ , c)
P(c I æ) (7.7)
P(æ)

P (c) P(x I c)
P(c I æ) = (7.8)
P(æ)

P(æ I
(likelihood);

I
I c).

I
7.2 149

I I

2005;
Likelihood

P(Dc = rr (7.9)

= log P(Dc I Oc)


, (7.10)
150

ax L L( QUC)
PAV
C
a?4
(7.11)

rv N(!-"c ,

(7.12)

(7.13)

{I'!

I
I
Bayes
(attribute conditional

P(c I I c) , (7.14)
P(z)PW)ff
7.3 151.

), 'hM
3 -aTi FDU ax p c pz c
(7.15)

I C).

IDcl (7.16)
IDI'

P(Xi I c) =
IDc ,Xil (7.17)
IDcl
I c) '"

( \
P(Xi I c) exp I . (7.18)
\ /

0.460 ?

17
152

1 c):

1 .... ( 0

1 ..... ( (0.697 -
0.195 ….t-' \ 2.0.195 2 ) -

-
V2ir. 0.101 \ 2.0.101 2 )

1 \ _. (\ fìD t:.'

)'
7.3 153

X 10- 5 .

> 6.80 x

(Laplacian

P(c)
IDcl + 1 (7.19)
IDI+N'
IDC , I + 1
Xi
(7.20)
IDcl +Ni .

17 + 2
154

3+1

0+1

(lazy

(semi-naïve Bayes

(One-Dependent

P(c I rr P(Xi I (7.21)

I
7.4 155

SPODE (Super-Parent

(a) NB (b) SPODE (c) TAN

TAN (Tree Augmented naÏve Bayes) [Fr iedman et a l.,


weighted

mutual information)

n/__ 1_\'-_ P(Xi' Xj I c)


I(xi , Xj I y) = ) P(Xi , Xj I c)log (7.22)
z;f7cU P(zz|c)P(Z31c)'

I(xi , Xj ;

Xj I

AODE (Averaged One-Dependent Estimator) [Webb et


156

P(c 1 rr P(Xj , (7.23)

m' [Webb
et al. , 2005].
c,

P(C , Xi) (7.24)


IDI+Ni
+1
P(X.iJ -, _.'/
1
I
(7.25)

6+1

y,

(belief
Acyclic
7.5 157

Probability

FB ( Xi

= , Xd

l 7ri) = (7.26)
i=l i=l

P ( Xl , X2 , X3 , X4 , X5) = P(Xl)P(X2) P (X3 1 Xl)P(X41 Xl , X2 ) P ( X5 I X2) ,

j_ X4 I X2.
158

(common

= . (7.27)

(marginal
(marginal-
ization)

j_ X4 I Xl

y j_ z I

(direct
ed)

1988].

(moral
(moralization) [Cowell et a l., 1999].

=
7.5 159

I Xl , X3 1- X2 I X5

X3 1- X5 I

(Minimal Description

Itp
= =

s(B I D) = f(O)IBI- LL(B I D) , (7.28)


160

LL(B I D) (Xi) (7.29)

s(B I
= (Akaike Information

AIC(B I D) = IBI - LL(B I D) . (7.30)

= (Bayesian
Information

D) I D) . (7.31

=
s(B I I

(7.32)

et
7.5 161

(evidence).

=
= q I E = e) ,
=

=
=

P(Q =q I E (7.33)

(random
(Markov

P(Q I IE =
= q I E = e).
162

= (G , 8);

1: nn = 0
2: qO
3: for t = 1, 2, . . . , T do
4: for do
5: Z = E U Q \ {Qi};
6: z=euqt-l\{qf-l};
7: IZ = z)j
8: Z
9: qt
10: end for
11: if qt = q then
12: nq = nq + 1
13: end if
14: end for
P(Q = q I E =

"1"

7.6

(latent

LL(8 I X , Z) =lnP(X , Z I 8) . (7.34)


ln(.) .
7.6 163

(marginal likelihood)

LL(8 I X) = I .Lz P(X , Z I 8) . (7.35)

EM (Expectation- et al.,

P(Z I

I
IX,

Q(8 I 8 t ) = lE z lx ,8 t LL(8 I X , Z) . (7.36)

= argmax Q(8 I 8 t ) . (7.37)


@

1983].

descent)
164

[Domingos and pazzani , 1997; Ng and Jordan,

and pazzani ,

1998] , [McCallum and Nigam ,

1991].
[Friedman et al. , [Webb et al.,
(lazy Bayesian Rule) [Zheng and Webb ,

[Kohavi ,
(Bayesian

2006].

J.

1990; Chickering et
[Friedman and Goldszmidt ,
and Domingos ,
1997; Heckerman , 1998].
7.7 165

mÎxture

and Krishnan , 2008].


166

7.1

7.2*

7.3

7.4
P(Xi I

7.5

7.6

7.7

x 2=
.

7.8
y ..l z I

7.9

7.10
167

Bishop , C. M. (2006). Pattern Recognition and Machine Learning. Springer ,


New York , NY.
M. , D. Heckerman , and C. Meek. (2004). 1earning
of Bayesian networks is NP-hard." Journal of Machine Learning
5:1287-1330.
Chow , C. K. and C. N. Liu. (1968). probabi1ity distri-
butions with dependence trees." IEEE on Theory ,
14(3):462-467.
Cooper , G. F. (1990). "The computatíona1 comp1exity ofprobabilistic inference
belief networks." lntelligence , 42(2-3):393-405
Cowell , R. G. , P. Dawid , S. L. Lauritzen , and D. J. Spiege1ha1ter. (1999). Prob-
Expert Systems. Springer , New York , NY.
Dempster , A. P. , N. M. Laird , and D. B. Rubin. (1977). "Maximum likelihood
from incomp1ete data via the EM a1gorithm." of the Royal
Society - Series B , 39(1):1-38.
P. and M. Pazzani. (1997). "On the optimality of the simp1e
Bayesian classifier under zero-one 10ss." Machine 29(2-3): 103-130.
(2005). and scientists." of the
American Association, 100(469):1-5.
D. Geiger , Go1dszmidt. (1997). "Bayesian network clas-
sifiers." Machine Learning, 29(2-3):131-163.
Friedman , N. and M. Go1dszmidt. (1996). Bayesian networks with
10ca1 structure." In Proceedings of the 12th Conference on Uncer-
in Port1and , O R.
Grossman , D. and P. (2004). network
by maximizing conditiona1likelihood." In Proceedings of the 21st
tional Conference on Machine Learning (ICML) , 46-53 , Banff, Canada.

in Models (M. 1. Jordan , ed.) , 301-354 , K1uwer , Dor-


drecht , The Netherlands.
V. (1997). An to Springer , NY.
168

Kohavi , R. (1996). the accuracy of naive- Bayes classifiers: A


decision-tree hybrid." In Proceedings of the 2nd Conference
on Discovery and (KDD) , 202-207 , Port1and , OR.
r. (1991). "Semi-naive Bayesian classifier." In Proceedings of the
Working Session on Learning (EWSL) , 206-219 , Porto , Por-
tugal.
Lewis , D. D. at forty: The independence assumption in
information retrieval." In Proceedings of the 10th on
Machine Learning (ECML) , 4-15 , Chemnitz , Germany.
A. and K.
In Working Notes of the AAA I'98 Workshop on
Learning for Text Cagegorization , Madison , W r.
McLach1an , G. and T. Krishnan. (2008). The EM Algorithm and Extensions ,
2nd edition. John Wiley & Sons , Hoboken , NJ.
Ng , A. Y. and M. r. Jordan. (2002). "On discriminative vs. generative classifiers:
A comparison of 10gistic regression and naive Bayes." In in N eural
Information Processing Systems 14 (NIPS) (T. G. Dietterich, S. Becker , and
Z. 841-848 , MIT Press , Cambridge, MA
J. (1988). Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco , CA.
Sahami , M. (1996). "Learning limited dependence Bayesian classifiers." In Pro-
ceedings of the 2nd Conference on Knowledge Discovery and
(KDD) , 335-338 , Portland , OR.
Samaniego , F. J. (2010). A of the and Prequentist Ap-
to Estimation. Springer, New York , NY.
Webb , G. , J. Boughton , and Z. Wang. (2005). "Not so naive Bayes: Aggregating
estimators." Machine 58(1):5-24
Wu , C. F. Jeff. (1983). convergence the EM algorithm."
of 11(1):95-103.
Wu , X. , V. Kumar , J. R. Ghosh , Q. Yang , H. G. J. M-
cLachlan, A. Ng , B. Liu , P. S. Yu , Z.-H. Zhou , M. Steinbach, D. J. Hand ,
and D. Steinberg. (2007). "Top 10 algorithms in data mining."
169

Systems, 14(1) :1-37


Zhang, H. (2004). "The optimality of naive Bayes." In Proceedings of the
17th International Intelligence Society Confer-
ence (FLAIRS) , 562-567, Miami , FL.
Z. and G. I. Webb. (2000). "Lazy learning of Bayesian rules." Machine
Learning, 41(1):53-84.

Bayes ,

,
(individual

(base learner) ,
(base learning

(component
172

× h1 X h1 × ×
h2 × hz v' × hz × ×
ha v' × d h3 V V x ha × ×
× ×

(8.1)

/ T \
H(x) = sign I ) (8.2)
8.2 Boosting 173

. (8.3)

(Random Forest).

8.2 Boosting

[Freund and Schapire , 1997] ,


E {-1 , + 1},

(additive

T
H(x) (8.4)

function) [Friedman et

fexp(H I Ð) = . (8.5)
174

= {(X1' . . , (Xm ,Ym)};

1: 1)1(X) = 11m.
2: for t = 1, 2 ,. ..., T do
3: ht = 'c (D , 1)t);

5: if Et > 0.5 then break

7:
Ð ,(æ) " f if ht(x) = f(x)
Z, if
iH-
dj-
n··- Zt

1-3JU-AU
A-A

= _e-H(oo) P (f (æ) = 1 1æ) + eH(oo) P (f (æ) = (8.6)

P (f (x) = 11 æ)
(8.7)
P (f (x) =

,-_ P (f (x) = 11
l(1 ... P (f (x) == -11 æ))j
\2-h
=
1 100) = -11 æ)
-1 , P (f (x) = 1 æ) < P (f (x)
1 1 æ)

= y æ) ,
1 (8.8)

consistent
8.2 175

eexp Ðt ) = ]

(æ)) + e"'t ]l (f (æ))]


P (f (æ) = ht (æ)) + (f (æ))
- Et) (8.9)

I Ðt ) = - Et) + (8.10)

1, (1 - Et \
(8.11)

I Ð) =
f(æ)ht(æ)j . (8.12)

= hr(æ) =

Ð) 1_-f(ælH+_ , (æl (,
( 1-
"1_\ 1.. 1_\ , f2(æ)h;(æ )\1
) I

ht(æ) = I Ð)
h
176

=
| I
r • I, (8.14)
h I I

(æ)
Dt(æ) = u. (,.,.\, , (8.15)

I
ht(æ) = I ... r •
,- r - ,-, II

= arg m a.x . (8.16)

f(æ)h(æ) = 1- , (8.17)

ht(æ) ] (8.18)

D(æ) e-f(æ)Ht(æ)
(æ) =
[e-f(æ)Ht(æ)]

1) (æ)

(æ)]
= Dt( æ) . , (8.19)
8.2 Boosting 177

and Wolpert , 1996] ,

0.6 • EZE
0.21

o Q2 Q4 Q6 QS
nu Zb
2 0.4 O. 6 0.8
<
(a) J
(c)
178

8.3

8.3.1 Bagging
Bagging [Breiman ,
Bootstrap
sampling).

{(Xl ,Yl) , (X2 ,Y2) ,"', (xm ,Ym)};

1: for do
2: ht = .l3 (D , Ðb.)
3: end for
H(x) = argmax I:;;=l[(ht (x)
yEY
8.3 179

T (0 (m) +0

[Zhou. 2012].

estimate) [Breiman , 1996a;


Wolpert and Macready ,

T
Hoob(æ) = = y) . lI (æ 1:. Dt) , (8.20)
t:l

Eoob . (8.21)

[Breiman ,
180

ji
o

; 0 4

'" 0.2
4
0.2

0.8 0.8

(a) (b)

=
= = log2 d
[Breiman ,

0 .34 0.028
0 .3 2 , - - -\

0.22 0.008

0.20 '-,; 0.004 '-0


10 V
10' 10' 103 10 V
10' 10' 103

(a) (b)
8 .4 181

Bagging

/
.!
/ < h3 . .
.f ,/ . f \"
• h1 :
..,,' '"1

2000]

averaging)

H(x ) (8.22)
182

• averaging)

T
(8.23)

Breiman T

1952J , [Perrone and Cooper ,

et a l., 1992; Ho et a l., 1994;


Kittler et al.,

voting)

T N T
if >
(8.24)
otherwise.

voting)

H(x) = hi(æ) . (8.25)


8.4 183

yoting)

H(x) = . (8.26)

h;
(hard voting).

h; I
(soft voting).

scaling) [Platt , regression) [Zadrozny and


Elkan , 2001

Stacking [Wolpert , 1992;


184

= {(æl ,

1: for t = 1 , 2,… , T do
2: ht =
3: end for
4:
5: for i = 1, 2,..., m do
6: for t = 1, 2,… , Tdo
7: Zit = ht(æi)j
8: end for
9: D' = D' U
10: end for
11: h' =
H(æ) = h'(h1(æ) , h 2(æ) ,..., hT(æ))

=D\

(Zi1; Zi2; . . . ; ZiT)

Linear
and Witten ,
2002].
8.5 185

Model A
[Clarke ,

:
(arnbigui ty
I æ) = , (8.27)

A(h I æ) = æ)

(æ) - H (æ))2 (8.28)

æ) = (1 (æ) - hi(æ))2 , (8.29)

E(H I æ) = (1 (æ) - H(æ))2 . (8.30)

I æ) E(hi I

= LWiE(hi I æ) -E(H I æ)
i=l
= E(h I æ) - E(H I æ) . (8.31)
186

Ei =J x)p(x)dx , (8.33)

Ai =J x)p(x (8.34)

E(H).
E =J E(H I (8.35)

E=E-A. (8.36)

and Vedelsby ,
(error-ambiguity decomposition).
8.5 187

= Y2) , . . . , (X m ,
{-1 ,

hi = -1

b d

measure)

d-Mb+c
1, 8 ,;.; = (8.37)
m

coefficient)

- bc
(8.38)
c)(c + d)(b + d)

• Q-statistic )
bc (8.39)

p-1 qa-Ba
= (8 .40)

Pl (8 .41)
m

P2 (8 .42)
m2
188

0 .40 0.40

0.35 0.35

0.30 0.30

0 .25
v
'

.......

0 .1 5

0. 10 r
nuo A 06
U
0 0 .2 0.4 0.6 0.8 - 0.2
( ) B F
(a) D

base
8.5 189

= {(Xl , Yl) , ,… , (Xm , Ym)};

1: for t = 1, 2, . . . ,T do

3: Dt =
4: ht = 5:. (D t )
5: end for
H(x) = (MapFt (x))
yEY

(Flipping Output) [Breiman ,


(Output
Smearing) [Breiman ,
and
Bakiri,
190

(Negative Correlation) [Liu and Yao ,

[Kuncheva, 2004; Rokach ,


[Schapire and
and Valiant ,

and Schapire ,

Mult iBoosting
[Webb ,

(statistical view) [Fried-


man et

[Demiriz et a l.,
and Wyner ,

(margin theory) [Schapire et a l.,


8.6 191

2014].

of
and Whitaker , 2003; Tang et a l.,

2012]

(selec-
tive [Rokach , 2010a]; [Zhou et a l.,

(ensemble selec- 2012]


tion).

[Zhou ,

and
2012]
192

8.1

k
P(H(n)";; k) (8 .43)
i=O

> 0, k = (p -

P(H (n) ,,;; (p - 8) n) ,,;; e- 282n . (8 .44)

8.2
-
(8 >
8.3

8.4 GradientBoosting [Friedman ,

8.5

8.6

8.7

8.8
Iterative

8.9*
193

Breiman, L. (1996a). "Bagging predictors." Learning, 24(2):123-140.


Breiman, L. (1996b). regressions." Machine 24(1):49 64.

Breiman , L. (2000). to increase prediction accuracy."


Machine 40(3):113-120.
Breiman , L. (2001a). "Random Machine Learning, 45(1):5-32.
Breiman, L. iterated bagging to debias Machine
Learning, 45(3) :261-277.
C1arke , B. (2003). mode1 averaging and stacking when mod-
e1 approximation error cannot be ignored." Journal 01 Machine Learning
4:683-712.
A. , K. P. Bennett , and J. Shawe-Tay1or. (2008).
Boosting via co1umn generation." Learning, 46(1-3):225-254.
Dietterich , T. G. (2000). "Ensemb1e methods In Pro-
ceedings 01 the 1st on Multiple Systems
(MCS) , 1-15 , Cagliari , Italy.
Dietterich, T. G. and G. (1995). "Solving multiclass 1earning problems
via error-correcting output codes." Journal 01 Artificial Intelligence Re-
2:263-286.
Y. and R. E. Schapire. generalization of
on-line learning and an application to boosting." Journal 01 Computer and
System Sciences ,
Friedman , J. , T. Hastie , and R. TibshiranL (2000). "Additive logistic
sion: A statistical view of boosting (with discussions)." Annals 01
28(2):337 407.

Friedman, J. H. (2001). "Greedy function approximation: A gradient Boosting


machine." Annals 01 Statistics , 29(5):1189-1232.
Ho , T. K. (1998). "The random subspace method for constructing decision
forests." IEEE on Pattern Machine Intelligence ,
20(8):832-844.
Ho , T. K., J. J. Hull , and S. N. Srihari. (1994). in mu1ti-
194

ple classifier systems." IEEE on Pattern Machine


Intelligence , 16(1) :66-75.
Kearns , M. and L. G. Valiant. (1989). "Cryptographic limitations on learni:rig
Boolean formulae and finite automata." In Proceedings 01 the 21st Annual
ACM on Computing (STOC) , 433-444 , Seattle, WA.
Kittler , J. , M. Hatef, R. and J. Matas. (1998). "On combining classifiers."
IEEE on Pattern Analysis and Machine Intelligence , 20(3):
226-239.
Kohavi , R. and D. H. Wolpert. (1996). "Bias plus variance decomposition for
zero-one 10ss functions." In Proceedings 01 the 13th Conference
on Machine , 275-283 , Bari , ltaly.
Krogh , A. and J. Vedelsby. (1995). "Neural network ensemb1es , cross va1idation ,
and active learning." In in Neural Processing Systems
7 (NIPS) (G. Tesauro , D. S. Touretzky, and T. K. Leen , eds.) , 231… 238 , MIT
Cambridge, MA.
Kuncheva , L. I. (2004). Combining Pattern Classifiers: Methods and Algo-
rithms. J & Sons , Hoboken , NJ.
Kuncheva , L. I. and C. J. Whitaker. (2003). "Measures of diversity in classifi-
er ensemb1es and their re1ationship with the ensemble accuracy." Machine
51(2):181-207.
Liu , Y. and X. YaQ. (1999). via negative
12(10):1399- 1404.
, H. (1952).
A. the statistical view
of boosting (with discussions)." Journal of Machine Learning

Perrone , M. P. and L. N. Cooper. (1993). "When networks disagree: Ensemble


method for neura1 networks." In Artificial Neural Networks for Speech
(R. J. ed.) , Chapman & Hall , New York , NY.
P1att , J. C. (2000). "Probabilities for SV machines." In Advances in
gin (A. J. Smo1a, P. L. Bart1ett , B. Schö1kopf, and D. Schuurmans ,
eds.) , 61 74 , MIT Press , Cambridge , MA.

195

Rokach , L. Arlificial Intelligence


33(1):1-39.
Rokach , L. (2010b). Pattem Ensemble Methods. World
Scientific , Singapore.
Schapire, R. E. (1990). "The strength ofweak learnability." Machine Leaming,
5(2):197 227.
Schapire , R. E. and Y. Freund. (2012). Boosting: Algorithms.
M1T Press , Cambridge , MA.
Y. P. Bartlett , and W. S. Lee. (1998). "Boosting the
margin: explanation for the effectiveness of voting methods."
01 26(5):1651-1686.
Seewald , A. K. (2002). "How to make Stacking better and faster while also tak-
ing care of an unknown weakness." 1n of the 19th Intemational
Conference on Machine 554-561 , Sydney, Australia.
E. K., P. N. Suganthan, and X. (2006). of diversity
measures." Machine Leaming, 65(1):247-27 1.
Ting , K. M. and 1. H. Witten. (1999). stacked Jour-
of Intelligence 10:271-289.
Webb , G. 1. (2000). "Mult iBoosting: A technique for combining boosting and
wagging." Machine
H.
260.
Wolpert , D. H. and W. G. Macready. (1999). "An method to estimate
Bagging's M achine 35(1):41-55.
Xu , L. , A. Krzyzak , and C. Y. Suen. (1992). "Methods of combining multiple
classifiers and their applications to recognition." IEEE
on Systems,
Zadrozny, B. and C. Elkan. (2001). probability esti-
mates from decision trees and naÏve Bayesian classifiers." 1n Proceedings of
the Conference on Leaming (ICML) , 609-616 ,
Williamstown , MA.
Zhou , Z.-H. (2012). Ensemble Methods:
196

Chapman &HalljCRC , Boca Raton , FL.


Zhou , Z.-H. (2014). margin distribution learning." In Proceedings of the
6th IAPR International Workshop on Neural Networks in Pattern
Recognition (ANNPR) , 1-11 , Montreal, Canada.
Zhou, Z.-H. and Y. (2004). ensemble based C4.5."
IEEE on
Zhou , Z.-H. , J. Wu , and W. Tang. (2002). "Ensembling neural networks: Many
could be better than all." Intelligence, 137(1-2):239-263.

Breiman,
(unsupervised

ty
(anomaly
(clustering).

=
= (X í1; Xí2;... ;
Il = = ø
= U7=1
(cluster

(validity
198

(intra-cluster (inter-cluster

(reference (external
(internal
index).

= = {Cl ,
= {q , 02 ,...,

88 = ((æi , æj) i < j)} , (9.1)


b =18DI , 8D = (9.2)

D8 = {(æi , æj)

d =IDDI , DD = {(æi , Xj) < j)} , (9.4)

(i <
c+ d =

• J accard

(9.5)

• and Mallows

(9.6)
9.3 199

m(m -1) . (9.7)

ICI(IOI - , (9.8)

diam(C) = dist(Xi , Xj) , (9.9)


= dist(Xi , Xj) , (9.10)
dcen(Ci , Cj ) = , (9.11)

dmin(Ci ,

• Bouldin

\
I ) . (9.12)

• )

DI = _min. ir ( ) • (9.13)
l-'Tii: J

(distance

dist(Xi , Xj) 0; (9.14)

= Xj ; (9.15)
200

= dist(xj , Xi) ; (9.16)

dist(Xi , Xj) dist(Xi , Xk) + . (9.17)

= Xj2; . • . ;

(Minkowski distance)

(..!':_ \
- Xjul P 1 . (9.18)

distance)

disted(Xi , Xj) = = . (9.19)

(city
distance)
block distance)

== Ilxi - Xjlll . (9.20)

(continuous
(categorical
(numerical

(nominal attribute)

(ordinal
(non-ordinal
attribute)
(Value Difference Metric) [Stanfill and Waltz ,

(9.21 )
9.3 201

MinkovDMp(Xi , X j) = IL IXiu - Xju lP + VDMp(Xiu , Xju) I


\u= l u=nc+ l /
(9.22)

(weighted

distwmk(X'i , Xj) = IXil - Xj l lP + '"

0 (i =

(similarity

(distance metric

< d3
202

based

9.4.1

= , xm} , (k- means


= {C1 , C 2 ,...,

E=LL i=l æEGi


(9.24)

LæEGi

et a l.,

1 0.697 0.460 11 0.245 0.057 21 0.748 0.232


2 0.774 0.376 12 0.343 0.099 22 0.714 0.346
3 . 0.634 0.264 13 0.639 0.161 23 0.483 0.312
4 0.608 0.318 14 0.657 0.198 24 0 .478 0 .437
5 0.556 0.215 15 0.360 0.370 25 0.525 0.369
6 0.403 0.237 16 0.593 0.042 26 0.751 0 .489
7 0.4 81 0.149 17 0.719 0.103 27 0.532 0.472
8 0.437 0.211 18 0.359 0.188 28 0.473 0.376
9 0.666 0.091 19 0.339 0.241 29 0.725 0.445
10 0.243 0.267 20 0.282 0.257 30 0.446 0.459
9.4 203

1Lk}
2: repeat
= ø k)
4: forj = 1 , 2, . . . , m do
5: dji =
6: = d ji ;
7: U{Xj};
8: end for
9: for i = 1 , 2 ,… , k do
10: X;
11:
12:
13: else
14:
15: end if
16: end for
17:
, Ck}

X12 ,

= = (0.343; = (0.532; 0.472)

= (0.697;
0.369 , 0.506 ,

01 = X8 , X9 , X lQ, X13 , X14 , X15 , X17 , X18 , X19 , X20 , X23};

02 = {X l1, X12 , X16};

03 = {X1 , X2 , X3 , X4 , X21 , X22 , X24 , X25 , X26 , X27 , X28 , X29 , X30}'

= (0.473; = (0.394; = (0.623; 0.388) .


204

0.8 0.8

0.7 0.7

0.6 0.6

ooo
..
. ..
..
. .
. ....;..
....::::... .
+ .‘
+ .... ...: _.
0.2

" . • +. .- .
• ....::...
-'
.1
0.2

0.1 0.1

0.2 0.3 0.4 0.7 0.8 0.9 0.8 0.9

0.8 0.8

0.7 0.7

0.6 0.6


. ••..
+ +
.....
.
0.2 · /J.·' ·
/ 4. 4
0.1 ...:i
0.2 0.3 0.4 0.6 0.8 0.9 0.2 0.4 0.7 0.8 0.9

(Learning Vector

= {(Xl , Yl ), (Xm ,
9.4 205

, (Xm , Ym)}j

2:
3:
4: i d ji = Ilxj -pil12j
5: djij
6: if Yj =
7: (Xj -
8: else
9: p' = (Xj -
10: end if
11:
12:
, Pq}

p' , (9.25)

=
= . (9.26)
206

(Iossy com
(vector
Ri = {x EX Illx -PiI12:( . (9.27)

(Voronoi tessellation).

Cl, C2 , C2 , Cl , Cl.

X12 , X18 , æ23 ,

0.283 , 0.506 , 0 .434 , 0.260 ,

p' (Xl - P5)

= (0.725; 0.445) + 0.1 . ((0.697; - (0.725; 0 .445))

= (0.722; 0 .442) .

IEI: (9.28)
E- 1 :
9.4 207

0.8 0.8

0.7

.. +' . .
• + .+•

"0 '--"" ...

+ ·
.+
.
0.2
l . .
0. 1'-
+
0.8 0.9

(b)
0.8 0.8

0.7

0.6

• .+ • l •
.
er 0.5

.+ . .


-'.0
•..... ......... .......

0.2
. ,. . to 0.2
J
0.1

0.8 0.9

(c) (d )

= Cl ,

PM (
PM(X) . p (x , (9.29)

>
coefficient) =1
208

=
2 ,...
P(Zj =

P (Zj = i) . PM(Xj i)
PM(Zj =i 1 Xj) =
PM(Xj)

. p(Xj :E i)
(9.30)
p(Xj :E l)

(i=1 , 2 ,..., k).

. (9.31)

:Ei)

\Ill-/'PA
LL D -qJ

\-atfF/

m 2 QU qdq,"

J.Li , :E i )

:E i ) = 0 , (9.33)
:E 1)

= PM(Zj
9.4 209

(9.34)

= j=1 (9.35)

0, =

(9.36)

p(Xj (9.37)
p(Xj

(9.38)

k}
210

}";i) 1 1 ,,;; i ,,;; k}


2: repeat
3:
4:
for j =, ,...,
1 2 m do

= i 1æj) (1";; i ,,;; k)


5: end for
6: for i = 1, 2 ,..., k do
7:

8:

9: •

10:
11: 11 ,,;; k}
12:
13: Ci ,,;;

14: for j = 1, 2,... , m do


15:
16: C>'j = C>'j U{æj}
17: end for

=
= :1: 6 ,

= 0.219 , 1'12 = =

0.361 , a; = = 0.316

= (0 .4 91; = (0.534; 0.295)

21=(::;;;;:;)Ji=(;:;:::17) z•(:;;;;;:;) ,
9.5 211

0.8 0.8

0.7 0.7

0.6

0.5
. . ‘
A
A
A
A
a‘
4‘

a‘ ' ‘ • ,‘
.+
o 0 0 0 + + • o 0 4
0.2 0.2

0.1

40U
ß.1 0.2 0.3 0.4 0.5 0.8 0.9

(a)
0.8

0.7 0.7

0.6 0.6

0.5
'‘
'‘
0.5
. 4‘
A‘

• ,‘
‘+‘
+ 0.3
• + • +
O 0
0.2

.. 0.1

L u u u u u M L u u u u u u u
l

(c) (d)
=

(density-based

5- (neigh-
patial Clustering of Appli-
cations with Noise"
D=

= I : : ; E};
212

= Xi , Pn

0 0...- - 0
h XA

OY\/21 , , \ ,J , J d )

(9.39)

(9.40)

(seed) ,
9.5 213

MinPts).

2: for do
3:
4: if ;;;:: MinPts then
5: 0 = OU{Xj}
6: end if
7: end for

f=D
10: while do
11: f o1d = f;
12: =< 0>;
13: f = f \ {o};
14: while do
15:
16: if then

18:
19: f =f \ t:l i
20: end if
21: end while
22: k= k+ = f o1d \f;
23: O=O\Ck
24: end while
= {Cl, C 2 ,..., Ck}

0.11 , MinPts = 5.
D= X9 , X13 , X14 , X18 , X19 , X24 ,

01 = {X6 , X7 , X8 , X lO, X12 , X18 , X19 , X20 , X23}'

D = D \ 01 =
X13 , X14 , X24 , X25 , X28 ,
214

0.8 0.8

0.7 0.7

0.6 0.6

0.5
. O5A
0 |
0.3
r J vF > •

0.2

0.1 .
6C 0.8 0.9 0.2 0.3 0 .4 0.5 0.6 0.7 0.8 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5
. 0.5

U.3

0.2 0.2

0.1 • 0/
.
0
OL Z
0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9

= 0.11 , M inPts =
"0"

C2 = X17 , X 21} ,
C3 = ;
C4 = X 27 , X28 , X 30 } .

tive N ESti
9.6 215

dorff
9.2 , (9.41)

dmax ( Ci , C j )
_.
= Ifl ax (9.42)

= {X1 , X2 , … , Xm};
dmax

1: for j = 1, 2,…, mdo


2:
3: end for
4: for i = 1, 2, . . . , m do
5: for j = 1, 2,…, mdo
6: M(i , j) =
7: M (j, i) =
8: end for
9: end for
q=m
11: while q > k do
j* 12:
13:
14: for j = j* + do
15:
16: end for
17:
18: for j = 1, 2, .. . , q - 1 do
19:
20: j)
21: end for
22: q = q-1
23: end while
= {C1 , C 2 ,..., Ck}

AGNES
216

0.7

0.6

0.4

"'

0.2

0.0

C1 = {æ !, æ26 , æ2g}; C2 = {æ2 , æ3 , æ4 , æ21 , æ22};


C3 = {æ23 , æ24 , æ28 , æ30}; C4 = {æ5 , æ7};

C5 = {æ9 , æ13 , æ14 , æ16 , æ17}; C6 = æ18 ,

C7 = {æ11 , æ12}.
9.7 217

0.8 0.8

0.7 0.7

0.6 0.6

..
..... ooo
..
. .;
."
.
.. .."

<!. • .•

0.2 0.2

0.1 0.1

1
0.7 0.8 0.9 0.8 0.9

0.8

0.7 0.7

0.6 0.6

...

...•
0.4

i / J\\ V .• . ..
..•
.
...
......, .........!
0.2 •. 0.2

0.1 .... • .!. ‘ 0.1


:. _..
0.7 0.8 0.9 0.2 0.3 0.4 0.7 0.8 0.9

= 7, 6 , 5,
218

Jain 1999; Xu and Wunsch II , 2005;

silhouette width)
1988; Halkidi et a1., 2001; Maulik and
2002].

[Deza and Deza , 2009]. and


et a1.,
2000; Tan et a l.,
al.,

[Jain and Dubes , 1988; Jain , 2009].


and Rousseeuw ,
Fuzzy C -means [Bezdek , 1981]
(soft

k-
[Schölkopf et al., clustering) [von Luxburg ,
et al.,

[Pelleg and Moore , 2000; Tibshirani et

[Kohonen , 2001]. [McLachlan and Peel ,


1998; Jain and Dubes , 1988].

[Ester et a l., [Ankerst et a1.,


9.7 219

[Hinneburg and AGNES

DIANA [Kaufman and Ro usseeuw ,

[Zhang et a l., [Guha et al. ,

detection detection) [Hodge and Austin , 2004; Chandola et

et al., 2012].
220

9.1

li:rp__ I) xjul P I = max IXiu - Xjul.


J
9.2
(Hausdorff

distH(X, Z) = max (disth(X , Z) , disth(Z, X)) , (9 .44)

disth(X , Z) = - zl12 . (9 .4 5)

9.3
9.4

9.5

9.6
9.7

9.8
9.9*
9.10*
221

Aloise , D. , A. Hansen , and P. Popat. (2009). "NP-hardness of


Euclidean sum-of-squares clustering." Machine Learning,
Ankerst , M. , M. Breunig, H.-P. Kriegel , and J. Sander. (1999). "OPTICS: Or-
to the clustering structure." In Proceedings of the
ACM SIGMOD on of
MOD) , 49-60 , Philadelphia, PA.
S. 1. and J.

Bezdek , J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algo-


rithms. Plenum Press , New York , NY.
Bilmes , J. A. tutorial of the EM algorithm and its appli-
cations to parameter estimation for Gaussian mixture and hidden Markov
models." Technical Report TR-97-021 , Department of Electrical Engineering
and Computer Science , University of California at Berkeley, Berkeley, CA.
Chandola , V. , A. Banerjee , and V. Kumar. (2009). "Anomaly detection: A
survey." ACM Computing 41(3):Article 15.
Deza, M. and E. Deza. (2009). Encyclopedia of Springer ,
Dhillon ,1. Y. Guan , and B. Kulis. Spectral
tering and normalized cuts." of the 10th ACM SIGKDD In-
ternational Conference on (KDD) ,
551 556 , Seattle, WA.

Ester , M. , H. P. Kriegel , J. Sander , and X. Xu. (1996). "A density-based algo-


rithm for discovering clusters in large spatial databases." In Proceedings of
the 2nd International Conference on Knowledge Discovery and
(KDD) , Portland , OR
V. (2002). "Why so many clustering - a position
paper." SIGKDD Explorations ,
R. Rastogi , and K. Shim. A robust clustering al-
gorithm for categorical attributes." In Proceedings of the 15th
Conference on Sydney, Australia.
Halkidi , M. , Y. Batistakis , and M. Vazirgiannis. (2001). "On clustering valida-
222

tion
145.
Hinneburg , A. and D. A. Keim. (1998). "An efficient approach to clustering
large multimedia databases with noise." In Pmceedings of the 4th
tional Conference on K Discovery and (KDD) , 58-65 ,
New York , NY.
Hodge , V. J. and J. Austin. (2004). "A survey of outlier detection methodolo-
gies." A 7t ificial Intelligence
Z. (1998). "Extensions to the k-means algorithm for clustering large
data sets with categorical values." Knowledge Discovery ,
2(3) :283-304.
D. W. , D. Weinshall , and Y. Gdalyahu. (2000). with
non-metric Image retrieval and class representation." IEEE
Transactions on Pattern Analysis and Machine Intelligence , 6(22):583-600.
Jain , A. K. (2009). "Data clustering: 50 years beyond k-means." Pattern Recog-
nition Letters , 31(8):651-666.
Jain , A. K. and R. C. Dubes. (1988). Algorithms for Clustering Prentice
Hall , Upper Saddle River , NJ.
K., M. N. Murty, and P. J. Flynn. (1999). "Data clustering: A review."
ACM Computing 3(31):264-323.
Kaufman , L. and P. J. Rousseeuw. (1987). "Clustering by means of medoids."
In Statistical on the LI-Norm Related Methods (Y.
Dodge , ed.) , 405-416 , Elsevier , Amsterdam , The Netherlands.
Kaufman , L. and P. J. (1990). in An Intro-
duction to Cluster Analysis. & New York , NY.
Kohonen , T. (2001). 3rd edition. Springer , Berlin.
Liu , F. T. , K. M. Ting , and Z.-H. Zhou. (2012). "Isolation-based anomaly detec-
tion." ACM on Knowledge Discovery fmm 6(1):Article
3.
Maulik , U. and S. Bandyopadhyay. (2002). "Performance evaluation of some
clustering algorithms and validity indices." IEEE on
Analysis and Machine Intelligence , 24(12):1650-1654.
223

McLachlan , G. and D. Peel. (2000). Finite Mixture Models. John Wiley &8ons ,
New York , NY.
Mitchell , T. (1997). McGraw Hill , New York , NY.
Pelleg , D. and A. Moore. (2000). "X-means: Extending
estimation of the number of clusters." 1n of the 1 7th
tional Conference on M achine 727-734 , 8tanford , CA.
Rousseeuw , P. J. (1987). "8ilhouettes: A graphical aid to the interpretation
and validation of cluster analysis." of Computational and Applied
20:53-65.
A. 8mola, and K.-R. Müller. (1998). "Nonliear anal-
ysis as a kernel
Stanfill , C. and D. Waltz. memory-based

Tan , X. , 8. Chen , Z.-H. Zhou , and J. Liu. (2009). under oc-


clusions and variant expressions with partial similarity." IEEE
on Forensics and Security, 2(4):217-230.
G. Walther , and T. Hastie. (2001). "Estimating the number of
clusters in a data set via the gap statistic." of the
Society - Series B , 63(2):411-423.
von Luxburg , U. (2007). "A tutorial on spectral and
Computing , 17 (4) :395 416.

Xing , E. P. , A. Y. Ng , M. 1. Jordan , and 8. Russell. (2003). "Distance metric


learning , with application to clustering with 1n
in Neural Information Processing Systems 15 (NIPS) (8. Becker, 8. Thrun ,
and K. Obermayer , eds.) , 505-512 , M1T Press , Cambridge, MA.
Xu , R. and D. Wunsch 11. (2005). "8urvey of clustering algorithms." IEEE
on Neural Networks , 3(16):645-6 78
Zhang , T. , R. Ramakrishnan , and M. Livny. (1996). "B1RCH: data
clustering method for large databases." 1n Proceedings of the ACM
SIGMOD on of
103-114 , Montreal , Canada.
Zhou , Z.-H. (2012). Ensemble Methods: Chap-
224

man & HalljCRC , Boca Raton , FL.


Zhou , Z.-H. and y. Yu. (2005). "Ensembling locallearners through multimodal
perturbation." IEEE on Systems, Cybernetics -
B: Cybernetics , 35(4):725-735.

Minkowski ,

- 3) + (33 - 23) =

(Kaunas)
10.1

(lazy

(eager learning) .

//

/ / / / -\Y< \
'+ I /
l / f
l l l+
\ \- /

k =
226

P(err) = 1- I æ)P(c I z) (10.1)

P(err) = 1- :L P(c I æ)P(c I z)


cEY

:L p2(clæ)
1 - p2(c* I æ)

= (1 + P (c* I æ) )(1- P (c* I æ))


2 x (1 - P (c* I æ)) . (10.2)

and Hart ,

(dense
8=

(103 )20 =
10.2 227

[Bellman , 1957]
(curse of
dimensionality) .
red uction)

.., ...' ..
...
-:'..'
' _' • 9 L & .......
;f d dzh
..

• t .. fz -. .... L'

,::_
.:....

(Mult iple Dimensional [Cox and


Cox ,

Z E ]R d'x m , d'
- zj ll = distij .

dist;j = IIzil12 + IIzj l1 2- 2z; Zj

= bii + bjj - 2bij (10.3)


228

bij = bij

+ mbjj , (10 .4)

(10.5)
j=1
m m
nunhu
= ,
i=1 j=1

tr(B)

(10.7)

(10.8)

(10.9)

decomposition) , B =
A = ;;:: ... V

nu
Z= Xm
-EA
10.3 229

. (10.12)

<<- d.

z=wTx , (10.13)

Z E

(i =1

Component
230

=
IIwil12 = 1, =0
( <
Zi2;...;

= TWTXi + const

(W T (10.14)

Xi X ;

tr (WTXXTW) (10.15)

s.t.VVTVVzI

tr (WTXXTW) (10.16)

S.t. WTW=I ,
10.3 231

X2

0.045
2 Xl

, (10.17)

;;:: ... =

æ ij

., Wd"
).
i= l j= l

t. (10.18)
232

(c)

[Schölkopf et al.,

(10.19)
10 .4 233

, (10.20)

i = 1, 2,. . .

(10.21)

(10.22)

Xj) = . (10.23)

(10.24)

(K)ij A = .

(j = 1, 2, . . .

æ) , (10.25)
234

(manifold

[Tenenbaum et al.,
10.5 235

= {Xl , X2 ,..., Xm};

1: for i = 1, 2,... , rn do
2:
3:
4: end for
Xj);

7: return

Linear [Roweis and Saul,


236

Xk
Xk
z

• Xl

Xi = + WilXI , (10.26)

(10.27)
Jbm
S.t. = 1,

= (Xi - Xj)T(Xi -

kEQi
(10.28)
l ,sEQi

(10.29)
10.6 237

= Zm) E ]Rd'Xm , (W)ij

M = (1 - W? (1 - W) , (10.30)

(10.31)

ZZT = 1.

= {Xl , X2 ,'" , Xm};

1: for i = 1, 2,…, mdo


2:
3:
4: = 0;
5: end for

8: return
=

(distance metric learning)


238

= Ilxi - Xj = +... + , (10.32)

dist!ed(Xi , Xj) = Ilxi - .

= , (10.33)

0, W = (W)ii

distance)
C.

flp M = = - Xj) = Ilxi - , (10.34)

Component
[Goldberger et al.,
10.6 239

(10.35)
3 ,

Pi , (10.36)

= (10.37)

-
(10.38)
i=l

et a l., 2005]

(Xi'

[Xing et a l., 2003]:

mjp - (10.39)

"
240

et a1., 1996];

1997].

[Fisher , [Baudat
and Anouar ,
(Canonical Correlation
[Harden et
view

[Yang et al. ,
[Ye et a 1., Zhou ,
and Bader , 2009].

cian [Belkin and Niyogi ,


Tangent Space [Zhang and Zha ,
Preserving [He and Niyogi ,

et al. ,

[Belkin et al. , 2006]. [Yan et

et a1.,
et
10.7 241

and Saul ,
et al., 2007; Zhan et
et
al.,
[Davis et a l.,
242

10.1

10.2

(" IYI.. err*


- 1 "v" )l .
_____* \
(10 .40)

10.3

10 .4

10.5

10.6

http://vision.ucsd.edu/content
jyale-face-database
10.7

10.8*

10.10
243

Aha , D. , ed. (1997). Kluwer , Norwell , MA.


Baudat , G. and F. Anouar. (2000). "Generalized discriminant analysis using a
Neural
Belkin , M. and P. Niyogi. (2003). "Laplacian eigenmaps for dimensionality re-
duction and data representation." Neural
Be1kin , M. , P. Niyogi , and V. Sindhwani. (2006). "Manifold A
geometric framework for learning from labeled and unlabe1ed examples."
Journal of Machine Research,
Bellman , R. E. (1957). Programming. Press ,
Princeton , N J.
Cover , T. M. and P. E. Hart. (1967). "Nearest neighbor pattern
IEEE on Theory ,
Cox , T. F. and M. A. Cox. (2001). Multidimensional Chapman & Ha1-
l/CRC , UK.
Davis , J. V. , B. P. Jain , S. Sra, and 1. S. Dhillon. (2007). "Information-
theoretic metric 1earning." In Proceedings of the 24th International Conferc
ence on Corvalis , OR.
Fisher , R. A. (1936). "The use of multip1e in taxonomic prob-
1ems." Annals of Eugenics , 188.
Friedman , J. H. , R.
of Co on Aritificial elligence
717 724 , Port1and , OR.

Frome , A. , Y. Singer , and J. Malik. (2007). "Image retrieval and classification


using local distance functions." in N eural Process-
ing Systems 19 (NIPS) (B. J. C. Platt , and T. Hoffman , eds.) ,
MIT Press , Cambridge , MA.
Geng , X. , Zhan , and Z.-H. Zhou. (2005). "Supervised nonlinear dimen-
sionality reduction for visualization and
-P 1107.
J. , G. E. Hinto S. T. R. R.
"N ana1ysis." In in Neural
244

Systems 17 (NIPS) (L. K. Weiss , and L. Bottou , eds.) ,


513-520 , MIT Press , Cambridge , MA.
Harden , D. R., S. Szedmak, and J. Shawe-Taylor. (2004). "Canonical
tion analysis: An overview with application to learning methods." Neural
:2639-2664.
He , X. and P. Niyogi. (2004). "Locality In Advances
in Neural Processing Systems 16 (NIPS) (S. Thrun , L. K. Saul ,
and B. Schölkopf, eds.) , 153-160 , MIT Press , Cambridge , MA.
H. (1936). "Relations between two sets of variates."
(3-4) :321-377.
T. G. and B. W. Bader. (2009). "Tensor decompositions and applica-
tions." SIAM Review, 51(3):455… 500.
Roweis , S. T. and L. K. Saul. (2000). "Locally linear Science , 290
(5500):2323-2316.
A. Smola, and K. -R. Müller. (1998). "Nonlinear component anal-
kernel eigenvalue problem." Neural 10(5):1299-1319.
J. B. , V. de Silva, and J. C. Langíord. (2000). "A global geomet-
ric framework for dimensionality reduction." Science , 290(5500):
2319-2323.
Wagstaff, K. , C. Cardie , S. Rogers , and S. Schrödl. (2001). "Constrained
k-means clustering with background knowledge." In Proceedings of the
18th Intemational Conference on Machine 577-584 ,
MA.
Weinberger , K. Q. and L. K. Saul. (2009). ."Distance metric learning for large
margin nearest neighbor Journal of Re-
10:207-244.
E. P. , A. Y. Ng , M. 1. Jordan , and S. Russell. (2003). "Distance metric
learning, with application to clustering with side-information." In
in Neural Information 15 (NIPS) (8. Thrun ,
and K. Obermayer , eds.) , 505-512 , MIT Press , Cambridge , MA
Yan , 8. , D. Xu , B. Zhang , and H.-J. Zhang. (2007). "Graph embedding and ex-
tensions: A general framework for dimensionality reduction." IEEE Trans-
245

on Pattem and Machine Intelligence , 29(1):40-5 1.


A. F. FIangi , and J.- Y. Yang. (2004). "Two-dimensional
PCA: A new approach to appearance-based face representation and recog-
nition." IEEE on Pattem Intelligence ,
26(1):131-137
Yang , L. , R. R. Sukthankar , and Y. Liu. (2006). "An efficient algorithm
for local distance metric learning." In Proceedings of the 21st National Con-
ference on Boston , MA.
Ye , J. , R. Janardan , and Q. Li. (2005). "Two-dimensionallinear discriminant
analysis." In Advances in Information Processing 17 (NIPS)
(L. K. Saul , Y. Weiss , and L. Bottou , eds.) , 1569-1576 , MIT Press ,
bridge , MA.
Zhan , D.-C. , Y. -F. Li , and Z.-H. Zhou. (2009). "Learning instance specific
distances using metric propagation." In Proceedings ofthe 26th Intemational
Conference on 1225-1232 , Montreal , Canada.
D. and Z.-H. Zhou. (2005). "(2D)2PCA: 2-directional 2-dimensional
PCA for efficient face representation and recognition." Neurocomputing , 69
(1-3) :224-23 1.
Zhang , Z. and H. Zha. (2004). "Principal manifolds and nonlinear dimension
reduction via local tangent space alignment." SIAM Joumal on Scientific
Computing, 26(1):313-338.
246

Pearson ,

College

,
(relevant
(irrelevant
(feature selection).
(data

(redundant
248

(su bset

(subset
(i =

{D l, D2 ,...,

, (1 1. 1)
11.2 249

IYI
Ent(D) =- (1 1. 2)

Relief (Relevant Features) [Kira and Rendell ,

Y1) ,
(X2 , Y2) , ..., (x m ,
250

+ , (1 1. 3)

=
= -

and Rendell ,

[Kononenko ,

(l = 1, 2,…, IYI;

8j
11. 3 251

LVW (Las Vegas Wrapper) [Liu and Setiono ,


Vegas

1:
2: d= IAI;
3: A* = A;
4: t = 0;
5: while t < T do
6:
7: d'= IA'I;
8: E' =
9: < d)) then
10: t = 0;
11: E=E';
12: d= d';
13: A* =A'
14: else
15: t=t+l
16: end if
17: end while
252

y l),

I)Yi - WTXi)2 (1 1. 5)

(1 1.6)

(ridge regression) [Tikhonov


, and Arsenin ,

(1 1. 7)

(Least Absolute
Selection Operator)
50.
11 .4 253

W1

\!
J

W2

Gradient Descent ,
[Boyd and Vandenberghe ,

minf(x)
z
(11.8)

II\7 f (x') - Ilx' - ('V x , x') , (11. 9)


254

( -ti1ai
-
-i
-EL

zkrmk jvkzk)

L-2 /It--\ I-L


z ti -- a
VAFb mzn Z 2 VIJ z a ( qA

= Xk -

Xk+1 - . (11.13)

( Il. P

I < z' ;
Xk +1 = < 0, (1 1. 14)
I z'!
11. 5 255

(codebook)

(dictionary
learning)
(sparse

(1 1. 15)
256

minllxi - . (1 1. 16)

(1 1. 17)

= A = E ]R kxm , 11 .

[Aharon et a l.,

=
)=1 IIF

X- J - bíoí
\ J7=' / IIF
2F ( )
n E LU
.
4tinxu

= 2:#i
11.6 257

pressíve sens-
sensing) [Donoho , 2006; a l.,
mg

n <<:

y = q, æ , (11.19)

(1 1. 20)
258

(Restricted Isometry [Candès , 2008].

x m (n<<
E

(1- :::;; (1 , (1 1. 21)

-ms s nu
(1 1. 22)

s.t. Y = As .

L1
et a l.,

min IIsl11 (1 1. 23)

s.t. Y = As .
11. 6 259

(Basis
Pursuit [Chen et al. , 1998].

(collaborative filter-

"

5 ? 3 2
?
53?· ?
whuq'-qr·

5 ?

3 5 4

and Recht ,

rank(X) (1 1. 24)

s.t. (X)ij =
260

(nuclear norm):
norm)

min{m ,n}

IIXII*= , (1 1. 25)

IIXII* (1 1. 26)

S.t. (X)ij = (A)ij , (i , j) E n.

Programming ,
n<<
[Recht , 2011].

and Fu kunaga ,
et al.,
(Akaike Criterion) [Akaike , [Blum and
Langley , [Forman ,

and
John , et al.,

[Quinlan ,
and Pederson , 1997; Jain and Zongker ,
and Elisseeff, 2003; Liu et al.,
1 1. 7 261

[Liu and Motoda , 1998 , 2007].


LARS (Least Angle [Efron et al.,

LASSO
LASSO [Yuan and
Lin , LASSO [Tibshirani et al.,

and Hastie , 2005].


et al.,

(group
sparse coding)
et a l., 2008; Wang et al., 2010].
2006; et al.,
et al. ,
et al., 2010]. [Baraniuk ,

Pursuit) [Chen et al.,


Pursuit ) [Mallat and [Liu and Ye ,

(http://www.yelab.net/software/SLEP/).
262

1 1. 1

1 1.. 2

1 1. 3

11.4

1 1.5

1 1. 6

11.7

1 1. 8

1 1. 9

11.1 0*
263

M. Elad , and A. Bruckstein. (2006). "K-SVD: An algorithm for


designing overcomplete dictionaries for sparse representation." 1EEE 'I'ra ns-
on Processing, 54(11):4311-4322.
Akaike , H. (1974). "A new look at the statistical model 1EEE
on Automatic Control, 19(6):716 723.•

Baraniuk, R. G. (2007). "Compressive sensing." 1EEE Signal Processing


24(4):118-121
F. Pereira , y. Singer , and D. Strelow. (2009). cod-
In Advances in Neural 1nformation Processing 22 (N1PS) (Y.
Bengio , D. Schuurmans, J. D. Lafferty, C. K. 1. Williams , and A. Culotta,
eds.) , 82-89 , MIT Press , Cambridge , MA.
Blum, A. and P. Langley. (1997). "Selection of relevant features and examples
in machine learning." 97(1-2):245-27 1.
Boyd , S. and L. Vandenberghe. (2004). Convex Optimization. Cambridge Uni-
versity Press , Cambridge , UK.
Candès , E. J. (2008). "The restricted isometry property and its implications for
compressed sensing." Rendus
E. J. , X. Li , Y. Ma , and J. Wright. (2011). "Robust principal
nent analysis?" Joumal 01 the ACM, 58(3):Article 11.
E. J. and B. Recht. matrix completion via convex op-

E. J. , J. . uncertainty principles:
Exact signal highly incomplete frequency
1EEE on 1nformation 52(2):489-509
Chen , D. L. Donoho , and M. A. Saunders. (1998). "Atomic decomposition
by basis pursuit." S1AM on Scientific 20(1):33-6 1.
Donoho , D. L. (2006). "Compressed sensing." 1EEE on
tion Theory , 52(4):1289-1306.
T. Hastie , 1. Johnstone , and R. Tibshirani. (2004). "Least angle
regression." Annals 01 Statistics , 32(2):407-499.
264

Forman , G. (2003). "An extensive empirical study of feature selection metrics


for text classification." Joumal of Leaming Research, 3:1289-1305.
Guyon , 1. and (2003). "An introduction to variable and feature
selection." Joumal of Machine Leaming Research , 3:1157-1182.
Jain , A. and D. Zongker. (1997). selection:
small sample. 1EEE s

Kira , K. and L. A. Rendell. (1992). "The feature selection problem: Tradi-


tional methods and a new algorithm." In Proceedings of the 10th
on (AAA1) , 129-134 , San Jose , CA.
Kohavi , R. and G. H. John. for feature subset selection."
1ntelligence , 97(1-2):273-324.

LIEF." ln Proceedings of the 7th European Conference on Leaming


(ECML) , 171-182 , Catania, Italy.
Liu , H. and H. Motoda. (1998). Feature Selection for Knowledge
Kluwer , Boston , MA.
Liu , H. and H. Motoda. (2007). M Selection.
Chapman & HalljCRC , Boca Raton , FL.
H. and Z. Zhao. (2010). "Feature selection: An
ever evolving frontier in data mining." In Proceedings of Workshop
on Selection in (FSDM) , 4-13 , Hyderabad , India.
Liu , H. and R. Setiono. (1996). "Feature selection and classification - a prob-
abilistic wrapper approach." In Proceedings of the 9th 1ntemational Con-
ference on Engineering of Artificial 1ntelligence
Expert Systems (1EAj Fukuoka, Japan.
Liu , J. and J. Ye. (2009). Euclidean projections in linear time."
In Proceedings of the 26th Conference on Leaming
657-664 , Montreal , Canada.
Mairal , J. , M. Elad , and G. Sapiro. (2008). "Sparse representation for color
image restoration." 1EEE on Processing, 17(1):5 3-69.
S. G. and Z. F.
265

dictionaries." IEEE on Signal Processing , 41(12):3397-3415.


P. M. and K. Fukunaga. (1977). "A branch and bound algorithm
for feature subset selection." IEEE on C-26(9):
917-922.
J. Novovicová, and J. Kittler. (1994). methods in
feature selection." Recognition Letters ,
J. R. (1986). "Induction of decision trees." Machine 1(1):
81-106.
B. approach
12:3413-3430.
Recht , B. , M. Fazel , and P. Parrilo. (2010). "Guaranteed
lutions of linear matrix equations via nuclear norm minimization." SIAM

(1996). and selection via the LASSO."


Journal of the Royal Series B ,
M. Saunders , S. Rosset , J. K. Knight. (2005). "Spar-
sity and smoothness via the fused LASSO." Journal of the Royal Statistical
B , 67(1):91-108.
Tikhonov , A. N. and V. Y. Arsenin , eds. Solution of Problem
DC.
Wang , J. , J. Yang , K. Yu , F. Lv , T. and Y. Gong. (2010). "Locality-
constrained linear coding for image classification." In Proceedings of the
IEEE Computer Society Conference on Computer Recog-
nition (CVPR) , 3360-3367, San CA.
Weston , J. , A. Elisseff, B. Schölkopf, and M. Tipping. (2003). "Use of the zero
norm with linear models and kernel methods." Joumal of Machine Learning

Yang , Y. and J. O. Pederson. (1997). "A comparative study on feature selection


in text categorization." In Pmceedings of the 14th Conference
on Machine Leaming (ICML) , 412-420 , Nashville , TN.
Yuan , M. and Y. Lin. (2006). "Model selection and estimation in regression
with grouped variables." of the Royal Society - Series B ,
266

68(1):49-67
Zou , H. and T. Hastie. (2005). "Regularization and variable selection via the
elastic net." of the Royal Statistical Society - Series B , 67(2):301
320.

Ulam, 1909- 1984)

1867-1918
learning

= {(Xl , Yl) , (Xm , Ym)} ,

and identically

E(h;Ð) , (12.1)

Ê(h;D) (12.2)

:::;;

d(h 1 , h2) "1 h2(X)) . (12.3)


268

f( lE (x)) lE(f (x)) . (12 .4)

(12.5)

) (12.6)

sup xm) - Xi+l,..., ,


X l,..., Xm ,

P (f (Xl ,"', Xm ) E) exp , (12.7)

(12.8)

12.2

Approximately
1984].

=
(concept
class)

(hypothesis
12.2 269

<

P(E(h) E) 1- 8 , (12.9)

< E, 8 <
270

Learning
E, 1/6,
PAC learnable)

;?;

(properly PAC

=1=
1111
12.3 271

P(h(x) = Y) = 1 -
= 1- E(h)
<1 • E • (12.10)

P( (h(Xl) = Yl) = Ym)) = (1- P (h )m


< (1 - E)m • (12.11)

E(h) > E = 0) < E)m

(12.12)

<5, (12.13)

(12.14)

.
272

<E<

P(Ê(h) - E(h) ? E) :S:; exp( -2m(2) , (12.15)

P(E(h) - Ê(h) ? -2m(2) , (12.16)

P(IE(h) - .

<E<

<8<

(1 ro f7_\ ;:;r ,_\1 yLU1fLI 8.

: IE(h)- Ê(h)1 > E)


1
,

L P(IE(h) - Ê(h)1 > E) :s:; 21 1-L 1exp( -2m(2) ,


12.4 273

exp(

(agnostic

PAC
0 8 <
"
m ?:

(12.20)

12.4

(Vapnik-
dimension) 1971].

.
=

{(h (æ 1), h (æ2) ,"', h (æ m ))}.


274

= (h(X1) ,..., h(xm )) I .

E N, 0 < E <

and Chervonenkis , 1971]. > E) (12.22)

= 2m } . (12.23)
12.4 275

E b} , X = = +1 ,
= 0.5 , X2
{h [o ,l] ' h[O ,2]' h[l ,2J'
< X4 <
(X5 ,

X =

E
= =

1972]:

(12.24)

= 1, =
- 1, d - - 1, = {Xl' X2 , … , Xm } ,
D' = {Xl' X2 ,..., Xm- l} ,

= { (h (Xl) , h (X2) , ... , h (xm)) I ,

= { (h (Xl) , h (X2) ,"' , h (Xm-l)) I h E 1i} .


276

={ I 3h ,

(h(Xi) = h' (Xi) = (x m )) ,

11iID I = (12.25)

(12.26)

1) 1) (12.27)

I
1l D I
1 (m 1) + (m 1)
(n:_-;:l) o.
=
= -=-11) )

{12.28)
12.4 277

= (:tt
m;;'d.
(:t i:; (7)
=

d, 0 <8<

P II E(h)
_",
- ,;:::" , 18d ln
m' - --- 0 I
\
1- 8 . (12.29)

E= 11
m


278

(12.30)

Risk

E(g) . (12.31)

K=; ,
(12.32)

_ f- __,..
E(g) E(g)

(12.34)
m 2 '

P( E(h) -

E(h) -

= E(h) - E(g) + E
12.5 279

12.5

Rademach-
er(1892-1969)
= {(X l, Y1) , (x m ,

Ê(h)

1
mz2
(12.36)

ar324ph(24) (12.37)

Yi) ,
280

(12.38)

, (12.39)

Rz(F) =

Rm (F) (12 .4 1)

et
2012]:

= Zm} , 0 < t5 <


12.5 281

+ (12.42)

(12.43)

Êz (f) ,

Ez (f) - EZ' (f)


fEF

= sup 1.(Zm) - f(z '.r,)


fEF m

(12 .44)
282

I sup Ez' (f) - Ez (f) I


J

J J

;:; r - r \ , . Jln(2 j8)


Rz( F) + (12 .4 5)

+3 (12 .46)


12.5 283

,x m } , 0 < Ó <

'8 {1. \ , n (/'I1\


+, _I/ln(ljó)
2m
(12 .47)

'8" \ ,; , " /ln(2jó)


+ RD( 1l) +
(/'I I\
E(h) E(h)
I

= X x

!h (z) = !h (x , y) = ,

{!h: h

sup 1
J

WL

-
- Yi ih(Xi))]
(J

(12.50)

(12.51)


284

et
a l., 2012]

Rm (tl) (12.52)

E(h) :s; Ê(h) + (12.53)

= {Z1 = (Xl , Yl) , Z2 = = (Xm , Ym)} ,


Yi =
12.6 285

D\i = Zi+ l,.. ., Zm} ,

Dt =

J!('c, [J!('c D, Z)] . (12.54)

(12.55)

1-m
avb Z D at ZD z 9" vhuEU

stability):

Z =

1J!('cD , Z) - J!('cD\i , Z) I ::;; ß , i = 1, 2, . . . , m , (12.57)

1J!('cD , Z) -J!('cDi , Z)1


-J!('c D\i , Z)1 + 1J!('cDi , Z) -.e('cD\i , Z)1
286

= z)
[Bousquet and

quet and Elisseeff, 2002].

0< ð<

+ 2ß + (4mß + M)V (12.58)

D) + ß + (4mß + M) (12.59)

ß=

D) - i( 'c,
-

i( 'c, (12.60)

Risk
12.7 287

Jl (g ,V) =
hEH

,
Ed-2'x
E-AVBq"

p mE

I Jl (g , V) - ê( g , D) I ,,;;

Jl('c,

Jl( Jl (g ,

,,;; Jl('c, D) - Jl (g , D) +E
E

[Valiant ,
[Kearns and Vazirani ,
288

and Chervonenkis ,

and

Ben-David et al., 1995].


and Panchenko ,
and Mendelson , [Bartlett ct al.,

and Elisseeff, 2002]


[Mukherjee
et et

et a l., (Asyrnptotical Empirical Risk Minimization)

deterministic

et al., 1996].
289

12.1

12.2

= 2e- 2me2 . 12.3

12 .4

12.5

12.6

12.7

12.8

12.9
+
12.10*
290

P. L. , O. Bousquet , and S. Mendelson. (2002). "Localized Rademacher


complexities." In Proceedings of the 15th Conference on
Theory 44-58 , Sydney, Australia.
Bartlett , P. L. and S. Mendelson. (2003). "Rademacher and Gaussian com-
plexities: Risk bounds and structural results." Journal of
3:463-482.
Ben-David , S. , N. Cesa-Bianchi , D. Haussler , and P. M. Long. (1995). "Charac-
terizations of learnability for classes of {O , . . . , n }-valued functions." Journal
of System Sciences , 50(1):74-86.
Bousquet , O. and A. Elisseeff. (2002). "Stability and generalization." Journal
of Machine Learning Research, 2:499-526.
Devroye , L. , L. G. eds. (1996). A Theory of
Springer , New York , NY.
Hoeffding , W. (1963). "Probability inequalities for sums of bounded random
variables." of the 58(301):13-30.

MIT Press , Cambridge , MA.


V. and D. Pa
ing the of function ." In High Dimensional Probability II (E.
D. M. Mason , and J. A. Wellner , eds.) , 443-457 , Birkhäuser Boston,
Cambridge , MA.
McDiarmid , C. (1989). "On the method of bounded in
Combinatorics , 141(1):148… 188.
Mohri , M. , A. and A. Talwalkar , eds. (2012). Foundations of
MIT Press , Cambridge , MA.
S. , P. Niyogi , T. Poggio , and R. M. Rifkin. (2006). "Learning the-
ory: Stability is sufficient for generalization and necessary and sufficient
for consistency of empirical risk minimization." in Computational
25(1-3):161-193.
K. (1989). "On learning sets and functions." Machine Learning,
291

Sauer , N. (1972). "On the density of families of sets." Journal of Combinatorial


Series A , 13(1):145-147.
Shalev-Shwartz , S. , O. Shamir , N. Srebro , and K. Sridharan. (2010).
ability, stability and uniform convergence." Journal of Machine Learning

Shelah , S. (1972). "A combinatorial problem; stability and order for models
and theories in languages." Journal of 41
(1):247-261
Valiant , L. G. (1984). "A theory of the learnable." of the
ACM, 27(11):1134-1142.
Vapnik , V. N. (1971). "On the uniform convergence of
relative frequencies of events to their probabilities." Theory of
Its Applications, 16(2):264-280.
292

TCS (Theoretical Computer

G.

theory of the
learnable"

Thring Award Goes to in Machine


= {(X1 , Y1) ,

l<<

(active

!"
294

..
"+"? " _ " ?

••
-•
4
+
•••

(duster

(manifold
13.2 295
296

= {1 , 2,...

N
p(x) p(X , (13.1)

0, = 1;

I(x) = argmaxp(y = j 1 x)

N
= (13.2)

p(x :Ei )
p(8 = i 1 x) = ;. (13.3)
p(x :E i)

= j 1

= j
1 8 =
= j 18 = i) = =j 1 = o.
= j 1 8 = i,

= ...,
13.2 297

Du = {XZ+l , XZ+2' … , Xl+ u}, 1<< u , 1+ u =

Dz

IN \
LL(Dz U Du) = . p(Xj :E i) . p(Yj 1e = i ,xj) 1
/

:Ei)) (13 .4)

:E i)
ji
'YIJ" = N (13.5)
. p(Xj :E i)

Z eqJ
(13.6)

:E i =
$30Jz z \z30tL

(13.7)

(13.8)

and U yar ,
et
298

[Cozman and Cohen ,

Support Vector

(low-density

4
.+.• 0',.
...-
·••
•• -
• ...L• .
•• · •

"

(Transductive Support Vector


Machine) [Joachims ,

assignment)
13.3 299

= . . , (XI ,
XI+2 , . . . 1 << u , 1 + u = m.
= . . . , YI+u) ,

me
g ,_
+CILÇi +Cu (13.9)

S.t. i = 1, 2,… , l ,
+ b) ;? i = 1+ 1, 1 + 2 ,… ,m ,
Çi ;? 0 , i =

• ,
=
300

= {(Xl ,Yl) , (X2 , Y2) , … , (Xz , yz)};


'= {X l+ l , Xl+ 2,"', XZ+u};
Cu .

(YZ +1, Yl+ 2, …, Yl+ u);


CZ ;
4: while Cu < Cz do
5: y , C z, e;
6: while :l{ i , j I 0) ^ (Çi > > 0) ^ (Çi 2)} do
7: Yi = -Yi;
8: Yj
9: e
10: end while
11: Cu '= min{2Cu , Cz }
12: end while
y= • , YZ+u)

, (13.10)
11 ,+

+ çj >

[Joachims. 1999].

[Chapelle and Zien ,


meanS3VM [Li et a l.,
13.4 301

= {(X1 , Y1) , , XZ+u } ,


l +u = =
=

I( exp
__._
), if od 14 )
(W)ij = < ,-- / .
l 0, otherwise ,

=
Yi = Sign (f (Xi)) ,
(energy function) [Zhu et al. , 2003]:

- f(Xj))2

1-2
IIU Z PTd
Z q" w ,
d
PT
Z rJ z

-
i=1 j=1
= fT(D - W)f , (13.12)

= (fzT (f (X1); f(X2);"'; f(xz)) , fu = (f (XZ +1);


...;
D = diag(d1 , d2 ,"',

W=
IWuz Wuul
302

D = I Dll
I uul Ll UU I

l Olu Wluilri

2fJWudl (13.14)

fu = (Duu - W uu )-lWudl . (13.15)

-ud
DO
P D W

.. t!
D;-;-lWlu 1I
.L.IU •• ,,,
(13.16)
'

= =

(Duu(1 -

=(I
= (1 - Puu)-lpudl . (13.17)

et a l., 2004].

U =
- {Xl ,'"
= diag(d1 , d2 ,...,
+ u) x = (F I, F i,...,
Fi = ((F) i1, • ,
Yi = arg maX1';;j ,;; IYI (F)ij.

= 1, 2 ,..., m , j = 1, 2,...,
13.4 303

OK
4.'''hw 'bn8
-qJ
F QU Y -qJ
(13.18)
'+b ,A

F(t + 1) + (13.19)

F* = .lim F(t) = (13.20)


z• -+00

= (:1: 1, Yl)};

4: t= 0;
5:
6: +
7: t =
8:
9: for i = 1 + 1, 1 + 2,..., 1

11: end for


iì =

et al. , 2004]

11 • •
,
\Ô=l \ (W)ij 11
. . ) tJ 11 z
(13.21)
304

based

sity.
" " (co-training) [Blum

(multi-view

(attribute
set)
13.5 305

= =

sufficient

[Blum and Mitchell ,

and Mitchell ,

and Zhou ,
and Li ,
[Zhou and Li ,
306

Yl) ,.…,( (x f, xf) , Yl)};


{(Xf+l' XT+1) , . . . , (xf+u' xT+u)};

2: Du \Ds;
3: for j = 1, 2 do

5: end for
6: for t = 1 , 2 , . . . , T do
7: for j = 1, 2 do

9
c Ds;
10: = I Xi E Dt};
11: I
12: Ds = Ds \ (Dp U Dn);
13: end for
14: if
15: break
16: else
17: for j = 1, 2 do
18:
19: end for
20:
21: end if
22: end for
h2
13.6 307

et al. ,
=

2: repeat
3: Cj = ø (1 j k);
4: fori=1 , 2,..., mdo
5: d ij = ;

7: is ...merged=false;
8: while -, do
9: minjE /C dij ;
10:
11: if -, is_voilated then
12: Cr =
13:
14: else
{r};
16: ø then
17:
18: end if
19: end if
20: end while
21: end for
22: for j = 1, 2, . . ., k do
X;
24: end for
25:
C2 ,..., Ck}
308

(X i ,

X25) , (X12 ' X20 ) , (X20 , X12) , (X14 , X17 ) , ( X 17 , X14)} ,

C= {(X2 , X21) , (X21' X2) , (X13 , X 23 ) , ( X23 , X 13), (X19 , X23 ) , ( X23 , X19)}.

= X6 , X12 ,

0.8 0.8

0.7

0.6

...
0.6

0.5 0.5

0.2

0.1
0.2

0.1
í ..:..
0.7 0.8 0.9 0.8 0.9

0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.2 0.2

0.1 0.1

0.7 0.8 0.9 0.9

(
=
13.6 309

C1 = {X3 , X5 , X7 , X9 , X13 , X14 , X16 , X17 , X21};

C2 = {X6 , X8 , X 1O, X11 , X12 , X15 , X18 , X19 , X20};


C3 = X26 , X27 , X28 , X29 , X30}.

=
label).

Seed
et al.,

= {Xl' X2 ,..., xm};


8c D, 181 << IDI

1: for j = 1 , 2 , . . . , k do
2: X
end for
4: repeat
5:
6: for j = 1, 2, . . . , k do
7: for all X E S , do
8: C j = Cj U{ X }
9: end for
10: end for
11: for all
12: dij = Ilxi ;
13: r = d ij ;
14: Cr = Cr U{x i}
15: end for
16: for j = 1 , 2 ,. . . , k do
X;
18: end for
19:
C2 ,..., C k}
310

82 = 83 = {XI4 , X17}.

C1 = X 4 , X22 , X23 , X24 , X25 , X26 , X27 , X28 , X29 , X30} ;

C2 = x 8,x lQ, xu , X12 , X 15 , x18 , x19 , X20} ;


C3 = X17 ,

0.8

0.7

0.(. 0.6

'·. ·.·.. ..
.
• •
0.4 \. .
\. ...." +
......
.\
...'
......
\. •
...
+

0.2

•.. . """ _L !
0.2 .• •+
0.1 . · . - -s 0.1
;.
• ...! .

-
"
0.6 0 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.8

0.7 0.7

0.6

'… .
.… . •

.. ......... ........ \. .• +
••

.
.•
_....... .
\.
(. ......•
• .....
••
e .-
-
ee 0.2
\
0.1 • •
.
0.1 ! •........... •

-
0.2 0.3 0.7 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.9

=
13.7 311

and Landgrebe ,

and Mitchell ,
et al.,
and Landgrebe ,

et a l.,

et a l.,
[Collobert et a l.,
and Chawla ,

and Zhang , 2006; Jebara et a l.,


et a l., 2003].

[Blum and Mitchell ,


and Li , 2005b].

[Belkin et a l., regular-


312

and Cohen ,

[Li and Zhou ,

and Li , et a l.,
et a l., 2006b;
[Zhou and Li ,
[Settles ,
313

13.1

13.2

13.3 of

P Z QV
Z P Z AU --qa

13.4
http:j j

13.5

13.6*

13. 7*

13.8

13.10
314

(2013).
Basu , S. , A. Banerjee, and R. J. Mooney. (2002). "Semí-supervised clusteríng
by seedíng." In Proceedings of the 19th on M achine
19-26 , Sydney, Australía.
P. Níyogí , and V. Síndhwaní. (2006). "Manifold regularízatíon: A
geometríc framework for learníng from labeled and unlabeled examples."
Journal of 7:2399-2434.
Blum, A. and S. (2001). "Learníng from labeled and unlabeled data
usíng graph míncuts." In Proceedings of the 18th International Conference
on Machine , 19-26 , Wíllíamston , MA.
Blum, A. and T. Mítchell. (1998). "Combíníng labeled and unlabeled data with
co-traíning." In ofthe 11th Annual Conference on Computation-
al Learning Theory 92-100 , Madison , W I.
Chapelle , 0. , M. Chi , and A. Zien. (2006a). "A contínuation method for semi-
supervísed SVMs." In Proceedings of the 23rd Conference on
Machine , 185-192 , Pittsburgh , PA.
Chapelle , 0. , B. Schölkopf, and A. Zien , eds. (2006b). Semi-Supervised
ing. MIT Press , Cambrídge , MA.
Chapelle , 0. , J. Weston , and B. Schölkopf. (2003). "Cluster kernels for semi-
supervised learníng." In in Processing Systems
15 (NIPS) (S. Becker , S. K. Obermayer , eds.) , 585'---592 , MIT
Press , Cambridge , MA.
Chapelle , O. and A. Zien. (2005). learning by low density
separation." In Proceedings of the 1 0th W orkshop on A
Intelligence and Statistics (AISTATS) , 57-64 , Savannah Hotel , Barbados.
R., F. Sinz , J. L. Bottou. (2006). "Tr adíng convexíty for
scalability." In Proceedings of the 23rd Conference on Machine
201-208 , Pittsburgh , PA.
G. and I. Cohen. (2002). "Unlabeled data can degrade classifìca-
tíon performance of generative In Proceedings of the 15th Inter-
Conference of the Intelligence Research Society
315

(FLAIRS) , 327-331 , Pensacola , FL.


Goldman , S. and Y. Zhou. (2000). "Enhancing supervised learning with unla-
beled data." In Proceedings of the 17th International
327-334 , San Francisco , CA.
Jebara , T. , J. and S. F. Chang. (2009). "Graph construction and
matching for semi-supervised learning." In Proceedings of the 26th Interna-
tional Conference on 441-448 , Montreal , Cana-
da.
Joachims , T. (1999). "Tr ansductive inference for text classification
port vector machines." In Proceedings of the 16th Conference
on Machine , 200-209 , Bled , Slovenia.
Li , Y.-F. , J. T.
label In Proceedings of the 26th International Conference on Machine
633-640 , Montreal , Canada.
Li , Y.-F. and Z.-H. Zhou. (2015). "Towards making unlabeled data never hurt."
IEEE on Pattern and Machine 37(1):
175-188.
Miller , D. J. and H. S. Uyar. (1997). "A mixture of experts classifier with
learning based on both labelled and unlabelled data." In in Neural
Information Systems 9 (NIPS) (M. Mozer , M. 1. Jordan , and T.
Petsche , eds.) , 571-577 , MIT Press , Cambridge , MA
A. S. Thrun , and T. Mitchell. (2000). "Text
tion from labeled and unlabeled documents using EM." Learning,
39(2-3):103-134.
B. (2009). "Active Technical Re-
port 1648 , Department of Computer University of Wis-
consin at W I. http:j jpages.cs.wisc.eduj ,,-, bsettlesj
pub j
and D. Landgrebe. (1994). "The effect of unlabeled samples
in reducing the small sample size problem and mitigating the Hughes phe-
nomenon." IEEE Transactions on Remote Sensing, 32(5):
1087-1095.
316

S. S. Keerthi , and O. Chapelle.


for semi-supervised kernel machines." In Proceedings ofthe 23rd
al Conference on Machine (ICML) , 123-130 , Pittsburgh, PA.
K., C. Cardie , 8. Rogers , and S. Schrödl. (2001).
k-means clustering with background knowledge." In Proceedings of the
18th Intemational Conference on M achine 577-584 ,
MA.
F. and C. Zhang. (2006). "Label propagation through linear neighbor-
hoods." In Proceedings 01 the 23rd Intemational Conference on Machine
Leaming (ICML) , 985-992 , Pittsburgh , PA.
D. , Z.-H. Zhou , and S. Chen. (2007). dimensionality
reduction." In Proceedings of the 7th SIAM Intemational Conference on
Minneapolis , MN.
Zhou , D. , O. Bousquet , T. N. Lal , J. Weston , and B. Schölkopf. (2004).
ing with local and global consistency." In Advances in Neuml Information
16 (NIPS) (8. Thrun , L. Saul , and B. Schölkopf, eds.) ,
284-291 , MIT Press , Cambridge , MA.
"When
In
529-538 , Reykjavik , Iceland.
Zhou , Z.-H. and M. Li. (2005a). regression with
In Proceedings of the 19th Joint Conference on
telligence (IJCAI) , 908-913 , Edinburgh , Scotland.
Zhou , Z.-H. and M. Li. (2005b). Exploiting unlabeled data using
three classifiers." IEEE on
17(11):1529-1541
Zhou , Z.-H. and M. Li. (2010). learning by disagreement."
Systems , 24(3):415-439.
Zhu , X. (2006). learning literature survey." Technical Report
1530 , Department of Computer Sciences , University of Wisconsin at Madi-
son , Madison , W I. http:j jwww.cs.wisc.eduj"'jerryzhujpubjssl....s urvey.pdf.
Zhu , X. , Z. Ghahramani , and J. Lafferty. learning
317

using Gaussian fields and harmonic functions." In Proceedings of the 20th


International Conference on Machine 912-919 ,
ton, DC.

Riemann,

1853
I
I 0).

graphical

network).
Markov
Bayesian
320

., OM}.

(Markov

, Xn , Yn) = rr P(Yi I Yi-1)P(Xi I yi) . (14.1)

= P(Yt+l = Sj I Yt = s i) , 1:(; i , j :(; N ,

B=

. bij = P(Xt = Oj I , 1 :(; i :(; N , 1 :(; j :(; M


14.1 321

P(Yl

..., Xn}:

< =t +

02 , . . .

{Xl , X2 ,...,

=
322

Random

functions)

{Xl , x 2}, {Xl ,

X =

, (14.2)

=
14.2 323

Lx

ç xQ*;

P(X) , (14 ,3)

= Lx

X5 ,

(separating

• (global Markov

XB I xc.

C B
324

P (XA , XB , XC) P(XA , XB , XC)


P(XA , XB I XC) = - \-;"1___ 0
:-'-'/

P(XC)

(14.5)

P(XA'XC)
P(XA I XC)
P(XC)

.
(14.6)

P(XA , XB I XC) = P(XA I XC)P(XB I XC) , (14.7)

Markov
14.3 325

(Markov =n(v)U l_ XV\n*(v) I Xn(v) .


blanket)
Markov

l_ X v I XV\(u ,v) .

= <I 1. 5‘ if
---, --
I 0.1 , otherwise ,
I 0.2 ‘ if XR = X r;;
--, --
I 1. 3, otherwise ,

= e-HQ(xQ) (14.8)

HQ(xQ) =' , (14.9)

Random
326

= {Xl' X2 , … , = .. .,
I

^
y i Yl Y2 Y3 Y4
y
| [ D] [N] [V] [P] [D] [N]

X
1
- -
P
L
V
-
3
-
n
u
-
--
w
1I
j - - 1-
nC o
-- - I
[D] [N]
The boy
[V] [P][D] [N]
I x

I = P (Yv I x , Yn(v)) , (14.10)

(chain-structured

Yl Y2 Y3 Yn

x = {Xl X2 ... Xn }
14.3 327

P X =
--zp
ex UU + X + QU'K X

1i

feature
Sk(Yi , X ,
feature

I 1, if Yi+ 1 = [P ], Yi = [V] and


i) = <
- 1 0 , otherwise ,

I 1, if Yi = [V] and Xi
=<
1 0, otherwise ,
328

(marginalization).

P(XE , XF) P(XE , XF)


P(XF I
P(XE , XF)

P(XE) . (14.13)
14.4 329

m43(X3)

P(Xl , X2 , X3 , X4 , X5)
X4 X3 X2 Xl

I Xl) P(X3 I X2)P(X4 I X3)


X4 X3 X2 Xl

(14.14)

X3) I X2) L P(Xl)P(X2 I Xl)

I I X2)m12(X2) , (14.15)
X3 X4 X2

P(X5) I X3) L

I I X3)

= L P (X5 I

= m35(X5) . (14.16)
330

,
(14.17)

=jm35(Z5) (14.18)

mij(Xj) II (14.19)
14.5 331

P(Xi) oc II mki(Xi) (14.20)

inference).

14.5.1
332

Ep[f] = J (14.21 )

N
, (14.22)

Chain
Monte

(14.23)

p(f) = lEp (14.24)

(14.25)
14.5 333

I x) ,

p(xt )T(xt - 1 IXt) = p(X t - 1 )T(xt I x t - 1) , (14.26)

Metropolis-Hastings
(rejeet
et al., K.

I
x t - 1 )A(x* I I A(x* I

p(xt •
1 )Q(X* I Xt - 1) = p(X*)Q(Xt - 1 I x*)A(x t - 1 I x*) , (14.27)

Ixt- 1 ).

2: for t = 1, 2 ,... do
3: I
4:
5: if I x t - 1 ) then
6: xt = x*
7: else
8: xt =
9: end if
10: end for
11: return Xl , x 2 ,
334

(, p(X*)Q(xt-l I x *) \
A(x*|xt-1) =min l 1 i (14.28)
p(xt - 1)Q(x* I x t - 1 ))

I
xt = ,

notation) [Buntine,
14.5 335

rr I:
N
p(x 1 8) = P(Xi' (14.29)

z 1

=
@

= 1 (14.31)

z 1

p(z 1 1

lnp(x) = .c (q) + KL(q 11 p) , (14.32)

.c (q) = J q(z) ln { } dz , (14.33)

KL(q 11 p) = - J q(z) (14.34)


336

M
q(Z) = (14.35)

exponential
qi

qi

z)] + const , (14.37)

lEi #j [lnp(x , z)] = !lnp(x, z) rr qidzi .

(qj =

lnqj(zj) = [lnp(x , z)] + const , (14.39)

[lnp (x , z) l)
q;(Z3)=.(14AO)
J/ J exp (x , Z)] )dzj

[lnp(x ,
14.6 337

field (mean

Dirichlet

(k =

1, 2,...

C.1. 6
338
14.7 339

T K I N \
IIp(8 t I II I Zt ,n , 8 t ) ), (14 .4 1)

pa
LR HK
@
(14.42)

T
LL( (14.43)

(14 .44)

and Friedman , 2009].

[Pearl, [Pearl ,
and Geman ,

et
and McCallum , 2012].
340

Belief Propagation)
[Murphyet
and Kappen , 2007; (factor graph)
[Kschischang et al.,
and Spiegelhalter ,

et a l.,

[Jordan ,
and Jordan , 2008].

LDA [Blei et a l.,

(non-parametric
et a l., and

[Hofmann ,

et a l., 2003;
Gilks et 1996]
341

14.1

14.2

14.3

14.4

14.5

14.6

14.7

14.8

14.9*

14.10*
342

C. , N. De A. Doucet , and M. 1. Jordan. (2003). "An intro-


duction to MCMC for machine Machine 50(1-2):5-43.
Blei , D. M. (2012). "Probabi1isitic topic models." ofthe ACM,
55( 4):77-84.
Blei , D. M. , A. Ng , and M. 1. Jordan. Dirichlet
of M

2:159-225.
and D. Geman. (1984). "Stochastic relaxation , Gibbs distributions ,
and the Bayesian restoration of images." IEEE on Pattern
Machine Intelligence , 6(6):721-74 1.
Ghahramani , Z. and T. L. Griffiths. (2006). "Infinite latent feature models
the Indian process." In
18 (NIPS) (Y. Schölkopf, and J. C. Platt , eds.) , 475-482 ,
MIT Press , Cambridge , MA.
Gilks , W. R., S. Richardson , and D. J. Spiegelhalter. (1996).
Monte in Practice. Chapman & HalljCRC , Boca Raton , FL.
Gonzalez , J. E. , Y. Low , and C. Guestrin. (2009). "Residualsplash for optimally
parallelizing belief propagation." In Proceedings of the 12th
on and
Clearwater Beach, FL.
Hastings , W. K. (1970). "Monte Carlo sampling methods using Markov chains
and
Hofmann , T. (2001). probabi1istic latent semantic
analysis." Machine 42(1):177-196.
Jordan , M. 1., ed. (1998). Models. Kluwer , Dordrecht ,
The Netherlands.
Koller , D. and N. Friedman. (2009). Graphical Models: Principles
Techniques. MIT Press , Cambridge , MA.
B. J. Frey, and H.-A. Loeliger. (2001). "Factor graphs and
the sum-product algorithm." IEEE on Information Theory , 47
343

(2) :498-519.
Lafferty, J. D. , A.
dom Probabilistic mode1s for segmenting and labeling sequence data."
1n Proceedings of the 18th Intemational Conference Leaming
282-289 , Williamstown , MA.
Lauritzen , S. L. and D. J. Spiege1ha1ter. (1988). "Local computations with prob
abi1ities on graphica1 structures and their application to expert systems."
of the Royal Society - Series B , 50(2):157-224.
Metropolis , N. , A. W. Rosenb1uth , M. N. Rosenb1uth , A. H. Teller , and E.
Teller. (1953). "Equations of state calcu1ations by fast computing machines."
Joumal of Chemical Physics , 21(6):1087-1092.
J. M. and H. J. Kappen. (2007). "Sufficient conditions for convergence
of the sum-product a1gorithm." IEEE on Information Theory ,
53(12):4422-4437.

tion for approximate inference: An empirical study." 1n Proceedings of the


15th Conference on in Artificial Intelligence
Sweden.
Nea1 , R. M. (1993). "Probabilistic inference using Markov chain Monte Carlo
methods." Technica1 Report CRG-TR-93-1 , Department of Computer Sci-
ence , University of Toronto
Pearl , .J. (1982). "Asymptotic properties of trees and game-searching
procedures." 1n Proceedings of the 2nd National un
Intelligence (AAAI) , Pittsburgh , PA.
J. "Fu structuring belief

Pearl , J. "

Pearl , J. (1988). Probabilistic Reasoning 'in Intelligent Systems: Networks of


Plausible Inference. Morgan Kaufmann , San Francisco , CA.
Rabiner , L. R. (1989). "A tutoria1 on hidden Markov model and selected ap-
plications in speech Proceedings of the IEEE , 77(2):257-286.
344

Sutton , C. McCallum. (2012). "An introduction to random


fields." Trends in Machine Learning, 4(4):267-373.
Teh , Y. W. , M. 1. Jordan , M. J. Beal , and D. M. Blei. (2006). "Hierarchical
Dirichlet processes." Journal of the Statistical Association, 101
(476):1566-1581
Wainwright , M. J. and M. 1. Jordan. (2008). "Graphical models , exponential
and variational inference." Trends in Machine

Y. (2000). "Correctness of local probability propagation in graphical


models with loops." Neural
345

Pearl , 1936-

.
Newell

"9.
Michael
[Fürnkranz et al., 2012 J. (rule

@• f1 ^ f2 (15.1)

^ ;
348

val

(confl.ict

(ordered
(priority

(default rule)

(propositional
(first-order (propositional

(atomic
+

^
15.2 349

(relational

(sequential

=
350

6,
.

^ .

^
15.2 351

.
352

(beam

and Niblett ,

Ratio

LRS = 2. (r 'h-l- log0 .


15.3 353

(Reduced Error
[Brunk and Pazzani ,
(growi ng

and
(Incremental 4]l

[Cohen ,
ed Incremental Pruning
Produce Error Reduction ,

JRIP

lREP* [Cohen ,

IREP*(D);
2: i = 0;
3: repeat

5: Di =
= lREP*(Di );
7:
8: i =
9: until
354


rule)j


rule).

et a l., 2012].
15 .4 355

1) 6) 10) 14)
16) 17) 1) 6)

16) 17) 14) 16)


6) 7) 10) 14)

7) 10) 14) 15)


1) 6) 7)

7) 10) 15) 16)


7) 14) 16) 17)

14) 16) 17) 16)


7) 10) 15)

10) 16) 10) 16)


6) 7) 10) 15)

6) 7) 10) 15)
10) 14) 15) 16)

14) 15) 16) 17)


1) 2) 3) 6)

2) 3) 6) 7)

(relational

(background

.)
356

Y) .

FOIL (Fírst-Order Inductive Learner) [Quinlan ,

Y) ,
15.5 357

(FOIL

F_Gain (15.3)
\ -_ m+ +m_ _

X (log 2 - 1og 2

6)" .

Logic

P(f (X)) ,
358

(Least General
LGG) [Plotkin , 1970].

t) =

LGG(s , t) ==

Y) Y)

Y);
15.5 359

Y) Y) Y).

Y) Y). (15 .4)

10)

, (15.5)

LGG( lO, Y)

and Dze-
roski , 1993]

(Relative Least
General Generalization) [Plotkin ,
360

A.

principle)
(inverse
resolution) [Muggleton and Buntine ,

= L1 = -, L 2 , C1 = A V L , C2 = B V
C=AV

(Av =A, (15.6)

V , (15.7)

C 1 . C2 (15.8)

v AVB

C2 = V {-, L}. (15.9)


15.5 361

[Muggleton ,

q •A
:
q •A (15.10)

p • p • A^q
A^B
:
q •B
• A^q p
(15.11)

p • A^C (15.12)
q• B q• C
p • A^B (15.13)
• A q • r^C
T

"C' = CÐ

= {X/Y}.

= BÐ

ò (most
general
= {1/ X} , Ð2 = {1/ X , 2/Y} ,
Ð3

= AV = Bv = ...., L2Ð,
362

C= V . (15.14)

= (resol ution quotient) ,

(15.15)
(C = AV

(--, L101)021 =

{--, L10 t})Ø 21. (15.16)

Y) Y).

,
15.6 363

"-, q(M, N)" . N) •?"

{ljM , XjN} , O2 =

(symbolism
[Michalski , 1983]. [Fürnkranz et

and

PRISM [Cendrowska ,

CN2 [Clark and Niblett ,


[Fürnkranz ,
RIPPER [Cohen ,
364

(Never-El1ding Language
et

[Muggleton ,
GOLEM [Muggleton and

PROGOL [Muggleton ,

and Lin ,

[Muggleton , [Srinivasan ,
Datalog [Ceri et al.,

1992; Lavrac and Dzeroski ,

and Muggleton ,

(probabilistic
ILP) [De Raedt et
(relational Bayesian network) [Jaeger ,

et al.,
Logic Program) al., 2000
logic network) [Richardson and Domingos ,
(statistical relationallearning) [Getoor and
365

15.1

15.2

15.3

15.4

15.5

15.6 Y)" .

15.7

15.8

15.9* t2 , … ,

15.1 0*
366

Bratko , I. and S. Muggleton. (1995). "Applications of inductive logic program-


ming." of the ACM, 38(11):65-70.
C. A. and M. J. pazzani. (1991). "An investigation of noise-tolerant re-
lational concept learning algorithms." In Proceedings ofthe 8th
Workshop on Machine Learning (1WML) , 389-393 , Evanston , IL.
Carlson , A. , J. Betteridge , B. Kisiel , B. Settles , E. R. Hruschka , and T. M.
Mitchell. (2010). "Toward an architecture for learn-
ing." In Proceedings of the 24th AAA1 Conference on Artificial 1ntelligence
1306-1313 , Atlanta , GA.
Cendrowska , J. (1987). "PRISM: An algorithm for inducing modular
1nternational Journal of Studies , 27(4):349-370.
Ceri , S. , G. Gottlob , and L. Tanca. (1989). "What you always wal1ted to kl10W
about Datalog (and l1ever dared 1EEE on K
1(1):146-166
Clark , P. and T. Niblett. (1989). "The CN2 induction algorithm." Machine
3(4):261-283.
Cohen , W. W. (1995). "Fast effective rule induction." In Proceedings ofthe 12th
1nternational Conference on Machine 115-123 , Tahoe ,
CA.
De Raedt , L. , P. Frasco l1i , K. and S. eds. (2008). Prob-
1nductive Logic Programming: Applications. Springer ,
Berlin.
L. Getoor , D. Koller , and A Pfeffer. (1999). "Learning prob-
abilistic relatìonal models." In Proceedings of the 16th 1nternational Joint
Conference on 1300-1307, Stockholm , Swe-
den.
J. (1994). "Top-down prul1Íng in In Proceed-
ings 0 f the 11 th Conference on Artificial
457 , Amsterdam , The Netherlands.
Fürnkranz , J. , D. Gamberger , and N. Lavrac. (2012). of Rule
Learning. Springer , Berlin.
367

J. and G. Widmer. (1994). "Incremental reduced error pruning."


In Proceedings of the 11th Conference on Machine
New Brunswick, NJ.
Getoor , L. and B. Taskar. (2007). Introduction to Relational Learn-
ing. MIT Press , Cambridge, MA.
Jaeger , M. (2002). "Relational Bayesian networks: A survey." Electronic Trans-
on Artificial Intelligence , 6:Article 15.
L. and S. Kramer. (2000). logic
programs." In of the AAA I'2000 W orkshop on Statis-.
Models from Relational 29-35 , Austin , TX.
N. and S. Dzeroski. (1993). Progmmming: Techniques
Ellis Horwood , New York , NY.
S. (1969). "On the quasi-minimal solution of the general covering
problem." In Proceedings of the 5th Symposium on
Processing (FCIP) , volume A3 , 128 , Yugoslavia.
Michalski , R. S. and methodology of inductive In
Machine An Intelligence (R. S. Michalski , J.
Carbonell, and T. Mitchell, eds.) , Tioga , Palo Alto , CA.
Michalski , R. S. , I. Mozetic , J. Hong , and N. Lavrac. (1986). "The multi-purpose
incrementallearning system AQ15 and its testing application to three med-
ical domains." In Proceedings of the 5th National Conference on
Intelligence 1041 1045 , Philadelphia, PA.

Muggleton , S. (1991). "Inductive logic programming."

Muggleton , S. , ed. (1992). Inductive Logic


don , UK.
Muggleton , S. (1995). "Inverse entailment and Progol." New Generation Com-
putíng, 13(3-4):245-286.
Muggleton , S. and W. Buntine. (1988). "Machine invention offirst order predi-
cates by inverting resolution." In Proceedings of the 5th International Work-
shop on Machine Ann Arbor , MI.
Muggleton , S. and C. (1990). "Efficient induction of logic programs."
368

In Proceedings of the 1st International Workshop on Algorithmic Learning


Theory Tokyo , Japan.
Mugg1eton , S. and D. Lin. 1earning of higher-order
dyadic data1og: Predicate invention revisited." In Proceedings ofthe 23rd In-
Joint Conference on Intelligence
Beijing , China.
P1otkin , G. D. on inductive
ligence 5 (B. D. 153-165 , Edinburgh U niversity
Press , Edinburgh , Scot1and.
P1otkin , G. D. note on inductive generalization." In
chine 6 (B. Me1tzer and D. Mitchie , eds.) , 107-124 , Edinburgh
University Press , Edinburgh , Scotland.
J. R.
5(3):239-266.
Ri chardson, M. and P. Domingos. (2006). "Markov logic networks." Machine
Learning, 62(1-2):107 136.

machin
Journal of the ACM, 12(1):23-41.
Srinivasan,A.(1999). "The A1eph manual."
mach1earnj A1ephj a1eph.html.
Winston , P. H. (1970). structural descriptions from Ph.D.
Department Engineering , MIT , Cambridge , MA.
Wnek , J. and R. S. Micha1ski. (1994). "Hypothesis-driven constructive induc-
tion in AQI7-HCI: A method and experiments." Machine Learning , 2(14):
139-168.
369

S. Michalski , 1937- 2007) .

G.

achine
(reinforcement learning).

Decision

= (X , A , P, R) ,
372

p=l
r= -100

a) = 1.
16.2 373

tll

16.2

armed

(exploration-
374

\ $$$ \

(Exploration-
Exploitation

V l, V 2 ,.

(16.1)

1-n
n n × n
+un ,,,.. (16.2)
16.2 375

n
+ 1-n
un AVV
n (16.3)

1: r 0;
2: Vi = 1 , 2 ,. .. K: Q(i) = = 0;
3: for t = 1 , 2 ,.. ., T do
4: if rand() < E then
5: k
6: else
7: k = argmaxi Q(i)
8: end if
9: v = R(k)j
10: r = r + v;
t
12: count(k)
13: end for

= l/vt.

16.2.3 Softmax
376

" T
(16 .4)

1: r = 0;
2: Vi = 1, 2,... K: Q(i) = 0 , count(i) = 0;
3: for t = 1, 2,..., T do
4:
5: v = R(k);

Softmax
(T =
16.3 377

0.40 ï

0 .35 1
Softmax (r =O .OI )

\'
il(

0.25
o 500 1000 1500 2000 2500 3000

E =

(state value function) ,


(state-action value
378

(16.5)
I Xo

2: ;=1 rt I Xo = (16.6)
a) = I Xo = x , aO

IF--4
Z
z
R
•z + Z --au

L P:--t x' (16.8)


16.3 379

(X , A , P , R);

1: V(x) = 0;
2: for t = 1, 2, . . . do
3: V'(x) = 2:: x'EX P:• x'
4: if then
5: break
6: else
7: V = V'
8: end if
9: end for

- V'(x)1 < B. (16.9)

;
(16.10)
.
x'EX

arg max 2.::: (16.11)


7r
380

V*(x) = (16.12)

(16.13)
V; (x) =
lb<::......
L:
x'EX
P:-+ x'

. (16.14)

a' t:.ti (16.15)


a) = L: a' t:.ti

íT '(x))

)
16.3 381

(16.16)

= a) , (16.17)

(policy iteration).

(X , A , P , R);

1: V(x) =
2: loop
3: for t = 1, 2,... do
4: : V'(x) = a) I: x/EX
5: if t = T + 1 then
6: break
7: else
8: V = V'
9: end if
10: end for
11: = argmaxaEA Q(x , a);
12: if \/x then
13: break
14: else

16: end if
17: end loop
382

th)= …= maxaEA
2: x' EX X-+X'

PZ• X'
\"1 '--X-+X" T .

.
'1' (16.18)

= (X , A , P , R);

1: V(x) = 0;
2: for t = 1, 2, ' , • do
3: : V'(x) = L: x' EX P:-+ x ' (t
4: if maxxEX lV (x) - V'(x)1 < () then
5: break
6: else
7: V=V'
8: end if
9: end for
= argmaxaEA

V'(x) . (16.19)
16.4 383

(model-free

< T1 , T2 , . . . , TT , XT >,
384

1 7T (X) , E;
7T E (X) = < '" - - (16.20)

;;;:::

(on-policy

1: Q(x , a) = =
2: for s = 1, 2 , . .. do
3:
< >;
4: fort=O , 1,..., T-1 do
5: R= T=-t ri;
6: =
=
8:
9:
I argmaxa ,
a) == < ::, ,-J: _:_

10: end for


16.4 385

E[fl

E[fl = I
r I_'P(X)
Jx " n q(x) f(x)dx . (16.23)

(16.24)
q(xD

(16.25)

Q(x , a) (16.26)

p 7r = rr
T-l
(16.27)
386

(16.28)

7T'

1: Q(x , a) = 0 ,
2: for s = 1, 2, . . . do
3:
< r l, >j

1___ E+ ;
.0 1 E/IAI ,
5: for t = do
6: R = ;
7: Q(Xt , at) =
8:
for
= Q(x , a')
11: end for

(Temporal
16 .4 387

= t 2::;=1

Qf+l (x ,a) = Qf( x, a)

- Qt( X,
(rt +l -

Q1r (x , a) (X'))
x'EX

a')Q 1r (x' , a')). (16.30)


x'EX

Qf+l = - , (16.31)

and Dayan ,
388

1: Q(X , a)
= 0 , 1l" (x , a)
2: x =
3: for t = 1 , 2 , . .. do
4: r , x'
5: a'
6: Q(x , a) = Q(x , a) a') - Q(x , a));
= argmaxa" ;
8:
9: end for

1: Q(x , a) =
2: x = xo;
3: for t = 1, 2, . . . do
4: r , x'
5: a'
6: Q(x , a) = Q(x , a) -
= argmaxa " ;
8: x
9: end for

(tabular value
16.5 389

=
et al., 2010]

(16.32)

function

.(16.33)

lr.' l
2 I

= (16.34)

x . (16.35)

X , (16.36)
390

(0; . . . ; 1; . . . ;

1: (J = 0;
2: = arg ;
3: for t = 1, 2, . . . do
4: r , æ'
5: a' =

7: 7l" (æ) = a");


8: æ=
9: end for

(apprenticeship learning) ,
(Iearning
from demonstration) ,
(Iearning by

(imitation "learning).
16.6 391

D= ,…, ni' ni)} ,

reinforce-
ment learning) [Abbeel and Ng , 2004].

=E
392

(16.37)

X11") ?: 0 . (16.38)

(16.39)
"

S.t. 1

2: 1['

3: for t = do
4:
5: S.t. :S; 1;
6: 1[' =
7: end for
16.7 393

and Barto , 1998]. [Gosavi , 2003]


[Whiteson ,
[Mausam and Kolobov ,
[Sigaud and Buffet ,
Observable
et a l., 2010].

[Kaelbling et a l., [Kober et


Deisenroth et
[Kuleshov and Precup , and Mohri ,

and Ftistedt , (online


[Bubeck and Cesa-Bianchi ,
(regret

[Sutton ,
p.22.

and Dayan ,
and Niranjan , 1994].
et a l.,
and Scherrer , [Dann et

1992; Price and Boutili-


er , et a l., 2009]. [Abbeel and Ng , 2004;
Langford and

(approximate dynamic
394

16.1 (Upper Confidence


+

Q(k)

16.2

16.3

16.4

16.5 ).

16.6

16.7

16.8

16.9

16.10*
395

Abbeel , P. and A. Y. Ng. inverse rein-


forcement learning." In Proceedings of the 21st International Conference on
Banff, Canada.
Argall , B. D. , S. Chernova , M. Veloso , and B. Browning. (2009). "A survey
of robot learning from demonstration." A utonomous Systems ,
57(5):469-483.
and B. Fr istedt. (1985). Problems. Chapman & HalljCRC ,
London , U K.
Bertsekas , D. P. (2012). Dynamic Optimal Control: Approx-
Programming , 4th edition. Athena Scientific ,
S. and N. Cesa-Bianchi. (2012). "Regret analysis of stochastic and
nonstochastic mu1ti-armed bandit problems." Trends in
Learning, 5(1):1-122.
R. Babuska , B. De Schutter , and D. Ernst. (2010). Reinforcement
Programming Using Punction Approximators. Chap-
Press , Boca Raton , FL.
C. , G. Neumann , and J. Peters. (2014). "Policy eva1uation with tem-
poral differences: A survey and comparison." Journal of Machine Learning

Deisenroth , M. P. , G. Neumann , and J. Peters. on policy


search for robotics." and Trends in Robotics, 2(1-2):1-142.
Geist , M. and B. Scherrer. (2014). "Off-policy eligibility traces:
A survey." of Machine Research, 15:289-333.
Gosavi , A. (2003). Optimization: Optimization
Reinforcement Kluwer , Norwell , MA.
M. L. Littman , and A. W. Moore. (1996). "Reinforcement
A survey." Journal of Artificial Intelligence Research, 4:237-285.
Kober , J. , J. A. Bagnell , and J. Peters.
robotics: A survey." International on Robotics Research , 32(11):

Ku1eshov , V. and D. "Algorithms for the multi-armed bandit


396

problem." Journal of Machine Learning 1:1-48.


Langford , J. and B. (2005). "Relating reinforcement
mance to performance." In Proceedings of the 22nd
tional Conference on Machine 473-480 , Bonn, Germany.
Lin , L.-J. (1992). agents based on reinforcement learn-
ing , and Machine 8(3-4):293-32 1.
Mausam and A. Kolobov. (2012). with Decision Processes:
An AI Perspective. Claypool, San Rafael , CA.
Price , B. and C. Boutilier. (2003). "Accelerating reinforcement
through implicit imitation." Journal of Artificial Intelligence Research, 19:

Rummery, G. A. and M. (1994). "On-line Q-learning using connec-


tionist systems." Technical Report CUEDjF-INFENGjTR 166 , Engineering
Department , Cambridge University, Cambridge, U K.
O. and O. Buffet. (2010). Decision Processes in
Hoboken , NJ.
Sutton , R. S. (1988). predict by the methods of temporal differ-
ences." Machine Learning, 3(1):9-44.
Sutton, R. S. and A. G. Barto. (1998). Reinforcement An Introduc-
tion. MIT Press , Cambridge, MA.
Tesauro , G. (1995). "Temporal difference TD-Gammon." Com-
of the ACM, 38(3):58-68.
Ueno , T. , S. Maeda , M. Kawanabe , and S. Ishii. (2011). "Generalized TD learn-
ing." Journal of Machine Learning 12:1977-2020.
Vermorel , J. and M. Mohri. (2005). "Multi-armed bandit algorithms and empir-
ical evaluation." In Proceedings of the 16th Conference on Machine
437-448 , Porto, Portugal
C. J. C. H. and P. Dayan. (1992). "Q-learning." Machine Learning, 8
(3-4):279-292.
Whiteson , S. (2010). for Reinforcement Learning.
Springer , Berlin.
397

Andreyevich
=
transpose A T, (AT)ij =

(A.l)
(AB)T = B TA T . (A.2)

= A- 1A =

(AT )-1 = (A- 1 )T , (A.3)


=B lA- 1 .

(A .4)

tr(A T) = tr(A) , (A.5)


tr(A + B) = tr(A) + tr(B) , (A.6)
tr(AB) = tr(BA) , (A.7)
tr(ABC) = tr(BCA) = tr(CAB) . (A.8)

det(A) = L • • • (A.9)
uESn
400

det(I) =
AA AA
--n
d 4lu A - d 4EU
=A A A n4 A
1i

,
G

det(cA) = cn det(A) , (A. lO)


det(A T) = det(A) , (A.ll)
= det(A) det(B) , (A.12)
= det(A)-l , (A.13)
det(A n) = det(A)n . (A.14)

(A.15)

i
(A.16)

(A.17)
i

(A.18)
401

(A.19)

(A.20)

(\1 2 f (A.21)

rule)

(A.22)

(A.23)

. 1
(A.24)
ax ax

T>

, (A.25)

r> T
(A.26)

T>

-, (A.27)

-, (A.28)

. (A.29)
åA
402

(A.30)

(chain
= g (h

(A.31)

- b)
- byW(Ax - b) 2W(Ax- b)

= 2AW(Ax - b) . (A.32)

A=U :E yT (A.33)

matrix);

Value
vector) ,

matrix ap-

k:::;;

mm IIA-AIIF (A.34)
AE lRmxn

s.t. rank(A) =k .
403

Ak = Uk :E kV f' (A.35)

=
=

'V =0, (B.1)

= , (B.2)

=
404

=
=
< 0.

g(x) < = <


=
g(x ) =

= =

l MZKO (B.3)

=0.

J(x) (B.4)

S.t . hi (X) = 0 (i = ,
(j = .
405

L Z = FJ Z h g + Z B

… (B.6)

(primal
(dual
(dual function) r : lR.m x

(B.7)

:::;; 0 , (B.8)

=
æ EollJ}
. (B.9)

:::;; p* , (B. I0)


406

mJ pi nu
CO
(B.ll)

(dual

(weak
= (strong

(B.12)
æ z

s.t. Ax

Q E

pro-
B 407

c.x= 2:2: CijXij , (B.13)

(i =

min c.x (B.14)


X

s.t. Ai. X = bi , i = 1, 2, .

X>- O.

f(X t+ l) < f(x t ) , t = 0, 1, 2 ,. (B.15)

f(x + + ð. xTVf(x) , (B.16)


408

+ Ll x) <

Ll x , (B.17)

method).

(coordi-
nate ascent)

= (X1' X2 ,.. .

,
XO X 1 ,

t+1 _ J!{ •.
= arg mm H Xi' -, . . . ,
Xi . - i, y, Xã) (B.18)

f(x 2 ) ••• (B.19)

,
X O X 1, X 2 ,..
point).

point).
409

p(x)

U(x
1
b-a

AA

p(x = U(X I a, b) (0.1)

Ehl=-r; (0.2)

[x] 12
(0.3)

b).

(Jacob Bernoulli ,

P(x = ;
410

lE [X] (0.5)
var[x] . (0.6)

= Bin(m I = ! .. (0.7)
\ml

lE [X] = (0.8)
var[x] = . (0.9)

=
E =

c nu
=
i=l

(0.11)

var[Xi] ; (0.12)

= 1I[j = . (0.13)

(, 2.
P(ml , m2 ,... , md I = Mult(ml ,

; (0.14)
ml!m2!

lE [mi] = ; (0.15)
411

var[mil = - ,ui) ; (C.16)

cov[mj , mil (C.17)

,
,,
I
,,
, ,,
--

0.5 1 nu

\b
= 1(1
r (a)r(b)
; (C.18)
b)
(C .19)
'

mrbLl = [ JR , (C.20)

+
filo
r - d
(C .21)

B(a , b) (C.22)
412

...;

= (C.23)

lmk
FJrn
(C.24)

(C.25)

ov
pu
(C.26)

f )
=N(x (C.27)

lE [x] (C.28)

var[x] (C.29)

p(x

(C.30)
413

p(x)

4 4 z

lE [x] (C.31)

cov[x] = . (C.32)

I = {X1 , X2 , '"

I I
I I distribution).

X = {X1 ,

1
- mx)
= (C.33)

b' = b+ m -
414

(, 3

ullback-Leibler

I \ L _ p( x)
KL(PIIQ) = I p(x) dx (C.34)
q(x)

(C.35)

, (C.36)

+ H(P, )

H(P,
415
W.
418
130 , 147 LASSO , 252 , 261
5x
LVQ , 204 , 218

AdaBoost , 173
108
331
Bagging , 178
MDP, 371
137, 139
111
Boosting , 173 , 190 MvM , 63

101
OvO , 63
OvR , 63

ECOC , 64

162 , 208 , 295 , 335


PCA , 229
Fl , 32

ReLU , 114
192 , 268 RIPPER , 353
RKHS , 128

ILP, 357 , 364 S3VM , 298

Softmax , 375

124, 132 , 135 184


SVM , 123

273 , 274

WEKA , 16
420

27 , 179

185
156 , 319 , 339
101
158

158 , 328

158

58 ,132 , 325

149
421

23 , 104, 191 , 352

137, 232

128 , 138 , 233

185 , 304

10 , 363

156, 319

206 , 296

59 , 149 , 297
149 , 328
422

191

182 , 225
183 , 225

154 , 225 , 240

299
61, 138
79 , 352 61 , 138

101 , 104

103 , 402

161 , 328 , 331 240 , 294

161 , 328

161 , 320

251 , 340 , 384

199
423

148 , 325

161
150 130
179

102 , 254 , 389 , 407

130

106

158

12 , 139

129

148 , 295 , 325


424

162 , 319

158
197
126

156

109 , 241 , 393

191

2
105 , 133

105
425

189 , 227

408
010-62782989 13701121933

2016
ISBN

N. ( TP181

http://WWw.wqbook. com
100084
010-62770175 010-62786544
010-62776969,

210mmx 235mm 27.75

1 5000

064027-01
_._
II
111
Z

x
x

JJ}NJPD

.
sup(.)
lI(. )
sign(. ) > 1, 0 , 1
1
1. 1 ........................................................... 1
1. 2 . ........... .................... ...... ....... .......... 2
1. 3 ....................................................... 4
1.4 .......... .......... .............. .... .... ... ... ....... 6
1. 5 ....................................................... 10
1. 6 ....................................................... 13
1. 7 ....... ................................................. 16
................................................................ .
............................................................. 20
........................................................... 22

.... .... .......... ...... ............. ... 23


2.1 ...... ....... ........... ..... ..... ............ 23
2.2 ..... ... ......... ... ............. ............... .... ... 24
2.3 ........ ... ...... ... ................. ............ ... ... 28
2 .4 .......................... ............. .............. ... 37
2.5 ........................... ......... ... ...... ........ 44
2.6 ........ ............ ....... ..... ...... .......... .... ... 46
.... ........ ........ .... ........................ ........ ..........48
.... .... ......... ....... .............. .................... ... 49
............................................................ 51

....... ................. ................. 53


3.1 ................................ ............... ........ 53
3.2 ...... ........................... .......... ..... ....... 53
3.3 ............. .......... ............ ................ 57
3.4 .................. ................. ...... .......... 60
3.5 ............ ........ ......;........ ........ .......... 63
X

3.6 .............. ....... ....... ..................... 66


3.7 ....................................................... 67
................................................................. 69
.......................... ..... .............................. 70
... ........................... ........ ..................... 72

........... 73
4.1 ..... .................................................. 73
4.2 ....................................................... 75
4.3 .. ......................... ....... ..................... 79
4.4 ................................................... 83
4.5 ................................................... 88
4.6 ....................................................... 92
................................................................. 93
............................................................. 94
........................................................... 95

...................................... 97
5.1 ......................... .... ......................... 97
5.2 .............................................. 98
5.3 .......................................... ....... .101
5.4 ............................................106
5.5 .............................................. 108
5.6 ....................................................... 113
5.7 ....................................................... 115
................................................................. 116
............................................................. 117
........................................................... 120

.121
6.1 ................................................ .121
6.2 ....................................................... 123
6.3 ......................................................... 126
6.4 ................................................ .129
6.5 .................................................. .133
xl

6.6 ......................................................... 137


6.7 ....................................................... 139
................................................................. 141
............................................................. 142
........................................................... 145

7.1 .................................................. .147


7.2 .................................................. .149
7.3 .............................................. 150
7.4 ............................................154
7.5 ....................................................... 156
7.6 ....................................................... .162
7.7 ....................................................... 164
................................................................. 166
............................................................. 167
........................................................... 169

171
8.1 ..................................................... 171
8.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3 ............................................ .178
8.4 ....................................................... 181
8.5 ......................................................... 185
8.6 ....................................................... 190
.......................................................... ........192
............................................................. 193
....................... .....................................196

197
9.1 ....................................................... 197
9.2 ....................................................... 197
9.3 ....................................................... 199
9.4 ....................................................... 202
9.5 ....................................................... 211
xii

9.6 ....................................................... 214


9.7 ....................................................... 217
........... .......... ..... .......................... ........ ..;..220
............................................................. 221
.....•... ........... ............... ............. .... .......224

.... ......225
10.1 . ... ...... ............................ .............. .225
10.2 ...................................................... 226
10.3 .................................................... 229
10.4 ....................... ."..........................232
10.5 ........................ .......;......................234
10.6 ... ...... ..... ... ...................... .......... .....237
10.7 ......... ...... .......................................240
................ ... ...... ... ..................... •...............242
....... ........... ............................ ... ..... .......243
..... .......... ............................. .... .... .... ...246

..............................247
11. 1 .......................... .............".......247
11. 2 .................................................... 249
11. 3 .................................................... 250
11. 4 .......................... .... ..... ..... .252
11.5 ............................ ... ....... .....254
11. 6 ...................................................... 257
11. 7 ......... ...... .............................. .........260
............. ..... ...... .......................... .... ...... .....262
............................................................. 263
........................................................... 266

12.1 ...................................................... 267


12.2 .. ... ........ .......................... ..... ....... ...268
12.3 . ................................. ........ ...... ..270
12 .4 ... ....... .....................;..... .... ...........273
Xlll

12.5 ............................................279
12.6 ....... .............. ........... .......... .... ..........284
12.7 ...................................................... 287
................................................................. 289
... ........... ......... ............ ............. .............290
........................................................... 292

293
13.1 .... .......... ....................... .............. .293
13.2 .................................................... 295
13.3 ............. .................. ............. ....... 298
13 .4 .............. ........... ............. ......... ...300
13.5 ..... ............ ....................... .......304
13.6 .. ............... ...................... .......... ... 307
13.7 ...................................................... 311
................................................................. 313
............................................................. 314
........................................................... 317

319
14.1 ...............................................319
14.2 ............................................322
14.3 .................................................... 325
14.4 ... ........... ............... .......... ...;........ .328
14.5 ...................................................... 331
14.6 ...................................................... 337
14.7 ....... .... ..... ............ ............ ........... ... 339
................................................................. 341
................. ............................................342
...................... ........... ............ ........ ......345

. ………………………………….... .347
15.1 ..... ........... ............ ............ ........... ... 347
15.2 ............... ............... .......... ........... ... 349
15.3 ...................................................... 352
Xl V

15 .4 ................................................. .354
15.5 .............................................357
15.6 ...................................................... 363
...................................................:............. 365
............................................................. 366
........................................................... 369

.371
16.1 ...................... 371
16.2 .................................................373
16.3 .................................................... 377
16.4 .....................................•.............. 382
16.5 .................................................... 388
16.6 ...................................................... 390
16.7 ...................................................... 393
............ ....'.......... .................... ...................394
............................................................. 395
.... ................... ............... .....................397

.'..........399
A ............................................................. 399
B ..... ..... ........... .................... ................... .403
C .... .................................... ................409

417

419

You might also like