You are on page 1of 127

(204423)

(Data Mining)

2559

(Data Mining)
204423









1

5

1 7
1.1 . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 . . . . . . . . . . . . . . . . . . . . 8
1.2.1 . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 . . . . . . . . . . . . . . . . . 12
1.2.3 . . . . . . . . . . . . . . . . . . . . 14

2 17
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 . . . . . . . . . . . . . . . . . . . . 19
2.1.3 . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 . . . . . . . . . . . . . . . . . . . . . 19
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 . . . . . . . . . . . . . . . . . . . . . . . 22

2
2.2.5 . . . . . . . . . . . . . . . 23
2.3 . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 . . . . . . . . . . . . . . . . . . . . . . 26
2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 . . . . . . . . . . . . . . 27
2.4.2 . . . . . . . . . . . . . 28
2.4.3 . . . . . . . . . . . . . . . 30
2.4.4 . . . . . . . . . . . . . . 31
2.4.5 . . . . . . . . . . . . . . . . . . . . . 31
2.5 . . . . . . . . . . . . . . . . . . . . 32
2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.4 . . . . . . . . . . . . . . . . 42

3 47
3.1 . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 . . . . . . . . . . . . . . . . 54
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 64
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 . . . . . . . . . . . . . . 65
4.1.2 . . . . . . . . . . . . . . . . . . . . . . 67
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 72

3
4.2.2 . . . . . . . . . . . . . . . . . 73
4.2.3 . . . . . . . . . . . . . . . . . . . . . . 77

5 83
5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 . . . . . . . . . . . . . . . 86
5.1.2 . . . . . . . . . . . . . . . . . . 88
5.1.3 . . . . . . . . . . . . . . . . . . . . . . 89
5.2 . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 . . . . . . . . . . . 96
5.3 . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.1 . . . . . . . . . . . . . . . . . 105
5.5.2 . . . . . . . . . . . . . . . . . . . . . 105
5.5.3 . . . . . . . . . . . 106
5.5.4 . . . . . . . . . . . . . 110

6 113
6.1 . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.1 . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.2 . . . . . . . . . . . . . . . . . . . . . . 118
6.1.3 . . . . . . . . . . . . . . . . . 118
6.2 . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 . . . . . . . . . . . . . . . . . 121

124

4

1.1 3 . . . . . . . . . . . . . . . 8
2.1 5 . . . . . . . 25
2.2 1000 . . . . . . . . . 25
2.3 . . . . . . . . . . . . . . . . . . . 26
2.4 . . . . . . . . . . . . . . . 36
2.5 . . . . . . . . 42
3.1 () . . . . . . . . . 49
3.2 () . . . . . . . . . 50
3.3 . . . . . . . . . . . . . . . . . . . . 55
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 . . . . . . . . . . . . . . . . 107
5.2 . . . . . . . . . . . . . . 109
5.3 5 . . . . . . . . . . . . . . . . . . . . . . 110
6.1 . . . . . . . . . . . . . . 116
6.2 . . . 116
6.3 . . 117
6.4 . . . . . . . . . . . 119
6.5 5 . . . . . . . . . . . . . . . . . . 120

5

2.1 . . . . . .
29
4.1 . . . . . . . . . . . . . . . 67
4.2 4 7 . . . . . . . . . . . . 72
4.3 . . . . . . . . . . . . . . . . . . . . . 74
4.4 4 . . . . . . . . . . . . . . . . . . . . . . 74
4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 . . . . . . . . . . . . . . . . . . 76
4.8 . . . . . . . . . . . . . . . . . . 76
4.9 (Score-based) . . . . . . . . . . . . . . . 77
5.1 14 . . . . . . . . . 90
5.2 . . . . . . . . . . . . . . . 106

6
1



1.1
(Information) (Knowledge)
(Raw Data) (Relation-
ship) (Pattern) (Concept)
(Non-trivial)




(Knowledge Discovery in Databases)
(Knowledge Extraction) (Data Analysis)





7
(Query) SELECT
(SQL)


(Data Stream)
(Remote Sensing) [Richards, 1999, Lillesand et al., 2014]
(Graphical data)






1.2
3
(Data Preprocessing) (Data Processing) (Post Pro-
cessing) 1.1

Data
pre- Data Post
Data Knowledge
processing processing processing

1.1: 3

8
(Data Cleaning) (Data Integration)
(Data Selection) (Data Normalising)
(Dimensionality Reduction) 2
3


(Association Rule) (Classication)
(Clustering) (Algorithm)
(Machine Learning)

4 5 6



(Multidisciplinary)

(Database system)



(Data Visualisation) [Cleveland, 1993, Fayyad et al., 2002]


1.2.1

(Image) (Text)
(Sound) (Time Series)

(Transactional Data)









(Frequent Itemset)

4

(Temporal Data)

-



(Spatial Data)




( )

10
Spatio-temporal Data

(Textual and Multimedia Data)



(Text) (Keyword)

(Task)
(Topic Analysis) [Blei et al., 2003]
(Abnormality Detection)
(Sentiment Analysis)

[Pang and Lee, 2008]
(Multimedia)



(Graphical Data)

World-Wide-Web
(Node) (Edge) (Hyperlink)
WWW WWW

AdSense Google

Facebook
Facebook

11
Facebook

1.2.2



2
(Descriptive) (Predictive)



(Class) (Concept)




(Concept/Class Descriptions)
(Data Charac-
terisation)

(Data Discrimination) (Manually)

Feature Extraction


Frequent Patterns

12

buys(X,bike) buys(X,helmet)
[support=3%, condence=60%]




(Measure of Interestingness)
(Condence) (Support) 60%
60%

(Single-dimension Association Rule)

buys(X,bike) age(X,3045) buys(X,helmet)


[support=2%, condence=50%]


(Minimum Support Threshold) (Minimum Condence
Threshold)


(Model)


(Labelled Training Data)
(Classier)
(Decision Tree), (Articial Neural Network), (Support
Vector Machine) (Logistic Regression)
5

13





(Similarity Measure)
6

(Outlier Analysis)
(Outlier)
(Noise)
[Jindal and Liu, 2007]


[Fawcett and Provost, 1997]

1.2.3


[Han et al., 2011]

14

15

1.

2.

3.

16
2




(Data)
(Data Point),
(Example), (Instance) (Input Vector) x
(Data Set) S


(Feature)
3
3

n N
(Euclidean Space) M

xn = {x1n , x2n , . . . , xm
n , . . . , xn }
M

17
n N m M
m
N M

1 m M
x1 . . . x1 . . . x1
.. . . . .. ..
. . .
1 M
xn . . . xn . . . xn
m
.. .. . . . ..
. . .
1 m M
xN . . . x N . . . x N

(Row Vector)
(Column Vector)

2.0.1. S 5 ()
( ) ()

13 39 157
23 56 157

15 60 177

32 72 187
21 43 162

1 13 39 157
3 60

2.1




18
2.1.1

(Nominal Feature)
(Domain) (Finite Set)
{,,,,,}

2.1.2

(Binary Feature)
2 ( 2)
2 (Symmetric) (Asymmetric)




2.1.3

(Ordinal Feature)


5
4 1

2.1.4

(Numeric Feature)
(Integer) (Real Number)

19
2.2


2
3 (Mean) (Median) (Mode)

2.2.1
(Arithmetic Mean)

1

1
N
x = xn (2.1)
N n=1

1 1 1 1 1 M
N N N
1 2 M
x = [x , x , . . . , x ] = [ x , x ,..., x ] (2.2)
N n=1 n N n=1 n N n=1 n

(Weighted Average)

1
N
x = wn xn (2.3)
N n=1

wn wn
1 2.1
xn 1
N
n=1 wn = 1 (Expected Value)

20
(Extreme Value) (Outlier)
(Trimmed Mean)
k

2.2.1. 6 1 5
10000, 15000, 13000, 20000, 30000
2000000

10000 + 15000 + 13000 + 20000 + 30000 + 2000000
= 348000
6


2.2.2
(Median)

{
N2 + 1 N
median = (2.4)
N
2 N
2 +1 N

2.2.3
(Mode)
(Distribution)
(Unimodal)
(Multimodal)

21
70
( )

2.2.4
(Variance)
1
N
n=1 (xn )2
Var(x) = 2 = (2.5)
N




(Standard Deviation)

(Covariance) i, j

N
n=1 (xn i )(xn j )
i j
i j
Cov(x , x ) = i j = (2.6)
N
i i i
i

(Covariance Matrix)
N
n=1 (xn ) (xn )
T
Cov(x) = = (2.7)
N

22
(xn )T (Transposition)
M M


12 . . . 1 m . . . 1 M
... ... ..
.
..
.

m 1 . . . m . . . m M
2
. .. .
.. . . . . .
.
2
M 1 . . . M m . . . M

(Symmetric) (Square)
(Diagonal)
i j = j i

2.2.5
(Quantile) Q (Range) n

1 |Q| = n 1 n

n = 2

n = 4

n = 100


3 (Q3) 1 (Q1) (Inter-
Quartile Range (IQR)) Q3 Q1 IQR
IQR

23
2.3

(Visualisation)

100000
(Outlier)

[Hoffman and Grinstein, 2002]

2.3.1
(Boxplot) 5 (Five Numbers Summary)

1. (Minimum)

2. 1

3.

4. 3

5. (Maximum)


2.1
(Mild Outlier) Q1 1.5 IQR
Q3 + 1.5 IQR (Extreme Outlier)
Q1 3 IQR Q3 + 3 IQR

24
Median
Extreme outlier Min Max Outlier
x
+ *
Q1 Q3
IQR
1.5*IQR 1.5*IQR
3*IQR 3*IQR

2.1: 5 ( )

2.3.2
(Histogram)


M 1
2 M 2.2

70 Histogram

60

50

40
Y

30

20

10

04 3 2 1 0 1 2 3
X

2.2: 1000

25
2.3.3

(Scatter Plot)
2 3 3
2
2.3

120 Scatter Plot

100

80

60
Y

40

20

20
20 0 20 40 60 80 100 120
X

2.3:

2.4
(Similarity)

(Similarity Measure)

26

(Dissimilarity)

2
(Similarity Matrix) ( )
(Dissimilarity Matrix)


0
d(x2 , x1 ) 0

d(x3 , x1 ) d(x3 , x2 ) 0 (2.8)
. . .
.. .. .. 0
d(xN , x1 ) d(xN , x2 ) d(xN , xN 1 ) 0

d(xi , xj ) xi xj .
d()

2.4.1



(Manhattan
Distance) (Euclidean Distance)
(Minkowski Distance)
xi , xj RM

d(xi , xj ) = (|x1i x1j |h + |x2i x2j |h + + |xM


i xj | )
M h 1/h
(2.9)


3
(Positive Deniteness): d(xi , xj ) > 0, if i = j and d(xi , xi ) = 0

(Symmetry): d(xi , xj ) = d(xj , xi )

27
(Triangle Inequality): d(xi , xj ) d(xi , xk ) + d(xk , xj )

h = 1 , L1 ,
(Cityblock Distance)

d(xi , xj ) = (|x1i x1j | + |x2i x2j | + + |xM


i xj |
M )
(2.10)

h = 2 L2

d(xi , xj ) = (xi1 x1j )2 + (x2i x2j )2 + + (xm i xj )
m 2 (2.11)

h = (Supremum Distance), Lmax


, L -

d(xi , xj ) = max |xfi xfj | (2.12)


f

2.4.2

2
0 1
4 1
0 1 0
- (Contingency
Table)

28
1 0
1 q r
0 s t

2.1:

r+s
d(xi , xj ) = = (2.13)
q+r+s+t

1 0
1
t q,r,s t t

r+s
d(xi , xj ) = = (2.14)
- q + r + s

2.4.1. P Positive (

1 2 3 4
P N P N N N
P N P N P N
P P N N N N

) N Negative ( )

0+1
d(, ) = 2+0+1 = 0.33

29
1+1
d(, ) = 1+1+1 = 0.67
1+2
d( , ) = 1+1+2 = 0.75
(
)

2.4.3


2


1

M P
d(xi , xj ) = (2.15)
M
M P xi xj

2
(Binary Feature Encoding) (State)
( ) (Binary Code)


3 110, 011, 101
2 2 (Encoded Binary
Feature)

30
2.4.4

1
1
(Absolute Value) 2

0 ( ) (Unbounded)

2
0 1 (Rank)

r1
Z= (2.16)
K 1
r K

2.4.5
(Cosine Similarity)

0
1 ( )
-1
180
0

xTi xj
cos(xi , xj ) = (2.17)
||xi ||||xj ||

31
||xi || L2 xi

2.5



4
1. (Accuracy)

2. (Completeness)

3. (Consistency)

4. (Timeliness)

4
4

2.5.1


A B
A cust.id B
customer.id
A B


3

32





2

2
(Random Variable)

p(X, Y ) p(X)
p(Y ) p(X, Y ) = p(X) p(Y )
(Null Hypothesis)

2


1000 X { , }
Y { , }


250 200 450
50 1000 1050
300 1200 1500

2

R
C
(oij eij )2
2
= (2.18)
i=1 j=1
oij

33
oij (Observed Frequency) eij
(Expected Frequency)

C R
k=1 o ik l=1 orj
eij = (2.19)
N
R C
(2.19) eij
X = i Y = j
eij i
Y = j
e11 = 300
450/1500 = 90


250 (90) 200 (360) 450
50 (210) 1000 (840) 1050
300 1200 1500

oij eij 2

2(250 90)2 (50 210)2 (200 360)2 (1000 840)2


= + + +
90 210 360 1000
= 507.93 (2.20)

34

2 2

P ( 2 > 2 )

(Cumulative Distribution Function


(CDF)) 2 2
(Degree of Freedom) df = (R 1)(C 1)
CDF CDF

CDF 0.95 ( Signicant Value 0.05)


2.6
xi , xj
Cov(xi , xj ) = 0
Cov(xi , xj ) = 0

i jCov(xi , xj )
R(x , x ) = (2.21)
xi xj

R(X, Y ) > 0
R(X, Y ) < 0 (
) R(X, Y ) = 0

35
2.4

12 10

10 8

8 6

6 4
Y

Y
4 2

2 0

0 2

2 4 6 8 10 2 4 6 8 10
X X

(a) (R = 0.94) (b) (R = -0.95)


1

0.5
1

0
Y

0.5
1

2 1

3 2 1 0 1 2 2 4 6 8 10
X X

(c) (R = 0.05) (d) (R = -0.02)

2.4:

36


correlation doesnt imply
causation

2.5.2
(Data reduction)









(Para-
metric) (Non-parametric)




(Model)
(Models Parameters)




Gaussian Mix-
ture Model (GMM)
GMM k

37


k
GM M (x) = i N (x; i , i ) (2.22)
i=1

I 0 ki=1 i = 1 GMM
(Mixing Weight)
i
Expectation-Maximisation (EM)
[Dempster et al., 1977]

GMM


(Poisson Distribution)






(Equal-
width Histogram) (Equal-frequency Histogram)





GMM

38
(k-means)
6





(Sampling)

(Random Sampling)

(Skewed Distribution)



(Stratied Sampling)
4
(Simple Random Sampling)
(Random Sampling with Replacement)
(Random Sampling without Replacement)
(Stratied Sampling)

2.5.3




39

(Incomplete)
(Missing Feature)
(Missing Value)

(Incorrect) (Noise),
(Outlier) (Extreme Value)





-1







(Random Error) (Random Deviation)


3


(Quantisation) 1
(Bin)
1

40
( )
()
0 (Zero-mean Noise)
X X

X = X + (2.23)

0
N (0, ). X

N N N
n=1 XN n=1 X
= + n=1
NN NN N
n=1 XN X
= n=1 + 0 (2.24)
N N
0



f () a b b = f (a)
f ()
b a b
b

41
b
b
2.5
(
)

10

6
Y

2 4 6 8 10
X

2.5:

2.5.4


(Mapping Function) z
x
x = f (z) (2.25)
3

42

(Feature Construction)


(Pixel) 3

1
100 100 1
10000 10000 1



(Feature
Vector)
3

(Texture Extraction)
Local
Binary Pattern [Ojala et al., 2002]

(Shape Extraction)

[Ke and Sukthankar, 2004]

(Colour Extraction)
[Manjunath et al., 2001]


(Image Processing) [Sonka et al., 2014]


43



3
1. Min-Max [min, max]
[minn , maxn ]
v min
v = (max min) + min (2.26)
max min n n n

2. Z-score = 0
= 1
v
v = (2.27)

3. (Decimal Scaling)
1
v
v = j
j max(|v |) < 1 (2.28)
10


(Discretisation)




44


45

1.



2.

3.

2, 3, 4, 1, 11, 10, 7, 5, 2, 3, 6, 0, 4, 2

4.

5. 1000 1.
2.

50
550
250
150

46
3







(k-means
clustering) k (k-Nearest Neighbours)
[Indyk and Motwani, 1998]

(Hyperplane)


(Curse of Dimensionality) [Friedman, 1997]



3
3
3 (Principle

47
Component Analysis) (Feature Subset Selection)
(Random Projection)

3.1
M N x1 , . . . , xN
x0
x0
x0
x0 {xn }N
n=1 x0

f0 (x0 )


N
f0 (x0 ) = ||x0 xn ||2 (3.1)
n=1

||x|| = (x1 )2 + (x2 )2 + + (xM )2
(Convex Function) x0 x0
x0 0


N
f0 = (x0 xn )2
n=1

f0 N
= 2(x0 xn ) = 0 (3.2)
x0 n=1

1
N
x0 = xn = (3.3)
N n=1

x0

48
1

0.5 0 0.5 1 1.5 2 2.5

3.1: ()
( )


3.1



(Point) (Line)
(Project)
(Projection) x = (x1 , x2 , . . . , xM )
e = (e1 , e2 , . . . , eM ) (Linear Combination)


M
T
e x= ei xi (3.4)
i=1

49
1
y

xn
1

0.5 0 0.5 1 1.5 2 2.5

3.2:

e ( 3.2)

y = + ae (3.5)

a x (
)
xn y 3.5 yn = + an e
xn


N
f1 (a1 , . . . , aN , e) = ||( + an e) xn ||2 (3.6)
n=1
N
N
N
= a2n ||e||2 2 an e (xn ) +
T
||xn ||2
n=1 n=1 n=1

50
an e e
an
an
f1 () (Partial Derivative) f1 () an

f1
= 2an ||e||2 2eT (xn ) = 0
an
an = eT (xn ) = xn (3.7)

f1 xn e
xn e xn
an 3.7 3.6


N
N
N
f1 (e) = a2n 2 a2n + ||xk ||2 (3.8)
n=1 n=1 n=1
N N
= [eT (xn )]2 + ||xn ||2 (3.9)
n=1 n=1
N
N
= e (xn )(xn ) e +
T T
||xn ||2 (3.10)
n=1 n=1

N
= eT Se + ||xn ||2 (3.11)
n=1

(Scatter Matrix) S = N n=1 (xn )(xn )
T

f1 eT Se
eT Se e
e e e
e ( )
e
e 1 e eT Se

51
||e|| = 1 (Lagrange Multiplier)

u = eT Se (eT e 1) (3.12)

u
= 2eS 2e (3.13)
e

(Eigensystem) e eT Se
(Eigenvector) S

Se = e (3.14)

(Eigenvalue) e
(Linear Algebra)
M M M
M
eT Se = eT e = , eT Se
e

(ei i )

(Basis) E
k
1
e1 e12 . . . e1k
e21 e22 . . . e2k
E= ... .. .. .. (3.15)
. . .
eM 1 eM
2 . . . eM
k

52
x x M k
1 2 1
e1 e1 eM 1
1 x
e2 e2 e2
1 2 M x x2
T .
. = (3.16)
E x = .. .. .. .. . ...
. . . . M
x
e1k e2k eM k xk

1
k M


k
y =+ am e m (3.17)
m=1
T
=+a E (3.18)
a = E T (x ) = x (3.19)

a ( x ) (Principle Components) k x
x E e i
k S

k
E (Pre-multiply)
(Data Matrix) E M k
Algorithm 1

Algorithm 1
1: X X = 0
2: S = cov(X)
3: T = eig(S) S
4: Select k most important principle components and put it in matrix E
5: Xpca = E T X E

53
3.1.1

Sei = i ei (3.20)

i S



E

(Error)

M
i
= i=k+1M
(3.21)
i=1 i

k
E

3.2

(Feature Selection)





3 (Fil-
ter Method) (In-

54
Filter Learning
Set of features Model
method algorithm

3.3:

formation Theory) 1 (Wrapper Method)



(Embedded Method)

( )

3.2.1







3.3






1
Information Theory [MacKay, 2003])

55



(Signal-
2-Noise Ratio)
signal
S2N = (3.22)
noise




k0 k1
S2N = k (3.23)
0 + 1k

0 1 0
0 1 1 k0
0 k
0k 0 k
N0 k
n=1 xn
k0 = (3.24)
N0
N0 k
n=1 (xn k0 )2
0k = (3.25)
N0

N0 0 k1 1k
S2N S2N
S2N M

56

(Information Gain) X, Y
Y X
Y Y
X X
Y
(Entropy)



1
0
(Expected Value)
(Information Content)

1
I(X) = log2 ( ) (3.26)
P (X)
H(X) = E[I(X)] (3.27)
= E[ log2 (I(X))] (3.28)

N
= P (xn ) log2 P (xn ) (3.29)
n=1


(Fair)

N
H(X) = P (xn ) log2 P (xn ) (3.30)
n=1
= [0.5 log2 (21 ) + 0.5 log2 (21 )] (3.31)
=1 (3.32)

57
(Biased)


N
H(X) = P (xn ) log2 P (xn ) (3.33)
n=1
= [1 log2 (1) + 0 log2 (0)] (3.34)
=0 (3.35)


X X

IG(X) = H(Y ) H(Y |X) (3.36)

(Discrete Event) Y


k
H(Y ) = P (Y = yi ) log2 (P (Y = yi )) (3.37)
i=1


r
H(Y |X) = P (X = xj )H(Y |X = xj ) (3.38)
j=1

H(Y |X = xj ) Y X xj


k
H(Y |X = xj ) = P (Y = yi |X = xj ) log2 (P (Y = yi |X = xj ))
i=1
(3.39)

58





(Mutual Information)
X Y




p(x, y)
I(x, y) = p(x, y) log (3.40)
x y
p(x)p(y)


1.
x1 x2

2. X Y



M < M



( )
[Guyon and Elisseeff, 2003]

59
3.2.2







(Search Strategy)


(Mean Squared
Error)
(Brute Force)
(Greedy Approach) (Evolutionary Algorithm)



2 2
(Forward Selection) (Backward Elimination)



( )


[Guyon and Elisseeff, 2003]
2
[Eiben and Smith, 2003]

60






[Guyon and Elisseeff, 2003]

3.2.3



(Regularisation)

fobj = objective regularisation (3.41)

L1


N
M
fobj = arg min T
yi (w xi + b) + |wj | (3.42)
w
i=1 j=1
| {z } | {z }
objective function regularisation

(Regularisation Parameter)
L1


(w) wj ( 0 ) wj
0 j
j
[Ng, 2004]

61
3.3
(PCA)

PCA


PCA




(Random Projection) Johnson
Lindenstrauss Lemma

Lemma 1 ([Johnson et al., 1986]). Let (0, 1). Let k, M, N N such that k
C2 log N , for a large enough absolute constant C. Let V RM be a set of N points.
Then there exists a linear mapping R : RM Rk , such that for all u, v V :

(1 )||u v||2ld ||Ru Rv||2ld (1 + )||u v||2ld (3.43)


2 2 2

V M N
(Random Matrix) R (Relative
distance) 1
R

k M k M

(1 )

62

1.

2.

3.

4.

5.

63
4






(Recommendation Systems)

4.1
(Association Rule)
. . . A = B A B
2
(Frequent Pattern)

(Frequent Itemset)

64

(Frequent Sequential Pattern)


(Frequent Structured Pattern)


A
B





(Market Basket Analysis)




4.1.1


I = {I1 , I2 , . . . , Im } D =
{t1 , t2 , . . . , tm } (Transaction Set) ti
ti I A
ti A A ti
(Implication)

A = B (4.1)

65
A I, B I A B = D
D A B (A B)
(Support) s
(Condence) c
D A B (A B)
D A
support(A = B) = P (A B)
# of ti containing A B
= (4.2)
# of ti
conf idence(A = B) = P (B|A)
support(A B)
=
support(A)
# of ti containing A B
= (4.3)
# of ti containing A
, A (Antecedence) B (Consequence) s
(Minimum Support Threshold) c (Minimum
Condence Threshold) ( Strong)
F = A B k s
k (Frequent k-itemset) k
Lk

4.1.1. 4
4.1
I = { , , , }
A = B A = { }
B = {}
2
support(A = B) = = 0.4 = 40%
5
0.4
conf idence(A = B) = = 1 = 100%
0.4

66
4.1:

1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

40% and 60%


= F = A B = { , }


2
1.

2.

4.1.2

(Apriori Algorithm) (Breadth-rst-
search) k (k+1)
1 (1-itemset)
1 L1 L2 L1 L2
2 (Frequent 2-itemsets) L2 L3
k
k
(k+1) (k+1)

67
(Apriori Property)

Apriori Property:


I
I A I
I A I I
I A

Algorithm 2
1: Find all strong 1-itemsets
2: while Lk1 is non-empty set do
3: Ck = apriori-gen(Lk1 )
4: For each c in Ck , initialise c.count to zero
5: for records r in the database do
6: Cr = subset(Ck , r) ; for each c in Cr , c.count++
7: Set Lk to all c in Ck whose support is greater than minimum support
8: end for
9: end while
10: Return all of the Lk sets

apriori-gen
(Joining) (Pruning)


k ( k ) Lk
Lk1 Lk1 Ck

68

Lk1 Lk1
(k 2) l1 l2 Lk1

(l1 [1] = l2 [1]) (l1 [2] = l2 [2]) (l1 [k 2] = l2 [k 2])


(l1 [k 1] < l2 [k 1])

l1 l2

l1 [1], l1 [2], . . . , l1 [k 1], l1 [k 1], l2 [k 1]


Ck Lk k
Ck Ck
Lk
Ck Ck
Ck
Ck (k-1)
k (k-1) k
Lk1

l, l.

69
s : s l

support_count(l)
s = (l s) if min_conf idence
support_count(s)


A
A
A 100
2100

(Closed Frequent Itemset) (Maximal
Frequent Itemset)
A S (Superset) B
A A = { }
B = { , } A A
B B
S A
( ) A
A S A
(Frequent) B A B B

B A


4.2


70


Netix

1
Netix
BellKors Pragmatic Chaos AT&T
1 Amazon




(Content-
based System)
(Collaborative Filter-
ing)


(Long-tail Phenomenon) 4.1

4.1:


( )

71
1 2 1 2 3
3 5 1
1 5 2
2 1 1
4 1 3
4.2: 4 7

4.2.1

(Utility Matrix)



4 1-5 4.2








1 K
0 1

72

4.2.2






(Item Prole) (User)
(User Prole)







4.3

73
/ ( ) S.Johansson C.Evans R.Downey
Ant man 1
Avenger 1 1 1
Iron man 1 1
4.3:







4.4

/ ( ) S.Johansson C.Evans R.Downey


1
1 1 1
1 1
1 1
4.4: 4

1.

2.

3.

74
(Rating) j i


pi,j = ((j k))/|Si | (4.4)
Si

Si i (j k) 1
k j 0 k
i j


pi,j = (vc vi )/|Cj | (4.5)
cCj

Cj i j
vi i |Cj |
Cj
4.

4.2.1.

/ S.Johansson C.Evans R.Downey


Ant man 1
Avenger 1 1 1
Iron man 1 1
4.5:

Iron man
( )

75
/ Ant man Avenger Iron man
1
1
1 1
4.6:

/ S.Johansson C.Evans R.Downey


0 0 0
4.7:

/ S.Johansson C.Evans R.Downey


1 0 1
4.8:

j i pi,j = Si ((j
k))/|Si |
S.Johansson R.Downey Iron man
( )
S , Avenger = (11)+(10)+(11) = 2
S , Ant man = (1 0) + (0 1) + (1 0) = 0
Avenger

0
1
( ) 1 5

i
j pi,j = cCj (vc vi )/|Cj |

76
/ Ant man Avenger Iron man
3
2
5 5
4.9: (Score-based)

4.2.3



(Collaborative Filtering)


vi,j
i
j Ii i
i
1
vi = vi,j (4.6)
|Ii |
jIi

active user a
j

n
pa,j = va + w(a, i)(vi,j vi ) (4.7)
i=1

pa,j w(a, i)
a i


77
(Minkowski distance-based Nearest
Neighbour)
(Pearsons Correlation Coefcient)
(Jaccard Distance)
(Cosine Distance)



{
1, if i neighbour(a)
w(a, i) =
0, otherwise

neighbour(a) k a
1
d(ui , uj ) = (|ui1 uj1 |h + + |uiM ujM |h ) h


{
d(ua , ui ), if i neighbour(a)
w(a, i) =
0, otherwise




j (va,j va )(vi,j vi )
w(i, j) =
j (v a,j va )2
j (vi,j vi )
2

78




0






(Finite Set)
S T
1 |S T |/|S T | (4.8)
S T (
) S T (Intersection)
(Union)
4.2.2. S = {dog, cat, parrot, monkey} T = {dog, monkey, snake}
S T SIMJaccard (S, T ) = 2/5 = 0.4
1 - SIMJaccard


(Rating Vector)


79

AB
SIMcos (A, B) = (4.9)
||A|| ||B||

A
A B
B





0


(Discretise)

( )
1-5 3,4,5 1 1 2
0

80



n
|r r|
acc = i=1 (4.10)
n
1 K

N
i=1 (pi,j vi, j)
2
M.S.E = (4.11)
N
N

81

1.

2.

3.

4.

5.


1 1 1 1 1
2 1 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1 1 1
6 1 1 1
7 1 1 1 1 1


[, ] [ , , ]

IF & THEN

82
5



(E-mail)
(Spam E-mail)







(Classication) h: X Y
S = {xi , yi }N i=1 D (xi , yi ) xi X
(Input Vector) yi Y (Class Label) xi
D


(Supervised Learning)
h y x
y y x
h (Classier)

83
5.1
(Bayesian Learning)
(Bayes Rule)

p(x|y)p(y)
p(y|x) = (5.1)
p(x)


(Joint Probability) p(x, y)

p(x, y) = p(x|y)p(y) = p(y|x)p(x) (5.2)

p(x)
5.1

y x

1. p(y|x) x y
(Posterior Probability)

2. p(x|y) x y
(Likelihood)

3. p(y) y
(Prior Probability)

4. p(x) x (Evidence)
2

84

(Binary
Classication Problem) 0 1
p(y = 1|x) p(y = 0|x)

p(x|y = 1)p(y = 1)
p(y = 1|x) = (5.3)
p(x)
p(x|y = 0)p(y = 0)
p(y = 0|x) = (5.4)
p(x)



p(x|y)
(Probability Density Function (PDF))

5.1.1. (Positive)

98% (True Positive)
97%

x = positive y = 0
y = 1
p(y = 0|positive) p(y = 1|positive)

85
5.1.1

p(positive|y = 0)p(y = 0)
p(y = 0|positive) = (5.5)
p(positive)
p(positive|y = 0)p(y = 0)
=
p(positive|y = 0)p(y = 0) + p(positive|y = 1)p(y = 1)
(5.6)
p(positive|y = 1)p(y = 1)
p(y = 1|positive) = (5.7)
p(positive)
p(positive|y = 1)p(y = 1)
=
p(positive|y = 0)p(y = 0) + p(positive|y = 1)p(y = 1)
(5.8)

p(positive|y = 1)
0.98 p(negative|y = 0)
0.97

p(positive|y = 1) + p(negative|y = 1) = 1 (5.9)


p(negative|y = 1) = 1 p(positive|y = 1) (5.10)
= 1 0.98 = 0.02 (5.11)
p(positive|y = 0) + p(negative|y = 0) = 1 (5.12)
p(positive|y = 0) = 1 0.97 = 0.03 (5.13)



p(y)
y
( ) p(y)

86
p(y)

p(y = 1) = p(y = 0) = 0.5

p(y)

p(x|y = 1)p(y = 1)
p(y = 1|x) = (5.14)
p(x)
p(x|y = 1)p(y = 1)
= (5.15)
p(x|y = 0)p(y = 0) + p(x|y = 1)p(y = 1)
p(x|y = 1)0.5
= (5.16)
p(x|y = 0)0.5 + p(x|y = 1)0.5
p(x|y = 1)
= (5.17)
p(x|y = 0) + p(x|y = 1)

p(x) = p(x|y = 0)p(y = 0) + p(x|y = 1)p(y = 1)


p(x) 0 1






y = arg max p(y|x) = arg max p(x|y) (5.18)


y(0,1) y(0,1)

(Maximum Likelihood Estimate


(ML))
p(y)

87
p(y = 0) = p(y = 1) = 0.5
ML

p(positive|y = 0) 0.5
p(y = 0|positive) = (5.19)
p(positive|y = 0) 0.5 + p(positive|y = 1) 0.5
0.03 0.5
= (5.20)
0.03 0.5 + 0.98 0.5
= 0.03 (5.21)
p(positive|y = 1) 0.5
p(y = 1|positive) = (5.22)
p(positive|y = 0) 0.5 + p(positive|y = 1) 0.5
0.98 0.5
= (5.23)
0.03 0.5 + 0.98 0.5
= 0.98 (5.24)
> p(y = 0|positive) (5.25)

ML

5.1.2
0.008 %
p(y = 1) =
0.008 p(y = 1) = 0.992

(Maximum A Posterior (MAP))

y = arg max p(y|x) = arg max p(x|y)p(y) (5.26)


y(0,1) y(0,1)

88
MAP
p(positive|y = 0)p(y = 0
p(y = 0|positive) =
p(positive|y = 0)p(y = 0) + p(positive|y = 1)p(y = 1)
(5.27)
0.03 0.992
= (5.28)
0.03 0.992 + 0.98 0.008
0.79 (5.29)
p(positive|y = 1)p(y = 1)
p(y = 1|positive) =
p(positive|y = 0)p(y = 0) + p(positive|y = 1)p(y = 1)
(5.30)
0.98 0.008
= (5.31)
0.03 0.992 + 0.98 0.008
0.21 (5.32)
< p(y = 0|positive) (5.33)

MAP



ML MAP
(Bayes
Classier)

5.1.3




14 5.1 xi = {x1i , x2i , . . . , xM i }

89

15

x15 = {=,=, = , =}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
5.1: 14

15 }|y)p(y)
p(y|x15 ) = p({x115 , x215 , . . . , xM (5.34)

y = { , } p({x1 , x2 , . . . , xM }|y)
M x 1 p(x|y)

p(x1 = |y = )
y = x1 = y =

90
p(x1 = |y = ) = 29 x
p({x1 , x2 , . . . , xM }|y) x
x1 , x2 , . . . , xM
(Joint Probability) p({x1 , x2 , . . . , xM }|y)

p({x1 , x2 , . . . , xM }|y) = p(x1 |y)p(x2 , . . . , xM |y) (5.35)
= p(x1 |y)p(x2 |y, x1 )p(x3 , . . . , xM |y) (5.36)
= ... (5.37)

(Conditional Probabil-
ity)

(Naive
Assumption)
5.35
p({x1 , x2 , . . . , xM }|y) = p(x1 |y)p(x2 , . . . , xM |y, x1 ) (5.38)
= p(x1 |y)p(x2 |y)p(x3 , . . . , xM |y, x1 , x2 ) (5.39)

M
= p(xi |y) (5.40)
i=1


(Naive Bayes Classier)
15

y = arg max p(x15 |y)p(y) (5.41)
y{ , }


4
= arg max p(y) p(xi |y) (5.42)
y{ , }
i=1

91
y =

p(y = |x15 ) = p(y = )p(|y = )p(|y = )


p( |y = )p(|y = ) (5.43)
9 2233
= /p(x15 ) (5.44)
14 9 9 9 9
p(y = |x15 ) = p(y = )p(|y = )p(|y = )
p( |y = )p(|y = ) (5.45)
5 2233
= /p(x15 ) (5.46)
14 9 9 9 9
p(y = |x15 ) > p(y = |x15 )




20 Newsgroups 20


(Keyword)

v {, , , , , }

92

(
)



( : )

x = {=1, =3, =1, =0, =0, =4}

x = {1,3,1,0,0,4}

5.2






(Normal Discrimi-
nant Analysis)
50

93
100 xq = 173.2

p(xq = 173.2|y = )
0 173.2
p(y = |xq = 173.2)
p(y = |xq = 173.2)

(
)

(Data Distribution
Model)





(Normally Distributed)
Central Limit Theorem [Wasserman, 2013]
(Independent Feature)



PDF 1
PDF (Univariate Normal Distribution)

1 (x k )2
p(x|y = k) = exp{ } (5.47)
2k2 2k2

94

p(y = 0|x) = p(x|y = 0)p(y = 0) (5.48)


= p(x|0 , 0 )p(y = 0) (5.49)
1 (x 0 )2
= exp{ }p(y = 0) (5.50)
202 202
p(y = 1|x) = p(x|y = 1)p(y = 1) (5.51)
= p(x|1 , 1 )p(y = 1) (5.52)
1 (x 1 )2
= exp{ }p(y = 1) (5.53)
212 212

0 , 1 , 0 , 1

0 , 1 , 0 , 1
y = 0
0 , 0 y = 1 1 , 1
Nk
n=1 xn
k = (5.54)
Nk
Nk
n=1 (xn k )T (xn k )
k = (5.55)
Nk
Nk
k = p(y = k) = (5.56)
N
(Probabilistic Classier) (
)
(Generative Classier)

95
5.2.1

(Discriminant Function) x
y
( p(y = 1|x) )
f1 (x) = 1 >1 (5.57)
p(y = 0|x)
1 p(y = 1|x) > p(y = 0|x) 0
p(y = 0|x) f1 (x)

( p(y = 1|x) )
f2 (x) = 1 log >0 (5.58)
p(y = 0|x)
(Prediction Threshold) 0

f2 (x)

5.2.2
1
PDF

1 1
p(x|y = k) = exp{ (x k )t 1
k (x k )} (5.59)
(2) |k |
d/2 1/2 2
||

Nk
xn
k = n=1 (5.60)
Nk
Nk
n=1 (xn k ) (xn k )
T
k = (5.61)
Nk

96





M
O(M 2 )

2


(Common Covariance Matrix)



0 + 1
= (5.62)
2

(Diagonal Covariance Matrix)


0
0



97
5.3




1 [Vapnik and Vapnik, 1998]



(Discriminative)



(Logistic Regression) [Ng and Jordan, 2002]



(Support
Vector Machine) (Articial Neural Network)
(Logistic Regression)

(Linear Regression)
x y

y = wT x + (5.63)

w (Weight Vector)
w
|y y| y y
w

1
Vladimir Vapnik Support Vector Machine (SVM)

98
(Decision Hyperplane)


(Threshold) 0 y 0
1 y < 0 0
5.58
( p(y = 1|x) ) ( )
f2 (x) = 1 log > 0 = 1 y > 0 (5.64)
p(y = 0|x)
wT x y
p(y = 1|x)
log = y = wT x (5.65)
p(y = 0|x)

wT x


p(y = 1|x) p(y = 0|x)
5.65
p(y = 1|x)
log = wT x (5.66)
p(y = 0|x)
p(y = 1|x)
= exp(wT x) (5.67)
p(y = 0|x)
p(y = 1|x)
= exp(wT x) (5.68)
1 p(y = 1|x)
p(y = 1|x) = exp(wT x) p(y = 1|x) exp(wT x) (5.69)
exp(wT x)
p(y = 1|x) = (5.70)
1 exp(wT x)
1
= (5.71)
1 exp(wT x)

99

exp(wT x)
p(y = 0|x) = (5.72)
1 exp(wT x)

1
(5.73)
1 exp(wT x)
(Sigmoid Function)

w w x
0 p(y = 0|x) > p(y = 1|x)
1 p(y = 1|x) > p(y = 0|x)
0 p(y = 0|x)
1 p(y = 1|x)


N
L1 = p(yn = 1|xn , w)yn (1 p(yn = 1|xn , w))1yn (5.74)
n=1

(Likelihood Function)
w



N
L2 = yn log p(yn = 1|xn , w) + (1 yn ) log(1 p(yn = 1|xn , w))
n=1
(5.75)

L2 Log-likelihood Function Log-


likelihood Function Log-likelihood
Log-likelihood Negative Log-likelihood Function

100
L2 (Convex Function)
L2 = 0
(Optimisation Theory)
[Boyd and Vandenberghe, 2004]
(Newtons Method)
f (w) w w
f (w) = 0
w

f (wi )
wi+1 = wi (5.76)
f (wi )

L2 = 0

L2 (wi )
wi+1 = wi (5.77)
L2 (wi )

(Learning Rate)

L2 L2 (First Derivative) (Second
Derivative) L2


N
L2 = [yn p(yn = 1|xn , w) (1 yn )p(yn = 0|xn , w)]xn (5.78)
n=1
N
L2 = xn p(yn = 1|xn , w)p(yn = 0|xn , w)xTn (5.79)
n=1

w

wT x
(Discriminative Classier)

101
5.4

(Parametric Model)


(Non-parametric)
k (k-
Nearest Neighbour (kNN)
kNN
x x x
xq kNN
xq kNN
xq xq
k
kNN
(x, y)
y = h(x) kNN

hknn (xq ) = majority(h1 (xq ), h2 (xq ), . . . , hk (xq )) (5.80)

k xq
kNN (Lazy Learner)
kNN
( )
(Training Time)
xq kNN
(Eager Learner) kNN
xq

kNN

102
(Real Time)


(Global Approximation)
(Local Approximation) ( )
kNN

kNN kNN



kNN k k

(Cross Validation) k kNN
(Distance-weighted kNN) kNN



wn h(xn )
hknn (xq ) = nN N (5.81)
n=1 n w
1
wn = (5.82)
d(xq , xn )

d(xq , xn ) xq xn

103
5.5




= (5.83)

(Training Error)
(Generalisation Error)


(
)

(Test Error)

1. (Training Dataset)


x y

2. (Testing Dataset)


x y

1.

104
2.


5.5.1

(Overtting)






(Undertting)


5.5.2



1
2


(Confusion Matrix)

Accuracy a+d
a+b+c+d = 1 error

105

Negative Positive
Negative a true negative b false positive

Positive c false negative d true positive
5.2:


d
True Positive Rate (Recall) c+d


a
True Negative Rate (Specicity) a+b


b
False Positive Rate (False Alarm) a+b


c
False Negative Rate c+d


(False Alarm) (Recall)

5.5.3

(Receiver Operation Characteristic Analysis)

2 x False Positive Rate y
True Positive Rate (ROC Graph)
5.1

106
1

0.8
True Positive Rate

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
5.1:


(0,1) False Positive Rate
0 True Positive Rate 1
(0.0) (1,1)

(1,0) ( )

107


1 p(y = 1|x) > 0.5
0 0.5 (Decision
Threshold) (Cost)
False Positive ( 1 (Type I Error))

False Negative ( 2 (Type II
Error))


0.5

(False Positive)
1
0.5
p(y = 1|x)
2 0.5




False
Positive Rate True Positive Rate
(ROC Curve)


(Classication Model)
5.2a

(Area Under Curve (AUC))

108
1 0 AUC
1 AUC
5.2 AUC

100 100

80 80
True Positive Rate

True Positive Rate


60 60

40 40

20 20

0 0

0 20 40 60 80 100 0 20 40 60 80 100
False Positive Rate False Positive Rate

(a) AUC=100% (b) AUC=65%


100

80
True Positive Rate

60

40

20

0 20 40 60 80 100
False Positive Rate

(c) AUC=50%

5.2:

109


[Fawcett, 2006]

5.5.4

(x, y)



1

(Hold-out Method)


k
k


#1
#2
#3
#4
#5
5.3: 5

(
)

110
k k 1
1 k

k
(k-fold Cross Validation) 5 5.3
k = N
Leave-one-out Cross Validation

111

1.

2.

3.

4.

5. k

6. k

7.

112
6


(Supervised Learning)






(Unlabelled Data)
(Class) (Cluster)

(Clustering)




(Partitioning Clustering)
(Hierachical
Clustering)
(Tree)

113
6.1



D = (x1 , . . . , xN ) xi = [x1i , . . . , xM i ]
M (Distance Function)
xi xj d(xi , xj )

i xj ) = ||xi xj ||
(xm m 2
d(xi , xj ) =
m=1

z
z

z = [z1 , . . . , zN ] {1, . . . , k}

1
n
Fobj = ||xn mzn ||2 (6.1)
2 i=1

(k-means Algorithm)

6.1.1
(Mean)




114
Algorithm 3
1: Randomly choose 1:k
2: Initilise z
3: do
4: zn = arg mini{1,...,k} d(xn , i )

5: k = N1k n:zn =k xn
6: while z does not change
7: return z z

Algorithm 3 k

z
z

z
6.1-6.3 6.1
80 4 20
(4,4), (-4,4), (4,-4), (-4,-4)
1

4



6.2



6.3

115
6

6 4 2 0 2 4 6

6.1:

6 4 2 0 2 4 6

6.2:

116
6

6 4 2 0 2 4 6

6.3:

6.1

1
n
F (z1:n , 1:k ) = ||xn zn ||2 (6.2)
2 i=1

1.
F z

2. z F

z F
(Local Minimum) (Coordinate
Descend Algorithm)

117
6.1.2


3
=1 =2 =3

2.2

(k-medoids)
Algorithm 4

Algorithm 4
1: Randomly choose m1:k
2: Initilise z
3: do
4: zn = arg mini{1,...,k} d(xn , mi )
5: mk = median(xn ) where zn = k
6: while z does not change
7: return z z

6.1.3
k k
(Heuristic) k
k=1 (
6.1) k k k
( k) 1 9
6.4 k=1
800 k
k 3 k=4 690 300

118
k
4

800

700
Objective function value

600

500

400

300

200

2 4 6 8
Value of k

6.4:
k=4

6.2
(Hierachical Clustering)




k

119



(Agglomerative
Clustering) (Bottom-up)



6.2.1
(Dendrogram)
6.5
(Monotonic)

a b c d e
6.5: 5

c d
2 e c d

120
a b

6.2.2

3


(Single Linkage)

dSL (G, H) = min d(xi , xj ) (6.3)


iG,jH

G H

(Chaining) (
)


(Complate Linkage)

dCL (G, H) = max d(xi , xj ) (6.4)


iG,jH

121


(Group Average)
1
dGA (G, H) = d(xi , xj ) (6.5)
NG NH
iG jH

NG NH G H







122

1.

2.

3.

4.

5.


5 10
6 19
8 15
12 5
7 13

123

[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
machine Learning research, 3(Jan):9931022.
[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cam-
bridge university press.
[Cleveland, 1993] Cleveland, W. S. (1993). Visualizing data. Hobart Press.
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological),
pages 138.
[Eiben and Smith, 2003] Eiben, A. E. and Smith, J. E. (2003). Introduction to evolutionary computing,
volume 53. Springer.
[Fawcett, 2006] Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters, 27(8):
861874.
[Fawcett and Provost, 1997] Fawcett, T. and Provost, F. (1997). Adaptive fraud detection. Data mining
and knowledge discovery, 1(3):291316.
[Fayyad et al., 2002] Fayyad, U. M., Wierse, A., and Grinstein, G. G. (2002). Information visualization
in data mining and knowledge discovery. Morgan Kaufmann.
[Friedman, 1997] Friedman, J. H. (1997). On bias, variance, 0/1loss, and the curse-of-dimensionality.
Data mining and knowledge discovery, 1(1):5577.

124
[Guyon and Elisseeff, 2003] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature
selection. Journal of machine learning research, 3(Mar):11571182.

[Han et al., 2011] Han, J., Pei, J., and Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.

[Hoffman and Grinstein, 2002] Hoffman, P. E. and Grinstein, G. G. (2002). A survey of visualizations for
high-dimensional data mining. Information visualization in data mining and knowledge discovery, pages
4782.

[Indyk and Motwani, 1998] Indyk, P. and Motwani, R. (1998). Approximate nearest neighbors: towards
removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory
of computing, pages 604613. ACM.

[Jindal and Liu, 2007] Jindal, N. and Liu, B. (2007). Review spam detection. In Proceedings of the 16th
international conference on World Wide Web, pages 11891190. ACM.

[Johnson et al., 1986] Johnson, W. B., Lindenstrauss, J., and Schechtman, G. (1986). Extensions of
lipschitz maps into banach spaces. Israel Journal of Mathematics, 54(2):129138.

[Ke and Sukthankar, 2004] Ke, Y. and Sukthankar, R. (2004). Pca-sift: A more distinctive representation
for local image descriptors. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings
of the 2004 IEEE Computer Society Conference on, volume 2, pages II506. IEEE.

[Lillesand et al., 2014] Lillesand, T., Kiefer, R. W., and Chipman, J. (2014). Remote sensing and image
interpretation. John Wiley & Sons.

[MacKay, 2003] MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge
university press.

[Manjunath et al., 2001] Manjunath, B. S., Ohm, J.-R., Vasudevan, V. V., and Yamada, A. (2001). Color
and texture descriptors. IEEE Transactions on circuits and systems for video technology, 11(6):703715.

[Ng, 2004] Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In
Proceedings of the twenty-rst international conference on Machine learning, page 78. ACM.

125
[Ng and Jordan, 2002] Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative classiers:
A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing
Systems, pages 841848.

[Ojala et al., 2002] Ojala, T., Pietikainen, M., and Maenpaa, T. (2002). Multiresolution gray-scale and
rotation invariant texture classication with local binary patterns. IEEE Transactions on pattern analysis
and machine intelligence, 24(7):971987.

[Pang and Lee, 2008] Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations
and trends in information retrieval, 2(1-2):1135.

[Richards, 1999] Richards, J. A. (1999). Remote sensing digital image analysis, volume 3. Springer.

[Sonka et al., 2014] Sonka, M., Hlavac, V., and Boyle, R. (2014). Image processing, analysis, and ma-
chine vision. Cengage Learning.

[Vapnik and Vapnik, 1998] Vapnik, V. N. and Vapnik, V. (1998). Statistical learning theory, volume 1.
Wiley New York.

[Wasserman, 2013] Wasserman, L. (2013). All of statistics: a concise course in statistical inference.
Springer Science & Business Media.

126

You might also like