Data Science

(204423)
(Data Mining)
2559

(Data Mining)
204423

1

5
1 7
1.1 . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 . . . . . . . . . . . . . . . . . . . . 8
1.2.1 . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 . . . . . . . . . . . . . . . . . 12
1.2.3 . . . . . . . . . . . . . . . . . . . . 14
2 17
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 . . . . . . . . . . . . . . . . . . . . 19
2.1.3 . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 . . . . . . . . . . . . . . . . . . . . . 19
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 . . . . . . . . . . . . . . . . . . . . . . . 22
2
2.2.5 . . . . . . . . . . . . . . . 23
2.3 . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 . . . . . . . . . . . . . . . . . . . . . . 26
2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 . . . . . . . . . . . . . . 27
2.4.2 . . . . . . . . . . . . . 28
2.4.3 . . . . . . . . . . . . . . . 30
2.4.4 . . . . . . . . . . . . . . 31
2.4.5 . . . . . . . . . . . . . . . . . . . . . 31
2.5 . . . . . . . . . . . . . . . . . . . . 32
2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.4 . . . . . . . . . . . . . . . . 42
3 47
3.1 . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 . . . . . . . . . . . . . . . . 54
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 64
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 . . . . . . . . . . . . . . 65
4.1.2 . . . . . . . . . . . . . . . . . . . . . . 67
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 72
3
4.2.2 . . . . . . . . . . . . . . . . . 73
4.2.3 . . . . . . . . . . . . . . . . . . . . . . 77
5 83
5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 . . . . . . . . . . . . . . . 86
5.1.2 . . . . . . . . . . . . . . . . . . 88
5.1.3 . . . . . . . . . . . . . . . . . . . . . . 89
5.2 . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 . . . . . . . . . . . 96
5.3 . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.1 . . . . . . . . . . . . . . . . . 105
5.5.2 . . . . . . . . . . . . . . . . . . . . . 105
5.5.3 . . . . . . . . . . . 106
5.5.4 . . . . . . . . . . . . . 110
6 113
6.1 . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.1 . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.2 . . . . . . . . . . . . . . . . . . . . . . 118
6.1.3 . . . . . . . . . . . . . . . . . 118
6.2 . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 . . . . . . . . . . . . . . . . . 121
124
4

1.1 3 . . . . . . . . . . . . . . . 8
2.1 5 . . . . . . . 25
2.2 1000 . . . . . . . . . 25
2.3 . . . . . . . . . . . . . . . . . . . 26
2.4 . . . . . . . . . . . . . . . 36
2.5 . . . . . . . . 42
3.1 () . . . . . . . . . 49
3.2 () . . . . . . . . . 50
3.3 . . . . . . . . . . . . . . . . . . . . 55
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 . . . . . . . . . . . . . . . . 107
5.2 . . . . . . . . . . . . . . 109
5.3 5 . . . . . . . . . . . . . . . . . . . . . . 110
6.1 . . . . . . . . . . . . . . 116
6.2 . . . 116
6.3 . . 117
6.4 . . . . . . . . . . . 119
6.5 5 . . . . . . . . . . . . . . . . . . 120
5

2.1 . . . . . .
29
4.1 . . . . . . . . . . . . . . . 67
4.2 4 7 . . . . . . . . . . . . 72
4.3 . . . . . . . . . . . . . . . . . . . . . 74
4.4 4 . . . . . . . . . . . . . . . . . . . . . . 74
4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 . . . . . . . . . . . . . . . . . . 76
4.8 . . . . . . . . . . . . . . . . . . 76
4.9 (Score-based) . . . . . . . . . . . . . . . 77
5.1 14 . . . . . . . . . 90
5.2 . . . . . . . . . . . . . . . 106
6
1

1.1
(Information) (Knowledge)
(Raw Data) (Relation-
ship) (Pattern) (Concept)
(Non-trivial)

(Knowledge Discovery in Databases)
(Knowledge Extraction) (Data Analysis)

7
(Query) SELECT
(SQL)

(Data Stream)
(Remote Sensing) [Richards, 1999, Lillesand et al., 2014]
(Graphical data)

1.2
3
(Data Preprocessing) (Data Processing) (Post Pro-
cessing) 1.1
Data
pre- Data Post
Data Knowledge
processing processing processing
1.1: 3
8
(Data Cleaning) (Data Integration)
(Data Selection) (Data Normalising)
(Dimensionality Reduction) 2
3

(Association Rule) (Classication)
(Clustering) (Algorithm)
(Machine Learning)

4 5 6

(Multidisciplinary)

(Database system)

(Data Visualisation) [Cleveland, 1993, Fayyad et al., 2002]

1.2.1

(Image) (Text)
(Sound) (Time Series)

(Transactional Data)

(Frequent Itemset)

4
(Temporal Data)

-

(Spatial Data)

( )

10
Spatio-temporal Data

(Textual and Multimedia Data)

(Text) (Keyword)

(Task)
(Topic Analysis) [Blei et al., 2003]
(Abnormality Detection)
(Sentiment Analysis)

[Pang and Lee, 2008]
(Multimedia)

(Graphical Data)

World-Wide-Web
(Node) (Edge) (Hyperlink)
WWW WWW
AdSense Google

Facebook
Facebook
11
Facebook
1.2.2

2
(Descriptive) (Predictive)

(Class) (Concept)

(Concept/Class Descriptions)
(Data Charac-
terisation)

(Data Discrimination) (Manually)

Feature Extraction

Frequent Patterns

12

buys(X,bike) buys(X,helmet)
[support=3%, condence=60%]

(Measure of Interestingness)
(Condence) (Support) 60%
60%

(Single-dimension Association Rule)

buys(X,bike) age(X,3045) buys(X,helmet)

[support=2%, condence=50%]

(Minimum Support Threshold) (Minimum Condence
Threshold)

(Model)

(Labelled Training Data)
(Classier)
(Decision Tree), (Articial Neural Network), (Support
Vector Machine) (Logistic Regression)
5
13

(Similarity Measure)
6
(Outlier Analysis)
(Outlier)
(Noise)
[Jindal and Liu, 2007]

[Fawcett and Provost, 1997]
1.2.3

[Han et al., 2011]

14

15

1.
2.
3.
16
2

(Data)
(Data Point),
(Example), (Instance) (Input Vector) x
(Data Set) S

(Feature)
3
3

n N
(Euclidean Space) M
xn = {x1n , x2n , . . . , xm
n , . . . , xn }
M
17
n N m M
m
N M

1 m M
x1 . . . x1 . . . x1
.. . . . .. ..
. . .
1 M
xn . . . xn . . . xn
m
.. .. . . . ..
. . .
1 m M
xN . . . x N . . . x N
(Row Vector)
(Column Vector)
2.0.1. S 5 ()
( ) ()

13 39 157
23 56 157

15 60 177

32 72 187
21 43 162
1 13 39 157
3 60
2.1

18
2.1.1

(Nominal Feature)
(Domain) (Finite Set)
{,,,,,}

2.1.2

(Binary Feature)
2 ( 2)
2 (Symmetric) (Asymmetric)

2.1.3

(Ordinal Feature)

5
4 1
2.1.4

(Numeric Feature)
(Integer) (Real Number)

19
2.2

2
3 (Mean) (Median) (Mode)
2.2.1
(Arithmetic Mean)

1
1
N
x = xn (2.1)
N n=1
1 1 1 1 1 M
N N N
1 2 M
x = [x , x , . . . , x ] = [ x , x ,..., x ] (2.2)
N n=1 n N n=1 n N n=1 n
(Weighted Average)

1
N
x = wn xn (2.3)
N n=1
wn wn
1 2.1
xn 1
N
n=1 wn = 1 (Expected Value)
20
(Extreme Value) (Outlier)
(Trimmed Mean)
k
2.2.1. 6 1 5
10000, 15000, 13000, 20000, 30000
2000000

10000 + 15000 + 13000 + 20000 + 30000 + 2000000
= 348000
6

2.2.2
(Median)

{
N2 + 1 N
median = (2.4)
N
2 N
2 +1 N
2.2.3
(Mode)
(Distribution)
(Unimodal)
(Multimodal)

21
70
( )

2.2.4
(Variance)
1
N
n=1 (xn )2
Var(x) = 2 = (2.5)
N

(Standard Deviation)

(Covariance) i, j
N
n=1 (xn i )(xn j )
i j
i j
Cov(x , x ) = i j = (2.6)
N
i i i
i

(Covariance Matrix)
N
n=1 (xn ) (xn )
T
Cov(x) = = (2.7)
N
22
(xn )T (Transposition)
M M

12 . . . 1 m . . . 1 M
... ... ..
.
..
.

m 1 . . . m . . . m M
2
. .. .
.. . . . . .
.
2
M 1 . . . M m . . . M
(Symmetric) (Square)
(Diagonal)
i j = j i
2.2.5
(Quantile) Q (Range) n

1 |Q| = n 1 n

n = 2
n = 4
n = 100

3 (Q3) 1 (Q1) (Inter-
Quartile Range (IQR)) Q3 Q1 IQR
IQR
23
2.3

(Visualisation)

100000
(Outlier)

[Hoffman and Grinstein, 2002]

2.3.1
(Boxplot) 5 (Five Numbers Summary)
1. (Minimum)
2. 1
3.
4. 3
5. (Maximum)

2.1
(Mild Outlier) Q1 1.5 IQR
Q3 + 1.5 IQR (Extreme Outlier)
Q1 3 IQR Q3 + 3 IQR
24
Median
Extreme outlier Min Max Outlier
x
+ *
Q1 Q3
IQR
1.5*IQR 1.5*IQR
3*IQR 3*IQR
2.1: 5 ( )
2.3.2
(Histogram)

M 1
2 M 2.2
70 Histogram
60
50
40
Y
30
20
10
04 3 2 1 0 1 2 3
X
2.2: 1000
25
2.3.3

(Scatter Plot)
2 3 3
2
2.3
120 Scatter Plot
100
80
60
Y
40
20
20
20 0 20 40 60 80 100 120
X
2.3:
2.4
(Similarity)

(Similarity Measure)

26

(Dissimilarity)

2
(Similarity Matrix) ( )
(Dissimilarity Matrix)

0
d(x2 , x1 ) 0

d(x3 , x1 ) d(x3 , x2 ) 0 (2.8)
. . .
.. .. .. 0
d(xN , x1 ) d(xN , x2 ) d(xN , xN 1 ) 0
d(xi , xj ) xi xj .
d()

2.4.1

(Manhattan
Distance) (Euclidean Distance)
(Minkowski Distance)
xi , xj RM
d(xi , xj ) = (|x1i x1j |h + |x2i x2j |h + + |xM

i xj | )
M h 1/h
(2.9)

3
(Positive Deniteness): d(xi , xj ) > 0, if i = j and d(xi , xi ) = 0
(Symmetry): d(xi , xj ) = d(xj , xi )
27
(Triangle Inequality): d(xi , xj ) d(xi , xk ) + d(xk , xj )
h = 1 , L1 ,
(Cityblock Distance)
d(xi , xj ) = (|x1i x1j | + |x2i x2j | + + |xM

i xj |
M )
(2.10)
h = 2 L2

d(xi , xj ) = (xi1 x1j )2 + (x2i x2j )2 + + (xm i xj )
m 2 (2.11)
h = (Supremum Distance), Lmax

, L -
d(xi , xj ) = max |xfi xfj | (2.12)

f
2.4.2

2
0 1
4 1
0 1 0
- (Contingency
Table)

28
1 0
1 q r
0 s t

2.1:
r+s
d(xi , xj ) = = (2.13)
q+r+s+t
1 0
1
t q,r,s t t

r+s
d(xi , xj ) = = (2.14)
- q + r + s
2.4.1. P Positive (
1 2 3 4
P N P N N N
P N P N P N
P P N N N N
) N Negative ( )

0+1
d(, ) = 2+0+1 = 0.33
29
1+1
d(, ) = 1+1+1 = 0.67
1+2
d( , ) = 1+1+2 = 0.75
(
)
2.4.3

2

1

M P
d(xi , xj ) = (2.15)
M
M P xi xj
2
(Binary Feature Encoding) (State)
( ) (Binary Code)

3 110, 011, 101
2 2 (Encoded Binary
Feature)

30
2.4.4

1
1
(Absolute Value) 2

0 ( ) (Unbounded)

2
0 1 (Rank)

r1
Z= (2.16)
K 1
r K
2.4.5
(Cosine Similarity)

0
1 ( )
-1
180
0
xTi xj
cos(xi , xj ) = (2.17)
||xi ||||xj ||
31
||xi || L2 xi
2.5

4
1. (Accuracy)
2. (Completeness)
3. (Consistency)
4. (Timeliness)
4
4

2.5.1

A B
A cust.id B
customer.id
A B

3

32

2

2
(Random Variable)

p(X, Y ) p(X)
p(Y ) p(X, Y ) = p(X) p(Y )
(Null Hypothesis)

2

1000 X { , }
Y { , }

250 200 450
50 1000 1050
300 1200 1500
2

R
C
(oij eij )2
2
= (2.18)
i=1 j=1
oij
33
oij (Observed Frequency) eij
(Expected Frequency)

C R
k=1 o ik l=1 orj
eij = (2.19)
N
R C
(2.19) eij
X = i Y = j
eij i
Y = j
e11 = 300
450/1500 = 90

250 (90) 200 (360) 450
50 (210) 1000 (840) 1050
300 1200 1500
oij eij 2
2(250 90)2 (50 210)2 (200 360)2 (1000 840)2

= + + +
90 210 360 1000
= 507.93 (2.20)
34

2 2
P ( 2 > 2 )
(Cumulative Distribution Function

(CDF)) 2 2
(Degree of Freedom) df = (R 1)(C 1)
CDF CDF

CDF 0.95 ( Signicant Value 0.05)

2.6
xi , xj
Cov(xi , xj ) = 0
Cov(xi , xj ) = 0

i jCov(xi , xj )
R(x , x ) = (2.21)
xi xj
R(X, Y ) > 0
R(X, Y ) < 0 (
) R(X, Y ) = 0
35
2.4
12 10
10 8
8 6
6 4
Y
Y
4 2
2 0
0 2
2 4 6 8 10 2 4 6 8 10
X X
(a) (R = 0.94) (b) (R = -0.95)

1
0.5
1
0
Y
0.5
1
2 1
3 2 1 0 1 2 2 4 6 8 10
X X
(c) (R = 0.05) (d) (R = -0.02)
2.4:
36

correlation doesnt imply
causation
2.5.2
(Data reduction)

(Para-
metric) (Non-parametric)

(Model)
(Models Parameters)

Gaussian Mix-
ture Model (GMM)
GMM k
37

k
GM M (x) = i N (x; i , i ) (2.22)
i=1

I 0 ki=1 i = 1 GMM
(Mixing Weight)
i
Expectation-Maximisation (EM)
[Dempster et al., 1977]

GMM

(Poisson Distribution)

(Equal-
width Histogram) (Equal-frequency Histogram)

GMM
38
(k-means)
6

(Sampling)

(Random Sampling)

(Skewed Distribution)

(Stratied Sampling)
4
(Simple Random Sampling)
(Random Sampling with Replacement)
(Random Sampling without Replacement)
(Stratied Sampling)
2.5.3

39

(Incomplete)
(Missing Feature)
(Missing Value)

(Incorrect) (Noise),
(Outlier) (Extreme Value)

-1

(Random Error) (Random Deviation)

3

(Quantisation) 1
(Bin)
1

40
( )
()
0 (Zero-mean Noise)
X X
X = X + (2.23)
0
N (0, ). X

N N N
n=1 XN n=1 X
= + n=1
NN NN N
n=1 XN X
= n=1 + 0 (2.24)
N N
0

f () a b b = f (a)
f ()
b a b
b
41
b
b
2.5
(
)
10
6
Y
2 4 6 8 10
X
2.5:
2.5.4

(Mapping Function) z
x
x = f (z) (2.25)
3
42

(Feature Construction)

(Pixel) 3

1
100 100 1
10000 10000 1

(Feature
Vector)
3
(Texture Extraction)
Local
Binary Pattern [Ojala et al., 2002]
(Shape Extraction)

[Ke and Sukthankar, 2004]
(Colour Extraction)
[Manjunath et al., 2001]

(Image Processing) [Sonka et al., 2014]

43

3
1. Min-Max [min, max]
[minn , maxn ]
v min
v = (max min) + min (2.26)
max min n n n
2. Z-score = 0
= 1
v
v = (2.27)

3. (Decimal Scaling)
1
v
v = j
j max(|v |) < 1 (2.28)
10

(Discretisation)

44

45

1.

2.

3.
2, 3, 4, 1, 11, 10, 7, 5, 2, 3, 6, 0, 4, 2
4.

5. 1000 1.
2.

50
550
250
150
46
3

(k-means
clustering) k (k-Nearest Neighbours)
[Indyk and Motwani, 1998]

(Hyperplane)

(Curse of Dimensionality) [Friedman, 1997]

3
3
3 (Principle
47
Component Analysis) (Feature Subset Selection)
(Random Projection)
3.1
M N x1 , . . . , xN
x0
x0
x0
x0 {xn }N
n=1 x0

f0 (x0 )

N
f0 (x0 ) = ||x0 xn ||2 (3.1)
n=1

||x|| = (x1 )2 + (x2 )2 + + (xM )2
(Convex Function) x0 x0
x0 0

N
f0 = (x0 xn )2
n=1
f0 N
= 2(x0 xn ) = 0 (3.2)
x0 n=1
1
N
x0 = xn = (3.3)
N n=1
x0

48
1
0.5 0 0.5 1 1.5 2 2.5
3.1: ()
( )

3.1

(Point) (Line)
(Project)
(Projection) x = (x1 , x2 , . . . , xM )
e = (e1 , e2 , . . . , eM ) (Linear Combination)

M
T
e x= ei xi (3.4)
i=1
49
1
y
xn
1
0.5 0 0.5 1 1.5 2 2.5
3.2:
e ( 3.2)

y = + ae (3.5)
a x (
)
xn y 3.5 yn = + an e
xn

N
f1 (a1 , . . . , aN , e) = ||( + an e) xn ||2 (3.6)
n=1
N
N
N
= a2n ||e||2 2 an e (xn ) +
T
||xn ||2
n=1 n=1 n=1
50
an e e
an
an
f1 () (Partial Derivative) f1 () an
f1
= 2an ||e||2 2eT (xn ) = 0
an
an = eT (xn ) = xn (3.7)
f1 xn e
xn e xn
an 3.7 3.6

N
N
N
f1 (e) = a2n 2 a2n + ||xk ||2 (3.8)
n=1 n=1 n=1
N N
= [eT (xn )]2 + ||xn ||2 (3.9)
n=1 n=1
N
N
= e (xn )(xn ) e +
T T
||xn ||2 (3.10)
n=1 n=1

N
= eT Se + ||xn ||2 (3.11)
n=1

(Scatter Matrix) S = N n=1 (xn )(xn )
T
f1 eT Se
eT Se e
e e e
e ( )
e
e 1 e eT Se
51
||e|| = 1 (Lagrange Multiplier)

u = eT Se (eT e 1) (3.12)
u
= 2eS 2e (3.13)
e

(Eigensystem) e eT Se
(Eigenvector) S
Se = e (3.14)
(Eigenvalue) e
(Linear Algebra)
M M M
M
eT Se = eT e = , eT Se
e

(ei i )

(Basis) E
k
1
e1 e12 . . . e1k
e21 e22 . . . e2k
E= ... .. .. .. (3.15)
. . .
eM 1 eM
2 . . . eM
k
52
x x M k
1 2 1
e1 e1 eM 1
1 x
e2 e2 e2
1 2 M x x2
T .
. = (3.16)
E x = .. .. .. .. . ...
. . . . M
x
e1k e2k eM k xk
1
k M

k
y =+ am e m (3.17)
m=1
T
=+a E (3.18)
a = E T (x ) = x (3.19)
a ( x ) (Principle Components) k x
x E e i
k S

k
E (Pre-multiply)
(Data Matrix) E M k
Algorithm 1
Algorithm 1
1: X X = 0
2: S = cov(X)
3: T = eig(S) S
4: Select k most important principle components and put it in matrix E
5: Xpca = E T X E
53
3.1.1

Sei = i ei (3.20)
i S

E

(Error)

M
i
= i=k+1M
(3.21)
i=1 i
k
E

3.2

(Feature Selection)

3 (Fil-
ter Method) (In-
54
Filter Learning
Set of features Model
method algorithm
3.3:
formation Theory) 1 (Wrapper Method)

(Embedded Method)

( )
3.2.1

3.3

1
Information Theory [MacKay, 2003])
55

(Signal-
2-Noise Ratio)
signal
S2N = (3.22)
noise

k0 k1
S2N = k (3.23)
0 + 1k
0 1 0
0 1 1 k0
0 k
0k 0 k
N0 k
n=1 xn
k0 = (3.24)
N0
N0 k
n=1 (xn k0 )2
0k = (3.25)
N0
N0 0 k1 1k
S2N S2N
S2N M

56

(Information Gain) X, Y
Y X
Y Y
X X
Y
(Entropy)

1
0
(Expected Value)
(Information Content)
1
I(X) = log2 ( ) (3.26)
P (X)
H(X) = E[I(X)] (3.27)
= E[ log2 (I(X))] (3.28)

N
= P (xn ) log2 P (xn ) (3.29)
n=1

(Fair)

N
H(X) = P (xn ) log2 P (xn ) (3.30)
n=1
= [0.5 log2 (21 ) + 0.5 log2 (21 )] (3.31)
=1 (3.32)
57
(Biased)

N
H(X) = P (xn ) log2 P (xn ) (3.33)
n=1
= [1 log2 (1) + 0 log2 (0)] (3.34)
=0 (3.35)

X X
IG(X) = H(Y ) H(Y |X) (3.36)
(Discrete Event) Y

k
H(Y ) = P (Y = yi ) log2 (P (Y = yi )) (3.37)
i=1

r
H(Y |X) = P (X = xj )H(Y |X = xj ) (3.38)
j=1
H(Y |X = xj ) Y X xj

k
H(Y |X = xj ) = P (Y = yi |X = xj ) log2 (P (Y = yi |X = xj ))
i=1
(3.39)
58

(Mutual Information)
X Y

p(x, y)
I(x, y) = p(x, y) log (3.40)
x y
p(x)p(y)

1.
x1 x2

2. X Y

M < M

( )
[Guyon and Elisseeff, 2003]

59
3.2.2

(Search Strategy)

(Mean Squared
Error)
(Brute Force)
(Greedy Approach) (Evolutionary Algorithm)

2 2
(Forward Selection) (Backward Elimination)

( )

2
[Eiben and Smith, 2003]
60

3.2.3

(Regularisation)
fobj = objective regularisation (3.41)
L1

N
M
fobj = arg min T
yi (w xi + b) + |wj | (3.42)
w
i=1 j=1
| {z } | {z }
objective function regularisation
(Regularisation Parameter)
L1

(w) wj ( 0 ) wj
0 j
j
[Ng, 2004]
61
3.3
(PCA)

PCA

PCA

(Random Projection) Johnson
Lindenstrauss Lemma
Lemma 1 ([Johnson et al., 1986]). Let (0, 1). Let k, M, N N such that k
C2 log N , for a large enough absolute constant C. Let V RM be a set of N points.
Then there exists a linear mapping R : RM Rk , such that for all u, v V :
(1 )||u v||2ld ||Ru Rv||2ld (1 + )||u v||2ld (3.43)

2 2 2
V M N
(Random Matrix) R (Relative
distance) 1
R

k M k M

(1 )

62

1.
2.
3.
4.
5.

63
4

(Recommendation Systems)

4.1
(Association Rule)
. . . A = B A B
2
(Frequent Pattern)

(Frequent Itemset)
64

(Frequent Sequential Pattern)

(Frequent Structured Pattern)

A
B

(Market Basket Analysis)

4.1.1

I = {I1 , I2 , . . . , Im } D =
{t1 , t2 , . . . , tm } (Transaction Set) ti
ti I A
ti A A ti
(Implication)
A = B (4.1)
65
A I, B I A B = D
D A B (A B)
(Support) s
(Condence) c
D A B (A B)
D A
support(A = B) = P (A B)
# of ti containing A B
= (4.2)
# of ti
conf idence(A = B) = P (B|A)
support(A B)
=
support(A)
# of ti containing A B
= (4.3)
# of ti containing A
, A (Antecedence) B (Consequence) s
(Minimum Support Threshold) c (Minimum
Condence Threshold) ( Strong)
F = A B k s
k (Frequent k-itemset) k
Lk
4.1.1. 4
4.1
I = { , , , }
A = B A = { }
B = {}
2
support(A = B) = = 0.4 = 40%
5
0.4
conf idence(A = B) = = 1 = 100%
0.4
66
4.1:

1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
40% and 60%

= F = A B = { , }

2
1.
2.

4.1.2

(Apriori Algorithm) (Breadth-rst-
search) k (k+1)
1 (1-itemset)
1 L1 L2 L1 L2
2 (Frequent 2-itemsets) L2 L3
k
k
(k+1) (k+1)

67
(Apriori Property)

Apriori Property:

I
I A I
I A I I
I A

Algorithm 2
1: Find all strong 1-itemsets
2: while Lk1 is non-empty set do
3: Ck = apriori-gen(Lk1 )
4: For each c in Ck , initialise c.count to zero
5: for records r in the database do
6: Cr = subset(Ck , r) ; for each c in Cr , c.count++
7: Set Lk to all c in Ck whose support is greater than minimum support
8: end for
9: end while
10: Return all of the Lk sets
apriori-gen
(Joining) (Pruning)

k ( k ) Lk
Lk1 Lk1 Ck
68

Lk1 Lk1
(k 2) l1 l2 Lk1

(l1 [1] = l2 [1]) (l1 [2] = l2 [2]) (l1 [k 2] = l2 [k 2])

(l1 [k 1] < l2 [k 1])
l1 l2
l1 [1], l1 [2], . . . , l1 [k 1], l1 [k 1], l2 [k 1]

Ck Lk k
Ck Ck
Lk
Ck Ck
Ck
Ck (k-1)
k (k-1) k
Lk1
l, l.
69
s : s l
support_count(l)
s = (l s) if min_conf idence
support_count(s)

A
A
A 100
2100

(Closed Frequent Itemset) (Maximal
Frequent Itemset)
A S (Superset) B
A A = { }
B = { , } A A
B B
S A
( ) A
A S A
(Frequent) B A B B

B A

4.2

70

Netix

1
Netix
BellKors Pragmatic Chaos AT&T
1 Amazon

(Content-
based System)
(Collaborative Filter-
ing)

(Long-tail Phenomenon) 4.1
4.1:

( )

71
1 2 1 2 3
3 5 1
1 5 2
2 1 1
4 1 3
4.2: 4 7
4.2.1

(Utility Matrix)

4 1-5 4.2

1 K
0 1
72

4.2.2

(Item Prole) (User)
(User Prole)

4.3
73
/ ( ) S.Johansson C.Evans R.Downey
Ant man 1
Avenger 1 1 1
Iron man 1 1
4.3:

4.4
/ ( ) S.Johansson C.Evans R.Downey

1
1 1 1
1 1
1 1
4.4: 4
1.

2.
3.

74
(Rating) j i

pi,j = ((j k))/|Si | (4.4)
Si
Si i (j k) 1
k j 0 k
i j

pi,j = (vc vi )/|Cj | (4.5)
cCj
Cj i j
vi i |Cj |
Cj
4.

4.2.1.

/ S.Johansson C.Evans R.Downey

Ant man 1
Avenger 1 1 1
Iron man 1 1
4.5:
Iron man
( )

75
/ Ant man Avenger Iron man
1
1
1 1
4.6:

0 0 0
4.7:

1 0 1
4.8:

j i pi,j = Si ((j
k))/|Si |
S.Johansson R.Downey Iron man
( )
S , Avenger = (11)+(10)+(11) = 2
S , Ant man = (1 0) + (0 1) + (1 0) = 0
Avenger

0
1
( ) 1 5

i
j pi,j = cCj (vc vi )/|Cj |
76
/ Ant man Avenger Iron man
3
2
5 5
4.9: (Score-based)
4.2.3

(Collaborative Filtering)

vi,j
i
j Ii i
i
1
vi = vi,j (4.6)
|Ii |
jIi
active user a
j

n
pa,j = va + w(a, i)(vi,j vi ) (4.7)
i=1
pa,j w(a, i)
a i

77
(Minkowski distance-based Nearest
Neighbour)
(Pearsons Correlation Coefcient)
(Jaccard Distance)
(Cosine Distance)

{
1, if i neighbour(a)
w(a, i) =
0, otherwise
neighbour(a) k a
1
d(ui , uj ) = (|ui1 uj1 |h + + |uiM ujM |h ) h

{
d(ua , ui ), if i neighbour(a)
w(a, i) =
0, otherwise

j (va,j va )(vi,j vi )
w(i, j) =
j (v a,j va )2
j (vi,j vi )
2
78

0

(Finite Set)
S T
1 |S T |/|S T | (4.8)
S T (
) S T (Intersection)
(Union)
4.2.2. S = {dog, cat, parrot, monkey} T = {dog, monkey, snake}
S T SIMJaccard (S, T ) = 2/5 = 0.4
1 - SIMJaccard

(Rating Vector)

79

AB
SIMcos (A, B) = (4.9)
||A|| ||B||
A
A B
B

0

(Discretise)

( )
1-5 3,4,5 1 1 2
0
80

n
|r r|
acc = i=1 (4.10)
n
1 K
N
i=1 (pi,j vi, j)
2
M.S.E = (4.11)
N
N
81

1.
2.
3.
4.
5.

1 1 1 1 1
2 1 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1 1 1
6 1 1 1
7 1 1 1 1 1

[, ] [ , , ]

IF & THEN
82
5

(E-mail)
(Spam E-mail)

(Classication) h: X Y
S = {xi , yi }N i=1 D (xi , yi ) xi X
(Input Vector) yi Y (Class Label) xi
D

(Supervised Learning)
h y x
y y x
h (Classier)
83
5.1
(Bayesian Learning)
(Bayes Rule)
p(x|y)p(y)
p(y|x) = (5.1)
p(x)

(Joint Probability) p(x, y)
p(x, y) = p(x|y)p(y) = p(y|x)p(x) (5.2)
p(x)
5.1

y x

1. p(y|x) x y
(Posterior Probability)
2. p(x|y) x y
(Likelihood)
3. p(y) y
(Prior Probability)
4. p(x) x (Evidence)
2

84

(Binary
Classication Problem) 0 1
p(y = 1|x) p(y = 0|x)
p(x|y = 1)p(y = 1)
p(y = 1|x) = (5.3)
p(x)
p(x|y = 0)p(y = 0)
p(y = 0|x) = (5.4)
p(x)

p(x|y)
(Probability Density Function (PDF))

5.1.1. (Positive)

98% (True Positive)
97%

x = positive y = 0
y = 1
p(y = 0|positive) p(y = 1|positive)
85
5.1.1

p(positive|y = 0)p(y = 0)
p(y = 0|positive) = (5.5)
p(positive)
=
p(positive|y = 0)p(y = 0) + p(positive|y = 1)p(y = 1)
(5.6)
p(y = 1|positive) = (5.7)
p(positive)
=
(5.8)
p(positive|y = 1)
0.98 p(negative|y = 0)
0.97

p(positive|y = 1) + p(negative|y = 1) = 1 (5.9)

p(negative|y = 1) = 1 p(positive|y = 1) (5.10)
= 1 0.98 = 0.02 (5.11)
p(positive|y = 0) + p(negative|y = 0) = 1 (5.12)
p(positive|y = 0) = 1 0.97 = 0.03 (5.13)

p(y)
y
( ) p(y)
86
p(y)
p(y = 1) = p(y = 0) = 0.5
p(y)
p(x|y = 1)p(y = 1)
p(y = 1|x) = (5.14)
p(x)
p(x|y = 1)p(y = 1)
= (5.15)
p(x|y = 0)p(y = 0) + p(x|y = 1)p(y = 1)
p(x|y = 1)0.5
= (5.16)
p(x|y = 0)0.5 + p(x|y = 1)0.5
p(x|y = 1)
= (5.17)
p(x|y = 0) + p(x|y = 1)
p(x) = p(x|y = 0)p(y = 0) + p(x|y = 1)p(y = 1)

p(x) 0 1

y = arg max p(y|x) = arg max p(x|y) (5.18)

y(0,1) y(0,1)
(Maximum Likelihood Estimate

(ML))
p(y)

87
p(y = 0) = p(y = 1) = 0.5
ML
p(positive|y = 0) 0.5
p(y = 0|positive) = (5.19)
p(positive|y = 0) 0.5 + p(positive|y = 1) 0.5
0.03 0.5
= (5.20)
0.03 0.5 + 0.98 0.5
= 0.03 (5.21)
p(positive|y = 1) 0.5
p(y = 1|positive) = (5.22)
p(positive|y = 0) 0.5 + p(positive|y = 1) 0.5
0.98 0.5
= (5.23)
0.03 0.5 + 0.98 0.5
= 0.98 (5.24)
> p(y = 0|positive) (5.25)
ML
5.1.2
0.008 %
p(y = 1) =
0.008 p(y = 1) = 0.992

(Maximum A Posterior (MAP))
y = arg max p(y|x) = arg max p(x|y)p(y) (5.26)

y(0,1) y(0,1)
88
MAP
p(positive|y = 0)p(y = 0
p(y = 0|positive) =
(5.27)
0.03 0.992
= (5.28)
0.03 0.992 + 0.98 0.008
0.79 (5.29)
p(y = 1|positive) =
(5.30)
0.98 0.008
= (5.31)
0.03 0.992 + 0.98 0.008
0.21 (5.32)
< p(y = 0|positive) (5.33)
MAP

ML MAP
(Bayes
Classier)
5.1.3

14 5.1 xi = {x1i , x2i , . . . , xM i }

89

15
x15 = {=,=, = , =}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
5.1: 14

15 }|y)p(y)
p(y|x15 ) = p({x115 , x215 , . . . , xM (5.34)
y = { , } p({x1 , x2 , . . . , xM }|y)
M x 1 p(x|y)

p(x1 = |y = )
y = x1 = y =
90
p(x1 = |y = ) = 29 x
p({x1 , x2 , . . . , xM }|y) x
x1 , x2 , . . . , xM
(Joint Probability) p({x1 , x2 , . . . , xM }|y)

p({x1 , x2 , . . . , xM }|y) = p(x1 |y)p(x2 , . . . , xM |y) (5.35)
= p(x1 |y)p(x2 |y, x1 )p(x3 , . . . , xM |y) (5.36)
= ... (5.37)
(Conditional Probabil-
ity)

(Naive
Assumption)
5.35
p({x1 , x2 , . . . , xM }|y) = p(x1 |y)p(x2 , . . . , xM |y, x1 ) (5.38)
= p(x1 |y)p(x2 |y)p(x3 , . . . , xM |y, x1 , x2 ) (5.39)

M
= p(xi |y) (5.40)
i=1

(Naive Bayes Classier)
15

y = arg max p(x15 |y)p(y) (5.41)
y{ , }

4
= arg max p(y) p(xi |y) (5.42)
y{ , }
i=1
91
y =
p(y = |x15 ) = p(y = )p(|y = )p(|y = )

p( |y = )p(|y = ) (5.43)
9 2233
= /p(x15 ) (5.44)
14 9 9 9 9
p(y = |x15 ) = p(y = )p(|y = )p(|y = )
p( |y = )p(|y = ) (5.45)
5 2233
= /p(x15 ) (5.46)
14 9 9 9 9
p(y = |x15 ) > p(y = |x15 )

20 Newsgroups 20

(Keyword)

v {, , , , , }
92

(
)

( : )
x = {=1, =3, =1, =0, =0, =4}
x = {1,3,1,0,0,4}
5.2

(Normal Discrimi-
nant Analysis)
50

93
100 xq = 173.2

p(xq = 173.2|y = )
0 173.2
p(y = |xq = 173.2)
p(y = |xq = 173.2)

(
)

(Data Distribution
Model)

(Normally Distributed)
Central Limit Theorem [Wasserman, 2013]
(Independent Feature)

PDF 1
PDF (Univariate Normal Distribution)
1 (x k )2
p(x|y = k) = exp{ } (5.47)
2k2 2k2
94

p(y = 0|x) = p(x|y = 0)p(y = 0) (5.48)

= p(x|0 , 0 )p(y = 0) (5.49)
1 (x 0 )2
= exp{ }p(y = 0) (5.50)
202 202
p(y = 1|x) = p(x|y = 1)p(y = 1) (5.51)
= p(x|1 , 1 )p(y = 1) (5.52)
1 (x 1 )2
= exp{ }p(y = 1) (5.53)
212 212
0 , 1 , 0 , 1

0 , 1 , 0 , 1
y = 0
0 , 0 y = 1 1 , 1
Nk
n=1 xn
k = (5.54)
Nk
Nk
n=1 (xn k )T (xn k )
k = (5.55)
Nk
Nk
k = p(y = k) = (5.56)
N
(Probabilistic Classier) (
)
(Generative Classier)

95
5.2.1

(Discriminant Function) x
y
( p(y = 1|x) )
f1 (x) = 1 >1 (5.57)
p(y = 0|x)
1 p(y = 1|x) > p(y = 0|x) 0
p(y = 0|x) f1 (x)
( p(y = 1|x) )
f2 (x) = 1 log >0 (5.58)
p(y = 0|x)
(Prediction Threshold) 0

f2 (x)
5.2.2
1
PDF

1 1
p(x|y = k) = exp{ (x k )t 1
k (x k )} (5.59)
(2) |k |
d/2 1/2 2
||

Nk
xn
k = n=1 (5.60)
Nk
Nk
n=1 (xn k ) (xn k )
T
k = (5.61)
Nk
96

M
O(M 2 )

2

(Common Covariance Matrix)

0 + 1
= (5.62)
2

(Diagonal Covariance Matrix)

0
0

97
5.3

1 [Vapnik and Vapnik, 1998]

(Discriminative)

(Logistic Regression) [Ng and Jordan, 2002]

(Support
Vector Machine) (Articial Neural Network)
(Logistic Regression)

(Linear Regression)
x y
y = wT x + (5.63)
w (Weight Vector)
w
|y y| y y
w

1
Vladimir Vapnik Support Vector Machine (SVM)
98
(Decision Hyperplane)

(Threshold) 0 y 0
1 y < 0 0
5.58
( p(y = 1|x) ) ( )
f2 (x) = 1 log > 0 = 1 y > 0 (5.64)
p(y = 0|x)
wT x y
p(y = 1|x)
log = y = wT x (5.65)
p(y = 0|x)

wT x

p(y = 1|x) p(y = 0|x)
5.65
p(y = 1|x)
log = wT x (5.66)
p(y = 0|x)
p(y = 1|x)
= exp(wT x) (5.67)
p(y = 0|x)
p(y = 1|x)
= exp(wT x) (5.68)
1 p(y = 1|x)
p(y = 1|x) = exp(wT x) p(y = 1|x) exp(wT x) (5.69)
exp(wT x)
p(y = 1|x) = (5.70)
1 exp(wT x)
1
= (5.71)
1 exp(wT x)
99

exp(wT x)
p(y = 0|x) = (5.72)
1 exp(wT x)

1
(5.73)
1 exp(wT x)
(Sigmoid Function)

w w x
0 p(y = 0|x) > p(y = 1|x)
1 p(y = 1|x) > p(y = 0|x)
0 p(y = 0|x)
1 p(y = 1|x)

N
L1 = p(yn = 1|xn , w)yn (1 p(yn = 1|xn , w))1yn (5.74)
n=1
(Likelihood Function)
w

N
L2 = yn log p(yn = 1|xn , w) + (1 yn ) log(1 p(yn = 1|xn , w))
n=1
(5.75)
L2 Log-likelihood Function Log-

likelihood Function Log-likelihood
Log-likelihood Negative Log-likelihood Function
100
L2 (Convex Function)
L2 = 0
(Optimisation Theory)
[Boyd and Vandenberghe, 2004]
(Newtons Method)
f (w) w w
f (w) = 0
w
f (wi )
wi+1 = wi (5.76)
f (wi )
L2 = 0
L2 (wi )
wi+1 = wi (5.77)
L2 (wi )
(Learning Rate)

L2 L2 (First Derivative) (Second
Derivative) L2

N
L2 = [yn p(yn = 1|xn , w) (1 yn )p(yn = 0|xn , w)]xn (5.78)
n=1
N
L2 = xn p(yn = 1|xn , w)p(yn = 0|xn , w)xTn (5.79)
n=1
w

wT x
(Discriminative Classier)
101
5.4

(Parametric Model)

(Non-parametric)
k (k-
Nearest Neighbour (kNN)
kNN
x x x
xq kNN
xq kNN
xq xq
k
kNN
(x, y)
y = h(x) kNN

hknn (xq ) = majority(h1 (xq ), h2 (xq ), . . . , hk (xq )) (5.80)
k xq
kNN (Lazy Learner)
kNN
( )
(Training Time)
xq kNN
(Eager Learner) kNN
xq

kNN
102
(Real Time)

(Global Approximation)
(Local Approximation) ( )
kNN

kNN kNN

kNN k k

(Cross Validation) k kNN
(Distance-weighted kNN) kNN

wn h(xn )
hknn (xq ) = nN N (5.81)
n=1 n w
1
wn = (5.82)
d(xq , xn )
d(xq , xn ) xq xn
103
5.5

= (5.83)

(Training Error)
(Generalisation Error)

(
)

(Test Error)

1. (Training Dataset)

x y
2. (Testing Dataset)

x y
1.

104
2.

5.5.1

(Overtting)

(Undertting)

5.5.2

1
2

(Confusion Matrix)

Accuracy a+d
a+b+c+d = 1 error
105

Negative Positive
Negative a true negative b false positive

Positive c false negative d true positive
5.2:

d
True Positive Rate (Recall) c+d

a
True Negative Rate (Specicity) a+b

b
False Positive Rate (False Alarm) a+b

c
False Negative Rate c+d

(False Alarm) (Recall)
5.5.3

(Receiver Operation Characteristic Analysis)

2 x False Positive Rate y
True Positive Rate (ROC Graph)
5.1
106
1
0.8
True Positive Rate
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
5.1:

(0,1) False Positive Rate
0 True Positive Rate 1
(0.0) (1,1)

(1,0) ( )

107

1 p(y = 1|x) > 0.5
0 0.5 (Decision
Threshold) (Cost)
False Positive ( 1 (Type I Error))

False Negative ( 2 (Type II
Error))

0.5

(False Positive)
1
0.5
p(y = 1|x)
2 0.5

False
Positive Rate True Positive Rate
(ROC Curve)

(Classication Model)
5.2a

(Area Under Curve (AUC))
108
1 0 AUC
1 AUC
5.2 AUC

100 100
80 80
True Positive Rate
True Positive Rate

60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
False Positive Rate False Positive Rate
(a) AUC=100% (b) AUC=65%

100
80
True Positive Rate
60
40
20
0 20 40 60 80 100
False Positive Rate
(c) AUC=50%
5.2:
109

[Fawcett, 2006]
5.5.4

(x, y)

1

(Hold-out Method)

k
k

#1
#2
#3
#4
#5
5.3: 5
(
)

110
k k 1
1 k

k
(k-fold Cross Validation) 5 5.3
k = N
Leave-one-out Cross Validation
111

1.
2.
3.

4.
5. k
6. k
7.

112
6

(Supervised Learning)

(Unlabelled Data)
(Class) (Cluster)

(Clustering)

(Partitioning Clustering)
(Hierachical
Clustering)
(Tree)
113
6.1

D = (x1 , . . . , xN ) xi = [x1i , . . . , xM i ]
M (Distance Function)
xi xj d(xi , xj )
i xj ) = ||xi xj ||
(xm m 2
d(xi , xj ) =
m=1
z
z
z = [z1 , . . . , zN ] {1, . . . , k}
1
n
Fobj = ||xn mzn ||2 (6.1)
2 i=1
(k-means Algorithm)
6.1.1
(Mean)

114
Algorithm 3
1: Randomly choose 1:k
2: Initilise z
3: do
4: zn = arg mini{1,...,k} d(xn , i )

5: k = N1k n:zn =k xn
6: while z does not change
7: return z z
Algorithm 3 k

z
z

z
6.1-6.3 6.1
80 4 20
(4,4), (-4,4), (4,-4), (-4,-4)
1

4

6.2

6.3
115
6
6 4 2 0 2 4 6
6.1:
6 4 2 0 2 4 6
6.2:
116
6
6 4 2 0 2 4 6
6.3:
6.1

1
n
F (z1:n , 1:k ) = ||xn zn ||2 (6.2)
2 i=1
1.
F z
2. z F

z F
(Local Minimum) (Coordinate
Descend Algorithm)
117
6.1.2

3
=1 =2 =3

2.2

(k-medoids)
Algorithm 4
Algorithm 4
1: Randomly choose m1:k
2: Initilise z
3: do
4: zn = arg mini{1,...,k} d(xn , mi )
5: mk = median(xn ) where zn = k
6: while z does not change
7: return z z
6.1.3
k k
(Heuristic) k
k=1 (
6.1) k k k
( k) 1 9
6.4 k=1
800 k
k 3 k=4 690 300
118
k
4
800
700
Objective function value
600
500
400
300
200
2 4 6 8
Value of k
6.4:
k=4
6.2
(Hierachical Clustering)

k

119

(Agglomerative
Clustering) (Bottom-up)

6.2.1
(Dendrogram)
6.5
(Monotonic)

a b c d e
6.5: 5
c d
2 e c d
120
a b

6.2.2

3

(Single Linkage)

dSL (G, H) = min d(xi , xj ) (6.3)

iG,jH
G H

(Chaining) (
)

(Complate Linkage)

dCL (G, H) = max d(xi , xj ) (6.4)

iG,jH
121

(Group Average)
1
dGA (G, H) = d(xi , xj ) (6.5)
NG NH
iG jH
NG NH G H

122

1.
2.
3.
4.
5.

5 10
6 19
8 15
12 5
7 13
123

[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
machine Learning research, 3(Jan):9931022.
[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cam-
bridge university press.
[Cleveland, 1993] Cleveland, W. S. (1993). Visualizing data. Hobart Press.
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological),
pages 138.
[Eiben and Smith, 2003] Eiben, A. E. and Smith, J. E. (2003). Introduction to evolutionary computing,
volume 53. Springer.
[Fawcett, 2006] Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters, 27(8):
861874.
[Fawcett and Provost, 1997] Fawcett, T. and Provost, F. (1997). Adaptive fraud detection. Data mining
and knowledge discovery, 1(3):291316.
[Fayyad et al., 2002] Fayyad, U. M., Wierse, A., and Grinstein, G. G. (2002). Information visualization
in data mining and knowledge discovery. Morgan Kaufmann.
[Friedman, 1997] Friedman, J. H. (1997). On bias, variance, 0/1loss, and the curse-of-dimensionality.
Data mining and knowledge discovery, 1(1):5577.
124
[Guyon and Elisseeff, 2003] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature
selection. Journal of machine learning research, 3(Mar):11571182.
[Han et al., 2011] Han, J., Pei, J., and Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
[Hoffman and Grinstein, 2002] Hoffman, P. E. and Grinstein, G. G. (2002). A survey of visualizations for
high-dimensional data mining. Information visualization in data mining and knowledge discovery, pages
4782.
[Indyk and Motwani, 1998] Indyk, P. and Motwani, R. (1998). Approximate nearest neighbors: towards
removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory
of computing, pages 604613. ACM.
[Jindal and Liu, 2007] Jindal, N. and Liu, B. (2007). Review spam detection. In Proceedings of the 16th
international conference on World Wide Web, pages 11891190. ACM.
[Johnson et al., 1986] Johnson, W. B., Lindenstrauss, J., and Schechtman, G. (1986). Extensions of
lipschitz maps into banach spaces. Israel Journal of Mathematics, 54(2):129138.
[Ke and Sukthankar, 2004] Ke, Y. and Sukthankar, R. (2004). Pca-sift: A more distinctive representation
for local image descriptors. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings
of the 2004 IEEE Computer Society Conference on, volume 2, pages II506. IEEE.
[Lillesand et al., 2014] Lillesand, T., Kiefer, R. W., and Chipman, J. (2014). Remote sensing and image
interpretation. John Wiley & Sons.
[MacKay, 2003] MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge
university press.
[Manjunath et al., 2001] Manjunath, B. S., Ohm, J.-R., Vasudevan, V. V., and Yamada, A. (2001). Color
and texture descriptors. IEEE Transactions on circuits and systems for video technology, 11(6):703715.
[Ng, 2004] Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In
Proceedings of the twenty-rst international conference on Machine learning, page 78. ACM.
125
[Ng and Jordan, 2002] Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative classiers:
A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing
Systems, pages 841848.
[Ojala et al., 2002] Ojala, T., Pietikainen, M., and Maenpaa, T. (2002). Multiresolution gray-scale and
rotation invariant texture classication with local binary patterns. IEEE Transactions on pattern analysis
and machine intelligence, 24(7):971987.
[Pang and Lee, 2008] Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations
and trends in information retrieval, 2(1-2):1135.
[Richards, 1999] Richards, J. A. (1999). Remote sensing digital image analysis, volume 3. Springer.
[Sonka et al., 2014] Sonka, M., Hlavac, V., and Boyle, R. (2014). Image processing, analysis, and ma-
chine vision. Cengage Learning.
[Vapnik and Vapnik, 1998] Vapnik, V. N. and Vapnik, V. (1998). Statistical learning theory, volume 1.
Wiley New York.
[Wasserman, 2013] Wasserman, L. (2013). All of statistics: a concise course in statistical inference.
Springer Science & Business Media.
126

Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

(204423)

(Textual and Multimedia Data)

buys(X,bike) age(X,3045) buys(X,helmet)

120 Scatter Plot

d(xi , xj ) = (|x1i x1j |h + |x2i x2j |h + + |xM

(Symmetry): d(xi , xj ) = d(xj , xi )

d(xi , xj ) = (|x1i x1j | + |x2i x2j | + + |xM

h = (Supremum Distance), Lmax

d(xi , xj ) = max |xfi xfj | (2.12)

2(250 90)2 (50 210)2 (200 360)2 (1000 840)2

(Cumulative Distribution Function

(a) (R = 0.94) (b) (R = -0.95)

(c) (R = 0.05) (d) (R = -0.02)

0.5 0 0.5 1 1.5 2 2.5

0.5 0 0.5 1 1.5 2 2.5

formation Theory) 1 (Wrapper Method)

IG(X) = H(Y ) H(Y |X) (3.36)

fobj = objective regularisation (3.41)

(1 )||u v||2ld ||Ru Rv||2ld (1 + )||u v||2ld (3.43)

(Frequent Sequential Pattern)

(Frequent Structured Pattern)

40% and 60%

(l1 [1] = l2 [1]) (l1 [2] = l2 [2]) (l1 [k 2] = l2 [k 2])

l1 [1], l1 [2], . . . , l1 [k 1], l1 [k 1], l2 [k 1]

/ ( ) S.Johansson C.Evans R.Downey

/ S.Johansson C.Evans R.Downey

/ S.Johansson C.Evans R.Downey

/ S.Johansson C.Evans R.Downey

p(x, y) = p(x|y)p(y) = p(y|x)p(x) (5.2)

p(positive|y = 1) + p(negative|y = 1) = 1 (5.9)

p(y = 1) = p(y = 0) = 0.5

p(x) = p(x|y = 0)p(y = 0) + p(x|y = 1)p(y = 1)

y = arg max p(y|x) = arg max p(x|y) (5.18)

(Maximum Likelihood Estimate

y = arg max p(y|x) = arg max p(x|y)p(y) (5.26)

p(y = |x15 ) = p(y = )p(|y = )p(|y = )

x = {=1, =3, =1, =0, =0, =4}

p(y = 0|x) = p(x|y = 0)p(y = 0) (5.48)

L2 Log-likelihood Function Log-

True Positive Rate

(a) AUC=100% (b) AUC=65%

dSL (G, H) = min d(xi , xj ) (6.3)

dCL (G, H) = max d(xi , xj ) (6.4)

You might also like