Professional Documents
Culture Documents
Data Mining PDF
Data Mining PDF
(Data mining)
(Models)
DSS Decision Table , Decision Tree
Frequent Patterns Tree (FP-Tree) Transaction
Data Mining Data Mining
programmer
Algorithm
(Data Mining: Concepts and Techniques)
47
Data Mining
Transaction
Data Mining Unknow , Valid, Actionable
3
1. Unknow
Ex :
Ex:
Unknow
2. Valid Data Mining
(Valid) 2
(Validation and Checking)
3. Actionable :
()
Data Mining
5.2
1. . 1960 :Data Collection
2. 1980: Data Access
3. 1990: Data Warehouse and Dicision Support
4. 2000 : Data Mining
48
3. (Data archeology)
4. (Data exploration)
5. Pattern (Data pattern processing)
6. (Data dredging)
7.
5.4 Data Mining
Data Mining Pattern
Client-Server (Client/server architecture)
(data drills)
programmer
Spreadsheets
5.5 Data Mining (A KDD Process)
(Pattern)
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
50
(http://www.g-able.com/thai/solutions/g-biz/bis.htm)
5.2 Data Mining BI(Business
Intelligence)
5.2
(Data Mining and Business Intelligence : Cabena et al., 1997)
5.8 Data Mining (Architecture of a Typical Data Mining System)
51
Data Mining
1. (Relational databases)
2. (Data warehouses)
3. (Transactional databases)
4.
-
-
- (Text databases)
-
- WWW
5.9 Data Mining Functionalities (Data Mining Task)
Data Mining
1. (Characterization and discrimination)
2. (Association)
3. (Classification/ Regression)
(Classification)
(Band Loyalty)
: Field
Table: Cutomer
Field
Data Type
Value
Description
Cus_id
Int
unique
Time
Int
Integer
Trend
Text
, ,
6
Status
Text
,, ,
Cus_type
Text
,
5.1
52
53
(Prediction)
Classification Estimation Record
6
10%
6. (Description / Visualization)
(Description)
(Visualization)
2
Plot
54
2. Decision Trees
(Decision Trees) Decision Trees Supervised Learning (
)
Training set
Tree Root Node, Child Leaf Node
3. Memory Based Reasoning (MBR)
MBR
(Supervised Learning)
4. Cluster Detection Segment ( Record
) Record Segment, Cluster Detection (Sub
Group)
5. Link Analysis Record Association
3
5.1 Association Discovery
Market Basket Analysis Super market
Mailing list Direct Mail Promotion
Association 75%
5.2 Sequential Pattern Discovery
logn term TV
VDO
5.3 Similar Time Sequence Discovery 2
Stock
6. Genetic Algorithm (GA)
Fittest
Species GA
3 GA
- 3
GA Fittest Function Fittest Function
55
- GA Operator Operator
Fittest function
7. Rule Induction Rule
Induction
8. K-nearest neighbor (K-NN)
(Count Up)
K-NN
K-NN
K-NN MBR (Memory-Based
Reasoning)
9. Association and Sequence Detection
- Association (Item)
Market-basket analysis
- Sequence Detection Association
(Association) AB
A (Antecedent) LHS (Left-Hand Side)
B (Consequent) RHS (Right- Hand Side)
10. Logic Regression
2 Yes/No , 0/1 (Dependent Variable) 2
(Model) Logic Regression
Model
Algorithm Algorithm Log Odds Logic
Transtromation
:
56
Data Mining
57
58
5) (forecasting)
59
: Support
Support ()
= AB
= P(A B)
A Intersec B
: Confidence
Confidence ()
= P(A/B)
= P(A B)
P(A)
1 : Support A C Confidence P(A/C)
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
5.2
Step :
1. Transaction DB
2. Item
3. Associate ()
Solve :
Support ()= A C
= P(A C)
= 2/4
= 0.5
= 50%
60
Confidence ()
= P(A/C)
= P(A C)
P(A)
= (2/4) / (3/4)
= 0.5/0.75
= 0.6666
= 66.66%
A C = (50%, 66.66%)
100 (Transaction) A C 50 A
C 66.66%
: A
C A
AC CA
P(A/C) P(C/A)
Association Rule
1.
Promotion
(Shelf)
2.
61
Association Rule
5.2 2
1. Support C A , Confidence P(C/A)
Solve :
62
Items Bought
A,B,C,F,D
A,C,D,E
A,D,E
B,E,F
C,E,D
E,F,A
F,E,C
F,A,B
5.3
Solve :
7. 5.3 A,D,E
A D E
Solve :
63
Interest
1
1
0
1
1
1
1
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
= 4/8
= 2/8
= 7/8
5.4
Solve :
1. Interest
X
Y
Z
1
1
0
1
1
1
1
0
1
= P(A^B)
P(A).P(B)
1
0
1
0
0
1
0
0
1
64
Tem Set
X,Y
X,Z
Y,Z
Support
= 2/8
= 0.25
= 50%
= 3/8
= 0.375
= 37.50%
= 1/8
= 0.125
= 12.50%
Interest
=(2/8) / (4/8).(2/8)
= 0.25/0.125
=2
=(3/8) / (4/8).(7/8)
= 0.375 / 0.438
= 0.86
=(1/8) / (2/8).(7/8)
= 0.125 / 0.218
= 0.57
Description
>1 Dependence
(X,Y )
< =1 Independence
(X,Z )
< =1 Independence
(Y,Z )
Name
Jack
Mary
Jim
Gender
M
F
M
Solve:
Step:
1. 0
Name
Gender
M
Jack
F
Mary
M
Jim
Fever
Y
Y
Y
Cough
Test-1
N
P
N
P
P
N
5.6
1
Fever
Cough
Test-1
Y (1)
N(0)
P(1)
Y(1)
N(0)
P(1)
Y(1)
P(1)
N(0)
5.7
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Test-2
N(0)
N(0)
N(0)
Test-3
N(0)
P(1)
N(0)
Test-4
N(0)
N(0)
N(0)
65
Binary (i,j)
Object i
D(i,j)
= b+c
a+b+c
3. jack mary
Mary
D(jack,mary)
Jack
D(Jack,Mary)
A (2)
B (0)
C (1)
D (3)
= b+c
a+b+c
D(Jack,Jim)
D(Jim,Mary)
= 0+1
=1
= 0.33
2+0+1 3
=0.67
=0.75
: Jack Mary
0.33
Jim Jim Jack
Mary Jack Jack Mary
:
( )
66
T
T
T
T
T
F
T
F
T
F
T
F
T
F
F
F
F
F
F
F
Test-1 Test-2
Test-3
Test-4
Test-4
T
T
T
T
F
F
T
T
T
F
T
F
F
T
T
F
T
F
T
F
F
T
T
F
T
F
F
T
T
T
T
T
F
F
T
F
T
F
F
T
T
T
F
F
T
F
F
T
F
F
5.8
5.14.4 Naive Bayesian Classification
Nave Bayesian Classification
1 : Tennis
: rain, hot, high ,false Tennis
(Outlook)
Sunny ()
sunny
Overcast ( )
Rain ()
Rain
Rain
Overcast
Sunny
(Temperature)
hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
(Humidity)
high
high
high
high
Normal
Normal
normal
high
(Windy)
False
True
False
False
False
True
True
False
/
Class
N
N
P
P
P
N
P
N
67
Sunny
Rain
Sunny
Overcast
Overcast
rain
Cool
Mild
Mild
Mild
Hot
Mild
normal
Normal
normal
High
Normal
high
False
False
True
True
False
True
P
P
P
P
P
N
5.9
Solve:
1.
Outlook
P(sunny|P) = 2/9
P(sunny|N) = 3/5
P(overcast|P) = 4/9
P(overcast|N) = 0/5
P(rain|P) = 3/9
P(rain|N) = 2/5
Temperature
P(hot|P) = 2/9
P(hot|N) = 2/5
P(mild|P) = 4/9
P(mild|N) = 2/5
P(cool|P) = 3/9
P(cool|N) = 1/5
Humidity
P(high|P) = 3/9
P(high|N) =4/5
P(normal|P) = 6/9
P(normal|N) = 1/5
Windy
P(True|P) = 3/9
P(True|N) = 3/5
P(False|P) = 6/9
P(False|N) = 2/5
P(p) = 9/14
P(n) = 5/14
= 14/14
=1 ( 1 )
68
Class
P
P
N
P
N
P
P
N
5.10
Class
I( s 1 ,s 2 ,...,s
)
i 1
si
si
log 2
s
s
69
* I (information) I Field
2. Entropy of attribute A with values {a1,a2,,av}
s 1 j ... s mj
I ( s 1 j ,..., s mj )
s
j 1
v
E(A)
* Entropy
3. Information gained by branching on attribute A
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
16
22
18
25
21
18
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
70
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
18
20
22
24
22
24
Very_good
Fair
Fair
Fair
Very_good
Excellent
Solve :
Step 1: Calculate expected info required to classify an arbitrary tuple
I I Field
m
I( s 1 ,s 2 ,...,s
)
i 1
si
si
log 2
s
s
1.1 I 2
1 S1
2 S2
I(s 1 , s 2 ) I(120 ,130 )
120
120 130
130
log 2
log 2
0 . 9988
250
250 250
250
S1
S2
I
S1
S2
= -0.48 log2 0.48
-0.52 log2 0.52
= (-0.48) (log 0.48)
(-0.52) (log 0.52)
log2
log2
= 0.50826+0.49057
= 0.9988
: log2 = log10A
Log102
log2 = 0.301
71
sum(S12,S22)
72
S23 = 42
sum(S13,S23)
Field major
For major=Science:
S11=84
S21=42
I(s11,s21)=0.9183
S22=46
I(s12,s22)=0.9892
For major=Business:
S23=42
I(s13,s23)=0
S13=0
73
74
75
E(A)
126
82
42
I ( s 11 , s 21 )
I ( s 12 , s 22 )
I ( s 13 , s 23 ) 0 . 7873
250
250
250
3. (Solve)
76
77
78
I S1,S2
0.9988
= 0.9988 0.786
= 0.2128
Gain
Information gained Field
3.2 Information gained Field gender
Solve:
79
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
:
Gain(age_range)
5
1. Data Mining
2. Data Mining
3. Data Mining (A KDD Process) KDD
Process
4. Data Mining Business Intelligence
5. Data Mining
6. Data Mining
80