Data Mining PDF

5
(Data mining)
(Models)
DSS Decision Table , Decision Tree

Frequent Patterns Tree (FP-Tree) Transaction

Data Mining Data Mining

programmer
Algorithm

(Data Mining: Concepts and Techniques)
5.1 (Data Mining)

Data Mining (Pattern)
Data Mining

(Rule)

47
Data Mining
Transaction
Data Mining Unknow , Valid, Actionable
3
1. Unknow

Ex :

Ex:
Unknow

2. Valid Data Mining
(Valid) 2

(Validation and Checking)
3. Actionable :
()

Data Mining
5.2
1. . 1960 :Data Collection

2. 1980: Data Access

3. 1990: Data Warehouse and Dicision Support

4. 2000 : Data Mining
5.3 Data Mining

1. (Knowledge discovery in databases)
2. (Knowledge extraction)
48
3. (Data archeology)
4. (Data exploration)
5. Pattern (Data pattern processing)
6. (Data dredging)
7.
5.4 Data Mining
Data Mining Pattern
Client-Server (Client/server architecture)
(data drills)
programmer

Spreadsheets
5.5 Data Mining (A KDD Process)
(Pattern)
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
5.1 Data Mining (KDD : Knowledge Discovery in Database)

49
(Steps of a KDD Process)

1. (Learning the application domain)
2. (data selection) mining

3. (Data cleaning and preprocessing)

4. (Data reduction and transformation)
(Format) Algorithm
Data Mining
5. Functions data mining summarization, classification, regression, association
clustering
6. Algorithm data mining Mine
7. Patterns
8. Pattern

9. (Use of discovered knowledge)
5.6 (Types of knowledge to be mined)
1. (Characterization)

2. (Discrimination)
3. (Association)

4. (Classification/prediction)
5. (Clustering)
6. (Outlier analysis)
7. (Other data mining tasks)
5.7 Data Mining Business Intelligence
Data Mining Data Warehouse Data Mart

BI (Business Intelligence)
BI
50

(http://www.g-able.com/thai/solutions/g-biz/bis.htm)
5.2 Data Mining BI(Business
Intelligence)

5.2
(Data Mining and Business Intelligence : Cabena et al., 1997)
5.8 Data Mining (Architecture of a Typical Data Mining System)
5.3 Architecture of a Typical Data Mining System
51
Data Mining
1. (Relational databases)
2. (Data warehouses)
3. (Transactional databases)
4.
-
-
- (Text databases)
-
- WWW
5.9 Data Mining Functionalities (Data Mining Task)
Data Mining
1. (Characterization and discrimination)
2. (Association)
3. (Classification/ Regression)
(Classification)

(Band Loyalty)

: Field
Table: Cutomer
Field
Data Type
Value
Description
Cus_id
Int
unique
Time
Int
Integer

Trend
Text
, ,
6
Status
Text
,, ,
Cus_type
Text
,

5.1
52
(Output) (Cus_type) (Dependent vairable)

((Independent vairable) Time, Trend Status
Data Mining Classification
Algorithm Algorithm
(Yes, No) (High, Medium, Low)
Data Mining Classification
1. Decision Tree
2. Neural Networks
3. Nave-Bayes
4. K-nearest neighbor (K-NN)
(Regression)
Regression Classification Regression
B
B 1,000 (1,000
Yes, No )
4. (Cluster analysis/ Segmentation)
(Clustering)

Clustering (Output) (Independent Variable)
Clustering (Unsupervied Learning)
Clustering
Ex :

( Intersection Union)
Data Mining Clustering Demographic Clustering Neural
Clustering
5. (Estimation/Prediction)
(Estimation)

53
(Prediction)
Classification Estimation Record

6
10%
6. (Description / Visualization)
(Description)

(Visualization)
2

Plot
5.10 Data Mining (Data Mining Tools and Technologies)

1. Neural Network
(Sequential Processing) (Parallel Processing)
Process Input Output
Input Output
Output
(weight) Input
Neural Network
Node Layer Input layer, Hidden layer, output layer
5.4 Neural Network
54
2. Decision Trees
(Decision Trees) Decision Trees Supervised Learning (
)
Training set
Tree Root Node, Child Leaf Node
3. Memory Based Reasoning (MBR)
MBR

(Supervised Learning)
4. Cluster Detection Segment ( Record
) Record Segment, Cluster Detection (Sub
Group)

5. Link Analysis Record Association
3
5.1 Association Discovery

Market Basket Analysis Super market

Mailing list Direct Mail Promotion
Association 75%
5.2 Sequential Pattern Discovery
logn term TV
VDO
5.3 Similar Time Sequence Discovery 2
Stock

6. Genetic Algorithm (GA)
Fittest
Species GA
3 GA
- 3
GA Fittest Function Fittest Function

55
- GA Operator Operator

Fittest function

7. Rule Induction Rule
Induction
8. K-nearest neighbor (K-NN)

(Count Up)

K-NN
K-NN
K-NN MBR (Memory-Based
Reasoning)
9. Association and Sequence Detection
- Association (Item)
Market-basket analysis
- Sequence Detection Association

(Association) AB
A (Antecedent) LHS (Left-Hand Side)
B (Consequent) RHS (Right- Hand Side)

10. Logic Regression
2 Yes/No , 0/1 (Dependent Variable) 2
(Model) Logic Regression
Model
Algorithm Algorithm Log Odds Logic
Transtromation
:

56
11. Discriminant Analysis :

1936 R.A Fisher Iris 3

Data Mining
12. Generalized Additive Models (GAM) : Linear Regression Logistic
Regression Model Possibly Non-Linear
Function GAM Regression Classification
GAM Function Curve
Parameter Neural Network
GAM GAM Output Input
Neural Network GAM
13. Multivariate Adaptive Regression Splits (MARS) : 80
Jerome H. Friedman CART MARS
MARS Plot
Non-Linear Step-wise regression tools
Data Mining
5.5 Data Mining: Confluence of Multiple Disciplines
57
5.11 Data Mining

Data Mining

1. (Marketing) Promotion
2. (Banking / Financial Analysis)
Package
()
3. (Retailing and sales)

4. (Manufacturing and production)

5. (Brokerage and securities trading)

Mining
6. DNA (Biomedical an DNA Analysis)

(Insurance), Computer hardware
software, (Government and defense), (Airlines),
(Health care), (Broadcasting) (Law enforcement)
5.12 Intelligent Data Mining

Intelligent Data Mining (Data warehouses)
(Reports) Patterns
Patterns Rules
Intelligent Data Mining 5
1) (association)
2) (sequences)
3) (classifications)
4) (clusters)
58
5) (forecasting)
5.6 Data Mining

5.13 Intelligent Data Mining
Intelligent data mining
1. Case-based Reasoning
2. Neural Computing
3. Intelligent Agents
4. (Other Tools)
- Decision trees
- Rule induction
- Data visualization
5.14 Data Mining
Data Mining
5.14.1 Association Rule 2
(support) (Confidence)
59
: Support
Support ()
= AB
= P(A B)
A Intersec B
: Confidence
Confidence ()
= P(A/B)
= P(A B)
P(A)
1 : Support A C Confidence P(A/C)
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
5.2
Step :
1. Transaction DB
2. Item
3. Associate ()
Solve :

Support ()= A C
= P(A C)
= 2/4
= 0.5
= 50%
60
Confidence ()
= P(A/C)
= P(A C)
P(A)
= (2/4) / (3/4)
= 0.5/0.75
= 0.6666
= 66.66%
A C = (50%, 66.66%)
100 (Transaction) A C 50 A
C 66.66%
: A
C A
AC CA
P(A/C) P(C/A)
Association Rule
1.

Promotion
(Shelf)
2.
61
Association Rule
5.2 2
1. Support C A , Confidence P(C/A)
Solve :
2. Support A B , Confidence P(A/B)

Solve :
3. Support D A , Confidence P(D/A)

Solve :
4. Support E B , Confidence P(E/B)

Solve :
62
5. Support B F , Confidence P(B/F)

Solve :
6. 5.3 Support E D , Confidence P(E/D)

Transaction ID
2000
1000
4000
5000
6000
7000
8000
9000
Items Bought
A,B,C,F,D
A,C,D,E
A,D,E
B,E,F
C,E,D
E,F,A
F,E,C
F,A,B
5.3
Solve :
7. 5.3 A,D,E
A D E
Solve :
63
5.14.2 Interestingness Measures

Interest 2

Interest

1. Dependence : 2
> 1 Positive
2. Independence: 2
< = 1 Negative
= P(A^B)
P(A).P(B)
Interest
1: 5.4 Support Interest

X
Y
Z
1
1
0
1
1
1
1
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
= 4/8
= 2/8
= 7/8
5.4
Solve :
1. Interest
X
Y
Z
1
1
0
1
1
1
1
0
1
= P(A^B)
P(A).P(B)
1
0
1
0
0
1
0
0
1
64
Tem Set
X,Y
X,Z
Y,Z
Support
= 2/8
= 0.25
= 50%
= 3/8
= 0.375
= 37.50%
= 1/8
= 0.125
= 12.50%
Interest
=(2/8) / (4/8).(2/8)
= 0.25/0.125
=2
=(3/8) / (4/8).(7/8)
= 0.375 / 0.438
= 0.86
=(1/8) / (2/8).(7/8)
= 0.125 / 0.218
= 0.57
Description
>1 Dependence
(X,Y )
< =1 Independence
(X,Z )
< =1 Independence
(Y,Z )
5.14.3 Dissimilarity Between Binary Variable

() (Group)
Binary 2 0, 1
1) P = Positive =1
=Yes =True
2) N=Negative =0
=No =False
1:
(Fever)
(Cough) 3 Jack, Mary Jim
Name
Jack
Mary
Jim
Gender
M
F
M
Solve:
Step:
1. 0
Name
Gender
M
Jack
F
Mary
M
Jim
Fever
Y
Y
Y
Cough
Test-1
N
P
N
P
P
N
5.6
1
Fever
Cough
Test-1
Y (1)
N(0)
P(1)
Y(1)
N(0)
P(1)
Y(1)
P(1)
N(0)
5.7
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Test-2
N(0)
N(0)
N(0)
Test-3
N(0)
P(1)
N(0)
Test-4
N(0)
N(0)
N(0)
65
2. Dissimilarity Between Binary Variable

Object j
Binary (i,j)
Object i
D(i,j)
= b+c
a+b+c
3. jack mary
Mary
D(jack,mary)
Jack
D(Jack,Mary)
A (2)
B (0)
C (1)
D (3)
= b+c
a+b+c
D(Jack,Jim)
D(Jim,Mary)
= 0+1
=1
= 0.33
2+0+1 3
=0.67
=0.75
: Jack Mary
0.33
Jim Jim Jack
Mary Jack Jack Mary
:
( )
66
Dissimilarity Between Binary Variable

1.
( 2 )
T
T
T
T
T
F
T
F
T
F
T
F
T
F
F
F
F
F
F
F
Test-1 Test-2
Test-3
Test-4
Test-4
T
T
T
T
F
F
T
T
T
F
T
F
F
T
T
F
T
F
T
F
F
T
T
F
T
F
F
T
T
T
T
T
F
F
T
F
T
F
F
T
T
T
F
F
T
F
F
T
F
F
5.8
5.14.4 Naive Bayesian Classification
Nave Bayesian Classification

1 : Tennis

: rain, hot, high ,false Tennis
(Outlook)
Sunny ()
sunny
Overcast ( )
Rain ()
Rain
Rain
Overcast
Sunny
(Temperature)
hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
(Humidity)
high
high
high
high
Normal
Normal
normal
high
(Windy)
False
True
False
False
False
True
True
False
/
Class
N
N
P
P
P
N
P
N
67
Sunny
Rain
Sunny
Overcast
Overcast
rain
Cool
Mild
Mild
Mild
Hot
Mild
normal
Normal
normal
High
Normal
high
False
False
True
True
False
True
P
P
P
P
P
N
5.9
Solve:
1.
Outlook
P(sunny|P) = 2/9
P(sunny|N) = 3/5
P(overcast|P) = 4/9
P(overcast|N) = 0/5
P(rain|P) = 3/9
P(rain|N) = 2/5
Temperature
P(hot|P) = 2/9
P(hot|N) = 2/5
P(mild|P) = 4/9
P(mild|N) = 2/5
P(cool|P) = 3/9
P(cool|N) = 1/5
Humidity
P(high|P) = 3/9
P(high|N) =4/5
P(normal|P) = 6/9
P(normal|N) = 1/5
Windy
P(True|P) = 3/9
P(True|N) = 3/5
P(False|P) = 6/9
P(False|N) = 2/5
P(p) = 9/14
P(n) = 5/14
= 14/14
=1 ( 1 )
2. An unseen sample X = <rain,hot,high,false>

P(X|p).P(p) = P(rain|p).P(hot|p).P(high|p).P(false|p).P(p) = (3/9).(2/9).(3/9).(6/9).(9/14) = 0.010582

P(X|n).P(n) = P(rain|n).P(hot|n).P(high|n).P(false|n).P(n) = (2/5).(2/5).(4/5).(2/5).(5/14) = 0.018285
Ans: sample X is Classified in class n (dont play)
68
Naive Bayesian Classification

1. 5.9 X = <overcast,cool,normal,true>
Tenis
2. Z Virus X
Class P Virus X
Class N Virus X
Class
P
P
N
P
N
P
P
N
5.10

Class
5.14.5 Entropy and Information Gain

(Analytical Characterization)

3
1. Information measures info required to classify any arbitrary tuple

I( s 1 ,s 2 ,...,s
)
i 1
si
si
log 2
s
s
69
* I (information) I Field
2. Entropy of attribute A with values {a1,a2,,av}
s 1 j ... s mj
I ( s 1 j ,..., s mj )
s
j 1
v
E(A)
* Entropy
3. Information gained by branching on attribute A
Gain(A) I(s 1, s 2 ,..., s m ) E(A)

* Information gained Field (attribute)
1:
2
(Analytical Characterization)
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
16
22
18
25
21
18
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
5.11 Candidate relation for Target class: Graduate students (=120)
70
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
18
20
22
24
22
24
Very_good
Fair
Fair
Fair
Very_good
Excellent
Solve :
Step 1: Calculate expected info required to classify an arbitrary tuple
I I Field
m
I( s 1 ,s 2 ,...,s
)
i 1
si
si
log 2
s
s
1.1 I 2
1 S1
2 S2
I(s 1 , s 2 ) I(120 ,130 )
120
120 130
130
log 2
log 2
0 . 9988
250
250 250
250
S1
S2
I
S1
S2
= -0.48 log2 0.48
-0.52 log2 0.52
= (-0.48) (log 0.48)
(-0.52) (log 0.52)
log2
log2
= 0.50826+0.49057
= 0.9988
: log2 = log10A
Log102

log2 = 0.301
71
1.2 I Field (Column) 2 Field 5 Field

gender, major, birth_country, age_range gpa
1.2.1 I Field major
1.2.2 I Field gender
1.2.3 I Field birth_country
1.2.4 I Field age_range
1.2.5 I Field gpa
: I Field major
1.2.1 Field major 3 Record
1. For major = Science
Solve:
S11 = 84
S21 = 42
sum(S11,S21)
-42 log2 42
= 126
= -84 log2 84
126 126
126 126
= -0.666 log2 0.666
-0.333 log2 0.333
= (-0.666) (log 0.666)
(-0.333) (log 0.333)
log2
log2
= 0.390+0.528
= 0.918
I Field
2. For major = Engineering
Solve:
S12 = 36
S22 = 46
sum(S12,S22)
72
3. For major = Business

Solve:
S13 = 0
S23 = 42
sum(S13,S23)
Field major
For major=Science:
S11=84
S21=42
I(s11,s21)=0.9183
For major=Engineering: S12=36
S22=46
I(s12,s22)=0.9892
For major=Business:
S23=42
I(s13,s23)=0
S13=0
1.2.2 I Field gender

Solve :
73
1.2.3 I Field birth_country

Solve :
1.2.4 I Field age_range

Solve:
74
1.2.5 I Field gpa

Solve:
75
Step 2 : Calculate entropy of each attribute: e.g. major

Entropy
s 1 j ... s mj
I ( s 1 j ,..., s mj )
s
j 1
v
E(A)
2.1 Entropy Field major

2.2 Entropy Field gender
2.3 Entropy Field birth_country
2.4 Entropy Field age_range
2.5 Entropy Field gpa
: Entropy Field major
1. Sum I Record Field major
Major
Sum
I
Science (S11,S21)
126
0.918
Engineering(S12,S22)
82
0.989
Business (S13,S23)
42
0
Sum (S1,S2)
250
2.
E(major)
126
82
42
I ( s 11 , s 21 )
I ( s 12 , s 22 )
I ( s 13 , s 23 ) 0 . 7873
250
250
250
3. (Solve)
76
= (126*0.918) + (82*0.989) + (42*0)

250
250
250
= 0.462 + 0.324 + 0
= 0.786 Entropy Field major
Entropy
Entropy Field
2.2 Entropy Field gender
Solve
2.3 Entropy Field birth_country

Solve
77
2.4 Entropy Field age_range

Solve
2.5 Entropy Field gpa

Solve
Step 3 : Calculate information gain for each attribute

Information gained
Gain(A) I(s 1, s 2 ,..., s m ) E(A)

Information gained Field (attribute)
3.1 Information gained Field major
3.2 Information gained Field gender
3.3 Information gained Field birth_country
3.4 Information gained Field age_range
3.5 Information gained Field gpa
78
: Information gained Field major

Gain(major ) I(s 1, s 2) E(major) 0 . 2115
I S1,S2
0.9988
Entropy Field major

0.786
= 0.9988 0.786
= 0.2128
Gain
Information gained Field
3.2 Information gained Field gender
Solve:
3.3 Information gained Field birth_country

Solve:
3.4 Information gained Field age_range

Solve:
3.5 Information gained Field gpa

Solve:
79
: Information gain for all attributes
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
:
Gain(age_range)

5
1. Data Mining
2. Data Mining
3. Data Mining (A KDD Process) KDD
Process
4. Data Mining Business Intelligence
5. Data Mining
6. Data Mining
80

Data Mining PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining PDF

Uploaded by

Copyright:

Available Formats

5

5.1 (Data Mining)

5.3 Data Mining

5.1 Data Mining (KDD : Knowledge Discovery in Database)

(Steps of a KDD Process)

5.3 Architecture of a Typical Data Mining System

(Output) (Cus_type) (Dependent vairable)

5.10 Data Mining (Data Mining Tools and Technologies)

5.4 Neural Network

11. Discriminant Analysis :

5.5 Data Mining: Confluence of Multiple Disciplines

5.11 Data Mining

5.12 Intelligent Data Mining

5.6 Data Mining

2. Support A B , Confidence P(A/B)

3. Support D A , Confidence P(D/A)

4. Support E B , Confidence P(E/B)

5. Support B F , Confidence P(B/F)

6. 5.3 Support E D , Confidence P(E/D)

5.14.2 Interestingness Measures

1: 5.4 Support Interest

5.14.3 Dissimilarity Between Binary Variable

2. Dissimilarity Between Binary Variable

Dissimilarity Between Binary Variable

2. An unseen sample X = <rain,hot,high,false>

Naive Bayesian Classification

5.14.5 Entropy and Information Gain

Gain(A) I(s 1, s 2 ,..., s m ) E(A)

5.11 Candidate relation for Target class: Graduate students (=120)

1.2 I Field (Column) 2 Field 5 Field

3. For major = Business

For major=Engineering: S12=36

1.2.2 I Field gender

1.2.3 I Field birth_country

1.2.4 I Field age_range

1.2.5 I Field gpa

Step 2 : Calculate entropy of each attribute: e.g. major

2.1 Entropy Field major

= (126*0.918) + (82*0.989) + (42*0)

2.3 Entropy Field birth_country

2.4 Entropy Field age_range

2.5 Entropy Field gpa

Step 3 : Calculate information gain for each attribute

Gain(A) I(s 1, s 2 ,..., s m ) E(A)

: Information gained Field major

Entropy Field major

3.3 Information gained Field birth_country

3.4 Information gained Field age_range

3.5 Information gained Field gpa

: Information gain for all attributes

You might also like

= (1260.918) + (820.989) + (42*0)