You are on page 1of 34

5

(Data mining)
(Models)
DSS Decision Table , Decision Tree

Frequent Patterns Tree (FP-Tree) Transaction


Data Mining Data Mining





programmer
Algorithm








(Data Mining: Concepts and Techniques)

5.1 (Data Mining)


Data Mining (Pattern)
Data Mining

(Rule)

47

Data Mining
Transaction
Data Mining Unknow , Valid, Actionable
3
1. Unknow

Ex :


Ex:
Unknow

2. Valid Data Mining
(Valid) 2


(Validation and Checking)
3. Actionable :
()

Data Mining
5.2
1. . 1960 :Data Collection

2. 1980: Data Access

3. 1990: Data Warehouse and Dicision Support

4. 2000 : Data Mining

5.3 Data Mining


1. (Knowledge discovery in databases)
2. (Knowledge extraction)

48

3. (Data archeology)
4. (Data exploration)
5. Pattern (Data pattern processing)
6. (Data dredging)
7.
5.4 Data Mining
Data Mining Pattern
Client-Server (Client/server architecture)
(data drills)
programmer

Spreadsheets
5.5 Data Mining (A KDD Process)
(Pattern)

Pattern Evaluation

Data Mining
Task-relevant Data

Data Warehouse

Selection

Data Cleaning

Data Integration
Databases

5.1 Data Mining (KDD : Knowledge Discovery in Database)


49

(Steps of a KDD Process)


1. (Learning the application domain)
2. (data selection) mining

3. (Data cleaning and preprocessing)


4. (Data reduction and transformation)
(Format) Algorithm
Data Mining
5. Functions data mining summarization, classification, regression, association
clustering
6. Algorithm data mining Mine
7. Patterns
8. Pattern

9. (Use of discovered knowledge)
5.6 (Types of knowledge to be mined)
1. (Characterization)

2. (Discrimination)
3. (Association)

4. (Classification/prediction)
5. (Clustering)
6. (Outlier analysis)
7. (Other data mining tasks)
5.7 Data Mining Business Intelligence
Data Mining Data Warehouse Data Mart


BI (Business Intelligence)
BI

50


(http://www.g-able.com/thai/solutions/g-biz/bis.htm)
5.2 Data Mining BI(Business
Intelligence)

5.2
(Data Mining and Business Intelligence : Cabena et al., 1997)
5.8 Data Mining (Architecture of a Typical Data Mining System)

5.3 Architecture of a Typical Data Mining System

51

Data Mining
1. (Relational databases)
2. (Data warehouses)
3. (Transactional databases)
4.
-
-
- (Text databases)
-
- WWW
5.9 Data Mining Functionalities (Data Mining Task)
Data Mining
1. (Characterization and discrimination)
2. (Association)
3. (Classification/ Regression)
(Classification)

(Band Loyalty)




: Field
Table: Cutomer
Field
Data Type
Value
Description
Cus_id
Int
unique

Time
Int
Integer

Trend
Text
, ,
6
Status
Text
,, ,
Cus_type
Text
,

5.1

52

(Output) (Cus_type) (Dependent vairable)


((Independent vairable) Time, Trend Status
Data Mining Classification
Algorithm Algorithm
(Yes, No) (High, Medium, Low)
Data Mining Classification
1. Decision Tree
2. Neural Networks
3. Nave-Bayes
4. K-nearest neighbor (K-NN)
(Regression)
Regression Classification Regression
B
B 1,000 (1,000
Yes, No )
4. (Cluster analysis/ Segmentation)
(Clustering)

Clustering (Output) (Independent Variable)
Clustering (Unsupervied Learning)
Clustering
Ex :

( Intersection Union)
Data Mining Clustering Demographic Clustering Neural
Clustering
5. (Estimation/Prediction)
(Estimation)


53

(Prediction)
Classification Estimation Record

6
10%
6. (Description / Visualization)
(Description)


(Visualization)
2

Plot

5.10 Data Mining (Data Mining Tools and Technologies)


1. Neural Network
(Sequential Processing) (Parallel Processing)
Process Input Output
Input Output
Output
(weight) Input
Neural Network
Node Layer Input layer, Hidden layer, output layer

5.4 Neural Network

54

2. Decision Trees
(Decision Trees) Decision Trees Supervised Learning (
)
Training set
Tree Root Node, Child Leaf Node
3. Memory Based Reasoning (MBR)
MBR


(Supervised Learning)
4. Cluster Detection Segment ( Record
) Record Segment, Cluster Detection (Sub
Group)

5. Link Analysis Record Association
3
5.1 Association Discovery

Market Basket Analysis Super market

Mailing list Direct Mail Promotion
Association 75%
5.2 Sequential Pattern Discovery
logn term TV
VDO
5.3 Similar Time Sequence Discovery 2
Stock

6. Genetic Algorithm (GA)
Fittest
Species GA
3 GA
- 3
GA Fittest Function Fittest Function

55

- GA Operator Operator

Fittest function

7. Rule Induction Rule
Induction
8. K-nearest neighbor (K-NN)

(Count Up)

K-NN
K-NN
K-NN MBR (Memory-Based
Reasoning)
9. Association and Sequence Detection
- Association (Item)
Market-basket analysis
- Sequence Detection Association

(Association) AB
A (Antecedent) LHS (Left-Hand Side)
B (Consequent) RHS (Right- Hand Side)


10. Logic Regression
2 Yes/No , 0/1 (Dependent Variable) 2
(Model) Logic Regression
Model
Algorithm Algorithm Log Odds Logic
Transtromation
:

56

11. Discriminant Analysis :


1936 R.A Fisher Iris 3



Data Mining
12. Generalized Additive Models (GAM) : Linear Regression Logistic
Regression Model Possibly Non-Linear
Function GAM Regression Classification
GAM Function Curve
Parameter Neural Network
GAM GAM Output Input
Neural Network GAM
13. Multivariate Adaptive Regression Splits (MARS) : 80
Jerome H. Friedman CART MARS
MARS Plot
Non-Linear Step-wise regression tools

Data Mining

5.5 Data Mining: Confluence of Multiple Disciplines

57

5.11 Data Mining


Data Mining

1. (Marketing) Promotion
2. (Banking / Financial Analysis)
Package
()
3. (Retailing and sales)




4. (Manufacturing and production)

5. (Brokerage and securities trading)

Mining
6. DNA (Biomedical an DNA Analysis)


(Insurance), Computer hardware
software, (Government and defense), (Airlines),
(Health care), (Broadcasting) (Law enforcement)

5.12 Intelligent Data Mining


Intelligent Data Mining (Data warehouses)
(Reports) Patterns
Patterns Rules
Intelligent Data Mining 5
1) (association)
2) (sequences)
3) (classifications)
4) (clusters)

58

5) (forecasting)

5.6 Data Mining


5.13 Intelligent Data Mining
Intelligent data mining
1. Case-based Reasoning
2. Neural Computing
3. Intelligent Agents
4. (Other Tools)
- Decision trees
- Rule induction
- Data visualization
5.14 Data Mining
Data Mining
5.14.1 Association Rule 2
(support) (Confidence)

59

: Support
Support ()

= AB
= P(A B)

A Intersec B
: Confidence
Confidence ()
= P(A/B)
= P(A B)
P(A)
1 : Support A C Confidence P(A/C)

Transaction ID
2000
1000
4000
5000

Items Bought
A,B,C
A,C
A,D
B,E,F
5.2

Step :
1. Transaction DB
2. Item
3. Associate ()
Solve :

Support ()= A C

= P(A C)
= 2/4
= 0.5
= 50%

60

Confidence ()

= P(A/C)

= P(A C)
P(A)
= (2/4) / (3/4)
= 0.5/0.75
= 0.6666
= 66.66%

A C = (50%, 66.66%)
100 (Transaction) A C 50 A
C 66.66%
: A
C A
AC CA
P(A/C) P(C/A)
Association Rule
1.

Promotion
(Shelf)
2.

61

Association Rule
5.2 2
1. Support C A , Confidence P(C/A)
Solve :

2. Support A B , Confidence P(A/B)


Solve :

3. Support D A , Confidence P(D/A)


Solve :

4. Support E B , Confidence P(E/B)


Solve :

62

5. Support B F , Confidence P(B/F)


Solve :

6. 5.3 Support E D , Confidence P(E/D)


Transaction ID
2000
1000
4000
5000
6000
7000
8000
9000

Items Bought
A,B,C,F,D
A,C,D,E
A,D,E
B,E,F
C,E,D
E,F,A
F,E,C
F,A,B
5.3

Solve :

7. 5.3 A,D,E
A D E
Solve :

63

5.14.2 Interestingness Measures


Interest 2


Interest

1. Dependence : 2
> 1 Positive
2. Independence: 2
< = 1 Negative
= P(A^B)
P(A).P(B)

Interest

1: 5.4 Support Interest


X
Y
Z

1
1
0

1
1
1

1
0
1

1
0
1

0
0
1

0
0
1

0
0
1

0
0
1

0
0
1

0
0
1

= 4/8
= 2/8
= 7/8

5.4
Solve :
1. Interest

X
Y
Z

1
1
0

1
1
1

1
0
1

= P(A^B)
P(A).P(B)
1
0
1

0
0
1

0
0
1

64

Tem Set
X,Y

X,Z

Y,Z

Support
= 2/8
= 0.25
= 50%
= 3/8
= 0.375
= 37.50%
= 1/8
= 0.125
= 12.50%

Interest
=(2/8) / (4/8).(2/8)
= 0.25/0.125
=2
=(3/8) / (4/8).(7/8)
= 0.375 / 0.438
= 0.86
=(1/8) / (2/8).(7/8)
= 0.125 / 0.218
= 0.57

Description
>1 Dependence
(X,Y )
< =1 Independence
(X,Z )
< =1 Independence
(Y,Z )

5.14.3 Dissimilarity Between Binary Variable



() (Group)
Binary 2 0, 1
1) P = Positive =1
=Yes =True
2) N=Negative =0
=No =False
1:
(Fever)
(Cough) 3 Jack, Mary Jim

Name
Jack
Mary
Jim

Gender
M
F
M

Solve:
Step:
1. 0
Name
Gender
M
Jack
F
Mary
M
Jim

Fever
Y
Y
Y

Cough
Test-1
N
P
N
P
P
N
5.6

1
Fever
Cough
Test-1
Y (1)
N(0)
P(1)
Y(1)
N(0)
P(1)
Y(1)
P(1)
N(0)
5.7

Test-2
N
N
N

Test-3
N
P
N

Test-4
N
N
N

Test-2
N(0)
N(0)
N(0)

Test-3
N(0)
P(1)
N(0)

Test-4
N(0)
N(0)
N(0)

65

2. Dissimilarity Between Binary Variable


Object j

Binary (i,j)

Object i

D(i,j)

= b+c
a+b+c

3. jack mary
Mary

D(jack,mary)

Jack

D(Jack,Mary)

A (2)

B (0)

C (1)

D (3)

= b+c
a+b+c

D(Jack,Jim)
D(Jim,Mary)

= 0+1

=1

= 0.33

2+0+1 3

=0.67
=0.75

: Jack Mary
0.33
Jim Jim Jack
Mary Jack Jack Mary
:
( )

66

Dissimilarity Between Binary Variable


1.
( 2 )

T
T
T
T
T
F
T
F
T
F

T
F
T
F
F
F
F
F
F
F

Test-1 Test-2
Test-3
Test-4
Test-4
T
T
T
T
F
F
T
T
T
F
T
F
F
T
T
F
T
F
T
F
F
T
T
F
T
F
F
T
T
T
T
T
F
F
T
F
T
F
F
T
T
T
F
F
T
F
F
T
F
F
5.8
5.14.4 Naive Bayesian Classification
Nave Bayesian Classification

1 : Tennis

: rain, hot, high ,false Tennis

(Outlook)
Sunny ()
sunny
Overcast ( )
Rain ()
Rain
Rain
Overcast
Sunny

(Temperature)
hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild

(Humidity)
high
high
high
high
Normal
Normal
normal
high

(Windy)
False
True
False
False
False
True
True
False

/
Class
N
N
P
P
P
N
P
N

67

Sunny
Rain
Sunny
Overcast
Overcast
rain

Cool
Mild
Mild
Mild
Hot
Mild

normal
Normal
normal
High
Normal
high

False
False
True
True
False
True

P
P
P
P
P
N

5.9
Solve:
1.
Outlook
P(sunny|P) = 2/9
P(sunny|N) = 3/5
P(overcast|P) = 4/9
P(overcast|N) = 0/5
P(rain|P) = 3/9
P(rain|N) = 2/5
Temperature
P(hot|P) = 2/9
P(hot|N) = 2/5
P(mild|P) = 4/9
P(mild|N) = 2/5
P(cool|P) = 3/9
P(cool|N) = 1/5
Humidity
P(high|P) = 3/9
P(high|N) =4/5
P(normal|P) = 6/9
P(normal|N) = 1/5
Windy
P(True|P) = 3/9
P(True|N) = 3/5
P(False|P) = 6/9
P(False|N) = 2/5
P(p) = 9/14
P(n) = 5/14

= 14/14
=1 ( 1 )

2. An unseen sample X = <rain,hot,high,false>



P(X|p).P(p) = P(rain|p).P(hot|p).P(high|p).P(false|p).P(p) = (3/9).(2/9).(3/9).(6/9).(9/14) = 0.010582

P(X|n).P(n) = P(rain|n).P(hot|n).P(high|n).P(false|n).P(n) = (2/5).(2/5).(4/5).(2/5).(5/14) = 0.018285
Ans: sample X is Classified in class n (dont play)

68

Naive Bayesian Classification


1. 5.9 X = <overcast,cool,normal,true>
Tenis
2. Z Virus X
Class P Virus X
Class N Virus X

Class

P
P
N
P
N
P
P
N

5.10

Class

5.14.5 Entropy and Information Gain


(Analytical Characterization)

3
1. Information measures info required to classify any arbitrary tuple

I( s 1 ,s 2 ,...,s

)
i 1

si
si
log 2
s
s

69

* I (information) I Field
2. Entropy of attribute A with values {a1,a2,,av}
s 1 j ... s mj
I ( s 1 j ,..., s mj )
s
j 1
v

E(A)

* Entropy
3. Information gained by branching on attribute A

Gain(A) I(s 1, s 2 ,..., s m ) E(A)


* Information gained Field (attribute)
1:
2
(Analytical Characterization)

gender

major

birth_country

age_range gpa

count

M
F
M
F
M
F

Science
Science
Engineering
Science
Science
Engineering

Canada
Foreign
Foreign
Foreign
Canada
Canada

20-25
25-30
25-30
25-30
20-25
20-25

16
22
18
25
21
18

Very_good
Excellent
Excellent
Excellent
Excellent
Excellent

5.11 Candidate relation for Target class: Graduate students (=120)

70

gender

major

birth_country

age_range gpa

count

M
F
M
F
M
F

Science
Business
Business
Science
Engineering
Engineering

Foreign
Canada
Canada
Canada
Foreign
Canada

<20
<20
<20
20-25
20-25
<20

18
20
22
24
22
24

Very_good
Fair
Fair
Fair
Very_good
Excellent

Solve :
Step 1: Calculate expected info required to classify an arbitrary tuple
I I Field
m

I( s 1 ,s 2 ,...,s

)
i 1

si
si
log 2
s
s

1.1 I 2
1 S1
2 S2
I(s 1 , s 2 ) I(120 ,130 )

120
120 130
130
log 2

log 2
0 . 9988
250
250 250
250

S1

S2

I
S1
S2
= -0.48 log2 0.48
-0.52 log2 0.52
= (-0.48) (log 0.48)
(-0.52) (log 0.52)
log2
log2
= 0.50826+0.49057
= 0.9988
: log2 = log10A
Log102

log2 = 0.301

71

1.2 I Field (Column) 2 Field 5 Field


gender, major, birth_country, age_range gpa
1.2.1 I Field major
1.2.2 I Field gender
1.2.3 I Field birth_country
1.2.4 I Field age_range
1.2.5 I Field gpa
: I Field major
1.2.1 Field major 3 Record
1. For major = Science
Solve:
S11 = 84
S21 = 42
sum(S11,S21)
-42 log2 42
= 126
= -84 log2 84
126 126
126 126
= -0.666 log2 0.666
-0.333 log2 0.333
= (-0.666) (log 0.666)
(-0.333) (log 0.333)
log2
log2
= 0.390+0.528
= 0.918
I Field
2. For major = Engineering
Solve:
S12 = 36
S22 = 46

sum(S12,S22)

72

3. For major = Business


Solve:
S13 = 0

S23 = 42

sum(S13,S23)

Field major
For major=Science:

S11=84

S21=42

I(s11,s21)=0.9183

For major=Engineering: S12=36

S22=46

I(s12,s22)=0.9892

For major=Business:

S23=42

I(s13,s23)=0

S13=0

1.2.2 I Field gender


Solve :

73

1.2.3 I Field birth_country


Solve :

1.2.4 I Field age_range


Solve:

74

1.2.5 I Field gpa


Solve:

75

Step 2 : Calculate entropy of each attribute: e.g. major


Entropy
s 1 j ... s mj
I ( s 1 j ,..., s mj )
s
j 1
v

E(A)

2.1 Entropy Field major


2.2 Entropy Field gender
2.3 Entropy Field birth_country
2.4 Entropy Field age_range
2.5 Entropy Field gpa
: Entropy Field major
1. Sum I Record Field major
Major
Sum
I
Science (S11,S21)
126
0.918
Engineering(S12,S22)
82
0.989
Business (S13,S23)
42
0
Sum (S1,S2)
250
2.
E(major)

126
82
42
I ( s 11 , s 21 )
I ( s 12 , s 22 )
I ( s 13 , s 23 ) 0 . 7873
250
250
250

3. (Solve)

76

= (126*0.918) + (82*0.989) + (42*0)


250
250
250
= 0.462 + 0.324 + 0
= 0.786 Entropy Field major
Entropy
Entropy Field
2.2 Entropy Field gender
Solve

2.3 Entropy Field birth_country


Solve

77

2.4 Entropy Field age_range


Solve

2.5 Entropy Field gpa


Solve

Step 3 : Calculate information gain for each attribute


Information gained

Gain(A) I(s 1, s 2 ,..., s m ) E(A)


Information gained Field (attribute)
3.1 Information gained Field major
3.2 Information gained Field gender
3.3 Information gained Field birth_country
3.4 Information gained Field age_range
3.5 Information gained Field gpa

78

: Information gained Field major


Gain(major ) I(s 1, s 2) E(major) 0 . 2115

I S1,S2
0.9988

Entropy Field major


0.786

= 0.9988 0.786
= 0.2128
Gain
Information gained Field
3.2 Information gained Field gender
Solve:

3.3 Information gained Field birth_country


Solve:

3.4 Information gained Field age_range


Solve:

3.5 Information gained Field gpa


Solve:

79

: Information gain for all attributes

Gain(gender)

= 0.0003

Gain(birth_country)

= 0.0407

Gain(major)
Gain(gpa)

= 0.2115
= 0.4490

Gain(age_range)

= 0.5971

:
Gain(age_range)

5
1. Data Mining
2. Data Mining
3. Data Mining (A KDD Process) KDD
Process
4. Data Mining Business Intelligence
5. Data Mining

6. Data Mining

80

You might also like