Professional Documents
Culture Documents
Lecture Notes For Chapter 5: by Tan, Steinbach, Kumar
Lecture Notes For Chapter 5: by Tan, Steinbach, Kumar
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
rule (Status=Single) No
Coverage = 40%, Accuracy = 50%
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
How does Rule-based Classifier Work?
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Rule Ordering Schemes
Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
R1 R1
R2
Rule Growing
Instance Elimination
Rule Evaluation
Stopping Criterion
Rule Pruning
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
CN2 Algorithm:
– Start from an empty conjunct: {}
– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
– Determine the rule consequent by taking majority class of instances
covered by the rule
RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Why do we need to
eliminate instances?
R3 R2
– Otherwise, the next rule is
R1 + + + +
identical to previous rule +
+ ++ +
Why do we remove class = +
+
++ +
+ +
+
+
+
+ +
positive instances? + + + +
+ + +
+ +
– Ensure that the next rule is -
- -
-
- - -
-
different - -
class = - - - -
Why do we remove - -
-
negative instances? -
- -
-
– Prevent underestimating -
accuracy of rule
– Compare rules R2 and R3
in the diagram
Metrics:
nc
– Accuracy
n
nc 1
– Laplace n : Number of instances
nk
covered by rule
nc : Number of instances
covered by rule
– M-estimate
n c kp k : Number of classes
p : Prior probability
nk
Stopping criterion
– Compute the gain
– If gain is not significant, discard the new rule
Rule Pruning
– Similar to post-pruning of decision trees
– Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct
Repeat
Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
– Pruning method: delete any final sequence of
conditions that maximizes v
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Direct Method: RIPPER
P
No Yes
Q R Rule Set
Give C4.5rules:
Birth? (Give Birth=No, Can Fly=Yes) Birds
(Give Birth=No, Live in Water=Yes) Fishes
Yes No
(Give Birth=Yes) Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles
Mammals Live In ( ) Amphibians
Water?
Yes No RIPPER:
(Live in Water=Yes) Fishes
Sometimes (Have Legs=No) Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Fishes Amphibians Can Reptiles
Fly?
(Can Fly=Yes,Give Birth=No) Birds
Yes No () Mammals
Birds Reptiles
Examples:
– Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
– Nearest neighbor
Uses k “closest” points (nearest neighbors) for
performing classification
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
X X X
Voronoi Diagram
d ( p, q ) ( pi
i
q )
i
2
Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
No 2 4 1 No 3 4
i 1
P( A | C ) P(C )
P(C | A)
P( A)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example of Bayes Theorem
Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P( A A A | C ) P(C )
P(C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
2
i j 2
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
sample mean = 110
10 No Single 90K Yes sample variance = 2975
10
1
( 120110) 2
N ic
Original : P( Ai | C )
Nc
c: number of classes
N ic 1
Laplace : P( Ai | C ) p: prior probability
Nc c
m: parameter
N ic mp
m - estimate : P( Ai | C )
Nc m
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) 0.06 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) 0.004 0.0027
eagle no yes no yes non-mammals 20
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
x1 x2 x3 x4 x5
Input
Layer Input Neuron i Output
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
Hidden g(Si )
Layer I3
threshold, t
Find a linear hyperplane (decision boundary) that will separate the data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Support Vector Machines
B1
B2
B2
B1
B2
B1
B2
b21
b22
margin
b11
b12
B1
w x b 0
w x b 1 w x b 1
b11
b12
1 if w x b 1 2
f ( x) Margin 2
1 if w x b 1 || w ||
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Support Vector Machines
2
We want to maximize: Margin 2
|| w ||
2
|| w ||
– Which is equivalent to minimizing: L( w)
2
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
– Boosting
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Error rate:
w C ( x ) y
N
1
i j i j j
N j 1
Importance of a classifier:
1 1 i
i ln
2 i
Weight update:
j
( j 1)
w exp
( j)
if C j ( xi ) yi
wi i
j
Zj
exp if C j ( xi ) yi
where Z j is the normalizat ion factor
If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
Classification:
C * ( x ) arg max j C j ( x ) y
T
y j 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›