Professional Documents
Culture Documents
Data Mining Classification: Alternative Techniques: Lecture Notes For Chapter 5 Introduction To Data Mining
Data Mining Classification: Alternative Techniques: Lecture Notes For Chapter 5 Introduction To Data Mining
Rule-Based Classifier
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
consequent of a 10 No Single
10
90K Yes
rule (Status=Single) → No
Coverage = 40%, Accuracy = 50%
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
O Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
O Rule-based ordering
– Individual rules are ranked based on their quality
O Class-based ordering
– Rules that belong to the same class appear together
O Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
O Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
R1 R1
R2
O Rule Growing
O Instance Elimination
O Rule Evaluation
O Stopping Criterion
O Rule Pruning
Rule Growing
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
O CN2 Algorithm:
– Start from an empty conjunct: {}
– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
– Determine the rule consequent by taking majority class of instances
covered by the rule
O RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
Instance Elimination
O Why do we need to
eliminate instances?
R3 R2
– Otherwise, the next rule is
R1
identical to previous rule + + + + +
+ ++ +
O Why do we remove class = +
+
+++
+ +
+
+
+
+ +
positive instances? + + + +
+ + +
+ +
– Ensure that the next rule is -
- -
-
- - -
-
different - -
class = - - -
O Why do we remove -
-
-
-
negative instances? -
- -
-
– Prevent underestimating -
accuracy of rule
– Compare rules R2 and R3
in the diagram
O Metrics:
nc
– Accuracy =
n
nc + 1
– Laplace = n : Number of instances
covered by rule
n+k nc : Number of instances
covered by rule
–
n + kp
M-estimate = c
k : Number of classes
p : Prior probability
n+k
O Stopping criterion
– Compute the gain
– If gain is not significant, discard the new rule
O Rule Pruning
– Similar to post-pruning of decision trees
– Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct
O Repeat
O Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
– Pruning method: delete any final sequence of
conditions that maximizes v
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
Indirect Methods
P
No Yes
Q R Rule Set
Give C4.5rules:
Birth? (Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
Yes No
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
Mammals Live In ( ) → Amphibians
Water?
Yes No RIPPER:
(Live in Water=Yes) → Fishes
Sometimes (Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Fishes Amphibians Can → Reptiles
Fly?
(Can Fly=Yes,Give Birth=No) → Birds
Yes No () → Mammals
Birds Reptiles
RIPPER:
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 0 0 0 0 2
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4
O Examples:
– Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
– Nearest neighbor
Uses k “closest” points (nearest neighbors) for
performing classification
O Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
Nearest-Neighbor Classifiers
Unknown record O Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
1 nearest-neighbor
Voronoi Diagram
d ( p, q ) = ∑ ( pi
i
−q )
i
2
O Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
Example: PEBLS
No 2 4 1 No 3 4
Example: PEBLS
Tid Refund Marital Taxable
Status Income Cheat
P ( A | C ) P (C )
P (C | A) =
P ( A)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
O Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
Bayesian Classifiers
O Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P ( A A K A | C ) P (C )
P (C | A A K A ) = 1 2 n
P(A A K A )
1 2 n
1 2 n
c c c
O Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
O For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
– where |Aik| is number of
6 No Married 60K No
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10
P( A | c ) = e
2 σ ij2
1 Yes Single 125K No
2πσ
i j 2
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No O For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
sample mean = 110
10 No Single 90K Yes sample variance = 2975
10
1 −
( 120 −110 ) 2
N ic
Original : P ( Ai | C ) =
Nc
c: number of classes
N +1
Laplace : P( Ai | C ) = ic p: prior probability
Nc + c
m: parameter
N + mp
m - estimate : P( Ai | C ) = ic
Nc + m
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = × × × = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = × × × = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P( M ) = 0.06 × = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P( N ) = 0.004 × = 0.0027
eagle no yes no yes non-mammals 20
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3
0 0 1 0
Σ Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
Y = I ( 0 .3 X 1 + 0 .3 X 2 + 0 .3 X 3 − 0 .4 > 0 )
1 if z is true
where I ( z ) =
0 otherwise
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
Artificial Neural Networks (ANN)
Input
Layer Input Neuron i Output
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
Hidden g(Si )
Layer I3
threshold, t
O Find a linear hyperplane (decision boundary) that will separate the data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66
Support Vector Machines
B1
B2
B2
B1
B2
B1
B2
b21
b22
margin
b11
b12
B1
r r
w• x + b = 0
r r
r r w • x + b = +1
w • x + b = −1
b11
r r b12
r 1 if w • x + b ≥ 1 2
f (x) = r r Margin = r 2
− 1 if w • x + b ≤ − 1 || w ||
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72
Support Vector Machines
2
O We want to maximize: Margin = r 2
|| w ||
r
|| w ||2
– Which is equivalent to minimizing: L( w) =
2
Ensemble Methods
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
– Boosting
Bagging
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Boosting
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
O Error rate:
∑ w δ (C ( x ) ≠ y )
N
1
εi = j i j j
N j =1
O Importance of a classifier:
1 1 − εi
αi = ln
2 ε i
Example: AdaBoost
O Weight update:
( j +1)
exp −α j if C j ( xi ) = yi
wi( j )
wi = α
exp j if C j ( xi ) ≠ yi
Zj
where Z j is the normalization factor
O If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
O Classification:
C * ( x ) = arg max ∑ α jδ (C j ( x ) = y )
T
y j =1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - α = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - α = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ α = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ α = 3.8744
Overall +++ - - - - - ++
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88