You are on page 1of 44

Data Mining

Classification: Alternative Techniques

Lecture Notes for Chapter 5


Introduction to Data Mining
by
Tan, Steinbach, Kumar

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Rule-Based Classifier
O

Classify records by using a collection of


ifthen rules

Rule:

(Condition) y

where

Condition is a conjunctions of attributes

y is the class label

LHS: rule antecedent or condition


RHS: rule consequent
Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) Birds

(Taxable Income < 50K) (Refund=Yes) Evade=No

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Rule-based Classifier (Example)


Name

Blood Type

human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle

warm
cold
cold
warm
cold
cold
warm
warm
warm
cold
cold
warm
warm
cold
cold
cold
warm
warm
warm
warm

Give Birth

yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no

Can Fly

no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes

Live in Water

no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no

Class

mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
reptiles
mammals
birds
mammals
birds

R1: (Give Birth = no) (Can Fly = yes) Birds


R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Application of Rule-Based Classifier


O

A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name

hawk
grizzly bear

Blood Type

warm
warm

Give Birth

Can Fly

Live in Water

Class

no
yes

yes
no

no
no

?
?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Rule Coverage and Accuracy


Tid Refund Marital

Coverage of a rule:
Status
1
Yes
Single
Fraction of records
2
No
Married
that satisfy the
3
No
Single
antecedent of a rule
4
Yes
Married
5
No
Divorced
O Accuracy of a rule:
6
No
Married
Fraction of records
7
Yes
Divorced
that satisfy both the
8
No
Single
9
No
Married
antecedent and
10 No
Single
consequent of a
(Status=Single) No
rule
O

Taxable
Income Class
125K

No

100K

No

70K

No

120K

No

95K

Yes

60K

No

220K

No

85K

Yes

75K

No

90K

Yes

10

Coverage = 40%, Accuracy = 50%


Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How does Rule-based Classifier Work?


R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name

lemur
turtle
dogfish shark

Blood Type

warm
cold
cold

Give Birth

Can Fly

Live in Water

Class

yes
no
yes

no
no
no

no
sometimes
yes

?
?
?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Characteristics of Rule-Based Classifier


O

Mutually exclusive rules


Classifier contains mutually exclusive rules if
the rules are independent of each other
Every record is covered by at most one rule

Exhaustive rules
Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
Each record is covered by at least one rule

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

From Decision Trees To Rules


Classification Rules
(Refund=Yes) ==> No

Refund
Yes

No

NO
{Single,
Divorced}

Marita l
Status

NO

{Married}

(Refund=No, Marital Status={Single,Divorced},


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No

NO

Taxable
Income
< 80K

(Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No

> 80K
YES

Rules are mutually exclusive and exhaustive


Rule set contains as much information as the
tree

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Rules Can Be Simplified


Refund
Yes

No

NO
{Single,
Divorced}

Marita l
Status

NO

Taxable
Income
< 80K

{Married}

> 80K

NO

YES

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

No

Married

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Yes
No

10

Initial Rule:

(Refund=No) (Status=Married) No

Simplified Rule: (Status=Married) No


Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Effect of Rule Simplification


O

Rules are no longer mutually exclusive


A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set use voting schemes

Rules are no longer exhaustive


A record may not trigger any rules
Solution?

Use a default class

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Ordered Rule Set


O

Rules are rank ordered according to their priority


An ordered rule set is known as a decision list

When a test record is presented to the classifier


It is assigned to the class label of the highest ranked rule it has
triggered
If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name

turtle

Blood Type

Give Birth

Can Fly

Live in Water

Class

no

no

sometimes

cold

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

11

Rule Ordering Schemes


O

Rule-based ordering
Individual rules are ranked based on their quality

Class-based ordering
Rules that belong to the same class appear together
Rule-based Ordering

Class-based Ordering

(Refund=Yes) ==> No

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},


Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

(Refund=No, Marital Status={Married}) ==> No

Tan,Steinbach, Kumar

(Refund=No, Marital Status={Single,Divorced},


Taxable Income>80K) ==> Yes

Introduction to Data Mining

4/18/2004

12

Building Classification Rules


O

Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holtes 1R

Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

13

Direct Method: Sequential Covering


1.
2.
3.
4.

Start from an empty rule


Grow a rule using the Learn-One-Rule function
Remove training records covered by the rule
Repeat Step (2) and (3) until stopping criterion
is met

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

14

Example of Sequential Covering

(ii) Step 1

(i) Original Data

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

15

Example of Sequential Covering

R1

R1

R2
(iii) Step 2

Tan,Steinbach, Kumar

(iv) Step 3

Introduction to Data Mining

4/18/2004

16

Aspects of Sequential Covering


O

Rule Growing

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

17

Rule Growing
O

Two common strategies

{}

Yes: 3
No: 4

Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=
No

Status =
Single

Status =
Divorced

Status =
Married

Yes: 3
No: 4

Yes: 2
No: 1

Yes: 1
No: 0

Yes: 0
No: 3

...

Income
> 80K

Yes: 3
No: 1

(a) General-to-specific

Tan,Steinbach, Kumar

Introduction to Data Mining

Refund=No,
Status=Single,
Income=90K
(Class=Yes)

Refund=No,
Status = Single
(Class = Yes)

(b) Specific-to-general

4/18/2004

18

Rule Growing (Examples)


O

CN2 Algorithm:
Start from an empty conjunct: {}
Add conjuncts that minimizes the entropy measure: {A}, {A,B},
Determine the rule consequent by taking majority class of instances
covered by the rule

RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOILs information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

19

Instance Elimination
O

Why do we need to
eliminate instances?

R3

Otherwise, the next rule is


identical to previous rule
O

Why do we remove
positive instances?

R1
+

class = +

Ensure that the next rule is


different
O

Why do we remove
negative instances?

class = -

Prevent underestimating
accuracy of rule
Compare rules R2 and R3
in the diagram
Tan,Steinbach, Kumar

Introduction to Data Mining

+ +
+
+
+
+

++
+
+
+++
+ + +
+ +
-

R2

+
+
+
+ +
-

4/18/2004

20

Rule Evaluation
O

Metrics:
Accuracy

Laplace

nc
n

nc + 1
n+k

n + kp
M-estimate = c
n+k

Tan,Steinbach, Kumar

Introduction to Data Mining

n : Number of instances
covered by rule
nc : Number of instances
covered by rule
k : Number of classes
p : Prior probability

4/18/2004

21

Stopping Criterion and Rule Pruning


O

Stopping criterion
Compute the gain
If gain is not significant, discard the new rule

Rule Pruning
Similar to post-pruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

22

Summary of Direct Method


O

Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

23

Direct Method: RIPPER


O

For 2-class problem, choose one of the classes as


positive class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problem
Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
Learn the rule set for smallest class first, treat the rest
as negative class
Repeat with next smallest class as positive class

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

24

Direct Method: RIPPER


O

Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOILs
information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced
error pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set

Pruning method: delete any final sequence of


conditions that maximizes v
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

25

Direct Method: RIPPER


O

Building a Rule Set:


Use sequential covering algorithm
Finds the best rule that covers the current set of
positive examples
Eliminate both positive and negative examples
covered by the rule

Each time a rule is added to the rule set,


compute the new description length
stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

26

Direct Method: RIPPER


O

Optimize the rule set:


For each rule r in the rule set R

Consider 2 alternative rules:


Replacement rule (r*): grow new rule from scratch
Revised rule(r): add conjuncts to extend the rule r

Compare the rule set for r against the rule set for r*
and r
Choose rule set that minimizes MDL principle

Repeat rule generation and rule optimization


for the remaining positive examples

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

27

Indirect Methods

P
No

Yes

Q
No

Rule Set

R
Yes

No

Yes

Q
No

Tan,Steinbach, Kumar

Yes

r1: (P=No,Q=No) ==> r2: (P=No,Q=Yes) ==> +


r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> r5: (P=Yes,R=Yes,Q=Yes) ==> +

Introduction to Data Mining

4/18/2004

28

Indirect Method: C4.5rules


Extract rules from an unpruned decision tree
O For each rule, r: A y,
consider an alternative rule r: A y where A
is obtained by removing one of the conjuncts
in A
Compare the pessimistic error rate for r
against all rs
Prune if one of the rs has lower pessimistic
error rate
Repeat until we can no longer improve
generalization error
O

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

29

Indirect Method: C4.5rules


O

Instead of ordering the rules, order subsets of


rules (class ordering)
Each subset is a collection of rules with the
same rule consequent (class)
Compute description length of each subset
Description length = L(error) + g L(model)
g is a parameter that takes into account the
presence of redundant attributes in a rule set
(default value = 0.5)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

30

Example
Name

Give Birth

human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle

yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no

Lay Eggs

no
yes
yes
no
yes
yes
no
yes
no
no
yes
yes
no
yes
yes
yes
yes
yes
no
yes

Tan,Steinbach, Kumar

Can Fly

no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes

Live in Water Have Legs

no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no

yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes

Introduction to Data Mining

Class

mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
reptiles
mammals
birds
mammals
birds
4/18/2004

31

C4.5 versus C4.5rules versus RIPPER


C4.5rules:

Give
Birth?

(Give Birth=No, Can Fly=Yes) Birds


(Give Birth=No, Live in Water=Yes) Fishes

No

Yes

(Give Birth=Yes) Mammals


(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles

Live In
Water?

Mammals
Yes

( ) Amphibians

RIPPER:

No

(Live in Water=Yes) Fishes


(Have Legs=No) Reptiles

Sometimes
Fishes

Yes
Birds
Tan,Steinbach, Kumar

(Give Birth=No, Can Fly=No, Live In Water=No)


Reptiles

Can
Fly?

Amphibians

(Can Fly=Yes,Give Birth=No) Birds

No

() Mammals

Reptiles

Introduction to Data Mining

4/18/2004

32

C4.5 versus C4.5rules versus RIPPER


C4.5 and C4.5rules:
PREDICTED CLASS
Amphibians Fishes Reptiles Birds
2
0
0
ACTUAL Amphibians
0
2
0
CLASS Fishes
1
0
3
Reptiles
1
0
0
Birds
0
0
1
Mammals

0
0
0
3
0

Mammals
0
1
0
0
6

0
0
0
2
0

Mammals
2
0
1
1
4

RIPPER:
PREDICTED CLASS
Amphibians Fishes Reptiles Birds
0
0
0
ACTUAL Amphibians
0
3
0
CLASS Fishes
0
0
3
Reptiles
0
0
1
Birds
0
2
1
Mammals

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

33

Advantages of Rule-Based Classifiers


As highly expressive as decision trees
O Easy to interpret
O Easy to generate
O Can classify new instances rapidly
O Performance comparable to decision trees
O

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

34

Instance-Based Classifiers
Set of Stored Cases
Atr1

...

AtrN

Class
A

Store the training records


Use training records to
predict the class label of
unseen cases

B
B
C
A

Unseen Case
Atr1

...

AtrN

C
B

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

35

Instance Based Classifiers


O

Examples:
Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly

Nearest neighbor
Uses k closest points (nearest neighbors) for
performing classification

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

36

Nearest Neighbor Classifiers


O

Basic idea:
If it walks like a duck, quacks like a duck, then
its probably a duck
Compute
Distance

Training
Records

Test
Record

Choose k of the
nearest records

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

37

Nearest-Neighbor Classifiers
Unknown record

Requires three things


The set of stored records
Distance Metric to compute
distance between records
The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


Compute distance to other
training records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

38

Definition of Nearest Neighbor

(b) 2-nearest neighbor

(c) 3-nearest neighbor

(a) 1-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

39

4/18/2004

40

1 nearest-neighbor
Voronoi Diagram

Tan,Steinbach, Kumar

Introduction to Data Mining

Nearest Neighbor Classification


O

Compute distance between two points:


Euclidean distance

d ( p, q ) =
O

q )

( pi

Determine the class from nearest neighbor list


take the majority vote of class labels among
the k-nearest neighbors
Weigh the vote according to distance

weight factor, w = 1/d2

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

41

Nearest Neighbor Classification


O

Choosing the value of k:


If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

42

Nearest Neighbor Classification


O

Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

43

Nearest Neighbor Classification


O

Problem with Euclidean measure:


High dimensional data

curse of dimensionality

Can produce counter-intuitive results


111111111110

vs

100000000000

011111111111

000000000001

d = 1.4142

d = 1.4142

Solution: Normalize the vectors to unit length

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

44

Nearest neighbor Classification


O

k-NN classifiers are lazy learners


It does not build models explicitly
Unlike eager learners such as decision tree
induction and rule-based systems
Classifying unknown records are relatively
expensive

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

45

Example: PEBLS
O

PEBLS: Parallel Examplar-Based Learning


System (Cost & Salzberg)
Works with both continuous and nominal
features
For

nominal features, distance between two


nominal values is computed using modified value
difference metric (MVDM)

Each record is assigned a weight factor


Number of nearest neighbor, k = 1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

46

Example: PEBLS
Tid Refund Marital
Status

Taxable
Income Cheat

Yes

Single

125K

No

d(Single,Married)

No

Married

100K

No

= | 2/4 0/4 | + | 2/4 4/4 | = 1

No

Single

70K

No

Yes

Married

120K

No

d(Single,Divorced)

No

Divorced 95K

Yes

No

Married

No

d(Married,Divorced)

Yes

Divorced 220K

No

= | 0/4 1/2 | + | 4/4 1/2 | = 1

No

Single

85K

Yes

d(Refund=Yes,Refund=No)

No

Married

75K

No

10

No

Single

90K

Yes

= | 0/3 3/7 | + | 3/3 4/7 | = 6/7

60K

Distance between nominal attribute values:

= | 2/4 1/2 | + | 2/4 1/2 | = 0

10

Refund

Marital Status
Class

Single

Married

Divorced

Yes

No

Tan,Steinbach, Kumar

Class

Yes

No

Yes

No

d (V1 ,V2 ) =

Introduction to Data Mining

n1i n 2 i

n1 n 2

4/18/2004

47

Example: PEBLS
Tid Refund Marital
Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

10

Distance between record X and record Y:


d

( X , Y ) = w X wY d ( X i , Yi ) 2
i =1

where:

wX =

Number of times X is used for prediction


Number of times X predicts correctly

wX 1 if X makes accurate prediction most of the time


wX > 1 if X is not reliable for making predictions
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

48

Bayes Classifier
A probabilistic framework for solving classification
problems
O Conditional Probability:
P( A, C )
P (C | A) =
P ( A)
O

P( A | C ) =
O

P( A, C )
P (C )

Bayes theorem:

P (C | A) =
Tan,Steinbach, Kumar

P ( A | C ) P (C )
P ( A)

Introduction to Data Mining

4/18/2004

49

Example of Bayes Theorem


O

Given:
A doctor knows that meningitis causes stiff neck 50% of the
time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, whats the probability


he/she has meningitis?

P( M | S ) =

Tan,Steinbach, Kumar

P ( S | M ) P ( M ) 0.5 1 / 50000
=
= 0.0002
P( S )
1 / 20
Introduction to Data Mining

4/18/2004

50

Bayesian Classifiers
O

Consider each attribute and class label as random


variables

Given a record with attributes (A1, A2,,An)


Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C| A1, A2,,An )

Can we estimate P(C| A1, A2,,An ) directly from


data?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

51

Bayesian Classifiers
O

Approach:
compute the posterior probability P(C | A1, A2, , An) for
all values of C using the Bayes theorem
P (C | A A K A ) =
1

P ( A A K A | C ) P (C )
P(A A K A )
1

Choose value of C that maximizes


P(C | A1, A2, , An)
Equivalent to choosing value of C that maximizes
P(A1, A2, , An|C) P(C)
O

How to estimate P(A1, A2, , An | C )?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

52

Nave Bayes Classifier


O

Assume independence among attributes Ai when class is


given:
P(A1, A2, , An |C) = P(A1| Cj) P(A2| Cj) P(An| Cj)
Can estimate P(Ai| Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj) P(Ai| Cj) is
maximal.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

53

How to Estimate Probabilities from Data?


c

Tid

Refund

Marital
Status

Taxable
Income

Evade

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced

95K

Yes

No

Married

60K

No

Yes

Divorced

220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

Class: P(C) = Nc/N


e.g., P(No) = 7/10,
P(Yes) = 3/10

For discrete attributes:


P(Ai | Ck) = |Aik|/ Nc k
where |Aik| is number of
instances having attribute
Ai and belongs to class Ck
Examples:

10

Tan,Steinbach, Kumar

Introduction to Data Mining

P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
4/18/2004

54

How to Estimate Probabilities from Data?


O

For continuous attributes:


Discretize the range into bins
one ordinal attribute per bin
violates independence assumption

Two-way split: (A < v) or (A > v)

choose only one of the two splits as new attribute

Probability density estimation:


Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

55

How to Estimate Probabilities from Data?


Tid
1

Refund
Yes

Marital
Status
Single

Taxable
Income
125K

Evade

Normal distribution:

1
P( A | c ) =
e
2

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced

95K

Yes

No

Married

60K

No

Yes

Divorced

220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

( Ai ij ) 2
2 ij2

ij

One for each (Ai,ci) pair


O

For (Income, Class=No):


If Class=No

sample mean = 110

sample variance = 2975

10

1
P ( Income = 120 | No) =
e
2 (54.54)
Tan,Steinbach, Kumar

Introduction to Data Mining

( 120 110 ) 2
2 ( 2975 )

= 0.0072

4/18/2004

56

Example of Nave Bayes Classifier


Given a Test Record:

X = (Refund = No, Married, Income = 120K)

naive Bayes Classifier:


P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
Tan,Steinbach, Kumar

P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
P(Income=120K| Class=No)
= 4/7 4/7 0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)


P(Married| Class=Yes)
P(Income=120K| Class=Yes)
= 1 0 1.2 10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)


Therefore P(No|X) > P(Yes|X)

=> Class = No

Introduction to Data Mining

4/18/2004

57

Nave Bayes Classifier


If one of the conditional probability is zero, then
the entire expression becomes zero
O Probability estimation:
O

Original : P ( Ai | C ) =

N ic
Nc

N +1
Laplace : P( Ai | C ) = ic
Nc + c
N + mp
m - estimate : P( Ai | C ) = ic
Nc + m

Tan,Steinbach, Kumar

Introduction to Data Mining

c: number of classes
p: prior probability
m: parameter

4/18/2004

58

Example of Nave Bayes Classifier


Name

human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle

Give Birth

yes

Give Birth

yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no

Can Fly

no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes

Can Fly

no

Tan,Steinbach, Kumar

Live in Water Have Legs

no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no

Class

yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes

mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals

Live in Water Have Legs

yes

no

A: attributes
M: mammals
N: non-mammals

6 6 2 2
P ( A | M ) = = 0.06
7 7 7 7
1 10 3 4
P ( A | N ) = = 0.0042
13 13 13 13
7
P ( A | M ) P( M ) = 0.06 = 0.021
20
13
P ( A | N ) P( N ) = 0.004 = 0.0027
20

Class

Introduction to Data Mining

P(A|M)P(M) > P(A|N)P(N)


=> Mammals

4/18/2004

59

Nave Bayes (Summary)


O

Robust to isolated noise points

Handle missing values by ignoring the instance


during probability estimate calculations

Robust to irrelevant attributes

Independence assumption may not hold for some


attributes
Use other techniques such as Bayesian Belief
Networks (BBN)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

60

Artificial Neural Networks (ANN)

X1 X2 X3
1
1
1
1
0
0
0
0

0
0
1
1
0
1
1
0

0
1
0
1
1
0
1
0

Input

0
1
1
1
0
0
1
0

X1

Black box
Output

X2

X3

Output Y is 1 if at least two of the three inputs are equal to 1.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

61

Artificial Neural Networks (ANN)

X1 X2 X3
1
1
1
1
0
0
0
0

0
0
1
1
0
1
1
0

0
1
0
1
1
0
1
0

Y
0
1
1
1
0
0
1
0

Input
nodes

Black box
X1

Output
node

0.3
0.3

X2
X3

0.3

t=0.4

Y = I ( 0 .3 X 1 + 0 .3 X 2 + 0 .3 X 3 0 .4 > 0 )
1
where I ( z ) =
0
Tan,Steinbach, Kumar

if z is true
otherwise

Introduction to Data Mining

4/18/2004

62

Artificial Neural Networks (ANN)


O

Model is an assembly of
inter-connected nodes
and weighted links
Output node sums up
each of its input value
according to the weights
of its links

Input
nodes

Black box
X1

Output
node

w1
w2

X2

w3

X3

Perceptron Model

Y = I ( wi X i t )

Compare output node


against some threshold t

or

Y = sign( wi X i t )
i

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

63

General Structure of ANN


x1

x2

x3

Input
Layer

x4

x5

Input
I1
I2

Hidden
Layer

I3

Neuron i
wi1
wi2
wi3

Si

Activation
function
g(Si )

Output

Oi

Oi

threshold, t

Output
Layer

Training ANN means learning


the weights of the neurons
y

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

64

Algorithm for learning ANN


O

Initialize the weights (w0, w1, , wk)

Adjust the weights in such a way that the output


of ANN is consistent with class labels of training
examples
2
Objective function: E = [Yi f ( wi , X i )]
i

Find the weights wis that minimize the above


objective function

e.g., backpropagation algorithm (see lecture notes)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

65

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

66

Support Vector Machines


B1

One Possible Solution

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

67

4/18/2004

68

Support Vector Machines

B2

Another possible solution

Tan,Steinbach, Kumar

Introduction to Data Mining

Support Vector Machines

B2

Other possible solutions

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

69

4/18/2004

70

Support Vector Machines


B1

B2

O
O

Which one is better? B1 or B2?


How do you define better?

Tan,Steinbach, Kumar

Introduction to Data Mining

Support Vector Machines


B1

B2
b21
b22

margin

b11

b12
O

Find hyperplane maximizes the margin => B1 is better than B2

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

71

Support Vector Machines


B1

r r
w x + b = 0
r r
w x + b = +1

r r
w x + b = 1

b11

1
r
f (x) =
1

r r
if w x + b 1
r r
if w x + b 1

Tan,Steinbach, Kumar

Introduction to Data Mining

b12

Margin =
4/18/2004

2
r 2
|| w ||
72

Support Vector Machines


O

2
We want to maximize: Margin = r 2
|| w ||

r
|| w ||2
Which is equivalent to minimizing: L( w) =
2

But subjected to the following constraints:


r r
if w x i + b 1
1
r
f ( xi ) =
r r
1 if w x i + b 1

This is a constrained optimization problem


Numerical approaches to solve it (e.g., quadratic programming)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

73

Support Vector Machines


O

What if the problem is not linearly separable?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

74

Support Vector Machines


O

What if the problem is not linearly separable?


Introduce slack variables
r
Need to minimize:
|| w ||2
N k
+ C i
L( w ) =
2
i =1

Subject to:

1
r
f ( xi ) =
1

Tan,Steinbach, Kumar

r r
if w x i + b 1 - i
r r
if w x i + b 1 + i

Introduction to Data Mining

4/18/2004

75

Nonlinear Support Vector Machines


O

What if decision boundary is not linear?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

76

Nonlinear Support Vector Machines


O

Transform data into higher dimensional space

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

77

Ensemble Methods
O

Construct a set of classifiers from the training


data

Predict class label of previously unseen records


by aggregating predictions made by multiple
classifiers

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

78

General Idea
D

Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers

D1

D2

C1

C2

Step 3:
Combine
Classifiers

....

Original
Training data

Dt-1

Dt

Ct -1

Ct

C*

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

79

Why does it work?


O

Suppose there are 25 base classifiers


Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes
a wrong prediction:
25 i
(1 ) 25i = 0.06
i =13

25

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

80

Examples of Ensemble Methods


O

How to generate an ensemble of classifiers?


Bagging
Boosting

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

81

Bagging
O

Sampling with replacement


Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)

1
7
1
1

2
8
4
8

3
10
9
5

4
8
1
10

5
2
2
5

6
5
3
5

7
10
2
9

8
10
7
6

9
5
3
3

10
9
2
7

Build classifier on each bootstrap sample

Each sample has probability (1 1/n)n of being


selected

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

82

Boosting
O

An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
Initially, all N records are assigned equal
weights
Unlike bagging, weights may change at the
end of boosting round

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

83

Boosting
Records that are wrongly classified will have their
weights increased
O Records that are classified correctly will have
their weights decreased
O

Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)

1
7
5
4

2
3
4
4

3
2
9
8

4
8
4
10

5
7
2
4

6
9
5
5

7
4
1
4

8
10
7
6

9
6
4
3

10
3
2
4

Example 4 is hard to classify


Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

84

Example: AdaBoost
O

Base classifiers: C1, C2, , CT

Error rate:

1
i =
N
O

w (C ( x ) y )
N

j =1

Importance of a classifier:

1 1 i

i = ln
2 i
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

85

Example: AdaBoost
O

Weight update:
exp j if C j ( xi ) = yi
wi

exp j if C j ( xi ) yi
where Z j is the normalization factor
( j +1)

wi( j )
=
Zj

If any intermediate rounds produce error rate


higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
O Classification:
T
O

C * ( x ) = arg max j (C j ( x ) = y )
y

Tan,Steinbach, Kumar

Introduction to Data Mining

j =1

4/18/2004

86

Illustrating AdaBoost
Initial weights for each data point

0.1

Original
Data

0.1

0.1

0.0094

0.4623

Data points
for training

- - - - - ++

+++
B1

Boosting
Round 1

0.0094

- - - - - - -

+++

Tan,Steinbach, Kumar

Introduction to Data Mining

= 1.9459

4/18/2004

87

Illustrating AdaBoost
B1
Boosting
Round 1

0.0094

0.0094

+++

0.4623

- - - - - - -

= 1.9459

B2
Boosting
Round 2

0.0009

0.3037

- - -

0.0422

- - - - - ++

= 2.9323

B3
0.0276

0.1819

0.0038

Boosting
Round 3

+++

++ ++ + ++

Overall

+++

- - - - - ++

Tan,Steinbach, Kumar

Introduction to Data Mining

= 3.8744

4/18/2004

88

You might also like