You are on page 1of 56

Data Mining

Lecture – 03
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction
• Many Algorithms:
1. Hunt’s Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees relatively
quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple testing
to avoid overfitting.
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that
Tid Refund Marital Taxable
Status Income Cheat
reach a node t 1 Yes Single 125K No
• General Procedure: 2 No Married 100K No

– If Dt contains records that belong the 3 No Single 70K No

same class yt, then t is a leaf node 4 Yes Married 120K No

labeled as yt 5 No Divorced 95K Yes


6 No Married 60K No
– If Dt is an empty set, then t is a leaf 7 Yes Divorced 220K No
node labeled by the default class, yd 8 No Single 85K Yes
– If Dt contains records that belong to 9 No Married 75K No

more than one class, use an attribute 10 No Single 90K Yes

test to split the data into smaller


10

subsets. Recursively apply the Dt


procedure to each subset.
?
Hunt’s Algorithm
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes

Refund Refund 9 No Married 75K No

Yes No Yes No 10 No Single 90K Yes


10

Don’t Don’t Marital


Marital
Cheat Status
Cheat Status
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Evaluation of a Classifier
• How predictive is the model we learned?
– Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole set of
instances
– Accuracy: proportion of correctly classified instances over
the whole set of instances
accuracy = 1 – error rate
19
Confusion Matrix
• A confusion matrix is a table that is often used to
describe the performance of a classification model (or
"classifier") on a set of test data for which the true
values are known.
PREDICTED CLASS
Class = Yes Class = No
ACTUAL
Class = Yes a b
CLASS
Class = No c d

a: TP (true positive) c: FP (false positive)


b: FN (false negative) d: TN (true negative)

20
Confusion Matrix - Example
• What can we learn from this matrix?

– There are two possible predicted


classes: "yes" and "no". If we were
predicting the presence of a disease,
for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
– The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
– Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
– In reality, 105 patients in the sample have the disease, and 60
patients do not.
21
Confusion Matrix – Confusion?
• False positives are actually negative
• False negatives are actually positives

22
Confusion Matrix - Example
• Let's now define the most
basic terms, which are
whole numbers (not rates):

– true positives (TP): These are


cases in which we predicted
yes (they have the disease), and
they do have the disease.
– true negatives (TN): We predicted no, and they don't have the
disease.
– false positives (FP): We predicted yes, but they don't actually
have the disease. (Also known as a "Type I error.")
– false negatives (FN): We predicted no, but they actually do have
the disease. (Also known as a "Type II error.")
23
Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:

• Accuracy: Overall, how often is the classifier correct?


(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often is it wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
24
Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:

• Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64

25
Confusion Matrix – Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly
predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.

http://aimotion.blogspot.com/2010/08/tools-for-machine-learning-performance.html 26
Confusion Matrix – Example 2
• Accuracy = (TP+TN)/total = (27+57)/100 = 84%
• Misclassification Rate = (FP+FN)/total = (6+10)/100 = 16%
• True Positive Rate = TP/actual yes = 27/33 = 0.81
• False Positive Rate =FP/actual no = 10/67 = 0.15

http://www.marcovanetti.com/pages/cfmatrix/?noc=1 27
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct
values.
CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

• What about this split? {Small,


Size
{Medium}
Large}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
How to Measure Impurity?
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say
that the table is impure or heterogeneous.

http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm
37
How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
– Entropy

– Gini Index

– Misclassification error

• All above formulas contain values of probability of pj a class j.

38
How to Measure Impurity? - Example
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in
short we write as 4B, 3C, 3T). The total data is 10 rows.

39
How to Measure Impurity? - Example
• Based on the data, we can compute probability of each
class. Since probability is equal to frequency relative,
we have
– Prob(Bus) = 4/10 = 0.4
– Prob(Car) = 3/10 = 0.3
– Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
40
How to Measure Impurity? - Entropy
• One way to measure impurity degree is using entropy

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,


Prob(Train)=0.3, we can now compute entropy as:

• Entropy = - 0.4log2(0.4) - 0.3log2(0.3) - 0.3log2(0.3) =


1.571

41
How to Measure Impurity? - Entropy
• Entropy of a pure table (consist of
single class) is zero because the
probability is 1 and log2(1)=0.
• Entropy reaches maximum value
when all classes in the table have
equal probability.
• Figure plots the values of
maximum entropy for different
number of classes n, where
probability is equal to p=1/n.
• In this case, maximum entropy is
equal to -n*p*log2p.
• Notice that the value of entropy is
larger than 1 if the number of
classes is more than 2.

42
How to Measure Impurity? - Gini
• Another way to measure impurity degree is using Gini
index

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,


Prob(Train)=0.3, we can now compute Gini index as:

• Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660

43
How to Measure Impurity? - Entropy
• Gini index of a pure table (consist
of a single class is zero because
the probability is 1 and 1-(1)^2=0.
• Similar to Entroy, Gini index also
reaches maximum value when all
classes in the table have equal
probability.
• Figure plots the values of
maximum Gini index for different
number of classes n, where
probability is equal to p=1/n.
• Notice that the value of Gini index
is always between 0 and 1
regardless the number of classes.

44
How to Measure Impurity? –
Missclassification Error
• Still another way to measure impurity degree

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,


Prob(Train)=0.3, we can now compute index as:

• Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60

45
How to Measure Impurity? –
Missclassification Error
• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Max(1)=0.
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=1/n and maximum Gini index
happens at 1-n*(1/n)^2=1-1/n, while maximum
misclassification error index also happens at 1-max{1/n}=1-
1/n.
46
Information Gain
• The reason for different ways of computation of impurity
degrees between data table D and subset table Si is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table Si) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain is
if we split the data table based on some attribute values.

47
Information Gain - Example
• For example, in the parent table below, we can compute degree of
impurity based on transportation mode. In this case we have 4
Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):

48
Information Gain - Example
• For example,
we split using
travel cost
attribute and
compute the
degree of
impurity.

49
Information Gain - Example
• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose
we will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D – Sum (n
k /n * Entropy of each value k of subset table Si )
• The information gain of attribute Travel cost per km is
computed as 1.571 – (5/10 * 0.722+2/10*0+3/10*0) =
1.210
50
Information Gain - Example
• You can also compute information gain based on Gini
index or classification error in the same method. The
results are given below.

51
Information Gain – Example
• Split using “Gender” attribute

52
Information Gain - Example
• Split using “Car ownership” attribute

53
Information Gain - Example
• Split using “Income Level” attribute

54
Information Gain - Example
• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
impurity degree based on three methods. You can use
either one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
attribute i}). In our case, travel cost per km produces the
maximum information gain.

55
Information Gain - Example
• So we split using “travel cost per km” attribute as this
produces the maximum information gain.

56

You might also like