Data Mining: Lecture - 03

Data Mining
Lecture – 03
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions

as legitimate or fraudulent
• Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
Status Income Cheat
NO Refund
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Decision
Model
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
• Many Algorithms:
1. Hunt’s Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees relatively
quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple testing
to avoid overfitting.
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that
Status Income Cheat
reach a node t 1 Yes Single 125K No
• General Procedure: 2 No Married 100K No
– If Dt contains records that belong the 3 No Single 70K No
same class yt, then t is a leaf node 4 Yes Married 120K No
labeled as yt 5 No Divorced 95K Yes

6 No Married 60K No
– If Dt is an empty set, then t is a leaf 7 Yes Divorced 220K No
node labeled by the default class, yd 8 No Single 85K Yes
– If Dt contains records that belong to 9 No Married 75K No
more than one class, use an attribute 10 No Single 90K Yes
test to split the data into smaller

10
subsets. Recursively apply the Dt

procedure to each subset.
?
Hunt’s Algorithm
Status Income Cheat

2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No
Yes No Yes No 10 No Single 90K Yes

10
Don’t Don’t Marital

Marital
Cheat Status
Cheat Status
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Evaluation of a Classifier
• How predictive is the model we learned?
– Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole set of
instances
– Accuracy: proportion of correctly classified instances over
the whole set of instances
accuracy = 1 – error rate
19
Confusion Matrix
• A confusion matrix is a table that is often used to
describe the performance of a classification model (or
"classifier") on a set of test data for which the true
values are known.
PREDICTED CLASS
Class = Yes Class = No
ACTUAL
Class = Yes a b
CLASS
Class = No c d
a: TP (true positive) c: FP (false positive)

b: FN (false negative) d: TN (true negative)
20
Confusion Matrix - Example
• What can we learn from this matrix?
– There are two possible predicted

classes: "yes" and "no". If we were
predicting the presence of a disease,
for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
– The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
– Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
– In reality, 105 patients in the sample have the disease, and 60
patients do not.
21
Confusion Matrix – Confusion?
• False positives are actually negative
• False negatives are actually positives
22
Confusion Matrix - Example
• Let's now define the most
basic terms, which are
whole numbers (not rates):
– true positives (TP): These are

cases in which we predicted
yes (they have the disease), and
they do have the disease.
– true negatives (TN): We predicted no, and they don't have the
disease.
– false positives (FP): We predicted yes, but they don't actually
have the disease. (Also known as a "Type I error.")
– false negatives (FN): We predicted no, but they actually do have
the disease. (Also known as a "Type II error.")
23
Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Accuracy: Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often is it wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
24
Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64
25
Confusion Matrix – Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly
predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.
http://aimotion.blogspot.com/2010/08/tools-for-machine-learning-performance.html 26
Confusion Matrix – Example 2
• Accuracy = (TP+TN)/total = (27+57)/100 = 84%
• Misclassification Rate = (FP+FN)/total = (6+10)/100 = 16%
• True Positive Rate = TP/actual yes = 27/33 = 0.81
• False Positive Rate =FP/actual no = 10/67 = 0.15
http://www.marcovanetti.com/pages/cfmatrix/?noc=1 27
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split

– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct
values.
CarType
Family Luxury
Sports
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
• What about this split? {Small,

Size
{Medium}
Large}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)

• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student

Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
How to Measure Impurity?
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say
that the table is impure or heterogeneous.
http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm
37
How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
– Entropy
– Gini Index
– Misclassification error
• All above formulas contain values of probability of pj a class j.
38
How to Measure Impurity? - Example
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in
short we write as 4B, 3C, 3T). The total data is 10 rows.
39
How to Measure Impurity? - Example
• Based on the data, we can compute probability of each
class. Since probability is equal to frequency relative,
we have
– Prob(Bus) = 4/10 = 0.4
– Prob(Car) = 3/10 = 0.3
– Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
40
How to Measure Impurity? - Entropy
• One way to measure impurity degree is using entropy
• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

Prob(Train)=0.3, we can now compute entropy as:
• Entropy = - 0.4log2(0.4) - 0.3log2(0.3) - 0.3log2(0.3) =

1.571
41
• Entropy of a pure table (consist of
single class) is zero because the
probability is 1 and log2(1)=0.
• Entropy reaches maximum value
when all classes in the table have
equal probability.
• Figure plots the values of
maximum entropy for different
number of classes n, where
probability is equal to p=1/n.
• In this case, maximum entropy is
equal to -n*p*log2p.
• Notice that the value of entropy is
larger than 1 if the number of
classes is more than 2.
42
How to Measure Impurity? - Gini
• Another way to measure impurity degree is using Gini
index

Prob(Train)=0.3, we can now compute Gini index as:
• Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660
43
• Gini index of a pure table (consist
of a single class is zero because
the probability is 1 and 1-(1)^2=0.
• Similar to Entroy, Gini index also
reaches maximum value when all
classes in the table have equal
probability.
• Figure plots the values of
maximum Gini index for different
number of classes n, where
probability is equal to p=1/n.
• Notice that the value of Gini index
is always between 0 and 1
regardless the number of classes.
44
How to Measure Impurity? –
Missclassification Error
• Still another way to measure impurity degree

Prob(Train)=0.3, we can now compute index as:
• Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60
45
How to Measure Impurity? –
Missclassification Error
• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Max(1)=0.
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=1/n and maximum Gini index
happens at 1-n*(1/n)^2=1-1/n, while maximum
misclassification error index also happens at 1-max{1/n}=1-
1/n.
46
Information Gain
• The reason for different ways of computation of impurity
degrees between data table D and subset table Si is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table Si) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain is
if we split the data table based on some attribute values.
47
Information Gain - Example
• For example, in the parent table below, we can compute degree of
impurity based on transportation mode. In this case we have 4
Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
48
• For example,
we split using
travel cost
attribute and
compute the
degree of
impurity.
49
• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose
we will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D – Sum (n
k /n * Entropy of each value k of subset table Si )
• The information gain of attribute Travel cost per km is
computed as 1.571 – (5/10 * 0.722+2/10*0+3/10*0) =
1.210
50
• You can also compute information gain based on Gini
index or classification error in the same method. The
results are given below.
51
Information Gain – Example
• Split using “Gender” attribute
52
• Split using “Car ownership” attribute
53
• Split using “Income Level” attribute
54
• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
impurity degree based on three methods. You can use
either one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
attribute i}). In our case, travel cost per km produces the
maximum information gain.
55
• So we split using “travel cost per km” attribute as this
produces the maximum information gain.
56

Data Mining: Lecture - 03

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Lecture - 03

Uploaded by

Copyright:

Available Formats

Data Mining

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

• Classifying credit card transactions

• Classifying secondary structures of protein

• Categorizing news stories as finance,

1 Yes Single 125K No

Training Data Model: Decision Tree

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

– If Dt contains records that belong the 3 No Single 70K No

same class yt, then t is a leaf node 4 Yes Married 120K No

labeled as yt 5 No Divorced 95K Yes

more than one class, use an attribute 10 No Single 90K Yes

test to split the data into smaller

subsets. Recursively apply the Dt

1 Yes Single 125K No

Refund Refund 9 No Married 75K No

Yes No Yes No 10 No Single 90K Yes

Don’t Don’t Marital

a: TP (true positive) c: FP (false positive)

– There are two possible predicted

– true positives (TP): These are

• Accuracy: Overall, how often is the classifier correct?

• Depends on number of ways to split

• Binary split: Divides values into two subsets.

• Binary split: Divides values into two subsets.

• What about this split? {Small,

– Binary Decision: (A < v) or (A  v)

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Own Car Student

Yes No Family Luxury c1 c20

Which test condition is the best?

• All above formulas contain values of probability of pj a class j.

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

• Entropy = - 0.4log2(0.4) - 0.3log2(0.3) - 0.3log2(0.3) =

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

• Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

• Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60

You might also like