You are on page 1of 94

SMU Classification: Restricted

SMU Classification: Restricted


SMU Classification: Restricted

Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree

IS424 Data Mining & Business Analytics 3


SMU Classification: Restricted

Recap: Data and Data exploration


• Topic 1: What is data
• Topic 2: Why Data Exploration and Data Preprocessing?
• Topic 3: Basic Techniques for Data Exploration
• Topic 4: Data Pre-processing

IS424 Data Mining & Business Analytics 4


SMU Classification: Restricted

Topic 1: What is data


• Definitions of Data
– Objects (record, point, case, sample, entity, or instance)
– Attributes (variable, field, characteristic, or feature)
– Types of Attributes: Nominal, Ordinal, Interval, Ratio

• Types of Data Sets


– Record (Data Matrix, Document Data, Transaction Data)
– Graph (World Wide Web, Social Networks, Molecular Structures)
– Ordered (Spatial Data, Temporal Data, Sequential Data, Genetic Sequence Data)

• Data Quality Problems Question:


– Noise and outliers Can a data set be two or more of
– Missing values
– Duplicate data
the types of data set types?
• Characteristics of Structured Data
– Dimensionality
– Sparsity
– …

IS424 Data Mining & Business Analytics 5


SMU Classification: Restricted

Topic 2: Why Data Exploration and Data


Pre-processing

Get Data Garbage in, Garbage out

Exploratory Data Analysis Solve Data Quality Problems


For Data Mining
Preprocessing

Data Mining
SMU Classification: Restricted

Why Data Exploration and Data


Pre-processing
• Garbage in, Garbage out
• Most models assume numerical columns
• Does it make sense if I put this record into the model?
• Does it corrupt my program?
• What are the aspects of Data Exploration and data pre-
processing?
• Missing values
• Bad data (e.g., noises) or outlier
• Duplicates
• Data Cleaning Solve Data Quality Problems
• Aggregation For Data Mining/Machine Learning
• Sampling
• Dimensionality Reduction
• Attribute Transformation
• Measure of Similarity & Dissimilarity
• …

7
SMU Classification: Restricted

What is Data Exploration


• What is Data Exploration?
– Goal: to understand data
– Also named Exploratory Data Analysis (EDA)
– A preliminary exploration of the data to better understand its
characteristics.

• Key motivations of data exploration include


– Helping to select the right tool for pre-processing and data mining
– Making use of humans’ abilities to recognize patterns
– People can recognize patterns not captured by data analysis tools

• We focus on two main categories of techniques for data


exploration
– Summary statistics
– Visualization

8
SMU Classification: Restricted

Topic 3:
Data Exploration Techniques

-Summary statistics
– Location, Spread, Frequency, Mode, Quartiles,
Percentiles, etc.

-Visualization
– Histogram, Box plots, Scatter plots, Star plots, etc.

Question?

IS424 Data Mining & Business Analytics 9


SMU Classification: Restricted

Topic 4: Data Pre-processing


• Data Cleaning
• How to handle the missing values (Delete/Eliminate; Fill in/ Imputation )
• Resolve redundancy caused by data integration
• Bad data

• Aggregation (Reduce the number of attributes or objects)


• Sampling
• Sampling is the main technique used for data selection
• The key principle for effective sampling: sample is representative
• Type of sampling

• Dimensionality Reduction
• Purpose: avoid Curse of Dimensionality
• Filter Redundant features and Irrelevant features
• Feature Extraction and Feature Selection (e.g., PCA, SVD)

• Discretization and Binarization


• Attribute Transformation and Normalization
• Measure of Similarity & Dissimilarity
• Euclidean Distance (L2)
• City Block Distance (L1)
• Simple Matching Coefficient (SMC) and Jaccard Coefficient between binary vectors
• Cosine Similarity

• Correlation

IS424 Data Mining & Business Analytics 10


SMU Classification: Restricted

Curse of Dimensionality
• Exponential growth in dimension space cause high sparsity in
the space for the same dataset!
• This sparsity is problematic for any method that require
statistical significance
d) 4D - 256 regions
X
X X
X
X
X X
XX X
Y
44

4 42 43
• For the same size of dataset:
Low dimension space: Higher dimension space: Very high dimension space: there
each region, there are there is no data point at is no data point at most of
some data points some regions regions
No data points! It is hard to solve the problem!!!
11
SMU Classification: Restricted

What Is Principal Component Analysis (PCA)


• PCA is basically a statistical procedure to convert a set of
observation of possibly correlated variables into a set of values of
uncorrelated variables

• PCA is one way for dimensionality reduction

• PCA is one way to transform data

• PCA is a method which can help us to find the important


dimensions (principal components)

• Each of the principal components is chosen in such a way so that


it would describe most of the still available variance and all these
principal components are orthogonal to each other.

• In all principal components,the first principal component has


maximum variance.

12
SMU Classification: Restricted

What Are the New Axes?


Original Variable/Dimension B

85%

Variance (%)
15%

Original Variable/Dimension A

• Orthogonal directions of greatest variance in data


• Projections along PC1 discriminate the data most along any one axis

We may select the first PC (PC1), and reduce the 2 dimensions to 1


dimension if needed, as most of the variance is covered by the PC1 (85%)
13
Principal Component Analysis (PCA) for
SMU Classification: Restricted

Dimension Reduction
Case 1 Case 2

PC 1 ( the new Dimension 1)

Reduce from 2-dimension to 1-dimension as


most of the variance is covered by the PC1
It is a new dimension. The unit is neither inches, nor
cm. This is the disadvantage of PCA method: the
meaning of original dimensions will be lost.
14
SMU Classification: Restricted

Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree

IS424 Data Mining & Business Analytics 15


SMU Classification: Restricted

Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.

A test set is used to determine the accuracy of the model. Usually,


the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate or
test it.

IS424 Data Mining & Business Analytics 16


SMU Classification: Restricted

Classification Task Example

IS424 Data Mining & Business Analytics 17


SMU Classification: Restricted

Classification Examples

1. Predicting tumor cells as benign or malignant


2. Classifying credit card transactions as legitimate or
fraudulent
3. Classifying secondary structures of protein as alpha-
helix, beta-sheet, or random coil in biology
4. Categorizing news stories as finance, weather,
entertainment, sports, etc.

IS424 Data Mining & Business Analytics 18


SMU Classification: Restricted

Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree

IS424 Data Mining & Business Analytics 19


SMU Classification: Restricted
SMU Classification: Restricted

What is “Decision Tree”


• Decision tree builds classification or regression models in the form
of a tree structure.
• Breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed.
• Final result is a tree with decision nodes and leaf nodes.

Source: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

IS424 Data Mining & Business Analytics 21


SMU Classification: Restricted

Definition of Decision Tree


• “A decision tree is a decision support tool that uses a
tree-like model of decisions and their possible
consequences, including chance event outcomes,
resource costs, and utility.”

IS424 Data Mining & Business Analytics 22


SMU Classification: Restricted

Important Terminology related to


Decision Trees

https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

IS424 Data Mining & Business Analytics 23


SMU Classification: Restricted

Important Terminology related to


Decision Trees
1. Root Node: It represents the entire population or sample and this
further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-
nodes.
3. Decision Node: When a sub-node splits into further sub-nodes,
then it is called the decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or
Terminal node.
5. Parent and Child Node: A node, which is divided into sub-nodes,
is called a parent node of sub-nodes whereas sub-nodes are the
child of a parent node.
6. Branch / Sub-Tree: A subsection of the entire tree is called branch
or sub-tree.
7. Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say the opposite process of
splitting.

IS424 Data Mining & Business Analytics 24


SMU Classification: Restricted

Example of Decision Tree


Splitting Attributes

Tid Refund Marital Taxable


Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree

IS424 Data Mining & Business Analytics 25


SMU Classification: Restricted

Training Data → Model

Tid Refund Marital Taxable


Status Income Cheat MarSt Single,
Married Divorced
1 Yes Single 125K No
2 No Married 100K No NO Refund
3 No Single 70K No Yes No
4 Yes Married 120K No
NO TaxInc
5 No Divorced 95K Yes
6 No Married 60K No < 80K > 80K
7 Yes Divorced 220K No NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10 There could be more than one tree that
Training Data fits the same data!

IS424 Data Mining & Business Analytics 26


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Start from the root of tree. Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 27


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 28


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 29


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 30


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 31


SMU Classification: Restricted

Apply Model to Test Data


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No

Assign Cheat to “No”


NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

IS424 Data Mining & Business Analytics 32


SMU Classification: Restricted

Decision Tree for Classification Task

IS424 Data Mining & Business Analytics 33


SMU Classification: Restricted

Growing a Tree
1. Features to choose

2. Conditions for splitting

3. Knowing when to stop

4. Pruning

IS424 Data Mining & Business Analytics 34


SMU Classification: Restricted

Decision Tree Induction


• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART*
– ID3*
– C4.5*
– SLIQ, SPRINT*

IS424 Data Mining & Business Analytics 35


SMU Classification: Restricted

What is “Hunt’s Algorithm”


• Hunt’s Algorithm is the basis of many existing decision
tree induction algorithms
• A decision tree is grown in a recursive fashion by
partitioning the training records successively into purer
subset
– Recursion is when a statement in a function calls itself repeatedly.

– The iteration is when a loop repeatedly executes until the controlling condition
becomes false.

– The primary difference between recursion and iteration is that a recursion is a


process, always applied to a function. The iteration is applied to the set of
instructions which we want to get repeatedly executed.

IS424 Data Mining & Business Analytics 36


SMU Classification: Restricted

General Structure of Hunt’s Algorithm


• Let Dt be the set of training records that Tid Refund Marital
Status
Taxable
Income Cheat
reach a node t 1 Yes Single 125K No
• General Procedure: 2 No Married 100K No

– If Dt contains records that belong to 3 No Single 70K No

the same class yt, then t is a leaf 4 Yes Married 120K No

node labeled as yt 5 No Divorced 95K Yes


6 No Married 60K No
– If Dt is an empty set, then t is a leaf 7 Yes Divorced 220K No
node labeled by the default class yd 8 No Single 85K Yes
– If Dt contains records that belong to 9 No Married 75K No
more than one class, use an 10
10 No Single 90K Yes

attribute test to split the data into Dt


smaller subsets.
Recursively apply the above
procedure to each subset. ?

IS424 Data Mining & Business Analytics 37


SMU Classification: Restricted

Hunt’s Algorithm Example


Refund
Cheat
Yes No
? Tid Refund Marital Taxable
Status Income Cheat
Cheat Cheat
No Yes,No 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Purely contain Mix of “Yes” and 4 Yes Married 120K No
label of “No” “No” label. So we 5 No Divorced 95K Yes
reclusively split 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

IS424 Data Mining & Business Analytics 38


SMU Classification: Restricted

Hunt’s Algorithm Example (Cont)


Refund
Cheat
Yes No
? Tid Refund Marital Taxable
Status Income Cheat
Cheat Cheat
No Yes,No 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
Refund 5 No Divorced 95K Yes
Yes No 6 No Married 60K No

Cheat 7 Yes Divorced 220K No


Marital
No Status 8 No Single 85K Yes
Single, 9 No Married 75K No
Married
Divorced
10 No Single 90K Yes
Cheat Cheat 10

Yes, No No

IS424 Data Mining & Business Analytics 39


SMU Classification: Restricted

Hunt’s Algorithm Example (Cont)


Refund
Cheat
Yes No
? Tid Refund Marital Taxable
Status Income Cheat
Cheat Cheat
No Yes,No 1 Yes Single 125K No
2 No Married 100K No

Refund 3 No Single 70K No

Yes No 4 Yes Married 120K No


Refund 5 No Divorced 95K Yes
Yes No Cheat Marital 6 No Married 60K No
No Status
Cheat Single, 7 Yes Divorced 220K No
Marital Married
No Status Divorced 8 No Single 85K Yes
Single, Cheat 9 No Married 75K No
Married Taxable
Divorced No
Income 10 No Single 90K Yes
Cheat Cheat 10

Yes, No No < 80K >= 80K


Cheat Cheat
No Yes

IS424 Data Mining & Business Analytics 40


SMU Classification: Restricted

Greedy Strategy for Tree Induction

– Split the records based on an attribute test that


optimizes certain criterion
– Split such that we get most homogeneous leaf node

IS424 Data Mining & Business Analytics 41


SMU Classification: Restricted

Tree Induction

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

IS424 Data Mining & Business Analytics 42


SMU Classification: Restricted

How to Specify Test Condition?


• Depends on attribute types
– Discrete types, such as, Nominal and Ordinal
– Continuous types, such as Interval or Ratio

• Depends on number of ways to split


– 2-way split
– Multi-way split

IS424 Data Mining & Business Analytics 43


SMU Classification: Restricted

Splitting Based on Nominal Attributes


• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury

Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType CarType
{Sports, {Family,
{Family} Luxury} {Sports}
Luxury}

IS424 Data Mining & Business Analytics 44


SMU Classification: Restricted

Splitting Based on Ordinal Attributes


• Multi-way split: Use as many partitions as
distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, {Medium,
Medium} {Large} Large} {Small}

IS424 Data Mining & Business Analytics 45


SMU Classification: Restricted

Splitting Based on Ordinal Attributes


• Multi-way split: Use as many partitions as
distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, {Medium,
Medium} {Large} Large} {Small}

Size
{Small,
• What about this split? Large} {Medium}

IS424 Data Mining & Business Analytics 46


SMU Classification: Restricted

Splitting Based on Continuous Attributes


• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing (percentiles),
or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more computationally intensive

IS424 Data Mining & Business Analytics 47


SMU Classification: Restricted

Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

IS424 Data Mining & Business Analytics 48


SMU Classification: Restricted

Tree Induction

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

IS424 Data Mining & Business Analytics 49


SMU Classification: Restricted

How to Determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1
Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

IS424 Data Mining & Business Analytics 50


SMU Classification: Restricted

How to Determine the Best Split


• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:

C0: 5
C1: 5
C0: 9
C1: 1 ✓
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

IS424 Data Mining & Business Analytics 51


SMU Classification: Restricted

Measures of Node Impurity


• Gini Index

• Entropy

• Misclassification error

IS424 Data Mining & Business Analytics 52


SMU Classification: Restricted

Measures of Impurity: GINI


• Gini Index for a given node t :
𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗
(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum: (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information

– Minimum: (0.0) when all records belong to one


class, implying most interesting information

IS424 Data Mining & Business Analytics 53


SMU Classification: Restricted

Consider this example…


Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10

Girlfriend
Sunny?
busy?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

Fish 4 Fish 6 Fish 7 Fish 3


Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Using Gini Index, evaluate which test condition is better

IS424 Data Mining & Business Analytics 54


SMU Classification: Restricted

Computing GINI for Fishing Example


𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗

Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Sunny?”
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Gini = 1 – [(4/9)2 +(5/9)2]= 0.494 updated value

Fish 6 P(Fish) = 6/11 P(Don’t Fish) = 5/11


No
Don’t Fish 5 Gini = 1 – [(6/11)2 +(5/11)2]= 0.496

IS424 Data Mining & Business Analytics 55


SMU Classification: Restricted

Computing GINI for Fishing Example


𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗

Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Girlfriend busy?”
Fish 7
Yes
Don’t Fish 0

Fish 3
No
Don’t Fish 10

IS424 Data Mining & Business Analytics 56


SMU Classification: Restricted

Computing GINI for Fishing Example


𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗

Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Girlfriend busy?”
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Gini = 1 – [(7/7)2 +(0/7)2]= 0.0

Fish 3 P(Fish) = 3/13 P(Don’t Fish) =10/13


No
Don’t Fish 10 Gini = 1 – [(3/13)2 +(10/13)2]= 0.355

IS424 Data Mining & Business Analytics 57


SMU Classification: Restricted

Splitting Based on GINI


• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the
quality of split is computed as,

𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼 𝑖 (1)
𝑛
𝑖=1

where ni = number of records at child i,


n = number of records at node p.

IS424 Data Mining & Business Analytics 58


SMU Classification: Restricted

Fishing Example Continue…


Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10

Girlfriend
Sunny?
busy?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

Fish 4 Fish 6 Fish 7 Fish 3


Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Gini(N1) = 0.494 Gini(N2) = 0.496 Gini(N3) = 0.0 Gini(N4) = 0.355


(updated value)
Ginisunny = (9/20 * 0.494) + Ginigirlfriend = (7/20 * 0.0) +
(11/20 * 0.496) (13/20 * 0.355)
= 0.495
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍
𝑘
𝑛𝑖
𝑛
𝐺𝐼𝑁𝐼 𝑖 (1)
= 0.23 ✓
𝑖=1

IS424 Data Mining & Business Analytics 59


SMU Classification: Restricted

Categorical Attributes: Computing


GINI Index
• For each distinct value, gather counts for each class in
the dataset
• Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

IS424 Data Mining & Business Analytics 60


SMU Classification: Restricted

Continuous Attributes: Computing


GINI Index Taxable
Income
• Use Binary Decisions based on > 80K?
one value
• Several Choices for the splitting Yes No
value
– Number of possible splitting Tid Refund Marital Taxable
Status Income Cheat
values
= Number of distinct values 1 Yes Single 125K No
• Each splitting value has a count
matrix associated with it 2 No Married 100K No

– Class counts in each of the 3 No Single 70K No


partitions, A < v and A  v 4 Yes Married 120K No
• Simple method to choose best v
5 No Divorced 95K Yes
– For each v, scan the database
to gather count matrix and 6 No Married 60K No
compute its Gini index 7 Yes Divorced 220K No
– Computationally Inefficient!
Repetition of work. 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

IS424 Data Mining & Business Analytics 61


SMU Classification: Restricted

Continuous Attributes: Computing


GINI Index
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300


0.300 0.343 0.375 0.400 0.420

IS424 Data Mining & Business Analytics 62


SMU Classification: Restricted

Measures of Node Impurity


• Gini Index

• Entropy

• Misclassification error

IS424 Data Mining & Business Analytics 63


SMU Classification: Restricted

Alternative Splitting Criteria


based on Entropy
• Entropy at a given node t:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1)

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.


• Maximum (log nc) when records are equally distributed
among all classes implying the least information
• Minimum (0.0) when all records belong to one class,
implying the most information
– Entropy based computations are similar to the
GINI index computations

IS424 Data Mining & Business Analytics 64


SMU Classification: Restricted

Same Fishing Example…


Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10

Girlfriend
Sunny?
busy?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

Fish 4 Fish 6 Fish 7 Fish 3


Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Using Entropy, evaluate which test condition is better

IS424 Data Mining & Business Analytics 65


SMU Classification: Restricted

Computing Entropy for Fishing Example


𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1)

Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Sunny?” = – [-0.5 + -0.5] = 1
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Entropy = – [(4/9) log2(4/9) + (5/9) log2(5/9)] = 0.991? Or 0.764

Fish 6 P(Fish) = 6/11 P(Don’t Fish) = 5/11


No
Don’t Fish 5 Entropy = – [(6/11) log2(6/11) + (5/11) log2(5/11)] = 0.994?

IS424 Data Mining & Business Analytics 66


SMU Classification: Restricted

Computing Entropy for Fishing Example


𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1)

Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Fish 7
Yes
Don’t Fish 0

Fish 3
No
Don’t Fish 10

IS424 Data Mining & Business Analytics 67


SMU Classification: Restricted

Computing Entropy for Fishing Example


𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1)

Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Entropy = – [(7/7) log2(7/7) + (0/7) log2(0/7)] = 0

Fish 3 P(Fish) = 3/13 P(Don’t Fish) = 10/13


No
Don’t Fish 10 Entropy = – [(3/13) log2(3/13) + (10/13) log2(10/13)] = 0.779

IS424 Data Mining & Business Analytics 68


SMU Classification: Restricted

Splitting Based on Information Gain


• Information Gain:
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (1)
𝑛

𝑝 is parent Node, n records are split into k partitions;


ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the


split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5

IS424 Data Mining & Business Analytics 69


Fishing Example Continue…
SMU Classification: Restricted

Information Gain
Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (1)
𝑛

Sunny? Girlfriend busy?


Yes No Yes No

Node N1 Node N2 Node N3 Node N4


Fish 4 Fish 6 Fish 7 Fish 3
Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Entropy(N1) = 0.991 Entropy(N2) = 0.994 Entropy(N3) = 0.0 Entropy(N4) = 0.779


Gainsplit = 1 – ((9/20 * 0.991) + Gainsplit = 1 – ((7/20 * 0.0) +


(11/20 * 0.994)) (13/20 * 0.779))
=0.007 = 0.494

IS424 Data Mining & Business Analytics 70


SMU Classification: Restricted

Problem of large number of partitions …


• Disadvantage:
– Tends to prefer splits that result in large number of
partitions, each being small but pure.

Day

D1 D2 D3 D4 D5 D6 D7 D8
Fish/Don’t Fish 1/0 0/1 0/1 1/0 1/0 0/1 1/0 0/1

All subset are perfectly pure! => optimal split!?

Large number of small partitions is penalized!

IS424 Data Mining & Business Analytics 71


SMU Classification: Restricted

Solution: Gain Ratio


• Gain Ratio:
GAIN Split k
ni ni
GainRATIOsplit = (1) SplitINFO = − log (2)
SplitINFO i =1 n n

Parent Node, n records are split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the


partitioning (SplitINFO).
– Higher entropy partitioning (large number of small
partitions) is penalized!
– Designed to overcome the disadvantage of
Information Gain

IS424 Data Mining & Business Analytics 72


SMU Classification: Restricted

Example: SplitINFO
k
ni n
SplitINFO = − log i
Day i =1 n n

D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1
1 1
SplitINFO (Day) = –σ20
𝑖=1 𝑙𝑜𝑔 = 4.3219280949= 4.322
20 20
Girlfriend
busy? SplitINFO (Girlfriend busy)
Yes No 7 7 13 13
= – [( log + log ]
7/0 3/10 20 20 20 20

= −[0.35 ∗ (−1.5145731728) +0.65* (-0.62148837675)]


Fish/Don’t Fish
=-[-0.530-0.404]=0.934

IS424 Data Mining & Business Analytics 73


SMU Classification: Restricted

Example: GainRATIO
Before splitting Fish 10
Day
Don’t Fish 10

D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1) GAIN Split


GainRATIOsplit =
𝑛 SplitINFO
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (2)
𝑛

Entropy
= – [p(Fish)log2(p(Fish)) + p(Don’t Fish)log2(p(Don’t Fish))]
= – [-0.5 + -0.5) = 1
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Day)=1 𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Girl Friend busy)=0.494
GainRATIO=1/4.322 GainRATIO=0.494/0.934
=0.231 =0.529

IS424 Data Mining & Business Analytics 74


SMU Classification: Restricted

Measures of Node Impurity


• Gini Index

• Entropy

• Misclassification error

IS424 Data Mining & Business Analytics 75


SMU Classification: Restricted

Splitting Criteria Based on


Classification Error
• Classification error at a node t :

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max𝑃 𝑖 𝑡 (1)


𝑖

• Measures misclassification error made by a node.


– Maximum: (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum: (0.0) when all records belong to one class,
implying most interesting information

IS424 Data Mining & Business Analytics 76


SMU Classification: Restricted

Still the Fishing Example…


Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10

Girlfriend
Sunny?
busy?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

Fish 4 Fish 6 Fish 7 Fish 3


Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Using Classification Error, evaluate which test condition is better

IS424 Data Mining & Business Analytics 77


SMU Classification: Restricted

Computing Error for Fishing Example


𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max𝑃 𝑖 𝑡 (1)
𝑖

Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Sunny?”
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Error = 1 – max((4/9),(5/9)) = 0.444 (updated)

Fish 6 P(Fish) = 6/11 P(Don’t Fish) = 5/11


No
Don’t Fish 5 Error = 1 – max((6/11),(5/11)) = 0.454

IS424 Data Mining & Business Analytics 78


SMU Classification: Restricted

Computing Error for Fishing Example


𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max𝑃 𝑖 𝑡 (1)
𝑖

Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Girlfriend busy?”
Fish 7
Yes
Don’t Fish 0

Fish 3
No
Don’t Fish 10

IS424 Data Mining & Business Analytics 79


SMU Classification: Restricted

Computing Error for Fishing Example


𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max𝑃 𝑖 𝑡 (1)
𝑖

Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Girlfriend busy?”
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Error = 1 – max((7/7),(0/7)) = 0

Fish 3 P(Fish) = 3/13 P(Don’t Fish) = 10/13


No
Don’t Fish 10 Error = 1 – max((3/13),(10/13)) = 0.231

IS424 Data Mining & Business Analytics 80


SMU Classification: Restricted

Splitting Based on Error


• The Split Error

𝑘
𝑛𝑖
𝐸𝑟𝑟𝑜𝑟𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐸𝑟𝑟𝑜𝑟 𝑖 (1)
𝑛
𝑖=1

where ni = number of records at child i,


n = number of records at node p.

IS424 Data Mining & Business Analytics 81


Fishing Example Continue…
SMU Classification: Restricted

Splitting Error
Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10

Girlfriend
Sunny?
busy?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

Fish 4 Fish 6 Fish 7 Fish 3


Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10

Error(N1) = 0.444 Error(N2) = 0.454 Error(N3) = 0.0 Error(N4) = 0.231


Errorsplit = 9/20 * 0.444 + Errorsplit = 7/20 * 0.0 +

=
11/20 * 0.454
=
13/20 * 0.231

IS424 Data Mining & Business Analytics 82
SMU Classification: Restricted

Misclassification Error vs GINI


• Which measure gives bigger impurity gain?

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2

N1 N2
C1 3 4
C2 0 3

IS424 Data Mining & Business Analytics 83


SMU Classification: Restricted

Misclassification Error vs GINI


• Which measure gives bigger impurity gain?

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Error = 0.3

Error(N1) Error(Children)
= 1 – max((3/3),(0/3)) = 3/10 * 0 + 7/10 * 0.428 N1 N2
=0 = 0.3 C1 3 4
C2 0 3
Error(N2)
= 1 – max((4/7),(3/7)) Error = 0.3
= 0.428

IS424 Data Mining & Business Analytics 84


SMU Classification: Restricted

Misclassification Error vs GINI


• Which measure gives bigger impurity gain?

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) Gini(Children)
= 1 – (3/3)2 – (0/3)2 = 3/10 * 0 + 7/10 * 0.489 N1 N2
=0 = 0.342 C1 3 4
C2 0 3
Gini(N2) Gini improves !!
= 1 – (4/7)2 – (3/7)2 Gini = 0.342
= 0.489

IS424 Data Mining & Business Analytics 85


SMU Classification: Restricted

Comparison Among Splitting Criteria


• For a 2-class problem, which curve is “entropy”, “Gini”,
“error”?

Impurity
measure

IS424 Data Mining & Business Analytics 86


SMU Classification: Restricted

Tree Induction

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

IS424 Data Mining & Business Analytics 87


SMU Classification: Restricted

Determine When to Stop Splitting


(Stopping Criteria for Tree Induction)
• Stop expanding a node when all the records belong to
the same class

• Stop expanding a node when all the records have similar


attribute values

• Early termination

IS424 Data Mining & Business Analytics 88


SMU Classification: Restricted

From Decision Tree to Rules

Refund
Yes No Classification Rules

NO (Refund=Yes) ==> No
Marita l
{Single, Status (Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income<80K) ==> No

Taxable NO (Refund=No, Marital Status={Single,Divorced},


Income Taxable Income>80K) ==> Yes

< 80K > 80K (Refund=No, Marital Status={Married}) ==> No

NO YES
Rules generated from decision trees are both mutually
exclusive and exhaustive
Rule set contains as much information as the tree

IS424 Data Mining & Business Analytics 89


SMU Classification: Restricted

Classification Tree
(Example)

x1 X2>5
NO YES

X2>1 X1<4
4 YES NO NO YES

RED BLUE X2>6 BLUE


YES NO

RED BLUE

1 5 6 x2

IS424 Data Mining & Business Analytics 90


SMU Classification: Restricted

Regression Tree (Example)

X>2
NO YES
2.00
X>1 X>3
1.50 NO NO
YES YES

1.00 0.50 1.50 1.00 2.00

0.50
1 2 3 x

IS424 Data Mining & Business Analytics 91


SMU Classification: Restricted

Pros and Cons


• Pros
– Simple to understand, interpret and visualize
– Use a white box model. If a given result is provided by a model.
– Can handle both categorical and numerical data
– Extremely fast at classifying unknown records
– Accuracy is comparable to other classification techniques for many
simple data sets
– Non-linear relationship between variables does not affect the
performance
• Cons
– Prone to overfitting
– Unstable because small variation in the data result in completely
different trees generated
– Greedy algorithm cannot guarantee the return of globally optimal
decision tree
– Calculations can get very complex, particularly if many values are
uncertain and/or if many outcomes are linked.

IS424 Data Mining & Business Analytics 92


SMU Classification: Restricted

Summary of Decision Tree


• Classification Concept Review
• Classification Technique: Decision Tree
– What is Decision Tree
• Definition of Decision Tree
• Important Terminology related to Decision Trees (Root Node, Splitting, Decision Node, Leaf / Terminal Node,
Parent and Child Node, Branch / Sub-Tree, Pruning)
– Example of Decision Tree
• Training data→ model
• Apply model to test data→ Classification result
– Growing a tree (Features to choose, Conditions for splitting, Knowing when to stop, Pruning)
– Decision tree induction algorithm: Hunt’s Algorithm (Example)
• How to split the records (Split the records based on an attribute test that optimizes certain
criterion; Split such that we get most homogeneous leaf node)
– How to specify the attribute test condition?
» Depends on attribute types (nominal, ordinal, continuous)
» Depends on number of ways to split (2-way split, multi-way split)
– How to determine the best split?
» Measures of Node Impurity: Gini Index, Entropy, Misclassification error
» Comparison Among the Splitting Criteria
• Determine when to stop splitting (Stopping Criteria for Tree Induction)
» Stop expanding a node when all the records belong to the same class
» Stop expanding a node when all the records have similar attribute values
» Early termination
– Pros and Cons of Decision Tree

IS424 Data Mining & Business Analytics 93


SMU Classification: Restricted

Questions?

Thank You

94

You might also like