Decision Tree Slides

SMU Classification: Restricted

Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree
IS424 Data Mining & Business Analytics 3

Recap: Data and Data exploration

• Topic 1: What is data
• Topic 2: Why Data Exploration and Data Preprocessing?
• Topic 3: Basic Techniques for Data Exploration
• Topic 4: Data Pre-processing

Topic 1: What is data

• Definitions of Data
– Objects (record, point, case, sample, entity, or instance)
– Attributes (variable, field, characteristic, or feature)
– Types of Attributes: Nominal, Ordinal, Interval, Ratio
• Types of Data Sets

– Record (Data Matrix, Document Data, Transaction Data)
– Graph (World Wide Web, Social Networks, Molecular Structures)
– Ordered (Spatial Data, Temporal Data, Sequential Data, Genetic Sequence Data)
• Data Quality Problems Question:

– Noise and outliers Can a data set be two or more of
– Missing values
– Duplicate data
the types of data set types?
• Characteristics of Structured Data
– Dimensionality
– Sparsity
– …

Topic 2: Why Data Exploration and Data

Pre-processing
Get Data Garbage in, Garbage out
Exploratory Data Analysis Solve Data Quality Problems

For Data Mining
Preprocessing
Data Mining
Why Data Exploration and Data

Pre-processing
• Garbage in, Garbage out
• Most models assume numerical columns
• Does it make sense if I put this record into the model?
• Does it corrupt my program?
• What are the aspects of Data Exploration and data pre-
processing?
• Missing values
• Bad data (e.g., noises) or outlier
• Duplicates
• Data Cleaning Solve Data Quality Problems
• Aggregation For Data Mining/Machine Learning
• Sampling
• Dimensionality Reduction
• Attribute Transformation
• Measure of Similarity & Dissimilarity
• …
7
What is Data Exploration

• What is Data Exploration?
– Goal: to understand data
– Also named Exploratory Data Analysis (EDA)
– A preliminary exploration of the data to better understand its
characteristics.
• Key motivations of data exploration include

– Helping to select the right tool for pre-processing and data mining
– Making use of humans’ abilities to recognize patterns
– People can recognize patterns not captured by data analysis tools
• We focus on two main categories of techniques for data

exploration
– Summary statistics
– Visualization
8
Topic 3:
Data Exploration Techniques
-Summary statistics
– Location, Spread, Frequency, Mode, Quartiles,
Percentiles, etc.
-Visualization
– Histogram, Box plots, Scatter plots, Star plots, etc.
Question?

Topic 4: Data Pre-processing

• Data Cleaning
• How to handle the missing values (Delete/Eliminate; Fill in/ Imputation )
• Resolve redundancy caused by data integration
• Bad data
• Aggregation (Reduce the number of attributes or objects)

• Sampling
• Sampling is the main technique used for data selection
• The key principle for effective sampling: sample is representative
• Type of sampling
• Dimensionality Reduction
• Purpose: avoid Curse of Dimensionality
• Filter Redundant features and Irrelevant features
• Feature Extraction and Feature Selection (e.g., PCA, SVD)
• Discretization and Binarization

• Attribute Transformation and Normalization
• Measure of Similarity & Dissimilarity
• Euclidean Distance (L2)
• City Block Distance (L1)
• Simple Matching Coefficient (SMC) and Jaccard Coefficient between binary vectors
• Cosine Similarity
• Correlation

Curse of Dimensionality
• Exponential growth in dimension space cause high sparsity in
the space for the same dataset!
• This sparsity is problematic for any method that require
statistical significance
d) 4D - 256 regions
X
X X
X
X
X X
XX X
Y
44
…
4 42 43
• For the same size of dataset:
Low dimension space: Higher dimension space: Very high dimension space: there
each region, there are there is no data point at is no data point at most of
some data points some regions regions
No data points! It is hard to solve the problem!!!
11
What Is Principal Component Analysis (PCA)

• PCA is basically a statistical procedure to convert a set of
observation of possibly correlated variables into a set of values of
uncorrelated variables
• PCA is one way for dimensionality reduction
• PCA is one way to transform data
• PCA is a method which can help us to find the important

dimensions (principal components)
• Each of the principal components is chosen in such a way so that

it would describe most of the still available variance and all these
principal components are orthogonal to each other.
• In all principal components,the first principal component has

maximum variance.
12
What Are the New Axes?

Original Variable/Dimension B
85%
Variance (%)
15%
Original Variable/Dimension A
• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most along any one axis
We may select the first PC (PC1), and reduce the 2 dimensions to 1

dimension if needed, as most of the variance is covered by the PC1 (85%)
13
Principal Component Analysis (PCA) for
Dimension Reduction
Case 1 Case 2
PC 1 ( the new Dimension 1)
Reduce from 2-dimension to 1-dimension as

most of the variance is covered by the PC1
It is a new dimension. The unit is neither inches, nor
cm. This is the disadvantage of PCA method: the
meaning of original dimensions will be lost.
14
Outline

Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually,

the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate or
test it.

Classification Task Example

Classification Examples
1. Predicting tumor cells as benign or malignant

2. Classifying credit card transactions as legitimate or
fraudulent
3. Classifying secondary structures of protein as alpha-
helix, beta-sheet, or random coil in biology
4. Categorizing news stories as finance, weather,
entertainment, sports, etc.

Outline

What is “Decision Tree”

• Decision tree builds classification or regression models in the form
of a tree structure.
• Breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed.
• Final result is a tree with decision nodes and leaf nodes.
Source: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

Definition of Decision Tree

• “A decision tree is a decision support tool that uses a
tree-like model of decisions and their possible
consequences, including chance event outcomes,
resource costs, and utility.”

Important Terminology related to

Decision Trees
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

Important Terminology related to

Decision Trees
1. Root Node: It represents the entire population or sample and this
further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-
nodes.
3. Decision Node: When a sub-node splits into further sub-nodes,
then it is called the decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or
Terminal node.
5. Parent and Child Node: A node, which is divided into sub-nodes,
is called a parent node of sub-nodes whereas sub-nodes are the
child of a parent node.
6. Branch / Sub-Tree: A subsection of the entire tree is called branch
or sub-tree.
7. Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say the opposite process of
splitting.

Example of Decision Tree

Splitting Attributes
Tid Refund Marital Taxable

Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Training Data → Model
Tid Refund Marital Taxable

Status Income Cheat MarSt Single,
Married Divorced
1 Yes Single 125K No
2 No Married 100K No NO Refund
3 No Single 70K No Yes No
4 Yes Married 120K No
NO TaxInc
6 No Married 60K No < 80K > 80K
7 Yes Divorced 220K No NO YES
8 No Single 85K Yes
9 No Married 75K No
10 There could be more than one tree that
Training Data fits the same data!

Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES


Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES


Test Data
Status Income
No Married 80K ?
Refund
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES


Test Data
Status Income
No Married 80K ?
Refund
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES


Test Data
Status Income
No Married 80K ?
Refund
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES


Test Data
Status Income
No Married 80K ?
Refund
Yes No
Assign Cheat to “No”

NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Decision Tree for Classification Task

Growing a Tree
1. Features to choose
2. Conditions for splitting
3. Knowing when to stop
4. Pruning

Decision Tree Induction

• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART*
– ID3*
– C4.5*
– SLIQ, SPRINT*

What is “Hunt’s Algorithm”

• Hunt’s Algorithm is the basis of many existing decision
tree induction algorithms
• A decision tree is grown in a recursive fashion by
partitioning the training records successively into purer
subset
– Recursion is when a statement in a function calls itself repeatedly.
– The iteration is when a loop repeatedly executes until the controlling condition
becomes false.
– The primary difference between recursion and iteration is that a recursion is a

process, always applied to a function. The iteration is applied to the set of
instructions which we want to get repeatedly executed.

General Structure of Hunt’s Algorithm

• Let Dt be the set of training records that Tid Refund Marital
Status
Taxable
Income Cheat
reach a node t 1 Yes Single 125K No
• General Procedure: 2 No Married 100K No
– If Dt contains records that belong to 3 No Single 70K No
the same class yt, then t is a leaf 4 Yes Married 120K No
node labeled as yt 5 No Divorced 95K Yes

6 No Married 60K No
– If Dt is an empty set, then t is a leaf 7 Yes Divorced 220K No
node labeled by the default class yd 8 No Single 85K Yes
– If Dt contains records that belong to 9 No Married 75K No
more than one class, use an 10
attribute test to split the data into Dt

smaller subsets.
Recursively apply the above
procedure to each subset. ?

Hunt’s Algorithm Example

Refund
Cheat
Yes No
? Tid Refund Marital Taxable
Status Income Cheat
Cheat Cheat
No Yes,No 1 Yes Single 125K No
3 No Single 70K No
Purely contain Mix of “Yes” and 4 Yes Married 120K No
label of “No” “No” label. So we 5 No Divorced 95K Yes
reclusively split 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10

Hunt’s Algorithm Example (Cont)

Refund
Cheat
Yes No
Status Income Cheat
Cheat Cheat
3 No Single 70K No
4 Yes Married 120K No
Refund 5 No Divorced 95K Yes
Yes No 6 No Married 60K No
Cheat 7 Yes Divorced 220K No

Marital
No Status 8 No Single 85K Yes
Single, 9 No Married 75K No
Married
Divorced
Cheat Cheat 10
Yes, No No

Hunt’s Algorithm Example (Cont)

Refund
Cheat
Yes No
Status Income Cheat
Cheat Cheat
Refund 3 No Single 70K No
Yes No 4 Yes Married 120K No

Refund 5 No Divorced 95K Yes
Yes No Cheat Marital 6 No Married 60K No
No Status
Cheat Single, 7 Yes Divorced 220K No
Marital Married
No Status Divorced 8 No Single 85K Yes
Single, Cheat 9 No Married 75K No
Married Taxable
Divorced No
Income 10 No Single 90K Yes
Cheat Cheat 10
Yes, No No < 80K >= 80K

Cheat Cheat
No Yes

Greedy Strategy for Tree Induction
– Split the records based on an attribute test that

optimizes certain criterion
– Split such that we get most homogeneous leaf node

Tree Induction
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

How to Specify Test Condition?

• Depends on attribute types
– Discrete types, such as, Nominal and Ordinal
– Continuous types, such as Interval or Ratio
• Depends on number of ways to split

– 2-way split
– Multi-way split

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType CarType
{Sports, {Family,
{Family} Luxury} {Sports}
Luxury}

Splitting Based on Ordinal Attributes

• Multi-way split: Use as many partitions as
distinct values.
Size
Small Large
Medium

Size Size
{Small, {Medium,
Medium} {Large} Large} {Small}

Splitting Based on Ordinal Attributes

• Multi-way split: Use as many partitions as
distinct values.
Size
Small Large
Medium

Size Size
{Small, {Medium,
Medium} {Large} Large} {Small}
Size
{Small,
• What about this split? Large} {Medium}

Splitting Based on Continuous Attributes

• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing (percentiles),
or clustering.
– Binary Decision: (A < v) or (A  v)

• consider all possible splits and finds the best cut
• can be more computationally intensive

Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Tree Induction
• Issues

How to Determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Own Car Student
Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to Determine the Best Split

• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1 ✓
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Measures of Node Impurity

• Gini Index
• Entropy
• Misclassification error

Measures of Impurity: GINI

• Gini Index for a given node t :
𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum: (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum: (0.0) when all records belong to one

class, implying most interesting information

Consider this example…

Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
Fish 4 Fish 6 Fish 7 Fish 3

Don’t Fish 5 Don’t Fish 5 Don’t Fish 0 Don’t Fish 10
Using Gini Index, evaluate which test condition is better

Computing GINI for Fishing Example

𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗
Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Sunny?”
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Gini = 1 – [(4/9)2 +(5/9)2]= 0.494 updated value

No
Don’t Fish 5 Gini = 1 – [(6/11)2 +(5/11)2]= 0.496


𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗
Before splitting
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Girlfriend busy?”
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10


𝐺𝐼𝑁𝐼(𝑡) = 1 − ෍[𝑝 𝑗 𝑡 ]2 (1)
𝑗
Before splitting
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Yes
Don’t Fish 0 Gini = 1 – [(7/7)2 +(0/7)2]= 0.0
Fish 3 P(Fish) = 3/13 P(Don’t Fish) =10/13

No
Don’t Fish 10 Gini = 1 – [(3/13)2 +(10/13)2]= 0.355

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the
quality of split is computed as,
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼 𝑖 (1)
𝑛
𝑖=1
where ni = number of records at child i,

n = number of records at node p.

Fishing Example Continue…

Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No

Gini(N1) = 0.494 Gini(N2) = 0.496 Gini(N3) = 0.0 Gini(N4) = 0.355

(updated value)
Ginisunny = (9/20 * 0.494) + Ginigirlfriend = (7/20 * 0.0) +
(11/20 * 0.496) (13/20 * 0.355)
= 0.495
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍
𝑘
𝑛𝑖
𝑛
𝐺𝐼𝑁𝐼 𝑖 (1)
= 0.23 ✓
𝑖=1

Categorical Attributes: Computing

GINI Index
• For each distinct value, gather counts for each class in
the dataset
• Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Continuous Attributes: Computing

GINI Index Taxable
Income
• Use Binary Decisions based on > 80K?
one value
• Several Choices for the splitting Yes No
value
– Number of possible splitting Tid Refund Marital Taxable
Status Income Cheat
values
= Number of distinct values 1 Yes Single 125K No
• Each splitting value has a count
matrix associated with it 2 No Married 100K No
– Class counts in each of the 3 No Single 70K No

partitions, A < v and A  v 4 Yes Married 120K No
• Simple method to choose best v
– For each v, scan the database
to gather count matrix and 6 No Married 60K No
compute its Gini index 7 Yes Divorced 220K No
– Computationally Inefficient!
Repetition of work. 8 No Single 85K Yes
9 No Married 75K No
10

Continuous Attributes: Computing

GINI Index
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No

Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300

0.300 0.343 0.375 0.400 0.420


• Gini Index
• Entropy

Alternative Splitting Criteria

based on Entropy
• Entropy at a given node t:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1)
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.

• Maximum (log nc) when records are equally distributed
among all classes implying the least information
• Minimum (0.0) when all records belong to one class,
implying the most information
– Entropy based computations are similar to the
GINI index computations

Same Fishing Example…

Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No

Using Entropy, evaluate which test condition is better

Computing Entropy for Fishing Example

Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Sunny?” = – [-0.5 + -0.5] = 1
Yes
Don’t Fish 5 Entropy = – [(4/9) log2(4/9) + (5/9) log2(5/9)] = 0.991? Or 0.764

No
Don’t Fish 5 Entropy = – [(6/11) log2(6/11) + (5/11) log2(5/11)] = 0.994?


Before splitting
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10


Before splitting
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Yes
Don’t Fish 0 Entropy = – [(7/7) log2(7/7) + (0/7) log2(0/7)] = 0

No
Don’t Fish 10 Entropy = – [(3/13) log2(3/13) + (10/13) log2(10/13)] = 0.779

Splitting Based on Information Gain

• Information Gain:
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (1)
𝑛
𝑝 is parent Node, n records are split into k partitions;

ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the

split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5

Information Gain
Don’t Fish 10
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (1)
𝑛
Sunny? Girlfriend busy?

Yes No Yes No

Entropy(N1) = 0.991 Entropy(N2) = 0.994 Entropy(N3) = 0.0 Entropy(N4) = 0.779

Gainsplit = 1 – ((9/20 * 0.991) + Gainsplit = 1 – ((7/20 * 0.0) +
✓
(11/20 * 0.994)) (13/20 * 0.779))
=0.007 = 0.494

Problem of large number of partitions …

• Disadvantage:
– Tends to prefer splits that result in large number of
partitions, each being small but pure.
Day
D1 D2 D3 D4 D5 D6 D7 D8
Fish/Don’t Fish 1/0 0/1 0/1 1/0 1/0 0/1 1/0 0/1
All subset are perfectly pure! => optimal split!?
Large number of small partitions is penalized!

Solution: Gain Ratio

• Gain Ratio:
GAIN Split k
ni ni
GainRATIOsplit = (1) SplitINFO = − log (2)
SplitINFO i =1 n n
Parent Node, n records are split into k partitions

ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the

partitioning (SplitINFO).
– Higher entropy partitioning (large number of small
partitions) is penalized!
– Designed to overcome the disadvantage of
Information Gain

Example: SplitINFO
k
ni n
SplitINFO = − log i
Day i =1 n n
D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1
1 1
SplitINFO (Day) = –σ20
𝑖=1 𝑙𝑜𝑔 = 4.3219280949= 4.322
20 20
Girlfriend
busy? SplitINFO (Girlfriend busy)
Yes No 7 7 13 13
= – [( log + log ]
7/0 3/10 20 20 20 20
= −[0.35 ∗ (−1.5145731728) +0.65* (-0.62148837675)]

Fish/Don’t Fish
=-[-0.530-0.404]=0.934

Example: GainRATIO
Before splitting Fish 10
Day
Don’t Fish 10
D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − σ𝑗 𝑝(𝑗|𝑡) log 2 𝑝 (𝑗|𝑡) (1) GAIN Split

GainRATIOsplit =
𝑛 SplitINFO
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (2)
𝑛
Entropy
= – [p(Fish)log2(p(Fish)) + p(Don’t Fish)log2(p(Don’t Fish))]
= – [-0.5 + -0.5) = 1
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Day)=1 𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Girl Friend busy)=0.494
GainRATIO=1/4.322 GainRATIO=0.494/0.934
=0.231 =0.529


• Gini Index
• Entropy

Splitting Criteria Based on

Classification Error
• Classification error at a node t :
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max𝑃 𝑖 𝑡 (1)

𝑖
• Measures misclassification error made by a node.

– Maximum: (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum: (0.0) when all records belong to one class,
implying most interesting information

Still the Fishing Example…

Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No

Using Classification Error, evaluate which test condition is better

Computing Error for Fishing Example

𝑖
Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Sunny?”
Yes
Don’t Fish 5 Error = 1 – max((4/9),(5/9)) = 0.444 (updated)

No
Don’t Fish 5 Error = 1 – max((6/11),(5/11)) = 0.454


𝑖
Before splitting
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10


𝑖
Before splitting
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Yes
Don’t Fish 0 Error = 1 – max((7/7),(0/7)) = 0

No
Don’t Fish 10 Error = 1 – max((3/13),(10/13)) = 0.231

Splitting Based on Error

• The Split Error
𝑘
𝑛𝑖
𝐸𝑟𝑟𝑜𝑟𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐸𝑟𝑟𝑜𝑟 𝑖 (1)
𝑛
𝑖=1
where ni = number of records at child i,

n = number of records at node p.

Splitting Error
Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No

Error(N1) = 0.444 Error(N2) = 0.454 Error(N3) = 0.0 Error(N4) = 0.231

Errorsplit = 9/20 * 0.444 + Errorsplit = 7/20 * 0.0 +
=
11/20 * 0.454
=
13/20 * 0.231
✓
Misclassification Error vs GINI

• Which measure gives bigger impurity gain?
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2
N1 N2
C1 3 4
C2 0 3


A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Error = 0.3
Error(N1) Error(Children)
= 1 – max((3/3),(0/3)) = 3/10 * 0 + 7/10 * 0.428 N1 N2
=0 = 0.3 C1 3 4
C2 0 3
Error(N2)
= 1 – max((4/7),(3/7)) Error = 0.3
= 0.428


A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) Gini(Children)
= 1 – (3/3)2 – (0/3)2 = 3/10 * 0 + 7/10 * 0.489 N1 N2
=0 = 0.342 C1 3 4
C2 0 3
Gini(N2) Gini improves !!
= 1 – (4/7)2 – (3/7)2 Gini = 0.342
= 0.489

Comparison Among Splitting Criteria

• For a 2-class problem, which curve is “entropy”, “Gini”,
“error”?
Impurity
measure

Tree Induction
• Issues

Determine When to Stop Splitting

(Stopping Criteria for Tree Induction)
• Stop expanding a node when all the records belong to
the same class
• Stop expanding a node when all the records have similar

attribute values
• Early termination

From Decision Tree to Rules
Refund
Yes No Classification Rules
NO (Refund=Yes) ==> No
Marita l
{Single, Status (Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income<80K) ==> No
Taxable NO (Refund=No, Marital Status={Single,Divorced},

Income Taxable Income>80K) ==> Yes
< 80K > 80K (Refund=No, Marital Status={Married}) ==> No
NO YES
Rules generated from decision trees are both mutually
exclusive and exhaustive
Rule set contains as much information as the tree

Classification Tree
(Example)
x1 X2>5
NO YES
X2>1 X1<4
4 YES NO NO YES
RED BLUE X2>6 BLUE

YES NO
RED BLUE
1 5 6 x2

Regression Tree (Example)
X>2
NO YES
2.00
X>1 X>3
1.50 NO NO
YES YES
1.00 0.50 1.50 1.00 2.00
0.50
1 2 3 x

Pros and Cons

• Pros
– Simple to understand, interpret and visualize
– Use a white box model. If a given result is provided by a model.
– Can handle both categorical and numerical data
– Extremely fast at classifying unknown records
– Accuracy is comparable to other classification techniques for many
simple data sets
– Non-linear relationship between variables does not affect the
performance
• Cons
– Prone to overfitting
– Unstable because small variation in the data result in completely
different trees generated
– Greedy algorithm cannot guarantee the return of globally optimal
decision tree
– Calculations can get very complex, particularly if many values are
uncertain and/or if many outcomes are linked.

Summary of Decision Tree

– What is Decision Tree
• Definition of Decision Tree
• Important Terminology related to Decision Trees (Root Node, Splitting, Decision Node, Leaf / Terminal Node,
Parent and Child Node, Branch / Sub-Tree, Pruning)
– Example of Decision Tree
• Training data→ model
• Apply model to test data→ Classification result
– Growing a tree (Features to choose, Conditions for splitting, Knowing when to stop, Pruning)
– Decision tree induction algorithm: Hunt’s Algorithm (Example)
• How to split the records (Split the records based on an attribute test that optimizes certain
criterion; Split such that we get most homogeneous leaf node)
– How to specify the attribute test condition?
» Depends on attribute types (nominal, ordinal, continuous)
» Depends on number of ways to split (2-way split, multi-way split)
– How to determine the best split?
» Measures of Node Impurity: Gini Index, Entropy, Misclassification error
» Comparison Among the Splitting Criteria
• Determine when to stop splitting (Stopping Criteria for Tree Induction)
» Stop expanding a node when all the records belong to the same class
» Stop expanding a node when all the records have similar attribute values
» Early termination
– Pros and Cons of Decision Tree

Questions?
Thank You
94

Decision Tree Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Tree Slides

Uploaded by

Copyright:

Available Formats

SMU Classification: Restricted

SMU Classification: Restricted

IS424 Data Mining & Business Analytics 3

Recap: Data and Data exploration

IS424 Data Mining & Business Analytics 4

Topic 1: What is data

• Types of Data Sets

• Data Quality Problems Question:

IS424 Data Mining & Business Analytics 5

Topic 2: Why Data Exploration and Data

Get Data Garbage in, Garbage out

Exploratory Data Analysis Solve Data Quality Problems

Why Data Exploration and Data

What is Data Exploration

• Key motivations of data exploration include

• We focus on two main categories of techniques for data

IS424 Data Mining & Business Analytics 9

Topic 4: Data Pre-processing

• Aggregation (Reduce the number of attributes or objects)

• Discretization and Binarization

IS424 Data Mining & Business Analytics 10

What Is Principal Component Analysis (PCA)

• PCA is one way for dimensionality reduction

• PCA is one way to transform data

• PCA is a method which can help us to find the important

• Each of the principal components is chosen in such a way so that

• In all principal components,the first principal component has

What Are the New Axes?

• Orthogonal directions of greatest variance in data

We may select the first PC (PC1), and reduce the 2 dimensions to 1

PC 1 ( the new Dimension 1)

Reduce from 2-dimension to 1-dimension as

IS424 Data Mining & Business Analytics 15

A test set is used to determine the accuracy of the model. Usually,

IS424 Data Mining & Business Analytics 16

Classification Task Example

IS424 Data Mining & Business Analytics 17

1. Predicting tumor cells as benign or malignant

IS424 Data Mining & Business Analytics 18

IS424 Data Mining & Business Analytics 19

What is “Decision Tree”

IS424 Data Mining & Business Analytics 21

Definition of Decision Tree

IS424 Data Mining & Business Analytics 22

Important Terminology related to

IS424 Data Mining & Business Analytics 23

Important Terminology related to

IS424 Data Mining & Business Analytics 24

Example of Decision Tree

Tid Refund Marital Taxable

Training Data Model: Decision Tree

IS424 Data Mining & Business Analytics 25

Training Data → Model

Tid Refund Marital Taxable

IS424 Data Mining & Business Analytics 26

Apply Model to Test Data

IS424 Data Mining & Business Analytics 27

Apply Model to Test Data

IS424 Data Mining & Business Analytics 28

Apply Model to Test Data

IS424 Data Mining & Business Analytics 29

Apply Model to Test Data

IS424 Data Mining & Business Analytics 30

Apply Model to Test Data