4-Decision Tree Learning 1

ISOM3360 Data Mining for Business Analytics
Decision Tree Learning
Instructor: Yi Yang
Department of ISOM
Spring 2023
q Last lecture
q Data preparation
q This Lecture
q Decision tree
2
Data Mining Process
3
Supervised Learning
q Classification is used to predict which class

(discrete value) a data point belongs to
q Fraud detection, customer churn prediction
q Regression is used to predict continuous value.
q Stock price prediction, housing price prediction
4
Training vs. Testing
1. Learn the model on all data, evaluate on parts of data
2. Split data into two parts, learn the model on one part, and
evaluate on the other part.
Recall that ML model is used to make predictions on

unseen data.
5
Training vs. Testing
1. Learn the model on all data, evaluate on parts of data
2. Split data into two parts, learn the model on one part, and
evaluate on the other part.
Supervised learning rule of thumb:
Never ever use testing data to learn your model 6

Supervised Learning data split
q training set—a subset to train a model.
q test set—a subset to test the trained model.

q Large enough to yield statistically meaningful results
q Is representative of the data set as a whole. In other

words, don't pick a test set with different
characteristics than the training set.
7
Decision Tree
q Decision trees are one of the most popular data mining

tools.
q Decision Trees are easy to understand, implement and

use, and computationally cheap.
q “It is probably the machine learning workhorse most

widely used in practice to date.”
q The model comprehensibility is important for

communication to non-DM-savvy stakeholders.
8
Decision Tree
categorical continuous
Employed Balance Age Default
Yes 123,000 50 No
No 51,100 40 Yes
No 68,000 55 No
Yes 34,000 46 Yes
Yes 50,000 44 No
No 100,000 50 Yes
Yes 70,000 35 ?????
Predicting customers who will default on loan payments
9
Decision tree
Tree
v A upside down if-else tree, start Root Node
Employed
with all training data
Yes No
Node Class =
v Each node has an if-else condition Not
Balance Node
about one feature. (Which feature?) Default
v Dataset is split into subset based >=50K <50K

on the condition Class =
v Root node contains all training Not
Default Age
examples
v Leaf node contains a subset of >=45 <45
training examples Class =
Class =
v (Optional) numerical features are Not
Default
discretized Default
Leaf
Node
10
The essence of Decision Tree
q The essence of supervised learning (prediction) is to find

features that are informative and have high
predictability.
q Decision tree methods iteratively select a feature so that,

after splitting, the sub dataset becomes more
pure/homogenous. In other words, the selected feature
has high predictability.
q Information gain is one way to measure informativeness.
11
Entropy
q Entropy H(S) is a measure of the amount of

uncertainty/impurity in the dataset S (i.e. entropy
characterizes the dataset S). It measures chaos.
p(x) is the proportion of class x

in the data S
q Eg. A dataset is composed of 16 cases of class

“Positive” and 14 cases of class “Negative”
Entropy (dataset) =
12
Entropy Exercise
𝑡𝑖𝑝: 0 log ! 0 = 0
13
Let’s play a game. I have someone in my mind,
and your job is to guess this person. You can only
ask yes/no question.
This person is a HK celebrity.
Go!
14
Information Gain
q The information gain is based on the
decrease in entropy after a dataset is split
on a feature.
A: has credit card??

Weighted average of subset entropy
• H(S) – Entropy of set S
• T – The subsets created from splitting set S by feature A
• p(t) – The proportion of subset t to set S

Yes No
• H(t) – Entropy of subset t
15
Information Gain Example
A1: has credit card?? A2: is student??
Entropy=
before
Yes No Yes No after
Entropy=0.837 Entropy = 0.722
IG= 0.997 - 0.615 = 0.382 IG= 0.997 - 0.779 = 0.218 16

Information Gain Exercise
A1 A2
before
Yes No Yes No after
Without calculation, which split (left or right) gives the highest

information gain? Which feature (A1 or A2) do we prefer?
17
Decision Tree: ID3
ID3: Only for classification, only handles
categorical features.
q Step 1: Calculate the information gain of every feature
q Step 2: Split the set S into subsets using the feature for
which the information gain is maximum
q Step 3: Make a decision tree node containing that feature,
divide the dataset by its branches and repeat the same
process on every branch.
q Step 4a: A branch with entropy of 0 is a leaf node.
q Step 4b: A branch with entropy more than 0 needs further
splitting.
18
Outlook: Sunny,
Overcast, Rain
Temp: Hot, Mild,

Cool
Humidity: High,
Normal
Wind: Weak,
Strong
Decision:
Yes (9), No (5)
19
q On board demonstration
20
Recap: The essence of Decision Tree
q The essence of supervised learning (prediction) is to find

features that are informative and have high
predictability.
q Decision tree methods select a feature so that, after

splitting, the sub dataset becomes more homogenous. In
other words, the selected feature has high predictability.
q Information gain is one way to measure informativeness.
21

4-Decision Tree Learning 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4-Decision Tree Learning 1

Uploaded by

Copyright:

Available Formats

ISOM3360 Data Mining for Business Analytics

Decision Tree Learning

q Classification is used to predict which class

q Fraud detection, customer churn prediction

q Regression is used to predict continuous value.

q Stock price prediction, housing price prediction

1. Learn the model on all data, evaluate on parts of data

Recall that ML model is used to make predictions on

1. Learn the model on all data, evaluate on parts of data

Supervised learning rule of thumb:

Never ever use testing data to learn your model 6

q training set—a subset to train a model.

q test set—a subset to test the trained model.

q Is representative of the data set as a whole. In other

q Decision trees are one of the most popular data mining

q Decision Trees are easy to understand, implement and

q “It is probably the machine learning workhorse most

q The model comprehensibility is important for

Yes 70,000 35 ?????

Predicting customers who will default on loan payments

v Dataset is split into subset based >=50K <50K

q The essence of supervised learning (prediction) is to find

q Decision tree methods iteratively select a feature so that,

q Information gain is one way to measure informativeness.

q Entropy H(S) is a measure of the amount of

p(x) is the proportion of class x

q Eg. A dataset is composed of 16 cases of class

This person is a HK celebrity.

A: has credit card??

• T – The subsets created from splitting set S by feature A

• p(t) – The proportion of subset t to set S

Yes No Yes No after

Entropy=0.837 Entropy = 0.722

IG= 0.997 - 0.615 = 0.382 IG= 0.997 - 0.779 = 0.218 16

Yes No Yes No after

Without calculation, which split (left or right) gives the highest

Temp: Hot, Mild,

q The essence of supervised learning (prediction) is to find

q Decision tree methods select a feature so that, after

q Information gain is one way to measure informativeness.

You might also like