You are on page 1of 21

ISOM3360 Data Mining for Business Analytics

Decision Tree Learning

Instructor: Yi Yang
Department of ISOM
Spring 2023
q Last lecture
q Data preparation

q This Lecture
q Decision tree

2
Data Mining Process

3
Supervised Learning

q Classification is used to predict which class


(discrete value) a data point belongs to

q Fraud detection, customer churn prediction

q Regression is used to predict continuous value.

q Stock price prediction, housing price prediction

4
Training vs. Testing

1. Learn the model on all data, evaluate on parts of data

2. Split data into two parts, learn the model on one part, and
evaluate on the other part.

Recall that ML model is used to make predictions on


unseen data.
5
Training vs. Testing

1. Learn the model on all data, evaluate on parts of data

2. Split data into two parts, learn the model on one part, and
evaluate on the other part.

Supervised learning rule of thumb:

Never ever use testing data to learn your model 6


Supervised Learning data split

q training set—a subset to train a model.

q test set—a subset to test the trained model.


q Large enough to yield statistically meaningful results

q Is representative of the data set as a whole. In other


words, don't pick a test set with different
characteristics than the training set.

7
Decision Tree

q Decision trees are one of the most popular data mining


tools.

q Decision Trees are easy to understand, implement and


use, and computationally cheap.

q “It is probably the machine learning workhorse most


widely used in practice to date.”

q The model comprehensibility is important for


communication to non-DM-savvy stakeholders.

8
Decision Tree

categorical continuous
Employed Balance Age Default
Yes 123,000 50 No
No 51,100 40 Yes
No 68,000 55 No
Yes 34,000 46 Yes
Yes 50,000 44 No
No 100,000 50 Yes

Yes 70,000 35 ?????

Predicting customers who will default on loan payments

9
Decision tree
Tree
v A upside down if-else tree, start Root Node
Employed
with all training data
Yes No
Node Class =
v Each node has an if-else condition Not
Balance Node
about one feature. (Which feature?) Default

v Dataset is split into subset based >=50K <50K


on the condition Class =
v Root node contains all training Not
Default Age
examples
v Leaf node contains a subset of >=45 <45
training examples Class =
Class =
v (Optional) numerical features are Not
Default
discretized Default
Leaf
Node

10
The essence of Decision Tree

q The essence of supervised learning (prediction) is to find


features that are informative and have high
predictability.

q Decision tree methods iteratively select a feature so that,


after splitting, the sub dataset becomes more
pure/homogenous. In other words, the selected feature
has high predictability.

q Information gain is one way to measure informativeness.

11
Entropy

q Entropy H(S) is a measure of the amount of


uncertainty/impurity in the dataset S (i.e. entropy
characterizes the dataset S). It measures chaos.

p(x) is the proportion of class x


in the data S

q Eg. A dataset is composed of 16 cases of class


“Positive” and 14 cases of class “Negative”

Entropy (dataset) =

12
Entropy Exercise
𝑡𝑖𝑝: 0 log ! 0 = 0

13
Let’s play a game. I have someone in my mind,
and your job is to guess this person. You can only
ask yes/no question.

This person is a HK celebrity.

Go!

14
Information Gain
q The information gain is based on the
decrease in entropy after a dataset is split
on a feature.

A: has credit card??


Weighted average of subset entropy
• H(S) – Entropy of set S

• T – The subsets created from splitting set S by feature A

• p(t) – The proportion of subset t to set S


Yes No
• H(t) – Entropy of subset t

15
Information Gain Example
A1: has credit card?? A2: is student??

Entropy=

before

Yes No Yes No after

Entropy=0.837 Entropy = 0.722

IG= 0.997 - 0.615 = 0.382 IG= 0.997 - 0.779 = 0.218 16


Information Gain Exercise
A1 A2

before

Yes No Yes No after

Without calculation, which split (left or right) gives the highest


information gain? Which feature (A1 or A2) do we prefer?
17
Decision Tree: ID3
ID3: Only for classification, only handles
categorical features.
q Step 1: Calculate the information gain of every feature
q Step 2: Split the set S into subsets using the feature for
which the information gain is maximum
q Step 3: Make a decision tree node containing that feature,
divide the dataset by its branches and repeat the same
process on every branch.
q Step 4a: A branch with entropy of 0 is a leaf node.
q Step 4b: A branch with entropy more than 0 needs further
splitting.

18
Outlook: Sunny,
Overcast, Rain

Temp: Hot, Mild,


Cool

Humidity: High,
Normal

Wind: Weak,
Strong

Decision:
Yes (9), No (5)

19
q On board demonstration

20
Recap: The essence of Decision Tree

q The essence of supervised learning (prediction) is to find


features that are informative and have high
predictability.

q Decision tree methods select a feature so that, after


splitting, the sub dataset becomes more homogenous. In
other words, the selected feature has high predictability.

q Information gain is one way to measure informativeness.

21

You might also like