AI Lec-05

AI Foundations and Applications
5. Decision Trees
Thien Huynh-The
HCM City Univ. Technology and Education
Jan, 2023
Decision Trees
• Previous techniques have consisted of real-valued feature vectors (or discrete-

valued) and natural measures of distance (e.g., Euclidean).
• Consider a classification problem that involves nominal data – data described by a
list of attributes (e.g., categorizing people as short or tall using gender, height, age,
and ethnicity).
• How can we use such nominal data for
classification?
• How can we learn the categories of such
data?
• Nonmetric methods such as decision trees
provide a way to deal with such data.
HCMUTE AI Foundations and Applications 03/18/2024 2

Decision Trees
• Decision trees attempt to classify a pattern through a sequence of questions. For

example, attributes such as gender and height can be used to classify people as
short or tall. But the best threshold for height is gender dependent.
• A decision tree consists of nodes and leaves, with each leaf denoting a class.
• Classes (tall or short) are the outputs of the tree.
• Attributes (gender and height) are a set of features that describe the data.
• The input data consists of values of the

gender
different attributes. Using these attribute male female
values, the decision tree generates a class
as the output for each input data. height height
>1m7 ≤1m7 >1m55 ≤1m55

tall short tall short

Basic Principles
• The top, or first node, is called the root node.

• The last level of nodes are the leaf nodes and contain the final classification.
• The intermediate nodes are the descendant or “hidden” layers.
• Binary trees, like the one shown to the right, are the most popular type of tree.
However, M-ary trees (M branches at each node) are possible.
gender
male female
height height
>1m7 ≤1m7 >1m55 ≤1m55
tall short tall short

Basic Principles
• Nodes can contain one more questions. In a binary tree, by convention if the
answer to a question is “yes”, the left branch is selected. Note that the same
question can appear in multiple places in the network.
• Decision trees have several benefits over neural network-type approaches,
including interpretability and data-driven learning.
• Key questions include how to grow the tree, how to stop growing, and how to prune
the tree to increase generalization.
• Decision trees are very powerful and can give excellent performance on closed-set
testing. Generalization is a challenge.

ID3 Algorithm
• Benefits
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent
a lot of data. Outlook
• Natural representation: (20 questions)

• The evaluation of the Decision Tree Classifier is Sunny Overcast Rain
easy
Humidity Wind
• Clearly, given data, there are many ways to Yes
represent it as a decision tree.
High Normal Strong Weak
• Learning a good representation from data is the No Yes No Yes
challenge.

ID3 Algorithm
• Given data you can always represent it using a decision tree; if so, what is a
"good" decision tree?
• Consider a following example: Will I play tennis today?
• Features
• Outlook: {Sun, Overcast, Rain}
• Temperature: {Hot, Mild, Cool}
• Humidity: {High, Normal, Low}
• Wind: {Strong, Weak}
• Labels
• Binary classification task: Y = {+, -}

Example
• Outlook: S(unny)
O(vercast)
R(ainy)
• Temperature: H(ot),
M(edium),
C(ool)
• Humidity: H(igh)
N(ormal)
L(ow)
• Wind: S(trong)
W(eak)

Example
• Data is processed in Batch (i.e. all the data

available)
• Recursively build a decision tree top down.
Outlook
Sunny Overcast Rain

Humidity Wind
Yes
High Normal Strong Weak

No Yes No Yes

Example
• Let S be the set of Examples

• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3 (S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as possible (Occam’s Razor)
• But, finding the minimal decision tree consistent with the data is NP-hard
• The recursive algorithm is a greedy heuristic search for a simple tree, but cannot
guarantee optimality.
• The main decision of the algorithm is the selection of the next attribute to condition
on.
• Consider data with two Boolean attributes (A,B).
< (A=0, B=0), - >: 50 examples
< (A=0, B=1), - >: 50 examples
< (A=1, B=0), - >: 0 examples
< (A=1, B=1), + >: 100 examples
• What should be the first attribute we select?

< (A=0, B=0), - >: 50 examples
A
< (A=0, B=1), - >: 50 examples 1 0
< (A=1, B=0), - >: 0 examples + -
< (A=1, B=1), + >: 100 examples splitting on A

B
•Splitting on A: we get purely labeled nodes. 1 0
•Splitting on B: we don’t get purely labeled nodes. A -
•What if we have: <(A=1, B=0), - >: 3 examples? 1 0
• (one way to think about it: # of queries required to + -
label a random data point) splitting on B

A
1 0
< (A=0, B=0), - >: 50 examples
B -
< (A=0, B=1), - >: 50 examples 1 0 100
+ -
< (A=1, B=0), - >: 0 examples 3 examples
100 3
< (A=1, B=1), + >: 100 examples splitting on A
• Trees looks structurally similar; which attribute B
1 0
should we choose?
A -
• One way to think about it: # of queries required to 1 0 53
label a random data point. + -
• If we choose A we have less uncertainty about the 100 50
splitting on B
labels.
• The goal is to have the resulting decision tree as small as possible (Occam’s
Razor)
• The main decision in the algorithm is the selection of the next attribute to
condition on.
• We want attributes that split the examples to sets that are relatively pure in one
label; this way we are closer to a leaf node.
• The most popular heuristics is based on information gain, originated with the ID3
system of Quinlan.

Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary

classification is:
• is the proportion of positive examples in S and

• is the proportion of negatives examples in S
– If all the examples belong to the same category [(1,0) or (0,1)]: Entropy = 0
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1
– Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
• Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each
example; if it is 0.8 – can use less then 1 bit.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:
• is the proportion of positive examples in S and

• is the proportion of negatives examples in S Which one has the lowest
– If all the examples belong to the same category: Entropy = 0 entropy (or less
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1 uncertainty)?
– Entropy = Level of uncertainty.
Test yourself: assign high,

medium, low to each of
these distributions

Exercise
Calculate the entropy for series 1 and series 2 with given example
distributions as follows

Information Gain
• The information gain of an attribute a is the expected reduction in entropy caused

by partitioning on this attribute
• Where:
– Sv is the subset of S for which attribute a has value v, and
– the entropy of partitioning the data is calculated by weighing the entropy of each partition by its size
relative to the original set
• Partitions of low entropy (imbalanced splits) lead to high gain
• Go back to check which of the A, B splits is better

Example
• Outlook: S(unny)
O(vercast)
R(ainy)
• Temperature: H(ot),
M(edium),
C(ool)
• Humidity: H(igh)
N(ormal)
L(ow)
• Wind: S(trong)
W(eak)

Example - Entropy
Calculate current entropy

;
 0.94

Example – Information Gain of Outlook
|Sv |
Gain (S , a ) = Entropy ( S ) − ∑ Entropy (S v )
v ∈values ( S) |S|
Outlook = sunny:
Entropy(O = S) = 0.971
Outlook = overcast:
Entropy(O = O) = 0
Outlook = rainy:
Entropy(O = R) = 0.971
Expected entropy
=
= (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
Information gain = 0.940 – 0.694 = 0.246

Example – Information Gain of Huminity
¿
𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
Humidity = high:
Entropy(H = H) = 0.985
Humidity = Normal:
Entropy(H = N) = 0.592
Expected entropy
=
= (7/14)×0.985 + (7/14)×0.592 = 0.7885
Information gain = 0.940 – 0.7885 = 0.1515

Which feature to split on?
Information gain:
Outlook: 0.246
Humidity: 0.151
Wind: 0.048
Temperature: 0.029
→ Split on Outlook

Complete the tree
• Students should complete and draw the tree that is able to make a decision based
on the values of given attributes

Student’s tasks
• Stopping criteria in decision trees

• Overfitting in decision trees and pruning technique
• Read here https://machinelearningcoban.com/2018/01/14/id3/
• Random forest algorithm
• Read here https://machinelearningcoban.com/tabml_book/ch_model/random_forest.
html
https://www.mathworks.com/help/stats/framework-for-ensemble-learning.html

Python
from __future__ import print_function
import numpy as np
import pandas as pd
class TreeNode(object):
def __init__(self, ids = None, children = [], entropy = 0, depth = 0):
self.ids = ids # index of data in this node
self.entropy = entropy # entropy, will fill later
self.depth = depth # distance to root node
self.split_attribute = None # which attribute is chosen, it non-leaf
self.children = children # list of its child nodes
self.order = None # order of values of split_attribute in children
self.label = None # label of node if it is a leaf
def set_properties(self, split_attribute, order):

self.split_attribute = split_attribute # split at which attribute
self.order = order # order of this node's children
def set_label(self, label):

self.label = label # set label if the node is a leaf

Python
def entropy(freq):
# remove prob 0
freq_0 = freq[np.array(freq).nonzero()[0]]
prob_0 = freq_0/float(freq_0.sum())
return -np.sum(prob_0*np.log(prob_0))
df = pd.DataFrame.from_csv('weather.csv')
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
tree = DecisionTreeID3(max_depth = 3, min_samples_split = 2)
tree.fit(X, y)
print(tree.predict(X))
https://machinelearningcoban.com/2018/01/14/id3/

Matlab
• Students practice by following the example
https://www.mathworks.com/help/stats/train-decision-trees-in-classification-learn
er-app.html
https://www.mathworks.com/help/stats/view-decision-tree.html

AI Lec-05

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Lec-05

Uploaded by

Copyright:

Available Formats

AI Foundations and Applications

• Previous techniques have consisted of real-valued feature vectors (or discrete-

HCMUTE AI Foundations and Applications 03/18/2024 2

• Decision trees attempt to classify a pattern through a sequence of questions. For

• The input data consists of values of the

>1m7 ≤1m7 >1m55 ≤1m55

HCMUTE AI Foundations and Applications 03/18/2024 3

• The top, or first node, is called the root node.

>1m7 ≤1m7 >1m55 ≤1m55

tall short tall short

HCMUTE AI Foundations and Applications 03/18/2024 4

HCMUTE AI Foundations and Applications 03/18/2024 5

• Natural representation: (20 questions)

HCMUTE AI Foundations and Applications 03/18/2024 6

HCMUTE AI Foundations and Applications 03/18/2024 7

HCMUTE AI Foundations and Applications 03/18/2024 8

• Data is processed in Batch (i.e. all the data

Sunny Overcast Rain

High Normal Strong Weak

HCMUTE AI Foundations and Applications 03/18/2024 9

• Let S be the set of Examples

• Consider data with two Boolean attributes (A,B).

< (A=1, B=1), + >: 100 examples splitting on A

• What should be the first attribute we select?

HCMUTE AI Foundations and Applications 03/18/2024 12

HCMUTE AI Foundations and Applications 03/18/2024 14

• Entropy (impurity, disorder) of a set of examples, S, relative to a binary

• is the proportion of positive examples in S and

• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:

• is the proportion of positive examples in S and

Test yourself: assign high,

HCMUTE AI Foundations and Applications 03/18/2024 16

HCMUTE AI Foundations and Applications 03/18/2024 17

• The information gain of an attribute a is the expected reduction in entropy caused

HCMUTE AI Foundations and Applications 03/18/2024 18

HCMUTE AI Foundations and Applications 03/18/2024 19

Calculate current entropy

HCMUTE AI Foundations and Applications 03/18/2024 20

Information gain = 0.940 – 0.694 = 0.246

HCMUTE AI Foundations and Applications 03/18/2024 21

Information gain = 0.940 – 0.7885 = 0.1515

HCMUTE AI Foundations and Applications 03/18/2024 22

HCMUTE AI Foundations and Applications 03/18/2024 23

HCMUTE AI Foundations and Applications 03/18/2024 24

• Stopping criteria in decision trees

HCMUTE AI Foundations and Applications 03/18/2024 25

def set_properties(self, split_attribute, order):

def set_label(self, label):

HCMUTE AI Foundations and Applications 03/18/2024 26

HCMUTE AI Foundations and Applications 03/18/2024 27

• Students practice by following the example

HCMUTE AI Foundations and Applications 03/18/2024 28

You might also like