Decision Tree

MACHINE LEARNING
• Humans can supply rules to logical reasoning programs
• Another way is to impart the ability of constructing rules to

the machines themselves. This is called Machine Learning
• The machine is given raw data and it is supposed to form

rules (i.e. a model or concept) about the process from which
the data is generated
• Two types: Symbolic data and numeric data

Symbolic machine learning
Numeric machine learning
1
DECISION TREES
Most widely used learning method
It is a method that induces concepts from examples
The learning is supervised: i.e. the classes or categories of the

data instances are known
It represents concepts as decision trees, a representation that

allows us to determine the classification of an object by
testing its values for certain properties
2
IDENTIFICATION TREES
We may think of each property of an instance as

contributing a certain amount of information to its
classification.
For example, if our goal is to determine the species of an

animal, the discovery that it lays eggs contributes a certain
amount of information to that goal
3
Definition
● Decision tree is a classifier in the form of a tree structure

– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision
● Decision trees classify instances or examples by starting

at the root of the tree and moving through it until a leaf
node.
Why decision tree?
■ Decision trees are powerful and popular

tools for classification and prediction.
■ Decision trees represent rules, which can
be understood by humans and used in
knowledge system such as database.
key requirements
■ Attribute-value description: object or case

must be expressible in terms of a fixed
collection of properties or attributes (e.g.,
hot, mild, cold).
■ Predefined classes (target values): the
target function has discrete output values
(bollean or multiclass)
■ Sufficient data: enough training cases
should be provided to learn the model.
Concept Learning
■ E.g., Learn concept “Edible mushroom”

◻ Target Function has two values: T or F
■ Represent concepts as decision trees
■ Use hill climbing search thru space of
decision trees
◻ Start with simple concept
◻ Refine it into a complex concept as needed
7
From Data to Identification Trees
Step 1: Collect data
8
From Identification Trees to Rules
Step 2: Create Decision Tree Hair
Blonde Brown
Red
Lotion Used Alex

∙Emily Pete
John
Sun burn
No Yes
No Sun burn
∙Sarah Dana
∙Annie Katie
Sun burn No Sun burn

9
Step 3: Make rules from the identification tree
For our example we have:
If the person’s hair is blonde

and the person uses lotion
then the person is not sunburned
If the person’s hair is blonde

and the person uses no lotion
then the person is sunburned
10
Step 3: Make rules from the identification tree
For our example we have:
If the person’s hair is red

then the person is sunburned
If the person’s hair is brown

then the person is not sunburned
11
Step 4: Optimize the rules (prune the antecedents)
To simplify a rule, you ask whether any of the antecedents

can be eliminated without changing what the rule does on the
samples
Example:
If hair is blonde and person uses lotion then no sunburn
If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
12
Example:
If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
R1 = Sunburn R2 = No Sunburn
Lotion = Yes l=0 m=3
13
Example:
If we eliminate the 2nd antecedent then the resulting

shortened rule is not consistent with the data, hence it cannot
be eliminated
R1 = Sunburn R2 = No Sunburn
Blonde = Yes l=2 m=2
14
Consider the following table
R1 = Canary R2 = Crow
Color = Yellow l = 1000 m=0
It is evident that the rule

If color is yellow then bird is canary
is sufficient
15
However if we have
we observe that the rule is not consistent with the data and
some more antecedents should be incorporated (by decision
tree expansion) to make the rule valid for the whole data set
However, we should ask whether doing so is worth our

trouble? Is the simplified rule worth an occasional error?
Also maybe the single entry is a noise? 16
Step 4: Optimize the rules (eliminate unnecessary rules)
Rules leading to one label can be replaced by a default rule:

If no other rule applies
then label x
However, this means a very big assumption that all of the

uncovered concept space belongs to label x
This may mean a misclassification of an unknown instance
17
Without default label, we would have known that the new

instance does not fire any rule and hence may be set aside to
be classified manually
This sometimes leads to formation of new rules for the

category of that instance
It may also lead to discovery of a new, yet unknown, category
18
Consider the following table
Color = Black n=0 o = 1000
For this we have to have the following rules:

If Color is Yellow then bird is Canary
If Color is Black then bird is Crow
19
If Color is Yellow then bird is Canary

If Color is Black then bird is Crow
Both rules are necessary. If we drop one of them we will not

be able to classify 50% samples
20
Now consider the following table
P1 = Yellow l = 999 m=0
P2 = Black n=0 o=1
In this case we have the rules:

If Color is Yellow then Bird is Canary
If Color is Black then Bird is Crow
21
In this case we have the rules:

If Color is Yellow then Bird is Canary
If Color is Black then Bird is Crow
If we drop the 2nd rule, we would misclassify only 1 sample in

1000 samples
We should ask ourselves:
Is the rule elimination worth an occasional error?
Also, maybe the single entry is a noise?
22
Identification Trees
For any dimension, we partition the set of training examples

into subsets
The number of subsets is the number of values of the

dimension. The examples in a subset has the same value for
that dimension
For example: risk is a dimension

It has three values: high, moderate, low
This has the effect of cutting the example space by

hyper-planes
23
Identification Trees
The dimension to be partitioned first is determined by the

average disorder of the examples partitioned
The space is partitioned sufficiently, when all the training

example in a partition have the same class (category) label
For classification of instances whose class label is

undetermined, we simply place it in our concept space and
note the label of the partition which encloses it
24
ID3 Example
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Katie Blonde Short Light Yes None
25
Average Disorder
■ Average Disorder =
◻ ∑ Nb / Nt * (∑ - Nbc / Nb log2 Nbc / Nb)
◻ Where Nb is the number of samples in
branch b,
◻ Nt is the total number of samples in all
branches,
◻ Nbc is the total samples in branch b of
class c
26
ID3 Example (cont.)
Attribute Name Attribute Values Attribute Occurrences
Hair Blonde 4
Brown 3
Red 1
27
ID3 Example (cont.)
■ Blonde = 4/8 (-2/4 log2 2/4 -2/4 log2 2/4)
= 4/8 (0.5 + 0.5)
= 0.5
■ Brown = 3/8 (-3/3 log2 3/3)
= 3/8 (-1 log2 1)
=0
■ Red = 1/8 (-1 log2 1)
=0
28
ID3 Example (cont.)
■Average Disorder (Hair) = Blonde + Brown
+ Red
= 0.5 + 0 + 0
= 0.5
Average Disorder (Hair) = 0.5
29
ID3 Example (cont.)
■ Similarly Average Disorder for other
attributes can be calculated; which turns
out to be
■ Average Disorder (Hair) = 0.5
■ Average Disorder (Height) = 0.6886
■ Average Disorder (Weight) = 0.9386
■ Average Disorder (Lotion)= 0.6067
30
ID3 Example (cont.)
■ Most homogeneous attribute is Hair so put
hair as the first test. Tree will be:
Hair
Blonde Brown
Red
31
ID3 Example (cont.)
■ With red and brown hair color all the training set is
completely classified. So the only problem left is
with blonde hair color.
Attribute Name Attribute Values Attribute Occurrences
Height (with hair = blonde) Tall 1
Average 1
Short 2
32
ID3 Example (cont.)
■ Tall = 1/4 (-1 log2 1)
=0
■ Average = 1/4 (-1 log2 1)
=0
■ Short = 2/4 (-1/2 log2 1/2 -1/2 log2 1/2)
= 2/4 (0.5 + 0.5)
= 0.5
■ Average Disorder (Height with “hair =
33
blonde”) = 0 + 0 + 0.5 = 0.5
ID3 Example (cont.)
■ Similarly for other attributes but with hair =
blonde the average disorder is:
■ Average Disorder (Height with “hair =
blonde”) = 0.5
■ Average Disorder (Weight with “hair =
blonde”) = 1
■ Average Disorder (Lotion with “hair =
blonde”) = 0
34
ID3 Example (cont.)
■ Here the lotion is with the minimum
average disorder so it will be the nest test.
Now the tree will become: Hair
Blonde Brown
Red
Lotion Used
No Yes
35
ID3 Example (cont.)
Hair
Blonde Brown
Red
Lotion Used Alex

∙Emily Pete
John
Sun burn
No Yes
No Sun burn
∙Sarah Dana
∙Annie Katie
Sun burn No Sun burn
36
ID3 Example (cont.)
■ Rules Extraction
■ IF the person’s hair color is blonde
■ The person uses lotion
■ THEN no sunburn
■ IF the person’s hair color is blonde

■ the person uses no lotion
■ THEN person turns red
37
ID3 Example (cont.)
■ IF the person’s hair color is red
■ THEN person turns red
■ IF the person’s hair color is brown

■ THEN no sunburn
38
Evaluating ID3
Bad Data
Data is bad if one set of attributes has two different
outcomes
Missing Data
Data is missing if an attribute is not present, perhaps
because it was too expensive to obtain
39
Entropy
■ A measure of homogeneity of the set of examples.
■ Given a set S of positive and negative examples of

some target concept (a 2-class problem), the entropy
of set S relative to this binary classification is
E(S) = - p(P)log2 p(P) – p(N)log2 p(N)

■ Suppose S has 25 examples, 15 positive and 10
negatives [15+, 10-]. Then the entropy of S relative to
this classification is
E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)

Some Intuitions
■ The entropy is 0 if
the outcome is
``certain’’.
■ The entropy is
maximum if we
have no knowledge
of the system (or
any outcome is Entropy of a 2-class problem
with regard to the portion of
equally possible). one of the two groups
Information Gain
■ Information gain measures the expected reduction in

entropy, or uncertainty.
◻ Values(A) is the set of all possible values for attribute A, and

Sv the subset of S for which attribute A has value v Sv = {s in
S | A(s) = v}.
◻ the first term in the equation for Gain is just the entropy of
the original collection S
◻ the second term is the expected value of the entropy after S
is partitioned using attribute A
■ It is simply the expected reduction in
entropy caused by partitioning the
examples according to this attribute.
■ It is the number of bits saved when
encoding the target value of an arbitrary
member of S, by knowing the value of
attribute A.
Training Examples
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
45
D14 Rain Mild High Strong No
Example: “Good day for tennis”
■ Attributes of instances
◻ Outlook = {rainy (r), overcast (o), sunny (s)}
◻ Temperature = {cool (c), medium (m), hot
(h)}
◻ Humidity = {normal (n), high (h)}
◻ Wind = {weak (w), strong (s)}
■ Class value
◻ Play Tennis? = {don’t play (n), play (y)}
■ Feature = attribute with one value
◻
46
E.g., outlook = sunny
Decision Tree Representation
Good day for tennis?
Leaves = classification
Arcs = choice of value
Outlook
for parent attribute Sunny Rain
Overcast
Humidity Wind
Play
Strong Weak
High Normal
Don’t play Play Don’t play Play
Decision tree is equivalent to logic in disjunctive normal form

Play ⇔ (Sunny ∧ Normal) ∨ Overcast ∨ (Rain ∧ Weak)
47
Numeric Attributes
Use
thresholds to
convert Outlook
numeric Sunny Rain
attributes into Overcast
discrete
values Humidity Wind
Play >= 10
< 10 MPH
MPH
>= 75% < 75%
Don’t play Play Don’t play Play
48
DT Learning as Search
■ Nodes
Decision Trees
■ Operators Tree Refinement: Sprouting the tree
Smallest tree possible: a single leaf
■ Initial node Information Gain
Best tree possible (???)
■ Heuristic?
■ Goal?
Day Outlook Temp Humid Wind Play?
d1 s h h w n
What is the d2 s h h s n
d3 o h h w y
Simplest Tree? d4 r m h w y
d5 r c n w y
d6 r c n s n
d7 o c n s y
d8 s m h w n
d9 s c n w y
d10 r m n w y
d11 s m n s y
d12 o m h s y
d13 o h n w y
How good? d14 r m h s n
Majority class:
[9+, correct on 9
examples
5-]
incorrect on 5
examples
50
Successors Ye
s
Humid Wind
Outlook Temp
Which attribute should we use to split?

51
Disorder is bad
Homogeneity is good
No
Bad
Better
Good
52
Entropy (disorder) is bad
Homogeneity is good
■ Let S be a set of examples
■ Entropy(S) = -P log2(P) - N log2(N)
◻ P is proportion of pos example
◻ N is proportion of neg examples
◻ 0 log 0 == 0
■ Example: S has 9 pos and 5 neg
Entropy([9+, 5-]) = -(9/14) log2(9/14) -
(5/14)log2(5/14)
53
= 0.940
Information Gain
■ Measure of expected reduction in
entropy
■ Resulting from splitting along an
attribute
Gain(S,A) = Entropy(S) - Σ (|S | / |S|) Entropy(S )

v v
v∈
Values(A)
Where
54
Entropy(S) = -P log2(P) - N log2(N)
Tree Induction Example
■ Entropy of data S
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94
Parent Node, p is split into
k partitions;
■ Split data by attribute Outlook ni is number of records in
S[9+, 5-] Sunny [2+,3-] partition i
Outlook Overcast
Rain [3+,2-]
[4+,0-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25
55
■ Split data by attribute Temperature
S[9+, 5-] <15 [3+,1-]

15-25 [4+,2-]
Temperature >25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]

– 6/14[-4/6(log2(4/6))-2/6(log2(2/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.91 = 0.03
56
■ Split data by attribute Humidity

High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
■ Split data by attribute Wind
Weak [6+, 2-]

S[9+, 5-] Wind
Strong [3+, 3-]
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]

– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05
57
Outlook Tempera Humidity Wind Play
ture Tennis
Sunny >25 High Weak No
Sunny >25 High Strong No
Gain(Outlook) = 0.25
Gain(Temperature)=0.0
Overcast >25 High Weak Yes
3
Rain 15-25 High Weak Yes Gain(Humidity) = 0.15
Rain <15 Normal Weak Yes Gain(Wind) = 0.05
Rain <15 Normal Strong No
Overcast <15 Normal Strong Yes
Outlook
Sunny 15-25 High Weak No
Sunny <15 Normal Weak Yes
Rain 15-25 Normal Weak Yes
Sunny Overcast Rain
Sunny 15-25 Normal Strong Yes
Overcast 15-25 High Strong Yes
?? Ye ??
Overcast >25 Normal Weak Yes s
Rain 15-25 High Strong No
58
■ Entropy of branch Sunny
Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97
■ Split Sunny branch by attribute Temperature

Gain(Temperature)
<15 [1+,0-]
Sunny[2+,3-] = 0.97
Temperature 15-25 [1+,1-] – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
>25 [0+,2-] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.57
■ Split Sunny branch by attribute Humidity

Gain(Humidity)
Sunny[2+,3-] High [0+,3-] = 0.97
Humidity – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
Normal [2+, 0-] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
■ Split Sunny branch by attribute Wind

Gain(Wind)
Weak [1+, 2-] = 0.97
Sunny[2+, 3-]
Wind – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
Strong [1+, 1-] = 0.97 – 0.95= 0.02
59
Outlook
Sunny Overcast Rain
Humidity Yes ??
High Normal
No Yes
60
■ Info(Rain)
Entropy =of-3/5(log
branch Rain
2(3/5))-2/5(log
2
(2/5)) =
0.97
■ Split Rain branch by attribute Temperature

Gain(Outlook)
Rain[3+,2-] <15 [1+,1-] = 0.97
Temperatur 15-25 [2+,1-] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
e – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
>25 [0+,0-]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.95 = 0.02
■ Split Rain branch by attribute Humidity

Gain(Humidity)
Rain[3+,2-] High [1+,1-] = 0.97
Humidity – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
Normal [2+, 1-] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.95 = 0.02
■ Split Rain branch by attribute Wind

Gain(Wind)
Rain[3+,2-] Weak [3+, 0-]
= 0.97
Wind – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
Strong [0+, 2-] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97
61
Issues
■ Missing data
■ Real-valued attributes
■ Many-valued features
■ Evaluation
■ Overfitting
62
Strengths
■ can generate understandable rules

■ perform classification without much computation
■ can handle continuous and categorical variables
■ provide a clear indication of which fields are most
important for prediction or classification
Weakness
■ Not suitable for prediction of continuous attribute.

■ Perform poorly with many class and small data.
■ Computationally expensive to train.
◻ At each node, each candidate splitting field must be sorted
before its best split can be found.
◻ In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
◻ Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
■ Do not treat well non-rectangular regions.

Decision Tree

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Tree

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

• Humans can supply rules to logical reasoning programs

• Another way is to impart the ability of constructing rules to

• The machine is given raw data and it is supposed to form

• Two types: Symbolic data and numeric data

Most widely used learning method

It is a method that induces concepts from examples

The learning is supervised: i.e. the classes or categories of the

It represents concepts as decision trees, a representation that

We may think of each property of an instance as

For example, if our goal is to determine the species of an

● Decision tree is a classifier in the form of a tree structure

● Decision trees classify instances or examples by starting

■ Decision trees are powerful and popular

■ Attribute-value description: object or case

■ E.g., Learn concept “Edible mushroom”

From Data to Identification Trees

Step 1: Collect data

From Identification Trees to Rules

Step 2: Create Decision Tree Hair

Lotion Used Alex

Sun burn No Sun burn

From Identification Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is blonde

If the person’s hair is blonde

From Identification Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is red

If the person’s hair is brown

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

To simplify a rule, you ask whether any of the antecedents

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

If we eliminate the 2nd antecedent then the resulting

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Consider the following table

It is evident that the rule

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

However, we should ask whether doing so is worth our

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Rules leading to one label can be replaced by a default rule:

However, this means a very big assumption that all of the

This may mean a misclassification of an unknown instance

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Without default label, we would have known that the new

This sometimes leads to formation of new rules for the

It may also lead to discovery of a new, yet unknown, category

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Consider the following table

For this we have to have the following rules:

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)