You are on page 1of 21

Decision Trees

Decision tree representation


ID3 Iterative Dichotomizer learning algorithm
Entropy, information gain
Overfitting
Supervised Classification Techniques
Discriminative models: Focus on Generative models – Focus on joint
Decision Boundaries. Learns p(y|x) distributions of {x,y}. Learns p(x,y)

• Decision Tree based Methods


• Naïve Bayes
• Memory based reasoning
(closest experience) – KNN
Classification
• Traditional Neural Networks • Bayesian Belief
• Support Vector Machines Networks
• Logistic Regression • Hidden Markov Models
• Conditional Random Fields • Gaussian Maximum
• Rule-based Methods Likelihood
• Rough Sets
Both: Generative Advesial Networks GAN - a generator and a discriminator model
Introduction:
• Identify cases/categories/classes based on list of
features :
• Who won an election?
• Which album/folder should a photo/email be
placed in?
• Disambiguation: What is the sense of a word -
bank?
• Sentiment: Which news were judgmental /
neutral?
• Profile: Who sent an ad through email?
• Diagnosis: Cases of malaria?

3
Illustrating Supervised Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?


13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
ID3:Entropy Entropy(S) = -p+ log2 p+ - p- log2 p-

-+-+
-+-+

---- +++
---- +++

• S is a sample of training examples taken from a population


• p+ is the proportion of positive examples, very high or very
low  low entropy.
• p- is the proportion of negative examples
• Entropy measures the impurity of S
5
Entropy
• Entropy of Set S = expected number of bits needed to encode
a class (+ or -) of randomly drawn members of S
Why ?
• The expected number of bits to encode
(+ or -) of random member of S:
-p+ log2 p+ - p- log2 p-

6
Top-Down Induction of Decision Trees
-ID3 – A greedy BFS method

1. Start by calculating entropy of Decision node

2. A  the “best” decision attribute for next node


Least Entropy, highest Information Gain

3. For each value of A create new descendant

4. If all training examples are perfectly classified


(same value of target attribute) stop, else iterate
over new leaf nodes.
7
Training Examples Entropy(S) = -p+ log2 p+ - p- log2 p-

Day Outlook Temp. Humidity Wind Play Tennis


D1 Sunny Hot High Strong No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
8
Entropy of the Table’s Decision Field
• -(9/14) x log2 (9/14) – (5/14)xlog2 (5/14) = .940

• Shows the amount of disorderliness


ROOT
Decision

9
Converting a Tree to Rules
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

R 1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No


R 2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R 3: If (Outlook=Overcast) Then PlayTennis=Yes
R 4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R 5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes 10
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E(S)=0.940 E(S)=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0

11
Entropy(S[29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99

[29+,35-] A1=?

G H

[21+, 5-] [8+, 30-]

12
Temperature
S=[9+,5-]
E(S)=0.940
Temperature

Hot Cool Mild

[2+, 2-] [3+, 1-] [4+,2-]


E=1.00 E=0.81 E=0.91

13
Information Gain

• Information Gain(S,A): expected reduction in entropy due to


sorting S on attribute A

14
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E(S)=0.940 E(S)=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
15
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E(S)=0.940
Outlook

Over
Sunny Rain
cast

[2+, 3-] [4+, 0] [3+, 2-]


E=0.970 E=0.0 E=0.970
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
16
Selecting the Next Attribute
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training


examples

17
R

ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny =[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
18
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


19
[mistake]
Some Characteristics of Decision Trees
• Discrete values target function
• Suitable for discrete attribute values though continuous also possible
• Target function is expressed as disjunction of conjunction
• Searches the complete space of finite discrete valued function for
correct hypothesis – decision trees
• However, results one single final hypothesis
• Greedy method for search, no backtrack. Solution may be sub-optimal
• Uses all training examples at each step
• Robust but may suffer overfiiting
20
Occam’s Razor
”If two theories explain the facts equally well, then the simpler theory is to
be preferred”

Arguments in favor:
• Fewer short hypotheses than long hypotheses
• A short hypothesis that fits the data is unlikely to be a coincidence
• A long hypothesis that fits the data might be a coincidence

Arguments opposed:
• There are many ways to define small sets of hypotheses

• Supports Inductive bias – after training, the learner justifies future


classifications due to generalization, by choosing shortest tree

21

You might also like