Iterative Dichotimizer Part1

Decision Trees
Decision tree representation

ID3 Iterative Dichotomizer learning algorithm
Entropy, information gain
Overfitting
Supervised Classification Techniques
Discriminative models: Focus on Generative models – Focus on joint
Decision Boundaries. Learns p(y|x) distributions of {x,y}. Learns p(x,y)
• Decision Tree based Methods

• Naïve Bayes
• Memory based reasoning
(closest experience) – KNN
Classification
• Traditional Neural Networks • Bayesian Belief
• Support Vector Machines Networks
• Logistic Regression • Hidden Markov Models
• Conditional Random Fields • Gaussian Maximum
• Rule-based Methods Likelihood
• Rough Sets
Both: Generative Advesial Networks GAN - a generator and a discriminator model
Introduction:
• Identify cases/categories/classes based on list of
features :
• Who won an election?
• Which album/folder should a photo/email be
placed in?
• Disambiguation: What is the sense of a word -
bank?
• Sentiment: Which news were judgmental /
neutral?
• Profile: Who sent an ad through email?
• Diagnosis: Cases of malaria?
3
Illustrating Supervised Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
ID3:Entropy Entropy(S) = -p+ log2 p+ - p- log2 p-
-+-+
-+-+
---- +++
---- +++
• S is a sample of training examples taken from a population

• p+ is the proportion of positive examples, very high or very
low  low entropy.
• p- is the proportion of negative examples
• Entropy measures the impurity of S
5
Entropy
• Entropy of Set S = expected number of bits needed to encode
a class (+ or -) of randomly drawn members of S
Why ?
• The expected number of bits to encode
(+ or -) of random member of S:
-p+ log2 p+ - p- log2 p-
6
Top-Down Induction of Decision Trees
-ID3 – A greedy BFS method
1. Start by calculating entropy of Decision node
2. A  the “best” decision attribute for next node

Least Entropy, highest Information Gain
3. For each value of A create new descendant
4. If all training examples are perfectly classified

(same value of target attribute) stop, else iterate
over new leaf nodes.
7
Training Examples Entropy(S) = -p+ log2 p+ - p- log2 p-
Day Outlook Temp. Humidity Wind Play Tennis

D1 Sunny Hot High Strong No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
8
Entropy of the Table’s Decision Field
• -(9/14) x log2 (9/14) – (5/14)xlog2 (5/14) = .940
• Shows the amount of disorderliness

ROOT
Decision
9
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak

No Yes No Yes
R 1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

R 2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R 3: If (Outlook=Overcast) Then PlayTennis=Yes
R 4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R 5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes 10
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E(S)=0.940 E(S)=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0
11
Entropy(S[29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99
[29+,35-] A1=?
G H
[21+, 5-] [8+, 30-]
12
Temperature
S=[9+,5-]
E(S)=0.940
Temperature
Hot Cool Mild
[2+, 2-] [3+, 1-] [4+,2-]

E=1.00 E=0.81 E=0.91
13
Information Gain
• Information Gain(S,A): expected reduction in entropy due to

sorting S on attribute A
14
S=[9+,5-] S=[9+,5-]
E(S)=0.940 E(S)=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
15
Humidity provides greater info. gain than Wind, w.r.t target classification.
S=[9+,5-]
E(S)=0.940
Outlook
Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]

E=0.970 E=0.0 E=0.970
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
16
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
where S denotes the collection of training

examples
17
R
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
Sunny Overcast Rain
Ssunny =[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
18
ID3 Algorithm
Outlook
Sunny Overcast Rain
Humidity Yes Wind

[D3,D7,D12,D13]
High Normal Strong Weak
No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]

19
[mistake]
Some Characteristics of Decision Trees
• Discrete values target function
• Suitable for discrete attribute values though continuous also possible
• Target function is expressed as disjunction of conjunction
• Searches the complete space of finite discrete valued function for
correct hypothesis – decision trees
• However, results one single final hypothesis
• Greedy method for search, no backtrack. Solution may be sub-optimal
• Uses all training examples at each step
• Robust but may suffer overfiiting
20
Occam’s Razor
”If two theories explain the facts equally well, then the simpler theory is to
be preferred”
Arguments in favor:
• Fewer short hypotheses than long hypotheses
• A short hypothesis that fits the data is unlikely to be a coincidence
• A long hypothesis that fits the data might be a coincidence
Arguments opposed:
• There are many ways to define small sets of hypotheses
• Supports Inductive bias – after training, the learner justifies future

classifications due to generalization, by choosing shortest tree
21

Iterative Dichotimizer Part1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Iterative Dichotimizer Part1

Uploaded by

Copyright:

Available Formats

Decision Trees

Decision tree representation

• Decision Tree based Methods

12 Yes Medium 80K ?

• S is a sample of training examples taken from a population

1. Start by calculating entropy of Decision node

2. A  the “best” decision attribute for next node

3. For each value of A create new descendant

4. If all training examples are perfectly classified

Day Outlook Temp. Humidity Wind Play Tennis

• Shows the amount of disorderliness

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

R 1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

[21+, 5-] [8+, 30-]

Hot Cool Mild

[2+, 2-] [3+, 1-] [4+,2-]

• Information Gain(S,A): expected reduction in entropy due to

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

[2+, 3-] [4+, 0] [3+, 2-]

where S denotes the collection of training

Sunny Overcast Rain

Ssunny =[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]

• Supports Inductive bias – after training, the learner justifies future

You might also like