Professional Documents
Culture Documents
Learning
Approaches : Decision Trees
1
2
A2 A3
c1
c1 c2 c2 c1
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
Some Characteristics
-
Decision tree may be n-ary, n ≥ 2
-
Decision tree is not unique, as different ordering of internal nodes can give
different decision tree
-
Decision tree helps us to classify data
x2 <0.83
x1 <0.27 x1 <0.89
x2 <0.34 x2 <0.56 C2 C1
x1 <0.09 C C x1 <0.56
2 1
C2 C1 C2 C1
8
2. Splitting criteria
Example
Attribute A is chosen first:
0 A 1
C1 C2
C1 C2 C1 C2
B ={(Xn,Yn) n = 1,..,N}
-
Choose « best » attribute
-
Split the learning sample
IF: all sample data from the training set have the same class
Create a leaf with that class
ELSE:
Proceed recursively until each sample data is correctly classified
no ye no ye
s s
flag
SF RSTO
RE
J
protocole service
D
tcp udp http private
domain-u
N service N P D
http private
domain-u
P N P
Splitting (on any attribute) has the property that average entropy of
the resulting training subsets is less than or equal to that of the
previous training set
Measure of impurity !
Entropy = pi log 2 pi
i
pi (Class « i » probability) : Proportion of samples
labeled from class « i » in the considered set
1
Two classes :
I(p1)
0,5
0
0 0,5 1
p1
Entropy = Entropy =
- 1 log21 = 0 -0.5log20.5-0.5log20.5
=1
0 1
Machine Learning Approaches: Decision Trees ----------- B. Solaiman
2.a. ID3 Algorithm 22
• Entropy takes it’s minimum value (zero) if and only if all the instances have
the same class (i.e., the training set with only one non-empty class, for
which the probability 1)
• Entropy takes its maximum value when the instances are equally
distributed among K possible classes. In this case, the maximum value of
the entropy is
B E (entropy)
(N instances)
Information Gain
(due to attribute A)
Test node for
an attribute A =
E – {(N1/N) E1 + (N2/N) E2}
This represents the difference between :
Information needed to identify an element of B
and
B1 B2 Information needed to identify an element of B
after the value of attribute A has been
obtained
E1 (N1) E2 (N2)
Machine Learning Approaches: Decision Trees ----------- B. Solaiman
2.a. ID3 Algorithm 24
Example 13 13 4 4
impurity log 2 log 2 0.787
17 17 17 17
N= 30 instances)
N1 = 17
1 1 12 12
impurity log 2 log 2 0.391
13 13 13 13
B E (entropy)
(N instances)
+ J3,J4,J5,J7,J9,J10,J11,J12,J13
- J1,J2, J6,J8,J14
Pif ?
E1 = 0 E2 = 0.971 E3 = 0.971
+ J3,J4,J5,J7,J9,J10,J11,J12,J13
- J1,J2, J6,J8,J14
Temp?
E1 = 1 E2 = 0.918 E3 = 0.811
+ J3,J4,J5,J7,J9,J10,J11,J12,J13
- J1,J2, J6,J8,J14
Humid ?
Haute Normale
E1 = 0.985 E2 = 0.591
Vent ?
vrai faux
+ J7,J11,J12 + J3,J4,J5,J9,10,J13
- J2,J6,J14 - J1,J8
Pif
Humid Vent
jouer
normal haute non oui
n = 2, p1 = 14 / 30, p1 = 16 / 30
The smallest value of gini(T) is zero which it takes when all the
classifications are same (if the classes in T are skewed)
It takes its largest value , when the classes are evenly distributed
between the tuples, that is the frequency of each class is
3. Overfitting Problem
REASONS OF OVERFITTING
Overfits the data will end up fitting noise in the data
+ -
+ +
+ - -
+ -
+ + -
-
Area with probably
- - + noise samples
- - -
- - - -
- -
- -
We may not have enough data in some part of the learning sample to
make a good decision
Pre-pruning: stop growing the tree earlier, before it reaches the point
where it perfectly classifies the learning sample.
Stop splitting a node if :
The number of objects is too small
The impurity is low enough
The best test is not statistically significant (according to some statistical test)
Post-pruning: allow the tree to overfit and then post-prune the tree
Compute a sequence of trees {T1,T2,…} where
T1 is the complete tree
Ti is obtained by removing some test nodes from Ti-1
Select the tree Ti* from the sequence that minimizes the error on a validation
sample set
A regression Tree : Has exactly the same model as a decision tree, but with
a number (called the prediction value) in each leaf instead of a class
Outlook
Rain
Sunny
Overcas
Humidity t Wind
45.6
High Normal Strong Weak
1.2 3.4
x2
Region2
Region1
Region3
x2
Region2
Region1
y=2.2
y=3.2
Region3
y=5.6
∑ ∑ { [x i - xi(RM)]2
m=1,..,M Sample(i) ϵ Rm