Decision Tree Induction

Classification
Addressing Overfitting
What is classification?
Examples
A bank loans officers needs analysis of the data to learn which
loan applicants are “safe” and which are “risky” for the bank.
A marketing manager needs data analysis to help guess
whether a customer with a given profile will buy a new
computer.
A medical researcher wants to analyse cancer data to predict
which one of three specific treatments a patient should receive.
In each of these examples, the data analysis task is
classification, where a model or classifier is constructed to
predict class labels.
 Class label is considered to be discrete value where the
ordering among values has no meaning.
Classification problems are considered to be prediction
problems where basically class labels are predicted.
Because the class labels of each training examples is
provided, this step is also known as supervised learning.
General Approach to
classification
 Data classification is a two-step process,
consisting of a learning step (where a
classification model is constructed) and
classification step (where the model is used to
predict class labels for given data).
Decision Tree Induction
Decision tree induction is the learning of
decision trees from class-labelled training
examples.
A decision-tree is a flow-chart like tree
structure, where each internal node denotes a
test on an attribute, each branch represents an
outcome of the test and each leaf node holds a
class label.
Decision Tree Induction
ID3 Iterative Dichotomiser 3 by J. Ross
Quinlan
C4.5 is a successor of ID3
Classification and Regression Trees(CART) –
generation of binary trees
ID3, C4.5 and CART adopt a
greedy(nonbacktracking) approach in which
decision trees are constructed in a top-down
recursive divide-and-conquer manner.
Algorithm starts with a training set of tuples
and their associated class labels.
Training stet is recursively partitioned into
smaller subsets as the tree is being built.
The Algorithm
Create a root node N for the tree.
If all examples are of same class C, then return N
as the leaf node labeled with the class C.
If attribute_list is empty, then return N as the leaf
node labeled with the majority class in D(majority
voting)
Apply Attribute_Selection_method(D,
attribute_list) to find “best” splitting_criteria;
Label node N with splitting_criterion
If splitting_attribute is discrete_valued and
multiway splits allowed then
◦ attribute_list  attribute_list –
splitting_attribute
For each outcome j of splitting_criterion
Let Dj be the set of data tuples in D satisfying
outcome j;
If Dj is empty then
Attach a leaf labelled with the majority class in D to
node N;
else
Attach the node returned by Generate_decision_tree(Dj,
attribute_list) to node N
Endfor
Return N
To partition attributes in D, there are three
possible scenarios. Let A be the splitting
attribute. A has v distinct values, {a1,a2,…,av},
based on training data.
◦ A is discrete-valued:
◦ A is continuous-valued:
◦ A is discrete-valued and a binary tree must be
produced.
The recursive partitioning stops only when one of the
following terminating condition is true:
◦ All the tuples in partition D belong to the same class
◦ There are no remaining attributes on which the tuples
may be further partitioned. In this case majority voting is
employed.
◦ There are no tuples for a given branch, that is, a
partition Dj, is empty. In this case, a leaf is created with
the majority class in D.
Attribute Selection
Procedure
It is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labelled
training tuples into individual classes.
It is also known as splitting rules because they determine
how the tuples at a given node are to be split.
It provides ranking for each attribute describing the given
training tuple. The attribute having the best score for the
measure is chosen as the splitting attribute for the given
tuples.
Well known attribute selection measures are : information
gain, gain ratio, gini index.
Choose an attribute to
partition data
 The key to building a decision tree – which attribute to choose in
order to branch.
The objective is to reduce impurity or uncertainty in data as much
as possible (A subset of data is pure if all instances belong to the
same class).
The heuristic : is to choose the attribute with the maximum
information gain or Gain Ratio based on information theory.
Information Theory
 Information Theory provides a mathematical basis for measuring
the information content.
 To understand the notion of information, think about it as
providing the answer to a question, for example whether a coin will
come up heads.
If one already has a good guess about the answer, then the actual answer
is less informative.
If one already knows that the coin is rigged so that it will come with
heads with probability 0.99, then a message (advanced information)
about the actual outcome of a flip is worth less than it would be for a
honest coin (50 - 50).
For a fair (honest) coin, you have no information, and as much less
you know, the more valuable the information is.
 information theory uses this same intuition, by measuring the
value for information contents in bits.
 one bit of information is enough to answer a yes/no question about
which one has no idea, such as the flip of a fair coin.
 Entropy (Information) measure :
m
Info ( D)   pi log 2 ( pi )
i 1
Where pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
We use entropy ( information) as a measure of impurity of
data set D.
http://www.saedsayad.com/decision_tree.htm
As the data becomes purer and purer, the entropy value becomes
smaller and smaller.
Constructing a tree is all about finding attributes that returns the
highest information Gain (i.e. the most homogeneous branches.
Information Gain
(ID3/C4.5)
The information gain is based on the decrease in entropy after a
dataset is split on an attribute.
Select the attribute with the highest information gain.
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
in D:
m
Info ( D)   pi log 2 ( pi )
i 1
Information needed (after using A to split D into v
partitions) to classify D:
v | Dj |
Info A ( D)    I (D j )
j 1 | D|
Information gained by branching on attribute A
Gain(A)  Info(D)  Info A(D)

Exercise
Calculate the information(entropy)
Given:
Set S contains14 examples
9 Positive values
5 Negative values
Exercise
Info(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14)
= 0.940
Information Gain
Information gain is based on the decrease in entropy after a
dataset is split on an attribute.
Looking for which attribute creates the most homogeneous
branches
Information Gain
Example
14 examples, 9 positive 5 negative
The attribute is Wind.
Values of wind are Weak and Strong
Exercise (cont.)
8 occurrences of weak winds
6 occurrences of strong winds
For the weak winds, 6 are positive and 2 are negative
For the strong winds, 3 are positive and 3 are negative
Exercise (cont.)
Gain(S,Wind) =
Info(S) - (8/14) * Info(Weak)
-(6/14) * Info(Strong)
Info(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
Info(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
Exercise (cont.)
So…
0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
Sample training data to determine
whether an animal lays eggs.
Dependent/
Independent/Condition attributes Decision
attributes
Animal Warm- Feathers Fur Swims Lays Eggs
blooded
Ostrich Yes Yes No No Yes
Crocodile No No No Yes Yes
Raven Yes Yes No No Yes
Albatross Yes Yes No No Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No
Entropy(4Y,2N): -(4/6)log2(4/6) – (2/6)log2(2/6)
= 0.91829
Now, we have to find the IG for all four attributes Warm-

blooded, Feathers, Fur, Swims
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,2N] E(SYes) = 0.97095
SNo = [1Y,0N] E(SNo) = 0 (all members belong to same class)
Gain(S,Warm-blooded) = 0.91829 – [(5/6)*0.97095 + (1/6)*0]
= 0.10916
For attribute ‘Feathers’:
Values(Feathers) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,0N] E(SYes) = 0
SNo = [1Y,2N] E(SNo) = 0.91829
Gain(S,Feathers) = 0.91829 – [(3/6)*0 + (3/6)*0.91829]
= 0.45914
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [4Y,2N]
SNo = [4Y,1N] E(SNo) = 0.7219
Gain(S,Fur) = 0.91829 – [(1/6)*0 + (5/6)*0.7219]
= 0.3167
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [4Y,2N]
SYes = [1Y,1N] E(SYes) = 1 (equal members in both classes)
SNo = [3Y,1N] E(SNo) = 0.81127
Gain(S,Swims) = 0.91829 – [(2/6)*1 + (4/6)*0.81127]
= 0.04411
Gain(S,Warm-blooded) = 0.10916
Gain(S,Feathers) = 0.45914
Gain(S,Fur) = 0.31670
Gain(S,Swims) = 0.04411
Gain(S,Feathers) is maximum, so it is considered as the root node
Anim War Feath Fur Swim Lays

The ‘Y’ descendant has only
al m- ers s Eggs positive examples and becomes
blood the leaf node with classification
ed ‘Lays Eggs’
Ostric Yes Yes No No Yes Feathers
h
Croco No No No Yes Yes Y N
dile
Raven Yes Yes No No Yes
[Ostrich, [Crocodile,
Albatr Yes Yes No No Yes
Raven, Dolphin,
oss
Dolph Yes No No Yes No Albatross] Koala]
in
Lays Eggs ?
Koala Yes No Yes No No
Animal Warm- Fur Swims Lays Eggs
blooded
Crocodile No No Yes Yes
Dolphin Yes No Yes No
Koala Yes Yes No No
We now repeat the procedure,

S: [Crocodile, Dolphin, Koala]
S: [1+,2-]
Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3)

= 0.91829
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [1Y,2N]
SNo = [1Y,0N] E(SNo) = 0
Gain(S,Warm-blooded) = 0.91829 – [(2/3)*0 + (1/3)*0] = 0.91829
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [1Y,2N]
SNo = [1Y,1N] E(SNo) = 1
Gain(S,Fur) = 0.91829 – [(1/3)*0 + (2/3)*1] = 0.25162
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [1Y,2N]
SNo = [0Y,1N] E(SNo) = 0
Gain(S,Swims) = 0.91829 – [(2/3)*1 + (1/3)*0] = 0.25162
Gain(S,Warm-blooded) is maximum
The final decision tree will be:
Feathers
Y N
Warm-blooded
Lays eggs
Y N
Does not lay eggs Lays Eggs

Factors affecting sunburn
Name Hair Height Weight Lotion Sunburned

Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Alex Brown Short Average Yes No
Annie Blonde Short Average No Yes
Emily Red Average Heavy No Yes
Pete Brown Tall Heavy No No
John Brown Average Heavy No No
Katie Blonde Short Light Yes No
S = [3+, 5-]
Entropy(S) = -(3/8)log2(3/8) – (5/8)log2(5/8)
= 0.95443
Find IG for all 4 attributes: Hair, Height, Weight, Lotion
For attribute ‘Hair’:
Values(Hair) : [Blonde, Brown, Red]
S = [3+,5-]
SBlonde = [2+,2-] E(SBlonde) = 1
SBrown = [0+,3-] E(SBrown) = 0
SRed = [1+,0-] E(SRed) = 0
Gain(S,Hair) = 0.95443 – [(4/8)*1 + (3/8)*0 + (1/8)*0]
= 0.45443
For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
SAverage = [2+,1-] E(SAverage) = 0.91829
STall = [0+,2-] E(STall) = 0
SShort = [1+,2-] E(SShort) = 0.91829
Gain(S,Height) = 0.95443 – [(3/8)*0.91829 + (2/8)*0 + (3/8)*0.91829] = 0.26571
For attribute ‘Weight’:
Values(Weight) : [Light, Average, Heavy]
SLight = [1+,1-] E(SLight) = 1
SAverage = [1+,2-] E(SAverage) = 0.91829
SHeavy = [1+,2-] E(SHeavy) = 0.91829
Gain(S,Weight) = 0.95443 – [(2/8)*1 + (3/8)*0.91829 + (3/8)*0.91829] = 0.01571
For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
SYes = [0+,3-] E(SYes) = 0
SNo = [3+,2-] E(SNo) = 0.97095
Gain(S,Lotion) = 0.95443 – [(3/8)*0 + (5/8)*0.97095] = 0.01571
Gain(S,Hair) = 0.45443
Gain(S,Height) = 0.26571
Gain(S,Weight) = 0.01571
Gain(S,Lotion) = 0.3475
Gain(S,Hair) is maximum, so it is considered as the root node
Name Hair Height Weight Lotion Sunbur
ned
Sarah Blonde Averag Light No Yes
e
Hair
Dana Blonde Tall Averag Yes No
e Blond Brow
e n
Alex Brown Short Averag Yes No Red
e [Sarah, Dana, [Alex, Pete,
Annie Blonde Short Averag No Yes Annie, Katie] John]
e Not
? Sunburned
Emily Red Averag Heavy No Yes
e [Emily
Pete Brown Tall Heavy No No ]
John Brown Averag Heavy No No Sunburned
e
Name Hair Height Weight Lotion Sunburned
Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Annie Blonde Short Average No Yes
Katie Blonde Short Light Yes No
Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1
Find IG for remaining 3 attributes Height, Weight, Lotion
For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
S = [2+,2-]
SAverage = [1+,0-] E(SAverage) = 0
STall = [0+,1-] E(STall) = 0
SShort = [1+,1-] E(SShort) = 1
Gain(S,Height) = 1 – [(1/4)*0 + (1/4)*0 + (2/4)*1] = 0.5
For attribute ‘Weight’:
Values(Weight) : [Average, Light]
S = [2+,2-]
SAverage = [1+,1-] E(SAverage) = 1
SLight = [1+,1-] E(SLight) = 1
Gain(S,Weight) = 1 – [(2/4)*1 + (2/4)*1]= 0
For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
S = [2+,2-]
SYes = [0+,2-] E(SYes) = 0
SNo = [2+,0-] E(SNo) = 0
Gain(S,Lotion) = 1 – [(2/4)*0 + (2/4)*0]= 1
Therefore, Gain(S,Lotion) is maximum

In this case, the final decision tree will be
Hair
Blonde Brow
Re n
d
Sunburned Not
Lotion
Sunburned
Y N
Not Sunburned
Sunburned
Computing Information-Gain
for Continuous-Valued
Attributes
Let attribute A be a continuous-valued attribute the
 Discretization ( break real-valued attributes into ranges in advance)
Using thresholds for splitting nodes
Must determine the best split point for A
◦ Sort the value A in increasing order
◦ Typically, the midpoint between each pair of adjacent values is considered as a possible split
point
◦ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◦ The point with the minimum expected information requirement for A is selected as the split-
point for A
Split:
◦ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A
> split-point
Example :
 Check Threshold: where length ≤ 12.5? ≤ 24.5? ≤ 30? ≤45?
Length 10 15 21 28 32 40 50
Class No Yes Yes No Yes Yes No
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 | D| |D|
◦ GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
◦ gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as the
splitting attribute
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) 1  p j
j 1
where pj is the relative frequency of class j in D.
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D2 |
gini A (D)  gini(D1)  gini(D2)
|D| |D|
Reduction in Impurity:
gini( A)  gini(D)  giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D)  1        0.459
 14   14 
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in
D2  10  4
giniincome{low,medium} ( D)   Gini( D1 )   Gini( D2 )
 14   14 
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium}

(and {high}) since it has the lowest Gini index
All attributes are assumed continuous-valued.
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
Comparing Attribute Selection Measures
The three measures, in general, return good results but

◦ Information gain:
◦ biased towards multivalued attributes
◦ Gain ratio:
◦ tends to prefer unbalanced splits in which one partition is much smaller than the others
◦ Gini index:
◦ biased to multivalued attributes
◦ has difficulty when # of classes is large
◦ tends to favor tests that result in equal-sized partitions and purity in both partitions
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is

preferred):
◦ The best tree as the one that requires the fewest # of bits to both (1) encode the tree,
and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)

◦ CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?

◦ Most give good results, none is significantly superior than others
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
◦ Too many branches, some may reflect anomalies due to
noise or outliers
◦ Poor accuracy for unseen samples
Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a
threshold
◦ Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees
◦ Use a set of data different from the training data to decide which is the “best pruned tree”
Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes

◦ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
Handle missing attribute values
◦ Assign the most common value of the attribute
◦ Assign probability to each of the possible values
Attribute construction
◦ Create new attributes based on existing ones that are
sparsely represented
◦ This reduces fragmentation, repetition, and replication
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has

comparable performance with decision tree and selected neural network
classifiers
Incremental: Each training example can incrementally increase/decrease

the probability that a hypothesis is correct — prior knowledge can be
combined with observed data
Standard: Even when Bayesian methods are computationally intractable,

they can provide a standard of optimal decision making against which other
methods can be measured
Bayes’ Theorem: Basics
Total probability Theorem: P(B)  M
 P( B | A ) P( A )
ii
i 1
Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

P(X)
◦ Let X be a data sample (“evidence”): class label is unknown
◦ Let H be a hypothesis that X belongs to class C
◦ Classification is to determine P(H|X), (i.e., posteriori probability): the probability
that the hypothesis holds given the observed data sample X
◦ P(H) (prior probability): the initial probability
◦ E.g., X will buy computer, regardless of age, income, …
◦ P(X): probability that sample data is observed
◦ P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
◦ E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes’ theorem
P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

P(X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among

all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
Since P(X) is constant for all classes, only
needs to be maximized P(C | X)  P(X | C )P(C )

i i i
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence
n
relation between attributes):
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i )  ...  P( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts the
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1 
g ( x,  ,  )  e 2 2
and P(xk|Ci) is 2 
P(X | Ci)  g ( xk , Ci , Ci )
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
74
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
31…40 low yes excellent yes
P(buys_computer = “no”) = 5/14= 0.357 <=30

<=30
medium
low
no fair
yes fair
no
yes
>40 medium yes fair yes
<=30 medium yes excellent yes
Compute P(X|Ci) for each class 31…40

31…40
medium
high
no excellent
yes fair
yes
yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

>40 medium no excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
75
Avoiding the Zero-Probability
Problem
Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
k 1
Ex. Suppose a dataset with 1000 tuples, income=low (0),

income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
◦ Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
◦ The “corrected” prob. estimates are close to their

“uncorrected” counterparts
76
Naïve Bayes Classifier: Comments
Advantages
◦ Easy to implement
◦ Good results obtained in most of the cases
Disadvantages
◦ Assumption: class conditional independence, therefore loss of
accuracy
◦ Practically, dependencies exist among variables
◦ E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,

etc.
◦ Dependencies among these cannot be modeled by Naïve Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks

(Chapter 9)
77
Preparing Data for
Classification and Prediction
Data Cleaning : remove or reduce noise (apply smoothing) and
treatment of missing values (replacing a missing value with the
most commonly occurring values for that attribute, or with the
most probable value based on statistics)
Relevance Analysis : 1) Correlation analysis is used to check
whether any two given attributes are statistically related. 2)
Dataset may also contain irrelevant attributes. These kind of
attributes may otherwise slow down and possibly mislead the
learning step.
Data Transformation and Reduction : may be transformed by
normalization – scaling attributes such that it comes in one
range. Or by generalizing it to higher-level concepts.
Advantage of ID3
Understandable prediction rules are created from the
training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is
classified.
Finding leaf nodes enables test data to be pruned, reducing
number of tests.
Disadvantage of ID3
Data may be over-fitted or over-classified, if a small sample
is tested.
Only one attribute at a time is tested for making a decision.
Classifying continuous data may be computationally
expensive, as many trees must be generated to see where to
break the continuum.

Decision Tree Induction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Tree Induction

Uploaded by

Copyright:

Available Formats

Classification

Gain(A)  Info(D)  Info A(D)

Now, we have to find the IG for all four attributes Warm-

Anim War Feath Fur Swim Lays

We now repeat the procedure,

Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3)

Does not lay eggs Lays Eggs

Name Hair Height Weight Lotion Sunburned

Therefore, Gain(S,Lotion) is maximum

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium}

The three measures, in general, return good results but

G-statistic: has a close approximation to χ2 distribution

MDL (Minimal Description Length) principle (i.e., the simplest solution is

Multivariate splits (partition based on multiple variable combinations)

Which attribute selection measure is the best?

◦ Postpruning: Remove branches from a “fully grown” tree—

Allow for continuous-valued attributes

Foundation: Based on Bayes’ Theorem.

Performance: A simple Bayesian classifier, naïve Bayesian classifier, has

Incremental: Each training example can incrementally increase/decrease

Standard: Even when Bayesian methods are computationally intractable,

Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among

needs to be maximized P(C | X)  P(X | C )P(C )

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

P(buys_computer = “no”) = 5/14= 0.357 <=30

Compute P(X|Ci) for each class 31…40

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

Ex. Suppose a dataset with 1000 tuples, income=low (0),

◦ The “corrected” prob. estimates are close to their

Symptoms: fever, cough etc., Disease: lung cancer, diabetes,

How to deal with these dependencies? Bayesian Belief Networks

You might also like