Professional Documents
Culture Documents
Addressing Overfitting
What is classification?
Examples
A bank loans officers needs analysis of the data to learn which
loan applicants are “safe” and which are “risky” for the bank.
A marketing manager needs data analysis to help guess
whether a customer with a given profile will buy a new
computer.
A medical researcher wants to analyse cancer data to predict
which one of three specific treatments a patient should receive.
In each of these examples, the data analysis task is
classification, where a model or classifier is constructed to
predict class labels.
Class label is considered to be discrete value where the
ordering among values has no meaning.
Classification problems are considered to be prediction
problems where basically class labels are predicted.
Because the class labels of each training examples is
provided, this step is also known as supervised learning.
General Approach to
classification
Data classification is a two-step process,
consisting of a learning step (where a
classification model is constructed) and
classification step (where the model is used to
predict class labels for given data).
Decision Tree Induction
Decision tree induction is the learning of
decision trees from class-labelled training
examples.
A decision-tree is a flow-chart like tree
structure, where each internal node denotes a
test on an attribute, each branch represents an
outcome of the test and each leaf node holds a
class label.
Decision Tree Induction
ID3 Iterative Dichotomiser 3 by J. Ross
Quinlan
C4.5 is a successor of ID3
Classification and Regression Trees(CART) –
generation of binary trees
ID3, C4.5 and CART adopt a
greedy(nonbacktracking) approach in which
decision trees are constructed in a top-down
recursive divide-and-conquer manner.
Algorithm starts with a training set of tuples
and their associated class labels.
Training stet is recursively partitioned into
smaller subsets as the tree is being built.
The Algorithm
Create a root node N for the tree.
If all examples are of same class C, then return N
as the leaf node labeled with the class C.
If attribute_list is empty, then return N as the leaf
node labeled with the majority class in D(majority
voting)
Apply Attribute_Selection_method(D,
attribute_list) to find “best” splitting_criteria;
Label node N with splitting_criterion
If splitting_attribute is discrete_valued and
multiway splits allowed then
◦ attribute_list attribute_list –
splitting_attribute
For each outcome j of splitting_criterion
Let Dj be the set of data tuples in D satisfying
outcome j;
If Dj is empty then
Attach a leaf labelled with the majority class in D to
node N;
else
Attach the node returned by Generate_decision_tree(Dj,
attribute_list) to node N
Endfor
Return N
To partition attributes in D, there are three
possible scenarios. Let A be the splitting
attribute. A has v distinct values, {a1,a2,…,av},
based on training data.
◦ A is discrete-valued:
◦ A is continuous-valued:
◦ A is discrete-valued and a binary tree must be
produced.
The recursive partitioning stops only when one of the
following terminating condition is true:
◦ All the tuples in partition D belong to the same class
◦ There are no remaining attributes on which the tuples
may be further partitioned. In this case majority voting is
employed.
◦ There are no tuples for a given branch, that is, a
partition Dj, is empty. In this case, a leaf is created with
the majority class in D.
Attribute Selection
Procedure
It is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labelled
training tuples into individual classes.
It is also known as splitting rules because they determine
how the tuples at a given node are to be split.
It provides ranking for each attribute describing the given
training tuple. The attribute having the best score for the
measure is chosen as the splitting attribute for the given
tuples.
Well known attribute selection measures are : information
gain, gain ratio, gini index.
Choose an attribute to
partition data
The key to building a decision tree – which attribute to choose in
order to branch.
The objective is to reduce impurity or uncertainty in data as much
as possible (A subset of data is pure if all instances belong to the
same class).
The heuristic : is to choose the attribute with the maximum
information gain or Gain Ratio based on information theory.
Information Theory
Information Theory provides a mathematical basis for measuring
the information content.
To understand the notion of information, think about it as
providing the answer to a question, for example whether a coin will
come up heads.
If one already has a good guess about the answer, then the actual answer
is less informative.
If one already knows that the coin is rigged so that it will come with
heads with probability 0.99, then a message (advanced information)
about the actual outcome of a flip is worth less than it would be for a
honest coin (50 - 50).
For a fair (honest) coin, you have no information, and as much less
you know, the more valuable the information is.
information theory uses this same intuition, by measuring the
value for information contents in bits.
one bit of information is enough to answer a yes/no question about
which one has no idea, such as the flip of a fair coin.
Entropy (Information) measure :
m
Info ( D) pi log 2 ( pi )
i 1
Where pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
We use entropy ( information) as a measure of impurity of
data set D.
http://www.saedsayad.com/decision_tree.htm
As the data becomes purer and purer, the entropy value becomes
smaller and smaller.
Constructing a tree is all about finding attributes that returns the
highest information Gain (i.e. the most homogeneous branches.
Information Gain
(ID3/C4.5)
The information gain is based on the decrease in entropy after a
dataset is split on an attribute.
Select the attribute with the highest information gain.
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
in D:
m
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v
partitions) to classify D:
v | Dj |
Info A ( D) I (D j )
j 1 | D|
Information gained by branching on attribute A
Feathers
Y N
Warm-blooded
Lays eggs
Y N
Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1
Find IG for remaining 3 attributes Height, Weight, Lotion
For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
S = [2+,2-]
SAverage = [1+,0-] E(SAverage) = 0
STall = [0+,1-] E(STall) = 0
SShort = [1+,1-] E(SShort) = 1
Gain(S,Height) = 1 – [(1/4)*0 + (1/4)*0 + (2/4)*1] = 0.5
For attribute ‘Weight’:
Values(Weight) : [Average, Light]
S = [2+,2-]
SAverage = [1+,1-] E(SAverage) = 1
SLight = [1+,1-] E(SLight) = 1
Gain(S,Weight) = 1 – [(2/4)*1 + (2/4)*1]= 0
For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
S = [2+,2-]
SYes = [0+,2-] E(SYes) = 0
SNo = [2+,0-] E(SNo) = 0
Gain(S,Lotion) = 1 – [(2/4)*0 + (2/4)*0]= 1
Hair
Blonde Brow
Re n
d
Sunburned Not
Lotion
Sunburned
Y N
Not Sunburned
Sunburned
Computing Information-Gain
for Continuous-Valued
Attributes
Let attribute A be a continuous-valued attribute the
Discretization ( break real-valued attributes into ranges in advance)
Using thresholds for splitting nodes
Must determine the best split point for A
◦ Sort the value A in increasing order
◦ Typically, the midpoint between each pair of adjacent values is considered as a possible split
point
◦ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◦ The point with the minimum expected information requirement for A is selected as the split-
point for A
Split:
◦ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A
> split-point
Example :
Check Threshold: where length ≤ 12.5? ≤ 24.5? ≤ 30? ≤45?
Length 10 15 21 28 32 40 50
Class No Yes Yes No Yes Yes No
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 | D| |D|
◦ GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
◦ gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as the
splitting attribute
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) 1 p j
j 1
where pj is the relative frequency of class j in D.
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D2 |
gini A (D) gini(D1) gini(D2)
|D| |D|
Reduction in Impurity:
gini( A) gini(D) giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in
D2 10 4
giniincome{low,medium} ( D) Gini( D1 ) Gini( D2 )
14 14
◦ Gain ratio:
◦ tends to prefer unbalanced splits in which one partition is much smaller than the others
◦ Gini index:
◦ biased to multivalued attributes
◦ has difficulty when # of classes is large
◦ tends to favor tests that result in equal-sized partitions and purity in both partitions
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only