Professional Documents
Culture Documents
UNIT - III
UNIT - III
Binary split:
Divides values into two subsets. Need to find optimal partitioning.
Splitting Indicies
3 main phases
Construction Phase
Pruning Phase
Processing the pruned tree to improve the understandability
THE GENERIC ALGORITHM
Let the training data set be T with class- labels{C1,C2….Ck}.
The process continued till all the records in partition belong to the
same class.
T is homogenous
T contains cases all belonging to a single class Cj. The decision tree for T is
a leaf identifying class Cj.
T is not homogeneous
T contains cases that belongs to a mixture of classes.
A test is chosen ,based on single attribute, that has one or more mutually
exclusive outcomes{O1,O2,….On}.
T is partitioned into subset T1,T2,T3…..Tn. where Ti contains all those cases
in T that have the outcome Oi of the chosen set.
The decision tree for T consist of decision node identifying the test, and one
branch for each possible outcome.
The same tree building method is applied recursively to each subset of training cases.
n is taken 2,and a binary decision tree is generated.
T is trivial
T contains no cases.
The decision tree T is a leaf ,but the class to be associated with the
leaf must be determined from information other than T.
The following are the three major difficulties when using Decision
Tree in a real life situation
Guillotine Cut(examine only a single attribute at a time)
Overfit (training data may not be a proper representative)
Attribute Selection Error
BEST SPLIT
The main operation during the tree building are
Evaluation of splits for each attribute and the selection of the best split;
determination of the splitting attribute.
Determination of the splitting condition on the selected splitting
attribute.
Partitioning the data using the best split.
The first task is to decide which of the independent attributes makes the
best splitter.
If an attribute takes on multiple values, sort it and then, using some
evaluation function as the measure of goodness, evaluate each split.
SPLITTING INDICES
Two methods
Information gain based on Entropy (Information Theory)
gini index (Measure of diversity derived from economics)
Entropy
It provides an information-theoretic approach to measure the goodness of
split.
SPLITTING CRITERIA
CLASS HISTOGRAM
Frequency distribution of the class values is represented as a class
histogram.
BINARY SPLITS FOR CATEGORICAL
ATTRIBUTE
COUNT MATRIX
DECISION TREE CONSTRUCTION
ALGORITHMS
Two types of algorithms
Handle only memory resident data.
Handle Efficiency and scalability issues.
CART
CART (Classification and Regression Trees )
CART builds binary decision tree by splitting the records at each node,
according to a function of a single attribute.
Uses the gini index for determining best split.
The initial split produces the two nodes, each of which attempt to split
in the same manner as the root node.
Process
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides) features into
two or more groups at each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to
build a decision tree. In simple words, the top-down approach means
that we start building the tree from the top and the greedy approach
means that at each iteration we select the best feature at the present
moment to create a node.
ID3 Steps
Calculate the Information Gain of each feature.
Considering that all rows don’t belong to the same class, split the
dataset S into subsets using the feature for which the Information Gain is
maximum.
Make a decision tree node using the feature with the maximum
Information gain.
If all rows belong to the same class, make the current node as a leaf node
with the class as its label.
Repeat for the remaining features until we run out of all features, or the
decision tree has all leaf nodes.
The attributes with the greatest information gain is taken as the splitting
attributes, and the data set is split for all distinct values of the attribute.
C4.5
C4.5 is an extension of ID3 algorithm to address the following issues not
dealt with by ID3:
Avoiding over fitting the data
Determining how deeply to grow a decision tree.
Handling continuous attributes. e.g., temperature
Choosing an appropriate attribute selection measure.
Handling training data with missing attribute values.
Handling attributes with differing costs.
Improving computational efficiency.
Unavailable values
Pruning of decision trees and rule derrivation.
CHAID
CHAID, Proposed by Kass in 1980, is a derivative of AID (Automatic Interaction
Detection), proposed by Hartigan in 1975.
It attempts to stop growing the tree before overfitting occurs.
It avoids the pruning process.
It always forms binary splits.
The Splitting attribute is chosen according to a chi-squared test of independence in a
contingency table(a cross tabulation of the non-class and class attributes)
The main stopping criterion is the P-value from the chi-squared test.
The combinatorial search algorithm can be used to find a partition that has a small
p-value for the chi-squared test.
A Bonferroni Adjustment is used relating the predictors to the dependent variable.
The adjustment is conditional on the number of branches in the partition.