You are on page 1of 48

DATA MINING

UNIT - III
UNIT - III

Decision Tree: Introduction-Tree construction principles-Best split-


splitting indices- splitting criteria- Tree construction algorithms: CART-
ID3-C4.5-CHAID.
DECISION TREE
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution or rule.
Decision trees are used to Classify an unknown sample Test the attribute
values of the sample against the decision tree
A Decision Tree is a classification scheme which generates a tree and a
set of rules, representing the model of different classes, from a given
data set.
The set of records available for developing classification methods is
generally divided into two disjoint subsets
A Training set(used for deriving the classifier)
A Test set(used to measure the accuracy of the classifier)
The attributes of the records are categorized into two types.
Numerical Attributes
Categorical Attributes
Class label – distinguished attribute.
The goal of classification is to build a concise model to predict the class
of records whose class label is not known.
EXTRACTING CLASSIFICATION RULES
FROM TREES
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
RULE 1: If it is sunny and the humidity is not above 75% then play.
RULE 2: If it is sunny and the humidity is not above 75% then play.
RULE 3: If it is overcast , then play
RULE 4: If it is rainy and not windy , then play.
RULE 5: If it is rainy and windy, then don't play.
ADVANTAGES
Generate understandable rules.
Able to handle both numeric and categorical
attributes.
They provide clear indication of which fields are
most important for prediction or classification.
WEAKNESSES
Some decision trees can only deal with binary-
valued target classes
Others can assign records to an arbitrary number of
classes ,but are error-prone when the number of
training examples are class gets small.
Process of growing a decision tree is
computationally expensive.
TREE CONSTRUCTION PRINCIPLE
Best Split
Greedy strategy
Split the records based on an attribute test that optimizes certain
criterion
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
How to specify the attribute test condition?

● Depends on attribute types


Nominal
Ordinal
Continuous
● Depends on number of ways to split
2-way split
Multi-way split
Splitting Based on Nominal Attributes
Multi-way split:
Use as many partitions as distinct values.

Binary split:
Divides values into two subsets. Need to find optimal partitioning.
Splitting Indicies

Work out entropy based on distribution of classes.


Trying splitting on each attribute.
Work out expected information gain for each attribute.
Choose best attribute.
TREE CONSTRUCTION PRINCIPLE
Splitting Attribute

Splitting Criterion (Qualifying Condition)

3 main phases
Construction Phase
Pruning Phase
Processing the pruned tree to improve the understandability
THE GENERIC ALGORITHM
Let the training data set be T with class- labels{C1,C2….Ck}.

The tree is built by repeatedly partitioning the training data set

The process continued till all the records in partition belong to the
same class.
T is homogenous
T contains cases all belonging to a single class Cj. The decision tree for T is
a leaf identifying class Cj.

T is not homogeneous
T contains cases that belongs to a mixture of classes.
A test is chosen ,based on single attribute, that has one or more mutually
exclusive outcomes{O1,O2,….On}.
T is partitioned into subset T1,T2,T3…..Tn. where Ti contains all those cases
in T that have the outcome Oi of the chosen set.
The decision tree for T consist of decision node identifying the test, and one
branch for each possible outcome.
The same tree building method is applied recursively to each subset of training cases.
n is taken 2,and a binary decision tree is generated.
T is trivial
T contains no cases.
The decision tree T is a leaf ,but the class to be associated with the
leaf must be determined from information other than T.
The following are the three major difficulties when using Decision
Tree in a real life situation
Guillotine Cut(examine only a single attribute at a time)
Overfit (training data may not be a proper representative)
Attribute Selection Error
BEST SPLIT
The main operation during the tree building are
Evaluation of splits for each attribute and the selection of the best split;
determination of the splitting attribute.
Determination of the splitting condition on the selected splitting
attribute.
Partitioning the data using the best split.
The first task is to decide which of the independent attributes makes the
best splitter.
If an attribute takes on multiple values, sort it and then, using some
evaluation function as the measure of goodness, evaluate each split.
SPLITTING INDICES
Two methods
Information gain based on Entropy (Information Theory)
gini index (Measure of diversity derived from economics)
Entropy
It provides an information-theoretic approach to measure the goodness of
split.
SPLITTING CRITERIA

Binary Splits for Numerical Attributes


Class Histogram
Binary Splits for categorical Attributes
Count Matrix
BINARY SPLITS FOR NUMERICAL
ATTRIBUTES
The numerical splits are split by the binary split if the form A ≤ v, where
v is a real number.
Attributes are sorted on the values and the mid point is taken as split
point.

CLASS HISTOGRAM
Frequency distribution of the class values is represented as a class
histogram.
BINARY SPLITS FOR CATEGORICAL
ATTRIBUTE
COUNT MATRIX
DECISION TREE CONSTRUCTION
ALGORITHMS
Two types of algorithms
Handle only memory resident data.
Handle Efficiency and scalability issues.
CART
CART (Classification and Regression Trees )
CART builds binary decision tree by splitting the records at each node,
according to a function of a single attribute.
Uses the gini index for determining best split.
The initial split produces the two nodes, each of which attempt to split
in the same manner as the root node.
Process
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides) features into
two or more groups at each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to
build a decision tree. In simple words, the top-down approach means
that we start building the tree from the top and the greedy approach
means that at each iteration we select the best feature at the present
moment to create a node.
ID3 Steps
Calculate the Information Gain of each feature.
Considering that all rows don’t belong to the same class, split the
dataset S into subsets using the feature for which the Information Gain is
maximum.
Make a decision tree node using the feature with the maximum
Information gain.
If all rows belong to the same class, make the current node as a leaf node
with the class as its label.
Repeat for the remaining features until we run out of all features, or the
decision tree has all leaf nodes.
The attributes with the greatest information gain is taken as the splitting
attributes, and the data set is split for all distinct values of the attribute.
C4.5
C4.5 is an extension of ID3 algorithm to address the following issues not
dealt with by ID3:
Avoiding over fitting the data
Determining how deeply to grow a decision tree.
Handling continuous attributes. e.g., temperature
Choosing an appropriate attribute selection measure.
Handling training data with missing attribute values.
Handling attributes with differing costs.
Improving computational efficiency.
Unavailable values
Pruning of decision trees and rule derrivation.
CHAID
CHAID, Proposed by Kass in 1980, is a derivative of AID (Automatic Interaction
Detection), proposed by Hartigan in 1975.
It attempts to stop growing the tree before overfitting occurs.
It avoids the pruning process.
It always forms binary splits.
The Splitting attribute is chosen according to a chi-squared test of independence in a
contingency table(a cross tabulation of the non-class and class attributes)
The main stopping criterion is the P-value from the chi-squared test.
The combinatorial search algorithm can be used to find a partition that has a small
p-value for the chi-squared test.
A Bonferroni Adjustment is used relating the predictors to the dependent variable.
The adjustment is conditional on the number of branches in the partition.

You might also like