Document 6

You might also like

You are on page 1of 38

Decision Trees

Introduction
• Predictive modeling involves building
regression(multiple regression or logistic
regression), neural network, and decision tree
models.
Generally,
• Logistic Regression is considered a statistical
method
• Neural network-an artificial intelligence model
• Decision tree- a machine learning technique

2
Decision Tree
• Decision tree are useful for classification and prediction.
• A decision tree model consists of a set of rules for dividing a large
heterogeneous population into smaller, more homogeneous groups
with respect to a particular target.
• The target variable is usually categorical and the decision tree is
used either to:
• (1) calculate the probability that a given record belong to each of
the category or
• (2) To classify the record by assigning it to the most likely class (or
category).
• The algorithm used to construct decision tree is referred to as
recursive partitioning

• Note: Decision tree can also be used to estimate the value of a continuous
target variable. However, multiple regression and neural network models
are generally more appropriate when the target variable is continuous.
3
Decision Tree Structure
• Root node=Top node
• Child node=Descendent node
• Leaf node (or terminal
node)=final node i.e no more
Root
splitting
• Rule=Unique path (set of
Child Child Leaf
conditions)from root to each leaf
Child Leaf
Leaf

4
A Simple Decision Tree
Target: Status:Buyer or Non-Buyer (categorical variable )

Node 0
Buyer 600 40%
Income < $100,000 Non-buyer 900 60% $100,000 and above

Node 1 Node 2
Age Buyer 350 36.84% Gender Buyer 250 45.45%
Non-buyer 600 63.16% 25 and Non-buyer 300 54.55%
female
<25 above male

Node 4 Node 5 Node 6


Node 3
Buyer 300 75% Buyer 200 50% Buyer 50 33.33%
Buyer 50 9.09%
Non-buyer 100 25% Non-buyer 200 50% Non-buyer 100 66.67%
Non-buyer 500 90.91%

Chinese
Malay & Indian
Race A customer with
Node 7
Node 8
income less than Buyer 30 15%
Buyer 170 85%
$100000 and age less Non-buyer 170 85%
Non-buyer 30 15%
than 25 is predicted
Note: Input variables that are higher up in the decision tree
as a non-buyer. can be deemed as the more important variables in predicting
the target variable. 5
Decision Rules
1. If Income <$100000 and Age<25 then Status=Non-Buyer
2. If Income <$100000 and Age>=25 then Status=Buyer
3. If Income >=$100000, Gender=Male and Race=Chinese then
Status=Non-Buyer
4. If Income >=$100000, Gender=Male and Race=Malay/Indian then
Status=Buyer
5. If Income >=$100000, Gender=Female then Status=Non-Buyer

Profile of Buyers
• Those earning less than $100000 and age 25 and above.
• Malays or Indians males and earning more than $10000.

6
Regression Tree
Target: salary (continuous variable)

7
Decision Rules
• If Employment=Clerical or Custodial and Gender=Female, average
salary=$25003.69.
• If Employment=Clerical and Gender =Male, average salary=$31558.15.
• If Employment= Custodial or Manager and Gender =Male, average
salary=$30938.89
• If Employment=Manager and Gender =Female, average salary=$47213.50
• If Employment=Manager and Gender =Male, average salary=$66243.24

8
Example: Good & Poor Splits

Good Split Nodes with


very small
sample size
9
Split Criteria
• The best split is defined as one that does
the best job of separating the data into
groups where a single class predominates
in each group
• Measure used to evaluate a potential split
is purity
– The best split is one that increases purity of
the sub-sets by the greatest amount
– A good split also creates nodes of similar size
or at least does not create very small nodes
10
Tests for Choosing Best Split-
Choice of splitting criterion depends on the type of
target variable, not on the type of input variable.

Splitting criteria

• Categorical target variable:

– Gini (population diversity)


Interval target variable:
– Entropy (information gain)

– Information Gain Ratio Variance reduction

– Chi-square Test F-test

11
Gini (measures Population Diversity)

• The Gini measure of a node is the sum of


the squares of the proportions of the
classes.

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)

12
Evaluating the split using Gini
Which of these two proposed splits increases purity the most?
Gini
score=0.52+0.52=0.5

Income<=2000 Income>2000
Female
Male
Buyer
Non-
buyer

Ginileft = 0.12 + 0.92 = 0.82 Ginileft = 1


Giniright = 0.12 + 0.92 = 0.82 Giniright = (4/14)2 + (10/14)2 = 0.592
Gini score = Gini score =
(10/20)*0.82 + (10/20)*0.82 = 0.820 (6/20)*1 + (14/20)*0.592 = 0.714
Perfectly pure node would have a Gini score of 1.
Therefore, the first split is better. Source: Berry and Linoff (2004) 13
Entropy

• Entropy can be computed for each node in the decision tree


• Entropy for a decision tree is the total of the entropy of all terminal
nodes in the tree
• Entropy measures impurity, disorderliness or lack of information in
decision tree.
• The best input variable -gives the greatest reduction in entropy.
• As a decision tree becomes purer, more orderly and more
informative, its entropy approaches zero.
• The reduction in entropy is sometimes referred to as information
gain.
Entropy: H  P log
i 2 ( Pi )

Pi is the probability of the i th category of the target variable ocurring in aparticular node

14
Evaluating the split using entropy
Entropy= -1* (P(dark)log2P(dark)+P(light)log2P(light))

Which of these two proposed splits increases information gain the most?

Entropy=1

log2(a)=log10(a)/log10(2)
Entropyleft = -1*(1log2(1)+0=0
Female Income<=2000 Income>2000
Male

Entropyleft = -1*(0.9log2(0.9)+0.1log2(0.1))=0.47 Entropyright =


Entropyright = -1*(0.1log2(0.1)+0.9log2(0.9)=0.47 -1*(4/14log2(4/14)+10/14log2(10/14))=
-1(-0.52-0.35)=0.863
Entropy score = Entropy score =
(10/20)*0.47 + (10/20)*0.47 = 0.47 (6/20)*0 + (14/20)*0.87 = 0.604
Information gain=1-0.47=0.53
Source: Berry and Linoff (2004 Information gain=1-0.604=0.396 15
Comparing two splits using Gini and entropy

• Both Gini and entropy prefers the first


proposed split.
• Decision tree using entropy tend to be
quite bushy.Bushy tree with many multi-
way splits are undesirable as these splits
lead to small numbers of records in each
node.

16
Algorithms for constructing decision trees

Common ones:
• CHAID (chi-square automatic interaction
detection)
• C5.0
• CART (classification and regression tree)
• Decision tree algorithms are very intensive
(i.e. a lot of computations are performed to
construct the tree)

17
CART(classification and regression tree)

• Regression tree: target is quantitative variable


• Classification tree: target is categorical variable
• Performs only binary split
• Uses Gini measure instead of the entropy
measure for classification tree.
• Uses Variance reduction for regression tree (for
continuous target variable).

18
CHAID
• Commonly used algorithm to construct decision tree for
qualitative(or categorical) target variable.
• Split algorithm (Chi-Square test) designed for categorical
inputs so continuous inputs must be discretized.
• Example:
• Research Objective of a data mining application is to
predict the buying status (i.e buyer vs non-buyer) of a
product based on demographic variables such as
gender, race, age and income.
• Assume that sample consists of 600 buyers and 900
non-buyers.

19
CHAID(cont’d)
• Step 1:Each input variable is evaluated on
its potential to split the data into two or
more subsets so that the target variable is
as differentiated (in a statistical sense)
between the subsets as possible.
Chi-square test of
Node 0
Buyer 600 40% independence will be
Non-buyer 900 60% used to test the null
Male Female hypothesis:
HO:Buying status and
Node 1 Node 2
Buyer 450 39.13% Buyer 150 42.86%
gender are
Non-buyer 700 60.87 % Non-buyer 200 57.14% independent

Decision Tree split by gender


20
Chi-Square test of independence using SPSS
H0:Buying status and gender are independent
gender * buyer Crosstabulation
H1:Buying status and gender are dependent
Count
buyer Chi-square=1.553 with p-value=0.213>0.05, do
.00 non-buyer 1.00 buyer Total not reject the null hypothesis.
gender .00 female 200 150 350
Thus buying status does not depend on gender .
1.00 male 700 450 1150
Total 900 600 1500
Gender does not help differentiate between
buyers and non-buyers.
Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 1.553b 1 .213
a
Continuity Correction 1.401 1 .236
Likelihood Ratio 1.545 1 .214
Fisher's Exact Test .213 .118
Linear-by-Linear
1.552 1 .213
Association
N of Valid Cases 1500
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is
140.00.

21
CHAID(cont’d)
The significance of race in differentiating between buyers and non-buyers can also be evaluated.

Decision Tree split by Race


Node 0
Buyer 600 40%
Non-buyer 900 60%

Chinese Malay Indian

Node 1 Node 2 Node 3


Buyer 300 37.5% Buyer 155 43.66% Buyer 145 42.03%
Non-buyer 500 62.5% Non-buyer 200 56.34% Non-buyer 200 57.97%

H0:Buying status and Race are independent


H1:Buying status and Race are dependent
Chi-square=4.66 with p-value=0.097>0.05, do not reject the null hypothesis a 5 % significance level
Thus buying status does not depend on RACE .

22
Decision Tree –Split by Race (Binary Split)

Node 0
Buyer 600 40%
Non-buyer 900 60%
Malay &
Chinese Indian

Node 1 Node 1
Buyer 300 37.5% Buyer 300 42.86%
Non-buyer 500 62.5% Non-buyer 400 57.14%

H0:Buying status and Race are independent


H1:Buying status and Race are dependent
Chi-square=4.46 with p-value=0.035<0.05, reject the
null hypothesis a 5 % significance level
Thus buying status does depend on RACE .
Here, binary split is better in differentiating between
buyers and non-buyers.
Generally, for categorical target variable, a
decision tree algorithm would try different
combinations of categories to search for the
split(s) that best differentiate (or predict) the
target variable.
23
Decision Tree –Determining Split for quantitative input variable

• Determining the best split for quantitative input variable is more difficult.
• Example: 5 values of a quantitative variable in ascending order
A<B<C<D<E
Four possible splitting points:
 average of A and B
 average of B and C
 average of C and D
 average of D and E
 As quantitative input variables can take many values, decision tree
algorithms perform very intensive computations to determine the splitting
point.

24
Decision Tree –Determining Split for quantitative input variable
Chi-square=5.52, p-
Node 0 value=0.019<0.05
Buyer 600 40%

< 30 years Non-buyer 900 60% 30 years or more

Node 1 Node 2
Buyer 230 36.51% Buyer 370 42.53%
Non-buyer 400 63.49% Non-buyer 500 57.47%

Decision tree split by age Chi-square=10.77, p-


value=0.001<0.05
Node 0
Buyer 600 40% $100,000
Non-buyer 900 60%
and above
< $100,000

Node 1 Node 1
Buyer 350 36.84% Buyer 250 45.45%
Non-buyer 600 63.16% Non-buyer 300 54.55%

Conclusion: Age and income can contribute towards


predicting buying status. Increasing age and increasing
Decision tree split by income income are associated with a higher probability of buying
25
the product.
Decision Tree
• Step 1: Each input variable is evaluated on its potential to split the data
into two or more subsets so that the target variable is as differentiated (in
a statistical sense) between the subsets as possible.
• Step 2: The statistical results (i.e. chi-square statistics or p-values) are
compared and the most significant input variable is selected to split the
tree at the best splitting point (s)(threshold(s)) identified for that variable.
• Step 3: Step 1 and 2 are repeated for each child node in the decision tree.
• Decision tree algorithms perform steps 1 to 3 repeatedly –hence the name
recursive partitioning
• The iterative process stops when a stopping rule is encountered.
Examples of Stopping rules :
1) Maximum depth (number of levels)
2) All potential parent nodes fail to have a pre-specified minimum number of
observations
3) All the potential child nodes fail to have a pre-specified minimum number
of observations
4) None of the input variables can reach a pre-specified level of statistical
significance.
26
Final Decision Tree
Note: Input variables that are higher up in the decision tree can be deemed as the more
important variables in predicting the target variable.

Node 0
Buyer 600 40%
Income < $100,000 Non-buyer 900 60% $100,000 and above

Node 2
Node 1
Age Buyer 350 36.84% Gender Buyer 250 45.45%

25 and Non-buyer 300 54.55%


Non-buyer 600 63.16% female
<25 above male

Node 4 Node 5 Node 6


Node 3
Buyer 300 75% Buyer 200 50% Buyer 50 33.33%
Buyer 50 9.09%
Non-buyer 100 25% Non-buyer 200 50% Non-buyer 100 66.67%
Non-buyer 500 90.91%

Chinese
Malay & Indian
Race A customer with Node 7
Node 8
income less than Buyer 30 15%
Buyer 170 85%
$100000 and age less Non-buyer 170 85%
Non-buyer 30 15%
than 25 is predicted
as a non-buyer.
27
Assessing the predictive performance of decision tree

Actual status
Predicted Status
Buyer Non-buyer
Total

Buyer 470(78.33%) 130(21.67%) 600

Non-buyer 130(14.44%) 770(85.56%) 900

Total 600 900 1500

Overall accuracy rate= (470+770)/ 1500=82.67%


Classification error rate= (130+130)/1500=17.33%

• Classification error rate are used to assess the adequacy(or


predictive performance) of the decision tree.

28
Comparison of Decision Tree Models
Model Criterion C5.0 CHAID QUEST C&RT

Split Type Multiple Multiple Binary Binary


Continuous Target No Yes No Yes
Continuous Predictors Yes No Yes Yes
Misclassification
Yes No No Yes
Costs
Impurity
Criterion for Predictor Information
Statistical Statistical (Dispersion)
Selection measure
measure
Missing
Handling Missing Surrogates Surrogates
Fractionalization becomes a
Predictor Values (Substitute) (Substitute)
category
Priors No No Yes Yes
Upper Limit on Stops rather Cost-complexity Cost-complexity
Pruning Criterion
predicted error than overfit pruning pruning
Interactive Trees No Yes Yes Yes

© Copyrights 2005 SPSS Malaysia

29
Comparison of Decision Tree Algorithms
Details CHAID C&RT QUEST C5.0

Name Chi-Square Automatic Classification & Quick, Unbiased, C5.0


Interaction Detection Regression Trees Efficient, Statistical Tree
Classification - Using chi-square - Uses recursive - Uses sequence of - Splitting the sample
Method statistics to identify optimal partitioning to split the rules, based on based on the field that
splits training records into significance tests, to provides the maximum
- Examine the crosstab segments with similar evaluate the predictor information gain
between each of the output field values variables at a node - Each subsample
predictor variables and the - Starts by examining the - For selection purposes, defined by the first split
outcome and tests for input fields to find the as little as a single test is then split again,
significance using a chi- best split, measured by may need to performed usually based on a
square independence test the reduction in an on each predictor at a different field, and the
- If more than one of these impurity index that node process repeats until the
relations is statistically results from the split - Splits are determined subsamples cannot be
significant, CHAID will - The split defines 2 by running quadratic split any further
select the predictor that is subgroups, each of which discriminant analysis - Finally, the lowest-level
the most significant is subsequently split into using the selected splits are reexamined,
(smallest p value) 2 more subgroups, and predictor on groups and those that do not
so on, until one of the formed by the target contribute significantly to
stopping criteria is categories the value of the model
triggered are removed or pruned

Feature Handle large no of Improvised from C&RT Most recent version of a


predictors, appropriate for machine-learning
non-linear/complex program
relationships

30
Comparison of Decision Tree Algorithms
Model Criterion CHAID C&RT QUEST C5.0

Split Type Multiple Binary Binary Multiple

Continuous Target Yes Yes No No

Continuous No Yes Yes Yes


Predictors
Misclassification No Yes No Yes
Costs
Criterion for Statistical Impurity (Dispersion Statistical Information Measure
Predictor Selection measure) (max. information gain)
Handling Missing Missing becomes a Surrogates Surrogates (substitute) Fractionalization
Predictor Values category (substitute)
Strengths - Generate Robust in the - Reduce the processing time - Robust in the presence
nonbinary trees presence of required for large C&RT of problems such as
- Works for all types problems such as analyses with either many missing data and large
of predictors, and it missing data and variables or many cases numbers of input fields
accepts both case large numbers of - Reduce the tendency found - Usually do not require
weights and fields in classification tree methods long training times to
frequency variables for favor predictors that allow estimate
more splits; that is, - C5.0 models tend to be
continuous predictor variables easier to understand
or those with many categories - Offers powerful
- Uses statistical tests to boosting method to
decide whether or not a increase accuracy of
predictor is used classification
31
Splitting Criteria for Decision Tree using
SAS EMiner

(a) Binary or nominal target variable


• (1) Chi-Square
• (2) Gini
• (3) Entropy
(b) Ordinal target variable: Entropy
(c )Interval targets: F-test,variance reduction

32
Decision Tree Advantages
1. Easy to understand and interpret.
2. Map nicely to a set of business rules
3. Can be applied to real problems
4. Make no prior assumptions about the data
5. Able to process both numerical and categorical
data
6. Can handle missing values as a separate
category (CART).
33
34
Using IBM SPSS
Modeler 18
CART

35
CHAID

36
C5

37
EXERCISES
• Use the jobapply.sav data and compate
the CART, CHAID and C5 models.
• Use the bankcredit SAS data and compare
the decision tree models.

38

You might also like