You are on page 1of 24

Chapter 9 – Classification and

Regression Trees (CART)


Agenda

• Last week’s lecture review

• Week 11: Classification and Regression Trees


• Introduction
• Characteristics
• Classification and Regression Trees
• Implementation Examples
• Summary
• Lab

2
Introduction
• The basic principle:
1. Classify or predict an outcome based on a set of predictors
2. The output is a set of rules
3. Also called CART*, Decision Trees, Binary Trees, or just Trees

• Characteristics
 Data-driven, not model-driven
 Make no assumptions about the data. No normalization
 Performs well across a wide range of situations without much
effort from the analyst
 CART applied to both types of data (numeric and categorical
predictors)
 Rules are represented by tree diagrams
 Very popular because it provides easily understandable decision
rules (at least if the trees are not too large).

CART: Classification And Regression Tree


*

3
Example 1: Bank Customer Classification

Using CART to classify bank customers who receive a loan offer as either acceptors or nonacceptor, based
on information such as their income, education level, and average credit card expenditure.

• Goal: classify a record as “will accept credit card


offer” or “will not accept”.

4
Tree Structure

• Two types of nodes in a tree: decision (=splitting)


nodes and terminal nodes.
• Decision nodes that have successors because if we
were to use a tree to classify a new record for which
we knew only the values of the predictor variables,
we would “drop” the record down the tree so that at
each decision node.
• Terminal nodes that have no successors (or leaves
of the tree) and represent the partitioning of the data
by predictors.

5
Example 1: Decision Rules and classifying a
New Record

6
Recursive Partitioning

7
Example 2: Riding Mowers

• Goal: find a way of classifying families in a city into those likely to purchase a riding mower
and those not likely to buy one.
• Data: 24 households classified as owning or not owning riding mowers (12 owners and 12
non-owners).
• Predictors: Income, Lot Size
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner

8
Example 2: cont.
59.7

• How was this split selected? The algorithm examined each predictor variable (in
this case, Income and Lot Size) and all possible split values for each variable to
find the best split.
• What are the possible split values for a variable? They are simply the midpoints
between pairs of consecutive values for the predictor.
• The possible split points for Income are {38.1, 45.3, 50.1, …, 109.5} and those for
Lot Size are {14.4, 15.4, 16.2, …, 23}. These split points are ranked according to
how much they reduce impurity (heterogeneity) in the resulting rectangle .

Note: We already sorted the data

9
Measuring Impurity

10
Impurity and Recursive Partitioning

• Obtain overall impurity measure (weighted avg. of individual rectangles)


• At each successive stage, compare this measure across all possible splits
in all variables
• Choose the split that reduces impurity the most
• Chosen split points become nodes on the tree

11
Example 2 (cont.)
Left Node

Owner = 1 Owner = 11
nonowner = 7 nonowner = 5
Total = 8 59.7 Total = 16

Right Node

• By comparing the reduction in impurity across all possible


splits in all possible predictors, the next split is chosen.

12
Example 2(cont.)

• If we continue splitting the mower data,


the next split is on the Lot Size variable
at the value 21.4.
• The lower-left rectangle, which contains records with Income ≤
59.7 and Lot Size ≤ 21.4, has all records that are nonowners;
whereas the upper-left rectangle, which contains records with
Income ≤ 59.7 and Lot Size > 21.4, consists exclusively of a
single owner.

• The two left rectangles are now “pure.”

13
Example 2 (cont.)

14
Example 2 (cont.)

• The final stage of the recursive partitioning


is

15
Example 3: Build a tree
Categorical Variables:

•Examine all possible ways in which the categories can be split.


•e.g., categories A, B, C can be split 3 ways
 {A} and {B, C}
 {B} and {A, C}
 {C} and {A, B}
•With many categories, # of splits becomes huge
Normalization:
•Whether predictors are numerical or categorical, it does not make any
difference if they are standardized (normalized) or not.

16
Example: Playing Golf

Goal: Classify 14 days as Yes (play golf) or No (no


play)

Predictors = Outlook, Humidity, Wind

Output = Yes, No

17
9 yes / 5 No

Outlook

OverCas
t
Day Outlook Humid Wind
D3 OverCas High Weak
t
Sunny D7 OverCas Normal Strong Rain
t
Day Outloo Humid Wind D12 OverCas High Strong Day Outloo Humid Wind
k t / 0 No k
4 yes
D1 Sunny High Weak D13 Pure
OverCas Normal
subset Weak D4 Rain High Weak
D2 Sunny High Strong t
D5 Rain Normal Weak
D8 Sunny High Weak D6 Rain Normal Strong
D9 Sunny Normal Weak D10 Rain Normal Weak
D11 Sunny Normal Strong D14 Rain High Strong
2 yes / 3 No 3 yes / 2 No
Split Further Split Further 18
Overfitting Problem

• Training vs Validation data error

• Stopping Tree Growth


 Natural end of the process is 100% purity in each
leaf
 This overfits the data, which end up fitting noise in
the data
 Overfitting leads to the low predictive accuracy of
new data
 Past a certain point, the error rate for the
validation data starts to increase

19
Overfitting Problem
• Different criteria for stopping the tree growth before it starts
overfitting the data. Examples:
 Tree depth (i.e., number of splits). We can control the depth
of the tree.
 The minimum number of records in a terminal node,
 The minimum reduction in impurity,
 The minimum number of records in a node needed in order to
split,
 The minimum number of records in a terminal node, etc.
The problem is that it is not simple to determine what is a good
stopping point using such rules.

20
Pruning

• CART lets tree grow to full extent, then prunes it back


• Idea is to find that point at which the validation error
begins to rise
• Generate successively smaller trees by pruning leaves
• At each pruning stage, multiple trees are possible
• Use cost complexity to choose the best tree at that
stage

21
Advantages and Shortcomings

Advantages
• Easy to use, understand
• Produce rules that are easy to interpret & implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical models
• Can work without extensive handling of missing data
Shortcomings
• May not perform well where there is the structure in the data that is not well captured by
horizontal or vertical splits
• Since the process deals with one variable at a time, no way to capture interactions
between variables

22
Summary

• Classification and Regression Trees are an easily understandable and


transparent method for predicting or classifying new records
• A tree is a graphical representation of a set of rules
• Trees must be pruned to avoid over-fitting of the training data
• As trees do not make any assumptions about the data structure, they usually
require large samples

23
Agenda

• Last week’s lecture review

• Week 11: Classification and Regression Trees


• Introduction
• Characteristics
• Classification and Regression Trees
• Implementation Examples
• Summary
• Lab

24

You might also like