Professional Documents
Culture Documents
Data Science Harvard Lecture 3 PDF
Data Science Harvard Lecture 3 PDF
Unit 1 - Lecture 3
This lecture is based primarily on Chapters 3 and 4 of the book “Data Science for
Business” by Foster Provost and Tom Fawcett, 2013.
2
Lecture Objectives
At the end of this lecture, the student should be able to identify and define concepts such
as:
• Decision Trees and how to construct them
• Decision Trees and data mining
• Tree-based methods
• Trees as a set of rules
• How to create a decision tree in R
• Working with the tree, rpart, and the party R packages
• Filling in a data frame using the R built-in Editor
3
Decision Trees
Decision tree are hierarchically branched structures that help one to come to a decision
based on asking only a few meaningful questions in a particular sequence.
Decision trees
• Are easy to use and explain and their classification accuracy is competitive with other
methods.
• Can generate knowledge from a few instances that can be applied to a broad
population.
• May not need values for all informative variables to help us reach a decision.
4
Tree-Structured Models
What do they look like?
5
Tree-Structured Models: “Rules”
6
Decision trees and Data Mining
• Decision trees (DTs), or classification trees, are one of the most popular data mining
tools
• (along with linear and logistic regression)
7
Tree-based Methods
Tree-based methods segment or stratify the predictors space into simple regions non-
overlapping regions.
Top-down means we begin at the top of the tree and successively split the predictor space.
Greedy means that at each step of the process, the best split is made at that a particular node
instead of looking ahead and ”choosing” a split that will produce a better tree in some future
step.
8
Decision Tree Scenario
Initial Considerations:
• Is it windy?
These four questions can be systematically compared to determine the one with the most
correct predictions, or equivalently, the one with the fewest errors (misses).
11
How to construct a Decision Tree
(Calculating the error table for a variable)
1) Start with the first variable, outlook, which can take 3 values (sunny, overcast, and rainy)
3 → No
Sunny: 5 instances Rule: sunny → No 2/5 errors
2 → Yes
2) Continue with the second variable, temperature, which can take 3 values (hot, mild, and cool)
2 → Yes
Hot: 4 instances Rule: hot → No 2/4 errors There is tie. Choose either one
2 → No
4 → Yes
Mild: 6 instances Rule: mild → Yes 2/6 errors
2 → No
3 → Yes
Cool: 4 instances Rule: cool → Yes 1/4 errors
1 → No
Continuing with similar analysis for the remaining variables, we have the following error table.
Complete Error Table for all Attribute Rules Error Total Errors
Four informative variables Outlook Sunny →No 2/5 4/14
True → No 3/6
14
How to construct a Decision Tree
(How to choose the root of the tree)
The variable that leads to the least number of errors (and thus the most number of correct decision)
should be chosen as the first node.
In this case, two variables (outlook and humidity) have the least number of errors as both have 4
errors out of 14 instances.
The decision tree will have outlook as the root node (also called the first splitting variable)
15
How to construct a Decision Tree
(How to construct the branches of the tree)
From the root node (outlook) the decision tree will be split into 3 branches or subtrees, one for each
of the three values of outlook.
The sunny branch will inherit the data for instances that have “sunny” as the value of outlook. These
will be used for further building of that sub-tree.
The rainy branch will inherit data for the instances that have “rainy” as the value of outlook. These will
be used for further building of that sub-tree.
The overcast branch will inherit data for the instances that have “overcast” as the value of outlook.
However, there is no need to further build this branch since the decision for all instances of this value
is always Yes.
The decision tree, after this first level of splitting, looks as shown in the next slide.
16
Decision Tree
Outlook
Sunny Rainy
Overcast
17
Determining the remaining branches of the Decision tree
To build the remaining branches of the trees we will follow a similar procedure as the one used to determine the
root of the tree.
In this case, for the sunny branch, error values will be calculated for the informative variables temperature,
humidity, and windy.
The error table is shown below.
The informative variable humidity shows the least
Attribute Rules Error Total Errors
amount of error, i.e. zero error. The other two
Temperature Hot → No 0/2 1/5 variables have non-zero errors.
Mild→ No 1/2
Therefore, the Outlook: sunny branch will use
Cool → Yes 0/1
Humidity as the next splitting variable.
Humidity High → No 0/3 0/5
18
Determining the remaining branches of the Decision tree
In this case, for the rainy branch, error values will be calculated for the informative variables temperature, humidity,
and windy.
The error table is shown below.
True → No 0/2
19
Final Decision Tree
Outlook
= Sunny = Rainy
= Overcast
Humidity
YES Windy
20
Using the Decision tree to Solve the Current Problem
• IF (Employed = No) AND (Balance ≥ 50k) AND (Age < 45) THEN Class=No Write-off
• IF (Employed = No) AND (Balance ≥ 50k) AND (Age ≥ 45) THEN Class=Write-off
22
What are the rules associated with this tree?
Outlook
= Sunny = Rainy
= Overcast
24
What are we predicting?
Classification tree
Split over income
Income
Age
<50K >=50K
45 p(LI)=0.15 Age
Split over age
<45 >=45
?
p(LI)=0.43 p(LI)=0.83
50K Income
Did not buy life insurance
45 p(LI)=0.15 Age
Split over age
<45 >=45
?
p(LI)=0.43 p(LI)=0.83
50K Income
Did not buy life insurance
27
MegaTelCo: Predicting Churn with Tree Induction
28
How to Create a Decision Tree in R
To create a decision tree we will need to create a data frame which is the most common
data structure in R.
Where col-vector are vectors such as character, numeric, or logical that correspond to
the different columns of a table that contains historical data.
29
Creating a Data frame in R
31
Checking the Structure of a data frame
> str(gamedf)
32
Taking a look at what the data frame look like
To see what the data frame look like just type the name of the data frame at the
console.
> gamedf
33
Using the Recursive Partitioning and Regression Trees Package
“rpart”
> Install.packages(“rpart”)
> require(“rpart”)
or
> library(“rpart”)
34
Using the Recursive Partitioning and Regression Trees Package
“rpart”
# grow tree
> fit <- rpart( Play ~ Outlook + Temperature + Humidity, method=“class”, data.frame(gamedf))
> printcp(fit)
> summary(fit)
#plot tree
> Install.packages(“tree”)
> require(“tree”)
or
> library(“tree”)
36
Using the Fit or Classification or Regression Tree Package
“tree”
# grow tree
# see summary
> summary(tr)
#plot tree
> plot(tr)
37
Using the R Package “A laboratory for Recursive Partytioning
“party”
> Install.packages(“party”)
> require(“party”)
or
> library(“party”)
38
Using the R Package “A laboratory for Recursive Partytioning
“party”
# grow tree
# see summary
> summary(ct)
#plot tree
> plot(ct)
39
Entering Data into a Data Frame via Keyboard
It is possible to create an “empty” data frame and then fill it in via the built-in editor of
R.
Assignments like name = character(0) creates a variable of mode character which has
no data.
40
Entering Data into a Data Frame via Keyboard
The edit() built-in function in R brings the editor so the user can fill in the variables of the data
frame.
mydataframe <- edit(mydataframe) invokes the editor so the user can fill in each one of the
variables of mydataframe.
Notice how the result of the editing is assigned back to the data frame object. 41
Central Tendency and Variability
A probability distribution describes the possible outcomes for an event, and how likely
each of those events are.
“Central tendency” is an euphemism for “If you have a bunch of values, what’s in the
middle (and what does the middle even mean?)
42
Fun Fact
43
Questions?
44