Data Science Harvard Lecture 3 PDF

CSCI E-84 
A Practical Approach to Data

Science
Ramon A. Mata-Toledo, Ph.D.
Professor of Computer Science
Har vard Extension School
Unit 1 - Lecture 3
Februar y, Wednesday 10, 2016

1
Business Problems and Data Science
Solutions
This lecture is based primarily on Chapters 3 and 4 of the book “Data Science for
Business” by Foster Provost and Tom Fawcett, 2013.
Thanks also to Professor P. Adamopoulos (Stern School of Business of New York

University) and Professor Tomer Geva (The Tel Aviv University School of
Management)
Figures are used with the authors’ permission.
2
Lecture Objectives
At the end of this lecture, the student should be able to identify and define concepts such
as:
• Decision Trees and how to construct them
• Decision Trees and data mining
• Tree-based methods
• Trees as a set of rules
• How to create a decision tree in R
• Working with the tree, rpart, and the party R packages
• Filling in a data frame using the R built-in Editor
3
Decision Trees
Decision tree are hierarchically branched structures that help one to come to a decision
based on asking only a few meaningful questions in a particular sequence.
Decision trees
• Are easy to use and explain and their classification accuracy is competitive with other
methods.
• Can generate knowledge from a few instances that can be applied to a broad
population.
• May not need values for all informative variables to help us reach a decision.
4
Tree-Structured Models 
What do they look like?
I think I shall never see

a poem as lovely as a
tree.
•
•
•
Poems are made by fools
like me, but only God
can make a tree.
Joyce Kilmer
1886-1918
5
Tree-Structured Models: “Rules”
• No two parents share descendants

• There are no cycles
• The branches always “point downwards”
• The points along the tree where the predictor space is split are called internal nodes
• Every example always ends up at a leaf node ( or terminal node) with some specific class
determination
• Probability estimation trees, regression trees (to be continued..)
6
Decision trees and Data Mining
• Decision trees (DTs), or classification trees, are one of the most popular data mining
tools
• (along with linear and logistic regression)
• Almost all data mining packages include DTs
• They have advantages for model comprehensibility, which is important for:

• model evaluation
• communication to non-DM-savvy stakeholders
7
Tree-based Methods
We will consider tree-based methods for regression and classification.
Tree-based methods segment or stratify the predictors space into simple regions non-
overlapping regions.
The approach will used is recursive, top-down and greedy.
Top-down means we begin at the top of the tree and successively split the predictor space.
Recursive means that the process is inherently repetitive.
Greedy means that at each step of the process, the best split is made at that a particular node
instead of looking ahead and ”choosing” a split that will produce a better tree in some future
step.
8
Decision Tree Scenario
Outlook Temperature Humidity Windy Play Example:

Sunny Hot High False No
Sunny Hot High True No The table shown at the left is an historical record of
the atmospheric conditions and the decisions made
Overcast Hot High False Yes
to play a game in a little league tournament. (Data
Rainy Mild High False Yes
courtesy of Witten, Frank, and Hall. 2010).
Rainy Cool Normal False Yes
Rainy Cool Normal True No The objective of this example is to determine if given
Overcast Cool Normal True Yes the atmospheric conditions shown below, should the
Sunny Mild High False No
game be played?
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes Outlook Temperature Humidity Windy Play
Overcast Mild High True Yes Sunny Hot Normal True ??
Overcast Hot Normal False Yes
Rainy Mild High True No 9
How to construct a Decision Tree
If there were a row that match the given conditions, we could make a similar decision.
However, there is no such past instance in this case.
Initial Considerations:
• What should be the first question to ask?
• How do we determine the importance of each question?
• How do we determine the root of the tree?
• Which question gives the most insight?
• Which question provides the shortest tree?

10
How to construct a Decision Tree 
(Continuation)
There are 4 informative variables (choices):
• What is the outlook?
• What is the temperature?
• What is the humidity?
• Is it windy?
These four questions can be systematically compared to determine the one with the most
correct predictions, or equivalently, the one with the fewest errors (misses).
11
(Calculating the error table for a variable)
1) Start with the first variable, outlook, which can take 3 values (sunny, overcast, and rainy)
3 → No
Sunny: 5 instances Rule: sunny → No 2/5 errors
2 → Yes
Overcast: 4 instances 4 → Yes Rule: overcast → Yes 0/4 errors
Rainy: 5 instances 3 → Yes

Rule: rainy → Yes 2/5 errors
2 → No
Error table for outlook Attribute Rules Error Total

Errors
Outlook Sunny → No 2/5 4/14
Outcast→ Yes 0/4
Rainy → Yes 2/5

12
(Calculating the error table for the variables - Continuation)
2) Continue with the second variable, temperature, which can take 3 values (hot, mild, and cool)
2 → Yes
Hot: 4 instances Rule: hot → No 2/4 errors There is tie. Choose either one
2 → No
4 → Yes
Mild: 6 instances Rule: mild → Yes 2/6 errors
2 → No
3 → Yes
Cool: 4 instances Rule: cool → Yes 1/4 errors
1 → No
Error table for temperature Attribute Rules Error Total

Errors
Temperature Hot → No 2/4 5/14
Mild→ Yes 2/6
Cool → Yes 1/5

13
(Final Error table)
Continuing with similar analysis for the remaining variables, we have the following error table.
Complete Error Table for all Attribute Rules Error Total Errors
Four informative variables Outlook Sunny →No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature Hot → No 2/4 5/14
Mild→ Yes 2/6
Cool → Yes 1/5
Humidity High → No 3/7 4/14
Normal → Yes 1/7
Windy False → Yes 2/8 5/14
True → No 3/6
14
(How to choose the root of the tree)
The variable that leads to the least number of errors (and thus the most number of correct decision)
should be chosen as the first node.
In this case, two variables (outlook and humidity) have the least number of errors as both have 4
errors out of 14 instances.
How do we break the tie?

Notice that outlook has a pure subtree (overcast) while there is no such a ”pure” subclass
for the humidity variable. The tie is broken in favor of outlook.
The decision tree will have outlook as the root node (also called the first splitting variable)
15
(How to construct the branches of the tree)
From the root node (outlook) the decision tree will be split into 3 branches or subtrees, one for each
of the three values of outlook.
The sunny branch will inherit the data for instances that have “sunny” as the value of outlook. These
will be used for further building of that sub-tree.
The rainy branch will inherit data for the instances that have “rainy” as the value of outlook. These will
be used for further building of that sub-tree.
The overcast branch will inherit data for the instances that have “overcast” as the value of outlook.
However, there is no need to further build this branch since the decision for all instances of this value
is always Yes.
The decision tree, after this first level of splitting, looks as shown in the next slide.
16
Decision Tree
Outlook
Sunny Rainy
Overcast
Temperature Humidity Windy Play YES Temperature Humidity Windy Play

Hot High False No Hot High False No
Hot High True No Hot High True No
Mild High False No Mild High False No
Cool Normal False Yes Cool Normal False Yes
Mild Normal True Yes Mild Normal True Yes
17
Determining the remaining branches of the Decision tree
To build the remaining branches of the trees we will follow a similar procedure as the one used to determine the
root of the tree.
In this case, for the sunny branch, error values will be calculated for the informative variables temperature,
humidity, and windy.
The error table is shown below.
The informative variable humidity shows the least
Attribute Rules Error Total Errors
amount of error, i.e. zero error. The other two
Temperature Hot → No 0/2 1/5 variables have non-zero errors.
Mild→ No 1/2
Therefore, the Outlook: sunny branch will use
Cool → Yes 0/1
Humidity as the next splitting variable.
Normal → Yes 0/2
Windy False → No 1/3 2/5
True → Yezs 1/2
18
Determining the remaining branches of the Decision tree
In this case, for the rainy branch, error values will be calculated for the informative variables temperature, humidity,
and windy.
The error table is shown below.
The informative variable windy shows the least

Attribute Rules Error Total Errors
amount of error, i.e. zero error. The other two
Temperature Mild → Yes 1/3 2/5 variables have non-zero errors.
Cool → Yes 1/2
Therefore, the Outlook: rainy branch will use Windy
as the next splitting variable.
Normal → Yes 1/3
Windy False → Yes 0/3 0/5
True → No 0/2
19
Final Decision Tree
Outlook
= Sunny = Rainy
= Overcast
Humidity
YES Windy
= Normal = True = False

= High
N0 (3.0) Yes (2.0) N0 (2.0) Yes (3.0)
20
Using the Decision tree to Solve the Current Problem
Outlook Temperature Humidity Windy Play
Sunny Hot Normal True ??

Outlook
Sunny → Normal → Yes

= Sunny = Rainy
= Overcast
Humidity YES Windy
= High = Normal = True = False
N0 (2.0) Yes (3.0)

N0 (3.0) Yes (2.0)
21
Trees as Sets of Rules
• IF (Employed = Yes) THEN Class=No Write-off
• IF (Employed = No) AND (Balance < 50k) THEN Class=No Write-off
• IF (Employed = No) AND (Balance ≥ 50k) AND (Age < 45) THEN Class=No Write-off
• IF (Employed = No) AND (Balance ≥ 50k) AND (Age ≥ 45) THEN Class=Write-off
22
What are the rules associated with this tree?
Outlook
= Sunny = Rainy
= Overcast
Humidity YES Windy
= High = Normal = True = False
N0 (2.0) Yes (3.0)

N0 (3.0) Yes (2.0)
23
MegaTelCo: Predicting Churn with Tree Induction
24
What are we predicting?
Classification tree
Split over income
Income
Age
<50K >=50K
45 p(LI)=0.15 Age
Split over age
<45 >=45
?
p(LI)=0.43 p(LI)=0.83
50K Income
Did not buy life insurance
Bought life insurance Interested in LI? = No

25
What are we predicting?
Classification tree
Split over income
Income
Age
<50K >=50K
45 p(LI)=0.15 Age
Split over age
<45 >=45
?
p(LI)=0.43 p(LI)=0.83
50K Income
Did not buy life insurance
Bought life insurance Interested in LI? = 3/7

26
27
28
How to Create a Decision Tree in R
To create a decision tree we will need to create a data frame which is the most common
data structure in R.
A data frame is created with the data.frame( ) function.
A data frame object is created as follows:
Mydataframe <- data.frame(col-vector1, col-vector2,…,col-vectorn)
Where col-vector are vectors such as character, numeric, or logical that correspond to
the different columns of a table that contains historical data.
29
Creating a Data frame in R
Outlook Temperature Humidity Windy Play Creating the column vectors:

Sunny Hot High False No
Sunny Hot High True No Outlook <- c(“Sunny”, “Sunny”, …,”Rainy”)
Overcast Hot High False Yes
Rainy Mild High False Yes Temperature <- c(“Hot”, “Hot”,...,”Mild”)
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Humidity <- c(“High”,”High”,...,”High”)
Overcast Cool Normal True Yes
Windy <- c(False, True,...,True)
Sunny Mild High False No
Sunny Cool Normal False Yes
Play <- c(“No”, “No”...,”No”)
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No 30
Creating a Data frame
(continuation)
Outlook <- c(“Sunny”, “Sunny”, …,”Rainy”)
Temperature <- c(“Hot”, “Hot”,...,”Mild”)
Humidity <- c(“High”,”High”,...,”High”)
Windy <- c(False, True,...,True)
Play <- c(“No”, “No”...,”No”)
gamedf <- data.frame(Outlook, Temperature, Humidity, Windy, Play)
31
Checking the Structure of a data frame
Examine the structure of the data frame using the function
> str(gamedf)
32
Taking a look at what the data frame look like
To see what the data frame look like just type the name of the data frame at the
console.
> gamedf
33
Using the Recursive Partitioning and Regression Trees Package
“rpart”
1) Install the package using
> Install.packages(“rpart”)
2)Load the package using
> require(“rpart”)
or
> library(“rpart”)
34
Using the Recursive Partitioning and Regression Trees Package
“rpart”
# grow tree
> fit <- rpart( Play ~ Outlook + Temperature + Humidity, method=“class”, data.frame(gamedf))
#display the results
> printcp(fit)
#detailed summary of splits
> summary(fit)
#plot tree
> plot(fit, uniform=TRUE, main=“Classification Tree for gamedf”)

35
Using the Fit or Classification or Regression Tree Package
“tree”
> Install.packages(“tree”)
> require(“tree”)
or
> library(“tree”)
36
Using the Fit or Classification or Regression Tree Package
“tree”
# grow tree
> tr <- tree( Play ~ Outlook + Temperature + Humidity, data.frame(gamedf))
# see summary
> summary(tr)
#plot tree
> plot(tr)
37
Using the R Package “A laboratory for Recursive Partytioning
“party”
> Install.packages(“party”)
> require(“party”)
or
> library(“party”)
38
Using the R Package “A laboratory for Recursive Partytioning
“party”
# grow tree
> ct <- ctree( Play ~ Outlook + Temperature + Humidity, data.frame(gamedf))
# see summary
> summary(ct)
#plot tree
> plot(ct)
39
Entering Data into a Data Frame via Keyboard
It is possible to create an “empty” data frame and then fill it in via the built-in editor of
R.
Example (of empty data frame)
mydataframe <- data.frame(name = character(0), age = numeric(0), gender = character (0),

smoker = logical(0))
Assignments like name = character(0) creates a variable of mode character which has
no data.
40
Entering Data into a Data Frame via Keyboard
The edit() built-in function in R brings the editor so the user can fill in the variables of the data
frame.
mydataframe <- edit(mydataframe) invokes the editor so the user can fill in each one of the
variables of mydataframe.
Notice how the result of the editing is assigned back to the data frame object. 41
Central Tendency and Variability
A probability distribution describes the possible outcomes for an event, and how likely
each of those events are.
What do we need to know about about a distribution?

• Where’s the center of it?
• How ”fat” or “thin” it is?
• What are its characteristics (is it skewed? Does it have a “hump”)
“Central tendency” is an euphemism for “If you have a bunch of values, what’s in the
middle (and what does the middle even mean?)
42
Fun Fact
• Mr. & Mrs. Doe (http://en.wikipedia.org/wiki/John_Doe)

• “John Doe” for males,
• “Jane Doe” for females, and
• “Jonnie Doe” and “Janie Doe” for children.
43
Questions?
44

Data Science Harvard Lecture 3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Harvard Lecture 3 PDF

Uploaded by

Copyright:

Available Formats

CSCI E-84

A Practical Approach to Data

Februar y, Wednesday 10, 2016

Thanks also to Professor P. Adamopoulos (Stern School of Business of New York

Figures are used with the authors’ permission.

I think I shall never see

• No two parents share descendants

• Probability estimation trees, regression trees (to be continued..)

• Almost all data mining packages include DTs

• They have advantages for model comprehensibility, which is important for:

We will consider tree-based methods for regression and classification.

The approach will used is recursive, top-down and greedy.

Recursive means that the process is inherently repetitive.

Outlook Temperature Humidity Windy Play Example:

• What should be the first question to ask?

• How do we determine the importance of each question?

• How do we determine the root of the tree?

• Which question gives the most insight?

• Which question provides the shortest tree?

There are 4 informative variables (choices):

• What is the outlook?

• What is the temperature?

• What is the humidity?

Overcast: 4 instances 4 → Yes Rule: overcast → Yes 0/4 errors

Rainy: 5 instances 3 → Yes

Error table for outlook Attribute Rules Error Total

Outcast→ Yes 0/4

Rainy → Yes 2/5

Error table for temperature Attribute Rules Error Total

Mild→ Yes 2/6

Cool → Yes 1/5

Overcast → Yes 0/4

Rainy → Yes 2/5

Temperature Hot → No 2/4 5/14

Mild→ Yes 2/6

Cool → Yes 1/5

Humidity High → No 3/7 4/14

Normal → Yes 1/7

Windy False → Yes 2/8 5/14

How do we break the tie?

Temperature Humidity Windy Play YES Temperature Humidity Windy Play

Hot High True No Hot High True No

Mild High False No Mild High False No

Cool Normal False Yes Cool Normal False Yes

Mild Normal True Yes Mild Normal True Yes

Normal → Yes 0/2

Windy False → No 1/3 2/5

True → Yezs 1/2

The informative variable windy shows the least

Windy False → Yes 0/3 0/5

= Normal = True = False

N0 (3.0) Yes (2.0) N0 (2.0) Yes (3.0)

Outlook Temperature Humidity Windy Play

Sunny Hot Normal True ??

Sunny → Normal → Yes

Humidity YES Windy

= High = Normal = True = False

N0 (2.0) Yes (3.0)

• IF (Employed = Yes) THEN Class=No Write-off

• IF (Employed = No) AND (Balance < 50k) THEN Class=No Write-off

Humidity YES Windy

= High = Normal = True = False

N0 (2.0) Yes (3.0)

CSCI E-84