You are on page 1of 44

CSCI E-84


A Practical Approach to Data


Science
Ramon A. Mata-Toledo, Ph.D.
Professor of Computer Science
Har vard Extension School

Unit 1 - Lecture 3

Februar y, Wednesday 10, 2016


1
Business Problems and Data Science
Solutions

This lecture is based primarily on Chapters 3 and 4 of the book “Data Science for
Business” by Foster Provost and Tom Fawcett, 2013.

Thanks also to Professor P. Adamopoulos (Stern School of Business of New York


University) and Professor Tomer Geva (The Tel Aviv University School of
Management)

Figures are used with the authors’ permission.

2
Lecture Objectives
At the end of this lecture, the student should be able to identify and define concepts such
as:
• Decision Trees and how to construct them
• Decision Trees and data mining
• Tree-based methods
• Trees as a set of rules
• How to create a decision tree in R
• Working with the tree, rpart, and the party R packages
• Filling in a data frame using the R built-in Editor

3
Decision Trees

Decision tree are hierarchically branched structures that help one to come to a decision
based on asking only a few meaningful questions in a particular sequence.

Decision trees

• Are easy to use and explain and their classification accuracy is competitive with other
methods.

• Can generate knowledge from a few instances that can be applied to a broad
population.

• May not need values for all informative variables to help us reach a decision.

4
Tree-Structured Models

What do they look like?

I think I shall never see


a poem as lovely as a
tree.



Poems are made by fools
like me, but only God
can make a tree.
Joyce Kilmer
1886-1918

5
Tree-Structured Models: “Rules”

• No two parents share descendants


• There are no cycles
• The branches always “point downwards”
• The points along the tree where the predictor space is split are called internal nodes
• Every example always ends up at a leaf node ( or terminal node) with some specific class
determination

• Probability estimation trees, regression trees (to be continued..)

6
Decision trees and Data Mining

• Decision trees (DTs), or classification trees, are one of the most popular data mining
tools
• (along with linear and logistic regression)

• Almost all data mining packages include DTs

• They have advantages for model comprehensibility, which is important for:


• model evaluation
• communication to non-DM-savvy stakeholders

7
Tree-based Methods

We will consider tree-based methods for regression and classification.

Tree-based methods segment or stratify the predictors space into simple regions non-
overlapping regions.

The approach will used is recursive, top-down and greedy.

Top-down means we begin at the top of the tree and successively split the predictor space.

Recursive means that the process is inherently repetitive.

Greedy means that at each step of the process, the best split is made at that a particular node
instead of looking ahead and ”choosing” a split that will produce a better tree in some future
step.

8
Decision Tree Scenario

Outlook Temperature Humidity Windy Play Example:


Sunny Hot High False No
Sunny Hot High True No The table shown at the left is an historical record of
the atmospheric conditions and the decisions made
Overcast Hot High False Yes
to play a game in a little league tournament. (Data
Rainy Mild High False Yes
courtesy of Witten, Frank, and Hall. 2010).
Rainy Cool Normal False Yes
Rainy Cool Normal True No The objective of this example is to determine if given
Overcast Cool Normal True Yes the atmospheric conditions shown below, should the
Sunny Mild High False No
game be played?
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes Outlook Temperature Humidity Windy Play
Overcast Mild High True Yes Sunny Hot Normal True ??
Overcast Hot Normal False Yes
Rainy Mild High True No 9
How to construct a Decision Tree
If there were a row that match the given conditions, we could make a similar decision.
However, there is no such past instance in this case.

Initial Considerations:

• What should be the first question to ask?

• How do we determine the importance of each question?

• How do we determine the root of the tree?

• Which question gives the most insight?

• Which question provides the shortest tree?


10
How to construct a Decision Tree

(Continuation)

There are 4 informative variables (choices):

• What is the outlook?

• What is the temperature?

• What is the humidity?

• Is it windy?

These four questions can be systematically compared to determine the one with the most
correct predictions, or equivalently, the one with the fewest errors (misses).

11
How to construct a Decision Tree

(Calculating the error table for a variable)

1) Start with the first variable, outlook, which can take 3 values (sunny, overcast, and rainy)
3 → No
Sunny: 5 instances Rule: sunny → No 2/5 errors
2 → Yes

Overcast: 4 instances 4 → Yes Rule: overcast → Yes 0/4 errors

Rainy: 5 instances 3 → Yes


Rule: rainy → Yes 2/5 errors
2 → No

Error table for outlook Attribute Rules Error Total


Errors
Outlook Sunny → No 2/5 4/14

Outcast→ Yes 0/4

Rainy → Yes 2/5


12
How to construct a Decision Tree

(Calculating the error table for the variables - Continuation)

2) Continue with the second variable, temperature, which can take 3 values (hot, mild, and cool)
2 → Yes
Hot: 4 instances Rule: hot → No 2/4 errors There is tie. Choose either one
2 → No

4 → Yes
Mild: 6 instances Rule: mild → Yes 2/6 errors
2 → No

3 → Yes
Cool: 4 instances Rule: cool → Yes 1/4 errors
1 → No

Error table for temperature Attribute Rules Error Total


Errors
Temperature Hot → No 2/4 5/14

Mild→ Yes 2/6

Cool → Yes 1/5


13
How to construct a Decision Tree

(Final Error table)

Continuing with similar analysis for the remaining variables, we have the following error table.

Complete Error Table for all Attribute Rules Error Total Errors
Four informative variables Outlook Sunny →No 2/5 4/14

Overcast → Yes 0/4

Rainy → Yes 2/5

Temperature Hot → No 2/4 5/14

Mild→ Yes 2/6

Cool → Yes 1/5

Humidity High → No 3/7 4/14

Normal → Yes 1/7

Windy False → Yes 2/8 5/14

True → No 3/6
14
How to construct a Decision Tree

(How to choose the root of the tree)

The variable that leads to the least number of errors (and thus the most number of correct decision)
should be chosen as the first node.

In this case, two variables (outlook and humidity) have the least number of errors as both have 4
errors out of 14 instances.

How do we break the tie?


Notice that outlook has a pure subtree (overcast) while there is no such a ”pure” subclass
for the humidity variable. The tie is broken in favor of outlook.

The decision tree will have outlook as the root node (also called the first splitting variable)

15
How to construct a Decision Tree

(How to construct the branches of the tree)

From the root node (outlook) the decision tree will be split into 3 branches or subtrees, one for each
of the three values of outlook.

The sunny branch will inherit the data for instances that have “sunny” as the value of outlook. These
will be used for further building of that sub-tree.

The rainy branch will inherit data for the instances that have “rainy” as the value of outlook. These will
be used for further building of that sub-tree.

The overcast branch will inherit data for the instances that have “overcast” as the value of outlook.
However, there is no need to further build this branch since the decision for all instances of this value
is always Yes.

The decision tree, after this first level of splitting, looks as shown in the next slide.

16
Decision Tree

Outlook

Sunny Rainy
Overcast

Temperature Humidity Windy Play YES Temperature Humidity Windy Play


Hot High False No Hot High False No

Hot High True No Hot High True No

Mild High False No Mild High False No

Cool Normal False Yes Cool Normal False Yes

Mild Normal True Yes Mild Normal True Yes

17
Determining the remaining branches of the Decision tree
To build the remaining branches of the trees we will follow a similar procedure as the one used to determine the
root of the tree.
In this case, for the sunny branch, error values will be calculated for the informative variables temperature,
humidity, and windy.
The error table is shown below.
The informative variable humidity shows the least
Attribute Rules Error Total Errors
amount of error, i.e. zero error. The other two
Temperature Hot → No 0/2 1/5 variables have non-zero errors.
Mild→ No 1/2
Therefore, the Outlook: sunny branch will use
Cool → Yes 0/1
Humidity as the next splitting variable.
Humidity High → No 0/3 0/5

Normal → Yes 0/2

Windy False → No 1/3 2/5

True → Yezs 1/2

18
Determining the remaining branches of the Decision tree

In this case, for the rainy branch, error values will be calculated for the informative variables temperature, humidity,
and windy.
The error table is shown below.

The informative variable windy shows the least


Attribute Rules Error Total Errors
amount of error, i.e. zero error. The other two
Temperature Mild → Yes 1/3 2/5 variables have non-zero errors.
Cool → Yes 1/2
Therefore, the Outlook: rainy branch will use Windy
Humidity High → No 1/2 2/5
as the next splitting variable.
Normal → Yes 1/3

Windy False → Yes 0/3 0/5

True → No 0/2

19
Final Decision Tree

Outlook

= Sunny = Rainy
= Overcast

Humidity
YES Windy

= Normal = True = False


= High

N0 (3.0) Yes (2.0) N0 (2.0) Yes (3.0)

20
Using the Decision tree to Solve the Current Problem

Outlook Temperature Humidity Windy Play

Sunny Hot Normal True ??


Outlook

Sunny → Normal → Yes


= Sunny = Rainy
= Overcast

Humidity YES Windy

= High = Normal = True = False

N0 (2.0) Yes (3.0)


N0 (3.0) Yes (2.0)
21
Trees as Sets of Rules

• IF (Employed = Yes) THEN Class=No Write-off

• IF (Employed = No) AND (Balance < 50k) THEN Class=No Write-off

• IF (Employed = No) AND (Balance ≥ 50k) AND (Age < 45) THEN Class=No Write-off

• IF (Employed = No) AND (Balance ≥ 50k) AND (Age ≥ 45) THEN Class=Write-off

22
What are the rules associated with this tree?

Outlook

= Sunny = Rainy
= Overcast

Humidity YES Windy

= High = Normal = True = False

N0 (2.0) Yes (3.0)


N0 (3.0) Yes (2.0)
23
MegaTelCo: Predicting Churn with Tree Induction

24
What are we predicting?
Classification tree
Split over income
Income
Age
<50K >=50K

45 p(LI)=0.15 Age
Split over age
<45 >=45
?

p(LI)=0.43 p(LI)=0.83

50K Income
Did not buy life insurance

Bought life insurance Interested in LI? = No


25
What are we predicting?
Classification tree
Split over income
Income
Age
<50K >=50K

45 p(LI)=0.15 Age
Split over age
<45 >=45
?

p(LI)=0.43 p(LI)=0.83

50K Income
Did not buy life insurance

Bought life insurance Interested in LI? = 3/7


26
MegaTelCo: Predicting Churn with Tree Induction

27
MegaTelCo: Predicting Churn with Tree Induction

28
How to Create a Decision Tree in R

To create a decision tree we will need to create a data frame which is the most common
data structure in R.

A data frame is created with the data.frame( ) function.

A data frame object is created as follows:

Mydataframe <- data.frame(col-vector1, col-vector2,…,col-vectorn)

Where col-vector are vectors such as character, numeric, or logical that correspond to
the different columns of a table that contains historical data.

29
Creating a Data frame in R

Outlook Temperature Humidity Windy Play Creating the column vectors:


Sunny Hot High False No
Sunny Hot High True No Outlook <- c(“Sunny”, “Sunny”, …,”Rainy”)
Overcast Hot High False Yes
Rainy Mild High False Yes Temperature <- c(“Hot”, “Hot”,...,”Mild”)
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Humidity <- c(“High”,”High”,...,”High”)
Overcast Cool Normal True Yes
Windy <- c(False, True,...,True)
Sunny Mild High False No
Sunny Cool Normal False Yes
Play <- c(“No”, “No”...,”No”)
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No 30
Creating a Data frame
(continuation)

Outlook <- c(“Sunny”, “Sunny”, …,”Rainy”)

Temperature <- c(“Hot”, “Hot”,...,”Mild”)

Humidity <- c(“High”,”High”,...,”High”)

Windy <- c(False, True,...,True)

Play <- c(“No”, “No”...,”No”)

gamedf <- data.frame(Outlook, Temperature, Humidity, Windy, Play)

31
Checking the Structure of a data frame

Examine the structure of the data frame using the function

> str(gamedf)

32
Taking a look at what the data frame look like

To see what the data frame look like just type the name of the data frame at the
console.
> gamedf

33
Using the Recursive Partitioning and Regression Trees Package
“rpart”

1) Install the package using

> Install.packages(“rpart”)

2)Load the package using

> require(“rpart”)
or
> library(“rpart”)

34
Using the Recursive Partitioning and Regression Trees Package
“rpart”
# grow tree

> fit <- rpart( Play ~ Outlook + Temperature + Humidity, method=“class”, data.frame(gamedf))

#display the results

> printcp(fit)

#detailed summary of splits

> summary(fit)

#plot tree

> plot(fit, uniform=TRUE, main=“Classification Tree for gamedf”)


35
Using the Fit or Classification or Regression Tree Package
“tree”

1) Install the package using

> Install.packages(“tree”)

2)Load the package using

> require(“tree”)
or
> library(“tree”)

36
Using the Fit or Classification or Regression Tree Package
“tree”
# grow tree

> tr <- tree( Play ~ Outlook + Temperature + Humidity, data.frame(gamedf))

# see summary

> summary(tr)

#plot tree

> plot(tr)

37
Using the R Package “A laboratory for Recursive Partytioning
“party”

1) Install the package using

> Install.packages(“party”)

2)Load the package using

> require(“party”)
or
> library(“party”)

38
Using the R Package “A laboratory for Recursive Partytioning
“party”
# grow tree

> ct <- ctree( Play ~ Outlook + Temperature + Humidity, data.frame(gamedf))

# see summary

> summary(ct)

#plot tree

> plot(ct)

39
Entering Data into a Data Frame via Keyboard

It is possible to create an “empty” data frame and then fill it in via the built-in editor of
R.

Example (of empty data frame)

mydataframe <- data.frame(name = character(0), age = numeric(0), gender = character (0),


smoker = logical(0))

Assignments like name = character(0) creates a variable of mode character which has
no data.

40
Entering Data into a Data Frame via Keyboard

The edit() built-in function in R brings the editor so the user can fill in the variables of the data
frame.

mydataframe <- edit(mydataframe) invokes the editor so the user can fill in each one of the
variables of mydataframe.

Notice how the result of the editing is assigned back to the data frame object. 41
Central Tendency and Variability

A probability distribution describes the possible outcomes for an event, and how likely
each of those events are.

What do we need to know about about a distribution?


• Where’s the center of it?
• How ”fat” or “thin” it is?
• What are its characteristics (is it skewed? Does it have a “hump”)

“Central tendency” is an euphemism for “If you have a bunch of values, what’s in the
middle (and what does the middle even mean?)

42
Fun Fact

• Mr. & Mrs. Doe (http://en.wikipedia.org/wiki/John_Doe)


• “John Doe” for males,
• “Jane Doe” for females, and
• “Jonnie Doe” and “Janie Doe” for children.

43
Questions?

44

You might also like