You are on page 1of 14

Decision Trees in R

Arko Barman
With additions and modifications by Ch. Eick
COSC 4335 Data Mining
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Decision Trees
• Used for classifying data by partitioning attribute
space
• Tries to find axis-parallel decision boundaries for
specified optimality criteria
• Leaf nodes contain class labels, representing
classification decisions
• Keeps splitting nodes based on split criterion,
such as GINI index, information gain or entropy
• Pruning necessary to avoid overfitting
Decision Trees in R

mydata<-data.frame(iris)
attach(mydata)

library(rpart)
model<-rpart(Species ~ Sepal.Length + Sepal.Width
+ Petal.Length + Petal.Width,
data=mydata,
method="class")
plot(model)
text(model,use.n=TRUE,all=TRUE,cex=0.8)
Decision Trees in R

library(tree)
model1<-tree(Species ~ Sepal.Length +
Sepal.Width + Petal.Length + Petal.Width,
data=mydata,
method="class",
split="gini")
plot(model1)
text(model1,all=TRUE,cex=0.6)
Decision Trees in R
library(party)
model2<-ctree(Species ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata)
plot(model2)
Controlling number of nodes
This is just an
example. You can
library(tree) come up with
mydata<-data.frame(iris) better or more
attach(mydata) efficient methods!
model1<-tree(Species ~
Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width,
data=mydata,
method="class",
control =
tree.control(nobs =
150, mincut = 10))
plot(model1)
text(model1,all=TRUE,cex=0.6)
predict(model1,iris)
Note how the number of
nodes is reduced by increasing
the minimum number of
observations in a child node!
Controlling number of nodes
This is just an
model2<-ctree(Species ~ example. You can
Sepal.Length + Sepal.Width + come up with
Petal.Length + Petal.Width, better or more
data = mydata, controls = efficient methods!
ctree_control(maxdepth=2))
plot(model2)

Note that setting the


maximum depth to 2 has
reduced the number of
nodes!
http://data.princeton.edu/R/linearmodels.html

Linear Models in R
• abline() – adds one or more straight lines to a
plot
• lm() – function to fit linear regression model
x1<-c(1:5,1:3)
x2<-c(2,2,2,3,6,7,5,1)
abline(lm(x2~x1))
title('Regression of x2 on
X1')
plot(x1,x2)
abline(lm(x2~x1))
title('Regression of x2 on
+ x1')
s<-lm(x2~x1)
lm(x1~x2)
abline(1,2)
Scaling and Z-Scoring Datasets
• http://stat.ethz.ch/R-manual/R-
patched/library/base/html/scale.html
s<-scale(iris[1:4])
mean(s[,1])
sd(s[,1])
t<-scale(s, center=c(5,5,5,5), scale=FALSE)
#subtracts the mean-vector and additionally (5,5,5,5) and does not dived
by the standard deviation.
• https://archive.ics.uci.edu/ml/datasets/banknote
+authentication

You might also like