Decision Trees in R

Decision Trees in R
Arko Barman
With additions and modifications by Ch. Eick
COSC 4335 Data Mining
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Decision Trees
• Used for classifying data by partitioning attribute
space
• Tries to find axis-parallel decision boundaries for
specified optimality criteria
• Leaf nodes contain class labels, representing
classification decisions
• Keeps splitting nodes based on split criterion,
such as GINI index, information gain or entropy
• Pruning necessary to avoid overfitting
Decision Trees in R
mydata<-data.frame(iris)
attach(mydata)
library(rpart)
model<-rpart(Species ~ Sepal.Length + Sepal.Width
+ Petal.Length + Petal.Width,
data=mydata,
method="class")
plot(model)
text(model,use.n=TRUE,all=TRUE,cex=0.8)
Decision Trees in R
library(tree)
model1<-tree(Species ~ Sepal.Length +
Sepal.Width + Petal.Length + Petal.Width,
data=mydata,
method="class",
split="gini")
plot(model1)
text(model1,all=TRUE,cex=0.6)
Decision Trees in R
library(party)
model2<-ctree(Species ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata)
plot(model2)
Controlling number of nodes
This is just an
example. You can
library(tree) come up with
mydata<-data.frame(iris) better or more
attach(mydata) efficient methods!
model1<-tree(Species ~
Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width,
data=mydata,
method="class",
control =
tree.control(nobs =
150, mincut = 10))
plot(model1)
text(model1,all=TRUE,cex=0.6)
predict(model1,iris)
Note how the number of
nodes is reduced by increasing
the minimum number of
observations in a child node!
Controlling number of nodes
This is just an
model2<-ctree(Species ~ example. You can
Sepal.Length + Sepal.Width + come up with
Petal.Length + Petal.Width, better or more
data = mydata, controls = efficient methods!
ctree_control(maxdepth=2))
plot(model2)
Note that setting the

maximum depth to 2 has
reduced the number of
nodes!
http://data.princeton.edu/R/linearmodels.html
Linear Models in R
• abline() – adds one or more straight lines to a
plot
• lm() – function to fit linear regression model
x1<-c(1:5,1:3)
x2<-c(2,2,2,3,6,7,5,1)
abline(lm(x2~x1))
title('Regression of x2 on
X1')
plot(x1,x2)
abline(lm(x2~x1))
title('Regression of x2 on
+ x1')
s<-lm(x2~x1)
lm(x1~x2)
abline(1,2)
Scaling and Z-Scoring Datasets
• http://stat.ethz.ch/R-manual/R-
patched/library/base/html/scale.html
s<-scale(iris[1:4])
mean(s[,1])
sd(s[,1])
t<-scale(s, center=c(5,5,5,5), scale=FALSE)
#subtracts the mean-vector and additionally (5,5,5,5) and does not dived
by the standard deviation.
• https://archive.ics.uci.edu/ml/datasets/banknote
+authentication

Decision Trees in R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees in R

Uploaded by

Copyright:

Available Formats

Decision Trees in R

1 Yes Single 125K No

Training Data Model: Decision Tree

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

Note that setting the

You might also like