You are on page 1of 9

Classification using R

The Decision Trees


One of the most widely used technique for classification of data having
more than two categories. It uses a statistical method called recursive
partitioning to create classes for the data.

We are going to use the dataset iris for our study here. data set gives
the measurements in centimeters of the variables sepal length and width
and petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa, versicolor, and virginica.

We will create a model which will be able to predict the type of species
based on other parameters.

> library(party)

>head(iris)

> sampleIndex=sample(1:nrow(iris),size=120, replace=FALSE)

> trainIris=iris[sampleIndex,]

> testIris=iris[-sampleIndex,]

> iris_ctree = ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,


data=trainIris)

> iris_ctree

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 120


1) Petal.Length <= 1.9; criterion = 1, statistic = 110.561

2)* weights = 35

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 56.979

4) Petal.Length <= 4.8; criterion = 0.988, statistic = 8.726

5)* weights = 41

4) Petal.Length > 4.8

6)* weights = 7

3) Petal.Width > 1.7

7)* weights = 37

> plot(iris_ctree)
One of the reasons why decision trees are used so widely is because of
their easy interpretability. The plot created after the successful
classification can be read and understood easily by anyone. Hence it is
loved by top level executive like CEO and CFO’s

Each node partitions the data into two segments based on a decision
parameter indicated in the node, with the P-Value.

In the square boxes, the bar graph tells the probability of a particular
record of being either setosa, versicolor or virginica if it comes inside
this group.

Now lets predict the test data values.

> PredictIris=predict(iris_ctree,newdata=testIris)

> FinalResult=data.frame(testIris,PredictIris)

> table(FinalResult$Species, FinalResult$PredictIris)

setosa versicolor virginica

setosa 15 0 0

versicolor 0 5 0

virginica 0 1 9

> 29/30

[1] 0.9666667 Our model had predicted the values very efficiently! This high value of
accuracy is also due to low count of testing values.
Factor Analysis(Principal Component Analysis)
Suppose, we are interested in consumers’ evaluation of a brand of coffee.
We take a random sample of consumers whom were given a cup of
coffee. They were not told which brand of coffee they were given. After
they had drunk the coffee, they were asked to rate it on 14 semantic –
differential scales. The 14 at-tributes which were investigated are shown
below:

1. Pleasant Flavor – Unpleasant Flavor

2. Stagnant, muggy taste – Sparkling, Refreshing Taste

3. Mellow taste – Bitter taste

4. Cheap taste – Expensive taste

5. Comforting, harmonious – Irritating, discordant

6. Smooth, friendly taste – Rough, hostile taste

7. Dead, lifeless, dull taste – alive, lively, peppy taste

8. Tastes artificial – Tastes like real coffee

9. Deep distinct flavor – Shallow indistinct flavor

10. Tastes warmed over – Tastes just brewed

11. Hearty, full bodies, full flavor Warm, thin empty flavor

12. Pure, clear taste – Muddy, swampy taste

13. Raw taste – Stale taste


14. Overall preference: Excellent quality, Very poor quality

A factor analysis of the ratings given by consumers indicated that four


factors could summarize the 14 attributes. These factors were: comforting
quality, heartiness, genuineness and freshness.

Here we are only exploring the factors, but we cannot confirm whether
these are the only factors, hence the name Exploratory Factor Analysis.
Lets try PCA on a dataset to understand it better. We are going to use
attitude dataset it has data from a survey of the clerical employees of a
large financial organization, the data are aggregated from the
questionnaires of the approximately 35 employees for each of 30
(randomly selected) departments. The numbers give the percent
proportion of favorable responses to seven questions in each department

>str(attitude)

> PComps=princomp(attitude,cor=T)

> summary(PComps)

Importance of components:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

Standard deviation 1.9277904 1.0681395 0.9204301 0.78285992 0.56892250 0.46747256 0.37475026

Proportion of Variance 0.5309108 0.1629888 0.1210274 0.08755281 0.04623897 0.03121866 0.02006254

Cumulative Proportion 0.5309108 0.6938997 0.8149270 0.90247983 0.94871881 0.97993746 1.00000000

> screeplot(PComps,type="line")
By looking at the scree plot we can decide how many principal
components or factors we have to retain. All the factors below the elbow
are explaining very less amount of variance as their eigen values are less
than 1 and hence only 4 factors can explain most (90%) of the variance
instead of 7 variables.

Cluster Analysis
Previously we saw how we can group variables in factor analysis. Now
lets study how can we group similar observations and make clusters of
data.

Cluster analysis groups individuals or objects into clusters objects in the


same cluster are more similar to one another than they are to objects in
other clusters. The attempt is to maximize the homogeneity of objects
within the clusters while also maximizing the heterogeneity between the
clusters. Like factor analysis, cluster analysis is also a inter dependence
technique.

An example: after clustering the survey data the results are below
Suppose, you have done a pilot marketing of a candy on a randomly
selected sample of consumers. Each of the consumers was given a candy
and was asked whether they liked it and whether they will buy it. Now
the respondents were grouped into fol-lowing four clusters:

Now the group NOT LIKED, WILL BUY group is a bit unusual. But people
can buy for others. From a strategy point of view, the group LIKED, WILL
NOT BUY is important, because they are potential customers. A possible
change in the pricing policy may change the purchasing decision.

Lets study this over iris dataset, where we will first remove the category
column and then group the data based on various observation values.

> newIris=iris[,-5]

> head(newIris)

> clusterIris=kmeans(newIris,3)

> clusterIris

K-means clustering with 3 clusters of sizes 62, 50, 38

Cluster means:

Sepal.Length Sepal.Width Petal.Length Petal.Width

1 5.901613 2.748387 4.393548 1.433871

2 5.006000 3.428000 1.462000 0.246000

3 6.850000 3.073684 5.742105 2.071053

Clustering vector:

[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1

[52] 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1

[103] 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3 3 1
Within cluster sum of squares by cluster:

[1] 39.82097 15.15100 23.87947

(between_SS / total_SS = 88.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"


"size"

[8] "iter" "ifault"

> plot(newIris[c("Sepal.Length", "Sepal.Width")], col=clusterIris$cluster)

You might also like