Professional Documents
Culture Documents
We are going to use the dataset iris for our study here. data set gives
the measurements in centimeters of the variables sepal length and width
and petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa, versicolor, and virginica.
We will create a model which will be able to predict the type of species
based on other parameters.
> library(party)
>head(iris)
> trainIris=iris[sampleIndex,]
> testIris=iris[-sampleIndex,]
> iris_ctree
Response: Species
2)* weights = 35
5)* weights = 41
6)* weights = 7
7)* weights = 37
> plot(iris_ctree)
One of the reasons why decision trees are used so widely is because of
their easy interpretability. The plot created after the successful
classification can be read and understood easily by anyone. Hence it is
loved by top level executive like CEO and CFO’s
Each node partitions the data into two segments based on a decision
parameter indicated in the node, with the P-Value.
In the square boxes, the bar graph tells the probability of a particular
record of being either setosa, versicolor or virginica if it comes inside
this group.
> PredictIris=predict(iris_ctree,newdata=testIris)
> FinalResult=data.frame(testIris,PredictIris)
setosa 15 0 0
versicolor 0 5 0
virginica 0 1 9
> 29/30
[1] 0.9666667 Our model had predicted the values very efficiently! This high value of
accuracy is also due to low count of testing values.
Factor Analysis(Principal Component Analysis)
Suppose, we are interested in consumers’ evaluation of a brand of coffee.
We take a random sample of consumers whom were given a cup of
coffee. They were not told which brand of coffee they were given. After
they had drunk the coffee, they were asked to rate it on 14 semantic –
differential scales. The 14 at-tributes which were investigated are shown
below:
11. Hearty, full bodies, full flavor Warm, thin empty flavor
Here we are only exploring the factors, but we cannot confirm whether
these are the only factors, hence the name Exploratory Factor Analysis.
Lets try PCA on a dataset to understand it better. We are going to use
attitude dataset it has data from a survey of the clerical employees of a
large financial organization, the data are aggregated from the
questionnaires of the approximately 35 employees for each of 30
(randomly selected) departments. The numbers give the percent
proportion of favorable responses to seven questions in each department
>str(attitude)
> PComps=princomp(attitude,cor=T)
> summary(PComps)
Importance of components:
> screeplot(PComps,type="line")
By looking at the scree plot we can decide how many principal
components or factors we have to retain. All the factors below the elbow
are explaining very less amount of variance as their eigen values are less
than 1 and hence only 4 factors can explain most (90%) of the variance
instead of 7 variables.
Cluster Analysis
Previously we saw how we can group variables in factor analysis. Now
lets study how can we group similar observations and make clusters of
data.
An example: after clustering the survey data the results are below
Suppose, you have done a pilot marketing of a candy on a randomly
selected sample of consumers. Each of the consumers was given a candy
and was asked whether they liked it and whether they will buy it. Now
the respondents were grouped into fol-lowing four clusters:
Now the group NOT LIKED, WILL BUY group is a bit unusual. But people
can buy for others. From a strategy point of view, the group LIKED, WILL
NOT BUY is important, because they are potential customers. A possible
change in the pricing policy may change the purchasing decision.
Lets study this over iris dataset, where we will first remove the category
column and then group the data based on various observation values.
> newIris=iris[,-5]
> head(newIris)
> clusterIris=kmeans(newIris,3)
> clusterIris
Cluster means:
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
[52] 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
[103] 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3 3 1
Within cluster sum of squares by cluster:
Available components: