Institute of Rural Management Anand: PGDM-RM41 - Term III - End Term Examination

Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination

< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >
ANS 1-
A) > getwd()
[1] "C:/Users/nidhi/Documents"
> setwd("C:/Users/nidhi/Downloads")
>library(cluster)
>library(factoextra)
>library(magrittr)
>food=read.csv("food-texture.csv",header = TRUE)
> View (food)
> food=food[,-1]
>foodscale=scale(food)
#optimal number of clusters
> fviz_nbclust(foodscale,kmeans,method = "wss")
> fviz_nbclust(foodscale,kmeans,method = "silhouette")
##K-means clustering
> km.res=kmeans(foodscale,2,nstart = 25)
> fviz_cluster(km.res,data=food,ellipse.type = "convex",palette="jco",ggtheme = theme_minimal())
#PCA
apply(food,2,mean)
apply(food, 2, var)
pr.out=prcomp(food , scale=TRUE)
pr.out$center
pr.out$scale
pr.out$rotation
pr.var=(pr.out$sdev)^2
Page 1 of 13
< 7 APRIL 2021 >
pve=pr.var/sum(pr.var)
pve
Cluster by Silhoutte Method
fviz_nbclust(food,kmeans,method="silhouette")
Page 2 of 13
< 7 APRIL 2021 >
Through this method we came to the conclusion that there kink at 2 so optimal number of
clusters are 2 while when we used silhouette, the maximum distance was found for 2 hence the optimal
number of clusters are 2.
B- Hierarchical
> res.hc=hclust(dist(vish),method="complete")
Page 3 of 13
< 7 APRIL 2021 >
> fviz_dend(res.hc, k=2, cex=0.5, k_colors=c(1,2), color_labels_by_k = TRUE, rect=TRUE)
C- there are 2 clusters where cluster 1 is of size 16 while cluster is of size 34.
Mean of oil 16.51 in cluster 1 while cluster 2 has 18.65 which shows that cluster 2 has more oil than cluster
1. And people are willing to pay price for cluster 1.
Pastries in cluster 1 are more of density, fracture, hardness while cluster 2 pastries are hard and is less
dense and crisper.
D-
Cluster 1 explains 60.6% while cluster 2 explains 25.91 % variability of data.
Cumulatively they are explaining 86.5% data variability.
ANS 2-
Page 4 of 13
< 7 APRIL 2021 >
A)- The scenario can be modelled using Markov Chain because the driver in the given zone has
only two options either to stay back or to move to the next zone. The probability of driving going
from one state to another is dependent upon its current state and no its previous states. This model
has the space state as north zone, south zone, and west zone and as per stochastic process the
movement evolves over the time and this condition is known as Markov Chain.
B)- The different states are North zone, South Zone and west. As the movement of driver is being
specified by these zones.
C)-
install.packages("Markovchain")
library(markovchain)
> tran_mat=matrix(c(0.3,0.3,0.4,0.4,0.4,0.2,0.5,0.3,0.2),nrow=3,byrow=TRUE)
> tran_mat
[,1] [,2] [,3]
[1,] 0.3 0.3 0.4
[2,] 0.4 0.4 0.2
[3,] 0.5 0.3 0.2
>disp_trans=new("markovchain",transitionMatrix=tran_mat,states=c("North","South","West"),nam
e="DriverMovement")
> disp_trans
DriverMovement
A 3 - dimensional discrete Markov Chain defined by the following states:
North, South, West
The transition matrix (by rows) is defined as follows:
North South West
North 0.3 0.3 0.4
South 0.4 0.4 0.2
West 0.5 0.3 0.2
Page 5 of 13
< 7 APRIL 2021 >
D)-
>steadyStates(disp_trans)
North South West
[1,] 0.3888889 0.3333333 0.2777778
The driver has a probability of 39% of reaching in the north zone, a probability of 33% that
he will be in south zone in steady state and approximately probability of 28% that he will reach
west zone in steady state.
E)-
> current_state=c(0.2,0.45,0.35)
> current_state*disp_trans^2
North South West
Page 6 of 13
< 7 APRIL 2021 >
[1,] 0.3825 0.3345 0.283
The percentage of the drivers has a probability in each of these zones after the next trip that
means after 2 transitions are
North= 38.25%
South= 33.45%
West= 28.3%
ANS3- A)
model1 = lm(data = Carseats, Sales ~ Price)

summary(model1)
Call:
lm(formula = Sales ~ Price, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.5224 -1.8442 -0.1459 1.6503 7.5108
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 13.641915 0.632812 21.558

<0.0000000000000002 ***
Price -0.053073 0.005354 -9.912 <0.0000000000000002 *** ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.532 on 398 degrees of freedom
Multiple R-squared: 0.198, Adjusted R-squared: 0.196
F-statistic: 98.25 on 1 and 398 DF, p-value: < 0.00000000000000022
INTERPRETATION- price is considered as independent variable. Here the price has good
influence over variable sales as there is 95% CI significant with a p-value less than 0.001, but R
squared value is 0.198 which means that the model is 19.8% explain the cause of variation. Also
the value of adjusted R squared is 0.196 both the values are very low. So we can say that model is
Page 7 of 13
< 7 APRIL 2021 >
significant though it is not good model because it is not able to explain accurately and has low
value of R squared.
B)-
model2=lm(data=Carseats, Sales ~.)
summary(model2)
Call: lm(formula = Sales ~ ., data = Carseats)
Residuals:
-2.8692 -0.6908 0.0211 0.6636 3.4115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6606231 0.6034487 9.380 < 0.0000000000000002 *
CompPrice 0.0928153 0.0041477 22.378 < 0.0000000000000002 *
Income 0.0158028 0.0018451 8.565 0.000000000000000258 *
Advertising 0.1230951 0.0111237 11.066 < 0.0000000000000002 *
Population 0.0002079 0.0003705 0.561 0.575
Price -0.0953579 0.0026711 -35.700 < 0.0000000000000002 *
ShelveLocGood 4.8501827 0.1531100 31.678 < 0.0000000000000002 *
ShelveLocMedium 1.9567148 0.1261056 15.516 < 0.0000000000000002 *
Age -0.0460452 0.0031817 -14.472 < 0.0000000000000002 *
Education -0.0211018 0.0197205 -1.070 0.285
UrbanYes 0.1228864 0.1129761 1.088 0.277
USYes -0.1840928 0.1498423 -1.229 0.220 ---
Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 8 of 13
< 7 APRIL 2021 >

We have ran the model again and we have seen that “Education”, “Urban”, “US” and “ Population”
has very high p-value so we are removing those variables due to the insignificance. Rest of the
variables are relevant to us.
Let us run the model again without these variables:
model3=lm(data=Carseats, Sales ~. -Population -US -Urban -Education) summary(model3) Call:
lm(formula = Sales ~ . - Population - US - Urban - Education, data = Carseats)
Residuals:
-2.7728 -0.6954 0.0282 0.6732 3.3292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.475226 0.505005 10.84 <0.0000000000000002 *
CompPrice 0.092571 0.004123 22.45 <0.0000000000000002 *
Income 0.015785 0.001838 8.59 <0.0000000000000002 *
Advertising 0.115903 0.007724 15.01 <0.0000000000000002 *
Price -0.095319 0.002670 -35.70 <0.0000000000000002 *
ShelveLocGood 4.835675 0.152499 31.71 <0.0000000000000002 *
ShelveLocMedium 1.951993 0.125375 15.57 <0.0000000000000002 *
Age -0.046128 0.003177 -14.52 <0.0000000000000002 *
---
Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 9 of 13
< 7 APRIL 2021 >
Previously R square value 87.34 % has been obtained which means this model can explain 87.34%
cause of variation.
With this new model we have got R square 87.2 percent which is similar and all the variables are
significant with a p value 0.001. so, this model is better the an the previous model.
C)-
We have tests correlation of predictor variables to understand whether there is a high correlation or
not among the predictor variables or not and which ultimately led us to predicting synergy among
variables.
That means whether both the (variable1*variable2) influencing the model or not.
cor(subset(Carseats, select=-c(ShelveLoc,Urban,US)))
Sales CompPrice Income Advertising Population Price
Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.4
4495073
CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.5
8484777
Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.0
5669820
Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.0
4453687
Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.0
1214362
Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.0
0000000
Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.1
0217684
Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.0
1174660
Page 10 of 13
< 7 APRIL 2021 >
Age Education
Sales -0.231815440 -0.051955242
CompPrice -0.100238817 0.025197050
Income -0.004670094 -0.056855422
Advertising -0.004557497 -0.033594307
Population -0.042663355 -0.106378231
Price -0.102176839 0.011746599
Age 1.000000000 0.006488032
Education 0.006488032 1.000000000
HERE WE CAN OBSERVE THAT THAT THE CORRELATION BETWEEN VARIABLES ARE NOT
EVEN EXCEEDING + OR - 0.5 SO WE CAN CONCLUDE THAT THERE IS NO SUCH SYNERGY
BETWEEN TWO VARIABLES.
D)- The four diagnostic plots are here:
THIS RESIDUAL PLUS FITTED PLOT IS SHOWING THAT THE MODEL IS HOMOSCEDASTIC
AND THERE ARE ONLY TWO OUTLIER IN THE DATASET WHICH ARE 208 AND 358.
Page 11 of 13
< 7 APRIL 2021 >
WE HAVE ALSO PLOTTED NORMALITY PLOT TO CHECK WHETHER THE MODEL IS

SATISFYING NORMALITY ASSUMPTION OR NOT. THE STANDARDIZED RESIDUALS ARE
FOLLOWING THE MIDDLE-DOTTED LINE AND THE GRAPH IS COINCIDING WITH THE
MIDDLE DOTTED LINE IT MEANS THE NORMALITY ASSUMPTION IS SATISFIED.
THIS RED DOTTED LINE IS LINEAR IN PATTERN SO LINEARITY ASSUMPTION IS ALSO

SATISFIED.
Page 12 of 13
< 7 APRIL 2021 >
THERE IS NO SIGNIFICANT OUTLIER AT THE EXTREME OF THE RED DOTTED LINE OF THIS
LEVERAGE PLOT WHICH CAN INFLUENCE THE MODEL AND PREDICTED VARIABLES.
Page 13 of 13

Institute of Rural Management Anand: PGDM-RM41 - Term III - End Term Examination

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Institute of Rural Management Anand: PGDM-RM41 - Term III - End Term Examination

Uploaded by

Copyright:

Available Formats

Institute of Rural Management Anand

PGDM-RM41 – Term III – End Term Examination

> View (food)

#optimal number of clusters

> fviz_nbclust(foodscale,kmeans,method = "wss")

> fviz_nbclust(foodscale,kmeans,method = "silhouette")

> km.res=kmeans(foodscale,2,nstart = 25)

> fviz_cluster(km.res,data=food,ellipse.type = "convex",palette="jco",ggtheme = theme_minimal())

Cluster by Silhoutte Method

> fviz_dend(res.hc, k=2, cex=0.5, k_colors=c(1,2), color_labels_by_k = TRUE, rect=TRUE)

Cluster 1 explains 60.6% while cluster 2 explains 25.91 % variability of data.

Cumulatively they are explaining 86.5% data variability.

North South West

[1,] 0.3888889 0.3333333 0.2777778

North South West

[1,] 0.3825 0.3345 0.283

model1 = lm(data = Carseats, Sales ~ Price)

Min 1Q Median 3Q Max

-6.5224 -1.8442 -0.1459 1.6503 7.5108

Estimate Std. Error t value Pr(>|t|) (Intercept) 13.641915 0.632812 21.558

Price -0.053073 0.005354 -9.912 <0.0000000000000002 *** ---

Residual standard error: 2.532 on 398 degrees of freedom

Multiple R-squared: 0.198, Adjusted R-squared: 0.196

F-statistic: 98.25 on 1 and 398 DF, p-value: < 0.00000000000000022

Residual standard error: 1.019 on 388 degrees of freedom

F-statistic: 243.4 on 11 and 388 DF, p-value: < 0.00000000000000022

Residual standard error: 1.019 on 392 degrees of freedom

F-statistic: 381.4 on 7 and 392 DF, p-value: < 0.00000000000000022

Sales CompPrice Income Advertising Population Price

Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.4

CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.5

Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.0

Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.0

Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.0

Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.0

Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.1

Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.0

Sales -0.231815440 -0.051955242

CompPrice -0.100238817 0.025197050

Income -0.004670094 -0.056855422

Advertising -0.004557497 -0.033594307

Population -0.042663355 -0.106378231

Price -0.102176839 0.011746599

Age 1.000000000 0.006488032

Education 0.006488032 1.000000000

D)- The four diagnostic plots are here:

WE HAVE ALSO PLOTTED NORMALITY PLOT TO CHECK WHETHER THE MODEL IS

THIS RED DOTTED LINE IS LINEAR IN PATTERN SO LINEARITY ASSUMPTION IS ALSO

You might also like