You are on page 1of 13

Institute of Rural Management Anand

PGDM-RM41 – Term III – End Term Examination


< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

ANS 1-

A) > getwd()

[1] "C:/Users/nidhi/Documents"

> setwd("C:/Users/nidhi/Downloads")

>library(cluster)

>library(factoextra)

>library(magrittr)

>food=read.csv("food-texture.csv",header = TRUE)

> View (food)

> food=food[,-1]

>foodscale=scale(food)

#optimal number of clusters

> fviz_nbclust(foodscale,kmeans,method = "wss")

> fviz_nbclust(foodscale,kmeans,method = "silhouette")

##K-means clustering

> km.res=kmeans(foodscale,2,nstart = 25)

> fviz_cluster(km.res,data=food,ellipse.type = "convex",palette="jco",ggtheme = theme_minimal())

#PCA

apply(food,2,mean)

apply(food, 2, var)

pr.out=prcomp(food , scale=TRUE)

pr.out$center

pr.out$scale

pr.out$rotation

pr.var=(pr.out$sdev)^2

Page 1 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

pve=pr.var/sum(pr.var)

pve

Cluster by Silhoutte Method

fviz_nbclust(food,kmeans,method="silhouette")

Page 2 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

Through this method we came to the conclusion that there kink at 2 so optimal number of
clusters are 2 while when we used silhouette, the maximum distance was found for 2 hence the optimal
number of clusters are 2.

B- Hierarchical

> res.hc=hclust(dist(vish),method="complete")

Page 3 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

> fviz_dend(res.hc, k=2, cex=0.5, k_colors=c(1,2), color_labels_by_k = TRUE, rect=TRUE)

C- there are 2 clusters where cluster 1 is of size 16 while cluster is of size 34.

Mean of oil 16.51 in cluster 1 while cluster 2 has 18.65 which shows that cluster 2 has more oil than cluster
1. And people are willing to pay price for cluster 1.

Pastries in cluster 1 are more of density, fracture, hardness while cluster 2 pastries are hard and is less
dense and crisper.

D-

Cluster 1 explains 60.6% while cluster 2 explains 25.91 % variability of data.

Cumulatively they are explaining 86.5% data variability.

ANS 2-

Page 4 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

A)- The scenario can be modelled using Markov Chain because the driver in the given zone has
only two options either to stay back or to move to the next zone. The probability of driving going
from one state to another is dependent upon its current state and no its previous states. This model
has the space state as north zone, south zone, and west zone and as per stochastic process the
movement evolves over the time and this condition is known as Markov Chain.

B)- The different states are North zone, South Zone and west. As the movement of driver is being
specified by these zones.

C)-
install.packages("Markovchain")
library(markovchain)
> tran_mat=matrix(c(0.3,0.3,0.4,0.4,0.4,0.2,0.5,0.3,0.2),nrow=3,byrow=TRUE)
> tran_mat
[,1] [,2] [,3]
[1,] 0.3 0.3 0.4
[2,] 0.4 0.4 0.2
[3,] 0.5 0.3 0.2
>disp_trans=new("markovchain",transitionMatrix=tran_mat,states=c("North","South","West"),nam
e="DriverMovement")
> disp_trans
DriverMovement
A 3 - dimensional discrete Markov Chain defined by the following states:
North, South, West
The transition matrix (by rows) is defined as follows:
North South West
North 0.3 0.3 0.4
South 0.4 0.4 0.2
West 0.5 0.3 0.2

Page 5 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

D)-

>steadyStates(disp_trans)

North South West

[1,] 0.3888889 0.3333333 0.2777778

The driver has a probability of 39% of reaching in the north zone, a probability of 33% that
he will be in south zone in steady state and approximately probability of 28% that he will reach
west zone in steady state.

E)-

> current_state=c(0.2,0.45,0.35)

> current_state*disp_trans^2

North South West

Page 6 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

[1,] 0.3825 0.3345 0.283

The percentage of the drivers has a probability in each of these zones after the next trip that
means after 2 transitions are

North= 38.25%

South= 33.45%

West= 28.3%

ANS3- A)

model1 = lm(data = Carseats, Sales ~ Price)


summary(model1)
Call:
lm(formula = Sales ~ Price, data = Carseats)

Residuals:

Min 1Q Median 3Q Max

-6.5224 -1.8442 -0.1459 1.6503 7.5108

Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 13.641915 0.632812 21.558


<0.0000000000000002 ***

Price -0.053073 0.005354 -9.912 <0.0000000000000002 *** ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.532 on 398 degrees of freedom

Multiple R-squared: 0.198, Adjusted R-squared: 0.196

F-statistic: 98.25 on 1 and 398 DF, p-value: < 0.00000000000000022

INTERPRETATION- price is considered as independent variable. Here the price has good
influence over variable sales as there is 95% CI significant with a p-value less than 0.001, but R
squared value is 0.198 which means that the model is 19.8% explain the cause of variation. Also
the value of adjusted R squared is 0.196 both the values are very low. So we can say that model is

Page 7 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

significant though it is not good model because it is not able to explain accurately and has low
value of R squared.

B)-
model2=lm(data=Carseats, Sales ~.)
summary(model2)
Call: lm(formula = Sales ~ ., data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-2.8692 -0.6908 0.0211 0.6636 3.4115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6606231 0.6034487 9.380 < 0.0000000000000002 *
CompPrice 0.0928153 0.0041477 22.378 < 0.0000000000000002 *
Income 0.0158028 0.0018451 8.565 0.000000000000000258 *
Advertising 0.1230951 0.0111237 11.066 < 0.0000000000000002 *
Population 0.0002079 0.0003705 0.561 0.575
Price -0.0953579 0.0026711 -35.700 < 0.0000000000000002 *
ShelveLocGood 4.8501827 0.1531100 31.678 < 0.0000000000000002 *
ShelveLocMedium 1.9567148 0.1261056 15.516 < 0.0000000000000002 *
Age -0.0460452 0.0031817 -14.472 < 0.0000000000000002 *
Education -0.0211018 0.0197205 -1.070 0.285
UrbanYes 0.1228864 0.1129761 1.088 0.277
USYes -0.1840928 0.1498423 -1.229 0.220 ---
Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 388 degrees of freedom


Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698

Page 8 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

F-statistic: 243.4 on 11 and 388 DF, p-value: < 0.00000000000000022


We have ran the model again and we have seen that “Education”, “Urban”, “US” and “ Population”
has very high p-value so we are removing those variables due to the insignificance. Rest of the
variables are relevant to us.
Let us run the model again without these variables:
model3=lm(data=Carseats, Sales ~. -Population -US -Urban -Education) summary(model3) Call:
lm(formula = Sales ~ . - Population - US - Urban - Education, data = Carseats)

Residuals:
Min 1Q Median 3Q Max
-2.7728 -0.6954 0.0282 0.6732 3.3292

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.475226 0.505005 10.84 <0.0000000000000002 *
CompPrice 0.092571 0.004123 22.45 <0.0000000000000002 *
Income 0.015785 0.001838 8.59 <0.0000000000000002 *
Advertising 0.115903 0.007724 15.01 <0.0000000000000002 *
Price -0.095319 0.002670 -35.70 <0.0000000000000002 *
ShelveLocGood 4.835675 0.152499 31.71 <0.0000000000000002 *
ShelveLocMedium 1.951993 0.125375 15.57 <0.0000000000000002 *
Age -0.046128 0.003177 -14.52 <0.0000000000000002 *
---
Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 392 degrees of freedom


Multiple R-squared: 0.872, Adjusted R-squared: 0.8697

Page 9 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

F-statistic: 381.4 on 7 and 392 DF, p-value: < 0.00000000000000022

Previously R square value 87.34 % has been obtained which means this model can explain 87.34%
cause of variation.
With this new model we have got R square 87.2 percent which is similar and all the variables are
significant with a p value 0.001. so, this model is better the an the previous model.
C)-
We have tests correlation of predictor variables to understand whether there is a high correlation or
not among the predictor variables or not and which ultimately led us to predicting synergy among
variables.
That means whether both the (variable1*variable2) influencing the model or not.
cor(subset(Carseats, select=-c(ShelveLoc,Urban,US)))

Sales CompPrice Income Advertising Population Price

Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.4

4495073

CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.5

8484777

Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.0

5669820

Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.0

4453687

Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.0

1214362

Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.0

0000000

Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.1

0217684

Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.0

1174660

Page 10 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

Age Education

Sales -0.231815440 -0.051955242

CompPrice -0.100238817 0.025197050

Income -0.004670094 -0.056855422

Advertising -0.004557497 -0.033594307

Population -0.042663355 -0.106378231

Price -0.102176839 0.011746599

Age 1.000000000 0.006488032

Education 0.006488032 1.000000000

HERE WE CAN OBSERVE THAT THAT THE CORRELATION BETWEEN VARIABLES ARE NOT
EVEN EXCEEDING + OR - 0.5 SO WE CAN CONCLUDE THAT THERE IS NO SUCH SYNERGY
BETWEEN TWO VARIABLES.

D)- The four diagnostic plots are here:

THIS RESIDUAL PLUS FITTED PLOT IS SHOWING THAT THE MODEL IS HOMOSCEDASTIC
AND THERE ARE ONLY TWO OUTLIER IN THE DATASET WHICH ARE 208 AND 358.

Page 11 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

WE HAVE ALSO PLOTTED NORMALITY PLOT TO CHECK WHETHER THE MODEL IS


SATISFYING NORMALITY ASSUMPTION OR NOT. THE STANDARDIZED RESIDUALS ARE
FOLLOWING THE MIDDLE-DOTTED LINE AND THE GRAPH IS COINCIDING WITH THE
MIDDLE DOTTED LINE IT MEANS THE NORMALITY ASSUMPTION IS SATISFIED.

THIS RED DOTTED LINE IS LINEAR IN PATTERN SO LINEARITY ASSUMPTION IS ALSO


SATISFIED.

Page 12 of 13
Institute of Rural Management Anand
PGDM-RM41 – Term III – End Term Examination
< Business Analytics >
< 7 APRIL 2021 >
< Nidhi Tulsyan, P41088 >

THERE IS NO SIGNIFICANT OUTLIER AT THE EXTREME OF THE RED DOTTED LINE OF THIS
LEVERAGE PLOT WHICH CAN INFLUENCE THE MODEL AND PREDICTED VARIABLES.

Page 13 of 13

You might also like