You are on page 1of 12

Advanced Statistics Project

Solutions:
Answer 1:
Basic summary data-
Dimension of the dataset: 13 variables and 100 observations.
Apart from product id, the rest of the 12 variables are numerical variables. The column names
of the dataset are: id, prodqual, Ecom, Techsup, Compres, Advertising, prodline, SalesFImage,
ComPricing, WartyClaim, OrdBilling, DelSpeed, Satisfaction.
Missing Values: On using the function anyNa(), it was found that there are no missing values in
the dataset.
Outliers: Upon using blotplox it is clear that outliers are present in E-commerce, Sales Force
Image, Order&Billing and Delivery Speed.
Graphical Analysis:
Graphical analysis like Histogram analysis, Boxplot analysis and scatterplot analysis is
performed on the dataset after the product id is neglected.
Histogram Analysis:
Scatterplot analysis:
Boxplot: to find outliers

Answer 2:
To check the presence of multicoliniarity, corrplot function is used first and since the
correlation values of some variables are above 0.5, the VIF values of the variables are also
found out.
Correlation plot diagram:
The VIF values of the variables are:
ProdQual Ecom TechSup CompRes Advertising
1.635797 2.756694 2.976796 4.730448 1.508933
ProdLine SalesFImage ComPricing WartyClaim OrdBilling
3.488185 3.439420 1.635000 3.198337 2.902999
DelSpeed
6.516014

The VIf value of delivery speed is 6.51 which is more than the threshold value of 5, so there is
presence of multicoliniarity in the dataset.

Answer 3:
Performing linear regression for dependent variable with every independedent variable.
Keeping the dependent variable as Customer Satisfaction and the rest of the 11 variables as
independent variables, linear regression is performed in the dataset using lm function.
The Following results are obtained :
- Satisfaction = 3.675 + 0.415*ProdQual
- Satisfaction = 5.151 + 0.481*Ecom
- Satisfaction = 6.447 + 0.087*TechSup
- Satisfaction = 3.680 + 0.595*CompRes
- Satisfaction = 5.625 + 0.322*Advertiing
- Satisfaction = 4.022 + 0.498*ProdLine
- Satisfaction = 4.070 + 0.556*SalesFImage
- Satisfaction = 8.038 + (-0.160)*ComPricing
- Satisfaction = 5.35 + 0.25*WartyClaim
- Satisfaction = 4.054 + 0.699*OrdBilling
- Satisfaction = 3.279 + 0.936*DelSpeed

Answer 4:
The bartlett test is done to check whether PCA/FA can be done or not and by checking the p
value, it is shown that we can. On using the KMO test it is clear that we have enough
variables to perform the PCA/FA test.
PCA/Factor analysis is performed on the dataset and the eigen values are found out, the
screeplot diagram is plotted to check . Free the Screeplot it is clear that only for 4 values are
required since only 4 eigen values are above 1 in the plot.

According to the kaiser rule values more than 1 are taken the rest is neglected. There are
only 4 values which are above 1. So it is correct to select only 4 factors. Moreover the 4 PA’s
explain about 82% of cumulative variation.

The FA diagram is plotted to decide on which all variables to group for reducing the
dimension.

PA1 tells about Delivery speed, Complaint resolution and delivery speed so it is grouped and
named as PURCHASE EXPERIENCE. Column name is ‘purch_exp’.
PA2 tells about Image of sales force, Ecommerce and advertising which explains about the
brand value of the broduct, so it is named as BRAND. Column name is ‘brand’
PA3 tells about warrenty and claims , technical support which explains about after sales
experience, so its named as SALES SERVICE. Column name is ‘sal_serv’
PA4 tells about product quality, product types and compricing which explains about the product
itself, so its named as PRODUCT. Column name is ‘prod’

Answer 5:
To perform multiple linear regression, a new dataframe is created with the 4 factors and the 5
column as customer satisfaction. The coloumns are named using cbind function and then
multiple linear regression is performed keeping customer satisfaction as the dependent
variable and the 4 factors as independent variables.
Therefore, Customer Satisfaction = 6.918 + 0.57*purch_exp + 0.619*brand + 0.056*sal_serv
+0.611*prod.
We have R-squared value as 69.71% which means the variability of customer satisfaction can
be explained with 4 independent variables with 69.71 % of certanity.
The adjusted R-squared value is 68.44%, which shows 68.44% of variablity in the dataset.
The degrees of freedom is 95 and the F-statistics value is 54.66 with low p-value which says
linear relationship.

Conclusion: By checking the values in Multiple linear regression, we can conclude that-
- Customer satisfaction highly depends on Purchasing Experience of the product, then 2 nd
comes the factors like product type,quality etc. The 3rd comes the brand and
advertisement and the 4th is the after sales service.

You might also like