Assignment 2 - Factor Hair

Project – 2 Factor Hair Case Study
By Divyansh Srivastava
Table of Contents
S.No Content Page No.

1 Project Objective 3
1.0.1 Importing the dataset in R 4
1.0.2 Environment Set up and data import 4
1.1 1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs. 5
1.1 EDA - Check for Outliers and missing values and check the summary of
1.2 the dataset. 8
2 Check for Multicollinearity - Plot the graph based on Multicollinearity. 9
3 3 Simple Linear Regression (with every variable) 10
Perform PCA/FA and Interpret the Eigen Values (apply Kaiser
4.1 Normalization Rule) 10
Output Interpretation Tell why only 4 factors are being asked in the
questions and tell whether it is correct in choosing 4 factors. Name the
4.2 factors with correct explanations. 14
Create a data frame with a minimum of 5 columns, 4 of which are
5.1 different factors and the 5th column is Customer Satisfaction 14
Perform Multiple Linear Regression with Customer Satisfaction as the
5.2 Dependent Variable and the four factors as Independent Variables. 14
MLR summary interpretation and significance (R, R2, Adjusted R2,
5.3 Degrees of Freedom, f-statistic, coefficients along with p-values) 15
5.4 Output Interpretation 16
Appendix 17
2
1. Project Objective
The objective of the report is to explore the Factor Hair data set (“Factor-Hair-Revised.csv”) in R, and
generate insights about the data set. This exploration report will consist of the following:
I. Importing the dataset in R

II. Insights from the dataset to answer the following questions
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs.

1.2 EDA - Check for Outliers and missing values and check the summary of the dataset.
2 Check for Multicollinearity - Plot the graph based on Multicollinearity.
3 Simple Linear Regression (with every variable).
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule).
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell whether it
is correct in choosing 4 factors. Name the factors with correct explanations.
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the 5th
column is Customer Satisfaction.
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable and
the four factors as Independent Variables.
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2, Degrees of Freedom, f-statistic,
coefficients along with p-values).
5.4 Output Interpretation.
3
1.0.1 Importing the dataset in R
1.0.2 Environment Set up and Data Import
Install necessary Packages and Invoke Libraries
The following packages were used in the analysis and representation of the dataset.
‘corrplot’ – This function is used to plot the graph of the correlation matrix. The correlation matrix to
visualize. To visualize a general matrix, please use is.corr=FALSE. The visualization method : “circle”,
“color”, “number”, etc.
‘ppcor’ - The R package ppcor provides users with four functions which are pcor(), pcor.test(), spcor(),
and spcor.test(). The function pcor() (spcor()) calculates the partial (semi-partial) correlations of all
pairs of two random variables of a matrix or a data frame and provides the matrices of statistics and
p-values of each pairwise partial (semi-partial) correlation.
‘tidyverse’ - The "tidyverse" collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr,
purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data.
‘ggplot2’ - The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for
creating elegant and complex plots.
‘psych’ - The psych package has been developed at Northwestern University since 2005 to include
functions most useful for personality, psychometric, and psychological research. The package is also
meant to supplement a text on psychometric theory.
‘car’ - Calculates type-II or type-III analysis-of-variance tables for model objects produced by lm, glm,
multinom (in the nnet package), polr (in the MASS package), coxph (in the survival package), coxme
(in the coxme pckage), svyglm (in the survey package), rlm (in the MASS package), lmer in the lme4
package, lme in the nlme package, and (by the default method) for most models with a linear predictor
and asymptotically normal coefficients.
‘nFactors’ - Indices, heuristics and strategies to help determine the number of factors/components to
retain.
Please refer Appendix 4 for source code.
Set up working directory
Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. To set up the working directory, we use the command ‘setwd()’. To fetch the path of
the existing working directory, we use the command ‘getwd()’.
Import and Read the Dataset
4
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
We have named the dataset in the problem as ‘mydata’.
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
str – After applying the command, we infer that all factors are numbers, except ‘ID’ which is an integer.
summary – Provides us deeper insight about the dataset which is self-explanatory.
dim - The dataset has 100 rows and 13 columns. Please refer Appendix 6 for source code.
attach – Please refer Appendix 8 for source code.
Since the first column is of no use to us, we discard it and create a new dataset called ‘newdata’.
Please refer to Appendix 9 for source code.
For plotting a histogram of all the independent variables, we use the following function.
‘par’ – we sue this function to divide the plotting space into 12 to display all the histograms at the
same time.
‘hist’ – This function is used for plotting histograms.
The resulting graph is displayed below.
5
For Bivariate analysis, we use the following functions
‘for’ – These are the basic control-flow constructs of the R language. They function in much the same
way as control statements in any Algol-like language.
‘plot’ – Generic function for plotting of R objects.
6
7
1.2 EDA - Check for Outliers and missing values and check the summary
of the dataset
For finding the missing values in our dataset ‘newdata’, we use the function ‘sum’ along with ‘is.na’.
The number of missing in the dataset come out to be zero.
Please refer Appendix 14 for the source code.
For finding the outliers in our dataset, we use the boxplot function. The resulting graph is displayed
below.
As we can see, there are 4 outliers present in E-Commerce, 2 outliers present in Salesforce Image, 3
outliers present in Order Billing, and 1 outlier present in delivery speed.
The summary of the dataset has already been checked in 1.1
8
2 Check for Multicollinearity - Plot the graph based on Multicollinearity
Before checking the correlation of the data we create a new dataset consisting of only the
independent variables called ‘newdata2’.
Now, we sue the function ‘corrplot’ for plotting the correlation between these independent variables.
Please note that blue shows positive correlation, and red shows negative correlation. Therefore, dark
blue shows highest positive correlation, and red shows highest negative correlation.
As per our observations from the graph, we infer the following.
 Order Billing and Delivery Speed are highly correlated
9
 Order Billing and Complain Resolution are highly correlated
 Delivery Speed and Complain Resolution are highly correlated
 E-Commerce and Salesforce Image are highly correlated
 Technical Support and Warranty Claim are highly correlated
We also use the Variable Inflation Factor (VIF) method to check for the multicollinearity.
The numerical value for VIF tells us (in decimal form) what percentage the variance (i.e. the standard
error squared) is inflated for each coefficient. A rule of thumb for interpreting the variance inflation
factor is
 1 – Not correlated
 Between 1 and 5 – Moderately Correlated
 Greater than 5 – Highly Correlated
From the values we get, we can infer that Delivery Speed is a cause of concern.
3 Simple Linear Regression (with every variable)
For Simple Linear Regression with every function, we use the function ‘lm’, and compare the
dependent variable ‘Satisfaction’, against every independent variable.
Please refer Appendix 17 for the source code and results.
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser
Normalization Rule)
Before running PCA/FA on our dataset, we’ll first run the Kaiser-Meyer-Olkin factor adequacy test on
our dataset to check if this is a suitable method to be used here.
We’ll run this test on our earlier created correlation matrix which we named as ‘CorMat’.
Please refer Appendix for source code.
Since the MSA for the data is greater than 5, we can run Factor Analysis on our dataset.
Firstly, we’ll compute the eigen values for our dataset with independent varibles ‘newdata2’.
10
We then draw a Scree plot for the same to understand our data better.
As per the Kaiser Normalization rule, we take factors only greater than one for decide the number to
which these 11 variables will be reduced.
As we can see in the graph, the number if 4.
New, we run factor analysis on our data using principal axis method, where with the number of factors
as four.
After running the factor analysis method, we graphically represent the factor loadings, as mentioned
below.
Please refer Appendix 17, 18, 19 and 20 for source code.
In bringing down 11 factors to 4, we lost around 31% of the variance. Only the first four factors have
Eigen Value greater than 1.
Factor 1 = for 29.20% of the variance;
Factor 2 = 20.20% of the variance;
11
Factor 3 = 13.60% of the variance;
Factor 4 = 6.2% of the variance.
All four factors together account for 69% of the variance.
After rotating the data, we get the below mentioned loading graph.
12
The red dotted lines mean that the loadings are negative and that it marginally falls under PA4.
13
4.2 Output Interpretation Tell why only 4 factors are being asked in the
questions and tell whether it is correct in choosing 4 factors. Name
the factors with correct explanations
As already explained in the last answer, we brought down the number of factors down to four as per
the Kaiser Normalization rule.
The four factors are now named as mentioned below.
Variables Factors Renamed to Explanation

DelSpeed, CompRes, These variables are a part of the product
OrdBilling PA1 Product_Purchase purchase lifecycle
SalesFImage, Ecom, These variables are sub-components of
Advertising PA2 Marketing Marketing
These variables comprise the after sales
WartyClaim, TechSup PA3 After_Sales services of a product
ProdLine, ProdQual, These factors influence a product's positioning
CompPricing PA4 Positioning in the market
5.1 Create a data frame with a minimum of 5 columns, 4 of which are

different factors and the 5th column is Customer Satisfaction
We now combine the newly created independent variables PA1, PA2, PA3 and PA4, and combine them
with the dependent variable Customer Satisfaction.
We also name them as mentioned in the answer of 4.2.
Please refer Appendix for source code.
5.2 Perform Multiple Linear Regression with Customer Satisfaction as

the Dependent Variable and the four factors as Independent
Variables
14
After creating the new dataset ‘Zdata’, we create the ‘Test’ and ‘Train’ datasets out of it to test the
model.
After running the multiple linear regression analysis on the ‘Train’ dataset, we can see that
‘Product_Purchase’, ‘Marketing’, and ‘Positioning’ are highly significant, as can be noted from the
three stars, which have been highlighted in the Appendix as well.
Therefore, we will now run multiple linear regression for Customer Satisfaction with respect to these
three factors.
Please refer to Appendix 25 and 26 for the same.
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,
Degrees of Freedom, f-statistic, coefficients along with p-values)
The explanation of the summary is mentioned below.
R (Residual Standard Error) - Residual Standard Error is measure of the quality of a linear regression
fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of
this error term, we are not capable of perfectly predicting our response variable (Customer
Satisfaction) from the predictor (Independent variables). The residual standard error in this case is
0.6683.
R2 (Multiple R-squared) - The R-squared (R2) statistic provides a measure of how well the model is
fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear
relationship between our predictor variables and our response / target variable (Customer
Satisfaction). It always lies between 0 and 1 (i.e. a number near 0 represents a regression that does
not explain the variance in the response variable well and a number close to 1 does explain the
observed variance in the response variable). In our example, the R2 we get is 0.6951. Or roughly 69%
of the variance found in the response variable (Customer Satisfaction) can be explained by the
predictor variables (Product_Purchase, Marketing and Positioning).
Adjusted R2(R-squared) - In multiple regression settings, the R2 will always increase as more variables
are included in the model. That’s why the adjusted R2 is the preferred measure as it adjusts for the
number of variables considered. The adjusted R2 in this case is 0.6856.
Degrees of Freedom - The Residual Standard Error was calculated with 96 degrees of freedom.
Simplistically, degrees of freedom are the number of data points that went into the estimation of the
parameters used after taking into account these parameters (restriction). In our case, we had 100 data
points.
15
F-statistic - F-statistic is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is. However, how much larger
the F-statistic needs to be depends on both the number of data points and the number of predictors.
The F-statistic for this analysis is 72.96.
Coefficients along with p-values - A small p-value indicates that it is unlikely we will observe a
relationship between the predictor (Product_Purchase, Marketing and Positioning) and response
(Customer Satisfaction) variables due to chance. Typically, a p-value of 5% or less is a good cut-off
point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated to
each estimate. Three stars (or asterisks) represent a highly significant p-value.
5.4 Output Interpretation
The output has been well interpreted in 5.3. Along with that, the following points were covered in this
project.
 Uni and Bi-variate analysis with every variable
 Check and remedy for multi-collinearity
 Simple Linear Regression with every variable
 Multiple Linear regression with Customer satisfaction as the dependent variable and other
factors as dependent variables
16
Appendix
##Appendix 1
setwd("D:/learning/BABI Online")
getwd()
## [1] "D:/learning/BABI Online"
##Appendix2
mydata=read.csv("Factor-Hair-Revised.csv")
##Appendix3
str(mydata)
## 'data.frame': 100 obs. of 13 variables:

## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
## $ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
## $ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
## $ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
## $ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
## $ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
## $ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
## $ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
## $ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
## $ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
## $ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
## $ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
##Appendix4
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages -------------------------------------------------

--------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2

## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------
--------------------------------------- tidyverse_conflicts() --
17
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':

##
## %+%, alpha
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':

##
## logit
## The following object is masked from 'package:dplyr':

##
## recode
## The following object is masked from 'package:purrr':

##
## some
library(cartools)
##
## Attaching package: 'cartools'
## The following object is masked from 'package:base':

##
## merge
library(ppcor)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':

##
## select
library(nFactors)
## Loading required package: boot
18
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':

##
## logit
## The following object is masked from 'package:psych':

##
## logit
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':

##
## melanoma
##
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':

##
## parallel
##Appendix5
summary(mydata)
## ID ProdQual Ecom TechSup

## Min. : 1.00 Min. : 5.000 Min. :2.200 Min. :1.300
## 1st Qu.: 25.75 1st Qu.: 6.575 1st Qu.:3.275 1st Qu.:4.250
## Median : 50.50 Median : 8.000 Median :3.600 Median :5.400
## Mean : 50.50 Mean : 7.810 Mean :3.672 Mean :5.365
## 3rd Qu.: 75.25 3rd Qu.: 9.100 3rd Qu.:3.925 3rd Qu.:6.625
## Max. :100.00 Max. :10.000 Max. :5.700 Max. :8.500
## CompRes Advertising ProdLine SalesFImage
## Min. :2.600 Min. :1.900 Min. :2.300 Min. :2.900
## 1st Qu.:4.600 1st Qu.:3.175 1st Qu.:4.700 1st Qu.:4.500
## Median :5.450 Median :4.000 Median :5.750 Median :4.900
## Mean :5.442 Mean :4.010 Mean :5.805 Mean :5.123
## 3rd Qu.:6.325 3rd Qu.:4.800 3rd Qu.:6.800 3rd Qu.:5.800
## Max. :7.800 Max. :6.500 Max. :8.400 Max. :8.200
## ComPricing WartyClaim OrdBilling DelSpeed
## Min. :3.700 Min. :4.100 Min. :2.000 Min. :1.600
## 1st Qu.:5.875 1st Qu.:5.400 1st Qu.:3.700 1st Qu.:3.400
## Median :7.100 Median :6.100 Median :4.400 Median :3.900
## Mean :6.974 Mean :6.043 Mean :4.278 Mean :3.886
## 3rd Qu.:8.400 3rd Qu.:6.600 3rd Qu.:4.800 3rd Qu.:4.425
## Max. :9.900 Max. :8.100 Max. :6.700 Max. :5.500
## Satisfaction
19
## Min. :4.700
## 1st Qu.:6.000
## Median :7.050
## Mean :6.918
## 3rd Qu.:7.625
## Max. :9.900
##Appendix6
dim(mydata)
## [1] 100 13
##Appendix7
names(mydata)
## [1] "ID" "ProdQual" "Ecom" "TechSup"

## [5] "CompRes" "Advertising" "ProdLine" "SalesFImage"
## [9] "ComPricing" "WartyClaim" "OrdBilling" "DelSpeed"
## [13] "Satisfaction"
##Appendix8
attach(mydata)
##Appendix9
newdata=mydata[c(2:13)]
##Appendix10
names=c("Product Quality","E-Commerce","Technical Support","Complaint Reso
lution","Advertising","Product Line","Salesforce Image","Competitive Prici
ng","Warranty & Claims","Order & Billing","Delivery Speed","Customer Satis
faction")
##Appendix11
# Histogram of independent variables
par(mfrow = c(3,4))
for (i in (1:11)) {
h = round(max(newdata[,i]),0)+1
20
l = round(min(newdata[,i]),0)-1
n = names[i]
hist (newdata[,i], breaks = seq(l,h,((h-l)/6)), labels = T,

include.lowest=T, right=T,
col=8, border=1,
main = NULL, xlab= n, ylab=NULL,
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,11), ylim = c(0,70))
}
21
##Appendix12
# Bivariate Analysis ####
par(mfrow = c(4,3))
22
for (i in c(1:11)) {plot(newdata[,i],`Satisfaction`,
xlab = names[i], ylab = NULL, col = "red",
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Satisfaction` ~ newdata[,i]),col = "blue")}
23
##Appendix13
#Clear all plots
#Finding the missing values in the data
sum(is.na(newdata))
## [1] 0
##Appendix14
#Checking for outliers
boxplot(newdata)
##Appendix15
#Checking correlation
newdata2=newdata[1:11]
CorMat=corrplot(cor(newdata2))
24
CorMat
## ProdQual Ecom TechSup CompRes Advertis

ing
## ProdQual 1.00000000 -0.1371632174 0.0956004542 0.1063700 -0.05347
313
## Ecom -0.13716322 1.0000000000 0.0008667887 0.1401793 0.42989
071
## TechSup 0.09560045 0.0008667887 1.0000000000 0.0966566 -0.06287
007
## CompRes 0.10637000 0.1401792611 0.0966565978 1.0000000 0.19691
685
## Advertising -0.05347313 0.4298907110 -0.0628700668 0.1969168 1.00000
000
## ProdLine 0.47749341 -0.0526878383 0.1926254565 0.5614170 -0.01155
082
## SalesFImage -0.15181287 0.7915437115 0.0169905395 0.2297518 0.54220
366
## ComPricing -0.40128188 0.2294624014 -0.2707866821 -0.1279543 0.13421
689
## WartyClaim 0.08831231 0.0518981915 0.7971679258 0.1404083 0.01079
207
## OrdBilling 0.10430307 0.1561473316 0.0801018246 0.7568686 0.18423
559
## DelSpeed 0.02771800 0.1916360683 0.0254406935 0.8650917 0.27586
308
## ProdLine SalesFImage ComPricing WartyClaim OrdBilling
## ProdQual 0.47749341 -0.15181287 -0.40128188 0.08831231 0.10430307
## Ecom -0.05268784 0.79154371 0.22946240 0.05189819 0.15614733
25
## TechSup 0.19262546 0.01699054 -0.27078668 0.79716793 0.08010182
## CompRes 0.56141695 0.22975176 -0.12795425 0.14040830 0.75686859
## Advertising -0.01155082 0.54220366 0.13421689 0.01079207 0.18423559
## ProdLine 1.00000000 -0.06131553 -0.49494840 0.27307753 0.42440825
## SalesFImage -0.06131553 1.00000000 0.26459655 0.10745534 0.19512741
## ComPricing -0.49494840 0.26459655 1.00000000 -0.24498605 -0.11456703
## WartyClaim 0.27307753 0.10745534 -0.24498605 1.00000000 0.19706512
## OrdBilling 0.42440825 0.19512741 -0.11456703 0.19706512 1.00000000
## DelSpeed 0.60185021 0.27155126 -0.07287173 0.10939460 0.75100307
## DelSpeed
## ProdQual 0.02771800
## Ecom 0.19163607
## TechSup 0.02544069
## CompRes 0.86509170
## Advertising 0.27586308
## ProdLine 0.60185021
## SalesFImage 0.27155126
## ComPricing -0.07287173
## WartyClaim 0.10939460
## OrdBilling 0.75100307
## DelSpeed 1.00000000
##Appendix16
#Check for multicollinearity in independent variables using VIF
vif(lm(`Satisfaction`~.,newdata2))
## ProdQual Ecom TechSup CompRes Advertising ProdLine

## 1.635797 2.756694 2.976796 4.730448 1.508933 3.488185
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 3.439420 1.635000 3.198337 2.902999 6.516014
##Appendix17
#Simple Linear Regression with every variable
SLM1=lm(Satisfaction~ProdQual)
summary(SLM1)
##
## Call:
## lm(formula = Satisfaction ~ ProdQual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88746 -0.72711 -0.01577 0.85641 2.25220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
26
## (Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
## ProdQual 0.41512 0.07534 5.510 2.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.047 on 98 degrees of freedom
## Multiple R-squared: 0.2365, Adjusted R-squared: 0.2287
## F-statistic: 30.36 on 1 and 98 DF, p-value: 2.901e-07
SLM2=lm(Satisfaction~Ecom)
summary(SLM2)
##
## Call:
## lm(formula = Satisfaction ~ Ecom)
##
## Residuals:
## -2.37200 -0.78971 0.04959 0.68085 2.34580
##
## Coefficients:
## (Intercept) 5.1516 0.6161 8.361 4.28e-13 ***
## Ecom 0.4811 0.1649 2.918 0.00437 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 8.515 on 1 and 98 DF, p-value: 0.004368
SLM3=lm(Satisfaction~TechSup)
summary(SLM3)
##
## Call:
## lm(formula = Satisfaction ~ TechSup)
##
## Residuals:
## -2.26136 -0.93297 0.04302 0.82501 2.85617
##
## Coefficients:
## (Intercept) 6.44757 0.43592 14.791 <2e-16 ***
## TechSup 0.08768 0.07817 1.122 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
27
SLM4=lm(Satisfaction~CompRes)
summary(SLM4)
##
## Call:
## lm(formula = Satisfaction ~ CompRes)
##
## Residuals:
## -2.40450 -0.66164 0.04499 0.63037 2.70949
##
## Coefficients:
## (Intercept) 3.68005 0.44285 8.310 5.51e-13 ***
## CompRes 0.59499 0.07946 7.488 3.09e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SLM5=lm(Satisfaction~Advertising)
summary(SLM5)
##
## Call:
## lm(formula = Satisfaction ~ Advertising)
##
## Residuals:
## -2.34033 -0.92755 0.05577 0.79773 2.53412
##
## Coefficients:
## (Intercept) 5.6259 0.4237 13.279 < 2e-16 ***
## Advertising 0.3222 0.1018 3.167 0.00206 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SLM6=lm(Satisfaction~ProdLine)
summary(SLM6)
##
## Call:
## lm(formula = Satisfaction ~ ProdLine)
##
## Residuals:
## -2.3634 -0.7795 0.1097 0.7604 1.7373
28
##
## Coefficients:
## (Intercept) 4.02203 0.45471 8.845 3.87e-14 ***
## ProdLine 0.49887 0.07641 6.529 2.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 98 degrees of freedom
SLM7=lm(Satisfaction~SalesFImage)
summary(SLM7)
##
## Call:
## lm(formula = Satisfaction ~ SalesFImage)
##
## Residuals:
## -2.2164 -0.5884 0.1838 0.6922 2.0728
##
## Coefficients:
## (Intercept) 4.06983 0.50874 8.000 2.54e-12 ***
## SalesFImage 0.55596 0.09722 5.719 1.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SLM8=lm(Satisfaction~ComPricing)
summary(SLM8)
##
## Call:
## lm(formula = Satisfaction ~ ComPricing)
##
## Residuals:
## -1.9728 -0.9915 -0.1156 0.9111 2.5845
##
## Coefficients:
## (Intercept) 8.03856 0.54427 14.769 <2e-16 ***
## ComPricing -0.16068 0.07621 -2.108 0.0376 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
29
SLM9=lm(Satisfaction~WartyClaim)
summary(SLM9)
##
## Call:
## lm(formula = Satisfaction ~ WartyClaim)
##
## Residuals:
## -2.36504 -0.90202 0.03019 0.90763 2.88985
##
## Coefficients:
## (Intercept) 5.3581 0.8813 6.079 2.32e-08 ***
## WartyClaim 0.2581 0.1445 1.786 0.0772 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SLM10=lm(Satisfaction~OrdBilling)
summary(SLM10)
##
## Call:
## lm(formula = Satisfaction ~ OrdBilling)
##
## Residuals:
## -2.4005 -0.7071 -0.0344 0.7340 2.9673
##
## Coefficients:
## (Intercept) 4.0541 0.4840 8.377 3.96e-13 ***
## OrdBilling 0.6695 0.1106 6.054 2.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SLM11=lm(Satisfaction~DelSpeed)
summary(SLM11)
##
## Call:
## lm(formula = Satisfaction ~ DelSpeed)
##
30
## Residuals:
## -2.22475 -0.54846 0.08796 0.54462 2.59432
##
## Coefficients:
## (Intercept) 3.2791 0.5294 6.194 1.38e-08 ***
## DelSpeed 0.9364 0.1339 6.994 3.30e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##Appendix18
#Factor Analysis
#Kaiser Test
KMO(CorMat)
## Kaiser-Meyer-Olkin factor adequacy

## Call: KMO(r = CorMat)
## Overall MSA = 0.65
## MSA for each item =
## ProdQual Ecom TechSup CompRes Advertising ProdLine
## 0.51 0.63 0.52 0.79 0.78 0.62
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 0.62 0.75 0.51 0.76 0.67
##Appendix19
#Since MSA>5 we can run factor analysis on this data
#Eigen value computation
ev=eigen(cor(newdata2))
print(ev,digits=5)
## eigen() decomposition
## $values
## [1] 3.426971 2.550897 1.690976 1.086556 0.609424 0.551884 0.401518
## [8] 0.246952 0.203553 0.132842 0.098427
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6] [
,7]
## [1,] -0.13379 0.313498 0.062272 0.64314 0.231666 0.564570 -0.1916
413
## [2,] -0.16595 -0.446509 -0.235248 0.27238 0.422288 -0.263257 -0.0596
31
262
## [3,] -0.15769 0.230967 -0.610951 -0.19339 -0.023957 0.108769 0.0171
999
## [4,] -0.47068 -0.019444 0.210351 -0.20632 0.028657 0.028152 0.0084
996
## [5,] -0.18373 -0.363665 -0.088097 0.31789 -0.803870 0.200569 0.0630
696
## [6,] -0.38677 0.284781 0.116279 0.20290 0.116674 -0.098195 0.6081
476
## [7,] -0.20367 -0.470696 -0.241342 0.22218 0.204373 -0.104972 -0.0014
374
## [8,] 0.15169 -0.413457 0.053045 -0.33354 0.248926 0.709736 0.3082
489
## [9,] -0.21293 0.191672 -0.598564 -0.18530 -0.032927 0.139840 0.0306
402
## [10,] -0.43722 -0.026399 0.168930 -0.23685 0.026754 0.119480 -0.6593
199
## [11,] -0.47309 -0.073052 0.232625 -0.19733 -0.035433 -0.029800 0.2342
393
## [,8] [,9] [,10] [,11]
## [1,] 0.135473 0.031328 -0.066597 -0.182792
## [2,] -0.122026 -0.542511 -0.281558 -0.062339
## [3,] 0.464710 -0.359300 0.388171 0.051930
## [4,] 0.513398 0.093248 -0.534672 0.362534
## [5,] -0.053477 -0.154682 -0.037158 0.081187
## [6,] -0.333207 -0.084155 0.234798 0.385078
## [7,] 0.169107 0.644899 0.353412 0.084699
## [8,] -0.098832 -0.094144 0.045182 0.102958
## [9,] -0.443540 0.317566 -0.435348 -0.128932
## [10,] -0.366018 -0.099073 0.303865 0.194151
## [11,] 0.065391 -0.021885 0.120104 -0.775632
EigenValue=ev$values
EigenValue
## [1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378

## [7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702
Factor=c(1,2,3,4,5,6,7,8,9,10,11)
Scree=data.frame(Factor,EigenValue)
plot(Scree,main="Scree Plot", col="Blue",ylim=c(0,5))
lines(Scree,col="Blue")
32
##Appendix20
#we will take 4 values because any value less than 1 is not of value as pe
r Kaiser
Unrotate=fa(newdata2,nfactors=4,rotate="none",fm="pa")
print(Unrotate,digits=4)
## Factor Analysis using method = pa

## Call: fa(r = newdata2, nfactors = 4, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 h2 u2 com
## ProdQual 0.2013 -0.4080 -0.0581 0.4626 0.4243 0.57570 2.396
## Ecom 0.2901 0.6592 0.2700 0.2159 0.6382 0.36183 2.003
## TechSup 0.2776 -0.3808 0.7381 -0.1663 0.7946 0.20539 1.945
## CompRes 0.8623 0.0117 -0.2553 -0.1839 0.8428 0.15719 1.272
## Advertising 0.2861 0.4572 0.0824 0.1288 0.3142 0.68579 1.947
## ProdLine 0.6895 -0.4534 -0.1424 0.3148 0.8003 0.19971 2.300
## SalesFImage 0.3945 0.8007 0.3458 0.2508 0.9792 0.02076 2.115
## ComPricing -0.2316 0.5530 -0.0444 -0.2861 0.4433 0.55673 1.906
## WartyClaim 0.3793 -0.3245 0.7355 -0.1530 0.8135 0.18647 2.037
## OrdBilling 0.7470 0.0208 -0.1752 -0.1809 0.6218 0.37818 1.234
## DelSpeed 0.8951 0.0983 -0.3035 -0.1976 0.9420 0.05796 1.361
##
## PA1 PA2 PA3 PA4
## SS loadings 3.2150 2.2229 1.4987 0.6777
## Proportion Var 0.2923 0.2021 0.1362 0.0616
## Cumulative Var 0.2923 0.4944 0.6306 0.6922
## Proportion Explained 0.4222 0.2919 0.1968 0.0890
33
## Cumulative Proportion 0.4222 0.7142 0.9110 1.0000
##
## Mean item complexity = 1.9
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective fu
nction was 6.5531 with Chi Square of 619.2726
## The degrees of freedom for the model are 17 and the objective function
was 0.3297
##
## The root mean square of the residuals (RMSR) is 0.017
## The df corrected root mean square of the residuals is 0.0306
##
## The harmonic number of observations is 100 with the empirical chi squa
re 3.1886 with prob < 0.9999
## The total number of observations was 100 with Likelihood Chi Square =
30.2733 with prob < 0.02444
##
## Tucker Lewis Index of factoring reliability = 0.92146
## RMSEA index = 0.09639 and the 90 % confidence intervals are 0.03169
0.13934
## BIC = -48.0146
## Fit based upon off diagonal values = 0.9974
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.9806 0.9738 0.9528
## Multiple R square of scores with factors 0.9616 0.9483 0.9078
## Minimum correlation of possible factor scores 0.9232 0.8966 0.8155
## PA4
## Correlation of (regression) scores with factors 0.8825
## Multiple R square of scores with factors 0.7789
## Minimum correlation of possible factor scores 0.5577
fa.diagram(Unrotate)
34
Unrotate$loadings
##
## Loadings:
## PA1 PA2 PA3 PA4
## ProdQual 0.201 -0.408 0.463
## Ecom 0.290 0.659 0.270 0.216
## TechSup 0.278 -0.381 0.738 -0.166
## CompRes 0.862 -0.255 -0.184
## Advertising 0.286 0.457 0.129
## ProdLine 0.689 -0.453 -0.142 0.315
## SalesFImage 0.395 0.801 0.346 0.251
## ComPricing -0.232 0.553 -0.286
## WartyClaim 0.379 -0.324 0.735 -0.153
## OrdBilling 0.747 -0.175 -0.181
## DelSpeed 0.895 -0.303 -0.198
##
## PA1 PA2 PA3 PA4
## SS loadings 3.215 2.223 1.499 0.678
## Proportion Var 0.292 0.202 0.136 0.062
## Cumulative Var 0.292 0.494 0.631 0.692
##Appendix21
Rotate=fa(newdata2,nfactors=4,rotate="varimax",fm="pa")
print(Rotate,digits=4)
35
## Factor Analysis using method = pa
## Call: fa(r = newdata2, nfactors = 4, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 h2 u2 com
## ProdQual 0.0240 -0.0700 0.0157 0.6470 0.4243 0.57570 1.027
## Ecom 0.0676 0.7874 0.0279 -0.1132 0.6382 0.36183 1.059
## TechSup 0.0198 -0.0252 0.8832 0.1164 0.7946 0.20539 1.037
## CompRes 0.8977 0.1295 0.0535 0.1317 0.8428 0.15719 1.093
## Advertising 0.1662 0.5300 -0.0429 -0.0624 0.3142 0.68579 1.239
## ProdLine 0.5255 -0.0353 0.1273 0.7118 0.8003 0.19971 1.922
## SalesFImage 0.1154 0.9715 0.0635 -0.1345 0.9792 0.02076 1.076
## ComPricing -0.0757 0.2129 -0.2089 -0.5904 0.4433 0.55673 1.566
## WartyClaim 0.1026 0.0566 0.8851 0.1280 0.8135 0.18647 1.078
## OrdBilling 0.7682 0.1267 0.0882 0.0887 0.6218 0.37818 1.109
## DelSpeed 0.9487 0.1852 -0.0049 0.0874 0.9420 0.05796 1.094
##
## PA1 PA2 PA3 PA4
## SS loadings 2.6349 1.9671 1.6409 1.3714
## Proportion Var 0.2395 0.1788 0.1492 0.1247
## Cumulative Var 0.2395 0.4184 0.5675 0.6922
## Proportion Explained 0.3460 0.2583 0.2155 0.1801
## Cumulative Proportion 0.3460 0.6044 0.8199 1.0000
##
## Mean item complexity = 1.2
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective fu
nction was 6.5531 with Chi Square of 619.2726
## The degrees of freedom for the model are 17 and the objective function
was 0.3297
##
## The root mean square of the residuals (RMSR) is 0.017
## The df corrected root mean square of the residuals is 0.0306
##
## The harmonic number of observations is 100 with the empirical chi squa
re 3.1886 with prob < 0.9999
## The total number of observations was 100 with Likelihood Chi Square =
30.2733 with prob < 0.02444
##
## Tucker Lewis Index of factoring reliability = 0.92146
## RMSEA index = 0.09639 and the 90 % confidence intervals are 0.03169
0.13934
## BIC = -48.0146
## Fit based upon off diagonal values = 0.9974
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.9819 0.9861 0.9396
## Multiple R square of scores with factors 0.9641 0.9724 0.8828
## Minimum correlation of possible factor scores 0.9281 0.9448 0.7657
## PA4
## Correlation of (regression) scores with factors 0.8816
36
## Multiple R square of scores with factors 0.7772
## Minimum correlation of possible factor scores 0.5545
fa.diagram(Rotate)
Rotate$loadings
##
## Loadings:
## PA1 PA2 PA3 PA4
## ProdQual 0.647
## Ecom 0.787 -0.113
## TechSup 0.883 0.116
## CompRes 0.898 0.130 0.132
## Advertising 0.166 0.530
## ProdLine 0.525 0.127 0.712
## SalesFImage 0.115 0.971 -0.135
## ComPricing 0.213 -0.209 -0.590
## WartyClaim 0.103 0.885 0.128
## OrdBilling 0.768 0.127
## DelSpeed 0.949 0.185
##
## PA1 PA2 PA3 PA4
## SS loadings 2.635 1.967 1.641 1.371
## Proportion Var 0.240 0.179 0.149 0.125
## Cumulative Var 0.240 0.418 0.568 0.692
37
##Appendix22
#Data for all rows
head(Rotate$scores)
## PA1 PA2 PA3 PA4

## [1,] -0.1338871 0.9175166 -1.719604873 0.09135411
## [2,] 1.6297604 -2.0090053 -0.596361722 0.65808192
## [3,] 0.3637658 0.8361736 0.002979966 1.37548765
## [4,] -1.2225230 -0.5491336 1.245473305 -0.64421384
## [5,] -0.4854209 -0.4276223 -0.026980304 0.47360747
## [6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571
Zdata=cbind(newdata[12],Rotate$scores)
##Appendix23
#Naming the new columns
names(Zdata)=c("Satisfaction","Product_Purchase","Marketing","After_Sales"
,"Positioning")
head(Zdata)
## Satisfaction Product_Purchase Marketing After_Sales Positioning

## 1 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
## 2 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
## 3 8.9 0.3637658 0.8361736 0.002979966 1.37548765
## 4 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
## 5 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
## 6 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571
##Appendix24
set.seed(100)
#Creating two datasets to test and train the model
sample=sample(1:nrow(Zdata),0.7*nrow(Zdata))
Train=subset(Zdata,sample=T)
Test=subset(Zdata,sample=F)
MLTrain=lm(Satisfaction~.,Train)
summary(MLTrain)
##
## Call:
## lm(formula = Satisfaction ~ ., data = Train)
##
## Residuals:
## -1.7125 -0.4708 0.1024 0.4158 1.3483
##
38
## Coefficients:
## (Intercept) 6.91800 0.06696 103.317 < 2e-16 ***
## Product_Purchase 0.57963 0.06857 8.453 3.32e-13 ***
## Marketing 0.61978 0.06834 9.070 1.61e-14 ***
## After_Sales 0.05692 0.07173 0.794 0.429
## Positioning 0.61168 0.07656 7.990 3.16e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 54.66 on 4 and 95 DF, p-value: < 2.2e-16
##Appendix25
MLR=lm(Satisfaction~Product_Purchase+Marketing+Positioning,data=Zdata)
summary(MLR)
##
## Call:
## lm(formula = Satisfaction ~ Product_Purchase + Marketing + Positioning,
## data = Zdata)
##
## Residuals:
## -1.68988 -0.46632 0.08656 0.41138 1.38575
##
## Coefficients:
## (Intercept) 6.91800 0.06683 103.517 < 2e-16 ***
## Product_Purchase 0.57944 0.06844 8.466 2.90e-13 ***
## Marketing 0.62068 0.06819 9.102 1.27e-14 ***
## Positioning 0.61488 0.07630 8.058 2.14e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 72.96 on 3 and 96 DF, p-value: < 2.2e-16
39

Assignment 2 - Factor Hair

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 2 - Factor Hair

Uploaded by

Copyright:

Available Formats

Project – 2 Factor Hair Case Study

S.No Content Page No.

I. Importing the dataset in R

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs.

1.0.2 Environment Set up and Data Import

Install necessary Packages and Invoke Libraries

Please refer Appendix 4 for source code.

Set up working directory

Please refer Appendix 1 for source code.

Import and Read the Dataset

We have named the dataset in the problem as ‘mydata’.

Please refer Appendix 2 for source code.

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

Please refer Appendix 3 for source code.

summary – Provides us deeper insight about the dataset which is self-explanatory.

Please refer Appendix 5 for source code.

attach – Please refer Appendix 8 for source code.

Please refer to Appendix 9 for source code.

Please refer Appendix 11 for source code.

‘hist’ – This function is used for plotting histograms.

Please refer Appendix 11 for source code.

The resulting graph is displayed below.

‘plot’ – Generic function for plotting of R objects.

Please refer Appendix 12 for source code.

The resulting graph is displayed below.

The number of missing in the dataset come out to be zero.

Please refer Appendix 14 for the source code.

The summary of the dataset has already been checked in 1.1

The resulting graph is displayed below.

Please refer Appendix 15 for the source code.

As per our observations from the graph, we infer the following.

 Order Billing and Delivery Speed are highly correlated

Please refer Appendix 16 for the source code.

3 Simple Linear Regression (with every variable)

Please refer Appendix 17 for the source code and results.

Please refer Appendix for source code.

The resulting graph is displayed below.

As we can see in the graph, the number if 4.

Please refer Appendix 17, 18, 19 and 20 for source code.

Factor 1 = for 29.20% of the variance;

Factor 2 = 20.20% of the variance;

Factor 4 = 6.2% of the variance.

All four factors together account for 69% of the variance.

Please refer Appendix 20 for source code.

The four factors are now named as mentioned below.

Variables Factors Renamed to Explanation

5.1 Create a data frame with a minimum of 5 columns, 4 of which are

We also name them as mentioned in the answer of 4.2.

Please refer Appendix for source code.

5.2 Perform Multiple Linear Regression with Customer Satisfaction as

Please refer to Appendix 25 and 26 for the same.

The explanation of the summary is mentioned below.

5.4 Output Interpretation

 Uni and Bi-variate analysis with every variable

 Check and remedy for multi-collinearity

 Simple Linear Regression with every variable

## [1] "D:/learning/BABI Online"

## 'data.frame': 100 obs. of 13 variables:

## corrplot 0.84 loaded

## -- Attaching packages -------------------------------------------------

## v ggplot2 3.2.1 v purrr 0.3.2

## The following objects are masked from 'package:ggplot2':

## Loading required package: carData