You are on page 1of 20

PROJECT 2

ADVANCED STATISTICS
MODULE

FACTOR HAIR REVISED

ASHWINI KRISHNAN

The project contains 5 Problem statements.


Below herein each problem is addressed followed by the answer.

Problem 1.

1. Perform exploratory data analysis on the dataset. Showcase some charts, graphs.
Check for outliers and missing values.

As a preliminary analysis of the data set given, we observe that the number of data in it is
large.

I took a dual approach in trying to identify the outliers in the data in order to double check
the validity of the same.

Data Check for Outliers in R:

In order to find outliers, an initial analysis was done to identify the IQR. The same was done
applied formula wise for each parameter in R.
In order to systematically establish the outliers, I carried forward the same to excel to help
establish the outliers faster
The resultant of which showed that the outliers exist in the following categories:

1. Ecom
2. TechSup
3. CompRes
4. SalesFImage
5. OrdBilling
6. DelSpeed

With majority of the outliers exisiting in TechSup and CompRes .


Find below the boxplot of Factors with Outliers and Factors without Outliers

For the purpose of an all-round solution to the problems further and for preservance of data,
outliers are not removed from this dataset.
Problem 2.

2. Is there evidence of multicollinearity? Showcase your analysis

In order to find multicollinearity, we are analyzing in R the dataset given to identify any
correlation factors as shown below:

From an initial test, it does not seem like the factors are correlated highly. Furthermore, we
proceed to do a test to confirm the presence of multicollinearity.
In the above test, we find that the multiple R Squared is very less – 61.52.
The possibility of a good correlation is significantly lower in our dataset is also represented
by the value of adjusted R Squared whose value is just 56.21

In discovering ideal methods to find the possibility of multicollinearity, a certain function in


R called VIF also known as Variance Inflation Factor was used to identify the same.
Under this function, we can determine how one variable is affected by another variable’s
variance.

As per the standard nomenclature, the VIF factor of 5 and above signifies high correlation
between variables.

In the test conducted below using function VIF, we see that for each factor, the value returns
are less than 5 except for two factors namely ProdLine and DelSpeed.

From the above factors which indicate a very low R Squared and VIF values, we can
conclude that there exists no multicollinearity between the said variables.

Problem 3.

3. Perform simple linear regression for the dependent variable with every independent
variable 

From the given 12 variables, Customer Satisfaction is the dependent variable.


Below given is the Linear Regression Model performed in R against all independent variable.
Problem 4.

4. Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name
the Factors
Based on rotational factors, we can identify the four important factors of Customer
Satisfaction, which has been the main principal dependent component are:

1. Delivery Speed, Complain Resolution, Order of Billing have high loading on factor 1,
this shows that these are the main factors contribute towards Product purchase
events
2. Sales Function Image, E-Commerce and Advertising have high loading on factor 2,
which shows that these are the main factors contributing towards online portal
purchases.
3. Technical Support and Warranty claim have high loading on factor 3, which shows
that these are the main factors contributing towards after purchase scenarios.
4. Product Quality has high loading on factor 4 which shows that it’s a key factor
contributing before purchase by a customer.

These are factors arrived at when we do the rotation analysis.

The unrotated analysis gives us the following factors:

1. Complain Resolution
2. Sales Function Image
3. Technical Support
4. Product Quality

By a narrow margin in an unrotated analysis, Complain resolution beats Delivery speed as


one of the top factors.
When we rotate the analysis, interpretation of data is more comprehensible, thereby it
eliminates any element of doubt as to choose the factors wisely.

Problem 4.

4. Perform Multiple linear regression with customer satisfaction as dependent variables and
the four factors as independent variables. Comment on the Model output and validity. Your
remarks should make it meaningful for everybody

Using the four major factors – DelSpeed, TechSup, SalesFImage and ProdQual, we have
arrived at a Multiple Linear Regression model against the dependent variable which is
Satisfaction.
Interpretation :

The summary table shows the Multiple R-squared: 0.7505, Adjusted R-squared: 0.74
respectively. This explains that the nearly 75% of variance of the data can be related to the
model’s output.

The coefficients table usually gives us the summary of the effect of each variable on the
dependent variable.
If the customer satisfaction decreases by 1, the major contributing factor towards it will be
the delivery speed and the sales image. Although in our factor analysis, it shows that
technical support and product quality as 2 of the other major factors that ultimately influences
the customer satisfaction, it shows here that it adds up only to 0.032 and 0.46 per cent of the
total contributing levels.
The positive or negative signs on the coefficient explains to us the increase or decrease
respectively of the independent variable towards the dependent variable.

Overall significance of the model could be interpreted as relatively less considering that R
squared does not amount to covering major data variance of the given data set. However,
given outliers predicted in problem statement 1, and the fact that only 4 major factors are
considered as dependent variable here, an R squared of 75 is acceptable.

However, since the p value is less than 5%, we are given to understand that the model is
perfectly valid in helping us understand that the customer satisfaction level is influenced
majorly by the 4 dependent factors established.

You might also like