You are on page 1of 35

Advanced

Statistics
Project

MADHUR DHANANIWALA
PGP-DSBA Online December_C ” 2021
Date : 03/04/2022
Table of Contents
PROBLEM 1 A PAGE
Introduction 1
Sample Of the Dataset 1
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both 2
Education and Occupation individually.
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State 2
whether the null hypothesis is accepted or rejected based on the ANOVA results.
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. 3
State whether the null hypothesis is accepted or rejected based on the ANOVA results.
PROBLEM 1 B
1.5 What is the interaction between the two treatments? Analyze the effects of one variable on 4-5
the other (Education and Occupation) with the help of an interaction plot.
1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their 6
interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative
hypotheses and state your results. How will you interpret this result?
1.7 Explain the business implications of performing ANOVA for this particular case study. 7

PROBLEM 2
Introduction 8
Sample Of the Dataset 8
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be 9-19
performed]. What insight do you draw from the EDA?
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. 20
2.3 Comment on the comparison between the covariance and the correlation matrices from 21-23
this data.[on scaled data]
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? 24-25
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] 26-27
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data 27
frame with the original features
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with 28
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors
and corresponding features]
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the 29-21
optimum number of principal components? What do the eigenvectors indicate?
2.9 Explain the business implication of using the Principal Component Analysis for this case 32
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained]
List of Figures

Fig PROBLEM 1 B PAGE


Fig 1 Interaction plot (Education and Occupation) 4
PROBLEM 2
Fig 2 Displot of apps 9
Fig 3 boxplot of apps 9
Fig 4 Displot of ‘Accept’ 10
Fig 5 Boxplot of ‘Accept’ 10
Fig 6 Displot of ‘Enroll’ 10
Fig 7 Boxplot of ‘Enroll’ 10
Fig 8 Displot of ‘Top10perc’ 11
Fig 9 Boxplot of ‘Top10perc’ 11
Fig 10 Displot of ‘Top25perc’ 11
Fig 11 Boxplot of ‘Top25perc’ 11
Fig 12 Displot of ‘ F.Undergrad’ 12
Fig 13 Boxplot of ‘F.Undergrad’ 12
Fig 14 Displot of ‘ P.Undergrad’ 12
Fig 15 Boxplot of ‘P.Undergrad’ 12
Fig 16 Displot of ‘Outstate’ 13
Fig 17 Boxplot of ‘Outstate’ 13
Fig 18 Displot of ‘Room. Board’ 13
Fig 19 Boxplot of ‘Room. Board’ 13
Fig 20 Displot of ‘Books’ 14
Fig 21 Boxplot of ‘Books’ 14
Fig 22 Displot of ‘Personal’ 14
Fig 23: Boxplot of ‘Personal’ 14
Fig 24 Displot of ‘PhD’ 15
Fig 25 Boxplot of ‘PhD’ 15
Fig 26 Displot of ‘Terminal’ 15
Fig 27 Boxplot of ‘Terminal’ 15
Fig 28 Displot of ‘S.F.Ratio’ 16
Fig 29 Boxplot of ‘S.F.Ratio’ 16
Fig 30 Displot of ‘perc.alumni’ 16
Fig 31 Boxplot of ‘perc.alumni’ 16
Fig 32 Displot of ‘Expend’ 17
Fig 33 Boxplot of ‘Expend’ 17
Fig 34 Displot of ‘Grad.Rate’ 17
Fig 35 Boxplot of ‘Grad.Rate’ 17
Fig 36 Pairplot 18
Fig 37 Heatmap 9
Fig 38 Boxplot before scaling 24
Fig 39 Boxplot after scaling 24
Fig 40 Scree plot 31
List of Tables
Tables Problem 1 A PAGE
1 Dataset Sample 1
Table 2. One-way ANOVA (Edu & salary) 2
Table 3. One-way ANOVA (Occupation & salary) 3
PROBLEM 1 B
Table 4. Two-way ANOVA (Education & Occupation) 6
PROBLEM 2
Table 5 Dataset sample 8
Table 6 z score 20
Table 7 Covariance matrix 22
Table 8 Correlation 23
Table 9 Eigenvectors 27
Table 10 Eigenvalues 27
Table 11 PCA is performed and it is exported into a data frame 27
Table 12 cumulative values 29
Table 13 5 principal components 30
PROBLEM 1 A

Introduction
To see that salary is hypothesized to depend on educational qualification and
occupation. To understand the dependency, the salaries of 40 individuals are
collected and each person’s educational qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor’s,
and Doctorate. Occupation is at four levels, Administrative and clerical, Sales,
Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.

Sample Of the Dataset:


Below is the output from python:

Table 1 : Dataset sample

1|Page
1.1) State the null and the alternate hypothesis for conducting
one-way ANOVA for both Education and Occupation
individually.

ANSWER)
One way ANOVA(Education):
 Null hypothesis : The mean salary is the same across all the 3 categories
of education (Doctorate, Bachelors, HS-Grad).

 Alternative hypothesis : The mean salary is different in at least one


category of education.

One way ANOVA(Occupation):


 Null hypothesis : The mean of 'Salary' variable with respect to each
occupation is equal.

 Alternative hypothesis : The mean of 'Salary' variable with respect to


each occupation is unequal.

1.2) Perform one-way ANOVA for Education with respect to


the variable ‘Salary’. State whether the null hypothesis is
accepted or rejected based on the ANOVA results.

Answer)

As we can see from the output of python in (Table 2),

since the p-value = 1.257709e-08 is less than the significance level (alpha
= 0.05), we can reject the null hypothesis and conclude that there is a
significant difference in the mean salaries for at least one category of
education.

Table 2: One-way ANOVA (Edu & salary)

2|Page
1.3) Perform one-way ANOVA for variable Occupation with
respect to the variable ‘Salary’. State whether the null
hypothesis is accepted or rejected based on the ANOVA
results.

Answer)

As we can see from the output of python in (Table 3),


Since the p-value = 0.458508 is greater than the significance level (alpha
= 0.05), we fail to reject the null hypothesis (i.e. we accept H0) and
conclude that there is no significant difference in the mean salaries across
the 4 categories of occupation.

Table 3: One-way ANOVA (Occupation & salary)

PROBLEM 1 B

3|Page
1.5) What is the interaction between the two treatments?
Analyze the effects of one variable on the other
(Education and Occupation) with the help of an
interaction plot.
Answer)

Fig 1: Interaction plot (Education and Occupation)

The above interaction plot shows that there is a significant amount of


interaction between the categorical variables, Education and Occupation.

The following are some of the observations from the interaction plot:

4|Page
 People with (HS-grad education) don't reach the position of (Exec-
managerial) and they hold only (Adm-clerk), (Sales) and (Prof-Specialty)
occupations.

 People with education as (Bachelors) or (Doctorate) and occupation as


(Adm-clerical) and (Sales) almost earn the same salaries ranging from
170000 – 190000.

 People with (education as Bachelors) and (occupation as Prof-Specialty)


earn lesser than people with (education as Bachelors) and (occupations as
Adm-clerical) and (Sales).

 People with education as (Bachelors) and occupation (Sales) earn higher


than people with education as (Bachelors) and occupation (Prof-
Specialty) whereas people with education as (Doctorate) and occupation
(Sales) earn lesser than people with (Doctorate) and occupation (Prof-
Specialty). We see a reversal in this part of the plot.

 Similarly, people with education as (Bachelors) and occupation as (Prof-


Specialty) earn lesser than people with education as (Bachelors) and
occupation (Exec-Managerial) whereas people with education as
(Doctorate) and occupation as (Prof-Specialty) earn higher than people
with education as (Doctorate) and occupation (Exec-Managerial). There
is a reversal in this part of the plot too.

 (Sales) people with (Bachelors) or (Doctorate) education earn the same


salaries and earn higher than people with education as (HS-grad).

 (Adm clerical) people with education as (HS-grad) earn the lowest


salaries when compared to people with education as (Bachelors) or
(Doctorate).

 (Prof-Specialty) people with education as (Doctorate) earn maximum


salaries and people with education as (HS-Grad) earn the minimum.

 People with education as (HS -Grad) earn the minimum salaries.

 There are no people with education as (HS -grad) who hold (Exec-
managerial) occupation.

 People with education as (Bachelors) and occupation, (Sales) and (Exec-


Managerial) earn the same salaries.
5|Page
1.6) Perform a two-way ANOVA based on the Education and
Occupation (along with their interaction Education*Occupation)
with the variable ‘Salary’. State the null and alternative hypotheses
and state your results. How will you interpret this result?

Answer)

NULL HYPOTHESIS :

The effect of the independent variable ‘Education’ on the mean


‘Salary’ does not depend on the effect of the other independent
variable ‘Occupation’ (i. e. there is no interaction effect between
the 2 independent variables, education and occupation).

ALTERNATIVE HYPOTHESIS:

There is an interaction effect between the independent variable


‘Education’ and the independent variable ‘Occupation’ on the
mean Salary.

By performing two way ANOVA , we get the following output


from python:

Table 4: Two-way ANOVA (Education & Occupation)

From the table, we see that there is a significant amount of


interaction between the variables, Education and Occupation. As p-
value = 2.232500e-05 is lesser than the significance level (alpha =
0.05), we reject the null hypothesis. Thus, we see that there is an
interaction effect between education and occupation on the mean
salary.

6|Page
1.7) Explain the business implications of performing
ANOVA for this particular case study.
Answer)

The business implications of performing ANOVA for this


particular case study is that by using ANOVA method and the
interaction plot , we see that education combined with occupation
results in higher and better salaries among the people. It is clearly
seen that people with education as Doctorate draw the maximum
salaries and people with education HS-grade earn the least. Thus,
we can conclude that Salary is dependent on educational
qualifications and occupation.

PROBLEM 2

7|Page
The dataset Education - Post 12th Standard.csv contains
information on various colleges. You are expected to do a Principal
Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th
Standard.csv' can be found in the following file: Data
Dictionary.xlsx.

Introduction
The given dataset consists of data points of names of various
university and college which has number of application received,
accepted, and enrolled, percentage of new students from top 10%
of higher secondary class, percentage of new students from top
25% of higher secondary class, Number of fulltime
undergraduates, Number of parttime undergraduate students,
Number of students for whom the particular college is out of state
tuition, cost of room and board, estimated book costs for a student,
estimated personal spending for a student, percentage of faculties
with Ph.D., percentage of faculties with a terminal degree,
student/faculty ratio, percentage of alumni who donate, The
instructional expenditure per student, Graduation Rate.

Sample Of the Dataset:


Below is the output from python:

Table 5 : Dataset sample

2.1) Perform Exploratory Data Analysis [both


univariate and multivariate analysis to be
performed]. What insight do you draw from
the EDA?
Answer)
8|Page
 The dataset consists of 777 rows and 18 columns.
 The “Name” field is an object data type, “S.F.Ratio” is a
float data type and all remaining 16 fields are numeric
fields.
 There are no missing values in the dataset.

Now we are going to perform a Univariate analysis on the dataset.

Univariate analysis Helps us to understand the distribution of data


in the dataset. With univariate analysis, we can find patterns and
we can summarize the data.

1. Univariate analysis of "Apps":

Fig 2: displot of apps


Fig 3: boxplot of apps

Observation:

The Box plot of Apps variable seems to have outliers, the distribution of the
data is skewed we could also understand that each college or university offers
applications in the range 3000 to 5000. The max applications seem to be around
50,000.

For univariate analysis of apps we are using box plots and dist plots to find
information or patterns in the data.

So we can clearly understand from the box plot we have outliers in the dataset.
2. Univariate analysis on "Accept":

9|Page
Observation:

The accepted variable seems to have outliers. The dist plot shows us that the
majority of applications accepted from each university are in the range of 70 to
1500. The accepted variable seems to be positively skewed.

3. Univariate analysis on "Enroll" :

Fig 6 : Displot of ‘Enroll’ Fig 7: Boxplot of ‘Enroll’

Observation:

The box plot of the Enroll variable also has outliers. The distribution of the data
is positively skewed. From the dist plot, we can understand the majority of the
colleges have enrolled students in the range of 200 to 500 students.

4. Univariate analysis on "Top10perc" :

Observation:
Fig 8: Displot of ‘Top10perc’ Fig 9: Boxplot of ‘Top10perc’

The box plot of the students from top 10 percentage of higher secondary class
seems to have outliers. The distribution seems to be positively skewed. There is

10 | P a g e
good amount of intake about 30 to 50 students from top 10 percentage of higher
secondary class.

5. Univariate analysis on "Top25perc":

Fig 10 : Displot of ‘Top25perc’ Fig 11: Boxplot of ‘Top25perc’


Observation:

The box plot for the top 25% has no outliers. The distribution is almost
normally distributed. The majority of the students are from the top 25% of the
higher secondary class.

6. Univariate analysis on "F.Undergrad" (FULL TIME


UNDERGRADUATE)

Fig 12 : Displot of ‘ F.Undergrad’


Fig 13: Boxplot of ‘F.Undergrad’
Observation:

The box plot of the full time graduates has outliers. The distribution of the data
is positively skewed. In the range about 3000 to 5000 they are full time
graduates studying in all the university.
11 | P a g e
7. Univariate analysis on "P.Undergrad" (PART TIME
UNDERGRADUATE)

Fig 14 : Displot of ‘ P.Undergrad’ Fig 15: Boxplot of ‘P.Undergrad’

Observation:

The box plot of the part time graduates has outliers. The distribution of the data
is positively skewed. In the range about 1000 to 3000 they are part-time
graduates studying in all the university.

8. Univariate analysis on "Outstate":

Fig 16: Displot of ‘Outstate’ Fig 17: Boxplot of ‘Outstate’

Observation:

The box plot of outstate has only one outlier. The distribution is almost
normally distributed.

12 | P a g e
9. Univariate analysis on "Room.Board":

Fig 18: Displot of ‘Room. Board’ Fig 19: Boxplot of ‘Room. Board’

Observation:

The Room board has few outliers. The distribution is normally distributed.

10. Univariate analysis on "Books":

Fig 20 : Displot of ‘Books’ Fig 21: Boxplot of ‘Books’

Observation:

13 | P a g e
The box plot of books has outliers. The distribution seems to be bimodal. The
cost of books per student seems to be in the range of 500 to 100.

11. Univariate analysis on "Personal":

Fig 22 : Displot of ‘Personal’


Fig 23: Boxplot of ‘Personal’
Observation:

The box plot of personal expense has outliers. Some student’s personal expense
are way bigger than the rest of the students. The distribution seems to be
positively skewed.

12. Univariate analysis on "PhD":

Fig 24 : Displot of ‘PhD’ Fig 25: Boxplot of ‘PhD’

Observation:

14 | P a g e
The box plot of PHD has outliers. The distribution seems to be negatively
skewed.

13. Univariate analysis on "Terminal":

Fig 26 : Displot of ‘Terminal’ Fig 27: Boxplot of ‘Terminal’

Observation:

The box plot of terminal seems to have outliers in the dataset. The distribution
for the terminal also seems to be negatively skewed.

14. Univariate analysis on "S.F.Ratio":

Fig 28: Displot of ‘S.F.Ratio’


Fig 29: Boxplot of ‘S.F.Ratio’

Observation:

15 | P a g e
The SF ratio variable also has outliers in the dataset. The distribution is almost
normally distributed. The student faculty ratio is almost same in all the
university and colleges.

15. Univariate analysis on "perc.alumni":

Fig 30: Displot of ‘perc.alumni’ Fig 31: Boxplot of ‘perc.alumni’

Observation:

The percentage of alumni box plot seems to have outliers in the dataset. The
distribution is almost normally distributed.

16. Univariate analysis on "Expend":

Fig 32: Displot of ‘Expend’ Fig 33: Boxplot of ‘Expend’


16 | P a g e
Observation:

The expenditure variable also has outliers in the dataset. The distribution of the
expenditure is positively skewed.

17. Univariate analysis on "Grad.Rate":

Fig 34: Displot of ‘Grad.Rate’


Observation: Fig 35: Boxplot of ‘Grad.Rate’

The graduation rate among the students in all the university above 60%. The
box plot of the graduation rate has outliers in the dataset. The distribution is
normally distributed.

Now we are going to perform a multivariate analysis on the dataset.

17 | P a g e
The pair plot helps us to understand the relationship between all the numerical
values in the dataset. By comparing all the variables with each other we could
understand the patterns or trends in the dataset.

Fig 36 : Pairplot

This Heat map gives us the correlation between two numerical values.

18 | P a g e
We could understand the application variable is highly positively correlated
with application accepted, students enrolled and full time graduates. So this
relationship gives the insights on when student submits the application it is
accepted and the student is enrolled as fulltime graduate.

We can find negative correlation between application and percentage of alumni.


This indicates us not all students are part of alumni of their college or
university.The application with top 10, 25 of higher secondary class, outstate,
room board, books, personal, PhD, terminal, S.F ratio, expenditure and
Graduation ratio are positively correlated.

Fig 37: Heatmap

2.2) Is scaling necessary for PCA in this case?


Give justification and perform scaling.

19 | P a g e
Answer)

Yes, scaling is necessary to normalize data before performing PCA. The PCA
calculates a new projection of your data set. And the new axis is based on the
standard deviation of your variables. So a variable with a high standard
deviation will have a higher weight for the calculation of the axis than a variable
with a low standard deviation. If we normalize our data, all variables have the
same standard deviation, thus all variables have the same weight and our PCA
calculates the relevant axis.

Before scaling I have dropped the names variable which is categorical.

Now, the dataset consists of only numerical values, I have applied the z-score
method for this case study. We can also min-max function to scale the variables.

Since the dataset has 18 numerical columns with different scales.

For example, the application, accepted application, enrolled full-time graduates,


part-time graduates, and outstate are a number of students. The top10 percent
and top20 percent are students in which the values are given in percentage.
Room board, books, and personal are values associated with money. The Ph.D.,
sf ratio, percentage of alumni are percentage values of different combinations of
students teachers alumni these are percentage values. The graduation rate is also
the percentage value of graduates who get graduated every year.

Table 6: z score

Z score tells us how many standard deviation is the point away from the
mean and also the direction. Now, we can understand that all the variables
are scaled by using z score function. Scaling is one of the most important
methods to follow before implementing models.

2.3) Comment on the comparison between the


covariance and the correlation matrices from
this data.[on scaled data]

20 | P a g e
Answer)

 The comparison between the covariance and correlation matrix is that


both of the terms measures the relationship and the dependency between
two variables.
 Scaling in general means the representation of the dataset. The numbers
will not change. We are bringing the dataset into one unit.
 Covariance indicates the direction of the linear relationship between the
variables whether it is positive or negative. By direction means it is
directly proportional or inversely proportional.
 The below snippet is the covariance matrix on the scaled dataset. We can
clearly understand covariance matrix indicates the direction of the linear
relationship between the variables.
 By direction means it is directly proportional or inversely proportional.

Covariance Matrix

Table 7: Covariance matrix


 Correlation measures the strength (how much?) and the direction of the
linear relationship between two variables. Strength is that is that
positively correlated or negatively correlated.
 This below snippet is the correlation matrix. We can clearly understand
the correlation matrix which gives the strength and the relationship
between the variables.
 The correlation matrix before scaling and after scaling will remain the
same.

21 | P a g e
 From this snippet, we can understand variables that are highly positively
correlated and the variables which are highly negatively correlated. We
can also understand the variables which are moderately correlated with
each other.
 We can see that application, acceptance, enrollment and full-time
graduates are highly positively correlated.
 Also the top 10 percent and top 25 percent are highly positively
correlated

Table 8: Correlation

22 | P a g e
2.4) Check the dataset for outliers before and after scaling.
What insight do you derive here?

Answer)

Checking the data before scaling:

Fig 38: Boxplot before scaling

Checking the dataset after scaling:

23 | P a g e

Fig 39: Boxplot after scaling


Observation:

The outliers are still present in the dataset.

Reason: scaling does not remove outliers scaling scales the values on a Z score
distribution. We can use any one method to remove outliers for further
processes.

For example, if we wish to remove outliers we can consider taking 3 standard


deviations as outliers or either we can remove them or impute them with IQR
values.

24 | P a g e
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn
PCA Print Both]

Answer)

The eigenvalues and eigenvectors are given below:

Eigenvectors :

25 | P a g e
Eigenvalues

Table 10: Eigenvalues

2.6) Perform PCA and export the data of the Principal


Component (eigenvectors) into a data frame with the
original features

Answer)

PCA is performed and it is exported into a data frame.

Table 11: PCA is performed and it is exported into a data frame

26 | P a g e
2.7) Write down the explicit form of the first PC (in terms of
the eigenvectors. Use values with two places of decimals
only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]

Answer)

The Linear equation of 1st component:


0.249 * Apps + 0.208 * Accept + 0.176 * Enrol + 0.354 * Top10perc + 0.344 *
Top25perc + 0.155 * F.Undergrad + 0.026 * P.Undergrad + 0.295 * Outstate +
0.249 * Room. Board + 0.065 * Books + -0.043 * Personal + 0.318 * PhD +
0.317 * Terminal + -0.177 * S.F.Rat io + 0.205 * perc.alumni + 0.319 * Expend
+ 0.252 * Grad. Rate

27 | P a g e
2.8) Consider the cumulative values of the eigenvalues.
How does it help you to decide on the optimum
number of principal components? What do the
eigenvectors indicate?

Answer)

Table 12: cumulative values

Adding the Eigenvalues we will get sum of 100.

To decide the optimum number of principal components


1. Check for cumulative variance up to 90%, check the corresponding
associated with 90%
2. The incremental value between the components should not be less than
five percent.

So basis on this we can decide the optimum number of principal components as


6, because after this the incremental value between the is less than 5%.

So, we select 5 principal components for this case study.

28 | P a g e
Table 13: 5 principal components

 The first components explain 32.02% variance in data


 The first two components explains 58.36% variance in data
 The first three components explains 65.26% variance in data
 The first four components explains 71.18% variance in data
 The first five components explains 76.67% variance in data

The Eigen vectors or PC for this case study is five, we can understand how
much each variable contributes to the principal components. In other words
we can also say weights attached to each variable. With this Eigen vectors
we can understand which variable has more weightage and influences the
dataset in the principal components. The PCA reduces the multi
collinearity and with this reduced collineraity we can runs models and
improved efficiency scores.

Fig 40: Scree plot

29 | P a g e
2.9) Explain the business implication of using the Principal
Component Analysis for this case study. How may PCs
help in the further analysis? [Hint: Write Interpretations
of the Principal Components Obtained]

Answer)

This business case study is about education dataset which contain the names of
various colleges, which has various details of colleges and university. To
understand more about the dataset we perform univariate analysis and
multivariate analysis which gives us the understanding about the variables.
From analysis we can understand the distribution of the dataset, skew, and
patterns in the dataset. From multivariate analysis we can understand the
correlation of variables. Inference of multivariate analysis shows we can
understand multiple variables highly correlated with each other. The scaling
helps the dataset to standardize the variable in one scale. Outliers are imputed
using IQR values once the values are imputed we can perform PCA. The
principal component analysis is used reduce the multicollinearity between the
variables. Depending on the variance of the dataset we can reduce the PCA
components. The PCA components for this business case is 5 where we could
understand the maximum variance of the dataset. Using the components we can
now understand the reduced multicollinearity in the dataset.

30 | P a g e
THE END

31 | P a g e

You might also like