You are on page 1of 42

Advanced Statistics Project

Project: Advanced Statistics
[ ANOVA, EDA AND PCA]

Name – Priyanka Sanjay Patil


PGP-DSBA Online May’ 22
Date: 28/08/2022

1
Advanced Statistics Project

Content
 Problem 1A:
  Problem Statement:………………………………………………………………………………………………………………4

1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually……………………………………………………………………………………………………………5

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results……………………………………………..5

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results…………………………6

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are
significantly different. Interpret the result……………………………………………………………………………………..7

 Problem 1B:……………………………………………………………………………………………………………….…8

1.5 What is the interaction between the two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot…………………………………….8

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their
interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative
hypotheses and state your results. How will you interpret this result?..............................................9

1.7 Explain the business implications of performing ANOVA for this particular case study……………10

 Problem 2: (Education - Post 12th Standard.csv)


Problem Statement: ………………………………………………………………………………………………………………………..11

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?.........................................................................................13

2
Advanced Statistics Project

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling……………………25

2.3 Comment on the comparison between the covariance and the correlation matrices from this
data.[on scaled data]…………………………………………………………………………………………………………………….26

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?.......29

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]………………………………31

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features……………………………………………………………………………………………………………33

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features] …………………………………………………………………………………………………………….36

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?..........................37

2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained …………………………………………………………………………………………………………………40

Problem No 1:
Problem Statement:
3
Advanced Statistics Project

Salary is hypothesized to depend on educational qualification and occupation. To understand the


dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational
qualification and occupation are noted. Educational qualification is at three levels, High school graduate,
Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical, Sales, Professional or
specialty, and Executive or managerial. A different number of observations are in each level of education
– occupation combination.

  importing all libraries into Jupiter notebook.

In this data we have 40 rows and 3 columns.

Data types are 2 objective and 1 integer which means 2 categorical column and 1 numerical.

There is no null value In the data checked with isnull()

We have checked Education value count and Occupation value count.

Education value count:

Occupation value count:

4
Advanced Statistics Project

Problem No 1A:
1.1. State the null and the alternate hypothesis for conducting one-
way ANOVA for both Education and Occupation individually.

Solution:

One-way ANOVA(Education)

Null Hypothesis H0 = The mean salary is same with respect to the all three categories of Education (High
school graduate, Bachelor, and Doctorate)

Alternate Hypothesis H1 = The mean salary is not same, at least one category has different mean salary.

One-way ANOVA(Occupation)

Null Hypothesis H0: The mean salary is the same across all the 4 categories of occupation (Prof-
Specialty, Sales, Adm-clerical, Exec-Managerial).

Alternate Hypothesis H1: The mean salary is different in at least one category of occupation.

1.2. Perform a one-way ANOVA on Salary with respect to Education.


State whether the null hypothesis is accepted or rejected based
on the ANOVA results.

Solution:

The above ANOVA table is for salary with respect to Education.

Conclusion:

Since the p-value =1.257709e-08 is less than the significance level (0.05), we can reject the null
hypothesis. And we conclude that there is difference in mean salaries at least one category of Education.
WE have drawn the point plot, below is the screen.

5
Advanced Statistics Project

1.3. Perform a one-way ANOVA on Salary with respect to Occupation.


State whether the null hypothesis is accepted or rejected based on the
ANOVA results.

Solution:

The above ANOVA table is for Salary with respect to Occupation.

Conclusion:

Since the p-value is greater than the significance level (0.05), we can fail to reject the null hypothesis.
(we accept null hypothesis)

Hence there is no difference in the mean salaries across the categories of Occupation (Prof-specialty,
Sales, Adm-clerical, Exec-managerial)
WE have drawn the point plot, below is the screen.

6
Advanced Statistics Project

1.4.If the null hypothesis is rejected in either (2) or in (3), find out which
class means are significantly different. Interpret the result

Solution:

By Using, the Tukey Honest Significant Difference test, we get the following table for the category
education:

The above table shows that since the p- values (p-adj in the table) are lesser than the significance level
(0.05), for all the three categories of education, this implies that the mean salaries across all categories
of education are different.

For the category occupation,

7
Advanced Statistics Project

From the above table shows that, The Tukey Honest Significant Difference test has further confirmed
that the mean salaries across all occupation classes are significantly same. The table confirms the same,
where we see that all p-values are greater than 0.05.

Problem 1B:
1.5. What is the interaction between the two treatments? Analyze the
effects of one variable on the other (Education and Occupation) with
the help of an interaction plot.

Solution:

The interaction plot shows that there is significant amount of interaction between the categorical
variables, Education and Occupation.

The following are some of the observations from the interaction plot:

8
Advanced Statistics Project

 People with HS-grad education do not reach the position of Exec-managerial and they hold only
Adm-clerk, Sales and Prof-Specialty occupations.
 People with education as Bachelors or Doctorate and occupation as Adm-clerical and Sales
almost earn the same salaries (salaries ranging from 170000 – 190000).
 People with education as Bachelors and occupation as Prof-Specialty earn lesser than people
with education as Bachelors and occupations as Adm-clerical and Sales.
 People with education as Bachelors and occupation Sales earn higher than people with
education as Bachelors and occupation Prof-Specialty whereas people with education as
Doctorate and occupation Sales earn lesser than people with Doctorate and occupation Prof-
Specialty. We see a reversal in this part of the plot.
 Similarly, people with education as Bachelors and occupation as Prof-Specialty earn lesser than
people with education as Bachelors and occupation Exec-Managerial whereas people with
education as Doctorate and occupation as Prof-Specialty earn higher than people with education
as Doctorate and occupation Exec-Managerial. There is a reversal in this part of the plot too.
 Salespeople with Bachelors or Doctorate education earn the same salaries and earn higher than
people with education as HS-grad.
 Adm clerical people with education as HS-grad earn the lowest salaries when compared to
people with education as Bachelors or Doctorate.
 Prof-Specialty people with education as Doctorate earn maximum salaries and people with
education as HS-Grad earn the minimum.
 People with education as HS -Grad earn the minimum salaries.
 There are no people with education as HS -grad who hold Exec-managerial occupation.
 People with education as Bachelors and occupation, Sales and Exec-Managerial earn the same
salaries.

1.6 Perform a two-way ANOVA based on the Education and Occupation


(along with their interaction Education*Occupation) with the variable
‘Salary’. State the null and alternative hypotheses and state your
results. How will you interpret this result?

Solution:

H0 = The mean salary variable is equal with respect to all categories (Education, Occupation)
H1 = At least one of the mean of salary variable with respect to Education and Occupation is different
TWO-WAY ANOVA

Let us now perform the Two-Way ANOVA. We will now analyze the effect of both the treatments on the
'Salary' variable.

By performing two-way ANOVA, we get the following table:

9
Advanced Statistics Project

 Since the p-value is less than the 0.05, we reject the null hypothesis.
 There is interaction between Education, Occupation and variable Salary is significant, Hence,
here at least one of the mean of salary variable with respect to Education and Occupation is
different that means not equal.
 We see that the p-value of the interaction effect term of 'Education' and 'Occupation' suggests
that the Null Hypothesis is rejected in this case.

1.7 Explain the business implications of performing ANOVA for this


particular case study.
Solution:

From the ANOVA method and the interaction plot, we see that education combined with occupation
results in higher and better salaries among the people. It is clearly seen that people with education as
Doctorate draw the maximum salaries and people with education HS-grad earn the least. Thus, we can
conclude that Salary is dependent on educational qualifications and occupation.

10
Advanced Statistics Project

Problem 2:
Problem Statement:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions given.
The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following file: Data
Dictionary.xlsx.

Importing all libraries and we check the version.

The given dataset consist of data points of names of various university and college which has number of
application received, accepted, and enrolled, percentage of new students from top 10% of higher
secondary class and top 25% of higher secondary class, Number of fulltime undergraduates, Number of
part time undergraduate students, Number of students for whom the particular college is out of state
tuition, cost of room and board, estimated book costs for a student, estimated personal spending for a
student, percentage of faculties with PHD, percentage of faculties with terminal degree, student/faculty
ratio, percentage of alumni who donate, The instructional expenditure per student, Graduation Rate.

Data dictionary:

In this data we have 777 rows × 18 columns

Data types are 1 objective , 1 float and 16 integer that means names column alone is a categorical value.

11
Advanced Statistics Project

The field ‘S.F. Ratio’ is a float data type

We also can see there are no any duplicates in the dataset.

The entire dataset does not have missing values or null values.

12
Advanced Statistics Project

2.1 Perform Exploratory Data Analysis [both univariate and multivariate


analysis to be performed]. What insight do you draw from the EDA?
Solution:
Univariate, bivariate and Multivariate analysis helps us to understand the distribution of data in the
dataset.
Univariate Analysis:
With univariate analysis we can find patterns and we can summarize the data-

APPS:

13
Advanced Statistics Project

For univariate analysis of apps, we are using box plot and hist plot to find information or patterns in the
data.

The Boxplot of APPS seems to have Outliers, The mean is higher than the median indicates that the
distribution is right skewed.as per above figure.

The number of applications received by a college are range from 81 to 48094.

Rutgers at New Brunswick receives the maximum number of applications (48094) and Christendom
College receives the least number of applications (81).

ACCEPT:

The Boxplot of ACCEPT seems to have Outliers.

The hist plot shows us the majority of applications accepted from each university are in the range from
m 72 to 26330.

Rutgers at New Brunswick accepts the maximum number of applications (26330) and Christendom
College accepts the least number of applications (79).The Mean is higher than the median indicates that
the distribution is right skewed, as per above figure.

14
Advanced Statistics Project

ENROLL:

The box plot of the Enroll variable also has outliers.

From the hist plot we can understand majority of the colleges have enrolled students in the range of 35
to 6392 students.

Texas A&M Univ. at College Station has maximum enrollment with 6392 students and Capitol College
has the least with 35 students.

The Mean is higher than the median indicates that the distribution is right skewed, as per above figure.

TOP 10 PERC:

The percentage of new students from top 10% of Higher Secondary class ranges from 1 to 96.

Massachusetts Institute of Technology has maximum percentage of new students from top 10% of
Higher Secondary class with 96% and three colleges (Center for Creative Studies, Fayetteville State
University and North Adams State College)have the least with 1%.

The Mean is higher than the median indicates that the distribution is right skewed, as per above figure.

15
Advanced Statistics Project

TOP 25 PERC:

The box plot of the TOP 25 PERC variable has no outliers.

The percentage of new students from top 25% of Higher Secondary class ranges from 9 to 100.

There are 7 universities/colleges which have the maximum percentage (100%) of new students from
top 25% of Higher Secondary class and Huron University has the least with 9%

The mean is very close to the median indicating that the distribution is almost normal.

FULL TIME UNDERGRADUATE:

The Box plot of Full Time Undergraduate also have Outlier.

The number of full time under graduate students ranges from 139 to 31643.

16
Advanced Statistics Project

Texas A&M University at College Station has maximum number of full time undergraduate students
(31643) students and Christendom College has the least number of full time under graduate
students( 139) .

The mean higher than the median indicates that the distribution is right skewed as per above figure.

PART TIME UNDERGRADUATE:

The Box plot of Part Time Undergraduate also have Outlier.

The Number of part-time undergraduate students ranges from 1 to 21836.

University of Minnesota Twin cities has the maximum number of part time undergraduate students
( 21836) and the below 4 colleges have the least with 1 student.

The mean higher than the median indicates that the distribution is right skewed as per above figure.

OUTSTATE:

The box plot of outstate has only one outlier. The distribution is almost normally distributed

17
Advanced Statistics Project

The number of students for whom the particular college or university is Out-of-state tuition ranges from
2340 to 21700.

Bennington College has the maximum number of students for whom the particular college or university
is Out-of-state tuition ( 21700) and Brigham Young University at Provo has the least students(2340)

ROOM BOARD:

The Room Board has outliers.

The cost of room and board ranges from $1780 to $8124.

Barnard College has the maximum cost of room and board ($ 8124) and North Carolina A & T State
University has the least cost of room and board ($1780).

The mean is quite close to the median indicating that the distribution is almost normal with a shorter
right tail as seen in the above figure.

BOOKS:

18
Advanced Statistics Project

 The books have outliers.


 The estimated book costs for a student ranges from $96 to $2340.
 Center for Creative Studies has the maximum estimated book costs ($ 2340) for a student and
Appalachian State University has the least book costs for a student ($ 96).
 The mean is greater than the median indicating that the distribution is right skewed as see in the
above figure.

PERSONAL: Estimated personal spending for a student

 The box plot of personal expense has outliers


 The estimated personal spending for a student ranges from $250 to $6800.

 Saint Louis University has the maximum estimated personal spending for a student ($6800) and
Benedictine College has the least ($ 250)
 The mean is greater than the median indicating that the distribution is right skewed as seen in
the above figure.

PHD: Percentage of faculties with Ph.D.’s

19
Advanced Statistics Project

 The box plot of PHD has outliers


 The percentage of faculties with Ph.D.’s ranges from 8% to 100%.
 Three colleges have the maximum percentage of faculties with Ph.D.’s (100%) and Center for
Creative Studies has the least (8%)
 The mean is lesser than the median indicates that the distribution is left skewed.

TERMINAL: Percentage of faculties with terminal degree

 The Box plot of Terminal has out liers.


 The percentage of faculties with terminal degree ranges from 24% to 100%.
 The below universities/colleges have the maximum percentage of faculties with terminal Degree
(100%) and Salem-Tokyo University has the least (24%).
 The mean is lesser than the median indicating that the distribution is left skewed

SF RATIO: Student/faculty ratio

20
Advanced Statistics Project

 The Box plot of S.F.Ratio has the Outliers.


 The student faculty ratio ranges from 2.5 to 39.8.
 Indiana Wesleyan University has the maximum student faculty ratio (39.8) and University of
Charleston has the least (2.5).
 The mean is quite close to the median indicating that the distribution is almost normal with
short right tail as seen above.
 The mean is quite close to the median indicating that the distribution is almost normal with
short right tail as seen above figure.

PERC ALUMINI: Percentage of alumni who donate

 The boxplot of perc.of alumani has the outliers.


 The percentage of alumni who donate ranges from 0% to 64%
 Williams College has the maximum percentage of alumni who donate (64%) and 2
 The mean is quite close to the median indicating that the distribution is almost normal with a
short right tail

21
Advanced Statistics Project

EXPENDITURE: The Instructional expenditure per student

 The boxplot of expenditure has the outliers.


 The instructional expenditure per student ranges from $3186 to $56233.
 Johns Hopkins University has the maximum instructional expenditure per student ($56233) and
Jamestown College has the least ($3186).
 The mean is greater than the median indicating that the distribution is right skewed as seen in
the above figure.

GRAD RATE: Graduation rate

 The boxplot of Gradrate has outliers.


 The graduation rate ranges from 10% to 100%.
 The mean is almost equal to the median indicating that the distribution is normal as seen in the
above figure.

22
Advanced Statistics Project

Bivariate/Multivariate Analysis:
Pair plot of the 17 variables-

The pair plot helps us to understand the relationship between all the numerical values in the dataset.

On comparing all the variables with each other we understand the patterns or trends in the dataset.

HEATMAP:

23
Advanced Statistics Project

This Heat map gives us the correlation between two numerical values.

We could understand the application variable is highly positively correlated with application, accepted,
students enrolled and full time graduates. So this relationship gives the insights on when student
submits the application it is accepted and the student is enrolled as fulltime graduate.

We can find negative correlation between application and percentage of alumni. This indicates us not all
students are part of alumni of their college or university.

The application with top 10, 25 of higher secondary class, outstate, room board, books, personal, PhD,
terminal, S.F ratio, expenditure and Graduation ratio are positively correlated.

24
Advanced Statistics Project

2.2 Is scaling necessary for PCA in this case? Give justification and
perform scaling.
Solution:

Yes, Scaling is necessary to normalize data before performing PCA.

Scaling is very important for PCA because of the way that the principle components are calculated.

The PCA calculates a new projection of data set, and the new axis are based on the standard deviation of
variables.

Before scaling I have dropped the names variable which is categorical. Now, the dataset consists of only
numerical values, I have applied z-score method for this case study.

the dataset has 17 numerical columns with different scales.

Z score tells us how many standard deviations is the point away from the mean and also the direction.
Now, we can understand that all the variables are scaled by using z score function. Scaling is one of the
most important method to follow before implementing models.

Z=X-μ/σ

25
Advanced Statistics Project

2.3 Comment on the comparison between the covariance and the


correlation matrices from this data. [on scaled data]

Solution:

 The comparison between the covariance and correlation matrix is that both of the terms
measures the relationship and the dependency between two variables.
 Scaling in general means representation of the dataset.
 The numbers will not change, we are bringing the dataset into one unit.
 Covariance indicates the direction of the linear relationship between the variables whether it is
positive or negative, by direction means it is directly proportional or inversely proportional.

Cov ( X , Y )=
∑ ( Xi− X )∗(Yi−Y )
N

 This below snippet is the covariance matrix on scaled dataset. We can clearly understand
covariance matrix indicates direction of the linear relationship between the variables.

26
Advanced Statistics Project

 Correlation measures the strength (how much?) and the direction of the linear
relationship between two variables.

27
Advanced Statistics Project

Cov(x , y)
Correlation=
σx∗σy
 This below screenshot is the correlation matrix. We can understand the correlation
matrix which gives the strength and the relationship between the variables.
 The correlation matrix before scaling and after scaling will remain the same.
 From this screenshot we can understand variables which are highly positively correlated
and the variables which are highly negatively correlated.
 We can see that application, acceptance, enrollment and fulltime graduates are highly
positively correlated Also the top 10 percentage and top 25 percentage are highly
positively correlated

28
Advanced Statistics Project

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Solution:

Checking the data before scaling

29
Advanced Statistics Project

Checking the dataset after scaling

30
Advanced Statistics Project

 The above two plots show the boxplots before and after scaling. From, the 2 boxplots, we see
the presence of outliers in the data. Normalization of data does not remove the outliers, but
only the range of the data changes.

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print
Both]
Solution:

The Eigen vectors and Eigen values are given below:

31
Advanced Statistics Project

32
Advanced Statistics Project

Eigenvalues:

2.6 Perform PCA and export the data of the Principal Component
(eigenvectors) into a data frame with the original features
Solution:

PCA has been performed and the principal component scores have been loaded into a data frame. The
below gives the screenshot of the PC data frame. (Please refer Python file)

33
Advanced Statistics Project

Scree plot:

We check the cumulative explained variance ratio and find a cut off for selecting the number of PCs

34
Advanced Statistics Project

Principal components have been generated with a cumulative variance of 81%. We have chosen 8 PCs
since as a general rule 80-20 is taken

Here we check to how the original features matter to each PC by using subplot and we are only
considering the absolute value.

35
Advanced Statistics Project

2.7 Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only). [hint: write
the linear equation of PC in terms of eigenvectors and corresponding
features]
Solution:

The linear equation of PC in terms of eigenvectors and corresponding features:

36
Advanced Statistics Project

2.8 Consider the cumulative values of the eigenvalues. How does it help
you to decide on the optimum number of principal components? What
do the eigenvectors indicate?
Solution:

For choosing the number of principal components which are chosen from the cumulative variance
explained.

Here, we see that 81% is achieved after the 6th Eigen value, hence 6 principal components have been
chosen.

The Eigenvectors determine the directions of the new attribute space, and the eigenvalues determine
their magnitude.

In the PCA, the components of the eigen vectors determine the PCs

37
Advanced Statistics Project

The eigen vector associated with the largest eigen value indicates the direction in which data has the
most variance. Similarly, the eigen vector associated with the second largest eigen value indicates the
direction in which the data has the second most variance and so on.

38
Advanced Statistics Project

The first components explain 32.02% variance in data the first two components explain 58.36% variance
in data. The first three components explains 65.26% variance in data The first four components explains
71.18% variance in data The first five components explains 76.67% variance in data

The PCA reduces the multi collinearity and with this reduced collinearity run models and improved
efficiency scores.

39
Advanced Statistics Project

2.9 Explain the business implication of using the Principal Component


Analysis for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained]
Solution:

This business case study is about education dataset which contain the names of various colleges, which
has various details of colleges and university.
To understand more about the dataset, we perform univariate analysis and multivariate analysis which
gives us the understanding about the variables. From analysis we can understand the distribution of the
dataset, skew, and patterns in the dataset. From multivariate analysis we can understand the correlation
of variables. Inference of multivariate analysis shows we can understand multiple variables highly
correlated with each other. The scaling helps the dataset to standardize the variable in one scale.
Outliers are imputed using IQR values once the values are imputed we can perform PCA. The principal
component analysis is used reduce the multicollinearity between the variables. Depending on the
variance of the dataset we can reduce the PCA components. The PCA components for this business case
are 5 where we could understand the maximum variance of the dataset. Using the components, we
understand the reduced multicollinearity in the dataset

In this data set, using the information on the Eigen values, Eigen Vectors and Cumulative Variance
Explained, the 6 PCS out of the 17 have been identified. Since, choosing 6 PCS has captured 81% of the
variance and information in the original data set. As a general rule 80-20 is taken, for choosing the
number of principal components which are chosen from the cumulative variance explained. Here, we
see that 81% is achieved after the 6th Eigen value

The below gives the screenshot of the PCs data frame.(Please refer Python file)

The explicit form of the first PC is below,

40
Advanced Statistics Project

The above coefficients, 0.25, 0.21…. indicate the weights associated with each of these 17 attributes
which make up the first PC

Using a component loading on a heat map, the features that have maximum loading across the
components can be identified.

41
Advanced Statistics Project

THANK YOU…!

42

You might also like