You are on page 1of 52

`Advanced Statistics- Project

Advanced Statistics (AS)


Project Report

Date: 15th August 2021


Version 1.0
`Advanced Statistics- Project

Table of Contents
1 Problem 1 Statement ............................................................................................................................ 1
1.1 Problem 1 A................................................................................................................................... 1
1.1.1 Preliminary Data Analysis ..................................................................................................... 1
1.1.2 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually. ............................................................................................... 4
1.1.3 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results......................................................... 4
1.1.4 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results......................................................... 4
1.1.5 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded) ................................................................... 5
1.2 Problem 1B:................................................................................................................................... 7
1.2.1 What is the interaction between two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘point plot’
function from the ‘Seaborn’ function] .................................................................................................. 7
1.2.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses and
state your results. How will you interpret this result? ......................................................................... 8
1.2.3 Explain the business implications of performing ANOVA for this particular case study. ..... 9
2 Problem 2 Statement .......................................................................................................................... 10
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA? .............................................................................................. 10
2.1.1 College-wise Top/Bottom Criteria Data Insights................................................................. 11
2.1.2 Univariate Analysis .............................................................................................................. 14
2.1.3 Bi-Variate Analysis............................................................................................................... 22
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. ..................... 25
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data]. ............................................................................................................................. 28
2.3.1 Correlation Matrix ............................................................................................................... 28
2.3.2 Covariance Matrix ............................................................................................................... 29
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? [Please
do not treat Outliers unless specifically asked to do so] ........................................................................ 31
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] ............................... 34
2.5.1 Eigen Values ........................................................................................................................ 34
`Advanced Statistics- Project

2.5.2 Eigen Vectors ...................................................................................................................... 34


2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features........................................................................................................................ 38
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features] ......................................................................................................................... 39
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? ............................... 42
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained] ................................................................................................................................................ 43
2.9.1 Dimensionality Reduction ................................................................................................... 43
2.9.2 Reduced Dimension Data Set - PCA .................................................................................... 44
2.9.3 Multicollinearity Removed .................................................................................................. 45
2.9.4 Data Analytics Examples ..................................................................................................... 47
2.9.5 Conclusion ........................................................................................................................... 47
`Advanced Statistics- Project

List of Figures
Figure 1-1 Salary Boxplot .............................................................................................................................. 2
Figure 1-2 Salary Histogram .......................................................................................................................... 3
Figure 1-3 Occupation-Salary Class Difference Boxplot................................................................................ 5
Figure 1-4 Education-Salary Class Difference Boxplot .................................................................................. 6
Figure 1-5 Education-Occupation Interaction Plot ....................................................................................... 7
Figure 2-1 Applications, Acceptance and Enrollment Histogram ............................................................... 15
Figure 2-2 New Students with Top HSC Scores (10% and 25%) Histogram ................................................ 16
Figure 2-3 Room-Board, Books and Personal Spends Histogram ............................................................... 16
Figure 2-4 Fulltime – Part-time Students Histogram .................................................................................. 17
Figure 2-5 Faculty in Colleges – Histogram ................................................................................................. 18
Figure 2-6 Instructional Expenditure Per Student and Number of Outstate Students - Histogram ........... 18
Figure 2-7 Percentage of Alumni Who Donate – Histogram ...................................................................... 19
Figure 2-8 Graduation Rate Histogram ....................................................................................................... 19
Figure 2-9 Unscaled Dataset Boxplot – See Outliers .................................................................................. 21
Figure 2-10 Data set Pair plot ..................................................................................................................... 22
Figure 2-11 Data set Heatmap .................................................................................................................... 24
Figure 2-12 Original Data Histogram Plot ................................................................................................... 27
Figure 2-13 Scaled Data Histogram Plot ..................................................................................................... 27
Figure 2-14 Original Data Boxplot – Outliers Identification ........................................................................ 32
Figure 2-15 Scaled Data Boxplot – Outliers Identification .......................................................................... 33
Figure 2-16 Eigen Values Ratio Cumulative Sum Scree Plot ....................................................................... 42
Figure 2-17 Eigen Values Scree Plot ............................................................................................................ 43
Figure 2-18 Reduced Dimensions Eigen Values Heat Map ......................................................................... 44
Figure 2-19 Transformed Data Set with Reduced Dimensionality .............................................................. 45
Figure 2-20 Original and Transformed Datasets ......................................................................................... 46
`Advanced Statistics- Project

List of Tables
Table 1-1 Problem 1 Data Sample................................................................................................................. 1
Table 1-2 Education Values Distribution....................................................................................................... 2
Table 1-3 Occupation Values Distribution .................................................................................................... 2
Table 1-4 Descriptive Statistics of P1 Data ................................................................................................... 3
Table 1-5 Education AOV Table .................................................................................................................... 4
Table 1-6 Occupation AOV Table .................................................................................................................. 4
Table 1-7 Education Comparison of Means .................................................................................................. 5
Table 1-8 Two Way AOV Table...................................................................................................................... 8
Table 1-9 Two Way AOV Table with Interaction ........................................................................................... 8
Table 2-1 Data description (all columns) .................................................................................................... 11
Table 2-2 Top College Statistics- Application, Acceptance and Enrollment ............................................... 11
Table 2-3 Top College Statistics- Fulltime/Part-time Students and Graduation Rate ................................ 12
Table 2-4 Top College Statistics- Instructional Expenditure and Faculty Quality ....................................... 12
Table 2-5 Top College Statistics- Cost to Students ..................................................................................... 13
Table 2-6 Top College Statistics-Student HSC Scores.................................................................................. 13
Table 2-7 Top College Statistics-Most Outstate Students .......................................................................... 13
Table 2-8 Descriptive Statistics of P2 Data ................................................................................................. 14
Table 2-9 Outlier Presence in all Numerical Columns................................................................................. 20
Table 2-10 Heat Map Data Correlation Inference ...................................................................................... 23
Table 2-11 Data before scaling ................................................................................................................... 26
Table 2-12 Data after scaling ...................................................................................................................... 26
Table 2-13 Scaled Data Correlation Matrix ................................................................................................. 28
Table 2-14 Scaled Data Covariance Matrix ................................................................................................. 30
Table 2-15 Eigen Vectors in data frame (Part 1) ......................................................................................... 38
Table 2-16 Eigen Vectors in data frame (Part 2) ......................................................................................... 39
Table 2-17 Manual PC Calculation(Using Scaled Data) ............................................................................... 40
Table 2-18 Auto PC Calculation with Python sklearn (Using Scaled Data) ................................................. 41
`Advanced Statistics- Project

1 Problem 1 Statement
Salary is hypothesized to depend on educational qualification and occupation. To understand the
dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational
qualification and occupation are noted.

Educational qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation is
at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education – occupation combination.

[Assume that the data follows a normal distribution. In reality, the normality assumption may not always
hold if the sample size is small.]

1.1 Problem 1 A
1.1.1 Preliminary Data Analysis
A small sample of the data set is as below:

S. No Education Occupation Salary


0 Doctorate Adm-clerical 153197
1 Doctorate Adm-clerical 115945
2 Doctorate Adm-clerical 175935
3 Doctorate Adm-clerical 220754
4 Doctorate Sales 170769
Table 1-1 Problem 1 Data Sample

Data info summary:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Education 40 non-null object
1 Occupation 40 non-null object
2 Salary 40 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.1+ KB
We draw the following conclusions from the initial analysis of raw data.

• The raw data given for analysis has a total of 40 rows and 3 columns.
• The columns ‘Education’ and ‘Occupation’ in the dataset are object data
• Salary column holds numerical data
• The dataset holds no null values; all values are valid
• There is no duplicated data

1
`Advanced Statistics- Project

Unique values of the categorical column Education are as follows:

S. No Education Number Present % of total


1 Doctorate 16 40
2 Bachelors 15 38
3 HS-grad 9 23
Total 40 100
Table 1-2 Education Values Distribution

Unique values of Occupation are as follows:

S. No Occupation Number Present % of total


1 Prof-specialty 13 33
2 Sales 12 30
3 Adm-clerical 10 25
4 Exec-managerial 5 12
Total 40 100
Table 1-3 Occupation Values Distribution

The graphical box plot of the salary data is displayed.


In this, we can see the median, first and third quartile
values depicted along with the maximum and
minimum values.

We can see that the data is left skewed and it is


evident that the salary data has no outliers.

Figure 1-1 Salary Boxplot

Descriptive statistics of the categorical column data is summarized as follows.

Salary Understanding
Mean 162186.9 Average Salary for all data entries
Standard Deviation 64860.41 Deviation in salary from the mean salary
Salary which is central i.e. at this salary 50% of data entries
Median 169100
are above and 50% of data entries are below.
Salary which is Q1 i.e. 25% of data entries are below this
25% 99897.5
salary.

2
`Advanced Statistics- Project

Salary which is Q1 i.e. 75% of data entries are below this


75% 214440.8
salary.
Negative skew refers to a longer or fatter tail on the left side
Skewness -0.17312
of the salary distribution
Range 210048 Range of salary recorded (Max-Min)
Maximum 260151 Maximum salary recorded
Minimum 50103 Minimum salary recorded
Count 40 Number of data entries for salary
Table 1-4 Descriptive Statistics of P1 Data

The KDE plot shows that there is a fatter tail in the left as confirmed by the skewness parameter.

Figure 1-2 Salary Histogram

3
`Advanced Statistics- Project

1.1.2 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
1.1.2.1 Education Hypothesis
Null Hypothesis: The mean salary for all education levels is the same.
Ho: µSalaryDoctorate = µSalaryBachelor = µSalaryHSGraduate
Alternate Hypothesis: At least one pair of mean salary for all education levels is different
Ha: At least one pair of means are not equal

1.1.2.2 Occupation Hypothesis


Null Hypothesis: The mean salary for all occupation levels is the same.
Ho: µSalProfSpecialty=µSalSales=µSalAdmClerical=µSalExecManagerial$
Alternate Hypothesis: At least one pair of mean salary for all occupation levels is different
Ha: At least one pair of means are not equal

1.1.3 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
The AOV table resulted from the one-way ANOVA calculation on salary with respect to education is
represented below.

df sum_sq mean_sq F PR(>F)


Education 2 1.03E+11 5.13E+10 30.95628 1.26E-08
Residual 37 6.14E+10 1.66E+09
Table 1-5 Education AOV Table

The p-value of 1.26e-08 is lower than the significance level α of 0.05. With this information, we reject the
null hypothesis and conclude with 95% confidence that there is indeed a difference in at least one pair of
the population means.

This indicates that, the mean salary of different education is different for at least one pair of education
classes.

1.1.4 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
The AOV table resulted from the one-way ANOVA calculation on salary with respect to occupation is
represented below.

df sum_sq mean_sq F PR(>F)


Occupation 3 1.13E+10 3.75E+09 0.884144 0.458508
Residual 36 1.53E+11 4.24E+09
Table 1-6 Occupation AOV Table

4
`Advanced Statistics- Project

As the p-value of 0.46 is greater than


the significance level α of 0.05. We fail
to reject the null hypothesis and
cannot conclude with 95% confidence
that there is a difference in at least
one pair of the population means.

In other words, the mean salary of


different occupation is the same.

Here we have represented the salary


earned in each occupation in a
graphical boxplot format. It is evident
that there is an overlap of the amount
earned in each occupation and that
this supports the null hypothesis that
the salary population mean of all
occupation means is the same.

Figure 1-3 Occupation-Salary Class Difference Boxplot

1.1.5 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
We rejected the null hypothesis for the One-Way ANOVA on Salary with respect to Education i.e. the null
hypothesis of the population means of salary for all 3 education levels being equal is rejected as confirmed
in section 1.1.3.

In order to check which pairs of means are equal or not equal, we perform the pair-wise Tukey HSD test.
This test also lets us estimate the mean difference between each pair of population means of salary for
the education classes.

Multiple Comparison of Means - Tukey HSD, FWER=0.05


Group1 Group2 Mean Diff P Value Lower Upper Reject
Bachelors Doctorate 43274.0667 0.0146 7541.1439 79006.9894 True
Bachelors HS-grad -90114.1556 0.001 -132035.1958 -48193.1153 True
Doctorate HS-grad 133388.2222 0.001 -174815.0876 -91961.3569 True
Table 1-7 Education Comparison of Means

1. It is seen that the mean difference in the salary earned by the education class groups ([Bachelors
- Doctorate], [Bachelors- HS-grad], [Doctorate - HS-grad]) is statistically significant.
2. The p value for all the three education classes groups indicates that the null hypothesis (the mean
salary between each education class combination is the same) is to be rejected.

5
`Advanced Statistics- Project

3. The Reject column confirms our


understanding stated in point (2).

We have the education-salary data


plotted in the form of a boxplot. Here
we can see that the mean salary for
each salary class varies widely.

Along with the ANOVA and the Tukey


HSD test and the box plot we can
conclude that the mean difference
between the education classes is
statistically significant.

Figure 1-4 Education-Salary Class Difference Boxplot

6
`Advanced Statistics- Project

1.2 Problem 1B:


1.2.1 What is the interaction between two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot. [hint: use the
‘point plot’ function from the ‘Seaborn’ function]
The interaction between the Education and Occupation with respect to salary earned has been displayed
in a point plot. The interactions clearly show patterns; and the inferences of these patterns has been
recorded below.

Figure 1-5 Education-Occupation Interaction Plot

1. Adm-Clerical occupation
This occupation pays significantly lesser salary to HS Graduates – around 75K.
Bachelor and Doctorate degree holders earn around 160K-170K for the same occupation.

2. Sales occupation
HS Graduates in this occupation earn the lowest as compared to HS Graduates working in any of
the other occupations, around 50K only.
Bachelor and Doctorate degree holders earn much more, at around 195K. This is almost a 290%
increase as compared to the salary earned by HS Graduates in the same occupation.

7
`Advanced Statistics- Project

3. Prof-Specialty occupation
This occupation pays the highest salary of around 250K as compared to other occupations.
However, this high salary is reserved only for Doctorate degree holders.
Bachelor’s degree holders and HS graduates in this occupation earn a much lower amount in the
95K-105K range.

4. Exec-Managerial
This occupation is reserved for only Bachelor and Doctorate degree holders and is not followed
by any HS Graduate.
Bachelor degree holders earn around 190K, which is lesser than Doctorate degree holders who
earn around 210K.

1.2.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this result?
Null Hypothesis, Ho: The mean of Salary is the same with respect to the Occupation and Education.

Alternate Hypothesis, Ha: At least one of the mean of Salary with respect to Occupation and Education is
unequal.

The AOV table resulted from the two-way ANOVA calculation on salary with respect to education and
occupation is represented below.

df sum_sq mean_sq F PR(>F)


Education 2 1.03E+11 5.13E+10 31.25768 1.98E-08
Occupation 3 5.52E+09 1.84E+09 1.12008 0.354582
Residual 34 5.59E+10 1.64E+09
Table 1-8 Two Way AOV Table

When both education and occupation factors are considered, we see that occupation is not a statistically
significant factor as its p-value 0.3545 is greater than the significance level α of 0.05.

However, education plays a significant role in the salary factor as its p-value 1.98E-08 is lower than the
significance level α of 0.05.

The AOV table resulted from the two-way ANOVA calculation on salary with respect to education,
occupation and the interaction between education and occupation is represented below.

df sum_sq mean_sq F PR(>F)


Education 2 1.03E+11 5.13E+10 72.21196 5.47E-12
Occupation 3 5.52E+09 1.84E+09 2.587626 0.072116
Education:Occupation 6 3.63E+10 6.06E+09 8.519815 2.23E-05
Residual 29 2.06E+10 7.11E+08
Table 1-9 Two Way AOV Table with Interaction

8
`Advanced Statistics- Project

We see that the education and occupation interaction is 2.23E-05 which is lower than the significance
level α of 0.05. This leads to conclude that there seems to be no statistically significant interaction
between these two factors as far as salary is concerned.

1.2.3 Explain the business implications of performing ANOVA for this particular case study.
1.2.3.1 Education Choices
Choosing to take up further education to earn a Bachelor’s degree or a Doctorate is a time-consuming,
expensive endeavor.

1. ANOVA allows a student to understand the relationship between the salary earned and the
education level. In this case, ANOVA clearly shows that the mean salary earned by persons of
different education backgrounds is dissimilar.
2. This ANOVA conclusion helps the person to decide whether the pursuing of a higher education is
the best course of action for him.
3. A HS graduate can decide to pursue a Bachelor’s degree to greatly enhance his earning capacity.
4. A Bachelor’s degree holder may deduce that earning a Doctorate degree may not give the salary
boost that justifies time/effort/money spent to get the additional education.

This proven education-salary dependency can be used by colleges to influence high-school graduates to
pursue college education instead of directly joining the workforce.

1.2.3.2 Occupation Choices


In terms of the occupation that a person can go for, ANOVA along with the Occupation-Education
interaction data allows him to choose judiciously.

1. It is seen that there is not enough data to reject the hypothesis that the mean salary earned by
persons of different occupations is different.
2. Nevertheless, the education-occupation interaction data will help the person to choose an
occupation which can ensure that he earns a better salary.
3. HS Graduates can opt for occupations like Administrative-Clerical that far out-pay jobs like Sales.
4. Similarly, Bachelor degree holders can avoid occupation in Prof-Specialty which gives then a lower
earning potential than the other fields.

9
`Advanced Statistics- Project

2 Problem 2 Statement
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are expected
to do a Principal Component Analysis for this case study according to the instructions given. The data
dictionary of the 'Education - Post 12th Standard.csv' can be found in the following file: Data
Dictionary.xlsx.

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
The total data info summary for this problem set is as below. The names of the columns, number of valid
data entries and the data types of each column is detailed below.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Names 777 non-null object
1 Apps 777 non-null int64
2 Accept 777 non-null int64
3 Enroll 777 non-null int64
4 Top10perc 777 non-null int64
5 Top25perc 777 non-null int64
6 F.Undergrad 777 non-null int64
7 P.Undergrad 777 non-null int64
8 Outstate 777 non-null int64
9 Room.Board 777 non-null int64
10 Books 777 non-null int64
11 Personal 777 non-null int64
12 PhD 777 non-null int64
13 Terminal 777 non-null int64
14 S.F.Ratio 777 non-null float64
15 perc.alumni 777 non-null int64
16 Expend 777 non-null int64
17 Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB
We draw the following conclusions from the initial analysis of raw data.

10
`Advanced Statistics- Project

• The raw data given for analysis has a total of 777 rows and 17 columns.
• The column ‘Name’ holds object data and the rest of the columns all hold numerical data
• The dataset holds no null values; all values are valid
• There is no duplicated data

The description of the columns is tabulated below.

Column Names Description


Names Names of various university and colleges
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Percentage of new students from top 10% of Higher Secondary class
Top25perc Percentage of new students from top 25% of Higher Secondary class
F.Undergrad Number of full-time undergraduate students
P.Undergrad Number of part-time undergraduate students
Number of students for whom the particular college or university is Out-of-state
Outstate
tuition
Room.Board Cost of Room and board
Books Estimated book costs for a student
Personal Estimated personal spending for a student
PhD Percentage of faculties with Ph.D.’s
Terminal Percentage of faculties with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Percentage of alumni who donate
Expend The Instructional expenditure per student
Grad.Rate Graduation rate
Table 2-1 Data description (all columns)

2.1.1 College-wise Top/Bottom Criteria Data Insights


2.1.1.1 Top Colleges – Applications, Acceptance and Enrollment
Maximum Applications Maximum Acceptance (%) ##1 Maximum Enrollment (%) ##2
College Name Value College Name Value College Name Value
Rutgers at New California Lutheran
48094 Wayne State College 100 100
Brunswick*1 University
Purdue University at Brewton-Parker
21804 Emporia State University 100 97.8
West Lafayette College
Mississippi University
Boston University 20192 Mayville State University 100 93.8
for Women
Table 2-2 Top College Statistics- Application, Acceptance and Enrollment

• ##1 - Maximum Acceptance (%) is computed as percentage of students who got acceptance after they
applied to that college
= Number of Accepted students / Number of Student Applications

11
`Advanced Statistics- Project

• ##2 - Maximum Enrollment (%) is computed as percentage of students who have enrolled in that
college after they received acceptance
= Number of Enrolled students / Number of Accepted Students)
• *1 - Rutgers at New Brunswick and Purdue University at West Lafayette got the highest number of
applications. However, the number of applications received by Rutgers at New Brunswick is more than
double of what Purdue University at West Lafayette received.

2.1.1.2 Top Colleges – Graduation Rate and Fulltime/Part-time Students


Maximum Fulltime U-Graduate Maximum Part-time U-
Highest Graduation rate
(%)##3 Graduate (%)##4
College Name Value College Name Value College Name Value
Columbia College
Cazenovia College*2 118*1 College of Wooster 99.9 88.5
MO
University of Alaska Pacific
100 Kenyon College 99.9 77.7
Richmond University
Amherst College 100 Hampden - Sydney College 99.8 Wilson College 77.2
Table 2-3 Top College Statistics- Fulltime/Part-time Students and Graduation Rate

• ##3 - Maximum Fulltime U-Graduate (%) is computed as percentage of fulltime UG students


= Number of fulltime UG students / Total number of fulltime and part-time UG students
• ##4 - Maximum Part-time U-Graduate (%) is computed as percentage of part-time UG students
= Number of fulltime UG students / Total number of fulltime and part-time UG students
• *2 - Validate data sanctity as there seems to be a discrepancy; Graduation rate is greater than 100

2.1.1.3 Top Colleges – Instructional Expenditure, Student-Faculty Ratio and PhD Faculty
Highest Instructional Lowest Instructional Best Student-Faculty
Most PhD Faculty (%)
expenditure expenditure Ratio
College Name Value College Name Value College Name Value College Name Value
Texas A&M
Johns Hopkins Jamestown University of
56233 3186 2.5 University at 103*3
University College Charleston
Galveston*2
Central Case Western
Washington
45702 Wesleyan 3365 Reserve 2.9 Pitzer College 100
University
College University
Antioch Lindenwood Johns Hopkins Bryn Mawr
42926 3480 3.3 100
University College University College
Table 2-4 Top College Statistics- Instructional Expenditure and Faculty Quality

• *3 - Validate data sanctity as there seems to be a discrepancy; Percent of PhD faculty is greater than
100

2.1.1.4 Top Colleges – Cost to Student


Highest Total Cost ##5 Lowest Total Cost ##5
College Name Value College Name Value
Saint Louis University 12330 Missouri Southern State College 3452
Lindenwood College 10000 University of Southern Mississippi 3470

12
`Advanced Statistics- Project

Northeastern University 9775 Delta State University 3580


Table 2-5 Top College Statistics- Cost to Students

• ##5 - Total cost to the student is total cost for Room-Board, Books and Personal expenditure

2.1.1.5 Top Colleges – Top 10%/25% HSC Soring New Students


Maximum New Students from top 10% of HSC Maximum New Students from top 25% of HSC
College Name Value College Name Value
Massachusetts Institute of Technology 96 University of California at Irvine 100
University of California at Berkeley 95 University of Pennsylvania 100
Harvey Mudd College 95 Harvey Mudd College 100
Table 2-6 Top College Statistics-Student HSC Scores

2.1.1.6 Top Colleges – Most Outstate Students


Highest Total Cost ##5
College Name Value
Bennington College 21700
Massachusetts Institute of Technology 20100
Gettysburg College 19964
Table 2-7 Top College Statistics-Most Outstate Students

13
`Advanced Statistics- Project

2.1.2 Univariate Analysis


The below table holds the descriptive statistics of the P2 dataset.

Standard
Mean Median 25% 75% Range Maximum Minimum
Deviation
Apps 3001.638 3870.201 1558 776 3624 48013 48094 81
Accept 2018.804 2451.114 1110 604 2424 26258 26330 72
Enroll 779.973 929.1762 434 242 902 6357 6392 35
Top10perc 27.55856 17.64036 23 15 35 95 96 1
Top25perc 55.79665 19.80478 54 41 69 91 100 9
F.Undergrad 3699.907 4850.421 1707 992 4005 31504 31643 139
P.Undergrad 855.2986 1522.432 353 95 967 21835 21836 1
Outstate 10440.67 4023.016 9990 7320 12925 19360 21700 2340
Room.Board 4357.526 1096.696 4200 3597 5050 6344 8124 1780
Books 549.381 165.1054 500 470 600 2244 2340 96
Personal 1340.642 677.0715 1200 850 1700 6550 6800 250
PhD 72.66023 16.32815 75 62 85 95 103 8
Terminal 79.7027 14.72236 82 71 92 76 100 24
S.F.Ratio 14.0897 3.958349 13.6 11.5 16.5 37.3 39.8 2.5
perc.alumni 22.74389 12.3918 21 13 31 64 64 0
Expend 9660.171 5221.768 8377 6751 10830 53047 56233 3186
Grad.Rate 65.46332 17.17771 65 53 78 108 118 10
Table 2-8 Descriptive Statistics of P2 Data

On cursory analysis of each row, we can conclude some insights.

1. The maximum number of applications a college in the dataset received is 48094 and the minimum number of application received is only
89. On average a college received around 3001 applications with the median being 1558 applications.
2. While around 2018 students accept the offer of admission from a college, on average only 780 students actually end up enrolling.
3. On comparing the F Undergrad and P Undergrad numbers, it is evident that more students opt for a full time undergraduate course as
compared to part time undergraduate course.
4. On average, a college generally has more staff with terminal degrees than with PHDs.

14
`Advanced Statistics- Project

5. The maximum student-faculty ratio is observed to be 39.8 but the average is better at 14.
6. There seems to be bad data in Grad.Rate as the maximum value of 118 is larger than 100; needs validation from SME
7. There seems to be bad data in PhD as the maximum value of 103 is larger than 100; needs validation from SME

2.1.2.1 Applications, Acceptance and Enrollment Histogram


Applications, Acceptance and Enrollment are all discrete data. The histogram with KDE line of these show that they are all right skewed (indicating
mean is higher than median - Table 2-8 Descriptive Statistics of P2 Data) which shows that higher values of Applications, Acceptance and Enrollment
values are seen for fewer colleges. In addition, the three columns all follow the same kind of distribution.

Figure 2-1 Applications, Acceptance and Enrollment Histogram

15
`Advanced Statistics- Project

2.1.2.2 New Students with Top HSC Scores (10% and 25%) Histogram
The histogram percentage of new students who were in top 10% and top 25% of their HS is shown below.
Both of them show a general normal distribution. But the Top10Perc his slightly right skewed with a longer
tail indicating fewer colleges have larger number of new students holding the Top10% HSC scores.

Figure 2-2 New Students with Top HSC Scores (10% and 25%) Histogram

2.1.2.3 Room-Board, Books and Personal Spends Histogram

Figure 2-3 Room-Board, Books and Personal Spends Histogram

The spends needed by a student is Room-Board, Books and Personal. Room.Board plot shows a
normalized data set (the mean, median are close in values to each other, Table 2-8 Descriptive Statistics

16
`Advanced Statistics- Project

of P2 Data). Books seems to follow a normal distribution, but there are some peaks in the histogram bars.
The Personal histogram is right skewed which indicates that the mean is higher than the median (Table
2-8 Descriptive Statistics of P2 Data).

2.1.2.4 Fulltime – Part-time Students Histogram

Figure 2-4 Fulltime – Part-time Students Histogram

The histogram of fulltime and part-time undergraduate students (discrete data) is right skewed which
indicates that the mean is higher than the median (Table 2-8 Descriptive Statistics of P2 Data). The x-axis
indicates number of students and y axis indicates number of colleges.

By checking the data, we see that on average, only 18.3% of students in a college are part-time and 81.6%
of students are full-time students.

2.1.2.5 Faculty in Colleges – Histogram


The histogram of the faculty levels and the Student-Faculty is depicted above. The PhD and Terminal
faculty histogram are is left skewed which indicates that the mean is lower than the median (Table 2-8
Descriptive Statistics of P2 Data). The S-F Ratio distribution is fairly normally distributed although it has a
slightly longer right tail.

17
`Advanced Statistics- Project

Figure 2-5 Faculty in Colleges – Histogram

2.1.2.6 Instructional Expenditure Per Student and Number of Outstate Students

Figure 2-6 Instructional Expenditure Per Student and Number of Outstate Students - Histogram

The histogram of the instructional expenditure per student shows data is right skewed which indicates
that the mean is higher than the median (Table 2-8 Descriptive Statistics of P2 Data). The plot shows that
there is a high peak to this data. The number of outstate students is looks to be normally distributed with
a little right skew.

18
`Advanced Statistics- Project

2.1.2.7 Percentage of Alumni Who Donate – Histogram


The histogram for the percentage of alumni who donate to the college is right tailed with not a very sharp.

Figure 2-7 Percentage of Alumni Who Donate – Histogram

2.1.2.8 Graduation Rate Histogram


The histogram for the graduation rate shows that the data is fairly normally distributed.

Figure 2-8 Graduation Rate Histogram

19
`Advanced Statistics- Project

2.1.2.9 All Numeric Data Outliers Check


Numerical Columns Outlier Presence
Apps Outliers above top whisker only
Accept Outliers above top whisker only
Enroll Outliers above top whisker only
Top10perc Outliers above top whisker only
Top25perc No Outliers
F.Undergrad Outliers above top whisker only
P.Undergrad Outliers above top whisker only
Outstate Outliers above top whisker only
Room.Board Outliers above top whisker only
Books Outliers seen both above top whisker and below lower whisker
Personal Outliers above top whisker only
PhD Outliers seen only below lower whisker
Terminal Outliers seen only below lower whisker
S.F.Ratio Outliers seen both above top whisker and below lower whisker
perc.alumni Outliers above top whisker only
Expend Outliers above top whisker only
Grad.Rate Outliers seen both above top whisker and below lower whisker
Table 2-9 Outlier Presence in all Numerical Columns

The skewness seen in box plot is confirmed by the histogram plots in the previous sections

20
`Advanced Statistics- Project

Figure 2-9 Unscaled Dataset Boxplot – See Outliers

21
`Advanced Statistics- Project

2.1.3 Bi-Variate Analysis


2.1.3.1 Pair-Plot
We can create a pair-by-pair plot to see the relations between the data variables. The variation of data
depending on another data can be seen by the plotting of point data. This information is captured in
numerical format in the heat map below.

Please note that due to large number of data variables, the generated image is very small.

Figure 2-10 Data set Pair plot

22
`Advanced Statistics- Project

2.1.3.2 Heat Map


The dependency of the different variables on the other variables in the dataset can be depicted in the
form of a heat map. Here we visualize positive/negative or high/low dependency of the data. The
inferences listed are all high inferences greater than ±0.8.

Variable1 Variable2 Inference


Accept Applications data variable has a high positive correlation on
Apps
Enroll Acceptance and Enrollment data variables.
There is a high positive correlation between the Applications and
Accept Enroll
Enrollment data variables.
There is a high positive correlation between the Top25Perc and
Top25Perc Top10Perc
Top10Perc data variables.
Apps
Fulltime Undergraduate students has a high positive correlation with
F.Undergrad Enroll
Applications, Acceptance and Enrollment data variables.
Accept
There is a high positive correlation between the Terminal and PhD
Terminal PhD
data variables.
Table 2-10 Heat Map Data Correlation Inference

23
`Advanced Statistics- Project

Figure 2-11 Data set Heatmap

24
`Advanced Statistics- Project

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Scaling is a preprocessing step which is applied to independent variables in order to normalize the data
within a particular range. Most datasets have features which vary highly in magnitudes, units and range.
Scaling ensures that all the features are given equal importance. If scaling is not done, then algorithms
which only takes magnitude in account will give results with incorrect modelling.

In the case of the Problem 2 data set, the disproportionate data magnitudes and units are clearly evident.

1. Number of fulltime/part-time undergraduate students in a university/college


2. Number of faculty members in a university/college which will generally be lesser than the number
of students
3. Number of college applications and the number of application acceptance
4. Cost of Books, Room-Board etc. is which is a cost unit
5. Percentage of graduation, percentage of faculties with PhD etc.

With the presence of these varied data magnitudes/units, it is best to apply scaling before any algorithm
processing is performed.

In this case, we will be scaling the data by computing the z-score value for all the values. The formula is
displayed below.

On doing this, the scaled data values will all have their mean tending to 0 and their standard deviation as
1 i.e. the data will become centralized.

A sample (complete data has 777 rows, 17 columns) of both the original the numerical dataset and the
dataset post scaling is shown below.

25
`Advanced Statistics- Project
Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
0 1660 1232 721 23 52 2885 537 7440 3300
1 2186 1924 512 16 29 2683 1227 12280 6450
2 1428 1097 336 22 50 1036 99 11250 3750
3 417 349 137 60 89 510 63 12960 5450
4 193 146 55 16 44 249 869 7560 4120
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 450 2200 70 78 18.1 12 7041 60
1 750 1500 29 30 12.2 16 10527 56
2 400 1165 53 66 12.9 30 8735 54
3 450 875 92 97 7.7 37 19016 59
4 800 1500 76 72 11.9 2 10922 15
Table 2-11 Data before scaling

Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
0 -0.34688 -0.32121 -0.06351 -0.25858 -0.19183 -0.16812 -0.20921 -0.74636 -0.9649
1 -0.21088 -0.0387 -0.28858 -0.65566 -1.35391 -0.20979 0.244307 0.457496 1.909208
2 -0.40687 -0.37632 -0.47812 -0.31531 -0.29288 -0.54957 -0.49709 0.201305 -0.55432
3 -0.66826 -0.68168 -0.69243 1.840231 1.677612 -0.65808 -0.52075 0.626633 0.996791
4 -0.72618 -0.76455 -0.78073 -0.65566 -0.59603 -0.71192 0.009005 -0.71651 -0.21672
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 -0.60231 1.270045 -0.16303 -0.11573 1.013776 -0.86757 -0.50191 -0.31825
1 1.21588 0.235515 -2.67565 -3.37818 -0.4777 -0.54457 0.16611 -0.55126
2 -0.90534 -0.25958 -1.20484 -0.93134 -0.30075 0.585935 -0.17729 -0.66777
3 -0.60231 -0.68817 1.185206 1.175657 -1.61527 1.151188 1.792851 -0.3765
4 1.518912 0.235515 0.204672 -0.52353 -0.55354 -1.67508 0.241803 -2.93961
Table 2-12 Data after scaling

26
`Advanced Statistics- Project

A histogram plot of the original data and the scaled data are displayed below for visual comparison. It is
evident that the graphs both display the same properties but the axes of the graphs show that the data
has been scaled.

Figure 2-12 Original Data Histogram Plot

Figure 2-13 Scaled Data Histogram Plot

27
`Advanced Statistics- Project

2.3 Comment on the comparison between the covariance and the correlation matrices from this data [on scaled data].
2.3.1 Correlation Matrix
Top Top Outstat Room Person Termin S.F. perc. Grad.
Apps Accept Enroll F. UG P. UG Books PhD Expend
10% 25% e .Board al al Ratio alumni Rate
Apps 1 0.94 0.85 0.34 0.35 0.81 0.4 0.05 0.16 0.13 0.18 0.39 0.37 0.1 -0.09 0.26 0.15
Accept 0.94 1 0.91 0.19 0.25 0.87 0.44 -0.03 0.09 0.11 0.2 0.36 0.34 0.18 -0.16 0.12 0.07
Enroll 0.85 0.91 1 0.18 0.23 0.96 0.51 -0.16 -0.04 0.11 0.28 0.33 0.31 0.24 -0.18 0.06 -0.02
Top
0.34 0.19 0.18 1 0.89 0.14 -0.11 0.56 0.37 0.12 -0.09 0.53 0.49 -0.38 0.46 0.66 0.49
10%
Top
0.35 0.25 0.23 0.89 1 0.2 -0.05 0.49 0.33 0.12 -0.08 0.55 0.52 -0.29 0.42 0.53 0.48
25%
F. UG 0.81 0.87 0.96 0.14 0.2 1 0.57 -0.22 -0.07 0.12 0.32 0.32 0.3 0.28 -0.23 0.02 -0.08
P. UG 0.4 0.44 0.51 -0.11 -0.05 0.57 1 -0.25 -0.06 0.08 0.32 0.15 0.14 0.23 -0.28 -0.08 -0.26
Outstat
0.05 -0.03 -0.16 0.56 0.49 -0.22 -0.25 1 0.65 0.04 -0.3 0.38 0.41 -0.55 0.57 0.67 0.57
e
Room.
0.16 0.09 -0.04 0.37 0.33 -0.07 -0.06 0.65 1 0.13 -0.2 0.33 0.37 -0.36 0.27 0.5 0.42
Board
Books 0.13 0.11 0.11 0.12 0.12 0.12 0.08 0.04 0.13 1 0.18 0.03 0.1 -0.03 -0.04 0.11 0
Person
0.18 0.2 0.28 -0.09 -0.08 0.32 0.32 -0.3 -0.2 0.18 1 -0.01 -0.03 0.14 -0.29 -0.1 -0.27
al
PhD 0.39 0.36 0.33 0.53 0.55 0.32 0.15 0.38 0.33 0.03 -0.01 1 0.85 -0.13 0.25 0.43 0.31
Termin
0.37 0.34 0.31 0.49 0.52 0.3 0.14 0.41 0.37 0.1 -0.03 0.85 1 -0.16 0.27 0.44 0.29
al
S.F.
0.1 0.18 0.24 -0.38 -0.29 0.28 0.23 -0.55 -0.36 -0.03 0.14 -0.13 -0.16 1 -0.4 -0.58 -0.31
Ratio
perc.
-0.09 -0.16 -0.18 0.46 0.42 -0.23 -0.28 0.57 0.27 -0.04 -0.29 0.25 0.27 -0.4 1 0.42 0.49
alumni
Expend 0.26 0.12 0.06 0.66 0.53 0.02 -0.08 0.67 0.5 0.11 -0.1 0.43 0.44 -0.58 0.42 1 0.39
Grad.
0.15 0.07 -0.02 0.49 0.48 -0.08 -0.26 0.57 0.42 0 -0.27 0.31 0.29 -0.31 0.49 0.39 1
Rate
Table 2-13 Scaled Data Correlation Matrix

The correlation matrix created is a NxN square matrix where N is the number of columns in the original data input (here, we have 17 columns so it is a 17x17
matrix). The column names and the index of the matrix are the same i.e. the names of the columns are reflected as names of the rows of the correlation matrix.

28
`Advanced Statistics- Project

This matrix is created to check the correlation of all the data variables on each other. The correlation can be a positive value which means the variables are directly
related to each other. In case the correlation is negative, the data variables are inversely related to one another.

The values in this matrix range between +1 to -1 only. The data is in the square matrix is a mirror image i.e. the top and the bottom half of the matrix along the
diagonal are mirror images of each other with the diagonal cells all having a value of 1. This matrix is used to generate the heat-map for quick correlation analysis.
See Figure 2-11 Data set Heatmap and Table 2-10 Heat Map Data Correlation Inference.

2.3.2 Covariance Matrix


The covariance matrix created is a MxM square matrix where M is the number of rows in the original data input (here, we have 777 rows so it is a 777x777 matrix).
Mathematically we get the co-variance matrix but multiplying the matrix with its transposed form. We can use the covariance matrix to get the pair plot of the
dataset. The column names and the index of the matrix are the same i.e. the numbers of the rows are reflected as names of the columns of the covariance matrix.
The covariance is calculated with the formula where
X, Y – Variables
̄x- mean of X
̄y- mean of Y
n – Number of data
The covariance matrix summarizes the variances and covariance of the dataset. Although the matrix is a square matrix, the top and the bottom half of the matrix
along the diagonal are mirror images of each other.

A sample of the created covariance matrix is shown below. The entire data set is a 777x777 matrix.

0 1 2 3 4 5 6 … 770 771 772 773 774 775 776


0 0.33 -0.15 -0.03 -0.31 0.06 -0.12 -0.1 … -0.04 -0.19 0.22 -0.1 -0.03 -0.46 0.14
1 -0.15 1.57 0.17 -0.43 0.24 -0.07 -0.13 … -0.47 -0.35 0.34 0.18 0.01 0.04 -0.24
2 -0.03 0.17 0.17 0.06 -0.13 0.02 -0.04 … 0.05 0.02 0.01 0.02 -0.02 0.18 -0.04
3 -0.31 -0.43 0.06 1.23 0.01 0.36 0.34 … 0.49 0.56 -0.68 -0.01 0.06 1.51 -0.29
4 0.06 0.24 -0.13 0.01 0.85 0.05 0.07 … -0.14 -0.24 0.21 -0.15 0.15 -0.09 -0.39
5 -0.12 -0.07 0.02 0.36 0.05 0.3 0 … 0.14 0.21 -0.27 -0.04 0.02 0.55 -0.17
6 -0.1 -0.13 -0.04 0.34 0.07 0 0.46 … 0.16 0.21 -0.18 0.09 -0.11 0.24 -0.18
… … … … … … … … … … … … … … … …
770 -0.04 -0.47 0.05 0.49 -0.14 0.14 0.16 … 0.55 0.39 -0.31 0.03 0.04 0.31 0.11

29
`Advanced Statistics- Project

771 -0.19 -0.35 0.02 0.56 -0.24 0.21 0.21 … 0.39 0.52 -0.37 0.05 0.02 0.48 0
772 0.22 0.34 0.01 -0.68 0.21 -0.27 -0.18 … -0.31 -0.37 0.69 -0.04 0.04 -1.07 0.1
773 -0.1 0.18 0.02 -0.01 -0.15 -0.04 0.09 … 0.03 0.05 -0.04 0.15 -0.06 0.05 0.09
774 -0.03 0.01 -0.02 0.06 0.15 0.02 -0.11 … 0.04 0.02 0.04 -0.06 0.16 -0.04 -0.02
775 -0.46 0.04 0.18 1.51 -0.09 0.55 0.24 … 0.31 0.48 -1.07 0.05 -0.04 3.07 -0.52
776 0.14 -0.24 -0.04 -0.29 -0.39 -0.17 -0.18 … 0.11 0 0.1 0.09 -0.02 -0.52 0.57
Table 2-14 Scaled Data Covariance Matrix

30
`Advanced Statistics- Project

2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
An outlier is a data point in the dataset which lies an abnormal distance from other values in a random
sample from a population. These unusual data points are problematic for several statistical analyses
because they can cause tests to either miss significant findings or distort real results.

It is important to note that scaling a dataset does NOT eliminate the presence of outliers. Outliers present
in the original dataset will still be present in the scaled dataset. The only way to correctly eliminate outliers
is to identify them and treat them by replacing the outliers with acceptable values or dropping that data.
This can only be done with the support of a subject matter expert.

Box plots graphically display the presence of outliers in all the data. We have plotted boxplots of the
original unscaled data (Figure 2-14 Original Data Boxplot – Outliers Identification) and the scaled data
(Figure 2-15 Scaled Data Boxplot – Outliers Identification).

Looking at the boxplots we can see:

1. We see that all data columns except the Top25Perc column have outliers.
2. Simply scaling does not affect the presence of outliers; We observe that the original data and the
scaled data box plots are similar and both show the presence of outliers.
3. The only difference in the unscaled data boxplots and scaled data box plots is the data scale of
the Y-axis. Scaled data Y-axis is a z-score scale.

31
`Advanced Statistics- Project

Figure 2-14 Original Data Boxplot – Outliers Identification

32
`Advanced Statistics- Project

Figure 2-15 Scaled Data Boxplot – Outliers Identification

33
`Advanced Statistics- Project

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
As a precautionary measure, it is prudent to first check if the dataset will benefit from PCA treatment. To
check this, we apply the Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin (KMO) measure of
sampling adequacy tests.

1. Bartlett's Test of Sphericity

We test the hypothesis that the data elements are uncorrelated in the population. The p-value should
be small so that we can reject the Null Hypothesis and PCA can be recommended.

Null Hypothesis, Ho: All data entries in the population are uncorrelated
Alternate Hypothesis, Ha: At least one pair of variables in the population are correlated

Conclusion: The calculated p-value is 0.0 so the null hypothesis is rejected. We continue with the PCA
calculation.

2. KMO Measure of Sampling Adequacy

We check the MSA value should be calculated. In order for the PCA test to be recommended, the
value should be greater than 0.7.

Conclusion: The calculated KMO MSA is 0.813 so we continue with the PCA calculation.

Principal Component Analysis (PCA) is a statistical procedure which is used to convert a set of data
observations of possibly correlated variables into a set of values of linearly uncorrelated variables. PCA is
applied to only

2.5.1 Eigen Values


Below is the set of Eigen Values computed for the data set. It tells you how much variance there is in the
data in that direction. It is a list of 17 numbers (as the n_components given to the PCA model is 17):
Eigen Values: [5.45052162, 4.48360686, 1.17466761, 1.00820573,
0.93423123, 0.84849117, 0.6057878 , 0.58787222,
0.53061262, 0.4043029 , 0.31344588, 0.22061096,
0.16779415, 0.1439785 , 0.08802464, 0.03672545,
0.02302787]

2.5.2 Eigen Vectors


Below is the set of Eigen Vectors calculated for the dataset. In this case it forms a 17x17 matrix because:

1. The number of components to keep has been set as 17 – number or rows


2. The number of numerical columns in the data frame which is under analysis is 17 – number of
columns

34
`Advanced Statistics- Project

Eigen Vectors:
[[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01, 3.54273947e-01, 3.44001279e-01, 1.54640962e-01,
2.64425045e-02, 2.94736419e-01, 2.49030449e-01, 6.47575181e-02, -4.25285386e-02, 3.18312875e-01,
3.17056016e-01, -1.76957895e-01, 2.05082369e-01, 3.18908750e-01, 2.52315654e-01],

[ 3.31598227e-01, 3.72116750e-01, 4.03724252e-01,-8.24118211e-02, -4.47786551e-02, 4.17673774e-01,


3.15087830e-01, -2.49643522e-01, -1.37808883e-01, 5.63418434e-02, 2.19929218e-01, 5.83113174e-02,
4.64294477e-02, 2.46665277e-01, -2.46595274e-01,-1.31689865e-01, -1.69240532e-01],

[-6.30921033e-02, -1.01249056e-01, -8.29855709e-02, 3.50555339e-02, -2.41479376e-02, -6.13929764e-02,


1.39681716e-01, 4.65988731e-02, 1.48967389e-01, 6.77411649e-01, 4.99721120e-01, -1.27028371e-01,
-6.60375454e-02, -2.89848401e-01, -1.46989274e-01, 2.26743985e-01, -2.08064649e-01],

[ 2.81310530e-01, 2.67817346e-01, 1.61826771e-01,-5.15472524e-02, -1.09766541e-01, 1.00412335e-01,


-1.58558487e-01, 1.31291364e-01, 1.84995991e-01, 8.70892205e-02, -2.30710568e-01, -5.34724832e-01,
-5.19443019e-01, -1.61189487e-01, 1.73142230e-02, 7.92734946e-02, 2.69129066e-01],

[ 5.74140964e-03, 5.57860920e-02, -5.56936353e-02,-3.95434345e-01, -4.26533594e-01, -4.34543659e-02,


3.02385408e-01, 2.22532003e-01, 5.60919470e-01,-1.27288825e-01, -2.22311021e-01, 1.40166326e-01,
2.04719730e-01, -7.93882496e-02, -2.16297411e-01, 7.59581203e-02, -1.09267913e-01],

[-1.62374420e-02, 7.53468452e-03, -4.25579803e-02,-5.26927980e-02, 3.30915896e-02, -4.34542349e-02,


-1.91198583e-01, -3.00003910e-02, 1.62755446e-01, 6.41054950e-01, -3.31398003e-01, 9.12555212e-02,
1.54927646e-01, 4.87045875e-01, -4.73400144e-02,-2.98118619e-01, 2.16163313e-01],

[-4.24863486e-02, -1.29497196e-02, -2.76928937e-02,-1.61332069e-01, -1.18485556e-01, -2.50763629e-02,


6.10423460e-02, 1.08528966e-01, 2.09744235e-01,-1.49692034e-01, 6.33790064e-01, -1.09641298e-03,

35
`Advanced Statistics- Project

-2.84770105e-02, 2.19259358e-01, 2.43321156e-01,-2.26584481e-01, 5.59943937e-01],

[-1.03090398e-01, -5.62709623e-02, 5.86623552e-02,-1.22678028e-01, -1.02491967e-01, 7.88896442e-02,


5.70783816e-01, 9.84599754e-03, -2.21453442e-01, 2.13293009e-01, -2.32660840e-01, -7.70400002e-02,
-1.21613297e-02, -8.36048735e-02, 6.78523654e-01,-5.41593771e-02, -5.33553891e-03],

[-9.02270802e-02, -1.77864814e-01, -1.28560713e-01, 3.41099863e-01, 4.03711989e-01, -5.94419181e-02,


5.60672902e-01, -4.57332880e-03, 2.75022548e-01, -1.33663353e-01, -9.44688900e-02, -1.85181525e-01,
-2.54938198e-01, 2.74544380e-01, -2.55334907e-01, -4.91388809e-02, 4.19043052e-02],

[ 5.25098025e-02, 4.11400844e-02, 3.44879147e-02, 6.40257785e-02, 1.45492289e-02, 2.08471834e-02,


-2.23105808e-01, 1.86675363e-01, 2.98324237e-01, -8.20292186e-02, 1.36027616e-01, -1.23452200e-01,
-8.85784627e-02, 4.72045249e-01, 4.22999706e-01, 1.32286331e-01, -5.90271067e-01],

[ 4.30462074e-02, -5.84055850e-02, -6.93988831e-02, -8.10481404e-03, -2.73128469e-01, -8.11578181e-02,


1.00693324e-01, 1.43220673e-01, -3.59321731e-01, 3.19400370e-02, -1.85784733e-02, 4.03723253e-02,
-5.89734026e-02, 4.45000727e-01, -1.30727978e-01, 6.92088870e-01, 2.19839000e-01],

[ 2.40709086e-02, -1.45102446e-01, 1.11431545e-02, 3.85543001e-02, -8.93515563e-02, 5.61767721e-02,


-6.35360730e-02, -8.23443779e-01, 3.54559731e-01, -2.81593679e-02, -3.92640266e-02, 2.32224316e-02,
1.64850420e-02, -1.10262122e-02, 1.82660654e-01, 3.25982295e-01, 1.22106697e-01],

[ 5.95830975e-01, 2.92642398e-01, -4.44638207e-01, 1.02303616e-03, 2.18838802e-02, -5.23622267e-01,


1.25997650e-01, -1.41856014e-01, -6.97485854e-02, 1.14379958e-02, 3.94547417e-02, 1.27696382e-01,
-5.83134662e-02, -1.77152700e-02, 1.04088088e-01, -9.37464497e-02, -6.91969778e-02],

[ 8.06328039e-02, 3.34674281e-02, -8.56967180e-02, -1.07828189e-01, 1.51742110e-01, -5.63728817e-02,

36
`Advanced Statistics- Project

1.92857500e-02, -3.40115407e-02, -5.84289756e-02, -6.68494643e-02, 2.75286207e-02, -6.91126145e-01,


6.71008607e-01, 4.13740967e-02, -2.71542091e-02, 7.31225166e-02, 3.64767385e-02],

[ 1.33405806e-01, -1.45497511e-01, 2.95896092e-02, 6.97722522e-01, -6.17274818e-01, 9.91640992e-03,


2.09515982e-02, 3.83544794e-02, 3.40197083e-03, -9.43887925e-03, -3.09001353e-03, -1.12055599e-01,
1.58909651e-01, -2.08991284e-02, -8.41789410e-03, -2.27742017e-01, -3.39433604e-03],

[ 4.59139498e-01, -5.18568789e-01, -4.04318439e-01, -1.48738723e-01, 5.18683400e-02, 5.60363054e-01,


-5.27313042e-02, 1.01594830e-01, -2.59293381e-02, 2.88282896e-03, -1.28904022e-02, 2.98075465e-02,
-2.70759809e-02, -2.12476294e-02, 3.33406243e-03, -4.38803230e-02, -5.00844705e-03],

[ 3.58970400e-01, -5.43427250e-01, 6.09651110e-01, -1.44986329e-01, 8.03478445e-02, -4.14705279e-01,


9.01788964e-03, 5.08995918e-02, 1.14639620e-03, 7.72631963e-04, -1.11433396e-03, 1.38133366e-02,
6.20932749e-03, -2.22215182e-03, -1.91869743e-02, -3.53098218e-02, -1.30710024e-02]]

37
`Advanced Statistics- Project

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with the original features
Once the PCA is completed, we get the Eigen vectors which are the loadings or coefficients of all PCs. This data updated as a data frame with the column names
from the original data set and with the indices labeled as PC numbers.

Please note that for easier readings, the table has been split into two parts with the index values of the rows repeated for clarity.

Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
PCA0 0.248766 0.207602 0.176304 0.354274 0.344001 0.154641 0.026443 0.294736 0.24903
PCA1 0.331598 0.372117 0.403724 -0.08241 -0.04478 0.417674 0.315088 -0.24964 -0.13781
PCA2 -0.06309 -0.10125 -0.08299 0.035056 -0.02415 -0.06139 0.139682 0.046599 0.148967
PCA3 0.281311 0.267817 0.161827 -0.05155 -0.10977 0.100412 -0.15856 0.131291 0.184996
PCA4 0.005741 0.055786 -0.05569 -0.39543 -0.42653 -0.04345 0.302385 0.222532 0.560919
PCA5 -0.01624 0.007535 -0.04256 -0.05269 0.033092 -0.04345 -0.1912 -0.03 0.162755
PCA6 -0.04249 -0.01295 -0.02769 -0.16133 -0.11849 -0.02508 0.061042 0.108529 0.209744
PCA7 -0.10309 -0.05627 0.058662 -0.12268 -0.10249 0.07889 0.570784 0.009846 -0.22145
PCA8 -0.09023 -0.17786 -0.12856 0.3411 0.403712 -0.05944 0.560673 -0.00457 0.275023
PCA9 0.05251 0.04114 0.034488 0.064026 0.014549 0.020847 -0.22311 0.186675 0.298324
PCA10 0.043046 -0.05841 -0.0694 -0.0081 -0.27313 -0.08116 0.100693 0.143221 -0.35932
PCA11 0.024071 -0.1451 0.011143 0.038554 -0.08935 0.056177 -0.06354 -0.82344 0.35456
PCA12 0.595831 0.292642 -0.44464 0.001023 0.021884 -0.52362 0.125998 -0.14186 -0.06975
PCA13 0.080633 0.033467 -0.0857 -0.10783 0.151742 -0.05637 0.019286 -0.03401 -0.05843
PCA14 0.133406 -0.1455 0.02959 0.697723 -0.61727 0.009916 0.020952 0.038354 0.003402
PCA15 0.459139 -0.51857 -0.40432 -0.14874 0.051868 0.560363 -0.05273 0.101595 -0.02593
PCA16 0.35897 -0.54343 0.609651 -0.14499 0.080348 -0.41471 0.009018 0.0509 0.001146
Table 2-15 Eigen Vectors in data frame (Part 1)

Books Personal PhD Terminal S.F. Ratio Perc. Alumni Expend Grad. Rate
PCA0 0.064758 -0.04253 0.318313 0.317056 -0.17696 0.205082 0.318909 0.252316
PCA1 0.056342 0.219929 0.058311 0.046429 0.246665 -0.2466 -0.13169 -0.16924

38
`Advanced Statistics- Project

PCA2 0.677412 0.499721 -0.12703 -0.06604 -0.28985 -0.14699 0.226744 -0.20806


PCA3 0.087089 -0.23071 -0.53472 -0.51944 -0.16119 0.017314 0.079273 0.269129
PCA4 -0.12729 -0.22231 0.140166 0.20472 -0.07939 -0.2163 0.075958 -0.10927
PCA5 0.641055 -0.3314 0.091256 0.154928 0.487046 -0.04734 -0.29812 0.216163
PCA6 -0.14969 0.63379 -0.0011 -0.02848 0.219259 0.243321 -0.22658 0.559944
PCA7 0.213293 -0.23266 -0.07704 -0.01216 -0.0836 0.678524 -0.05416 -0.00534
PCA8 -0.13366 -0.09447 -0.18518 -0.25494 0.274544 -0.25533 -0.04914 0.041904
PCA9 -0.08203 0.136028 -0.12345 -0.08858 0.472045 0.423 0.132286 -0.59027
PCA10 0.03194 -0.01858 0.040372 -0.05897 0.445001 -0.13073 0.692089 0.219839
PCA11 -0.02816 -0.03926 0.023222 0.016485 -0.01103 0.182661 0.325982 0.122107
PCA12 0.011438 0.039455 0.127696 -0.05831 -0.01772 0.104088 -0.09375 -0.0692
PCA13 -0.06685 0.027529 -0.69113 0.671009 0.041374 -0.02715 0.073123 0.036477
PCA14 -0.00944 -0.00309 -0.11206 0.15891 -0.0209 -0.00842 -0.22774 -0.00339
PCA15 0.002883 -0.01289 0.029808 -0.02708 -0.02125 0.003334 -0.04388 -0.00501
PCA16 0.000773 -0.00111 0.013813 0.006209 -0.00222 -0.01919 -0.03531 -0.01307
Table 2-16 Eigen Vectors in data frame (Part 2)

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only). [hint: write
the linear equation of PC in terms of eigenvectors and corresponding features]
The observed variables in the data be denoted by x1,x2,...,xn where n is the number of observations
The principal components are linear combinations of the a values. The equation for the same may be defined as follows:
PC1 = a1x1 + a2x2 +...+ an𝑝 xn
PC2 = b1x1 + b2x2 +...+ bn xn
PC3 = c1x1 + c2x2 +...+ cn xn
Where a/b/c etc. are extracted coefficients

39
`Advanced Statistics- Project

1. As a part of our calculation, data x1, x2 etc. is the z-score scaled data set, this is seen in Table 2-12 Data after scaling. This is placed in column X.
2. The coefficients are the Eigen vectors that have been computed for the scaled data set as seen in Table 2-15 Eigen Vectors in data frame (Part 1) and Table
2-16 Eigen Vectors in data frame (Part 2). These values (rounded to 2) are placed in columns A and B.
3. Total computation will result in 777 rows and 17 columns.
4. We have shown calculation of the first two PCs with the formula and table values below:
PC1 Value = Sum of (Column X * Column A)
PC2 Value = Sum of (Column X * Column B)

Data (scaled z- Eigen Vectors Data (scaled z- Eigen Vectors


Columns Product of Columns Product of Columns
score) First Row (Loading 1) score) First Row (Loading 2)
X*A X*B
X A X B
Apps -0.3469 0.25 -0.0867 -0.3469 0.33 -0.11448
Accept -0.3212 0.21 -0.0675 -0.3212 0.37 -0.11884
Enroll -0.0635 0.18 -0.0114 -0.0635 0.4 -0.0254
Top10perc -0.2586 0.35 -0.0905 -0.2586 -0.08 0.020688
Top25perc -0.1918 0.34 -0.0652 -0.1918 -0.04 0.007672
F.Undergrad -0.1681 0.15 -0.0252 -0.1681 0.42 -0.0706
P.Undergrad -0.2092 0.03 -0.0063 -0.2092 0.32 -0.06694
Outstate -0.7464 0.29 -0.2165 -0.7464 -0.25 0.1866
Room.Board -0.9649 0.25 -0.2412 -0.9649 -0.14 0.135086
Books -0.6023 0.06 -0.0361 -0.6023 0.06 -0.03614
Personal 1.27 -0.04 -0.0508 1.27 0.22 0.2794
PhD -0.163 0.32 -0.0522 -0.163 0.06 -0.00978
Terminal -0.1157 0.32 -0.037 -0.1157 0.05 -0.00579
S.F.Ratio 1.0138 -0.18 -0.1825 1.0138 0.25 0.25345
perc.alumni -0.8676 0.21 -0.1822 -0.8676 -0.25 0.2169
Expend -0.5019 0.32 -0.1606 -0.5019 -0.13 0.065247
Grad.Rate -0.3183 0.25 -0.0796 -0.3183 -0.17 0.054111
-1.5928 0.767349
Table 2-17 Manual PC Calculation(Using Scaled Data)

40
`Advanced Statistics- Project

The same output is calculated when we execute the PCA and fit_transform function of the sklearn.decomposition library. This is preferred as it is faster
and eliminates the possibility of human error, given that the inputs are specified correctly. A small sample of the consolidated output for the same dataset is
displayed below.

The highlighted green cells show the manually computed and the auto-calculated values.

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board


0 -1.59286 0.767334 -0.10107 -0.92175 -0.74398 -0.29831 0.638443 -0.87939 0.093084
1 -2.1924 -0.57883 2.278798 3.588918 1.059997 -0.17714 0.236753 0.046925 1.11378
2 -1.43096 -1.09282 -0.43809 0.677241 -0.36961 -0.96059 -0.24828 0.30874 -0.10545
3 2.855557 -2.63061 0.141722 -1.29549 -0.18384 -1.05951 -1.24936 -0.14769 0.378997
4 -2.21201 0.021631 2.38703 -1.11454 0.684451 0.004918 -2.15922 -0.62441 -0.16038

Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate


0 0.048593 0.399747 -0.08969 -0.0521 0.18014 0.001752 -0.09314 0.093552
1 0.965154 -0.21251 0.097239 -0.24352 -0.7442 0.10371 -0.05026 -0.17406
2 0.64066 -0.15499 -0.34473 0.097551 0.227527 -0.02256 -0.00405 0.003759
3 0.461244 -0.42065 0.687143 -0.07546 -0.00338 -0.07318 -0.19155 -0.17525
4 0.363428 -0.15334 -0.05055 0.267207 -0.61441 -0.27399 0.010653 0.048344
Table 2-18 Auto PC Calculation with Python sklearn (Using Scaled Data)

41
`Advanced Statistics- Project

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors
indicate?
Consider the cumulative values of the eigenvalues; How does it help you to decide on the optimum
number of principal components?

The cumulative sum of the Eigen values generated tend towards the number of principal components.
We use this to compute the ratio (see formula below) which is used to perform the relative comparison
and find out the number of principal components that are relevant. This is done to reduce the dimensions.
Ratio = Eigen value of each PC /Sum of all Eigen values of all PCs
Eigen Values Ratio: [0.32020628, 0.26340214, 0.06900917, 0.05922989,
0.05488405, 0.04984701, 0.03558871, 0.03453621,
0.03117234, 0.02375192, 0.01841426, 0.01296041,
0.00985754, 0.00845842, 0.00517126, 0.00215754,
0.00135284]
The graphical display of the cumulative Eigen values ratio as scree plot is below. The red horizontal line
indicates the 80% cumulative variance. The PCs must be taken so as to explain between 70% -90% of the
total variance, we reach this value at around 5 principal components.

Figure 2-16 Eigen Values Ratio Cumulative Sum Scree Plot

What do the Eigen vectors indicate?

When we perform Eigen decomposition, we compute the Eigen Values and Eigen Vectors. Eigen vectors
indicate the direction of the principal components. The corresponding Eigen values are the magnitudes of
variance capture

42
`Advanced Statistics- Project

2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of
the Principal Components Obtained]
2.9.1 Dimensionality Reduction
We can check the computed Eigen values to help decide the number of principal components that are to
be considered to reach an optimal number Kaiser's rule is simply to retain factors whose Eigen values are
greater than 1. It is based on the assumption that to retain a factor that explains less variance than a single
original variable is not reasonable i.e. it means that the PC does not do the job of even a lone original
variable.

So, going by this rule it can be taken that 4 is the ideal number of principal components that is to be taken
to reduce the dimensions of the original dataset.
Eigen Values: [5.45052162, 4.48360686, 1.17466761, 1.00820573,
0.93423123, 0.84849117, 0.6057878 , 0.58787222,
0.53061262, 0.4043029 , 0.31344588, 0.22061096,
0.16779415, 0.1439785 , 0.08802464, 0.03672545,
0.02302787]

The computed Eigen values are displayed graphically as a scree plot to show the principal components
that are to be considered based on their Eigen values.

Figure 2-17 Eigen Values Scree Plot

Considering that we started with 17 columns of numeric data, reducing the number of relevant data fields
to only 4-5 is very useful. It reduces computation effort and focuses the dataset on significant variables.

43
`Advanced Statistics- Project

2.9.2 Reduced Dimension Data Set - PCA


Once the PCA has been applied with reduced dimensions (n_components = 5), we plot the heat map of the new Eigen vectors to determine the names of the new
columns.

Figure 2-18 Reduced Dimensions Eigen Values Heat Map

The redlined blocks indicate the highest values in each PC. We take the corresponding column names and rename the PC to get meaningful labels.

• PC0 = pc_Expend
• PC1 = pc_application_acceptance_enrollment
• PC2 = pc_books_personal_expenditure
• PC3 = pc_faculty_phd_terminal
• PC4 = pc_top_10_25_hsc

Once the columns have been renamed and the ‘Names’ column has been added back to the dataset, the transformed dataset is ready. This has 777 rows and 6
columns. We show a sample portion of this below.

44
`Advanced Statistics- Project

pc_application_ pc_books_
pc_faculty_phd_ pc_top_
Names pc_Expend acceptance_ personal_
terminal 10_25_hsc
enrollment expenditure
Abilene Christian
0 -1.59286 0.767334 -0.10107 -0.92175 -0.74398
University
1 Adelphi University -2.1924 -0.57883 2.278797 3.588919 1.059997
2 Adrian College -1.43096 -1.09282 -0.43809 0.677241 -0.36961
Agnes Scott
3 2.855557 -2.63061 0.14172 -1.29548 -0.18384
College
Alaska Pacific
4 -2.21201 0.021631 2.387029 -1.11454 0.684451
University
Figure 2-19 Transformed Data Set with Reduced Dimensionality

With this reduced dimensionality (important parameters only), we can isolate data needed.

2.9.3 Multicollinearity Removed


On completion of the PCA process, we get a resultant data set which can be used for further processing.
The nature of this transformed dataset is that its elements have their multicollinearity removed. This is
evident when we finally plot the heat map of the transformed data. Please see the comparison heat map
of original and transformed data below.

45
`Advanced Statistics- Project

Figure 2-20 Original and Transformed Datasets

46
`Advanced Statistics- Project

2.9.4 Data Analytics Examples


2.9.4.1 Colleges with High Application/Acceptance/Enrollment having Low Books/Personal Expenditure
With the combination check of colleges having expenditure below the first quartile and having High
Application/Acceptance/Enrollment above the third quartile, we get 57 colleges for the student to choose
from.

2.9.4.2 Colleges with High PhD/Terminal Faculty and Top 10%/25% HSC
With the combination check of colleges having PhD and Terminal faculty above the third quartile and
having new students from top 10%/25% of HSC, we get 54 colleges. This college names list will be helpful
for future students

2.9.5 Conclusion
With reduced dimensionality shining a light on the important data parameters only, the future students
can decide which university to apply to for

1. Better chance of acceptance


2. Lower cost to the student
3. Access to more qualified staff
4. Having peers who are also gifted/motivated to help in personal growth

47

You might also like