You are on page 1of 30

ABSTRACT

BY USING METHOS OF
EXPLORATORY DATA
ANALYSIS, METHODS OF
ANOVA AND PRINCIPAL
COMPONENT ANALYSIS THIS
REPORT HAS BEEN MADE TO
ANALYSE, INTERPRET AND
CONLUDE THE DATA SETS
GIVEN UNDER ADVANCED
STATISTICS PROJECT.

RAJESH
DONTHINENI
PROJECT REPORT ON ADVANCED POST GRADUATE PROGRAM IN

STATISTICS DATA SCIENCE AND BUSINESS


ANALYTICS
1

INDEX
CONTENT PAGE NUMBER
PROBLEM 1A…………………………………………………………………………….. 4-7

Table 1.1: Salary.csv sample dataset…………………………………………………………. 4

Table 1.2: Descriptive statistics for the dataset………………………………………………..4

Table 1.3: Check for null values…………………………………………………………5

1. State the null and the alternate hypothesis for conducting one-way anova for both
education and occupation
individually………………………………………………………….….8-9
2. Perform a one-way anova on salary with respect to education. state whether the null
hypothesis is accepted or rejected based on the anova
results……………………………...6
Table 1.4: One-way anova table for salary with respect to education…………………….
…….6
Plot 1.1: Point plot for salary with respect to education………………………………….
…….6
3. Perform a one-way anova on salary with respect to occupation. state whether the null
hypothesis is accepted or rejected based on the anova
results……………………………..7
Table 1.5: One-way anova table for salary with respect to
occupation…………………………7
Plot 1.2: Point plot for salary with respect to occupation………………………………………
7

PROBLEM 1B…………………………………………………………………………….8-10
1. What is the interaction between two treatments? analyse the effects of one variable on the
other (education and occupation) with the help of an interaction plot. [hint: use the ‘point
plot’ function from the ‘seaborn’ function] …………………………………………......8-
9
Plot 1.3: Interaction
plot………………………………………………………………………..8

Page 1 of 30
2

2. Perform a two-way anova based on salary with respect to both education and occupation
(along with their interaction education*occupation). state the null and alternative
hypotheses and state your results. how will you interpret this result?.............................9-
10
Table 1.6: Two-way anova
table……………………………………………………………….9
3. Explain the business implications of performing anova for this particular case
study…….10

PROBLEM 2……………………………………………………………………………..11-
27
Table 2.1: Education – Post 12th standard.csv
dataset………………………………………..11

Table 2.2: Check for data


types……………………………………………………………….11

Table 2.3: Descriptive statistics for the


dataset……………………………………………….12

Table 2.4: Check for


duplicates……………………………………………………………….12

Table 2.5: Check for null values………………………………………………………………


13

1. Perform exploratory data analysis [both univariate and multivariate analysis to be


performed]. what insight do you draw from the EDA?..................................................14-
18
Plot 2.1: Univariate analysis of numeric variables using histogram
plot……………………..14

Page 2 of 30
3

Plot 2.2: Univariate analysis of numeric variables using


boxplot…………………………….15

Plot 2.3: Multivariate analysis of numeric columns using


heatmap…………………………..17

2. Is scaling necessary for pca in this case? give justification and perform scaling……..18-
19
Table 2.7: scaled dataset……………………………………………………………………...19

3. Comment on the comparison between the covariance and the correlation matrices from
this data [on scaled data]
…………………………………………………………………..19-20
Table 2.8: Correlation matrix of scaled data set………………………………………………
20
Table 2.9: Covariance matrix of scaled data set………………………………………………
20
4. Check the dataset for outliers before and after scaling. what insight do you derive here?
[please do not treat outliers unless specifically asked to do so]
…………………………..21
Plot 2.4: Outliers checking before
scaling…………………………………………………….21

Plot 2.4: outliers checking after


scaling……………………………………………………….21

5. Extract the eigenvalues and eigenvectors. [using sklearn PCA print both] ………......22-
23
Table 2.10: Eigenvectors and Eigenvalues of dataset……………………………………..22-
23
6. Perform PCA and export the data of the principal component (eigenvectors) into a data
frame with the original
features…………………………………………………………...23
Table 2.11: New dataset of the principal components with original features…………………
23
7. write down the explicit form of the first pc (in terms of the eigenvectors. use values with
two places of decimals only). [hint: write the linear equation of pc in terms of

Page 3 of 30
4

eigenvectors and corresponding features]


……………………………………………………...………24
Table 2.12: Explicit form of first
PC………………………………………………………….24
8. Consider the cumulative values of the eigenvalues. how does it help you to decide on the
optimum number of principal components? what do the eigenvectors indicate?..........24-
26
Table 2.13: Cumulative values of the eigenvalues……………………………………………
24
Plot 2.6: Scree plot……………………………………………………………………………
25

Table 2.14: New data set of reduced principal


components…………………………………..25
Plot 2.7: Correlation between the principal components and the variables present in the
dataset
………………………………………………………………………………………………..26
9. Explain the business implication of using the principal component analysis for this case
study. how may pcs help in the further analysis? [hint: write interpretations of the principal
components obtained]…………………………………………………………………….26-27

Page 4 of 30
5

Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals are collected and each person’s educational
qualification and occupation are noted. Educational qualification is at three levels, High
school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and
clerical, Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.

TABLE 1.1: SALARY.CSV SAMPLE DATASET

 Given data set has variables Education, Occupation, Salary. Education and
Occupation are categorical variables and Salary is the continuous variable. The data
set has total of 40 entries and three columns.

TABLE 1.2: DESCRIPTIVE STATISTICS FOR THE DATASET

Page 5 of 30
6

 From the above table we can observe there are three unique education levels and four
occupation levels are given.
 Doctorate education level and Prof-speciality from occupation has top entries who is
having highest salaries.
 The minimum and maximum salaries in the data set is from 50103/- to 260151/-
respectively.
 By observing the mean, median and standard deviation of salary variable we can say
that the data is more or less normally distributed.

TABLE 1.3: CHECK FOR NULL VALUES

 From the above results, it is evident that there are no null values present in the dataset.
And we have 1 integer data type and 2 object data types.

1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.

One way ANOVA(Education)

NULL HYPOTHESIS- 𝐻0: The mean salary is the same across all the 3 categories of education
(Doctorate, Bachelors, HS-Grad).

ALTERNATE HYPOTHESIS- 𝐻1: The mean salary is different in at least one category of
education.

One way ANOVA(Occupation)

NULL HYPOTHESIS- 𝐻0: The mean salary is the same across all the 4 categories of
occupation (Prof-Specialty, Sales, Adm-clerical, Exec-Managerial).

Page 6 of 30
7

ALTERNATE HYPOTHESIS- 𝐻1: The mean salary is different in at least one category of
occupation.
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
TABLE 1.4: ONE-WAY ANOVA TABLE FOR SALARY WITH RESPECT TO
EDUCATION

PLOT 1.1: POINT PLOT FOR SALARY WITH RESPECT TO EDUCATION

 By observing the one-way Anova table for salary with respect to Education we can
reject the null hypothesis and conclude that there is a significant difference in the mean
salaries for at least one category of education since the P value is less than Alpha value
(1.257709e-08< 0.05).
 By observing the point plot, we can say that the doctorate candidates are having higher
salaries followed by Bachelors Education and HS-grad. In general, we can say that
people who are having high education background are tend to have high salaries in the
market.
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether
the null hypothesis is accepted or rejected based on the ANOVA results.

Page 7 of 30
8

TABLE 1.5: ONE-WAY ANOVA TABLE FOR SALARY WITH RESPECT TO


OCCUPATION

PLOT 1.2: POINT PLOT FOR SALARY WITH RESPECT TO OCCUPATION

 By observing the one-way Anova table for salary with respect to Occupation we fail
to reject the null hypothesis and conclude that there is no significant difference in the
mean salaries across the 4 categories of occupation since the P value is greater than
Alpha value (0.4585< 0.05).
 By observing the point plot, we can say that the Exec-Managerial candidates are
having higher salaries followed by other categories. In general, we can say that people
who are at managerial levels are tend to have high salaries in the market.

Problem 1B:

Page 8 of 30
9

1. What is the interaction between two treatments? Analyse the effects of one variable
on the other (Education and Occupation) with the help of an interaction plot.
PLOT 1.3: INTERACTION PLOT

 The interaction plot shows that there is significant amount of interaction between the
categorical variables, Education and Occupation.
 People with HS-grad education do not reach the position of Exec-managerial and they
hold only Adm-clerk, Sales and Prof-Specialty occupations.
 People with education as Bachelors or Doctorate and occupation as Adm-clerical and
Sales almost earn the same salaries. As we can see there is no significance difference
in mean salaries across four categories.
 People with education as Bachelors and occupation as Prof-Specialty earn lesser than
people with education as Bachelors and occupations as Adm-clerical and Sales.
 People with education as Bachelors and occupation Sales earn higher than people with
education as Bachelors and occupation Prof-Specialty whereas people with education
as Doctorate and occupation Sales earn lesser than people with Doctorate and
occupation Prof-Specialty.
 Similarly, people with education as Bachelors and occupation as Prof-Specialty earn
lesser than people with education as Bachelors and occupation Exec-Managerial

Page 9 of 30
10

whereas people with education as Doctorate and occupation as Prof-Specialty earn


higher than people with education as Doctorate and occupation Exec-Managerial.
 Salespeople with Bachelors or Doctorate education earn the same salaries and earn
higher than people with education as HS-grad.
 Adm clerical people with education as HS-grad earn the lowest salaries when
compared to people with education as Bachelors or Doctorate.
 Prof-Specialty people with education as Doctorate earn maximum salaries and people
with education as HS-Grad earn the minimum.
 People with education as HS -Grad earn the minimum salaries.
 There are no people with education as HS -grad who hold Exec-managerial
occupation.
 People with education as Bachelors and occupation, Sales and Exec-Managerial earn
the same salaries.
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?

NULL HYPOTHESIS- 𝐻0: There is no interaction effect between the 2 independent


variables, education and occupation. (The effect of the independent variable ‘education’ on
the mean ‘salary’ does not depend on the effect of the other independent variable
‘occupation’)

ALTERNATE HYPOTHESIS- 𝐻1: There is an interaction effect between the


independent variable ‘education’ and the independent variable ‘occupation’ on the mean
salary.

TABLE 1.6: TWO-WAY ANOVA TABLE

Page 10 of 30
11

 By observing the two-way Anova table, we see that there is a significant amount of
interaction between the variables, Education and Occupation.
 As p value = 2.232500e-05 is lesser than the significance level (alpha = 0.05), we
reject the null hypothesis.
 Thus, we see that there is an interaction effect between education and occupation on
the mean salary.

3. Explain the business implications of performing ANOVA for this particular case
study.
 From the ANOVA method and the interaction plot, we see that education
combined with occupation results in higher and better salaries among the people.
 It is clearly seen that people with education as Doctorate draw the maximum
salaries and people with education HS-grad earn the least.
 Thus, we can conclude that Salary is dependent on educational qualifications and
occupation.

Page 11 of 30
12

Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You
are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be
found in the following file: Data Dictionary.xlsx.
TABLE 2.1: EDUCATION – POST 12TH STANDARD.CSV DATASET

TABLE 2.2: CHECK FOR DATA TYPES

Page 12 of 30
13

 Given data set has 777 entries and total of 18 columns.


 One object type and 17 numeric (float and int) data types available in the dataset.

TABLE 2.3: DESCRIPTIVE STATISTICS FOR THE DATASET

Page 13 of 30
14

 By observing the maximum values, we can say that outliers are present in the entire
dataset.
 Each numeric field has 777 entries, it shows there are no null values present in the
dataset.

TABLE 2.4: CHECK FOR DUPLICATES

 By observing the above table, we can say that there are no duplicate values present in
the dataset.
Table 2.5: CHECK FOR NULL VALUES

Page 14 of 30
15

 By observing the above table, we can say that there are no null values present in the
dataset.
Table 2.6: NUMERIC COLUMNS OF THE DATASET

 The above dataset consists of only numeric columns as we need only Numerical fields
to perform Principal Component Analysis.

1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be


performed]. What insight do you draw from the EDA?

Page 15 of 30
16

 We can understand patterns and distribution of data and also summarise the data from
univariate analysis.
 For Univariate analysis, Boxplots and Histograms are the suitable visualization
techniques to detect the behaviour of each variable.

PLOT 2.1: UNIVARIATE ANALYSIS OF NUMERIC VARIABLES USING


HISTOGRAM PLOT

PLOT 2.2: UNIVARIATE ANALYSIS OF NUMERIC VARIABLES USING


BOXPLOT

Page 16 of 30
17

 Above plots shows the Histogram and Boxplot representations of each numeric
variable.
 From the boxplot representation of all the numeric variables we can understand that
all numeric variables in the dataset have outliers except “TOP 25 PERC”.
 From the histogram representation of Applications variable, we can understand that
each college or university offers application in the range 3000 to 5000. The max
applications seem to be around 50,000. The distribution of the data is positively
skewed
 From the histogram representation of Acceptance variable, we can understand the
majority of applications accepted from each university are in the range from 80 to
1500. The distribution of the data is positively skewed.
 From the histogram representation of Enrol variable, we can understand majority of
the colleges have enrolled students in the range of 200 to 500 students. The
distribution of the data is positively skewed.

Page 17 of 30
18

 From the histogram representation of Top 10 PERC variable, we can understand there
is good amount of intake about 30 to 50 students from top 10 percentage of higher
secondary class. The distribution of data is positively skewed.
 From the histogram representation of Top 25 PERC variable, we can understand
Majority of the students are from top 25% of higher secondary class. The distribution
of the data is almost normally distributed.
 From the histogram representation of Full-Time Undergraduate variable, we can
understand in the range about 3000 to 5000 they are full time graduates studying in all
the universities. The distribution of the data is positively skewed.
 From the histogram representation of Part-Time graduate’s variable, we can
understand in the range about 1000 to 3000 they are part-time graduates studying in
all the universities. The distribution of the data is positively skewed.
 From the histogram representation of Outstate variable, we can understand the
maximum number of outstate students are in the range of 6000 to 13000. The
distribution of data is almost normally distributed.
 From the histogram representation of Room Board variable, we can understand the
maximum cost of Room and Board is about 3600 to 4600. The distribution of the data
is normally distributed.
 From the histogram representation of Books variable, we can understand the
maximum cost of books per student are in the range of 400 to 700. The distribution of
the data seems to be bimodal as the boxplot is having outliers on both the sides.
 From the histogram representation of Personal Expense variable, we can understand
the maximum personal expenses of most of the students are in the range of 400 to
1600 and Some student’s personal expense are way bigger than the rest of the
students. The distribution of the data seems to be positively skewed.
 From the histogram representation of PHD variable, we can understand 75 to 85
percent of the faculties are having PHD’s in all the universities. The distribution of
the data seems to be negatively skewed.
 From the histogram representation of Terminal variable, we can understand most of
the faculties are having terminal degree in all the universities. The distribution of the
data seems to be negatively skewed.
 From the histogram representation of SF Ratio variable, we can understand the
maximum SF ration is in between 13 to 15 in all the universities. The distribution of
data is almost normally distributed.
Page 18 of 30
19

 From the histogram representation of Alumni variable, we can understand 12 to 24


percent of alumni students who are donating from all the universities. The distribution
of the data is almost normally distributed.
 From the histogram representation of Expenditure variable, we can understand the
maximum instructional expenditure per student is in between 8000 to 10000. The
distribution of the data is positively skewed.
 From the histogram representation of Graduation Rate variable, we can understand the
graduation rate among the students in all the university above 65%. The distribution
of the data is normally distributed
PLOT 2.3: MULTIVARIATE ANALYSIS OF NUMERIC COLUMNS USING

HEATMAP

 Multivariate analysis is to check the behaviour of 2 or more variables and between


those variables
 The pair plot helps us to understand the relationship between all the numerical values
in the dataset. On comparing all the variables with each other we could understand the
patterns or trends in the dataset.

Page 19 of 30
20

 From the correlation heatmap we could understand the correlation between two
numeric variables.
 I have used correlation heatmap as part of multivariate analysis to check how strongly
the variables correlated to each other.
 From the above correlation heatmap we could understand the application variable is
highly positively correlated with application accepted, students enrolled and full-time
graduates.
 So, this relationship gives the insights on when student submits the application it is
accepted and the student is enrolled as fulltime graduate.
 We can find negative correlation between application and percentage of alumni. This
indicates us not all students are part of alumni of their college or university.
 The application with top 10, 25 of higher secondary class, outstate, room board,
books, personal, PhD, terminal, S.F ratio, expenditure and Graduation ratio are
positively correlated.
2. Is scaling necessary for PCA in this case? Give justification and perform scaling.
 Scaling is necessary for principal component analysis in this case study because the
features in data set is having different scales.
 As we can observe the original data set from Application variable to Outstate variable
are number of students.
 The top10 percent and top20 percent are students in which the values are given in
percentage.
 The variables Room board, Books, Expenditure and Personal are values associated
with money.
 The variables PHD, Terminal, SF ratio, Percentage of alumni and Graduation rate are
values associated Percentage.
 Before performing PCA we need to make sure all the features in data set are on same
scale.
 I have used Z-Score method to perform the scaling on this data set which helps me
scale the data and the scaled data set will have mean tend to 0 and standard deviation
tending to 1. Below is the sample of scaled data set.

Page 20 of 30
21

TABLE 2.7: SCALED DATASET

 Now we can observe that all features in the data set are in the same scale after
performing Z- Score scaling on the original data set.
3. Comment on the comparison between the covariance and the correlation matrices
from this data [on scaled data]
 As we know covariance shows us how the two variables differ, whereas correlation shows
us how the two variables are related. The correlation value lies between -1 and +1 and the
covariance value lies between -∞ and +∞.
 Covariance is affected by the change in scale, i.e., if all the value of one variable is
multiplied by a constant and all the value of another variable are multiplied, by a similar
or different constant, then the covariance is changed but correlation is not influenced by
the change in scale.
 The comparison between the covariance and correlation matrix is that both of the terms
measure the relationship and the dependency between two variables.
 Covariance indicates the direction of the linear relationship between the variables whether
it is positive or negative. By direction means it is directly proportional or inversely
proportional.
 Correlation measures the strength and the direction of the linear relationship between two
variables. Strength is that is that positively correlated or negatively correlated.
 The correlation matrix before scaling and after scaling will remain the same.
 From correlation matrix we can understand variables which are highly positively
correlated, highly negatively correlated and moderately correlated with each other.
 Below are the covariance and correlation matrices of scaled data set.

Page 21 of 30
22

TABLE 2.8: CORRELATION MATRIX OF SCLAED DATA SET

TABLE 2.9: COVARIANCE MATRIX OF SCLAED DATA SET

 If we can observe the covariance and correlation matrices after scaling the values are
almost same.

Page 22 of 30
23

4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
PLOT 2.4: OUTLIERS CHECKING BEFORE SCALING

PLOT 2.5: OUTLIERS CHECKING AFTER SCALING

 If we observe the above plots of outliers before and after scaling, we could conclude
that the outliers present in both the plots. i.e., after scaling also outliers are present in
the data set.
 The reason for the same is scaling does not remove outliers. scaling scales the values
on a Z score distribution.
 In order to get rid of outliers we need to treat them with appropriate method after
taking insights from the business by understanding the importance of having or
removing the outliers.

Page 23 of 30
24

5. Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
TABLE 2.10: EIGENVECTORS AND EIGENVALUES OF DATASET

Page 24 of 30
25

6. Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features
TABLE 2.11: NEW DATASET OF THE PRINICIPAL COMPONENTS WITH
ORIGINAL FEATURES

Page 25 of 30
26

7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]
 The Linear equation of 1st component:
 0.249 * Apps + 0.208 * Accept + 0.176 * Enrol + 0.354 * Top10perc + 0.344 *
Top25perc + 0.155 * F. Undergrad + 0.026 * P. Undergrad + 0.295 * Outstate + 0.249
* Room. Board + 0.065 * Books + -0.043 * Personal + 0.318 * PhD + 0.317 *
Terminal + -0.177 * S.F. Ratio + 0.205 * perc. Alumni + 0.319 * Expend + 0.252 *
Grad. Rate
TABLE 2.12: EXPLICIT FORM OF FIRST PC

8. Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?
TABLE 2.13: CUMULATIVE VALUES OF THE EIGENVALUES

 Eigenvalues are coefficients applied to eigenvectors that gives the vectors their length
and magnitude.
 Eigenvectors are unit vectors with length and magnitude is equal to 1. They are often
referred to as right vectors, which simply means a column vector.
 Eigenvectors indicates line of best fit, shows the direction of maximum variance in
the dataset. The Eigenvector is the direction of that line
 Cumulative values mean adding each variance with the row one by one and checking
how much variance each Principal component is explained from given value, we can
derive the optimum number principal components needed for further analysis of the
dataset with reduced number of variables.
 Adding the Eigen values, we will get sum of total percentage. i.e.,100%.

Page 26 of 30
27

 To decide on the optimum number of principal components we need to check for


cumulative variance up to 85% to 90% depending on the data covered by each
principal component.
 Then we need to check the corresponding principal components associated with 85%
to 90% of the data covered.
 The incremental value between the components should not be less than five percent.
 So, basis on this we can decide the optimum number of principal components as 5,
because after this the incremental value is less than 5%. But I’m considering 6
Principal components for this case study because we can able to explain 85% variance
present in the data set.
PLOT 2.6: SCREE PLOT

TABLE 2.14: NEW DATA SET WITH REDUCED PRINICIPLE COMPONENTS

Page 27 of 30
28

PLOT 2.7: CORRELATION BETWEEN THE PRINCIPAL COMPONENTS AND


THE VARIABLES PRESENT IN THE DATASET

 From the above plot, we can observe the multi collinearity is highly reduced and we
name the PCs based on the characteristic it explains better and then we can start our
further analysis of the given dataset using the 6 Principal Components
9. Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations
of the Principal Components Obtained]
 The given case study is about education dataset which contain the names of various
colleges, which has various details of colleges and university.
 To understand more about the dataset, I have performed univariate and multivariate
analysis which help us to give the understanding about the distribution of the dataset,
skewness, patterns in the dataset and correlation of variables.
 From multivariate analysis we could conclude that the multiple variables are highly
correlated with each other.
 The scaling helps the dataset to standardize the variables in one scale and it is very
important to scale the data before stepping into principal component analysis.
 The principal component analysis is used reduce the multicollinearity between the
variables.
 Depending on the variance of the dataset we can reduce the PCA components.

Page 28 of 30
29

 The PCA components for this business case is 5 as per the conditions but I have taken
6PC’s because with help of 6 PC’s we could able to explain 85% variance in the
dataset.
 Using the 6 principal components we can now understand the reduced
multicollinearity in the dataset.

 Principal Component Analysis is used to remove the redundant features from the
datasets without losing much information.
 These features are low dimensional in nature. The first component has the highest
variance followed by second, third and so on. PCA works best on data set having 3 or
higher dimensions. Because, with higher dimensions, it becomes increasingly difficult
to make interpretations from the resultant cloud of data.
 With this analysis we can perform further analysis and model building. PCA will
improve the efficiency of machine learning models

Page 29 of 30

You might also like