You are on page 1of 23

PROJECT: Advanced Statistics

Meghna Batar | Advanced Statistics | March 19’2020


PROBLEM 1A

Salary is hypothesized to depend on educational qualification and occupation. To


understand the dependency, the salaries of 40 individuals [SalaryData.csv] are
collected and each person’s educational qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor, and
Doctorate. Occupation is at four levels, Administrative and clerical, Sales,
Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
 [Assume that the data follows a normal distribution. In reality, the normality
assumption may not always hold if the sample size is small.]

1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)

PROBLEM 1B

1. What is the interaction between two treatments? Analyse the effects of one variable
on the other (Education and Occupation) with the help of an interaction plot. [hint:
use the ‘point plot’ function from the ‘seaborn’ function]
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?
3. Explain the business implications of performing ANOVA for this particular case study.

PAGE 1
PROBLEM 2

The dataset Education - Post 12th Standard.csv contains information on various


colleges. You are expected to do a Principal Component Analysis for this case
study according to the instructions given. The data dictionary of the 'Education -
Post 12th Standard.csv' can be found in the following file: Data Dictionary.xlsx.

1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be


performed]. What insight do you draw from the EDA?
2. Is scaling necessary for PCA in this case? Give justification and perform scaling.
3. Comment on the comparison between the covariance and the correlation matrices
from this data [on scaled data].
4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
5. Extract the eigenvalues and eigen vectors. [Using Sklearn PCA Print Both]
6. Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only).  [Hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]
8. Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?
9. Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of
the Principal Components Obtained]

PAGE 2
SOLUTIONS

1. State the null and the alternate hypothesis for conducting one-way
ANOVA for both Education and Occupation individually.

Ans: The null and alternate hypothesis for one way ANOVA for Education are :-

Ho: The mean salary variable for each Education level is equal

Ha1: For at least one of the means of salary for level of Education is different

The null and alternate hypothesis for one way ANOVA for Occupation are :-

Ho: The mean salary variable for each Occupation type is equal

Ha2: For at least one of the means of salary for type of Occupation is
different

Where Alpha = 0.05

 If the P-value is < 0.05, then we reject the null hypothesis.


 If the P-value is >= 0.05, then we fail to reject the null hypothesis.

2. Perform a one-way ANOVA on Salary with respect to Education.


State whether the null hypothesis is accepted or rejected based on
the ANOVA results.

PAGE 3
Since, the P-value is less than Alpha we reject null hypothesis (Ho) for
Education.
3. Perform a one-way ANOVA on Salary with respect to Occupation.
State whether the null hypothesis is accepted or rejected based on
the ANOVA results.

Since, the P-value is greater than Alpha we cannot reject null hypothesis (Ho)
for Occupation.

PROBLEM 1B

1. What is the interaction between two treatments? Analyze the


effects of one variable on the other (Education and Occupation)
with the help of an interaction plot. [hint: use the ‘point plot’
function from the ‘seaborn’ function].

PAGE 4
Observation:

 Exec-managerial and sales jobs with bachelors and doctorates is


fairly good.
 Sales job with HS- grad is low.
 Prof – specialty job with doctorate is high but with bachelors and
HS-grad is pretty low.
 Adm-clerical job with bachelors and doctorate is fine but with HS-
grad it is like nil.
 Bachelors are into higher salary in Adm-clerical and sales.
 Doctorates are into high salary in the field of prof-specialty and
exec-managerial.
 HS- grad are in low income in every field of occupation.

PAGE 5
2. Perform a two-way ANOVA based on Salary with respect to both
Education and Occupation (along with their interaction
Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this
result?

The null and alternate Hypothesis for Two Way ANOVA for each
Occupation type and Education level are:
Ho: The mean salary variable foe each Occupation type and Education level
are equal.
Ha2: For at least one of the means of salary for type of occupation and
Education level are not equal.

Where, Alpha = 0.05


If the P-value is < 0.05, then we reject the null hypothesis.
If the P-value is >= 0.05, then we fail to reject the null hypothesis.

As we can see that there is some sort of interaction between the two treatments.
So, we will introduce a new term while performing the Two Way ANOVA.

PAGE 6
Due to the inclusion of the interaction effect term, we can see changes in the P-
value of the first two treatments as compared to the Two Way ANOVA without the
interaction effect terms.

And we see that P-value of the interaction effect term of Education suggests that
the Null Hypothesis is rejected in this case.

3. Explain the business implication of performing ANOVA for this


particular case study.

ANOVA stands for ‘analysis of variance’ and is used in statistics when you
are testing a hypothesis to understand how different groups respond to
each other by making connections between independent and dependent
variables. ANOVA is a statistical test that compares the means of groups in
order to determine if there is a difference between them. It is used when
more than two group means are compared. For two group means, we can
do t-test.
ANOVA is used in a business context to help manage income/salary by
comparing your education to occupation here in this case to help manage
revenue income (salary).
ANOVA can also be used to forecast Salary trends by analysing patterns in
data to better understand the future hike of salary.
Its also a widely used statistical technique for comparing the relationship
between factors that cause a rise in salary, assuming this report is for HR
department or HR consulting firm. Some of the key takeaways as below:

PAGE 7
I. As the Education level upgrades salary increases. On an average Doctorate
earns higher salary than bachelors and HS- Grads. However, it might be
possibility that being doctorate may not necessarily mean significant high
salary than HS-Grad or bachelors employees. So that means doctorates are
suitable for all job role or not always preferred above other education
levels, maybe they cab be considered some times as over qualified for
certain job roles.
II. Though there is lesser significance of occupation than education on salary
but at certain levels it impacts salary.
III. We must also take note of that high salaries are offered to bachelor’s
degree holders than doctorates for few occupations. So, we can say that
there are some shortcomings of dataset provided which reduces accuracy
of the test and analysis done, as there can be few more other important
variables which can impact salary such as years of experience,
specialization, industry/domain, etc.
IV. HR department plays more comprehensive role while setting up salary
bands. As similar job titles with different industries demands varying salary
packages as per job profile, plus years of experience for the job matters here
deciding scale of a person.
V. ANOVA test indicates that the education level coupled with occupation
has significant influence over salary than alone occupation type with
comparison to educational background.

PROBLEM 2

1. Perform EDA (univariate and multivariate). What insights do


you draw?

The first step to know our data:


Understand it, get familiar with it. What are the answers we’re trying to get
with that data? What variables are we using, and what do they mean? How
does it look from a statistical perspective? Is data formatted correctly? Do
we have missing values? And duplicated? What about outliers? S0, all these
answers can be found out step by step as below:

PAGE 8
Step 1: Import all the necessary libraries and the data.
Step 2: Describe the data after loading it. Check datatypes, number of
columns and rows, check missing values. Check the information by
using .info(). Depending upon requirement drop off missing values or
replace it.
Step 3: Reviewing new dataset and identify outliers with interquartile range
(IQR) and visualize it.

EDA (UNIVARIATE ANALYSIS)

Univariate analysis revers to analysis of single variable. The main purpose is


to summarize and find patterns in data.
The statistical description of the numeric variable, histogram or distplot to
view the distribution and the box plot to view the 5 point summary and
outliers if any.

PAGE 9
Observations:

 Data consists of 777 universities with 18 variables but not a single


categorical variable present.
 Perc.alumni have minimum value as 0 needs to be cleaned.
 There are no missing values in the data.
 We have 1 categorical field Name need to be cleaned.
 Very few students fall under topper students with top 10% and 25% .
 No duplicate record.

PAGE 10
PAGE 11
There are drastic difference of values seen in upper and lower range of most of
the variables. This indicated presence of outliers. In order to get most accurate
prediction, we must do outliers treatment before scaling. Few pairs have very
high correlation.

2. Is scaling necessary for PCA in this case? Give justification and


perform scaling.

Our data set has 18 attributes initially hence we get 18 principal


components. Once we get the amount of variance explained by each
principal components we can decide how many components we need for
our model based on the amount of information we want to retain.
Hence, yes it is necessary to normalize the data before performing PCA.
The PCA calculated a new projection to our data set. If we normalize our
data, all variables have the same standard deviation, thus all variables have
the same weight and our PCA calculates relevant axis. This skews the PCA
towards high magnitude features. We can speed up gradient descent or
calculations in algorithm by scaling.
Scaling of data can be done using Z-score method or standard scaler in
SKLearn Formula For Z-score:

The standard scaler assumes your data is normally distributed within each
features and will scale them such that the distribution is now centered
around 0, with a standard deviation of 1.

PAGE 12
Observations:

After scaling the standard deviation is 1.0 for all variables. Post scaling
(25%) value and minimum value difference is lesser than original dataset in
most of the variables.

3. Comment on the comparison between the covariance and the


correlation matrices from this data (scaled data).

PAGE 13
Observation:

Highest correlation is seen among out state, enroll variable with F.Undergrad.
Least correlation observed with SF Ratio variable.

4. check the dataset for outliers before and after scaling. What
insights do you derive here?

While performing univariate analysis we have plotted boxplots for all the
variables for checking outliers presence. The scaling shrinks the range of

PAGE 14
the feature values shown. However, the outliers have an influence when
computing the empirical mean and standard deviation. Standard scaler
therefore cannot guarantee balanced scales in presence of outliers.
So, even if there are outliers in the data, they will not affected by
standardization.

5. Extract the eigen value and eigen vectors.

PAGE 15
6. Perform PCA and export data of principal component (eigen
vector) into a data frame with original features.

PAGE 16
In this table we can see that first PC or Array explains 8.3% variance in
our dataset, while first seven captures low variances.

PAGE 17
PAGE 18
7. Write down the explicit form of the first PC in the terms of eigen
vectors. Use values with 2 decimal place only.

In PCA, given a mean centered dataset X with n sample and P variables, the
first principal component PC1 given by the linear combination of original
variables x1,x2,……….xp.

The first principal component PC1 represents the component that retains
the maximum variance of the data. W1 corresponds to an eigen vector of
covariance matrix.

The explicit form of the PC1 is:

8. Consider the cumulative values of the eigen values. How does it


help you to decide on the optimum number of principal
components? What do eigen vectors indicate?

PAGE 19
The plot visually shows how much of the variance are explained, by how many
principal components.

PCA uses the eigen vectors of the covariance matrix to figure out how we should
rotate the data. The eigen vectors (PC) determine the direction or axes along
which linear transformation acts, stretching or compressing input vectors. They
are the lines of change that represents the action of the larger matrix.

9. Explain the business implications of using the principal


components analysis for this case study. How may PCs help in
further analysis?

PAGE 20
PCA is an unsupervised statistical technique algorithm. PCA is a
“dimensionality reduction” method. It reduces the number of variables that
are correlated to each other into fewer independent variables without losing the
essence of these variables. It provides an overview of linear relationships
between inputs and variables. PCA helps in Dimensionality reduction. Converts
set of correlated variables to non-correlated variables.

Principal Component Analysis (PCA) performs well in identifying all


influencing factors affecting results in individual areas. Also correlating factors
associated with candidate win/lose. Not only in the election commission, the
PCA technique is used in many applications and different industries and
multiple areas and fields.

Interpretation of the principal components is based on finding which variables are


most strongly correlated with each component, i.e., which of these numbers are
large in magnitude, the farthest from zero in either direction. Which numbers we
consider to be large or small is of course is a subjective decision. You need to
determine at what level the correlation is of importance. Here a correlation above
0.5 is deemed important.

The first principal component is strongly correlated with five of the original
variables. The first principal component increases with increasing Arts, Health,
Transportation, Housing and Recreation scores. This suggests that these five
criteria vary together. If one increases, then the remaining ones tend to increase as
well. 

The second principal component increases with only one of the values,
decreasing Health. This component can be viewed as a measure of how unhealthy
the location is in terms of available health care including doctors, hospitals, etc.
The third principal component increases with increasing Crime and Recreation.
This suggests that places with high crime also tend to have better recreation
facilities.

To complete the analysis we often times would like to produce a scatter plot of the
component scores.

These correlations are obtained using the correlation procedure. In the variable
statement we include the first three principal components, "prin1, prin2, and
prin3", in addition to all nine of the original variables. We use the correlations
between the principal components and the original variables to interpret these
principal components.

PAGE 21
Because of standardization, all principal components will have mean 0. The
standard deviation is also given for each of the components and these are the
square root of the eigenvalue.

PAGE 22

You might also like