Professional Documents
Culture Documents
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
PROBLEM 1B
1. What is the interaction between two treatments? Analyse the effects of one variable
on the other (Education and Occupation) with the help of an interaction plot. [hint:
use the ‘point plot’ function from the ‘seaborn’ function]
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?
3. Explain the business implications of performing ANOVA for this particular case study.
PAGE 1
PROBLEM 2
PAGE 2
SOLUTIONS
1. State the null and the alternate hypothesis for conducting one-way
ANOVA for both Education and Occupation individually.
Ans: The null and alternate hypothesis for one way ANOVA for Education are :-
Ho: The mean salary variable for each Education level is equal
Ha1: For at least one of the means of salary for level of Education is different
The null and alternate hypothesis for one way ANOVA for Occupation are :-
Ho: The mean salary variable for each Occupation type is equal
Ha2: For at least one of the means of salary for type of Occupation is
different
PAGE 3
Since, the P-value is less than Alpha we reject null hypothesis (Ho) for
Education.
3. Perform a one-way ANOVA on Salary with respect to Occupation.
State whether the null hypothesis is accepted or rejected based on
the ANOVA results.
Since, the P-value is greater than Alpha we cannot reject null hypothesis (Ho)
for Occupation.
PROBLEM 1B
PAGE 4
Observation:
PAGE 5
2. Perform a two-way ANOVA based on Salary with respect to both
Education and Occupation (along with their interaction
Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this
result?
The null and alternate Hypothesis for Two Way ANOVA for each
Occupation type and Education level are:
Ho: The mean salary variable foe each Occupation type and Education level
are equal.
Ha2: For at least one of the means of salary for type of occupation and
Education level are not equal.
As we can see that there is some sort of interaction between the two treatments.
So, we will introduce a new term while performing the Two Way ANOVA.
PAGE 6
Due to the inclusion of the interaction effect term, we can see changes in the P-
value of the first two treatments as compared to the Two Way ANOVA without the
interaction effect terms.
And we see that P-value of the interaction effect term of Education suggests that
the Null Hypothesis is rejected in this case.
ANOVA stands for ‘analysis of variance’ and is used in statistics when you
are testing a hypothesis to understand how different groups respond to
each other by making connections between independent and dependent
variables. ANOVA is a statistical test that compares the means of groups in
order to determine if there is a difference between them. It is used when
more than two group means are compared. For two group means, we can
do t-test.
ANOVA is used in a business context to help manage income/salary by
comparing your education to occupation here in this case to help manage
revenue income (salary).
ANOVA can also be used to forecast Salary trends by analysing patterns in
data to better understand the future hike of salary.
Its also a widely used statistical technique for comparing the relationship
between factors that cause a rise in salary, assuming this report is for HR
department or HR consulting firm. Some of the key takeaways as below:
PAGE 7
I. As the Education level upgrades salary increases. On an average Doctorate
earns higher salary than bachelors and HS- Grads. However, it might be
possibility that being doctorate may not necessarily mean significant high
salary than HS-Grad or bachelors employees. So that means doctorates are
suitable for all job role or not always preferred above other education
levels, maybe they cab be considered some times as over qualified for
certain job roles.
II. Though there is lesser significance of occupation than education on salary
but at certain levels it impacts salary.
III. We must also take note of that high salaries are offered to bachelor’s
degree holders than doctorates for few occupations. So, we can say that
there are some shortcomings of dataset provided which reduces accuracy
of the test and analysis done, as there can be few more other important
variables which can impact salary such as years of experience,
specialization, industry/domain, etc.
IV. HR department plays more comprehensive role while setting up salary
bands. As similar job titles with different industries demands varying salary
packages as per job profile, plus years of experience for the job matters here
deciding scale of a person.
V. ANOVA test indicates that the education level coupled with occupation
has significant influence over salary than alone occupation type with
comparison to educational background.
PROBLEM 2
PAGE 8
Step 1: Import all the necessary libraries and the data.
Step 2: Describe the data after loading it. Check datatypes, number of
columns and rows, check missing values. Check the information by
using .info(). Depending upon requirement drop off missing values or
replace it.
Step 3: Reviewing new dataset and identify outliers with interquartile range
(IQR) and visualize it.
PAGE 9
Observations:
PAGE 10
PAGE 11
There are drastic difference of values seen in upper and lower range of most of
the variables. This indicated presence of outliers. In order to get most accurate
prediction, we must do outliers treatment before scaling. Few pairs have very
high correlation.
The standard scaler assumes your data is normally distributed within each
features and will scale them such that the distribution is now centered
around 0, with a standard deviation of 1.
PAGE 12
Observations:
After scaling the standard deviation is 1.0 for all variables. Post scaling
(25%) value and minimum value difference is lesser than original dataset in
most of the variables.
PAGE 13
Observation:
Highest correlation is seen among out state, enroll variable with F.Undergrad.
Least correlation observed with SF Ratio variable.
4. check the dataset for outliers before and after scaling. What
insights do you derive here?
While performing univariate analysis we have plotted boxplots for all the
variables for checking outliers presence. The scaling shrinks the range of
PAGE 14
the feature values shown. However, the outliers have an influence when
computing the empirical mean and standard deviation. Standard scaler
therefore cannot guarantee balanced scales in presence of outliers.
So, even if there are outliers in the data, they will not affected by
standardization.
PAGE 15
6. Perform PCA and export data of principal component (eigen
vector) into a data frame with original features.
PAGE 16
In this table we can see that first PC or Array explains 8.3% variance in
our dataset, while first seven captures low variances.
PAGE 17
PAGE 18
7. Write down the explicit form of the first PC in the terms of eigen
vectors. Use values with 2 decimal place only.
In PCA, given a mean centered dataset X with n sample and P variables, the
first principal component PC1 given by the linear combination of original
variables x1,x2,……….xp.
The first principal component PC1 represents the component that retains
the maximum variance of the data. W1 corresponds to an eigen vector of
covariance matrix.
PAGE 19
The plot visually shows how much of the variance are explained, by how many
principal components.
PCA uses the eigen vectors of the covariance matrix to figure out how we should
rotate the data. The eigen vectors (PC) determine the direction or axes along
which linear transformation acts, stretching or compressing input vectors. They
are the lines of change that represents the action of the larger matrix.
PAGE 20
PCA is an unsupervised statistical technique algorithm. PCA is a
“dimensionality reduction” method. It reduces the number of variables that
are correlated to each other into fewer independent variables without losing the
essence of these variables. It provides an overview of linear relationships
between inputs and variables. PCA helps in Dimensionality reduction. Converts
set of correlated variables to non-correlated variables.
The first principal component is strongly correlated with five of the original
variables. The first principal component increases with increasing Arts, Health,
Transportation, Housing and Recreation scores. This suggests that these five
criteria vary together. If one increases, then the remaining ones tend to increase as
well.
The second principal component increases with only one of the values,
decreasing Health. This component can be viewed as a measure of how unhealthy
the location is in terms of available health care including doctors, hospitals, etc.
The third principal component increases with increasing Crime and Recreation.
This suggests that places with high crime also tend to have better recreation
facilities.
To complete the analysis we often times would like to produce a scatter plot of the
component scores.
These correlations are obtained using the correlation procedure. In the variable
statement we include the first three principal components, "prin1, prin2, and
prin3", in addition to all nine of the original variables. We use the correlations
between the principal components and the original variables to interpret these
principal components.
PAGE 21
Because of standardization, all principal components will have mean 0. The
standard deviation is also given for each of the components and these are the
square root of the eigenvalue.
PAGE 22