You are on page 1of 16

ISHANT AS PROJECT

Business Report

1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually.

A) Head

B) Tail

C) Info
H0: The mean hours for all the levels of A are equal. HA: H0 is false, At least at one of the levels of A,
the mean hours of relief is different from other.

Statistical: H0: μa1 = μa2 = μa3 HA: H0 is false.

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.
Problem-2: The dataset Education – Post 12th Standard.csv is a dataset that contains the names of
various colleges. This particular case study is based on various parameters of various institutions.
You are expected to do Principal Component Analysis for this case study according to the
instructions given in the following rubric.

2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed].

Ans) Univariate Analysis:


The number of students accepted are those who got an admit when they applied to the university.
We can intuitively understand that it could have similar distribution as the number of applications
So, intuitively from number of applications, followed by number of accepted and finally number of
students enrolled would have similar distribution as they are interlinked with each other. The only
thing here is the numbers decrease respectively as this is an admission process. To conclude here,
this column is also right skewed and h

We would do that at the end of the univariate analysis. We have a distribution difference between
these two columns. As we might have good number of students of top 25% almost in every
university or college. Whereas in top 10% we would few students.

Visually, it looks a bit normal, however it could be right skewed as there is a kind of steep step
towards the right and we also have one outlier. We would definitely check for normality as said, in
the end of this sub-topic.
Visually, it looks a bit normal, however it could be right skewed as there is a kind of steep step
towards the right and we also have one outlier. We would definitely check for normality as said, in
the end of this sub-topic.

The boarding and room cost has few outliers because few universities depending on the location we
might have high costs like in cities New York, etc. Overall ,the distribution looks good, but it is right
skewed.
It is a good sign that most of the universities have 60 to 85 percent of their faculty with PhD scholars.
The percentage of alumni donating to their universities is positively skewed and there are
outliers. This percentage can be great if the percentage increases which would help the institutions a
lot

The graduation rate is left skewed and it is also evident that it has few universities with less
graduation rate. We cannot judge a university on the basis of graduation rate because we know that
top universities

When we used the describe function we got some anomalies in PhD and Graduation rate, as these
are percentages they cannot be more than hundred,
Bi-Variate analysis:

First we would look into the correlations between the variables.


From the above plot, we basically know that if we have more number of applications then we would
also have more number of people getting an admit on an average overall in terms of all the
institutions

In the same way, when we get the acceptances, it also directly proportional to how many how them
enroll after getting the acceptances.
The above plot tells us the relationship between the instructional expenditure and Student/Faculty
Ratio, these two variables are negatively correlated as we can observe the plot

These two variables are about the faculty. We see that if there are good number of faculties having
terminal degree, there is a high chance they can have a PhD as well.
The above bar plot depicts the top 10 universities which tend to have to low student/Faculty ratio.
Here top certainly means in ascending order

2.2) Scale the variables and write the inference for using the type of scaling function for this case
study.

Ans -  In the jupyter notebook, I have done my whole work in different combinations pertaining to
PCA like with and without outlier treatment.

 For our data, I have used two scaling techniques – Z score Normalization and Log transformation.

 Z score normalization is not so appealing here as we have lot of skewness in our variables.

 It is best instance for us to use log transformation. Moreover, we do not have any negative or zero

2.3) Comment on the comparison between covariance and the correlation matrix after scaling

Ans –
2.4) Check the dataset for outliers before and after scaling. Draw your inferences from this
exercise

Ans - We have already seen in the univariate analysis which is intuitively means that before scaling
and we observed that all the variables had outliers except one variable. Later, I have done the outlier
treatment.
2.5) Build the covariance matrix and calculate the eigenvalues and the eigenvector.

Ans –
2.6) Write the explicit form of the first PC (in terms of Eigen Vectors).

Ans – Here we have two things to do. One is the array representation of first principal component
and the other is the linear equation of it.
2.7) Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform PCA and
export the data of the Principal Component scores into a data frame

Ans-

We have two ways to decide the optimum number of principal components. The first way is to check
the Eigen Values, we can take the number by the counting the eigen values which are greater than 1.
From 2.5, we got 2 such values.

We could take 2 principal components. However, we cannot conclude that here itself.

The other way is the cumulative variance explained by the principal components.

You might also like