You are on page 1of 5

Assignment Answers

A1)
Often our raw dataset is comprised of attributes with varying scales.
For example, the age of employees in a company may be between 21-70 years,
the size of the house they live is 500-5000 sq. feet and their salaries may range
from $30000-$80000.
In this situation, the age feature will not play any role because it is several order
smaller than other features. However, it may contain some important information
that may be useful for the task.
Hence, we need to normalize the features independently to the same scale, say
[0,1], so that they can contribute equally while computing the distance.
We calculate normalization by:

x_new = (x-x_min)/(x_max-x_min )
the maximum number we can get after applying the formula is 1, and the
minimum number is 0. So here is one big characteristic all the numbers will be
between 0 and 1.
Data standardization is the process of rescaling one or more attributes so that
they have a mean value of 0 and a standard deviation of 1.
We calculate standardization by:

x_new = (x-μ)/σ
So, when we don’t know the distribution of our data or the distribution is not
Gaussian (a bell curve) we go for normalization, else if our data has a Gaussian
(a bell curve) distribution we go for standardization.
A2)

Variation Inflation Factor helps to detect the multi-collinearity in regression


analysis. Multi-collinearity occurs when there is correlation between independent
variables in model.

The formula to calculate the VIF is:

VIF=1/(1-r2)

A thumb rule for interpreting the variance inflation factor:

 1 = not correlated.
 Between 1 and 5 = moderately correlated.
 Greater than 5 = highly correlated.

If the VIF is 2 then value of correlation coefficient is calculated as below:

2=1/(1-r2)

1-r2=½

1-½=r2

r2=0.5

r=0.707

As the r2 value is 50%, we can say that there is a moderate positive relationship
between two variables.

A3)
The chi-square test for independence is applied when you have two categorical
variables from a single population. It is used to determine whether there is a
significant association between the two variables.
For example, in a learning preference survey, students might be classified by
gender (male or female) and studying preference (online/books/classes).
We could use chi-square test for independence to determine whether gender is
related to studying preference.

The test procedure is appropriate when the following conditions are met:

 The sampling method is simple random sampling.


 The variables under study are each categorical.
 If sample data are displayed in a contingency table, the expected frequency
count for each cell of the table is at least 5.

This approach consists of four steps: (1) state the hypothesis, (2) formulate an
analysis plan, (3) analyse sample data, and (4) interpret results.

State the Hypothesis

Variable gender has 2 levels, and variable studying preference has 3 levels.

The null hypothesis states that knowing the level of variable gender does not help
you predict the level of variable studying preference. That is, the variables are
independent.

Ho: Gender and studying preference are independent.

Ha: Gender and studying preference are not independent.

The alternative hypothesis is that knowing the level of variable gender can help
you predict the level of variable studying preference.

Formulate an Analysis Plan

For the analysis, we will use the significance level 0.05. Using sample data, we will
conduct a chi-square test for independence.
Analyze sample data

Applying the chi-square test for independence to sample data, we compute the
degrees of freedom, the expected frequency counts, and the chi-square test
statistic. Based on the chi-square statistic and the degrees of freedom, we
determine the P-value.

Interpret results

If the P-value is less than the significance level (0.05), we reject the null hypothesis
and we conclude that there is a relationship between gender and studying
preference.

A4)

Boxplots are a way of summarizing data through visualizing the five number
summary which consists of the minimum value, first quartile, median, third
quartile, and maximum value of a data set.

Now, an outlier is an observation that lies at an abnormal distance from other


values in a random sample from a population.

With the help of box plots we can easily determine the values which are beyond
the upper and lower limits which are considered as outliers and discard them from
our dataset before making any further observations for more accurate results.
A5)

Firstly we perform missing value analysis and check the percentage of values that
are missing for each variable of our dataset. If it’s less than 30% then we go for
imputation or else we remove all rows with the null values from our dataset.

Now when going for imputation we randomly choose one value from a particular
column and make it NA and save that value for checking in future whether our
predicted value is close to actual value or not.

Then we go for mean/median/mode or KNN method to determine which one


predicts the missing value the closest to the actual value.

After analysis the method which provides the value closest to actual value is used
and all the null values are imputated using that particular method.

You might also like