You are on page 1of 7

Assignment Name - Analytics Advanced

Problem Statement -
1. On what basis we choose data scaling method
(Normalization/Standardization)?
Answer.
Data scaling is used to standardize the range of features of data.
There are 2 ways to do this.
1. Normalization
2. Standardization
Normalization is a technique we use when we don’t know the distribution of
your data. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in
the ranges of values.
For example, consider a data set containing two features, age, and income(x2).
Where age ranges from 0–100, while income ranges from 0–100,000 and
higher. Income is about 1,000 times larger than age. So, these two features are
in very different ranges. So here we normalise the data to bring all the
variables to the same range.
Standardization means centering the variable at zero and standardizing the
variance at 1. The procedure involves subtracting the mean of each
observation and then dividing by the standard deviation.The result of
standardization is that the features will be rescaled so that they’ll have the
properties of a standard normal distribution.
Standardizing the features around the center and 0 with a standard deviation
of 1 is important when we compare measurements that have different units.
Variables that are measured at different scales do not contribute equally to the
analysis and might end up creating a bais.
For example, A variable that ranges between 0 and 1000 will outweigh a
variable that ranges between 0 and 1. Using these variables without
standardization will give the variable with the larger range weight of 1000 in
the analysis. Transforming the data to comparable scales can prevent this
problem. Typical data standardization procedures equalize the range and/or
data variability.
While choosing the data scaling method, we need to check the distribution of
data. If the data is normally/uniformly distributed, then Standardization is the
suitable method for the scaling purpose. Or else, if the data is not normally
distributed, we go with Normalization scaling method.
2. If the VIF is 2 then what is value of correlation coefficient (r^2)
Answer:
The variance inflation factor (VIF) quantifies the extent of correlation between
one predictor and the other predictors in a model. It is used for
diagnosing collinearity/multicollinearity. Higher values signify that it is difficult
to impossible to assess accurately the contribution of predictors to a model.
The relation between VIF and Correlation Coefficient (r²) is:

So,
VIF = 2
Thus,
⇒ 2(1 - r²) = 1
⇒ 2 - 2 × r² = 1
⇒ 1 = 2r²

3. How do you interpret chi-square result?
Answers.
A Chi-square test is designed to analyse categorical data. That means that the
data has been counted and divided into categories. It will not work with
parametric or continuous data (such as height in inches).
For example, if you want to test whether attending class influences how
students perform on an exam, using test scores (from 0-100) as data would not
be appropriate for a Chi-square test. However, arranging students into the
categories "Pass" and "Fail" would. Additionally, the data in a Chi-square grid
should not be in the form of percentages, or anything other than frequency
(count) data. Thus, by dividing a class of 54 into groups according to whether
they attended class and whether they passed the exam.
Another way to describe the Chi-square test is that it tests the null
hypothesis that the variables are independent. The test compares the
observed data to a model that distributes the data according to the
expectation that the variables are independent. Wherever the observed data
doesn't fit the model, the likelihood that the variables are dependent becomes
stronger, thus proving the null hypothesis incorrect!
4. Why do we choose boxplot method than other for outlier detection and
removal?
Answer.
Outliers are observations inconsistent with rest of the dataset Global Outlier.
These are due to poor data quality / contamination, Low quality measurement,
manual error, malfunction of equipment. Boxplot is a type of graph that
displays a summary of a large amount of data in five numbers. These numbers
include the median, upper quartile, lower quartile, minimum and maximum
data values.
Below are few main points we consider box plot method
Handles Large Data Easily.
Due to the five-number data summary, a box plot can handle and present a
summary of a large amount of data. A box plot consists of the median, which is
the midpoint of the range of data; the upper and lower quartiles, which
represent the numbers above and below the highest and lower quarters of the
data and the minimum and maximum data values. Organizing data in a box
plot by using five key concepts is an efficient way of dealing with large data too
unmanageable for other graphs, such as line plots or stem and leaf plots.
Exact Values Not Retained.
The box plot does not keep the exact values and details of the distribution
results, which is an issue with handling such large amounts of data in this
graph type. A box plot shows only a simple summary of the distribution of
results, so that it you can quickly view it and compare it with other data. Use a
box plot in combination with another statistical graph method, like a
histogram, for a more thorough, more detailed analysis of the data.
A Clear Summary
A box plot is a highly visually effective way of viewing a clear summary of one
or more sets of data. It is particularly useful for quickly summarizing and
comparing different sets of results from different experiments. At a glance, a
box plot allows a graphical display of the distribution of results and provides
indications of symmetry within the data.
Displays Outliers
A box plot is one of very few statistical graph methods that show outliers.
There might be one outlier or multiple outliers within a set of data, which
occurs both below and above the minimum and maximum data values. By
extending the lesser and greater data values to a max of 1.5 times the inter-
quartile range, the box plot delivers outliers or obscure results. Any results of
data that fall outside of the minimum and maximum values known as outliers
are easy to determine on a box plot graph.
5. How do we choose best method to impute missing value for a data?

Answer.
There are two imputation methods to impute the missing values.
1.Central statistics –(Mean,Mode,Median)
2.KNN imputation

Choosing best method to impute the missing values of data follows trial and
error. Steps are as follows
1.We need to create a subset of data from the population.
2.Delete some of the values manually.
3.Impute those deleted values with Imputaion methods.
4.Compare those impute data and actual data.
5.See which one will be close to the actual value and choose that method for
the model.

You might also like