You are on page 1of 5

Basic Statistical Interview Question

Basic Statistics
1. What is ‘Mean, Median’, ’Mode’, ‘G.M.’ (geometric mean)?
Ans: All of these are measure of central tendency in data.
Mean- is determined by adding all the data points in a population and then dividing the total by the number
of points. Mean is effected by outliers and skewness of the distribution.
Median- is the middle score for a set of data that has been arranged in order of magnitude. The median is less
affected by outliers and skewed data. Can be used if the data is skewed or have outlier.
Mode- most frequently occurring number found in a set of numbers.
G.M. - Geometric Mean. It is generally useful to get central tendency for rates and ratios type data.
2. What is coefficient of variation (C.V.)? When do we use C.V.?
Ans:- A coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series
around the mean. It is the ratio of the standard deviation to the mean (average) or median.
CV = (SD/Mean or median)*100%. It is also a measure of variability. In regression model we generally drop
those variable which have very low CV i.e. the variable have low CV is constant in nature.
3. Skewness and Kurtosis?
Skewness is measure of asymmetry (Not Symmetric), ie. Measure of deviation from symmetricity.
Theoretically can take value –inf to +inf. Skewness 0 means symmetric. Negative values for the skewness
indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By
skewed left, it means that the left tail is long relative to the right tail. Similarly, skewed right means that the
right tail is long relative to the left tail. If the data are multi-modal, then this may affect the sign of the
skewness.

Kurtosis- Measure of peakedness. Value of Kurtosis in Normal Distribution is 3 and named as Mesokurtic. If
Kurtosis is more than 3 is called Leptokurtic, if is less than 3 is called Platycurtic. Note some software actually
shows the measure by subtracting -3, so need to be careful about definition when we use kurtosis. Data sets
with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or
lack of outliers.

4. How do you detect outliers if observations come from a population having quite symmetric distribution?
Ans:- An outlier is an observation that diverges from an overall pattern on a sample.
5. Hypothesis testing
Ans:- Null Hypothesis (H0)- which is already assumed to be true.
Alternative Hypothesis: Which is need to prove based on sample data.

Mean Test- Test of means of two population can be done by t- test and z- test.
t-test is used when population variance is not known and estimated form sample, here the test statistics
follows t distribution . Z- test is used when population variance is known, here the test statistics follows
normal distribution.

Chi Square and F test- Chi- square and F both are test of variance. Chi- square test is applicable in the
following ways

 Chi-square test of single variance: It is used to test a hypothesis on a specific value of the population variance.
Example: H0:σ2=15, H1:σ2≠15
 Chi-square goodness of fit: Chi square test for testing goodness of fit is used to decide whether there is any
difference between the observed (experimental) value and the expected (theoretical) value.
 Chi-square test of independence: Chi square test for independence of two attributes. Suppose N observations
are considered and classified according two characteristics say A and B. We may be interested to test whether
the two characteristics are independent. In such a case, we can use Chi square test for independence of two
attributes. Example: H0: In the population, the two categorical variables are independent. H1:In the
population, two categorical variables are dependent.

 F-test of the equality of two variances: It is used to compare the variances of two quantitative variables.
Example: H0:σ21=σ22, H1:σ2≠σ22
 ANOVA: We use ANOVA to compare more than two means. F-test in ANOVA is used to assess whether the
expected values of a quantitative variable within several pre-defined groups differ from each other.
 We do t-test for individual coefficient significance in regression. We can use F-test for overall significance of
the model. It helps to compare the fit of different linear model for same data. Example: H0:β1=β2=β3………
=βk=0, H1: At least one coefficient is significant.

6. Chebyshev’s Inequality
7. Distributions and their properties and can be asked
Discrete
a. Binomial
b. Uniform
c. Poisson
Continuous
a. Normal
b. Exponential
c. Log Normal

Regression and Time Series


8. What are the assumptions of linear multiple regression? Testing methodology of each assumption and if any of
the assumption fails what is the possible way to handle it.
9. What is multicollinearity? What are the steps you can follow to remove multicollinearity?
10.What is Heteroscedasticity? What problem may arise in linear regression model for this?

Assumptions –

1. Linearity and additivity of the relationship between dependent and independent variables
2. Statistical independence of the errors (no correlation between consecutive errors in the case of
time series data)
3. Homoscedasticity (constant variance) of the errors
4. Normality of the error distribution
5. No Multicollinearity of the independent variables (independent variables should not be
correlated)

Testing methodology and possible way of handle –

Linearity-

 plot of observed versus predicted values


 plot of residuals versus predicted values
Prior to regression model correlation value between dependent and independent variables is a
measure of linearity. Univariate coefficient testing and R-Square and Adjusted R-Square between
independent and dependent variable.

Handle – Taking a non-linear transformation of predictors as log, square, exponential etc.

Statistical independence-

 plot residual vs. Time


 table or plot of residual autocorrelations
 Durbin – Watson (DW) statistic (for autocorrelation of lag one)
 Breusch–Godfrey test

Handle -- Adding lags of the dependent variable and/or lags of some of the independent variables.
One can model diff(y) with diff(x) which might solve the problem

Homoscedasticity –

Heteroscedasticity occurs when the variance of the error terms differ across observations.

Problem –

Make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in
confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is
increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically
narrow.

 Plot residual vs. fitted values


 Breusch-Pagan Test
 White Test

Handle - Model with Transform Variables. A simple fix would be to work with shorter intervals of
data in which volatility is more nearly constant.

Normality --

 Normal probability plot or normal quantile plot of the residuals


 Kolmogorov-Smirov test
 Shapiro-Wilk test
 Jarque-Bera test
 Anderson-Darling test

Handle - Non-linear transformation of the variables (response or predictors) can bring improvement
in the model.

Multicollinearity –

 Variance Inflation Factor (VIF)


 Correlation matrix

Handle - Centring the data (that is deducting the mean of the variable from each score) might
help to solve the problem.
The simplest way to address the problem is to remove independent variables with high VIF values
PCA can be done if there is large no. of variables.

Standard Linear Modelling Steps- Data- Outliers Detection and Missing Value Treatment (if required) - Possible
Transformation of Variables- Stationarity Check- Univariate analysis (correlation , univariate adj R sq,
significant coefficient)- Different Candidate model development with Significant coefficients, proper sign in
variables, no multicollinearity, - Final model selection based on r sq, adj r square- statistical assumption tests.-

11.How do you choose better linear model among two of them?


Ans: - Based on R-square and adjusted R square (higher is good), AIC values- AIC is a measure of information
loss. Model with min AIC value would be the better model

12.When do we go for logistic model? How do you check independent and dependent variable can have a logistic
relation. K-S test, Gini Coefficient, AUC.
Ans:- We go for logistic model when the dependent variable of the data is Categorical. If we plot X vs E(Y) it
follows sigmoid pattern.

13.What do you mean by stationarity of time series? Why stationarity is required before time series or regression
modelling? If the data is not stationary how do make that stationary?
Ans:- A stationary time series is one whose statistical properties such as mean, variance etc. are all constant
over time. And auto covariance is dependent on lag. Our models are mostly developed based on historical
data and model is developed with the expectation that market will behave similar fashion in future. If the
data is not stationary i.e. it is not a stable data. So model based on non-stationary data can predict different
value from expectation.
• We can difference the data.
• If the data contain a trend, we can fit some type of curve to the data and then model the residuals
from that fit.
• For non-constant variance, taking the logarithm or square root of the series may stabilize the
variance.
14.What is ARIMA model? Explain the parameters of ARIMA model.

Multivariate Analysis
15.Explain PCA and Factor Analysis in brief statements.

Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of
variables to a small set that still contains most of the information in the large set.
It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables.
The first principal component accounts for as much of the variability in the data as possible, and each
succeeding component accounts for as much of the remaining variability as possible.
How does PCA work –

1. Calculate the covariance matrix X of data points.


2. Calculate eigen vectors and corresponding eigen values.
3. Sort the eigen vectors according to their eigen values in decreasing order.
4. Choose first k eigen vectors and that will be the new k dimensions.
5. Transform the original n dimensional data points into k dimensions.

16.State the steps of K-means clustering method. How do you choose optimum no. of cluster?
1) Randomly select cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers.
4) Recalculate the new cluster center by taking mean of all data points assigned to that
centroid's cluster (summing over all the points of the group/cluster and dividing by the total
number of points).
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).

17.What we exactly do in linear discriminate analysis? State your answer in one or two sentences.

R, Excel and SQL


18.a <- c(2,3,4); b <- c(1,2); What will be a*b ? ans- c(2,6,4) with a warning msg ie. length is differnet
19.A set of observations (a vector in R) given to you. State an R function which can print the outliers. (sapply,
lapply)
20.How do you append and merge two dataframe in R? (rbind, merge)
21.Suppose you have a data frame in R containing “Gender” and “Age” column. Suggest a simple R function which
can show average “Age” value for different “Gender”
22.Use of vlookup hlookup.
23.Which function will you use to remove spaces in EXCEL?
24.You have given two table in EXCEL. One table have unique “USERNAME” and corresponding “PASSWORD”
column. Another table have “USERNAME” and “ACESS_LOCATION” column. Suggest a function in EXCEL so
that you can have one consolidate table containing “USERNAME”,”ACESS_LOCATION”, “PASSWORD” column.
(vlookup)
25.What is inner join and outer join in SQL? Explain in brief using two small dataframe.
Here are the different types of the JOINs in SQL:
• (INNER) JOIN: Returns records that have matching values in both tables
•LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table
•RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table
•FULL (OUTER) JOIN: Return all records when there is a match in either left or right table

You might also like