You are on page 1of 7

What is P-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the
strength of your results. p-value is a number between 0 and 1. Based on the value it will
denote the strength of the results. The claim which is on trial is called the Null
Hypothesis.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can
reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis
which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis
could go either way. To put it in another way,

High P values: your data are likely with a true null. Low P values: your data are unlikely
with a true null.

The p-value of the test statistic is a way of saying how extreme that statistic is for our
sample data. The smaller the p-value, the more unlikely the observed sample.

Difference Between P-Value and Alpha

To determine if an observed outcome is statistically significant, we compare the values of


alpha and the p-value. There are two possibilities that emerge:

 The p-value is less than or equal to alpha. In this case, we reject the null
hypothesis. When this happens, we say that the result is statistically significant. In
other words, we are reasonably sure that there is something besides chance alone
that gave us an observed sample.
 The p-value is greater than alpha. In this case, we fail to reject the null hypothesis.
When this happens, we say that the result is not statistically significant. In other
words, we are reasonably sure that our observed data can be explained by chance
alone.

What is sampling? How many sampling methods do you know?

“Data sampling is a statistical analysis technique used to select, manipulate and analyze a
representative subset of data points to identify patterns and trends in the larger data set
being examined.
What are the assumptions required for linear regression?

There are four major assumptions: 1. There is a linear relationship between the
dependent variables and the regressors, meaning the model you are creating actually fits
the data, 2. The errors or residuals of the data are normally distributed and independent
from each other, 3. There is minimal multicollinearity between explanatory variables,
and 4. Homoscedasticity. This means the variance around the regression line is the same
for all values of the predictor variable.

What is the difference between type I vs type II error?

“A type I error occurs when the null hypothesis is true, but is rejected. A type II error
occurs when the null hypothesis is false, but erroneously fails to be rejected.” Read the
full answer here.

What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can
all be jumbled up.

However, there are chances that the data is distributed around a central value without
any bias to the left or right and reaches normal distribution in the form of a bell-
shaped curve.

The random variables are distributed in the form of a symmetrical, bell-shaped curve.

Properties of Normal Distribution are as follows:

1. Unimodal -one mode


2. Symmetrical -left and right halves are mirror images.
3. Bell-shaped -maximum height (mode) at the mean.
4. Mean, Mode, and Median are all located in the center.
5. Asymptotic

What is correlation and covariance in statistics?

Both Correlation and Covariance establish the relationship and also measure the
dependency between two random variables. Though the work is similar between these
two in mathematical terms, they are different from each other
Correlation is considered or described as the best technique for measuring and also for
estimating the quantitative relationship between two variables. Correlation measures
how strongly two variables are related.

Covariance: In covariance two items vary together and it’s a measure that indicates the
extent to which two random variables change in cycle. It is a statistical term; it explains
the systematic relation between a pair of random variables, wherein changes in one
variable reciprocal by a corresponding change in another variable

What is a confidence interval?

A confidence interval gives us a range of values which is likely to contain the


population parameter. The confidence interval is generally preferred, as it tells us how
likely this interval is to contain the population parameter.

This likeliness or probability is called Confidence Level or Confidence coefficient and


represented by 1 — alpha, where alpha is the level of significance.

A confidence interval does not quantify variability


A 95% confidence interval is a range of values that you can be 95% certain contains
the true mean of the population. This is not the same as a range that contains 95% of the
values. The graph below emphasizes this distinction.

Statistics

https://www.mathsisfun.com/data/confidence-interval.html

Step 1: start with

 the number of observations n


 the mean X
 and the standard deviation s

Note: we should use the standard deviation of the entire population, but in


many cases we won't know it.
We can use the standard deviation for the sample if we have enough
observations (at least n=30, hopefully more).

Using our example:

 number of observations n = 40
 mean X = 175
 standard deviation s = 20

Step 2: decide what Confidence Interval we want: 95% or 99% are common
choices. Then find the "Z" value for that Confidence Interval here:

Confidence
Z
Interval

80% 1.282

85% 1.440

90% 1.645

95% 1.960

99% 2.576

99.5% 2.807

99.9% 3.291

For 95% the Z value is 1.960

Step 3: use that Z value in this formula for the Confidence Interval

X  ±  Z (s/√n)

Where:

 X is the mean


 Z is the chosen Z-value from the table above
 s is the standard deviation
 n is the number of observations

And we have:

175 ± 1.960 × 20√40

Which is:
175cm ± 6.20cm

In other words: from 168.8cm to 181.2cm

A Confidence interval is a range of values that likely would contain an unknown


population parameter. Confidence level refers to the percentage of probability, or
certainty, that the confidence interval would contain the true population parameter when
you draw a random sample many times. Or, in the vernacular, "We are 99% certain
(confidence level) that most of these datasets (confidence intervals) contain the true
population parameter."

What is Overfitting?
In overfitting, a statistical model describes random error or noise instead of the
underlying relationship. Overfitting occurs when a model is excessively complex, such
as having too many parameters relative to the number of observations. A model that has
been overfitted, has poor predictive performance, as it overreacts to minor fluctuations
in the training data.

What is Underfitting? 
occurs when a statistical model or machine learning algorithm cannot capture the
underlying trend of the data. Underfitting would occur, for example, when fitting a
linear model to non-linear data. Such a model too would have poor
predictive performance.

What is Machine Learning?


Machine Learning explores the study and construction of algorithms that can learn
from and make predictions on data. Closely related to computational statistics. Used to
devise complex models and algorithms that lend themselves to a prediction which in
commercial use is known as predictive analytics. Given below, is an image representing
the various domains Machine Learning lends itself to.

What is skewness?
Skewness is a measure of the asymmetry of the data around the sample mean. If
skewness is negative, the data are spread out more to the left of the mean than to the
right. If skewness is positive, the data are spread out more to the right.

What is Variance?
Variance (σ2) in statistics is a measurement of the spread between numbers in a data set.
That is, it measures how far each number in the set is from the mean and therefore
from every other number in the set.

Central Limit Theorem

The central limit theorem states that the sampling distribution of the sample mean
approaches a normal distribution as the sample size gets larger no matter what the
shape of the population distribution.

Mean vs Median
Outliers and skewed data have a smaller effect on the median. ... Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course, with
other types of changes, the median can change. When you have a skewed distribution, the
median is a better measure of central tendency than the mean.

Null Hypothesis

A null hypothesis is a hypothesis that says there is no statistical significance between


the two variables. It is usually the hypothesis a researcher or experimenter will try to
disprove or discredit. An alternative hypothesis is one that states there is a statistically
significant relationship between two variables.

The standard error of the mean, also called the standard deviation of the mean, is a method used
to estimate the standard deviation of a sampling distribution.

Standard error is the standard deviation of the sampling distribution of a statistic. It can be abbreviated
as S.E.

The standard error (SE) of a statistic is the approximate standard deviation of a


statistical sample population. The standard error is a statistical term that
measures the accuracy with which a sample distribution represents a population
by using standard deviation. In statistics, a sample mean deviates from the actual
mean of a population—this deviation is the standard error of the mean.

CATEGORICAL VARIABLE
https://www.kaggle.com/alexisbcook/categorical-variables

A categorical variable takes only a limited number of values.

 Consider a survey that asks how often you eat breakfast and provides four options: "Never",
"Rarely", "Most days", or "Every day". In this case, the data is categorical, because
responses fall into a fixed set of categories.
 If people responded to a survey about which what brand of car they owned, the responses
would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also
categorical.
You will get an error if you try to plug these variables into most machine learning models in Python
without preprocessing them first. In this tutorial, we'll compare three approaches that you can use to
prepare your categorical data.

You might also like