Professional Documents
Culture Documents
When you perform a hypothesis test in statistics, a p-value can help you determine the
strength of your results. p-value is a number between 0 and 1. Based on the value it will
denote the strength of the results. The claim which is on trial is called the Null
Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can
reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis
which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis
could go either way. To put it in another way,
High P values: your data are likely with a true null. Low P values: your data are unlikely
with a true null.
The p-value of the test statistic is a way of saying how extreme that statistic is for our
sample data. The smaller the p-value, the more unlikely the observed sample.
The p-value is less than or equal to alpha. In this case, we reject the null
hypothesis. When this happens, we say that the result is statistically significant. In
other words, we are reasonably sure that there is something besides chance alone
that gave us an observed sample.
The p-value is greater than alpha. In this case, we fail to reject the null hypothesis.
When this happens, we say that the result is not statistically significant. In other
words, we are reasonably sure that our observed data can be explained by chance
alone.
“Data sampling is a statistical analysis technique used to select, manipulate and analyze a
representative subset of data points to identify patterns and trends in the larger data set
being examined.
What are the assumptions required for linear regression?
There are four major assumptions: 1. There is a linear relationship between the
dependent variables and the regressors, meaning the model you are creating actually fits
the data, 2. The errors or residuals of the data are normally distributed and independent
from each other, 3. There is minimal multicollinearity between explanatory variables,
and 4. Homoscedasticity. This means the variance around the regression line is the same
for all values of the predictor variable.
“A type I error occurs when the null hypothesis is true, but is rejected. A type II error
occurs when the null hypothesis is false, but erroneously fails to be rejected.” Read the
full answer here.
Data is usually distributed in different ways with a bias to the left or to the right or it can
all be jumbled up.
However, there are chances that the data is distributed around a central value without
any bias to the left or right and reaches normal distribution in the form of a bell-
shaped curve.
The random variables are distributed in the form of a symmetrical, bell-shaped curve.
Both Correlation and Covariance establish the relationship and also measure the
dependency between two random variables. Though the work is similar between these
two in mathematical terms, they are different from each other
Correlation is considered or described as the best technique for measuring and also for
estimating the quantitative relationship between two variables. Correlation measures
how strongly two variables are related.
Covariance: In covariance two items vary together and it’s a measure that indicates the
extent to which two random variables change in cycle. It is a statistical term; it explains
the systematic relation between a pair of random variables, wherein changes in one
variable reciprocal by a corresponding change in another variable
Statistics
https://www.mathsisfun.com/data/confidence-interval.html
number of observations n = 40
mean X = 175
standard deviation s = 20
Step 2: decide what Confidence Interval we want: 95% or 99% are common
choices. Then find the "Z" value for that Confidence Interval here:
Confidence
Z
Interval
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
Step 3: use that Z value in this formula for the Confidence Interval
X ± Z (s/√n)
Where:
And we have:
Which is:
175cm ± 6.20cm
What is Overfitting?
In overfitting, a statistical model describes random error or noise instead of the
underlying relationship. Overfitting occurs when a model is excessively complex, such
as having too many parameters relative to the number of observations. A model that has
been overfitted, has poor predictive performance, as it overreacts to minor fluctuations
in the training data.
What is Underfitting?
occurs when a statistical model or machine learning algorithm cannot capture the
underlying trend of the data. Underfitting would occur, for example, when fitting a
linear model to non-linear data. Such a model too would have poor
predictive performance.
What is skewness?
Skewness is a measure of the asymmetry of the data around the sample mean. If
skewness is negative, the data are spread out more to the left of the mean than to the
right. If skewness is positive, the data are spread out more to the right.
What is Variance?
Variance (σ2) in statistics is a measurement of the spread between numbers in a data set.
That is, it measures how far each number in the set is from the mean and therefore
from every other number in the set.
The central limit theorem states that the sampling distribution of the sample mean
approaches a normal distribution as the sample size gets larger no matter what the
shape of the population distribution.
Mean vs Median
Outliers and skewed data have a smaller effect on the median. ... Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course, with
other types of changes, the median can change. When you have a skewed distribution, the
median is a better measure of central tendency than the mean.
Null Hypothesis
The standard error of the mean, also called the standard deviation of the mean, is a method used
to estimate the standard deviation of a sampling distribution.
Standard error is the standard deviation of the sampling distribution of a statistic. It can be abbreviated
as S.E.
CATEGORICAL VARIABLE
https://www.kaggle.com/alexisbcook/categorical-variables
Consider a survey that asks how often you eat breakfast and provides four options: "Never",
"Rarely", "Most days", or "Every day". In this case, the data is categorical, because
responses fall into a fixed set of categories.
If people responded to a survey about which what brand of car they owned, the responses
would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also
categorical.
You will get an error if you try to plug these variables into most machine learning models in Python
without preprocessing them first. In this tutorial, we'll compare three approaches that you can use to
prepare your categorical data.