You are on page 1of 28

PROBABILITY AND STATISTICS FOR DATA SCIENCE

Probability and Statistics form the basis of Data Science. The probability
theory is very much helpful for making the prediction. Estimates and
predictions form an important part of Data science. With the help of
statistical methods, we make estimates for the further analysis.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Data

Data is the collected information(observations) we have about something


or facts and statistics collected together for reference or analysis.

Why does Data Matter?

•Helps in understanding more about the data by identifying relationships


that may exist between 2 variables.
•Helps in predicting the future or forecast based on the previous trend of
data.
•Helps in determining patterns that may exist between data.
•Helps in detecting fraud by uncovering anomalies in the data.
PROBABILITY AND STATISTICS FOR DATA SCIENCE
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Descriptive Statistics

A descriptive statistic is a summary statistic that quantitatively describes or


summarizes features of a collection of information. It helps us in knowing our
data better. It is used to describe the characteristics of data.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Population or Sample Data


Before performing any analysis of data, we should determine if the data we’re
dealing with is population or sample.

Population: Collection of all items (N) and it includes each and every unit of our
study. It is hard to define and the measure of characteristic such as mean, mode is
called parameter.

Sample: Subset of the population (n) and it includes only a handful units of the
population. It is selected at random and the measure of the characteristic is called as
statistics.
   
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Statistics is defined as the process of collection of data, classifying data, representing the data for easy
interpretation, and further analysis of data. 

There are two methods of analyzing data in mathematical statistics that are used on a large scale:

•Descriptive Statistics
•Inferential Statistics

Descriptive Statistics

The descriptive method of statistics is used to describe the data collected and summarize the data and
its properties using the measures of central tendencies and the measures of dispersion.

Inferential Statistics

Inferential statistics requires statistical tests performed on samples, and it draws conclusions by
identifying the differences between the 2 groups. Tests calculate the p-value that is compared with the
probability of chance(α) = 0.05. If the p-value is less than α, then it is concluded that the p-value is
statistically significant.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Different Models of Statistics

Statistics being a broad term used in various forms, different models of statistics are used
in different forms. 
Skewness - In statistics, the word skewness refers to a measure of the asymmetry in a
probability distribution where it measures the deviation of the normal distribution
curve for data. 
ANOVA Statistics - The word ANOVA means Analysis of Variance. The measure used in
calculating the mean difference for the given set of data is called the ANOVA statistics. This
model of statistics is used to compare the performance of stocks over a period of time.
Degrees of freedom - This model of statistics is used when the values are changed. Data that can
be moved while estimating a parameter is the degree of freedom.

Regression Analysis - In this model, the statistical process determines the relationship between the
variables. The process signifies how a dependent variable changes when an independent variable
changed.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Measures of Central Tendency in Statistics

The measure of central tendency and the measure of dispersion are considered as the basis of
descriptive statistics. The representative value for the given data is the measure of central
tendency that gives us an idea of where data points are centered. This is done to find how the
data are scattered around this centered measure. The different measures of central tendency
for the data are:

• Arithmetic Mean
• Median
• Mode
• Geometric Mean
• Harmonic Mean
PROBABILITY AND STATISTICS FOR DATA SCIENCE
PROBABILITY AND STATISTICS FOR DATA SCIENCE

The data you want to analyze can have any distribution and the probability distribution
graphs can take on very distinct and recognizable shapes. Recognizing these graphs and
distributions can help you find certain characteristics of your data and perform specific
calculations on them.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

What is Normal Distribution?

A normal distribution is the continuous probability distribution with a probability


density function that gives you a symmetrical bell curve. Simply put, it is a plot of
the probability function of a variable that has maximum data concentrated around
one point and a few points taper off symmetrically towards two opposite ends.

A normal distribution has a probability distribution that is centered around the mean.
This means that the distribution has more data around the mean. The data distribution
decreases as you move away from the center. The resulting curve is symmetrical
about the mean and forms a bell-shaped distribution.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

From the above graph, you can see that the distribution is mostly about the mean or the
average of all heights. Apart from this, most data is around the mean. As you move
away, the probability density decreases too. This kind of curve is called a Bell Curve,
and it is a common feature of a normal distribution.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

1.Continuous Probability Distribution: A probability distribution where the random


variable, X, can take any given value, e.g., amount of rainfall.  You can record the
rainfall received at a certain time as 9 inches. But this is not an exact value. The
actual value can be 9.001234 inches or an infinite amount of other numbers. There
is no definitive way to plot a point in this case, and instead, you use a continuous
value.

2.Probability Density Function: An expression that is used to define the range of


values that a continuous random variable can take.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

What is Standard Deviation?

The standard deviation is a measure of how the values in your data differ from
one another or how spread out your data is.

The standard deviation measures how far apart the data points in your
observations are from each. You can calculate it by subtracting each data point
from the mean value and then finding the squared mean of the differenced
values; this is called Variance. The square root of the variance gives you the
standard deviation.

60,70,45,30,90
Mean=mu = (60+70+45+30+90)/5 =59
Variance: E(x-mu)^2/n
((60-59)2+(70-59)2+(30-59)2+(90-59)2)/5
=448.8
SD =sqrt(Variance) = 21.18
PROBABILITY AND STATISTICS FOR DATA SCIENCE

What is Standard Normal Distribution?

A Standard Normal Distribution is a type of normal distribution with a mean of 0


and a standard deviation of 1. This means that the normal distribution has its
center at 0 and intervals that increase by 1.

The mean and standard deviation in a normal distribution is not fixed. They can
take on any value. However, when you standardize the normal distribution, the
mean and standard deviation remain fixed and are the same for all standard normal
distributions. 
PROBABILITY AND STATISTICS FOR DATA SCIENCE

What is Z-Score?

The z-score is used to tell you how far from the mean the data point is. You
calculate it using the mean and standard deviation, so it can also be said that the
Z-Score is how many standard deviations below the mean the data is. 

The z-score is used to standardize your normal distribution. Using the z-score, you
can convert each data point into a value in terms of mean and standard deviation,
effectively converting the graph into a scaled-down version. The z-score tells you
how far each data point is from the mean in steps of standard deviation. So, with
the mean and standard deviation, you can plot all points on our graph.  
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Central Limit Theorem, Its Significance & Uses

Central Limit Theorem, also known as the CLT, is a crucial pillar of statistics
and machine learning. It is at the heart of hypothesis testing. 

What is the Central Limit Theorem?

The CLT is a statistical theory that states that - if you take a sufficiently large sample
size from a population with a finite level of variance, the mean of all samples from that
population will be roughly equal to the population mean.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Significance of Central Limit Theorem

The CLT has several applications. Look at the places where you can use it.

 Political/election polling is a great example of how you can use CLT. These
polls are used to estimate the number of people who support a specific
candidate. You may have seen these results with confidence intervals on
news channels. The CLT aids in this calculation.

 You use the CLT in various census fields to calculate various population
details, such as family income, electricity consumption, individual salaries,
and so on.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Assumptions Behind the Central Limit Theorem

Before we move on further, it is important to understand the assumptions behind CLT:

 The data must adhere to the randomization rule. It needs to be sampled at random.

 The samples should be unrelated to one another. One sample should not impact the
others.

 When taking samples without replacement, the sample size should not exceed 10% of
the population.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Why n ≥ 30 Samples?

The sample size of 30 is considered sufficient to see the effect of the CLT. If the
population distribution is closer to the normal distribution, you will need fewer samples
to demonstrate the central limit theorem. On the other hand, if the population
distribution is highly skewed, you will need a large number of samples to understand
the CLT.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Hypothesis Testing in Statistics

Hypothesis Testing is a type of statistical analysis in which you put your


assumptions about a population parameter to the test. It is used to estimate
the relationship between 2 statistical variables.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Null Hypothesis and Alternate Hypothesis

 The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.

 H0 is the symbol for it, and it is pronounced H-naught.

 The Alternate Hypothesis is the logical opposite of the null hypothesis.


The acceptance of the alternative hypothesis follows the rejection of the
null hypothesis.

 H1 is the symbol for it.


PROBABILITY AND STATISTICS FOR DATA SCIENCE

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical


hypothesis into two types.

• Simple Hypothesis: A simple hypothesis specifies an exact value for the


parameter.

• Composite Hypothesis: A composite hypothesis specifies a range of values.


PROBABILITY AND STATISTICS FOR DATA SCIENCE

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of data
that would result in the null hypothesis being rejected if the test sample falls into it,
inevitably meaning the acceptance of the alternate hypothesis.

 In a one-tailed test, the critical distribution area is one-sided, meaning the test
sample is either greater or lesser than a specific value.

 In two tails, the test sample is checked to be greater or less than a range of values
in a Two-Tailed test, implying that the critical distribution area is two-sided.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

 Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis
despite being true.

 Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it
is false, unlike a Type-I error.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

Level of Significance

The alpha value is a criterion for determining whether a test statistic is statistically
significant.

In a statistical test, Alpha represents an acceptable probability of a Type I error. Because


alpha is a probability, it can be anywhere between 0 and 1.

In practice, the most commonly used alpha values are 0.01, 0.05, and 0.1, which represent a
1%, 5%, and 10% chance of a Type I error, respectively (i.e. rejecting the null hypothesis when
it is in fact correct).
PROBABILITY AND STATISTICS FOR DATA SCIENCE

P-Value

A p-value is a metric that expresses the likelihood that an observed difference could
have occurred by chance. As the p-value decreases the statistical significance of the
observed difference increases. If the p-value is too low, you reject the null hypothesis.

Here you have taken an example in which you are trying to test whether the new
advertising campaign has increased the product's sales. The p-value is the likelihood
that the null hypothesis, which states that there is no change in the sales due to the new
advertising campaign, is true.

If the p-value is .30, then there is a 30% chance that there is no increase or decrease in
the product's sales.  If the p-value is 0.03, then there is a 3% probability that there is no
increase or decrease in the sales value due to the new advertising campaign.

As you can see, the lower the p-value, the chances of the alternate hypothesis being true
increases, which means that the new advertising campaign causes an increase or
decrease in sales.
PROBABILITY AND STATISTICS FOR DATA SCIENCE

You might also like