Statistics and Probability

PROBABILITY AND STATISTICS FOR DATA SCIENCE
Probability and Statistics form the basis of Data Science. The probability
theory is very much helpful for making the prediction. Estimates and
predictions form an important part of Data science. With the help of
statistical methods, we make estimates for the further analysis.
Data
Data is the collected information(observations) we have about something

or facts and statistics collected together for reference or analysis.
Why does Data Matter?
•Helps in understanding more about the data by identifying relationships

that may exist between 2 variables.
•Helps in predicting the future or forecast based on the previous trend of
data.
•Helps in determining patterns that may exist between data.
•Helps in detecting fraud by uncovering anomalies in the data.
Descriptive Statistics
A descriptive statistic is a summary statistic that quantitatively describes or

summarizes features of a collection of information. It helps us in knowing our
data better. It is used to describe the characteristics of data.
Population or Sample Data

Before performing any analysis of data, we should determine if the data we’re
dealing with is population or sample.
Population: Collection of all items (N) and it includes each and every unit of our
study. It is hard to define and the measure of characteristic such as mean, mode is
called parameter.
Sample: Subset of the population (n) and it includes only a handful units of the
population. It is selected at random and the measure of the characteristic is called as
statistics.

Statistics is defined as the process of collection of data, classifying data, representing the data for easy
interpretation, and further analysis of data.
There are two methods of analyzing data in mathematical statistics that are used on a large scale:
•Descriptive Statistics
•Inferential Statistics
Descriptive Statistics
The descriptive method of statistics is used to describe the data collected and summarize the data and
its properties using the measures of central tendencies and the measures of dispersion.
Inferential Statistics
Inferential statistics requires statistical tests performed on samples, and it draws conclusions by
identifying the differences between the 2 groups. Tests calculate the p-value that is compared with the
probability of chance(α) = 0.05. If the p-value is less than α, then it is concluded that the p-value is
statistically significant.
Different Models of Statistics
Statistics being a broad term used in various forms, different models of statistics are used
in different forms.
Skewness - In statistics, the word skewness refers to a measure of the asymmetry in a
probability distribution where it measures the deviation of the normal distribution
curve for data.
ANOVA Statistics - The word ANOVA means Analysis of Variance. The measure used in
calculating the mean difference for the given set of data is called the ANOVA statistics. This
model of statistics is used to compare the performance of stocks over a period of time.
Degrees of freedom - This model of statistics is used when the values are changed. Data that can
be moved while estimating a parameter is the degree of freedom.
Regression Analysis - In this model, the statistical process determines the relationship between the
variables. The process signifies how a dependent variable changes when an independent variable
changed.
Measures of Central Tendency in Statistics
The measure of central tendency and the measure of dispersion are considered as the basis of
descriptive statistics. The representative value for the given data is the measure of central
tendency that gives us an idea of where data points are centered. This is done to find how the
data are scattered around this centered measure. The different measures of central tendency
for the data are:
• Arithmetic Mean
• Median
• Mode
• Geometric Mean
• Harmonic Mean
The data you want to analyze can have any distribution and the probability distribution
graphs can take on very distinct and recognizable shapes. Recognizing these graphs and
distributions can help you find certain characteristics of your data and perform specific
calculations on them.
What is Normal Distribution?
A normal distribution is the continuous probability distribution with a probability

density function that gives you a symmetrical bell curve. Simply put, it is a plot of
the probability function of a variable that has maximum data concentrated around
one point and a few points taper off symmetrically towards two opposite ends.
A normal distribution has a probability distribution that is centered around the mean.
This means that the distribution has more data around the mean. The data distribution
decreases as you move away from the center. The resulting curve is symmetrical
about the mean and forms a bell-shaped distribution.
From the above graph, you can see that the distribution is mostly about the mean or the
average of all heights. Apart from this, most data is around the mean. As you move
away, the probability density decreases too. This kind of curve is called a Bell Curve,
and it is a common feature of a normal distribution.
1.Continuous Probability Distribution: A probability distribution where the random

variable, X, can take any given value, e.g., amount of rainfall. You can record the
rainfall received at a certain time as 9 inches. But this is not an exact value. The
actual value can be 9.001234 inches or an infinite amount of other numbers. There
is no definitive way to plot a point in this case, and instead, you use a continuous
value.
2.Probability Density Function: An expression that is used to define the range of

values that a continuous random variable can take.
What is Standard Deviation?
The standard deviation is a measure of how the values in your data differ from
one another or how spread out your data is.
The standard deviation measures how far apart the data points in your
observations are from each. You can calculate it by subtracting each data point
from the mean value and then finding the squared mean of the differenced
values; this is called Variance. The square root of the variance gives you the
standard deviation.
60,70,45,30,90
Mean=mu = (60+70+45+30+90)/5 =59
Variance: E(x-mu)^2/n
((60-59)2+(70-59)2+(30-59)2+(90-59)2)/5
=448.8
SD =sqrt(Variance) = 21.18
What is Standard Normal Distribution?
A Standard Normal Distribution is a type of normal distribution with a mean of 0

and a standard deviation of 1. This means that the normal distribution has its
center at 0 and intervals that increase by 1.
The mean and standard deviation in a normal distribution is not fixed. They can
take on any value. However, when you standardize the normal distribution, the
mean and standard deviation remain fixed and are the same for all standard normal
distributions.
What is Z-Score?
The z-score is used to tell you how far from the mean the data point is. You
calculate it using the mean and standard deviation, so it can also be said that the
Z-Score is how many standard deviations below the mean the data is.
The z-score is used to standardize your normal distribution. Using the z-score, you
can convert each data point into a value in terms of mean and standard deviation,
effectively converting the graph into a scaled-down version. The z-score tells you
how far each data point is from the mean in steps of standard deviation. So, with
the mean and standard deviation, you can plot all points on our graph.
Central Limit Theorem, Its Significance & Uses
Central Limit Theorem, also known as the CLT, is a crucial pillar of statistics
and machine learning. It is at the heart of hypothesis testing.
What is the Central Limit Theorem?
The CLT is a statistical theory that states that - if you take a sufficiently large sample
size from a population with a finite level of variance, the mean of all samples from that
population will be roughly equal to the population mean.
Significance of Central Limit Theorem
The CLT has several applications. Look at the places where you can use it.
 Political/election polling is a great example of how you can use CLT. These
polls are used to estimate the number of people who support a specific
candidate. You may have seen these results with confidence intervals on
news channels. The CLT aids in this calculation.
 You use the CLT in various census fields to calculate various population
details, such as family income, electricity consumption, individual salaries,
and so on.
Assumptions Behind the Central Limit Theorem
Before we move on further, it is important to understand the assumptions behind CLT:
 The data must adhere to the randomization rule. It needs to be sampled at random.
 The samples should be unrelated to one another. One sample should not impact the
others.
 When taking samples without replacement, the sample size should not exceed 10% of
the population.
Why n ≥ 30 Samples?
The sample size of 30 is considered sufficient to see the effect of the CLT. If the
population distribution is closer to the normal distribution, you will need fewer samples
to demonstrate the central limit theorem. On the other hand, if the population
distribution is highly skewed, you will need a large number of samples to understand
the CLT.
Hypothesis Testing in Statistics
Hypothesis Testing is a type of statistical analysis in which you put your

assumptions about a population parameter to the test. It is used to estimate
the relationship between 2 statistical variables.
Null Hypothesis and Alternate Hypothesis
 The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
 H0 is the symbol for it, and it is pronounced H-naught.
 The Alternate Hypothesis is the logical opposite of the null hypothesis.

The acceptance of the alternative hypothesis follows the rejection of the
null hypothesis.
 H1 is the symbol for it.

Simple and Composite Hypothesis Testing
Depending on the population distribution, you can classify the statistical

hypothesis into two types.
• Simple Hypothesis: A simple hypothesis specifies an exact value for the

parameter.
• Composite Hypothesis: A composite hypothesis specifies a range of values.

One-Tailed and Two-Tailed Hypothesis Testing
The One-Tailed test, also called a directional test, considers a critical region of data
that would result in the null hypothesis being rejected if the test sample falls into it,
inevitably meaning the acceptance of the alternate hypothesis.
 In a one-tailed test, the critical distribution area is one-sided, meaning the test
sample is either greater or lesser than a specific value.
 In two tails, the test sample is checked to be greater or less than a range of values
in a Two-Tailed test, implying that the critical distribution area is two-sided.
Type 1 and Type 2 Error
A hypothesis test can result in two types of errors.
 Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis
despite being true.
 Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it
is false, unlike a Type-I error.
Level of Significance
The alpha value is a criterion for determining whether a test statistic is statistically
significant.
In a statistical test, Alpha represents an acceptable probability of a Type I error. Because

alpha is a probability, it can be anywhere between 0 and 1.
In practice, the most commonly used alpha values are 0.01, 0.05, and 0.1, which represent a
1%, 5%, and 10% chance of a Type I error, respectively (i.e. rejecting the null hypothesis when
it is in fact correct).
P-Value
A p-value is a metric that expresses the likelihood that an observed difference could
have occurred by chance. As the p-value decreases the statistical significance of the
observed difference increases. If the p-value is too low, you reject the null hypothesis.
Here you have taken an example in which you are trying to test whether the new
advertising campaign has increased the product's sales. The p-value is the likelihood
that the null hypothesis, which states that there is no change in the sales due to the new
advertising campaign, is true.
If the p-value is .30, then there is a 30% chance that there is no increase or decrease in
the product's sales. If the p-value is 0.03, then there is a 3% probability that there is no
increase or decrease in the sales value due to the new advertising campaign.
As you can see, the lower the p-value, the chances of the alternate hypothesis being true
increases, which means that the new advertising campaign causes an increase or
decrease in sales.

Statistics and Probability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics and Probability

Uploaded by

Copyright:

Available Formats

PROBABILITY AND STATISTICS FOR DATA SCIENCE

Data is the collected information(observations) we have about something

Why does Data Matter?

•Helps in understanding more about the data by identifying relationships

A descriptive statistic is a summary statistic that quantitatively describes or

Population or Sample Data

Different Models of Statistics

Measures of Central Tendency in Statistics

What is Normal Distribution?

A normal distribution is the continuous probability distribution with a probability

1.Continuous Probability Distribution: A probability distribution where the random

2.Probability Density Function: An expression that is used to define the range of

What is Standard Deviation?

What is Standard Normal Distribution?

A Standard Normal Distribution is a type of normal distribution with a mean of 0

Central Limit Theorem, Its Significance & Uses

What is the Central Limit Theorem?

Significance of Central Limit Theorem

Assumptions Behind the Central Limit Theorem

Before we move on further, it is important to understand the assumptions behind CLT:

Hypothesis Testing in Statistics

Hypothesis Testing is a type of statistical analysis in which you put your

Null Hypothesis and Alternate Hypothesis

 H0 is the symbol for it, and it is pronounced H-naught.

 The Alternate Hypothesis is the logical opposite of the null hypothesis.

 H1 is the symbol for it.

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical

• Simple Hypothesis: A simple hypothesis specifies an exact value for the

• Composite Hypothesis: A composite hypothesis specifies a range of values.

One-Tailed and Two-Tailed Hypothesis Testing

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

In a statistical test, Alpha represents an acceptable probability of a Type I error. Because

You might also like