You are on page 1of 69

Normal

distribution
Dhanya
N.M.
What is
normal??

🞅 Medical researchers have determined so-called normal intervals for a person’s


blood pressure, cholesterol, triglycerides, and the like. For example, the normal
range of systolic blood pressure is 110 to 140. The normal interval for a person’s
triglycerides is from 30 to 200 milligrams per deciliter (mg/dl). By measuring
these variables, a physician can determine if a patient’s vital statistics are
within the normal interval or if some type of treatment is needed to correct a
condition and avoid future illnesses.
🞅 The question then is, How does one determine the so-called normal intervals?
Histograms for the Distribution of Heights
of Adult Women
Normal and
skewed
Normal
Distribution

Following variables are close to normally distributed variables:


🞅 Height of a population
🞅 Blood pressure of adult human
🞅 Position of a particle that experiences diffusion
🞅 Measurement errors
🞅 Residuals in regression
🞅 Shoe size of a population
🞅 Amount of time it takes for employees to reach home
🞅 A large number of educational measures
🞅 Additionally, there are a large number of variables around us which are normal with a
x% confidence; x < 100.
Normal
Distributions
🞅 The theoretical curve, called a normal distribution curve, can be used to study many
variables that are not perfectly normally distributed but are nevertheless
approximately normal.
🞅 The mathematica l equation for a normal distribution is
Shape of normal
distribution
The Standard Normal
Distribution

🞅 Since each normally distributed variable has its own mean and standard deviation,
the shape and location of these curves will vary.
🞅 In practical applications, then, you would have to have a table of areas under the
curve for each variable.
🞅 To simplify this situation, statisticians use what is called the standard normal distribution.
Empirical
rule
Finding Areas Under the Standard Normal
Distribution Curve

🞅 For the solution of problems using the standard normal distribution, a four-step procedure
is recommended with the use of the Procedure Table shown.
🞅 Step 1 Draw the normal distribution curve and shade the area.
🞅 Step 2 Find the appropriate figure in the Procedure Table and follow the directions
given.
Find the area to the left of z =
1.99
Find the area to the right of z=-
1.16
Find the area between z=1.68 and
z=1.37
Find
probabilities
Answer
s

🞅
0.1587
🞅
0.9335
🞅
0.3085
Find
probability
Answer
s
Problem

🞅 It is known that IQ scores form a normal distribution with μ = 100 and σ =15.
Given this information, what is the probability of randomly selecting an individual
with an IQ score less than 120?

🞅 1. Transform the X values into z-scores.


🞅 2. Use the unit normal table to look up the proportions corresponding to the z-score
values.
Answer
Questio
n

🞅 The highway department conducted a study measuring driving speeds on a local


section of interstate highway. They found an average speed of μ = 58 miles per hour
with a standard deviation of σ = 10. The distribution was approximately normal.

🞅 Given this information, what proportion of the cars are traveling between 55 and 65
miles
per hour? Using probability notation, we can express the problem as p(55 < X < 65) ?
Answer
Holiday Spending

🞅 A survey by the National Retail Federation found that women spend on average $146.21
for the Christmas holidays. Assume the standard deviation is $29.44. Find the
percentageof women who spend less than $160.00. Assume the variable is normally
distributed.
Emergency Call Response
Time

🞅 The American Automobile Association reports that the average time it takes to respond
to an emergency call is 25 minutes. Assume the variable is approximately normally
distributed and the standard deviation is 4.5 minutes. If 80 calls are randomly selected,
approximately how many will be responded to in less than 15 minutes?
Police Academy Qualifications

🞅 To qualify for a police academy, candidates must score in the top 10% on a general
abilities test. The test has a mean of 200 and a standard deviation of 20. Find the
lowest possible score to qualify. Assume the test scores are normally distributed.
Answer
Systolic Blood
Pressure

🞅 For a medical study, a researcher wishes to select people in the middle 60% of the
population based on blood pressure. If the mean systolic blood pressure is 120 and
the standard deviation is 8, find the upper and lower readings that would qualify
people to participate in the study.
The Normal Approximation to the
Binomial Distribution

🞅 A normal distribution is often used to solve problems that involve the binomial
distribution since when n is large (say, 100), the calculations are too difficult to do by
hand using the binomial distribution.
🞅 Reca ll that a binomial distribution has the following characteristics:
🞅 1. There must be a fixed number of trials.
🞅 2. The outcome of each trial must be independent.
🞅 3. Each experiment can have only two outcomes or outcomes that can be reduced
to two outcomes.
🞅 4. The probability of a success must remain the same for each trial.
Correction for
continuity
🞅 In addition to the previous condition of np ≥ 5 and nq ≥ 5, a correction for continuity
may be used in the normal approximation.
🞅 A correction for continuity is a correction employed when a continuous distribution is
used to approximate a discrete distribution.
Reading While
Driving

🞅 A magazine reported that 6% of American drivers read the newspaper while driving. If
300 drivers are selected at random, find the probability that exactly 25 say they read
the newspaper while driving.
Taxonomy of Probability Distributions
Discrete probability distributions
🞅Binomial distribution
🞅Multinomial distribution
🞅Poisson distribution
🞅Hypergeometric distribution

Continuous probability distributions


🞅Normal distribution
🞅Standard normal distribution
🞅Gamma distribution
🞅Exponential distribution
🞅Chi square distribution
🞅Lognormal distribution
🞅Weibull distribution
Normal
Distribution

🞅 What is so special about normal probability distribution?


🞅 Why so many data science and machine learning articles revolve around
normal probability distribution?
Agenda

🞅 What probability distribution is?


🞅 What normal distribution means?
🞅 Which variables exhibit normal distribution?
🞅 How to check distribution of your data set in
Python?
🞅 How to make a variable normally distributed in
Python?
🞅 Problems with normality
A Little Background
First

🞅 Firstly, the most important point to note is that the normal distribution is also known
as
the Gaussian distribution.
🞅 It is named after the genius of Carl Friedrich Gauss.
🞅 Lastly, an important point to note is that the simple predictive models are usually the
most used models due to the fact that they can be explained and are well-understood.
🞅 Now to add to this point; normal distribution is simple and hence its simplicity makes
it
extremely popular.
What Does Probability
Distribution Mean?
Let me explain by building the appropriate building blocks first.
🞅 Consider the predictive models we might be interested in building in our
data science projects.
🞅 If we want to predict a variable accurately then the first task we need
to
perform is to understand the underlying behavior of our target variable.
🞅 What we need to do first is to determine the possible outcomes of the target
variable and if the underlying outcomes are discrete (distinct values) or
continuous (infinite values).
🞅 For the sake of simplic ity, if we are estimating the behaviour of a dice
then the first step is to know that it can take any value from 1 to 6
(discrete).
🞅 Then the next step would be to start assigning probabilities to the events
(values). Consequently, if a value cannot occur then it is assigned a
probability of 0%.
The higher the probability, the more likely
it is for the event to occur.

🞅 As an instance, we can start repeating an experiment for a large number of


times and start noting the values we retrieve for the variable.
🞅 Now what we can do is to group the values into categories/buckets.
🞅 And for each bucket, we can start recording the number of times the variable
had the value of the bucket.
🞅 For example, we can throw a dice 10000 times and as there are 6 possible values
that a dice can take, we can create 6 buckets.
🞅 And start recording the number of occurrences for each value.
Probability
Distribution
🞅 We can plot the chart and it will form a curve.
🞅 This curve is known as probability distribution curve and the likelihood of the target variable
getting a value is the probability distribution of the variable.
🞅 Once we understand how the values are distributed then we can start estimating the
probabilities of the events, even by the means of using formulas (known as probability
distribution functions).
🞅 As a result, we ca n start understanding its behaviour better.
🞅 The probability distribution is dependent on the moments of the sample such as mean,
standard deviation, skewness and kurtosis.
🞅 If you add all of the probabilities then it will sum up to 100%.
🞅 There are a large number of probability distributions and the most widely used probability
distribution is known as “normal distribution”.
Let’s Now Move Onto Normal
Probability
Distribution
🞅 If you plot the probability distribution and it forms a bell shaped curve and the
mean,
mode and median of the sample are equal then the variable has normal distribution.
🞅 This is an example normal distribution bell shaped curve:
It is important to understand and estimate the
probability distribution of your target variable.

Following variables are close to normally distributed variables:


🞅 Height of a population
🞅 Blood pressure of adult human
🞅 Position of a particle that experiences diffusion
🞅 Measurement errors
🞅 Residuals in regression
🞅 Shoe size of a population
🞅 Amount of time it takes for employees to reach home
🞅 A large number of educational measures
🞅 Additionally, there are a large number of variables around us which are normal with a
x% confidence; x < 100.
What Is Normal
Distribution?
🞅 A normal distribution is a distribution that is solely dependent on
two parameters of the data set: its mean and the standard
deviation of the sample.
🞅 Mean — This is the average value of all the points in the sample.
🞅 Standard Deviation — This indicates how much the data set
deviates from the mean of the sample.
This characteristic of the distribution
makes it extremely simple for
statisticians and hence any variable
that exhibits normal distribution is
feasible to be forecasted with higher
accuracy.
Normal Distribution Is Simply … The
Normal
Behaviour That We Are Just So Familiar
With
🞅 Now, what’s phenomenal to note is that once you find the
probability distributions of most of the variables in nature then
they all approximately follow normal distribution.
🞅 Normal distribution is simple to explain. The reasons are:
🞅 The mean, mode and median of the distribution are equal.
🞅 We only need to use the mean and standard deviation to explain the
entire distribution.
But how are so many variables
approximately normally distributed? What is the
logic behind it?

🞅 The idea revolves around the theorem that when you repeat an experiment a large
number of times on a large number of random variables then the sum of their
distributions will be very close to normality.
🞅 As height of a person is a random variable and is based on other random variables
such as the amount of nutrition a person consumes, the environment they live in, their
genetics and so on, the sum of the distributions of these variables end up being very
close to normal.

🞅 This is known as the Central Limit Theorem.


This brings us to the core of the
article:
🞅 We understood from the section above that the normal distribution is the sum of
many random distributions.
🞅 If we plot the normal distribution density function, it’s curve has following
characteristics:
Characteristics

🞅 The bell-shaped curve above has 100 mean and 1 standard deviation
🞅 Mean is the center of the curve. This is the highest point of the curve as most of the
points
are at the mean.
🞅 There are equal number of points on each side of the curve. The center of the curve
has the most number of points.
🞅 The total area under the curve is the total probability of all of the values that the
variable
ca n take.
🞅 The total curve area is therefore 100%
Characteristics
🞅 Approximately 68.2% of all of the points are within the range -1 to 1 standard
deviation.
🞅 About 95.5% of all of the points are within the range -2 to 2 standard deviations.
🞅 About 99.7% of all of the points are within the range -3 to 3 standard deviations.
🞅 This allows us to easily estimate how volatile a variable is and given a confidence
level,
what its likely value is going to be.
🞅 As an instance, in the gray bell shaped curve above, there is a 68.2% chance that
the value of the variable will be within 101–99.
🞅 Imagine the confidence you can now have when making future decisions with that
information!!!
Normal Probability Distribution
Function
🞅 The probability density function of normal distribution is:
🞅 The probability density function is essentially the probability of continuous random
variable
taking a value.


Normal distribution is a bell-shaped curve where


mean=mode=median.
🞅 If you plot the probability distribution curve using its computed probability density
function then the area under the curve for a given range gives the probability of the
target variable being in that range.
🞅 This probability distribution curve is based on a probability distribution function which
itself is computed on a number of parameters such as mean, or standard deviation of
the variable.
🞅 We could use this probability distribution function to find the relative chance of a
random variable taking a value within a range. As an instance, we could record the
daily returns of a stock, group them into appropriate buckets and then find the
probability of the stock making 20–40% gain in the future.
🞅 The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature Distribution In
Python?

🞅 The simplest method I follow is to load all of the


features in the data frame and then write this script:
🞅 Use the Python Pandas libarary:
🞅 DataFrame.hist(bins=10)
🞅 #Make a histogram of the DataFrame.
🞅 It shows us the probability distributions of all of
the variables.
What Does It Mean For A Variable
To Have Normal Distribution?

🞅 Now what’s even more fascinating is that once you add a large number
of random variables with differing distributions together, your new
variable will end up having a normal distribution. This is essentially known
as the Central Limit Theorem.
🞅 The variables that exhibit normal distribution always exhibit normal
distribution.
As an instance, if A and B are two variables with normal distributions then:
🞅 A x B is normally distributed
🞅 A + B is normally distributed
🞅 As a result, it is extremely simple to forecast a variable and find the
probability of it within a range of values because of the well-known
probability distribution function.
What If The Sample Distribution Is Not
Normal?

🞅 You can convert a distribution of a feature into normal distribution.


🞅 I have used a number of techniques to make a feature normally
distributed:
1. Linear
Transformation

🞅 Once we gather sample for a variable, we can compute the Z-score via
linearly
transforming the sample using the formula above:
🞅 Calculate the mean
🞅 Calculate the standard deviation
🞅 For each value x, compute Z using:
2. Using Boxcox
Transformation

🞅 You can use SciPy package of Python to transform data to normal


distribution:
🞅 scipy.stats.boxcox(x, lmbda=None, alpha=None)
3. Using Yeo-Johnson
Transformation

🞅 Additionally, power transformer yeo-johnson can be used. Python’s sci-kit learn


provides the appropriate function:
🞅 sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’, standardize=True,
copy=True)
Min-Max
Normalization

🞅 Min-max normalization, (usually called feature scaling) performs a linear transformation


on the original data.
🞅 This technique gets all the scaled data in the range [0,1].
Problems With
Normality
🞅 As the normal distribution is simple and is well-understood, it is also over used in
the predictive projects.
🞅 Assuming normality has its own flaws.
🞅 As an instance, we cannot assume that the stock price follows normal distribution as
the price cannot be negative.
🞅 Therefore the stock price potentially follows log of normal distribution to ensure it is
never
below zero.
🞅 We know that the returns can be negative, therefore the returns can follow
normal distribution.
Problems With
Normality

🞅 It is not wise to assume that the variable follows a


normal distribution without any analysis.
🞅 A variable can follow Poisson, Student-t or Binomial distribution as an
instance and falsely
assuming that a variable follows normal distribution can lead to inaccurate
results.
Problem

🞅 The population distribution of SAT scores is normal with a mean of μ = 500 and a
standard deviation of 100. Given this information about the population and the
known proportions for a normal distribution, we can determine the probabilities
associated with specific samples. For example, what is the probability of randomly
selecting an individual from this population who has an SAT score greater than 700?

You might also like