You are on page 1of 28

Role of Probability

Distribution in DA/ML
SANGEETA SHAH BHARADWAJ
What role does Probability Distribution
play?
??

 Need to predict the outcome of future event. E.g. whether a customer will churn , employee
will attrite , is it a fraudulent transaction?

Such outcomes cannot be predicted with certainty

Understanding of probability distribution improves and helps us in predicting the same.


What is sample space?

Examples of outcomes
Predicting customer churn at individual customer level . There are only two possibilities either
a customer will leave or will not leave.
So sample space is S={ churn, no churn}=binary (yes or no) or
 S={ low , medium, high} or { AAA, AA, A ,BBB , B} credit rating= discrete but can take finite
values (also referred as discrete random variable) or
 S={ {X|X X
( also refereed as continuous random variable)
S a sample space is universal set that consists of all possible outcomes of an experiment
What is your interest? What do you want
to analyse?
Do you want to study probability of attrition of entire set of employees or you would like to
study a subset of employees whose probability of leaving the organization is more 50%,60%.

?? Only a subset.

So Event (E) is a subset of sample space and probability is usually calculated with respect to an
event.
When will senior management be more concerned ?
When percentage of employees (having probability of leaving the organization is more than
60%) is 20% or 50%?
Random Variables
Discreet random variables
Probability mass function (PMF)  the probability that a random variable X takes a specific value
e.g. the number of fraudulent transaction at an e-commerce platform is 10 is written as P(X=10)
Cumulative distribution function (CDF)is the prbobaility that a random variable X takes a value
less than equal to 10 and is written as P(X 10)

Continuous random variables


Probability density function (PDF)  the probability that a random variable X takes a value in a
small neighborhood of x

Cumulative distribution function (CDF)is the probability that a random variable X takes a
value less than equal to a and is written as
Binomial Distribution
For example, suppose it is known that 2% of all credit card transactions in a certain region are fraudulent. If there are 50
transactions per day in a certain region, we can use a Binomial Distribution Calculator to find the probability that more
than a certain number of fraudulent transactions occur in a given day
P(X > 1 fraudulent transaction) = 0.26423
P(X > 2 fraudulent transactions) = 0.07843
P(X > 3 fraudulent transactions) = 0.01776
Email companies use the binomial distribution to model the probability that a certain number of spam emails land in an
inbox per day.
For example, suppose it is known that 4% of all emails are spam. If an account receives 20 emails in a given day, we can
use a Binomial Distribution Calculator to find the probability that a certain number of those emails are spam:
P(X = 0 spam emails) = 0.44200
P(X = 1 spam email) = 0.36834
P(X = 2 spam emails) = 0.14580
Binomial Distribution
For example, Retail stores use the binomial distribution to model the probability that they
receive a certain number of shopping returns each week.
For example, suppose it is known that 10% of all orders get returned at a certain store
each week. If there are 50 orders that week, we can use a Binomial Distribution Calculator
 to find the probability that the store receives more than a certain number of returns that
week:
•P(X > 5 returns) = 0.18492
•P(X > 10 returns) = 0.00935
•P(X > 15 returns) = 0.00002
Binomial Distribution
It is discreet probability distribution
A random variable X is said to have a binomial distribution
◦ When random variable have only two outcomes
◦ The objective is to find the probability of getting x successes out of n trials
◦ The probability of success in p and thus probability of failure is (1-p)
◦ The probability is constant and does not change between trials

CDF
PMF

[]
𝑥
𝑛 𝑘
𝑃 ( X = 𝑥 )=
𝑛
𝑥[ ,wq22
𝑥
]
𝑝 (1− 𝑝 )
𝑛 −𝑥
𝑃 ( X ≤ 𝑥 )= ∑,wq22
𝑝 ( 1 − 𝑝 )
𝑛− 𝑘

𝑘=0 𝑘

Where
Binomial Distribution  N=20

 p=0.1(blue), p=0.5(green)
and p=0.8(red)

 N=20, p=0.1 say 10%


returns
 P( exactly 5 customer will
return)=0.03192
 P(max 5 customers will
return)=0.9887
 P(exactly 20 customers
will return)=0
Mean and Variance of Binomial
distribution
Mean of binomial distribution B(n, p) = np
Variance is np(1-p)

 N=20, p=0.1 say 10% returns  N=20, p=0.5 say 50% returns
 P( exactly 5 customer will  P( exactly 5 customer will return)=0.0147
return)=0.03192  P(max 5 customers will return)=0206
 P(max 5 customers will return)=0.9887  P(exactly 20 customers will return)=0
 P(exactly 20 customers will return)=0  Avg no. of customers who are likely to
 Avg no. of customers who are likely to return= 10
return= 2  Variance is 5
 Variance is 1.8
Poisson Distribution
Examples
No. of cancellation of orders by customers at an e commerce website in a day
No. of customer complaints at call centers in a day
Characteristics of distribution
Events are independent of each other. The occurrence of one event does not affect the probability another event will
occur
The average rate (events per time period) is constant
Two events cannot occur at the same time

The Poisson Process is the model for describing randomly occurring events and
by itself, isn’t that useful. We need the Poisson Distribution to do interesting
things like finding the probability of a number of events in a time period
The Probability Mass Function (PMF) of
Poisson Distribution
The PMF

Where λ is the rate of occurrence of


events per unit of time

Mean of poison distribution = λ


Standard deviation =
More examples
suppose a given call center receives 10 calls per hour. We can use a Poisson distribution calculator to find the probability
that a call center receives 0, 1, 2, 3 … calls in a given hour:
P(X = 0 calls) = 0.00005
P(X = 1 call) = 0.00045
P(X = 2 calls) = 0.00227
P(X = 3 calls) = 0.00757

For example, suppose a given website receives an average of 20 visitors per hour. We can use the 
Poisson distribution calculator to find the probability that the website receives more than a certain number
of visitors in a given hour:
P(X > 25 visitors) = 0.11218
P(X > 30 visitors) = 0.01347
P(X > 35 visitors) = 0.00080
Binomial vs Poisson Distribution
The Binomial and Poisson distributions
are both discrete probability
distributions. In some circumstances
the distributions are very similar
Exponential Distribution
Why did we have to invent Exponential Distribution?
To predict the amount of waiting time until the next event (i.e., success,
failure, arrival, etc.)
For example, we want to predict the following:
The amount of time until the customer finishes browsing and actually
purchases something in your store (success)
The amount of time until the hardware on AWS EC2 fails (failure)
The amount of time you need to wait until the bus arrives (arrival)
Exponential Distribution
A random variable X is said to have
an exponential distribution with PDF:
f(x) = { λe-λx,  x ≥ 0}
and parameter λ>0 which is also called the
rate.

Mean and Variance of a random variable X


following an exponential distribution:
Mean -> E(X) = 1/λ
the greater the rate, the faster the curve drops
Variance -> Var(X) = (1/λ)² and the lower the rate, flatter the curve. 
Probability calculations-Exponential
distribution
P{X≤x} = 1 – e-λx, corresponds to the
area under the density curve to the
left of x.
P{X>x} = e-λx, corresponds to the area
under the density curve to the right of
x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds
to the area under the density curve
between x1 and x2.
Normal Distribution
The normal distribution is a core
concept in statistics, the backbone of
data science. The function is

Standard Normal Distribution is a


special case of Normal Distribution
when 𝜇 = 0 and 𝜎  = 1. For any Normal
distribution, we can convert it into
Standard Normal distribution
Central limit theorem
Given a dataset with unknown distribution (it could be uniform, binomial or completely
random), the sample means will approximate the normal distribution.

These samples should be sufficient in size.


The distribution of sample means,
calculated from repeated sampling, will
tend to normality as the size of your
samples gets larger.
Hypothesis Testing
Hypothesis testing is a form of statistical inference that uses data from a sample to draw
conclusions about a population parameter or a population probability distribution
First, a tentative assumption is made about the parameter or distribution. This assumption is
called the null hypothesis and is denoted by H0. An alternative hypothesis (denoted Ha), which is
the opposite of what is stated in the null hypothesis, is then defined
The hypothesis-testing procedure involves using sample data to determine whether or
not H0 can be rejected. If H0 is rejected, the statistical conclusion is that the alternative
 hypothesis Ha is true.
Type I and II error
Ideally, the hypothesis-testing procedure leads to the acceptance of H0 when H0 is true and the
rejection of H0 when H0 is false
Unfortunately, since hypothesis tests are based on sample information, the possibility of errors
must be considered
A type I error corresponds to rejecting H0 when H0 is actually true, and a type II error
corresponds to accepting H0 when H0 is false
the probability of making a type I error is denoted by α, and the probability of making a type II
error is denoted by β
Level of significance
In using the hypothesis-testing procedure to determine if the null hypothesis should be
rejected, we need to specify the maximum allowable probability of making a type I error, called
the level of significance for the test.
Common choices for the level of significance are α = 0.05 and α = 0.01
The p-value is often called the observed level of significance for the test and is compared with α
 Although most applications of hypothesis testing control the probability of making a type I
error, they do not always control the probability of making a type II error. A graph known as
an operating-characteristic curve can be constructed to show how changes in the sample size
affect the probability of making a type II error
Parametric vs non parametric
A hypothesis test can be performed on parameters of one or more.
In addition to the population mean, hypothesis-testing procedures
are available for population parameters such as proportions, 
variances, standard deviations, and medians.
Hypothesis tests are also conducted in regression and correlation
analysis to determine if the regression relationship and the
correlation coefficient are statistically significant, non parametric
hypothesis testing
Z-test
Z-test is used when
We need to test the value of population mean , given that population variance is known

The population is normal distribution and population variance is known

The sample size is large (n>30)

Z= ( example page 82-83 of textbook)


One-Sample t-Test
t-tests are more appropriate when dealing with problems with a limited sample size, (i.e., n <
30) and population standard deviations is not known
t = (x̄ – μ) / (s/√n)
Where,
x̄ is the sample mean
μ is the population mean
s is the standard deviation
n is the size of the given sample
Example page 85 of text book
Two sample t-Test
A two sample t-Test is required to test difference between two population means where
standard deviation are unknown.
Example page 86-87

Paired sample t-Test


To see the impact of an intervention, i.e. whether the difference in parametric value is
statistically different before and after the intervention of two different type of intervention
Chi-Square Goodness of Fit Test
A chi-square (χ2) statistic is a test that tests the contrast of a model with real data observed

They need to estimate how exactly a distribution observed corresponds to a predicted


distribution. This is referred to as a measure for ‘goodness-of-fit.’

They need to estimate whether there are two independent random variables.
When analyzing the cross-tabulations of survey response which shows the frequency and
percentage of responses by different segments or categories of respondents to questions
(gender, occupation, level of education, etc the Chi-Square test tells researchers whether or not
there is a statistically significant difference in how a given question was answered by the
different segments or categories.
Questions?

You might also like