Professional Documents
Culture Documents
Distribution in DA/ML
SANGEETA SHAH BHARADWAJ
What role does Probability Distribution
play?
??
Need to predict the outcome of future event. E.g. whether a customer will churn , employee
will attrite , is it a fraudulent transaction?
Examples of outcomes
Predicting customer churn at individual customer level . There are only two possibilities either
a customer will leave or will not leave.
So sample space is S={ churn, no churn}=binary (yes or no) or
S={ low , medium, high} or { AAA, AA, A ,BBB , B} credit rating= discrete but can take finite
values (also referred as discrete random variable) or
S={ {X|X X
( also refereed as continuous random variable)
S a sample space is universal set that consists of all possible outcomes of an experiment
What is your interest? What do you want
to analyse?
Do you want to study probability of attrition of entire set of employees or you would like to
study a subset of employees whose probability of leaving the organization is more 50%,60%.
?? Only a subset.
So Event (E) is a subset of sample space and probability is usually calculated with respect to an
event.
When will senior management be more concerned ?
When percentage of employees (having probability of leaving the organization is more than
60%) is 20% or 50%?
Random Variables
Discreet random variables
Probability mass function (PMF) the probability that a random variable X takes a specific value
e.g. the number of fraudulent transaction at an e-commerce platform is 10 is written as P(X=10)
Cumulative distribution function (CDF)is the prbobaility that a random variable X takes a value
less than equal to 10 and is written as P(X 10)
Cumulative distribution function (CDF)is the probability that a random variable X takes a
value less than equal to a and is written as
Binomial Distribution
For example, suppose it is known that 2% of all credit card transactions in a certain region are fraudulent. If there are 50
transactions per day in a certain region, we can use a Binomial Distribution Calculator to find the probability that more
than a certain number of fraudulent transactions occur in a given day
P(X > 1 fraudulent transaction) = 0.26423
P(X > 2 fraudulent transactions) = 0.07843
P(X > 3 fraudulent transactions) = 0.01776
Email companies use the binomial distribution to model the probability that a certain number of spam emails land in an
inbox per day.
For example, suppose it is known that 4% of all emails are spam. If an account receives 20 emails in a given day, we can
use a Binomial Distribution Calculator to find the probability that a certain number of those emails are spam:
P(X = 0 spam emails) = 0.44200
P(X = 1 spam email) = 0.36834
P(X = 2 spam emails) = 0.14580
Binomial Distribution
For example, Retail stores use the binomial distribution to model the probability that they
receive a certain number of shopping returns each week.
For example, suppose it is known that 10% of all orders get returned at a certain store
each week. If there are 50 orders that week, we can use a Binomial Distribution Calculator
to find the probability that the store receives more than a certain number of returns that
week:
•P(X > 5 returns) = 0.18492
•P(X > 10 returns) = 0.00935
•P(X > 15 returns) = 0.00002
Binomial Distribution
It is discreet probability distribution
A random variable X is said to have a binomial distribution
◦ When random variable have only two outcomes
◦ The objective is to find the probability of getting x successes out of n trials
◦ The probability of success in p and thus probability of failure is (1-p)
◦ The probability is constant and does not change between trials
CDF
PMF
[]
𝑥
𝑛 𝑘
𝑃 ( X = 𝑥 )=
𝑛
𝑥[ ,wq22
𝑥
]
𝑝 (1− 𝑝 )
𝑛 −𝑥
𝑃 ( X ≤ 𝑥 )= ∑,wq22
𝑝 ( 1 − 𝑝 )
𝑛− 𝑘
𝑘=0 𝑘
Where
Binomial Distribution N=20
p=0.1(blue), p=0.5(green)
and p=0.8(red)
N=20, p=0.1 say 10% returns N=20, p=0.5 say 50% returns
P( exactly 5 customer will P( exactly 5 customer will return)=0.0147
return)=0.03192 P(max 5 customers will return)=0206
P(max 5 customers will return)=0.9887 P(exactly 20 customers will return)=0
P(exactly 20 customers will return)=0 Avg no. of customers who are likely to
Avg no. of customers who are likely to return= 10
return= 2 Variance is 5
Variance is 1.8
Poisson Distribution
Examples
No. of cancellation of orders by customers at an e commerce website in a day
No. of customer complaints at call centers in a day
Characteristics of distribution
Events are independent of each other. The occurrence of one event does not affect the probability another event will
occur
The average rate (events per time period) is constant
Two events cannot occur at the same time
The Poisson Process is the model for describing randomly occurring events and
by itself, isn’t that useful. We need the Poisson Distribution to do interesting
things like finding the probability of a number of events in a time period
The Probability Mass Function (PMF) of
Poisson Distribution
The PMF
For example, suppose a given website receives an average of 20 visitors per hour. We can use the
Poisson distribution calculator to find the probability that the website receives more than a certain number
of visitors in a given hour:
P(X > 25 visitors) = 0.11218
P(X > 30 visitors) = 0.01347
P(X > 35 visitors) = 0.00080
Binomial vs Poisson Distribution
The Binomial and Poisson distributions
are both discrete probability
distributions. In some circumstances
the distributions are very similar
Exponential Distribution
Why did we have to invent Exponential Distribution?
To predict the amount of waiting time until the next event (i.e., success,
failure, arrival, etc.)
For example, we want to predict the following:
The amount of time until the customer finishes browsing and actually
purchases something in your store (success)
The amount of time until the hardware on AWS EC2 fails (failure)
The amount of time you need to wait until the bus arrives (arrival)
Exponential Distribution
A random variable X is said to have
an exponential distribution with PDF:
f(x) = { λe-λx, x ≥ 0}
and parameter λ>0 which is also called the
rate.
They need to estimate whether there are two independent random variables.
When analyzing the cross-tabulations of survey response which shows the frequency and
percentage of responses by different segments or categories of respondents to questions
(gender, occupation, level of education, etc the Chi-Square test tells researchers whether or not
there is a statistically significant difference in how a given question was answered by the
different segments or categories.
Questions?