You are on page 1of 34

Inferential Statistics

▪ What is inferential
Statistics?
▪ Making inferences
about populations
based on samples
Descriptive Vs Inferential Statistics

• Inferential Statistics refers to


▪ Descriptive – Description of population using sample
Statistics • Hypothesis Testing: Testing of
refers to statements about population based
▪ Summary/ on sample characteristics
Description of – Z Test
sample – T Test
only(not – Chi-square Test
population)
• Predictions
– Regression
– Classification
Some examples of inferential
statistics

• Given the IQ scores of sample of some men and


women, are the IQ Level of Men and Women the
same?
• Given a crime data set (sample), check if black and
white are equally likely to commit crime
Inferential Statistics using
Probability Density Functions(PDF)
• Box-plot vs Normal Distribution
Inferential Statistics using
Probability Density Functions(PDF)
Answers to the following • Modeling:
questions are statistical – Outcome of experiment:
inference about the true or false
population
– Use Bernoulli
• Will a client buy the distribution
product, or not?
Bernoulli Distribution:
• Will the medicine help the
P(x)=pxq(1-x)
patient to recover, or not?
where p is the probability
• Will the online ad be
of x=1, q=(1-p)
clicked on, or not?
Inferential Statistics using
Probability Density Functions(PDF)
Suppose the questions • Modeling:
about populations, when n – Outcome of experiment
trials performed, are is true or false, and
• What is probability of x n trial are performed
clients to buy a product
– Use Binomial
out of n clients visited a
Distribution:
shop?
• Will the medicine help x
patients out of n patients to
recover? where p is the probability
• Will the online ad be of Success, q=1-p
clicked by x persons out of
n persons?
Binomial Distribution

To find probability of getting x


heads from 10 trials
from scipy.stats import binom
import matplotlib.pyplot as plt
num_trials = 10
heads_probability = .5
probs = [binom.pmf(i, num_trials,
heads_probability) for i in
range(11)]
plt.stem(list(range(11)), probs)
plt.show()
Poison Distribution
Suppose the questions When to Model using
about populations are poison distribution:
• What is the probability of k • Events corresponds to large
road accidents to happen in value of random variable
Kandigai in a day? occurs rarely
• What is the probability of k • Experiment outcome: 0,1,2,3,….
clients make a purchase in your
online store every X minutes Poisson Distribution:
• What is the probability that
every Yth product coming off
the assembly line is defective.
• What is the probability that x
users may access your ML
model deployed in cloud
Note that x!>> lamda^x, for very large x
Poisson Distribution

• import matplotlib.pyplot as plt


• from scipy.stats import poisson
• rate = 3.3
• probs = [poisson.pmf(i, rate) for
i in range(15)]
• plt.stem(list(range(15)), probs)
• plt.show()
Exponential Distribution
Suppose the questions
about populations are
• What is probability that two
consecutive road accidents in
Kandigai happens in x minutes? ● Here λ > 0 is the parameter of the
distribution, often called the rate parameter.
• What is the probability that two It is equal to 1/μ
consecutive clients make a ● The exponential distribution is often
purchase in your online store in X concerned with the amount of time until
some specific event occurs.
minutes ● For example, the amount of time (beginning
• What is the probability that time now) until an earthquake occurs has an
taken for two consecutive exponential distribution.
● Other examples include the length, in
defective products coming off the minutes, of long distance business telephone
assembly is x minutes. calls, and the amount of time, in months, a
car battery lasts.
• What is the probability that time ● It can be shown, too, that the value of the
taken for two consecutive users change that you have in your pocket or purse
access your ML model deployed in approximately follows an exponential
cloud is x seconds distribution.
Let X = amount of time (in minutes) a postal clerk spends with his or her customer. The time is known to have
an exponential distribution with the average amount of time equal to four minutes.

It is given that μ = 4 minutes. To do any calculations, you must know λ, the decay parameter. λ = 1/μ .
Therefore, λ = 1/4 = 0.25

For example, f(5) = 0.072. The postal clerk spends five minutes with the customers.
Exponential Distribution
import numpy as np
import seaborn as sns
scale = 1 / 3.3
draws = np.random.exponential(scale, size = 1_000_000)
sns.kdeplot(draws, shade=True, color='xkcd:lightish blue')
Normal(Gaussian) Distribution
• For what data, Normal
Distribution fits
– When probability of
occurrence of extreme value
from mean is low
• Example data where Normal
distribution fits
– Body temperature
– People's height
– Car mileage
– IQ scores
– Error distribution of
observed values of sensors
• Why to fit distribution
– To infer the occurrence of
events
Sample Distribution

• if we take many samples( the same number of


observations in each sample is the same) from a
population, and calculate a mean for each sample,
then the distribution of these means across the
samples is called as sample distribution

• Central Limit Theorem(CLT):


– Let m be the mean and s be the standard deviation of a
population P. The sample distribution of the population is
a normal distribution with mean m, and standard deviation
s/sqrt(n), where n is the number of observation in each
sample
Sample Distribution

Frequency Dist of population of size 1 million Frequency Dist of sample distribution

Frequency Distribution of four different samples(of size 100 each)


An Application of CLT in Data Science

• Suppose an advt. company tells


customer that 20% expected click-
through-rate(CTR) will be provided,
where CTR is normally distributed .
• But the customer draw a sample of size
100, and finds out that only 16 people
clicked the advt.
• Can the customer take the advt. company
to court for meeting 20% CTR?
– Ans: No
– Customer should find the mean of
sampling distribution
– If mean of sampling distribution is
too short of .2, then complaint can As per CLT, mean of normally
be made distributed population is the same as
mean of sampling distribution
Applications of CLT in Data Science
▪ Sensor Error is usually normally distributed
▪ To find Expected error that sensor/device makes, Find the
mean of sample distribution
▪ To find SD of error that the device will make, find the SD of
sample distribution, say S. By CLT, the SD of the device (entire
population) will be ES= sqrt(n) x S, where n is the sample size.
Statistical Hypothesis Testing

• What is Conjecture?
– Any statement which is either true or false
• What is Statistical Hypothesis?
– Conjecture that can be tested experiments / observations
• Eg:
– Given drug X, and disease d, X is effective in treating d
– Avg monthly salary of an Indian is 10k
– Avg monthly salary of an Indian and a Chinese is the
same
– Performance of Algorithms A and B are statistically the
same
Statistical Hypothesis Testing
Hypothesis Testing using Z-Test to test
population mean

Test Statistics To check if the population mean is mu:


Two-Tailed Vs One tailed Test
A factory has a machine that dispenses 80 mL of fluid into a
bottle. An employee believes that the average amount of fluid
isn’t 80 mL. Using 40 samples, he measures the average
amount dispensed by the machine to be 78 mL. Machine
standard deviation is 2.5 mL. State the null and alternative
hypothesis. At a 95% confidence level, is there enough
evidence to support the idea that the machine is not working
properly?

Suppose later that further testing shows that the machine was
working properly, what type of error did the employee make
(Type 1 or Type 2)?
P(x<2.31) =0.9896

Z-Table
Steps of Z-test for left tail to
population mean
Note:
Step 1: Formulate H0 and H1 The objective is to reject
H0: PM=50 (PM denotes Population mean) null hypothesis when
H1: PM <50 population mean is
significantly less than50
Step 2: Select Significance Level
alpha = 5%

Step 3: Find Z* from the Z-table corresponding to the chosen alpha


Z* = -1.65

Step 4: Calculate Z test statistics


Z= [(SM-PM)/SD] * sqrt(n) (SM denotes sample mean)
Z= [(46-50)/6]*2=-4/3=-1.33
(SM=55, Population SD is 6, Sample size is 4)

Step 4): If Z< Z*, reject H0.


As Z > Z* in the above example, we cannot reject the null hypothesis H0
Steps of Z-test for right tail to
test population mean
Step 1: Formulate H0 and H1 Note:
H0: PM=50 (PM denotes Population mean) The objective is to reject
H1: PM>50 null hypothesis when
population mean is
significantly more than
Step 2: Select Significance Level 50
alpha = 5%

Step 3: Find Z* from the Z-table corresponding to the chosen alpha


Z* = 1.65

Step 4: Calculate Z test statistics


Z= [(SM-PM)/SD]*sqrt(n) (SM denotes sample mean)
Z=[(55-50)/6]*2 =5/3 =1.67
(SM=55, Population SD is 6, Sample size is 4)

Step 4): If Z> Z*, reject H0.


As Z>Z* in the above example, we can reject the null hypothesis H0.
Steps of Z-test for two tails to
test population mean Note:
Step 1: Formulate H0 and H1 The objective is to reject
H0: PM=50 (PM denotes Population mean) null hypothesis when
population mean is
H1: PM != 50
significantly different than
50
Step 2: Select Significance Level, alpha = 5%
Step 3: Find Z* from the Z-table
corresponding to the alpha/2 =2.5%, Z* =1.96

Step 4: Calculate Z test statistics


Z= [(SM-PM)/SD] * sqrt(n)
(SM denotes sample mean)
Z=[(55-50)/6]*2 =5/3 =1.67
(SM=55, Population SD is 6, Sample size is 4)
Step 4): When Z is positive, If Z> Z*,
reject H0
When Z is negative, if Z<Z*, reject H0
* In the above example, we can not reject null
hypothesis H0 as Z is positive and Z<Z*
Z-test to test difference of means
of two populations
• Two sample Z-test
– Objective: Check if means of two populations
are significantly close
• Null Hypothesis:

• The test statistics:


When to use right tail
• A manufacturer of a certain brand of snack claims that the average
saturated fat content does not exceed 1.5 grams per serving.
State the null and alternative hypothesis to be used in testing this
claim, and state whether you will reject the hypothesis or not at 5%
significance level, given the standard deviation of the population is
2, and some observations(saturated fat content in gms) as
1.2 1.6 1.4 1.8 1.1 0.9 1.2 1.3 0.93
• Ans: SM = 1.27, n=9 . As the population standard deviation is given in the question,
we should go for Z Test
• Let X be the saturated fat content in a serving
• H0: PM=1.5
• H1: PM>1.5 (right tail), As we are checking saturated fat is more or not.
• Alpha =5%=0.05 => for z*=1.65, p(-inf < x <z* )=0.95 (Check the Z test table)
• Z= [(1.27-1.5)/2] *sqrt(9) = -0.345 is not more than 1.65. Hence H0 cannot be rejected
• Note: Possiblity of rejecting null hypothesis happen only when PM>1.5 as no complains can be
made against manufacturer if PM <=1.5
When to use left tail
• A trading consulting company states that stock price of a certain company
will increase 0.1% per day in average. State the null and alternative
hypothesis to be used in testing this claim, and state whether you will reject
the hypothesis or not at 5% significance level, given standard deviation of the
population is 2, and some observations(per day stock price) as
0.02 0.03 0.02 -0.01 -0.02 0.001 -0.03 0.001 -0.04
• Ans: n=9, SM = -0.003
• Let X be the avg. stock price increase per day
• H0: PM=0.001
• H1: PM<0.001 (left tail), As we are checking if the avg. per day increase of stock price
is less
• Alpha =5%=0.05 => for z*= , p(-inf < x <z* )=0.05
( z* satisfies p(-inf<x<z*) = 0.05 < => -z* satisfies p(-inf<x< -z*)=0.95)
Therefore z* =-1.65 (Check Z test table)
• Z= [(-0.003-0.001)/2] *sqrt(9) = -0.00617, which is greater than -1.65. Hence H0
cannot be rejected
• Note: The possiblity of rejecting null hypothesis happen only when PM<0.001 as no complains
can be made against company if PM >=0.001
When to use two tailed test
• Suppose a faculty says the avg marks of students in his class is 50 in
data science course. You collected marks of 9 students in the class, and
the marks are:
55 45 60 70 65 61 39 88 90
State the null and alternative hypothesis to be used in testing this claim,
and state whether you will reject the hypothesis or not at 5% significance
level, given the standard deviation of the population is 5, and some
observations(per day stock price) as
• Ans: n=9, SM = 63.66
• Let X be the avg. mark of the students
• H0: PM=50
• H1: PM!=50
• Alpha =5%=> apha/2=0.025 => for z*= 1.96, p(-inf < x <z* )=0.975
Therefore z* = 1.96 or Z*= -1.96
• Z= [(63.66-50)/5] *sqrt(9) = 8.2, which is more than 1.96. Hence H0 is
rejected
• Note: Here, deviation of mean from 50 to left and right are not acceptable
T-Test
• Use Standard deviation of sample(s) in the Z
statistics to obtain t-statistics

• T-statistics to check for population mean


T-Test
t Table
T-Test
• Try the same questions/examples as given for the Z-Test. Just
assume that the Population standard deviation is not given in the
questions. You need to check the t-Table with no. of degrees of
freedom (n-1) to find the t* value corresponding to a significance level
Alpha.

• The major difference between the Z-Test and T-Test is that the population
standard deviation/variance would be given for Z-Test where as the sample
standard deviation/variance would be given for T-Test. If the sample
standard deviation/variance is not given in the question while calculating T
statistics, you need to measure the same (standard deviation) from the
given data.

You might also like