You are on page 1of 75

Sampling Distribution

and
Estimation
Car Mileage Case
Hybrid and electric cars are a vital part in reducing US’s gasoline consumption.
Most effective way to conserve gasoline is to design gasoline powered cars that are
more fuel efficient. Virtually every gasoline powered midsize cars equipped with
automatic transmission has an EPA combined city and highway mileage estimate of
26miles/gallon or less. Suppose that government has decided to offer tax credit to
any automaker selling midsize model which achieves an EPA of at least 31mpg.

Consider an automaker has recently introduced a new midsized model that this
qualifies for the tax credit. Consider the population of all cars of this type that will
or could be potentially be produced. The automaker will choose a sample of 50 of
these cars. The manufacturers production operation runs 8 hour-shifts, with 100
midsized cars produced on each shift. When all start up problems have been
corrected, automaker select 1 car at random from each of 50 shifts and they are
subjected to EPA test.
Sampling Distribution of the Sample Mean

The sampling distribution of the sample mean is the


probability distribution of the population of the sample
means obtainable from all possible samples of size n from a
population
Example: The Population of Sample Means
Example: A Graph of the Probability
Distribution
Standard Error

• Variation in the values of statistic from sample to


sample is called sampling fluctuation and is
measured by STANDARD ERROR
Sampling Distribution of Mean

E ( x) = 
ˆ = x

s.e( x ) =
n

As sample size increases, standard error decreases


Result

2 𝜎2
𝐼𝑓 𝑋~𝑁 𝜇, 𝜎 , 𝑥~𝑁(𝜇,
ҧ )
𝑛
Example
The foreman of a bottling plant has observed that the amount
of soda in each “32-ounce” bottle is actually a normally
distributed random variable, with a mean of 32.2 ounces and
a standard deviation of .3 ounce.

If a customer buys one bottle, what is the probability that the


bottle will contain more than 32 ounces?
Example

We want to find P(X > 32), where X is normally distributed and µ =


32.2 and σ =.3

 X −  32 − 32.2 
P(X  32) = P   = P( Z  − .67) = 1 − .2514 = .7486
  .3 

“there is about a 75% chance that a single bottle of soda contains more than
32oz.”
Example

The foreman of a bottling plant has observed that the amount


of soda in each “32-ounce” bottle is actually a normally
distributed random variable, with a mean of 32.2 ounces and
a standard deviation of .3 ounce.

If a customer buys a carton of four bottles, what is the


probability that the mean amount of the four bottles will be
greater than 32 ounces?
Example

We want to find P(X > 32), where X is normally distributed


With µ = 32.2 and σ =.3

Things we know:
X is normally distributed, therefore so will X.

= 32.2 oz.
Example

If a customer buys a carton of four bottles, what is the probability that


the mean amount of the four bottles will be greater than 32 ounces?

“There is about a 91% chance the mean of the four bottles will exceed
32oz.”
mean=32.2

what is the probability that one bottle will what is the probability that the mean of
contain more than 32 ounces? four bottles will exceed 32 oz?
Central Limit Theorem (CLT)

If a random sample of size n is drawn from a


population with mean µ and standard deviation σ,
the distribution of the sample mean (x) approaches
normal distribution with mean µ and standard
deviation n as the sample size (n) increases.
 2 
 , n
i.e. x ~ N  

 
If the population is normal, the distribution of the
sample mean is normal regardless of sample size.
WHY CLT IS USEFUL

• When the sampling distribution of x is approximately


normal, we can use the Empirical rule to predict how
close sample means will be to the true population
mean.
• Since the CLT holds for a large number of population
distributions, it helps us to make inferences about the
population means regardless of the shape of the
population distribution. This is often helpful in practice
since we usually do not know the true shape of the
population distribution (and often it is skewed).
How Large?

• How large is “large enough?”


• If the sample size is at least 30, then for most
populations, the sampling distribution of sample
means is approximately normal
• For skewed distribution, it may be even 50 or more
• For heavy tailed it may be even more (100 or more)
• If the population is normal, then the sampling
distribution of sample mean is normal regardless of
the sample size
Data Analysis

Mean 31.56
Standard Error 0.112812
Median 31.55
Mode 31.4
Standard Deviation 0.797701
Sample Variance 0.636327
Kurtosis -0.51125
Skewness -0.03422
Range 3.5
Minimum 29.8
Maximum 33.3
Sum 1578
Count 50
How to estimate parameters?

Already seen that

𝜇Ƹ = 𝑥ҧ 𝑢𝑠𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑡𝑜 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛


For the car mileage case sample mean=31.56
How to estimate population Standard
deviation σ?

1 2
𝜎ො = 𝑠 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑑 = ෍ 𝑥 − 𝑥ҧ
𝑛−1

Note: These estimates are point estimates, may not


be perfect.

Use “Interval Estimates”


Confidence Intervals
Interval Estimate =
Point Estimate ± Margin of Error

Margin of Error = sampling


distribution (point)*Standard error
Confidence Intervals for a Mean: σ Known

• Confidence interval for a population mean is an


interval constructed around the sample mean so we
are reasonable sure that it contains the population
mean
• Any confidence interval is based on a confidence
level
Elements of Interval Estimation

Probability That the Population Parameter Falls


Somewhere Within the Interval.
Confidence Interval Statistic (eg.
Sample mean)

Confidence Limit (Lower) Confidence Limit (Upper)


The Car Mileage Case

• Automaker conducted mileage tests on n=50 cars


• Sample mean is 31.56
• This is a point estimate of the population mean
• Do not know how good this estimate is
• Will use a confidence interval
The Car Mileage Case

• There were many samples of 50 cars


• Each would give different means
• Consider the probability distribution of all the
sample means
• Called the sampling distribution

=x
se( x ) = 
n
The Car Mileage Case

1. Because the sampling distribution of the sample mean is


a normal distribution, we can use the normal distribution
to compute probabilities about the sample mean
2. The 95 percent confidence interval is

  
x  1.96 x =  x  1.96
 

 n 
What is happening?
S ampling Dis tribution of the Me an
0.4

95%
0.3

f(x) 0.2

0.1
2.5% 2.5%

0.0
x

 x + 1.96

x − 1.96 n
n

x
2.5% fall below
the interval x
x
x
x 2.5% fall above
the interval
x

x
x

95% fall within


the interval
1/20/2020 Statistical Inference
Generalizing

• The probability that the confidence interval will contain the


population mean μ is denoted by 1 - α
• 1 –α is referred to as the confidence coefficient
• (1 – α)  100% is called the confidence level
• Usual to use two decimal point probabilities for 1 – α
• Here, focus on 1 – α = 0.95 or 0.99
General Confidence Interval

• In general, the probability is 1 – α that the population


mean μ is contained in the interval

x  z 2 x  
=  x  z 2
 

 n

• The normal point zα/2 gives a right hand tail area under
the standard normal curve equal to α/2
• The normal point -zα/2 gives a left hand tail area under
the standard normal curve equal to a/2
• The area under the standard normal curve between zα/2
and zα/2 is 1 – α
General Confidence Interval

• If a population has standard deviation σ (known),


• and if the population is normal or if sample size is large (n 
30), then …
• … a (1-)100% confidence interval for  is

    
x  z 2 =  x − z 2 , x + z 2 
n  n n
95% Confidence Interval

  
x  z 0.025  x  =  x  1.96 
 n
   
=  x − 1.96 , x + 1.96 
 n n
99% Confidence Interval

• For 99% confidence, need the normal point z0.005


• (1 – 0.99) / 2 = 0.005
• z0.005 = 2.575
• The 99% confidence interval is

  
x  z 0.025  x  =  x  2.575 
 n
   
=  x − 2.575 , x + 2.575 
 n n
The Effect of a on Confidence Interval Width
t-Based Confidence Intervals for a Mean:
σ Unknown

• If σ is unknown (which is usually the case), we can construct a


confidence interval for μ based on the sampling distribution of

x −
t=
s n

• If the population is normal, then for any sample size n, this sampling
distribution is called the t distribution
The t Distribution

• The curve of the t distribution is similar to that of the


standard normal curve
• Symmetrical and bell-shaped
• The t distribution is more spread out than the standard
normal distribution
• The spread of the t is given by the number of degrees of
freedom (sample size)
• Denoted by df
• For a sample of size n, there are one fewer degrees of
freedom, that is, df = n – 1
Degrees of Freedom and the
t-Distribution

As the number of degrees of freedom increases, the spread


of the t distribution decreases and the t curve approaches
the standard normal curve
t and Right Hand Tail Areas

• Use a t point denoted by tα


• tα is the point on the horizontal axis under the t curve that
gives a right hand tail equal to α
• So the value of tα in a particular situation depends on the
right hand tail area α and the number of degrees of freedom
• df = n – 1
• 1 – α is the specified confidence coefficient
t and Right Hand Tail Areas
t-Based Confidence Intervals for a Mean:
σ Unknown

• If the sampled population is normally distributed with


mean , then a (1)100% confidence interval for  is

s
x  t 2
n

• t/2 is the t point giving a right-hand tail area of /2


under the t curve having n-1 degrees of freedom
Car Mileage estimation:
• Recall from Chapter 3 example, 𝑥ҧ = 31.56 mpg for a
sample of size n=50 and s= 0.8


 0 .8
x = = = 0.113
n 50
t 0.025, 49 = 2.010
Car Mileage: 95% Confidence interval of
mean mileage
s
x  t 2; n −1
n
= 31.56  ( 2.010 * 0.113)
= 31.56  0.22713
95% CI of 
[31.33,31.79]
Practice Problem 1:
• A manufacturer of light bulbs claims that its light bulbs have a mean life  hours
with a standard deviation of 85 hours. A random sample of 40 such bulbs is
selected for testing. If the sample produces a mean value of 1505 hours, find out
95% Confidence Interval of .
Solution: Given, n=40 (large), =85 (known), 1-=0.95, =0.05,

x = 1505
z / 2 = z 0.025 = 1.96
95% CI of  is given by

 85 85 
1505 − 1.96 , 1505 + 1.96
 40 40 
= 1478.66 , 1531.34
Practice Problem 2:
• Waiting times (in hours) at a popular restaurant are found to have a mean
waiting time of 1.52 hours with sd 2.25hrs. for a sample of 50 customers.
Construct the 99% confidence interval for the estimate of the population mean.
Solution: Given, n=50 (large), s=2.25 (estimated), 1-=0.99, =0.01,
z / 2 = z 0.005 = 2.58
x = 1.52
Therefore,
99% CI of  is given by

 2.25 2.25 
1.52 − 2.58 , 1.52 + 2.58
 50 50 
= 1.20 , 2.34 
Use t based confidence interval and observe the difference (assuming normal
population).
Hypothesis Testing
Null and Alternative Hypotheses and
Errors in Hypothesis Testing
• Null hypothesis, H0, is a statement of the basic
proposition being tested
• Represents the status quo and is not rejected unless there is
convincing sample evidence that it is false
• Alternative hypothesis, Ha, is an alternative accepted
only if there is convincing sample evidence it is true
• One-Sided, “Greater Than” H0: μ  μ0 vs. Ha: μ > μ0
• One-Sided, “Less Than” H0 : μ  μ0 vs. Ha : μ < μ0
• Two-Sided, “Not Equal” H0 : μ = μ0 vs. Ha : μ  μ0
where μ0 is a given constant value (with the appropriate
units) that is a comparative value
Car Mileage Case
Hybrid and electric cars are a vital part in reducing US’s gasoline consumption.
Most effective way to conserve gasoline is to design gasoline powered cars that are
more fuel efficient. Virtually every gasoline powered midsize cars equipped with
automatic transmission has an EPA combined city and highway mileage estimate of
26miles/gallon or less. Suppose that government has decided to offer tax credit to
any automaker selling midsize model which achieves an EPA of at least 31mpg.

Consider an automaker has recently introduced a new midsized model that this
qualifies for the tax credit. Consider the population of all cars of this type that will
or could be potentially be produced. The automaker will choose a sample of 50 of
these cars. The manufacturers production operation runs 8 hour-shifts, with 100
midsized cars produced on each shift. When all start up problems have been
corrected, automaker select 1 car at random from each of 50 shifts and they are
subjected to EPA test.
Car Mileage Case
• Null hypothesis, H0,

H0: μ  31

• Alternative hypothesis,

Ha: μ > 31
We write:

H0: μ  31 vs Ha: μ > 31


Example

Suppose a bank knows that their customers are waiting in line an average of
10.2 minutes during the lunch hour. The branch manager has decided
to add an additional teller during the 12-2 p.m. period and wishes to test
the hypothesis that the average wait has decreased due to the additional
teller. Set up the null and alternative hypothesis for the bank manager.
H0: μ = 10.2
H1: μ < 10.2
Case:
Marketing Iced Coffee
• In order to capitalize on the iced coffee trend, Starbucks
offered for a limited time half-priced Frappuccino
beverages between 3 pm and 5 pm.
• Manager at a local Starbucks, determines the following
from past historical data:
• 43% of iced-coffee customers were women.
• 21% were teenage girls.
• Customers spent an average of $4.18 on iced coffee
with a standard deviation of $0.84.
Case:
Marketing Iced Coffee
• One month after the marketing period ends, Anne
surveys 50 of her iced-coffee customers and finds:
✓46% were women.
✓34% were teenage girls.
✓They spent an average of $4.26 on the drink.
• Manager wants to use this survey information to
calculate the probability that:
✓Customers spend an average of $4.26 or more on iced coffee.
✓46% or more of iced-coffee customers are women.
✓34% or more of iced-coffee customers are teenage girls.
Types of Decisions
• As a result of testing H0 vs. Ha, will decide either of the
following decisions for the null hypothesis H0:
• Do not reject H0 or reject H0

If the population is normal or n is large*, the test statistic z


follows a normal distribution

• To “test” H0 vs. Ha, use the “test statistic”


x − 0 x − 0
z= = ;
x  n
 0 − population mean if null is true
• z measures the distance between μ0 and x on the sampling
distribution of the sample mean
Error Probabilities

• Type I Error: Rejecting H0 when it is true


•  is the probability of making a Type I error
• 1 –  is the probability of not making a Type I error
• Type II Error: Failing to reject H0 when it is false
• β is the probability of making a Type II error
• 1 – β is the probability of not making a Type II error
Typical Values

• Usually set  to a low value


• So there is a small chance of rejecting a true H0

• Typically,  = 0.05
• Strong evidence is required to reject H0
• Usually choose α between 0.01 and 0.05
•  = 0.01 requires very strong evidence to reject H0

• Tradeoff between  and β


• For fixed sample size, the lower , the higher β
• And the higher , the lower β
z Tests about a Population Mean: σ
Known

• Test hypotheses about a population mean using the normal


distribution
• Called z tests
• Require that the true value of the population standard
deviation σ is known
• In most real-world situations, σ is not known
• But often is estimated from s of a single sample
• When σ is unknown, test hypotheses about a
population mean using the t distribution
• Here, assume that we know σ
Steps in Testing a “Greater Than” Alternative

1. State the null and alternative hypotheses


2. Specify the significance level α
3. Select the test statistic
4. Determine the critical value rule for deciding
whether or not to reject H0
5. Collect the sample data and calculate the value of
the test statistic
6. Decide whether to reject H0 by using the test statistic
and the rejection rule (p value)
7. Interpret the statistical results in managerial terms
and assess their practical importance
Steps in Testing Car mileage Case

1. State the null and alternative hypotheses


H0:   31
Ha:  > 31
2. Specify the significance level 
• α = 0.05
3. Select the test statistic
• Use the test statistic
x − 31 x − 31
z= =
x  n
• A positive value of this test statistic results from a sample
mean that is greater than 31 mpg
Steps in Testing a “Greater Than” Alternative
in car mileage Case
4. Determine the rejection rule for deciding whether or
not to reject H0
• To decide how large the test statistic must be to reject H0 by
setting the probability of a Type I error to α, do the
following:
• The probability α is the area in the right-hand tail of the
standard normal curve
• Use the normal table to find the point zα (called the
rejection or critical point)
• Reject H0 in favor of Ha if the test statistic z is greater than
the rejection point zα
• In the mileage case, the rejection rule is to reject H0 if the
calculated test statistic z is > 1.645
Steps in Testing a “Greater Than” Alternative
in car mileage Case
5. Collect the sample data and calculate the value of the
test statistic
• In the mileage case, assume that σ is known and σ =
1.65 mpg
• For a sample of n = 50, x = 31.56 mpg. Then

x − 31 31.56 − 31
z= = = 2.39
 n 1.65 50
Steps in Testing a “Greater Than” Alternative
in car mileage Case
6. Decide whether to reject H0
• Compare the value of the test statistic to the rejection
point according to the rejection rule
• Here, z = 2.39 is greater than z0.05 = 1.645
• Therefore reject H0: μ ≤ 31 in favor of
Ha: μ > 31 at the 0.05 significance level
7. Interpret the statistical results
• Conclude mean mileage of the new make exceeds 31
mpg. The company is eligible to get the benefit.
The p-Value

• The p-value or the observed level of significance is the


probability of obtaining the sample results if the null
hypothesis H0 is true
• The p-value is used to measure the weight of the evidence
against the null hypothesis
• Sample results that are not likely if H0 is true have a low
p-value and are evidence that H0 is not true
• The p-value is the smallest value of α for which we can reject
H0
• The p-value is an alternative to testing with a z test
statistic
Steps Using a p-value to Test a “Greater
Than” Alternative
4. Collect the sample data and compute the value of the
test statistic
In the car mileage case, the value of the test statistic
was calculated to be z = 2.39
5. Calculate the p-value by corresponding to the test
statistic value

𝑃 𝑍 > 2.39 = 1 − 0.9916 = 0.0084


Steps Using a p-value to Test a “Greater
Than” Alternative Continued
5. Continued
• If H0 is true, the probability is 0.0089 of obtaining a sample
whose mean is 31.56 mpg or higher
• This is so low as to be evidence that H0 is false and should be
rejected
6. Reject H0 if the p-value is less than α
• In the mileage case, α was set to 0.05
• The calculated p-value of 0.0089 is < α = 0.05
• This implies that the test statistic z = 2.39 is greater than the
rejection point z0.05 = 1.645
• Therefore reject H0 at the α = 0.05 significance level
• Example: From the case, the manager wants to
determine if the marketing campaign has had a
lingering effect on the amount of money customers
spend on iced coffee.
✓ Before the campaign,  = $4.18 and σ = $0.84. Based on
50 customers sampled after the campaign,  = $4.26.
✓ Let’s find P ( X  4.26.) Since n > 30, the central limit
theorem states that X is approximately normal. So,

 X −  4.26 − 4.18 


( )
P X  4.26 = P  Z 
 n
 = P Z  
  0.84 50 
= P ( Z  0.67 ) = 1 − 0.7486 = 0.2514

LO 7.4
t Tests about a Population Mean: σ
Unknown

• Assume the population being sampled is normally


distributed
• The population standard deviation σ is unknown, as is the
usual situation
• If the population standard deviation σ is unknown, then it
will have to estimated from a sample standard deviations
• Under these two conditions, have to use the t distribution to
test hypotheses
Defining the t Statistic: σ Unknown

• Let x be the mean of a sample of size n with standard


deviation s
• Also, µ0 is the claimed value of the population mean
• Define a new test statistic
x − 0
t=
s n

• If the population being sampled is normal, and s is used


to estimate σ, then the sampling distribution of the t
statistic is a t distribution with n – 1 degrees of freedom
t Tests about a Population Mean: σ
Unknown
Alternative Reject H0 if: p-value
H a: µ > µ 0 t > t Area under t distribution to
right of t
H a: µ < µ 0 t < –t Area under t distribution to
left of –t
H a: µ  µ 0 |t| > t /2 * Twice area under t
distribution to right of |t|
tα, tα/2, and p-values are based on n – 1 degrees of freedom
(for a sample of size n)
* either t > tα/2 or t < –tα/2
Problem
The manager of a small convenience store does not want her customers
standing in line for too long prior to a purchase. In particular, she is willing to
hire an employee for another cash register if the average wait time of the
customers is more than five minutes. She randomly observes the wait time (in
minutes) of customers during the day as:

3.5 5.8 7.2 1.9 6.8 8.1 5.4

a. Set up the null and the alternative hypotheses to determine if the manager needs
to hire another employee.
b. Calculate the value of the test statistic. What assumption regarding the
population is necessary to implement this step?
c. Use the critical value approach to decide whether the manager needs to hire
another employee at α=0.10.
d. Repeat the above analysis with the p-value approach.
Example
An automatic bottling machine fills cola into two liter (2000 cc) bottles. A consumer
advocate wants to challenge this average amount. A random sample of 40 bottles coming
out of the machine was selected and the exact content of the selected bottles are recorded.
The sample mean was 1999.6 cc. The population standard deviation is known from past
experience to be 1.30 cc.
Test appropriate hypothesis.

Ho :  = 2000
H 1 :   2000
Test statistic ; p-value
x −
z = 0 = 1999.6 - 2000
obs  1.3
n 40

= −1.95
z = 1.645
0.05
z  -1.645
obs
p - value = P(Z  -1.95)
= 0.0256  0.05

Reject Null i.e. the test is significant


There is sufficient evidence for rejection
Problem
I believe that on an average a PGP student at IIMK spends
15 hours per week using library resources. A random
sample of 8 students were selected and
the average number of hours they spend in the library came
out to be 16.3 hrs. Assuming reading time to follow normal
distribution with sd 3.6 hrs, test a suitable hypothesis

To test
H0: μ = 15
H1: μ ≠ 15
Test statistic ; p-value
x −
z = 0 = 16.3 -15
obs  3.6
n 8
= 1.02
z = 1.96
0.025
z  1.96
obs
p - value = P(Z  1.02) + P(Z  −1.02)
= 0.1539* 2 = 0.3078  0.05

Do not reject Null i.e. the test is insignificant


There is no enough evidence for rejection of the belief
Example

New software companies that create programs for web applications believe that average
staff age at these companies is 27. A random sample of 18 staff is chosen from these
companies and their age is given as follows: 41, 18, 25, 36, 26, 35, 24, 30, 28,
19, 22, 22, 26, 23, 24, 31, 22, 22. Test appropriate hypothesis.

H0:  = 27 n = 18
H1:   27 x = 26.3
n = 18 s = 6.15
For  = 0.05 and (18-1) = 17 df , x −  26.3 - 27
critical values of t are ±2.11 t = s 0 = 6.15
obs
x − 0 n 18
The test statistic is: t =
s = − 0.48  Do not reject H
n 0

You might also like