You are on page 1of 42

Module 6: Sampling and

Sampling Distribution
ISOM2500 BUSINESS STATISTICS
(L1-3, Spring 2022/23)
Jason HO
Contents
Samples and surveys
Sampling distribution of the sample mean
Central limit theorem
Sampling distribution of any statistic

2
Journey in This Course
Descriptive statistics Building Blocks of Theory of Statistics
Module 1 & 2 Module 4a, 4b & 5
Module 3
Graphical Tools Random Variables
Probability • Discrete or
Continuous
Numerical Tools • Jointly distributed

From now on Confidence Interval


Estimation Simple Linear
• Sampling Regression
Distribution Hypothesis Testing

Inferential statistics 3
Inferential Statistics: A General Set-up

Survey
Population
Parameter(s) Find and
to describe a compute
Sample Sample
characteristic of Statistic(s)
interest to estimate
the parameter

Infer the population


using the statistic(s)
4
Focus of This Module

Survey
Population
Parameter(s) Find and
to describe a compute
Sample Sample
characteristic of Statistic(s)
interest to estimate
the parameter

Infer the population


using the statistic(s)
5
SAMPLES AND
SURVEYS

6
Samples and Surveys
A survey gathers information of a subgroup of entities (i.e., sample)
who belong to a much larger group (i.e., population), providing the
necessary ingredients – THE DATA – for parameter estimation

7
Example 1: Use of Surveys in Daily Lives
 When an election is approaching, there are nonstop reports and news
about the latest opinion poll
 A retailer wants to know the market share of a brand before deciding
to stock the items on its shelves
 The foreman of a warehouse will not accept a shipment of electronic
components unless virtually all the components in the shipment
operate correctly
 Managers in the human resources department determine the salary
for the new employees based on wages paid around the country

8
Representative Sample and Sampling Bias
A sample that presents a “good” snapshot of the population (i.e.,
showing/preserving systematic patterns of the population) is said
to be representative
Samples that distort the population (e.g., one that systematically
omits a portion of the population) are said to have sampling bias

To assure a representative sample, we must pick entities/“members”


of the population at random
• Simple random sampling, systematic sampling, stratified
sampling, cluster sampling and many others
9
Simple Random Sampling
A procedure that makes every sample of size n from the population
equally likely produces a simple random sample (SRS)
• The gold standard among all possible sampling methods
Note: Methods that give everyone in the population an equal
chance to be in the sample does NOT necessarily produce a
representative sample
• A shop has an equal # of male and female customers
• Flip a coin; if it lands head, select 100 women at random; if it lands
tail, select 100 men at random
• Every sample is of a single sex – hardly representative!
10
Sample Statistic/Measure As Statistic

A number which is unknown to us

To estimate a population parameter (denoted by ), an intuitive


way of selecting a statistic is to use its sample counterpart, e.g.,
• use sample mean ( ) to estimate a population mean ( )
• use sample SD ( s ) to estimate a population SD ( )
11
Sampling Variation/Sample-to-sample
Variation
Sampling a small portion from a large population: results in
sampling variation/sample-to-sample variation
• Every time, only n numbers A SRS of size n
( ) are n different
observed in a sample This time: n particular numbers numbers
• Next time, when we (never be
the same)
select another sample Next time: Another n numbers
from time
from the population, to time, yet
we will most likely get only n
n completely different …… numbers
every time
numbers 12
ONLY see 1
Understand a Statistic Sample value of sample of
A SRS of size n any statistic size n & get
1 value!
This time: n particular numbers 1.5
“1.5” ≠
Next time: Another n numbers 2

…… 100

No way that a point estimate 1.5 equals ! But how close?


Necessary to understand how the statistic behaves as a RV
(i.e., the probability distribution of a statistic)
13
SAMPLING
DISTRIBUTION OF
THE SAMPLE MEAN

14
Population = An (Underlying) Probability
Model or Distribution X or f(x)
Most statistical methods are developed Population
by often, if not always, assuming an
underlying probability distribution X
or f(x) for the population: Infinitely many values
• The population comprises infinitely Percentage
histogram
many figures
• The percentage histogram (or the An underlying
probability model
red smooth curve) of all these
A random variable X or
figures mimics the probability a probability
distribution f(x) of a RV, say, X distribution f(x)

15
Data = IID Samples From X
Assume that the data arise as a representative sample of size n from
the population (with an underlying probability distribution X or f(x))
• The data are modeled as RVs, which are independent and
identically distributed (iid) samples/draws from X or f(x),
represented by

• Each data value is an independent realization from the


underlying probability model X, (i.e., like a random draw
with replacement from all the elements which constitute
the percentage histogram in Slide 15)
16
Sample Mean and its Sampling Distribution
From now on, let’s confine our statistic of interest as the sample
mean (i.e., the parameter of interest is a population mean ),
defined by

• As all Xi’s are RVs, the sample mean is a RV, with its probability
distribution especially called the sampling distribution
Sampling distribution of the sample mean is the distribution of
the sample mean computed from a sample of size n. In theory, it
can be obtained from ALL possible samples of size n from the
population through repeated sampling
17
Percentage
histogram

18
Example 1: Sampling Distribution
 https://onlinestatbook.com/stat_sim/sampli
ng_dist/index.html

19
Example 2: Sampling Distribution of the
Sample Mean

20
Normality of the Sample Mean from
Normal Population
When the population is normal, the sample mean is always
normally distributed for all sample sizes (n = 1,2,3,…)
Sampling distributions of from

21
CENTRAL LIMIT
THEOREM

22
Central Limit Theorem
For a random sample of size n from a
population with mean and variance (both
finite), the sample mean is approximately
normal when n is large (≥30)

sample mean

Following from Module 5, Slide 22, and


A twin result: due to
23
More Sample Means for Non-normal
Population

The standardized sample mean:


24
Example 3: Central Limit Theorem
 https://s3.amazona
ws.com/he-assets-
prod/interactives/05
1_central_limit_the
orem/Launch.html

25
Importance & Takeaway From CLT
1. Justifies that the best way to
estimate a population mean
is to use the sample mean
─ Centers around
─ Smaller variation as n
─ Bell-shaped

2. Provides a relatively easy way to compute approximately


probabilities of averages (or sums) of RVs
3. Explains the fact that many real data distribute like a bell-shaped
curve
26
Example 4: Estimating a Population Mean

By CLT

27
Example 5: CLT
A recent report stated that the day-care cost per week in a region is
$109. Suppose this figure is taken as the mean cost per week and
that the standard deviation is known to be $20

Find the probability that a sample of 50 day-care centers


would show a mean cost of $105 or less per week

Note: In this question, we are interested in cost per week in any


day-care center in the region, which is the population denoted by X,
and X ~ (109, 202) according to the question
28
Example 5 (Cont’d)
The distribution of weekly cost in day-care centers, unknown to
us, may not be normal, but the sample size n = 50 is large. The CLT
applies, giving the mean cost per week of 50 day-care centers

The required probability is

which is pretty small


29
The Sample Proportion From a Binary
Population is a Sample Mean
A population is said to be binary, if there are only 2 kinds of
values/outcomes (e.g., myopia or not, left- or right-handedness)
• Need to know the proportion p of 1 outcome (called ‘success’)
Define a RV: with
Based on a SRS of size n from the binary population, the sample
proportion defined by # of successes out of n is the sample mean

which is used as an estimate of the population proportion p


30
Sampling Distribution of the Sample
Proportion
An exception to the CLT result at slide 23:
For a SRS of size n from a binary population with proportion/mean p,
the sampling distribution of the sample proportion
Has mean p
Has variance
Is approximately normal when np, n(1-p) ≥ 10

31
Example 6: Sample Proportion
 Coke bottles are filled by a machine so that contents X have a normal
distribution with mean 298ml and SD 3ml
<295ml?
What is the proportion of bottles with less than 295ml?
 Let X be the content (in ml) of any coke bottle, then

 The required proportion of bottles is equivalent to

100 bottles
 What if when we have a carton of 100 bottles of cokes

What is the chance of >15% of bottles with less than 295ml?


32
<295ml

Example 6 (Cont’d 1)
≥295ml

Understand the relationship between the above 2 questions:


 The 1st question: Among all bottles of coke produced by the machine, the
proportion of bottles with <295ml is 15.86%. That means when we have all
bottles in front of us, then we can separate them into 2 groups, p = 15.86%
of them with <295ml and 84.14% with more
 Now, when we have a carton of 100 bottles of cokes, it means that we
actually need to sample/select 100 bottles from these 2 groups (i.e., each
bottle has a chance of p to have <295ml), and then we want to know,
among these 100 sampled bottles, the proportion of bottles with <295ml
Such a proportion is the sample proportion from a sample of size 100
33
Example 6 (Cont’d 2)
 This proportion is a RV as it depends on which 100 bottles we have selected
(i.e., sampling variation)
 We want to know the probability that this RV is greater than 15%

 The required probability is given by

 By CLT, since np = 100 x 0.1586 > 10, and n(1-p) > 10,

34
Student’s t Statistic: Replacing an
unknown with s
When the population SD is unknown, the standardized sample
mean with replaced by s
• when n ≥ 30

• when n < 30 and the population is normal,

35
SAMPLING
DISTRIBUTION OF
ANY STATISTIC

36
Are Other Statistics Approximately Normal?

Every statistic has a sampling distribution but, other than the


sample mean (e.g., sample variance, sample median and so on),
the distribution may be very different from being bell-shaped

Construction of an approximate sampling distribution for ANY


statistic (valid for all sample sizes and all populations):
1. take repeated samples of the same size from the population
2. construct a percentage histogram for the computed values
of the statistic over the many repeated samples
The next Example looks at the sample maximum
37
Example 7: Winning the Lottery by
Betting on Birthdays
A lottery game of selecting 5 #s from integers 1 to 39:
Grand price won if match all 5 #s
Bet on 5 #s from birth days of month for 5 family members
No chance to win if the highest number drawn is 32 to 39
What is the probability of such a scenario?
The statistic of interest is the sample maximum: H = highest of 5
#’s randomly selected without replacement from 1 to 39
• e.g., if the 5 selected #s are 3, 12, 22, 26, 28, then H = 28; or, if the
5 selected #s are 3, 12, 22, 26, 37, then H = 37 38
Example 7 (Cont’d)
Sampling distribution of
H is far from being
Summarized normal
next:
values of H
for 1,560
games

Highest # over 31 occurred in 72% of the games


• No grand price in 72% of the times once made the bet!
Indeed, most common value of H = 39 (in 13.5% of games)
39
Estimation Process
Parameter Statistic
of interest
Sample
Mean Mean
Population
X To be continued
in the next slide

Summary
other Statistic
than other than
Sample Counterpart
mean sample mean
40
Statistic Sampling distribution of the statistic

Known Yes
? Slide 21
Yes

No Sample No
Sample Normal size n
Mean X? ≥30? Slide 35

No Sample Yes
Yes
size n
≥30? or
No Slide 23 Slide 35

Summary Statistic
other than An approximate sampling distribution
sample mean constructed via repeated sampling Slide 37
41
Takeaway
Sampling variation; repeated sampling
Sampling distribution of the sample mean
Central Limit Theorem
Sampling distribution of the sample proportion
Sampling distribution of other sample statistics

42

You might also like