You are on page 1of 24

Sampling Distribtions Chapter 2

2 Sampling Distributions

Reasons for Sampling


There are many practical reasons for choosing a sample rather than a population, to
estimate a characteristic of a population:

Time
It may take to much time to contact the whole population. Even if you could contact the
whole population the results may be meaningless as they would be out of date, e.g. if it
took me 2 years to poll the voting population in the 1 year run up to an election.

Cost

The cost may be too high and hence prohibitive, e.g. if I was to poll the 90 million people
who vote a general election, the cost would be astronomical.

May be Physically Impossible

It may well be impossible to track the whole population down, e.g. it would be difficult to
find all 90 million voters in the run up to a general election.

Testing May Destroy Population


Testing every motor vehicle for how well it stands up in a crash, will result in there being
no motor vehicles being left to drive.

Results from the Sample are Adequate

The additional accuracy gained from testing a whole population rather than a sample may
not justify the additional time, cost or effort expended in doing so.

Section 1 - Chapter 2 63
Sampling Distribtions Chapter 2

Sampling Methods
When sampling we must ensure that we choose a sample which is representative of the
whole population.

The sampling methods that follow are just a few ways in which sampling can be carried
out. Other methods also exist which are not discussed here.

Simple Random Sampling


A simple random sample of size n from a finite population of size N is a sample selected
such that each possible sample of size n has the same probability of being selected.

A sample is selected so that each item or person in the population has the same chance of
being selected.

Example
An example of this would be writing this class’ names on pieces of paper and then
putting the names in a box and drawing names from that box.

Systematic Random Sampling


A random starting point is selected and then every kth member of the population is
selected.

The starting point would be a random number between 1 and k. Then we would pick
every number after that.

k is calculated as the population size N, divided by sample size, n. If k is not a whole


number, then round down to the next lowest number.

Example
If we had a population of size 2,000 and we wanted to choose a sample of size 100, then
2,000
k  20 . We would then choose a random number between 1 and 20 as our
100
starting point.

If we choose 18 as our random starting point, then starting with the 18th observation,
every 20th observation (18, 38, 58,……) would be chosen. We would end up with a
sample of 100 observations.

Section 1 - Chapter 2 64
Sampling Distribtions Chapter 2

Stratified Random Sampling


The population is divided up into subgroups called strata, and a simple random sample is
randomly selected from each stratum. The best results are obtained when the elements
within each stratum are much alike (homogeneous).

This is used to guarantee that each group is represented in the sample.

Example
Consider the advertising expenditure for the largest 352 companies in the United States.
Suppose we wanted to study whether firms with high returns on equity, spent more of
each sales dollar on advertising than firms with a low return or deficit.

Stratum Profitability Number of Relative Number


(return on equity) Firms Frequency Sampled
1 30% and over 8 0.02 1
2 20% up to 30% 35 0.10 5
3 10% up to 20% 189 0.54 27
4 0% up to 10% 115 0.33 16
5 Deficit 5 0.01 1

Total 352 1.00 50

If we use simple random sampling firms in the 3rd and 4th strata would have a much
higher chance (87%) of being chosen, whereas firms in the 1 st and 5th strata would have
little chance of being chosen and may well not be chosen at all.

If we want a sample of 50 firms, we can guarantee representation from each group by


randomly choosing:
50x0.02= 1 from the 1st strata
50x0.10= 5 from the 2nd strata
50x0.54= 27 from the 3rd strata
50x0.33= 16 from the 4th strata
50x0.01= 1 from the 5th strata

Section 1 - Chapter 2 65
Sampling Distribtions Chapter 2

Cluster Sampling
A population is divided into clusters using naturally occurring geographic or other
boundaries. Ideally each cluster is a representative small scale version of the population
(i.e. heterogeneous). A simple random sample of the clusters is then chosen. All elements
within each sampled (chosen) cluster form the sample.

So here, we will not have all clusters (groups) represented in our sample.

Sampling “Error”
Sampling error is the difference between a sample statistic and its corresponding
population parameter. In the case of the mean, it is

X  Where:

X = the mean of the sample


 = the population mean

Samples are used to estimate population characteristics. For example the mean of a
sample is used to estimate the mean of the population. However, since the sample is only
part of the population, it is unlikely that the sample mean will be exactly equal to the
population mean.

Likewise, the sample standard deviation is unlikely to be exactly the same as the
population standard deviation.

We would not be surprised then if the sample statistic is different from the corresponding
population parameter.

This difference is called the sampling error.

Section 1 - Chapter 2 66
Sampling Distribtions Chapter 2

Example

Consider the population of 5 employees at Spurs Industries.

Last week the output for each employee was 97, 103, 96, 99 and 105 units.

Suppose we select two employees whose output was 97 and 105. The mean of this
97  105
sample is  101.
2

Suppose we select another two employees whose output was 103 and 96. The mean of
103  96
this sample is  99.5.
2
97  103  96  99  105
We know however that the mean of the population is  100
5

The sampling error for the 1st sample is 1 (101-100).

This was found from X   , where x = sample mean &  = population mean.

The sampling error for the second sample is -0.5 (99.5-100)

Each of these differences, 1.0 and -0.5, is the sampling error made in estimating the
population mean based on the sample mean.

Each of the possible samples of size 2 has an equal chance of selection. Each sample may
have a different sample mean and hence, sampling error. The value of the sampling error
is based on the random selection of the sample. Therefore, sampling errors are random
and occur by chance.

Sampling Distributions of the Sample Mean

Here our random variable will be a mean. Each observation represents the average of a
sample of size n.

Organizing the means of all possible samples of size n, into a probability distribution
would result in us obtaining the sampling distribution of the sample mean.

Section 1 - Chapter 2 67
Sampling Distribtions Chapter 2
Example-Constructing Sampling Distribution of Sample Mean
Yerani Industries has seven production employees (the population). The hourly earnings
for each employee are given in the table below.

Employee Hourly Earnings

Joe $7
Sam $7
Sue $8
Bob $8
Jan $7
Art $8
Ted $9

What is the sampling distribution of the sample mean of the samples of size 2?

To arrive at the sampling distribution of the sample mean, all possible samples of size 2
need to be selected without replacement from the population, and their means computed.

7!
There are 21 possible samples ( 7 C 2   21 ).
2!5!
Listed below are all the 21 sample means from all samples of size 2.

Sample Employees Earnings Mean Sample Employees Earnings Mean

1 Joe, Sam 7, 7 7.00 12 Sue, Bob 8, 8 8.00


2 Joe, Sue 7, 8 7.50 13 Sue, Jan 8, 7 7.50
3 Joe, Bob 7, 8 7.50 14 Sue, Art 8, 8 8.00
4 Joe, Jan 7, 7 7.00 15 Sue, Ted 8, 9 8.50
5 Joe, Art 7, 8 7.50 16 Bob, Jan 8, 7 7.50
6 Joe, Ted 7. 9 8.00 17 Bob, Art 8, 8 8.00
7 Sam, Sue 7, 8 7.50 18 Bob, Ted 8, 9 8.50
8 Sam, Bob 7, 8 7.50 19 Jan, Art 7, 8 7.50
9 Sam, Jan 7, 7 7.00 20 Jan, Ted 7, 9 8.00
10 Sam, Art 7, 8 7.50 21 Art, Ted 8, 9 8.50
11 Sam, Ted 7, 9 8.00

Sampling Distribution of the Sample Mean for n=2

Sample Mean Number of Means Probability


$7.00 3 0.1429
$7.50 9 0.4285
$8.00 6 0.2857
$8.50 3 0.1429

Total 21 1.00

Section 1 - Chapter 2 68
Sampling Distribtions Chapter 2

Sampling Distribution of the Mean of Two Dice

Let us consider rolling a fair die an infinite number or times. We know that the possible
outcomes are 1, 2, 3, 4, 5, 6. The probability distribution of the random variable X is:

X 1 2 3 4 5 6
P(X) 1/6 1/6 1/6 1/6 1/6 1/6

The mean of this population is 3.5, from:

   xP (x)

1 1 1 1 1 1


 1   2   3   4   5   6   3.5
6 6 6 6 6 6

The variance is 2.92, from:

 x   
2
2  P ( x)

2 1  2 1 2 1 2 1 2 1 2 1
 1  3.5    2  3.5    3  3.5    4  3.5    5  3.5    6  3.5    2.92
6 6 6 6 6 6

The Distribution of x is

Section 1 - Chapter 2 69
Sampling Distribtions Chapter 2
We can create the sampling distribution of the mean of two dice, by drawing samples of
size 2 from the population. For each sample of two dice we add their scores and divide by
2 – that is, take the average. We have constructed a new random variable x .
Each sample mean can then be recorded. Using classical probability this will lead to the
following table:

Sample # Sample x
1 1, 1 1.0
2 1, 2 1.5
3 1, 3 2.0
4 1, 4 2.5
5 1, 5 3.0
6 1, 6 3.5
7 2, 1 1.5
8 2, 2 2.0
9 2, 3 2.5
10 2, 4 3.0
11 2, 5 3.5
12 2, 6 4.0
12 3, 1 2.0
14 3, 2 2.5
15 3, 3 3.0
16 3, 4 3.5
17 3, 5 4.0
18 3, 6 4.5
19 4, 1 2.5
20 4, 2 3.0
21 4, 3 3.5
22 4, 4 4.0
23 4, 5 4.5
24 4, 6 5.0
25 5, 1 3.0
26 5, 2 3.5
27 5, 3 4.0
28 5, 4 4.5
29 5, 5 5.0
30 5, 6 5.5
31 6, 1 3.5
32 6, 2 4.0
33 6, 3 4.5
34 6, 4 5.0
35 6, 5 5.5
36 6, 6 6.0

There are 36 possible samples of size 2. Each sample outcome is equally likely and has a
probability of 1/36 of occurring. x can assume only 11 different possible values: 1.0,
1.5, ……….6.0, with certain values of x occurring more frequently than others.

Section 1 - Chapter 2 70
Sampling Distribtions Chapter 2

We can construct the sampling distribution of our new random variable x .

Sampling Distribution of x

x P(x )
1.0 1/36
1.5 2/36
2.0 3/36
2.5 4/36
3.0 5/36
3.5 6/36
4.0 5/36
4.5 4/36
5.0 3/36
5.5 2/36
6.0 1/36

The value x =1.0 occurs only once, so its probability is 1/36. The value of x =3.5 occurs
6 times, so its probability is 6/36.

The mean of the sampling distribution of x is 3.5, from:

 x   x P (x )

 1   2   3   3  2   1 
 1.0   1.5   2.0   .......... .5.0   5.5   6.0   3.5
 36   36   36   36   36   36 

The variance is 1.46, from:

 2 x   x   x  P ( x )
2

2 1  2 2  2 1 
 1.0  3.5    1.5  3.5    .......... ......  6.0  3.5    1.46
 36   36   36 

Note that the mean of 3.5 is the same as the mean of the population of tossing a die.

Further, note that the variance of the sampling distribution of x , where n=2, is 1.46 which
is exactly half the variance of the population of the toss of a die (2.92).

Section 1 - Chapter 2 71
Sampling Distribtions Chapter 2

Sampling Distribution of X for n = 2

Compare this to the distribution of X. They are quite different distributions.

The Distribution of x is

Section 1 - Chapter 2 72
Sampling Distribtions Chapter 2

Repeating the experiment with larger sample sizes n, the sampling distribution tends to
resemble a normal probability distribution.

Sampling Distribution of X for n = 5

Mean of x =3.5 and variance of x =2.92/5

Sampling Distribution of X for n = 10

Mean of x =3.5 and variance of x =2.92/10

Sampling Distribution of X for n = 25

Section 1 - Chapter 2 73
Sampling Distribtions Chapter 2
Mean of x =3.5 and variance of x =2.92/25

For each value of n, the mean of the sampling distribution is exactly the same as the
population mean.

That is:

Mean of the Population = Mean of the Sampling Distribution of sample mean

And after further investigation we see that:


2
Variance of the sampling distribution of sample mean is
n


and the standard deviation is This is known as the standard error of the mean.
n
Where:

  the standard deviation of the population


n  the size of the sample

Also as the sample size n, increases, the sample means tend to cluster around the true
population mean.

Central Limit Theorem

The sampling distribution of the mean of a random sample drawn from any population is
approximately normal for a sufficiently large sample size, typically taken to be least 30
observations. The larger the sample size the more closely the sampling distribution will
resemble a normal distribution.

This means that as the sample size, n, gets larger, the sample means tend to follow a
normal probability distribution and tend to cluster around the true population mean. This
holds regardless of the distribution of the population from which the sample was drawn.

In summary, regardless of the type of distribution for which one draws a random sample,
the sampling distribution will be normal under certain conditions:

1. if the population distribution is normal the sampling distribution will be normal


regardless of sample size.
2. if the population distribution is approximately normal, the sample distribution will
be approximately normal.
3. if the population is not normal, the sample distribution will be approximately
normal if the sample is large enough, typically taken to be least 30.

Section 1 - Chapter 2 74
Sampling Distribtions Chapter 2
Example
Here we have an underlying normal distribution x, with mean =  and
standard deviation =  .
Normal Distribution of x, with Mean =  and Standard Deviation = 
NORMAL DISTRIBUTION

x

We will generate the x distribution (sample size n), with mean =  and standard
deviation =  . This will also be a normal distribution – according to the central limit
n
theorem.
Normal Distribution of x , with Mean =  and Standard Deviation = 
n
NORMAL DISTRIBUTION

Finally we will standardize the distribution of x , giving us a standard normal with


mean=  and standard deviation = 0.
Standard Normal Distribution of x , with Mean =0 and Standard Deviation = 1
NORMAL DISTRIBUTION

z

This rule holds true for any underlying distribution x of x . So even if the underlying
distribution, x, was not normal, the distribution of x would still be normal with mean =
 and standard deviation =  . This is the result of the central limit theorem and we
n
must bear in mind that the sample size n must be at least 30.

Section 1 - Chapter 2 75
Sampling Distribtions Chapter 2

Standard Deviation of Sample Means –


Standard Error of the Mean


 x  when the population standard deviation,  , is KNOWN
n

s
sx  when the population standard deviation,  , is UNKNOWN
n

 x  the standard deviation of the sample means


  the standard deviation of the population
s  the standard deviation of the sample
sx  estimate of the standard deviation of the sample mean
n  sample size

Mean of the Sample Means- with Known Population Mean

x  
 x  the mean of the sample means
  the mean of the population

Mean of the Sample Means - with Unknown Population Mean

Here we take the average of the sample means and use that as an approximation to the
population mean. It is denoted by x .

Section 1 - Chapter 2 76
Sampling Distribtions Chapter 2

Solving Sample Mean Probability Problems

Since the sample means tend to follow a normal probability distribution – (we know this
by looking at the Central Limit Theorem) we can use the ideas discussed earlier to
compute the probability that a sample mean will fall within a certain range.

We will want to convert any normal distribution to a STANDARD normal distribution.

We will convert using

When the population standard deviation and population mean are both KNOWN.

X 
z
 n

When the population standard deviation is UNKNOWN and the population mean is
KNOWN.

X 
z
s n

When the population standard deviation is KNOWN and the population mean is
UNKNOWN.

X x
z
 n

When the population standard deviation and population mean are both UNKNOWN.

X x
z
s n

Where:

X  variable for sample mean


  mean of the population
  the standard deviation of the population
s  the standard deviation of the sample
n  sample size
x  the mean of the sample means

Section 1 - Chapter 2 77
Sampling Distribtions Chapter 2

Example

The foreman of a bottling plant has observed that the amount of soda in each 32-ounce
bottle is actually a normally distributed random variable, with mean of 32.2 ounces and a
standard deviation of 0.3 ounces.

a) If a customer buys one bottle, what is the probability that the bottle will contain more
than 32 ounces?

b) If a customer buys a carton of four bottles, what is the probability that the mean
amount of the four bottles will be greater than 32 ounces?

Solution a.

(The solution uses table B8. You may also use tables B81 & B82 and Excel to solve. The
methods to achieve this were covered at great length in the previous chapter.)

Let X be the random variable representing the amount of soda in one bottle.
It is normally distributed with mean = 32.2 and SD = 0.3

X  mean 32  32 . 2
P ( X  32 )  P (  )  P ( Z   0 . 67 )
SD 0 .3

NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than -0.67

z
–0.67 0

 P (  0 . 67  Z  0 )  0 . 5

 0 . 2486  0 . 5  0 . 7486

Section 1 - Chapter 2 78
Sampling Distribtions Chapter 2

Solution b.

(The solution uses table B8. You may also use tables B81 & B82 or Excel to solve. The
methods to achieve this were covered at great length in the previous chapter.)

Let X be the random variable representing the average amount of soda in four bottles.
0 .3
It is normally distributed with mean = 32.2 and SD   0 . 15
4
X  mean 32  32 . 2
P ( X  32 )  P (  )  P ( Z   1 . 33 )
SD 0 . 15

NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than -1.33

z
–1.33 0

 P (  1 . 33  Z  0 )  0 . 5

 0 . 4082  0 . 5  0 . 9082

Section 1 - Chapter 2 79
Sampling Distribtions Chapter 2

Example

A real estate exams scores are normally distributed with mean 430 and standard deviation 20.
If we randomly selected 50 exams what is the probability that the sample mean of these 50
exams would exceed a score of 458?

(The solution uses table B8. You may also use tables B81/B82 and Excel to solve. The
methods to achieve this were covered at great length in the previous chapter.)

The distribution X , has mean  =430 standard deviation  =20.


The distribution X is normal and has mean  =430 standard deviation  = 20 .
n 50
X can be standardized so that we can use the tables.
458  430 28
p( X  458)  p ( Z  )  p(Z  )  p (Z  9.899)
20 2.828
50
Normalizing

We require p ( Z  9.899) . Graphically this is:

NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than 9.899

z
0 9.899

This is approximately zero.

Section 1 - Chapter 2 80
Sampling Distribtions Chapter 2

Sampling Distribution of the Proportion

We may be interested in testing measures other than the sample mean. We may be
interested in measuring the percentage of people in the work force that would opt for
early retirement. Each person has two choices of either agreeing with early retirement or
not. This experiment follows a binomial probability distribution (there are three other
conditions, see end of chapter 3)

The sample proportion, , can be calculated by

This is just the probability of success based on the sample.

As we do not know the proportion of people in the population of the workforce that
would opt for early retirement we can take samples and calculate the approximate
population proportion.

If the samples are large enough we may use the normal distribution as an approximation
to the binomial.

The conditions that must apply for this to be the case are:

If ∗ ≥ 5 ∗ ≥ 5 where:

= ℎ

= ℎ ( =1− )

= ℎ ℎ

Taking a number of samples of workers and working out the sample proportion, , for
each is the first step to finding an approximation for the population proportion. We may
then take all of the sample proportions and calculate the average. This will give us an
estimate of the population proportion.

The standard deviation of such a distribution- known as the standard error of the
proportion, is denoted by , where:

(1 − )
=

Section 1 - Chapter 2 81
Sampling Distribtions Chapter 2

Example

Suppose we take 10 sample groups of 150 people in each group, and record the number
of people in each group that agree with early retirement. The following are the results:

Sample Number of Sample


Group Successes Proportion
26
1 26 150 = 0.173
18
2 18 150 = 0.120
21
3 21 150 = 0.140
30
4 30 150 = 0.200
24
5 24 150 = 0.160
21
6 21 150 = 0.140
16
7 16 150 = 0.107
28
8 28 150 = 0.187
35
9 35 150 = 0.233
27
10 27 150 = 0.180

Averaging these all out gives an approximation for the population proportion:

0.173 + 0.12 + 0.14 + 0.2 + 0.16 + 0.14 + 0.107 + 0.187 + 0.233 + 0.18
= 0.164
10

The standard error of estimate of this sampling distribution is:

(1 − ) 0.164(1 − 0.164)
= = = √0.000914 = 0.030
150

Section 1 - Chapter 2 82
Sampling Distribtions Chapter 2
Now we can answer such questions as, “what is the probability that 20% or less of the
workforce will agree with early retirement?” We already have the mean and standard
error and we know we can use the normal distribution to approximate the binomial.

That is ( ̅ < 0.20)

Sampling Distribution
of the Proportion

 0.2 p bar

Standardizing, gives

̅ . .
< = ( < 1.20) = 0.8849
.

Sampling Distribution
of the Proportion

0.8849
 1.2 z

Section 1 - Chapter 2 83
Sampling Distribtions Chapter 2

STATS
Exercises 2 Normal Distribution – Sampling & Central Limit Theorem

1. A normal population has a mean of 60 and a standard deviation of 12. You select a
random sample of 9. Compute the probability the sample mean is:

a. Greater than 63.


b. Less than 56.
c. Between 56 and 63.

2. A population of unknown shape has a mean of 75. You select a sample of 40. The
standard deviation of the sample is 5. Compute the probability the sample mean is:

a. Less than 74.


b. Between 74 and 76.
c. Between 76 and 77.
d. Greater than 77.

3. The mean rent for a one-bedroom apartment in Southern California is $2,200 per
month. The distribution of the monthly rent does not follow the normal distribution. In
fact, it is positively skewed. What is the probability of selecting a sample of 50 one-
bedroom apartments and finding the mean to be at least $1,950 per month? The standard
deviation of the sample is $250.

4. According to an IRS study, it takes an average of 330 minutes for taxpayers to prepare,
copy, and electronically file a 1040 tax form. A consumer watchdog agency selects a
random sample of 40 taxpayers and finds the standard deviation of the time to prepare,
copy, and electronically file form 1040 is 80 minutes.

a. What assumption or assumptions do you need to make about the shape of the
population?
b. What is the standard error of the mean?
c. What is the likelihood the sample mean is greater than 320 minutes?
d. What is the likelihood the sample mean is between 320 and 350 minutes?
e. What is the likelihood the sample mean is greater than 350 minutes?

Section 1 - Chapter 2 84
Sampling Distribtions Chapter 2

Section 1 - Chapter 2 85
Sampling Distribtions Chapter 2

Section 1 - Chapter 2 86

You might also like