You are on page 1of 13

Sampling Distribution

If you compute the mean of a sample of 10 numbers, the value you obtain will not equal
the population mean exactly; by chance it will be a little bit higher or a little bit lower. If
you sampled sets of 10 numbers over and over again (computing the mean for each set),
you would find that some sample means come much closer to the population mean than
others. Some would be higher than the population mean and some would be lower.
Imagine sampling 10 numbers and computing the mean over and over again, say about
1,000 times, and then constructing a relative frequency distribution of those 1,000 means.
This distribution of means is a very good approximation to the sampling distribution of
the mean. The sampling distribution of the mean is a theoretical distribution that is
approached as the number of samples in the relative frequency distribution increases.
With 1,000 samples, the relative frequency distribution is quite close; with 10,000 it is
even closer. As the number of samples approaches infinity, the relative frequency
distribution approaches the sampling distribution.

The sampling distribution of the mean for a sample size of 10 was just an example; there
is a different sampling distribution for other sample sizes. Also, keep in mind that the
relative frequency distribution approaches a sampling distribution as the number of
samples increases, not as the sample size increases since there is a different sampling
distribution for each sample size.

A sampling distribution can also be defined as the relative frequency distribution that
would be obtained if all possible samples of a particular sample size were taken. For
example, the sampling distribution of the mean for a sample size of 10 would be
constructed by computing the mean for each of the possible ways in which 10 scores
could be sampled from the population and creating a relative frequency distribution of
these means. Although these two definitions may seem different, they are actually the
same: Both procedures produce exactly the same sampling distribution.

Statistics other than the mean have sampling distributions too. The sampling distribution
of the median is the distribution that would result if the median instead of the mean were
computed in each sample.

Students often define "sampling distribution" as the sampling distribution of the mean.
That is a serious mistake.

Sampling distributions are very important since almost all inferential statistics are based
on sampling distributions.

Sampling distribution of the mean

he sampling distribution of the mean is a very important distribution. In later chapters


you will see that it is used to construct confidence intervals for the mean and for
significance testing.

1
Given a population with a mean of μ and a standard deviation of σ, the sampling

distribution of the mean has a mean of μ and a standard deviation of , where N is the
sample size. The standard deviation of the sampling distribution of the mean is called the
standard error of the mean. It is designated by the symbol: . Note that the spread of the
sampling distribution of the mean decreases as the sample size increases.

An example of the effect of sample size is shown above. Notice that the mean of the
distribution is not affected by sample size.

Standard error

The standard error of a statistic is the standard deviation of the sampling distribution of
that statistic. Standard errors are important because they reflect how much sampling
fluctuation a statistic will show. The inferential statistics involved in the construction of
confidence intervals and significance testing are based on standard errors. The standard
error of a statistic depends on the sample size. In general, the larger the sample size the
smaller the standard error. The standard error of a statistic is usually designated by the
Greek letter sigma (σ) with a subscript indicating the statistic. For instance, the standard
error of the mean is indicated by the symbol: σM.

Central Limit Theorem


The central limit theorem states that given a distribution with a mean μ and variance σ2,
the sampling distribution of the mean approaches a normal distribution with a mean (μ)
and a variance σ2/N as N, the sample size, increases.

The amazing and counter-intuitive thing about the central limit theorem is that no matter
what the shape of the original distribution, the sampling distribution of the mean
approaches a normal distribution. Furthermore, for most distributions, a normal
distribution is approached very quickly as N increases. Keep in mind that N is the sample
size for each mean and not the number of samples. Remember in a sampling distribution
the number of samples is assumed to be infinite. The sample size is the number of scores
in each sample; it is the number of scores that goes into the computation of each mean.

On the next page are shown the results of a simulation exercise to demonstrate the central
limit theorem. The computer sampled N scores from a uniform distribution and computed
the mean. This procedure was performed 500 times for each of the sample sizes 1, 4, 7,
and 10.

2
Below are shown the resulting frequency distributions each based on 500 means. For N =
4, 4 scores were sampled from a uniform distribution 500 times and the mean computed
each time. The same method was followed with means of 7 scores for N = 7 and 10
scores for N = 10.

Two things should be noted about the effect of increasing N:

1. The distributions becomes more and more normal.


2. The spread of the distributions decreases.

Area under the sampling distribution of the mean

Assume a test with a mean of 500 and a standard deviation of 100. Which is more likely:
(1) that the mean of a sample of 5 people is greater than 580 or (2) that the mean of a
sample of 10 people is greater than 580? Using your intuition, you may have been able to
figure out that a mean over 580 is more likely to occur with the smaller sample. One way
to approach problems of this kind is to think of the extremes. What is the probability that
the mean of a sample of 1000 people would be greater than 580. The probability is
practically zero since the mean of a sample that large will almost certainly be very close
to the population mean. The chance that it is more than 80 points away is practically nil.
On the other hand, with a small sample, the sample mean could very well be as many as
80 points from the population mean. Therefore, the larger the sample size, the less likely
it is that a sample mean will deviate greatly from the population mean. It follows that it is
more likely that the sample of 5 people will have a mean greater than 580 then will the
sample of 10 people.

To figure out the probabilities exactly, it is necessary to make the


assumption that the distribution is normal. Given normality and the
formula for the standard error of the mean, the probability that the
mean of 5 students is over 580 can be calculated in a manner almost
identical to that used in calculating the area under portions of the
normal curve.

3
Since the question involves the probability of a mean of 5 numbers being over 580, it is
necessary to know the distribution of means of 5 numbers. But that is simply the
sampling distribution of the mean with an N of 5. The mean of the sampling distribution
of the mean is μ (500 in this example) and the standard deviation is = 100/2.236 =
44.72. The sampling distribution of the mean is shown below.

The area to the left of 580 is shaded. What proportion of the curve is below 580? Since
580 is 80 points above the mean and the standard deviation is 44.72, 580 is 80/44.72 =
1.79 standard deviations above the mean.

The formula for z used here is a special case of the general formula for z:
Since the distribution of interest is the distribution of means, the formula can be rewritten

as: where M is a sample mean, μ is mean of the distribution of means which is


equal to the population mean, and is the standard error of the mean.

In general, when a problem asks about the probability of a mean, the sampling
distribution of the mean should be used. The standard error of the mean is used as the
standard deviation. Continuing with the calculations,

M = 580
μ= 500
= 44.72

and therefore, = 1.79.

From a z table, it can be determined that 0.96 of the distribution is below 1.79. Therefore
the probability that the mean of 5 numbers will be greater than 580 is only 0.04. The
calculation of the probability with N = 10 is similar. The standard error of the mean ( )

is equal to : which, of course, is smaller than the value of 44.72


obtained for N=5.

Using the formula: to calculate z and a z table to calculate the


probability, it can be determined that the probability of obtaining a mean based on N = 10
that is greater than 580 is only 0.01. As expected, this is much lower than the probability
of .04 obtained for N = 5.

4
Summing up, finding an area under the sampling distribution of the mean is the same as
finding an area below any normal curve. In this case, the normal curve is the sampling
distribution of the mean. It has a mean of μ and a standard deviation of

Sampling distribution, difference between


independent means
his section applies only when the means are computed from independent samples. The
formulas are more complicated when the two means are not independent. Let's say that a
researcher has come up with a drug that improves memory. Consider two hypothetical
populations: the performance of people on a memory test if they had taken the drug and
the performance of people if they had not. Assume that the mean (μ) and the variance ()
of the distribution of people taking the drug are 50 and 25 respectively and that the mean
(μ) and the variance () of the distribution of people not taking the drug are 40 and 24
respectively. It follows that the drug, on average, improves performance on the memory
test by 10 points. This 10-point improvement is for the whole population. Now consider
the sampling distribution of the difference between means. This distribution can be
understood by thinking of the following sampling plan:

Sample n scores from the population of people taking the drug and compute the mean.
This mean will be designated as M1. Then, sample n scores from the population of people
not taking the drug and compute the mean. This mean will be designated as M2. Finally
compute the difference between M1 and M2. This difference will be called Md where the
"d" stands for "difference." This is the statistic whose sampling distribution is of interest.
The sampling distribution could be approximated by repeating the above sampling
procedure over and over while plotting each value of Md. The resulting frequency
distribution would be an approximation to the sampling distribution. The mean and the
variance of the sampling distribution of Md are:

If a and n1 = n2= n then


.

For the present example,

= 50 - 40 = 10 and

5
If n1 = 10 and n2 = 8 then = 5.5.

Finally the standard error of Md is simply the square root of the variance of the sampling

distribution of Md. So, = 2.35.

Once you know the mean and the standard error of the sampling distribution of the
difference between means, you can answer questions such as the following: Given that
one experiment with the memory drug just described is conducted, what is the probability
that the mean of the group of 10 subjects getting the drug will be 15 or more points
higher than the mean of the 8 subjects not getting the drug? Right away it can be
determined that the chances are not very high since on average the difference between the
drug and no-drug groups is only 10.

A look at a graph of the sampling distribution of Md makes the problem more concrete. The
mean of the sampling distribution is 10 and the standard deviation is 2.35. The graph shows the
mean of the distribution. Tick marks are one standard deviation apart. Back to the question: What
is the probability that the drug group will score 15 or more points higher? The blackened region
of the graph is 15 or higher: It is a small portion of the area. The probability can be determined
by computing the number of standard deviations above the mean 15 is. Since the mean is 10 and
the standard deviation is 2.35, 15 is (15-10)/2.35 = 2.13 standard deviations above the mean.
From a z table it can be determined that .983 of the area is below 2.13; therefore, .017 of the area
is above. It follows that the probability of a difference between the drug and no-drug means of 15
or larger is .017.

The main difference between this problem and simple problems involving the area under
the normal distribution is that in this problem you first had to figure out the mean and
standard deviation of the sampling distribution of the difference between two means. In
the simpler problems involving the area under the normal distribution, you were always
given the mean and standard deviation of the distribution. The formula used there:

is used here in a different form: .

Since the problem now concerns differences between means, the relevant statistic is the
difference between means obtained in the sample (experiment), the relevant population
mean (μ) is the mean of the sampling distribution of the difference between means, and
the relevant standard deviation is the standard deviation of the sampling distribution of
the difference between means

Sampling distribution of a linear combination of means

6
Assume there are k populations each with the same variance (σ2). Further assume that (1)
n subjects are sampled randomly from each population with the mean computed for each
sample and (2) a linear combination of these means is computed by multiplying each
mean by a coefficient and summing the results. Let the linear combination be designated
by the letter "L." If this sampling procedure were repeated over and over again, a
different value of L would be obtained each time. It is this distribution of the values of L
that makes up the sampling distribution of a linear combination of means. The
importance of linear combinations of means can be seen in the section "Confidence
interval on linear combination of means" where it is shown that many experimental
hypotheses can be stated in terms of linear combinations of the mean and that the choice
of coefficients determines the hypothesis tested. The formula for L is: L = a1M1 + a2M2 +
... + akMk where M1is the mean of the numbers sampled from Population 1, M2 is the
mean of the numbers sampled from Population 2, etc. The coefficient a1 is used to
multiply the first mean, a2 is used to multiply the second mean, etc.
Assuming the means from which L is computed are independent, the mean and standard
deviation of the sampling distribution of L are: μL = a1 μ1 + a2 μ2 + ... + ak μk and

where μi is the mean for population i, σ2 is the variance of each


population, and n is the number of elements sampled from each population. Consider an
example application using the sampling distribution of L. Assume that on a test of
reading ability, the population means for 10, 12, and 14 year olds are 60, 68, and 80
respectively. Further assume that the variance within each of these three populations is
100. Then, μ1 = 60, μ2 = 68, μ3 = 80, and σ2 = 100. If eight 10-year-olds, eight 12- year-
olds and eight 14-year-olds are sampled randomly, what is the probability that the mean
for the 14 year olds will be 15 or more points higher than the average of the means for the
two younger groups?

Or, symbolically, what is the probability that 15? The answer lies in the

sampling distribution of . Letting a1 = -.5, a2 = -.5, and a3 = 1, L =

. The mean and standard deviation of the sampling distribution of L for this
example are μL = a1 μ1 + a2 μ2 + ... + ak μk
= (-.5)(60) + (-.5)(68) + (1)(80) = 16

and = = 4.33.

The sampling distribution of L therefore has a mean of 16 and a standard deviation of


4.33. The question is, what is the probability of getting a value of L greater than or equal
to 15? The formula = (15 - 16)/4.33 =.23 can be used to find out how many
standard deviations above μL an L of 15 is. Using a z table, it can be determined that .41
of the time a z of -.23 or lower would occur. Therefore the probability of a z of -.23 or

7
higher occurring is (1 - .41) = .59. So, the probability that is greater than 15
is .59.

Sampling distribution of Pearson's r


Just like any other statistic, Pearson's r has a sampling
distribution. If N pairs of scores were sampled over and
over again the resulting Pearson r's would form a
distribution. When the absolute value of the correlation in
the population is low (say less than about .4) then the
sampling distribution of Pearson's r is approximately
normal. However, with high values of correlation, the distribution has a negative skew.
The graph on the right shows the sampling distribution of Pearson's r when the
population correlation is .8 and when N = 19. The strong negative skew is apparent.
Although no value of r ever exceeds 1.0, some values of r are very low. A transformation
called Fisher's z' transformation converts Pearson's r to a value that is normally

distributed and with a standard error of .

It stands to reason that the greater the sample size (N, the number of pairs of scores), the
smaller the standard error. Since N is in the denominator of the formula, the larger the
sample size, the smaller the standard error. Consider the following problem: If the
population correlation (rho) between scores on an aptitude test and grades in school were
.5, what is the probability that a correlation based on 19 students would be larger than
.75? The first step is to convert a correlation of .5 to z'. This can be done with the r to z'

table. The value of z' is .55. The standard error is = 1/4 = .25.

The second step is to convert .75 to z'. Again this is done with an r to z' table. The value is
.97.

The number of standard deviations from the mean can be calculated with the formula:
where: z is the number of standard deviations above the z' associated with the
population correlation, z' is the value of Fisher's z' for the sample correlation (z' =.97 in
this case), μ is the value of z' for the population correlation (.55 in this case) and is the
mean of the sampling distribution of z'. is the standard error of Fisher's z'; it was
previously calculated to be .25 for N = 19.

Plugging the numbers into the formula: z = (.97 - .55)/.25 = 1.68. Therefore, a correlation
of .75 is associated with a value 1.68 standard deviations above the mean. As shown
previously, a z table can be used to determine the probability of a value more than 1.68
standard deviations above the mean. The probability is .95. Therefore there is a .05

8
probability of obtaining a Pearson's r of .75 or greater when the "true" correlation is only
.50.

The sampling distribution of the difference between two independent Pearson r's can be
approached in terms of the sampling distribution of z'. First both r's are converted to z'.
The standard error of the difference between independent z's is:

where N1 is the number of pairs of scores the first


correlation is based on and N2 is the number of pairs of scores the second correlation is
based upon. It is important to keep in mind that this formula only holds when the two
correlations are independent. This means that different subjects must be used for each
correlation. If three tests are given to the same subjects then the correlation between tests
one and two is not independent of the correlation between tests one and three.

Assume that in the population of females the correlation between a test of verbal ability
and a test of spatial ability is .6 whereas in the population of males the correlation is .5.

If a random sample of 40 females and 35 males is taken, what is the probability that the
correlation for the female sample will be lower than the correlation for the male sample.
Start by computing the mean and standard deviation of the sampling distribution of the
difference between z's. As can be calculated with the help of the r to z' procedure, r's of .6
and .5 correspond to Z's of .69 and .55 respectively. Therefore the mean of the sampling

distribution is .69-.55 = .14. The standard deviation is


The portion of the distribution for which the difference is negative (the
correlation in the female sample is lower) is shaded. What proportion of
the area is this? A difference of 0 is: standard
deviations above (.58 sd's below) the mean. A z table can be used to
calculate that the probability of a z less than or equal to -.58 is .28.
Therefore the probability the the correlation will be lower in the female sample is .28.

Sampling distribution of median

The standard error of the median for large samples and normal distributions is:
. Thus, the standard error of the median is about 25% larger than
that for the mean. It is thus less efficient and more subject to sampling fluctuations. This
formula is fairly accurate even for small samples but can be very wrong for extremely
non- normal distributions. For non-normal distributions, the standard error of the median
is difficult to compute.

Sampling distribution of the standard deviation

The standard error of the standard deviation is: .

9
The distribution of the standard deviation is positively skewed for small N but is
approximately normal if N is 25 or greater. Thus, procedures for calculating the area
under the normal curve work for the sampling distribution of the standard deviation as
long as N is at least 25 and the distribution is approximately normal

Sampling distribution of a proportion


Assume that .80 of all third grade students can pass a test of physical fitness. A random
sample of 20 students is chosen: 13 passed and 7 failed. The parameter π is used to
designate the proportion of subjects in the population that pass (.80 in this case) and the
statistic p is used to designate the proportion who pass in a sample (13/20 = .65 in this
case). The sample size (N) in this example is 20. If repeated samples of size N where
taken from the population and the proportion passing (p) were determined for each
sample, a distribution of values of p would be formed. If the sampling went on forever,
the distribution would be the sampling distribution of a proportion. The sampling
distribution of a proportion is equal to the binomial distribution. The mean and standard

deviation of the binomial distribution are μ = π. and . For the present


example, N = 20, π = .80, the mean of the sampling distribution of p (μ) is .8 and the
standard error of p (σp) is .089. The shape of the binomial distribution depends on both N
and π. With large values of N and values of π in the neighborhood of .5, the sampling
distribution is very close to a normal distribution.

The plot shown here is the sampling distribution for the present example. As you can see,
the distribution is not far from the normal distribution, although it does have some
negative skew.

Assume that for the population of people applying for a job at a bank in a major city, .40
are able to pass a basic literacy test required to get the job. Out of a group of 20
applicants, what is the probability that 50% or more of them will pass? This problem
involves the sampling distribution of p with π = .40 and N = 20. The mean of the
sampling distribution is π = .40. The standard deviation is

10
Using the normal approximation, a proportion of .50 is: (.50-.40)/.11 = 0.909 standard
deviations above the mean. From a z table it can be calculated that 0.818 of the area is
below a z of 0.909. Therefore the probability that 50% or more will pass the literacy test
is only about 1 - 0.818 = 0.182.

Correction for Continuity

Since the normal distribution is a continuous distribution, the probability that a sample
value will exactly equal any specific value is zero. However, this is not true when the
normal distribution is used to approximate the sampling distribution of a proportion. A
correction called the "correction for continuity" can be used to improve the
approximation.

The basic idea is that to estimate the probability of, say, 10 successes out of 20 when π is
0.4, one should compute the area between 9.5 and 10.5 as shown below.

Therefore to compute the probability of 10 or more successes, compute the area above
9.5 successes. In terms of proportions, 9.5 successes is 9.5/20 = 0.475. Therefore, 9.5 =
(0.475 - 0.40)/.11 = 0.682 standard deviations above the mean. The probability of being
0.682 or more standard deviations above the mean is 0.247 rather than the 0.182 that was
obtained previously.The exact answer calculated using the binomial distribution is 0.245.
For small sample sizes the correction can make a much bigger difference than it did here.

Sampling distribution of the difference between two


proportions
The mean of the sampling distribution of the difference between two independent
proportions (p1 - p2) is μp1 - p2 = π1 - π2. The standard error of p1- p2 is

11
The sampling distribution of p1- p2 is approximately normal as long as the proportions are
not too close to 1 or 0 and the sample sizes are not too small. As a rule of thumb, if n1 and
n2 are each at least 10 and neither nor are within .10 of 0 or 1 then the approximation is
satisfactory for most purposes. To see the application of this sampling distribution,
assume that .8 of high school graduates but only .4 of high school drop outs are able to
pass a basic literacy test. If 20 students are sampled from the population of high school
graduates and 25 students are sampled from the population of high school drop outs, what
is the probability that the proportion of drop outs that pass will be as high as the
proportion of graduates?

For this example, the mean of the sampling distribution of p1 - p2 is μ= π1 − π2 = .8 - .4 =


.4.

The standard error is: =

= .133. The solution to the problem is the probability that p1


- p2 is less than or equal to 0. The number of standard deviations above the mean

associated with a difference in proportions of 0 is: = (0 - .4)/.133 =


-3.01.

From z table it can be determined that only .0013 of the time would p1 - p2 be 3.01 or
more standard deviations below the mean.

1. What are the mean and standard deviation of the sampling distribution of the mean?

2. Given a test that is normally distributed with a mean of 30 and a standard deviation of
6,

(a) What is the probability that a single score drawn at random will be greater than 34?

(b) What is the probability that a sample of 9 scores will have a mean greater than 34?

(c) What is the probability that the mean of a sample of 16 scores will be either less than
28 or greater than 32?

3. What is a standard error and why is it important?

4. What is the relationship between sample size and the standard error of the mean?

12
5. What is the symbol used to designate the standard error of the mean? The standard
error of the median?

6. Young children typically do not know that memory for a list of words is better if you
rehearse the words. Assume two populations: (1) four-year-old children instructed to
rehearse the words and (2) four-year-old children not given any specific instructions.
Assume that the mean and standard deviation of the number of words recalled by
Population 1 are 3.5 and .8. For Population 2, the mean and standard deviation are 2.4
and 0.9. If both populations are normally distributed, what is the probability that the
mean of a sample of 10 from Population 1 will exceed the mean of a sample of 12 from
Population 2 by 1.8 or more?

7. Assume four normally-distributed populations with means of 9.8, 11, 12, and 10.4 all
with the same standard deviation of 5. Nine subjects are sampled from each population
and the mean of each sample computed. What is the probability that average of the means
of the samples from Populations 1 and 2 will be greater than the average of the means of
the samples from Populations 3 and 4?

8. If the correlation between reading achievement and math achievement in the


population of fifth graders were .45, what would be the probability that in a sample of 12
students, the sample correlation coefficient would be greater than .7?

9. If numerous samples of N = 15 are taken from a uniform distribution and a relative


frequency distribution of the means drawn, what would be the shape of the frequency
distribution?

10. If a fair coin is flipped 18 times, what is the probability it will come up heads 14 or
more times?

11. Some computer programs require the user to type in commands to get the program to
do what he or she wants. Others allow the user to choose from menus and push buttons.
Assume that .45 of office workers are able to solve a particular problem using a program
that is "command based" and that .83 of workers can solve the problem using a
"menu/button based" program. If two samples of 12 subjects are tested, one with each
program, what is the probability that difference in the proportions solving the problem
will be greater than .5?

ANSWERS:2a. 0.2524
2b. 0.0228
2c. 0.1826

6. 0.0278. 0.125

10. Exact binomial probability = 0.0154; normal approximation = 0.0169. Show work to
get the normal approximation.

13