You are on page 1of 5

Biostatistics Notes

Essential Statistics second edition


Instructors Copy
Moore/Notz/Fligner
Chapter 11 Sampling Distributions

The material presented in this module forms the backbone for much of the
material on statistical inference to follow; namely, confidence intervals and
significance tests based on large samples. One cannot understand the
meaning of a P-value, for example, without understanding sampling
distributions. It may not be easy for you to grasp the idea that a statistic has
its own probability distribution, and concepts such as the mean of a sample
mean confuse many newcomers. But, time invested here will yield rewards
repeatedly as we continue to study.
It is in this module that we will formally introduce the distinction between a
statistic and parameter for here we begin to turn toward inference from
sample to population. A common population parameter of interest is the
population mean . The sample mean x is commonly used to estimate , so
its sampling distribution is important.
In this module we examine several important results concerning this
particular sampling distribution:
(1)The mean of the statistic x is the same as the mean of the population
were sampling from, ;
(2)the standard deviation of the statistic x is n;
(3)because of this result, the Law of Large Numbers states that the
statistic x has less variability (greater consistency) when based on a
larger sample size;
(4)the Central Limit Theorem states the sampling distribution of x will be
approximately normal, no matter what population the sample is taken
from, provided the sample size is large enough.

Statistical inference (vs. Exploratory data analysis)


The process of statistical inference involves using information from a sample
to draw conclusions about a wider population.
Population entire group of individuals about which we want information.
Ex. People with chronic pain (but we cant study all of them, so)
Sample part of the population from which we actually collect information.
Ex. Selected group of people to get data from.

Make an inference about the Population from the Representative Sample.

How do you pick a representative sample?


Simple Random Sample (SRS) a sample of n individuals from the
population chosen in such a way that every set of n individuals has an equal
chance to be the sample actually selected. Does not favor any part of the
population; helps eliminate potential bias because the selection of samples is
left to probability and chance.

Define and identify parameters and statistics (Slide A for practice)


Parameter a number that describes the population. In statistical practice,
the value is not known because we cannot examine the entire population.
Statistic a number that describes some characteristic of a sample; the
value of a statistic can be computed from the sample data without making
use of any unknown parameters. In practice, we often use this to estimate an
unknown parameter.

Statistics come from Samples


x (x-bar) = sample mean; s = sample standard deviation
Parameters come from Populations
(mu) = population mean; = population standard deviation

Sampling Variability
Sampling variability the value of a statistic varies in repeated random
sampling.
Sampling variability is inevitable no matter how careful we are with random
sampling.
Think of a statistic as a random variable because it takes on numerical
values that describe the outcome of a random sampling process.

What happens when we take many different samples?


Chances are we will get many different statistics.

So how can we use the sample statistic as an accurate estimate of our


population parameter, since many random samples will produce different x-
bars?
How can x-bar (xx ) be an accurate estimate of mu ()?
One solution: increase sample size.

Describe the law of large numbers


Law of large numbers draw observations at random from any population
with finite mean mu (). As the number of observations drawn increases, the
sample mean [x-bar (x )] of the observed values gets closer and closer to the
mean [mu ()] of the population. (i.e. the statistic gets closer to the
parameter). This is the case for any population for any distribution (whether
normal or skewed distribution); if we use a large enough sample, we can an
estimate that is very close to the population parameter.

The problem with this is that we may not always be able to get a large
enough sample size.
How can we make inferences about a population based on findings from a
smaller sample.
Create a sampling distribution.

Define and describe sampling distributions


(1)Take every possible sample of a certain size
(2)calculate the mean for each sample
(3)graph all calculated means on histogram

Population distribution versus sampling distribution (slide B).


Population distribution distribution of values of the variable among all
individuals in the population. (i.e. each n within the population distribution
represents one person or one data point)
Sampling distribution distribution of values taken by a statistic in all
possible samples of the same size from the same population. (i.e. each n
within the sampling distribution represents one sample).
There are actually three distinct distributions involved when we sample
repeatedly and measure a variable of interest.
1) the population distribution gives the values of the variable for all the
individuals in the population.
Reality, its rarely known
2) the distribution of sample data shows the values of the variable for all
the individuals in the sample.
Reality, its whats created by doing exploratory data analysis on
collected data.
3) the sampling distribution shows the statistic values from all the
possible samples of the same size from the population. Simulation: using
software to imitate chance behavior (because we cant really take so many
samples in real life).
Reality, what we really need to look at when making statistical
inferences.

Describe the sampling distribution of sample means (x )


When we choose many SRSs from a population, the sampling
distribution of the sample mean is centered at the population mean
(mu) and is less spread out than the population distribution. (some
sample means may be greater than population mean; equally likely to get
some sample means that are less than the population mean; since there is
no systematic tendency to under or over estimate the parameter, the mean
of all the sample means will ultimately equal the population mean, so we can
say that x-bar (or the sample mean) is an unbiased estimator of the
parameter).

In terms of spread, since we are illustrating sample means in the sampling


distribution, most of the outliers seen in the population distribution gets
evened out; the spread of the sampling distribution is always smaller than
the population distribution. in fact as sample size gets larger, the standard
deviation of the sampling distribution gets smaller (this is because larger
samples have a greater likelihood of getting outliers from high and low end,
thus evening them out). So we can calculate the standard deviation of the
sampling distribution by dividing the sigma (the population standard
deviation) over the square root of n.

Suppose that x (x-bar) is the mean of an SRS of size n drawn from a large
population with mean (mu) and standard deviation . Then:
The mean of the sampling distribution of x is x =
The standard deviation of the sampling distribution of x is
x = /(square root of n)

These facts about the mean and standard deviation of xx (x-bar) are always
true
no matter what shape the population distribution has (left skewed, right
skewed, big spread, little spread..)!!!

What this tells us: When we take a large enough random sample, we can
trust that it will help estimate the true population mean accurately because a
large sample will give us a sample mean that is close to the parameter (law
of large numbers), and also a large sample size will ensure a small sampling
distribution standard deviation, which means that all large samples will have
means that are close to the population parameter. (Slide C).
Why are all the sampling distributions seen so far normal? Is it because the
population distributions are normal? What if the population distributions
arent normal?

Most population distributions are not Normal. What is the shape of the
sampling distribution of sample means when the population distribution isnt
Normal?

It is a remarkable fact that as the sample size increases, the distribution of


sample means changes its shape: it looks less like that of the population
distribution and more like a Normal distribution! (This is true no matter what
the population distribution looks like, so long as the population distribution
has a meaningful or finite standard deviation, and the sample size is large
enough; if the population distribution already resembles a Normal
distribution, then the sample size required to create a Normal sampling
distribution will be far less than when a population distribution is very
skewed. If we have a large enough sample size, then even a very skewed
population distribution can have a very normal sampling distribution = CLT)

Describe and apply the central limit theorem


Draw an SRS of size n from any population with mean (mu) and
finite standard deviation (sigma).
Central limit theorem (CLT) for large n, the sampling distribution of the
sample mean x (x-bar) is approximately Normal. (That is, averages are more
Normal than individual observations).

xx is approximately N(, /square root n)

when sample size is large, the sampling distribution of the sample mean x-
bar is approximately normal with the mean of the sample means equal to
mu (population parameter) and the standard deviation of sigma over the
square root of n.

The population distributions are never Normal. In order to do statistical


inference, we rely heavily on the normal distribution. So, CLT is VERY
IMPORTANT!!! (Slide D. for practice).

Normal condition for Sample Means


If the population distribution is Normal, then so is the sampling distribution of
x-bar. This is true no matter what the sample size n is.

If the population distribution is not Normal, the central limit theorem (CLT)
tells us that the sampling distribution of x-bar (sample means) will be
approximately Normal in most cases if n >= 30.
Remember:
(1) Means of random samples are LESS VARIABLE than
individual observations.
(2) Means of random samples are MORE NORMAL than
individual observations.