Professional Documents
Culture Documents
SESSION 2
What we will do
• Sampling
• The concept of a sampling distribution.
• Focus on the sampling distribution of the sample mean, and discuss the
role of a famous mathematical result called the central limit theorem.
• LLN/CLT
3
Sampling – why?
• To estimate properties of a population from the data observed
in the sample
• To contact the entire population is too time consuming and
expensive.
• Certain tests are destructive.
• Checking all the items is physically impossible.
• The mathematical procedures appropriate for performing this estimation
depend on which properties of the population are of interest and which
type of random sampling scheme is used.
• Because the details are considerably more complex for more complex
sampling schemes such as multistage sampling, we will focus on simple
random samples, where the mathematical details are relatively
straightforward.
5
Types of Sampling
• Simple random sampling
• A simple random sample of size n has the property that every possible
sample of size n has the same probability of being chosen.
• Simple random samples are the easiest to understand, and their statistical
properties are fairly straightforward.
• There are several ways simple random samples can be chosen, all of
which involve random numbers.
• Because each sampling unit has the same chance of being sampled,
simple random sampling can result in samples that are spread over the
entire population.
• Simple random sampling requires that all sampling units be identified
prior to sampling. Sometimes this is infeasible. Accordingly, simple
random sampling can yield underrepresentation or overrepresentation
of certain segments of the population
6
Nonrandom Samples
• Nonrandom samples produce tainted, sometimes not
believable results
• Biased with respect to the population
• Results reflect only the subpopulation from which the data are obtained.
• Sources of bias in samples
• Bad sample design – e.g., home phone surveys conducted during working
hours
• Survey (non)response bias – e.g., hotel opinion surveys about service
quality
• Attrition bias from clinical trials - e.g., if the drug works, the subject does
not come back.
• Self selection – volunteering for a trial or an opinion sample.
• Participation bias – e.g., voluntary participation in the Literary Digest
poll(see next page)
7
The Lesson…
• Having a really big sample does not assure you of an accurate
result. It may assure you of a really solid, really bad
(inaccurate) result.
• A confidence interval
• is an interval around the point estimate, calculated from the sample data, that is
very likely to contain the true value of the population parameter.
• An unbiased estimate
• is a point estimate such that the mean of its sampling distribution is equal to the
true value of the population parameter being estimated.
Sampling Distributions…
• The distribution of a statistic in “repeated sampling” is the
sampling distribution.
• The random sample is itself random, since each member is random. (A
second sample will differ randomly from the first one.)
• Statistics computed from random samples will vary as well.
Note: 𝐸 𝑥 =𝜇 = 𝑥ҧ (?)
• Let X is distributed as below
X P(X)
10 0.25
20 0.5
30 0.25
• 𝐸 𝑥 = σ𝑥 𝑥𝑃(𝑥) = 20
1
• 𝑥ҧ = σ𝑖 𝑥
𝑛 𝑖
• e.g.,) If samples are drawn randomly, observed proportion that 10 is sampled is
likely to be 0.25.
• What would happen if sample is not randomly drawn?
14
Sampling distribution
x 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6
…and the mean and variance are calculated as well:
19
While there are 36 possible samples of size 2, there are only 11 values for
the sample mean, and some (e.g. =3.5) occur more frequently than
others (e.g. =1).
20
Compare…
Compare the distribution of X…
• Unbiased estimates
• 𝐸(𝑋)=µ
• In a random sampling situation, an estimation error, ∆= (𝑋 − 𝜇), is likely to
be zero.
• This does not depend on the sample size.
• Consistent estimates
𝜎
• The standard deviation of 𝑋 is SD(𝑋)=
𝑁
• As N goes to ∞, SD(𝑋) goes to zero.
24
Generalize…
We can generalize the mean and variance of the sampling of two
dice:
…to n-dice:
Exercise
• The amount of time spent by North American adults watching TV per day is
normally distributed with a mean of 6 hours and a standard deviation of 1.5 hours.
• a) what is the probability that a randomly selected North American adult watches
TV for more than 7 hours per day?
• Ans)
• b) what is the probability that the average time watching TV by a random sample
of five North American adults is more than 7 hours?
• Ans)
• c) Compare your answers in parts (a) and (b) and explain what drives the
difference between the two.
26
Exercise - solutions
• The amount of time spent by North American adults watching TV per day is
normally distributed with a mean of 6 hours and a standard deviation of 1.5 hours.
• a) what is the probability that a randomly selected North American adult watches
TV for more than 7 hours per day?
• b) what is the probability that the average time watching TV by a random sample
of five North American adults is more than 7 hours?
• Ans) P ( 𝑥>7)=
ҧ P(Z>(7-6)/(1.5/sqrt(5))=0.0681
• c) Compare your answers in parts (a) and (b) and explain what drives the
difference between the two.
27
𝐴𝑠 N ∞, P[|xlj − 𝜇| > 𝜀] 0
𝑟𝑒𝑔𝑎𝑟𝑑𝑙𝑒𝑠𝑠 of how small 𝜀 is.
Bernoulli knew…
29
Summary
•
•
• Experiment = A randomly
picked application.
Let X = 0 if Rejected
Let X = 1 if Accepted
• X is DISCRETE (Binary). This
is called a Bernoulli random
variable.
The 13,444 observations are the population. The true proportion is μ = 0.780943. We
draw samples of N from the 13,444 and use the observations to estimate μ.
35
• The average of the averages will hit the true mean (on
average)
• The mean is UNBIASED: No moral connotations
Range of Uncertainty
• The point estimate will be off (high or low)
• Quantify uncertainty in ± sampling error.
• Look ahead: If I draw a sample of 100, what value(s) should I
expect?
• Based on unbiasedness, I should expect the mean to hit the
true value.
• Based on my empirical rule, the value should be within plus
or minus 2 standard deviations 95% of the time.
• What should I use for the standard deviation?
39
Ni=1(xi − x)2
• Estimate σ2 using the data: s =2
2
N −1
• Then, divide s by N.
40
• When you have a fairly small sample (under 30) and you have to
estimate σ using s, then both the empirical rule and the normal
distribution can be a bit misleading. The interval you are using is a bit
too narrow.
• You will find the appropriate widths for your interval in the “t table” The
values depend on the sample size. (More specifically, on N-1 = the
degrees of freedom.)
47
Example 1a
The foreman of a bottling plant has observed that the amount of soda in
each “32-ounce” bottle is actually a normally distributed random
variable, with a mean of 32.2 ounces and a standard deviation of .3
ounce.
If a customer buys one bottle, what is the probability that the bottle will
contain more than 32 ounces?
48
Example 1a…
• We want to find P(X > 32), where X is normally distributed and µ =
32.2 and σ =.3
X − 32 − 32.2
P(X 32) = P = P( Z − .67) = 1 − .2514 = .7486
.3
• “there is about a 75% chance that a single bottle of soda contains more than
32oz.”
49
Example 1b…
The foreman of a bottling plant has observed that the amount of soda in
each “32-ounce” bottle is actually a normally distributed random
variable, with a mean of 32.2 ounces and a standard deviation of .3
ounce.
Example 1.b…
We want to find P(𝑋ത > 32), where X is normally distributed
With µ = 32.2 and σ =.3
Things we know:
1) ത
X is normally distributed, therefore so will 𝑋.
2) = 32.2 oz.
3)
51
Example 1.b…
If a customer buys a carton of four bottles, what is the probability that
the mean amount of the four bottles will be greater than 32 ounces?
“There is about a 91% chance the mean of the four bottles will exceed 32oz.”
52
797 794 817 813 817 793 762 719 804 811
837 804 790 796 807 801 805 811 835 787
Example 1.c… 800 771 794 805 797 724 820 601 817 801
798 797 788 802 792 779 803 807 789 787
794 792 786 808 808 844 790 763 784 739
805 817 804 807 800 785 796 789 842 829
Discussion
• Sampling distribution
• Justify sampling distributions drawn from one sample taken while being
under the assumption of repeated samples
• Discuss why we place an emphasis on the sampling distribution of 𝑥ҧ