You are on page 1of 55

BUSINESS STATISTICS

SESSION 2

Young Bong Chang, Ph.D.


Sungkyunkwan University
2

What we will do
• Sampling
• The concept of a sampling distribution.
• Focus on the sampling distribution of the sample mean, and discuss the
role of a famous mathematical result called the central limit theorem.
• LLN/CLT
3

Populations and Samples


• Population: Collection of all possible observations (data points)
on a variable
• Sample: A subset of the data points in the population
• Random sample: Defined by the way the sample data are
obtained. All points in the population are equally likely to be
drawn in any particular sample.

• What is the purpose of obtaining a sample?


• Learn about the population.
4

Sampling – why?
• To estimate properties of a population from the data observed
in the sample
• To contact the entire population is too time consuming and
expensive.
• Certain tests are destructive.
• Checking all the items is physically impossible.
• The mathematical procedures appropriate for performing this estimation
depend on which properties of the population are of interest and which
type of random sampling scheme is used.
• Because the details are considerably more complex for more complex
sampling schemes such as multistage sampling, we will focus on simple
random samples, where the mathematical details are relatively
straightforward.
5

Types of Sampling
• Simple random sampling
• A simple random sample of size n has the property that every possible
sample of size n has the same probability of being chosen.
• Simple random samples are the easiest to understand, and their statistical
properties are fairly straightforward.
• There are several ways simple random samples can be chosen, all of
which involve random numbers.
• Because each sampling unit has the same chance of being sampled,
simple random sampling can result in samples that are spread over the
entire population.
• Simple random sampling requires that all sampling units be identified
prior to sampling. Sometimes this is infeasible. Accordingly, simple
random sampling can yield underrepresentation or overrepresentation
of certain segments of the population
6

Nonrandom Samples
• Nonrandom samples produce tainted, sometimes not
believable results
• Biased with respect to the population
• Results reflect only the subpopulation from which the data are obtained.
• Sources of bias in samples
• Bad sample design – e.g., home phone surveys conducted during working
hours
• Survey (non)response bias – e.g., hotel opinion surveys about service
quality
• Attrition bias from clinical trials - e.g., if the drug works, the subject does
not come back.
• Self selection – volunteering for a trial or an opinion sample.
• Participation bias – e.g., voluntary participation in the Literary Digest
poll(see next page)
7

Nonrandom Samples – the classic case


• Literary Digest, 1936, Alf Landon vs. Franklin Roosevelt: Survey
result based on a HUGE sample. Prediction?
• Landon, 1,293,669
• Roosevelt, 972,897

• Final Returns in the Digest’s Poll of Ten Million Voters


• Literary Digest subscribers
• Telephone registrations and drivers’ license registrations – both
overrepresented on the republican side.

• Election result: Roosevelt by a landslide, 62%-38%


8

The Lesson…
• Having a really big sample does not assure you of an accurate
result. It may assure you of a really solid, really bad
(inaccurate) result.

• To make our sample more look-like random, ….


9

Key Terms in Sampling


• Unless you measure each member of the population – that is,
you take a census – you cannot learn the exact value of this
population parameter. Therefore, you instead take a random
sample of some type and estimate the population parameter
from the data in the sample.
• A point estimate is a single numeric value, a “best guess” of a population
parameter, based on the data in a random sample.
• The sampling error (or estimation error) is the difference between the point
estimate and the true value of the population parameter being estimated.
10

Key Terms in Sampling…


• The sampling distribution of any point estimate
• is the distribution of the point estimates from all possible samples (of a given
sample size) from the population.

• A confidence interval
• is an interval around the point estimate, calculated from the sample data, that is
very likely to contain the true value of the population parameter.

• An unbiased estimate
• is a point estimate such that the mean of its sampling distribution is equal to the
true value of the population parameter being estimated.

• The standard error of an estimate


• is the standard deviation of the sampling distribution of the estimate. It measures
how much estimates vary from sample to sample.
11

Sampling Distributions…
• The distribution of a statistic in “repeated sampling” is the
sampling distribution.
• The random sample is itself random, since each member is random. (A
second sample will differ randomly from the first one.)
• Statistics computed from random samples will vary as well.

• The sampling distribution is the theoretical population that


generates sample statistics.
• The method we will employ on the rules of probability and the
laws of expected value and variance to derive the sampling
distribution.
• e.g.,) the roll of one and two dice…
12

Note: Sample statistics


• Statistic = a quantity that is computed from a sample (a single
measure of some attributes of a sample).
• We will assume random samples.

• Ex. Sample sum Total =  i=1 x i


N

x = (1/ N) i=1 x i


N
• Ex. Sample mean

s2 = [1/(N − 1)] i=1(xi − x)2


N
• Ex. Sample variance

• Ex. Sample minimum x[1].


• Ex. Proportion of observations less than 10
• Ex. Median = the value M for which 50% of the observations are less than
M.
13

Note: 𝐸 𝑥 =𝜇 = 𝑥ҧ (?)
• Let X is distributed as below

X P(X)
10 0.25
20 0.5
30 0.25

• 𝐸 𝑥 = σ𝑥 𝑥𝑃(𝑥) = 20
1
• 𝑥ҧ = σ𝑖 𝑥
𝑛 𝑖
• e.g.,) If samples are drawn randomly, observed proportion that 10 is sampled is
likely to be 0.25.
• What would happen if sample is not randomly drawn?
14

Distributions of the Sample Sum and the


Sample Mean
Samples obtained from N(500, 1002)

• The sample sum and sample mean are random variables.


• Each random sample produces a different sum and mean
15

Distributions of the Sample Sum and the Sample


Mean….
• The sample sum
• Mean of the sum: E[X1+X2+…+XN] = E[X1]+E[X2]+…+E[XN] = Nμ
• Variance of the sum: Var[X1+X2+…+XN] = Var[X1]+…+Var[XN] = Nσ2

• The sample mean


• Expected value of the sample mean: E(1/N)[X1+X2+…+XN] =
(1/N){E[X1]+E[X2]+…+E[XN]} = (1/N)Nμ = μ
• Variance of the sample mean: Var(1/N)[X1+X2+…+XN] =
(1/N2){Var[X1]+…+Var[XN]} = Nσ2/N2 = σ2/N
16

Sample results vs. population value


Samples obtained from N(500, 1002)

The mean of the 10 means is 495.87 The true mean is 500


The standard deviation of the 10 means is 16.72 . Sigma/sqr(N) is 100/sqr(20) = 22.361
17

Sampling distribution

• The sample mean has a


sampling mean and a
sampling variance.
• The sample mean also
has a probability
distribution. Looks like
a normal distribution.

This is a histogram for 1,000 means of samples of


20 observations from Normal[500,1002].
18

Sampling Distribution of the Mean…


• A fair die is thrown infinitely many times, with the random
variable X = # of spots on any throw.
• The probability distribution of X is:

x 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6
…and the mean and variance are calculated as well:
19

Sampling Distribution of Two Dice


• A sampling distribution is created by looking at all samples of size n=2
(i.e. two dice) and their means…

While there are 36 possible samples of size 2, there are only 11 values for
the sample mean, and some (e.g. =3.5) occur more frequently than
others (e.g. =1).
20

Sampling Distribution of Two Dice…


The sampling distribution of 𝑥ҧ is shown below:
𝑥ҧ 𝑃(𝑥)ҧ
1.0 1/36
1.5 2/36
2.0 3/36
2.5 4/36
3.0 5/36
3.5 6/36
4.0 5/36
4.5 4/36
5.0 3/36
5.5 2/36
6.0 1/36
21

Compare…
Compare the distribution of X…

…with the sampling distribution of 𝑥ҧ .

As well, note that:


22

Sampling Distribution of the Mean


• In random sampling from a normal population with mean μ and
variance σ2, the sample mean will also have a normal distribution
with mean μ and variance σ2/N.
• Does this work for other distributions, such as Poisson and
Binomial?
• Does the mean have the same distribution as the population? Poisson or
binomial? No.
• Is the mean normally distributed? Approximately – to be pursued later.
23

Sampling Distributions in repeated samples


• The sampling distribution provides us with the behavior of the
sample mean in repeated samples.
• I have only one sample and one sample mean. Makes sense?

• Unbiased estimates
• 𝐸(𝑋)=µ
• In a random sampling situation, an estimation error, ∆= (𝑋 − 𝜇), is likely to
be zero.
• This does not depend on the sample size.

• Consistent estimates
𝜎
• The standard deviation of 𝑋 is SD(𝑋)=
𝑁
• As N goes to ∞, SD(𝑋) goes to zero.
24

Generalize…
We can generalize the mean and variance of the sampling of two
dice:

…to n-dice:

The standard deviation of the


sampling distribution is called the
standard error:
25

Exercise
• The amount of time spent by North American adults watching TV per day is
normally distributed with a mean of 6 hours and a standard deviation of 1.5 hours.

• a) what is the probability that a randomly selected North American adult watches
TV for more than 7 hours per day?

• Ans)

• b) what is the probability that the average time watching TV by a random sample
of five North American adults is more than 7 hours?

• Ans)

• c) Compare your answers in parts (a) and (b) and explain what drives the
difference between the two.
26

Exercise - solutions
• The amount of time spent by North American adults watching TV per day is
normally distributed with a mean of 6 hours and a standard deviation of 1.5 hours.

• a) what is the probability that a randomly selected North American adult watches
TV for more than 7 hours per day?

• Ans) P(𝑥>7)= P(Z>(7-6)/1.5)=0.2514

• b) what is the probability that the average time watching TV by a random sample
of five North American adults is more than 7 hours?

• Ans) P ( 𝑥>7)=
ҧ P(Z>(7-6)/(1.5/sqrt(5))=0.0681

• c) Compare your answers in parts (a) and (b) and explain what drives the
difference between the two.
27

Two Major Theorems


• Law of Large Numbers: As the sample size gets larger,
sample statistics get ever closer to the population
characteristics
• Central Limit Theorem: Sample statistics computed from
means (such as the means, themselves) are approximately
normally distributed, regardless of the parent distribution.
28

The Law of Large Numbers

𝐴𝑠 N ∞, P[|xlj − 𝜇| > 𝜀] 0
𝑟𝑒𝑔𝑎𝑟𝑑𝑙𝑒𝑠𝑠 of how small 𝜀 is.

Bernoulli knew…
29

The Law of Large Numbers…


e.g.,)

Event consists of two random outcomes YES and NO


Prob[YES occurs] = θ where θ need not be 1/2
Prob[NO occurs ] = 1- θ
Event is to be staged N times, independently
N1 = number of times YES occurs, P = N1/N

LLN: As N →  Prob[(P - θ) >  ] → 0


no matter how small  is.

For any N, P will deviate from θ because of randomness.


As N gets larger, the difference will disappear.
30

Application of the Law of Large Numbers

• The casino business is


nothing more than a huge
application of the law of large
numbers.
• The insurance business is
close to this as well.
31

Implication of the Law of Large Numbers


• If the sample is large enough, the difference between the
sample mean and the true mean will be trivial.
• This follows from the fact
• that the variance of the mean is σ2/N → 0.

• An estimate of the population mean based on a larger sample


is better than an estimate based on a smaller one.

• The problem of a “biased” sample:


• As the sample size grows, a biased sample produces a better and better
estimator of the wrong quantity.
• Drawing a bigger sample does not make the bias go away. That was the
essential fallacy of the Literary Digest poll.
32

Central Limit Theorem…

• The sampling distribution of the mean of a random sample drawn


from any population is approximately normal for a sufficiently
large sample size.
• If the population is normal, then X is normally distributed for all values of n.
• If the population is non-normal, then X is approximately normal only for
larger values of n.
• The larger the sample size, the more closely the sampling distribution of X
will resemble a normal distribution.
• In most practical situations, a sample size of 30 may be sufficiently large
to allow us to use the normal distribution as an approximation for the
sampling distribution of X.
• Theorem (loosely): Regardless of the underlying distribution of
the sample observations, if the sample is sufficiently large
(generally > 30), the sample mean will be approximately normally
distributed with mean μ and standard deviation σ/√N.
33

Summary

• If X is normal, X is normal. If X is nonnormal, X is approximately


normal for sufficiently large sample sizes.
34

Discussion 1 : Estimating Acceptance Rate

• Experiment = A randomly
picked application.
Let X = 0 if Rejected
Let X = 1 if Accepted
• X is DISCRETE (Binary). This
is called a Bernoulli random
variable.

The 13,444 observations are the population. The true proportion is μ = 0.780943. We
draw samples of N from the 13,444 and use the observations to estimate μ.
35

Discussion 1 : Estimating Acceptance Rate


The sample proportion we are examining here is a sample mean.
X = 0 if the individual′s application is rejected
X = 1 if the individual′s application is accepted
1
The "acceptance rate" is xlj = N σ𝑁𝑖=1 𝑥𝑖 . The population proportion
is 𝜇 = 0.780943. xlj is an estimator of 𝜇, the population mean.

𝑥ഥ in 100 (repeated) samples with N=144 in each sample


36

Discussion 1 : The Mean is A Good Estimator

• Sometimes the mean is too high, sometimes too low. On average, it


seems to be right.
• The sample mean of the 100 sample estimates is 0.7844. The population
mean (true proportion) is 0.7809.
37

Discussion 1 : What Makes it a Good Estimator?

• The average of the averages will hit the true mean (on
average)
• The mean is UNBIASED: No moral connotations

• The sampling variability in the estimator gets smaller as N gets


larger.
• If N gets large enough, we should hit the target exactly; The mean is
CONSISTENT
38

Discussion 1 : Uncertainty in Estimation


• How to quantify the variability in the proportion estimator?

Range of Uncertainty
• The point estimate will be off (high or low)
• Quantify uncertainty in ± sampling error.
• Look ahead: If I draw a sample of 100, what value(s) should I
expect?
• Based on unbiasedness, I should expect the mean to hit the
true value.
• Based on my empirical rule, the value should be within plus
or minus 2 standard deviations 95% of the time.
• What should I use for the standard deviation?
39

Discussion 1 : Estimating the Variance of the


Distribution of Means
• We will have only one sample!
• Use what we know about the variance of the mean:
• V[mean] = σ2/N

Ni=1(xi − x)2
• Estimate σ2 using the data: s =2

2
N −1
• Then, divide s by N.
40

Discussion 1 : The Sampling Distribution


• For sampling from the population and using the sample mean
to estimate the population mean:
• Expected value of x will equal μ
• Standard deviation of x will equal σ/ √ N
• CLT suggests a normal distribution
41

Discussion 1 : The Sampling Distribution


Sometimes the sample mean Sometimes the sample mean will
will be very close to the true be quite far from the true mean
mean

This is the sampling variability of the mean as an estimator of μ


42

Discussion 1 : Accommodating Sampling


Variability
• To describe the distribution of sample means, use the sample mean,
𝑥ҧ to estimate the population expected value
• To describe the variability, use the sample standard deviation, s,
divided by the square root of N
• To accommodate the distribution, use the empirical rule, 95%, 2
standard deviations.
• i.e.,) 𝑥ҧ ± 2 × 𝑆𝐷(𝑥)ҧ or 𝑥ҧ ± 1.96 × 𝑆𝐷(𝑥)ҧ if 𝑥ҧ follows normal distribution
43

Discussion 1 : Estimating the Sampling Distribution


• For the 1st sample,
• the mean was 0.849, s was 0.358. s/√N = .0298
• Forming the distribution I would use
0.849 ± 2 x 0.0298
• For a different sample,
• the mean was 0.750, s was 0.433, s/√N = .0361.
• Forming the distribution, I would use
0.750 ± 2 x 0.0361
44

Discussion 1 : Will the Interval Contain the True


Value?
• How can we accommodate uncertainty?
• Form a confidence interval
• Note: Uncertainty: The midpoint is random; it may be very high or low, in which
case, no.
• Sometimes it will contain the true value.

• The degree of (un)certainty depends on the width of the interval.


• Very narrow interval: very uncertain. (1 standard errors)
• Wide interval: much more certain (2 standard errors)
• Extremely wide interval: nearly perfectly certain (2.5 standard errors)
• Infinitely wide interval: Absolutely certain.

• The interval is a “Confidence Interval”


• The degree of certainty is the degree of confidence.
• The standard in statistics is 95% certainty (about two standard errors).
• Interpretation of the interval
• Not a statement about probabilities that  will lie in specific intervals.
• (1-) percent of the time, the interval will contain the true parameter
45

Discussion 1 : Where did the interval widths


come from?
• Empirical rule of thumb:
• 2/3 = 66 2/3% is contained in an interval that is the mean plus and minus
1 standard deviation
• 95% is contained in a 2 standard deviation interval
• 99% is contained in a 2.5 standard deviation interval.
• Based exactly on the normal distribution, the exact values would be
• 0.9675 standard deviations for 2/3 (rather than 1.00)
• 1.9600 standard deviations for 95% (rather than 2.00)
• 2.5760 standard deviations for 99% (rather than 2.50)
46

Discussion 1: Large sample


• If the sample is moderately large (over 30), one can use the normal
distribution values instead of the empirical rule.
• The empirical rule is easier to remember. The values will be very close to
each other.

• When you have a fairly small sample (under 30) and you have to
estimate σ using s, then both the empirical rule and the normal
distribution can be a bit misleading. The interval you are using is a bit
too narrow.
• You will find the appropriate widths for your interval in the “t table” The
values depend on the sample size. (More specifically, on N-1 = the
degrees of freedom.)
47

Example 1a
The foreman of a bottling plant has observed that the amount of soda in
each “32-ounce” bottle is actually a normally distributed random
variable, with a mean of 32.2 ounces and a standard deviation of .3
ounce.

If a customer buys one bottle, what is the probability that the bottle will
contain more than 32 ounces?
48

Example 1a…
• We want to find P(X > 32), where X is normally distributed and µ =
32.2 and σ =.3
 X −  32 − 32.2 
P(X  32) = P   = P( Z  − .67) = 1 − .2514 = .7486
  .3 

• “there is about a 75% chance that a single bottle of soda contains more than
32oz.”
49

Example 1b…
The foreman of a bottling plant has observed that the amount of soda in
each “32-ounce” bottle is actually a normally distributed random
variable, with a mean of 32.2 ounces and a standard deviation of .3
ounce.

If a customer buys a carton of four bottles, what is the probability that


the mean amount of the four bottles will be greater than 32 ounces?
50

Example 1.b…
We want to find P(𝑋ത > 32), where X is normally distributed
With µ = 32.2 and σ =.3

Things we know:
1) ത
X is normally distributed, therefore so will 𝑋.
2) = 32.2 oz.
3)
51

Example 1.b…
If a customer buys a carton of four bottles, what is the probability that
the mean amount of the four bottles will be greater than 32 ounces?

“There is about a 91% chance the mean of the four bottles will exceed 32oz.”
52

797 794 817 813 817 793 762 719 804 811
837 804 790 796 807 801 805 811 835 787

Example 1.c… 800 771 794 805 797 724 820 601 817 801
798 797 788 802 792 779 803 807 789 787
794 792 786 808 808 844 790 763 784 739
805 817 804 807 800 785 796 789 842 829

• The sample of 60 operators appears above. Suppose it is claimed that


the population that generated these data is Poisson with mean 800.
How likely is it to have observed these data if the claim is true?
• The sample mean is 793.23. The assumed population standard error
of the mean, as we saw earlier, is sqr(800/60) = 3.65. If the mean
really were 800 (and the standard deviation were 800), then the
probability of observing a sample mean would be
• P[z < (793.23 – 800)/3.65] = P[z < -1.855] = .0317981.
• This is fairly small. (Less than the usual 5% considered reasonable.)
This might cast some doubt on the claim.
53

Exercises – mark midterms


• The time it takes for a statistics professor to mark his midterm
test for one student is normally distributed with a mean of 4.8
minutes and a standard deviation of 1.3 minutes. There are 60
students in the class.
• A) What is the probability that he needs more than 5 hours to mark all the
midterm tests?
• B) What is your underlying assumption you make in answering ?
• C) Does your answer change if you discover that the time needed to mark
a midterm test are not normally distributed ?
54

Exercises – mark midterms (solutions)



• A) P(𝑋>5*60/60)=P(Z>(5-4.8)/(1.3/sqrt(60))=0.117
• B) The 60 midterms tests of the students in this year’s class
can be considered a random sample of many thousands of
midterm tests the professor has marked and will mark).
• C) No change
55

Discussion
• Sampling distribution
• Justify sampling distributions drawn from one sample taken while being
under the assumption of repeated samples
• Discuss why we place an emphasis on the sampling distribution of 𝑥ҧ

• Central limit theorem


• How large must n be for the approximation to be valid?

You might also like