You are on page 1of 23

Mathematical Techniques

for Computer Science


Applications
CSCI-GA.1180-001
Parijat Dube
Summer 2022

Lecture material from


Linear Algebra and Probability for Computer Science Applications , CRC Press, 2012.
Lecture 11: Confidence Intervals
Required Reading: Chapter 11 of Textbook
Chapter 8 (8.1-8.4) of online probability book
https://www.probabilitycourse.com/preface.php
Sampling error
Suppose that you poll of 1000 people randomly chosen from the
American population on the question of whether they prefer blueberry
to apple pie. You find that 574 prefer apple pie. Presumably:
a) You can’t be sure that exactly .574 of the American population
prefers apple pie to blueberry pie.
b) You can be pretty sure that at least 30% of the American population
prefer apple pie to blueberry pie.
How sure can you be that most Americans prefer apple pie to blueberry
pie?
Point estimate vs interval estimate
• Refer to Chapter 8.2 and 8.3 of https://www.probabilitycourse.com
Confidence interval
• Plausible range of values for the population parameter is called a
confidence interval.
• If we report a point estimate, we probably won’t hit the exact
population parameter. If we report a range of plausible values we
have a good shot at capturing the parameter.
Confidence Interval
• From Central Limit Theorem
Calculation for a napkin at a dinner party
Let p be the (true, unknown) fraction and N the number in the sample
As long as both p and 1−p are both more than around 5/N, the
distribution of sample mean pretty much follows a Gaussian with norm
p and standard deviation σ = 𝑝(1 − 𝑝)/𝑁.
If 0.2 < 𝑝 < 0.8 then 0.4 < 𝑝(1 − 𝑝) < 0.5 => 𝜎 ≈ 0.5/ 𝑁
The 95% confidence interval is within 2𝜎 = 1/𝑁
The 99% confidence interval is within 2.6𝜎 ≈ 5⁄4 1/𝑁.
The 99.9% confidence interval is within 3.3𝜎 ≈ 5⁄3 1/𝑁.
Apple pie example
𝑋6 = .574, 𝑁 = 1000.
!
≈ 0.03. 𝑝(1 − 𝑝) ≈ 1/2
!"""

So the 95% confidence interval for p is about [.54,.60]


The 99% confidence interval for p is about [.53,.61]
The 99.9 % confidence interval for p is about [.52,.62]
Precision beyond the second decimal is meaningless.
Even simpler cocktail napkin formula
If the frequency is p and you sample N items, then it is 95% sure that
the number in the sample is between 𝑁𝑝 − 𝑁 and 𝑁𝑝 + 𝑁 ,
assuming p is between 0.2 and 0.8.
Important Points
• 𝜎 is proportional to 1/𝑁
• To reduce the width of the confidence interval by a factor of k, it is
necessary to increase the sample size by k^2
• The confidence interval does not at all depend on the size of the
population, as long as the population is large enough that one can
validly approximate sampling without replacement by sampling with
replacement.
Rederiving the formula
Suppose that the true frequency of category C in the population is p. What
is the distribution of frequencies in the sample?
(Note that this is not actually the question we want to answer. We know the
frequency in the sample; we want to know the true frequency. We’ll come
back to that, twice.)
Answer: You are taking N independent* samples from a population where
the category has frequency p. We can take each of these to be follow a
Bernoulli distribution Xi with parameter p. They each have mean p and
variance p(1−p). The total number of C’s in the sample is the random
variable Y=X1+X2+…+XN. So Y follows the binomial distribution with
parameters p and N. Y has mean Np and variance Np(1−p)
* We’ll come back to “independent” too.
Rederiving the formula
The total number of C’s in the sample is the random variable
Y=X1+X2+…+XN. Y follows the binomial distribution with parameters p
and N. Y has mean Np and variance Np(1−p).
The fraction of C’s in the sample, Z =Y/N, has mean p, variance
p(1−p)/N, and standard deviation 𝑝(1 − 𝑝)/𝑁.
By the central limit theorem, Z is well approximated by the Gaussian
distribution with mean p and standard deviation 𝑝(1 − 𝑝)/𝑁. (The
error is O(1/N).)
What about the Xi being independent?
If you sample with replacement (each time, you draw randomly from
the entire population) then the Xi are independent.
If you are careful not to repeat elements then you are sampling without
replacement. And in that case, as we have discussed, the Xi are not
independent. If the first person you sample likes apple pie, then that
slightly lowers the probability that the next person you sample will like
apple pie.
But it hardly matters.
Sampling without replacement
Let W be the total population. If N < 𝑊, then it is unlikely that there
will be any repeated items in your sample, even if you sample with
replacement. (Birthday paradox.) So sampling without and sampling
with replacements are essentially the same. In fact as long as N << W
the number of repetitions will be too small to make any difference.
Sampling without replacement
Extreme example. Suppose N=W.
If you sample without replacement, then the sample is the entire
population, so the fraction in the sample is equal to the fraction in the
population, with no possible error.
Deriving the confidence on other bounds
Suppose that we want 𝑃 𝑎 ≤ 𝑍 ≤ 𝑏 for some bounds a,b where Z is
the Gaussian distribution with mean 𝜇 and standard deviation .
Any two Gaussians differ only in a shift corresponding to 𝜇 and a scaling
in the x axis corresponding to 𝜎. Therefore, if you consider the random
variable 𝐺 = (𝑍 − 𝜇)/𝜎 , 𝐺 follows the Gaussian distribution with
mean 0 and standard deviation 1. And
𝑎−𝜇 𝑏−𝜇
𝑃 𝑎≤𝑍≤𝑏 =𝑃 <𝐺<
𝜎$ 𝜎
𝑃 𝑐 < 𝐺 < 𝑑 = E 𝑃F 𝐺 = 𝑡 𝑑𝑡
#
Matlab
Matlab provides a closely related function erf defined as
2 % &' !
Error function erf 𝑥 = E 𝑒 𝑑𝑡
𝜋 "
Using a little calculus, one can determine that
$
1
F
E 𝑃 𝐺 = 𝑢 𝑑𝑢 = (erf 𝑑/ 2 − erf 𝑐/ 2 )
# 2
Back to the apple pie example
We’ve polled 1000 people. In the population at large the fraction that
prefer apple is 0.574. What is the probability that, in our sample, the
fraction that prefer apple is between 0.56 and 0.60?
Answer:
𝜇 = 𝑝 = 0.574. 𝜎 = 𝑝(1 − 𝑝)/𝑁=0.0156

a=0.56. b=0.60
c=(a−µ)/σ=−0.895. d=(b−µ)/σ=1.663.
P(a<X<b) = (erf(d/ 2)− erf(c/ 2))/2 = 0.766
What does “confidence interval” mean?
Naïve answer
The one we’ve already seen:
If the true frequency in the population is .574 and you take a random
sample of 1000 elements, then with probability .95, the fraction in the
sample will be between 0.543 and 0.605

Pollster: That’s not what I want to know. I have the frequency in the
sample, I need to know something about the frequency in the
population.
What does “confidence interval” mean?
Frequentist answer
Frequentist: I’ll give you a procedure for computing a 95% confidence
interval. If you follow this procedure whenever you have taken a
random sample, using the frequency in the sample as p, then 95% of
the time the true probability is in the confidence interval. For practical
purposes it’s the same as the naïve formula, in almost all cases.
Discussion of frequentist answer
Pollster: Oh. That seems rather confusing and indirect. You are talking
about going from frequencies in the sample to frequencies in the
population, which is good. But only in terms of a whole collection of
hypothetical experiments that I am not planning to run. I have a
specific situation to deal with: I polled 1000 people at random; 574
prefer blueberry pie. Can’t I just say that with probability .95, the
fraction in the population is between 0.543 and 0.645?
Frequentist: No. You’re talking nonsense. There’s no such thing as “the
probability that the fraction is between 0.543 and 0.645.” The fraction
is whatever it is. It isn’t drawn from a sample space. It isn’t generated
by a stochastic process. It doesn’t have a probability distribution.
Frequentist procedure
for confidence interval
What we want: for sample size N and confidence level h:
Two monotonically increasing functions LN,h(p) and UN,h(p) such that:
If the true frequency of the property in the population is f,
and you take a sample of size N,
and the frequency of the property in the sample S is p
Then with probability ≥ h, LN,h(p) ≤ f ≤UN,h(p).
Note: This is about the probability distribution of p (which the frequentist
considers legitimate), not of f (which the frequentist considers illegitimate).
Whatever poll is being carried out, and whatever f is, if you take a random
sample, measure p, and compute the interval, then with probability h, the
interval lies around f.
Finding L and H
f = true frequency. p=frequency in sample. h=confidence level
Find monotonically increasing functions Q(f) and R(f) such that if the
true frequency is f such that P(Q(f) ≤ p ≤ R(f)) ≥ h. This was what we
did in the naïve answer.
Q(f) ≤ p is the same as f ≤ Q-1(p) (Q inverse).
p ≤ R(f) is the same as R-1(p) ≤ f.
So choose L= R-1 and U= Q-1 and then P(L(p) ≤ f ≤ U(p)) ≥ h, which is
what we want.
Finally, if N is substantial and p is not close to 0 or 1, and you use the
Gaussian approximation we used before, you can just use the same
confidence intervals as before.

You might also like