You are on page 1of 5

STA1007S Lab 10: Confidence Intervals

October 2020

SUBMISSION INSTRUCTIONS:

Your answers need to be submitted on Vula.

Go into the Lab submissions section and click on "Lab Session 10 - 2020" to access the submission form.
Please note that the answers get automatically marked and so have to be in the correct format:

ENTER YOUR ANSWERS TO 2 DECIMAL PLACES UNLESS THE ANSWER IS A ZERO OR AN INTE-
GER (for example if the answer is 0 you just enter 0 and not 0.00, or if the answer is 2 you enter 2 and not 2.00).

DO NOT INCLUDE ANY UNITS (ie meters, mgs, etc).

PROBABILITIES MUST BE BETWEEN 0 AND 1, SO A 50% CHANCE WOULD CORRESPOND TO A


PROBABILITY OF 0.5.

Introduction
This week’s lab is focused on confidence intervals and on the t-distribution. You will learn how to use R
functions to compute confidence intervals both when you assume that you know σ and when σ is unknown.
You will find that most of the R code necessary to execute the R commands is provided. This lab is meant to
be practice for you, so even if the code and the output of the code is provided, you are expected to create
your own script, run the pieces of code yourself and check whether the output is what you would expect it to
be. The questions you need to submit through Vula will appear in the submission boxes.
At any time you might call the function help(), to obtain information from any function you want. E.g. If
you wanted to obtain a description of how the function sample() works, you can at any time type in the
console (bottom left panel in RStudio):
help("sample")

or you can just type:


?sample

You should take this as a habit and check the help files of the functions you use for the first time.

Start a new R script and import your data


Start a new R script in your existing R project for the computer labs and write a few lines describing what
you are going to do.
Remember to add a line to clean your working environment and one to double check that your working
directory is correct.
Remember to save your script frequently!

1
Confidence interval for the mean
In practice, usually we will only have one sample that we will need to use to estimate population parameters.
This is indeed the big challenge of inferential statistics! We want to make general, widely applicable assertions
from a limited set of observed values. For example, we might be interested to know what the average wing
length of the Red-winged starlings at UCT is or what the average incubation period of the common flu in
Cape Town is. It is impossible to measure each and every single bird or monitor every single patient, so we
need to use sampling to estimate the population mean.
You have already learnt that the sample mean is itself a random variable i.e. it will vary from one random
sample of data to another. Furthermore, the CLT tells us that the sample mean will be normally distributed
irrespective of the distribution of the data as long as n > 30. As shown in the notes, the sampling distribution
of a sample mean is:

σ2
 
X̄ ∼ N µ,
n
2
where σn conveys how variable we expect the sample mean to be. High variability implies that any single
estimate has a fair amount of uncertainty associated with it ⇒ in such cases the estimate of the sample
mean is also said to be imprecise. When the results of any study are reported it is critical that some
measure of uncertainty is included together with any point estimate. A confidence interval provides a range
of plausible values whereby a wide range translates to a large degree of uncertainty and a narrow range of
values corresponds to a small degree of uncertainty.
Lets simulate a random sample of 50 random variables using a uniform distribution U (10, 20). We can
(b−a)2
calculate the expected value of a uniform distribution by E[X] = a+b
2 and the variance by V ar[X] = 12
(where a = 10 and b = 20 in this case). In other words we know from theory that the true mean of the data
(20−10)2
generating process is a+b
2 = 15 and the true variance is V ar[X] = 12 = 8.33 . . . these are theoretical
values that will not be equal to the empirical values you observe in your random sample.
R has a set of functions to work with for the uniform distribution, similar to those you’ve seen for the other
distributions. We use runif() to sample from a uniform distribution, punif() to calculate probabilities of
events (for example P (X ≤ x)), dunif() gives us the density of the p.d.f. at a point x, and finally, qunif()
gives us the quantile - the value of X - that leaves some percentage of the values above or below (e.g. the
median leaves 50% of the values above and also below it).
# Simulate 50 observations from U(10,20)
UniformSample <- runif(50, 10, 20)
mean(UniformSample)

## [1] 15.57136
We observe that the mean in this particular sample of 50 points is 15.57 (it will be different for your random
sample) but we want a range of values in order to make better inference about the true mean, i.e. a confidence
interval. The normality of x̄ leads to the following formula for a confidence interval:
 
σ σ
x̄ − Z ∗ × √ ; x̄ + Z ∗ × √
n n
The sample mean (x̄ = 15.57) and sample size (n = 50) are determined from the data, and we have just
calculated σ so the variance of the sampling distribution here is 8.33
50 or the standard error is
2.887

50
. Then, we
can use the function qnorm() to find the appropriate Z ∗ value (or you can use the Z table at the back of
Introstat). The q versions of the probability functions compute the value that corresponds to some given
percentage of the distribution, for example the figure below shows that the critical value that demarcates the
upper 10% of the standard normal distribution is equal to 1.28.

2
Z ~ N(0,1)

10% of the dbn

0.00 1.28
Typing qnorm(0.1, lower.tail = FALSE) will return this value, i.e. the value z that corresponds to
P (Z > z) = 0.10. Note that the lower.tail argument tells R whether we want the area to be a region to the
left (P (Z < z)) or the right (P (Z > z)) of the value that is returned. The default value for lower.tail is
equal to TRUE and so returns P (Z < z). Go ahead and confirm that you get the same value when specifying
qnorm(0.9, lower.tail = TRUE). The code below shows further examples that give the appropriate critical
values that correspond to 2.5%, 50%, 66.67%, and 97.5% of the standard normal distribution:
qnorm(0.025, lower.tail = T)

## [1] -1.959964
qnorm(0.025, lower.tail = F)

## [1] 1.959964
qnorm(0.50, lower.tail = T)

## [1] 0
qnorm(0.667, lower.tail = T)

## [1] 0.4316442
qnorm(0.975, lower.tail = T)

## [1] 1.959964
A long way to get a 95% confidence interval is to manually put the pieces together:

2.887
15.57 ± 1.96 × √ = 14.77; 16.37
50
We can also do all of this within the qnorm() function by specifying the appropriate mean and standard
deviation:
# Calculate the lower limit of the CI
low_limit <- qnorm(0.025, mean = mean(UniformSample), sd = 2.887/sqrt(50))

# Calculate the upper limit of the CI


up_limit <- qnorm(0.975, mean = mean(UniformSample), sd = 2.887/sqrt(50))

3
# or
up_limit <- qnorm(0.025, mean = mean(UniformSample), sd = 2.887/sqrt(50), lower.tail = FALSE)

# Print the CI
c(low_limit, up_limit)

## [1] 14.77114 16.37158


We know that 95% of the intervals calculated from our method will catch the true mean, in other words the
interval will miss the true mean only 1 in 20 times. It is apparent that for this random sample the true mean
of 15 falls within the confidence interval. Go ahead and repeat this exercise, compute confidence intervals for
a number of samples and verify whether they capture the true mean of the distribution that generated the
data (which remember, is what we are trying to learn about).

SUBMISSION:

Vula Question 1. Assume that you have a random sample of 75 points. If the sample mean is equal to 3.45
and you assume that you know σ to be equal to 0.5, compute a confidence interval for µ.
Vula Question 2. If you repeated this exercise 100 times, how many times would you expect the interval to
miss the mean?

Reality check!
In the example above we simulated the data and so we knew what the true variance in the data was. In reality
we will not usually know the underlying distribution that generated the data and so will not know what σ is.
As discussed in the notes, when σ is estimated by s from the sample, the resulting statistic follows the t
distribution with n-1 degrees of freedom. The t distribution is just another probability density function
and R has the corresponding probability functions rt() ; pt() ; dt() and qt().

SUBMISSION:

Use R to find the following:


Vula Question 3. T ∼ t7 ; P (T > 5.75)?
Vula Question 4. T ∼ t32 ; P (T ≤ 0.50)?
Vula Question 5. T ∼ t15 ; f (T = 0.5)?
Vula Question 6. T ∼ t12 ; P (T > −1.75)?
Vula Question 7. What is the critical value that demarcates the upper 10% of the t5 distribution?
When we do not know σ, the formula for computing a confidence interval stays the same except that you
need to replace the chosen Z ∗ value with the appropriate corresponding t value.

s
x̄ ± t∗n−1 × √
n

Return to the example at the beginning of this lab where we want to use the 50 values contained in the vector
UniformSample to make inference about the true mean. Obviously this example is a little silly because we
know the true value of the mean, but for simplicity lets pretend that we were given these data and did not
simulate them. In that case we have to estimate the variance from the data:
# Estimate sigma from s in the data
sd(UniformSample)

## [1] 2.784713

4
Now we estimate σ as 2.78 and the confidence interval for the true mean should use the t distribution instead
of the normal distribution. This amounts to the same calculation except that we need the appropriate t
critical value:
# Corresponding t_49 critical value
qt(0.975, df = 49)

## [1] 2.009575
Recall that when n > 30 we expect very similar results when using the t versus using the normal because
the estimate of σ becomes more stable as n increases. If we compare the t49 critical value of 2.01 with the
Z value of 1.96 we see how similar they are and so do not expect a meaningful difference in the resulting
confidence interval:
# Compute lower and upper limits using a t value
lower = mean(UniformSample) - qt(0.975, df = 49)*(sd(UniformSample)/sqrt(50))
upper = mean(UniformSample) + qt(0.975, df = 49)*(sd(UniformSample)/sqrt(50))

# Print the interval


c(round(lower,2), round(upper,2))

## [1] 14.78 16.36

SUBMISSION:

Vula Question 8. You want to estimate the mean incubation period of a particular flu virus. You get hold
of data from ten patients with the virus and measure how long their virus incubated for. If you find the
sample mean to be 2.1 days and the sample variance to be 0.5 days, compute the appropriate 90% confidence
interval.

The commands you learned today


These are the functions and operators that you learned today. Fill in your own description of what they do.
runif()
punif()
qunif()
dunif()
rt()
pt()
qt()
dt()

You might also like