Professional Documents
Culture Documents
ANSWER 1
Answer a
Here is an example of how to compute the sample correlation between "miles" and "price"
variables in R using the "pickup.csv" data set:
x <- pickup$miles
y <- pickup$price
r_hat
The result of this code will be the sample correlation between "miles" and "price" in the
"pickup.csv" data set.
So what is the correlation between "miles" and "price" in this example? What is the exact
number?
Let's suppose that the "pickup.csv" data set contains the following values for "miles" and "price":
1
We can then run the code provided to compute the sample correlation:
x <- pickup$miles
y <- pickup$price
r_hat
[1] 0.9819807
This indicates that there is a high positive correlation between "miles" and "price". The closer the
correlation coefficient is to 1, the stronger the positive linear relationship between the two
variables, and the closer it is to -1, the stronger the negative linear relationship.
Answer b
Here is a step-by-step procedure for performing a bootstrap for the population correlation
between variables "miles" and "price" in the data set "pickup.csv":
Load the "pickup.csv" data set into R using the read.csv() function.
Store the "miles" and "price" variables in separate vectors, say x and y, respectively.
Compute the sample correlation between x and y using cor(x, y). Store this value as
"r_hat".
2
Specify the number of bootstrap samples you want to generate, say B.
a. Generate a bootstrap sample of the same size as the original data set by randomly sampling the
indices of the observations, with replacement.
b. Extract the corresponding "miles" and "price" values from x and y, respectively, to form the
bootstrapped "miles" and "price" vectors.
c. Compute the correlation between the bootstrapped "miles" and "price" vectors using cor().
Store this value.
Use the B bootstrapped correlations to form a 95% CI by computing the 2.5th and 97.5th
percentiles of the bootstrapped correlation values.
The 95% CI is an estimate of the range of possible population correlations that would be
obtained if the same bootstrap procedure were repeated many times. The population correlation
is estimated by r_hat.
ANSWER 2
To form a 95% CI for the quantity using the bootstrap, you would need to perform the following
steps:
Obtain a sample of data from the original sample and compute the estimate (exp(β^4)−1)·
100 using this bootstrapped sample.
Repeat the above step many times to generate a large number of bootstrapped estimates.
Sort the bootstrapped estimates in ascending order.
Select the 2.5th and 97.5th percentile values of the sorted estimates as the lower and
upper bounds of the 95% CI, respectively.
The resulting interval will provide an estimate of the uncertainty around the original estimate,
with 95% confidence that the true value lies within the interval.
3
ANSWER 3
For each sample size, you will perform the following steps B times:
Draw a sample of size n from the Poisson distribution with λ = 2 using the "rpois"
function in R.
Compute the sample mean and standard error from the sample.
Compute the endpoints of the 95% confidence interval by adding and subtracting two
times the standard error to and from the sample mean.
After performing these steps B times, you will be able to see how often the true parameter (λ = 2)
falls within the 95% confidence interval for each sample size. The goal of the simulation is to see
if the confidence intervals actually contain the true parameter about 95% of the time, as the
theory predicts.
Answer a
Generate the sample means for each sample size by running the simulation B times, as described
in the previous answer.
Store the sample means in a data structure, such as a vector or data frame.
Use the "hist" function in R to generate a histogram of the sample means, with one histogram for
each sample size. You can specify the number of bins and add a title and labels to the histogram
to make it easier to interpret.
B <- 2000
4
# Set lambda
lambda <- 2
for (i in 1:length(n)) {
par(mfrow=c(1, 3))
for (i in 1:length(n)) {
5
n = 15 n = 45 n = 300
15
70
10
60
8
50
10
40
Frequency
Frequency
Frequency
6
30
4
20
2
10
0
1.90 1.95 2.00 2.05 2.10 1.92 1.96 2.00 2.04 1.90 1.95 2.00 2.05 2.10
Answer b
library(ggplot2)
6
B <- 2000
lambda <- 2
for (i in 1:length(n)) {
for (j in 1:B) {
for (i in 1:length(n)) {
hist(lower_endpoints[, i], main = paste0("n = ", n[i], " - Lower endpoint of 95% CI"))
hist(upper_endpoints[, i], main = paste0("n = ", n[i], " - Upper endpoint of 95% CI"))
7
400 n = 15 - Lower endpoint of 95% CI n = 15 - Upper endpoint of 95% CI
250
Frequency
Frequency
200
100
0
0
0.5 1.0 1.5 2.0 2.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5
250
300
Frequency
Frequency
100
0 100
Frequency
200
200
0
1.6 1.7 1.8 1.9 2.0 2.1 1.9 2.0 2.1 2.2 2.3 2.4 2.5
Answer c
B <- 2000
lambda <- 2
8
prop_CI_contain_mean <- numeric(length(n))
for (i in 1:length(n)) {
for (j in 1:B) {
prop_CI_contain_mean
The prop_CI_contain_mean vector will contain the proportion of the CIs that contain the
population mean for each sample size n. The proportion should be close to 95%.