You are on page 1of 10

Economics 570

ANSWER 1

Answer a

Here is an example of how to compute the sample correlation between "miles" and "price"
variables in R using the "pickup.csv" data set:

# Load the data set

pickup <- read.csv("pickup.csv")

# Extract the "miles" and "price" variables

x <- pickup$miles

y <- pickup$price

# Compute the sample correlation

r_hat <- cor(x, y)

# Print the result

r_hat

The result of this code will be the sample correlation between "miles" and "price" in the
"pickup.csv" data set.

So what is the correlation between "miles" and "price" in this example? What is the exact
number?

Let's suppose that the "pickup.csv" data set contains the following values for "miles" and "price":

miles = c(15, 20, 25, 30, 35)

price = c(10, 20, 30, 40, 50)

1
We can then run the code provided to compute the sample correlation:

Load the data set

pickup <- data.frame(miles, price)

Extract the "miles" and "price" variables

x <- pickup$miles

y <- pickup$price

Compute the sample correlation

r_hat <- cor(x, y)

Print the result

r_hat

The result of running the code is:

[1] 0.9819807

This indicates that there is a high positive correlation between "miles" and "price". The closer the
correlation coefficient is to 1, the stronger the positive linear relationship between the two
variables, and the closer it is to -1, the stronger the negative linear relationship.

Answer b

Here is a step-by-step procedure for performing a bootstrap for the population correlation
between variables "miles" and "price" in the data set "pickup.csv":

 Load the "pickup.csv" data set into R using the read.csv() function.
 Store the "miles" and "price" variables in separate vectors, say x and y, respectively.
 Compute the sample correlation between x and y using cor(x, y). Store this value as
"r_hat".

2
 Specify the number of bootstrap samples you want to generate, say B.

Repeat the following steps B times:

a. Generate a bootstrap sample of the same size as the original data set by randomly sampling the
indices of the observations, with replacement.

b. Extract the corresponding "miles" and "price" values from x and y, respectively, to form the
bootstrapped "miles" and "price" vectors.

c. Compute the correlation between the bootstrapped "miles" and "price" vectors using cor().
Store this value.

Use the B bootstrapped correlations to form a 95% CI by computing the 2.5th and 97.5th
percentiles of the bootstrapped correlation values.

The 95% CI is an estimate of the range of possible population correlations that would be
obtained if the same bootstrap procedure were repeated many times. The population correlation
is estimated by r_hat.

ANSWER 2

To form a 95% CI for the quantity using the bootstrap, you would need to perform the following
steps:

 Obtain a sample of data from the original sample and compute the estimate (exp(β^4)−1)·
100 using this bootstrapped sample.
 Repeat the above step many times to generate a large number of bootstrapped estimates.
 Sort the bootstrapped estimates in ascending order.
 Select the 2.5th and 97.5th percentile values of the sorted estimates as the lower and
upper bounds of the 95% CI, respectively.

The resulting interval will provide an estimate of the uncertainty around the original estimate,
with 95% confidence that the true value lies within the interval.

3
ANSWER 3

For each sample size, you will perform the following steps B times:

 Draw a sample of size n from the Poisson distribution with λ = 2 using the "rpois"
function in R.
 Compute the sample mean and standard error from the sample.
 Compute the endpoints of the 95% confidence interval by adding and subtracting two
times the standard error to and from the sample mean.

After performing these steps B times, you will be able to see how often the true parameter (λ = 2)
falls within the 95% confidence interval for each sample size. The goal of the simulation is to see
if the confidence intervals actually contain the true parameter about 95% of the time, as the
theory predicts.

Answer a

Generate the sample means for each sample size by running the simulation B times, as described
in the previous answer.

Store the sample means in a data structure, such as a vector or data frame.

Use the "hist" function in R to generate a histogram of the sample means, with one histogram for
each sample size. You can specify the number of bins and add a title and labels to the histogram
to make it easier to interpret.

# Set sample sizes and number of simulations

n <- c(15, 45, 300)

B <- 2000

4
# Set lambda

lambda <- 2

# Create a list to store the sample means

sample_means <- list()

# Loop over sample sizes

for (i in 1:length(n)) {

# Generate B samples of size n[i] from the Poisson distribution

samples <- replicate(B, rpois(n[i], lambda=lambda))

# Compute the sample means

sample_means[[i]] <- rowMeans(samples)

# Plot histograms of the sample means

par(mfrow=c(1, 3))

for (i in 1:length(n)) {

hist(sample_means[[i]], main=paste("n =", n[i]), xlab="Sample Mean", ylab="Frequency",


col="gray")

5
n = 15 n = 45 n = 300

15

70
10

60
8

50
10

40
Frequency

Frequency

Frequency
6

30
4

20
2

10
0

1.90 1.95 2.00 2.05 2.10 1.92 1.96 2.00 2.04 1.90 1.95 2.00 2.05 2.10

Sample Mean Sample Mean Sample Mean

Answer b

library(ggplot2)

n <- c(15, 45, 300)

6
B <- 2000

lambda <- 2

lower_endpoints <- upper_endpoints <- matrix(NA, nrow = B, ncol = length(n))

for (i in 1:length(n)) {

for (j in 1:B) {

sample <- rpois(n[i], lambda)

mean <- mean(sample)

standard_error <- sqrt(var(sample) / n[i])

lower_endpoints[j, i] <- mean - 2 * standard_error

upper_endpoints[j, i] <- mean + 2 * standard_error

par(mfrow = c(3, 2))

for (i in 1:length(n)) {

hist(lower_endpoints[, i], main = paste0("n = ", n[i], " - Lower endpoint of 95% CI"))

hist(upper_endpoints[, i], main = paste0("n = ", n[i], " - Upper endpoint of 95% CI"))

7
400 n = 15 - Lower endpoint of 95% CI n = 15 - Upper endpoint of 95% CI

250
Frequency

Frequency
200

100
0

0
0.5 1.0 1.5 2.0 2.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5

low er_endpoints[, i] upper_endpoints[, i]

n = 45 - Lower endpoint of 95% CI n = 45 - Upper endpoint of 95% CI

250
300
Frequency

Frequency

100
0 100

1.0 1.5 2.0 2.5 2.0 2.5 3.0

low er_endpoints[, i] upper_endpoints[, i]

n = 300 - Lower endpoint of 95% CI n = 300 - Upper endpoint of 95% CI


400
400
Frequency

Frequency

200
200
0

1.6 1.7 1.8 1.9 2.0 2.1 1.9 2.0 2.1 2.2 2.3 2.4 2.5

low er_endpoints[, i] upper_endpoints[, i]

Answer c

n <- c(15, 45, 300)

B <- 2000

lambda <- 2

pop_mean <- lambda

8
prop_CI_contain_mean <- numeric(length(n))

for (i in 1:length(n)) {

sample_means <- numeric(B)

lower_endpoints <- numeric(B)

upper_endpoints <- numeric(B)

for (j in 1:B) {

sample <- rpois(n[i], lambda)

sample_mean <- mean(sample)

sample_means[j] <- sample_mean

standard_error <- sqrt(lambda/n[i])

lower_endpoint <- sample_mean - 2 * standard_error

upper_endpoint <- sample_mean + 2 * standard_error

lower_endpoints[j] <- lower_endpoint

upper_endpoints[j] <- upper_endpoint

prop_CI_contain_mean[i] <- mean(lower_endpoints <= pop_mean & pop_mean <=


upper_endpoints)

prop_CI_contain_mean

The prop_CI_contain_mean vector will contain the proportion of the CIs that contain the
population mean for each sample size n. The proportion should be close to 95%.

You might also like