You are on page 1of 7

Statistics 130

Lab 6

October 18, 2013

Lab #6: Bootstrap Intervals


Due: Friday, October 25, 2013, in your lab section The Central Limit Theorem was key in formulating the specic condence intervals we talked about in class. For instance, because the CLT says that the sample mean is normally distributed (X N (, / n)), we can use the standard normal table to get critical values (or any point for the condence interval. These critical values scale the standard error of X estimator) to cover the desired level of condence (e.g. 95%). What if we didnt have the CLT to tell us where to get these critical values from?

Bootstrapping
The method of bootstrapping (or bootstrap sampling) allows us to estimate the condence interval for a population parameter when we only know the estimator (a function) and have a sample. That is, we dont know where to get our critical value from. Finding bootstrap condence intervals generally goes like this: 1. Get your sample of size n 2. Decide on a condence level 1 3. Repeat m times: (a) Randomly draw a sample of size n with replacement from your original sample (this is called a bootstrap sample). (b) Calculate your point estimate of interest on this bootstrap sample. 4. Look at the distribution of your m estimates (the bootstrap distribution). If it is roughly symmetric, you may proceed. (There is another method for asymmetric distributions, but we wont get into that here.) 5. Order your m bootstrap point estimates from lowest to highest. 6. Your 100(/2)% lowest bootstrap estimate is your lower bound and your 100(1 /2)% lowest estimate is your upper bound on your condence interval. Thus, if we have 1000 bootstrap samples and a 95% condence interval, our lower bound will be the 25th lowest estimate our upper bound will be the 975th lowest estimate (25th largest estimate).

Why it works
Bootstrapping gives us a reliable estimate of the condence interval for a population parameter because, with an unbiased estimator: 1. Our sampling distribution is centered about our true population parameter value.

Statistics 130

Lab 6, Page 2 of 6

October 18,, 2013

2. Our bootstrap distribution is centered about our samples estimate of the true population value. Thus, with a symmetric bootstrap distribution, we expect our bootstrap distribution to have the same relationship to our one sample estimate as our sampling distribution (of the estimator, if we could draw lots of real samples) would have to the true population value.

Objective
The goal of this lab is to investigate how and why bootstrapping works for estimating condence intervals. You could develop your own code for doing this in Matlab, but there is already a nice applet online that does all of this for us! So no Matlab coding; just go to lock5stat/statkey. (If you use Google Chrome: there is a statkey app!)

Using statkey for bootstrap condence intervals


Proportions
Under Bootstrap Condence Intervals, go to CI for Single Proportion. By default, a dataset on mixed nuts is loaded. See Figure 1. 1. Click Edit Data. Enter the count (number of successes) and sample size for your sample. You can see on the right side of the screen that the items under Original Sample have now been updated to reect your sample. 2. Generate some number of bootstrap samples (sampling from the original sample with replacement) by clicking on Generate X Sample(s). Dots will appear in the graph, each one representing the proportion estimate from a bootstrap sample. Information on a particular bootstrap sample appears under Bootstrap Sample to the right (hover over points to see their info). NOTE: generation of samples is cumulative; clicking Generate 1 Sample then Generate 10 Samples will give you 11 samples total. Reset Plot erases all samples. 3. Summary statistics on all bootstrap samples generated thus far appears in the upper right of the plot window. 4. Clicking on Two-Tail pulls up a 95% (by default) bootstrap interval with lower AND upper bounds. This is what we are usually interested in. We didnt talk about onesided condence intervals in class, since they arent as useful, but clicking left or right tail gives one-sided condence intervals (just lower or upper bound, respectively). 5. You can adjust the condence level by clicking on rectangle with 0.950 in it, and entering your desired level in the dialogue that appears.

Statistics 130

Lab 6, Page 3 of 6

October 18,, 2013

Figure 1: Annotated screen shot of using statkey for a bootstrap condence interval for a proportion.

Statistics 130

Lab 6, Page 4 of 6

October 18,, 2013

6. Your bootstrap condence interval can be observed by the bounds that are reported in the rectangles below the x-axis of the plot. Note that the red dots represent the lowest and highest estimates that fall below and above /2 and 1 /2, respectively.

Mean
To nd the bootstrap condence interval for a mean, at the StatKey start screen, go to CI for Single Mean, Median, St.Dev. Just like with proportions, a default dataset is already loaded in. The steps proceed much like with proportions, but the data editing step is dierent here. So simply replace Step 1 in the Proportions section above with the following: 1. After clicking Edit Data, you should see the same dialogue as in Figure 2. (a) Replace PenMins with an appropriately descriptive name of your sample data. (b) Delete the default data below PenMins, and paste in (or manually enter) your sample data values. (c) Leave Data has header name checked, and click Ok. From the Proportions section above, steps 2-6 go here.

Figure 2: Annotated screen shot of editing the data when using StatKey for a bootstrap condence interval for a mean.

Statistics 130

Lab 6, Page 5 of 6

October 18,, 2013

Name: Section:

Suvayan Roy 2

Your assignment
From the initial survey data from our class (Sakai Resources Labs Initial survey 114-13.xlsx), pick a column to investigate. You will want to pick a column of values for which you can either calculate the sample proportion or the sample mean (so birthday date is a bad choice). You will also want to clean up your data: take out any non-sensical values (e.g., hours per day at computer greater than 24), and any NA values. 1. What is the name of the variable/column you are looking at?

Weight

2. Are you investigating the population proportion or mean? Write out the equation for the point estimator you are using. What is the point estimate for your sample from the spreadsheet? I am investigating the population mean. mean(x) = sigma(Xi/n)

3. State what inference you are trying to make. What is the population you are investigating? What are you trying to say about that population by nding the condence interval for the proportion/mean? We are trying to infer the overall mean of the population (all Duke students) by looking at the mean of the sample. By creating a confidence interval, we are creating a interval which we feel confident that the population mean exists within. 4. From the equations presented in lecture, calculate the 99% condence interval for whichever population parameter you are dealing with. Mean = 156.444, SD = 31.383 99% Z Score = 2.576 CI = 156.444 +/- 31.383*2.576 = 75.60 to 237.29

Statistics 130

Lab 6, Page 6 of 6

October 18,, 2013

5. Go to StatKey and, following the directions above, enter either the count and proportion (if doing proportion), or enter your sample data (if doing the mean). Generate 1 bootstrap sample, and compare the estimate to your sample estimate that you calculated above. Generate a second bootstrap sample and compare the mean of your two estimates to your sample estimate. Generate a few more, make the same comparison, then generate 10 more, then 100 more, making the same comparison to your samples estimate every time. As you generate more bootstrap samples, what is the mean of bootstrap distribution converging to?

Bootstrap Sample 1 Mean: 160.171 Bootstrap 2 Sample Mean: 159.772 Bootstrap 12 Sample Mean: 158.370 Bootstrap 112 Sample Mean: 156.744 The mean is closely conversing to our sample mean.
6. Reset the plot, and generate a total of 1000 samples. What is the standard deviation of the bootstrap distribution (upper right corner of plot window)? Find a new condence interval by substituting this standard deviation for the standard error in your condence interval calculation from part 4.

Mean = 156.379 SD = 4.940 99% Z Score = 2.576 CI = 156.379 +/- 4.940*2.576 = 143.65 to 169.10

7. With the same plot (still with 1000 samples), click on two-tail and nd the 99% bootstrap condence interval from the output on the plot. Take a screen-shot and attach it to this lab.

See attached screenshot

8. How do your three condence intervals (from parts 4, 6, and 7) compare?

The intervals for 6 and 7 are nearly the same, because the use the same data. The interval in part 6 is much wider, because before we took a 1000 samples, the standard deviation of the smaller number of data points was much smaller, leading to a wider interval. With sampling though, the standard deviation is reduced due to the large number of data points.

You might also like