You are on page 1of 18

Bootstrapping

A nonparametric Approach to Statistical Inference Mooney & Duval Series on Quantitative Applications in the Social Sciences, 95, Sage University Papers Seminar General Statistics, 25 October 2012 Wendy Post Statistician Department of Special Education University of Groningen

1

overview
• • • • • • • Introduction: what is bootstrapping Introduction of a real dataset Estimation of standard errors Confidence intervals Estimation of bias Testing Literature

2

1

What if the usual assumptions of the population cannot be made (normality assumptions.E. Kass (2011): the big picture Bootstrap method Quantitative research: testing and estimation require information about the population distribution. 2 .Quantitative research Real world theoretical world 3 scientific models Data statistical models ϴ Conclusions Based on R. or small samples): the traditional approach does not work Bootstrap: Estimates the population distribution by using the information based on a number of resamples from the sample.

allowing one to use fingers or a tool to provide greater force in pulling the boots on. Bootstrap method Impossible task: to pull oneself up by one's bootstraps A metaphor for ‘a self-sustaining process that proceeds without external help’.Bootstrap Wikipedia: Boots may have a tab. 3 . loop or handle at the top known as a bootstrap.

x2*.….x2.xn*) Statistic (sample mean) Based on Efron and Tibshirani (1993) Statistic Bootstrap sample (mean of X*) 4 .…. the mean) • Estimate the sample distribution of the statistic by the bootstrap sample distribution Bootstrap method 8 Real world unknown probability distribution/model Observed data bootstrap world estimated probability model Bootstrap sample P X = (x1.xn) ˆ P X* = (x1*.Bootstrap method Use the information of a number of resamples from the sample to estimate the population distribution Given a sample of size n: Procedure • treat the sample as population • Draw B samples of size n with replacement from your sample (the bootstrap samples) • Compute for each bootstrap sample the statistic of interest (for example.

72 1.85 -0.49 0.81 1.40 B_sample 2 1.73 5 .55 0.08 0.22 -0.15 0.27 -0.26 -0.20 0. . 0.07 -0.12.49 -0.55 0. .13 0.15 -0. 0.41 1.26 -0.11 0.45) Example Case number 1 2 3 4 5 6 7 8 9 .40 0.30 -0.27 0.26 -0.40 1.22 -0. .26 -0.17 0.20 -0.49 -0.17 0.20 -0.17 -0.61 0.07 0.08 0.Example Population: standard normal distribution Sample with sample size = 30.15 . 0.40 1.58 -0.1698 95% Confidence interval: (-0. 0.61 -0.22 0. Results sample (traditional analysis) Sample mean: 0.40 1.07 B_sample 1 -0.22 0.08 0. 1000 bootstrap samples.93 0.17 0.30 -1.27 -0.83 B_sample 3 1. 25 26 27 28 29 30 Mean Sd Original data 1.32 0.11 .15 0.49 1.72 0.52 1.46) Bootstrap results: sample mean: 0.40 -0.07 1.72 1.22 -0.1689.55 0. 95% Confidence interval : (-0.07 -0.55 1.52 -1.10.40 -0.08 -0.13 1.

30) summary(meansW) quantile(meansW. p.numeric(1000) #declaration of de bootstrap mean vector # bootstrap procedure for (i in 1:1000){ meansW[i] <.025.rnorm(30.975)) # traditional approach summary(sampleW) mean(sampleW) CI_up = mean(sampleW)+1.mean(sample(sampleW.96*sd(sampleW)/sqrt(30) 6 . c(0.R-script Based on Crawley: The R book.96*sd(sampleW)/sqrt(30) CI_lo = mean(sampleW)-1. 0.seed(146035) #start seed for simulation sampleW <. replace = T))} R-script # bootstrap results hist(meansW. 1) # draw sample of size 30 meansW <.320 set. 0.

Bootstrap results Results example Population: standard normal distribution Sample with sample size = 30.1698 95% Confidence interval: (-0.10.12. Results sample Sample mean: 0. 95% Confidence interval : (-0. 1000 bootstrap samples. 0.45) 7 . 0.46) Bootstrap results: sample mean: 0.1689.

1) # draw sample of size 30 mymean= function(sampleW.321 library(boot) set. R = 1000) Bootstrap Statistics : original t1* 0.R-script : alternative Based on Crawley: The R book. p. statistic = mymean. i) mean(sampleW[i]) myboot = boot(sampleW.697321e-05 std.seed(146035) #start seed for simulation sampleW <. error 0. R=1000) myboot R-script : output ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = sampleW. mymean.1698585 bias 7.1484104 8 . 0.rnorm(30.

Justification: 1. the bootstrap distribution approaches the sample distribution bootstrapping • Estimation of the standard error of your statistic of interest • Construction of confidence intervals • Estimation of bias • Bootstrap testing (for small samples. closely related to permutation tests) 9 . bootstrapping will provide a good approximation of the sample distribution. 2. If the number of resamples (B) from the original sample increases. If the sample is representative for the population.Why does bootstrapping work? Basic idea: If the sample is a good approximation of the population. the sample distribution (empirical distribution) approaches the population (theoretical) distribution if n increases.

measured by Attitudes Survey towards Inclusive Education (ASIE) 19 Manipulated data for the Buddy project Research question: What is the effect of the buddy intervention on the attitude compared to the control group? 20 10 .Comparison of two independent groups Buddy project (De Boer) Study aim: To determine the effect of a buddy program intervention for influencing the attitude of typically developing students towards children with severe intellectual and/or physical disabilities Comparison of two groups: Buddy intervention: 4 typically developing students each being a buddy for a child with disabilities Control group: 4 typically developing students without being a buddy for a child with disabilities Dependent variable: difference in attitude(post-pre).

25 22 11 .Manipulated data Data Buddy project 21 Manipulated data Buddy project Design: two independent groups Test statistic: difference in means = 6.

000 24 12 .…B 4.…x4* 2. Select B independent bootstrap samples (replications) with replacement from the data of the children in the control intervention x5*. Determine the bootstrap sample distribution of difmean* 4.manipulated data Buddy project How to bootstrap for estimation of the standard error of the sample statistic? General 1. Compute the bootstrap sample statistic for each replication.…x8* B times (for estimation of standard errors. Select B independent bootstrap samples (replications) with replacement from the data of the children in the buddy intervention x1*. Select B independent bootstrap samples (replications): sample with replacement: x1*. B = 25-200 is OK) 2. θ*(b). b = 1. 200.…x8* 3. difmean*(b). Determine the bootstrap sample distribution of the bootstrap sample statistic 4. Compute the difference in means in both interventions for each bootstrap sample.…B 3. Estimate the standard deviation of the bootstrap sample statistic (standard error of the estimate) 23 manipulated data Buddy project In our Buddy project: 1. b = 1. Estimate the standard deviation of the difmean* (standard error) We will do this for B = 25. 1000 and 10.

11 2.46 2.88 6.48 2.33 6.5 6.Frequency distribution of the bootstrap sample statistic (difmean*) Histogram of ms 8 B= 25 Frequency 40 50 60 Histogram of ms B= 200 Frequency 6 4 2 0 0 -2 10 20 30 0 2 4 ms 6 8 10 12 -2 0 2 4 ms 6 8 10 12 Histogram of ms Histogram of ms B=1000 Frequency B= 10000 15 0 F re q u e n c y 50 0 0 500 1000 10 0 1500 0 5 ms 10 0 ms 5 10 25 Buddy project statistics of the bootstrap distribution of difmean* Number of Mean Median replications 25 200 1000 10000 5.24 6.0 6.47 26 the standard error 13 .26 6.5 6.25 sd 3.

5 10.25 97.5 6.0 50% 6.25 10.5 10.0 1.5% 10.5 6.0 6.5 lower boundary upper boundary 27 bootstrap for bias estimation Bias of an estimator: difference between population value θ and the expected value of that estimator .30 1. In formula: bias = θ .5% replications sd 25 200 1000 10000 0. then the estimators may be biased 28 14 .Buddy project percentiles All percentiles can be computed: estimation of the confidence intervals: percentile method Number of 2.E( ) So is the estimator on average correct? If the theoretical assumptions about distributions are not satisfied.24 1.

25) 30 15 .88 6.25 Estimated bias = bootstrap mean – sample mean B 25 200 1000 10000 Mean θ* 5. x4* to the buddy project.g. Compute for each the bootstrap sample the difference in means. Assign the first 4 observations x1*.62 6.37 -0. and the last 4 observations. b = 1.24 29 Buddy project: Testing the difference between two interventions 1.08 0.…. x4*. Sample mean of difference was: 6.….x8* to the control group.17 6.24 6. 3. 6. 2.01 0.bootstrap for bias estimation The difference between original sample statistic and the bootstrap sample statistic can be regarded as the estimation of the bias.26 Bias θ*-0.33 6.…B 4. Approximate the p-value by the number of bootstrap sample differences larger than the original sample mean (e.26 6. θ*(b). Draw B independent bootstrap samples (replications): sample with replacement: x1*. Note: this is the distribution under the null hypothesis of equal distributions.…x8*.01 corrected mean 6.

000 gives sample value < -6.0282 • 287 cases of 10.Buddy project: Testing the difference between two interventions Results • 282 cases of 10. 32 16 . Number of bootstrap samples: The number of bootstrap samples depends on what you like to do.000 gives a sample mean difference > 6.25 • two-sided p-value: 0. For all other applications: Take B> 1000. For estimation of standard error: 25-200 resamples should be sufficient.057 31 sample size Sample size: Bootstrapping is working well for samples >30-50 (if the sampling procedure is random) For smaller sample sizes: there might be problems about accuracy and validity: be careful with the interpretation.25 • One sided P-value: 0.

• The strength of bootstrapping lies in estimation rather than in testing (certainly for small samples) • The validity of the bootstrap testing is questionable for small sample sizes 33 Comments about bootstrapping (cont) • There are more advanced ways to construct confidence intervals with bootstrapping – BCα-method (bias-correction and accelerated) – ABC-method (approximate confidence interval) • Testing with the bootstrap method in small samples is closely connected with the randomization tests: randomization tests are the exact versions. • Bootstrap can be applied to more complex designs and in more situations (there are several ways in regression models) 34 17 . Only the simplest cases are demonstrated here.Comments about bootstrapping • Only the basic principles are introduced here • The sampling method depends on the stochastic element one would like to generate (depends on the model!).

An introduction to the bootstrap.D. Ter Braak. G. C. R.Z. (1992).E. R. Kass (2011). London: Sage publications. (1993).). and Duval.H. R. Bootstrapping.J. Statistical inference: The big picture. C. Sendler.Rothe & W. Permutation versus bootstrap significance tests in multiple regression and anova.J. 79-86. In Bootstrapping and related techniques.Jockel. (eds. and Tibshirani. K. Efron. London: Chapman & Hall. (1993). Statistical Science 26(1) 1-9.literature Moony. B.F. 35 18 . A nonparametric approach to statistical inference.