You are on page 1of 12

Bootstrapping and Power

Analysis
Bootstrap Sampling in R

• Bootstrapping uses random sampling with replacement to estimate


statistics from a sample. By resampling from this sample we can
generate novel data that can be a representative of a population. This is
loosely based on the law of large numbers. Instead of estimating
estimating a statistic once–from the small sample we start with–the
statistic can be estimated multiple times. Hence, if we resample with
replacement 10 times, we will compute estimates of a desired statistic
10 times.
• The procedure in bootstrapping is as follows:
1.Resample the data with replacement n times
2.Compute desired statistic n times to generate a distribution of
estimated statistics
3.Determine standard error/confidence interval for the bootstrapped
statistic from the bootstrapped distribution
• boot(data, statistic, R, sim=”ordinary”, stype=”i”,
strata=rep(1,n), L=NULL, m=0, weights=NULL,
ran.gen=function(d, p) d, mle=NULL, …)
• https://data-flair.training/blogs/bootstrapping-in-r/
• https://www.datacamp.com/tutorial/bootstrap-r
• perform bootstrap resampling to estimate the relation between the
tuition cost of private and public colleges using the historical tuition
dataset from TidyTuesday
• Currently, the data is in long format where each row represents one
observation. There are multiple observations for each year representing
data from each institution type: Public, Private, All Institutions. In order for
us to make use of the data with resampling we need to convert the data to
a wide format where we have one column containing tuition costs for
private schools and another column containing the tuition costs for public
schools. Each row in this wide format represents the historical tuition cost
for a particular year. Below is the code that transforms the long format
data to a wider format. The columns we will be interested in for modeling
are public and private representing the tuition costs for public and private
schools for a particular academic year as given by the column year .
• We can see with bootstrapping the estimated slope that described the relation between
private and public school tuition costs is around 2.38–similar to what we obtained
without bootstrapping. This could be because the underlying relation between public
and private schools is linear and the underlying assumptions of linear models apply to
this dataset. The underlying assumptions in a linear regression are:
• The underlying relationship between the variables is linear
• The errors are normally distirbuted
• There is equal variance around the line of best fit (homoscedasticity of error)
• Observations are independent
• Furthermore, with bootstrapping we have lower and upper bounds for the relation as
well which we did not have with just using a linear model on the un-sampled data.
Power Analysis
• For testing a hypothesis H0 against H1, the test with probabilities α and β of
Type I and Type II errors respectively, the quantity (1- β) is called the power of
the test.
• The power of the test depends upon the difference between the parameter
value specified by H0 and the actual value of the parameter.
• The level of significance may be defined as the probability of Type I error
which is ready to tolerate in making a decision about H0.
• The size of a non-randomized test is defined as the size of the critical region.
Practically, it is numerically the same as the level of significance.
• Power analysis is one of the important aspects of experimental design. It allows us to
determine the sample size required to detect an effect of a given size with a given
degree of confidence.
• And it allows us to determine the probability of detecting an effect of a given size with
a given level of confidence, under-sample size constraints.
• The power of the test is too low we ignore conclusions arrived from the data set.
• In R, the following parameters required to calculate the power analysis
• Sample size
• Effect size
• Significance level
• Power of the test
• If we have any of the three parameters given above, we can calculate the fourth one.
Power Analysis in R

Function Power Calculation For


pwr.2p.test two proportions equal n
pwr.2p2n.test two proportions unequal n
pwr.anova.testbalanced one way anova
pwr.chisq.test chi square test
pwr.f2.test general linear model
pwr.p.test proportion one sample
pwr.r.test correlation
pwr.t.test t-tests (one sample, 2 samples, paired)
pwr.r.test t-test (two samples with unequal n)
• The significance level α defaults to be 0.05.
• Finding effect size is one of the difficult tasks. Your subject
expertise needs to brought to be here. Cohen gives the
following guidelines for the social sciences. For more details
about effects size

Effect sizeCohen’s w
Small 0.10
Medium 0.30
Large 0.50
• For a one-way ANOVA comparing 4 groups, calculate the
sample size needed in each group to obtain a power of 0.80,
when the effect size is moderate (0.25) and a significance
level of 0.05 is employed.

You might also like