You are on page 1of 6

A reprint from

American Scientist
the magazine of Sigma Xi, The Scientific Research Society

This reprint is provided for personal and noncommercial use. For any other use, please send a request to Permissions,
American Scientist, P.O. Box 13975, Research Triangle Park, NC, 27709, U.S.A., or by electronic mail to perms@amsci.org.
Sigma Xi, The Scientific Research Society and other rightsholders
Computing Science

The Bootstrap
Cosma Shalizi

S tatistics is the branch of ap-


plied mathematics that studies
ways of drawing inferences from lim-
Statisticians can reuse
To get standard errors or confidence
intervals, we need to know the distri-
bution of our estimates around the true
ited and imperfect data. We may want
to know how a neuron in a rats brain
their data to quantify parameters. These sampling distributions
follow from the distribution of the data,
responds when one of its whiskers gets
tweaked, or how many rats live in Man- the uncertainty of because our estimates are functions of the
data. Mathematically the problem is well
hattan, or how high the water will get defined, but actually computing anything
under the Brooklyn Bridge, or the typi- complex models is another story. Estimates are typically
cal course of daily temperatures in the complicated functions of the data, and
city over the year. We have some data mathematically convenient distributions
on all of these things, but we know that distribution. Parameters can be single all may be poor approximations of the
our data are incomplete, and experience numbers, such as the total rat popula- data source. Saying anything in closed
tells us that repeating our experiments tion; vectors; or even whole curves, such form about the distribution of estimates
or observations, even taking great care as the expected time-course of tempera- can be simply hopeless. The two classi-
to replicate the conditions, gives more ture over the year. Statistical inference cal responses of statisticians have been
or less different answers every time. It is comes down to estimating those param- to focus on tractable special cases, and to
foolish to treat any inference from only eters, or testing hypotheses about them. appeal to asymptotic analysis, a method
the data in hand as certain. These estimates and other inferences that approximates the limits of functions.
If all data sources were totally capri- are functions of the data values, which
cious, thered be nothing to do beyond means that they inherit variability from Origin Myths
piously qualifying every conclusion the underlying stochastic process. If we If youve taken an elementary statistics
with but we could be wrong about reran the tape (as Stephen Jay Gould course, you were probably drilled in the
this. A mathematical science of statis- used to say) of an event that happened, special cases. From one end of the pos-
tics is possible because, although repeat- we would get different data with a cer- sible set of solutions, we can limit the
ing an experiment gives different results, tain characteristic distribution, and ap- kinds of estimator we use to those with
some types of results are more common plying a fixed procedure would yield a simple mathematical formsay, mean
than others; their relative frequencies are different inferences, again with a certain averages and other linear functions of the
reasonably stable. We can thus model distribution. Statisticians want to use this data. From the other, we can assume that
the data-generating mechanism through distribution to quantify the uncertainty the probability distributions featured in
probability distributions and stochastic of the inferences. For instance, by how the stochastic model take one of a few
processesrandom series with some in- much would our estimate of a parameter forms for which exact calculation is pos-
determinacy about how the events might vary, typically, from one replication of the sible, either analytically or via tables of
evolve over time, although some paths experiment to anothersay, to be precise, special functions. Most such distribu-
may be more likely than others. When what is the root-mean-square (the square tions have origin myths: The Gaussian
and why we can use stochastic models root of the mean average of the squares) bell curve arises from averaging many
are very deep questions, but ones for deviation of the estimate from its aver- independent variables of equal size
another time. But if we can use them in age value, or the standard error? Or we (say, the many genes that contribute to
a problem, quantities such as these are could ask, What are all the parameter height in humans); the Poisson distri-
represented as parameters of the sto- values that could have produced this data bution comes from counting how many
chastic models. In other words, they are with at least some specified probability? of a large number of independent and
functions of the underlying probability In other words, what are all the param- individually improbable events have oc-
eter values under which our data are not curred (say, radium nuclei decaying in a
low-probability outliers? This gives us given second), and so on. Squeezed from
Cosma Shalizi received his Ph.D. in physics from
the confidence region for the parameter both ends, the sampling distribution of
the University of WisconsinMadison in 2001. He
is an assistant professor of statistics at Carnegie rather than a point estimate, a promise that estimators and other functions of the
Mellon University and an external professor at the either the true parameter point lies in that data becomes exactly calculable in terms
Santa Fe Institute. Address: 132 Baker Hall, Carn- region, or something very unlikely under of the aforementioned special functions.
egie Mellon University, 5000 Forbes Avenue, Pitts- any circumstances happenedor that That these origin myths invoke vari-
burgh, PA 15213. Internet: http://www.bactra.org our stochastic model is wrong. ous limits is no accident. The great re-

2010 Sigma Xi, The Scientific Research Society. Reproduction


186 American Scientist, Volume 98
with permission only. Contact perms@amsci.org.


EFOTJUZPGTBNQMJOHEJTUSJCVUJPO



QSPCBCJMJUZEFOTJUZ
 
MPHSFUVSOT

 


m 


m  
          m m    m m m m m m
USBEJOHEBZ S R
Figure 1. A series of log returns from the Standard and Poors 500 stock index from October 1, 1999, to October 20, 2009 (left), can be used to
illustrate a classical approach to probability. A financial model that assumes the series are sequences of independent, identically distributed
Gaussian random variables yields the distribution function shown at center. A theoretical sampling distribution that models the smallest 1
percent of daily returns (denoted as q0.01) shows a value of 0.0326 0.00104 (right), but we need a way to determine the uncertainty of this estimate.

sults of probability theorythe laws of tributions of their estimates. We have the log returns, the log of the price today
large numbers, the ergodic theorem, the been especially devoted to rewriting our divided by the price yesterday. For this
central limit theorem and so onde- estimates as averages of independent time period of 2,529 trading days, there
scribe limits in which all stochastic pro- quantities, so that we can use the CLT to are 2,528 such values (see Figure 1). The
cesses in broad classes of models display get Gaussian asymptotics. Refinements efficient market hypothesis from fi-
the same asymptotic behavior. The cen- to such results would consider, say, the nancial theory says the returns cant be
tral limit theorem (CLT), for instance, rate at which the error of the asymptotic predicted from any public information,
says that if we average more and more Gaussian approximation shrinks as the including their own past values. In fact,
independent random quantities with a sample sizes grow. many financial models assume such
common distribution, and if that com- To illustrate the classical approach series are sequences of independent,
mon distribution is not too pathological, and the modern alternatives, Ill intro- identically distributed (IID) Gaussian
then the distribution of their means ap- duce some data: The daily closing prices random variables. Fitting such a model
proaches a Gaussian. (The non-Gauss- of the Standard and Poors 500 stock yields the distribution function in the
ian parts of the distribution wash away index from October 1, 1999, to October center graph of Figure 1.
under averaging, but the average of two 20, 2009. (I use these data because they An investor might want to know,
Gaussians is another Gaussian.) Typi- happen to be publicly available and fa- for instance, how bad the returns
cally, as in the CLT, the limits involve miliar to many readers, not to impart could be. The lowest conceivable log
taking more and more data from the any kind of financial advice.) Profes- return is negative infinity (with all the
source, so statisticians use the theorems sional investors care more about chang- stocks in the index losing all value),
to find the asymptotic, large-sample dis- es in prices than their level, specifically but most investors worry less about an

EBUB TJNVMBUFEEBUB EBUB TJNVMBUFEEBUB


   
m m m SFTBNQMJOH 
   m
m m m m
   m
FTUJNBUPS

FTUJNBUPS
FTUJNBUPS

FTUJNBUPS

O
MBUJP
TJNV

   

   
FNQJSJDBM

GJUUFENPEFM  
EJTUSJCVUJPO


   

   

m m    m m    m m    m m   

QBSBNFUFSDBMDVMBUJPO SFFTUJNBUF QBSBNFUFSDBMDVMBUJPO SFFTUJNBUF

R R R R

Figure 2. A schematic for model-based bootstrapping (left) shows that simulated values are generated from the fitted model, and then they are treated
like the original data, yielding a new parameter estimate. Alternately, in nonparametric bootstrapping, a schematic (right) shows that new data are
simulated by resampling from the original data (allowing repeated values), then parameters are calculated directly from the empirical distribution.

2010 Sigma Xi, The Scientific Research Society. Reproduction


www.americanscientist.org 2010 MayJune 187
with permission only. Contact perms@amsci.org.







%FOTJUZ

%FOTJUZ







 

m m    m m m m m m
S R
Figure 3. An empirical distribution (left, in red, smoothed for visual clarity) of the log returns from a stock-market index is more peaked and has sub-
stantially more large-magnitude returns than a Gaussian fit (blue). The black marks on the horizontal axis show all the observed values. The distribu-
tion of q0.01 based on 100,000 nonparametric replications is very non-Gaussian (right, in red). The empirical estimate is marked by the blue dashed line.

apocalyptic end of American capital- year.) From the fitted distribution, we real q0.01 is in that range, or our data set
ism than about large-but-still-typical can calculate that q0.01 = 0.0326, or, is one big fluke (at 1-in-20 odds), or the
lossessay, how bad are the smallest undoing the logarithm, a 3.21 percent IID-Gaussian model is wrong.
1 percent of daily returns? Call this loss. How uncertain is this point esti-
number q0.01; if we know it, we know mate? The Gaussian assumption lets Fitting Models
that we will do better about 99 percent us calculate the asymptotic sampling From its origins in the 19th century
of the time, and we can see whether distribution of q0.01, which turns out to through about the 1960s, statistics was
we can handle occasional losses of that be another Gaussian (see the right graph split between developing general ideas
magnitude. (There are about 250 trad- in Figure 1), implying a standard error about how to draw and evaluate sta-
ing days in a year, so we should expect of 0.00104. The 95 percent confidence tistical inferences, and working out the
two or three days at least that bad in a interval is (0.0347, 0.0306): Either the properties of inferential procedures in
tractable special cases (like the one we
just went through) or under asymptot-
ic approximations. This yoked a very

broad and abstract theory of inference
to very narrow and concrete practical
formulas, an uneasy combination often
preserved in basic statistics classes.
The arrival of (comparatively) cheap
 and fast computers made it feasible for
scientists and statisticians to record lots
of data and to fit models to them. Some-
UPNPSSPXhTSFUVSO

times the models were conventional ones,


including the special-case assumptions,
 which often enough turned out to be
detectably, and consequentially, wrong.
At other times, scientists wanted more
complicated or flexible models, some of

Figure 4. A scatter plot of black circles shows


m log returns from a stock-market index on suc-
cessive days. The best-fit line (blue) is a linear
function that minimizes the mean-squared
prediction error. Its negative slope indicates
that days with below-average returns tend
to be followed by days with above-average
m returns, and vice versa. The red line shows an
optimization procedure, called spline smooth-
m m    ing, that will become more or less curved de-
UPEBZhTSFUVSO pending on looser or tighter constraints.

2010 Sigma Xi, The Scientific Research Society. Reproduction


188 American Scientist, Volume 98
with permission only. Contact perms@amsci.org.
which had been proposed long before but
now moved from being theoretical curi- 
osities to stuff that could run overnight.
In principle, asymptotics might handle
either kind of problem, but convergence
to the limit could be unacceptably slow,
especially for more complex models. 
By the 1970s statistics faced the prob-
lem of quantifying the uncertainty of in-

UPNPSSPXhTSFUVSO
ferences without using either implausi-
bly helpful assumptions or asymptotics;
all of the solutions turned out to demand

even more computation. Perhaps the most
successful was a proposal by Stanford
University statistician Bradley Efron, in
a now-famous 1977 paper, to combine
estimation with simulation. Over the last
m
three decades, Efrons bootstrap has
spread into all areas of statistics, sprout-
ing endless elaborations; here Ill stick to
its most basic forms.
Remember that the key to dealing with
m
uncertainty in parameters is the sampling
distribution of estimators. Knowing what m m   
distribution wed get for our estimates UPEBZhTSFUVSO
on repeating the experiment would give
us quantities, such as standard errors. Figure 5. The same spline fit from the previous figure (black line) is combined with 800 splines
Efrons insight was that we can simulate fit to bootstrapped resamples of the data (blue curves) and the resulting 95 percent confidence
replication. After all, we have already fit- limits for the true regression curve (red lines).
ted a model to the data, which is a guess
at the mechanism that generated the Bootstrapping model with resampling from the data.
data. Running that mechanism generates The bootstrap approximates the sam- After all, our initial collection of data
simulated data that, by hypothesis, have pling distribution, with three sources of gives us a lot of information about the
nearly the same distribution as the real approximation error. First theres simu- relative probabilities of different values,
data. Feeding the simulated data through lation error, using finitely many replica- and in certain senses this empirical dis-
our estimator gives us one draw from tions to stand for the full sampling dis- tribution is actually the least prejudiced
the sampling distribution; repeating this tribution. Clever simulation design can estimate possible of the underlying dis-
many times yields the sampling distri- shrink this, but brute forcejust using tributionanything else imposes biases
bution as a whole. Because the method enough replicationscan also make it or preconceptions, which are possibly
gives itself its own uncertainty, Efron arbitrarily small. Second, theres statisti- accurate but also potentially misleading.
called this bootstrapping; unlike Bar- cal error: The sampling distribution of We could estimate q0.01 directly from the
on von Mnchhausens plan for getting the bootstrap reestimates under our fit- empirical distribution, without the me-
himself out of a swamp by pulling him- ted model is not exactly the same as diation of the Gaussian model. Efrons
self out by his bootstraps, it works. the sampling distribution of estimates nonparametric bootstrap treats the
Lets see how this works with the under the true data-generating process. original data set as a complete popula-
stock-index returns. Figure 2 shows The sampling distribution changes with tion and draws a new, simulated sample
the overall process: Fit a model to data, the parameters, and our initial fit is not from it, picking each observation with
use the model to calculate the param- completely accurate. But it often turns equal probability (allowing repeated val-
eter, then get the sampling distribution out that distribution of estimates around ues) and then re-running the estimation
by generating new, synthetic data from the truth is more nearly invariant than (as shown in Figure 2).
the model and repeating the estima- the distribution of estimates themselves, This new method matters here be-
tion on the simulation output. The first so subtracting the initial estimate from cause the Gaussian model is inaccurate;
time I recalculate q0.01 from a simula- the bootstrapped values helps reduce the true distribution is more sharply
tion, I get -0.0323. Replicated 100,000 the statistical error; there are many sub- peaked around zero and has substan-
times, I get a standard error of 0.00104, tler tricks to the same end. The final tially more large-magnitude returns, in
and a 95 percent confidence interval of source of error in bootstrapping is speci- both directions, than the Gaussian (see
(0.0347, 0.0306), matching the theo- fication error: The data source doesnt the left graph in Figure 3). For the em-
retical calculations to three significant exactly follow our model at all. Simulat- pirical distribution, q0.01 = 0.0392. This
digits. This close agreement shows that ing the model then never quite matches may seem close to our previous point
I simulated properly! But the point of the actual sampling distribution. estimate of 0.0326, but its well beyond
the bootstrap is that it doesnt rely on Here Efron had a second brilliant the confidence interval, and under the
the Gaussian assumption, just on our idea, which is to address specification Gaussian model we should see values
ability to simulate. error by replacing simulation from the that negative only 0.25 percent of the

2010 Sigma Xi, The Scientific Research Society. Reproduction


www.americanscientist.org 2010 MayJune 189
with permission only. Contact perms@amsci.org.
time, not 1 percent of the time. Doing pass through each data point. (A spline er on this new data set. Each replication
100,000 non-parametric replicatesthat was originally a flexible length of wood will give a different amount of smooth-
is, resampling from the data and rees- craftsmen used to draw smooth curves, ing and ultimately a different curve. Fig-
timating q0.01 that many timesgives fixing it to the points the curve had to go ure 5 shows the individual curves from
a very non-Gaussian sampling distri- through and letting it flex to minimize 800 bootstrap replicates, indicating the
bution (as shown in the right graph of elastic energy; stiffer splines yielded flat- sampling distribution, together with 95
Figure 3), yielding a standard error of ter curves, corresponding mathemati- percent confidence limits for the curve
0.00364 and a 95 percent confidence in- cally to tighter constraints.) as a whole. The overall negative slope
terval of (0.0477, 0.0346). To actually get the spline, I need to and the asymmetry between positive
Although this is more accurate than pick the level of the constraint. Too small, and negative returns are still there, but
the Gaussian model, its still a really sim- and I get an erratic curve that memorizes we can also see that our estimated curve
ple problem. Conceivably, some other the sample but wont generalize to new is much better pinned down for small-
nice distribution fits the returns better data; but too much smoothing erases real magnitude returns, where there are lots
than the Gaussian, and it might even and useful patterns. I set the constraint of data, than for large-magnitude returns,
have analytical sampling formulas. The through cross-validation: Remove one where theres little information and small
real strength of the bootstrap is that it point from the data, fit multiple curves perturbations can have more effect.
lets us handle complicated models, and with multiple values of the constraint
complicated questions, in exactly the to the other points, and then see which Smoothing Things Out
same way as this simple case. curve best predicts the left-out point. Re- Bootstrapping has been ramified tremen-
To continue with the financial exam- peating this for each point in turn shows dously since Efrons original paper, and I
ple, a question of perennial interest is how much curvature the spline needs in have sketched only the crudest features.
predicting the stock market. Figure 4 is order to generalize properly. In this case, Nothing Ive done here actually proves
a scatter plot of the log returns on suc- we can see that we end up selecting a that it works, although I hope Ive made
cessive days, the return for today being moderate amount of wiggliness; like the that conclusion plausible. And indeed
on the horizontal axis and that of to- linear model, the spline predicts rever- sometimes the bootstrap fails; it gives
morrow on the vertical. Its mostly just sion in the returns but suggests that its very poor answers, for instance, to ques-
a big blob, because the market is hard asymmetricdays of large negative re- tions about estimating the maximum (or
to predict, but I have drawn two lines turns being followed, on average, by big- minimum) of a distribution. Understand-
through it: a straight one in blue, and a ger positive returns than the other way ing the difference between that case and
curved one in black. These lines try to around. This might be because people are that of q0.01, for example, turns out to in-
predict the average return tomorrow more apt to buy low than to sell high, but volve rather subtle math. Parameters are
as functions of todays return; theyre we should check that this is a real phe- functions of the distribution generating
called regression lines or regression curves. nomenon before reading much into it. the data, and estimates are functions of
The straight line is the linear function There are three things we should note the data or of the empirical distribution.
that minimizes the mean-squared pre- about spline smoothing. First, its much For the bootstrap to work, the empirical
diction error, or the sum of the squares more flexible than just fitting a straight distribution has to converge rapidly on
of the errors made in solving every line to the data; splines can approximate the true distribution, and the parameter
single equation (called the least squares a huge range of functions to an arbitrary must smoothly depend on the distribu-
method). Its slope is negative (0.0822), tolerance, so they can discover compli- tion, so that no outlier ends up unduly
indicating that days with below-aver- cated nonlinear relationships, such as influencing the estimates. Making influ-
age returns tend to be followed by ones asymmetry, without guessing in advance ence precise here turns out to mean tak-
with above-average returns and vice what to look for. Second, there was no ing derivatives in infinite-dimensional
versa, perhaps because people try to hope of using a smoothing spline on spaces of probability distribution func-
buy cheap after the market falls (push- substantial data sets before fast comput- tions, and the theory of the bootstrap is a
ing it up) and sell dear when it rises ers, although now the estimation, includ- delicate combination of functional analy-
(pulling it down). Linear regressions ing cross-validation, takes less than a sis with probability theory. This sort of
with Gaussian fluctuations around the second on a laptop. Third, the estimated theory is essential to developing new
prediction function are probably the spline depends on the data in two ways: bootstrap methods for new problems,
best-understood of all statistical mod- Once we decide how much smoothing such as ongoing work on resampling
elstheir oldest forms go back two to do, it tries to match the data within spatial data, or model-based bootstraps
centuries nowbut theyre more ven- the constraint; but we also use the data where the model grows in complexity
erable than accurate. to decide how much smoothing to do. with the data.
The black curve is a nonlinear esti- Any quantification of uncertainty here The bootstrap has earned its place
mate of the regression function, coming should reckon with both effects. in the statisticians toolkit because, of
from a constrained optimization pro- There are multiple ways to use boot- all the ways of handling uncertainty in
cedure called spline smoothing: Find the strapping to get uncertainty estimates complex models, it is at once the most
function that minimizes the prediction for the spline, depending on what were straightforward and the most flexible. It
error, while capping the value of the av- willing to assume about the system. Here will not lose that place so long as the era
erage squared second derivative. As the I will be cautious and fall back on the saf- of big data and fast calculation endures.
constraint tightens, the optimal curve, est and most straightforward procedure:
the spline, straightens out, approaching Resample the points of the scatter plot Bibliography
the linear regression; as the constraint (possibly getting multiple copies of the Efron, B. 1979. Bootstrap methods: another look
loosens, the spline wiggles to try to same point), and rerun the spline smooth- at the jackknife. Annals of Statistics 7:126.

2010 Sigma Xi, The Scientific Research Society. Reproduction


190 American Scientist, Volume 98
with permission only. Contact perms@amsci.org.