You are on page 1of 78



19 February 2010

Confidence Intervals for Forecast Verification

Eric Gilleland

Institute for Mathematics Applied to Geosciences (IMAGe) and
Research Applications Laboratory

P. O. Box 3000
ISSN Print Edition 2153-2397
ISSN Electronic Edition 2153-2400


The Technical Notes series provides an outlet for a variety of NCAR
Manuscripts that contribute in specialized ways to the body of scientific
knowledge but that are not suitable for journal, monograph, or book
publication. Reports in this series are issued by the NCAR scientific
divisions. Designation symbols for the series include:

EDD – Engineering, Design, or Development Reports
Equipment descriptions, test results, instrumentation,
and operating and maintenance manuals.

IA – Instructional Aids
Instruction manuals, bibliographies, film supplements,
and other research or instructional aids.

PPR – Program Progress Reports
Field program reports, interim and working reports,
survey reports, and plans for experiments.

PROC – Proceedings
Documentation or symposia, colloquia, conferences,
workshops, and lectures. (Distribution may be limited to

STR – Scientific and Technical Reports
Data compilations, theoretical and numerical
investigations, and experimental results.

The National Center for Atmospheric Research (NCAR) is operated by
the nonprofit University Corporation for Atmospheric Research (UCAR)
under the sponsorship of the National Science Foundation. Any opin-
ions, findings, conclusions, or recommendations expressed in this pub-
lication are those of the author(s) and do not necessarily reflect the
views of the National Science Foundation.


1 Introduction 1

2 Normal Approximation 4
2.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . 13
2.6 Linear Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 16
2.7 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Categorical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8.1 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8.2 Peirce Skill Score (PSS) . . . . . . . . . . . . . . . . . . . . 21
2.8.3 Odds Ratio (OR) . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Rank Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 Checking the Assumption of Normality . . . . . . . . . . . . . . . . 23
2.11 Accounting for Dependence . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Simultaneous Confidence . . . . . . . . . . . . . . . . . . . . . . . . 31

3 The Bootstrap 33
3.1 A review of the bootstrap algorithm . . . . . . . . . . . . . . . . . . 34
3.1.1 Bootstrap formulation . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Heuristic overview of the bootstrap . . . . . . . . . . . . . . 35
3.2 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Percentile method . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Basic Bootstrap Confidence Intervals . . . . . . . . . . . . . 39
3.2.3 Bootstrap−t Confidence Intervals . . . . . . . . . . . . . . . 39
3.2.4 The BCa interval . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.5 The ABC Method . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.6 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Examples of Bootstrapping with R . . . . . . . . . . . . . . . . . . 43
3.4 Accounting for Dependence . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Sample size and Replication sample size issues . . . . . . . . . . . . 58

4 Summary 58


. . . . . .A Appendix 59 A. . . . . . . . . . . . . . . . . . . . . . .4 Plotting in R . . . . . . . . .2 The verification package in R . . . . . . 59 A. . 60 A. . . . . . . . . . . . . . . 66 References 70 iv . . . . . . . . . . . .3 Simulating correlated random vectors . . . . . . . .1 Forecast Verification Review . . . . . . . . . . . . 64 A.

. . . . . . . Red lines show the 95% confidence region using the Bonferroni method described in section 2. . 25 6 Associated standardized normal qq-plots for the forecast (left) and observed (right) simulations from table 1. . . . . . . . . . . . . . . Figure reproduced from Bernadet et al.List of Figures 1 Forecast bias for 24-hour accumulated precipitation (in) calculated for increasing thresholds for ARW (blue). . . . . . Blue line is the 1-1 line. . . . . 3 2 Plots of tables 1 (left column) and 4 (right column). (2009). . . . . . . .4 to learn how to make these graphs in R . . Simulations are stand- ardized and the blue line is the 1-1 line. Dotted blue lines show 95% confidence intervals for hit rate and false alarm rate individually (for threshold = 0. . . . See section A. . . and (lower right) normal distribution with mean 0 and variance 12. . . . . . . . . .12. . . . .plot in the R package verification.25. . .6) using the interval (11). . . . 28 v . . . 20 5 QQ-plots of randomly generated samples of size n = 30 from (top left) standard normal. . . . . . . . . (top right) Gumbel distribution centered at 0 with scale of 1. (lower left) exponential with unit rate. . . . 8 3 Box plots for the simulated data in table 1. . . . . . . . . . Blue solid lines are “forecast" values and orange dotted lines are “observed" values. . . . . . NMM (red) and the dif- ference of NMM subtracted from ARW (green). . 14 4 Example ROC plot with ties using data from Mason and Graham (2002) as available in the help file for roc. . . . . . .

. . . . b.05). . . . The “true" mean (standard deviation) for the forecast and observed are: 12 (2. . 60 10 Common forecast verification scores based on 2 × 2 contingency tables. . . . . . 1. . . . d. . . . . . . First-order accuracy is best. . . . . . . c. . 62 12 Common forecast verification scores for continuous random variables.List of Tables 1 Simulated forecast/observation pairs for continuous bivariate-normal distributed random variables. and values listed are rounded to 2 decimal places. .7). . See table 9 for definitions of a. . .. . . . . . .96 is used for the critical value zα/2 . . . See appendix A. . . . . . . . . . . . . . 63 vi . . . . . 7 2 Contingency table for Finley’s (1884) tornado forecasts. . . . . . . . . Un- like table 1. population variances and means are equivalent. .e. . . . . . . . . . α = 0. b. . . . . 32 5 Summary statistics for table 4 (rounded to two decimal places).75. . . . . . . . . . . . . . . . .5) and 13 (2. . . . . . . . 51 8 95% (percentile) confidence intervals using the circular block boot- strap (CBB) approach to accounting for dependence in the bootstrap for the simulations from tables 1 (top) and 4 (bottom). . . . . . . . . . . . . . 38 7 95% bootstrap confidence intervals (percentile and BCa ) for the veri- fication statistics output by the verify function for the simulations of table 1. . . and n. . . . . 21 4 Simulations similar to those in table 1. . . . Note that results will vary because they are obtained through simulations. . . d. . 19 3 Verification statistics based on proportions from the contingency table in table 2. The variance inflation factor is estimated by Eq. . 58 9 Generic 2 × 2 contingency table. . and n. . . . and it should be noted that these are based on particular statistics and a known exact distribution. . . . . Associated confidence intervals are 95% intervals (i. . . . . . . 33 6 A comparison summary of various methods for estimating confid- ence intervals through bootstrap resampling. . . . . . . . .3 to learn how these values were simulated. . . . . . . See table 9 for definitions of a. c. . . . . . . .6). . . . . . . . . . respectively. . . and the “true" (linear) correlation between the forecast and observed values is 0. . but with temporal correla- tion (AR(1) with population autoregressive coefficient of 0. . (19). 61 11 Common forecast verification scores based on 2 × 2 contingency tables continued from table 10. . . .

there has been an increase in emphasis on including confidence intervals for such calculations (e. in the Model Evaluation Tools (MET) software. Of particular importance are the assumptions underlying the bootstrap When the assumptions are met. There are different approaches and numerous issues concerning the construc- tion of confidence intervals depending on the statistic of interest and characteristics of the sample.. there does not seem to be any consolidated source for instructing a user how to correctly construct confidence intervals for the various situations that arise in practice. The various bootstrap confidence intervals (along with their pros and cons) and bootstrap methods are described. Statistically.PREFACE Forecast verification is concerned with the performance of a forecast usually in relation to other forecast models. which are frequently overlooked because of a miscommunication that the proced- ure has no assumptions. which is a new major software tool designed to perform forecast verification for both operational use and research. confidence intervals can be categorized into parametric and non- parametric intervals. or different configurations of the same model. a single point estimate about an aspect of forecast performance is reported. Recently.dtcenter. in addition to details about the assumptions involved and how to check and account for the assumptions. such a point estimate is meaningless without information about its associated uncertainty.g. and other issues as they pertain to forecast verification applications in a single document. This article was originally submitted on 30 July 2008. However. http://www. in fact. their assumptions. then this approach can provide highly accurate intervals for most statistics of interest. By far the most common parametric intervals are those based on the assumption of approximate normality for the underlying sample. or they are accounted for within the bootstrap procedure. particularly in the context of forecast verification. it has less stringent assumptions than parametric intervals. The present manuscript is an attempt to present the various types of confidence inter- vals. Bootstrap intervals are also discussed. Essentially. Often. vii . previous versions of the same model. Such intervals are discussed in detail for those verification statistics that use this approximation.

particularly for verification scores. However. it is important to acknowledge this uncertainty about the point estimate. the validity of confidence intervals may be affected by the nature of the verification series. 2001. Therefore.. see. say θ̂. θ.g. then there may be an artificial bias favoring forecast A over the others. this issue is not discussed further here. While confidence intervals are treated in any introductory statistical textbook. f will denote the forecast series (or field) and o the verification series (or field). and not on Bayesian Credible Inter- vals. confidence intervals for this paradigm are not discussed here. e. One way to do this is in the form of an hypothesis test (e. The emphasis here is strictly on frequentist confidence intervals. which may be a direct observation. For example. a (1 − α) · 100% confidence interval for a parameter estimate θ̂ is interpreted. in order to make inferences about the true value of the quantity. 1982. is used as the verification series. say forecast A. Another related but more informative method is to give a confidence interval for the quantity θ̂. awkwardly. and an analysis obtained from one of the forecasts.. Throughout this write-up. Another type of frequentist confidence interval not discussed here is the profile- likelihood method (e.. 2006)..g. if multiple forecasts are being com- pared. or an interpolation or analysis. θ. testing whether θ is significantly different from zero). This write-up is an attempt to consolidate as much pertinent information on confidence intervals as is reasonable (particularly as pertains to forecast verification) into a single manuscript. there is inevitably an error associated with them. though Jolliffe (2007) gives a nice overview of statistical inference for forecast verification and Wilks (2006) and Jolliffe and Stephenson (2003) also give much information on the subject. being difficult to automate. it seems that there are numerous issues. For information on such intervals in general. even if direct observations are taken. that are not to the best of my knowledge available in a single succinct treatment of the subject. Indeed. Under the classical (or frequentist) paradigm. Observational error can be an important component of uncertainty that should be incorporated into the confidence intervals. the value observed is a single realization from a distribution of possible values. Dawid and Stone.1 Introduction When obtaining a point estimate of a random quantity.g. Furthermore. Gilleland and Katz. Bernardo and Smith (2000). see Jolliffe (2007). 2006) as this approach is not typically useful for most forecast verification purposes. Coles. one would expect that the true parameter. al- though the fiducial argument of Fisher (1935) is a frequentist approach and much work has been focused on these intervals recently (e. would fall inside (1 − α) · 100 of the intervals. Further.. Hannig et al.g. 1 . In the context of forecast verification. as the interval such that if it were reconstructed for 100 different realizations of the sample of random variables.

1. both models tend to over-predict the amount of pre- cipitation. with statistical significance for one or both models in each case. For thresholds of 0. it can 2 . It can be tempting to look at the red and blue lines. Additionally. at all of the thresholds. For this example. as this value is contained within the intervals. and construct confidence intervals for the differences themselves. Figure 1 shows an example of confidence in- tervals obtained from bootstrapping for two versions of the WRF model for 24-hour accumulated precipitation (in) for increasing thresholds. With the confidence intervals. but it is never exactly 1.Example 1. note that the confidence intervals overlap. the models’ biases are not statistically significantly different from 1. and conclude that there is no statistically significant difference between the two models in terms of bias. 0.10 inches). such a conclusion ignores the struc- ture of the uncertainty of the two models. the bias for both models is relatively close to 1 (unbiased). The result is statistically significant because 1 is not contained in the intervals. the result is most likely associated to there being far fewer values above the highest thresholds. it can be seen that for the two lowest thresholds (0. one could not make meaningful statements about the perform- ance of the models. However.75 the models under-predict precipitation.25. Indeed.5. In order to ac- curately asses whether the bias for the two models is statist- ically different between them. For example. The figure graphically demonstrates the importance of accounting for uncertainty. and 0. figure 1 illustrates another important point con- cerning comparisons between the two models. it can be readily seen from the figure that there is much more certainty concerning the “true" value of the model bias for the lower thresholds than for the higher ones. Otherwise. The bias values were taken from ad- equately spaced points in time so that temporal dependence is not an issue.01 and 0. Finally. The green line in figure 1 shows the differences along with the confidence intervals. Without such information. it is necessary to look at the differences in bias.

Construction 3 . precipitation exceeding a threshold.g. that is an issue for the practitioner interested in the results.g. There are two general types of variables of interest in meteorological forecast verification: continuous (e. etc.. NMM (red) and the difference of NMM subtracted from ARW (green). Otherwise.) and categorical (e. The confidence intervals displayed here pertain only to the sampling uncertainty. Figure reproduced from Bernadet et al.50 inches. In this docu- ment.). (2009).01 inches and 0.Figure 1: Forecast bias for 24-hour accumulated precipitation (in) calculated for increasing thresholds for ARW (blue). temperature. relative humidity.. While these differences may not be practically significant. the word significant refers to statistical significance. the bias for the two models is not statistically significant. storm categories. as zero is included in the remaining intervals. etc. ! be seen that the ARW model has a statistically significantly higher bias than the NMM model for thresholds of 0.5 through 1.

and continuous variables generally fulfill different assumptions than cat- egorical variables. Section 2. sections 2. and some information about this is provided in section 2. simultaneous confidence intervals).12 shows how to obtain appropriate coverage for confid- ence regions (i. Some treatment of how to check the validity of the normal approximation is given in section 2. though care must still be taken in performing appropriate resampling procedures for each particular situation.e. which can be employed for most situations even when distributional assumptions.of confidence intervals often requires making assumptions about the underlying sample. More generally. Typically. where zα = φ−1 (α) with φ the cumulative normal distribution. and much of the remainder of this section deals with some of these issues and how to handle them.. however. continuous variables lead to verification score estim- ates whose confidence intervals can be constructed using the normal approximation discussed in section 2.8. Of course. such as the normal approximation. the Central Limit Theorem – CLT). A result concerning the asymptotic behavior of the binomial distribution can be used for proportions discussed in section 2. the true standard error.1).9 discuss confid- ence interval estimation for some special cases of verification scores where intervals can be obtained through distributional assumptions making use of the normal ap- proximation at some level. se(θ).10. For many parameters of interest. the normal distribution is often used as an approximation for the distribution of θ̂. which allows for some verification scores aggregated over categorical variables to also make use of the normal approximation. 2 Normal Approximation Because of a powerful theorem concerning the central tendency of large numbers (i. section 3 discusses the construction of confidence intervals using bootstrap resampling methods.1. b (1) Many other normal approximation intervals are available that do not rely on 4 . cannot be made. Once it is obtained. the interval becomes θ̂ ± zα/2 · se(θ). The general form of the normal confidence interval is given by θ̂ ± zα/2 · se(θ). It is also important to account for any dependencies in the sample.e. estimation of their standard errors can be odious.11. Finally.. The appendix contains various ancillary information useful for carrying out the examples provided in the various sections. is not usually known in practice so that it must be estimated somehow. including a brief review of some verification statistics used in this manuscript (appendix A. as will be seen in subsequent sections.5 through 2.

Suppose we have a set of independent and identically distributed (iid) random variables. .60 − 1.96 · 2. σ is the standard deviation of x1 . . 13.82 (σ̂o ≈ 2. σ̂.67. then for n large n enough. 2. . Table 1 shows a realization of 30 simulated forecast and observation pairs that marginally follow iid nor- mal distributions with mean (standard deviation) of 12 (2. .1. 12.05/2 ≈ 1. Their estimated means (standard deviations) are f¯ ≈ 12.96.4 and 2. .the central limit theorem. because the true standard error.60 + 1.76. x1 .1 Figure 2 (left column) shows a time series and scatter plot for these simulated values.7). we get √ √  12. To obtain 95% confidence intervals for f¯. is generally not known. For example.5 below rely on other results that still lead to a normal approximation interval.96 · 2.75 between the pairs. 5 .60/ 30. and a linear correlation of 0.5) and 13 (2. the intervals of sections 2. and z1−0.60/ 30 ≈ (11.05. and n is the sample size. the CLT tells us that the mean µ̂ = n1 xi follows a normal distribution P i=1 and the 100 · (1 − α)% CI is given by σ µ̂ ± zα/2 · √ .2 Plugging these values into the interval (2).53) . xn . . xn with mean µ and variance σ 2 . and their estimated linear correlation is r̂ ≈ 0. σ. (3) n − 1 i=1 Example 2. so that we actually use σ̂ µ̂ ± zα/2 · √ . n with mean zero and variance one.60 (σ̂f ≈ 2.1 Mean Confidence intervals for the mean provide a classic example of using the normal approximation.58). respectively. we have (1 − α) · 100 = 95 =⇒ α = 0. (2) n and typically this estimate is taken to be n 2 1 X σ̂ = (xi − x̄)2 . . However. we must use an estimate.60) and ō ≈ 13. .

henceforth. Similarly. σ̂ in interval (2) leads to an interval that might well not be a good approximation.e. we can make use of the apply function (i.3 when finding bootstrap confidence intervals.e.verify( Z[. For help using any function in R. Some useful verification statistics can be easily obtained for Z using the verify function from the verification pack- age as follows. frcst. Z[. use the function mean.2 Student’s t-distribution If we do not observe a large enough number of values.g.58.3 To obtain the estimated mean.. 2. Note that the “true" (i. fit <. which is about −2.01/2 . the Student’s t-distribution 6 .01/2)..type="cont") summary( fit) This function will be used later in section 3. R). For example. A useful exercise is to try simulating 100 such samples. For example.. and sd to get the estimated standard deviations. help( mean)). to obtain the means of the two columns of the matrix in table 1 (call it Z) simultaneously. to find the value for z0. type ? followed by the function’s name (e.type="cont".e. population) value of the forecast and observed means fall inside their respective 95% confidence intervals. ?mean). qnorm( 0. the 95% confidence interval for ō for these sim- ulated data is approximately (12. 2008. For the mean. or use the help function (i. from the R prompt type.89. In the R statistical programming language (R Development Core Team. obs.74). the critical values zα/2 can be obtained for any value of α using the function qnorm. and estimate 95% confidence intervals for the mean to see how many times the population means fall inside the intervals. apply( Z. mean)).. say n < 30."Observed"]. 2."Forecast"]. then the uncertainty associated with the estimate. 14.

191593 2 14.389068 14. The “true" mean (standard deviation) for the forecast and observed are: 12 (2.150141 29 14.192419 23 16.184865 16.383291 3 9.491850 13.167156 19 13.355702 11 10.670986 28 14.750962 9 12. respectively.462885 10.848257 18.260947 22 11.158444 13.635971 17 9.924855 4 11. and the “true" (linear) cor- relation between the forecast and observed values is 0.889418 7 .947174 19.730549 14.75.251538 15.437557 13.854932 13 14.349605 10.676388 15.394073 9.591317 12.865105 6 12.432232 16.059593 8 13.905100 11.188449 16.873908 7 7.265871 12.434587 16.052257 27 12.7).500861 16. i Forecast Observed 1 12.890921 21 13.329994 12 12.496651 11.709277 20 11.693249 10 9.401932 26 9.5) and 13 (2.158311 16 14.823590 5 12.336874 11.819920 24 16.015671 11.437230 15 10.519104 18 15.473143 12.519409 11.398705 14.002107 16.914912 25 16.589335 30 7.909060 11.990255 14 17.Table 1: Simulated forecast/observation pairs for continuous bivariate-normal dis- tributed random variables.169597 12.953189 11.

Figure 2: Plots of tables 1 (left column) and 4 (right column). Blue solid lines are “forecast" values and orange dotted lines are “observed" values.4 to learn how to make these graphs in R 2 18 16 1 Simulations 14 0 12 −1 10 −2 8 0 5 10 15 20 25 30 0 5 10 15 20 25 30 ● ● ● ● 18 1 ● ●● ● ●● ● ● ● 16 ● ● ● ● ● Observed ● ● ● ● 0 ● ● ● 14 ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● −1 ● 12 ● ● ●● ●● ● ● ● ●● ● ● ● ● 10 ● −2 ● ● 8 10 12 14 16 18 −1 0 1 2 Forecast Forecast 8 . See section A.

it might not matter. however. . So. Note that interval (4) is the same as interval (2) except for the critical value tα/2. The errors. being a mean. Note also that there is an assumption that the random variables x1 . which will often not be the case for atmospheric data because they are usually both spatially and temporally correlated.n−1 = −t1−α/2. This value can be obtained for the simulations in table 1 from R using qt. For mean absolute error (MAE).96 from the standard normal distribution for α = 0. 2. This yields confidence intervals for the mean based on interval (4) of (11. Note that this value (≈ 2. where n − 1 = 30 − 1 = 29 and recalling that tα/2. .can be used so that we replace the zα/2 in (2) with tα/2. the absolute value of the errors cannot be negative. interval (2) or (4) may be used for calculating confidence intervals for ME depending on the size of n. the Mean Error (ME).n−1 is the α/2 quantile from the t-distribution with n − 1 degrees of freedom. For example. There is still.85.05.79) for ō from Example 2. use qt( 1-0. for α = 0. may be normally distributed either by the central limit theorem or by virtue of their errors (f − o). . so the assumptions for the interval (2) (or (4)) should be carefully checked for MAE (see section 2. Therefore.05.n−1 to get σ̂ µ̂ ± tα/2. zn might still be independent of each other. 29). for ME at least.3 Continuous Variables Some continuous measures of forecast performance may at least approximately follow a normal distribution so that the confidence interval (1) may be employed. Example 2.57) for f¯.n−1 from the t-distribution with n − 1 degrees of freedom in place of the zα/2 critical value from the standard normal distribution.025.63. (4) n where tα/2. 13. and (12. . For example.n−1 · √ .10). 9 .1. . z1 . which for some variables will be normally distributed even if the values of f or o are not normally distributed.2.n−1 . xn are independent. . . an underlying assumption of normality. .05) is close to the critical value of approximately 1. 14.

but still rely on the underlying data to be iid normal with mean µ and variance σ 2 . Therefore. stats( Z) # Gives summary statistics for each column of table 1. At the very least. if x1 . . one should examine diagnostic graphs to determine the feasibility of the assumptions for using parametric intervals. there are ways to account for dependence in the confidence intervals (see section 2.2). To estimate the mean error from the sim- ulations of table 1. The next two sections derive confidence intervals for measures that are not normally distributed themselves. we first need to compute the errors z1 = f1 − o1 . Some parameters (or verification scores) that are not normally distributed themselves can still make use of the CLT to determine appropriate confidence iid intervals.Nevertheless. Then. The function stats from this package will automatically cal- culate the estimates we seek. Z. . . 10 . . xn ∼ N (0. 1). Example 2. among several others. For example. and store them to a vector object named Z. and subsequently an increase in the standard error. For simplicity. which is one of the pack- ages required by the verification package (see section A. 1. In the case of dependent data.diff <. Tests for dependence and normality should be employed before calculating any confidence intervals for parameter estimates based on the data. .-apply( Z. we will make use of a function from the fields package. diff) Next. To do this in R. we need to estimate the mean and standard devi- ation of this new vector.3. Let Z be a 30 × 2 R matrix containing the values from table 1. . use the follow- ing command.11). the effect will be equivalent to a reduction in the sample size (because several data points will give redundant information). the confidence intervals should be wider than the intervals that do not account for dependence in the data. . .diff. then their squares follow a χ21 distribution. z30 = f30 − o30 .

there is stat- istical significance at the 5% level that the “forecast" average is under-estimating the “observation" average. The approximate 95% confidence interval in the context of comparing medians from two populations (e. . The technique seems to only be applied to 95% confidence intervals throughout the literature. .86). Note that zero is not contained in this interval with positive limits. and (iii) a compromise between the two samples being compared having equal variances and having one variance dominate the other. with α = 0. the interquartile range). .. . The value 1. 2.e. and for example.g. gives the approximate 95% confidence interval for the estimate of ME for table 1 of approximately (0. Substituting these values into interval (2).05.58 × IQR ẋ ± √ . given by Eq. (ii) a relation between the variance of the sample median and the sample mean. and n is the sample size. . xn . 1.4 Median A method introduced by McGill et al. In particular.22 and 0. (3) is replaced with one that is less sensitive to outliers. Brown et al. Velleman and Hoaglin. . (1978) for finding confidence intervals for the median. but seems to work well for a wide-variety of distributions (e.58 in interval (5) is based on three components: (i) an alternative estimate for the standard deviation for a sample. call it ẋ. 1981.g. these three components are described next. in constructing notches on box plots with the R and MatLab programming languages. .. (i) The usual estimate for the standard deviation for a sample x1 . (5) n where IQR is the difference between the 0. stats( Z. ẋ is the estimated median. xn from a normal distri- bution. The original context for these intervals was in the comparison of medians from different populations. is based on assumptions of normality. . x1 .75 and 0. denoted σ̂x . 11 ..33 for the mean error and its standard deviation.58. side-by-side box plots with notches around the median) is given by 1. we obtain the estimates of 1.25 quartiles of the sample distribution (i. and were displayed in box plots using notches around the middle (median) line. 1997).diff) From this. Therefore..

49 and 13.7/1. where Z is a 30 × 2 matrix holding the values in table 1.39. Estim- ates for the median of the forecast and observed values are approximately 12.349. var(x̄) = σ̂ 2 /n so that var(ẋ) = (π/2) · var(x̄) = (π/2) · σ̂ 2 /n. That is. this is (1. These can be obtained from R by the command apply( Z. α = 0. 2. ẋ. Putting (i) andp(ii) together gives√an estimate for the standard error of the b ẋ) = π/2·IQR/(1. In general. 2. 0. it is given by σ̂x = IQR/1. hold <. respectively.25. but simply in obtaining a confidence interval about the estimated population median. and the estimate used here is obtained by solving this equation for σ.96 + 1. The value 1. If interest is not in comparing two populations.4. The notches show the interval (5) for each column of the table.349. (ii) For large samples.96/ 2)/2 ≈ 1. quantile.349 2n Example 2. then var(x̄1 − x̄2 ) = var(x̄1 ) + var(x̄2 ) = 2var(x̄1 ).apply( Z. the IQR for a normal distribution with variance σ 2 is given by IQR = 1.e.05)..58 from interval (5) above is approximately the product π/2 · 1. (iii) If two batches of samples are being compared. then the following interval is an approximate (1 − α) · 100% confidence interval √ π · IQR ẋ ± z1−α/2 · √ (6) 1. a compromise √ between these two extremes would be to use the average (z1−α/2 + z1−α/2 / 2)/2 for √ the critical value. and they have equal variances.349· median as se( p n). another result for the normal distribution is that the vari- ance of the sample median is π/2 times the variance of the sample mean. For 95% confidence (i. That is.349σ.75)) 12 . The other extreme is if one variance dominates the other. median). Namely. Box plots for the simulated forecast and ob- served pairs in table 1 are shown in figure 3. The following commands in R will give the estimates for the upper and lower quartiles (first line) for each of the forecast and observed columns of Z and their estimated interquartile ranges (second line). probs=c(0.7.

are obtained by taking the square root of these limits.ν . Because the χ2 distribution is not symmetric. xn ∼ N (µ.40) and (12. σ 2 . notch=TRUE) 2. diff) This gives estimates for the interquartile ranges of approx- imately 3. is a chi-square critical value.5 Variance and Standard Deviation iid Suppose x1 . make use of the following probability. Observed=Z[.n−1 (n − 1)s2 u(σ̂ 2 ) = (7) χ21−α/2. χα/2. apply( hold. To create the box plots in figure 3.1].17 and 4. It is the number on the abscissa such that α of the area under the chi-square curve with ν degrees of freedom lies to the right of χ2α.14. Confidence intervals for the variance. 2. . 14. 13 . χ2α. boxplot( list( Forecast=Z[.20. . σ 2 ).2]). s2 = ni=1 (xi − x̄)2 /(n − 1).n−1 χ1−α/2. we denote s2 to be the sample variance.59). the upper and lower limits cannot be computed by adding and subtracting the same quantity as in (2) and (4). respectively. . 13. ( ) (n − 1)s2 (n − 1)s 2 Pr 2 < σ2 < 2 = 1 − α. . σ.57.n−1 Confidence intervals for the standard deviation. The P value. use the following com- mand in R.ν . Plugging the estimates into (5) yields the 95% confidence intervals for the medians as approximately (11. They are given by (n − 1)s2 `(σ̂ 2 ) = χ2α/2.n−1 Here.

18 16 14 12 10 8 Forecast Observed 14 . Figure 3: Box plots for the simulated data in table 1.

Example 2. 29.025. type (note that anything after # on a line is a comment and need not be typed) the following (for α = 0. if we type qchisq( 0. We continue using the simulated values from table 1. respectively. or the help (?) function. respectively. from the R prompt. 3. which gives approximately (using the sqrt function in R) (2.49) and (2.tail=FALSE) # upper critical value qchisq( 0. and n−1 = 30−1 = 29. Therefore.18) and (4. 3. # lower critical value qchisq(p= 0.07).06. use the args function. We can use the R function qchisq to obtain these values for any value of α.24. we obtain a 95% confidence interval from (7) for the variance of the forecast and observed columns of table 1. we simply take the square root of these intervals. lower.025. From this. args( qchisq) lists out all of the required and optional arguments for the qchisq function. To see the list of arguments for any R function.47). we only need the appropriate critical values from the χ2 dis- tribution. For example. 15 .tail=FALSE) Note that the arguments p and df need not be written ex- plicitly as long as they are kept in the order of the argument list.27.07. In fact.5. 29„FALSE). they are approximately (4. df=29.975.05). 12. lower. To obtain 95% confidence inter- vals for the standard deviations. then we should get the same answer as we did from the first line above (note the double comma „ in the argument list to skip the argument ncp). 12. We have estimates for the variance (and standard de- viations) of each column. Specifically.

which we have stored in an R object called Z. is given by  2c` e − 1 e2cu − 1  . we get v ≈ 1. Note also that we get a 2 × 2 matrix returned from the cor command.02.39 (note that this uses z0.6. we get approximately (0.2] (or equivalently. we first cal- culate v. between f and o). a 100(1 − α)% CI for ρ. we could have used cor(Z)[1. Continuing with the simulations from table 1. (9) e2c` + 1 e2cu + 1 where c` and cu are as defined in Eq. recall that the “true" population correlation is 0.64 and cu ≈ 1. If we only want this value returned.2. is the linear correlation coefficient (e. (8) n−3 n−3 where v = 21 ln 1+r   1−r and zα/2 is the 100(1 − α/2)% critical value from the normal distribution.1]). Then. the population correlation.g. define the following two quantities.55.975 ≈ 1. c` and cu from (8) above (use log in R for ln). First. we can compute the (linear) correlation between the two columns of the table.75. (8). 2. but here we are only interested in one of the off-diagonal elements (the matrix is symmetric). To get 95% confidence intervals for this estimate. Example 2.76. c` ≈ 0. This is different from 16 .96). Substituting into interval (9). using cor(Z).88) for our 95% confidence interval for the linear correlation between the forecast and observed simulations from table 1. cor(Z)[2. We obtain such intervals under the assumption that the forecast and observations are jointly distributed as a bivariate normal. 0.7 Mean Squared Error (MSE) The mean square error (MSE) in forecast verification is defined to be the average squared difference between the forecast and observed values. we find that our estim- ated linear correlation is about 0. In doing this. In doing so.. zα/2 zα/2 c` = v − √ and cu = v + √ .6 Linear Correlation Coefficient Another important statistic of interest. for which it is desired to construct con- fidence intervals. .

o Maybe this notation is getting obtuse. o) = sf so rf. marginal variances. assuming the underlying data are iid nor- mally distributed. we need confidence intervals for the covariance term. Finally. The MSE can be broken out into the mean error. for example. and can be handled simply using.1.1. Again. interval (9) is used to construct confidence intervals for rf. re- spectively. respect- ively. respectively. the standard deviations of the marginal distributions of the forecast and observations. the Wald or Wilson methods as described in the next subsection. assuming iid normality for the underlying data. which is described in section 3. 17 . Confidence intervals can be calculate for each of these individual components. In terms of finding confidence intervals for the MSE as a whole unit.8 Categorical Measures Most of these measures are proportions. but it’s just the product of the marginal standard deviations and the correlation between f and o.the usual MSE familiar to statisticians. because this is a factor of the standard deviations of f and o. Again. Technically. and the covariance as below. Another breakdown for the MSE is given by MSE = Variance + Bias2 . (10) where s2f and s2o denote the variances for the forecast and observed fields. That is. However. That is. Note that the first term is just the square of the mean error (as above).o . we obtain confidence intervals for their correlation instead. Con- fidence intervals for other types of categorical verification statistics are described subsequently. 2. we can re-write the covariance as cov(f. o). n n n 1X 1X 1X f¯ − ō = fi − oi = (fi − oi ) n i=1 n i=1 n i=1 The next two terms are the variances of the forecast and observed values. confidence intervals for the standard deviations can be obtained from (7). where the first component is again covered by (7) and the bias term is a squared difference of two means. MSE = (f¯ − ō)2 + s2f + s2o − 2 · cov(f. the boot- strap procedures described in section 3 are the best suited to the task.

2. Given a proportion.. reasonable interval is given by adding two successes and two fail- ures to the proportion.type = "binary". For smaller samples. e. pred = NULL. and false alarm ratio (FAR). estimated by p̂. 2680) verify(obs.1 Proportions Verification scores such as the hit rate (aka probability of detection (POD)) and false alarm rate (aka probability of false detection (POFD)) are simply propor- tions. and is given by r z2 2 zα/2 α/2 p̂(1−p̂)+ p̂ + 2n ± zα/2 n 4n 2 zα/2 . This is known as the Wald confidence interval. (11) where zα/2 is the usual α/2-critical value obtained from the standard normal distribution. Table 3 shows point es- timates for some verification statistics based on proportions. the help file for verify). obs. p. frcst. Table 2 shows the famous Finley (1884) tor- nado forecasts contingency table. the simplest interval to use is the normal approx- imation (a binomial-distributed random variable tends to normality as the sample size increases). false alarm rate. and is given by p p̂ ± zα/2 p̂(1 − p̂)/n. Some of the point estimates in table 3 can be readily obtained using the verify function from the verification package in R as follows (see. which is generally an improvement over the Wald interval for large n as well. along with their associated confidence intervals computed us- ing intervals (11) and (12). and then using the Wald confidence interval given by (11).8.type = "binary") 18 . 23.7. One is Wilson’s score method.g. there are more satisfactory intervals. Important examples of verification scores whose confidence intervals can be con- structed by (11) or (12) are the hit rate. (12) 1+ n A simpler. 72.c(28. Example 2. and we can easily calculate confidence intervals for proportions. obs <.

12.6.plot of the R package verification. This is the value used in the calculations for the two sets of confidence intervals shown in table 3.8. which gives 0. Table 2: Contingency table for Finley’s (1884) tornado forecasts. Note that if the rounded value of 0. then the upper limit of both intervals (Wald and Wilson) increases to 0. Observed Tornado No tornado Forecast Tornado 28 72 100 No 23 2680 2703 tornado 51 2752 2803 For the false alarm rate. Figure 4 shows the ROC plot for data from Mason and Graham (2002) as available in the help file for roc. Example 2. The red lines show the corresponding confidence region using the Bonferroni method described in section 2. For threshold of 0. It is easily seen that the intervals (11) and (12) both yield very similar results.03 is used instead. the value can be calculated by hand.04. the blue dotted lines show 95% confidence intervals for the hit and false alarm rates individually based on interval (11). at least for this example.02616279 without rounding any further. 19 . with the latter tending to be slightly tighter than the former.

12.2 0.4 0.0 0.8 Hit Rate 0.4 0.6 ● 0.2 0.plot in the R package verification.4 ● 0.2 ● 0 0.0 ● 0.Figure 4: Example ROC plot with ties using data from Mason and Graham (2002) as available in the help file for roc. Dotted blue lines show 95% confidence intervals for hit rate and false alarm rate indi- vidually (for threshold = 0.8 ● 0.8 1.0 False Alarm Rate 20 . ROC Curve 1. Red lines show the 95% confidence region using the Bonferroni method described in section 2.0 ● 1 0.6) using the interval (11).6 0.6 0.

As usual.39. confidence intervals can be calculated for it based on binomial sampling provided that H and F are mutually independent (Stephenson. we obtain an estimate for PSS of ≈ 0.03) (0. 0.80) 2. α = false alarm rate/POFD 0.e.03) false alarm ratio (FAR) 0. we assume that H and F are normally distributed for large n.9.66). ln{OR} ± zα/2 · (1/a + 1/b + 1/c + 1/d)1/2 .3 Odds Ratio (OR) 100·(1−α)% confidence bounds for the odds ratio (OR) can be obtained in several different ways. Using the estimates of hit rate ≈ 0. 0. Example 2.05).8. and plugging into interval (13) with α = 0.55 (0.2 Peirce Skill Score (PSS) Because the Peirce Skill Score (PSS) can be written as the hit rate (H) minus the false alarm rate (F).549 and false alarm rate ≈ 0.. 2000). 0. 0. and values listed are rounded to 2 decimal places. (14) 21 . but one that suffices is the normal approximation to the logarithm of OR provided by Woolf (1955). 0.026 from the Finley tornado forecasts of example 2.63. Thusly.8. 0.81) (0. Associated confidence intervals are 95% intervals (i. Point Estimate Wald interval Wilson interval hit rate/POD (13) nH nF where nH = hits + misses and nF = false alarms + correct negatives.63. 0.02. yields a 95% confidence interval for the estimate of PSS of (0.72 (0. 1. 2.03 (0.523.Table 3: Verification statistics based on proportions from the contingency table in table 2.96 is used for the critical value zα/2 .69) (0. the sampling distribution is the normal distribution so that interval (1) applies with standard deviation given by s H(1 − H) F (1 − F ) q ŝP SS = ŝ2H + ŝ2F = + .

simply exponentiate the above formula (see also Stephenson. a false alarm). There is an hypothesis test for the Spearman rank correlation (Best and Roberts. (15) 22 . and the intention here is to direct the interested reader to appropriate references for specific statistics. Because the 95% confidence interval’s lower limit is positive (i.9 Rank Correlations Rank statistics provide unique challenges for finding confidence intervals associated with them. OR > 1). c. and d are the usual contingency table counts: hits. When it is possible. and correct rejections. at least to describe in a text. b..where zα/2 is the usual standard normal quantile as in (1) and a. with α = 0. 4. Using interval (14). much work in the area has been done.. there is statistically sig- nificant evidence that the odds of a forecasted tornado being a hit are greater than being a false alarm. Note that this is for the logarithm of OR. 1975).10. Nevertheless. The only alternative is to employ bootstrap techniques described in section 3 for this statistic. but as it is against no correlation only. For an odds ratio of 1 (i. we get a 95% confidence interval for the log odds ratio of approximately (3.21.7. it cannot be used to create a con- fidence interval. the calculations can be cumbersome. false alarms. then the odds would be in favor of a forecasted tor- nado being a false alarm. Continuing with the Finley tornado data from example 2. that the odds are 1 to 1 that a forecast tornado would be a hit vs. the logarithm of this value would be zero. If the odds ratio were between zero and one.e. and that the estimated variance is not negative). an approximate 100(1 − α)% confidence-interval estimate of the population Kendall’s τ coefficient is given by t ± zα/2 σ̂(T ). 2000). and the logarithm of the odds ratio would be negative.e.41).e. Example 2. respectively (see table 9). 2. Ties are usually the culprit for confusing the issue. the odds ratio estimate is ≈ 45.05. For OR. rather than to belabor over the details for specific cases.. to ensure the approximate asymptotic distribu- tion assumption is valid. misses. For n sufficiently large (i.3.

graph the sample 23 . The interval introduced by Cortes and Mohri (2004) is given by      σ(Ak ) σ(Ak ) min E(Ak ) − √ . which is related to the Wilcoxon- Mann-Whitney statistic. and suggest their own distribution-independent interval. Mason and Graham (2002). Estimates are provided in Cortes and Mohri (2004) for E(Ak ) and σ 2 (Ak ). which they showed to yield tighter intervals than other methods for most situations. here denoted A. 1992). specifically.10 Checking the Assumption of Normality Perhaps the most satisfying method for checking the assumption of normality is to check the quantile-quantile graph (aka qq-plot).3. much research has been focused on finding confidence intervals for the various statistics relevant to this analysis tool. max E(Ak ) + √ . the reader is deferred to any text on nonparametric statistics (e. Because of the wide use of the receiver operating characteristic (ROC) diagram in numerous different fields of study.g.Note that this is a normal approximation as in (1). E(Ak ) is the expected value of the AUC. Hypothesis testing for the AUC in the meteorological literature can be found in. 2. over all classifications with k errors. as well as details about how to determine εk . see example 3. one can make use of a normal approximation under the hood for large n that simplifies the calculations some.. The details for finding this estimate are rather odious. partic- ularly the area under the ROC curve (AUC).g. (16) k∈IK εk k∈IK εk where k. e. nonparametric) intervals for AUC. Ik is a closed interval in the positive integers. Instead. an example is given for finding such intervals using bootstrap resampling in section 3. Mason (2003) suggests that Wilson’s interval (12) for proportions can be used with n equal to the total sample size. Because of the intricacy of computations in deriving parametric (or. Gibbons and Chakraborti. and εk > 0 is set to determine the level of the confidence interval. Pepe (2003) is a good source for information on ROC curves. Cortes and Mohri (2004) present a useful summary of some of the intervals introduced. Although the interval (16) is a distribution-independent interval. the number of false alarms and misses. Wilks (2006). σ 2 (Ak ) is the variance of A for classifications with k errors. is fixed. and therefore related to the ranking quality of the classi- fication. including confid- ence intervals. in some cases. and it is not the intention here to bedizen the text with highly technical and esoteric mathematics. but the formulas are a bit more involved than is desired to detail here. but the estimation of σ̂(T ) is troublesome because of ties. Simply put.

The upper left graph shows the qq-plot for a sample from the standard normal distribution. The two plots are similar except when n is small. n and Φ is the standard normal cdf. proced- ures that can be automated are still useful. x1:n . when faced with thousands of separate samples). Various values for a have been proposed (e. but have a different slope than 1 (hence do not follow the blue 1-1 line in the graph). One such method involves looking at the linear correlation between the ordered −1 i−a  sample values. 4 Probability plots are also sometimes used. the probability that r is at most cα (i. 1988).. More spe- −1 i−a cifically.g.14 in Devore (1995) gives these values for the choice of a = 0. Of course. and this is not always feasible (e. and Anderson-Darling tests. and the values yi = Φ n+1−2a as is used in the qq-plot described above.g. the qq-plot approach requires a human observer to make subjective assessments of appropriateness of the normal distribution. and are related to the qq-plot.e.375 (as in Blom. or estimate the mean and variance and use the quantiles from the normal distribution with these values for the parameters instead of the standard normal. and 0 ≤ a ≤ 1/2 is constant. Judge et al. the expected value of the k th order statistics are used. Figure 5 shows examples of qq-plots for random samples drawn from 4 different distributions shown plotted against the standard normal quantiles. Shapiro-Wilk. the Cramér-von-Mises criterion. . where cα is a critical value chosen so that when the population distribution is truly normal. 1959). but one with a different variance than that of the standard normal. The points fall in a fairly straight line. yi ) is calculated as P P P n xi:n yi − ( xi:n )( yi ) r=p P 2 p P . Blom.g.. . Clearly the top right and lower left are samples from distinctly non-normal distributions. .375). . 1959.. 24 . Pearson’s chi-square. plot the ordered sample against Φ n+1−2a . where i = 1. The sample correlation coefficient for the n pairs (xi:n . Therefore. The lower right qq-plot is from a normal distribution.. n xi:n − ( xi:n )2 n yi2 − ( yi )2 P P The hypothesis of population normality is rejected if r ≤ cα . . 4 quantiles against the quantiles of the theoretical normal  distribution. Therefore.. Other tests for normality include: the D’Agostino’s K-squared. and this is evident from the qq-plots (see the curvature in the layout of the points). Jarque-Bera. . xn:n . and the Shapiro-Francia tests for normality (see e. the Lilliefors (a modi- fication of the Kolmogorov-Smirnov test). . . incorrectly rejecting normality) is α. it is important to either standardize the sample before plotting. Instead of using the quantiles of the normal on the abscissa. Table A. Notice that the points hug the 1-1 line fairly closely. uses a = 0.

(top right) Gumbel distribution centered at 0 with scale of 1. (lower left) exponential with unit rate. Normal Q−Q Plot Normal Q−Q Plot * * 5 2 * 4 Sample Quantiles Sample Quantiles * *** * 1 * 3 * *** ** * 2 **** *** 0 ** * 1 ***** * **** **** * −1 **** 0 * ** ****** * * * ** −1 * −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles Theoretical Quantiles Normal Q−Q Plot Normal Q−Q Plot * * 4 4 ** * * 2 **** Sample Quantiles Sample Quantiles 3 **** * *** 0 * * ** *** ** 2 −2 * *** ***** *** −4 *** 1 **** ** −6 ** * * * ****** 0 * −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles Theoretical Quantiles Figure 5: QQ-plots of randomly generated samples of size n = 30 from (top left) standard normal. and (lower right) normal distribution with mean 0 and variance 12. 25 .25. Blue line is the 1-1 line.

Figure 5 shows results for random simula- tions from various distributions so results from this example may not look exactly like the examples in this figure.2]))/sd(Z[.1. 1.1.2)) qqnorm( (Z[.12.1.5)) abline( 0. par( mfrow=c(1. col="blue") qqnorm( rgev( 30.1])) abline(0. using the function qqnorm.1]-mean(Z[. shape=0)) abline( 0. sd=3. we obtain a qq-plot for the standard normal distribution. Using R. col="blue") Results from the above lines of code are shown in figure 6. Let us check the validity of the normality assumption for the simulations in table 1.2)) qqnorm( rnorm( 30)) abline(0.Example 2.2 to learn how to install and load packages in R).2]-mean(Z[.1]))/sd(Z[. The function rgev comes from the R package evd (see sec- tion A. # Put all four graphs on one device in a # 2 × 2 layout. loc=0.11. The standard normal qq-plots for these simulations appear to 26 . 1. the following R code shows how these graphs were made. like those in figure 5. col="blue") qqnorm( rnorm( 30. 1. col="blue") Example 2. col="blue") qqnorm( rexp( 30)) abline( 0.2])) abline(0. How- ever. col="blue") qqnorm( (Z[. par( mfrow=c(2. scale=1.

(18) i=1 n with ϕ̂(i) being estimates of the correlation between the process of interest (e. an autoregressive process. Agresti. 1996.. If the data follow an autoregressive process. then confidence intervals obtained from the resulting model will adequately account for dependence (see e. 1993. 2. Kleinbaum. temperature) at time t and t + i. and for an AR(1) process with lag-1 correlation coefficient ρ. 27 . Intervals calculated based on the independence assumption when the data are actually dependent (e. any dependence in the data should be accounted for by applying a statistical model to the data (e.g. Katz.11 Accounting for Dependence All of the intervals given so far assume that the data are independent.e. etc. (17) n where σ̂ 2 is the sample variance of the data. we have that ϕ(i) = ρ|i| . 1982. and when n is large. Ideally. For more about this approach. Note that the technique can give wholly inaccurate intervals if the autore- gressive assumption is not valid. see Wilks (1997). as is also the case if a dependence model is used that is not appropriate. but it suffices to say that if an appropriate model can be found. the data are first standardized by subtracting the mean. and V is the variance inflation factor given by n−1   X i V =1+2 1− ϕ̂(i)..g. temporally or spatially) can be grossly inaccurate. In the above example. Brockwell and Davis. geo-spatial model.. 2002.g. The method takes the variance of the mean to be σ̂ 2 var(x̄) = V · . the lines will still essentially follow a straight line). Zwiers and von Storch (1995).). It is beyond the scope of this write-up to tackle such models. Cressie. This is also referred to as the autocorrelation function.. there is a simple method for inflating the variance in the face of dependent data. This is not necessary (i. It is often the case in forecast verification that this assumption is not valid. 1996). and dividing by the standard deviations.g. then for the mean. be reasonably linear suggesting that normal approximation confidence intervals are reasonable. but it allows plotting the 1-1 line making for easier diagnosing. generalized additive model.

Figure 6: Associated standardized normal qq-plots for the forecast (left) and ob- served (right) simulations from table 1. Simulations are standardized and the blue line is the 1-1 line. ● ● 2 2 ● ● ● ● ●● ● ● ● 1 ● 1 ● ●● ●● Sample Quantiles ●● ●● ● ● ● ● ● 0 ●● ●●● ● 0 ● ● ● ●●● ● ●● ● ● ● ●●● −1 ● −1 ● ●● ●● ● ●● ● ● ● ● −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles Theoretical Quantiles 28 .

Because the data from table 1 were simu- lated to be independent (i. Example 2. (1997)). or the number of effectively independent observations. the vector of ones and zeros. n − 1.. however. using R to these columns. The function acf can be used to calculate es- timates of the autocorrelation function. this factor can also be used. ne = n/V. (19) 1−ρ For the median. section 8. as follows.0.5. not between the columns). order=c(1. see Wilks (2006). For the case of the lag-1 autoregressive dependence structure (AR(1)).13. i = 0.0). where 1 indicates a value above the median. arima( Z[. and zero a value below the median. For a much more thorough treatment of this topic in the realm of forecast verification. for details). (see Brockwell and Davis. Brown et al. 2002. in order to compute V of Eq. say of order 1. A convenient large-sample approximation for the AR(1) case is given by 1+ρ V= .2]. down the rows. ..1].0. It is also possible to plot the autocorrelation (ACF) and partial auto- correlation (PACF) functions using the R functions acf and pacf. (18). Nevertheless. method="ML") As expected. the estimated autocorrelation coefficients are near zero for both columns of simulations. ne . that ne also only pertains to inferences on means. respectively.3. then Eq.e.0). 29 . we do not expect to have much dependence. . .e. ϕ̂(i) for each lag. we fit an autoregressive model. but in estimating the autoregressive parameter. order=c(1. method="ML") arima( Z[. . it is more appropriate to use the excursions above and below the median (i. (18) also gives rise to an estimate of the estimated effective sample size. It should be emphasized.

6.Example 2.31 ± −0. order = c(1.83. order = c(1.02).94 2. 95% confidence intervals adjusted for dependence for p the mean of each column are given by p · 0. 30 . and (1 + 0.26 ± 1.e. which are narrower as expected. 0. The effective sample size for each column is given by 30/2. See appendix A. respectively.48)/(1 − 0..60. 0.32).48 and 0.64. we get a variance in- flation factor of (1 + 0.3 to read these values into R. it is assumed that these values are stored in R as Z2.48) ≈ 2. think- ing of the columns of the table as time series).56) ≈ 3.85/30 ≈ (−0. 0. The third line in the code above gives estimates for the means and standard deviations for each column of Z2.96 1. 1]. Henceforth. the simulations both follow an AR(1) process with popula- tion autoregressive coefficient of 0. One way to do this in R follows.93 3. 0. Using (19). Specifically.14. 2].56.55/30 ≈ (−0. 0. respectively. 0.85 for the fore- cast column. 0)) # Compute various statistics for both columns stats( Z2) From the above commands we get point estimates of the autoregressive coefficient for the forecast and observed columns of about 0.94. but this time with temporal dependence (i.55 for the ob- served column. we need to estimate the value of the AR(1) coefficient.85 ≈ 11 and 30/3. In order to use the variance inflation factor. 0)) # Observed column arima(x = Z2[. as well as how to see how they were simulated. # Forecast column arima(x = Z2[.31) and −0.56)/(1 − 0. Then.08) and (−0. The corresponding 95% confidence intervals obtained from (1) without inflating the variance to account for de- pendence are approximately (−0.96 · 0. Table 5 sum- marizes the pertinent information to construct confidence in- tervals for this example.55 ≈ 9 (rounding up). Table 4 holds simulations similar to those in table 1.

Specifically.1 describes how confidence intervals can be obtained for each of these proportions. say α0 given by α0 = 1−(1−α/m)m if the verification statistics are independent and α0 ≤ mαm otherwise. If the components of θ m are at least approximately inde- pendent. Related to this issue is the fact that the intervals do not account for dependence in the components of θ m . a simultaneous confidence region around the sample vector with confidence level 1 − α requires that the confidence intervals for the individual components each have confidence level smaller than α. each individual component of the vector must have more precise confidence intervals than (1−α)100% in order to achieve a (1−α)100% simultaneous confidence region. The Bonferroni correction can be overly conservative in that the true confidence level. can be considerably smaller than α. ROC. It is not always a simple task to obtain such intervals. and numerous methods have been suggested. Furthermore. A common situation in forecast verification where it is important to account for simultaneous confidence intervals arrives with the relative operating characteristic (ROC) diagrams (aka receiver operating characteristic. then the region defined by m confidence intervals (for each component of θ m ) each with confidence level αm will jointly enclose θ m with probability at least as large as 1 − α. a level α Bonferroni confidence region is obtained by replacing zα/2 . Example 2. That is.15. which is available in the help file for the R function 31 . of verification statistics have length m. (11) with zα/4 . it is not known exactly how much smaller it is for any given problem.12 Simultaneous Confidence There are numerous instances where confidence intervals for a vector of verification statistics are desired. Here interest is primarily in confidence intervals for the two axes of the plot: hit rate (aka sensitivity) and false alarm rate (aka 1−specificity). θ m . then αm = α/m.2. Section 2. for example. One relatively simple method.8. is to increase the length of each individual interval based on the Bon- ferroni inequality. called the Bonferroni method (or Bonferroni correction). Here we use data from Mason and Graham (2002). While it may be possible to construct confidence intervals for the individual components. diagrams). and assuming approximate independence. in Eq. let the vector.

47581190 9 -0.3 to learn how these values were simulated.3048584 0.1694675 -0.12639304 5 -0.56886759 11 -0.45436737 21 0.1231457 -0.2496801 -2.1796791 -1.4300334 -0.73388300 26 0. Unlike table 1.39070897 17 -0.2277186 0.01004306 22 2.43560061 8 0.02686002 13 -1.53966190 27 -1.76397733 3 -0.5347141 -1.2234683 -0.6).5693288 -0.6693022 0. but with temporal correlation (AR(1) with population autoregressive coefficient of 0.36194711 32 .1747598 -0.3186723 -1.81930500 12 -1.71297461 28 -1.50519692 4 -0.28454249 30 0.1823664 -1.75843656 23 1.Table 4: Simulations similar to those in table 1.4179113 0.5128639 0.7120746 -0.6023930 0.13244522 18 0.1709163 1.3412686 -0.1246895 0.56123172 29 -0.1626001 -0.28998716 24 0.91198788 20 -0.5412252 -1.22433912 25 1.97850052 16 -0.26301891 6 -1.25258151 19 -1.1914762 -1.7012047 -0.32459909 10 -0.5687755 1. population variances and means are equivalent. See appendix A.3768035 -0.7517067 -0.09641756 14 -1.7757895 -0.4932702 -0.02079795 15 -1.89639945 2 0.18606631 7 0. i Forecast Observed 1 -0.

Figure 4 shows the ROC plot for this example. 3 The Bootstrap Before discussing how to construct confidence intervals using bootstrap methods. it is expec- ted that the vector containing the true hit and false alarm rates will fall inside the box about 95 times. The variance inflation factor is estimated by Eq. There exist numerous papers and text books pertaining to bootstrap resampling.26 -0. DiCiccio and Efron (1996).Table 5: Summary statistics for table 4 (rounded to two decimal places). 33 .56 coefficient Variance Inflation 2. Efron and Tibshirani (1998). respectively (blue dotted lines).plot of the verification package.31 Standard 0. it is useful to review the bootstrap procedure.93 Deviation Autoregressive 0. for threshold of 0.85 3. The interpretation of the red box is that the vector of the true hit and false alarm rates falls within the red box with approximately 95% confidence. along with 95% confidence in- tervals for the hit and false alarm rates.94 0. Forecast Observed Mean -0. including: Davison and Hinkley (1997). Lahiri (2003). if the experiment were re-run 100 times.55 Factor roc.48 0. (19).6. That is.

1999.n . F̂m.e. θ = θ(F ). m = b nc. which follows an unknown distribution. For most applications. θ could be the frequency bias. Xn }. 3. That is. F . this will allow for easier notation in subsequent sections. . Athreya. n 1X F̂n ≡ I{x ≤·} . Shao and Dongsheng.g. The next step is to generate B iid random samples from F̂n . based on the original sample. for example. of F from the available observations. In an effort not to bedevil the reader. Bertail. e. Xn = {X1 . the notation for F̂n .1. F̂n . Finally. . if the population distribution is heavy-tailed (cf. a simple random sample with replacement from Xn is being taken to obtain a bootstrap sample. Formally.2. .. the above is based 5 e. 1995). . Lee.g. Xm∗ There are methods that sample without replacement (e.. and we assume that F̂n is a reasonable representative of the population distribution function. F̂n is playing the part of F so that θ(F̂n ) is analogous to the otherwise unknown level-1 parameter θ(F ). etc. a rather formal formulation of the procedure is given first in sec- tion 3. probability of detecting an event.5 is obtained for ∗(b) ∗(b) each replicate by replacing Xn with Xm . however.g. Xm } for the bth replicate. (20) n i=1 i where I{A} is 1 whenever the statement A is true and 0 otherwise.. and calculating θ̂m from this resample. Usually.g.1. Xn .3. . F̂n is taken to be the empirical distribution function of the observations. where EXi2 = ∞). . The estimate of level-1 population parameters.1. That is. Note that when using F̂n as in (20).1 A review of the bootstrap algorithm To be precise.. it is recommended to choose m as a√function of n such that as n −→ ∞ both m −→ ∞ and m n −→ 0 (e. suppose we observe the sample. should reflect the possibility of m < n (e. Lahiri.1. where byc denotes the greatest integer less than y). θ̂n . but this is not done here in favor of simpler notation). That is. 1987. F . . In the case of heavy-tailed distributions (i.. is obtained by θ̂n = θ(F̂n ). 2003.1 Bootstrap formulation The philosophy behind bootstrap resampling is that the observed sample of ran- dom variables provides an adequate representation of the full underlying popula- tion. 1997). 34 .g. but there are instances when it is more appropriate to take m < n. we take m = n. We first construct an estimate. which we denote ∗(b) ∗(b) ∗(b) here by Xm = {X1 . this will be followed by a more heuristic explanation in section 3. equitable threat score. The bootstrap version of the usual sample based estimator. but this is much less common. ..

bias and MSE for θ̂n − θ in (21) above).g. . . Xn ∼ F .. If this assumption is invalid. . Xn to get X1∗ . ϕ(G). as well as its unknown distribution function. .g. . .). θ̂n − θ ∼ Gn ). the estimator θ(F̂n ) accurately reflects the characteristics of the population and the sample that together determine the sampling distribution of random vari- ables constructed from both (e. . θ̂m . . . the unknown parameter. For the block bootstrap methods discussed in section 3. etc.2 Heuristic overview of the bootstrap iid To clarify the above procedure.1. . and we want to estimate a (level-1) parameter. . For example. 3. θ̂m . . . . false alarm rate. where m ≤ n. and perhaps functionals.g. θ(F ). (see..g. There are various bootstrap methods available that account for dependent random variables. Xn denoted here by F̂n . Repeat steps 3 and 4 above B times to obtain a sample from G of the ∗(1) ∗(B) parameter of interest. from F̂n .4 below. Xn . Estimate the distribution function. Xm ∗ .. Provided that the original sample is an adequate representation of the popula- tion. The bootstrap principle provides a general method for estimating level-2 parameters related to the unknown distribu- tion Gn of such random variables (e. . . the first two moments of θ̂n − θ are given by the magnitude bias and mean square error (MSE). 35 .4. θ̂n could be the mean. p. . Estimate.on the situation of iid random variables. (21) where E∗ is the conditional expectation given X1 . suppose we observe X1 . θ̂n . . ∗ 4. The above are valid for any estimator. . . Xm ∗ . Lahiri. θ(F ). θ̂ − θ). blocks of data are resampled. both of which can be estimated by way of bootstrap resampling using Z Bias = xĜn (dx) = E∗ (θ̂n∗ ) − θ(F̂ ) d and Z [= MSE x2 Ĝn (dx) = E∗ (θ̂n∗ − θ(F̂ ))2 . G. 2. F . 5.. . from X1 . e. and these are discussed in section 3.g. . 2003. and these blocks are assumed to be iid so that the above algorithm is again valid. Calculate θ̂m from X1∗ . . Sample with replacement from X1 . 1. 3).. . . thereof (e. 3. then the algorithm as described above will not yield accurate results. . . and do not assume any specific form (e. probability of detection. if desired. .

6 and transformation-respecting. but with B in place of n and θ̂m in place of Xi ). X3∗ = X1 . where θ̂m is from step 5 of the above algorithm. of the distribution are desired. if the parameter estimate is 6 That is. The BCa method is an automatic algorithm for producing highly accurate (second-order accurate) confidence limits from a bootstrap distribution. then ( θ̂` . Some of the more commonly applied methods are the: percentile. F̂n . In fact. (-0. For example. One possible replicated sample isX1∗ .g.g. The estimate. The estimate θ(F̂n ) may be equivalent to the usual MLE estimate.2)). and this step is only needed if it is desired to estimate parameters involving the “population" parameter and the sample parameter (e. ABC. θ̂u ) is a confidence interval for the variance. and bootstrap-t (DiCiccio and Efron. suppose interest is in the first moment of F . θ̂n . θ̂u ) is the correct interval for the standard deviation.7 However. X2∗ . MSE). A summary of the pros and cons of each follows. 6. It does not increase the computational burden ∗(b) because it uses only the original bootstrap statistics. in step 1 (and 2) is usually obtained by the ecdf (20) so that simple random samples are taken with replacement in step 3. Thus. Estimate G by Ĝn . It may not be necessary to estimate G in step 6 if only functionals. 36 . The BCa is also a percentile interval. X2 . For example. P E∗ (θ̂m B b=1 To clarify step 3. suppose X1 . the interval for m(θ) can be obtained by applying m to the lower and upper endpoints p of pthe interval for θ. 1996) methods.. by using the em- ∗ pirical cdf similar to (20). There are several methods for obtaining such intervals. θ̂ − θ). BCa . if the method is transformation-respecting.g. so it would not be very useful to have a confidence interval whose lower and/or upper endpoints were outside this range (e. the range for probability of detection (POD) is 0 to 1. in its calculation. For example. the interval will not include values outside the range of possible values. which happens n to be θ(F̂n ) = n1 Xi = X̄. if (θ̂` .. the estimate is just the first moment of this distribution. ϕ(G) (e. intuitive form.. perhaps because it has a very simple. it is only first- order accurate so that it can have poor coverage. It is also range-preserving.2 Bootstrap confidence intervals The percentile method is perhaps the most widely used method. and each has positive and negative qualities.1. m(·). X3 constitute the sample. 3. Assuming the ecdf (20) for F̂n . X3 . but it adjusts the usual percentile interval to correct for bias and account for non-constant variance. 7 An interval is transformation-respecting if for any monotonically increasing function.g. Bootstrap confidence intervals are estimated using the sample {θ̂∗(b) }. 1. X1 . and an estimate for its MSE is needed. the estimate of MSE is given by (21) to be P i=1 B ∗ 1 ∗(b) ∗(b) −θ(F̂ ))2 = (θ̂m −X̄)2 .. the cumulative distribution of θ̂ (e.

For example. The bootstrap-t method is considerably simpler than either the BCa or ABC methods. this is the method recommended in practice. for ABC. but can be too close to the standard intervals. (ii) is not as accurate as the other methods in general. but it is also computationally inefficient so that it can be prohibitively slow for large data sets. and (v) is not transformation-respecting. However. Examples for each type are held until section 3. The BCa interval is also transformation-respecting. The BCa is more numerically stable than the bootstrap-t.g.. In fact. Fur- ther. it is also possible to employ the normal approximation interval given by Eq. but does not share the computational burden of the BCa approach. results for this method can be erratic as well as sensitive to a few outliers. ABC only works for statistics that are defined smoothly in X . Calibration is a bootstrap technique for improving the coverage accuracy of any system (e. (22) That is. 37 . it improves from second-order accuracy to third-order accuracy. and is second-order accurate.1 Percentile method The 100 · (1 − α)% confidence interval given by the percentile method is given by     Ĝ−1 (α). The basic bootstrap confidence interval is described next. then the standard er- ror must be estimated at each iteration of the bootstrap using another bootstrap routine thereby greatly increasing the computational cost of the procedure. quantiles. θ̂∗(α) is the αth percentile of the sample of θ̂∗ values obtained from the algorithm described in section 3. it also increases the computational burden extensively. Generally. 3. Of course. The ABC method is an approximation for the BCa interval that is also second order accurate.1. However. (1) using bootstrap estimates of se(θ). but see DiCiccio and Efron (1996) and/or Efron and Tibshirani (1998) for a more thorough description of these vari- ous confidence intervals. θ̂∗(1−α) . if there is no formula available to estimate the standard error of the parameter of interest. ABC) of approximate confidence intervals. Each of these methods is described below. This will not be discussed in the sections below as it (i) requires the assumption of normality.2. and the like. Ĝ−1 (1 − α) = θ̂∗(α) . this method cannot be used for the median.unbiased with constant variance. However. Table 6 summarizes some important characteristics of the bootstrap confidence interval estimation methods discussed here.3. (iv) is not range-preserving. (iii) can only give symmetric intervals. then BCa gives the same interval as the percentile method.

Approximation Assumes normality. Method Accuracy Range. BCa 2-nd order Yes Yes Computationally inefficient. Other preserving respecting ABC 2-nd order Yes Yes Only applicable for smooth statistics. Sensitive to outliers. or skewness. and it should be noted that these are based on particular statistics and a known exact distribution. Bootstrap-t 2-nd order No No Erratic for small n. Transformation. Percentile 1-st order Yes Yes May have poor coverage depending on the distribution of the parameter of interest. 38 . First-order accuracy is best. nonconstant variance. Normal 1-st order No No Intervals are symmetric. Basic 1-st order No No Does not adjust for Bootstrap bias.Table 6: A comparison summary of various methods for estimating confidence intervals through bootstrap resampling.

θ̂∗(b) (se th of θ (se(θ)) from the b bootstrap replicate. For example. the distribution is estimated directly from the data. 3. of this parameter. denoted se b ∗(b) (θ). In fact.2.1. or skewness. however. and it can be second-order accurate in some instances. B of step 1 to obtain B samples of se b ∗(b) (θ). θ̂ − se (θ̂ − se b · T̂ (α) ). From the algorithm given in section 3. contrary to what one might think given the name and procedure. . then bootstrap res- ampling can be done on each of the replicated samples to get an estimate of this parameter. The 100 · (1 − α)% confidence interval for θ̂ is then given by b · T̂ (1−α) . where θ̂ is the usual estimate for θ based on the original sample. nonconstant variance. (23) b ∗(b) (θ) se b ∗(b) (θ)) is the estimate where θ̂ is the estimate of θ from the original sample. . x∗(b) . . in addition to estimating the statistic of interest. then n ∗(b) θ̂∗(b) = n1 xi = x̄∗(b) . respectively. θ. ∗ θ̂∗(b) − θ̂ T = . also obtain the standard error for each bootstrap sample. T ∗ . These values are ordered and the α and 1 − α percentiles are taken to get T̂ (α) and T̂ (1−α) .2 Basic Bootstrap Confidence Intervals The basic bootstrap interval is given by   ∗(B+1)(1−α) ∗(B+1)α 2θ̂ − θ̂ . Using this and the estimated parameter. b = 1. if θ is the mean. This is the original interval suggested by Efron. where se 39 .3 Bootstrap−t Confidence Intervals The bootstrap-t method allows us to find confidence intervals without assuming normality or any other distribution. 2θ̂ − θ̂ . That is.3. calculate the following t-statistic for each replicate. . n − 1 i=1 b This yields a sample of B values of the statistic. (24) b is the standard error of θ calculated from the original data. calculate the standard error for each ∗(b) bootstrap replicate Xm . and it does not adjust for bias. at the cost of increased computations.2. If a formula for the standard error estimate is not available. and is appropriate only for the data being considered. and P i=1 n ∗(b) 1 X ∗(b) √ se (θ) = (xi − x̄∗(b) )2 / n. It is generally only first-order accurate.

with Φ(·) the standard normal distribution. as defined in section 2).3. The ABC method. an adjusted percent- ile method that corrects for bias and non-constant variance is often used. or BCa interval. Note further that if both of these values are zero. and zα as usual (i..4 The BCa interval In order to obtain better coverage than the percentile method. Namely. θ̂∗(α2 ) ). ap- proximates the BCa interval endpoints analytically without using Monte Carlo 40 . The parameter a estimated by â is called the acceleration. Note that there are two quantities that need to be estimated here. it is the quantile of the standard normal distribution that corresponds with the proportion of bootstrap replications of θ̂∗ that are less than the original estimate. The interval is simply (θ̂∗(α1 ) . B where I{x∈A} is the indicator function so that I{x∈A} = 1 if x ∈ A and 0 otherwise. θ̂. known as the bias-corrected and accelerated. It measures how quickly the standard error is changing on the normalized scale. P 3. but with the ith data point removed.2.2. and θ̂(·) = n1 ni=1 θ̂(i) . A simple estimate for this parameter is given by Pn 3 i=1 (θ̂(·) − θ̂(i) ) â = Pn .e. â and ẑ0 . (25) where   ẑ0 +zα α1 = Φ ẑ0 + 1−â(ẑ0 +zα )   ẑ0 +z1−α α2 = Φ ẑ0 + 1−â(ẑ0 +z1−α ) .5 The ABC Method The primary disadvantage of the BCa interval is the computational burden. The simplest estimate for ẑ0 is given by PB ! b=1 I{θ̂ ∗ (b)<θ̂} ẑ0 = Φ−1 . and this gives the same interval as the percentile interval of section 3.2. then α1 = Φ(zα ) = α and α2 = Φ(z1−α ) = 1 − α. which stands for approximate bootstrap confidence intervals. Simply put.1. 6{ i=1 (θ̂(·) − θ̂(i) )2 }3/2 where θ̂(i) is the usual estimate of the parameter θ.

. xn ) and let P∗ denote the sampling vector. 1/n). . . 41 . 8 Monte Carlo refers to the bootstrap’s manner of simulating data. Suppose that the statistic of interest has the form θ̂ = T (P). ẑ0 = Φ−1 {2 · Φ(â · Φ(−γ̂)} = â − γ̂. . .replications. n2 where Ṫi . The above estimates require the further estimation of the quantities â and ẑ0 as with the BCa method. The standard error is approximated by Pn !1/2 2 i=1 Ṫi σ̂ = . 0)T . however. is given by T ((1 − ε)P0 + εei ) − T (P0 ) Ṫi = limε−→0 ε with ei the ith coordinate vector (0. . . They are given by Pn 1 Ṫ 3 â = Pn i=1 2 i 6 ( i=1 Ṫi )3/2 and . 0. . θ̂(1−α) ). for the ABC approach. 1. . Different estimates are used. . called the empirical influence component.8 The interval sought is given by (θ̂(α) . 0. We observe x = (x1 . . ω ≡ ẑ0 + z1−α ω λ ≡ (1−âω) 2 0 (27) δ̂ ≡ Ṫ (P ) θ̂(1−α) = T (P0 + λδ̂/σ̂). . . . the following quantities are calculated (given here for the (1 − α) limit of expression (26)). Assume that nP∗ follows a multinomial distribution with success probabilities P0 = (1/n. . . Next. (26) where the two endpoints are estimated as follows. The ABC method applies only to multiparameter exponential families and nonparametric problems using the multinomial distribution. .

where γ̂ = b̂/σ̂ − ĉq . The quadratic ABC confidence limit for θ. b̂. Because it is not known. then.5 above. ε2 Now. 3. this approximation is given by ω ≡ z0 + z(1−α) . For more about the ABC intervals (e.. (29) The idea is that if the calibration curve β(α) given in (29) were known.g. λ ≡ ω/(1 − âω)2 . 42 . bootstrapping is used to estimate it giving ∗ β̂(α) = Pr∗ {θ̂ < θ̂(α) }. ε2 Finally. i=1 where T̈i is an element of the second order influence function. θ̂(1−α) = θ̂ + σ̂ξ. for example β(α) = Pr{θ < θ̂(α) } = α. there is a further approximation to give a computationally more conveni- ent form of the ABC endpoint. T ((1 − ε)P0 + εei ) − 2T (P0 ) + T ((1 − ε)P0 − εei ) T̈i = limε−→0 = . such as the ABC method described in section 3. then it could be used to improve the approximate confidence limits. see DiCiccio and Efron (1996) or Efron and Tibshirani (1998). T [(1 − ε)P0 + εṪ /(n2 σ̂)] − 2T (P0 ) + T [(1 − ε)P0 − εṪ /(n2 σ̂)] ĉq = limε−→0 . if the approximation were perfectly correct.6 Calibration Calibration is a bootstrap technique for improving the coverage accuracy of any system of approximate confidence intervals. Thus.2. That is. computational advantages and theoretical reasoning). (28) ξ ≡ λ + cq λ2 . the two quantities b̂ and ĉq must now also be estimated. The first. Again for the (1 − α) limit of expression (26).2. is given by n X b̂ = T̈i /(2n2 ). The estimates used to calculate the limit given by (28) are the same as used to calculate the limit in (27).

To calibrate a system of approximate confidence intervals. β̂(α) is used to improve the endpoints by finding the value of α such that β̂(α) = α.e. Therefore. . sections 2. and for each one. Generate B bootstrap samples x∗1 .3 and 2. the algorithm is given by 1.7). From this.g. possibly unknown. there is still the assumption that the sample is independent and identically distributed (i. and this gives an indication of accurate coverage for the system of intervals. Find the value of α satisfying β̂(α) = α. it is not always possible to obtain parametric confidence intervals (e. . B.g. 2. 43 . In short. As explained in previous sections (e. For each sample b = 1. . . calculate the value. that makes θ̂ = θ̂(α̂ ) . generate B bootstrap ∗ samples. Further. distribution as every other point in the sample). 3. and each point of the sample follows the same. α̂∗ . The problem might be estimating a valid standard error for the statistic of interest. compute β̂(α) from (30). If the cumulative distribution function of α̂∗ is nearly uniform. . the normal approximation interval (1)). there is the assumption that the sample is representative of the population (e.. required. and heuristic comparisons of the various confidence interval methods are made. estimate β̂(α) by PB I{α̂∗ (b)<α} β̂(α) = b=1 . some examples of the IID bootstrap are explored. the examples here will make use of such statistics as well as some that can have confidence intervals computed parametrically... . 3.. For each α. or other assumptions. θ̂(α)∗ ). compute the desired confidence limit (e. too small a sample may not fit this bill). (30) B . knowledge of one point in the sample does not impact knowledge of the rest of the sample.g. or it may reside in the distributional assumptions. for comparison.g. then β̂(α) = α. 4. If not..3 Examples of Bootstrapping with R Here. . . x∗B . It is important to remember that although no specific distribution assumption is made for the IID bootstrap.

can be accessed by fit[["MSE"]] or the short-hand fit$MSE. so that the MSE is now E∗ (θ̂f ∗ − θ̂o∗ )2 . i) { # Function to compute verification statistics # for continuous data.1. the com- mand names( fit) will show the components included in the list object fit. . and assigned the name booter.function(d. . xn . fn and o1 . on . MSE. it is of interest to have some notion of whether the performance is significantly worse than the reference forecasts or not. In example 2. booter <. Note that the MSE estimated here is not strictly the same as that in Eq. Here. . . From the summary of fit. however. . (21) is a functional of the distribution of x1 . This is shown next. . To be used in # conjunction with ‘boot.46). persistence). among other scores comparing forecast performance against a baseline low-skill forecast (e. This is not surprising given that the columns of table 1 were simulated to be highly correlated with each other. baseline MSE is ≈ 6. The reult- ing R object (assigned the name fit) is a list object whose components include these estimates. we have two samples.1 of section 2. . That is.. . the function verify is used to calculate various verification statistics for the simulations from table 1 including: ME. MAE. which is one of the required packages for verification (see appendix A.1. we make use of the R package boot. the forecast MSE (4. f1 . (21) because the comparison is with another sample as opposed to a functional of the statistical distribution of the sample. Here.. and that will work with the function boot. In particular.’ 44 . and differences in their population means and standard deviations are small.2) making use of bootstrap confidence intervals for the hit and false alarm rates in the ROC plots. Individual components of a list object. . the value θ(F̂ ) in Eq. First.g. . a function is needed that will compute the various statistics of interest.Example 3. for example MSE.g. . Without such knowledge. .61) is smaller than all of the reference MSE estimates (e.

Now that this function is written. # A <. MSE."Observed"]. # # Value returned: # # A numeric vector of the statistics: MAE. and then the function boot."Forecast"]. A$MSE. # Additionally. A$MSE−A$MSE.] seME <. The reason for returning the variance of ME is in part so that the bootstrap-t confidence intervals can be calculated for this statistic without having to further subsample at each iteration of the bootstrap. obs. the variance of ME is returned # along with the differences between the MSE # and the two reference MSE’s.pers. # # Arguments: # # ‘d’ a data frame with components named # "Forecast" and "Observed.". MSE (persistence).baseline. A$MSE. as output by verify.].type="cont".hold/sqrt(30) return( c(A$MAE.baseline. it is a simple matter to use the function boot to perform the bootstrapping. seME∧ 2.pers)) } # end of ‘booter’ function. MSE (baseline)." # ‘i’ indexes used with ‘boot’ indicating # which sample points to include in # a given replicate sample. A$MSE−A$MSE. diff) hold <. d[ to calculate 45 . See the help # file for verify for more on these values.baseline. A$MSE.verify( d[i. frcst. # and SS (baseline).type="cont") hold <.stats( hold)["Std.apply( d[i.Dev. # ME. A$SS. 1. A$ME.

MAE. index=1) MAE.boot.boot.boot. Here. we will just compute the percentile and BCa intervals. The replicated samples of statistics are held as a 1000×9 matrix in booted$t. except for ME. index=3) 46 . type booted$t0 to see them. Here. and basic bootstrap intervals.the various confidence intervals. we will run 1000 rep- lications in order to get fairly large sample distributions of the statistics (see section 3. Next. 0. 0. Some warning messages may appear for some of these (e. the 95-. "bca"). The original sample estimates are included in the t0 component of this object. booter. "bca"). conf=c(0. extreme end- points used). 0.7)) MSE. To check the dimensions of this matrix.99. for which the variance is computed in booted.999). and the index argument is used to indicate for which of the output statistics to compute the intervals. booted <. 99. For this <. use dim( booted$t).95.boot( Z.99.9% intervals are <.999).5 for choosing B). conf=c(0.99. 0. type=c("perc". type="all". This is done one-at-a-time for each of the output compon- # To see the output ME. calculate the confidence intervals for the various statistics. normal approximation. type=c("perc". and has been as- signed the name booted. 1000) The output from boot is a list object..and 99.g. index=c(2.999). booted. conf=c(0. and compare them with summary( fit) to en- sure that the values are the same. This enables calculations of the bootstrap-t. <.

type=c("perc".ci( <. "bca"). Table 7 shows a sum- mary of the 95% intervals derived booted. the confidence interval for the difference in MSE between the forecast and the persistence reference forecast does not include zero (having only negative limits) 47 .95.999).ci( <.boot.99. 0. "bca"). index=4) MSEpers. conf=c(0. 0. 0.95.95. conf=c(0. conf=c(0. 0.boot. index=6) varME <.ci( booted. type=c("perc". 0. 0.999).95. type=c("perc".boot. However. 0. index=8) MSEpersdiff. 0. type=c("perc". 0.99. but note that much terser code can be written by employing a for loop. conf=c( <.boot.999). "bca"). being careful about the case for ME where all of the types of confidence intervals are computed instead of only two of them. index=7) <.ci( <. 0. 0. type=c("perc".99.95. 0.95. "bca"). conf=c(0. type=c("perc". conf=c(0. index=9) Each interval is computed on a separate line in the above example for clarity.99. "bca"). index=5) SSbase.99.99.boot. we have that the confidence interval for SS (based on the baseline refer- ence forecast) straddles zero suggesting that the forecast is not significantly better than the baseline reference forecast. "bca").999).999).999). This is justified by the fact that the confidence interval for the difference in MSE with this reference forecast also con- tains zero. Here.boot.

function( d. # # Arguments: # # ’d’ is a numeric vector of the sample on # which to perform the bootstrap. basic and bootstrap-t). 48 . a new function is required for the statistic argument in boot. It gives the frequencies of # each point sampled. For ME. and this is given by passing a vector of length 2 into the index argument where the second value gives the position in the output of booter of the bootstrap- estimated variance of ME. First. Here. and as- signed the name booter2. Note that the estimate for the mean is output as the first position of the returned vector. # ’f’ is used internally by the ’boot’ # function. Example 3. suppose # ’d’ contains three elements. f) { # Function to compute the mean and its # associated variance for use with the # ‘boot’ function. nor- mal approximation.e. all of the confidence interval methods were computed. we chose to compute boot- strap confidence intervals using all of the available methods in boot. Recall that for ME. each of the types of bootstrap confidence intervals given here are compared for the forecast and observation means for these same simulations. and its variance in the second position. there needs to be an estimated variance.1 the percentile and BCa in- tervals were calculated for output statistics from the verify function for the simulations in table 1. In example 3. That is. booter2 <. In order for some of the methods to work (i.2.indicating that the forecast is doing better than persistence in this case.. This is given below.

fbar. type="all") The output from running the above example is not du- plicated here as any given run will give slightly different res- obar. # First.999). conf=c(0. conf=c(0. x1 . # # Value returned: # # A numeric vector of length 2 containing the # mean and its variance. and obtain the various types of confidence intervals. v)) } # end of ‘booter2’ function. type="all") # Now for the observed column. booter2.99. x2 . booter2.boot <. 1000.999). 1000.95.99.sum( d*f)/sum(f) v<. # m<. 0. then ’f’ # would be the numeric vector with values # 2. obar.boot( Z[. and 1. 0. stype="f") boot."Forecast"]. 0. and x3 .95.boot( Z[. Warning messages about the extreme end-points being used for some of the methods (and some of the confidence 49 .ci( fbar. 0. stype="f") boot. x3 . # x1 .(sum( d∧ 2*f)/(sum(f)-1) - m∧ 2*sum(f)/(sum(f)-1))/sum(f) return( c( m. respectively. for the forecast column. Now that we have a function to use for the statistic of interest. and a given bootstrap # replicated sample is x1 .boot <.boot. we run the following commands to run the boot- strap algorithm."Observed"]. 0.boot.

1].booter <. we use the example from the help file for the verification function roc. # # Value: single numeric giving the AUC value.levels) may appear. # # Arguments: # # ’d’ an R data frame containing the # components: # ’event’ binary indicator of # event/no-event. # ’i’ indexing argument for bootstrap # algorithm (see help file for ’boot’). This is probably a result of performing the bootstrap on a sample of only 30 values.function( d.area to compute BCa approximate 95% bootstrap confidence intervals for the area under the ROC curve (AUC) for the Mason and Graham (2002) data. Note that be- cause the estimate for the mean is in the first position. AUC.area( d$event[i]. and variance of the mean is output in the second position of the booter2 function above. a function to compute this statistic that can be used in conjunction with boot is needed. # return( roc. Example 3.3. it is not necessary to specify the index argument. and # ’p1’ a probability prediction on the # interval [0. d$p1[i])$A) 50 . i) { # Function to compute the area under the ROC # curve (AUC) for use in conjunction with the # ’boot’ function. Here. Again.

− 0.04.80) (4. 7.54) MSE (persistence) 51 . 10.87.85 (−4.22 (−1. Note that results will vary because they are obtained through simulations.40) MSE (baseline) MSE − −9.32) (baseline) MSE 14.54.56. 0.05) (−5.41.16) (1.16.65) ( (−20.65) (SS.71) (persistence) Skill Score 0. 0. 7. − 4. − 0.51.46 (4. 20. − 2.18. baseline) MSE − −1. 1. 0.77) MSE 6. 25.66 (−15.21) (2. Point Estimate Percentile BCa MAE 1. 2.69.Table 7: 95% bootstrap confidence intervals (percentile and BCa ) for the verifica- tion statistics output by the verify function for the simulations of table 1. 2.63) MSE 4.64 (1.18) ME −1.63) (−1.29 (−0.61 (2.12.28 (6. 8.60) (−0.51.10.

1.frame(a. construct the data frame of values for the Mason and Graham (2002) data as per the help file for roc.1.1) d <. AUC. .984.816. .booter. instead of estimating from the unknown distribution.1. d) names(A) <. . but note that only the event and p1 components are used AUC. 1993.c. where F is known apart from its parameter(s) η.0. a <. .0. 1994. 0. 1983.0.952) A <.c(1981. . This is given below where the entire data frame is construc- ted.95. 1992. 1990.1) c <. 1985. but with unknown paramet- ers.0. 1988.1.booted <.. conf=0. Of course.boot( A. 0.0.1. . 1000) boot."year". 0. Perhaps the best solution is to model the dependence directly using a parametric bootstrap approach.booter’ function..008.032.584.944. Therefore. 0. type="bca") 3. and other methods for taking into account de- 52 . .booted.016. one must now assume a model for the underlying sample whereas the iid bootstrap. "event".2.1. interest is now in estimating θ(Fη ). 1986. 1989.4 Accounting for Dependence Numerous methods have been proposed to account for dependence using the boot- strap. } # end of ’AUC.136.0.4. .832. . 1982.8.c(. .928. 0.c(0. .0.1. . 1. "p1". In this approach.b. F . . 1. 1991. AUC.area. run the bootstrap algorithm and construct ap- proximate 95% confidence intervals for the AUC of these data using the BCa method.024. . 1987. a distribution is assumed. . Next. "p2") Finally.c(.6. . 1995) b <.

The NBB mostly because of its ease of description and the MBB and CBB because Lahiri (1999) found that the over- lapping block approaches are to be preferred over the NBB and that fixed length blocks typically lead to smaller mean-squared errors than random block lengths.. √ Suppose ` is an integer such that both ` and n/` are large (e.1. Estimate G as in step 6 from the algorithm in section 3.. where bxc 53 .pendence. partic- ularly to the NBB. and SB–stationary block (Politis and Romano. . 2. Fη . Repeat steps 2 and 3 above B times to obtain a sample from G of the ∗(1) ∗(B) parameter(s) of interest. and the other methods are similar. 1992). . The NBB is the simplest block bootstrap method to describe. ∗ 3. 1986). Calculate θ̂m from the sample in step 2 above..g. . If a reasonable model can be found.1. Therefore. 4. 53–55. Xm ∗ . 1986). Other non-parametric methods to account for dependence in the bootstrap method include: subsampling (e. Because the block bootstrap approaches require the fewest assumptions. 2003). 1. see Efron and Tibshirani (1998).g. of the distribution. do not require such assumptions. 1989.g. 1997. The key difference here is that the resamples are drawn at random from a distribution function instead of being sampled with replacement from the original sample. 2003). the procedure from section 3. Künsch. 1987). 1994). θ̂m . this method is described first. Hurvich and Zeger. so this subject is not taken up here. Included among the various types of block bootstrap approaches are: NBB–nonoverlapping block (e.g. Once such a model is found. There are any number of possible models that can be used to model dependence. η. Carlstein.. pp.. . then this is the most straightforward approach. Lahiri. CBB–circular block (Politis and Romano. some further attention will be given to these methods here. denoted Fη̂ . 2003. the CBB is an extension of the MBB. θ̂m . Liu and Singh. TBB–transformaion-based (e. and vari- ous block bootstrap methods (e.2 is modified as follows. sieve bootstrap (e. For more on the parametric bootstrap. and are perhaps the most similar to the iid bootstrap. 31–33). pp. Estimate the parameters. 5. All of these methods are nicely described in Lahiri (2003) and Lahiri (1999) along with comparisons and issues related to the approaches..g. .. MBB and CBB methods.2. and there are some advantages of the CBB over the MBB (Lahiri. Lahiri. Carlstein. . pp. but see section 2. 2003.g. . however. 41–43). let ` = b nc. .g. Further.11 for some possibilities for Fη . 1992). Wilks. Sample directly from Fη̂ to obtain a bootstrap sample X1∗ . GBB–generalized block (Lahiri. MBB–moving block (e.

i = 1. Instead. the block length ` can bias the results. In this example. Example 3. Xi+`−1 ) be the ith block of length ` starting with Xi with 1 ≤ i ≤ N = n−`+1. 1995). Lahiri. . . one simply performs the usual iid bootstrap resampling on the blocks {Bi }N i=1 . . . .1. Therefore. Let Bi = (Xi . . . .cbb <. 2003. X8 are included in 3 blocks each. . .2 is performed on these vectors. . X` } . First. Finally. chapter 7). As before. chapter 12).1 and 3. where X2 and X9 are only in two. . suppose b = n/` is an integer. block determina- tion is more complicated.1. . sampling using m < n can be done analogously as in the iid bootstrap by selecting m < N blocks.function( data) { # Function to compute verification statistics # for continuous data. Y k = X(k−1)`+1 . . The MBB method is similar to the NBB. The CBB method is merely an extension of MBB whereby the (overlapping) blocks are periodically extended to avoid this problem. the vectors Y i . .4. Now. n] to be an integer.. booter. .denotes the greatest integer less than or equal to x). Again.g. where m ≡ k`. and should be chosen carefully.. . . . . Recall that the simulations in table 4 are temporally correlated following an AR(1) pro- cess. . . . but the blocks overlap. The bias and variance of a block bootstrap estimator are tied to the block length (Hall et al. and X1 and X10 each appear only once). but they are very complicated and can be highly computationally burdensome. Partition the sample Xn into b blocks of length ` so that we have Y 1 = {X1 . and is not described here. For simplicity. Xn . the reader is referred to Lahiri (2003. X3 . . . where typically ` −→ ∞ and n−1 ` −→ 0 as n −→ ∞. . .. and the bootstrap procedure described in sections 3.g. . . 1999. . b are considered to be iid random vectors. One problem with the MBB is that values toward the middle have a greater chance of being included in the resamples than values at the ends because values towards the middle are included in more blocks (e. There are ways to determine optimal lengths (see e. Y b =  X(b−1)`+1 . In the face of spatial dependence. Xk` . if n = 10 and ` = 3. To be used in 54 . we take ` ≡ `n ∈ [1. which are considered to be iid random vectors. . the CBB approach is used to obtain confidence intervals for verification scores for the sim- ulations in both table 1 and 4. . we write a function to compute the verification statistics in conjunction with the tsboot function.

obs. # # Value returned: # # A numeric vector of the statistics: MAE. # and SS (baseline).Dev. # A<.baseline. MSE (baseline). A$MSE-A$MSE. 1. MSE (persistence).".hold/sqrt(30) return( c( A$MAE.type="cont") hold <. this is all that is calculated. diff) hold <. the results can still be biased if the choice of block length is not appropriate. # The variance of ME is also returned.cbb’ function. Now that we have a function to output the statistics that will work in conjunction with boot.apply( data. # # See the help file for ’verify’ for more on # these values.type="cont". so for these inter- vals."Observed"]. 55 .baseline.stats( hold)["Std. Because account is taken for dependence.pers." resp. MSE."Forecast"]. # ME.baseline. A$MSE. # # Arguments: # # ’data’ is an R data frame object with two # columns of continuous data named # "Forecast" and "Observed. A$MSE. seME∧ 2. A$MSE-A$MSE. and compute the confidence limits. However. as output by verify. A$SS.] seME <. the next step is to run the bootstrap algorithm.pers)) } # end of ’booter. A$MSE. it is assumed that the percentile intervals will not be biased. data[. frcst.verify( data[. A$ME. # conjunction with ’tsboot’ in order to # account for dependence in the data.

R=1000. index=3) cbb1MSEbase.cbb.cbb. 0.95.cbb. <.ci( Zbooted. type="perc". conf=c(0.cbb <.cbb. conf=c(0. conf=c(0. conf=c( <. type="perc". index=6) cbb1MSEbasediff.95.boot.cbb <.ci( Zbooted. conf=c(0.95.boot.99). booter.boot. 0. type="perc". 0.99). 0. but note that the BCa method is not defined for this type of bootstrapping.99). type="perc". R=1000. index=9) cbb2MAE. we will attempt to compute all of the possible types of intervals.boot. conf=c(0.7)) cbb1MSE. type="perc".99).tsboot( Z2.99). 0. index=c(2. <.ci( Zbooted. sim="fixed") Z2booted.cbb. index=1) <.boot.99).ci <.cbb. l=floor( sqrt( dim( Z2)[1])). index=1) 56 <. Zbooted. index=8) cbb1MSEpersdiff. type="perc". conf=c(0.99) <.ci <. type="all". booter. type="perc" <. 0.99).95.99). index=4) cbb1MSEpers. type="perc". conf=c(0.cbb.boot.For the case of ME. sim="fixed") cbb1MAE. conf=c(0.boot. l=floor( sqrt( dim( Z)[1])).ci( Zbooted.95. index=5) cbb1SSbase.tsboot( Zbooted. 0.

95. index=5) cbb2SSbase. type="perc".cbb. 0. conf=c(0. type="perc". 0.99). index=4) Z2booted. conf=c(0.boot. index=c(2. type="perc".boot. Recall that the simulations from table 1 are not temporally depend- ent so that results here should not be believed. and are therefore not calculated. respectively.cbb. the confidence interval for SS suggests that the baseline reference forecast is outperforming the <. index=3) cbb2MSEbase. 0. type="perc". <. but do not include zero. type="perc".cbb. Table 8 shows results for MSE and SS for the simulations of tables 1 and 4. Indeed. index=8) cbb2MSEpersdiff. index=6) cbb2MSEbasediff.95.boot. 0.99). the limits for SS are very close to zero.95.99). for the 95% confidence intervals using the percentile method. conf=c(0.cbb.cbb.boot.99).99). it would be difficult to interpret this as being strong evidence that the forecast and baseline reference forecast are different here.99).boot.1 (table 7) the results suggest that there is no significant dif- ference (at the 5% level)between the forecast and baseline reference forecast performance (in terms of SS).ci( Z2booted. index=9) Warning messages are given for the ME cases above that the BCa is not defined for time series bootstraps.99) Z2booted. 57 Z2booted. conf=c( <. type="all".ci <.ci <.7)) cbb2MSE. conf=c(0.cbb2ME. type="perc".ci <.ci <. 0. Z2booted. conf=c(0. In the case of the simulations from table 4.95. whereas in example 3.95. Z2booted. conf=c(0.

34. Inaccurate intervals 58 . ignoring fiducial and profile-likelihood intervals. Bootstrap methods can be used to calculate confidence intervals for any verific- ation statistic of interest.17) SS (0. Try a larger number. The focus is solely on frequentist confidence intervals. There is also no set answer for how large B. 8.14) MSE (0. but care should be taken in how the bootstrap resampling is conducted for any given sample and statistic of interest. 1. To be sure. If the results change meaningfully. the sample size of replications. it is a judgement call on the part of the investigator that the sample given is representative of the population. you probably have a large enough number. Obtain the bootstrap estimates again using the same number of replications. 2. 4 Summary This write-up is intended to consolidate various information about obtaining con- fidence intervals for statistics of interest in forecast verification. Change the random-number seed.49) SS (0. 0. then the first number you chose was too small.03) 3. It will depend greatly on the statistic(s) of interest.Table 8: 95% (percentile) confidence intervals using the circular block bootstrap (CBB) approach to accounting for dependence in the bootstrap for the simulations from tables 1 (top) and 4 (bottom). it is important to begin with a large enough sample for this assumption to be reasonable. MSE (4. or the power of the test. Therefore. Obtain the bootstrap estimates. 3. A standard rule of thumb is as follows: 1.13. Choose a large but tolerable number of replications. Determining how large n should be. If results are similar enough. 0. is not straightforward with bootstrapping because only the observed sample determines the distribution of the population. should be.5 Sample size and Replication sample size issues Because the given sample is assumed to be representative of the population distri- bution. perform step 2 a few more times.01.06. and may not be possible to determine for many parameters.

and many more in other fields such as medicine. For more details. the results may still be highly biased if the block length is not chosen well. bias in the confidence intervals can be largely removed by using the BCa Forecast Verification Review Numerous papers have been written about forecast verification statistics as per- tains to meteorology. for example.can result if certain features are not taken into account. observational uncertainty. Bayesian prediction in- tervals have a much more intuitive interpretation than the frequentist confidence intervals. For any such method. A Appendix A. etc. Alternative methods for obtaining uncertainty information about a parameter are also available. see Wilks (2006). For the iid boot- strap. it should be noted that bootstrap estimation for the mean of a sample outperforms the normal approximation in general. Confidence intervals for many verification statistics can be achieved by using the normal approximation given by (1). Proportions such as the hit rate and false alarm rate can rely on theoretical arguments regarding binomially distributed ran- dom variables.html Table 9 shows a generic 2 × 2 contingency table to be used as a guide for 59 . and Kendall’s τ (15). a terse review and list of some prominent verification statistics used (continuous variables and 2 × 2 contingency table scores only) are given as a reference.). While accounting for dependency through a statistical model is to be preferred. however. or by an indirect assumption of approxim- ate if the sample is not identically distributed. including: dependence in the sample. which also results in a normal approximation. etc. can be challenging. knowledge of an appropriate model is necessary. and can be more easily adapted to account for other types of uncertainty besides sampling uncertainty (e. Accounting for dependency in the sample can be easily managed by employing a block bootstrap method. Here.. Jolliffe and Stephenson (2003).g. Determining the optimal block length. if the population distribution is heavy-tailed. For example. or at least computationally inefficient. the linear correlation coefficient (9). where the CBB approach is recommended. but at the cost of computational efficiency. the interval for the variance (7). as well as the URL: http://www. if the statistic of interest is highly biased. but not discussed here. uncertainty in the physical processes.

The package verification itself relies on a few other packages.packages("verification").g. and if the permissions are not set for this directory (a com- 60 . which summarizes some of the more commonly used forecast verification statistics from such tables. there are statistics specifically for probabilistic forecasts (e. and it is assumed that this package is installed and loaded before trying the R code in the examples here. A. Further. against dichotomous observations). Other statistics exist for both categorical and continuous variables in addi- tion to the statistics shown here. Table 12 summarizes common verification scores for continuous variables.. The HSS (table 11) is a generalized skill score that compares the forecast’s performance to the reference forecast of random chance. which are also not discussed here. use the command install.2 The verification package in R There is an R package called verification that contains several useful functions for forecast verification. Again. In meteorology. it is often better to use a more informed reference forecast such as climatology or persistence. To do this for verification. and then let it do its thing. It will try to install the package in a default directory. which will prompt you to select a CRAN mirror. and seeking information on events that exceed (or don’t exceed) the threshold(s). and these must also be installed (they are auto- matically loaded when verification is loaded). Observed Yes No Forecast Yes a =hits b =false alarms nf =forecast yes No c =misses d =correct negatives n − nf =forecast no no =observed yes n − no =observed no n =total tables 10 and 11. Installing R packages is easily accomplished from the R prompt. Do so. see the references cited above for more details as the main focus of the present manuscript is on confidence intervals. Table 9: Generic 2 × 2 contingency table. There are also similar statistics for categorical variables with multiple categories. however. but note that the statistics from tables 10 and 11 can still be used by applying a threshold (or thresholds) to the continuous variables.

Table 10: Common forecast verification scores based on 2 × 2 contingency tables. 1] Heavily influenced (fraction correct) by most common category. See table 9 for definitions of a. 1] 0 indicates (critical success no skill. d. Sensitive to the climatological frequency of the event. 1] Susceptible to (probability of hedging. b False alarm rate b+d 0 [0. Should detection. Estimate Perfect Range Other Score a+d Accuracy n 1 [0. a Hit rate a+c 1 [0. 1] Susceptible to (probability of hedging. a Threat score. 1] Should be used with POD. b False alarm ratio a+b 0 [0. a+b Bias a+c 1 [0. c. 61 . Poorer index. and n. Not how well they correspond. b. TS a+b+c 1 [0. CSI) scores for rarer events. Mostly used false detection. ∞) Compares only (frequency) relative frequencies. POD) be used with FAR. in conjunction with POFD) hit rate in ROC diagrams. Sensitive to the (FAR) climatological frequency of the event.

(Cohen’s k) e = [(a + c)(a + b)+ (c + d)(b + d)]/n Odds ratio. ∞) 0 indicates no skill. events. Estimate Perfect Range Other Score a−(a+c)(a+b)/n Equitable threat a+b+c−(a+c)(a+b)/n 1 [−1/3. c. ETS (Gilbert no skill. where 1 (−∞. 1] 0 indicates score. d. Log odds ratio log(OR) ∞ (−∞. 1] 0 indicates Kuipers no skill. 1] 0 indicates score. Peirce’s skill score) (a+d)−e Heidke skill n−e . b. 62 . See table 9 for definitions of a. ∞) 1 indicates no skill.Table 11: Common forecast verification scores based on 2 × 2 contingency tables continued from table 10. HK good for rare (true skill sstatistic. Not discriminant. and n. HSS no skill. OR ad/bc ∞ [0. skill score) Hanssen and POD − POFD 1 [−1.

∞) Does not measure i=1 (additive bias) correspondence between forecast and observations.o = (fi − f¯)(oi − ō). MAE direction of errors. ∞) Does not measure bias correspondence between forecast and observations. ∞) Not the usual squared error. 1] Does not account n sf. where 1 [−1. √ Root mean MSE 0 [0. ME 0 (−∞. Estimate Perfect Range Other Score n (fi − oi )/n = f¯ − ō P Mean error. i=1 n s2f = (fi − f¯)2 and P coefficient i=1 n s2o (oi − ō)2 P = i=1 63 . ∞) Not the usual MSE i=1 error. MSE in statistics.o /(sf so ). (Multiplicative) f¯/ō 1 (−∞. n (fi − oi )2 /n P Mean squared 0 [0.Table 12: Common forecast verification scores for continuous random variables. n P Mean absolute |fi − oi |/n 0 [0. P correlation for forecast bias. RMSE in RMSE statistics. (Linear) r = sf. ∞) Does not measure i=1 error.

say in /home/user/Rlibrary.cbind( rnorm(30). Z <.e. every time R is opened) in order to use it.t( t( rho) %*% t( Z) ) # Now give the first column a (population) mean of 12 and # standard deviation of 2. and give the second column a # mean of 13 and standard deviation of 2. 30)) colnames( Z) <. Once the package is installed. lib="/home/user/Rlibrary").75.c("Forecast".75.7.75). To do this. Z <. Note that if the package is installed somewhere other than the default directory.chol( rho2) # Now pre-multiply ’Z’ by the transpose of rho to obtain the # correlated 30 × 2 matrix. Type ?verify at the R prompt to get started using the package.Z*cbind( rep( 2. 1)) rho <. then it may be necessary to use the lib. rnorm(30)) # Now. This is accomplished by library( verification). "Observed") The first line of this code produces two independent vectors of 30 iid standard normal random variables. it must be loaded into any newly invoked R session (i.5. 30). library( verification.e. 30).. and assigns this matrix to an object called Z. Z <. rho2 <. and (column) binds them together to form a 30 × 2 matrix.cbind( c(1. The primary function in verification is called verify. # First.mon problem). 0. create a matrix L such that LLT = P .7. obtain a 30 × 2 matrix of 2 samples from a standard normal # distribution. A.. The next couple of lines use 64 . c(0. use the lib argument (i.e. 30)) + cbind( rep(12. install. then install the package elsewhere.5. rep(2. rep(13..loc argument (i.loc="/home/user/Rlibrary")). where P is a # correlation matrix whose off-diagonal elements are ρ = 0. # # Use the Cholesky decomposition of P to obtain L (call it ’rho’).3 Simulating correlated random vectors The simulated data in table 1 were constructed in R as follows.packages("verification". lib.

] <. ncol=2) Z2[ (2.g. It is simplest to begin by correlating standard normal vectors (i.] <. Z <.matrix( 0. we must multiply the first (second) column by 2. header=TRUE) Simulations from table 4 are created in a similar way as above. The second and third lines of the above code perform this procedure.dat".66 In order that the first (second) column has a “population" mean of 12 (13) and standard deviation of 2. we set up a 2 × 2 matrix with 1’s on the diagonal.6 Z2 <. 0.dat.rnorm( 2) Z2[1. This is how the data in table 1 were simulated. use the following command at the R prompt to read the simulations into R (it is assumed here that the file is in the working R directory.7). Taking rho to be the same matrix from above. take the Cholesky decomposition. and the penultimate line pre-multiplies the random matrix Z to obtain correlated random vectors with the desired covariance V as above. Note that the transpose of the resulting matrix rho in this case is approximately   1 0 .]. nrow=30. If it is desired to read in the values directly from table 1. then copy and paste these values into an ASCII file named.’ Include the column names as a single line on the top.0. ρ.] + t( t( rho) %*% matrix( rnorm(2). phi <. nrow=2. in order to correlate the simulated forecast and observed sample with a coefficient of r (e. Then. The last line of the code simply gives column names to the matrix Z.75 0. ncol=1)) for( i in 2:30) Z2[i.the fact that cov(ρT Z) = ρT cov(Z)ρ = ρT ΣZ ρ. Then. and then "un-standardize" to obtain the desired “population" means and variances.5 (2.7) and add 12 (13). rσf σo σo2 To do this... such that σf2   T rσf σo V = ρ ΣZ ρ = . nrow=2.table("Z. say ‘Z. but the full path to the file may be used if this is not the case). That is. we seek a 2 × 2 matrix. ncol=1)) Z2 <.phi*Z2[i-1.e.] <. r = 0.Z2*3 + 2 65 .75).t( t( rho) %*% matrix( Z2[1. This is done in the penultimate line of code above. so that σf = σo = 1). do the following. and r on the off- diagonal.

colnames( Z2) <- c("Forecast", "Observed")

The above code gives the forecast and observed columns both the same mean
of 2 and standard deviation of 3 (penultimate line), but note that this step is for
illustrative purposes only and was not used in creating the simulations of table 4.
The population autoregressive correlation coefficient is 0.6, and the population
correlation between the two columns is again 0.75.

A.4 Plotting in R
To make the plots shown in figure 2, the following code can be used. See the help
file for par to learn more about optional arguments to the plot function, and
plotting in general with R.

par( mfrow=c(2,2), mar=c(4, 4, 1, 1))
plot( 1:30, Z[,"Forecast"], ylim=range( c( Z)), type="l",
col="blue", lwd=2, xlab="", ylab="Simulations")
lines( 1:30, Z[,"Observed"], col="orange", lty=2, lwd=2)

plot( 1:30, Z2[,"Forecast"], ylim=range( c( Z2)), type="l",
col="blue", lwd=2, xlab="", ylab="")
lines( 1:30, Z2[,"Observed"], col="orange", lty=2, lwd=2)

plot( Z[,"Forecast"], Z[,"Observed"], cex=1.5,
xlab="Forecast", ylab="Observed")

plot( Z2[,"Forecast"], Z2[,"Observed"], cex=1.5,
xlab="Forecast", ylab="")

The first line sets up the plotting device. Each time the function plot is called,
whatever is on the device is overwritten. The option mfrow=c(2,2) in the call to
par tells R to put 4 plots on the same device in a 2×2 layout instead of overwriting
the previous plot. The mar argument controls the outer margins (again, see the
help file for par to learn more about these options, and other options). The lines
function, in contrast to plot, adds lines to an existing plot (without overwriting
the entire plot; see also, points).


Support for this manuscript was provided by the Air Force Weather Agency
(AFWA) (, the Developmental Testbed Center
(DTC) ( and the National Center for At-
mospheric Research (NCAR). The National Center for Atmospheric Research is
sponsored by the National Science Foundation. Any opinions, findings, and con-
clusions or recommendations expressed in this publication are those of the author
and do not necessarily reflect the views of the National Science Foundation.

Agresti, A., 1996: An introduction to categorical data analysis. Wiley, New York,
N.Y., 290pp.
Athreya, K., 1987: Bootstrap of the mean in the infinite variance case. Annals of
Statistics, 15, 724–731.
Bernadet, L., J. Wolff, L. Nance, A. Loughe, B. Weatherhead, E. Gilleland, and
B. Brown, 2009: Comparison between ARW and NMM objective verification
scores. 23rd Conf. on Wea. Anal. and Forec./19th Conf. on Numer. Wea. Predic.,
Amer. Meteor. Soc., Omaha, NE , 1–6.
Bernardo, J. and A. Smith, 2000: Bayesian Theory. Wiley, New York, N.Y.
Bertail, P., 1997: Second-order properties of an extrapolated bootstrap without
replacement under weak assumptions. Bernoulli, 3, 149–179.
Best, D. and D. Roberts, 1975: Algorithm as 89: The upper tail probabilities of
Spearman’s rho. Applied Statistics, 24, 377–379.
Blom, G., 1959: Statistical Estimates and Transformed Beta Variables. Wiley, New
York, N.Y., 176 pp.
Brockwell, P. and R. Davis, 2002: Introduction to Time Series and Forecasting.
Springer, New York., 420pp.
Brown, B., G. Thompson, R. Bruintsjes, R. Bullock, and T. Kane, 1997: Inter-
comparison of in-flight icing algorithms. part II: Statistical verification results.
Wea. Forecasting, 12, 890–914.
Carlstein, E., 1986: The use of subseries methods for estimating the variance of
a general statistic from a stationary time series. Annals of Statistics, 14, 1171–


Coles, S., 2001: An introduction to statistical modeling of extreme values. Springer-
Verlag, London, UK, 208 pp.

Cortes, C. and M. Mohri, 2004: Confidence intervals for the area under the ROC
curve. Advances in neural information processing systems, 17, 305–312.

Cressie, N., 1993: Statistics for Spatial Data (Rev. Sub. Edition). Wiley, U.S.A.,

Davison, A. and D. Hinkley, 1997: Bootstrap Methods and Their Application.
Cambridge University Press.

Dawid, A. and M. Stone, 1982: The functional-model basis of fiducial inference.
Ann. Statistics, 10, 1054–1067.

Devore, J., 1995: Probability and statistics for engineering and the sciences (fourth
edition). International Thompson Publishing Company, Belmont, California, 743

DiCiccio, T. and B. Efron, 1996: Bootstrap confidence intervals. Statistical Sci-
ence, 11, 189–228.

Efron, B. and R. Tibshirani, 1998: An Introduction to the Bootstrap. Chapman
and Hall, Boca Raton.

Fisher, R., 1935: The fiducial argument in statistical inference. Ann. Eugenics, 6,

Gibbons, J. and S. Chakraborti, 1992: Nonparametric Statistical Inference. Third
Ed. Statistics: Textbooks and Monographs, 131. Marcel Dekker, Inc., New York,
N.Y., 544 pp. pp.

Gilleland, E. and R. W. Katz, 2006: Analyzing seasonal to interannual extreme
weather and climate variability with the extremes toolkit (extremes). 18th Con-
ference on Climate Variability and Change, 86th American Meteorological Soci-
ety (AMS) Annual Meeting, 29 Jan - 2 Feb, Atlanta, Georgia, P2.15, 1–11.

Hall, P., J. Horowitz, and B. Jing, 1995: On blocking rules for the bootstrap with
dependent data. Biometrika, 82, 561–574.

Hannig, J., H. Iyer, and P. Patterson, 2006: Fiducial generalized
confidence intervals. J. Amer. Statistical Assoc., 101, 254–269,


Hurvich. 282pp. 27. 1996: Logistic regression: A self-learning text. 32.. and K. and N.T. G. 1999: On a class of m out of n bootstrap confidence intervals. S. New York.. New York . C. McGill. 2007: Uncertainty and inference for verification measures. I. R. Ltd. Graham.B. 225–248. Zeger.. J. Meteorol. 1992: Exploring the Limits of Bootstrap. 1988: Introduction to the Theory and Practice of Econometrics (second edition). Atmos. eds. Wiley and Sons Ltd. Roy. Larsen. 1987: Frequency domain bootstrap methods for time series.. Lütkepohl.. D. Mason. R. Stephenson. 17. 2002: Areas beneath the relative operating character- istics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation.. and Billard. 1989: The jacknife and the bootstrap for general stationary observa- tions. I. Liu. 901–911.. Sci. Springer-Verlag. New York. Amer. Fore- casting. Mason.. Griffiths. 2003: Forecast Verification: A practitioner’s guide in atmospheric sci- ence. Stat. D. Springer-Verlag. 37–76. R. and W. Jolliffe. Wiley. I. N.. and T. Lahiri.. 1978: Variations of box plots. 1982: Statistical evaluation of climate experiments with general circula- tion models: A parametric time series modeling approach. A Practitioner’s Guide in Atmospheric Science. I. New York University. and D. Stat. W. 1024 pp. R. Carter. Jolliffe. Soc. Lee. Tukey. Singh.. — 2003: Resampling Methods for Dependent Data.R.. Annals of Statistics. Statistics and Operations Research Working Paper. Künsch. 1999: Theoretical comparisons of block bootstrap methods. J. 637–650. 12–16. 386–404. Jolliffe. 22. and S. Wiley. H. chapter Moving blocks jacknife and bootstrap capture weak dependence. New York. Judge.. J. B . S. 2145–2166. 1217–1261. Kleinbaum. and Stephenson.J. 128.. Wiley and Sons. Lee. Soc. chapter 3. S. 61. Q..Y. 69 . Wea. Katz. L. 2003: Forecast Verification. Annals of Statistics. 39. H. Lepage. Binary events. R. 1446–1455. Eds.

Dongsheng.9. and Billard. 336–351. M. — 1994: The stationary bootstrap. 8. D. and T. 70 . Lepage. Climate. San Diego. and H. R Development Core Team. 1981: Applications. 19. p. Oxford. basics.. Ann. American Statistic. second edition. 15.R-project. URL http://www.. 263–270. 627 pp. Velleman. 1989. eds. Aca- demic Press. J. Forecasting. 221–232. New York. 2000: Use of the odds ratio for diagnosing forecast skill. Springer-Verlag. Climate.. 2003: The Statistical Evaluation of Medical Tests for Classification and Prediction. and D. 1992: Exploring the Limits of Bootstrap.. B. von Storch. 34 pp. D. 1955: On estimating the relation between blood group and disease. An Introduction. 1995: The jacknife and the bootstrap. Zwiers. Hoaglin. Wilks. F. — 2006: Statistical Methods in the Atmospheric Sciences. Assoc. 65–82. R Foundation for Statistical Computing. Romano. Austria. P. R. J. J. Vienna. Politis. 3. Ex. 1995: Taking serial correlation into account in tests of the mean. 1997: Resampling hypothesis tests for autocorrelated fields. Oxford Statistical Science Series. Wiley. New York. Woolf. (London). and computing of explor- atory data analysis. 2008: R: A Language and Environment for Statistical Computing. D. and J. Wea. 318 pp. Duxbury Press. 123.Pepe. Human Genet. 1303– 1313. ISBN 3- 900051-07-0. UK.. Stephenson. J. chapter A circular block resampling procedure for stationary Shao. 251–253.. L. 10.

9 mean squared error (MSE) variance inflation factor. 53 moving block (MBB). 34 range-preserving. 21 stationary block (SB). Peirce skill score (PSS). 21 nonoverlapping block (NBB). 29 generalized block (GBB). 18 71 . 53 Finley’s (1884) tornado forecasts. 27 probability of false detection (POFD). 1 standard deviation. 4 Wilcoxon-Mann-Whitney statistic. 21 18 probability of detection (POD). 18 hit rate. 35 philosophy. 35 iid. 53 hypothesis test. 53 hit rate. 19 sieve. 21 Wilson’s score method for proportions. 16 relative operating characteristic (ROC) false alarm rate. 11. 23 odds ratio (OR). 53 transformation-based (TBB). 27 component parts. 18 diagram. 23. 31 Kendall’s τ . 53 effective sample size. 18 normal approximation. 18 simultaneous confidence region. 13 bootstrap t-distribution. 22 area under the ROC curve (AUC). 39 mean absolute error (MAE). 50 receiver operating characteristic (ROC) bootstrap. 36 Central Limit Theorem. 37 Wald interval for proportions. 9 transformation-respecting. 36 diagram. 4 rank correlation confidence interval for Kendall’s τ . Spearman. 9 dependent random variables variance. 17 median. 22 23. 31 mean. 18 Bayesian quantiles. 31 correlation coefficient. 53 false alarm rate. 36 mean error (ME). 5 t-statistic.Index autoregressive process. 22 Bonferroni correction. 18. 53 circular block (CBB). 31 false alarm ratio (FAR). 34 mean square error (MSE). 37 credible intervals. 53 magnitude bias. 22 subsampling. 13 block bootstrap.