Analysis of Cost Data in Randomized Trials: An Application of The Non-Parametric Bootstrap

STATISTICS IN MEDICINE
Statist. Med. 2000; 19:3219–3236
Analysis of cost data in randomized trials: an application

of the non-parametric bootstrap
Julie A. Barber1;∗; † and Simon G. Thompson2

1 MRC Clinical Trials Unit, 222 Euston Road, London NW1 2DA, U.K.
2 MRC Biostatistics Unit, Robinson Way, Cambridge CB2 2SR, U.K.
SUMMARY
Health economic evaluations are now more commonly being included in pragmatic randomized trials. However
a variety of methods are being used for the presentation and analysis of the resulting cost data, and in many
cases the approaches taken are inappropriate. In order to inform health care policy decisions, analysis needs
to focus on arithmetic mean costs, since these will re ect the total cost of treating all patients with the
disease. Thus, despite the often highly skewed distribution of cost data, standard non-parametric methods or
use of normalizing transformations are not appropriate. Although standard parametric methods of comparing
arithmetic means may be robust to non-normality for some data sets, this is not guaranteed. While the
randomization test can be used to overcome assumptions of normality, its use for comparing means is still
restricted by the need for similarly shaped distributions in the two groups. In this paper we show how the
non-parametric bootstrap provides a more exible alternative for comparing arithmetic mean costs between
randomized groups, avoiding the assumptions which limit other methods. Details of several bootstrap methods
for hypothesis tests and condence intervals are described and applied to cost data from two randomized trials.
The preferred bootstrap approaches are the bootstrap-t or variance stabilized bootstrap-t and the bias corrected
and accelerated percentile methods. We conclude that such bootstrap techniques can be recommended either
as a check on the robustness of standard parametric methods, or to provide the primary statistical analysis
when making inferences about arithmetic means for moderately sized samples of highly skewed data such as
costs. Copyright ? 2000 John Wiley & Sons, Ltd.
1. INTRODUCTION
Pragmatic randomized controlled trials now often include an economic evaluation to inform deci-
sions about allocating health care resources [1]. In such trials an estimate of the cost of treatment
and its consequences may be obtained for each patient, for example from information about the
quantities of health care resources used by each patient over the study period. This resource use
information is weighted by ‘unit cost’ estimates, which give a xed monetary value to each unit
of resource type, and then summed to give patient specic estimates of total cost [2].
∗ Correspondence to: Julie A. Barber, MRC Clinical Trials Unit, 222 Euston Road, London NW1 2DA, U.K.
† E-mail: julie.barber@ctu.mrc.ac.uk
Contract=grant sponsor: London NHS Executive
Copyright ? 2000 John Wiley & Sons, Ltd.

3220 J. A. BARBER AND S. G. THOMPSON
When information about the costs of alternative treatments is to be used to guide health care
policy decisions, it is the total cost of treating all patients with the disease that is relevant. Given
this, it is clearly the arithmetic mean cost for a particular treatment policy which is the most
informative measure for cost data in a randomized trial [3–6]. However, cost data typically have
a highly positively skewed distribution, usually due to a minority of patients who utilize large
amounts of resources because they are particularly ill. Sometimes skewness is made even more
extreme by high proportions of patients with zero or very small cost values.
A recent review of statistical methods used for cost data in published randomized trials [3]
showed that only about half of the papers used any form of statistical inference to compare
costs between groups. Amongst these a variety of methods were used, with half using standard
parametric methods comparing arithmetic means and others using non-parametric analyses (for
example, Mann–Whitney U-test) or log transformation approaches. The review showed a lack
of appreciation amongst researchers involved in economic evaluations about the importance of
analysing arithmetic mean costs and an apparent confusion as to the most appropriate methods to
use for cost data.
In this paper appropriate methods of statistical inference for patient specic cost data from
randomized controlled trials are discussed. Section 2 provides motivation for this work through
a brief description of the cost data from two pragmatic randomized trials. Sections 3 and 4 then
describe methods for analysis, the rst appraising various conventional methods and the second
describing bootstrap approaches for hypothesis testing and estimation. In Section 5 the methods
are applied to each of the data sets described in Section 2, and the results compared. Section 6
comments on other proposed methods as well as extensions to bootstrap methods relevant to health
economic assessment, and gives overall conclusions and recommendations.
2. EXAMPLES
Two examples of pragmatic randomized trials which included an economic evaluation are given
here. In both cases the costs were obtained at a patient specic level, calculated from resource
use information. In our experience these data are typical of cost data obtained in health service
evaluations in randomized trials. The second example is from a small study, to exemplify the
particular problems of analysing skewed cost data when sample sizes are small. Such moderate
and small samples are common in practice [3], even though larger trials would of course be
preferable scientically.
A pragmatic randomized trial was set up to assess the clinical eectiveness and costs of a
community based exercise programme for treating patients with low back pain [7]. In total 184
patients were randomized either to the exercise programme or usual general practitioner (GP)
care. Resource use information collected over a one year period from randomization was used
to calculate the total cost of treatment for each patient. Resources included were the number of
orthopaedic referrals, physiotherapy visits, GP visits, osteopathy visits, MRI scans, X-rays, nights
in hospital, costs of any special equipment purchased, days of work lost, plus a xed treatment
cost for those randomized to the exercise programme. Total costs were available for 144 (78 per
cent) of the 184 patients randomized. The large amount of missing cost data is a consequence of
missing information for one or more of the resource use items used to calculate total costs, and
is typical of economic evaluations in trials [3].
Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:3219–3236
ANALYSIS OF COST DATA IN RANDOMIZED TRIALS 3221
Figure 1. (a) Distribution of total costs ($) in each group over one year for the back pain trial. (b) Distribution
of total costs ($) per month in each group for the parasuicide pilot trial.
The distributions of these costs are severely skewed (Figure 1(a)) with the coecient of variation
in each group being greater than 1.5. The mean cost (standard deviation) for the 70 patients with
data in the exercise programme group is $360 (582) and for the 74 patients in the usual care
group $508 (1109). Both groups have a large proportion of patients with very small or zero cost
values: 34 per cent of those in the exercise group had a xed cost of $25:20 (the cost of the
exercise programme) and 43 per cent of those in the usual care group incurred no cost over the
study period.
The second example is from a small pilot study comparing cognitive behavioural therapy with
usual care for treatment of patients with a history of deliberate self-harm (parasuicide) [8]; 34
subjects were randomized and followed up for 6 months. Resource use information collected over
this period was used to calculate a total cost per month for each patient. Included in the calculation
were in- and out-patient psychiatric hospital services, general hospital services, custodial services
such as police contact, primary care services, community psychiatric services, social services and
private consultations. The skewed distributions of total costs per month in each group are shown in
Figure 1(b). For the 18 patients in the therapy group the mean monthly cost (standard deviation)
was $648 (595) and for the 14 patients with cost data in the usual care group $1191 (1416).
In these examples, as in other similar trials, the statistical issue is how to provide a valid
comparison and inference regarding the arithmetic mean costs in the two treatment groups given
the skewness of the distributions of costs.
3. CONVENTIONAL METHODS
For the analysis of cost data from randomized controlled trials, primary interest is usually whether
the costs of a particular treatment group Y are greater (or less) than those associated with an
alternative treatment group Z. Given the often skewed distribution of cost data, conventional sta-
tistical guidelines [9] would recommend non-parametric methods or analysis of data which has
been transformed to achieve approximate normality. The following sections describe and discuss
the various conventional approaches used in practice or recommended for the analysis of costs.
For simplicity this focuses on tests, but the same arguments apply similarly to the calculation of
condence intervals.
3.1. Data transformation

Transformation of skewed data to achieve approximate normality is widely used in practice.
A log transformation is most common, but others such as shifted log (adding a constant before
taking logs, particularly to enable the retention of zeros in the data), square root and reciprocal
transformations have also been advocated in principle for cost data [5; 10]. Other and possibly
more justied reasons for such transformation may be to attain additivity for covariate eects in
regression modelling [11] or homogeneity of variance [12].
Calculation of means after (non-linear) data transformations does not result in a comparison of
arithmetic means, and so is not appropriate for cost data. For example, log transformation results in
a comparison of geometric mean costs, and reciprocal transformation in a comparison of harmonic
mean costs. Valid inferences drawn from statistical analyses comparing geometric or harmonic
means cannot in general be taken also to apply to arithmetic means [13], since the relationship
between these dierent measures of location depends on the shapes of the distributions.
In the health economic literature, several authors have proposed that costs should be log trans-
formed before analysis [10; 14; 15]. However in only restricted conditions do such methods result
in a comparison of arithmetic means. Even if the data have a log-normal distribution, a test of
geometric means is equivalent to a test of arithmetic means only if the variances on the log scale
in the two groups are equal. This is because for log-normal data, an estimate of the arithmetic
mean is exp(lm + lv2 ) where lm and lv are the mean and variance of the logged data [16]. There-
fore testing the null hypothesis H0 : ly = l z (where ly and l z , are the underlying means of the
logged distributions) for data which is log-normal, is equivalent to testing equality of the underly-
ing arithmetic means H0 : y = z (where y and z are the underlying means of the untransformed
distributions) only if the standard deviations on the logged scale are the same.
3.2. Standard non-parametric methods

Standard non-parametric methods such as the Mann–Whitney U-test make a general comparison of
distributions between groups. The null hypothesis for the test is that the values in the two samples
y (drawn randomly from distribution F) and z (drawn randomly from distribution G) come from
the same population, that is H0 : F = G. This null hypothesis is not specically concerned with
dierences in one aspect of the distributions, such as their means, and a signicant result can arise
because of a dierence in location or shape of the distributions. The test is therefore only useful
for a comparison of arithmetic means in cases where the distributions can be assumed to have
similar shapes.
3.3. Two-sample t-test

A two-sample t-test or z-test on untransformed data allows a direct comparison of arithmetic means.
The null hypothesis is that the means of the underlying distributions for groups Y and Z are equal,
that is H0 :y = z or H0 : = 0 where = y − z . The usefulness of these standard approaches
for comparing arithmetic means for cost data, however, is limited by their assumption of normality
and, for the t-test, equal variances. Although violation of the equal variances assumption could
seriously aect the validity of the t-test, especially if the samples are small and unequal in size
[17], methods of carrying out a t-test with unequal variances exist [18]. For large samples with
unequal variances a simple z-test is appropriate. For data such as costs where unequal variance
and highly skewed distributions are likely to occur together, it is robustness of these methods to
non-normality which is most important. There is much early literature investigating this [19–21]
which has shown that the two-sample t-test will be robust to non-normality if the sample sizes are
large enough, if the sample size and the skewness in the two groups are similar, or if the data
are not too severely non-normal.
Although such information is helpful, it is dicult in practice to use it in deciding whether the
standard parametric procedures may be misleading. Some indication of what is meant by large,
similar and not too severely is required. In reference to the robustness of the t-test for log-normal
cost data, Zhou and colleagues [6] have speculated that a total sample size of 1000 or more would
be necessary. Unfortunately, however, no overall clear practical criteria for assessing robustness
to non-normality have been dened and indeed these may be impossible to specify. Faced with
the analysis of a new non-normal data set it is dicult to judge whether or not standard t-test
methods will be valid.
3.4. Permutation tests

For comparisons of arithmetic means in randomized studies where distributions are non-normal,
Fisher suggested the permutation test (also called the randomization test) [22]. This non-parametric
test determines statistical signicance directly from observed values without the need to assume a
particular sampling distribution. It requires only that the subjects have been divided randomly into
the groups being compared. The null hypothesis of the test is that the two population distributions
from which the samples were drawn are identical, that is, H0 :F = G. By using the dierence in
means as the test statistic, this test becomes especially sensitive to dierences in population means
[17].
The permutation test is carried out by considering all possible ways in which the subjects could
have been randomized into two groups of the same size as in the original sample. Under H0 and
with N observations in total of which m are in group Y and n in group Z, each permutation
−1
is equally likely with exact probability Nn . For each of the R possible permutations of the
data, the dierence in means r can be calculated. The set of values ˆr∗ (r = 1; 2; : : : ; R) has
ˆ∗
a distribution which has been formed under the null hypothesis, and when compared with the
observed dierence in means ˆ gives a two-sided P-value for the test as P̂perm = #{|ˆr∗ |¿||}=R.
ˆ
This P-value will be exact when all R possible permutations of the data have been included, this
being called a systematic permutation test. However, even with modern computers this approach
could prove very time consuming. An alternative is the randomization or sampled permutation test.
This method uses a random sample of the R permutations (recommended to be 1000 or more [23])
and results in a P-value which is a good approximation to the exact value from the systematic
permutation procedure.
For testing dierences in mean costs, the randomization test avoids any problems of assuming
normality. However, given that the null hypothesis is actually concerned with a comparison of
distributions, the test will only exclusively test means if it can be justiably assumed that the
shapes of the population distributions are identical [17]. The need for such an assumption has
been discussed in detail by Romano [24] and Boik [25].
3.5. Limitations of conventional methods

The methods of analysis described in the previous sections all require particular distributional
assumptions such as normality, log-normality, equal variances or same shaped distributions in
order to ensure that they test only for dierences in arithmetic means. Skewed cost data may
often not adhere to such requirements. For example, the back pain trial cost data introduced
in Section 2 is not log-normally distributed; even a shifted log transformation does not lead to
approximate normality. In addition the variances in the two groups are unequal (exercise group
$582, usual care group $1109). Thus, except in fortuitous circumstances, the above methods are
not appropriate. Certainly no one method can be advocated as generally suitable. For this reason
we discuss the application of non-parametric bootstrap techniques in the next section as a better
general strategy for the analysis of cost data.
4. BOOTSTRAP METHODS
The non-parametric bootstrap is an approach similar to the randomization test which can be used to
compare arithmetic mean costs while avoiding distributional assumptions [23]. Use of this method
for cost data has recently also been proposed by other authors [4–6], but without the necessary
level of detail for application in practice. In the next section a brief description of the bootstrap
principle is given, followed by exposition of bootstrap methods for carrying out hypothesis tests
and calculating condence intervals for comparisons of arithmetic means. These are explained in
detail, both so that bootstrap methods suited to the specic case of comparing arithmetic means
are explicit, and to provide relevant information to assist with their practical application. Page
number references to two books on bootstrapping, Efron and Tibshirani (henceforth ET) [23] and
Davison and Hinkley (henceforth DH) [26] are given for the same reason.
4.1. The bootstrap principle

The bootstrap is a data-based simulation method for assessing statistical precision developed by
Efron in a series of articles from 1979 [27–29]. It is especially useful when the sampling distribu-
tion of an estimator is unknown or cannot be dened mathematically, so that classical methods of
statistical analysis are not available. In the ‘real world’ (using Efron’s phraseology) the observed
sample is chosen randomly from an unknown probability distribution F and the statistic of interest
calculated. This is mimicked in the non-parametric ‘bootstrap world’ by using the observed sample
as an empirical estimate F̂ of the unknown distribution F. Samples of the same size as the original
are drawn from F̂, by sampling with replacement from the observed data. In practice the number
of samples (B) required will depend on the measure of interest. For example it is recommend
that between 25 and 200 resamples are required to estimate a bootstrap standard error (ET p. 47
[23]) while at least 1000 are needed to obtain a bootstrap condence interval (ET p. 275 [23]).
For each resample the statistic of interest (for example the dierence in means) is calculated. The
distribution of these B values of the statistic provides an approximation to its population sampling
distribution and can be used to estimate standard errors, condence limits and to carry out hypoth-
esis tests. Using bootstrapping for statistical analysis avoids the need for parametric distributional
assumptions such as normality. However it does require the assumption that the true distribution
of the data is adequately represented by its empirical distribution, so that the true distribution of
the statistic of interest is well approximated by its bootstrap distribution.
When using bootstrapping for the analysis of randomized trials it is important to account for the
fact that the groups have potentially dierent probability distributions. To allow for this, stratied
bootstrap resampling should be used. This involves drawing samples of the same size as the original
separately from each group (ET p. 88 [23], DH p. 71 [26]). The bootstrap methods for carrying
out hypothesis tests and calculating condence intervals for comparing two means are described in
subsequent sections. Although these involve bootstrapping dierent statistics, all require the same
resampling procedure. For two randomized groups Y and Z, with, respectively, a sample of m
independent values y1 ; y2 ; : : : ; ym and a sample of n independent values z1 ; z2 ; : : : ; zn , the bootstrap
resampling procedure can be summarized in two steps:
1. Form B bootstrap data sets ( yb∗ ; zb∗ ); b = 1; 2; : : : ; B where yb∗ are m values sampled with re-
placement from y1 ; y2 ; : : : ; ym and zb∗ are n values sampled with replacement from z1 ; z2 ; : : : ; zn .
2. For each bootstrap data set ( yb∗ ; zb∗ ) compute the statistic of interest b∗ . Use the distribution
of these B values ∗ as an estimate of the underlying distribution.
b
4.2. Bootstrap hypothesis tests for comparing arithmetic means

When carrying out a bootstrap hypothesis test, bootstrap resampling is used to obtain an approxi-
mation to the distribution of the test statistic under H0 . In a similar way as for the randomization
test, this distribution is then directly compared with the observed value of the test statistic to esti-
mate the P-value. For comparing arithmetic means in two groups, with null hypothesis H0 : y = z ,
this avoids having to make assumptions about equality of variances or shapes of the distributions
(ET p. 222 [23]).
The resampling procedure is given by the two-step approach described above. In this case
the most reliable test statistic for comparison of means is the Studentized statistic given by the
dierence in means divided by its standard error (ET p. 223 [23], DH p. 171 [26]). In order to
obtain an estimate of the distribution of the test statistic under H0 , however, this is calculated for
ˆ∗ −ˆ
each bootstrap sample as tb∗ = Sb Ê∗ , where ˆb∗ is the dierence in bootstrap means, ˆ the observed
b
dierence in means, and SE∗b the standard error of the bootstrap dierence√ in means. The latter
is calculated as for the t-test with unequal variances, that is SÊb∗ = (SDyb ∗2 =m + SD∗2 =n) where
zb
SD∗ and SD∗ are the observed standard deviations of the bootstrap samples y∗ and z ∗ .
yb zb b b
The estimated distribution of the test statistic under H0 is used in place of a standard
t-distribution to obtain the approximate two-sided P-value P̂ boot = #{|tb∗ |¿|tobs |}=B where tobs is
the observed value of the test statistic, calculated as for a t-test with unequal variances. If the
distribution of bootstrap-t values is close to the standard t-distribution then the P-value from the
t-test and that from the bootstrap test will be similar. This implies that for the particular data set
being analysed the t-test is robust to violations of its assumptions.
This bootstrap test gives dierent results if a dierent scale is used for ˆb∗ , and will perform
best on a variance stabilised scale where ˆ and SÊ() are independent (DH p. 111 [26]). This
independence can be assessed from a plot of the bootstrap dierences in means ˆb∗ against their
corresponding standard errors SÊ∗b . If there is a strong relationship between these, the test should
be carried out after a variance stabilizing transformation of the bootstrap values ˆb∗ , say function
g. Note that this is not the same as transforming the data since transforming the bootstrap values
ˆ∗b will still allow a comparison of arithmetic means. The test statistic calculated for each resample
is then given by tb∗ = g(ˆb∗ ) − g().
ˆ There is no longer a need to divide by the estimated standard
error because on the variance stabilized scale this will be constant. Comparing the observed value
ˆ − g(0 ) (where 0 is the null value for the dierence in means,
of the test statistic tobs = g()
usually 0) with the distribution of tb∗ values will provide an estimate of the P-value for the test.
Details of an ‘automatic’ method of nding an appropriate transformation g are given in the next
section.
4.3. Bootstrap condence intervals

Several methods for calculating bootstrap condence intervals have been proposed [27–31] and
compared in theory and empirically by several authors [23; 26; 32–34]. In this section the most
well known and researched of the methods are described, beginning with the bootstrap-t [28]
and variance stabilized bootstrap-t [31] methods, followed by the simple percentile [27] and bias-
corrected and accelerated percentile approaches [29]. All are described in the context of condence
intervals for dierences in arithmetic means.
4.3.1. Bootstrap-t methods. The bootstrap-t method involves generating bootstrap-t values for the
particular data set, which can be used in place of the standard t-distribution values when calculating
condence intervals. The distribution of bootstrap-t values is obtained in exactly the same way
ˆ∗ −ˆ
as for the bootstrap hypothesis test (Section 4.2), using the test statistic t ∗ = b . To estimate
b SÊ∗
b
a 100(1 − )per cent condence interval for , the 100(=2)per cent and 100(1 − =2)per cent
∗ and t ∗
percentiles of t ∗ (t(=2) (1−=2) , respectively) are used, the condence interval being given by
∗
(ˆ − t(1−=2) ∗ SÊ())
SÊ(); ˆ − t=2
where ˆ is the observed dierence in means and SÊ() its standard error. The 100 percentile is
estimated by the (B + 1)th member of the ordered bootstrap sample if this is a whole number;
otherwise interpolation must be used (DH p. 195 [26], ET p. 160 [23]). Efron has shown that
estimating such quantiles reliably will require at least 1000 bootstrap resamples (ET p. 275 [23]).
As before, if the distribution of the tb∗ values approximates that of the standard t-distribution, the
bootstrap condence interval will be similar to the standard interval.
In the same way as for the bootstrap test, this method of obtaining a condence interval is not
transformation respecting and is best performed on a variance stabilized scale where ˆ and SÊ()
are independent (ET p. 164 [23], DH p. 111 [26]). If a strong relationship is seen on a plot of ˆb∗
against SÊb∗ a suitable variance stabilizing transformation should be found. The condence interval
is then calculated on the variance stabilized scale and back transformed to obtain an interval on
the original scale.
An ‘automatic’ method of nding an approximate variance stabilizing transformation based on
a Taylor series argument has been recommended (ET p. 164 [23], DH p. 111 [26]). For a random
variable X with mean and standard deviation s() that varies as a function of , the transformation
given by
Z x
1
g(x) = du (1)
s(u)
has the property that the variance of g(X ) is approximately constant. The following steps describe
the variance stabilised bootstrap-t method for a dierence in means:
1. Form a set of B bootstrap data sets sampled with replacement from the two groups Y and
Z, as described in Section 4.1.
2. Compute the dierence in means ˆ∗b and standard error SÊb∗ for each bootstrap data set,
b = 1; 2; : : : ; B.
3. Fit a curve to the points [ˆ∗b and SÊb∗ ] using a non-linear regression technique to produce
a smooth function s such that s(ˆb∗ ) is the average SÊb∗ at ˆb∗ , for example using fractional
polynomials [35] or a lowess running line smoother [36].
4. Estimate the variance stabilizing transformation g() ˆ using equation (1) and a numerical
integration technique based, for example, on the trapezoid rule or cubic splines.
5. Compute a bootstrap-t interval for the transformed values g(ˆ∗b ) in the way described previ-
ously. The standard error will be approximately constant, so SÊ∗b and SÊ(g()) ˆ can be set
equal to 1.
6. The endpoints calculated on the transformed scale can be mapped back to the original scale
using the inverse transformation g−1 .
The adequacy of the transformation can be checked by plotting ˆ∗ against the standard error on
b
the variance stabilized scale, an approximation of which is given using the delta method (DH
p. 113 [26]).
SÊb∗
SÊ(g(ˆ∗b )) =
s(ˆb∗ )
Note that when describing the general case of obtaining a variance stabilized interval, Efron and
Tibshirani (ET p. 165 [23]) use two sets of bootstrap resamples: one to estimate the transformation
and the other to obtain the condence limits. This is in order to limit the computation required
in cases where the standard error of the estimator cannot be calculated directly. For dierences in
means one set of bootstrap resamples is adequate.
4.3.2. Percentile intervals. Using percentile methods to calculate a condence interval for a dier-
ence in means requires resampling to obtain a bootstrap distribution of dierences in means ˆb∗ . For
the simplest bootstrap percentile method, the approximate 100(1 − )per cent condence interval is
given by the 100(=2) and 100(1 − =2) percentiles of the distribution. If the bootstrap distribution
is approximately normal these condence limits will be close to those from the standard normal
approach. Despite the attractive simplicity of this method, its general use is not recommended.
It has been shown both theoretically [29; 33; 37] and through simulation studies [38; 39] that this
method can be inaccurate. In particular, the coverage error for these intervals is substantial if the
distribution of ˆ is not nearly symmetric around the observed value.
The ‘adjusted’ or ‘bias corrected and accelerated’ method (BCa ), proposed by Efron [29], is a
more reliable percentile approach. The BCa interval is again given by percentiles of the bootstrap
distribution of dierences in means, but the percentiles used are chosen after correction for skew-
ness or ‘acceleration’ â and bias ẑ 0 . Efron has shown that the procedure requires at least 1000
bootstrap resamples (ET p. 275 [23]).
The BCa limits for a 100(1 − )per cent condence interval are given by the (1001 ) and
(1002 ) percentiles of the distribution of bootstrap means ˆb∗ , where

ẑ 0 + z(=2) ẑ 0 + z(1−=2)
1 = ẑ 0 + and 2 = ẑ 0 +
1 − â(ẑ 0 + z(=2) ) 1 − â(ẑ 0 + z(1−=2) )
with (:) being the standard normal cumulative distribution function and z(x) the 100(x) percentile
point of the standard normal distribution. The bias correction, ẑ 0 , is calculated from the proportion
ˆ and measures the median bias of ˆ∗ as the
of bootstrap replicates less than the original estimate ,
∗
discrepancy between ˆ and ˆ in normal units, that is, ẑ 0 = −1 (#{ˆb∗ ¡}=B).
ˆ Since the dierence
in means is an unbiased estimator this value will be close to zero. The acceleration or skewness
correction â can be estimated in dierent ways. One method proposed by Efron and Tibshirani
[23] is to use one-sixth of the jack-knife estimate of skewness calculated as
Pn ˆ
((:) − ˆ(i) )3
â = Pni=1
6{ i=1 (ˆ(:) − ˆ(i) )2 }3=2
where ˆ(i) is the statistic calculated from the jack-knife sample with the ith point deleted and
Pn
ˆ(:) = i=1 ˆ(i) =n, the mean of the jack-knife values.
The BCa interval provides an improvement over the basic percentile approach both in theory
and practice. A disadvantage of the method however is that the coverage error has been shown to
increase as goes to zero, that is, for higher percentage condence intervals [32].
5. APPLICATION TO EXAMPLES
Here the methods described in the previous sections are applied to cost data from the two trials
introduced in Section 2. The stratied bootstrap resampling (Section 4.1) was implemented in
Stata [40] as an extension of its single sample bootstrapping facility. The ‘automatic’ method
of variance stabilization was programmed with a lowess running line smoother to describe the
relationship between the dierence in means and standard error, and cubic splines for numerical
integration. The BCa method was implemented using calculations of bias and acceleration shown
in the previous section. Stata programs are available from the rst author. Similar implementations
are available in S-plus [26] and GAUSS [38]. In the analyses below, randomization and bootstrap
methods are based on 2000 resamples.
5.1. Back pain trial

Table I(a) shows results from the various methods of carrying out hypothesis tests for the back
pain trial cost data illustrated in Figure 1(a). An analysis of log transformed costs is included,
even though the cost data from this trial are not log-normal.
Table I. Results from comparison of total costs in the two groups for the back pain trial:
(a) hypothesis tests; (b) 95 per cent condence intervals for dierence in means ($).
(a)
Test P-value (two-sided)

Methods assuming normality:
t-test (equal variances) 0.321
t-test (unequal variances) 0.314
Resampling methods (B = 2000):

randomization test 0.348
bootstrap test 0.329
variance stabilized bootstrap test 0.271
Other methods:
t-test of log (cost + 4.65)∗ 0.008
Mann–Whitney U-test 0.023
∗ Constant added to allow inclusion of zeros, and chosen to minimize skewness of the log
transformed data.
(b)
Condence interval method Lower limit Upper limit

standard t (equal variances) −146 442
standard t (unequal variances) −142 439
Bootstrap methods (B = 2000):

bootstrap-t −93 515
variance stabilized bootstrap-t −81 511
percentile interval −124 447
BCa percentile interval −90 506
Results from the standard t-tests assuming equal and unequal variances, randomization test, and
the bootstrap tests all give two-sided P-values of around 0.3, indicating no evidence of a dierence
in mean cost between the randomized groups. Given the severely skewed distributions of the cost
data, such similarity of P-values is surprising, but clearly indicates the robustness of the t-test
for these data. Of the bootstrap tests the variance stabilized method should be preferred. This
is because a scatter plot of the 2000 bootstrap standard errors against their dierences in means
shows a strong positive relationship (Figure 2(c)). Applying a variance stabilizing transformation
gives the variance stabilized bootstrap-t values depicted in Figure 2(d). This distribution is more
symmetrical around 0 than that on the untransformed scale (Figure 2(a)). A scatter plot of variance
stabilized standard errors against the dierences in means shows that the transformation has been
largely successful (Figure 2(e)). Despite the apparent need for this transformation, P-values from
the two bootstrap methods are similar in this case.
In contrast, results from the Mann–Whitney U-test and an analysis of shifted log transformed
costs are signicant (Table I(a)), both indicating that the exercise group therapy is more expensive
than usual care (an eect in the opposite direction to that suggested by the arithmetic means).
The signicant P-values are a result of the large proportion of xed costs in the two groups. For
example, when ranking the data for the Mann–Whitney U-test, the 34 per cent of patients with
just the xed cost of the exercise programme ($25:20) are ranked higher than the 43 per cent
of patients with zero costs in the usual care group. Interpreting these test results simplistically
would provide seriously misleading conclusions, since for these data the signicant P-values do
not re ect dierences in arithmetic means.
The observed dierence in mean cost between randomized groups (usual care minus exercise) is
$148. Condence intervals for this dierence calculated using conventional and bootstrap methods
are shown in Table I(b). The bootstrap condence intervals are all slightly shifted to the right
compared with the standard intervals so that the lower and upper limits are larger and the inter-
vals asymmetric about the observed dierence in means. These features of the bootstrap intervals
are a consequence of the skewness in the distributions of bootstrap-t values (Figure 2(a)) and
bootstrap dierences in means (Figure 2(b)). As for the bootstrap test, the relationship between
the bootstrap standard errors and dierences in means apparent in Figure 2(c), indicates that the
variance stabilized condence interval should be preferred to the bootstrap-t interval. However, us-
ing this approach makes little dierence to the interval obtained. The percentile and BCa methods
give rather dierent intervals; however the asymmetry of the distribution of bootstrap dierences
in means (Figure 2(b)) indicates the BCa result should be preferred. Thus the favoured bootstrap
methods in this example are the variance stabilized and BCa methods, for which the intervals are
very similar.
The results in Table I show that the t-test is remarkably robust to non-normality for these
data. This may be because the sample size was large enough for the central limit theorem to act
suciently or because the sample size and skewness were similar in the two groups. Dierences
between condence intervals calculated assuming normality and those based on the bootstrap meth-
ods are however rather more obvious, although these would not result in substantially dierent
interpretations.
5.2. Parasuicide trial

For this smaller data set with only 32 patients having cost data available for analysis (Figure 1(b)),
Table II(a) shows P-values from the various hypothesis tests. The P-values are all non-signicant
indicating little evidence of a dierence in costs between the groups in the trial. P-values from
t-tests assuming equal and unequal variances are slightly dierent, re ecting the unequal standard
deviations in the two groups (usual care group $1416, therapy group $595). The P-value from
the randomization test is most similar to that from the t-test with equal variances and that of the
bootstrap test more similar to the t-test with unequal variances. This is as expected given the null
hypotheses of the randomization test (Section 3.4) and bootstrap test (Section 4.2). In the same
way as described for the back pain data the bootstrap dierences in means and standard errors
are not independent, and so a variance stabilizing transformation was found. Again for these data
the variance stabilized result is most appropriate, and provides a P-value which, although not
substantially dierent in interpretation from the unequal variance t-test, is somewhat smaller. The
P-values from a comparison of shifted log transformed cost data and from the Mann–Whitney
U-test are much larger than those from the other tests and are unlikely to be testing for dierences
in arithmetic means.
The observed dierence between groups (usual care minus therapy) in mean costs per month
was $543. Condence intervals for this dierence are shown in Table II(b). Compared to the
methods which assume normality, the bootstrap intervals are all asymmetric having larger values
for the lower limits and, except for the percentile interval, higher values for the upper limits. Of
the bootstrap-t methods the variance stabilized approach provides the preferred interval, because
this adjusts for the dependence of the standard error on the dierence in means. For the percentile
methods the BCa procedure is preferred because the distribution of bootstrap dierences in means
is slightly asymmetric around the observed value. The variance stabilized bootstrap-t and BCa
condence intervals are similar and both rather dierent from those obtained under the assumption
of normality.
Figure 2. Bootstrap statistics for the back pain trial data (costs in $): (a) Q-Q plot of 2000 bootstrap-t
values; (b) Q-Q plot of 2000 bootstrap dierences in means; (c) scatter plot of bootstrap standard errors
against bootstrap dierences in means with lowess running line smoother (bandwidth = 0:8); (d) Q-Q plot
of 2000 variance stabilized bootstrap-t values; (e) scatter plot (with lowess smoother) of variance stabilized
bootstrap standard errors against dierences in means.
Fig. 2. Continued.
Table II. Results of comparisons of total monthly costs between treatment groups for
the parasuicide pilot study: (a) hypothesis tests; (b) 95 per cent condence intervals for
the dierence in means ($).
(a)
Test P-value (two-sided)

t-test (equal variances) 0.151
t-test (unequal variances) 0.196
Resampling methods (B = 2000):

randomization test 0.148
bootstrap 0.207
variance stabilized bootstrap test 0.102
Other methods:
t-test of log (cost+56.39)∗ 0.680
Mann–Whitney U-test 0.621
∗ Constant added to allow inclusion of two zero cost values, and chosen to minimize skewness of
the log transformed data.
(b)
Condence interval method Lower limit Upper limit

standard t (equal variances) −209 1295
standard t (unequal variances) −310 1396
Bootstrap methods (B = 2000):

bootstrap-t −125 1705
variance stabilized bootstrap-t −42 1511
percentile interval −145 1325
BCa percentile interval −65 1501
6. DISCUSSION
This paper highlights the importance of analysing arithmetic means when using cost data from
trials to inform health care policy decisions. Standard non-parametric methods and analyses of
transformed costs are generally inappropriate because they are not focused on arithmetic means.
t-test approaches are also not universally reliable for costs because it is dicult to be sure how
robust these methods are to non-normality for a particular data set. Amongst resampling procedures
for comparing arithmetic means, the usefulness of the randomization test is limited because it
requires an assumption of similarly shaped distributions in the two groups. Bootstrap methods are
preferred because they avoid the need for such assumptions.
In Section 5 standard t-test results were compared with bootstrap results for two example
data sets. This showed the t-test to be surprisingly robust to the severe non-normality of the
cost data in both examples. In particular the bootstrap and the t-test methods gave very similar
P-values for the moderately sized back pain trial data. For the smaller parasuicide trial, however,
tests and condence intervals were more noticeably dierent, although this makes little impact on
the overall interpretation. Amongst the methods of obtaining bootstrap condence intervals con-
sidered, the most reliable are the bootstrap-t (with variance stabilization if necessary) and the BCa
procedures.
Experience with these and other similar examples indicate that the t-test may be adequate in
many cases for the analysis of cost data. In particular for trials large enough to in uence health
care policy, standard t-test based approaches will be robust and give results very similar to the
bootstrap. However, for small and moderate samples sizes it is dicult to anticipate whether the
theoretically more appropriate bootstrap analyses might give dierent results. Hence we recommend
that in these cases either the results from bootstrap analyses of cost data be reported directly, or
that they be used to check on the robustness of results from standard parametric methods.
We have dismissed general use of a log transformation for the analysis of cost data. Even if the
distribution is log-normal, this will only result in a comparison of arithmetic means if the variances
in the two groups on the log scale are equal. A more exible method for testing dierences in
arithmetic means after a log transformation, allowing for dierent variances, has been proposed and
applied to cost data by Tu et al. [6; 41]. Its application is, however, limited in that it still requires
approximate log-normality of the data and does not provide a method for calculating condence
intervals for a dierence in means. The shape of the distribution of cost data varies widely between
studies, and methods relying on parametric assumptions, such as log-normality, are not generally
applicable. Another method known as ‘smearing’ [11], which utilizes transformations, has been
suggested for costs [10]. This approach was developed to provide an estimate of the predicted
response on the untransformed scale after tting a linear regression model on a transformed scale.
It is a non-parametric method in that it does not assume a distribution for the residuals from
the regression model. However ‘smearing’ is primarily concerned with prediction and does not
provide a method for carrying out statistical inference about the dierence in arithmetic mean
costs between groups.
Other analyses suggested in the literature for costs include some two-stage approaches. Coyle [15]
describes a method which requires rst an analysis of means of restricted data, after removal of
low and high cost outliers, and then a comparison of the proportions of values in low, restricted
and high cost categories. Rutten-van Molken et al. [10] describe other two-stage procedures specif-
ically for data where there are large numbers of zero cost values. In this case the rst analysis
considers the probability of a non-zero cost and the second the costs amongst those with positive
values. Such two-stage methods are not appropriate for cost evaluations intended to inform health
care policy decisions, since tting models for dierent parts of the data will not give overall
information regarding a comparison of arithmetic mean costs.
Although we have concentrated on two-sample analyses of costs in this paper, bootstrap methods
can be extended to three or more group comparisons using bootstrap analysis of variance or
bootstrap regression techniques [23; 26]. In addition bootstrap multiple regression methods for costs
can be used to provide for baseline covariate adjustment and assessment of treatment–covariate
interactions (subgroup eects) in randomized trials, and more generally for constructing models
for predicting costs from data in either trials or observational studies. Bootstrap methods are also
relevant to situations other than costs when analysis needs to focus on arithmetic means despite
non-normality of distributions. Examples include comparing use of health resources such as length
of hospital stay [42] or days lost from work due to sickness absence [43].
The non-parametric bootstrap resampling method is, however, not entirely assumption free. In
using bootstrap resampling there is an important assumption that the true distribution of the data is
adequately represented by its empirical distribution, so that bootstrap estimates are ‘only guaranteed
to be accurate as the sample size goes to innity’ [23]. However, the performance of bootstrap
methods depends on what estimator is of interest, and how well the bootstrap distribution of this
estimator approximates its true distribution. For means which are well behaved smooth functions of
the sample data, bootstrap methods generally perform well [26]. Bootstrap methods for dierences
in means have been shown to be reliable for normal and skewed (folded normal) samples as
small as 8 [39]. However, properties of the bootstrap for small samples is still an area of current
research [44]. For economic evaluations in practice bootstrap methods are likely to be reliable
since these will require moderate to large sample sizes to ensure adequate power for comparisons
between groups. In addition useful bootstrap diagnostic techniques exist [26]. In particular a ‘jack-
knife after bootstrap’ can be used to check the sensitivity of bootstrap condence intervals and
signicance levels to particular observations in the sample. Although we have not described this
jack-knife technique here, it is relevant when dealing with skewed data such as costs, particularly
when sample size is small.
In this paper we have concentrated on one of a number of important statistical issues arising in
economic evaluations. Other complex issues include how to deal with censored and missing cost
data [45], how to allow for the imprecision in xed cost components, how to estimate required
sample sizes [46], how to draw inferences about costs and eects jointly [47], and how to develop
a rational strategy for sensitivity analyses. As cost data becomes a more common feature in
randomized trials, there is an urgent requirement for further research to provide statistical guidance
on such issues.
ACKNOWLEDGEMENTS
This research received nancial support from London NHS Executive. We thank Jennifer Klaber Moett,
David Torgerson and Peter Tyrer for permission to use their data, James Carpenter for helpful discussions
about bootstrapping and Tony Brady for help with programming in Stata.
REFERENCES
1. Drummond MF, Stoddart GL. Economic analysis and clinical trials. Controlled Clinical Trials 1984; 5:115 –128.
2. Drummond MF, O’Brien B, Stoddart GL, Torrance GW. Methods for the Economic Evaluation of Health Care
Programmes. Oxford University Press: Oxford, 1997.
3. Barber JA, Thompson SG. Analysis and interpretation of cost data in randomised controlled trials: review of published
studies. British Medical Journal 1998; 317:1195–1200.
4. Desgagne A, Castilloux A, Angers J, LeLorier J. The use of the bootstrap statistical method for the pharmacoeconomic
cost analysis of skewed data. Pharmacoeconomics 1998; 13:487– 497.
5. Briggs A, Gray A. The distribution of health care costs and their statistical analysis for economic evaluation.
J. Health Serv. Res. Policy 1998; 3:233–245.
6. Zhou X, Mel CA, Hui SL. Methods for comparison of cost data. Annals of Internal Medicine 1997; 127:752–756.
7. Klaber Moett J, Torgerson DJ, Bell-Syre S, Jackson D. Llewllyn-Phillips H, Farrin A, Barber JA. Randomized
controlled trial of exercise for low back pain: clinical outcomes, costs and preferences. British Medical Journal 1999;
319:279 –283.
8. Evans K, Tyrer P, Catalan J, Schmidt U, Davidson K, Dent J, Tata P, Thornton S, Barber J, Thompson S. Manual-
assisted-cognitive-behaviour therapy in the treatment of recurrent deliberate self harm: A randomized controlled trial.
Psychological Medicine 1999; 29:19 –25.
9. Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical guidelines for contributors to medical journals. British
Medical Journal 1983; 286:1489 –1493.
10. Rutten-van Molken M, van Doorslaer EK, van Vliet RC. Statistical analysis of cost outcomes in a randomized controlled
clinical trial. Health Economics 1994; 3:333 –345.
11. Duan N. Smearing estimate: a non-parametric retransformation method. Journal of the American Statistical Association
1983; 78:605– 610.
12. Armitage P, Berry G. Statistical Methods in Medical Research. Blackwell Scientic Publications: Cambridge, 1987.
13. Thompson SG. Barber JA. How should cost data in randomized controlled trials be analysed? British Medical Journal
2000; 320:1197–1200.
14. Gray AM, Marshall M, Lockwood A, Morris J. Problems in conducting economic evaluations alongside clinical trials.
Lessons from a study of case management for people with mental disorders. British Journal of Psychiatry 1997;
170:47–52.
15. Coyle D. Statistical analysis in pharmacoeconomic studies: A review of current issues and standards.
Pharmacoeconomics 1996; 9:506 –516.
16. Kendall MG, Stuart A. The Advanced Theory of Statistics. Volume I: Distribution Theory. Charles Grin & Company
Limited: London, 1969.
17. Bradley JV. Distribution-free Statistical Tests. Prentice-Hall: New Jersey, 1968.
18. Welch, BL. The generalisation of ‘Student’s’ problem when several dierent population variances are involved.
Biometrika 1947; 34:28–35.
19. Gayan AK. Signicance of dierence between the means of two non-normal samples. Biometrika 1950; 37:399 – 408.
20. Pearson ES, Please NW. Relation between the shape of population distribution and the robustness of four simple test
statistics. Biometrika 1975; 62:223–241.
21. Geary RC. Testing for normality. Biometrika 1947; 36:353 –369.
22. Fisher RA. The Design of Experiments. Oliver & Boyd: Edinburgh, 1935.
23. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman and Hall: New York, 1993.
24. Romano JP. On the behaviour of randomization tests without a group invariance assumption. Journal of the American
Statistical Association 1990; 85:686 – 692.
25. Boik RJ. The Fisher-Pitman permutation test: a non-robust alternative to the normal theory F test when variances are
heterogeneous. British Journal of Methods in Statistical Psychology 1987; 40:26 – 42.
26. Davison AC, Hinkley DV. Bootstrap Methods and their Application. Cambridge University Press, 1997.
27. Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics 1979; 7:1–26.
28. Efron B. Non-parametric standard errors and condence intervals (with discussion). Canadian Journal of Statistics
1981; 9:139 –172.
29. Efron B. Better bootstrap condence intervals. Journal of the American Statistical Association 1987; 82:171–200.
30. Carpenter JR. Test inversion bootstrap condence intervals. Journal of the Royal Statistical Society, Series B 1999;
69:159 –172.
31. Tibshirani RJ. Variance stabilization and the bootstrap. Biometrika 1988; 75:433– 444.
32. Carpenter J, Bithell J. Bootstrap condence intervals: when? which? what? A practical guide for medical statisticians.
Statistics in Medicine 2000; 19:1141–1164.
33. Hall P. Theoretical comparison of bootstrap condence intervals. Annals of Statistics 1988; 16:927–953.
34. DiCiccio TJ, Efron B. Bootstrap condence intervals (with discussion). Statistical Science 1996; 11:189 –228.
35. Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric
modelling. Applied Statistics 1994; 43:429 – 467.
36. Cleveland WS. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical
Association 1979; 74:829 –836.
37. Schenker N. Qualms about bootstrap condence intervals. Journal of the American Statistical Association 1985;
80:360 –361.
38. Mooney CZ, Duval RD. Bootstrapping: A Nonparametric Approach to Statistical Inference. Sage Publications:
Newbury Park, 1993.
39. Hall P, Martin M. On the bootstrap and two sample problems. Australian Journal of Statistics 1988; 30A:179 –192.
40. Stata Corporation. Stata Reference Manual: Release 6. Stata Corporation: Texas, 1999.
41. Tu W, Zhou X. A Wald test comparing medical costs based on log-normal distributions with zero values costs.
Statistics in Medicine 1999; 18:2749 –2761.
42. Burns T, Creed F, Fahy T, Thompson S, Tyrer P, White I for the UK700 Group. Intensive versus standard case
management for severe psychotic illness: a randomised trial. Lancet 1999; 353:2185–2189.
43. Pocock SJ. When not to rely on the central limit theorem – An example from absenteeism data. Communications in
Statistics – Theory and Methods 1982; 11:2169 –2179.
44. Wolf-Ostermann K. Bootstrap testverfahren fur lokationsparameter univariater verteilungen. PhD thesis, Department of
Statistics, University of Dortmund, 1997.
45. Lin DY, Feuer EJ, Etzioni R, Wax Y. Estimating medical costs from incomplete follow-up data. Biometrics 1997;
53:419 – 434.
46. Al MG, van Hout BA, Michel BC, Rutten FF. Sample size calculation in economic evaluations. Health Economics
1998; 7:327–335.
47. Chaudhary MA, Stearns SC. Estimating condence intervals for cost-eectiveness ratios: An example from a randomized
trial. Statistics in Medicine 1996; 15:1447–1458.

Analysis of Cost Data in Randomized Trials: An Application of The Non-Parametric Bootstrap

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Cost Data in Randomized Trials: An Application of The Non-Parametric Bootstrap

Uploaded by

Copyright:

Available Formats

STATISTICS IN MEDICINE

Statist. Med. 2000; 19:3219–3236

Analysis of cost data in randomized trials: an application

Julie A. Barber1;∗; † and Simon G. Thompson2

Contract=grant sponsor: London NHS Executive

Copyright ? 2000 John Wiley & Sons, Ltd.

3.1. Data transformation

3.2. Standard non-parametric methods

3.3. Two-sample t-test

3.4. Permutation tests

3.5. Limitations of conventional methods

4.1. The bootstrap principle

4.2. Bootstrap hypothesis tests for comparing arithmetic means

4.3. Bootstrap con dence intervals

5.1. Back pain trial

Test P-value (two-sided)

Resampling methods (B = 2000):

Con dence interval method Lower limit Upper limit

Methods assuming normality:

Bootstrap methods (B = 2000):

5.2. Parasuicide trial

Test P-value (two-sided)

Resampling methods (B = 2000):

Con dence interval method Lower limit Upper limit

Methods assuming normality:

Bootstrap methods (B = 2000):

You might also like

4.3. Bootstrap condence intervals

Condence interval method Lower limit Upper limit

Condence interval method Lower limit Upper limit