You are on page 1of 8

Published Ahead of Print on January 31, 2006, as 10.2105/AJPH.2003.

036343
 PUBLIC HEALTH MATTERS 

Sufficiency and Stability of Evidence for


Public Health Interventions using Cumulative Meta-Analysis
| Paige Muellerleile, PhD, Brian Mullen, PhD

discuss previous efforts to interpret cumula-


We propose cumulative meta-analysis as the procedure of completing a new
tive meta-analysis, explain indicators of suffi-
meta-analysis at each successive wave in a research database. Two facets of cu-
mulative knowledge are considered: the first, sufficiency, refers to whether the ciency and stability to aid interpretation of cu-
meta-analytic database adequately demonstrates that a public health intervention mulative meta-analysis, and consider the use
works. The second, stability, refers to the shifts over time in the accruing evi- of the indicators of sufficiency and stability in
dence about whether a public health intervention works. a set of previously published meta-analyses.
We used a hypothetical data set to develop the indicators of sufficiency and sta-
bility, and then applied them to existing, published datasets. Our discussion cen- CUMULATIVE META-ANALYSIS
ters on the implications of the use of this procedure in evaluating public health in-
terventions. (Am J Public Health. 2006;96:XXX–XXX. doi:10.2105/AJPH.2003.036343) Cumulative meta-analysis refers to the pro-
cess of performing new meta-analyses at suc-
cessive points in time in a research domain.16
Meta-analysis is the statistical integration of value in the 445th test? What about the Therefore, at each “wave” of the database
the results of independent studies.1–4 This 200th test?11 For many public health issues, (each time a study is added), a separate meta-
approach to the quantitative review of the collecting additional evidence for an already- analysis is conducted. For simplicity, all of the
weight of evidence has proven to be useful in established effect may waste more than re- examples in this paper are assumed to con-
helping determine the effectiveness of public search and participant resources: delaying form to usual standards for performing an
health interventions. Meta-analysis has been implementation of effective risk-reduction in- informative meta-analysis. These standards
used to gauge the effectiveness of interven- terventions may also waste health care re- involve thorough literature searches using
tions aimed at changing patient behavior,5–7 sources, employer costs, and lives. well-defined criteria for a specific hypothesis,
interventions aimed at changing physician be- The second of the aspects overlooked by and consideration of the methodological
havior,8 and interventions aimed at more far- traditional meta-analysis is stability. Stability soundness of the studies to be included. They
reaching public health policy.9 refers to the shifts over time in the accruing also involve careful and consistent extraction
Traditional meta-analysis can inform public evidence about whether a public health inter- of precise tests of significance and effect size.
health interventions and policies, usually to vention works. For example, the purported ef- Comprehensive discussions of standards for
determine whether an intervention has an fects of sex education have been controver- performing meta-analyses can be found in
impact on health practices, and the magni- sial. Studies of the efficacy of sex education several sources.1–4,17,18
tude of that impact. However, traditional programs have rendered conflicting estimates To illustrate the examination of evidence
meta-analysis overlooks 2 aspects of public of the effects of these programs on adolescent for sufficiency and stability in cumulative
health information. The first is sufficiency. Suf- sexual activity: some studies indicate that sex meta-analysis, we will make use of a hypo-
ficiency refers to whether the meta-analytic education programs decrease sexual activity.12 thetical data set that has previously been used
database adequately demonstrates whether a Others indicate that sex education programs to illustrate other meta-analytic issues.1,2,16,19,20
public health intervention works. For exam- do not appear to influence rates of sexual ac- Table 1 describes the data set, which includes
ple, 1 meta-analysis10 synthesizes the relation- tivity.13 Still others indicate that sex education the results of 10 studies of the effects of X on
ship between socioeconomic status and self- programs lead to increased sexual activity.14 Y. For this example, let X = some public health
esteem, integrating the results of 446 As additional studies are added, the estimate intervention (e.g., seatbelt laws) and let Y =
hypothesis tests conducted among 312 940 of the typical effect of sex education programs some public health outcome (e.g., traffic fa-
participants. This number of hypothesis tests on adolescent sexual activity may continue to talities). For each hypothesis test, Table 1
begs the question of whether there was suffi- fluctuate.15 For a number of public health is- also presents the corresponding Z for signifi-
cient justification for using valuable research sues, implementing effective interventions is cance and ZFisher for effect size (the Fisher
and participant resources to conduct the a worthwhile effort. However, implementing logarithmic transformation of the product
446th hypothesis test to help establish the re- ineffective interventions can waste health care moment r for effect size). This use of ZFisher
lationship between socioeconomic status and resources, employer costs, and lives. is consistent with the meta-analytic tech-
self-esteem. If there was little value in adding We describe these 2 aspects of cumulative niques4,21,22 used in this effort. Cumulative
the 446th hypothesis test, was there sufficient knowledge in the public health context. We meta-analyses using Cohen d or Hedges δ, or

March 2006, Vol 96, No. 3 | American Journal of Public Health Muellerleile and Mullen | Peer Reviewed | Public Health Matters | 1
 PUBLIC HEALTH MATTERS 

TABLE 1—Hypothetical Meta-Analytic Databas> termine when the effect became stable; that
is, the point at which the value of the effect
Direction Significance Levels Effect Sizes of X upon Y did not change appreciably from
Study Year Statistic (df) n of Effect a Z P ZFisher r one wave to the next.
−7
1 1981 χ (1) = 23.000
2
110 + 4.80 8.35E 0.49 .457
2 1982 r (78) = .335 80 + 3.04 .00119 0.35 .335
PREVIOUS EFFORTS TO INTERPRET
3 1983 P = .000001 80 + 4.75 .000001 0.59 .531
CUMULATIVE META-ANALYSES
4 1984 t (98) = 6.500 100 + 5.91 2.08E −9 0.62 .549
Previous meta-analytic undertakings23–26
5 1985 F (1, 88) = 15.000 90 + 3.71 .00010 0.40 .382
have not differentiated sufficiency from stabil-
6 1986 F (1, 63) = 10.250 65 + 3.07 .00107 0.39 .374
ity; however, both sufficiency and stability are
7 1987 r (68) = .535 70 + 4.77 9.45E −7 0.60 .535
implied in these efforts. For example, Lau and
8 1988 Z = 3.891 70 + 3.891 .00005 0.50 .465
others24 observed that there was sufficient
9 1989 t (63) = 6.000 65 + 5.31 5.79E −8 0.70 .603
evidence for researchers to have shown intra-
10 1990 P = .01 60 + 2.33 .01 0.31 .300
venous streptokinase for acute infarction to
Note. Data in this table are fabricated. be a lifesaving therapy 25 years before its
a
Results of these hypothesis tests are in the expected direction.
approval by the Food and Drug Administra-
Source. Mullen B.1,2
tion. Likewise, they noted that 2 additional
clinical trials did not change the value of the
therapy established by the preceding evi-
any other linear metric of effect size, could number of hypothesis tests, ki , increases, dence. Nevertheless, previous efforts to inter-
be conducted similarly. and as the cumulative
_ sample size, ΣNi , in- pret cumulative meta-analyses were based on
Initially, assessment of the evidence for creases.19 For a Z Fisher i that remains constant, visual inspection of the accumulating results,
sufficiency and stability comes from visual then, additional studies result in narrower similar to the foregoing discussion of results
examination of the results of a cumulative CIi s around that mean, which decreases the portrayed in Figure 1. However, visual inspec-
meta-analysis. Figure 1a presents a graph of range of values for the effect size that are sta- tion of accumulating results may not yield a
the data set described in Table 1. The data tistically equivalent to the true effect size. straightforward answer about whether there
point at wave 1 represents the effect size From the first wave through the end of the is sufficient evidence to determine that X has
from the first study, published in 1981. Its database, the evidence for the effect of X on an effect on Y, nor does it necessarily yield a
value is ZFisher = 0.49. At wave 2, the data Y appeared to be sufficient: the CIi around straightforward answer about whether that ef-
point represents the mean effect size, combin- the mean effect size did not include the value fect has achieved stability.
ing the data from the first and second studies. of zero. Put differently, the range of values for Pogue and Yusuf 25 suggested a different
The value of the effect size from the second the mean effect size at each wave appeared approach for determining when accumulating
study is ZFisher_ = 0.35, resulting in a mean ef- to be statistically different from a null effect. evidence is statistically significant, which in-
fect size of ZFisher 2 = 0.42. Performing a new Therefore, it would be hard to argue for addi- volves the adaptation of classical monitoring
meta-analysis for each of the 10 waves in the tional research about the effects of X on Y, as boundaries. They propose that the cumulative
database
_ results in a mean effect size of it appears that the effect was there from the meta-analyst calculate an “optimum informa-
Z Fisher 10 = 0.50. start. Similarly, from the first wave through tion size,” which is the cumulative sample size
The_ 95% confidence intervals (CIs) around the end of the database, the evidence for the needed to demonstrate an effect, in light of
each Z Fisher i are not intended for use as esti- effect of X on Y appeared to be stable: there event rates and the minimum reasonable val-
mators of inferential probabilities. Cumulative is little change in the value of the mean effect. ues of the independent variable that would
meta-analysis necessarily involves multiple Therefore, it would be hard to argue for addi- be considered consequential.
tests of the same hypothesis, and using CIs tional research to determine whether the Although their efforts to produce a method
for estimating inferential probabilities there- emergent picture of the effect of X upon Y for statistical inference within cumulative
fore increases the likelihood of committing a might change. Although the visual informa- meta-analysis are commendable, there has
Type I error. In this context, rather than being tion presented in Figure 1a portrays a simple been little debate about the efficacy of the
indications of the likelihood that the effects data set for which this interpretation is proposed monitoring boundaries. We propose
are significant, the CIs indicate the range of straightforward, one cannot expect real the use of more straightforward indicators of
values that are statistically equivalent to the datasets to be so obliging. In real datasets, it sufficiency and stability, even though there
parameter.
_ In other words, the CIi around the may be very difficult to determine when may not be accompanying inferential proba-
Z Fisher i for wave i indicates the range of values there was sufficient evidence to determine bilities for them. The first reason for using
indistinguishable from the parameter value. that X had a particular effect on Y. Moreover, more straightforward indicators is their sim-
Generally, the CIis become narrower as the in real datasets, it may be very difficult to de- plicity. The second reason for using more

2 | Public Health Matters | Peer Reviewed | Muellerleile and Mullen American Journal of Public Health | March 2006, Vol 96, No. 3
 PUBLIC HEALTH MATTERS 

analysis on the subject of visual interpretation


of data showed that interjudge agreement
can be quite good.32 Moreover, other scholars
have recommended guidelines for creation of
graphical presentations that facilitate inter-
judge agreement (e.g., consistent axes and
scaling).33–41 Following such guidelines, we
hope to reduce the potential for lack of agree-
ment among judges.
Clearly, the hypothetical database pre-
sented in Table 1, used to generate Figure 1,
appears to demonstrate sufficient evidence
for the effect of X upon Y. It also appears that
the effect of X upon Y is stable. However, real
research databases are unlikely to be as clear-
cut as this one. Therefore, we will outline the
procedures for generating indicators of suffi-
ciency and stability using the hypothetical
database, and then use the same procedures
in real databases.

The Failsafe Ratio


There is a bias in favor of publishing re-
ports of significant results.42–47 The conse-
quence of the bias is the possibility that un-
published or unknown studies with null
results may exist in researchers’ file draw-
ers.47 To address the file drawer problem,
Rosenthal developed a technique for estimat-
Source. Mullen.2 ing the number of unpublished, unretrieved
FIGURE 1—Cumulative meta-analysis using Mullen’s hypothetical database: (a) failsafe studies with null results that would have to
ratio, (b) individual hypothesis tests, (c) cumulative slope, and (d) . exist in file drawers that would bring the
overall combined probability to just significant
at the α = 0.05 level. The resulting “failsafe
number” (Nfs(P = 0.05)42 ) is calculated as follows:
straightforward indicators is that Pogue and upon Y. They did not address whether that
Yusuf 25 require a priori specification of the effect has become stable across waves in a
 (∑ Z ) 
optimum information size. However, a re- database.25 For these reasons, we propose (1) N fs ( p =.05) =  2 −k
searcher must know what the event rates that cumulative meta-analysts make use of  (1.645) 
might be—which requires an understanding of more straightforward indicators of (both)
what minimum effects of the independent vari- sufficiency and stability. Rosenthal47 noted that it would be unlikely
able are both consequential and reasonable— that there would be 5 times as many un-
before specifying the optimum information INDICATORS OF SUFFICIENCY retrieved studies as there were in the
size. In other words, the researcher would AND STABILITY meta-analyst’s database. He proposed that
need extensive knowledge of the observed N f s(P = 0.05) exceed 5k + 10 (the addition of 10
results of the accumulated research before The indicators we propose rely on inspec- studies would ensure that for very small
undertaking a cumulative meta-analysis to un- tion of graphs of a type of meta-analytic meta-analytic databases of 1 or 2 studies, the
derstand the observed results of the accumu- “time-series” data.16 Some researchers have number of unretrieved studies would be 15
lated research. Finally, the third reason for argued there is little agreement among judges or 20, rather than only 5 or 10). The impor-
using more straightforward indicators is that who interpret visual information,27,28 which tance of the failsafe number Nfs(P = 0.05) and
Pogue and Yusuf were concerned only with may result in different conclusions about Rosenthal’s47 5k + 10 standard is illustrated
sufficiency: whether additional evidence is those data than conclusions on the basis of by the studies that use it.48–52 The “failsafe
needed to establish that X has some effect statistical analysis.29–31 However, a meta- ratio” is an indicator of the relative sizes of

March 2006, Vol 96, No. 3 | American Journal of Public Health Muellerleile and Mullen | Peer Reviewed | Public Health Matters | 3
 PUBLIC HEALTH MATTERS 

the failsafe number and the Rosenthal stan- exists, and additional research is unlikely to other words, the effect becomes stable, not
dard, and is calculated as follows: change the weight of that evidence. changing dramatically across waves in the
Although the failsafe ratio can indicate the database. A comparison of the size of the
 ( ∑ Z )i  sufficiency of a research database, it does not slope in successive waves in the database
 1.645 2  − k i
( )  adequately address the stability of the effect provides the cumulative meta-analyst with a
(2) Failsafe Ratio =
5k i + 10 size. To the extent that the results of addi- means of determining whether a phenome-
tional studies are of different magnitudes non has become stable.
where ki = the number of studies in the data- (as long as they are not null effects, on aver- Figure 1c does not show how the regres-
base at wave i. If the failsafe ratio is less than age), there can be fluctuations in the magni- sion line may have changed across waves,
1.000, then Nfs(P = 0.05)i at wave i has not ex- tude of the cumulative effect size that will which would indicate the point in the data-
ceeded the 5ki + 10 standard. Thus, the re- not be captured by examination of the failsafe base at which the regression line became
sults at wave i are still vulnerable to future ratio. It is necessary to consider a more direct stable. In contrast, Figure 1d displays the cu-
null results. If the failsafe ratio exceeds 1.000, indicator of stability. mulative meta-analysis from Figure 1a, with
then Nfs(P = 0.05)i at wave i has exceeded the the addition of the cumulative slope, which
5ki + 10 standard. Thus, the results at wave i The Cumulative Slope changes as regressions are performed on each
will tolerate future null results. One way to determine whether there is a of the successive pairs of Ẑ Fisher–and–ki data
Figure 1b displays the cumulative meta- change in a database over time is to plot the points at each wave of the database. In
analysis from Figure 1, with the addition data and examine the slope of the plotted Figure 1d, the absolute values of the slopes
of the failsafe ratio that was calculated at points.
_ The combined effect sizes presented resulting from regressing effect size on each
each wave of the database. For example, as Z Fisher i at each wave can mask the change successive wave i comprise the “cumulative
the first wave had 1 study (k1 = 1), and the in effect size in successive waves. In slope.” Absolute value is used because of the
Nf s (P = 0.05) 1 = 7.5. Therefore, the value of Figure 1c, each data point’s placement has chance that the first few effect sizes are larger
the failsafe ratio would be: been preserved across waves, rather than pre- (resulting in a negative slope) or smaller (re-
senting the average effect for each wave. For sulting in a positive slope) than the eventual
example, the first study in the database ap- mean effect size. For example, the first cumu-
7.5
(3) Failsafe Ratio = = 0.500 pears at wave 1 (ZFisher = 0.49). That same lative slope plotted at wave 2, β = −0.070,
5(1)+ 10
data point is also displayed at subsequent represents the slope from the 3 pairs of data
waves. The second study in the database ap- points at waves 1 and 2. The values of the
Because the value of the failsafe ratio is pears at wave 2 (ZFisher = 0.35). That data slopes fluctuate between |−0.070| at wave 2,
less than 1.000, the results at wave 1 are point, along with the first, is displayed at all and +0.023 at wave 4. After that point, they
still vulnerable to future null results. The sec- subsequent waves. In the final wave of the level out at around 0.000.
ond wave added 1 study (k2 = 2), and the database, all 10 data points appear, for a total Inspection of the slopes displayed in
Nfs(P = 0.05) 2 = 20.7. Therefore, the value of of 55 data points in the figure. Figure 1d reveals that the phenomenon be-
the failsafe ratio would be: Additionally, the regression line in comes stable after the third wave in the data-
Figure 1c is the result of the regression of base, where the value of the cumulative slope
20.7 the Σki = 55 data points across each wave approaches 0.000. Thus, to the extent that
(4) Failsafe Ratio = = 1.035
5(2)+ 10 upon ki as a predictor. The purpose of the the cumulative slope is different from 0.000,
regression is to estimate the rate of change the cumulative weight of evidence continues
Because the failsafe ratio exceeds 1.000, (slope) across all of the waves of the meta- to fluctuate. As the cumulative slope ap-
the results at wave 2 are likely to tolerate fu- analytic database. It would be inappropriate proaches 0.000, the cumulative weight of
ture null results. The value of the failsafe ratio to use the slope to derive inferential proba- evidence has become stable. In other words,
continues to increase to a value of 10.483 by bilities, because meta-analytic data violate additional studies are unlikely to change the
the 10th wave of the database. the assumptions of the general linear model picture of the phenomenon.
Inspection of the failsafe ratio displayed in for statistical inference.1,2,4,53,54 However, Examining sufficiency and stability as com-
Figure 1b reveals that the number of studies the least-squares estimates of regression pa- plementary aspects of an emerging cumula-
in the database with null results needed to rameters like the slope and the intercept tive meta-analytic database allow the analyst
reduce the combined significance to P = 0.05 are not biased. In the hypothetical database, to consider the separate contributions that
becomes excessive beyond the second wave the regression equation that results from sufficiency and stability can make toward
in the database, where the failsafe ratio ex- the 55 pairs of Ẑ Fisher–and–ki data points, is understanding the phenomenon. In the case
ceeds 1.000. From that point onward in time, Ẑ Fisher = 0.46 + 0.004(k). In the cumulative of a phenomenon that appears to be strong
there seems to be no need for additional re- meta-analysis, the slope (0.004) indicates at the outset, a cumulative slope of 0.000
search to establish the effect of X on Y; there that the best-fitting line levels off as the indicates that additional studies would con-
is sufficient evidence that the phenomenon number of hypothesis tests increases. In tinue to support the phenomenon’s existence

4 | Public Health Matters | Peer Reviewed | Muellerleile and Mullen American Journal of Public Health | March 2006, Vol 96, No. 3
 PUBLIC HEALTH MATTERS 

(high sufficiency). However, in the case of a heart-healthy eating behaviors, White and even after 10 studies (k8 = 11 hypothesis
phenomenon that appears to be negligible Pitts’56 integration of drug education interven- tests). However, stability (at a null effect) was
or null at the outset, a cumulative slope of tions to reducing drug use, Koger et al.’s57 in- established early in the cumulative meta-
0.000 suggests that additional studies would tegration of music therapy interventions for analytic database (4 studies, k3 = 4 hypothesis
not support existence of the phenomenon increasing skills among adults with dementia, tests, 9 years before the meta-analysis, and
(low sufficiency). As such, the cumulative and Acton and Kang’s58 integration of inter- before 54% of the includable hypothesis
slope is a better indicator of the stability ventions to reduce burden among caregivers tests). The cumulative meta-analysis for the
of a phenomenon than of sufficient evi- for adults with dementia. By happenstance, 2 drug abuse prevention programs would have
dence for it. of these datasets address issues for youth,55,56 revealed that a good deal of effort and re-
and 2 datasets address issues for older sources had been invested in conducting re-
Summary adults.57,58 Table 2 provides descriptive infor- search on a phenomenon that never achieved
Figure 1a displays a hypothetical example mation for these studies. sufficiency, yet for which stability might have
of a database that is both sufficient and Figure 2 depicts the cumulative meta- been established long ago.
stable from the outset. The indicators of suffi- analyses for the 4 meta-analytic databases, A third picture emerges from examination
ciency (failsafe ratio) and stability (cumulative including the failsafe ratio and the cumulative of Figure 2c. The music therapy interven-
slope) permit the cumulative meta-analyst to slope. Examination of Figure 2a reveals that tions57 did not achieve sufficiency until
determine when there was sufficient evidence the heart-healthy nutrition programs55 after 7 studies (k4 = 7 hypothesis tests, 6
for the existence of the phenomenon, and achieved sufficiency (with the failsafe ratio ex- years before the meta-analysis, and before
when it became stable. Because the failsafe ceeding the critical value of 1.000) from the 67% of the includable hypothesis tests).
ratio and cumulative slope established suffi- outset, with a modest meta-analytic database Further examination of Figure 2c reveals
ciency and stability early in the hypothetical of k1 = 2 hypothesis tests. This achievement of that the interventions achieved stability just
database, these indicators appear to corre- sufficiency occurred before 10 of 12, or ap- after that point (k5 = 11 hypothesis tests, 5
spond with the conclusion the analyst might proximately 83%, of all of the potentially in- years before the meta-analysis, and before
have drawn from an examination of Figure 1a. cludable hypothesis tests were conducted. Ex- 48% of the includable hypothesis tests).
The following section makes use of these in- amination of Figure 2a also reveals that the The cumulative meta-analysis for the effec-
dicators in real meta-analyses that are less heart-healthy nutrition programs achieved sta- tiveness of music therapy for adults with
obvious than the hypothetical example. bility (with the cumulative slope approaching dementia reveals that excessive time and
0.000) at the second wave in the database, effort was invested in evaluating programs
APPLICATIONS TO ACTUAL with a modest database of k2 = 7 hypothesis for which sufficiency and stability had
META-ANALYSES tests. Stability was achieved before 5 of 12, or been established much earlier. However,
approximately 42%, of all of the potentially unlike the nutrition programs in Figure 2a,
A selection of meta-analyses published in includable hypothesis tests were conducted. the cumulative meta-analysis for music
the public health literature can illustrate the Early examination of a cumulative meta- therapy would have indicated that more
application of these indicators of sufficiency analysis of heart-healthy eating programs data needed to accumulate before the suf-
and stability. Selection of the following 4 would have revealed that excessive time and ficiency and stability of the intervention
meta-analyses was on the basis of 2 factors: effort was invested in evaluating a program effectiveness could be established.
they attempted to evaluate the effectiveness for which sufficiency and stability had been Finally, the picture that emerges in
of a particular public health intervention, and established much earlier. Figure 2d is similar to that of Figure 2b. The
they used compatible meta-analytic tech- Examination of Figure 2b reveals a differ- caregiver burden reduction programs58 did
niques. The selection includes McArthur’s55 ent pattern. The drug abuse prevention pro- not achieve sufficiency at all, even after 24
integration of a school-based intervention on grams56 did not achieve sufficiency at all, studies (k13 = 27 hypothesis tests). Stability,

TABLE 2—Actual Meta-Analyses Used to Illustrate Cumulative Meta-Analysis

No. of No. of Total No.


Study Hypothesis Studies Hypothesis Tests of Participants

McArthur 55 School-based cardiovascular programs that focus on nutrition will increase heart-healthy eating. 9 12 3828
White and Pitts56 Interventions for adolescents and young adults reduce marijuana use at > 2-year follow-up. 10 11 13 201
Koger et al.57 Music therapy improves behavioral, social, emotional, and cognitive skills of adults with dementia. 21 21 336
Acton and Kang 58 Interventions for caregivers of adults with dementia reduce negative cognitive and behavioral 24 27 1293
consequences for the caregiver.

March 2006, Vol 96, No. 3 | American Journal of Public Health Muellerleile and Mullen | Peer Reviewed | Public Health Matters | 5
 PUBLIC HEALTH MATTERS 

weak) effect. Examining the cumulative slope,


however, would have indicated that addi-
tional studies were unlikely to change the
aggregate picture of the (very weak) effect. In
those cases, and in the case of the caregiver
burden reduction programs,58 the failsafe
ratio and cumulative slope identify the points
in the history of a cumulative database where
additional tests of a hypothesis amount to
flogging a dead horse.
The complementary aspects of cumulative
knowledge, sufficiency and stability, corre-
spond with 2 dimensions of study outcome:
significance level and effect size. First, signifi-
cance level refers to the likelihood of having
obtained the observed results, or results more
extreme, if in fact the null hypothesis of no
difference is true, whereas sufficiency refers
to whether the cumulative weight of evidence
allows us to accept the existence of the phe-
nomenon. Sufficiency requires a high cumula-
tive probability. Second, effect size refers to
the strength of a phenomenon, whereas sta-
bility refers to whether the cumulative weight
of evidence has leveled off at a steady aggre-
gate picture of the phenomenon. Stability re-
quires a steady cumulative average effect.
The cumulative meta-analytic context under-
Source. McArthur,55 White et al.,56 Koger et al.,57 Acton et al.58 scores the role of the size of the database.
At the individual study level, significance lev-
FIGURE 2—Cumulative meta-analysis using real meta-analytic databases: (a) heart-healthy
els and effect sizes are linked through the size
eating programs,56 (b) drug abuse prevention programs,57 (c) music therapy programs,57
of the sample. That is, a significant effect of
and (d) caregiver burden reduction programs.58
P = 0.0499999 might be weak if on the basis
of a large sample (n = 1000, ZFisher = 0.052),
but strong if on the basis of a small sample
however, was established relatively early in the sufficiency and stability aspects of cumu- (n = 3, ZFisher = 1.830).4 Given the correspon-
the cumulative meta-analytic database (k4 = 7 lative knowledge in the hypothetical data set dence between significance level/effect size
hypothesis tests, 12 years before the meta- presented in Table 1, but they also illustrate and sufficiency/stability, the size of the data-
analysis, and before 74% of the includable sufficiency and stability in the 4 real meta- base should play a pivotal role in cumulative
hypothesis tests). The cumulative meta-analysis analytic examples.55–58 Consider the heart- meta-analysis. Indeed, this appears to be the
for the caregiving burden interventions would healthy nutrition programs,55 which appeared point of Schmidt’s20 admonition: when is it
have revealed that a good deal of effort and to have strong effects from the outset. Exam- possible to tell when there is sufficient evi-
resources had been invested in conducting ining the failsafe ratio would have confirmed dence for the existence of a phenomenon?
research on a phenomenon that never that additional studies were unlikely to
achieved sufficiency and yet for which stabil- change the weight of the evidence. Examining PUBLIC HEALTH IMPLICATIONS OF
ity might have been established long ago. the cumulative slope would have confirmed CUMULATIVE META-ANALYSIS
that additional studies were unlikely to
DISCUSSION change the aggregate picture of the phenome- The implications for using cumulative
non. Similarly, drug abuse prevention pro- meta-analysis are varied. Among its possible
We have proposed the failsafe ratio as an grams56 appeared to have weak effects at the uses are changing school curricula, changing
indicator of sufficiency, and the cumulative outset. Examining the failsafe ratio would recommendations for physicians, assessing re-
slope as an indicator of stability. The indica- have indicated that additional studies could search goals, or modifying criteria for funding
tors illustrate the complementary nature of change the weight of evidence for the (very research. Although cumulative meta-analysis

6 | Public Health Matters | Peer Reviewed | Muellerleile and Mullen American Journal of Public Health | March 2006, Vol 96, No. 3
 PUBLIC HEALTH MATTERS 

cannot take the place of other considerations established, 1 study60 set out recommenda- able opportunities to invest time, effort, and
that inform decision-making practice, it is an tions for physicians to identify and intervene resources. The failsafe ratio and cumulative
additional tool that policy makers can use to with overburdened caregivers. Their recom- slope can reveal information about an emerg-
make better decisions about implementing mendations included the same educational, ing phenomenon to help researchers make
programs. counseling, and respite-care services assessed the best use of limited resources needed to
The cumulative meta-analysis generated in the primary-level studies integrated in advance the state of the science and improve
from the integration of heart-healthy nutrition Acton and Kang’s58 meta-analysis. public health.
interventions55 demonstrated that, early on, Moreover, 8 years after stability for the
both sufficiency and stability for an effective negligible effect was attained, the US Depart-
program was attained. However, the cumula- ment of Health and Human Services61 issued About the Authors
At the time this work was completed, Paige Muellerleile
tive meta-analysis generated from the integra- a preliminary report on governmental com- was with the Department of Psychology, University of
tion of drug abuse prevention programs56 mitments to programs for independent living, Wisconsin–Marshfield, Marshfield, and Brian Mullen was
demonstrated that sufficiency was never es- including caregiver burden reduction pro- with Department of Psychology, Syracuse University, Syra-
cuse, New York.
tablished, but stability for the essentially null grams. The report claims that “a growing Requests for reprints should be sent to Paige Mueller-
effect was established by the fourth wave in body of evidence confirms that the provision leile, PhD, University of Wisconsin-Marshfield, 2000
the database. However, these 2 programs ap- of supportive services can diminish caregiver West Fifth Street, Marshfield, Wisconsin 54449 (e-mail:
pmueller@uwc.edu).
pear to receive differential research support burden, [and] permit caregivers to remain in This article was accepted January 5, 2005.
and commitment. For example, the Healthy the workforce. . . .”61 The 2001 appropria-
People 2010 59 guidelines delineate only 1 tions for the National Caregiver Support Pro- Contributors
objective for improving nutrition in school gram were $125 000 000.61 To date, we have Both authors developed the conceptual perspective, an-
meals, but there are at least 7 objectives for been unable to determine that any appropria- alyzed the data, and wrote the article.
decreasing substance use among schoolchild- tions have been dedicated for music therapy
ren. Although drug abuse is a serious public programs. The rendered wisdom from cur- Human Participant Protection
No protocol approval was needed for this study.
health problem, the Healthy People 2010 ob- rent research objectives is that there is more
jectives appear to be made on the basis of promotion of (ineffective) caregiver burden
some of the same studies that appeared in reduction programs than (effective) music
References
1. Mullen B. Advanced BASIC Meta-Analysis. Hills-
White and Pitts’ meta-analysis,56 indicating an therapy programs. dale, NJ: Lawrence Erlbaum Associates; 1989.
overemphasis on promoting programs from The examples above make it clear that re- 2. Mullen B. Advanced BASIC Meta-Analysis. 2nd
which schoolchildren derive no benefit. search in public health can benefit from tools ed. Mahwah, NJ: Lawrence Erlbaum Associates. In
Meanwhile, the objectives underemphasize a for determining when sufficient evidence has press.

program from which schoolchildren derive accrued to establish intervention efficacy. 3. Mullen B, Rosenthal R. BASIC Meta-Analysis.
Hillsdale, NJ: Lawrence Erlbaum Associates; 1985.
significant benefits. Despite the emerging cul- There are several valuable applications of this
tural alarm over obesity and its associated approach. For example, for research questions 4. Rosenthal R. Meta-Analytic Procedures for Social
Research. Newbury Park, CA: Sage; 1991.
health problems, efficacious heart-healthy eat- involving moderators, cumulative meta-analysis
5. McDonald HP, Garg AX, Haynes RB. Interven-
ing programs appear to be overlooked. In- can be used to examine sufficiency and stabil- tions to enhance patient adherence to medication pre-
deed, a simple MEDLINE search of the litera- ity separately within levels of the moderator: scriptions: scientific review. JAMA. 2002;288:
ture on schoolchildren corroborates this The evidence from studies testing the inter- 2868–2879.
suspicion: A search for heart healthy and nu- vention at 1 level of the moderator may dem- 6. Peterson AM, Takiya L, Finley R. Meta-analysis of
trition yielded 13 citations; a search for drug onstrate sufficiency, whereas studies testing interventions to improve drug adherence in patients
with hyperlipidemia. Pharmacotherapy. 2003;23:80–87.
abuse and prevention yielded 651 citations. another level of the moderator may not dem-
7. Roter DL, Hall JA, Merisca R, Nordstrom B, Cretin D,
The rendered wisdom from current research onstrate sufficiency. Similarly, cumulative Svarstad B. Effectiveness of interventions to improve
objectives is that there is more promotion of meta-analysis can be used to gauge the fit of patient compliance: a meta-analysis. Med Care. 1998;
(ineffective) drug abuse prevention programs public policy recommendations: despite the 36:1138–1161.
than (effective) heart-healthy eating programs. evidence that the effect of caregiver burden 8. Davis D, O’Brien MA, Freemantle N, Wolf FM,
Consider the cumulative meta-analysis gen- reduction levels off at zero, policy recommen- Mazmanian P, Taylor-Vaisey A. Impact of formal con-
tinuing medical education: do conferences, workshops,
erated from the integration58 of interventions dations favor more funding. Finally, this ap- rounds, and other traditional continuing education ac-
to reduce caregiver burden. The cumulative proach may provide an empirically based tivities change physician behavior or health care out-
meta-analysis demonstrated that by the sev- benchmark against which funding proposals comes? JAMA. 1999;282:867–874.

enth wave in the database, stability for the can be evaluated by granting agencies: pro- 9. Fichtenberg CM, Glantz SA. Effect of smoke-free
workplaces on smoking behaviour: systematic review.
negligible effect was attained, indicating no posals for new studies that use cumulative BMJ. 2002;325:188–191.
substantive changes to the accruing evidence meta-analysis to document that current evi-
10. Twenge JM, Campbell WK. Self-esteem and socio-
that interventions do not reduce caregiver dence for an intervention that has not yet economic status: a meta-analytic review. Pers Soc Psy-
burden. However, 7 years after stability was achieved stability stand as particularly valu- chol Rev. 2002;6:59–71.

March 2006, Vol 96, No. 3 | American Journal of Public Health Muellerleile and Mullen | Peer Reviewed | Public Health Matters | 7
 PUBLIC HEALTH MATTERS 

11. Schmidt FL. What do data really mean? Research 29. Gottman JM, Glass GV. Analysis of interrupted a meta-analytic review. Psychol Bull. 1993;113:
findings, meta-analysis, and cumulative knowledge in time-series experiments. In: Kratochwill TR, ed. Single 472–486.
psychology. Am Psychol. 1992;47:1173–1181. Subject Research: Strategies for Evaluating Change. New
50. Ito TA, Miller N, Pollock VE. Alcohol and aggres-
York, NY: Academic Press; 1978:197–235.
12. Ku L, Sonenstein FL, Pleck JH. Factors influencing sion: a meta-analysis on the moderating effects of in-
first intercourse for teenage men. Public Health Rep. 30. Jones R, Weinrott M, Vaught R. Effects of serial hibitory cues, triggering events, and self-focused atten-
1993;108:680–694. dependency on the agreement between visual and tion. Psychol Bull 1996;120:60–82.
statistical inference. J Appl Behav Anal. 1978;11:
13. Eisen M, Zellman GL. Changes in incidence of 51. Sheeran P, Orbell S. Do intentions predict con-
277–283.
sexual intercourse of unmarried teenagers following a dom use? Meta-analysis and examination of six moder-
community-based sex education program. J Sex Res. 31. Tryon WW. A simplified times series analysis for ator variables. Br J Soc Psychol. 1998;37:231–250.
1987;23:527–533. evaluating treatment interventions. J Appl Behav Anal.
52. Sweeney PD, Anderson K, Bailey S. Attributional
1982;15:423–429.
14. Marsiglio W, Mott FL. The impact of sex educa- styles and depression: a meta-analytic review. J Pers Soc
tion on sexual activity, contraceptive use and premari- 32. Ottenbacher KJ. Interrater agreement of visual Psychol. 1986;50:974–991.
tal pregnancy among American teenagers. Fam Plann analysis in single subject decisions: quantitative re-
view and analysis. Am J Ment Retard. 1993;98: 53. Hedges LV, Olkin I. Statistical Methods for Meta-
Perspect. 1986;18:151–162.
135–142. Analysis. Orlando, FL: Academic Press; 1985.
15. Rosnow RL, Rosenthal R. Statistical procedures
33. Chambers JM, Cleveland WS, Kleiner B, Tukey PA. 54. McCain LJ, McCleary R. The statistical analysis of
and the justification of knowledge in psychological sci-
Graphic Methods for Data Analysis. Belmont, CA: the simple interrupted time-series quasi-experiment. In:
ence. Am Psychol. 1989;44:1276–1284.
Wadsworth; 1983. Cook TD, Campbell DT, eds. Quasi-Experimentation:
16. Mullen B, Muellerleile P, Bryant B. Cumulative Design and Analysis Issues for Field Settings. Chicago,
meta-analysis: a consideration of indicators of suffi- 34. Cleveland WS. Elements of Graphing Data. Sum- IL: Rand McNally; 1979:233–293.
ciency and stability. Pers Soc Psychol Bull. 2001;27: mit, NJ: Hobart Press; 1994.
55. McArthur DB. Heart healthy eating behaviors of
1450–1462. 35. Cleveland WS, McGill R. Graphical perception: children following a school-based intervention: a meta-
17. Cooper HM. The Integrative Research Review: A So- theory, experimentation, and application to the devel- analysis. Issues Compr Pediatr Nurs. 1998;21:35–48.
cial Science Approach. Beverly Hills, CA: Sage; 1984. opment of graphical methods. J Am Stat Assoc. 1984;
79:531–554. 56. White D, Pitts M. Educating young people about
18. Light RJ, Pillemer DB. Summing Up: The Science of drugs: a systematic review. Addiction. 1998;93:
Reviewing Research. Cambridge, MA: Harvard Univer- 36. Cleveland WS, McGill R. The many faces of a 1475–1487.
sity Press; 1984. scatterplot. J Am Stat Assoc. 1984;79:807–822.
57. Koger SM, Chapin K, Brotons M. Is music therapy
19. Johnson B, Mullen B, Salas E. A comparison of 37. Mosteller F, Tukey JW. Data analysis, including an effective intervention for dementia? A meta-analytic
the three major meta-analytic approaches. J Appl Psy- statistics. In: Lindzey G, Aronson E, eds. The Handbook review of literature. J Music Ther. 1999;36:2–15.
chol. 1995;80:94–106. of Social Psychology. Vol 2. 2nd ed. Reading, MA:
Addison-Wesley; 1968. 58. Acton GJ, Kang J. Interventions to reduce the bur-
20. Schmidt FL, Hunter JE. Comparison of three den of caregiving for an adult with dementia: a meta-
meta-analysis methods revisted: an analysis of Johnson, 38. Tufte ER. Envisioning Information. Cheshire, CT: analysis. Res Nurs Health. 2001;24:349–360.
Mullen, & Salas (1995). J Appl Psychol. 1999;84: Graphics Press; 1990.
144–148. 59. Healthy People 2010: Understanding and Improving
39. Tufte ER. Graphical Explanations. Cheshire, CT: Health. 2nd ed. Washington, DC: US Department of
21. Rosenthal R, Rubin DB. Interpersonal expectancy Graphics Press; 1997. Health and Human Services; 2000.
effects: the first 345 studies. Behav Brain Sci. 1978;3: 40. Tukey JW. Data based graphics: visual display in
410–415. 60. Kasuya RT, Polgar-Bailey P, Takeuchi R. Caregiver
the decades to come. Stat Sci. 1990;5:327–339. burden and burnout: a guide for primary care physi-
22. Rosenthal R, Rubin DB. Comment: assumptions 41. Wainer H. Visual Revelations: Graphical Tales of cians. Postgrad Med. 2000;108:119–123.
and procedures in the file drawer problem. Stat Sci. Fate and Deception from Napoleon Bonaparte to Ross
1988;3:120–125. 61. US Department of Health and Human Services.
Perot. New York: Copernicus; 1997.
Delivering on the Promise: Preliminary Report. 2001.
23. Antman EM, Lau J, Kupelnick B. A comparison of 42. Cooper HM. Statistically combining independent Available at: http://www.hhs.gov/newfreedom/prelim/
results of meta-analyses of randomized control trials studies: a meta-analysis of sex differences in conformity caregive.html. Accessed on November 16, 2003.
and recommendations of clinical experts. JAMA. 1992; research. J Pers Soc Psychol. 1979;37:131–135.
268:240–248.
43. Greenwald AG. Consequences of prejudice against
24. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, the null hypothesis. Psychol Bull. 1975;82:1–20.
Mosteller F, Chalmers TC. Cumulative meta-analysis of
therapeutic trials for myocardial infarction. N Engl J 44. Hedges LV, Vevea JL. Estimating effect size under
Med. 1992;327:248–254. publication bias: small sample properties and robust-
ness of a random effects selection model. J Educ Behav
25. Pogue JM, Yusuf S. Cumulating evidence from Stat. 1996;21:299–333.
randomized trials: utilizing sequential monitoring
boundaries for cumulative meta-analysis. Control Clin 45. Hojat M, Gonnella JS, Caelleigh AS. Impartial
Trials. 1997;18:580–593. judgment by the “gatekeepers” of science: fallibility and
accountability in the peer review process. Adv Health
26. Yusuf S, Held P, Furberg C. Update of effects of Sci Educ Theory Pract. 2003;8:75–96.
calcium antagonists in myocardial infarction or angina
46. Olson CM, Rennie D, Cook D, et al. Publication
in light of the second Danish Verapamil Infarction Trial
bias in editorial decision making. JAMA. 2002;287:
(DAVIT-II) and other recent studies. Am J Cardiol.
2825–2828.
1991;67:1295–1297.
47. Rosenthal R. The “file drawer problem” and toler-
27. DeProspero A, Cohen S. Inconsistent visual analy-
ance for null results. Psychol Bull. 1979;86:638–641.
sis of intrasubject data. J Appl Behav Anal. 1979;12:
573–579. 48. Beck CT. A meta-analysis of the relationship be-
tween postpartum depression and infant temperament.
28. Furlong MJ, Wampold BE. Intervention effects
Nurs Res. 1996;45:225–230.
and relative variation as dimensions in experts’ use of
visual inference. J Appl Behav Anal. 1982;15:415–421. 49. Herbert TB, Cohen S. Depression and immunity:

8 | Public Health Matters | Peer Reviewed | Muellerleile and Mullen American Journal of Public Health | March 2006, Vol 96, No. 3

You might also like