Professional Documents
Culture Documents
Comparative studies of psychotherapy often find few or no differences in the outcomes that alterna-
tive treatments produce. Although these findings may well reflect the comparability of alternative
treatments, as a rule, studies are often not sufficiently powerful to detect the sorts of effects sizes
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
likely to be found when two or more treatments are contrasted. The present survey evaluated the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
power of psychotherapy outcome studies to detect differences for contrasts of two or more treatments
and treatment versus no-treatment control conditions. Outcome studies (A' = 85) were drawn from
nine journals over a 3-year period (1984-1986). Data in each article were examined first to provide
estimates of effect sizes and then to evaluate statistical power at posttreatment and follow-up. The
findings indicate that the power of studies to detect differences between treatment and no treatment
is generally quite adequate given the relatively large effect sizes usually evident for this comparison.
On the other hand, the power is relatively weak to detect the small-to-medium effect sizes likely to
be evident when alternative treatments are contrasted with each other. Thus, the equivalent outcomes
that treatments produce (i.e., "no difference") may be due to the relatively weak power of the tests.
The implications for interpreting current outcome studies and for designing future comparative
studies are highlighted.
In psychotherapy research, comparative outcome studies ad- The absence of clear outcome differences between alternative
dress the question of which of two or more techniques is the treatments has served as an important point of departure re-
most effective for a particular clinical problem and patient pop- garding the nature of therapy and the processes that account for
ulation. Often the constituent treatments reflect conflicting change (see Stiles et al., 1986). As a prominent example, Frank
conceptual views about the nature of dysfunction, the focus of (1982) suggested that therapeutic change results from several
treatment, and the techniques required to produce change. features that are common among different techniques. Thus,
Consequently, comparative studies generate tremendous inter- no outcome differences between treatments might be expected
est (see Heimberg& Becker, 1984; Kazdin, 1986; Lambert, Sha- given their common ingredients. This view has been bolstered
piro, & Bergjn, 1986; Luborsky, Singer, & Luborsky, 1975; in part by the recognition that different techniques often appear
Rachman & Wilson, 1980; Stiles, Shapiro, & Elliott, 1986). more diverse in theory than they do in clinical practice (e.g.,
Outcome evidence on the relative effectiveness of alternative Klein, Dittmann, Parloff, & Gill, 1969; Sloane et al., 1975).
treatments has been evaluated extensively. Conclusions have Common ingredients, particularly the special relationship be-
been drawn from individual comparative outcome investiga- tween client and therapist and the provision of support, empa-
tions (e.g., Sloane, Staples, Cristol, \orkston, & Whipple, thy, and concern, are pervasive among alternative techniques
1975), box-score (e.g., Luborsky et al., 1975) and narrative re- (Waterhouse & Strupp, 1984).
views (e.g., Kazdin & Wilson, 1978; Lambert etal., 1986; Rach- Current estimates suggest that well over 400 psychotherapy
man & Wilson, 1980), and meta-analyses (see Brown, 1987). techniques are in use for adults (Karasu, personal communica-
Although individual studies and large-scale reviews occasion- tion, March 1, 1985) and that over 230 techniques are in use
ally argue for the superiority of one technique over another, for children (Kazdin, 1988). If we assume that many of these
evaluations of the literature usually suggest that treatments tend techniques are effective in producing therapeutic change, it is
not to differ or at least not to differ very much in the outcomes difficult to conceive that they vary in effectiveness or operate
they produce. through different therapeutic processes. Yet, this assumption is
not tantamount to stating that the results from viable treatment
contenders for a given clinical problem will be similar. Whether
Completion of this article was facilitated by Research Scientist Devel- treatment outcomes differ can only be evaluated empirically.
opment Award MH00353 and by Grant MH35408 from the National The finding that treatments do not differ in many, if not most,
Institute of Mental Health. We are extremely grateful to Jacob Cohen,
tests may mean that treatments are approximately equal in their
Larry V. Hedges, and Kenneth I. Howard. Special thanks are also ex-
effects. Yet, it is important to know if the studies are, as a rule,
tended to Helena C. Kraemer, who provided comments and guidance
on prior drafts. designed to detect outcome differences. When two or more ac-
Correspondence concerning this article should be addressed to Alan tive interventions are expected to produce change, the investi-
E. Kazdin, Western Psychiatric Institute and Clinic, 3811 O'Hara gation must be sufficiently sensitive to detect what could prove
Street, Pittsburgh, Pennsylvania 15213. to be relatively small differences.
138
COMPARATIVE OUTCOME STUDIES AND STATISTICAL POWER 139
A critical research issue is the extent to which an investiga- point, the surveys did not focus on psychotherapy outcome ex-
tion can detect differences between groups when differences ex- clusively or primarily. Yet, the characteristics of psychotherapy
ist within the population. This notion is referred to as statistical outcome studies may make the conclusions even more appli-
power and reflects the probability that the test will lead to rejec- cable.
tion of the null hypothesis.1 Power is a function of the criterion To begin with, small sample sizes may be dictated in part by
for statistical significance (alpha), sample size (.V), and the the subject matter due to difficulties in recruiting, treating, and
difference that exists between groups (effect size). retaining large samples of homogeneous subjects (Kraemer,
Although power is an issue in virtually all research, it raises 1981). In contemporary outcome research, studies typically in-
special issues in studies where two or more conditions (groups) clude 20 or fewer subjects per group (e.g., Cross, Sheehan, &
are not significantly different (Fagley, 198S; Freiman, Chal- Khan, 1982; DiLoreto, 1971; Liberman & Eckman, 1981;
mers, Smith, & Kuebler, 1978; Kazdin, 1980). The absence of Rush, Beck, Kovacs, & Hollon, 1977). Indeed, it is not difficult
significant differences can contribute to knowledge under a va- to find studies comparing alternative treatment and control
riety of circumstances. However, an essential precondition is conditions in which the sample sizes are 10 or fewer cases per
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
that the investigation be sufficiently powerful to detect mean- condition (e.g., Forman, 1980; Yu, Harris, Solovitz, & Franklin,
This document is copyrighted by the American Psychological Association or one of its allied publishers.
ingful differences. In the vast majority of psychotherapy out- 1986). Inadequate levels of power could be a rival explanation
come studies that have contrasted two or more treatments, the for the absence of treatment differences.
power may have been relatively weak due to small samples sizes. A second problem of psychotherapy research is the loss of
There are many reasons to suspect that outcome studies as a subjects over time. Apart from the possible selection biases that
general rule provide weak tests. Over 25 years ago, Cohen attrition can introduce, sample sizes (and power) may be mark-
(1962) examined clinical research published in the Journal of edly reduced by the end of treatment and by the follow-up as-
Abnormal and Social Psychology for a 1-year period (1960). sessment (Garfield, 1980). The toll of attrition on the design is
Over 2,000 statistical tests were identified (from 70 articles) that often high. In some treatment programs, between 40% and 50%
were considered to reflect direct tests of the hypotheses. To eval- of subjects may drop out of treatment (e.g., Fleischman, 1981;
uate power, Cohen examined different effect sizes, that is, the Patterson, 1974; Viale-Val, Rosenthal, Curtiss, & Marohn,
magnitude of the differences between alternative groups based 1984). In one large-scale comparative outcome study, almost
on standard deviation units. Cohen distinguished three levels of 90% of the cases (396 of 450) that completed treatment were
effect sizes (small = .25, medium = .50, and large = 1.00) and lost at follow-up 1 year later (Feldman, Caplinger, & Wodarski,
evaluated the power of published studies to detect differences at 1983). Thus, in psychotherapy outcome studies, sample size
these levels, assuming alpha = .05 and using nondirectional and power may weaken considerably over time. If the conclusion
(two-tailed) tests. of no differences between treatments is not evident at posttreat-
The results indicated that power was generally weak for de- ment, it may well be reached at follow-up.
tecting differences equivalent to small and medium effect sizes. One may be able to discern from psychotherapy research the
For example, the mean power of studies to detect differences approximate effect sizes for classes of comparisons such as
reflecting small and medium effect sizes was. 18 and .48, respec- treatment versus no treatment and treatment versus treatment.
tively. This means that, on the average, studies had slightly less Indeed, such effect sizes have often been examined in the con-
than a 1 in 5 chance to detect small effect sizes and less than a text of meta-analysis. Effect sizes obtained after experiments
1 in 2 chance to detect medium effect sizes. These levels are are completed provide an estimate of population effect sizes
considerably below the recommended level of power, .80 (4 in 5 (Cohen, 1973). It is important to examine effect sizes for com-
chance).2 Cohen concluded that the power of the studies was parisons of alternative therapy techniques, not only to deter-
weak and that sample sizes in future studies should routinely mine the magnitude of the differences with which we may be
be increased (see also Cohen, 1977). working, but also to interpret studies in which critical tests of
A more recent analysis examined if the situation has im- treatment are provided and few or no differences emerge.
proved since Cohen's earlier portrayal. Rossi, Rossi, and Cot- From the published research, one can estimate the power of
trill (1984) sampled research from the Journal of Personality studies to detect differences of the magnitudes that emerge for
and Social Psychology and the Journal of Abnormal Psychology,
journals that are descendants of the publication Cohen ana- 1
Power (1-beta) is the probability of rejecting the null hypothesis
lyzed. The data were obtained from 142 articles in the 1982 when it is false. Stated differently, power is the likelihood of finding
volume of each journal. Rossi et al. found that 3%, 26%, and differences between the treatments when, in fact, the treatments are
69% of the studies in 1982 had power above .80 for detecting truly different in their outcomes.
2
small, medium, and large effects, respectively. This compared The level of power that is adequate is not easily specified or justified
with parallel data from Cohen (1962) of 0%, 9%, and 79%, re- mathematically. As with the level of confidence (alpha), the decision is
spectively. Although slight increases in power were evident for based on convention about the margin of protection one should have
against falsely accepting the null hypothesis (beta). Cohen (1965) rec-
detecting small and medium effects, the vast majority of studies
ommended adoption of the convention that beta = .20 and hence power
were quite weak with regard to detecting such effects.
(l-beta) = .80 when alpha = .05. This translates to the 4 in 5 likelihood
The applicability of these findings to psychotherapy outcome of detecting an effect when a difference exists in the population. Al-
research might be challenged. The surveys of Cohen (1962) and though power a .80 is used as a criterion for discussion in the present
Rossi et al. (1984) covered only a 1 -year period, sampled a small article, a higher level (.90, .95) is often encouraged as the acceptable
number of journals, and encompassed diverse areas of research criterion (e.g., Freiman, Chalmers, Smith, & Kuebler, 1978; Friedman,
within clinical and social psychology. Perhaps, more to the Furberg,&DeMets, 1985).
140 ALAN E. KAZDIN AND DEBRA BASS
comparisons of alternative treatment conditions. There is, of goals (e.g., reading improvement). Outcome investigations referred to
course, the bias that derives from consulting only published studies designed to measure some facet of psychological adjustment or
studies. Such studies may be more prone to yield significant functioning after treatment was completed (posttreatment). At least two
groups or conditions were required for the study to be included. The
effects, whereas those that did not yield such effects may be un-
groups could include any combination of treatment and control condi-
published and relegated to the investigator's file drawer (Rosen-
tions. Although primary interest was in studies comparing two or more
thai, 1979). However, comparative outcome studies with few or
treatments, all psychotherapy outcome studies were included if there
no differences are often published (e.g., Sloane et al., 1975). The were at least two groups. This inclusion criterion was adopted to permit
reasons include the keen interest in the comparisons in their evaluation of power for the different comparisons (one treatment vs.
own right, the likely differences between the treatments and the another treatment vs. no-treatment).
no-treatment or waiting-list control condition, and the evalua- To provide a sample of psychotherapy outcome research, nine jour-
tion of therapeutic processes common to alternative tech- nals were studied for a 3-year period (1984-1986). Four journals were
niques. selected because they were the most frequent contributors to meta-anal-
In the present article, we examine the power of treatment yses of psychotherapy outcome research (Shapiro & Shapiro, 1982;
Smith, Glass, AMillei; 1980). The journals included Ox Journal of Con-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Measure and Calculation of Effect Size and, hence, are combined in subsequent analyses. The table de-
tails information relevant to the evaluation of power but also
Within each study, an effect size was calculated between each pair of
portrays several basic characteristics of psychotherapy outcome
groups on each outcome measure. Effect size (ES) was denned as (m, -
studies.
m2)/i where m, and m2 refer to two group means (treatment or control)
The mean sample size (N) for all studies was 53 subjects
and s is the pooled within-group standard deviation.9 Each ES was clas-
sified as coming from a comparison of two treatments, of treatment (SD = 39.2); the mean number of groups was 3 (SD = 1.2). All
versus no treatment, or of treatment versus an active control. Also, each 85 studies included at least one treatment group, given that this
ES was classified as evaluated at posttreatment or at follow-up. If multi- was a selection criterion for inclusion; 75 (88.2%) of the studies
ple follow-up assessments were reported, only the last (longest duration) included two or more treatment conditions; 40 (47.1%) of the
assessment was considered. Thus, six types of ESs might have been de- studies included a no-treatment or waiting-list control condi-
rived from a single study. Within a study, multiple values of each effect tion. Only 8 (9.4%) studies included an active control (e.g., at-
size might have been obtained if there were several outcome measures. tention-placebo) condition. In terms of outcome evaluation all
The ESs of each type within a study were averaged, so that each study
studies included posttreatment and 67.1% of the studies in-
contributed no more than one mean ES per comparison of interest.6
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Table 1
Psychotherapy Outcome Studies: Sources and Study Characteristics
No. of studies 85 38 10 16 16 2 2 1
Sample size
M 53.60 55.63 81.10 38.44 45.31 71.00 42.50 64.00
SD 39.21 31.97 79.45 18.00 29.84 42.43 31.82 0
Range 12-298 12-146 31-298 14-66 17-131 41-101 20-65
No. of groups —
M 3.14 3.05 3.70 2.56 3.31 4.00 2.00 8.00
SD 1.20 .93 1.06 .63 1.35 2.83 0 0
Range 2-8 2-6 2-6 2-4 2-6 2-6
No. of studies with — —
Two or more treatments 75' 34 9 13 15 2 1 1
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Active control 8° 5 0 0 3 0 0 0
Follow-up assessment 57" 27 4 11 12 2 1 0
Follow-up length (months)
M 8.03 7.14 5.00 8.73 11.75 3.50 1.00 0
SD 7.55 5.11 4.97 6.23 12.48 3.54 0 0
Range 1-46 1-18 1-12 1-24 2-46 1-6 0 0
Nate. JCCP = Journal oj'ConsultingandClinical Psychology; JCP = Journal of Counseling Psychology; BT = Behavior Therapy; BRT= Behaviour
Research and Therapy; ASP = Archives of General Psychiatry; BJP = British Journal of Psychiatry; BJCP = British Journal of Clinical Psychology.
Dash denotes that the high and low numbers for the range were the same or that only one entry was available.
•88.24%. b 47.06%. "9.41%. "67.06%.
in which effect sizes of psychotherapy research have been exam- Table 3 presents estimated effect sizes for the comparisons of
ined. interest. As a potential guideline for the interpretation of the
As is evident in the table, the sample sizes were quite similar data, it is useful to bear in mind Cohen's (1977) classification
across treatment and control groups. Consideration of the data of small, medium, and large ESs at .20, .SO, and .SO.9 When two
pooled across groups revealed that the mean group sizes at post- or more treatments were compared with each other, the mean
treatment and follow-up were 16.1 and 15.3, respectively. Ex- absolute ESs at posttreatment and follow-up were .50 and .47,
amination of the 75th percentile conveyed that three fourths of respectively. These fall within the range of medium ESs. The
the studies included fewer than 20 subjects per group. mean absolute ES across all outcome studies comparing treat-
ment versus no treatment was .85 at posttreatment and .89 at
follow-up. These ESs fall within the range of large ESs. Few
studies (8 at posttreatment, 5 at follow-up) were available that
Table 2 compared treatment versus active control conditions. The
Sample Sizes (n) Obtained From Surveyed Studies mean ES for this comparison was .38 at posttreatment and .32
Group Posttreatmenl Follow-up at follow-up, both between small and medium ESs.10
The range of ESs obtained for the comparison of interest are
Treatment illustrated in Figure 1. This figure conveys the median ES and
M 16.01(11.12) 14.56 (7.83) the 25th and 75th percentiles, a range that includes 50% of the
Man 12.00 13.00
10.00-19.00
studies for the comparison of interest. The data convey clearly
25th-75th percentiles 10.00-17.00
Range 6-107 3-74 that the comparisons of alternative treatments span the small-
No-treatment control to-medium range. In contrast, the ESs for treatment versus no
M 16.49(16.52) 19.91 (21.70) treatment are in the medium-to-large range.
Mdn 12.00 13.00
25th-75th percentiles 10.00-18.00 9.00-20.00
Range 5-114 7-84 Estimated Power
Active control
If ESs fall within the ranges noted previously, one can esti-
M 16.89(9.74) 20.17(10.28)
Mdn 12.00 16.50 mate the extent to which studies, as currently designed, are
25th-75th percentiles 12.00-24.50 11.50-32.50
Range 5-34 10-34 ' The magnitudes for small, medium, and large effect sizes have been
All revised from those specified originally by Cohen (1962), which were .25,
M 16.12(12.09) 15.25(9.81)
.50, and 1.00. The effect sizes of .20, .50, and .80 reflect the magnitudes
Mdn 12.00 13.50
10.00-19.00 10.00-17.00 in more recent references (Cohen, 1977).
25th-75th percentiles 10
Range 5-114 3-84 Because of the small number of studies, the data for this compari-
son will not be discussed further. However estimates of ES, power, and
Note. Standard deviations are expressed in parentheses. related information are presented in the tables.
COMPARATIVE OUTCOME STUDIES AND STATISTICAL POWER 143
Table 3 Table 4
Estimated Effect Sizes for Comparisons of Interest Estimated Power to Detect Differences Between Alternative
Treatments and Treatment Versus Control Conditions
Comparison Posttreatment Follow-up
Comparison Posttreatment Follow-up
Treatment vs treatment
M .50 (.31) .47 (.32) Treatment vs. treatment
Mdn .47 .40 Mdn .74 .63
25th-75th percentiles .26-.66 .27-.64 25-75 percentiles .27-97 .38-87
Range .04-1.52 .02-1.76 Range .05-995 .03-995
No. of articles 75 42 No. of articles 75 42
Treatment vs no treatment Treatment vs no treatment
M .85 (.47) .89 (.46) Mdn .995 .995
Mdn .78 .87 25th-75th percentiles .89-995 .92-995
25th-75th percentiles .54-1.05 .52-1.15 Range .37-995 .73-995
Range .30-2.67 .26-1.85 No. of articles 40 10
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
beta = .20 so that power = .80. For the purpose of illustration, ment and follow-up). Also, the majority of studies (82.5% and
sample size and ESs will be considered for the different compar- 90.0%) comparing treatment with a no-treatment control met
isons in relation to the posttreatment evaluation of outcome. or surpassed the power of .80 at posttreatment and follow-up.
In the present survey, the median ES for comparing two treat- A number of implications can be drawn from the present sur-
ments was .47 at posttreatment, with a range from .26 to .66 vey. First, power in psychotherapy outcome studies research is
representing the 25th and 75th percentiles. Figure 2 conveys generally low for detecting small and medium effects. The need
the corresponding sample sizes needed for these ESs. A sample for improved power remains, and several recommendations
size of 71 per group would be needed to retain power at the made over 20 years ago (Cohen, 1965) continue to be timely.
desired level for the median ES, with a corresponding range The concern with weak power in relation to psychotherapy re-
from 232 to 36 subjects for the 25th and 75th percentiles.11 Ex- search echoes points voiced in relation to research in clinical,
amination of the figure shows the number of subjects per group social, educational, and applied psychology more generally
actually used. The median number of subjects for studies com- (e.g., Brewer, 1972; Chase & Chase, 1976; Cohen, 1962; Rossi
paring alternative treatments was 12.0, with a range from 10 to et al., 1984). Psychological research, of course, is not alone on
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
19 for the 25th and 75th percentiles. These actual sample sizes this score. The absence of differences in major clinical trials of
This document is copyrighted by the American Psychological Association or one of its allied publishers.
are far below the number needed to detect treatment differ- alternative interventions in medicine and public health may, in
ences. many instances, be attributed to insufficient power (see Frei-
If the investigator wishes to compare treatment versus no manetal., 1978).
treatment, the median ES is .78 at posttreatment. This ES is A second and more focused implication of the present results
bounded by ESs of .54 and 1.05, which reflect the 25th and 75th pertains to the special issues regarding comparative psychother-
percentiles and represent 50% of the studies surveyed. Figure 2 apy outcome research (Heimberg & Becker, 1984; Kazdin,
plots the corresponding sample sizes for these ESs. A sample 1986). In evaluations of alternative psychotherapy techniques,
size of 27 per group would be needed for the desired power with both in individual studies and in literature reviews, a frequent
the median ES for this comparison; the corresponding range of interpretation is that there are few or no outcome differences
54 to 14 subjects per group would be required for the 25th and between alternative techniques. Perhaps most treatment tech-
75th percentiles. These numbers are larger but closer to the ac- niques are not very different in the effects they produce, and
tual sample sizes used. Figure 2 shows the actual sample sizes, the common processes of therapy advanced to explain the null-
with a median of 12 at posttreatment bounded by a range from hypothesis findings are well-based. However, in light of the pres-
10 to 18 for the respective percentiles. ent findings, it is appropriate and parsimonious to raise another
prospect. Possibly, studies that compare alternative treatments
are not sufficiently powerful to detect differences between treat-
Discussion
ments unless the effect sizes are large.
Psychotherapy outcome research was examined in nine jour- Given the complexities of clinical problems and psychother-
nals over a 3-year period (1984-86) to assess the extent to which apy and the limitations and variability of outcome assessment,
studies are sufficiently powerful to detect differences between large effects, even if they are evident in the population, might be
alternative treatment and control conditions. Effect sizes were difficult to obtain in a given investigation. The present survey
estimated at posttreatment and follow-up, and these were used suggests that effect sizes for comparisons of alternative treat-
along with sample size data to evaluate power. The results can ments are likely to be in the small-to-medium range. Given the
be discussed with reference to Cohen's (1977) criteria for small usual sample sizes that are used, the majority of studies may
(.20), medium (.50), and large (.80) effect sizes. The present re- not be sufficiently powerful to detect such differences.
sults indicate that comparisons of alternative treatments yield There are several limitations of the present survey. To begin
effect sizes close to the medium level (mean ESs = .50 at post- with, the methods of estimating and evaluating effect sizes in
treatment and .47 at follow-up) and that comparisons of treat- psychotherapy research are not entirely free from controversy.
ment versus no treatment tend to yield relatively large effect Significant issues such as the appropriate weighting of alterna-
sizes (mean ESs = .85 at posttreatment and .89 at follow-up).
The question of interest is: To what extent are psychotherapy
1
outcome studies sufficiently powerful to detect differences given ' The requisite sample (N) or group (n) sizes for a given alpha, power,
these effect sizes? Using estimated effect sizes and sample sizes and effect size of various increments can be obtained from various pub-
from the studies themselves and adopting an alpha of .05 (for lished tables (see Cohen, 1977;Hinkle&Oliver, 1983;Kraemer&Thie-
two-tailed tests), we obtained power estimates. In evaluating the mann, 1987). For the present purposes, the requisite group sizes, when
findings, we considered power 2 .80 a criterion of adequate sen- we assumed an equal number per group, were obtained by direct calcu-
lation (see Lachin, 1981; Snedecor & Cochran, 1 980). For a two-tailed
sitivity of a test.
test of means from two independent samples of equal group sizes,
Power for comparisons of alternative treatments was below
the recommended level (medians = .74 for posttreatment and
.63 for follow-up). Indeed, for studies comparing two or more
active treatments, fewer than half at posttreatment and fewer
than one third at follow-up (45.3% and 28.6% of the studies) For a two-tailed test with alpha = .05 and beta = .20, this translates to
had power at or above the recommended level. The power to
detect differences for comparisons of treatment versus no treat- (1.96 + .842)2(2)
ment was quite adequate (medians = .995 for both posttreat- " -- JSP
COMPARATIVE OUTCOME STUDIES AND STATISTICAL POWER 145
Required ns for power = .80 and other study characteristics were ignored. The impact of
250- these and other characteristics of therapy on effect sizes have
been studied in large-scale meta-analyses (e.g., Shapiro & Sha-
piro, 1982; Smith etal., 1980).
200-
The purpose of the present survey was to examine broad
classes of comparisons to estimate power. The estimates, when
Group 150- viewed in the range of 25th and 75th percentiles, provide an
Size (n) interval that suggests reasonable consistency, even though the
distinctions between different techniques and studies were not
100- -
made. One might want to examine whether the power of studies
of particular treatments is weaker than that of others and to
50-
e a make technique distinctions. However, this was beyond the
present goal.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
T (P) T (FU) NT (P)NT (FU) therapy studies was neglected in the present survey and may
Treatment Compared To- have important implications for power. The present analysis fo-
cused on the power to detect differences between treatments.
Obtained ns from Survey The analyses examined main effects of a single variable (treat-
25- ment technique). This focus is tantamount to asking the ques-
tion, does psychotherapy work? The question has long been re-
jected as much too general to warrant serious attention (e.g.,
20-
Edwards & Cronbach, 1952). The global question has been re-
placed by a more specific focus aimed at identifying which type
Group 15
of treatment works best for which type of client, as adminis-
Size (n) tered by which type of therapist under which circumstances
10- (Kiesler, 1966; Paul, 1967). This question focuses on interac-
tions (e.g., Treatment X Patient x Therapist X Setting) rather
than simply on the efficacy of treatment or the relative effec-
5- tiveness of different treatments. Yet, testing interactions in fac-
torial studies is likely to divide samples into smaller groups than
those used to test main effects of treatment. If power is weak for
T (P) T (FU) NT (P) NT (FU) testing the main effect of treatment techniques, a fortiori, it is
Treatment Compared To... likely to be weak for testing the interactions. This concern has
been voiced in relation to other areas of psychological research
Figure 2. Sample sizes required for comparisons of treatment (T) with
in which the inadequacy of power to test statistical interactions
another treatment or with no treatment (NT) separately for posttreat-
is common (Chase & Tucker, 1975; Cohen, 1965).
ment (P) and follow-up (FU). (The sample sizes required are based on
Another limitation of the present evaluation is the exclusive
alpha = .05 for a two-tailed test, with power set at. 80 and an assumption
of equal ns per group. The column for each comparison reflects the focus on power and the limited discussion even within this do-
range of sample sizes from the 25th and 75th percentiles; the horizontal main. The focus has suggested that power needs to be increased
line within each column corresponds to the median sample size needed. in psychotherapy outcome studies given the magnitude of effect
The upper figure represents the sample sizes required for power - .80 sizes usually reported. The discussion has emphasized that
given the effect sizes that are obtained for the different types of compari- power is a function of alpha, sample size, and effect size. Sample
sons; the lower figure conveys the actual sample sizes used in studies of size seems to be the only variable for the investigator to manipu-
the present survey.) late and improve given that the adherence to alpha (at .05 or
.01) has become a strongly entrenched convention (Cohen,
1977) and that effect size, at first blush, seems merely to reflect
live outcome measures, the appropriate standard deviation the state of affairs evident in nature, that is, an estimate of true
unit, the extent to which the sample effect sizes within and population differences.
among different studies can be pooled to estimate population Actually, power entails much more than the factors empha-
effect sizes, the nonindependence of effect sizes computed for a sized in this article. Power is a function of effect size, which is
given study for the types of comparisons of interest, and the influenced in manifold ways by the care and consistency with
exclusion of unpublished studies, (among others) have been which an investigation is conducted. Stated generally, effect size
carefully discussed, but not entirely resolved, in other second- can be increased or rather optimized in an investigation by re-
ary analyses. The present method of computing effect sizes fol- ducing error variance. Methodological features such as select-
lowed several practices used in prior meta-analyses and is sub- ing homogeneous sets of patients, ensuring the integrity of
ject to similar issues and concerns. The present survey suffers treatment, standardizing the assessment conditions, carefully
from greater interpretive obstacles by combining studies, effect choosing outcome measures, and similar practices increase the
sizes, and power estimates across broad classes of treatment in power of an investigation by reducing variability in its execu-
which the identity of the type of treatment, patient, measure, tion. Although the present focus was limited to consideration
146 ALAN E. KAZDIN AND DEBRA BASS
of power, power itself can be augmented by the careful execu- interpreted unambiguously without improved power. In the
tion of the study. context of psychotherapy research, as opposed to government
The present survey illustrates the utility of power analyses in and politics, it may be the absence rather than the presence of
evaluating the results of previously conducted research. Such power that corrupts.
analyses can estimate the likelihood that research can detect
differences given sample and effect sizes. Another and more im-
References
portant use of power analysis is the planning of a study to ensure
that sample sizes can detect an effect of a given magnitude (Co- Brewer, J. K. (1972). On the power of statistical tests in the American
hen, 1977; Kraemer & Thiemann, 1987; Meinert, 1986). Psy- Educational Research Journal. American Educational Research
chotherapy outcome research can profit greatly from this use of Journal, 9,391-401.
power analyses. The usual impediment for incorporating power Brown, J. (1987). A review of meta-analyses conducted on psychother-
in the design of research is hesitancy in estimating effect size. apy outcome research. Clinical Psychology Review, 7.1-23.
Casey, R. J., & Herman, J. S. (1985). The outcome of psychotherapy
However, data from meta-analyses of psychotherapy (Brown,
with children. Psychological Bulletin, 98,388-400.
1987) as well as descriptive analyses such as those found in the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Heimberg, R. G., & Becker, R. E. (1984). Comparative outcome re- Multiple settings, treatments, and criteria. Journal of Consulting and
search. In M. Hersen, L. Michelson, & A. S. Bellack (Eds.), Issues in Clinical Psychology, 42,471-481.
psychotherapy research (pp. 251-283). New York: Plenum Press. Paul, G. L. (1967). Outcome research in psychotherapy. Journal of Con-
Hinkle, D. E., & Oliver, J. D. (1983). How large should the sample be? sulting Psychology, 31,109-118.
A question with no simple answer? Or. . . .EducationalandPsycho- Prioleau, L., Murdock, M., & Brody, N. (1983). An analysis of psycho-
logical Measurement, 43,1051-1060. therapy versus placebo studies. The Behavioral and Brain Sciences,
Kazdin, A. E. (1980). Research design in clinical psychology. New York: 8, 275-285.
Harper & Row. Rachman, S. J., & Wilson, G. T. (1980). The effects of psychological
Kazdin, A. E. (1986). Comparative outcome studies of psychotherapy: therapy (2nd ed.). Oxford, England: Pergamon Press.
Methodological issues and strategies. Journal of Consulting and Clini- Rosenthal, R. (1979). The "file drawer problem" and tolerance for null
cal Psychology, 54, 95-105. results. Psychological Bulletin, 86,638-641.
Rossi, J. S., Rossi, S. R., & Cottrill, S. D. (1984, April). Statistical power
Kazdin, A. E. (1988). Child psychotherapy: Developing and identifying
of research in social and abnormal psychology: What have we gained
effective treatments. New York: Pergamon Press.
in 20 years? Paper presented at the meeting of the Eastern Psychologi-
Kazdin, A. E., & Wilcoxon, L. A. (1976). Systematic desensitization
cal Association, Baltimore, MD.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.