Effect Size of Treatment Effects

[REVIEW]
ESTIMATING THE SIZE OF TREATMENT

EFFECTS: Moving Beyond P Values
by JAMES J. MCGOUGH, MD AND STEPHEN V. FARAONE, PhD
Dr. McGough is Professor of Clinical Psychiatry at the Semel Institute for Neuroscience and Human Behavior and David Geffen School of Medicine at
the University of California, Los Angeles. Dr. Faraone is Professor of Psychiatry and Behavioral Sciences at the State University of New York Upstate
Medical University, Syracuse.
Psychiatry (Edgemont) 2009;6(10):21–29
ABSTRACT
Objective: To increase
understanding of effect size
calculations among clinicians who
over-rely on interpretations of P
values in their assessment of the
medical literature.
Design: We review five methods
of calculating effect sizes: Cohen’s d
(also known as the standardized
mean difference)—used in studies
that report efficacy in terms of a
continuous measurement and
calculated from two mean values and
their standard deviations; relative
risk—the ratio of patients responding
to treatment divided by the ratio of
patients responding to a different
treatment (or placebo), which is
particularly useful in prospective
clinical trials to assess differences
between treatments; odds ratio—
used to interpret results of
retrospective case-control studies
FUNDING: Funding for the development of this manuscript was provided by Shire Development
and provide estimates of the risk of
Inc.
side effects by comparing the
probability (odds) of an outcome FINANCIAL DISCLOSURE: Dr. McGough has served as a consultant to and received research
occurring in the presence or absence support from Eli Lilly & Company, Shire Pharmaceuticals, and the National Institutes of Health.
of a specified condition; number In the past year, Dr. Stephen Faraone has received consulting fees and has been on Advisory
needed to treat—the number of Boards for Eli Lilly and Shire and has received research support from Eli Lilly, Pfizer, Shire,
subjects one would expect to treat and the National Institutes of Health. In previous years, Dr. Faraone has received consulting
with agent A to have one more fees or has been on advisory boards or has been a speaker for the following sources: Shire,
success (or one less failure) than if McNeil, Janssen, Novartis, Pfizer and Eli Lilly. In previous years, he has received research
the same number were treated with support from Eli Lilly, Shire, Pfizer and the National Institutes of Health.
agent B; and area under the curve
ADDRESS CORRESPONDENCE TO: James J. McGough, MD, 300 UCLA Medical Plaza, Suite
(also known as the drug-placebo 1414, Los Angeles, CA 90095; E-mail: jmcgough@mednet.ucla.edu; Phone: (310) 794-7841;
response curve)—a six-step process Fax: (310) 267-0378.
that can be used to assess the effects
of medication on both worsening and KEY WORDS: Effect sizes, clinical trials, comparing treatment outcomes, pediatric
improvement and the probability that psychopharmacology
[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 21

a medication-treated subject will necessary to detect differences difference in outcomes under two
have a better outcome than a between groups.1 The power of a different treatment interventions.
placebo-treated subject. particular study (defined as 1 minus Effect sizes thus inform clinicians
Conclusion: Effect size statistics the specified Type II error) is the about the magnitude of treatment
provide a better estimate of probability of finding a true effects. Some methods can also
treatment effects than P values difference between groups if one indicate whether the difference
alone. exists, given a study’s sample size observed between two treatments is
and assuming the anticipated clinically relevant. An effect size
INTRODUCTION magnitude of treatment effect estimate provides an interpretable
The purposes of this article are to reflects the actual magnitude. value on the direction and
review methods for comparing the Despite attention to these issues as a magnitude of an effect of an
efficacy of interventions across part of experimental design, an over- intervention and allows comparison
studies and to provide examples of reliance on P values in the of results with those of other studies
how these methods can be used to interpretation of clinical trial results that use comparable measures.2,3
make treatment decisions. can lead either to a naïve acceptance Interpretation of an effect size,
Significance (the probability that an of marginal treatments as efficacious however, still requires evaluation of
observed outcome of an experiment or the rejection of potentially the meaningfulness of the clinical
or trial is not due to chance alone), efficacious treatments for a lack of change and consideration of the
direction (positive or negative), statistical significance. study size and the variability of the
magnitude (absolute or relative An issue related to P values is the results. Moreover, similar to
size), and relevance (the degree to 95-percent confidence interval (CI). statistical significance, effect sizes
which a result addresses a research A 95-percent CI reveals a range of are also influenced by the study
topic) are critical elements in the values around a sample mean in design and random and
interpretation of treatment effects, which one can assume, with 95- measurement error. Effect size
but more often than not, clinicians percent certainty, that the true controls for only one of the many
commonly judge results from clinical population mean is found. If the 95- factors that can influence the results
trials based on their interpretation of percent CIs of two treatments do not of a study, namely differences in
P values, a measure of statistical overlap, by definition the means will variability. The main limitation of
significance only. Regardless of the be significantly different at a level of effect size estimates is that they can
statistic used, when outcomes from P<0.05. only be used in a meaningful way if
two treatments are different at a While a significant P value there is certainty that compared
level of P<0.05, it is generally suggests that something nonrandom studies are reasonably similar on
assumed that one treatment, often has occurred, it does not inform us study design features that might
an active drug, is superior to the about the clinical significance of the increase or decrease the effect size.
other, often a placebo. More nonrandom effect. For example, the For example, the comparison of
specifically, a P value indicates the magnitude of statistical significance effect sizes is questionable if the
probability, given a particular data is heavily influenced by the number studies differed substantially on
set with a particular design and of patients studied. Other things design features that might plausibly
sample size, that differences being equal, larger samples yield influence drug/placebo differences,
detected between two treatments more significant P values than such as the use of double-blind
are merely due to chance and do not smaller samples. Consequently, a methodology in one study and non-
reflect true differences. A P value large trial of a marginally effective blinded methodology in the other. It
gives a measure of Type I error—the treatment can have greater statistical would be impossible to determine
probability of incorrectly detecting a significance than a small trial of a whether the difference in effect size
difference between two treatments highly effective treatment. When the was attributable to differences in
when no difference exists. A related results of separate clinical trials are drug efficacy or differences in
concept, though less well reported as statistically significant, methodology. Alternatively, if one of
appreciated, is Type II error—the one cannot base decisions about the two studies being compared used a
probability of failing to detect relative effects of the different highly reliable and well-validated
significant differences when in fact treatments by comparing P values.2 outcome measure while the other
they do exist. Careful consideration used a measure of questionable
is required when attempting to EFFECT SIZE reliability and validity, these different
balance Type I (erroneous rejection An effect size is a statistical endpoint outcome measures could
of the null hypothesis) and Type II calculation that can be used to also lead to results that would not be
(erroneous failure to reject the null compare the efficacy of different meaningful.
hypothesis) errors because agents by quantifying the size of the Five methods of calculating effect
assumptions based on the topic difference between treatments. It is sizes are reviewed: (1) Cohen’s d,
studied will affect the sample sizes a dimensionless measure of the (2) relative risk (RR) ratios, (3)
22 Psychiatry 2009 [VOLUME 6, NUMBER 10, OCTOBER]

odds ratios (ORs), (4) number the Cohen’s d scores may not be TABLE 1. Conversion of Cohen’s d scores
needed to treat (NNT), and (5) area statistically significant. While P to percentiles5
under the curve (AUC). We include values are used to assess whether or
examples of each method. not an effect exists, the use of PERCENTAGE OF
95-percent CIs allow for assessment CONTROL GROUP WHO
1. COHEN’S d of uncertainty in the magnitude of COHEN’S d
WOULD HAVE A SCORE
Cohen’s d is used when studies the effect. Cohen’s d values are more BELOW THE AVERAGE
report efficacy in terms of a reliable when used in drug/drug (or SUBJECT IN THE
EXPERIMENTAL GROUP
continuous measurement, such as a placebo) comparisons than in
score on a rating scale.4 Cohen’s d is baseline/endpoint comparisons, 0.0 50
also known as the standardized mean although they are often used in the 0.1 54
difference. It should be noted that latter. It is important to note that the
the term effect size is sometimes interpretation of Cohen’s d can be 0.2 58
used to mean Cohen’s d in particular problematic in samples with non- 0.3 62
and not the several other methods normal distributions or restricted 0.4 66
discussed in this review. Of the four ranges, or if the measurement from
critical elements in the which it was derived had unknown 0.5 69
interpretation of treatment effects, reliability.5 0.6 73
Cohen’s d is most useful for Another method of interpreting
0.7 76
assessing magnitude of effects. effect sizes, provided by Coe (Table
Cohen’s d is calculated from two 1), converts Cohen’s d scores to 0.8 79
mean values and their standard percentiles.5 For example, in Table 1, 0.9 82
deviation (SD). These can be a Cohen’s d of 0.6 signifies that the
endpoint scores in response to drug mean score of subjects in the 1.0 84
A versus drug B, endpoint change experimental group is 0.6 SDs above 1.2 88
scores, or endpoint change scores the mean score of subjects in the
for drug A versus drug B. Cohen’s d control group, and that the mean 1.4 92
is computed with the following score of the experimental group 1.6 95
formula: exceeds the scores of 73 percent of
1.8 96
those in the control group.
Cohen’s d = To illustrate how Cohen’s d scores 2.0 98
Mean of experimental are calculated, data from placebo- 2.5 99
group - Mean of control group controlled studies by The Research
3.0 99.9
Pooled SD for entire sample Unit on Pediatric Psychopharmacology
Anxiety Study Group (RUPP) and by
A Cohen’s d score of zero means Emslie et al are used to demonstrate fluoxetine was significantly lower
that the treatment and comparison the process.6,7 than in the placebo group (38.4±14.8
agent have no differences in effect. A In the RUPP study of 128 children vs. 47.1±17.0; P=0.002), a difference
Cohen’s d greater than zero indicates and adolescents with anxiety of -8.7 (95% CI: -15.2,
the degree to which one treatment is disorders treated with fluvoxamine -2.2) and Cohen’s d=0.55
more efficacious than the other.3 A or placebo, the primary study (47.1-38.4=8.7; 8.7/15.9 [pooled
conventional rule is to consider a outcome was the Pediatric Anxiety SD]=0.55). Using the conventional
Cohen’s d of 0.2 as small, 0.5 as Rating Scale, a continuous measure.6 approach to interpreting this
medium, and 0.8 as large.4 A Cohen’s The final mean (±SD) score in the measure,4 the RUPP study suggests
d score is frequently accompanied by group treated with fluvoxamine was that fluvoxamine treatment of
a confidence interval (CI) so that the significantly lower than in the pediatric anxiety has a large effect
reliability of the comparison can be placebo group (9.0±7.0 vs. 15.9±5.3; size, while the Emslie et al study
assessed. Calculation of a 95-percent P<0.001), a difference of -6.9 (95% suggests that fluoxetine treatment of
CI around the Cohen’s d score can CI: -4.6, -9.2), and Cohen’s d=1.1 pediatric depression has a medium
facilitate the comparison of effect (15.9-9.0=6.9; 6.9/6.21 [pooled effect size.6,7 Of note, the Cohen’s d
sizes of different treatments. When SD]=1.1). In the Emslie et al study of score of 0.55 from the Emslie data
effect sizes of similar studies have 96 children and adolescents with means that 73 percent of placebo
CIs that do not overlap, this suggests depression treated with fluoxetine or patients had worse scores than the
that the Cohen’s d scores are likely placebo, the primary study outcome average fluoxetine-treated patient
to represent true differences was the Children’s Depression Rating (Table 1). The Cohen’s d of 1.1 in
between the studies. Effect sizes Scale-Revised, another continuous the RUPP study means that at least
that have overlapping CIs suggest measure.7 The final mean (±SD) 84 percent of the placebo group had
that the difference in magnitude of score in the group treated with worse scores than the average

TABLE 2. Use of Cohen’s d in studies of stimulants and nonstimulants in fluvoxamine-treated patient.
children, adolescents, and adults with ADHD, major depressive episode, and The growing appreciation of the
dysthymic disorder* value of an effect size, such as
Cohen’s d in the interpretation of
STUDY TREATMENT, SUBJECTS, AND SCALESa COHEN’S d clinical trials, in contrast to simple
reliance on whether or not P values
Biederman et al, 2003 XR-MPH (n=65) vs. placebo (n=71). Children. are “significant,” has led to an
0.9
(2 weeks)8 Conners ADHD/DSM-IV Scale
increased emphasis on effect sizes in
Faraone and 30mg, 1.4 clinical reports. Recognizing the
Lisdexamfetamine (n=218) vs. placebo (n=72).
Schreckengost, 2007 50mg, 1.4 limitations of comparing Cohen’s d
Children. ADHD Rating Scale
(4 weeks)9 70mg, 1.7 scores across studies with different
Faraone et al, 2004 IR-MPH (n=139) vs. placebo (n=113). Adults. populations, designs, and outcomes,
Crossover, 0.9; a summary of studies that report
(meta-analysis of 6 4 crossover, 2 parallel studies.
Parallel, 0.7 Cohen’s d in medication trials of
studies)10 ADHD Rating Scale
attention deficit hyperactivity
Faraone et al, 2002 IR-MAS (n=108) vs. IR-MPH (n=108). Children
0.25 (pooled); disorder (ADHD) appears in Table
(meta-analysis of 4 and adolescents. IOWA Conners Rating Scale,
range, –0.5–1.0 2.8–22
studies)11 ADHD Rating Scale, global scale
Implications for the clinician.
Kelsey et al, 2004 Atomoxetine (n=133) vs. placebo (n=64). Although Cohen suggested
0.7
(8 weeks)12 Children. ADHD Rating Scale conventions by which to interpret
Transdermal MPH vs. transdermal placebo the robustness of clinical effect
McGough et al, 2006 sizes,4 values derived for Cohen’s d
(N=79). Children. Crossover study. SKAMP 0.9
(1 week)13 do not have direct clinical
deportment scale
applicability. For example, although
Michelson et al, 2002 Atomoxetine (n=85) vs. placebo (n=85). Children deemed a “small” effect, Cohen’s d in
0.7
(6 weeks)14 and adolescents. ADHD Rating Scale
the range of 0.2 from a study
Study 1, atomoxetine (n=141) vs. placebo comparing treatment-related
Michelson et al, 2003 (n=139); Study 2, atomoxetine (n=129) vs. Study 1, 0.4; mortality rates for two
(10 weeks)15 placebo (n=127). Adults. Conners Adult ADHD Study 2, 0.4 chemotherapies for breast cancer
Rating Scale would have far greater clinical
consequence than a “large” effect of
Pelham et al, 2001 OROS-MPH vs. IR-MPH (N=68). Children.
2.0 0.8 from a study of treatment-related
(7 days)16 Crossover study. IOWA Conners Rating Scale
ADHD symptom reduction. More to
Spencer et al, 2002 Atomoxetine (n=129) vs. placebo (n=124). the point, one might choose a cancer
0.7 intervention with a better side-effect
(12 weeks)17 Children. ADHD Rating Scale
profile, provided the effect size
OROS-MPH or IR-MPH vs. placebo (n=64). OROS-MPH, difference for two treatments, with a
Swanson et al, 2003
Children. Crossover study. IOWA Conners 1.7; IR-MPH, concomitant difference in survival,
(7 days)18
Rating Scale 1.6
was small; whereas, small treatment
Weisler et al, 2006 XR-MAS (n=188) vs. placebo (n=60). Adults. effects in behavioral disorders, given
0.8
(4 weeks)19 ADHD Rating Scale the great degree of individual
response variability, are not usually
Wigal et al, 2005 XR-MAS (n=102) vs. atomoxetine (n=101). XR-MAS, 1.1; meaningful. Estimates of Cohen’s d
(3 weeks)20 Children. SKAMP deportment scale atomoxetine, 0.2 mainly serve to maximize successful
outcomes in randomized clinical
Wilens et al, 2005 XR-bupropion (n=81) vs. placebo (n=81). Adults. trials by providing a basis for
0.6
(8 weeks)21 ADHD Rating Scale determining study samples sizes
needed to provide sufficient power to
OROS-MPH, detect meaningful treatment effects.
Wolraich et al, 2001 OROS-MPH (n=95) or IR-MPH (n=97) vs. placebo
1.1; IR-MPH,
(1–4 weeks)22 (n=90). Children. IOWA Conners Rating Scale
1.0
2. RELATIVE RISK (RR)
a Cohen’s d is useful for estimating
All were parallel studies unless noted otherwise.
ADHD = attention-deficit/hyperactivity disorder; DSM-IV = Diagnostic and Statistical Manual, effect sizes from quantitative or
Fourth Edition; IOWA = Inattention and Overactivity With Aggression; IR = immediate dimensional measures. For
release; MAS = mixed amphetamine salts; MPH = methylphenidate; OROS = osmotic-release categorical measures, such as
oral system; SKAMP = Swanson, Kotkin, Agler, M-Flynn, and Pelham; XR = extended- “improved” versus “not improved” or
release. “present” versus “absent,” two
measures that can be used to assess

effects are RR and OR. Consideration in most clinical reports. It is intuitive manner. Like RR, OR can be
of RR is particularly useful in important to note that RR must be useful for assessing magnitude,
prospective clinical trials to assess interpreted in the context of the direction, and relevance of effects.
differences in treatments. When actual probability of the events An OR is computed as the ratio of
attempting to interpret treatment occurring. For example, if a study two odds: the odds that an event will
effects, RR can be useful for showed that the incidence of an occur compared with the probability
assessing magnitude, direction, and adverse event on medication was five that it will not occur. Specifically, it is
relevance of effects. percent (0.05) and with placebo it as follows:
The RR is the ratio of patients was one percent (0.01), then the RR
improving in a treatment group would be 5.0, meaning that the risk OR= Proportion with a given trait
divided by the probability of patients of experiencing the event on the improved (or worsened)/not
improving in a different treatment drug would be fivefold greater than improved (or not worsened)
(or placebo) group: on placebo. In assessing response to Proportion without a given trait
treatment, a response rate of 90 improved (or worsened)/not
RR= Probability of patients percent for Treatment A and a improved (or not worsened)
improving on medication response rate of 45 percent for
Probability of patients Treatment B yields an RR of 2 For a clinical trial, an OR indicates
improving on placebo (twofold greater probability of the increase in the odds of
responding to A). improvement (or worsening) that
RR is easy to interpret and RR, which reflects a ratio between can be associated with a secondary
consistent with the way in which two conditions, needs to be trait under consideration. An OR can
clinicians generally think. RR ratios differentiated from absolute risk, also have an associated CI so that the
can range from zero to infinity. In a calculated as a ratio or percentage of reliability of the comparison can be
study of two treatments, an RR of 1 an event occurring within one group assessed.
indicates that outcomes did not without any context. Differences in The OR has particular utility in
differ in the two groups, while an RR absolute risk are calculated as the the interpretation of retrospective
of 3 indicates that the treatment ratio or percentage of an event in case-controlled comparisons. It is
group had a threefold greater one group minus the same within a less frequently used in the
probability than the placebo group of comparison group. In the RUPP interpretation of randomized,
showing improvement. example described previously, controlled trials, but can be used to
In the RUPP anxiety study, 76 calculation of absolute risk (76% assess both positive outcomes (such
percent of the subjects receiving response in the active group minus as improvement) and adverse events.
fluvoxamine were treatment 29% in the control group) indicates Compared with RR, the OR has value
responders according to Clinical that 47-percent more subjects in predicting the likelihood of
Global Impressions (CGI) responded to fluvoxamine compared outcomes with low frequency, such
improvement ratings, compared with with placebo. This conveys as side effects, or to assess
29 percent in the placebo group.6 information differently than the RR, differential treatment effects in
This yields an RR of 2.6, suggesting which, as discussed, suggests a subjects with various secondary
that pediatric patients treated with threefold probability of improvement characteristics, such as sex, age
fluvoxamine for anxiety disorders with active treatment. group, or comorbid condition.
had an almost threefold greater A recent study used OR to assess
probability of responding than those 3. ODDS RATIO (OR) some characteristics of patients with
on placebo. In the pediatric While RR is an appropriate and without a diagnosis of
depression study,7 56 percent of the measure for prospective studies, dissociative disorders.23 The
fluoxetine-treated subjects versus 33 such as randomized clinical trials or respondents were 231 consecutive
percent of those on placebo were cohort studies, OR is suitable for admissions to an inner-city
treatment responders, suggesting an case-control studies, usually when outpatient psychiatric clinic. Of the
RR of 1.7. These results are subjects with a given characteristic 82 respondents who completed the
consistent with the respective effect are compared with those without the Dissociative Disorders Interview
sizes derived from each study. characteristic. An additional benefit Schedule, 24 were diagnosed with a
Implications for the clinician. of using OR as opposed to RR is that dissociative disorder and 58 were
While Cohen’s d is useful as a basis by using the log of OR in statistical not. Of those with dissociative
for the design of clinical trials, modeling, confounding variables can disorders, childhood physical abuse
calculations of RR provide clinicians be controlled. Although RR may be was reported in 17 (71%) and was
with a more intuitively obvious easier to understand in terms of denied in seven (29%). Of those
means of assessing treatment evaluating the meaningfulness of without dissociative disorders,
efficacy. The statistic is easily differences, OR can be used for the childhood physical abuse was
calculated from information provided same purpose, albeit in a less reported in 17 (29%) and was denied

four-repeat allele (OR: 3.0; 95% CI:
TABLE 3. Number needed to treat (NNT) analyses of data from 10 studies of children, 1.14, 8.14; P=0.03) as compared to
adolescents, and adults With ADHD children with other genotypes, while
the odds of developing social
STUDY TREATMENT, SUBJECTS, AND SCALESa NNT withdrawal with increasing dose were
almost fourfold greater for children
Biederman et al, LDX (n=214) vs. placebo (n=72). Children. with any copy of the seven-repeat
2
2007 (4 weeks)27 % CGI respondersa allele (OR: 3.92; 95% CI: 1.20, 12.86;
P=0.01).24
Greenhill et al, 2002 MR-MPH (n=155) vs. placebo (n=159). Implications for the clinician.
3
(3 weeks)28 Children and adolescents. % CGI respondersa ORs are useful for making inferences
about the risk or probability of one
Atomoxetine + fluoxetine (n=127) vs. outcome versus another, whether
Kratochvil et al, atomoxetine+placebo (n=46). Children and ADHD-RS, 12; these are rates of positive response or
2005 (5 weeks)29 adolescents. % responders: ADHD-RS or CDRS-R, 6 some adverse outcome. While an OR
CDRS-R scores 1 SD of age and sex norms
might be statistically significant, the
Bupropion vs. increased risk for differences in
Kuperman et al, SR-bupropion (n=11) vs. MPH (n=8) vs. outcome between groups might be
placebo, 3;
2001 (7 weeks)30 placebo (n=8). Adults. % CGI respondersa
MPH vs. placebo, 4 quite small and lack clinical
significance if multiple variables
Pliszka et al, 2000 MAS (n=20) vs. MPH (n=20) vs. placebo MAS vs. placebo, 2;
moderate response. ORs are most
(3 weeks)31 (n=18). Children. % CGI respondersa MPH vs. placebo, 3
useful in assessing characteristics
OROS-MPH (n=47) vs. placebo. Crossover associated with low-frequency
Reimherr et al, 2007
study. Adults. % responders: >50% 3 outcomes, such as differential
(4 weeks)32
improvement in WRAADDS scores responses between various patient
groups or in assessing side effects. ORs
Riggs et al, 2004 Pemoline (n=34) vs. placebo (n=33).
5 do not easily provide estimates of
(12 weeks)33 Adolescents. % CGI respondersa
clinical effect size comparable to
Cognitive-behavioral therapy + ADHD Cohen’s d and do not provide a
Safren et al, 2005
medications (n=16) vs. ADHD medications 2 reasonable basis for sample size
(15 weeks)34
(n=15). Adults. % CGI respondersa estimates in clinical trials. When
MPH (n=104) vs. placebo (n=42). Adults. % available, ORs provide additional
Spencer et al, 2005 information for predicting risk versus
responders: % CGI respondersa + >30% 2
(6 weeks)35 benefit in individual patients.
reduction in AISRS scores
Weiss et al, 2006 DEX (n=23) vs. paroxetine (n=24) vs. placebo DEX vs. paroxetine, 4. NUMBER NEEDED TO TREAT
(20 weeks)35 (n=26). Adults. % CGI respondersa 2; vs. placebo, 2
(NNT)
a
Patients with CGI ratings of much or very much improved (score 1 or 2). The NNT is defined as the number
ADHD=attention-deficit/hyperactivity disorder; ADHD-RS=ADHD Rating Scale IV; of subjects one would expect to treat
AISRS=Adult ADHD Investigator System Report Scale; CDRS-R=Children’s Depression with agent A to have one more success
Rating Scale–Revised; CGI=Clinical Global Impressions scale; DEX=dextroamphetamine; (or one less failure) than if the same
LDX=lisdexamfetamine; MAS=mixed amphetamine salts; MPH=methylphenidate; number were treated with agent B.
MR=modified release; OROS=osmotic-release oral system; SD=standard deviation;
The NNT has been described as the
SR=sustained release; WRAADDS=Wender-Reimherr Adult Attention Deficit Disorder Scale;
XR=extended release. effect size estimate “that seems to best
reflect clinical significance...for binary
(success/failure) outcomes.”2 NNT is a
in 41 (71%; OR: 5.86 [17/7]/[17/41], side effects that commonly occur measure related to absolute risk
P<0.001).23 This means that the odds with stimulants.24 Although reduction and may be most useful in
of having suffered childhood physical preliminary and requiring replication, assessing relevance of treatment
abuse were sixfold greater for results suggested that patient effects. The NNT is computed as
patients with dissociative disorders genotypes predicted risk for follows:
as compared to those without developing certain side effects. For
dissociative disorders. example, the odds of manifesting NNT = 100 divided by % improved on
The OR is frequently used to picking behaviors were threefold treatment - % improved on placebo
provide estimates of side effects risk. greater for children who were
In the Preschool ADHD Treatment homozygous for the dopamine In the RUPP anxiety study, 76
Study, children were treated with receptor D4 variable-number tandem percent of the subjects receiving
methylphenidate and assessed for repeat (DRD4-VNTR) polymorphism fluvoxamine versus 29 percent of the

placebo group were treatment
responders.6 This yields NNT=2.1
(100/[76-29]), suggesting that for
every two patients treated with
active medication, at least one will
have a better outcome than if treated
with placebo. In the fluoxetine
depression study, 56 percent of the
subjects receiving fluoxetine versus
33 percent of the placebo group were
treatment responders,7 yielding
NNT=4.3(100/[56-33]). This suggests
that it is necessary to treat four
patients with active medication to
get one better response than
treatment with placebo. These two
NNT values are consistent with
previously described effect sizes and
RRs for improvement.
NNT is also useful for assessing FIGURE 1. Drug-placebo response curve of total Conners Adult ADHD Rating Scale scores in
the risk of negative outcomes. In patients receiving atomoxetine or placebo. Used with permission from Biomed Central.
another study of adolescents with Faraone SV, et al. Behav Brain Funct. 2005;1:16.38
depression, suicide-related events
were reported in 4.5 percent of the change in a symptom score from (P=0.002 and 0.008, respectively).37
subjects treated with fluoxetine and baseline to endpoint; (2) for the drug Figure 1, based on previous
in 2.7 percent of those taking and placebo groups separately, analyses of investigator-rated total
placebo.25 This yields calculate for each observed score the CAARS scores,37 illustrates the
NNT=56(100/[4.5-2.7]), suggesting proportion of patients having that application of the drug-placebo
that 56 patients would need to be score or a better score; (3) for each response curve to identify additional
treated with active medication to observed score, plot the proportions differences in treatment groups. In
observe one additional patient computed in step 2 for the drug this figure, atomoxetine is shown to
evidencing suicidality compared with group on the vertical axis and for the have produced a drug-placebo
placebo. placebo group on the horizontal axis; response curve that is always above
NNTs for 10 studies of ADHD (4) connect the plotted points and the diagonal, indicating that
treatment in children, adolescents, label those that correspond to the atomoxetine was more efficacious
and adults are shown in Table 3.26–35 best response as the 25th percentile, than placebo throughout the full
Implications for the clinician. the median response as the 75th range of outcome scores. The AUC of
NNT is an easily calculated statistic percentile, and the worst response; 0.60 indicates that 60 percent of
that is readily interpretable and can (5) if the outcome variable is a subjects treated with atomoxetine
provide guidance as to the likelihood change score, also label the point showed symptom improvement,
of positive or negative outcomes in corresponding to no change; and (6) while only approximately 40 percent
actively treated versus placebo plot the diagonal line of no effect (no of those treated with placebo
groups. difference between the drug and improved. The point in the figure
placebo groups). labeled –7 is located at coordinates
5. AREA UNDER THE CURVE This approach has been applied to 0.54 and 0.4, indicating that a
(AUC) data from two studies of atomoxetine reduction of –7 in CAARS total
The method generally known as versus placebo in adults with scores was seen in 54 percent of
the AUC or “area under the ROC ADHD.37 Subjects (N=267 in study 1 patients treated with atomoxetine
curve” (ROC standing for “receiver and N=248 in study 2) received 60 to and in 40 percent of those given
operating characteristic”)2 has also 120mg of atomoxetine or placebo placebo. If clinical improvement is
been described as the “drug-placebo daily for 10 weeks and were assessed defined as a change score on the
response curve,” a generalization of with a variety of categorical and CAARS between –1 and –14, the
the ROC curve.36 This measure is quantitative outcomes. In both figure shows that a majority of
most useful for assessing relevance studies atomoxetine was significantly patients judged as responsive
of treatment effects. more efficacious than placebo received atomoxetine rather than
There are six steps to developing according to investigator and patient placebo. A value of zero indicates no
a drug-placebo response curve:36 (1) responses on the Conners Adult change in CAARS scores. Among the
choose an outcome variable, such as ADHD Rating Scale (CAARS) patients receiving atomoxetine, 82

percent had a score of >0, indicating clinical research could result in children and adolescents. N Engl J
that 18 percent experienced a better designed and more relevant Med. 2001;344:1279–1285.
worsening of symptoms. In contrast, studies. Clinicians need to bear in 7. Emslie GJ, Rush AJ, Weinberg WA,
71 percent of placebo patients had a mind that effect size estimates from et al. A double-blind, randomized,
score of >0 and thus 29 percent clinical trials, usually based on active placebo-controlled trial of
experienced a worsening of drug versus placebo comparisons, do fluoxetine in children and
symptoms.37 not correlate directly with patient adolescents with depression. Arch
Implications for the clinician. responses in real-world clinical Gen Psychiatry.
Although ROC methodology is fairly practice. Nonetheless, a greater 1997;54:1031–1037.
complex, it is instructive in two awareness of these approaches 8. Biederman J, Quinn D, Weiss M, et
regards. First, as shown in the above among clinicians would result in a al. Efficacy and safety of Ritalin®
example, the method highlights the more nuanced appreciation of the LA™, a new, once daily, extended-
effects of medication on both clinical trials literature than reliance release dosage form of
worsening and improvement. on P values alone. In turn, this methylphenidate, in children with
Although reports from clinical trials knowledge of treatment effect attention deficit hyperactivity
focus on the latter, it is of equal estimation may be used to promote disorder. Pediatr Drugs.
importance to know if the placebo patient compliance through the use 2003;5:833–841.
group shows more worsening than of epidemiological evidence 9. Faraone SV, Schreckengost J.
the treatment group. This helps describing the risks of behaviors or Lisdexamfetamine dimesylate
clinicians when they weigh the pros the benefits of treatments or effect size in children with
and cons of treatment. Although the interventions. attention-deficit/hyperactivity
choice to not treat with medication is disorder. Presented at 54th Annual
sometimes justified on the basis of ACKNOWLEDGMENTS Meeting of the American Academy
avoiding adverse events, not treating Editorial assistance was provided of Child & Adolescent Psychiatry;
has the adverse effect of worsening by Timothy Coffey, Robert Gregory, 2007 Oct 23-–28; Boston.
outcome in many cases. Another Rosa Real, and William Perlman, 10. Faraone SV, Spencer T, Aleardi M,
value of ROC is that the AUC Excerpta Medica, Bridgewater, New et al. Meta-analysis of the efficacy
statistic can be interpreted as the Jersey. of methylphenidate for treating
probability that a patient treated adult attention-deficit/hyperactivity
with medication will have a better REFERENCES disorder. J Clin
outcome than a placebo-treated 1. Schulz KF, Grimes DA. The Lancet Psychopharmacol. 2004;24:24–29.
patient. This can be useful when Handbook of Essential Concepts in 11. Faraone SV, Biederman J, Roe C.
communicating outcome probabilities Clinical Research. New York (NY): Comparative efficacy of Adderall
to patients or their parents. Elsevier Limited;2006. and methylphenidate in attention-
2. Kraemer HC, Kupfer DJ. Size of deficit/hyperactivity disorder: a
CONCLUSION treatment effects and their meta-analysis. J Clin
Several statistical methods have importance to clinical research and Psychopharmacol.
been described to help evaluate the practice. Biol Psychiatry. 2002;22:468–473.
magnitude of the differences 2006;59:990–996. 12. Kelsey DK, Sumner CR, Casat CD,
between two or more interventions 3. Faraone SV. Understanding the et al. Once-daily atomoxetine
in clinical studies and that provide a effect size of ADHD medications: treatment of children with
better estimate of treatment effects implications for clinical care. attention-deficit/hyperactivity
than simple reliance on P values. Medscape Psychiatry & Mental disorder, including an assessment
These include Cohen’s d (also known Health. 2003;8(2). of evening and morning behavior: a
as the standard mean difference), 4. Cohen J. Statistical Power double-blind, placebo-controlled
RR, OR, NNT, and AUC (also known Analysis for the Behavioral trial. Pediatrics. 2004;114:e1–e8.
as the drug-placebo response curve). Sciences, Second Edition. Hillsdale 13. McGough JJ, Wigal SB, Abikoff H,
None of these methods alone can be (NJ): Eribaum;1988. et al. A randomized, double-blind,
used to assess all four critical 5. Coe R. It’s the effect size, stupid: placebo-controlled, laboratory
elements in the interpretation of what effect size is and why it is classroom assessment of
treatment effects (significance, important. Presented at British methylphenidate transdermal
direction, magnitude, and relevance) Educational Research Association system in children with ADHD. J
and, therefore, a perfect measure of Annual Conference; 2002 Sept Atten Disord. 2006;9:476–485.
effect size does not exist. It is 12–14; Exeter, England. 14. Michelson D, Allen AJ, Busner J, et
incumbent on the clinician to put 6. The Research Unit on Pediatric al. Once-daily atomoxetine
these pieces of information together Psychopharmacology Anxiety treatment for children and
appropriately. Toward that end, more Study Group. Fluvoxamine for the adolescents with attention deficit
general use of these methods in treatment of anxiety disorders in hyperactivity disorder: a

randomized, placebo-controlled Prevalence of dissociative methylphenidate in adults with
study. Am J Psychiatry. disorders in psychiatric ADHD with assessment of
2002;159:1896–1901. outpatients. Am J Psychiatry. oppositional and emotional
15. Michelson D, Adler L, Spencer T, et 2006;163:623–629. dimensions of the disorder. J Clin
al. Atomoxetine in adults with 24. McGough J, McCracken J, Swanson Psychiatry. 2007;68:93–101.
ADHD: two randomized, placebo- J, et al. Pharmacogenetics of 32. Riggs PD, Hall SK, Mikulich-
controlled studies. Biol methylphenidate response in Gilbertson SK, et al. A randomized
Psychiatry. 2003;53:112–120. preschoolers with ADHD. J Am controlled trial of pemoline for
16. Pelham WE, Gnagy EM, Burrows- Acad Child Adolesc Psychiatry. attention-deficit/hyperactivity
Maclean L, et al. Once-a-day 2006;45:1314–1322. disorder in substance-abusing
Concerta methylphenidate versus 25. Emslie G, Kratochvil C, Vitiello B, adolescents. J Am Acad Child
three-times-daily methylphenidate et al; Columbia Suicidality Adolesc Psychiatry.
in laboratory and natural settings. Classification Group; TADS Team. 2004;43:420–429.
Pediatrics. 2001;107(6). Treatment for Adolescents with 33. Safren SA, Otto MW, Sprich S, et
17. Spencer T, Heiligenstein JH, Depression Study (TADS): safety al. Cognitive-behavioral therapy for
Biederman J, et al. Results from 2 results. J Am Acad Child Adolesc ADHD in medication-treated adults
proof-of-concept, placebo- Psychiatry. 2006;45:1440–1455. with continued symptoms. Behav
controlled studies of atomoxetine 26. Biederman J, Krishnan S, Zhang Y, Res Ther. 2005;43:831–842.
in children with attention- et al. Efficacy and tolerability of 34. Spencer T, Biederman J, Wilens T,
deficit/hyperactivity disorder. J lisdexamfetamine dimesylate et al. A large, double-blind,
Clin Psychiatry. (NRP-104) in children with randomized clinical trial of
2002;63:1140–1147. attention-deficit/hyperactivity methylphenidate in the treatment
18. Swanson J, Gupta S, Lam A, et al. disorder: a phase III, multicenter, of adults with attention-
Development of a new once-a-day randomized, double-blind, forced- deficit/hyperactivity disorder. Biol
formulation of methylphenidate for dose, parallel-group study. Clin Psychiatry. 2005;57:456–463.
the treatment of attention- Ther. 2007;29:450–463. 35. Weiss M, Hechtman L; Adult ADHD
deficit/hyperactivity disorder. Arch 27. Greenhill LL, Findling RL, Swanson Research Group. A randomized
Gen Psychiatry. 2003;60:204–211. JM; MPH MR ADHD Study Group. double-blind trial of paroxetine
19. Weisler RH, Biederman J, Spencer A double-blind, placebo-controlled and/or dextroamphetamine and
TJ, et al; SLI381.303 Study Group. study of modified-release problem-focused therapy for
Mixed amphetamine salts methylphenidate in children with attention-deficit/hyperactivity
extended-release in the treatment attention-deficit/hyperactivity disorder in adults. J Clin
of adult ADHD: a randomized, disorder. Pediatrics. 2002;109(3). Psychiatry. 2006;67:611–619.
controlled trial. CNS Spectr. 28. Kratochvil CJ, Newcorn JH, Arnold 36. Faraone SV, Biederman J, Spencer
2006;11:625–639. E, et al. Atomoxetine alone or TJ, Wilens TE. The drug-placebo
20. Wigal SB, McGough JJ, McCracken combined with fluoxetine for response curve: a new method for
JT, et al; SLI381.404 Study Group. treating ADHD with comorbid assessing drug effects in clinical
A laboratory school comparison of depressive or anxiety symptoms. J trials. J Clin Psychopharmacol.
mixed amphetamine salts extended Am Acad Child Adolesc 2000;20:673–679.
release (Adderall XR®) and Psychiatry. 2005;44:915–924. 37. Faraone SV, Biederman J, Spencer
atomoxetine (Strattera®) in school- 29. Kuperman S, Perry PJ, Gaffney T, et al. Efficacy of atomoxetine in
aged children with attention GR, et al. Bupropion SR vs. adult attention-deficit/hyperactivity
deficit/hyperactivity disorder. J methylphenidate vs. placebo for disorder: a drug-placebo response
Atten Disord. 2005;9:275–289. attention deficit hyperactivity curve analysis. Behav Brain
21. Wilens TE, Haight BR, Horrigan JP, disorder in adults. Ann Clin Funct. 2005;1:16.
et al. Bupropion XL in adults with Psychiatry. 2001;13:129–134.
attention-deficit/hyperactivity 30. Pliszka SR, Browne RG, Olvera RL,
disorder: a randomized, placebo- Wynne SK. A double-blind,
controlled study. Biol Psychiatry. placebo-controlled study of
2005;57:793–801. Adderall and methylphenidate in
22. Wolraich ML, Greenhill LL, Pelham the treatment of attention-
W, et al; Concerta Study Group. deficit/hyperactivity disorder. J Am
Randomized, controlled trial of Acad Child Adolesc Psychiatry.
OROS methylphenidate once a day 2000;39:619–626.
in children with attention- 31. Reimherr FW, Williams ED, Strong
deficit/hyperactivity disorder. RE, et al. A double-blind, placebo-
Pediatrics. 2001;108:883–892. controlled, crossover study of
23. Foote B, Smolin Y, Kaplan M, et al. osmotic release oral system

Effect Size of Treatment Effects

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Effect Size of Treatment Effects

Uploaded by

Copyright:

Available Formats

[REVIEW]

ESTIMATING THE SIZE OF TREATMENT

Psychiatry (Edgemont) 2009;6(10):21–29

[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 21

22 Psychiatry 2009 [VOLUME 6, NUMBER 10, OCTOBER]

[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 23

24 Psychiatry 2009 [VOLUME 6, NUMBER 10, OCTOBER]

[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 25

26 Psychiatry 2009 [VOLUME 6, NUMBER 10, OCTOBER]

[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 27

28 Psychiatry 2009 [VOLUME 6, NUMBER 10, OCTOBER]

[VOLUME 6, NUMBER 10, OCTOBER] Psychiatry 2009 29

You might also like