Peng 2013

This article was downloaded by: [University of Ulster Library]
On: 09 November 2014, At: 05:14

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
The Journal of Experimental Education

Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/vjxe20
Beyond Cohen's d: Alternative Effect

Size Measures for Between-Subject
Designs
a a
Chao-Ying Joanne Peng & Li-Ting Chen
a
Indiana University—Bloomington
Published online: 08 Aug 2013.
To cite this article: Chao-Ying Joanne Peng & Li-Ting Chen (2014) Beyond Cohen's d: Alternative
Effect Size Measures for Between-Subject Designs, The Journal of Experimental Education, 82:1,
22-50, DOI: 10.1080/00220973.2012.745471
To link to this article: http://dx.doi.org/10.1080/00220973.2012.745471
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
THE JOURNAL OF EXPERIMENTAL EDUCATION, 82(1), 22–50, 2014
Copyright
C Taylor & Francis Group, LLC
ISSN: 0022-0973 print/1940-0683 online
DOI: 10.1080/00220973.2012.745471
MEASUREMENT, STATISTICS,
AND RESEARCH DESIGN
Downloaded by [University of Ulster Library] at 05:14 09 November 2014
Beyond Cohen’s d: Alternative Effect Size Measures

for Between-Subject Designs
Chao-Ying Joanne Peng and Li-Ting Chen

Indiana University—Bloomington
Given the long history of discussion of issues surrounding statistical testing and effect size indices
and various attempts by the American Psychological Association and by the American Educational
Research Association to encourage the reporting of effect size, most journals in education and
psychology have witnessed an increase in effect size reporting since 1999. Yet, effect size was often
reported in three indices, namely, the unadjusted R2, Cohen’s d, and η2 with a simple labeling of small,
medium, or large, according to Cohen’s (1969) criteria. In this article, the authors present several
alternatives to Cohen’s d to help researchers conceptualize effect size beyond standardized mean
differences for between-subject designs with two groups. The alternative effect size estimators are
organized into a typology and are empirically contrasted with Cohen’s d in terms of purposes, usages,
statistical properties, interpretability, and the potential for meta-analysis. Several sound alternatives
are identified to supplement the reporting of Cohen’s d. The article concludes with a discussion of
the choice of standardizers, the importance of assumptions, and the possibility of extending sound
alternative effect size indices to other research contexts.
Keywords effect size, meta-analysis, robust, statistical assumptions, typology
THE JOURNAL OF EXPERIMENTAL EDUCATION is one of the first to draw researchers’

attention to the importance of documenting effects, in addition to or in spite of statistical sig-
nificance testing (Thompson, 1988, 1989, 1993). This theme issue was followed by editorial
policies of several journals requiring the reporting of effect sizes (ESs) (e.g., Educational and
Psychological Measurement (Thompson, 1994). Given this long history of discussion of is-
sues surrounding statistical testing and ES indices (Huberty, 2002) and various attempts by the
This research was supported in part by two Maris M. Proffitt and Mary Higgins Proffitt Endowment Grants of Indiana
University, awarded to C.-Y. J. Peng, and H.-M. Chiang and C.-Y. J. Peng, respectively.
Address correspondence to Chao-Ying Joanne Peng, Department of Counseling and Educational Psychology, ED
4072, 201 N. Rose Avenue, Indiana University, Bloomington, IN 47405, USA. E-mail: peng@indiana.edu
ALTERNATIVE EFFECT SIZE 23
American Psychological Association (APA) and the American Educational Research Associa-
tion (AERA) to encourage the calculation and reporting of ES (American Educational Research
Association, 2006; American Psychological Association, 2010; APA Publications and Commu-
nications Board Working Group on Journal Article Reporting Standards, 2008; Wilkinson and
the APA Task Force on Statistical Inference, 1999), it is not surprising to note an increase in the
ES reporting rate among quantitative studies published in education and psychology journals.
Indeed, most journals improved the ES reporting rate to exceed 50% after 1999—the year when
the 1999 APA Task Force Report was released (e.g., Dunleavy, Barr, Glenn, & Miller, 2006; Fi-
dler et al., 2005; Harrison, Thompson, & Vannest, 2009; Matthews et al., 2008; Meline & Wang,
2004; Odgaard & Fowler, 2010; Osborne, 2008; Vacha-Haase, Nilsson, Reetz, Lance, Thompson,
2000; Wilkinson and the APA Task Force on Statistical Inference, 1999). The three most popular
ES measures reported continued to be the unadjusted R2, Cohen’s d, and η2 (e.g., Alhija & Levy,
2009; Anderson, McCullagh, Wilson, 2007; Harrison et al., 2009; Keselman et al., 1998; Kirk,
1996; Matthews et al., 2008; Smith & Honoré, 2008). And when ES was interpreted, Cohen’s
(1969) criteria of small, medium, and large effects were often cited without a context, even after
1999 (Alhija & Levy, 2009; Fidler et al., 2005; Meline & Wang, 2004; Snyder & Thompson,
1998; Sun, Pan, & Wang, 2010; Thompson & Snyder, 1997, 1998; Vacha-Haase et al., 2000).
The three most frequently reported ES measures (i.e., the unadjusted R2, Cohen’s d, and η2)
have been criticized for bias, lack of robustness to outliers, and instability (i.e., large sampling
errors) under violations of statistical assumptions (e.g., Algina, Keselman, & Penfield, 2005;
Hedges & Olkin, 1985; Maxwell, Camp, & Arvey, 1981; Yin & Fan, 2001). Several correction
formulae have been developed to correct the positive bias found in the unadjusted R2 and η2
(Maxwell et al., 1981; Yin & Fan, 2001). Likewise, Hedges’s gu (Hedges, 1981) corrects bias
in Cohen’s d; various robust estimators (Algina et al., 2005; Keselman, Algina, Lix, Wilcox, &
Deering, 2008) are robust against outliers and the violation of normality, which is assumed by Co-
hen’s d. Several nonparametric alternative ES estimators [e.g., probability of superiority (Grissom
& Kim, 2001), stochastic superiority (Vargha & Delaney, 2000), or dominance statistic (Cliff,
1993)] make no assumptions about the population shapes. Furthermore, these nonparametric al-
ternatives reconceptualize ES as an effect throughout a distribution, not just at its center (Grissom
& Kim, 2001). Yet, these alternative ES measures have been infrequently, if ever, reported by
researchers to document effects of significance. Some have not been included/discussed exten-
sively in recent books on ES or meta-analysis (e.g., Cooper, Hedges, & Valentine, 2009; Grissom
& Kim, 2012) or books on applied statistical methods (e.g., Hinkle, 2003; Meyers, Gamst, &
Guarino, 2006; Tabachnick & Fidell, 2007).
Hence, this article aims to present several alternatives to Cohen’s d to help researchers concep-
tualize ES beyond standardized mean differences for between-subject designs with two groups.
The alternative ES indices are contrasted with Cohen’s d in terms of purposes, usages, statistical
assumptions, statistical properties, interpretability, and the potential for meta-analysis. All ES
indices discussed in this article are organized into a typology in order to facilitate reader’s under-
standing and application. Several viable alternative ES indices are identified from the typology to
supplement the reporting of Cohen’s d. As a result of the myriad ES indices in the literature and
abundant alternative indices already proposed and investigated to substitute for the unadjusted
R2 and η2, we chose to focus on alternative ES indices that are particularly well-suited for docu-
menting differences between two groups in between-subject designs. Many of these are suitable
24 PENG AND CHEN
for other research contexts as well (see the Discussion section). Estimators specifically proposed
for within-subject/single-subject designs, or for multiple dependent variables are not included.
The remainder of this article is divided into the following sections: What is an ES?, Estimators
of ES—Purposes and Usages, ES Estimators by Definition 1—Table 2a, ES Estimators by
Definition 2—Table 2b, and Discussion. The article concludes with a discussion of the choice of
standardizers in standardizing ES, the importance of statistical assumptions (primarily, normality
and equal variances), the identification of sound ES estimators, and the potential of the sound ES
estimators to be extended to between-subject designs with multiple groups, within-subject designs
with two or more repeated measures, and split-plot designs with multiple groups measured at
multiple time points.
What Is an ES?
Although the APA and AERA guidelines and reporting standards repeatedly refer to ES, this term
was tangentially defined by AERA as “quantitative indices of effect magnitude,” or “magnitude
of relations,” or “an index of the quantitative relation between variables” (AERA, 2006, p.37).
ES is used synonymously with “index of effect,” or “index of the effect” in AERA’s standards
(AERA, 2006, p. 37), but always as “effect size” in the APA guidelines and reporting standards
(APA Publications and Communications Board Working Group on Journal Articles Reporting
Standards, 2008; Wilkinson and the APA Task Force on Statistical Inference, 1999). Furthermore,
the language in the APA and AERA guidelines and reporting standards suggests ES to be a
population-based parameter. For example, the 6th edition of the APA Publication Manual states,
“Whenever possible, provide a confidence interval for each effect size reported to indicate the
precision of estimation of the effect size” (APA, 2010, p. 34). The AERA’s reporting standards
refer to “the estimated effect,” “the true effect,” and “the uncertainly of the findings” (AERA,
2006, p. 37). Thus, we can infer from these references that APA and AERA guidelines and
reporting standards intended to define ES in terms of population characteristics (i.e., parameters).
And these population-based ESs are estimated from samples by an estimator with uncertainty.
What exactly is meant by an ES for between-subject designs with two groups? We searched
the literature within PsycINFO and Web of Knowledge databases using three keywords: effect
size, practical significance, and clinical significance. Furthermore, we manually searched for
additional definitions referenced in published work found from keyword searches. These two
search stratgies resulted in seven definitions for ES in this context (Table 1 in chronological
order). Because the APA and AERA guidelines and reporting standards consider ES to be a
population-based parameter that is estimated from sample data with uncertainty, Definitions 4,
7a, 7b, and 7c were not adopted as they define ES as an estimate or sample statistic. Kelley
and Preacher (2012) also criticized Definitions 7a, 7b, and 7c as problematic as they mean three
different things by the same term ES. Definitions 3 and 5 were worded vaguely and can mean
sampling error, i.e., sample result departing from the null hypothesis, even when null hypothesis
is true. Hence, these two definitions were also not adopted. The remaining Definitions 1 and
2 are the only two definitions from Table 1 that meet the APA and AERA generic definition
of ES. For comparisons of two groups in a between-subject design, Definition 1 defines ES as
the difference of two population parameters, such as the population mean or median difference,
whereas Definition 2 defines ES as any departure from the null hypothesis, such as the degree
of overlap between two population distributions. Although one can argue that Definition 1 is
TABLE 1
Definitions of Effect Size in Chronological Order for Between-Subject Designs With Two Groups
No. Definition in original quotes, emphasis added in bold References
1 The term ‘size of effect’ is used here in a broad sense to mean either the difference between two population means, or the Oakes (1986, p. 49)
variance of several population means, or the strength of association in a multivariate population between variables,
etc.
2 Without intending any necessary implication of causality, it is convenient to use the phrase “effect size” to mean “the degree Cohen (1969, pp. 9–10)
to which the phenomenon is present in the population,” or “the degree to which the null hypothesis is false.”
Whatever the manner of representation of a phenomenon in a particular research in the present treatment, the null
hypothesis always means that the effect size is zero.
By the above route, it can now readily be made clear that when the null hypothesis is false, it is false to some specific degree,
i.e., the effect size (ES) is some specific nonzero value in the population. The larger this value, the greater the degree to
which the phenomenon under study is manifested.
3 An estimate of the degree of departure from the null hypothesis. Kramer & Rosenthal (1999, p. 60)
4 An estimate of the magnitude of the relationship between X and Y, expressed by an effect size estimate such as r or d. Kramer & Rosenthal (1999, p. 60)
5 An effect size characterizes the degree to which sample results diverge from the null hypothesis (cf. Cohen, 1988, 1994). Thompson (2002, p. 25)
6 Effect size refers to the magnitude of the impact of the independent variable on the dependent variable. It can also be Kline (2004, p. 97)
described as the degree of association or covariation between the two.
7a In the literature, the term ‘effect size’ has several different meanings. Firstly, effect size can mean a statistic which Nakagawa & Cuthill (2007, p.
estimates the magnitude of an effect (e.g., mean difference, regression coefficient, Cohen’s d, correlation coefficient). 593)
We refer to this as an “effect statistic” (it is sometimes called an effect size measurement or index).
7b Secondly, it also means the actual values calculated from certain effect statistics (e.g., mean difference = 30 or r = 0.7; Nakagawa & Cuthill (2007, p.
in most cases, ‘effect size’ means this, or is written as ‘effect size value’). 593)
7c The third meaning is a relevant interpretation of an estimated magnitude of an effect from the effect statistics. This is Nakagawa & Cuthill (2007, p.
sometimes referred to as the biological importance of the effect, or the practical and clinical importance in social and 593)
medical sciences.
Note. Definitions 1 and 2 define ES in terms of population characteristics, namely, the difference between two population means (Definition 1), or simply H 0 being
false (Definition 2). Definitions 3, 4, and 7b define ES as an estimate (i.e., the numerical value) of the corresponding population effect, presumably based on a sample
estimator. Definition 7a refers to the sample estimator itself, whereas Definition 7c refers to the biological, practical, or clinical importance of the sample estimate.
Definition 6 is not explicit about the reference to which it refers. Definition 5 defines ES in terms of sampling errors under H 0 , not ES per se. Furthermore, this definition
can be problematic in goodness of fit type of analyses for which the H 0 of a good fit is to be preferred over H 1 of a poor fit. And the sample goodness of fit measure is
the sample estimate for population ES. ES = effect size.
25
26 PENG AND CHEN
logically subsumed under Definition 2—which we agree with, it is easier to keep these two
definitions distinct in order to demonstrate, in the next three sections, how they lead to different
sample estimators.
A general framework for ES recently proposed by Kelley and Preacher (2012) subsumes many
current definitions of ES, including all listed in Table 1. Kelley and Preacher (2012) defined ES
as a “quantitative reflection of the magnitude of some phenomenon that is used for the purpose
of addressing a question of interest” (p. 140). Their definition of ES links ES to a question or
phenomenon of interest, rather than to a null hypothesis, as do almost all of the current ES
definitions. This broad conceptualization of ES has implications for interpretation, comparison
of ESs (e.g., in meta-analysis), and evaluation of properties of ES indices. While we eagerly
anticipate future research to apply and critique this new definition of ES in advancing research
methodology, its usefulness and applicability is not yet proven for the discussion of alternative ES
measures for the specific research context under consideration (namely, between-subject designs
with two groups). Hence, we did not impose Kelley and Preacher (2012)’s framework on the
typology developed in this article.
Estimators of ES—Purposes and Usages
Estimators such as Cohen’s d, or Hedges’s gu , estimate the population standardized mean dif-
ferences by Definition 1 of ES. These are referred to as the d-family estimators (e.g., Olejnik
& Algina, 2000; Rosenthal, 1994), which are summarized in Table 2a. The d-family estimators
estimate differences in population parameters, such as differences in means, proportions, probits,
logits, and so on. Other estimators, infrequently used by researchers, estimate the degree of
overlap (or lack of) between two populations. The degree of overlap (or lack of) between two
populations is a special form of the falsehood in H 0 by Definition 2. Estimators that fit into Defi-
nition 2 are summarized in Table 2b. The first column in Tables 2a and 2b lists the ES parameters
and the second column presents sample ES estimators. Parametric estimators are bolded in the
second column, whereas nonparametric estimators are not. Estimators in raw units are italicized;
all others are not. All estimators, except for those in raw measurement units, are empirically
demonstrated using data that meet the underlying statistical assumptions that are assumed for
their calculations.1
ES Estimators by Definition 1—Table 2a
Table 2a presents ES estimators that estimate the difference between two population means.
The larger the absolute values of these estimators, the larger is the corresponding population
ES estimated by these estimators, when statistical assumption(s) are met. The ES estimators
can be grouped into six categories, according to the metric (raw versus standardized), whether
normality and equal variances are assumed about the population distributions, and whether they
are robust by a certain definition (to be defined later). These six categories are as follows: (1) ES
1(See pp. 2–8 of the Appendix, available at https://oncourse.iu.edu/access/content/user/peng/Appendix.Beyond%20
Cohen%20d.Peng%2BChen.final.docx. Statistical properties of these estimators [e.g., bias, precision in confidence inter-
val (CI), potential for meta-analysis, interpretability] are summarized from publications and conference papers; they are
described on pp. 12–27 of the Appendix.
TABLE 2a
Effect Size Estimators of Two Population Mean Difference
Population ESa (References) Point estimatora,b,c (References)
(1) ES estimators in raw units
Raw difference between two population means () = μ1 – μ2 Sample mean difference =X̄1 − X̄2
Raw difference between two population medians Sample median difference
Raw difference between two population modes Sample mode difference
(2) Standardized ES estimators, assuming normality and equal variance
Standardized mean difference (δ) = μ1 −μ σ

2
for one-tailed tests (where the Cohen’s d
alternative hypothesis is μ1 > μ2 ), or (Cohen, 1969, pp. 64–65)
= |μ1 −μ
σ
2|
, for two-tailed tests, where d = X̄1 − X̄2
, for one-tailed tests, or = |
X̄1 −X̄2 |
for two-tailed tests,
2 2
μ1 , μ2 = population means of groups 1 and 2, respectively, σ̂pooled σ̂pooled
and where
σ = common population standard deviation, assumed to be equal for groups 1 X̄1 , X̄2 = sample means of groups 1 and 2, respectively, and
and 2.
2
σ̂pooled = the pooled sample variance of groups 1 and 2 with (n1 + n2 – 2) in the
(Cohen, 1969, 18) denominator.
In experimental studies, (Cohen, 1969, pp. 17–18 on normality and equal variances assumptions)
δ = μE −μσ
C
,
where μE , μC = population means of experimental and control groups,
respectively, and
σ = the common population standard deviation.
(Hedges & Olkin, 1985, p. 76)
Glass’s g
(Glass, 1976, Figures 1, 2, & 3)
g = X̄Eσ̂−cX̄C , where
X̄E , X̄C = sample means of experimental group and control group, respectively,
and
σ̂C = the sample SD of the control group or the estimated within-class SD.
(Glass, 1969, Figures 1, 2, & 3 on normality and equal variances assumptions)
(Continued on next page)

27
28
TABLE 2a
Effect Size Estimators of Two Population Mean Difference (Continued)
Hedges’s gu
(Hedges, 1981, pp. 111–114)
gu = δ̂ ∗ c(m), where
( m )
c(m) = √ m 2m−1 , δ̂ = Cohen’s d, or Glass’s g defined above,
2 ( 2 )
(x) = the gamma function, and
m = (n1 + n2 − 2) for Cohen’s d,
m = [(n1 (or n2 ) − 1] for Glass’s g.
The c(m) function can be approximated by [1 − 4×m−1
3
], where m = the degrees
of freedom.
(Hedges, 1981, p.108 on normality and equal variances assumptions)
(3) Standardized ES estimators, assuming normality only
Standardized mean difference (δ ∗ ), when population variances are not equal, Cohen’s d∗
(Keselman et al., 2008, p.118, Equation 16)
δ∗ = μ1 −μ2
σ12 +σ22
, where d ∗ = X̄1 2−X̄22 , where
σ1 +σ2
2
2
μ1 , μ2 = population means of groups 1 and 2, respectively, and
X̄1 , X̄2 = sample means of groups 1 and 2, respectively, and
σ12 , σ22 = population variances of groups 1 and 2, respectively.
σ̂12 , σ̂22 = sample variances of groups 1 and 2, respectively.
(Cohen, 1969, pp.41–42)
(Cohen, 1969, pp. 41–42 on normality assumption)
Standardized mean difference (δ j ), when population variances are not equal,
δj = μ1σ−μ j
2
, where Keselman and colleagues’ dj
μ1 , μ2 = population means of groups 1 and 2, respectively, and (Glass et al., 1981, pp. 106–107; Keselman et al., 2008, pp.117–118)
σj = population standard deviation of either group 1 or 2. dj = X̄1σ̂−jX̄2 , where
(Keselman et al., 2008, pp.117–118) X̄1 , X̄2 = sample means of groups 1 and 2, respectively, and
σ̂j = sample standard deviation of either group 1 or 2.
(Keselman et al., 2008, pp. 116–118 on normality assumption)
(4) Robust standardized ES estimators,d assuming equal variance
Robust standardized
mean
difference (δ R ), when population variances are equal, Algina and colleagues’ dR
δR = (.642) μt1σ−μ w
t2
, where (Algina et al., 2005,
p.320)

μt1 , μt2 = the 20% trimmed population means of groups 1 and 2, respectively, X̄t1 −X̄t2
dR = (.642) × √ 2 , where
σ̂w
.642 = the Winsorized standard deviation for a 20% trimmed standard normal
X̄t1 , X̄t2 = the 20% trimmed sample means of groups 1 and 2, respectively, and
distribution, and
σ̂w2 = the pooled 20% Winsorized sample variance of groups 1 and 2.
σ w = the 20% Winsorized common population standard deviation.
(Algina et al., 2005, pp. 319–320 on equal variance assumption)
(Algina et al., 2005, pp. 319–320)
† †
Robust standardized mean difference (δR ), when population variances are equal, Algina and colleagues’ d R
†
δR = μt1σ−μ t2
, where (Algina et al., 2005, p.321)
w † t1 −X̄t2
μt1 , μt2 = the 20% trimmed population means of groups 1 and 2, respectively, dR = X̄√ 2
, where
σ̂w
σ w = the 20% Winsorized common population standard deviation.
X̄t1 , X̄t2 = the 20% trimmed sample means of groups 1 and 2, respectively, and
(Algina et al., 2005, pp. 320–321)
σ̂w2 = the pooled 20% Winsorized sample variance of groups 1 and 2.
(Algina et al., 2005, pp. 319–320 on equal variance assumption)
(5) Robust standardized ES estimatorsd, not assuming equal variance
A robust standardized mean difference (δRj ), when population variances are not Keselman and colleagues’ dR
equal, (Keselman et al., 2008, pp. 117, 119)
δRj = (.642) × ( μt1σ−μ
w
t2
), where dRj = (.642) × ( X̄t1σ̂− X̄t2
), where
j jw
μt1 , μt2 = the 20% trimmed population means of groups 1 and 2, respectively, X̄t1 − X̄t2 = the 20% trimmed sample means of groups 1 and 2, respectively, and
.642 = the Winsorized standard deviation for a 20% trimmed standard normal σ̂wj = the 20% Winsorized sample standard deviation of either group 1 or 2.
distribution, and (Keselman et al., 2008, p. 119 on the absence of normality or equal variance
σwj = the 20% Winsorized population standard deviation of either group 1 or 2. assumption)
(Keselman et al., 2008, pp.117, 119)
29
30 TABLE 2a
Effect Size Estimators of Two Population Mean Difference (Continued)
A robust standardized mean difference (δR∗ ), when population variances are not Keselman and colleagues’ d ∗R
equal, (Keselman et al., 2008, pp. 117, 120)
δR∗ = μt12 −μt22 , where dR∗ = X̄t12 −X̄t22 , where
σw1 +σw2 σ̂w1 +σ̂w2
2 2
μt1 , μt2 = the 20% trimmed population means of groups 1 and 2, respectively, X̄t1 , X̄t2 = the 20% trimmed sample means of groups 1 and 2, respectively, and
and 2 , σ̂ 2 = the 20% Winsorized sample variances of groups 1 and 2,
σ̂w1
2 , σ 2 = the 20% Winsorized population variances of groups 1 and 2, w2
σw1 w2 respectively.
respectively. (Keselman et al., 2008, pp. 116, 118–120 on the absence of normality or equal
(Keselman et al., 2008, pp. 117, 120) variance assumption)
(6) Robust standardized ES estimator in order statistic, assuming equal variance
Standardized mean difference (δ) = μ1 −μ σ

2
for one-tailed tests (where the Hedges and Olkin’s d̃
alternative hypothesis is μ1 > μ2 ), or (Hedges & Olkin, 1984, p.575)
= |μ1 −μ
σ
2|
, for two-tailed tests, where d̃ = XMdnE −X
σ̂
MdnC
, where XMdnE , XMdnC = medians of experimental and control
μ1 , μ2 = population means of groups 1 and 2, respectively, groups, respectively,
and σ̃ = a2 Xc(2) + · · · + an−1 Xc(n−1) , and
σ = the common population standard deviation, assumed to be equal for both XC1 ≤ XC2 · · · ≤ Xc(n−1) ≤ XC(n)
groups 1 and 2. are the ordered data of the control group.
(Cohen, 1969, p.18) (Hedges & Olkin, 2008, pp. 573, 575 on equal variance assumption)
In experimental studies,
δ = μE −μσ
C
,
where μE , μC = population means of experimental and control groups,
respectively, and
σ = the common population standard deviation.
(Hedges & Olkin, 1985, p. 76)
aES expressed in raw measurement scale or metric are italicized; those in standardized units are not italicized.
bParametric estimators are in bold whereas nonparametric estimators are not.
cStatistical properties of these estimators are summarized in the Appendix at https://oncourse.iu.edu/access/content/user/peng/Appendix.Beyond%20Cohen%20d.
Peng%2BChen.final.docx.
dThe robustness is judged by quantitative, qualitative, and infinitesimal criteria (Wilcox, 2005, Section 2.1).
ES = effect size.
TABLE 2b
Effect Size Estimators of the Degree of Overlap Between Two Populations
Population ES (References) Point estimatora,b (References)
(A) Estimators of degree of nonoverlap, assuming normality and equal variance
Degree of nonoverlap between two population distributions→the less The sample estimator of U 1 , U 2 , and U 3 can be converted from Cohen’s d using
overlapping of two population distributions, the greater ES, and vice versa. Table 2.2.1 in Cohen (1969, p.20). This table converts properly only when the
1. U 1 = the percentage of the nonoverlapping area out of the entire area covered two samples are nearly normal distributed with an equal variance and an equal
by both populations; sample size.
2. U 2 = the percentage in the second population that exceeds the same (Cohen, 1969, p.19 on normality, equal variance, and equal population size
percentage in the first population; assumptions)
3. U 3 = the percentage of the first population exceeded by the upper half of the
second population.
(Cohen, 1969, pp. 19–21)
(B) Estimators of dominance, only CL assumes normality and equal variances, others do not assume either
The probability that a randomly selected score from one group is greater than a McGraw and Wong’s common language (CL)
randomly selected score from the other group = P rob (X1 > X2 ) (McGraw& Wong, 1992, p.361)
(McGraw & Wong, 1992) X̄1 −X̄2
CL = , where
σ̂12 +σ̂22
(x) = the standard normal cumulative distribution function,
X̄1 , X̄2 = mean of groups 1 and 2, respectively,
It is assumed that X̄1 , > X̄2 , and
σ̂12 , σ̂22 = the unbiased sample variance estimators of groups 1 and 2,
respectively.
(McGraw & Wong, 1992, p. 364, on normality and equal variances assumptions)
Grissom and Kim’s Probability of Superiority (PS)
(Grissom, 1994, p. 314; Grissom & Kim, 2001, p.140)
P S = n U,n , where
1 2
U = the Wilcoxon-Mann-Whitney statistic that measures the number of times
that scores in group 1 outrank scores in group2, assuming no ties or equal
allocation of ties,
n1 = the sample size of group 1, and
n2 = the sample size of group 2.

31
32
TABLE 2b
Effect Size Estimators of the Degree of Overlap Between Two Populations (Continued)
Stochastic superiority (A12 ) = the probability that a randomly selected score Vargha and Delaney’s Â
from group 1 will be greater than a randomly selected score from group 2, plus (Vargha & Delaney, 2000, p.107)
1 =X2 )
.5 times the probability that a randomly selected score from group 1 equals a Â12 = #(X1 >X2 )+.5#(X
n1 n2 , where
randomly selected score from the group 2 # = the number of times that the argument in the parenthesis is true in a data set.
= P rob (X1 > X2 ) + .5 × P rob (X1 = X2 ) . Alternatively,

R1 n1 +1
(Vargha & Delaney, 2000) n − 2
Â12 = 1 n , where
2
R1 = the rank sum of the first group’s scores, ranked among all (n1 +n2 ) scores,
n1 = sample size of group 1, and
n2 = sample size of group 2.
Dominance Statistic (δ Cliff ) = the probability that a randomly selected score from Cliff’s d
one group is greater than a randomly selected score from the other group (Cliff, 1993, p.495)
minus the probability that a randomly selected score from one group is smaller Cliff’s d = #(X1 >X2n)−#(X 1 <X2 )
, where
1 n2
than a randomly selected score from the other group = # = the number of times that the argument in the parenthesis is true in a data set,
P rob (X1 > X2 ) − P rob (X1 < X2 ) n1 = the sample size of group 1, and
(Cliff, 1993, p.494) n2 = the sample size of group 2.
Note that ties are not counted in the numerator; they are counted in n1 and n2 in
the denominator.
Alternatively,
Cliff’s d = n2U
1 n2
− 1, where
U = the Wilcoxon-Mann-Whitney statistic,
(C) Estimators of (mis)classifications, assuming multivariate normality and equal variances/covariances, except when LR is used for classification
Proportion of cases misclassified Levy’s p

(Levy, 1967) (Levy, 1967, p. 38 and p. 39)
Levy’s p = the one-tailed probability associated with the normal deviate z.
In the univariate
case,
+n2
z = (0.5) × t × nn11 ×n 2
, or
z = √t , assuming n1 = n2 ;
N
(Levy, 1967, p. 38 on normality and equal variance assumptions)
in the multivariate case,
1 +n2 )
2
z = (0.5) × √ R 2 × (N−m−1)(n n1 ×n2 , or
1−R
z= √R , assuming n1 = n2 and N is large compared to m,
1−R 2
(Levy, 1967, pp. 38–39 on multivariate normality and equal variance/covariance
assumptions)
where
t = the two-independent t-statistic,
R2 = the multivariate point-biserial correlation coefficient,
m = the number of dependent measures,
N = the total sample size,
33
34
TABLE 2b
Effect Size Estimators of the Degree of Overlap Between Two Populations (Continued)
Improvement-over-chance index (I) Huberty and colleagues’ Iˆ

= the degree of nonoverlap between two distributions, more than what is (Huberty & Lowman, 2000, pp. 546–547)
ito −He
expected by chance. Iˆ = (1−He1−H
)−(1−H ito )
e
= H1−H e
, where
(Huberty & Lowman, 2000) Hito = the observed hit rate; can be obtained from predictive discriminant
The population hit rate = ( 2δ ), analysis (PDA), logistic regression (LR), etc., in terms of group membership
where = cumulative normal distribution function, prediction, followed by a calculation of group classification hit rate,
δ = the standardized population mean difference. H e = the hit rate by chance; can be obtained from He = q1 nn1 +q 2 n2
or
1 +n2
(Hess, Olejnik, & Huberty, 2001, p.918) He = max (q1 , q2 ), where
qj is the prior probabilities, which should reflect the relative population sizes in
each group (e.g., it can be estimated by sample size (qj = nj /N) given that the
sample sizes are proportional to the population sizes).
(Huberty & Lowman, 2000, pp. 546, 551, 552, 559, 561 on normality and equal
variance/covariance assumptions for PDA)
aParametric
estimators are in bold whereas nonparametric estimators are not.
bStatistical
properties of these estimators are summarized in the Appendix at https://oncourse.iu.edu/access/content/user/peng/Appendix.Beyond%20Cohen%20d.
Peng%2BChen.final.docx.
estimators in raw units; (2) standardized ES estimators assuming normality and equal variance;
(3) standardized ES estimators, assuming normality only; (4) robust standardized ES estimators,
assuming equal variance; (5) robust standardized ES estimators, not assuming equal variance;
and (6) robust standardized ES estimator in order statistics (e.g., medians, quartiles, percentiles,
ranked data), assuming equal variance. The definition for robustness is different for Categories
(4), (5), and (6) estimators. This difference is described later and in Table 2a.
The sample mean difference in raw units from Category (1) and estimators from Categories
(2), and (3) are inherently parametric and all assume normality about the populations, but not
necessarily equal variances. Categories (4) and (5) estimators do not assume normality. Estimators
in Category (6) are defined in terms of sample medians; hence, they are invariant under monotonic
transformations. Each category of estimators is described later.
(1) ES Estimators in Raw Units
The first three rows in Table 2a present three ES estimators in raw measurement scale: sample
mean, median, and mode differences. Of the three, mean difference is parametric, whereas
median and mode differences are considered nonparametric. Compared with mean differences,
median and mode differences are not extensively reported as an ES measure, or researched for
their suitability for meta-analysis (see the Appendix for statistical properties and comments in
Table 2a). According to the American Psychological Association (APA) guidelines and reporting
standards, the reporting of ES in raw units is preferred over standardized differences for easy
understanding, as long as the raw measurement scale or metric is meaningful (APA, 2010, p. 34;
Wilkinson and the APA Task Force on Statistical Inference, 1999, p. 599).
(2) Standardized ES Estimators, Assuming Normality and Equal Variance
Three standardized ES estimators (i.e., Cohen’s d, Glass’ g, and Hedges’ gu ) estimate the
same population parameter (δ) and assume normality and equal variance for populations. They
differ in the choice of the denominator—called a standardizer—in the standardization. Although
measurement unit free, nonlinear (e.g., log) transformations of the data do affect the magnitude
and interpretation of these three ES estimators (Kraemer & Andrews, 1982). Thus, these three
standardized ES estimators reflect the choice of the measurement, in addition to the magnitude
of the treatment effect (Kraemer & Andrews, 1982). As long as normality assumption holds,
all three ES estimators are consistent and asymptotically efficient estimators (Hedges & Olkin,
1984).
Cohen’s d is intuitively appealing for estimating an equally intuitive δ in the population.
However, researchers should be aware of several issues surrounding Cohen’s d. First, Cohen’s d is
a biased estimator of δ (Hedges, 1981). Second, the estimation of the common population variance
(σ ), using the weighted average of two sample variances, depends on sample sizes (Keselman
et al., 2008). Third, when two population variances are different, estimating δ by Cohen’s d is
problematic because δ is undefined in this case, yet the magnitude of Cohen’s d is a function of
the ratio of two sample sizes—called base rates in Ruscio (2008). Fourth, d and δ are defined in
terms of means and variances that are least-squares statistics and parameters respectively. As a
result, δ is not a robust parameter (Staudte & Sheather, 1990) because of its sensitivity to subtle
36 PENG AND CHEN
changes to the population distribution, and Cohen’s d is sensitive to outliers in data. Kraemer and
Kupfer (2006) considered Cohen’s d to convey no clinically interpretable information, because
the threshold of preventing a disabling or fatal disease (e.g., polio) is different for different
treatments (e.g., a low-risk treatment, such as polio vaccine, versus a high-risk treatment, such as
radiation). Here, the clinical significance is defined as the change of a patient/client’s status and the
amount of change caused by a therapy/treatment/intervention (Jacobson, Follette, & Revenstorf,
1984; Jacobson & Truax, 1991). Therefore, without the knowledge of the threshold of clinical
significance for a specific treatment, Cohen’s d alone cannot convey the clinical significance of
that treatment. Last, the symbolic representation of Cohen’s d is confusing in the literature (see
Statistical Properties and Comments of Table 2a in the Appendix).
The second estimator (i.e., Glass’s g) in this category is defined as a standardized mean
difference between two groups, divided by the sample standard deviation of the control group
or “the estimated within-class standard deviation, assumed to be homogenous across the two
classes” (Glass, McGaw, & Smith, 1981, p. 38). Glass’s g was initially defined for meta-analysis
in the context of an experimental study that designated one group as an experimental group
and the other as a control group (Glass, 1976). Since the publication of Glass’s 1976 seminal
article on g, there has been much debate about the use of the control group’s standard deviation
as a standardizer (see Statistical Properties and Comments of Table 2a in the Appendix). We
agree with Glass et al. (1981) that the choice between the control group’s or the experimental
group’s sample standard deviation depends on the purpose of reporting g. Both standardizers are
acceptable because Glass’s gs computed from both express two distinct features of a finding.
One note of caution, Glass’s g may estimate different population parameters in different research
designs (Kraemer & Andrews, 1982). For example, only when the time effects and the placebo
effects both equal zero, do values of Glass’s g from the pre-post design and the test-control design
estimate the same population parameter (Kraemer & Andrews, 1982). Furthermore, Kraemer and
Kupfer (2006)’s criticism of Cohen’s d in terms of clinical significance may equally apply to
Glass’s g as well. Additional research is needed to explore this limitation.
Hedges’s gu corrects the bias found in Cohen’s d and Glass’s g by multiplying either estimator
with a correction coefficient, c(m) (Hedges, 1981). Hedges’s gu is “the unique uniformly minimum
variance unbiased estimator (UMVUE) of δ” (Hedges, 1981, p.116), when two sample sizes are
equal and gu is derived from Cohen’s d. When the two population variances are different (i.e., the
equal variance assumption is violated), the estimation of δ by Hedges’s gu based on Cohen’s d still
depends on two sample sizes. So far, we have not located research that investigated the sensitivity
of gu to the base rates of the two groups. Thus, it remains to be seen if gu outperforms Cohen’s d
in terms of the sensitivity to base rates, especially when population variances are unequal. Also,
Kraemer and Kupfer’s (2006) criticism of Cohen’s d in terms of clinical significance may equally
apply to Hedge’s gu as well. Additional research is needed to explore this limitation.
Using Data Set 1 of 15 observations per group, randomly generated from two normal distribu-
tions with equal variances and population δ = 1.0 (p. 2 of the Appendix), we computed Cohen’s
d = 1.04, Glass’s g = 0.94, Hedges’ gu = 1.01 based on Cohen’s d, and Hedges’ gu = 0.89
based on Glass’s g. It is clear from these estimates that Hedges’ gu corrects the bias downward in
Cohen’s d and Glass’s g; the amount of correction is a gamma function of the degrees of freedom.
The greater the degree of freedom, the smaller the correction of bias in either Cohen’s d or
Glass’s g.
(3) Standardized ES Estimators, Assuming Normality Only
Two estimators—Cohen’s d∗ and Keselman and colleagues’ dj —fit into this category. Cohen’s
∗
d uses the square root of the unweighted average of two sample variances as its standardizer
to estimate the corresponding population δ ∗ . Both δ ∗ and d∗ do not assume an equal variance.
As long as normality assumption holds, Cohen’s d∗ is a consistent and asymptotically efficient
∗
estimator of δ (Hedges & Olkin, 1984). Cohen’s d∗ is an unbiased estimator for δ ∗ only when
δ = 0; otherwise, it is a biased estimator (Huynh, 1989). The variance of d∗ increases with δ ∗ ,
∗
but decreases monotonically with increasing n1 and n2 (Huynh, 1989). In addition to the bias
and instability of variance found in Cohen’s d∗ (Huynh, 1989), its corresponding parameter (i.e.,
δ ∗ ) is difficult to interpret because δ ∗ is standardized with the standard deviation of the contrived
population distribution (Grissom & Kim, 2001). For these reasons, Cohen’s d∗ may not be a
viable alternative to Cohen’s d.
Keselman and colleagues’ dj was proposed explicitly to be an alternative to Cohen’s d when
population variances are unequal and sample sizes are unequal (Keselman et al., 2008). A
closer look at dj should reveal that dj is no different from Glass’s g computationally, when the
control group’s SD was used in the denominator of dj . Yet, the rationale behind both measures
is different. It appeared to us that Keselman and colleagues’ dj used a single group’s SD to
simplify the perplexing choice of a standardizer when population variances are unequal, even
though Glass et al. (1981) was not cited in Keselman et al. According to Glass et al. (1981),
the use of different standardizers (i.e., SD of the experimental or the control group) in Glass’s g
reflects different features of a finding (see Statistical Properties and Comments of Table 2a in the
Appendix). These different features compliment, instead of contradict, each other. On this choice
issue, we agree with Grissom and Kim (2001)’s recommendation to select a standardizer that is
suitable for a research context when population variances differ (see Statistical Properties and
Comments of Table 2a in the Appendix). Similar to Kraemer and Andrews (1982)’s criticism of
Glass’s g, Keselman et al.’s dj may estimate different population parameters in different research
designs. Thus, a careful consideration of the research context should guide researchers in deciding
whether to adopt dj as an alternative to Cohen’s d in meta-analysis, when population variances
are unequal.
Using Data Set 2 of 15 observations per group, randomly generated from two normal distri-
butions with unequal variances (a ratio of 1:100 between Groups 1 and 2), the population δ ∗ =
1.41, δ 1 = 10, and δ 2 = 1.00 (p. 5 from the Appendix), we computed Cohen’s d∗ = 1.30, Kesel-
man and colleagues’ d1 = 7.80, and Keselman and colleagues’ d2 = 0.92. Thus, the choice of
the standardizer is a crucial decision by researchers, when the two population variances are not
assumed to be equal.
(4) Robust standardized ES Estimators, Assuming Equal Variance

†
Two estimators fit into this category: dR and dR . and two do not (dRj and dR∗ ). These two
estimators are defined as ratios of trimmed means and the square roots of Winsorized variances
that are robust to subtle changes to the distributions, compared to ordinarily defined means
and variances (Wilcox, 2005, Section 2.1). A trimmed mean is the average of scores, obtained
38 PENG AND CHEN
from removing a prespecified percentage (e.g., 20%) of the largest and the smallest scores (see
the Appendix for an illustration of this concept). A Winsorized variance is the variance of the
Winsorized scores. Winsorized scores are derived from replacing the smallest g scores by the
(g+1)th score and the largest g scores by the (n – g)th score, where n = the sample size,
2 × g/n = the percent of trimming (see p. 9 of the Appendix for an illustration of trimmed
means, Winsorized scores, and the Winsorized variance). It is unclear from Algina et al., (2005)
and Keselman et al., (2008) if the ratio of two robust parameters/statistics (i.e., ratios of trimmed
means over Winsorized variances) is itself robust in the same criteria as those discussed in Wilcox
(2005, Section 2.1). Furthermore, these robust estimators are not necessarily robust against the
violation of normality and equal variance. Similarly, there is no guarantee that these robust ES
estimators are robust against all forms of outliers in the populations, as sample data are subject
to sampling errors and the population distributions are always unknown (Keselman et al., 2008).
Using Data Set 3 of 15 observations per group, randomly generated from two uniform dis-
†
tributions with equal variances, population δ R = 1.85, and δR = 2.89 (p. 6 of the Appendix),
†
we computed Algina and colleagues’ dR = 1.18 and dR = 1.83. We further comment on the
interpretation of these two robust estimators and the next two introduced below.
(5) Robust Standardized ES Estimators, Not Assuming Equal Variance
Two estimators fit into this category: dRj and dR∗ . Similar to the robust ES estimators of Category
(4), these two estimators are defined as ratios of trimmed means and Winsorized variances that are
robust to subtle changes to the distributions, compared to ordinarily defined means and variances
(Wilcox, 2005, Section 2.1). As stated before, it is unclear from Algina et al. (2005) and Keselman
et al. (2008) if the ratio of two robust parameters/statistics (i.e., ratios of trimmed means over
Winsorized variances) is itself robust in the same criteria as those discussed in Wilcox (2005,
Section 2.1). Furthermore, these robust estimators are not necessarily robust against the violation
of normality and equal variance. Similarly, there is no guarantee that these robust ES estimators
are robust against all forms of outliers in the populations, as sample data are subject to sampling
errors and the population distributions are always unknown (Keselman et al., 2008).
The exact sampling distributions of these robust estimators cannot be obtained analytically;
they can be approximated by empirical methods under assumptions (Yuen, 1974; Yuen & Dixon,
1973). Without an acceptable sampling distribution, the null hypothesis significance testing and
confidence interval construction are not possible. Bonett (2008) further cautioned researchers with
the difficulty of interpreting robust estimators in terms of the separation between two nonnormal
distributions; under these conditions, the difference between two trimmed population means as
well as the Winsorized SD depend on the specific shapes of the population distributions which
are almost always unknown. Similar to Kraemer and Andrews (1982)’s criticism of Glass’s g,
Keselman et al.’s dRj may estimate different population parameters in different research designs.
Compared to Cohen’s d, the four robust estimators from Categories (4) and (5) are less
straightforward and less researched for their statistical properties (bias, consistency, or efficiency).
Furthermore, it is impossible to synthesize estimates of these robust estimators for meta-analysis,
if robust descriptive statistics, such as trimmed means, Winsorized variance/standard deviation,
are not available from primary studies. Because of the difficulty with sampling distributions,
interpretations, and the need for robust statistics, these four robust estimators, in our opinion, are
less suitable for meta-analysis than Cohen’s d.
Using Data Set 4 of 15 observations per group, randomly generated from two uniform distribu-
tions with unequal variances (a ratio of 1:4 between Groups 1 and 2), population δR1 = 1.11, δR2
= 0.56, and δR∗ = 1.10 (p. 7 of the Appendix), we computed Algina and colleagues’ dR1 = 2.55,
dR2 = 1.36, and dR∗ = 2.64. It is clear from these values that dR1 and dR2 can be quite different,
depending on the ratio of two different population variances. And the population δR1 , δR2 or δR∗
can be difficult to compute because of the difficulty in determining the Winsorized variances for
both populations.
(6) Robust Standardized ES Estimator in Order Statistic, Assuming

Equal Variance
Hedges and Olkin (1984) proposed a robust estimator (d̃) against outliers or extreme scores in
the sample. It is defined as a standardized median difference between two groups. The denominator
in d̃ is computed from the scores in the control group without the lowest and the highest scores
and with optimal coefficients chosen to “minimize the variance of the σ̃ .” They are not easily
described but are tabulated” (Hedges & Olkin, 1984, p. 575) (see e.g., Sarhan & Greenberg, 1962,
pp. 218–251). So far, we have not been able to locate empirical studies of d̃, investigating
its statistical properties (e.g., biasness, consistency), or comparing it to Cohen’s d or other
standardized estimators. A different standardizer proposed by Grissom and Kim (2001) uses
the biweight SD—another robust estimator of the population SD (Wilcox, 2005, pp. 93–95).
Zhang and Schoeps (1997) proposed yet another estimator that replaces the standardizer in d̃
with a difference in two percentiles obtained from the control group. Again, we have not been
able to locate empirical studies of Grissom and Kim’s or Zhang and Schoeps’s alternatives to
d̃, investigating their statistical properties (e.g., biasness, consistency), or comparing them to
Cohen’s d or other standardized estimators. For these reasons, we are unable to comment on the
merits of d̃ or its two alternatives, relative to Cohen’s d, as an ES estimator. Last, it is impossible
to synthesize estimates of d̃ and its two alternatives for meta-analysis, if primary studies do not
provide raw data from which to compute these estimates.
Using Data Set 3 of 15 observations per group, randomly generated from two uniform distri-
butions with equal variances and the population δ = 1.73 (p. 6 from the Appendix), we computed
Hedges and Olkin’s d̃ = 1.63. In computing d̃, we chose to divide the median difference by
the standard deviation of the control group after removing the smallest and the largest scores.
We chose not to apply the optimal coefficients suggested by Hedges and Olkin (1984) because
these coefficients are applicable to normal distributions only (Sarhan & Greenberg, 1962, pp.
218–251). Thus, the sample estimate (d̃ = 1.63) would have been larger, if optimal weights could
have been applied.
ES Estimators by Definition 2—Table 2b
As stated earlier, estimators included in Table 2b estimate a population ES defined in terms of

the overlap, or lack of, between two populations—a special form of the degree of falsehood in
H 0 . Compared to Cohen’s d, these estimators do not require a standardizer, they measure ES by
considering the effect throughout the distributions, not at the centers alone, and their estimates
40 PENG AND CHEN
are bounded in a range (e.g., from 0 to 1). A majority of these estimators are nonparametric
in nature; hence, make few assumptions about the underlying populations. These ES estimators
can be grouped into three categories: (A) estimators of degree of nonoverlap, (B) estimators of
dominance, and (C) estimators of (mis)classifications. In the following paragraphs, we discuss
statistical properties of these estimators.
(A) Estimators of Degree of Nonoverlap, Assuming Normality and Equal

Variance
Cohen (1969) proposed three indices—U 1 , U 2 , and U 3 —to measure the degree of nonover-
lapping between two populations (see p. 10 of the Appendix for a graphical illustration of these
three concepts). They are interrelated and all are derived from Cohen’s d under the normality,
equal variance, and equal population size assumptions. U 1 ranges from 0% to 100% whereas U 2
and U 3 range from 50% to 100%. The greater (less) separation between two populations, the
greater (smaller) is the ES, hence, U 1 , U 2 , and U 3 . Compared to Cohen’s d, the three U indices
quantify the magnitude of an effect throughout two distributions, not the centers alone. Despite
Cohen’s claim that these U indices are “intuitively compelling and meaningful” (Cohen, 1969, p.
19), they are not reported in any of the journal articles reviewed by Peng and colleagues (2013).
or by other reviews cited in Peng et al.
Using Data Set 1 of 15 observations per group, randomly generated from two normal distribu-
tions with equal variances, we computed sample U 1 = 56.88%, U 2 = 67.87%, and U 3 = 85.12%,
corresponding to population U 1 = 55.38% U 2 = 69.15%, and U 3 = 84.13%, respectively. The
population U scores were derived from a population Cohen d = 1, and the sample U estimates
were derived from a sample Cohen d = 1.04 (p. 3 of the Appendix). Given this data set, ob-
tained from the assumed normal populations with equal variances, the sample Us estimated the
population Us well.
(B) Estimators of Dominance, Only CL Assumes Normality and Equal

Variances, Others Do not Assume Either
Although called by different names, four estimators—McGraw and Wong’s CL, Grissom and
Kim’s PS, Vargha and Delaney’s Â, and Cliff’s d—fit into this category because they all estimate
the degree of one distribution dominating over the other distribution (or degree of nonoverlapping
between two distributions). The larger (smaller) the degree of dominance of one population over
the other, the larger (smaller) is the ES. CL ranges from .5 to 1, PS and Â range from 0 to
1, and Cliff’s d ranges from −1 to 1. Of these four estimators, CL and PS estimate the same
probability of population superiority, with CL being parametric and PS being nonparametric. Â
and Cliff’s d are nonparametric as well, though they estimate different probabilities of dominance
in the populations. Â estimates the dominance, or superiority, by taking into consideration the
probability of ties in data. In contrast, Cliff’s d estimates the dominance by considering dominance
in both directions (i.e., X 1 > X 2 and X 2 > X 1 ), but not the probability of ties.
All four estimators can be used to test an appropriate H 0 of equal probability of dominance. The
treatment of ties in data differs. Ties are not an issue with CL because CL assumes continuous data.
PS assigns ties equally into both groups, so does Â. Cliff’s d excludes ties from its computation
or its corresponding H 0 (see Statistical Properties and Comments of Table 2b and an illustration
of their differences in the Appendix).
Being parametric, CL assumes normality and equal variance (McGraw & Wong, 1992; Vargha
& Delaney, 2000). Under the nonnormal conditions, CL’s performance is adequate; it deteriorates
under the violation of both normality and equal variance (McGraw & Wong, 1992). In contrast, the
other three estimators (i.e., PS, Â, and Cliff’s d) do not hold these assumptions and are invariant
to monotonic transformation of data (see Statistical Properties and Comments of Table 2b in the
Appendix). All four estimators can be extended to multiple groups or correlated data (Cliff, 1993;
Grissom & Kim, 2001; McGraw & Wong, 1992; Vargha & Delaney, 2000), though Grissom and
Kim (2001, p. 141) cautioned the intransitivity of PS for three or more groups. When normality
and equal variance assumptions hold, CL is an unbiased estimator of Prob (X1 > X2 ), whereas
PS is theoretically its unbiased, consistent, and most efficient (i.e., having the smallest variance
and standard error) estimator, among all unbiased estimators (Grissom & Kim, 2001). Cliff’s
d is also unbiased estimator of its corresponding parameter (Cliff, 1993). Cliff (1993) argued
that δ Cliff , or its estimator (Cliff’s d), is better than CL, because “[it] avoids the necessity of
modifiers that communicate the probability of ties” (p. 494). Kraemer and Kupfer (2006) referred
to Cliff’s d as success rate difference (SRD) and demonstrated its usefulness in conveying clinical
significance.
Similar to CL, PS, and Cliff’s d, Vargha and Delaney’s Â is also an unbiased estimator of its
corresponding population parameter. In addition, Â is complementary (i.e., A12 = 1 – A21 ), extends
CL to ordinal and categorical data (Vargha & Delaney, 2000), is a unique linear transformation of
Cliff’s d (Vargha & Delaney, 2000), is usefulness in conveying clinical significance (Kraemer &
Kupfer, 2006), is related to the area under ROC (receiver operating characteristic) curve (Kraemer
& Kupfer, 2006), is insensitive to base rates (Ruscio, 2008), and has the potential to quantify
the degree of stochastic homogeneity/heterogeneity among multiple groups or the degree of
stochastic equality among correlated data (Delaney & Vargha, 2002; Vargha & Delaney, 1998,
2000). According to Vargha and Delaney (2000), values of .56, .64, .71 for A and values of .11,
.28, .43 for δ cliff correspond to the small, medium, and large values of Cohen’s δ. The conceptual
framework of Â developed by Vargha and Delaney (1998, 2000) and Delaney and Vargha (2002)
encompasses CL, PS, and Cliff’s d as special cases, and enables researchers to interpret Â
meaningfully in terms of stochastic equality/superiority or stochastic homogeneity/heterogeneity
in a variety of research contexts and for a variety of data types.
The discussion thus far present CL, PS, Â, and Cliff’s d as viable alternative ES estimators to
Cohen’s d. Yet, it is impossible to synthesize estimates of PS, Â, and Cliff’s d for meta-analysis,
if primary studies do not provide raw data from which to compute these estimates.
Using Data Set 1 of 15 observations per group, randomly generated from two normal popula-
tions with equal variances and a population CL = 0.76, we computed the sample CL estimate to
be 0.77 (p. 3 of the Appendix). Given this data set, obtained from the assumed normal distribu-
tions with equal variances, the sample CL estimated the population CL closely. Using Data Set 4
of 15 observations per group, randomly generated from two uniform distributions with unequal
variances, the population PS = A = .75, and population δ Cliff = .5, we computed the sample
PS = .85, Â12 = .85, and Cliff’s d = .71 (p. 8 of the Appendix). For this particular data set,
sample PS and Â estimated their corresponding population parameters better than Cliff’s d for its
parameter.
42 PENG AND CHEN
(C) Estimators of (Mis)Classification, Assume Multivariate Normality and Equal

Variances/Covariances, except when LR Is Used for Classification
Two estimators (i.e., Levy’s p and Huberty and colleagues’ Iˆ) and their corresponding parame-
ters express ES in terms of the rate of (mis)classifications. The lower (higher) the misclassification
rate, the greater (smaller) is the separation between two populations, hence, a greater (smaller)
ES. Both estimators range from 0 to 1, and are applicable for univariate and multivariate data. The
difference between them is that Levy’s p estimates the probability of misclassification, whereas Iˆ
estimates the probability of correct classification beyond chance. Thus, Levy’s p is an unadjusted
probability, and Iˆ is a probability adjusted for the chance probability. In the multivariate case,
both assume multivariate normality and equal variances/covariances, especially when Iˆ derives
the observed hit rate (Hito ) from predictive discriminant analysis (PDA). According to Huberty
and Lowman (2000), for two-group univariate equal variance conditions, values of Iˆ less than .1
are comparable to Cohen’s small values of r2 (Cohen, 1988); values of Iˆ between .15 and .25 are
comparable to the medium values of r2 (Cohen, 1988), and an Iˆ greater than .3 is comparable to
the large values of r2 (Cohen, 1988). For k-group univariate equal variance conditions, Huberty
and Lowman (2000) suggested values of Iˆ less than .1, between .15 and .25, and greater than .3,
2
to be comparable to Cohen’s small, medium, and large values of ηadj (Cohen, 1988).
ˆ
The I index may be derived from logistic regression (LR) as well. LR does not hold these two
assumptions (see Statistical Properties and Comments of Table 2b on pp. 26–27 of the Appendix).
We agree with Hess, Olejnik and Huberty (2001) that using either PDA or LR to derive Iˆ involves
the difficult task of specifying the priors in PDA and the cutoff for LR. In their simulation study,
the prior probability for linear and quadratic predictive discriminant analysis was maintained at
.50 which was also used as a cutoff for LR. Hence, the results favored the equal ns conditions
over the unequal ns conditions. Natesan and Thompson (2007) extended the work of Hess et al.
(2001) to conditions with small but equal sample sizes (n = 10, 20, 30, and 40). Their simulation
results found that LR worked better than PDA in conditions with n ≥ 20. When n = 10, LR and
PDA performed poorly in estimating I. Furthermore, when the populations are not normal and/or
variances and covariances are heterogeneous, different theoretical values of I may be obtained,
even for the same population ES (δ) (Hess et al., 2001). In other words, the population value
of I depends on the variance ratio and the population distribution shape. Since data cannot be
guaranteed to come from two normal distributions with equal variances and covariances, these
limitations with Iˆ or I need to be taken into consideration when reporting and interpreting Huberty
and colleagues’Iˆ.
Because of the difficulty of computing the hit rate by chance and an unclear definition of its
population parameter, Iˆ is not a viable alternative ES estimator to Cohen’s d, in our opinion.
Levy’s p has fewer problems than Iˆ with computation or definition of its parameter. Yet, it
assumes normality and equal variances/covariances for the populations. It further assumes that
misclassfication occurs when a score deviates from its group more than half of the distance
between the two group means (the univariate case) or the two centroids (the multivariate case),
without considering costs of misclassification (Levy, 1967). These three assumptions may not
be realistic or rational in practical or clinical sense. Hence, we caution the reporting of Levy’s
p as a substitute for Cohen’s d. Similar to estimators in Categories (4), (5), and (6) of Table 2a
and those in Category (B) of Table 2b, both Iˆ and Levy’s p are impossible to be synthesized for
meta-analysis, if the primary studies do not provide raw data from which to derive their estimates.
Thus, they are not viable alternatives to Cohen’s d.
Using Data Set 1 of 15 observations per group, randomly generated from two normal pop-
ulations with equal variances, we computed Levy’s sample estimate p = .52 for a population
parameter of .5, which implied equal likelihood of classification into Population 1 or 2 (p. 4 of
the Appendix). For a population hit rate of .6915 based on equal likelihood of classification and a
population I of .38, we derived Huberty and colleagues’ sample estimate Iˆ = .47 based on PDA,
or Iˆ = .4 based on LR (p. 4 of the Appendix). For this data set, both Levy’s sample estimate p
and the Huberty and colleagues’ Iˆ, obtained from LR, estimated their corresponding population
parameters better than the Huberty and colleagues’ Iˆ, obtained from PDA.
DISCUSSION
Since the publication of Glass’s (1976) seminar article on meta-analysis, the conceptualization
of ES has mushroomed from a descriptive, quantitative measure to well-developed statistical
theories and models for ES estimations and integration (Grissom & Kim, 2005, 2012; Hedges
& Olkin, 1985; Keselman et al., 2008). Given this long history of ES indices (Huberty, 2002)
and several attempts by APA and AERA to encourage the calculation and reporting of ES
(American Educational Research Association, 2006; American Psychological Association, 2010;
APA Publications and Communications Board Working Group on Journal Article Reporting
Standards, 2008; Wilkinson and the APA Task Force on Statistical Inference, 1999), it is not
surprising that the average rate of reporting ES has increased among quantitative studies published
in psychology and education journals (e.g., Dunleavy et al., 2006; Fidler et al., 2005; Harrison
et al., 2009; Matthews et al., 2008; Meline & Wang, 2004; Odgaard & Fowler, 2010; Osborne,
2008; Vacha-Haase et al., 2000). Yet, despite the increasing rate of reporting ES, several ES
reporting practices prevailed, namely, the limited types of ES reported, poor statistical properties
of ES reported, insufficient interpretation of ES, and little effort by researchers to supplement
statistical significance testing results with practical or clinical significance, presumably measured
by ES. The three most popular ES indices were found to be the unadjusted R2, Cohen’s d, and η2
(e.g., Alhija & Levy, 2009; Anderson et al., 2007; Harrison et al., 2009; Keselman et al., 1998;
Kirk, 1996; Matthews et al., 2008; Smith & Honoré, 2008). Unfortunately, these three ES indices
have been criticized for bias, lack of robustness against outliners or departure from normality, and
instability under violations of statistical assumptions (e.g., Algina et al., 2005; Hedges & Olkin,
1985; Maxwell et al., 1981; Yin & Fan, 2001). Because there are myriad ES indices in the literature
and abundant alternative indices have already been proposed and investigated to substitute for the
unadjusted R2 and η2, we chose to focus on alternative ES indices that are particularly well-suited
for documenting differences between two groups in between-subject designs.
In this article, we presented several alternatives to Cohen’s d to help researchers conceptualize
ES beyond standardized mean differences. These alternative estimators correspond to different
definitions of ES (Table 1), possess different statistical properties (Tables 2a and 2b in the
Appendix), and present potentials, or challenges, for meta-analysts (Tables 2a and 2b in the
Appendix), all within the context of between-subject designs with two groups. In the remaining
paragraphs, we highlight issues associated with selecting one or more ES measures suitable for
research in education and psychology.
44 PENG AND CHEN
Standardized Versus Unstandardized ES Measures
While estimators in raw units, i.e., Category (1) of Table 2a, are preferred over the standardized
estimators, as long as the metric or measurement scale is meaningful (APA Publication Manual,
2010; Wilkinson and the APA Task Force on Statistical Inference, 1999), they present challenges
for comparisons across studies and synthesis in meta-analysis because units of measurements
are rarely identical from one study to the next. Standardized estimators, in Categories (2) to (6)
of Table 2a, can be meaningful and appealing to researchers in general. One knotty challenge
in standardizing sample mean/median differences is the choice of a standardizer. The choice
depends on research contexts (e.g., Glass et al., 1981; Kelly & Preacher, 2012), the equal variance
assumption, and ES estimator’s robustness to this assumption (e.g., Grissom & Kim, 2001).
Researchers should not assume meeting the equal variance assumption without testing it (Grissom
& Kim, 2001). Grissom (2000) has offered a number of practical and useful solutions to this knotty
challenge. Glass et al. (1981) advocated the use of the control group’s SD as a standardizer in meta-
analysis, but recommended either the experimental or the control groups’ SDs as a standardizer
for two-group comparisons because Glass’s gs computed from both express two distinct features
of a finding.
Robust Estimators of Standardized ES Measures
The second challenge faced by researchers in standardizing ES estimators is in the normality

assumption that is assumed by all estimators in Categories (2) and (3) in Table 2a. To deal with
this challenge, robust estimators were proposed. One set of robust estimators [Categories (4)
and (5) of Table 2a] is robust against nonnormal populations. Though Categories (4) and (5)
robust estimators are defined as ratios of two robust estimators, in the three criteria discussed
in Wilcox (2005, Section 2.1), there is no guarantee that the ratios themselves are robust in the
†
same three criteria. Two of these estimators (dR and dR ) further assume equal variances. To
our knowledge, the robustness of these estimators against the violation of the equal variance
assumption has not been comprehensively investigated. Bonett (2008) further cautioned that their
interpretations can be difficult in terms of the degree of separation between/among distributions,
because of the unknown population shapes and the bounded scales used in educational and
psychological measurements, as allured to previously. In our empirical analyses of these robust
estimators, we became aware of another potential difficulty posted to researchers in attempting to
compute population trimmed means and Winsorized variances, when the population distributions
are unknown. Consequently, the population robust parameters can be difficult to conceptualize,
let alone to estimate.
The other class of robust estimators [i.e., Hedges & Olkin’s d̃ and its alternatives in Category
(6) of Table 2a] claim to be robust against outliers or extreme scores in samples and populations;
yet they assume equal variances. So far, we have not been able to locate empirical studies
investigating their statistical properties, or comparing them to Cohen’s d or other standardized
estimators. For these reasons, we are unable to comment further on these estimators.
ES Measures of Distribution Overlap
Estimators presented in Table 2b estimate the overlap (or lack of) of two populations—a more
general definition of ES than population mean (or other center) differences. They are expressed
as probabilities or differences in probabilities. Hence, they are neither standardized, nor in raw
units. Except for McGraw and Wong’s CL, Levy’s p, or Huberty and colleagues’ Iˆ, all others are
nonparametric in nature, requiring virtually no assumptions about the underlying populations,
and can be computed and interpreted easily. All of them express ES “as a measure of the effect
of a treatment throughout a distribution, not just at its center” (Grissom & Kim, 2001, p. 140)—a
radical reconceptualization of ES according to Grissom and Kim (2001). In our empirical analyses
of PS, A, and Cliff’s population d, we became aware of the potential difficulty posted to researchers
in attempting to define the population dominance (the parameter) when population distributions
are unknown or not easily assumed to follow standard shapes, such as the normal distribution.
Of the nine estimators summarized in Table 2b, we recommend the four estimators of dom-
inance in Category (B) to supplement Cohen’s d to conceptualize ES beyond mean differences.

Of these four estimators, Vargha and Delaney’s Â stands out for its meaningful interpretability in
terms of stochastic equality/superiority or stochastic homogeneity/heterogeneity in a variety of
research contexts and for a variety of data types. Compared to Cohen’s d, Vargha and Delaney’s
Â represents a radical reconceptualization of ES with sound statistical properties and well de-
veloped theoretical framework. A SAS macro written by Kromrey and Coughlin (2007) can be
used to facilitate the computation of Vargha and Delaney’s Â. As for meta-analysis, Vargha and
Delaney’s Â is not as readily suitable as Cohen’s d. This is so because the rank information needed
to compute Vargha and Delaney’s Â is usually not reported in primary studies. The same obstacle
is encountered when researchers try to conduct meta-analysis using estimators in Categories (4),
(5), and (6) from Table 2a or those in Categories (B) and (C), except for CL, from Table 2b. This
may be the reason why researchers have persisted in not reporting alternative ES indices, beside
Cohen’s d, in meta-analysis.
Standard Errors of ES Measures
So far the article has discussed the usage and statistical properties of alternative ES point estimators
beyond Cohen’s d. Yet, the 6th edition of APA Publication Manual specifically states, “Whenever
possible, provide a confidence interval for each effect size reported to indicate the precision of
estimation of the effect size.” (p. 34). Several methods do exist in the literature [e.g., noncentral
t (Steiger & Fouladi, 1997), Bonett’s (Bonett, 2008), Bootstrap (Efron & Tibshirani, 1993)] to
assist researchers in constructing confidence intervals for ES estimates computed from sample
data. To the extent possible, we have summarized findings from studies that compared coverage
probabilities and precisions of CIs constructed by various methods for the estimators considered
in this article (see the Appendix). To facilitate the reporting of ES, Hogarty and Kromrey (2001,
p. 10) also urged researchers to carefully consider factors (e.g., designs, operational details,
measurement reliability, sample characteristics) that impact the appropriate interpretation of ESs,
in addition to limitations of certain ES indices.
We believe that the construction of a typology of ES is the first step to systematically explore
and study the methodology of ES estimators. The typology presented in this article went beyond
the d-type versus r-type classification of ES estimators (Rosenthal, 1994) to include additional
facets such as, parametric versus nonparametric, robust versus nonrobust, and ES for estimating
population mean differences versus ES for estimating population overlap, or lack of. Although
the scope of this study is limited to ES measures suitable for between-subject designs, several
estimators are applicable to multiple groups in between-subject, within-subject, and split-plot
46 PENG AND CHEN
designs (e.g., Delaney & Vargha, 2002; Hedges & Olkin, 1984; Kraemer & Andrews, 1982;
Zhang & Schoeps, 1997). Additional studies are needed to investigate the performance of ES
estimators under various missing data conditions, or to extend the typology and recommendations
offered in this article to broader research contexts (e.g., single-subject designs) that present unique
challenges in developing proper ES indices.
AUTHOR NOTES
Chao-Ying Joanne Peng is Professor of Inquiry Methodology and Adjunct Professor of Statis-
tics at Indiana University-Bloomington. Her research interests include effect size estimation,
research designs, and statistical computing. Li-Ting Chen is Research Associate at Indiana
University-Bloomington. Her research interests include effect size estimation, simulations, and
SAS programming.
REFERENCES
Algina, J., Keselman, H. J., & Penfield, R. D. (2005). An alternative to Cohen’s standardized mean difference effect size:
A robust parameter and confidence interval in the two independent groups case. Psychological Methods, 10, 317–328.
doi: 10.1037/1082-989X.10.3.317
Alhija, F. N.-A., & Levy, A. (2009). Effect size reporting practices in published articles. Educational and Psychological
Measurement, 69, 245–265. doi: 10.1177/0013164408315266
American Educational Research Association. (2006). Standards for reporting on empirical social science research in
AERA publications. Educational Researcher, 35, 33–40. doi: 10.3102/0013189X035006033
American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.).
Washington, DC: Author.
Andersen, M. B., McCullagh, P., & Wilson, G. J. (2007). But what do the numbers really tell us?: Arbitrary metrics and
effect size reporting in sport psychology research. Journal of Sport & Exercise Psychology, 29, 664–672. Retrieved
from http://journals.humankinetics.com/jsep
APA Publications and Communications Board Working Group on Journal Articles Reporting Standards. (2008). Reporting
standards for research in psychology: Why do we need them? What might they be? American Psychologist, 63,
839–851. doi: 10.1037/0003-066X.63.9.839
Bond, C. F. Jr., Wiitala, W. L., & Richard, F. D. (2003). Meta-analysis of raw mean differences. Psychological Methods,
8, 406–418. doi: 10.1037/1082-989X.8.4.406
Bonett, D. G. (2008). Confidence intervals for standardized linear contrasts of means. Psychological Methods, 13, 99–109.
doi: 10.1037/1082-989X.13.2.99
Brunner, E., & Puri, M. L. (2001). Nonparametric methods in factorial designs. Statistical Papers, 42, 1–52. doi:
10.1007/s003620000039
Burr, I. W., & Cislak, P. J. (1968). On a general system of distributions: I. Its curve-shape characteristics: II. The
sample median. Journal of American Statistical Association, 63, 627–635. Retrieved from http://www.amstat.org/
publications/jasa.cfm
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114, 494–509.
doi: 10.1037/0033-2909.114.3.494
Cliff, N. (1996). Ordinal methods for behavioral data analysis. Mahwah, NJ: Erlbaum.
Cohen, J. (1969). Statistic power analysis in the behavioral sciences. New York, NY: Academic Press.
Cohen, J. (1988). Statistic power analysis in the behavioral sciences. Hillsdale, NJ: Erlbaum.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi: 10.1037/0003-066X.49.12.997
Cooper, H., Hedges, L. V., & Valentine, J. C. (Eds.). (2009). The handbook of research synthesis and meta-analysis (2nd
ed.). New York, NY: Russell Sage Foundation.
Delaney, H. D., & Vargha, A. (2002). Comparing several robust tests of stochastic equality with ordinally scaled variables
and small to moderate sized samples. Psychological Methods, 7, 485–503. doi: 10.1037/1082–989X.7.4.485
Dunleavy, E. M., Barr, C. D., Glenn, D. M., & Miller, K. R. (2006). Effect size reporting in applied psychology: How
are we doing? The Industrial-Organizational Psychologist, 43, 29–37.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall.
Ellison, B. E. (1964). Two theorems for inferences about the normal distribution with applications in acceptance sampling.
Journal of American Statistical Association, 59, 89–95. Retrieved from http://www.amstat.org/publications/jasa.cfm
Fidler, F., Cumming, G., Thomason, N., Pannuzzo, D., Smith, J., Fyffe, P., . . . Schmitt, R. (2005). Evaluating the
effectiveness of editorial policy to improve statistical practice: The case of the Journal of Consulting and Clinical
Psychology. Journal of Consulting and Clinical Psychology, 73, 136–143. doi: 10.1037/0022-006X.73.1.136
Glass, G. V. (1976). Primary, secondary, and meta-analysis research. Educational Researcher, 5, 3–8. doi:
10.3102/0013189X005010003
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Newbury Park, CA: Sage.
Grissom, R. J. (1994). Probability of the superior outcome of one treatment over another. Journal of Applied Psychology,
79, 314–316. doi: 10.1037/0021-9010.79.2.31
Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68,
155–165. doi: 10.1037/0022-006X.68.1.155
Grissom, R. J., & Kim, J. J. (2001). Review of assumptions and problems in the appropriate conceptualization of effect
size. Psychological Methods, 6, 135–146. doi: 10.1037/1082-989x.6.2.135
Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical approach. Mahwah, NJ: Erlbaum.
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate applications (2nd ed.). New
York, NY: Routledge.
Harrison, J., Thompson, B., & Vannest, K. J. (2009). Interpreting the evidence for effective interventions to increase
the academic performance of students with ADHD: Relevance of the statistical significance controversy. Review of
Educational Research, 79, 740–775. doi: 10.3102/0034654309331516
Hedges, L. V. (1981). Distributional theory for Glass’s estimator of effect size and related estimators. Journal of Educa-
tional Statistics, 6, 107–128. doi: 10.2307/1164588
Hedges, L. V. (1982). Estimation of effect size from a series of independent experiments. Psychological Bulletin, 92,
490–499. doi: 10.1037/0033-2909.92.2.490
Hedges, L. V., & Olkin, I. (1984). Nonparametric estimators of effect size in meta-analysis. Psychological Bulletin, 96,
573–580. doi: 10.1037/0033-2909.96.3.573
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press.
Hess, M. R., & Kromrey, J. D. (2004, April). Robust confidence intervals for effect sizes: A comparative study of Cohen’s
d and Cliff’s delta under non-normality and heterogeneous variance. Paper presented at the annual meeting of the
American Educational Research Association, San Diego, CA.
Hess, B., Olejnik, S., & Huberty, C. J. (2001). The efficacy of two improvement-over-chance effect sizes for two-group
univariate comparisons under variance heterogeneity and nonnormality. Educational and Psychological Measurement,
61, 909–936. doi: 10.1177/00131640121971572
Hinkle, D. E. (2003). Applied Statistics for the Behavioral Sciences. Boston, MA: Houghton Mifflin.
Hogarty, K. Y., & Kromrey, J. D. (2001, April). We’ve been reporting some effect sizes: Can you guess what they mean?
Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA.
Huberty, C. J. (2002). A history of effect size indices. Educational and Psychological Measurement, 62, 227–240. doi:
10.1177/0013164402062002002
Huberty, C. J., & Lowman, L. L. (2000). Group overlap as a basis for effect size. Educational and Psychological
Measurement, 60, 543–563. doi: 10.1177/0013164400604004
Huynh, C. L. (1989, March). A unified approach to the estimation of effect size in meta-analysis. Paper presented at the
Annual Meeting of the American Educational Research Association, San Francisco. (ERIC Document Reproduction
Service No. 306 248)
Jacobson, N., Follette, W. C., & Revenstorf, D. (1984). Psychotherapy outcome research: Methods for reporting variability
and evaluating clinical significance. Behavior Therapy, 15, 336–352. doi: 10.1016/S0005-7894(84)80002-7
Jacobson, N., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psy-
chotherapy research. Journal of Consulting and Clinical Psychology, 59, 12–19. doi: 10.1037//0022-006X.59.1.12
James, G. S. (1951). The comparison of several groups of observations when the ratios of the population variances are
unknown. Biometrika, 38, 234–329. Retrieved from http://biomet.oxfordjournals.org
48 PENG AND CHEN
James, G. S. (1954). Tests of linear hypothesis in univariate and multivariate analysis when the ratios of the populations
variances are unknown. Biometrika, 41, 19–43. Retrieved from http://biomet.oxfordjournals.org
Kelley, K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized mean differ-
ence: Bootstrap and parametric confidence intervals. Educational and Psychological Measurement, 65, 51–69. doi:
10.1177/0013164404264850.
Kelley, K., & Preacher, K. J. (2012). On effect size. Psychological Methods, 17, 137–152. doi: 10.1037/a0028086
Keselman, H. J., Algina, J., Lix, L. M., Wilcox, R. R., & Deering, K. N. (2008). A generally robust approach for testing
hypotheses and setting confidence intervals for effect sizes. Psychological Methods, 13, 110–129. doi: 10.1037/1082-
989X.13.2.110
Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., . . . Levin, J. R. (1998). Statistical
practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of
Educational Research, 68, 350–386. doi: 10.3102/00346543068003350
Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement,
56, 746–759. doi: 10.1177/0013164496056005002
Kirk, R. E. (2007). Effect magnitude: A different focus. Journal of Statistical Planning and Inference, 137, 1634–1646.
doi: 10.1016/j.jspi.2006.09.011
Kline, R. B. (2004). Beyond significant testing: Reforming data analysis methods in behavioral research. Washington,
DC: American Psychological Association.
Kraemer, H. C., & Andrews, G. (1982). A nonparametric technique for meta-analysis effect size calculation. Psychological
Bulletin, 91, 404–412. doi: 10.1037/0033-2909.91.2.404
Kraemer, H. C., & Kupfer, D. J. (2006). Size of treatment effects and their importance to clinical research and practice.
Biological Psychiatry, 59, 990–996. doi: 10.1016/j.biopsych.2005.09.014
Kramer, S., & Rosenthal, R. (1999). Effect sizes and significance levels in small-sample research. In R. Hoyle (Ed.),
Statistical strategies for small sample research (pp. 59–79). Thousand Oaks, CA: Sage.
Kromrey, J. D., & Coughlin, K. B. (2007, November). ROBUST ES: A SAS macro for computing robust estimates of
effect size. Paper presented at the annual meeting of the SouthEast SAS Users Group, Hilton Head, SC. Retrieved
from http://analytics.ncsu.edu/sesug/2007/PO19.pdf
Levy, P. (1967). Substantive significance of significant differences between groups. Psychological Bulletin, 67, 37–40.
doi: 10.1037/h0020415
MacGrath, R. E., & Meyer, G. J. (2006). When effect sizes disagree: The case of r and d. Psychological Methods, 11,
386–401. doi: 10.1037/1082-989X.11.4.386
Matthews, M. S., Gentry, M., McCoach, D. B., Worrell, F. C., Matthews, D., & Dixon, F. (2008). Evaluating the
state of a field: Effect size reporting in gifted education. The Journal of Experimental Education, 77, 55–65. doi:
10.3200/JEXE.77.1.55-68
Maxwell, S. E., Camp, C. J., & Arvey, R. D. (1981). Measures of strength of association: A comparative examination.
Journal of Applied Psychology, 66, 525–534. doi: 10.1037/0021-9010.66.5.525
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361–365.
doi: 10.1037/0033-2909.111.2.361
Meline, T., & Wang, B. (2004). Effect reporting practices in AJSLP and other ASHA journals, 1999–2003. American
Journal of Speech-Language Pathology, 13, 202–207. Retrieved from http://ajslp.asha.org
Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: Design and interpretation. Thousand
Oaks, CA: Sage.
Murphy, B. P. (1976). Comparison of some two sample means tests by simulation. Communications in Statistics B:
Simulations and Computation, 5, 23–32. doi: 10.1080/03610917608812004
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval, and statistical significance: A practical guide for
biologists. Biological Reviews, 82, 591–605. doi: 10.1111/j.1469-185X.2007.00027.x
Natesan, P., & Thompson, B. (2007). Extending improvement-over change I-index effect size simulation studies to cover
some small-sample cases. Educational and Psychological Measurement, 67, 59–72. doi: 10.1177/001364406292028.
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York, NY: Wiley.
Odgaard, E. C., & Fowler, R. L. (2010). Confidence intervals for effect sizes: Compliance and clinical significance in the
Journal of Consulting and Clinical Psychology. Journal of Consulting and Clinical Psychology, 78, 287–297. doi:
10.1037/a0019294
Olejnik, S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and
limitations. Contemporary Educational Psychology, 25, 241–286. doi: 10.1006/ceps.2000.1040
Osborne, J. W. (2008). Sweating the small stuff in educational psychology: How effect size and power reporting failed
to change from 1969 to 1999, and what that means for the future of changing practices. Educational Psychology, 28,
151–160. doi: 10.1080/01443410701491718
Peng, C.-Y. J., Chen, L.-T., Chiang, H.-M., & Chiang, Y.-C. (2013). The impact of APA and AERA guidelines an effect
size reporting. Educational Psychology Review, 25, 157–209. doi: 10.1007/s10648-013-9218-2
Pratt, J. W. (1964). Obustness [sic] of some procedures for the two-sample location problem. Journal of American
Statistical Association, 59, 665–680. Retrieved from http://www.amstat.org/publications/jasa.cfm
Pratt, J. W., & Gibbons, J. D. (1981). Concepts of nonparametric theory. New York, NY: Springer-Verlag.
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The handbook of research
synthesis (pp. 231–244). New York, NY: Russell Sage Foundation.
Ruscio, J. (2008). A probability-based measure of effect size: Robustness to base rates and other factors. Psychological
Methods, 13, 19–30. doi: 10.1037/1082-989X.13.1.19
Sarhan, A. S., & Greenberg, B. G. (1962). Contributions to order statistics. New York, NY: Wiley.
Smith, M. L., & Glass, G. V. (1977). Meta-analysis of psychotherapy outcome studies. American Psychologist, 32,
752–760. doi: 10.1037/0003-066X.32.9.752
Smith, M. L., & Honoré, H. H. (2008). Effect size reporting in current health education literautre. American Journal of
Health Studies, 23, 130–135. Retrieved from http://www.va-ajhs.com
Snyder, P. A., & Thompson, B. (1998). Use of tests of statistical significance and other analytic choices in a school
psychology journal: Review of practices and suggested alternatives. School Psychology Quarterly, 13, 335–348. doi:
10.1037/h0088990
Somers, R. H. (1968). An approach to the multivariate analysis of ordinal data. American Sociological Review, 971–977.
doi: 10.2307/2092687
Staudte, R. G., & Sheather, S. J. (1990). Robust estimation and testing. New York, NY: Wiley.
Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L.
Harlow, S. Mulaik & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 221–257). Hillsdale, NJ:
Erlbaum.
Sun, S. Y., Pan, W., & Wang, L. L. (2010). A comprehensive review of effect size reporting and interpreting prac-
tices in academic journals in education and psychology. Journal of Educational Psychology, 102, 989–1004. doi:
10.1037/a0019507
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics. Boston, MA: Pearson/Allyn & Bacon.
Thompson, B. (1988). A note on statistical significance testing. Measurement and Evaluation in Counseling and Devel-
opment, 20, 146–148.
Thompson, B. (1989). Statistical significance, result importance, and result generalizability: Three noteworthy but some-
what different issues. Measurement and Evaluation in Counseling and Development, 22, 2–5.
Thompson, B. (Guest Ed.). (1993). Statistical significance testing in contemporary practice (theme issue). The Journal of
Experimental Education, 61, 285–393.
Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837–847.
Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect
sizes. Educational Researcher, 31, 25–32. doi: 10.3102/0013189X031003025
Thompson, B. (2008). Foundations of behavioral statistics: An insight-based approach. New York, NY: Guilford Press.
Thompson, B., & Snyder, P. A. (1997). Statistical significance testing practices. The Journal of Experimental Education,
66, 75–83. doi: 10.1080/00220979709601396
Vacha-Haase, T., Nilsson, J. E., Reetz, D. R., Lance, T. S., & Thompson, B. (2000). Reporting practices and APA
editorial policies regarding statistical significance and effect size. Theory and Psychology, 10, 413–425. doi:
10.1177/0959354300103006
Vargha, A., & Delaney, H. D. (1998). The Kruskal-Wallis test and stochastic homogeneity. Journal of Educational and
Behavioral Statistics, 23, 170–192. doi: 10.2307/1165320
Vargha, A., & Delaney, H. D. (2000). A critique and improvement of the CL common language effect size statistics of
McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 101–132. doi: 10.2307/1165329
Welch, B. L. (1951). On the comparison of several mean values: An alternative approach. Biometrika, 38, 330–
336.
Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing (2nd ed.). San Diego, CA: Elsevier
Academic Press.
50 PENG AND CHEN
Wilkinson, L., & the APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals:
Guidelines and explanations. American Psychologist, 54, 594–604. doi: 10.1037/0003-066X.54.8.594
Yin, P., & Fan, X. (2001). Estimating R2 shrinkage in multiple regression: A comparison of different analytical methods.
The Journal of Experimental Education, 69, 203–224. doi: 10.1080/00220970109600656
Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. Biometrika, 61, 165–170. doi:
10.2307/2334299
Yuen, K. K., & Dixon, W. J. (1973). The approximate behavior and performance of the two-sample trimmed t. Biometrika,
60, 369–374. doi: 10.2307/2334550
Zhang, Z., & Schoeps, N. (1997). On robust estimation of effect size under semiparametric models. Psychometrika, 62,
201–214. doi: 10.1007/BF02295275
Zientek, L. R., Capraro, M. M., & Capraro, R. M. (2008). Reporting practices in quantitative teacher education re-
search: one look at the evidence cited in the AERA Panel Report. Educational Researcher, 37, 208–216. doi:
10.3102/0013189X08319762
Zimmerman, D. M., & Zumbo, B. D. (1993). Rank transformations and the power of the Student t test and Welch t’ test
for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology, 47, 523–539.
doi: 10.1037/h0078850

Peng 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Peng 2013

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [University of Ulster Library]

On: 09 November 2014, At: 05:14

The Journal of Experimental Education

Beyond Cohen's d: Alternative Effect

To link to this article: http://dx.doi.org/10.1080/00220973.2012.745471

PLEASE SCROLL DOWN FOR ARTICLE

Beyond Cohen’s d: Alternative Effect Size Measures

Chao-Ying Joanne Peng and Li-Ting Chen

THE JOURNAL OF EXPERIMENTAL EDUCATION is one of the first to draw researchers’

No. Definition in original quotes, emphasis added in bold References

Estimators of ES—Purposes and Usages

ES Estimators by Definition 1—Table 2a

1(See pp. 2–8 of the Appendix, available at https://oncourse.iu.edu/access/content/user/peng/Appendix.Beyond%20

Population ESa (References) Point estimatora,b,c (References)

(1) ES estimators in raw units

(2) Standardized ES estimators, assuming normality and equal variance

Standardized mean difference (δ) = μ1 −μ σ

(Continued on next page)

Population ESa (References) Point estimatora,b,c (References)

(3) Standardized ES estimators, assuming normality only

(4) Robust standardized ES estimators,d assuming equal variance

Population ESa (References) Point estimatora,b,c (References)

(6) Robust standardized ES estimator in order statistic, assuming equal variance

Standardized mean difference (δ) = μ1 −μ σ

Population ES (References) Point estimatora,b (References)

(A) Estimators of degree of nonoverlap, assuming normality and equal variance

(Continued on next page)

Population ES (References) Point estimatora,b (References)

Proportion of cases misclassified Levy’s p

Population ES (References) Point estimatora,b (References)

Improvement-over-chance index (I) Huberty and colleagues’ Iˆ

transformations. Each category of estimators is described later.

(1) ES Estimators in Raw Units

(2) Standardized ES Estimators, Assuming Normality and Equal Variance

(3) Standardized ES Estimators, Assuming Normality Only

(4) Robust standardized ES Estimators, Assuming Equal Variance

(5) Robust Standardized ES Estimators, Not Assuming Equal Variance

(6) Robust Standardized ES Estimator in Order Statistic, Assuming

ES Estimators by Definition 2—Table 2b

As stated earlier, estimators included in Table 2b estimate a population ES defined in terms of

(A) Estimators of Degree of Nonoverlap, Assuming Normality and Equal

(B) Estimators of Dominance, Only CL Assumes Normality and Equal

(C) Estimators of (Mis)Classification, Assume Multivariate Normality and Equal

Standardized Versus Unstandardized ES Measures

The second challenge faced by researchers in standardizing ES estimators is in the normality

inance in Category (B) to supplement Cohen’s d to conceptualize ES beyond mean differences.

Standard Errors of ES Measures

You might also like