You are on page 1of 12

Psychological Methods Copyright 2001 by the American Psychological Association, Inc.

2001. Vol.6, No. 2, 135-146 I082-989X/01/S5.00 DOI: 10.1037//1082-989X.6.2.135

Review of Assumptions and Problems in the Appropriate


Conceptualization of Effect Size
Robert J. Grissom and John J. Kim
San Francisco State University

Estimation of the effect size parameter, D, the standardized difference between


population means, is sensitive to heterogeneity of variance (heteroscedasticity),
which seems to abound in psychological data. Pooling s2s assumes homoscedas-
ticity, as do methods for constructing a confidence interval for D, estimating D
from t or analysis of variance results, formulas that adjust estimates for inflation by
main effects or covariates, and the Q statistic. The common language effect size
statistic as an estimate of Pr(X, > X2), the probability that a randomly sampled
member of Population 1 will outscore a randomly sampled member of Population
2, also assumes normality and homoscedasticity. Various proposed solutions are
reviewed, including measures that do not make these assumptions, such as the
probability of superiority estimate of Pr(X} > X2). Ways to reconceptualize effect
size when treatments may affect moments such as the variance are also discussed.

In univariate research that compares the means of Tomarken & Serlin, 1986; Wilcox, 1987). When
two independent groups, the parameter that is typi- group Xs differ, group s2s often differ in the same
cally estimated to measure effect size is D = (|x, - direction (Fisher & van Belle, 1993; Fleiss, 1986;
|X2)/CT (Cohen, 1988). Estimation of D is problematic Norusis, 1995; Raudenbush & Bryk, 1987; Saw-
when there is heterogeneity of variance (heterosce- ilowsky & Blair, 1992; Snedecor & Cochran, 1989).
dasticity) because there is no commonCT,and different Samples with greater positive skew often have the
estimators of cr can lead to very different estimates of larger means and variances. Reaction time, latency,
D. Moreover, heteroscedasticity should prompt differ- difference-threshold, and galvanic skin response data
ent conceptions of effect size. As discussed later, if provide some examples. In a randomized design, ho-
two treatments produce distributions with different moscedasticity is to be expected a priori, but the treat-
variances and/or different shapes, measures of effect ment after randomization can produce heteroscedas-
size other than the standardized difference between ticity as described in Bryk and Raudenbush (1988).
means would be informative. This article describes However, research using nonrandomly formed groups
the problems and reviews proposed solutions. may well involve a priori heteroscedasticity (Grissom,
Numerous results in samples suggest heteroscedas- 2000).
ticity in a wide variety of areas of research (Grissom, In educational research, Wilcox (1987) reported
2000; Keppel, 1991; Lix, Cribbie, & Keselman, 1996; that ratios of largest to smallest sample variances
(maximum sample variance ratios [VRs]) exceeding
16 are not uncommon, and VR values exceeding 400
have been reported (Lix et al., 1996). Maximum VRs
Robert J. Grissom and John J. Kim, Department of Psy- of up to 12 are considered to be realistic by Tomarken
chology, San Francisco State University. and Serlin (1986). In clinical outcome research with
We gratefully acknowledge Rebecca Ray and Julie
children and adolescents, variances have been found
Gorecki for assistance with the preparation of an earlier
to be significantly different in therapy and control
version of this article, and Jeffrey Keyes for assistance with
data analysis. groups (Weisz, Weiss, Han, Granger, & Morton,
Correspondence concerning this article should be ad- 1995). In comparing a systematic desensitization
dressed to Robert J. Grissom, 564 Apple Grove Lane, Santa group with an implosive therapy and control group,
Barbara, California 93105. Electronic mail may be sent to Hekmat (1973) found sample VRs of over 12 and
rgrissom@sfsu.edu. nearly 29, respectively, on the Behavior Avoidance

135
136 GRISSOM AND KIM

Test. Inspection of several articles in one recent issue When estimating £>e = (|ie - |xc)Are, using the stan-
of the Journal of Consulting and Clinical Psychology dard deviation of an experimental group, se, instead of
yielded sample VR values of 3.24, 4.00 (several), sc can result in a numerically very different estimate
6.48, 6.67, 7.32, 7.84, 25.00, and 281.79. The latter as well as a different definition of effect size. Treat-
VR involved skewed distributions of the number of ment x Subject interaction or a ceiling or floor effect
drinks per day under two different treatments for de- could cause heteroscedasticity and, therefore, Dc +
pression in alcoholism, data that the authors trans- De. The problem of widely varying values of d de-
formed to normalize (R. A. Brown, Evans, Miller, pending on choice of 5 exists even if there is equality
Burgess, & Mueller, 1997). When control and therapy of population variances because sample variances are
conditions for number of posttest panic attacks were still likely to vary greatly. With df < 20, point esti-
compared, VRs of 8.02 and 6.65 were found for con- mates of CT can be in error by hundreds of percent
trol s2/treated s2 and Therapy 1 s2/Therapy 2 s2, re- (Hoaglin, Mosteller, & Tukey, 1991). In addition, al-
spectively (Feske & Goldstein, 1997). When compar- though there is a case for using some standard mea-
ing activity levels of drugged rats with those of a sure to estimate CT, such as sc, one could question
control group Conti and Musty (1984) found a maxi- whether it is always appropriate to use o-c just because
mum sample VR of 6.87 across dosages of the active there is a comparison (control) group. The compari-
component of marijuana. son group may represent the current standard treat-
The magnitudes of reported sample VRs strongly ment, but it may not turn out to be essentially different
suggest heteroscedasticity. However, because of the from the "experimental" treatment.
great sampling variability of s2, estimates of popula- Under heteroscedasticity, Cohen (1988) suggested
tion VRs are needed. It would also be useful to have estimating the parameter D' that uses the denominator
Monte Carlo studies of the "robustness" to heterosce- a' = [(o-,2 + o-22)/2]1/2. Huynh'^(1989) Monte Carlo
dasticity of the many measures of effect size that as- simulations showed that d' = (X, - X2)l[(s2 + s22)/
sume homoscedasticity. However, under heterosce- 2]1/2, which uses n-} - 1 in the denominator to calculate
dasticity, a major issue beyond robustness of each s2, is generally a biased estimator of D', but is
traditional measures of effect size is the conceptual- only slightly biased when D' = 1 and is not at all
ization of appropriate alternative measures of effects biased when D' = 0. (There is also some bias when
of treatments on moments about the mean. It is hoped the usual pooled estimator is used). The estimator d'
that the present review will prompt (a) more cautious was also found to be unstable, the variance of d' in-
interpretation of current measures and conversion for- creasing with D' and, not surprisingly, decreasing
mulas of effect size that assume homoscedasticity; (b) monotonically with increasing ns. The simulation
use of measures of overall effect of treatments on supported the use of h = cf d', an adjusted estimator
centers, tails, spreads, and shapes of distributions; and of D' to correct bias, where cf = T(//2)/[(//2)1/2
(c) research on the estimation of population VRs. F((/-l)/2)], F is the gamma function, and/ = [(n{—
1)(«2 - l)(s2 + s 2 ) 2 ] l [ ( n , - l)s,4 + («2 - I)s24]. The
simulation assumed normality of X, and X2. See
Choice of Denominator for Effect Size Huynh, 1989, for a detailed discussion of a method for
When Comparing Two Means and addressing the instability of d'. However, issues of
Assuming Normality bias and instability aside, D' is difficult to interpret
because it uses theCTof a contrived population.
In cases where there is heteroscedasticity and a In the case of homoscedasticity, the pooled estima-
control group, Glass, McGaw, and Smith (1981) have tor, sp, is a less biased and less variable estimator of
proposed using sc, the standard deviation of the con- the common a than is sc, an estimator that is only
trol group, to estimate cr. In this case the effect size available when there is a control group. Thus, in two-
estimator, dc = (Xe - Xc)sc, is estimating (|o,e - JJLC)/ group studies that rightly or wrongly assume homo-
CTC, an informative parameter under normality that in- scedasticity, Hedges' g = (X, -X2)/sp (or the version
dicates the location of an average experimental popu- that removes small-sample bias) is the most widely
lation (e) participant with respect to the control used estimator of D (Hedges & Olkin, 1984). How-
population's (c) distribution. Unfortunately for the ever, one could argue that, if JJLS differ, homoscedas-
measures of effect size that do so, assuming normality ticity is unlikely and pooling is inappropriate. The
may not be realistic (Micceri, 1989; Wilcox, 1996). formulas provided by Hedges and Olkin (1985)
REVIEW OF ASSUMPTIONS 137

for confidence limits for D also assume homoscedas- ized longitudinal design for comparing two or more
ticity. groups, but this method also assumes homoscedastic-
When comparing two groups by estimating a D in ity, and the sensitivity of this measure to heterosce-
a multigroup study that assumes homoscedasticity, dasticity also is not known.
the commendable goal of attempting to obtain a less When factorial designs are used, various choices
biased and less variable estimate of a results in pool- for estimating cr arise, depending on which two means
ing within-group variances, sp = (MSW)V2. However, are being compared. For example, consider a 2 x 2
some researchers may be assuming homoscedasticity design in which one factor compares treatment and
without testing for it, having tested for it with one of control and the other factor compares men and
the relatively low-power tests that are often provided women. One might want to estimate D for treatment
in statistical packages (e.g., Levene, 1960) or with one versus control overall. Alternatively, one might want
of the possibly more powerful but not very powerful to estimate treatment effect sizes separately for men
tests (M. B. Brown & Forsythe, 1974; R. G. O'Brien, and women as if only the single treatment factor had
1978; Weston & Hopkins, 1998) that still failed to been involved. The latter makes sense if there is in-
detect heteroscedasticity. Moreover, the amount of teraction. However, in a factorial design in which
heteroscedasticity that would be troublesome for es- there is a classification factor, such as gender, and a
timating D is likely much less than what would be treatment factor with respect to which effect sizes are
troublesome for F testing in ANOVA. Under hetero- to be estimated as if treatment were the only factor, a
scedasticity, the estimator of Dp based on pooling, dp main effect of the classification factor reduces within-
= (X, - X2)/(MSw)l/2, is again problematic because cell MS values. Oliver and Hyde (1995) cite some
there is no common a to be estimated (the noncen- meta-analyses that have inflated estimates of D by not
trality parameter is not defined). using formulas to correct for such reduction of MSW
In summary, when homoscedasticity is assumed, values for the purpose at hand. Correction formulas
the most efficient estimator of the common CT is sp. are provided by Glass et al. (1981) and Smith, Glass,
However, using an sp that weights the two ss (approxi- and Miller (1980), and are extended to larger designs
mately) proportionally to n{ and «2 is problematic by Ray and Shadish (1996). However, the correction
under heteroscedasticity. In the latter case, one can formulas assume homoscedasticity. The correction
use the s of whichever group is to be the comparison formulas by Glass and his coworkers that attempt to
group to estimate that population's a, or use s} and s2 render estimates of D from factorial designs that are
to estimate the square root of the mean of o^2, and a22 comparable to estimates from single-factor designs
to measure D' by using a' (or the mean of o^, and a2). require SS values that are typically not available in
Finally, under heteroscedasticity, the choice of de- meta-analysis. Morris and DeShon (1997) provided a
nominator should depend on the research context. For modified correction formula that requires values of F
example, if the two genders are being compared, it and df instead of SS values, but the modification as-
may not be sensible to use CTM or CTF for the denomi- sumes homoscedasticity. Abelson and Prentice (1997)
nator for the purpose of estimating a DM or D¥. On the presented a measure of effect size for contrasts in
other hand, using a' results in a D' in a contrived two-way designs, but homoscedasticity was again as-
population whose CT is the average of the male and sumed and sensitivity to heteroscedasticity is not
female populations' crs. known.

Multigroup and Factorial Designs Meta-Analysis: Converting Statistics to


Estimates of D and Other Problems
Two overall estimators of standardized effect size of Synthesis
for one-way ANOVA, dmm = (Xmax - XmJ/sp and/
= s-^ ... /sp, where s^ is the standard deviation of all The problem of heteroscedasticity is compounded
of the sample means, assume homoscedasticity. The/ in meta-analysis. When attempting to cumulate effect
is related to estimating T|, the correlation between sizes, meta-analysts are often averaging different
group membership and the dependent variable kinds of estimates of different kinds of D arising from
(Winer, Brown, & Michels, 1991), that also assumes the underlying primary studies. This is because pri-
homoscedasticity. Maxwell (1998) presented a prom- mary researchers usually provide the values of test
ising measure of effect size for a powerful random- statistics, such as t or F, but do not provide sufficient
138 GRISSOM AND KIM

information to calculate dc or dp directly. When using being averaged. Moreover, if the type of estimate of D
the primary researcher's t or two-group F to estimate is confounded with varying substantive characteristics
D by means of t(l/ne + l/n c ) 1/2 or [F(l/ne + l/nc)]1/2, of the primary studies, interpretation of statistically
meta-analysts indirectly assume homoscedasticity be- significant moderators of effect size will be mislead-
cause these formulas so assume. When se2 > sc2, dc > ing. Editors of journals would greatly enhance the
dp, and when se2 < sc2, dc < dp. (Glass et al., 1981, quality of primary studies and meta-analyses if they
provide numerical values for the relationship between would require that primary studies include all ns, Xs,
se2/5c2 and dc/dp). Therefore, whether attempting to and ss. Because of heteroscedasticity, the Publication
estimate Dc or Dp, meta-analysts are often unwittingly Manual of the American Psychological Association
averaging values of d, among some of which dp > dc (American Psychological Association [APA], 1994)
and, among others, dp < dc. Unfortunately, the aver- oversimplifies when it states that estimates of effect
aging procedure is not likely to result in overestima- size are readily obtainable from statistics such as t or
tions canceling underestimations (Glass et al., 1981). F with ns or J/s. The APA should consider endorsing
The problem is compounded by the fact that one often the movement toward requiring researchers to archive
has to convert to, and average, values of dp from a mix data as a condition for research funding and publica-
of primary studies that collectively provide a variety tion (Eagly, 1997). Finally, because values of s can
of forms of results, such as dp itself (rare), t, two- vary greatly when dfs are small, varying ns across
group F, one-way or factorial ANOVA F, one-way or studies can exacerbate the indeterminacy of CT and,
factorial repeated measures ANOVA F, analysis of therefore, of D.
covariance F, change-score t, change-score F, re-
ported p value, and inferable-only p value. Ray and Some Additional Proposed Solutions to
Shadish (1996) empirically demonstrated that differ- Heteroscedasticity in Two-Group Comparisons
ent calculated conversions to dp arising from the dif-
ferent forms of results reported in primary studies can There are various additional proposed solutions to
vary, sometimes greatly. Also, heteroscedasticity can the problem of heteroscedasticity. Some solutions at-
affect the power and rate of Type I error of the Hedges tempt to improve estimation of D, whereas more radi-
and Olkin (1985) Q statistic that is often used to test cal solutions offer measures of effect size that are
for the significance of the variability of the set of conceptually different from D. A modest suggestion is
estimates of D that arise from the underlying primary that researchers in the two-sample case reduce the
studies. See Harwell's 1997 Monte Carlo study for problem of heteroscedasticity by using ns > 10 and
details on the sensitivity of Q to heteroscedasticity either equal ns or ns in which neither group contains
under varying conditions, within the primary studies, more than 60% of the total number of participants
of population VR, distributional shape, and ns. (Huynh, 1989; Kraemer, 1983). In research domains
When different primary studies underlying a meta- in which the dependent variable is a score on a test
analysis use different measures of the same latent de- that has been normed on a vast sample, such as some
pendent variable, the primary effect sizes may vary clinical and educational outcome studies, a possible
simply because varying measurement reliabilities re- solution to the estimation problem exists. In this case
sult in varying values of sc. The dependent variable one can divide (Xe - Xn) by the sn of the normative (n)
can be corrected for the unreliability that can reduce group to estimate Dn = (|xe - |JLn)/o-n (Kendall &
an uncorrected estimate of D by increasing s (Becker, Grove, 1988; Kendall, Marss-Garcia, Nath, &
1996; Schmidt & Hunter, 1996). However, the cor- Sheldrick, 1999). The use of such a constant s is ben-
rection yields an estimate of a D that is theoretically eficial because, otherwise, values of (X, - X2) may
possible but not currently realizable in practice. Dif- actually vary only slightly among primary studies
ferent estimates of primary Dps or Des can also arise whose values of d vary greatly simply because of
because of varying values of se when, in some pri- varying values of 5.
mary studies, subjects may show greater variation in In the two-group case Rosenthal (1991) suggested
responsiveness to a treatment than in other studies of transforming the data to equate variances before cal-
the same treatment. culating d, but he came to prefer a reconceptualization
It is not surprising that meta-analysts rarely explic- of effect size as a point-biserial correlation, ppb, a
itly state the nature of the D parameter that they are measure of effect size in terms of the strength and
estimating with the various kinds of d values that are direction of the relationship between the grouping
REVIEW OF ASSUMPTIONS 139

variable and the dependent variable. (A variance- of effect size, as compellingly demonstrated by Wil-
stabilizing transformation of the data may greatly alter cox and Muska (1999) for heavy-tailed distributions.
the values of D and possibly their interpretation, but One of the solutions offered for this problem of esti-
this may be less of a problem for primary researchers mation and interpretation is from Hedges and Olkin
who are using scales that are already contrived.) (1985), who suggested trimming the highest and low-
Trimming to reduce heteroscedasticity is possible be- est scores from the control group, replacing (Xe - Xc)
fore calculating rpb, but this may result in a value of with (Mrin^. - Mdnc), and replacing sc with the range
rpb that estimates an ill-defined parameter (Wilcox, of the trimmed data or some other measure of vari-
1994). Although heteroscedasticity does not necessar- ability.
ily directly influence the value of ppb, it can influence One such alternative measure of variability is the
the t test that is used to test the significance of rpb, and median absolute deviation from the median (MAD).
converting d or t to rpb by the usual formulas assumes A statistic should not only be resistant to outliers, as
homoscedasticity. Also, if heteroscedasticity is asso- is MAD, but should also have relatively small sam-
ciated with one or more outliers, the latter can greatly pling variability to increase power and narrow confi-
affect ppb. Even slightly heavy tails in the distribution dence intervals. In this latter regard, the biweight stan-
of the dependent variable scores can greatly affect ppb dard deviation, sbw (Goldberg & Iglewicz, 1992; Lax,
and its associated confidence interval. Although the 1985), is superior to MAD. Calculation of s2bw by
formula for rpb should be corrected for attenuation hand is laborious. First, calculate Y-t = (X^ - Mdri)l
caused by unequal «s (rpbc) this seems to be rarely 9(MAD) and set a{ = 1 if I7J < 1 and a-, = 0 if
done in practice. The attenuation increases with in- 1. Then
creasing disproportionality of ns and increasing mag-
i/2-,4-11/2
nitude of ppb (McNemar, 1969). The corrected rpbc =
a rpb/[(a2 - 1) r2^ + 1]1/2, where a = .25pq and p and bw (D
q are the proportions of total sample size in each
group (Hunter & Schmidt, 1990). When a meta- Wilcox (1997) presented an S-PLUS software func-
analyst averages uncorrected rpbs, some of the vari- tion for calculating biweight mid variances (s2bw)- Be-
ability in rpbs may arise artifactually from varying cause raw scores are required to calculate .$2bw, pri-
degrees of inequality of sample sizes from primary mary researchers, but not meta-analysts who are
study to primary study. Wilcox (1996) critiqued al- synthesizing unarchived data, can thus use (Mdne -
ternative measures of correlation in terms of resis- Mdnc~)/sbv,c as an estimate of effect size. Laird and
tance to outliers and provided Minitab macros for Mosteller (1990) suggested dividing (Mdne - Mdnc)
them. These rs may prove to be useful alternative by .75 (interquartile range of control group) to pro-
robust estimators of correlational effect size mea- vide resistance to outliers while maintaining a de-
sures. nominator that approximates sc. Under normality, .75
(interquartile range) approximates s. However, the in-
Measures Using Estimators That Are terquartile range is just one example of an interquan-
More Robust tile range. Quantiles may be generally defined as
scores that are equal to or greater than certain speci-
We have seen that, if normality and homoscedas- fied proportions of the other scores in the distribution.
ticity are assumed, D can appropriately be estimated Results from Shoemaker (1999) indicate that a supe-
by using Xc, Xe, and sp. Assuming normality, the rior robust measure of spread may sometimes be ob-
usual interpretation of Dc locates |xe as Dc x ac units tained by using interquantile ranges of more extreme
away from JJLC. Thus, if u,e > u,c and, for example, dc quantiles than are used in the interquartile range.
= +1, it is estimated that the average experimental These results suggest additional possible denomina-
population subject scores at the 84th percentile (1 crc tors for measuring effect size, but the method may
unit above (ic) of the control group's population dis- perform poorly under extreme skew.
tribution. However, if there is heteroscedasticity as- Kraemer and Andrews (1982) presented a nonpara-
sociated with one or more outliers normality does not metric estimator of effect size, but this estimator re-
hold and such interpretation of the estimate dc is in- quires a pretest-posttest design. Hedges and Olkin
valid. Also, because a can be very sensitive to shape, (1984, 1985) presented several nonparametric estima-
nonnormality can greatly affect the value of a D type tors, one of which, d*c, does not require homoscedas-
140 GRISSOM AND KIM

ticity or pretest scores. Assuming normality, this the number of times that the m subjects given Treat-
method estimates Dc by using d*c = <&~l pc, where ment 1 have scores that outrank those of the n subjects
<5~' is the inverse normal cumulative distribution given Treatment 2, assuming no ties or equal alloca-
function and pc is the proportion of the control group tion of ties, in all possible comparisons of X, and X2
scores that are below Mdne. Because it requires raw values. There are mn such head-to-head comparisons
scores, this method is of use to primary researchers possible. Therefore, the proportion of times that sub-
but, again, not to meta-analysts who are synthesizing jects given Treatment 1 are superior in their scores to
unarchived data. Moreover, the sampling distribution subjects given Treatment 2 is Ulmn. For example, if .7
of this estimator is not known, so significance testing of the comparisons of all treated subjects with all
and construction of confidence intervals are not de- control subjects result in better scores being observed
veloped. in the treated group, PS = .7. A proportion in a
The methods reviewed in this section address het- sample estimates a probability in a population. Esti-
eroscedasticity and nonnormality as problems to be mating /V(X, > X2) from the PS is robust to nonnor-
circumvented when estimating a D rather than as mality. Wilcox presented a Minitab macro (Wilcox,
characteristics of data that can be accommodated by 1996) and S-PLUS software functions (Wilcox, 1997)
new conceptual approaches to measuring effect size. for constructing a confidence interval for Pr(Xl > X2)
All of the suggested formulas are estimating param- based on Fligner and Policello's (1981) adjusted U'
eters that are conceptually similar to D. Replacing o- statistic and on a method by Mee (1990) that appears
with a more resistant measure of scale is only a com- to yield fairly accurate confidence levels.
putational solution. A more radical solution would The Mann-Whitney U test is nonrobust to hetero-
involve reconceptualization of effect size as a mea- scedasticity when testing for the equality of centers
sure of the effect of a treatment throughout a distri- (Murphy, 1976; Pratt, 1964; Zimmerman & Zumbo,
bution, not just at its center. Heteroscedasticity would 1993) because values of U are functions of higher
then no longer be a mere nuisance, as it is when moments as well as means. However, sensitivity to
attempting to measure effect size in the traditional the effect of a treatment on variance and on higher
manner, but a utilized reflection of a treatment's ef- moments may be considered to be an advantage for a
fect on moments. The next two sections address this broader measure of effect size. In this context we are
issue. not testing H0: Mdnl = Mdn2, against H,: Mdn{ +
Mdn2 but a broader H0: Pr(Xl > X2) = .5 against H{:
Robust Estimation of Pr(Xv > X2) Pr(X} > X2) * .5. The alternative H,: Mdnt * Mdn2
(any other location parameter could have been used)
An intuitively appealing measure of effect size in is an example of a special-case model, a shift-model
the two-group case is the probability that a randomly in which a treatment merely adds a constant to what
sampled member of a population given one treatment the score of each participant would have been without
will have a score (X,) that is superior to the score (X2) the treatment. In this model, homoscedasticity is as-
of a randomly sampled member of a population given sumed because adding a constant to each score merely
another treatment. A theoretically unbiased and con- shifts a distribution without changing its spread. How-
sistent estimator of this probability has been called the ever, the alternative H,: Pr(Xt > X2) + .5, tested
probability of superiority (PS; Grissom, 1994a, against H0: Pr(Xt > X2) = .5, is the more general case
1994b, 1996). Theoretically, the PS has the smallest of stochastic superiority of one treatment over an-
variance of all unbiased estimators of Pr(X} > X2). other. Under this H,, the effect of a treatment need not
The PS can be based on raw data that are continuous, be constant, so that the treatment may affect spread
ordinal, or ordinal-categorical (Grissom, 1994a, and, therefore, homoscedasticity need not be as-
1994b). The applicability of the PS estimator to ordi- sumed. This general case is perhaps more realistic
nal data is an advantage in much psychological re- because treatments may well affect spread, and pos-
search, in which dependent variables are often likely sibly shape, as well as the center of a distribution.
to be monotonically, but not necessarily linearly, re- Using the PS to test H0: Pr(X} > X2) = .5 against H,:
lated to underlying latent variables. Monotonic trans- Pr(X{ > X2) =£ .5, or against a one-tailed alternative,
formations leave Pr(X\ > X2) invariant. The PS = is a consistent test in the sense that its power ap-
Ulmn, where U is the Mann-Whitney statistic and m proaches unity as sample sizes approach infinity.
and n are the sample sizes. The value of U indicates When raw data are not available, the parameter
REVIEW OF ASSUMPTIONS 141

Pr(X} > X2) can be estimated by using the common a n d . . . ) , see McGraw and Wong (1992). Also see
language effect size statistic (CL; McGraw & Wong, Whitney' s (1951) extension of the U test for k> 2 and
1992), which is calculated from sample means and a discussion and tables of critical values for this ex-
variances. The CL is based on a z score, ZCL= C^i ~ tension in Mosteller and Bush (1954). For more dis-
X2)/Cs,2 + s22)1'2. The value of Pr(Xl > X2) is esti- cussion of Pr(X, > X2) and its estimators, see
mated by the proportion of the area of the normal Lehmann (1975), Laird and Mosteller (1990), Pratt
curve below ZCL. For example, if ZCL = +1.00 or and Gibbons (1981), and Vargha and Delaney (1998).
-1.00, CL = .84 or .16, respectively. The value of Pratt and Gibbons (1981) discuss various methods for
/MX, > X2) is estimated to be .84 or . 16 in these cases. dealing with tied scores.
This CL method assumes normality of X, and X2, but
it may not seem to some readers to require homosce- Empirical Comparisons of PS and CL
dasticity because the variance of (X, - X2) would be Wolfe and Hogg (1971) asserted that the estimates
<j\ _x = (cr2. + a2 ) regardless of the values of a2^ and from PS and CL do not differ "too much" in practice.
cr22. However, in theory, CL only estimates Pr(X, > To test this assertion with real and then simulated
X2) under normality and homoscedasticity (Pratt & data, we first searched two published manuals of ac-
Gibbons, 1981). McGraw and Wong (1992), who in- tual data to find sets of data that compared two inde-
troduced the CL to psychologists, assumed homosce- pendent groups with respect to some psychologically
dasticity and reported Monte Carlo simulations that relevant dependent variable. Eleven such sets were
indicated fairly good performance in the face of non- found, and PS and CL were used to calculate esti-
normality but somewhat poorer performance under mates of Pr(Xj > X2) for each set. In the four least
joint nonnormality and heteroscedasticity. In theory, discrepant of the 11 pairs of estimates from the PS and
the CL is not quite an unbiased estimator unless it is the CL, the larger estimate in each case was less than
adjusted (Pratt & Gibbons, 1981). 3% larger than the smaller of the two estimates. How-
Various statistics and estimates of effect size can be ever, in the three comparisons that resulted in the
converted to approximate values of ZCL (Grissom, greatest discrepancies, the larger estimate was 12.1%,
1994a). For example, if n, = n2, ZCL = -tin112, ZCL 15.8%, or 15.9% larger than the smaller estimate.
= -.7074 and ZCL = -rpb[2(n - l)]1/2/[n(l - 4 )]1/2 1
Next, we inspected preliminary results from Monte
For conversion formulas when n, =£ n2, see Grissom, Carlo comparisons of the behaviors of the PS and the
1994a. All of these conversion formulas assume ho- CL. Results as of this writing are based on «, = n2,
moscedasticity and normality, and their sensitivity to normality, population VRs ranging from 1 to 8, and
violation of these assumptions is unknown. 25,000 samples for each simulated condition. Gener-
Estimates of Pr(X, > X2) have been used in a meta- ally, when |x, = |X2, the means of the sampling dis-
analysis (Mosteller & Chalmers, 1992), as well as in tributions of the PS and the CL are very similar and
a meta-meta-analysis that was compelled to rely on the correlation between the two sets of estimates is
the previously mentioned possibly nonrobust conver- well over .9. However, as the difference between (i,
sion formulas for ZCL because of the unavailability of and |x2 increases, this correlation sometimes de-
raw data (Grissom, 1996). Again, the archiving of raw creases to a value as low as approximately .2, and the
data would benefit meta-analyses that attempt to es- CL tends to exhibit more sampling error than does the
timate Pr(X, > X2). For a meta-analysis using PS, the PS. Even when assuming normality and homoscedas-
mean of the values of PS should be obtained by ticity, primary researchers who are estimating Pr(X, >
weighting each PS by the reciprocal of its variance. X2) should consider reporting both the CL and the PS.
For the variance, use (l/12)[(l/m) + (1/n) + (1/mn)].
For an alternative approach, see Colditz, Miller, and
Multiple-Valued Measures
Mosteller (1988).
In the two-sample case, PK^i > X2) is a useful Measures of effect size for comparing two groups
measure of effect size, but this measure may not be can be misleading when there is heteroscedasticity,
transitive in the k > 2 case. Group 1 may score sto- different shapes of distributions, or both. For ex-
chastically higher than Group 2, and Group 2 may ample, suppose that a treatment makes some experi-
score higher than Group 3, but it is possible in this mental subjects perform better and some perform
case that Pr(X3 > X,) > .5. For an approximate CL worse than they would have had they been in the
approach to estimating Pr(X} > X2 and X, > X3 comparison group. In this case, the experimental
142 GRISSOM AND KIM

group's variability will be greater or less than that of comparing two groups separately at various quantiles.
the comparison group, depending on whether it is the This method results in a graph of "shift functions" that
better or poorer performing subjects who are helped indicates whether a treatment becomes more or less
or harmed by the experimental treatment, but the two effective as we observe from the poorer performing to
means or medians may be nearly the same. In this the better performing of the control group's members.
case, using (Xe - Xc) or (Mdne - Mdnc) in the nu- Each shift function indicates how far the comparison
merator yields an estimated effect size near zero, but group has to be shifted to attain the outcome of the
the treatment clearly has had an effect in the tails, if treated group at each quantile of interest. Quantiles of
not in the center, of the distribution. In another case, the control group's scores at their various qth quan-
the two centers and the two variabilities may differ. tiles, Xqc values, are plotted against the differences
For example, if the more variable group has the higher between Xqc and Xqe, the quantiles of the experimental
mean (not uncommon), the proportions of its mem- group's scores, defining the shift function, AXqc =
bers among the high scorers and among the low scor- Xqe - Xqc. It is more informative to compare distribu-
ers can be different than what is implied by the esti- tions at various quantiles, such as deciles, than to
mate of D. Consider a case in which, compared with compare them only at their centers, such as their
controls, experimental participants have a small supe- means or medians (.50 quantile, 5th decile). Wilcox
riority, say, D = +.3 (using the a of the combined (1996) provided a Mini tab macro for estimating shift
distribution), and a cre2 that is only 15% greater than functions and another for constructing confidence in-
that of the control group. Even in this unexceptional tervals for the difference between the two groups'
case, if normality is assumed, approximately 2.5 times deciles at various deciles throughout the comparison
as many treated as control subjects would be in the group's distribution. Wilcox (1997) also provided S-
highest 5% of the combined distribution (Hedges PLUS software functions for making robust infer-
& Nowell, 1995). For further discussion and exam- ences about shift functions and for constructing robust
ples see Feingold (1992b, 1995) and P. C. O'Brien simultaneous confidence intervals for them. Lunne-
(1988). The kinds of problems that have been dis- borg (1986) provided discussion and a method for
cussed can arise even under homoscedasticity if there constructing a confidence interval for the difference
is nonhomomerity (inequality of shapes of distribu- between corresponding quantiles of two distributions.
tions). A general definition of quantiles suits the purpose of
Informative methods have been proposed for ex- the present article. However, statistical packages use a
pressing group differences at different portions of a variety of definitions of sample quantiles (Hyndman
distribution, but at the cost of greater complexity. For & Fan, 1996). For an introduction to quantiles, see
example, Hedges and Friedman (1993), assuming nor- Hoaglin, Mosteller, and Tukey (1985). See Dielman,
mality for both populations, defined effect size in a Lowry, and Pfaffenberger (1994) for results of Monte
portion of a tail beyond a fixed value, Xa, in a distri- Carlo studies of the behavior of 10 supposedly distri-
bution of scores combined from Group 1 and Group 2. bution-free estimators of quantiles for a variety of
This Da is the difference between u,a, and |xa2, for distributions. For discussion of distribution-free uni-
those scores that are beyond Xa, divided by cr for variate and multivariate methods for comparing the
those scores, cra. The value of Xa is chosen as a score probability density functions of two groups, see Sil-
that is exceeded by some k% of the scores in the verman (1986) and Izenman (1991).
combined distribution. Thus, if Xa is the score at the Descriptive graphical methods for depicting differ-
lOOa percentile point of the combined distribution, ences between two distributions at levels in addition
then the effect in the tail beyond Xa is Da = (|xal - to their centers are the Wilk and Gnanadesikan (1968)
|xa2)/cra. Extensive computational details are found in percentile comparison graph, the Tukey (1984) sum-
the appendix of Hedges and Friedman (1993). The difference graph, both discussed by Cleveland (1985),
computations of the estimates of Da are repeated for and the ordinal dominance curve (Darlington, 1973)
various values of k that are of interest to the re- that is similar to the percentile comparison graph. De-
searcher. Note that when the overall \L\ ¥= n-2, the velopment of additional methods for graphic depic-
combined distribution departs from normality. tion of effect sizes may promote the greater use of
Wilcox (1995, 1996, 1997) presented references for measures of effect size. A simple example is the de-
a variety of a graphical methods and illustrated a piction of two boxplots, one above the other, using the
method, based on the work of Doksum (1977), for Minitab STACK command.
REVIEW OF ASSUMPTIONS 143

Postscripts American Psychological Association. (1994). Publication


manual of the American Psychological Association (4th
There is an embarrassment of riches in the variety
ed.). Washington, DC: Author.
of solutions to the problem of heteroscedasticity in
Becker, G. (1996). Bias in the assessment of gender differ-
significance testing (Grissom, 2000; Wilcox, 1996,
ences. American Psychologist, 51, 154-155.
1997). Therefore, future primary studies may become
a mix of reports of means, transformed means, Brown, M. B., & Forsythe, A. B. (1974). Robust tests for
trimmed means, medians, various other resistant mea- equality of variances. Journal of the American Statistical
Association, 69, 364—367.
sures of location, shift-functions, CLs, and point-
biserial and other measures of correlation. If so, some Brown, R. A., Evans, D. M., Miller, I. W., Burgess, E. S., &
of these studies' tests of significance may be better at Mueller, T. I. (1997). Cognitive-behavioral treatment for
controlling rates of Type I and Type II error under depression in alcoholism. Journal of Consulting and
heteroscedasticity by avoiding the use of means, as Clinical Psychology, 65, 715-726.
encouraged by Wilcox (1996, 1997), but the effect Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of
size synthesizing problems of meta-analysts will be variance in experimental studies: A challenge to conven-
greatly exacerbated. tional interpretations. Psychological Bulletin, 104, 396-
Workers in each area of research should learn about 404.
the degree of heteroscedasticity typically arising from Cleveland, W. S. (1985). The elements of graphing data.
its types of participants, measures, and treatments. Monterey, CA: Wads worth.
Therefore, estimation of population VRs should be Cohen, J. (1988). Statistical power analysis for the behav-
encouraged. In addition, development of a method for ioral sciences (2nd ed.). New York: Academic Press.
constructing simultaneous confidence intervals for Colditz, G. A., Miller, J. N., & Mosteller, F. (1988). Mea-
population VRs that is robust to nonnormality would suring gain in the evaluation of medical technology: The
be very useful. Descriptive meta-analyses of sample probability of a better outcome. International Journal of
VRs may be conducted (Feingold, 1992a). However, Technology Assessment in Health Care, 4, 637-642.
one should not use the arithmetic means of sa2/sb2 Conti, L., & Musty, R. E. (1984). The effects of delta-9-
directly because, even if there is equality of popula- tetrahydro cannabinol injections to the nucleus accum-
tion variances, variability of sample variances will bens on the locomotor activity of rats. In S. Agurell, W.
cause the mean of the ratio sa2/sb2 to be equal to or L. Dewey, & R. E. White (Eds.), The cannabinoids:
greater than 1. For example, suppose that aa2 = crb2 Chemical, pharmacologic, and therapeutic aspects. New
and, because of sampling variability, say, sa2 = 2sb2 York: Academic Press.
and sb2 = 2.sa2 equally often across primary studies, Darlington, M. L. (1973). Comparing two groups by simple
yielding primary VR values of 1/2 = .5 and 2/1 = graphs. Psychological Bulletin, 79, 110-116.
2.0, respectively, equally often. The arithmetic mean Dielman, T., Lowry, C., & Pfaffenberger, R. (1994). A com-
of .5 and 2 is not 1, but 1.25. Feingold (1992a) and parison of quantile estimators. Communications in Sta-
Shaffer (1992) discussed the proper method of cumu- tistics B: Simulation and Computation, 23, 355-371.
lation of VRs through the use of median VR or mean Doksum, K. A. (1977). Some graphical methods in statis-
ratio of logarithms of variances. tics: A review and some extensions. Statistica Neer-
Sampling distributions for additional heteroscedas- landica, 31, 53-68.
tic estimators of effect size that do not assume nor- Eagly, A. (1997). Data archiving in psychology. Science
mality are also needed so that significance of esti- Agenda, 10, 14.
mates can be tested and confidence intervals can be Feingold, A. (1992a). Cumulation of variance ratios. Review
constructed, as they can be for the Pr(Xt > X2) mea- of Educational Research, 62, 433^34.
sure. Monte Carlo simulations show that coverage for Feingold, A. (1992b). Sex differences in variability in in-
lower confidence bounds for Pr(X, > X2) is generally tellectual abilities: A new look at an old controversy.
very close to the nominal .95 level when the PS esti- Review of Educational Research, 62, 61-84.
mator is used (Mee, 1990). Feingold, A. (1995). The additive effects of differences in
central tendency and variability are important in compari-
References
sons between groups. American Psychologist, 50, 5-13.
Abelson, R. P., & Prentice, D. A. (1997). Contrast tests of Feske, U., & Goldstein, A. J. (1997). Eye movement desen-
interaction hypotheses. Psychological Methods, 2, 315- sitization and reprocessing treatment for panic disorder:
328. A controlled outcome and partial dismantling study.
144 GRISSOM AND KIM

Journal of Consulting and Clinical Psychology, 65, Hunter, J. E., & Schmidt, F. L. (1990) Methods of meta-
1026-1035. analysis. Newbury Park, CA: Sage.
Fisher, L. D., & van Belle, G. (1993). Biostatistics: A meth- Huynh, C. L. (1989, March). A unified approach to the
odology for the health sciences. New York: Wiley. estimation of effect size in meta-analysis. Paper presented
Fleiss, J. L. (1986). The design and analysis of clinical at the Annual Meeting of the American Educational Re-
experiments. New York: Wiley. search Association, San Francisco. (ERIC Document Re-
Fligner, M. A., & Policello II, G. E. (1981). Robust rank production Service No. ED 306 248).
procedures for the Behrens-Fisher problem. Journal of Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in
the American Statistical Association, 76, 162-168. statistical packages. The American Statistician, 50, 361-
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta- 365.
analysis in social research. Thousand Oaks, CA: Sage. Izenman, A. J. (1991). Recent developments in nonparamet-
Goldberg, K. M., & Iglewicz, B. (1992). Bivariate exten- ric density estimation. Journal of the American Statistical
sions of the boxplot. Technometrics, 34, 307-320. Association, 86, 205-224.
Grissom, R. J. (1994a). Probability of the superior outcome Kendall, P. C., & Grove, W. M. (1988). Normative com-
of one treatment over another. Journal of Applied Psy- parisons in therapy outcome. Behavioral Assessment, 10,
chology, 79, 314-316. 147-158.
Grissom, R. J. (1994b). Statistical analysis of ordinal cat- Kendall, P. C., Marss-Garcia, A., Nath, S. R., & Sheldrick,
egorical status after therapies. Journal of Consulting and R. C. (1999). Normative comparisons for the evaluation
Clinical Psychology, 62, 281-284. of clinical significance. Journal of Consulting and Clini-
Grissom, R. J. (1996). The magical number .7 ± .2: Meta- cal Psychology, 67, 285-299.
meta-analysis of the probability of superior outcome in Keppel, G. (1991). Design and analysis: A researcher's
comparisons involving therapy, placebo, and control. handbook (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Journal of Consulting and Clinical Psychology, 64, 973- Kraemer, H. C. (1983). Theory of estimation and testing of
982. effect sizes: Use in meta-analysis. Journal of Educational
Grissom, R. J. (2000). Heterogeneity of variance in clinical Statistics, S, 93-101.
data. Journal of Consulting and Clinical Psychology, 68, Kraemer, H. C., & Andrews, G. (1982). A non-parametric
155-165. technique for meta-analysis effect size calculation. Psy-
Harwell, M. (1997). An empirical study of Hedges's homo- chological Bulletin, 91, 404-412.
geneity test. Psychological Methods, 2, 219-231. Laird, N. M., & Mosteller, F. (1990). Some statistical meth-
Hedges, L. V., & Friedman, L. (1993). Gender differences ods for combining experimental results. International
in variability in intellectual abilities: A reanalysis of Fein- Journal of Technology Assessment in Health Care, 6,
gold's results. Review of Educational Research, 63, 94- 5-30.
105. Lax, D. A. (1985). Robust estimators of scale: Finite sample
Hedges, L. V., & Nowell, A. (1995). Sex differences in performance in long-tailed symmetric distributions. Jour-
mental test scores, variability, and numbers of high- nal of the American Statistical Association, 80, 736-741.
scoring individuals. Science, 269, 41-45. Lehmann, E. L. (1975). Nonparametrics. San Francisco:
Hedges, L. V., & Olkin, I. (1984). Nonparametric estimators Holden-Day.
of effect size in meta-analysis. Psychological Bulletin, Levene, H. (1960). Robust tests for equality of variances. In
96, 573-580. I. Olkin (Ed.), Contributions to probability and statistics:
Hedges, L. V., & Olkin, I. (1985). Statistical methods for Essays in honor of Harold Hotelling (pp. 278-292). Stan-
meta-analysis. San Diego, CA: Academic Press. ford, CA: Stanford University Press.
Hekmat, H. (1973). Systematic versus semantic desensiti- Lix, L. M., Cribbie, R., & Keselman, H. J. (1996, June). The
zation and implosive therapy: A comparative study. Jour- analysis of completely randomized univariate designs.
nal of Consulting and Clinical Psychology, 40, 202-209. Paper presented at the annual meeting of the Psychomet-
Hoaglin, D. C, Mosteller, F., & Tukey, J. W. (Eds.). (1985). ric Society, Banff, Alberta, Canada.
Exploring data tables, trends, and shapes. New York: Lunneborg, C. E. (1986). Confidence intervals for a quantile
Wiley. contrast: Application of the bootstrap. Journal of Applied
Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991). Psychology, 71, 451^56.
Fundamentals of exploring analysis of variance. New Maxwell, S. E. (1998). Longitudinal designs in randomized
York: Wiley. group comparisons: When will intermediate observations
REVIEW OF ASSUMPTIONS 145

increase statistical power? Psychological Methods, 3, look at the robustness and Type II error properties of the
275-290. t test to departures from population normality. Psycho-
McGraw, K. O., & Wong, S. P. (1992). A common lan- logical Bulletin, 111, 352-360.
guage effect size statistic. Psychological Bulletin, 111, Schmidt, F. L., & Hunter, J. E. (1996). Measurement error
361-365. in psychological research: Lessons from 26 research sce-
McNemar, Q. (1969). Psychological statistics (4th ed.). narios. Psychological Methods, 1, 199-223.
New York: Wiley. Shaffer, J. P. (1992). Caution on the use of variance ratios:
Mee, R. W. (1990). Confidence intervals for probabilities A comment. Review of Educational Research, 62, 429-
and tolerance regions based on a generalization of the 432.
Mann-Whitney statistic. Journal of the American Statis- Shoemaker, L. H. (1999). Interquantile tests for dispersion
tical Association, 85, 793-800. in skewed distributions. Communications in Statistics B:
Micceri, T. (1989). The unicorn, the normal curve, and other Simulation and Computation, 28, 189-205.
improbable creatures. Psychological Bulletin, 105, 156- Silverman, B. W. (1986) Density estimation for statistics
166. and data analysis. New York: Chapman & Hall.
Morris, S. B., & DeShon, R. P. (1997). Correcting effect Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The
sizes computed from factorial analysis of variance for use benefits of psychotherapy. Baltimore: Johns Hopkins
in meta-analysis. Psychological Methods, 2, 192-199. University Press.
Mosteller, F., & Bush, R. R. (1954). Selected quantitative Snedecor, G. W., & Cochran, W. G. (1989). Statistical
techniques. In G. Lindzey & E. Aronson (Eds.), Hand- methods (8th ed.). Ames: Iowa State University.
book of social psychology (Vol. 1, pp. 289-334). Cam- Tomarken, A. J., & Serlin, R. C. (1986). Comparison of
bridge, MA: Addison-Wesley. ANOVA alternatives under variance heterogeneity and
Mosteller, F., & Chalmers, T. C. (1992). Some progress and specific noncentrality structures. Psychological Bulletin,
problems in meta-analysis of clinical trials. Statistical 99, 90-99.
Science, 7, 227-236. Tukey, J. W. (1984). The collected works of John W. Tukey.
Murphy, B. P. (1976). Comparison of some two sample Monterey, CA: Wadsworth.
means tests by simulation. Communications in Statistics Vargha, A., & Delaney, H. D. (1998). The Kruskal-Wallis
B: Simulation and Computation, 5, 23-32. test and stochastic homogeneity. Journal of Educational
Norusis, M. J. (1995). SPSS 6.1 guide to data analysis. and Behavioral Statistics, 23, 170-192.
Englewood Cliffs, NJ: Prentice Hall. Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., &
O'Brien, P. C. (1988). Comparing two samples: Extensions Morton, T. (1995). Effects of psychotherapy with chil-
of the t, rank-sum, and log-rank tests. Journal of the dren and adolescents revisited: A meta-analysis of treat-
American Statistical Association, 83, 52-61. ment outcome studies. Psychological Bulletin, 117, 450-
O'Brien, R. G. (1978). Robust techniques for testing het- 468.
erogeneity of variance. Psychometrika, 43, 327-342. Weston, T., & Hopkins, K. D. (1998). Testing for homoge-
Oliver, M. B., & Hyde, J. S. (1995). Gender differences in neity of variance: An evaluation of current practice. Un-
attitudes toward homosexuality: A reply to Whitely and published manuscript, University of Colorado at Boulder.
Kite. Psychological Bulletin, 117, 155-158. Whitney, D. R. (1951). A bivariate extension of the U sta-
Pratt, J. W. (1964). Obustness [sic] of some procedures for tistic. Annals of Mathematical Statistics, 22, 274-282.
the two-sample location problem. American Statistical Wilcox, R. R. (1987). New designs in analysis of variance.
Association Journal, 59, 665-680. Annual Review of Psychology, 38, 29-60.
Pratt, J. W., & Gibbons, J. D. (1981). Concepts ofnonpara- Wilcox, R. R. (1994). The percentage bend correlation co-
metric theory. New York: Springer-Verlag. efficient. Psychometrika, 59, 601-616.
Raudenbush, S. W., & Bryk, A. S. (1987). Examining cor- Wilcox, R. R. (1995). Comparing two independent groups
relates of diversity. Journal of Educational Statistics, 12, via multiple quantiles. The Statistician, 44, 91-99.
241-269. Wilcox, R. R. (1996). Statistics for the social sciences. San
Ray, J. W., & Shadish, W. R. (1996). How interchangeable Diego, CA: Academic Press.
are different measures of effect size? Journal of Consult- Wilcox, R. R. (1997). Introduction to robust estimation and
ing and Clinical Psychology, 64, 1316-1325. hypothesis testing. San Diego, CA: Academic Press.
Rosenthal, R. (1991). Meta-analytic procedures for social Wilcox, R. R., & Muska, J. (1999). Measuring effect size: A
research. Newbury Park, CA: Sage. non-parametric analogue of u>2. British Journal of Math-
Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic ematical and Statistical Psychology, 52, 93-110.
146 GRISSOM AND KIM

Wilk, M. B., & Gnanadesikan, R. (1968). Probability plot- Zimmerman, D. M., & Zumbo, B. D. (1993). Rank trans-
ting methods for the analysis of data. Biometrika, 55, formations and the power of the Student t test and Welch
1-17. t' test for non-normal populations with unequal variances.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Sta- Canadian Journal of Experimental Psychology, 47, 523-
tistical principles in experimental design (3rd ed.). New 539.
York: McGraw-Hill.
Wolfe, D. A., & Hogg, R. V. (1971). On constructing sta- Received January 20, 1999
tistics and reporting data. The American Statistician, 25, Revision received August 28, 2000
27-30. Accepted December 21, 2000 •

Members of Under-represented Groups:


Reviewers for Journal Manuscripts Wanted

If you are interested in reviewing manuscripts for APA journals, the APA Publications
and Communications Board would like to invite your participation. Manuscript re-
viewers are vital to the publications process. As a reviewer, you would gain valuable
experience in publishing. The P&C Board is particularly interested in encouraging
members of underrepresented groups to participate more in this process.

If you are interested in reviewing manuscripts, please write to Demarie Jackson at the
address below. Please note the following important points:

• To be selected as a reviewer, you must have published articles in peer-reviewed


journals. The experience of publishing provides a reviewer with the basis for
preparing a thorough, objective review.
• To be selected, it is critical to be a regular reader of the five to six empirical jour-
nals that are most central to the area or journal for which you would like to review.
Current knowledge of recently published research provides a reviewer with the
knowledge base to evaluate a new submission within the context of existing re-
search.
• To select the appropriate reviewers for each manuscript, the editor needs detailed
information. Please include with your letter your vita. In your letter, please iden-
tify which APA journal(s) you are interested in, and describe your area of exper-
tise. Be as specific as possible. For example, "social psychology" is not suffi-
cient—you would need to specify "social cognition" or "attitude change" as well.
• Reviewing a manuscript takes time (1-4 hours per manuscript reviewed). If you
are selected to review a manuscript, be prepared to invest the necessary time to
evaluate the manuscript thoroughly.

Write to Demarie Jackson, Journals Office, American Psychological Association, 750


First Street, NE, Washington, DC 20002-4242.

You might also like