You are on page 1of 5

CHILD DEVELOPMENT PERSPECTIVES

What Are Effect Sizes and Why Do We Need Them?


Larry V. Hedges
Northwestern University

ABSTRACT—Effect sizes are quantitative indexes of the between variables. In a sense, an effect size describes the
relations between variables found in research studies. degree to which the null hypothesis of no relation between
They can provide a broadly understandable summary of variables is false. Although there are many different effect size
research findings that can be used to compare different indexes, there are reasons to prefer some indexes over others;
studies or summarize results across studies. Unlike statis- thus, some effect sizes are better than others. There are many
tical significance (p values), effect sizes represent different possible effect sizes, including the difference between
strength of relationships without regard to sample size. treatment and control group means divided by the standard
Three families of effect sizes are widely used: the stan- deviation (Cohen’s d), the correlation coefficient between the
dardized mean difference family, the standardized independent variable and the outcome, and the difference in
regression coefficient family, and the odds ratio family. proportions of individuals experiencing a particular outcome.
KEYWORDS—effect size; p values; statistics; meta-
analysis; statistical significance WHY DO WE NEED EFFECT SIZES?

Most scientific studies attempt to estimate the relation between We report results in science to communicate the findings to
variables of interest. Experimental studies focus on causal others who may use the results, including other scientists and
relations between discrete treatment variables and an outcome, policy makers. Effect sizes are used to communicate the
estimated using random assignment of units to treatments. strength of the relationship between variables found in the
Nonexperimental studies often focus on relations between scientific study. Any research study is rooted in many details of
measured (discrete or continuous) variables, often controlling design, measurement devices, and detailed procedures, and
for the effects of other (confounding) variables. thorough evaluation of these details is necessary to evaluate the
In either kind of study, statistical methods are usually used to scientific integrity of the study. However, there is also a need to
assess the strength, precision, and statistical reliability of the understand and communicate the central findings of the study in
relation, information that is usually included when reporting a way that can be widely understood and compared with the
a study’s results. findings of related studies.
One of the barriers to broad understanding of results in social
WHAT ARE EFFECT SIZES? science research is that different researchers often use different
instruments to measure outcome constructs (such as different
Effect sizes are quantitative indexes of relations among varia- achievement tests, different adjustment measures, and different
bles. Although the term effect size has been used to refer to indicators of deviance), which makes it difficult to compare the
several more specific indexes, the term now generally refers to results of studies. If Study A finds that the treatment effect of an
any index of relation between variables. Here, we use the term intervention is 2.3 scale score points on the Woodcock–Johnson
in that more general sense to refer to any index of relation reading comprehension test, but Study B finds that the treatment
effect is 7.5 points on the Terra Nova reading comprehension
scale, which effect is larger? It is hard to tell.
Correspondence concerning this article should be addressed to
Effect sizes are intended to communicate a research study’s
Larry V. Hedges, Department of Statistics, Northwestern
University, 2040 N. Sheridan Road, Evanston, IL 60208; e-mail: findings about strength of relations between variables in
l-hedges@northwestern.edu. a manner that captures the essential features of results but that
# 2008, Copyright the Author(s) can be broadly understood and compared with findings from
Journal Compilation # 2008, Society for Research in Child Development other studies.

Volume 2, Number 3, Pages 167–171


168 j Larry V. Hedges

STATISTICAL SIGNIFICANCE DOES NOT SOLVE THE the sampling distribution of d is (almost exactly) the population
PROBLEM OF INTERPRETING AND COMPARING effect size d 5 (l1  l2)/r, which depends only on the
RESULTS ACROSS STUDIES population parameters underlying the sample data.
This effect size is mathematically natural in the context of
Hypothesis testing has become a nearly universal aspect of statistical power because statistical power depends on popula-
scientific procedure. The discrete significance test is often tion structure (population parameters) only through the pop-
supplemented by a so-called exact p value. The p value is the ulation effect size d 5 (l1  l2)/r. Thus, this effect size is the
probability of observing data that are as inconsistent with the natural representation of ‘‘how false’’ the null hypotheses are in
null hypothesis as the data actually observed, given that the null quantitative terms or, alternatively, the size of the relation
hypothesis is true (i.e., that the true effect size is 0). Because the between treatment and outcome.
p value from every study is interpreted in the same manner, The effect size d has several other virtues as a way to
p values would seem to be broadly understood. Significance tests characterize treatment effect in the two-group study. First, the
and exact significance values have the virtue that they are scale effect size depends only on the underlying population param-
free; that is, any study of the relation between variables can eters, not on sample size, which is particular to the study.
yield a p value, regardless of whether the studies measure the Second, the effect size d does not depend on the scale used to
outcome on the same scale. Thus, p values would seem to be measure the outcome variable; any linear transformation of the
a good candidate for effect size metric. I argue that this is data would yield exactly the same value of the effect size. This
incorrect. second point is important because classical measurement
Further examination of the nature of p values makes it clear theory would interpret any two measures that are linear trans-
that they are not suitable for comparison of effect sizes. Consider formations of one another to be measures of the same thing.
the p value associated with one research design, the two-group Thus, if a researcher were to choose from an array of different
study with n1 individuals in one group and n2 individuals in measures of the same thing (each with its own scale of
another group, that measures a continuous outcome variable Y. measurement), the effect size computed would not depend on
The analysis of this design would use a two-sample t test to which measure he or she chose. In statistical terms, the effect
measure the statistical significance of the treatment effect size would be invariant to choice of scale. This scale invariance
(which is just the mean difference). The p value depends on is very desirable in a measure of effect size. It means that effect
the test statistic sizes from different studies can be meaningfully compared even
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  if the studies measure the same outcome construct using
n1 n2 Y 1  Y 2 different measurement scales.
t5 ;
n1 þ n2 S A caveat is appropriate here. Even measures that would
where Y1 and Y2 are the sample means in the two groups and S is generally be conceded to measure the same construct (e.g., two
the pooled within-group standard deviation. Thus, the t statistic different tests of reading comprehension) are probably not
is influenced by the sample sizes n1 and n2. In particular, a study exactly linearly equatable (i.e., they would not have correlation
that had a larger sample size but obtained exactly the same r 5 1.0). However, many measures that attempt to measure the
summary statistics Y1 , Y2 , and S, would have a different t same construct come close to being linearly equatable, and this
statistic and therefore a different p value. Thus, although p approximation is widely used in psychometric programs to
values are useful in determining how reliable the mean equate different tests (Holland & Rubin, 1982). It is frequently
difference may be, they do not provide an index of how large a reasonable modeling assumption that permits an interpreta-
the effect may be. tion of d values from different studies as representing the
treatment effect on a common scale that has a population
standard deviation of unity.
The Standardized Mean Difference: Cohen’s d In some circumstances, studies do not measure outcomes that
Cohen (1977) introduced the effect size concept as a tool to aid can be called the same construct in a narrow psychometric
in statistical power computations. He characterized the situa- sense. Effect size measures such as d can still be of use in such
tion as follows: cases. However, statisticians often use d as a measure of the
separation between two distributions of values. Thus, d can be
ðtest statisticÞ 5 ðsample size pieceÞ  ðeffect sizeÞ:
interpreted as an index of separation (or overlap) between
In the situation p
of the t statistic I mentioned
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi above, the sample distributions of outcome scores of two groups. This interpreta-
size piece is n1 n2 =ðn1 þ n2 Þand the effect size is tion of d can be applied even when comparing d values that are
ðY1  Y2 Þ=S. This effect size is often called Cohen’s d. Thus, not based on measures of the same construct. In this case, d still
Cohen’s d is just the difference in means expressed in standard represents a measure of separation of the distributions of scores
deviation units. Note that the effect size d does not depend from the two groups. This interpretation is appropriate when
explicitly on the sample sizes n1 and n2. Moreover, the mean of considering the question of whether the effect of a treatment on

Child Development Perspectives, Volume 2, Number 3, Pages 167–171


What Are Effect Sizes? j 169

one construct (say, academic achievement) is bigger or smaller standard deviation, the total standard deviation, or even the
than its effect on a different construct (say, deviance). between-school standard deviation. Each choice of standard
Cohen’s d was appealing to many educational researchers deviation would yield a different effect size measure (see
because it corresponded to the way they often interpreted study Hedges, 2007). The choice of which standardized effect size
results. Even before effect sizes came into common use, it was measure to use in these designs will depend on the purpose of
not uncommon for researchers to refer to a treatment effect as the comparison and the kind of effect sizes one wishes to
being as large as ‘‘half a standard deviation,’’ clearly in the spirit compare.
of Cohen’s d.
Cohen (1977) noted that not only the t statistic but also many Families of Effect Sizes
other test statistics can be written as the product of a sample size Three families of effect size measures are common in the social
piece that explicitly depends on sample size and design of the sciences. The standardized mean difference, which I have
study and an effect size piece that depends only on the already discussed, is a member of a family of related effect
population parameters underlying the study. In fact, virtually sizes that is designed for representing relations between
all test statistics can be decomposed in this manner, yielding dichotomous independent variables and continuous dependent
effect size indexes that are naturally related to statistical power variables.
and that can also be used to represent study results. The standardized regression coefficient family is designed for
Like d, most of the effect size measures that arise naturally in representing the relation between continuous independent
conjunction with statistical power are scale invariant, meaning variables and continuous dependent variables. This family
that they involve explicit or implicit standardization. includes the correlation coefficient as a special case: a stan-
dardized regression coefficient when there is only one predictor.
Complications The standardized regression coefficient describes how many
Although Cohen’s d has much to recommend it, in the simple standard deviations of change in the dependent variable are
case of a two-group study based on simple random samples of associated with a change of 1 SD in the independent variable.
individuals, complications can arise that compromise compa- Thus, standardized regression coefficients may be suitable
rability of effect sizes across studies. For example, studies may effect sizes for describing the relation between two constructs
differ substantially in the selectivity of their sampling. This even when the independent and dependent variables are not
means that the standard deviation refers to a substantially measured in exactly the same way in all studies. When there is
different population of individuals in different studies, so that only a single independent variable and no other covariates, the
even if two studies measured the outcome in the same scale and standardized regression coefficient is identical to the correla-
had the same absolute treatment–control mean difference, the tion coefficient. When there are additional independent varia-
standardized mean differences in these two studies would be bles used as covariates, then the standardized regression
different. For example, Hyde (1981) compared the results of coefficient describes an association between two variables,
studies of cognitive gender differences using Cohen’s d. One of controlling for the other covariates.
Hyde’s studies was based on a nationally representative sample, The odds ratio family is designed to represent the relation
and another was based on subjects from the Terman study of between dichotomous independent variables and dichotomous
geniuses, a sample with considerably more restricted cognitive dependent variables. This family includes the risk difference,
ability. Although this example is extreme, less extreme differ- the risk ratio, and the odds ratio (see, e.g., Cooper & Hedges,
ences in the restrictiveness of samples could hamper compara- 1994). These indexes are all designed to describe a comparison
bility of standardized effect sizes such as Cohen’s d to some of the proportions of individuals having one of the two possible
degree. outcomes (such as surviving a disease) in two groups defined by
Another consideration that can compromise comparability of the values of the independent variable (such as the treatment
standardized effect sizes such as Cohen’s d is differential and the control group). If we denote the proportion having the
measurement error. Unreliability in the outcome measure target outcome (e.g., survival) in Group 1 by p1 and the pro-
attenuates d (see Hedges, 1981), and if the reliability of outcome portion in Group 2 by p2, the risk difference is simply the
measures varies across studies, there will be differential difference between these proportions: p1  p2. The risk ratio is
attenuation. Fortunately, this effect is substantial only if there the ratio of these proportions: p1/p2. The odds ratio is the ratio
are large differences in the reliability of outcome measures of the odds of the target outcome, p/(1  p), in the two groups.
across studies. Thus, the odds ratio is as follows:
In studies with more complex designs, the notion of stan-
p1 ð1  p1 Þ p1 ð1  p2 Þ
dardized effect sizes becomes more ambiguous. For example, in 5 :
p2 ð1  p2 Þ p2 ð1  p1 Þ
two-level designs (such as those that sample schools first and
then children within schools), there are several possible Although it seems least intuitive, the odds ratio has certain
standard deviations to use. One might use the within-school mathematical advantages, can be computed from many different

Child Development Perspectives, Volume 2, Number 3, Pages 167–171


170 j Larry V. Hedges

designs, and is probably the most widely used effect size in this male–female, etc.), indexes of growth (e.g., 1-year growth), or
family. collections of related effect sizes (like Lipsey & Wilson’s, 1993,
compendium).
Choosing an Effect Size Index In some cases, interpretation of effect sizes can be aided by
Researchers should choose an effect size measure that is natural comparing effects with the natural range of variation in
for the research design they use. It will make both calculation populations, such as the interquartile range. However, it is
and interpretation of the effect size easier. The standardized important to recognize that natural variation is different at
mean difference and related effect sizes in that family are different levels of a multilevel population. For example, in
natural for studies that examine the relation between a dichot- considering academic achievement, the variance of the pop-
omous independent variable (such as treatment vs. control) and ulation of school means is much smaller (typically only 20% as
a continuous outcome measure. Effect sizes in the standardized large) than the total variation (Hedges & Hedberg, 2007).
regression coefficient family (such as the correlation coefficient) Therefore, a difference that appears large when compared to
are natural for studies examining the relation between contin- the variation of school means may appear small when compared
uous independent and dependent variables. Effect sizes in the to the range of the total achievement distribution (see
odds ratio family are natural for studies in which both the Konstantopoulos & Hedges, 2008).
independent and the dependent variables are dichotomous. Universal (decontextualized) criteria for interpretation of
Effect sizes in each of the different families can be approxi- effect sizes are not helpful. Cohen (1977) reluctantly proposed
mately translated into effect sizes in other metrics, providing one set of criteria but argued that they should be applied only in
a rough means of comparing effect sizes computed from studies situations where there was no other knowledge available to
with somewhat different designs, but these translations are only make a more informed judgment. But it is particularly important
approximate (see, e.g., Hedges & Olkin, 1985). to remember that they were proposed for power analysis, not for
the interpretation of social research. Unfortunately, researchers
Reporting Effect Sizes have largely ignored Cohen’s cautionary arguments and have
Effect sizes, like all statistics, should be reported with enough reified his guidelines into the folklore of social research. The
detail about how they were calculated to permit a competent interpretation of effect sizes as large or small depends on the
scientist to replicate that computation. Merely stating that research context. A very small value of effect size on an outcome
a Cohen’s d was calculated in a complex design is an ambiguous like death may be extremely important, whereas a much larger
statement because there may be several legitimate ways in numerical value of that same effect size on an outcome like
which such a d could be computed and these need not lead to the ‘‘discomfort’’ might be judged to be far less consequential. In
same answer. addition, the uncertainty associated with the effect size (e.g., the
Effect sizes computed from the data in research studies are width of the confidence interval) should influence its interpre-
statistics and are subject to sampling uncertainties. Effect sizes, tation: Effect size estimates with very wide confidence intervals
like all such statistics, should always be reported along with may be of less practical value than estimates with less un-
some measure of their sampling uncertainty. The standard error certainty. Finally, effect sizes from designs that involve statis-
is an appropriate measure of uncertainty, as is a confidence tical controls (such as covariate analyses) depend on what is
interval for the effect size. Methods for computing standard being controlled. Effect sizes from two studies that control for
errors and confidence intervals for effect sizes are beyond the different covariates may not be comparable.
scope of this article but are available in many sources (e.g., Interpretation of effect size estimates from data should always
Cooper & Hedges, 1994). be made in light of their sampling uncertainties. Standard errors
of effect sizes give some sense of how large this uncertainty
Interpretation of Effect Sizes might be, but confidence intervals are even more helpful
Statistical calculation is a mathematical process, whereas because they provide a direct statement of the range of values
interpretation of effect sizes is an act of human judgment. No in which the true effect size might be.
statistical theory can make these judgments. Such judgments
will typically be made in some normative context involving the
importance of the relationship and the magnitude of other REFERENCES
relations to which an effect size may be compared. This means
that it is difficult to interpret an effect size as large or small Cohen, J. (1977). Statistical power analysis for the behavioral sciences
(2nd ed.). New York: Academic Press.
without implicitly or explicitly comparing it to other effect sizes.
Cooper, H., & Hedges, L. V. (1994). The handbook of research
Thus, the interpretation of effect sizes is often comparative and synthesis. New York: Russell Sage Foundation.
always depends on the context. There are many comparative Hedges, L. V. (1981). Distribution theory for Glass’s estimator of
standards that might be used to help in interpretations of effect effect size and related estimators. Journal of Educational
sizes, including gaps between relevant groups (Black–White, Statistics, 6, 107–128.

Child Development Perspectives, Volume 2, Number 3, Pages 167–171


What Are Effect Sizes? j 171

Hedges, L. V. (2007). Effect sizes in cluster randomized designs. Hyde, J. S. (1981). How large are cognitive gender differences?
Journal of Educational and Behavioral Statistics, 32, 341–370. A meta-analysis using v2 and d. American Psychologist, 26,
Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values 892–901.
for planning group randomized trials in education. Educational Konstantopoulos, S., & Hedges, L. V. (2008). How large an effect can
Evaluation and Policy Analysis, 29, 60–87. we expect from school reforms? Teachers College Record, 110,
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta- 1613–1640.
analysis. New York: Academic Press. Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological,
Holland, P. W., & Rubin, D. B. (Eds.). (1982). Test equating. New educational, and behavioral treatment: Confirmation from meta-
York: Academic Press. analysis. American Psychologist, 48, 1181–1209.

Child Development Perspectives, Volume 2, Number 3, Pages 167–171