© All Rights Reserved

0 views

10.1111@j.1750-8606.2008.00060.x.pdf

© All Rights Reserved

- COMPARATIVE STUDY OF ACADEMIC PERFORMACE OF UME AND DIRECT ENTRY STUDENTS AT GRADUATION IN SCIENCE EDUCATION SECTION, AHMADU BELLO UNIVERSITY ZARIA, NIGERIA
- finding answers through data collection
- Ten Big Statistical Ideas in Research
- UT Dallas Syllabus for psy2317.002.07f taught by Nancy Juhn (njuhn)
- The Spending Pattern among the Youth in Lagos, Nigeria
- UT Dallas Syllabus for psy2317.501.07f taught by Nancy Juhn (njuhn)
- Mmw Finals
- Kaur, Gurpreet. 2013. Scientific Attitude in Relation to Critical Thinking Among Teachers. Educationia Confab. Vol. 2, No. 8, August 2013.
- UT Dallas Syllabus for psy2317.002 06f taught by Nancy Juhn (njuhn)
- Chapter 3 Methodology.pdfniamh
- 7 Evaluating the Effectiveness of Safety Management Practices and Strategies in Construction Projects
- ketamine.pdf
- Tutorial Solutions Week 7-2012_2013
- Chapter 12
- Assess the Geriatric Problems among Inmates of Old Age Home
- Terminology
- articol 1
- Anderson, 1998 - Meta-Alnalyses of Gender Effects on Convers
- rao1989.pdf
- bias in RCT

You are on page 1of 5

Larry V. Hedges

Northwestern University

ABSTRACT—Effect sizes are quantitative indexes of the between variables. In a sense, an effect size describes the

relations between variables found in research studies. degree to which the null hypothesis of no relation between

They can provide a broadly understandable summary of variables is false. Although there are many different effect size

research findings that can be used to compare different indexes, there are reasons to prefer some indexes over others;

studies or summarize results across studies. Unlike statis- thus, some effect sizes are better than others. There are many

tical significance (p values), effect sizes represent different possible effect sizes, including the difference between

strength of relationships without regard to sample size. treatment and control group means divided by the standard

Three families of effect sizes are widely used: the stan- deviation (Cohen’s d), the correlation coefficient between the

dardized mean difference family, the standardized independent variable and the outcome, and the difference in

regression coefficient family, and the odds ratio family. proportions of individuals experiencing a particular outcome.

KEYWORDS—effect size; p values; statistics; meta-

analysis; statistical significance WHY DO WE NEED EFFECT SIZES?

Most scientific studies attempt to estimate the relation between We report results in science to communicate the findings to

variables of interest. Experimental studies focus on causal others who may use the results, including other scientists and

relations between discrete treatment variables and an outcome, policy makers. Effect sizes are used to communicate the

estimated using random assignment of units to treatments. strength of the relationship between variables found in the

Nonexperimental studies often focus on relations between scientific study. Any research study is rooted in many details of

measured (discrete or continuous) variables, often controlling design, measurement devices, and detailed procedures, and

for the effects of other (confounding) variables. thorough evaluation of these details is necessary to evaluate the

In either kind of study, statistical methods are usually used to scientific integrity of the study. However, there is also a need to

assess the strength, precision, and statistical reliability of the understand and communicate the central findings of the study in

relation, information that is usually included when reporting a way that can be widely understood and compared with the

a study’s results. findings of related studies.

One of the barriers to broad understanding of results in social

WHAT ARE EFFECT SIZES? science research is that different researchers often use different

instruments to measure outcome constructs (such as different

Effect sizes are quantitative indexes of relations among varia- achievement tests, different adjustment measures, and different

bles. Although the term effect size has been used to refer to indicators of deviance), which makes it difficult to compare the

several more specific indexes, the term now generally refers to results of studies. If Study A finds that the treatment effect of an

any index of relation between variables. Here, we use the term intervention is 2.3 scale score points on the Woodcock–Johnson

in that more general sense to refer to any index of relation reading comprehension test, but Study B finds that the treatment

effect is 7.5 points on the Terra Nova reading comprehension

scale, which effect is larger? It is hard to tell.

Correspondence concerning this article should be addressed to

Effect sizes are intended to communicate a research study’s

Larry V. Hedges, Department of Statistics, Northwestern

University, 2040 N. Sheridan Road, Evanston, IL 60208; e-mail: findings about strength of relations between variables in

l-hedges@northwestern.edu. a manner that captures the essential features of results but that

# 2008, Copyright the Author(s) can be broadly understood and compared with findings from

Journal Compilation # 2008, Society for Research in Child Development other studies.

168 j Larry V. Hedges

STATISTICAL SIGNIFICANCE DOES NOT SOLVE THE the sampling distribution of d is (almost exactly) the population

PROBLEM OF INTERPRETING AND COMPARING effect size d 5 (l1 l2)/r, which depends only on the

RESULTS ACROSS STUDIES population parameters underlying the sample data.

This effect size is mathematically natural in the context of

Hypothesis testing has become a nearly universal aspect of statistical power because statistical power depends on popula-

scientific procedure. The discrete significance test is often tion structure (population parameters) only through the pop-

supplemented by a so-called exact p value. The p value is the ulation effect size d 5 (l1 l2)/r. Thus, this effect size is the

probability of observing data that are as inconsistent with the natural representation of ‘‘how false’’ the null hypotheses are in

null hypothesis as the data actually observed, given that the null quantitative terms or, alternatively, the size of the relation

hypothesis is true (i.e., that the true effect size is 0). Because the between treatment and outcome.

p value from every study is interpreted in the same manner, The effect size d has several other virtues as a way to

p values would seem to be broadly understood. Significance tests characterize treatment effect in the two-group study. First, the

and exact significance values have the virtue that they are scale effect size depends only on the underlying population param-

free; that is, any study of the relation between variables can eters, not on sample size, which is particular to the study.

yield a p value, regardless of whether the studies measure the Second, the effect size d does not depend on the scale used to

outcome on the same scale. Thus, p values would seem to be measure the outcome variable; any linear transformation of the

a good candidate for effect size metric. I argue that this is data would yield exactly the same value of the effect size. This

incorrect. second point is important because classical measurement

Further examination of the nature of p values makes it clear theory would interpret any two measures that are linear trans-

that they are not suitable for comparison of effect sizes. Consider formations of one another to be measures of the same thing.

the p value associated with one research design, the two-group Thus, if a researcher were to choose from an array of different

study with n1 individuals in one group and n2 individuals in measures of the same thing (each with its own scale of

another group, that measures a continuous outcome variable Y. measurement), the effect size computed would not depend on

The analysis of this design would use a two-sample t test to which measure he or she chose. In statistical terms, the effect

measure the statistical significance of the treatment effect size would be invariant to choice of scale. This scale invariance

(which is just the mean difference). The p value depends on is very desirable in a measure of effect size. It means that effect

the test statistic sizes from different studies can be meaningfully compared even

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ if the studies measure the same outcome construct using

n1 n2 Y 1 Y 2 different measurement scales.

t5 ;

n1 þ n2 S A caveat is appropriate here. Even measures that would

where Y1 and Y2 are the sample means in the two groups and S is generally be conceded to measure the same construct (e.g., two

the pooled within-group standard deviation. Thus, the t statistic different tests of reading comprehension) are probably not

is influenced by the sample sizes n1 and n2. In particular, a study exactly linearly equatable (i.e., they would not have correlation

that had a larger sample size but obtained exactly the same r 5 1.0). However, many measures that attempt to measure the

summary statistics Y1 , Y2 , and S, would have a different t same construct come close to being linearly equatable, and this

statistic and therefore a different p value. Thus, although p approximation is widely used in psychometric programs to

values are useful in determining how reliable the mean equate different tests (Holland & Rubin, 1982). It is frequently

difference may be, they do not provide an index of how large a reasonable modeling assumption that permits an interpreta-

the effect may be. tion of d values from different studies as representing the

treatment effect on a common scale that has a population

standard deviation of unity.

The Standardized Mean Difference: Cohen’s d In some circumstances, studies do not measure outcomes that

Cohen (1977) introduced the effect size concept as a tool to aid can be called the same construct in a narrow psychometric

in statistical power computations. He characterized the situa- sense. Effect size measures such as d can still be of use in such

tion as follows: cases. However, statisticians often use d as a measure of the

separation between two distributions of values. Thus, d can be

ðtest statisticÞ 5 ðsample size pieceÞ ðeffect sizeÞ:

interpreted as an index of separation (or overlap) between

In the situation p

of the t statistic I mentioned

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ﬃ above, the sample distributions of outcome scores of two groups. This interpreta-

size piece is n1 n2 =ðn1 þ n2 Þand the effect size is tion of d can be applied even when comparing d values that are

ðY1 Y2 Þ=S. This effect size is often called Cohen’s d. Thus, not based on measures of the same construct. In this case, d still

Cohen’s d is just the difference in means expressed in standard represents a measure of separation of the distributions of scores

deviation units. Note that the effect size d does not depend from the two groups. This interpretation is appropriate when

explicitly on the sample sizes n1 and n2. Moreover, the mean of considering the question of whether the effect of a treatment on

What Are Effect Sizes? j 169

one construct (say, academic achievement) is bigger or smaller standard deviation, the total standard deviation, or even the

than its effect on a different construct (say, deviance). between-school standard deviation. Each choice of standard

Cohen’s d was appealing to many educational researchers deviation would yield a different effect size measure (see

because it corresponded to the way they often interpreted study Hedges, 2007). The choice of which standardized effect size

results. Even before effect sizes came into common use, it was measure to use in these designs will depend on the purpose of

not uncommon for researchers to refer to a treatment effect as the comparison and the kind of effect sizes one wishes to

being as large as ‘‘half a standard deviation,’’ clearly in the spirit compare.

of Cohen’s d.

Cohen (1977) noted that not only the t statistic but also many Families of Effect Sizes

other test statistics can be written as the product of a sample size Three families of effect size measures are common in the social

piece that explicitly depends on sample size and design of the sciences. The standardized mean difference, which I have

study and an effect size piece that depends only on the already discussed, is a member of a family of related effect

population parameters underlying the study. In fact, virtually sizes that is designed for representing relations between

all test statistics can be decomposed in this manner, yielding dichotomous independent variables and continuous dependent

effect size indexes that are naturally related to statistical power variables.

and that can also be used to represent study results. The standardized regression coefficient family is designed for

Like d, most of the effect size measures that arise naturally in representing the relation between continuous independent

conjunction with statistical power are scale invariant, meaning variables and continuous dependent variables. This family

that they involve explicit or implicit standardization. includes the correlation coefficient as a special case: a stan-

dardized regression coefficient when there is only one predictor.

Complications The standardized regression coefficient describes how many

Although Cohen’s d has much to recommend it, in the simple standard deviations of change in the dependent variable are

case of a two-group study based on simple random samples of associated with a change of 1 SD in the independent variable.

individuals, complications can arise that compromise compa- Thus, standardized regression coefficients may be suitable

rability of effect sizes across studies. For example, studies may effect sizes for describing the relation between two constructs

differ substantially in the selectivity of their sampling. This even when the independent and dependent variables are not

means that the standard deviation refers to a substantially measured in exactly the same way in all studies. When there is

different population of individuals in different studies, so that only a single independent variable and no other covariates, the

even if two studies measured the outcome in the same scale and standardized regression coefficient is identical to the correla-

had the same absolute treatment–control mean difference, the tion coefficient. When there are additional independent varia-

standardized mean differences in these two studies would be bles used as covariates, then the standardized regression

different. For example, Hyde (1981) compared the results of coefficient describes an association between two variables,

studies of cognitive gender differences using Cohen’s d. One of controlling for the other covariates.

Hyde’s studies was based on a nationally representative sample, The odds ratio family is designed to represent the relation

and another was based on subjects from the Terman study of between dichotomous independent variables and dichotomous

geniuses, a sample with considerably more restricted cognitive dependent variables. This family includes the risk difference,

ability. Although this example is extreme, less extreme differ- the risk ratio, and the odds ratio (see, e.g., Cooper & Hedges,

ences in the restrictiveness of samples could hamper compara- 1994). These indexes are all designed to describe a comparison

bility of standardized effect sizes such as Cohen’s d to some of the proportions of individuals having one of the two possible

degree. outcomes (such as surviving a disease) in two groups defined by

Another consideration that can compromise comparability of the values of the independent variable (such as the treatment

standardized effect sizes such as Cohen’s d is differential and the control group). If we denote the proportion having the

measurement error. Unreliability in the outcome measure target outcome (e.g., survival) in Group 1 by p1 and the pro-

attenuates d (see Hedges, 1981), and if the reliability of outcome portion in Group 2 by p2, the risk difference is simply the

measures varies across studies, there will be differential difference between these proportions: p1 p2. The risk ratio is

attenuation. Fortunately, this effect is substantial only if there the ratio of these proportions: p1/p2. The odds ratio is the ratio

are large differences in the reliability of outcome measures of the odds of the target outcome, p/(1 p), in the two groups.

across studies. Thus, the odds ratio is as follows:

In studies with more complex designs, the notion of stan-

p1 ð1 p1 Þ p1 ð1 p2 Þ

dardized effect sizes becomes more ambiguous. For example, in 5 :

p2 ð1 p2 Þ p2 ð1 p1 Þ

two-level designs (such as those that sample schools first and

then children within schools), there are several possible Although it seems least intuitive, the odds ratio has certain

standard deviations to use. One might use the within-school mathematical advantages, can be computed from many different

170 j Larry V. Hedges

designs, and is probably the most widely used effect size in this male–female, etc.), indexes of growth (e.g., 1-year growth), or

family. collections of related effect sizes (like Lipsey & Wilson’s, 1993,

compendium).

Choosing an Effect Size Index In some cases, interpretation of effect sizes can be aided by

Researchers should choose an effect size measure that is natural comparing effects with the natural range of variation in

for the research design they use. It will make both calculation populations, such as the interquartile range. However, it is

and interpretation of the effect size easier. The standardized important to recognize that natural variation is different at

mean difference and related effect sizes in that family are different levels of a multilevel population. For example, in

natural for studies that examine the relation between a dichot- considering academic achievement, the variance of the pop-

omous independent variable (such as treatment vs. control) and ulation of school means is much smaller (typically only 20% as

a continuous outcome measure. Effect sizes in the standardized large) than the total variation (Hedges & Hedberg, 2007).

regression coefficient family (such as the correlation coefficient) Therefore, a difference that appears large when compared to

are natural for studies examining the relation between contin- the variation of school means may appear small when compared

uous independent and dependent variables. Effect sizes in the to the range of the total achievement distribution (see

odds ratio family are natural for studies in which both the Konstantopoulos & Hedges, 2008).

independent and the dependent variables are dichotomous. Universal (decontextualized) criteria for interpretation of

Effect sizes in each of the different families can be approxi- effect sizes are not helpful. Cohen (1977) reluctantly proposed

mately translated into effect sizes in other metrics, providing one set of criteria but argued that they should be applied only in

a rough means of comparing effect sizes computed from studies situations where there was no other knowledge available to

with somewhat different designs, but these translations are only make a more informed judgment. But it is particularly important

approximate (see, e.g., Hedges & Olkin, 1985). to remember that they were proposed for power analysis, not for

the interpretation of social research. Unfortunately, researchers

Reporting Effect Sizes have largely ignored Cohen’s cautionary arguments and have

Effect sizes, like all statistics, should be reported with enough reified his guidelines into the folklore of social research. The

detail about how they were calculated to permit a competent interpretation of effect sizes as large or small depends on the

scientist to replicate that computation. Merely stating that research context. A very small value of effect size on an outcome

a Cohen’s d was calculated in a complex design is an ambiguous like death may be extremely important, whereas a much larger

statement because there may be several legitimate ways in numerical value of that same effect size on an outcome like

which such a d could be computed and these need not lead to the ‘‘discomfort’’ might be judged to be far less consequential. In

same answer. addition, the uncertainty associated with the effect size (e.g., the

Effect sizes computed from the data in research studies are width of the confidence interval) should influence its interpre-

statistics and are subject to sampling uncertainties. Effect sizes, tation: Effect size estimates with very wide confidence intervals

like all such statistics, should always be reported along with may be of less practical value than estimates with less un-

some measure of their sampling uncertainty. The standard error certainty. Finally, effect sizes from designs that involve statis-

is an appropriate measure of uncertainty, as is a confidence tical controls (such as covariate analyses) depend on what is

interval for the effect size. Methods for computing standard being controlled. Effect sizes from two studies that control for

errors and confidence intervals for effect sizes are beyond the different covariates may not be comparable.

scope of this article but are available in many sources (e.g., Interpretation of effect size estimates from data should always

Cooper & Hedges, 1994). be made in light of their sampling uncertainties. Standard errors

of effect sizes give some sense of how large this uncertainty

Interpretation of Effect Sizes might be, but confidence intervals are even more helpful

Statistical calculation is a mathematical process, whereas because they provide a direct statement of the range of values

interpretation of effect sizes is an act of human judgment. No in which the true effect size might be.

statistical theory can make these judgments. Such judgments

will typically be made in some normative context involving the

importance of the relationship and the magnitude of other REFERENCES

relations to which an effect size may be compared. This means

that it is difficult to interpret an effect size as large or small Cohen, J. (1977). Statistical power analysis for the behavioral sciences

(2nd ed.). New York: Academic Press.

without implicitly or explicitly comparing it to other effect sizes.

Cooper, H., & Hedges, L. V. (1994). The handbook of research

Thus, the interpretation of effect sizes is often comparative and synthesis. New York: Russell Sage Foundation.

always depends on the context. There are many comparative Hedges, L. V. (1981). Distribution theory for Glass’s estimator of

standards that might be used to help in interpretations of effect effect size and related estimators. Journal of Educational

sizes, including gaps between relevant groups (Black–White, Statistics, 6, 107–128.

What Are Effect Sizes? j 171

Hedges, L. V. (2007). Effect sizes in cluster randomized designs. Hyde, J. S. (1981). How large are cognitive gender differences?

Journal of Educational and Behavioral Statistics, 32, 341–370. A meta-analysis using v2 and d. American Psychologist, 26,

Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values 892–901.

for planning group randomized trials in education. Educational Konstantopoulos, S., & Hedges, L. V. (2008). How large an effect can

Evaluation and Policy Analysis, 29, 60–87. we expect from school reforms? Teachers College Record, 110,

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta- 1613–1640.

analysis. New York: Academic Press. Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological,

Holland, P. W., & Rubin, D. B. (Eds.). (1982). Test equating. New educational, and behavioral treatment: Confirmation from meta-

York: Academic Press. analysis. American Psychologist, 48, 1181–1209.

- COMPARATIVE STUDY OF ACADEMIC PERFORMACE OF UME AND DIRECT ENTRY STUDENTS AT GRADUATION IN SCIENCE EDUCATION SECTION, AHMADU BELLO UNIVERSITY ZARIA, NIGERIAUploaded byTJPRC Publications
- finding answers through data collectionUploaded byapi-339611548
- Ten Big Statistical Ideas in ResearchUploaded byivyruthoracion
- UT Dallas Syllabus for psy2317.002.07f taught by Nancy Juhn (njuhn)Uploaded byUT Dallas Provost's Technology Group
- The Spending Pattern among the Youth in Lagos, NigeriaUploaded byIOSRjournal
- UT Dallas Syllabus for psy2317.501.07f taught by Nancy Juhn (njuhn)Uploaded byUT Dallas Provost's Technology Group
- Mmw FinalsUploaded byCharles Allen Bautista
- Kaur, Gurpreet. 2013. Scientific Attitude in Relation to Critical Thinking Among Teachers. Educationia Confab. Vol. 2, No. 8, August 2013.Uploaded byNaomi Dias
- UT Dallas Syllabus for psy2317.002 06f taught by Nancy Juhn (njuhn)Uploaded byUT Dallas Provost's Technology Group
- Chapter 3 Methodology.pdfniamhUploaded byOchieng Christopher Odhiambo
- 7 Evaluating the Effectiveness of Safety Management Practices and Strategies in Construction ProjectsUploaded byMaite Prieto
- ketamine.pdfUploaded byAnonymous UzTRy6aQ8
- Tutorial Solutions Week 7-2012_2013Uploaded byYaryna Lobyk
- Chapter 12Uploaded bypalak32
- Assess the Geriatric Problems among Inmates of Old Age HomeUploaded byIJSRP ORG
- TerminologyUploaded byBenjamin Basow
- articol 1Uploaded byannasescu
- Anderson, 1998 - Meta-Alnalyses of Gender Effects on ConversUploaded byAndreea Manole
- rao1989.pdfUploaded byRomi Lycantrophe Dieorlive
- bias in RCTUploaded byAbuttaher Suri
- 42962440Uploaded bySusie Selamanya
- Somers DBAUploaded bywillow
- Ockhams Razor Stats Confidence IntervalsUploaded bysorby99
- yUploaded byrobfox
- BayesianStatistics-V2.R1Uploaded byManuel Maluenga
- Comparison of Nefazodone, The Cog Behav Sys of Psychotherapy TX DepressionUploaded byCarolina Berríos
- 0weowkdUploaded byFahmi Finishtrian
- Energy Theft and Defective Meters Detection (2)Uploaded byAndrea Velásquez Encalada
- 1140130198Uploaded byTct
- Deviation Between Self Reported and Measured Occupational Physical Activity Levels in Office Employees Effects of Age and Body Composition 2015 International Archives of Occupational and Environmental HealthUploaded byCostina Stan

- Statistics- chapter4.pdfUploaded byyohannes
- question paper of pomUploaded bySrikanth Pg
- Elementary StatisticsUploaded byMarion Alyssa Japitana
- Hypothesis Testing in ExcelUploaded bynewton28182
- SSRN Id2340560 Trademark Extortion RevisitedUploaded bySapodrilo
- stat-101-1Uploaded byDan Christopher Limbaroc
- HP35sUploaded bylsramos@telefonica.net
- Basic Statistics for Economists Lecture 2Uploaded bybeohlai
- Assignment Introduction to StatisticsUploaded byAnonymous jgh3LU6
- MODE CHOICE MODELLING TOWARD SULTAN MAHMUD BADARUDIN II PALEMBANG AIRPORT WITH MULTINOMIAL LOGISTIC REGRESSION ANALYSISUploaded byVEbntzainal
- Hydrologic AnalysisUploaded byRenha Dinha Pradipta
- Resrarch Proposal of Chemical and FertilizerUploaded byhitsparmar
- 10 Gojkov Rajovic Stojanovic EnglUploaded byNemanja Bubanja
- Wilcoxon TestUploaded bydawnherald
- (Www.entrance-exam.net)-Kurukshetra University, B.tech, ME,5th Sem,Industrial Engineering Sample Paper 1Uploaded byamandguraya91293
- Note for ESL Ch 2Uploaded bywwmm933
- Concurrent Validity and Test-retest Reliability of Global Positioning System (GPS) and Timing Gates to Assess Sprint Performance Variables, Waldron Et Al (2011)Uploaded byjohn lewis
- Exam C_1106Uploaded by1br4h1m0v1c
- Variations in Diet and Stature-Are They-Arcini Caroline Ahlstrom Torbjorn and TaUploaded byNatalihe Uesugi
- Queueing ModelsUploaded byNilesh Patil
- Statistics ExercisesUploaded byman420
- DISC 334-Management Science and Spreadsheet Modeling for UG Spring 2010-2011 FinalUploaded byrcheedarala
- Validity and reliabiltyUploaded byNadim_miah
- Bayesian Probability - Wikipedia, The Free EncyclopediaUploaded byAyush choudhary
- unit plannerUploaded byapi-318813104
- Example of Motivation LetterUploaded byTiara Kurnia Candra
- Achievement Versus Aptitude in College Admissions a Cautionary Note Based on Evidence From ChileUploaded byIgnacio Perez
- Principles and procedures of statistics: with special reference to the biological sciencesUploaded byHeraldo Souza
- PSII Project - Grp B13AUploaded bytejaswi51234
- statugcategoricaldataUploaded bymathurakshay99