You are on page 1of 8

The Journal of General Psychology, 2011, 138(4), 292–299 Copyright C 2011 Taylor & Francis Group, LLC

Correcting Overestimated Effect Size Estimates in Multiple Trials
WOLFGANG WIEDERMANN BARTOSZ GULA PAUL CZECH DENISE MUSCHIK University of Klagenfurt

ABSTRACT. In a simulation study, Brand, Bradley, Best, and Stoica (2011) have shown that Cohen’s d is notably overestimated if computed for data aggregated over multiple trials. Although the phenomenon is highly important for studies and meta-analyses of studies structurally similar to the simulated scenario, the authors do not comprehensively address how the problem could be handled. In this comment, we first suggest a corrective term dc that includes the number and correlation of trials. Next, the results of a simulation study provide evidence that the proposed dc results in a more precise estimation of trial-level effects. We conclude that, in practice, dc together with plausible estimates of inter-trial correlation will produce a more precise effect size range compared to that suggested by Brand and colleagues (2011). Keywords: aggregation bias, Cohen’s d effect size, multiple trials

RECENTLY, BRAND, BRADLEY, BEST, AND STOICA (2011) addressed the important topic of highly overestimated effect size estimates obtained from data aggregated across multiple trials, henceforth referred to as aggregation bias. Via simulations, the authors showed that Cohen’s d (Cohen, 1988)—the standardized effect size of the mean difference of two independent samples—is generally overestimated if computed from the sums or means of repeated measurements, assuming that the effects of interest occur at trial-level. Under some circumstances (e.g., 30 independent trials per person), the population effect is overestimated up to nearly six times. The authors explain this by the fact that the pooled variance of aggregated trials decreases as the number of trials increases. They suggest the
The authors wish to thank Rainer Alexandrowicz and Reinhold Hatzinger for valuable comments on an earlier version of the article. Address correspondence to Wolfgang Wiedermann, University of Klagenfurt, Department of Psychology, Universitaetsstrasse 65-67, Klagenfurt, 9020 Austria; wolfgang. wiedermann@aau.at (e-mail).
292

(2) we propose a simple corrective term which attenuates the aggregation bias. However. η2 is commonly reported in the psychological literature (Pierce. If partial η2 is used. Both η2 can be transformed into the same η metric as Cohen’s d applying d = 2 · 1−η2 (e. & Muschik 293 additional use of more conservative split-plot ANOVA and the reporting of the average inter-trial correlation together with the number of trials. Brand and colleagues (2011. . we focus on the improvement of Cohen’s d.” Beyond this suggestion to incorporate trials as a separate factor in analyses. It has been shown that such split-plot analyses result in a power loss for the between subjects effect with higher number of trials and increasing inter-trial correlation (Bradley & Russell. 9) suggest “ . Czech.g. because it is more commonly reported in studies dealing with two group comparisons. the range may be too large to be informative. Partial η2 is defined as SSeffect /(SSeffect + SSerror ). ω2. this interval will be of marginal value and misleading for the following reasons: First. 1998). A further conclusion the authors arrive at concerns a more sophisticated reporting practice. Although we generally support the idea of a 2 . This follows from the fact that both aggregated Cohen’s d and partial η2 for the main effect of treatment in the corresponding split-plot ANOVA rely on the same within-group sums of squares. in the present article. or generalized ω2 (Olejnik & Algina. where partial η2 is erroneously referred to as classic η2 (Levine & Hullet. such as ε2. then the overestimation will be the same as that for Cohen’s d calculated from aggregated variables. which is likely to provoke an imprecise reporting of η2 values. Second. However. 2003).Wiedermann. for example. For single-trial studies. Pierce. we suggest that classical η2 and the corrected Cohen’s d suggested below should be compared in future studies. the results are identical for either measure as SStotal = SSeffect + SSerror . In this comment. As a consequence of effect size overestimation. Partial η2 is part of the standard output of statistical packages such as SPSS. (1) we argue that the suggested computation of a range is intransparent and of marginal practical value. a liberal test and the more conservative split-plot design to gain a full understanding of a range of potential effect sizes. Gula. SS being the sum of squares. 2004). the authors provide no further information on why and how this should be an improvement. whereas classic η2 as SSeffect /SStotal . Although many effect size measures for split-plot ANOVAs exist. in which the number of trials as well as the average inter-trial correlation should be routinely reported to allow researchers to draw inferences about the strength of the results. depending on the inter-trial correlation and the number of trials. & Aguinis. 2002. Therefore. Block. then the least-biased estimate of the effect is not located somewhere in the middle of the range but equals its lower bound. 2004). Block. and (3) we demonstrate the adequacy of the proposed correction method in a simulation study. see Cohen. Only classical η2 provides a more conservative estimate of the true underlying trial-level effect. . p. should classical η2 provide an accurate estimate for the true effect. & Aguinis. even if an effect size range is computed from the classical η2 (transformed into the Cohen’s d metric) and the mean-aggregated Cohen’s d. 1988).

For correlated variables the variance of mean-aggregated measures of k trials is defined as k−1 2 σ2 + ρσ . Thus. this approach requires all responses to be mutually independent. applying the sample-based average inter-trial correlation instead of the true population correlation.2. 0. two independent n × k matrices (n = number of participants. we performed a simulation study to investigate the properties of this correction. for k independent trials the individual means follow a normal distribution with mean µ and variance σ 2/k. However. we adopted the simulation design of Brand and colleagues (2011)1 in order to ensure comparability of results. Each trial had mean of 10 and a standard deviation of 2.8). 0. Sample size was n = 38 per group. Instead of simply reporting the additional numbers. Transforming the latter equation shows that the meanaggregated effect size estimate can be corrected using dc = da /(k/ k + ρ(k 2 − k)). σ 2 denotes the variance of the means of each respondent’s trials. In this case. which seems rather implausible in practice. respectively. 0. the actual distribution of the sample means approaches a normal distribution with mean µ and variance σ 2/n. which is outlined in more detail in the next section. the bias in the effect size estimate da obtained from average responses √ simply depends on the number of trials and can be corrected using dc = da / k. Here. Therefore. However. instead of rearranging simulated values to obtain an average correlation between trials. k k with ρ and σ 2 being the inter-trial correlation and trial variance (assuming equal trial variances). k = number of trials) following a multivariate normal distribution were generated. we suggest an empirical approach to assess the amount of overestimation. we doubt that this additional information allows for empirically sound conclusions about the actual inflation of effect size estimates. Methods In essence. Consequently. Means of the second matrix were increased to obtain three most common . knowledge of the true population correlation ρ is required.5.294 The Journal of General Psychology more sophisticated reporting routine in the presentation of empirical results. Covariance structures of the multivariate distributions were varied to obtain the average correlation (ρ = 0. A Simple Correction Method The central limit theorem states that as the sample size (n) of independent samples increases.

For uncorrelated trials this overestimation is generally more pronounced (with distortions ranging from 129% to 461%) than for correlated trials. Gula. For independent trials (ρ = 0) and small effects (d = 0. would additionally mitigate the overestimation (for space limitation not shown in Table 1).50 (medium). Czech. Because the true population correlation is unknown in practice. which was also simulated by Brand and colleagues (2011). scenarios involving correlated trials are of greater interest for practice. Across all scenarios overestimation ranged from 0% to 5%. Correcting Cohen’s d using the true population correlation (dc ) eliminates the inflation for all simulated scenarios. zero inter-trial correlation appears extremely unrealistic. 0. as Brand and colleagues (2011) already noted. For each of 100. dc values are still slightly above the true population effect.50) a maximum distortion of 68% is observed. However.80) together with a rather small inter-trial correlation (ρ = 0. However.000 iterations mean-aggregated variables were computed and the Cohen’s effect size estimate (da ) as well as the corrected dc were logged. the rows in Table 1 labeled dc refer to the corrected effect size measures using the sample-based average intertrial correlation. thus additionally dc = da /(k/ k + r(k 2 − k)) was calculated by replacing ρ with the averaged sample estimate r. medium. 0. Correcting these values using the Hedges and Olkin’s formula 3 g = d · 1 − 4N−9 . This might be attributable to the fact that ordinary Cohen’s d is biased in the case of small sample sizes (Hedges & Olkin. where N is the total sample size.2) effect size inflation ranges from 15% to 24% depending on the number of trials. For medium true effects (d = 0. Generally. da values are heavily inflated with increasing number of trials. For small and medium true effects.80) and k > 1 overestimation ranges from 28% to 130%. For large true effects (d = 0. & Muschik 295 effect sizes. and large true effects. The proposed correction assumes knowledge of the true population correlation ρ. Results Table 1 shows the amount of effect size inflation as a function of ρ and k for small.Wiedermann. the correction approach performs almost as well as using the true population correlation. The simulation study was performed in R (R Development Core Team. For the uncorrected estimates (da ). 1985). In the case of large true effects (d = 0. the results are virtually identical to those of Brand and colleagues (2011). The mean-aggregation is more common in practice because missing values impair the sum aggregation due to unequal number of values per participant. We will consider the case of mean-aggregated measures only and not the aggregation based on sums. Therefore.20 (small). Across all levels of trials and correlations the maximum distortion is 12%. 2011). .80 (large). d = 0.20) the distortion ranges from 0% to 15%.

20 0.0) (5.0) 0. and 30 Trials. 20.20 (0.0) (0.0) (2.2 (small effect) 0.0) (5.0) 0 0.0) (0.0) (0.0) 0.51 (2.0) (110.0) 0.23 0.0) 0.21 (130.0) (0.21 0.21 0.0) (35.0) (5.0) (0.46 0.20 0.0) (70.0) (110.23 0.0) (10.0) (2.51 0.07 0.29 0.20 (0.0) (2.56 (358.20 0.23 (15.0) (5.20 0.0) (12.21 0.0) 0.0) (114.21 Cohen’s d = 0.28 0.0) (0.20 0.56 (460.21 (5.0) (0.20 (0.0) (2.12 0.0) 0.0) (Continued on next page) .21 0. or Large True Effects Derived from Aggregation Over 1.0) (0.97 (94.92 0.0) 1.0) (2.23 0.0) (40.5 0.0) 0.28 (40.0) (0.54 (130.51 (2.21 (5.0) (0.05 0.8 da dc dc da dc dc da dc dc da dc dc 0.57 0.0) Cohen’s d = 0.51 0.0) 0.28 0.22 0.39 (95.21 0.0) (0.20 (0.20 0.21 (5.20 0.0) (115.20 (0.0) (8.21 0.5 (medium effect) 1.51 0.0) (360.0) (5.296 TABLE 1.20 0.80 0.2 0.0) (12.0) (2.0) 0.20 0.51 0.0) 0.21 (460.20 0.20 0.20 0.34 0.0) (0.51 0.0) 0.0) (2.0) (2. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates.0) (5.51 0.2 da dc dc da dc dc 1.51 0.0) 0. Medium.20 0.20 0.0) 0.0) 2.0) (50.0) (5.0) (5.0) (2. Number of Trials 5 10 20 30 Correlation Effect Size 1 The Journal of General Psychology 0 0.0) 0.0) (5. Averaged Cohen’s d (see text) of Either Small.0) (15.0) (5.0) (0.65 (225.0) (0.0) (0.0) (2.55 (10.0) (5.62 (224.51 0.42 0.20 0.51 (2.21 0.21 0.0) (0.0) (14.0) 1.20 0.15 0. 10.0) (5.20 (0.21 0.20 0.27 0.0) (68.0) (0.0) (5.43 0.22 0.0) (70.51 0.0) (0.0) 0.0) (15.0) (5.20 0.0) 0.0) (10.0) (40.0) (15.21 0.51 0.63 (26. 5.20 0.51 0.0) 0.75 0.0) 2.85 0.21 0.84 1.0) (2.0) 0.0) (0.

0) (2.83 (3.5) 0.82 0.5) (3.5) (2.5) 0.82 (2.8) (2.87 0.82 1. Averaged Cohen’s d (see text) of Either Small.8) (2.5) (2.56 0.66 0.52 (4.82 0.51 0.5) 0.82 0.8) 0.57 (14.0) 0.5) (2.52 0.51 0.5) (95.83 0.8) 4.0) (2.5) (7.8 (large effect) 2.10 (37.0) (2.5) (8.91 0.75) (13.0) (4.56 1.5) (2. & Muschik 0.82 0.51 0.72 0.8) (42.51 0.0) 0 0.8) 3.51 0.82 0.0) 0.8) (2.51 0.5) (2.51 0.57 0.51 0.0) (2.5) (41.5) (2. (Continued) Number of Trials 5 10 20 30 Correlation Effect Size 1 0.0) 0.0) 0.5) 0.57 0.82 0. Medium.5 0.TABLE 1.82 1.5) (3.0) (2.0) (4.53 0.5) (11.55 (93.5) (2.8 da dc dc da dc dc da dc dc da dc dc 1.52 (42.58) (13.86 (7.69 (38. 20.82 0.0) (12.0) 0.82 0.95 (18.5) (15.0) (108.82 0.51 Cohen’s d = 0.5) (2.82 0.49 0.82 (2.99 1.13 0.0) 0.83 (357.5) (71.0) (2.82 0.51 (2.02 1. 10.3) (2. Czech.83 (461.70 0.82 (2.87 0.06 0.5) 0.82 0.51 (2.3) (2.8) 0.0) (4.8) 1.5) (2.37 0.51 (32.8) 1.0) 0. or Large True Effects Derived from Aggregation Over 1.51 0.0) (2. 5.5 Wiedermann. Gula.5) (2.82 1.82 0.84 1.52 0.5) 0.82 0.98 1.82 0.0) (2.89 0.92 1.0) (40. and 30 Trials.0) (2.8 da dc dc da dc dc 0.82 (2.52 (4.66 0.0) (2.86 0.8) 0.0) (14.5) (2.82 0.0) (2.5) (23.5) (8.5) (2.0) (115.82 0.3) (2.5) 0.5) (3.14 0.51 0.82 (2.5) (130.5) 1.3) (2.0) (14.0) (2.0) (2.67 0.83 (128.0) 0.5) (27.5) (2.51 (2.59 (223.0) (32.82 0.0) (6.5) (2.91 0.82 0.5) (22.23 (53.0) (2.90 (12.8) (2. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates.51 0.8) 297 .2 0.71 0.82 0.

6. given a study with 30 trials. . His main research focuses on judgment and decision making. Hence. learning. For meta-analyses. His research focuses on statistical methods. NOTE 1. Bartosz Gula is assistant professor at the Cognitive Psychology Unit. These would still allow for the application of the more precise dc . if in meta-analyses of studies employing independent samples neither the true nor the sample estimates of the inter-trial ρ are given. where the sample estimate of the correlation between two dependent variables is unavailable.99) compared to the distortion of 115% obtained from the uncorrected measure. University of Klagenfurt and at the Department of Health Care Management. If.6(30 dc considerable improvement over da . Similarly. AUTHOR NOTES Wolfgang Wiedermann is a research associate at the Applied Psychology and Methods Research Unit.8 the overestimation was 24% (average dc = 0. and previous studies suggested a plausible range of correlation between 0. A related problem and correction approach for inflated effect sizes has been discussed by Dunlap. the true correlation is unknown. His primary interests are statistical methods. For example. on the other hand.56). The authors want to thank Andrew Brand for supplying the R code used in the study. Her primary interests are applied statistics and consumer research.5. University of Klagenfurt.4(302 − 30)) = 0. surrogate values should be derived from previous studies. and Burke (1996) for dependent samples. This range still constitutes a = 0. University of Klagenfurt. this correction does not vary as much as a function of trials in contrast to the uncorrected Cohen’s d. each trial drawn from a population with true d = 0. the sample estimate can be used to notably reduce the overestimation reported by Brand and colleagues (2011).5 and ρ = 0. should be included in the methodological repertoire in order to obtain valid knowledge of the empirical phenomena of interest.46 and upper 2 − 30)) = 0. where studies with different numbers of trials are combined. and health-related cognition. Denise Muschik is graduate student at the Applied Psychology and Methods Research Unit. Carinthia University of Applied Sciences.4 and 0. Vaslow. Moreover. and memory.71/(30/ 30 + 0. Cortina. Paul Czech is graduate student at the Applied Psychology and Methods Research Unit. An important question for future research on meta-analytical methods is to what degree the effect sizes reported in practice are distorted and how corrective terms such as the suggested dc . addiction research. they suggested using estimates from previous studies.71/(30/ 30 + 0.298 The Journal of General Psychology Conclusions The simulation shows that distortion of effect size estimates can be corrected if the true underlying inter-trial correlation is known. then dc is bilower ased by merely 8% to 12% (dc = 0. University of Klagenfurt. In the more realistic case of ρ > 0 and the most extreme case of true Cohen’s d = 0. dc will be less prone to inflations in meta-analyses.

J.. Hedges. Burke. FL: Academic Press. & Computers. ISBN 3-900051-07-0. Best L. A. A. R. Vaslow. G. Educational and Psychological Measurement. Czech. (2003). Cohen. Orlando. V. Olejnik. Vienna.). Psychological Methods. Bradley.. L. & Russell.. URL http://www. Eta squared. 2011 Final version accepted July 8. B.. (1998). Original manuscript received May 9. 138. Levine. Human Communication Research. M. Pierce. 1–11. Brand. S. Some cautions regarding statistical power in splitplot designs. M. Block. R Development Core Team (2011). D. 170–177. C. 1. 30.. 462–477.Wiedermann. P. Instruments. Psychological Methods. I.. 434–447.. R. R Foundation for Statistical Computing. L... 2011 . Statistical power analysis for the behavioral sciences (2nd ed.. (2004). Meta-analysis of experiments with matched groups or repeated measures designs. (2002). & Stoica. 8. M. Dunlap. Generalized eta and omega squared statistics: Measures of effect size for some common research designs. 28. J. 64. W. T. Hillsdale. NJ: Erlbaum. J. 916–924. J. Multiple trials may yield exaggerated effect size estimates. Aguinis.. & Olkin. H. (1985). R.. & Muschik 299 REFERENCES Bradley. Gula. (1996). A. Cortina. & Algina.R-project. Behavior Research Methods. partial eta squared. The Journal of General Psychology. A. & Hullet. 612–625. R. G. J. R: A language and environment for statistical computing.org. C. Statistical methods for meta-analysis. Austria. (2011). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. and misreporting of effect size in communication research. (1988). T.