You are on page 1of 9

British Journal of Anaesthesia, 115 (5): 699–707 (2015)

doi: 10.1093/bja/aev166
Advance Access Publication Date: 3 June 2015
Review Article

Pitfalls in reporting sample size calculation in


randomized controlled trials published in leading
anaesthesia journals: a systematic review
M. Abdulatif*, A. Mukhtar and G. Obayah
Anaesthetic Department, Faculty of Medicine, Cairo University, Giza, Egypt
*Corresponding author: 1 Al-Saray Street, Al-Manial, Cairo Postal Code 11559, Egypt. E-mail: mohamedabdulatif53@gmail.com

Abstract
We have evaluated the pitfalls in reporting sample size calculation in randomized controlled trials (RCTs) published in the 10
highest impact factor anaesthesia journals.
Superiority RCTs published in 2013 were identified and checked for the basic components required for sample size
calculation and replication. The difference between the reported and replicated sample size was estimated. The sources used for
estimating the expected effect size (Δ) were identified, and the difference between the expected and observed effect sizes (Δ gap)
was estimated.
We enrolled 194 RCTs. Sample size calculation was reported in 91.7% of studies. Replication of sample size calculation was
possible in 80.3% of studies. The original and replicated sample sizes were identical in 67.8% of studies. The difference between
the replicated and reported sample sizes exceeded 10% in 28.7% of studies. The expected and observed effect sizes were
comparable in RCTs with positive outcomes (P=0.1). Studies with negative outcome tended to overestimate the effect size (Δ gap
42%, 95% confidence interval 32–51%), P<0.001. Post hoc power of negative studies was 20.2% (95% confidence interval 13.4–27.1%).
Studies using data derived from pilot studies for sample size calculation were associated with the smallest Δ gaps (P=0.008).
Sample size calculation is frequently reported in anaesthesia journals, but the details of basic elements for calculation are
not consistently provided. In almost one-third of RCTs, the reported and replicated sample sizes were not identical and the
assumptions for the expected effect size and variance were not supported by relevant literature or pilot studies.

Key words: research hypothesis, effect size; statistical power, sample size; study design, superiority trials

Editor’s key points (CONSORT) group developed a set of evidence-based recommen-


dations to improve the quality of design, analysis, and interpret-
• In this systematic review, the authors identify errors in
ation of RCTs.2 Sample size calculation contributes to the quality
sample size calculation in the leading anaesthesia journals.
of RCTs3 4 and is an important item in the CONSORT checklist.2
• Frequently, there were differences between the reported
Adequate reporting of sample size calculation should normally
and the replicated sample sizes, with the assumptions for
include four main components: the expected minimal clinically
the expected effect size not supported by relevant data.
relevant difference between the study groups, the  of measure-
ments for continuous primary outcomes, the power of the study
(generally set between 80 to 90%), and type I error (usually 5%).5
Charles and colleagues6 demonstrated that 43% of RCTs pub-
A randomized controlled trial (RCT) is the most frequently used, lished in the six highest impact factor medical journals did not
valid method to evaluate clinical interventions in medical report all the required parameters for sample size calculation.
subspecialties.1 The Consolidated Standards of Reporting Trials Moreover, in 30% of the reports that gave enough data to

© The Author 2015. Published by Oxford University Press on behalf of the British Journal of Anaesthesia. All rights reserved.
For Permissions, please email: journals.permissions@oup.com

699
700 | Abdulatif et al.

recalculate the sample size, the variation in the replicated sam- systematic review. All superiority RCTs with parallel-group de-
ple size was greater than 10%.6 Major deficiencies in sample signs were included in the analysis. Randomized controlled trials
size calculations were also reported by other specialties.7–11 with crossover or factorial design, those involving animals, and
Previous reports indicated significant improvement in the fre- simulation or manikin studies were excluded.
quency of reporting of sample size calculation from 52 to 86% in Two of the authors (M.A. and A.M.) independently extracted
anaesthesia journals.12 13 However, the integrity of sample size data from the full text of the selected articles relating to the de-
reporting for RCTs published in anaesthesia literature was not sign and sample size calculation in duplicate. Any disagreement
previously evaluated. This systematic review evaluated the pit- was resolved by discussion. We created a checklist to assess the
falls in reporting sample size calculation in parallel-group super- frequency and adequacy of reporting of sample size estimation
iority RCTs published in the 10 highest impact factor anaesthesia (Table 1). Sample size calculation was considered to be done if
journals during a 1 year period. the authors explicitly stated this in the methodology section.
Furthermore, we checked for the availability of a clearly speci-
fied primary outcome measure, the basic components of sam-
Methods ple size calculations, 15 and the source for estimating the
The Medline database was searched via ‘Pubmed’ using the minimal expected effect size. An effect size (Δ) is a standar-
search terms ‘randomized controlled trials’, ‘controlled clinical dized measure that describes the magnitude of the difference
trial’, and ‘randomized controlled trial’ for the period from January between study groups. 16 The effect size was calculated using
to December 2013. The tables of contents and the abstracts of clin- formulae outlined in Appendix I according to the type of
ical reports published in all issues of 10 of the leading anaesthe- the primary outcome and the number of the groups in each
sia journals were also screened to identify the relevant RCTs RCT 16–18 (Appendix I).
(Fig. 1). The choice of the included journals was based on their The number of patients randomized and the observed effect
impact factor ranking at the time of conceptualization of this sizes were extracted from the results section. We also noted

Included journals*

Acta Anaesthesiologica Scandinavica - Anaesthesia - Anesthesiology - Anesthesia and Analgesia


British Journal of Anaesthesia - Canadian journal of Anesthesia - European Journal of
Anaesthesiology - Journal of Neurosurgical Anesthesiology - Pediatric Anesthesia
Regional Anesthesia and Pain Medicine

Studies assessed for eligibility: 231

Included studies: 194 Excluded studies: 37

- Crossover design: 14
- Non-inferiority: 8
- Equivalence: 3
- Manikin/simulation: 4
Sample size calculation Sample size calculation - Non-randomized: 3
was not reported: 16 was reported: 178 - Factorial design: 1
- Step-up/step-down dose
finding: 1
- Premature termination: 1
- French language: 1
Sample size calculation Sample size calculation
was replicated: 143 was not replicated: 35 - Conflict of interest: 1†

Fig 1 Flow chart of the included and excluded studies and details of sample size replication. *Journals are listed in alphabetical order. †The authors of this review
contributed to this excluded study.14
Pitfalls in sample size calculation | 701

Table 1 Data extraction checklist Table 2 Reported parameters required for sample size
calculation in the 178 included studies. Values are expressed
as number ( percentage)

• Primary outcome measure clearly specified Parameter Number of


• Type of primary outcome measure studies
• Sample size estimation was described in the methodology
section Expected effect size in studies with continuous primary
• Source for justification and magnitude of the expected outcome
effect size (Δ) Reported 111 (62.4%)
• Power of the study and α level Not reported 4 (2.2%)
• The expected and reported dropout rates Expected effect size in studies with binary primary outcome
• Number of study groups Reported 56 (31.5%)
• Type of statistical test used for sample size calculation Not reported 7 (3.9%)
• Hypothesis testing: one-tailed, two-tailed, or unspecified Expected variance in studies with continuous primary
• Compensation for multiple comparison in studies outcome
including more than two groups Reported 87 (48.9%)
• Software used for sample size calculation Not reported 28 (15.7%)
• The final number of patients analysed in each of the study Source for justification of the expected effect size
groups Relevant published studies 92 (51.7%)
• Difference between reported and replicated sample size Pilot study 29 (16.3%)
• Results of the primary outcome: positive vs negative Unspecified 57 (32.0%)
Hypothesis testing
Two tailed 50 (28.1%)
One tailed 1 (0.6%)
Unspecified 127 (71.3%)
Type I error (α)
whether the results of the trial were statistically significant for 0.05 173 (97.2%)
the primary outcome. The reviewers were not blinded to the 0.01 5 (2.8%)
journal name and authors. Number of studies including more than two 36 (20.2%)
Sample size replication was primarily done using the same groups
software and statistical test specified by the authors. In studies Number of studies that compensated for 3 (1.7%)
not reporting sample size estimation software, the Power multiple comparisons
Analysis and Sample Size (PASS 13 software; NCSS, LLC, Power of study
Kaysville, UT, USA) was used for replication. Replicate estima- 0.7 1 (0.6%)
tions were done for studies providing all the required compo- 0.8 131 (73.6%)
0.85 3 (1.7%)
nents for sample size estimation. For RCTs missing the details
0.87 1 (0.6%)
of the statistical test, we used the formulae adapted for χ2 test
0.9 37 (20.7%)
and two-tailed Student’s t-test for binary and continuous end
0.95 5 (2.8%)
points, respectively.19–21 When adjustments for multiple testing
were not reported, we assumed no adjustment had been applied.
Sample size replication was not possible for studies that did not
report the expected effect size.
and compared them using Student’s paired t-test. Differences
The difference between the reported and replicated sample
among the groups were analysed using Mann–Whitney U-test
size was defined as the reported sample size minus the re-
or Kruskal–Wallis test, as appropriate. The software SPSS v15.0
calculated sample size divided by the reported sample size.6
for Windows (SPSS, Inc., Chicago, IL, USA) was used for statistical
A difference of 10% or more indicated important deviation from
analysis.
the reported sample size6 and was confirmed by expert statistical
consultation.6 8 The magnitude of the expected and observed
dropout rates was recorded.
The difference between the expected minimal effect size and Results
the observed effect size in the primary outcome was considered
A total 231 RCTs were identified during literature search. Of these
the Δ gap.22 The magnitude of the Δ gap was compared between
reports, 194 studies were enrolled in the analysis and 37 did not
studies with positive and negative outcomes. Furthermore, the
fulfil the prespecified inclusion criteria.14 Sixteen studies (8.3%)
impact of the source for assumption of the expected effect size
did not report any calculation for sample size, while 178 (91.7%)
on the Δ gap was evaluated. Post hoc power calculation was per-
reported one or more of the essential parameters of sample
formed using standardized formulae for each study that reported
size estimation (Fig. 1). Most of the included studies did not
a non-significant difference in the primary outcome.23
specify the statistical test and the use of one-tailed or two-tailed
hypothesis for sample size calculation. The α error was most fre-
quently set at 5%, and the reported power of the studies ranged
Statistical analysis
between 70 and 95%. Thirty-four of the enrolled studies included
Continuous variables are expressed as means () or medians more than two groups, but only three studies used Bonferroni
(interquartile range; IQR) and discrete variables as counts correction to adjust the level of α error to compensate for multiple
( percentage), unless otherwise stated. We calculated the ex- comparisons (Table 2). Replication of sample size calculation was
pected and observed Δ values and dropout rates across all trials possible in 143 (80.3%) studies. In 35 (19.7%) studies, replication
702 | Abdulatif et al.

was not possible because of unspecified expected effect size observed effect sizes were encountered in RCTs using pilot stud-
(Fig. 1). ies as a source for estimating the minimal expected effect size.
Our analysis indicated that the replicated and reported In line with the findings of this systematic review, variations
sample sizes were different in 46 (32.2%) studies. This difference between the reported and replicated sample sizes have been
exceeded 10% in 41 (28.7%) studies. Ninety-two studies reported a reported at the protocol development stage,24 25 in trial registra-
compensation for a possible dropout rate. However, the observed tion databases,26 and after publication in high-impact medical
dropout rate at the conclusion of the study was significantly journals.6 This finding is difficult to explain and could be related
lower than the expected rate (Table 3). to the suboptimal cooperation between expert statisticians and
The median expected effect sizes for dichotomous and con- clinical investigators. Involvement of a statistician in sample
tinuous outcomes were 0.5 (IQR 0.4–0.63) and 0.8 (IQR 0.6–1.0), re- size calculation has been recommended15 27 and was used in
spectively. The median observed effect sizes for dichotomous the present systematic review. Using multivariate logistic regres-
and continuous outcomes were 0.42 (IQR 0.2–0.6) and 0.7 (IQR sion, Koletsi and colleagues8 recently reported that the involve-
0.3–1.0), respectively. The overall median Δ gaps for dichotomous ment of a methodologist in planning statistical analysis and
and continuous outcomes were 9% (IQR −5 to 30%) and 9% (IQR trial design is associated with two-fold higher odds in adequate
−30 to 30%), respectively. In studies with positive outcome reporting of sample size calculation (odds ratio=1.97, 95% CI
(73.7%), the expected and observed effect sizes were not sig- 1.10–3.53). However, the final outcome of sample size estimation
nificantly different [Δ gap −6.0%, 95% confidence interval (CI) cannot be guaranteed without a clear estimate and justification
−14 to 2.0%], P=0.1. In the studies with negative outcomes for the expected clinically important effect size by the investiga-
(26.3%), the expected effect size was significantly higher than tor.28 29 The rationale for choosing a predetermined magnitude
the observed effect size (Δ gap 42%, 95% CI 32–51%), P<0.001. for the expected effect size was not adequately justified in RCTs
The mean (95% CI) post hoc power of studies with negative outcome included in this systematic review. A recent evaluation of re-
was 20.2% (13.4–27.1%). search protocols submitted to UK ethics committees indicated
The assumption for the expected difference and variance that only 3% provided a reasoned explanation for the chosen
between the control and intervention groups was based on rele- effect size.24
vant published reports in 92 (51.7%) studies and on pilot studies The effect size is one of the most important indicators of clin-
in 29 (16.3%) studies, while 57 (32.0%) studies did not specify the ical significance. Unfortunately, this fact is frequently over-
source of the expected difference. The Δ gaps were smaller for looked, and many researchers mistakenly relate statistically
RCTs using pilot studies for estimating the magnitude of the significant outcomes with clinical relevance.30 The concept of
expected effect size (P=0.008; Fig. 2). clinical relevance in the primary outcome is identified in litera-
ture by what is called the minimal clinically important difference
(MCID), defined as ‘the smallest change that patients perceive as
Discussion beneficial or detrimental, and is useful to aid the clinical inter-
pretation of health status data, particularly in response to inter-
The results of this systematic review indicated that sample size
vention’.31–33 The use of the MCID in calculation of appropriate
calculation is frequently reported, in 91.7% of RCTs published
sample size for RCTs is expected to improve their clinical rele-
in leading anaesthesia journals. However, the details of the
vance and offers a threshold above which outcome is experi-
basic elements required for appropriate sample size calculation
enced as relevant by the patient.34
or replication are not consistently reported. In almost one-third
The concept of MCID was not addressed in the RCTs included
of the enrolled RCTs, the variation in the replicated sample size
in this systematic review, but the source used for estimating the
was greater than 10% and the assumption for the expected
expected effect size and variance was identified in 68% of studies
minimal effect size was not supported by relevant published lit-
erature or pilot studies. Studies with a negative outcome were as-
sociated with significant overestimation of the expected effect
size. In contrast, the least variations between the expected and
100

Table 3 Details of the reported and replicated sample sizes and


dropout rates (143 studies). Values are expressed as number 50 *
( percentage) or median [interquartile range (IQR)]. *The
D gap (%)

expected dropout rate was significantly higher than the


observed rate (P<0.0001)
0
Reported sample size
Not finally achieved 26 (18.2%)
Studies compensating for possible
92 (64.3%) –50
dropouts
Expected dropout rate 16.8% (IQR 9.5–20.3%)
Unspecified Previous literature Pilot study
Observed dropout rate 4.6% (IQR 0–6.6%)*
Replicated sample size
Identical to reported 97 (67.8%) Fig 2 The impact of the source of prediction of the magnitude of effect size
More than the reported 29 (20.3%) on the calculated Δ gaps (the difference between the observed and expected
Less than the reported 17 (11.9%) effect size). Box represents median observations (horizontal rule), with 25th
Difference exceeded 10% 41 (28.0%) and 75th percentiles of observed data (top and bottom of box). Whiskers
indicate the maximal and minimal values. *Denotes significance relative
Difference less than 10% 5 (3.5%)
to the other two groups, P=0.008.
Pitfalls in sample size calculation | 703

and was mostly derived from relevant published literature and participants in underpowered trials with low potential to provide
pilot studies. Our finding is in line with a recent review by Clark a reliable answer to the research question.47 The same ethical
and colleagues,24 who reported that only 59% of the protocols concern extends to exposing patients to unjustified risks in
for pain and anaesthesia RCTs submitted to UK research ethics large clinical trials with overestimated sample size.48
committees reported the source for the value of expected effect The dropout rate during the conduct of RCTs may complicate
size. A recent comprehensive systematic review and survey iden- statistical analysis by introducing bias in the findings and redu-
tified seven potential methods for specifying the target expected cing statistical power.49 Long-term follow-up has been associated
effect size in sample size calculation, with wide variability in with increased dropout rate.50 Apart from the relatively high
awareness and use among expert triallists in UK and Ireland.35 dropout rate that might be encountered in chronic pain studies,51
The use of pilot studies was recognized by 90% of responders as RCTs evaluating anaesthetic interventions tend to have short
a source for identifying the minimal expected effect size required follow-up periods. This could account for the relatively low
for sample size calculation.35 (4.6%) observed attrition rate in the present systematic review.
Clinical trials using data derived from pilot studies for sample This finding could be used as a guide in optimizing sample size
size calculation were associated with the smallest Δ gaps be- calculation for future RCTs in anaesthesia.
tween the expected and observed effect sizes. However, pilot Appropriate sample size estimation commonly integrates
studies are generally underpowered and could yield an imprecise scientific precision and available resources.52 There is currently
estimate of the difference in the population.36 37 The use of the an emerging approach incorporating study-specific optimal α
upper limit of the 95% CI of the variance derived from internal and β errors that considers the clinically relevant effect size
pilot studies has been shown to improve the precision of sample and the cost rather than the statistical significance in sample
size estimation.38 For RCTs using variance derived from an exter- size estimation.53 54 The ultimate goals of this approach are to
nal pilot study or published literature, the use of the 60% upper improve the feasibility of the study and to minimize the conse-
confidence limit of  has been recommended.39 An additional quences of type I and II errors on clinical decision making. In
strategy to optimize sample size calculation is the use of a pre- most of the RCTs included in this review, the α and β errors
specified adaptive approach that allows recalculation of the sam- were commonly set by convention at a level of 5 and 20%, re-
ple size during the course of a clinical trial, with subsequent spectively. Studies that compared more than two groups did
adjustments to the initially planned size.40 This approach is not indicate that adjustment of α error was considered to com-
based on revision of the event rate, the variance, or the treatment pensate for multiple comparisons. A similar observation was re-
effect.41 ported in a recent review of the integrity of sample size
The expected and observed effect sizes were significantly dif- determination in original research protocols.24 Anaesthesia
ferent only in studies with negative outcomes. This might indi- trials frequently use a longitudinal design to assess the change
cate overestimation of the expected treatment effect in the in groups’ outcome at multiple time points.55 Multiple end
intervention group. Consistent with our findings, Aberegg and points, multiple treatment arms, or interim analyses of the
colleagues22 investigated the possible bias in the design of RCTs data require α error adjustment.56 It has been estimated that re-
reporting negative results in mortality in adult critical care set- ducing the α error from 0.05 to 0.01 to compensate for five mul-
tings. The mean expected and observed differences in mortality tiple tests will almost double the minimal sample size at a study
rates were 10.1 and 1.4%, respectively (P<0.0001), with a Δ gap of power of 90%.37 Furthermore, estimation of sample size among
8.7%.22 Two possible explanations were suggested to account for studies with repeated measures mandates identification of vari-
the increased Δ gap in negative trials; first, the investigators ance at each time measurement and the pattern of correlation
might adjust the expected effect size to reduce sample size, ‘sam- among pairs of repeated measurements.57 Therefore, for a
ple size samba’,8 27 and second, the unrealistic optimism about multi-arm study or a study with repeated measures, it would
the efficacy of treatment in the intervention study arm.36 42 be useful to conduct a pilot study to provide sufficient details
This systematic review elicited several factors that might be about the pattern of mean, mean difference, and the variances
associated with unrecognized underpowered RCTs; these of response variable at each time point.58
include: inappropriate estimation of the expected effect size, Inadequate description of sample size calculation continues
incorrect mathematical estimation of sample size, failure to to be reported7–11 despite the recommendations of the CON-
achieve the minimal target sample size, and waiving of sample SORT2 and the Standard Protocol Items: Recommendations for
size calculation. An earlier study assessing the quality of RCTs Interventional Trials (SPIRIT)59 groups. Adherence to these re-
published in anaesthesia literature reported that 72% of studies commendations has been reported to be suboptimal in anaesthe-
that had negative results did not consider the possibility of an sia,60 critical care,61 and other medical journals.62 63 Furthermore,
underlying type II error and inadequate a priori sample size esti- the instruction and obligation to adhere to all the elements in the
mation.12 It has been suggested that many of the published an- CONSORT checklist in journal author guides has been reported
aesthesia studies are underpowered.3 43 This is supported by to be heterogeneous.64–67 We therefore developed a simple struc-
the results of our post hoc power calculation for RCTs with nega- tured algorithm based on our results and on review of the litera-
tive outcome. It should be noted, however, that combining the ture to guide investigators to conduct and report sample
MCID with the 95% CI has been suggested as a valuable strategy estimation appropriately (Fig. 3). We also suggest that the trans-
for a more appropriate clinical interpretation of the outcome of parency could be improved by adopting an obligatory short
negative RCTs.30 44 45 If the MCID lies within the 95% CI of the ob- checklist of all the basic elements of sample size calculation at
served effect size, the treatment may be clinically effective re- the different stages of designing and reporting of RCTs. Our as-
gardless of the point of estimate.30 sumption is in line with the recommendation of the World
Randomized controlled trials that are too small could be de- Health Organization Minimal Registration Data Set68 and is sup-
ceiving, either by missing sensibly direct treatment impacts ported by the finding that perfect 100% compliance was reported
that would be clinically paramount or by overestimating the ef- in obligatory data, while a variable adherence was observed in
fect of treatment and finding it significant merely by chance.46 47 optional data in one of the most popular trial registration
Furthermore, it is generally considered unethical to enrol databases.69
704 | Abdulatif et al.

Identify primary outcome measure Continuous outcome Binary outcome

Justify expected effect size/variance* Pilot study Relevant published literature

Improve the precision of variance Pilot study: 95% UCL of SD Literature: 60% UCL of SD

Continuous 2 groups: t-test Binary 2 groups: Z/c2

Chose appropriate statistical test


Continuous >2 groups: ANOVA Binary >2 groups: c2

Specify study power (80–90%) Magnitude of expected effect size Cost, logistics and resources

Specify alpha error (significance)† Conventional: 0.05 Support critical decisions: 0.01

Compensate for possible dropouts Short follow up: 5–10% Long follow up: Pilot/literature

Estimate minimum sample size Adequately report all details

Fig 3 Algorithm to improve detailed transparent reporting of sample size estimation for parallel-group superiority randomized controlled clinical trials. *Estimation
of the expected variance is required for continuous primary outcomes. †Adjustment of α for multiple comparison is required for studies including more than two
groups. , analysis of variance; UCL, upper confidence limits.

Our review has some limitations. First, we have restricted our of reporting sample size calculation. A follow-up future evalu-
reappraisal of sample size estimation to parallel-group superior- ation will be required to validate this recommendation.
ity RCTs to ensure uniform conclusions; therefore, our findings
cannot be extended to other types of RCTs or other study designs.
Second, our literature search covered all issues of the 10 highest Authors’ contributions
impact factor anaesthesia journals published in 1 year. Probably,
a wider spectrum could have pointed out additional shortcom- M.A. and A.M. contributed to the design of the review, data
ings in sample size calculation. However, we planned to have extraction, analysis of the results, writing, revision, and final
the most up-to-date status of this issue by including only RCTs approval of the manuscript. G.O. contributed to tabulation of data,
published in 2013. Third, we extracted relevant sample size analysis, critical revision, and final approval of the manuscript.
data from published RCTs. Comparisons with the original study
protocols of registered trials were not considered. Registration
of intervention clinical trials is increasingly required in anaesthe- Acknowledgements
sia70 and other medical specialties.71 However, at the time of
We would like to acknowledge the support and advice of Professor
preparation of this systematic review, some of the selected
Jeffrey S. Kane for data analysis and optimization of sample size
journals were not yet adopting a compulsory prior trial registra-
replication.
tion policy.72
In conclusion, the results of this systematic review indicated
that despite a high frequency (91.7%) of sample size reporting,
some of the required basic assumptions for calculation were Declaration of interest
deficient or not supported by plausible reasoning in 19.7 and M.A. is a member of the associate editorial board of the British
32% of studies, respectively. This could explain the large differ- Journal of Anaesthesia. A.M. and G.O. have no conflict of interest
ences between the design assumptions and the observed data. to declare.
The continued suboptimal reporting of sample size estimation
calls for more strict guidelines at the different stages of designing
and reporting of RCTs. We suggest that the use of our proposed
simple algorithmic approach at the design stage and an obligatory
Funding
checklist in journals’ author guides could improve the integrity Departmental resources.
Pitfalls in sample size calculation | 705

References 20. Moyé LA, Tita AT. Defending the rationale for the two-tailed
test in clinical research. Circulation 2002; 105: 3062–5
1. Sessler DI, Devereaux PJ. Emerging trends in clinical trial 21. Columb MO. One-sided P values, repeated measures analysis
design. Anesth Analg 2013; 116: 258–61 of variance: two sides of the same story! Anesth Analg 1997;
2. Moher D, Hopewell S, Schulz KF, et al. CONSORT 2010 explan- 84: 701–2
ation and elaboration: updated guidelines for reporting paral- 22. Aberegg SK, Richards DR, O’Brien JM. Delta inflation: a bias in
lel group randomised trials. J Clin Epidemiol 2010; 63: e1–37 the design of randomized controlled trials in critical care
3. Pua HL, Lerman J, Crawford MW, Wright JG. An evaluation of medicine. Crit Care 2010; 14: R77
the quality of clinical trials in anesthesia. Anesthesiology 2001; 23. Bausell RB, Li Y-F. Power Analysis for Experimental Research.
95: 1068–73 A Practical Guide for the Biological, Medical and Social Sciences.
4. Todd MM. Clinical research manuscripts in Anesthesiology. Cambridge, UK: Cambridge University Press, 2002
Anesthesiology 2001; 95: 1051–3 24. Clark T, Berger U, Mansmann U. Sample size determinations
5. Whitley E, Ball J. Statistics review 4: sample size calculations. in original research protocols for randomised clinical trials
Crit Care 2002; 6: 335–41 submitted to UK research ethics committees: review. Br Med
6. Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P. J 2013; 346: f1135
Reporting of sample size calculation in randomised con- 25. Chan AW, Hróbjartsson A, Jørgensen KJ, Gøtzsche PC,
trolled trials: review. Br Med J 2009; 338: b1732 Altman DG. Discrepancies in sample size calculations and
7. Nikolakopoulos S, Roes KC, van der Lee JH, van der Tweel I. data analyses reported in randomised trials: comparison of
Sample size calculation in pediatric clinical trials conducted publications with protocols. Br Med J 2008; 337: a2299
in an ICU: a systematic review. Trials 2014; 15: 274 26. Huić M, Marušić M, Marušić A. Completeness and changes in
8. Koletsi D, Fleming PS, Seehra J, Bagos PG, Pandis N. Are sam- registered data and reporting bias of randomized controlled
ple sizes clear and justified in RCTs published in dental jour- trials in ICMJE journals after trial registration policy. Plos
nals? PLoS One 2014; 9: e85949 One 2011; 6: e25258
9. Ayeni O, Dickson L, Ignacy TA, Thoma A. A systematic review 27. Schulz KF, Grimes DA. Sample size calculations in rando-
of power and sample size reporting in randomized controlled mised trials: mandatory and mystical. Lancet 1005; 365:
trials within plastic surgery. Plast Reconstr Surg 2012; 130: 1348–53
78e–86e 28. Adams-Huet B, Ahn C. Bridging clinical investigators and
10. Abdul Latif L, Daud Amadera JE, Pimentel D, Pimentel T, statisticians: writing the statistical methodology for a
Fregni F. Sample size calculation in physical medicine and re- research proposal. J Investig Med 2009; 57: 818–24
habilitation: a systematic review of reporting, characteristics, 29. Gibbs NM, Weightman WM. Beyond effect size: consideration
and results in randomized controlled trials. Arch Phys Med of the minimum effect size of interest in anesthesia trials.
Rehabil 2011; 92: 306–15 Anesth Analg 2012; 114: 471–5
11. Weaver CS, Leonardi-Bee J, Bath-Hextall FJ, Bath PM. Sample 30. Page P. Beyond statistical significance: clinical interpretation
size calculations in acute stroke trials: a systematic review of of rehabilitation research literature. Int J Sports Phys Ther 2014;
their reporting characteristics and relationship with out- 9: 726–36
come. Stroke 2004; 35: 1216–24 31. Jones PW, Beeh KM, Chapman KR, Decramer M, Mahler DA,
12. Greenfield ML, Rosenberg AL, O’Reilly M, Shanks AM, Wedzicha JA. Minimal clinically important differences in
Sliwinski MJ, Nauss MD. The quality of randomized con- pharmacological trials. Am J Respir Crit Care Med 2014; 189:
trolled trials in major anesthesiology journals. Anesth Analg 250–5
2005; 100: 1759–64 32. Kvien TK, Heiberg T, Hagen KB. Minimal clinically important
13. Greenfield ML, Mhyre JM, Mashour GA, Blum JM, Yen EC, improvement/difference (MCII/MCID) and patient acceptable
Rosenberg AL. Improvement in the quality of randomized symptom state (PASS): what do these concepts mean? Ann
controlled trials among general anesthesiology journals Rheum Dis 2007; 66 Suppl 3: iii40–1
2000 to 2006: a 6-year follow-up. Anesth Analg 2009; 108: 33. Kon SS, Canavan JL, Jones SE, et al. Minimum clinically im-
1916–21 portant difference for the COPD assessment test: a prospect-
14. Abdulatif M, Ahmed A, Mukhtar A, Badawy S. The effect of ive analysis. Lancet Respir Med 2014; 2: 195–203
magnesium sulphate infusion on the incidence and severity 34. Clement ND, Macdonald D, Burnett R. Predicting patient sat-
of emergence agitation in children undergoing adenotonsil- isfaction using the Oxford knee score: where do we draw the
lectomy using sevoflurane anaesthesia. Anaesthesia 2013; line? Arch Orthop Trauma Surg 2013; 133: 689–94
68: 1045–52 35. Cook JA, Hislop J, Adewuyi TE, et al. Assessing methods to
15. Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, specify the target difference for randomised controlled trial:
Jager KJ. Sample size calculations: basic principles and DELTA (Difference ELicitation in TriAls) review. Health Technol
common pitfalls. Nephrol Dial Transplant 2010; 25: 1388–93 Assess 2014; 18: 1–175
16. Fritz CO, Morris PE, Richler JJ. Effect size estimates: current 36. Vickers AJ. Underpowering in randomized trials reporting a
use, calculations, and interpretation. J Exp Psychol Gen 2012; sample size calculation. J Clin Epidemiol 2003; 56: 717–20
141: 2–18 37. Wittes J. Sample size calculations for randomized controlled
17. Ellis PD. The Essential Guide to Effect Sizes: Statistical Power, Meta- trials. Epidemiol Rev 2002; 24: 39–53
Analysis, and the Interpretation of Research Results. Cambridge, 38. Sim J, Lewis M. The size of a pilot study for a clinical trial
UK: Cambridge University Press, 2010 should be calculated in relation to consideration of precision
18. Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd and efficiency. J Clin Epidemiol 2012; 65: 301–8
Edn. New Jersey, USA: Lawrence Erlbaum Associates, 1988 39. Chen H, Zhang N, Lu X, Chen S. Caution regarding the choice
19. Jones SR, Carley S, Harrison M. An introduction to power and of standard deviations to guide sample size calculations in
sample size estimation. Emerg Med J 2003; 20: 453–8 clinical trials. Clin Trials 2013; 10: 522–9
706 | Abdulatif et al.

40. Friede T, Kieser M. Sample size recalculation in internal pilot 62. Turner L, Shamseer L, Altman DG, et al. Consolidated stan-
study designs: a review. Biom J 2006; 48: 537–55 dards of reporting trials (CONSORT) and the completeness
41. Wittes J, Brittain E. The role of internal pilot studies in in- of reporting of randomised controlled trials (RCTs) published
creasing the efficiency of clinical trials. Stat Med 1990; 9: 65–71 in medical journals. Cochrane Database of Systematic Reviews
42. Chalmers I, Matthews R. What are the implications of opti- 2012, 11: MR000030
mism bias in clinical research? Lancet 2006; 367: 449–50 63. Strech D, Soltmann B, Weikert B, Bauer M, Pfennig A. Quality
43. Myles PS. Stopping trials early. Br J Anaesth 2013; 111: 133–5 of reporting of randomized controlled trials of pharmacologic
44. Levine M, Ensom MH. Post hoc power analysis: an idea whose treatment of bipolar disorders: a systematic review. J Clin
time has passed? Pharmacotherapy 2001; 21: 405–9 Psychiatry 2011; 72: 1214–21
45. Goodman SN, Berlin JA. The use of predicted confidence in- 64. Folkes A, Urquhart R, Grunfeld E. Are leading medical jour-
tervals when planning experiments and the misuse of nals following their own policies on CONSORT reporting?
power when interpreting results. Ann Intern Med 1994; 121: Contemp Clin Trials 2008; 29: 843–6
200–6 65. Hopewell S, Altman DG, Moher D, Schulz KF. Endorsement of
46. Avidan MS, Wildes TS. Power of negative thinking. Br J Anaesth the CONSORT statement by high impact factor medical jour-
2015; 114: 3–5 nals: a survey of journal editors and journal ‘Instructions to
47. Guyatt GH, Mills EJ, Elbourne D. In the era of systematic re- Authors’. Trials 2008; 9: 20
views, does the size of an individual trial still matter? PLoS 66. Schriger DL, Arora S, Altman DG. The content of medical
Med 2008; 5: e4 journal instructions for authors. Ann Emerg Med 2006; 48:
48. Altman DG. Statistics and ethics in medical research: III How 743–9
large a sample? Br Med J 1980; 281: 1336–8 67. Altman DG. Endorsement of the CONSORT statement by high
49. Bell ML, Kenward MG, Fairclough DL, Horton NJ. Differential impact medical journals: survey of instructions for authors.
dropout and bias in randomised controlled trials: when it Br Med J 2005; 330: 1056–7
matters and when it may not. Br Med J 2013; 346: e8668 68. Moja LP, Moschetti I, Nurbhai M, et al. Compliance of clinical
50. Fewtrell MS, Kennedy K, Singhal A, et al. How much loss to trial registries with the World Health Organization minimum
follow-up is acceptable in long-term randomised trials and data set: a survey. Trials 2009; 10: 56
prospective studies? Arch Dis Child 2008; 93: 458–61 69. Ross JS, Mulvey GK, Hines EM, Nissen SE, Krumholz HM. Trial
51. Gehling M, Hermann B, Tryba M. Meta-analysis of dropout publication after registration in ClinicalTrials.gov: a cross-
rates in randomized controlled clinical trials: opioid anal- sectional analysis. PloS Med 2009; 6: 1–9
gesia for osteoarthritis pain. Schmerz 2011; 25: 296–305 70. Myles PS. Trial registration for anaesthesia studies. Br J Anaesth
52. Kopman AF, Lien CA, Naguib M. Neuromuscular dose– 2013; 110: 2–3
response studies: determining sample size. Br J Anaesth 2011; 71. Huser V, Cimino JJ. Evaluating adherence to the International
106: 194–8 Committee of Medical Journal Editors’ policy of mandatory,
53. Mudge JF, Baker LF, Edge CB, Houlahan JE. Setting an optimal α timely clinical trial registration. J Am Med Inform Assoc 2013;
that minimizes errors in null hypothesis significance tests. 20: e169–74
PLoS One 2012; 7: e32734 72. Wager E, Elia N. Why should clinical trials be registered? Eur J
54. Bacchetti P. Current sample size conventions: flaws, harms, Anaesthesiol 2014; 31: 397–400
and alternatives. BMC Med 2010; 8: 17
55. Ma Y, Mazumdar M, Memtsoudis SG. Beyond repeated-
measures analysis of variance: advanced statistical methods
Appendix I: Formulae used for effect size
for the analysis of longitudinal data in anesthesia research.
Reg Anesth Pain Med 2012; 37: 99–105
calculations:
56. Baron G, Perrodeau E, Boutron I, Ravaud P. Reporting of ana- Continuous primary outcome
lyses from randomized controlled trials with multiple arms: a
• For studies including two groups, the effect size was
systematic review. BMC Med 2013; 11: 84
calculated by taking the difference in means and dividing
57. Guo Y, Logan HL, Glueck DH, Muller KE. Selecting a sample
that number by the  of the control group (Glass’s ▵).16 17
size for studies with repeated measures. BMC Med Res Methodol
• For studies including three or more groups, the effect size was
2013; 13: 100
calculated as follows.
58. Brooks GP, Johanson GA. Sample size consideration for
multiple comparison procedures in ANOVA. JMASM 2011; 10:
97–109 We first calculated Cohen’s d as follows:
59. Chan AW, Tetzlaff JM, Gøtzsche PC, et al. SPIRIT 2013 ex-
planation and elaboration: guidance for protocols of clinical mmax  mmin

trials. Br Med J 2013; 346: e7586 σ
60. Can OS, Yilmaz AA, Hasdogan M, et al. Has the quality of where mmax and mmin are the highest and lowest means, respect-
abstracts for randomised controlled trials improved since ively, and σ is the  within population.
the release of Consolidated Standards of Reporting Trial Then d was transformed to f using the following formula
guideline for abstract reporting? A survey of four high-profile (Cohen’s f ):
anaesthesia journals. Eur J Anaesthesiol 2011; 28: 485–92
61. Latronico N, Metelli M, Turin M, Piva S, Rasulo FA, Minelli C. rffiffiffiffiffiffi
1
Quality of reporting of randomized controlled trials pub- f ¼d
2k
lished in Intensive Care Medicine from 2001 to 2010. Intensive
Care Med 2013; 39: 1386–95 where k is the number of population means.18
Pitfalls in sample size calculation | 707

Binary primary outcome • For studies with more than two proportions, χ2
statistics was used to estimate the effect size.18
• For studies including two groups, the effect size was calcu-
lated using the following formula: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u m
uX ðOik  Eik Þ
ω¼t
ð p1  p2Þ Eik
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i
 ð1  PÞ
P 

where ω is the effect size, Oik and Eik denote the observed and ex-
where p1 and p2 are the proportions in the two groups and pected proportion, respectively, m is the total number of cells and
 ¼ ð p1 þ p2Þ=2 is the mean of the two values.5
P i the cell number.

Handling editor: J. G. Hardman

You might also like