You are on page 1of 3

STATISTICS AND RESEARCH DESIGN

Sample calculations for comparison of 2 means


Nikolaos Pandis, Associate Editor of Statistics and Research Design
Bern, Switzerland, and Corfu, Greece

A
common question in orthodontic research is In this article, we will perform a sample calculation
“how many patients do I need for my study?” for a normally distributed quantitative outcome for
The next articles will introduce relevant con- a 2-arm trial with 1:1 allocation ratio (2-sided test).
cepts that will help readers to understand how to appro- Sample calculations are based on assumptions, and we
priately plan the size of a trial. should aim to detect differences between treatment
The objective of a clinical trial is to provide reliable groups, if they exist, that have clinical importance rather
evidence regarding the effect or no effect of a treatment than statistical significance.
modality. A sufficient number of participants allows the Before we proceed with the sample calculation, we
researcher to detect a difference with reasonable preci- need to define the following.
sion (good power) if a difference exists, or allows one
 The research question.
to be reasonably certain that no difference exists if the
 The principal outcome measure of the trial.
results show no difference. Small studies tend to be
 m1, the anticipated mean response for the standard or
less convincing and inconclusive because they often
control treatment.
have low power. Recruiting more patients than necessary
 m2, the anticipated mean response for the alternative
is a waste of resources and even unethical, since more
treatment and hence the minimum clinically impor-
patients than necessary could be exposed to a potentially
tant difference (m2 – m1) between treatment arms
ineffective therapy. There is a close relationship between
that we would like to detect.
power and sample size; usually, as the sample size in-
 The standard deviation (for continuous outcomes
creases, study power is also expected to increase. Ideally,
only).
a balance between study power, a clinically important
 The degree of certainty with which we want to be able
difference to be detected, trial feasibility, and credibility
to detect the treatment difference (power) and the
are required.
level of significance (type I error or a).
What is study power? Power is the probability of ob-
serving a difference between treatment groups when We will use an example trial to illustrate the pro-
a difference exists. A study designed to detect a clinically cess. Pandis et al,1 in a study assessing treatment
important difference with, let's say, a power of 80% time to alignment and dental changes between self-
assumes an 80% chance of observing a difference if ligating and conventional appliances, found that the
there is a difference, and also assumes a 20% chance molar width difference at the end of the follow-up pe-
of missing the difference (false negative) when such riod was 2 mm (SD, 2 mm), a statistically significant
a difference exists. Allowing a 20% (power 80%) or finding (Table II). This study was not randomized,
a 10% (power 90%) chance of a false negative (type II and the authors used different wires. Was the 2-mm
error or beta) is unavoidable, since a sample calculation difference in molar width genuine or was it observed
with 100% power (type II error approaching zero) would because wires of different shapes were used for the
require an infinite number of participants. Type I error, treatment groups? We would like to confirm or refute
or a or alpha, refers to false-positive results and indi- those findings by adopting a randomized control trial
cates that we are willing to accept a 5% (a 5 0.05) design and using exactly the same wire shape and se-
chance of observing a statistically significant difference quence for both treatment groups. As it was previously
when no such difference exists between the treatment explained, to perform the sample calculation, we would
groups. See Table I for descriptions and relationships need to decide what would be a clinically important
of error types and power. difference that we want to detect. We can refer to
the previous study and can assume that a molar width
difference of 2 mm between the 2 appliances at a cer-
Am J Orthod Dentofacial Orthop 2012;141:519-21 tain time after treatment initiation has clinical impor-
0889-5406/$36.00
Copyright Ó 2012 by the American Association of Orthodontists. tance. Then we can design a randomized control trial
doi:10.1016/j.ajodo.2011.12.010 with 90% power and a 5% level of significance, which
519
520 Statistics and research design

Table I. Types of errors in hypothesis testing at a 5% significance level and 80% power
Result of significance
test In reality, no difference exists In reality, a difference exists
Not significant 1 – a (5 0.95 or 95%) b or type II error (5 0.20 or 20%)
Correct conclusion, accepting the null hypothesis b 5 1 – power
(Ho) when the Ho is true Incorrect conclusion, rejecting the alternative hypothesis
(Ha) when the Ha is true
Significant a (5 0.05 or 5%) or type I error 1 – b (5 1 – 0.20 5 0.8 or 80%)
a 5 level of significance 1 – b 5 power
Incorrect conclusion, rejecting the Ho when the Correct conclusion, rejecting the Ho when the Ha is true
Ho is true

to add more participants to the calculated sample de-


Table II. Intermolar width changes induced by align-
1 pending on the expected number of lost patients during
ment per bracket group adapted from Pandis et al
the follow-up period.
Conventional Self-ligating In the above calculations, we assumed that the ob-
(n 5 27) (n 5 27) servations are independent, the numbers of participants
Dental cast measurement Mean (SD) Mean (SD)
per trial arm are the same, and there are no losses to
Initial intermolar width (mm) 44.2 (2.5) 44.2 (2.6)
Final intermolar width (mm) 44.6 (2.7) 46.2 (1.7) follow-up. By experimenting with the formula, we
can see that the required sample size increases when
the required difference of clinical importance is de-
creased, the power level is increased, the alpha level is
Table III. Values for different combinations of power
decreased, or the standard deviation is increased, and
and level of significance, adapted from Pocock2
vice versa. Therefore, sample size calculations can be
b manipulated by changing the assumptions; however,
0.05 0.1 0.2 0.5
changes should be sensible and preferably in accor-
(95% power) (90% power) (80% power) (50% power) dance with previous research or from a pilot study.
a 0.05 13.0 10.5 7.85 3.84 Sample sizes are often calculated by using software or
0.01 17.8 14.9 11.7 6.63 referring to tables.3
Power calculations should be considered at the de-
sign stage; they have limited or no value after the trial
will detect a 2-mm difference between the treatment is conducted. After data analysis is complete, power is
groups if such a difference really exists. Therefore, assessed by looking at the confidence intervals of the es-
m2 – m1 5 2 mm, power 5 90%, and a 5 0.05, and timates. Narrow confidence intervals indicate high
let us assume that the standard deviation (s) is 2 power and precision, and vice versa.
mm for both treatment arms by also referring to the Finally, a statement such as “the trial has 90%
cited study. We will use the following formula for 2 power” is ambiguous. A more appropriate way to com-
means from Pocock.2 ment on power with our example is as follows: “With 22
2s2 subjects per group, the trial has 80% power to detect
n5f ða; bÞc 2 a difference of 2 mm in molar width between conven-
ðm1  m2Þ
tional and self-ligating appliances at the 5% signifi-
where f(a, b) is a function of power and significance cance level.”
level, and Table III displays the appropriate substitution The next article will present sample calculations for
values. If we perform the appropriate substitutions in the proportions.
2  22
formula, we will get: n510:5 2 521. A total of 42
2 KEY POINTS
patients for both treatment arms are required to be
able to detect a 2-mm difference in molar width between  Sample calculation should be based on clinically
treatment groups with a power of 90% and a 5% level of meaningful differences, consider previous knowl-
significance. If we use a 5 0.01 and power 5 90%, then edge, and balance statistical precision, trial feasibility,
2  22 and credibility.
n514:9  2 530, and a total of 60 patients are  Power is considered at the design stage, and it has no
2
needed for both treatment arms. It would be prudent value after the trial is conducted.

April 2012  Vol 141  Issue 4 American Journal of Orthodontics and Dentofacial Orthopedics
Statistics and research design 521

REFERENCES 2. Pocock SJ. Clinical trials: a practical approach. Chichester, United


Kingdom: Wiley; 1983. p. 125-9.
1. Pandis N, Polychronopoulou A, Eliades T. Self-ligating vs conven- 3. Machin D, Cambell MJ, Tan SB, Tan SH. Sample size tables for clin-
tional brackets in the treatment of mandibular crowding: a prospec- ical trials. 3rd ed. Oxford, United Kingdom: Wiley-Blackwell; 2009:
tive clinical trial of treatment duration and dental effects. Am J
p. 14-8.
Orthod Dentofacial Orthop 2007;132:208-15.

American Journal of Orthodontics and Dentofacial Orthopedics April 2012  Vol 141  Issue 4

You might also like