You are on page 1of 5

Acta Pædiatrica ISSN 0803-5253

REVIEW ARTICLE

Understanding type I and type II errors, statistical power and sample size
Anthony K. Akobeng (aakobeng@sidra.org)1,2
1.Sidra Medical and Research Centre, Doha, Qatar
2.Royal Manchester Children’s Hospital, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK

Keywords ABSTRACT
Power, Sample size, Type I error, Type II error The results of a clinical trial may be subject to random error because of the variability in the
Correspondence measured data, which arises purely by chance. There are two types of random error – type I
Anthony K Akobeng, Sidra Medical and Research error and type II error. In this study, type I and type II errors are explained, and the important
Center, PO Box 26999, Doha, Qatar.
Tel: +974 4012 5850 |
concepts of statistical power and sample size estimation are discussed.
Email: aakobeng@sidra.org Conclusion: The most important way of minimising random errors is to ensure adequate
Received sample size; that is, a sufficient large number of patients should be recruited for the study.
11 October 2015; revised 6 December 2015;
accepted 29 February 2016.

DOI:10.1111/apa.13384

INTRODUCTION Statistical hypothesis testing allows us to make inferences


The aim of a clinical trial was to gather information on the about the whole of the target population based on infor-
benefits and risks of a treatment that would be applicable mation obtained from a sample of individuals from the
to a target population of people with a certain disease (e.g. population by providing some information on how much
babies with necrotising enterocolitis, children with random error might be associated with a result. In other
asthma). Practically, it is usually impossible to perform a words, hypothesis testing quantifies the evidence that the
study on all the people in the target population, and collected data support a specified hypothesis about the
studies are performed on a sample of individuals drawn population (2).
from the population. It is then the hope that the results
obtained for the sample could be extrapolated to all
individuals in the population (1). For instance, when a HYPOTHESIS TESTING
new treatment is found to be more effective than placebo Hypothesis testing is used to determine whether there is a
for treating acute exacerbations of asthma in individuals in difference between two comparison groups in a study (3).
a research study, we assume that the results from the study To do this, we initially make an assumption that there is no
will be applicable to the whole population of people with difference between the groups. This assumption of ‘no
asthma.
However, the accuracy of the study results and, therefore,
the validity of our assumption are subjected to random Key notes
error, even if bias (systematic error) is excluded (1,2).  A type I error occurs when we wrongly conclude that
Random error is due to the variability in the measured data, one treatment is more effective than another when, in
which arises purely by chance. Thus, even in a well- fact, it is not.
designed, well-executed clinical trial, we cannot be certain  A type II error occurs when we wrongly conclude that
whether the results obtained accurately reflects the true there is no difference in treatment effects when, in fact,
population results, that is whether the results in the study there is a difference.
are real or arose by chance as it is possible for random error  An important way of minimising random errors is to
(the play of chance) alone to lead to an inaccurate estimate ensure adequate sample size.
of the treatment effect.

©2016 Foundation Acta Pædiatrica. Published by John Wiley & Sons Ltd 2016 105, pp. 605–609 605
Type I and type II errors, power and sample size Akobeng

difference’ is known as the null hypothesis (4). Following TRUTH


statistical analysis, we either reject or fail to reject the null
hypothesis (see below). When the null hypothesis is
rejected, the alternate hypothesis (the opposite of the null Difference No
difference
hypothesis – that there is a difference between the groups) is
Type I error
accepted. RESULT OF Significant Correct
There are a number of statistical tests that may be used in STATISTICAL
hypothesis testing, and these can be broadly classified as TEST
Not Type II error
parametric tests (e.g. t-test, ANOVA) or nonparametric significant Correct
tests (e.g. Mann–Whitney U-test, Kruskal–Wallis test). The
choice of the correct test depends mainly on the nature of
the data and the study design. Whatever the test used, the Figure 1 Four possible conclusions in a clinical trial.
end result is to generate what is called a p value (or
probability value) (1). Being a probability, the p value can
take any value between 0 and 1. performed. These are referred to as type I (a) and type II
The p value resulting from a statistical hypothesis test is (b) errors (6–8).
used to establish whether the sample data support the null
hypothesis or provide evidence of a difference as specified Type I error
by the alternative hypothesis (2). The p value therefore A type I error occurs when the null hypothesis of no
represents the strength of evidence in support of the null difference is rejected when, in fact, it is true. In such a
hypothesis. A large p value suggests that the sample data situation, we wrongly accept that there is a difference
support the null hypothesis, whereas a small p value between treatment groups (a ‘false-positive’ result) and
suggests they do not. By convention, the cut-off between a wrongly conclude that one treatment is more effective than
large and a small p value is set at 0.05 (5%) (5). What this another when, in fact, it is not. The probability of making a
means is that if the p value is less than 0.05, we say that the type I error (a) is equivalent to the cut-off value for
evidence in favour of the null hypothesis is so small that we statistical significance (9). Thus, the chosen cut-off value
will reject the null hypothesis and accept the alternative represents the probability of a type I error, that is the risk of
hypothesis that there is a statistical difference between the ‘false positives’ that is accepted as being tolerable in the
treatment groups; that is, the results are unlikely to have study. What this means is that when a researcher uses the
occurred by chance. When the p value is below the cut-off, conventional 0.05 (5%) as the threshold for statistical
the result is generally referred to as being ‘statistically significance, the probability of a type I error (a) is 5%
significant’. If the p value is equal or greater than 0.05, then indicating that a positive conclusion based on finding a p
we say that the evidence in favour of the null hypothesis is value <0.05 will be wrong 5% of the time. In other words, if
big enough for us to fail to reject the null hypothesis that the study was repeated 20 times, the researchers would
there is no difference between the groups; that is, the results accept that the results of 1 of those 20 replicates could be a
might have occurred by chance. In such a situation, we are false-positive result (10). Although, by convention, 5% is
unable to demonstrate a statistical difference between the generally taken to be an acceptable type I error rate,
groups and the result is referred to as ‘not statistically researchers may choose to use more stringent rates. For
significant’. instance, if the consequences of false-positive result are
serious, for example if the risk of harm associated with a
false-positive result is great, the researchers may decide to
RANDOM ERRORS reduce the tolerable type I error rate to, say, 1% (p value
As explained above, the aim of hypothesis testing is to get <0.01).
some idea on whether any observed differences seen There are a number of issues that could increase the
between comparison groups are due to the treatment under likelihood of a type I error in a study (11,12). Most of these
investigation or due to chance. When p values are used, are related to multiple testing which occurs when research-
results are usually expressed as either ‘statistically signifi- ers investigate several endpoints and perform more and
cant’ (unlikely to have occurred by chance) or ‘not more statistical tests on the same data. The more tests are
statistically significant’ (might have occurred by chance). performed on a set of data, the more likely that one or more
There are four possible conclusions that may be drawn of them will yield a ‘statistically significant difference’ just
from these statistical results (6). These are illustrated in because of the play of chance. The problem of multiple
Figure 1. As shown in this figure, two of the four possibil- testing may also be encountered in other situations such as
ities lead to correct conclusions whilst the other two performing secondary analyses not planned in the original
possibilities result in wrong conclusions (errors). In the study design, performing multiple interim analyses of
absence of bias (systematic error), random error is respon- accumulating data before all required patients have been
sible for these wrong statistical conclusions, which lead to recruited and stopping trials earlier than planned (9,12).
inaccurate estimates of the treatment effect. Two types of There are three ways of minimising the problem of type I
random error can occur when hypothesis testing is errors (11). One way is to avoid multiple testing as

606 ©2016 Foundation Acta Pædiatrica. Published by John Wiley & Sons Ltd 2016 105, pp. 605–609
Akobeng Type I and type II errors, power and sample size

described above. The other way is to choose a more probability of correctly detecting a difference between groups
stringent p value as the cut-off level for statistical signifi- when such a difference exists. Put in another way, it means
cance (say 0.01 instead of 0.05). This approach will, that the probability of the test not being able to detect such a
however, likely to increase the risk of a false-negative result true difference (probability of making a type II error) is 20%.
(a type II error) [see below] unless the sample size of the When, as described under ‘Type I error’ (above), researchers
study is increased appropriately to take account of the new choose to set b as 10%, then the power of the study will be
alpha level. Another approach would be to refrain from 90% which means that the researchers are willing to accept a
referring to results as being ‘statistically significant’ or ‘not 10% chance of wrongly concluding that there is no difference
statistically significant’ (i.e. not using hypothesis testing) between groups.
and simply describing them as estimates with confidence The power of a study is determined by several factors
intervals (1). including the frequency of the outcome being studied, the
magnitude of the effect, the study design and, more
Type II error importantly, the sample size of the study. For an RCT to
As shown in Figure 1, when the results of a study show ‘no have a reasonable chance of answering the research
statistically significant’ difference between treatments, this question it sets out to address, the sample size must be
can mean two things: (i) that there is truly no difference large enough; that is, there must be enough participants in
between the treatments (true negative), or (ii) that the study the study.
has been unable to detect a difference because of the play of
chance (false negative) (11). When the second possibility
occurs, we say there has been a type II error. In other SAMPLE SIZE
words, a type II error occurs when we accept the null At the planning stage of a study, a calculation of sample size
hypothesis of no difference and conclude that there is no (i.e. the number of patients needed for the trial) should be
difference in treatment effects when in fact, there is a performed. The calculation allows the determination of a
difference. In such a situation, we obtain a ‘false-negative’ sample size that is large enough to have a good chance of
result and may wrongly conclude that a treatment is not detecting a benefit of treatment, if such a benefit exists, but
effective when, in fact, it is. which is also no larger than it needs to be (14).
The probability of making a type II error is referred to as
beta or b. By convention, b is usually set at 20% or 0.20,
Factors that affect sample size calculations
which means that the researchers are willing to accept a
Four components are required to allow the calculation of an
20% chance of wrongly concluding that there is no
adequate sample size (6,7,15). These are as follows:
difference between groups. If the consequences of a false-
negative conclusion are serious, the researchers may choose  the type I error rate (a),
to decrease the acceptable level of b to, say, 10% or 5% (10).  the type II error rate (b) or statistical power (1 b),
The understanding of type II errors is very important  the smallest effect size of clinical interest,
especially as there are several studies in the medical  Nature of the data
literature that firmly make conclusions of ‘no difference’
between interventions without due regard being paid to the o the variability of the outcome of interest (for contin-
possibility of a type II error. Closely related to type II errors uous outcomes)
is the concept of statistical power which is described below. o the expected frequency of the outcome of interest in
the control group (for dichotomous outcomes).

STATISTICAL POWER
In simple terms, the concept of statistical power (usually Type I error rate (a)
referred to simply as ‘power’) refers to the ability of a This is the same as the critical level of statistical significance
statistical test to detect a true difference between two set by the researchers which is usually 5% (see above).
groups. In other words, it is a measure of the ability of the
test to identify that there is a difference between treatments Type II error rate (b) or more commonly statistical power
when such a difference truly exists. Power derives from the (1-b)
type II error rate (b) and is mathematically the complement The type II error rate (b) is usually set at 20%, and power
of b, that is Power = 1 b (7). (1 b) at 80% (see above).
As with the setting of the level of statistical significance (a),
the choice of b and therefore power (1 b) is arbitrary and the The smallest effect size of clinical interest
value should be decided by the researchers prior to the trial This refers to a treatment effect that the researchers are
(6). By convention, a minimum power of 80% is generally interested in and is usually taken to be the smallest
considered adequate (13), and in studies where power is difference in outcome between the two groups that is
reported, researchers commonly choose a power of 80% considered to be clinically important. In other words, it is
(0.8). When in a clinical trial, the statistical test is reported to the standard that the treatment should meet to be consid-
have a power of 80%, it means that the test has an 80% ered efficacious (10).

©2016 Foundation Acta Pædiatrica. Published by John Wiley & Sons Ltd 2016 105, pp. 605–609 607
Type I and type II errors, power and sample size Akobeng

Determining the minimum clinically important difference estimation and the related power calculation should be
is not always straightforward. Some researchers base this on reported (13).
the results of previous similar studies or pilot work (16,17). The primary endpoint of the study, that is the clinical
Clinical judgment may also be used to estimate the clinically feature considered to be the most relevant to the condition
important difference, and one approach is to consider and the treatment being studied should be used as the basis
differences that have led to the adoption of new therapies for sample size calculations. It is also important to realise
for the condition of interest in the past. Another approach is that when the stated power of a study is based on the total
to survey experts and patients to determine what difference number of participants in the study, then any subgroup
in outcomes would need to be demonstrated for them to analysis performed within the study may not have adequate
adopt the new therapy in view of its costs and known risks statistical power because the number of people in the
(17). subgroups will be lower than the total sample size of the
The smaller the clinically relevant difference, the more study. One way of avoiding this problem is for researchers
patients are required for the study, and the bigger the (at the planning stage) to base their sample size calculation
difference, the fewer patients are required. It must be noted on statistical power needed to detect the desired treatment
that if, to reduce the calculated sample size, the chosen effect in the smallest anticipated subgroup (6).
minimum clinically important difference is too big, then the
study will still be underpowered to detect smaller but
clinically important effects (18). If on the other hand, the CONCLUSIONS
chosen minimum clinically important difference is too The results of large studies are associated with less random
small, then the required sample size may be unnecessarily error whilst those of small studies may be subject to a
increased substantially. greater degree of random error. Therefore, for a study to be
able to properly answer the research question that it sets out
Nature of the data to answer, due consideration should be paid to the number
When the outcome of interest is continuous (e.g. blood of patients that needs to be recruited into the study (sample
pressure, body mass index), sample size is affected by the size). Researchers should report a-priori estimates of sample
variability of the outcome (19,20). In general, the greater size in clinical trial reports. Readers need to understand the
the variability of the outcome variable, the larger the sample principles discussed in this study and be critical of studies
size required to assess whether an observed effect is a true that do not report appropriate sample size calculations.
effect. Variability, in this case, is estimated by means of the
standard deviation (20). When as is often the case, the
variability is unknown, the researchers may estimate it from FUNDING AND CONFLICT OF INTERESTS
a previous similar study, pilot work or an estimate from The author has no financial disclosures or conflict of
clinical observation. interests.
For dichotomous outcomes (e.g. remission versus no
remission), the expected frequency of the outcome of
References
interest in the control group (event rate in the control
group) influences the number of patients required (6,7). 1. Akobeng AK. Confidence intervals and p-values in clinical
Again, the researchers may estimate this from a similar decision making. Acta Paediatr 2008; 97: 1004–7.
study, pilot work or an estimate from clinical observation. If 2. Sedgwick P. Pitfalls of statistical hypothesis testing: type I and
type II errors. BMJ 2014; 349: g4287.
the predicted frequency used for the study’s sample size
3. Whitley E, Ball J. Statistics review 3: hypothesis testing and P
calculation is very different from the rate actually observed values. Crit Care 2002; 6: 222–5.
in the study, the study may have been underpowered to 4. Last JM. A dictionary of epidemiology. New York: Oxford
show a difference between treatments (12). University Press, 2001.
5. Akobeng AK. Understanding randomised controlled trials.
The importance of sample size Arch Dis Child 2005; 90: 840–4.
Prestudy calculation of sample size is very important. When 6. Fletcher RW, Fletcher SW. Clinical epidemiology – the
essentials. Philadelphia: Lippincroft Williams & Wilkins, 2005.
the sample size of a study is too small, it may be impossible
7. Schulz KF, Grimes DA. Sample size calculations in randomised
to detect any true differences in outcome between the trials: mandatory and mystical. Lancet 2005; 365: 1348–53.
groups. Such a study might be a waste of resources and 8. Coggon D. Statistics in clinical practice. London: BMJ
potentially unethical. Frequently, however, a number of Publishing Group, 1995.
small sized studies are published claiming no difference in 9. Altman DG. Practical statistics for medical research. Boca
outcome between groups without the power of the studies Raton: Chapman & Hall/CRC, 1991.
10. Ward MM. Primer: measuring the effects of treatment in
being reported. Right from the planning stage of a trial,
clinical trials. Nat Clin Pract Rheumatol 2007; 3: 291–7.
researchers should aim to recruit enough participants in 11. Keirse MJ, Hanssens M. Control of error in randomized clinical
order to ensure that the study has a high probability of trials. Eur J Obstet Gynecol Reprod Biol 2000; 92: 67–74.
detecting, as statistically significant, the smallest effect that 12. Scales DC, Rubenfeld GD. Estimating sample size in critical
would be regarded as clinically important, and sample size care clinical trials. J Crit Care 2005; 20: 6–11.

608 ©2016 Foundation Acta Pædiatrica. Published by John Wiley & Sons Ltd 2016 105, pp. 605–609
Akobeng Type I and type II errors, power and sample size

13. Sedgwick P. Sample size: how many participants are needed in 17. Brasher PMA, Brant RF. Sample size calculations in randomized
a trial? BMJ 2013; 346: f1041. trials: common pitfalls. Can J Anaesth 2007; 54: 103–6.
14. Beck RW. Sample size for a clinical trial: why do some trials 18. Whitley E, Ball J. Statistics review 4: sample size calculations.
need only 100 patients and others 1000 patients or more? Crit Care 2002; 6: 335–41.
Ophthalmology 2006; 113: 721–2. 19. Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW,
15. Kadam P, Bhalerao S. Sample size calculation. Int J Ayurveda Jager KJ. Sample size calculations: basic principles and
Res 2010; 1: 55–7. common pitfalls. Nephrol Dial Transplant 2010; 25: 1388–93.
16. Endacott R, Botti M. Clinical research 3: sample selection. 20. Noordzij M, Dekker FW, Zoccali C, Jager KJ. Sample size
Intensive Crit Care Nurs 2005; 21: 51–5. calculations. Nephron Clin Pract 2011; 118: c319–23.

©2016 Foundation Acta Pædiatrica. Published by John Wiley & Sons Ltd 2016 105, pp. 605–609 609

You might also like