Professional Documents
Culture Documents
Statistical Significance Versus Clinical Relevance
Statistical Significance Versus Clinical Relevance
• Group of 2,000 adults between 20-30 years old, all of whom suffer
from constant tiredness. Then the participants were randomly divided
into 2 groups, with 1000 participants in each.
• One group of participants (the intervention group) were given the
new drug: “energylina”. The other groups of participants (the control
group) were given a dummy (placebo) pill.
• Double blinded
• The participants took the pills for 3 weeks, 2 per day
Outcome-tiredness
• Scale to measure participants’ levels of tiredness before and after the
trial
• This rated fatigue on a scale of 1 to 20; with 1 meaning the participant
felt entirely well-rested and 20 meaning the participant felt entirely
fatigued.
Outcome-tiredness
• Scale to measure participants’ levels of tiredness before and after the
trial
• This rated fatigue on a scale of 1 to 20; with 1 meaning the participant
felt entirely well-rested and 20 meaning the participant felt entirely
fatigued.
Results
• 90% of the participants in the “energylina” group improved by 2
points on the scale.
• 80% of participants in the placebo group improved by 1 point on the
scale
For example, imagine a safe treatment that could reduce the number of hours you
suffered with flu-like symptoms from 72 hours to 10 hours. Would you buy it?
Statistical significance is ruled by the p-value (and confidence intervals). When we find a
difference where p < 0.05, we call this ‘statistically significant’. Just like our results from
the above hypothetical trial. If a difference is statistically significant, it simply means it
was unlikely to have occurred by chance.
So it’s important to consider that trial
results could be…
• Perhaps not. You might only be willing to take this new pill if it were
to lead to a bigger, more noticeable benefit for you. For such a small
improvement, it might not be worth the cost of the pill. So although
the results may be statistically significant, they may not be clinically
important.
To avoid falling in the trap of thinking that because a result is
statistically significant it must also be clinically important, you
can look out for a few things…
• Sample size
• Effect size.
• So to conclude, just because a treatment has been shown to lead
to statistically significant improvements in symptoms does not
necessarily mean that these improvements will be clinically
significant (i.e. meaningful or relevant to patients). That’s for patients
and clinicians to decide.
• Hypothesis testing and statistical significance testing - Fisher, Neyman and
Pierson in the 1920s and 1930s
• Overtrusted
- when it is forgotten that “proper inference requires full reporting and transparency
- Better-looking (smaller) P values alone do not guarantee full reporting and transparency.
In fact, smaller P values may hint to selective reporting and nontransparency.
• Misused- use of the P value is to make “scientific conclusions and business or policy
decisions”based on“whether a P value passes a specific threshold” even though “a P
value, or statistical significance, does not measure the size of an effect or the importance
of a result,” and “by itself, a P value does not provide a good measure of evidence
EXAMPLE STUDIES
Am Heart J 2015;170:110-6.)
• Hypothesis : ACEi and ARBs may have detrimental effects on the
kidney if combined with intravenous iodine contrast
• Type II: fail to reject the null hypothesis when there is in fact a difference between the
group - β
Ho is actually true Ho is actualy false
We conclude Ho is true We made correct We made a Type II error
conclusion
We conclude Ho is false We made a Type I error We made correct
conclusion
• The power is the ability to show that the null hypothesis is false when it actually is
false(1-β)
• We definitely want power to be larger than 50%; we probably wouldn't even conduct the
study if we have less than even a 50-50 chance of getting it right. The usual power level
that we aim for is 80%, but 90% is not uncommon in some areas
Sample Size Calculation
• Following items should be specified
– a primary variable
– the statistical test method
– the null hypothesis; the alternative hypothesis; the study
design
– the Type I error
– the Type II error
– way how to deal with treatment withdrawals
Simple formula for difference in means
Represents the
desired power
Sample size in each (typically .84 for 80%
group (assumes equal power).
sized groups)
( Z Z/2 )
2 2
n 2 2
difference Represents the desired
Standard deviation of level of statistical
Effect Size (the
the outcome variable significance (typically
difference in
means) 1.96).
Sample size calculators on the web…
• http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleS
ize
• http://calculators.stat.ucla.edu
• http://hedwig.mgh.harvard.edu/sample_size/size.html
These sample size calculations are idealized
•Consult a statistician!
The goal of effect estimation
• The goal of effect estimation is to answer a research question with either yes or no.
• Effect estimation on the other hand aims to provide a quantitative assessment of the
difference between two (or more) study groups and provide a range of likely values for
that difference.
• The measure of effect in our examples is the risk ratio.
• Interpretarea rezultatelor se face în contextul designului studiului
• The P-value indicates the probability that the null hypothesis is true
• The NNT is the average number of patients who need to be treated to prevent one
additional bad outcome (e.g. the number of patients that need to be treated for one of
them to benefit compared with a control in a clinical trial).
• The ideal NNT is 1, where everyone improves with treatment and no one improves with
control. The higher the NNT, the less effective is the treatment
Number needed to treat (NNT)
• We need to treat 48 patients for 5 years in order to prevent one major
atherosclerosis event (NNT=1/ (0,134-0,113) =48)
• The numbers needed to treat to prevent any of the separate outcomes are even
bigger, 250 to prevent a single major coronary event, 100 to prevent one non-
haemorrhagic stroke and 67 to prevent a single revascularization
• The number needed to treat to prevent one CIN occurrence was only 13
NNT = 1/(0,108- 0,109)=13
• P- it reflects the chance of observed result occurring if the null hypothesis were true
• Selective reporting is perhaps the most common error in scientifitific literature and one
of the worst.
• We must take into account the possibility that some of these differences may be chance
findings, for instance, with a formal adjustment for multiple testing
• The results that are not statistically significant are less likely to be reported
Conclusions
• any effect can produce statistical significance if the sample size is large enough or
measurement precision is high enough, even if the difference itself is not clinically relevant
• The effect should be estimated (e.g. a difference in means, a difference in fre- quencies, a
ratio of risks, etc.)