You are on page 1of 38

Statistical significance

versus clinical relevance


Sef Lucrari Dr Mugurel Apetrii
UMF Grigore T Popa IASI
What if I told you that I conducted a study which shows that a single
pill can significantly reduce tiredness without any adverse effects?
Let’s imagine this scenario…

• Group of 2,000 adults between 20-30 years old, all of whom suffer
from constant tiredness. Then the participants were randomly divided
into 2 groups, with 1000 participants in each.
• One group of participants (the intervention group) were given the
new drug: “energylina”. The other groups of participants (the control
group) were given a dummy (placebo) pill.
• Double blinded
• The participants took the pills for 3 weeks, 2 per day
Outcome-tiredness
• Scale to measure participants’ levels of tiredness before and after the
trial
• This rated fatigue on a scale of 1 to 20; with 1 meaning the participant
felt entirely well-rested and 20 meaning the participant felt entirely
fatigued.
Outcome-tiredness
• Scale to measure participants’ levels of tiredness before and after the
trial
• This rated fatigue on a scale of 1 to 20; with 1 meaning the participant
felt entirely well-rested and 20 meaning the participant felt entirely
fatigued.
Results
• 90% of the participants in the “energylina” group improved by 2
points on the scale.
• 80% of participants in the placebo group improved by 1 point on the
scale

• This difference between the groups was statistically significant (p <


 0.05)
So does that mean the treatment is effective? Should you take “energylina”? Should every doctor prescribe
it?

Did the results convinced you?


Not necessarily…
„Statistical significance vs. clinical significance”

 Clinical significance is the practical importance of the treatment effect, whether it has a


real, palpable, noticeable effect on daily life.

For example, imagine a safe treatment that could reduce the number of hours you
suffered with flu-like symptoms from 72 hours to 10 hours. Would you buy it?

Statistical significance is ruled by the p-value (and confidence intervals). When we find a
difference where p < 0.05, we call this ‘statistically significant’. Just like our results from
the above hypothetical trial. If a difference is statistically significant, it simply means it
was unlikely to have occurred by chance.
So it’s important to consider that trial
results could be…

• Statistically significant AND clinically important

• Not statistically significant BUT clinically important.

• Statistically significant BUT NOT clinically important. 


Going back to our hypothetical study, what have we got:
statistical significance? clinical significance, or both?

• Perhaps not. You might only be willing to take this new pill if it were
to lead to a bigger, more noticeable benefit for you. For such a small
improvement, it might not be worth the cost of the pill. So although
the results may be statistically significant, they may not be clinically
important.
To avoid falling in the trap of thinking that because a result is
statistically significant it must also be clinically important, you
can look out for a few things…

• Look to see if the authors have specifically mentioned whether


the differences they have observed are clinically important or
not.

• Sample size

• Effect size.
• So to conclude, just because a treatment has been shown to lead
to statistically significant improvements in symptoms does not
necessarily mean that these improvements will be clinically
significant (i.e. meaningful or relevant to patients). That’s for patients
and clinicians to decide.
• Hypothesis testing and statistical significance testing - Fisher, Neyman and
Pierson in the 1920s and 1930s

• March 2017 American Statistical Association (ASA) posted a statement on the


correct use of P-values. Their rationale ‘while the P-value can be a useful
statistical measure, it is commonly misused and misinterpreted.’

• A p-value is a measure of the effect of chance within a study


• It is not the probability that the result of the study is true or correct
•  It is the probability that if the null hypothesis is true, and if the results were not
affected by bias or confounding
• The null hypothesis is the theory that the exposure or intervention that is being
studied is not associated with the outcome of interest. 
P values
• Misinterpreted- “probability that the studied hypothesis is true”

• Overtrusted
- when it is forgotten that “proper inference requires full reporting and transparency
- Better-looking (smaller) P values alone do not guarantee full reporting and transparency.
In fact, smaller P values may hint to selective reporting and nontransparency.

• Misused- use of the P value is to make “scientific conclusions and business or policy
decisions”based on“whether a P value passes a specific threshold” even though “a P
value, or statistical significance, does not measure the size of an effect or the importance
of a result,” and “by itself, a P value does not provide a good measure of evidence
EXAMPLE STUDIES

Am Heart J 2015;170:110-6.)
• Hypothesis : ACEi and ARBs may have detrimental effects on the
kidney if combined with intravenous iodine contrast

• Results: withholding ACEi/ARBs treatment prior


to cardiac catheterization reduced the occurrence of CIN by
41% (P= 0.16)

• Conclusion: withholding ACEi/ARBs prior to cardiac catheterization


resulted in a ‘non-significant reduction in contrast induced
acute kidney injury and a significant reduction in postprocedural rise
of creatinine.’
www.thelancet.com Vol 377 June 25, 2011
• median follow-up of 4.9 years,

• The occurrence of any major atherosclerotic event was reduced by


17% in the lipid-lowering drugs group compared with the placebo group
(P¼ 0.0021)
• Conclusions:
- „lowering LDL cholesterol with the combination
of simvastatin plus ezetimibe safely reduces the risk of major
atherosclerotic events in a wide range of patients with CKD”

- „widespread use of LDLcholesterol-lowering therapy in patients with CKD


would result in a worthwhile reduction in cardiovascular disease
complications”
What do these results mean for clinical
practice?
• How must we interpret the results?
• Are the recommendations based on the magnitude of effect of the
intervention or the P-value of the effect?
• Is a P-value>0.05 a chance finding?
• Should we disregard the results, postpone judgement and continue to
practice as if these data were never published if the P-value is
greater than 0.05, and take any result for truth if the P-value is
lower than 0.05?
CI 0.74 to 0.94
RR=0,83
The 95% CI means that when we repeat
a study many times in similar samples,
the observed CIs will cover the true risk
ratio 95% of the times.

CI 0.31 to 1.19 RR=0,61 Conversely, a P-value of 0.05 means


that if the null hypothesis were true and
we were to repeat a study, a similar or
more extreme risk ratio would be
observed only 5% of the times.
Two theoretical samples with a similar P-value
The null hypothesis and alternative
hypothesis
• Null hypothesis—Ho: the incidence of CIN ‘is not different’
between a group of patients who discontinue ACEi/ARB
therapy and a group of patients who continue ACEi/ARB
therapy when undergoing cardiac catheterization.

• Alternative hypothesis—Ha in the group of patients who discontinue


ACEi/ARB therapy compared with the group of patients who
continued ACEI/ARB therapy when undergoing cardiac catheterization
SHARP study
• Null hypothesis (Ho): the events ‘are not different’ between a group
of patients who use lipid-lowering drugs and a group of patients who
use placebo.

• Alternative hypothesis (Ha) the incidence of major atherosclerotic


events is different in the group of patients who use lipid-lowering
drugs compared with the group of patients who use placebo
Type I and type II error
• Type I: The null hypothesis can be rejected falsely – α – set at 0.05

• Type II: fail to reject the null hypothesis when there is in fact a difference between the
group - β
Ho is actually true Ho is actualy false
We conclude Ho is true We made correct We made a Type II error
conclusion
We conclude Ho is false We made a Type I error We made correct
conclusion

• The power is the ability to show that the null hypothesis is false when it actually is
false(1-β)

• We definitely want power to be larger than 50%; we probably wouldn't even conduct the
study if we have less than even a 50-50 chance of getting it right. The usual power level
that we aim for is 80%, but 90% is not uncommon in some areas
Sample Size Calculation
• Following items should be specified
– a primary variable
– the statistical test method
– the null hypothesis; the alternative hypothesis; the study
design
– the Type I error
– the Type II error
– way how to deal with treatment withdrawals
Simple formula for difference in means
Represents the
desired power
Sample size in each (typically .84 for 80%
group (assumes equal power).
sized groups)

 ( Z   Z/2 )
2 2

n  2 2
difference Represents the desired
Standard deviation of level of statistical
Effect Size (the
the outcome variable significance (typically
difference in
means) 1.96).
Sample size calculators on the web…
• http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleS
ize

• http://calculators.stat.ucla.edu
• http://hedwig.mgh.harvard.edu/sample_size/size.html
These sample size calculations are idealized

•They do not account for losses-to-follow up (prospective studies)


•They do not account for non-compliance (for intervention trial or RCT)
•They assume that individuals are independent observations (not true in
clustered designs)

•Consult a statistician!
The goal of effect estimation
• The goal of effect estimation is to answer a research question with either yes or no.

• Effect estimation on the other hand aims to provide a quantitative assessment of the
difference between two (or more) study groups and provide a range of likely values for
that difference.
• The measure of effect in our examples is the risk ratio.
• Interpretarea rezultatelor se face în contextul designului studiului

• Efectele care nu sunt clinic inportante pot fi semnificative statistic dacă


volumul eșantionului este destul de mare (ex. eroarea standard este
mica)

• Acordă atenție necesară la dimensiunea efectului (rezultatul descriptiv)


în contextul intervalului de încredere (ex. semnificația clinică)
Studiu de cohortă n=34,079 femei → greutatea câștigată de grupul care a
efectuat exercițiu fizic >21 ore/săptămână față de grupul cu <7.5 ore de
exercițiu fizic/săptămână (p<0.001)

• Mărimea efectului ? <7.5 ore/săpt → 0.15 kg în plus față de grupul cu ≥21


• Extrapolare: dacă studiul se continuă 13 ani grupul cu ≥21 va acumula în greutate
cu 0.635 kg mai puțin comparativ cu grupul
cu < 7.5!

Physical Activity and Weight Gain Prevention. JAMA 2010;303:1173-1179


COMMON MISCONCEPTIONS AND
MISTAKES
• The false dichotomy of p<0,05 vs p>0,06

• The P-value indicates the probability that the null hypothesis is true

• Multiple testing and selective reporting


The false dichotomy of p<0,05 vs p>0,06
• ‘significant’ when a P-value less than 0.05 ‘non-significant’ (n.s.) when
a P-value is greater than or equal to 0.05
• However, what is the difference between a P =0,055 vs P=0,045
• P-value has little meaning by itself.
• It has nothing to do with the magnitude or the importance of an
observed effect. ‘Any effect, no matter how small in terms of clinical
impact, can produce statistical significance if the sample size is large
enough or measurement precision is high enough
Number needed to treat (NNT)
• The number needed to treat (NNT) is an epidemiological measure used in
communicating the effectiveness of a health-care intervention, typically a treatment with
medication.

• The NNT is the average number of patients who need to be treated to prevent one
additional bad outcome (e.g. the number of patients that need to be treated for one of
them to benefit compared with a control in a clinical trial).

• Defined as the inverse of the absolute risk reduction (NNT=1/(Pa-Pb)

• The ideal NNT is 1, where everyone improves with treatment and no one improves with
control. The higher the NNT, the less effective is the treatment
Number needed to treat (NNT)
• We need to treat 48 patients for 5 years in order to prevent one major
atherosclerosis event (NNT=1/ (0,134-0,113) =48)

• The numbers needed to treat to prevent any of the separate outcomes are even
bigger, 250 to prevent a single major coronary event, 100 to prevent one non-
haemorrhagic stroke and 67 to prevent a single revascularization

• The number needed to treat to prevent one CIN occurrence was only 13
NNT = 1/(0,108- 0,109)=13
• P- it reflects the chance of observed result occurring if the null hypothesis were true

• Selective reporting is perhaps the most common error in scientifitific literature and one
of the worst.

• Aggravated by multiple testing.

• We must take into account the possibility that some of these differences may be chance
findings, for instance, with a formal adjustment for multiple testing

• The results that are not statistically significant are less likely to be reported
Conclusions
• any effect can produce statistical significance if the sample size is large enough or
measurement precision is high enough, even if the difference itself is not clinically relevant

• The effect should be estimated (e.g. a difference in means, a difference in fre- quencies, a
ratio of risks, etc.)

• These effects should be presented with a CI

• Consider to lower P threshold?

You might also like