How To Read A Paper - The Statistics

Mark Kerr
Clinical Librarian
Is it relevant – PICO
Is it valid – chance
Is it valid – bias/confounding
Is it useful – effect size
Statistics for Validity (to avoid chance)

Descriptive Statistics (population, results)
Statistics to Demonstrate Difference (probability)
Why? Uncertainty
 1. PICO – if relevant, then read it
 2. Difference due to chance? Due to bias?
 If not then likely to be due to intervention
 3. Is difference big enough to be clinically useful?
 And is it ‘real’ (patient-oriented) or surrogate outcome?
 4. Does it change/inform knowledge or practice?
 Can I apply this to my patient?
1. SCREEN 2. APPRAISE 3. ASSESS 4. APPLY

IRRELEVANCE PICO
BIAS RAMMBO
CHANCE Sample size, P value, 95% CI
FUTILITY Effect Size, Clinical Significance
REPRODUCABILITY Similar studies, meta-analysis
ACCEPTABILITY Social, regulatory & clinical approval
AFFORDABILITY Cost-effectiveness studies
GENERALISABILITY Are they sufficiently like us?

4 critical elements in interpreting treatment effects:
Significance
 outcome of an experiment or trial is not due to chance
Direction
 positive or negative
Magnitude
 absolute or relative size
Relevance
 the degree to which a result addresses a research topic
Effect size estimates from clinical trials, usually based on
active drug vs placebo, do not correlate directly with
patient responses in real-world clinical practice
mean difference
relative risk
odds ratio
number needed to treat
area under the curve
 mean difference
 used in studies that report efficacy in terms of a continuous
measurement and calculated from two mean values and their
standard deviations
 relative risk
 the ratio of patients responding to treatment divided by the
ratio of patients responding to different treatment (or placebo),
which is particularly useful in prospective clinical trials to assess
differences between treatments
 odds ratio
 interpret results of retrospective studies, provide estimates of risk of
side effects by comparing probability (odds) of an outcome in the
presence or absence of a specified condition
 number needed to treat
 the number of subjects one would expect to treat with agent A to have
one more success (or one less failure) than if the same number were
treated with agent B
 area under the curve
 (aka drug-placebo response curve) - a 6-step process that can be used
to assess effects of medication on both worsening and improvement
and probability that a medication-treated subject will have a better
outcome than a placebo-treated subject.
Statistical Validity
The degree to which an

observed result, such as
a difference between
two measurements, can
be relied upon and not
attributed to errors in
sampling and
measurement.
Validity is the lack of

chance, bias or
confounding
 To calculate the sample size, you need to know:
 The minimum change to be clinically significant
▪ i.e. A statistically significant outcome showing an intervention with
0.0001% difference is clinically useless
 The frequency and spread of data we might expect (3 typcal
sources: previous relevant studies, a pilot study, or clinical
expectations)
 The acceptable margin or error (5% is typical)
 Type of study design (superiority, non-inferiority, equivalence)
 Type of primary outcome (dichotomous/continuous)
 The evidence: a statement on sample size calculation and
the expected sample – and proof that this was achieved:
CLOTS Trial: Lancet. 2009 June 6; 373(9679): 1958–1965.

 The p value gives a measure of how likely it is that any
differences between control and experimental groups are
due to chance alone. P values range from 0 (impossible to
happen by chance) to 1 (the event will certainly happen).
 p=0.001 unlikely result happened by chance: 1 in 1000
 p=0.05 fairly unlikely result happened by chance: 1 in 20
 p=0.5 fairly likely the result happened by chance: 1 in 2
 p=0.75 very likely the result happened by chance: 3 in 4
 Results where p is less than 0.05 are said to be “significant.”
This is just an arbitrary figure as in 1 in 20 cases, the results
could be due to chance.
 0.05 is increasingly considered insufficient
 Particle physicists, who collect reams of data from atom-
smashing experiments, have long demanded a P value
below 0.0000003 (or 3 × 10−7)
 More than a decade ago, geneticists took similar steps to
establish a threshold of 5 × 10−8 (p= 0.00000005) for
genome-wide association studies, which look for
differences between people with a disease and those
without across hundreds of thousands of DNA-letter
variants.
 a significant P value suggests that something nonrandom
has occurred, it does not inform us about the clinical
significance of the nonrandom effect.
Type I (α) and Type II (β) Errors
Type 1 (α) error = study concludes a relationship exists

between two variables, when in fact there is no relationship,
leading us to wrongly reject the null hypothesis when it was
actually true.
 A study has avoided Type 1 error if P<0.05
Type 2 (β) error = study concludes a relationship doesn’t

exist between two variables, when there is a relationship.
(i.e. a study reports a high (poor) P value when the null
hypothesis should have been disproved)
 A study has avoided Type II error if Power>80%
 Used in the same way as p values in assessing the effects of
chance but can give more information.
 Any result obtained in a sample of patients only gives an estimate of the result
which would be obtained in the whole population.
 The real value will not be known, but the confidence interval can show the size
of the likely variation from the true figure.
 A 95% confidence interval means there is a 95% chance that the
true result lies within the range specified. (Equivalent to a p
value of 0.05).
 The larger the trial the narrower the confidence interval, and therefore the
more likely the result is to be definitive.
 In an odds ratio diagram if the confidence interval crosses the line of zero
difference (i.e. if the value includes 1) it can mean either that there is no
significant difference between the treatments and/or that the sample size was
too small to allow us to be confident where the true result lies.
 A measurement is ACCURATE if it matches exactly
an accepted standard
 A measurement is PRECISE if it has a small
random error of estimation
 A reliable
RELIABLE measurement is one that can be
repeated with minimal variation

 A valid
VALID measurement gives genuine information
about that which is being measured
VALID ACCURATE PRECISE RELIABLE

Quick Validity Check
 Target sample size shown?
 Reached?
 Loss to follow ups or dropouts/withdrawals?
 Primary outcome?
 Proved? (e.g. to P<0.05, 95% CI)
 Baseline characteristics?
 ‘No significant difference’?
 Realistic & Appropriate?
VALIDITY only means “TRUE”, not necessarily “USEFUL”

Odds, Risk &
Numbers Needed to…
Outcome
YES NO
Treatment a b
Control
c d
 Risk: the number of participants having the event in a group divided by the total number of
participants
 Odds: the number of participants having the event divided by the number of participants not
having the event
 Risk ratio (relative risk): the risk of the event in the intervention group divided by the risk of
the event in the control group
 Odds ratio: the odds of the event in the intervention group divided by the odds of the event in
the control group Treatment
Control
 Risk difference: the absolute change in risk that is attributable to the experimental
intervention
 Number needed to treat (NNT): the number of people you would have to treat with the
experimental intervention (compared with the control) to prevent one event (in a specific time
period).
EER = Experimental Event Rate, CER = Control Event Rate
Outcome
YES NO
Treatment a b
Control
c d
 In treatment group Outcome

YES NO
 Risk = a/a+b
 Odds = a/b
 In control group Treatment
Outcome
YES NO
Group
a b
 Risk = c/c+d
 Odds = c/d
 OR = (a÷b) ÷ (c÷d) = ad Control
bc Group c d
Odds Ratios compare
how likely outcomes are
between two groups
Outcome
YES NO
Treatment a b
Control
c d
 In a simple aspirin study, with a Outcome

YES NO
beneficial outcome
(“headache goes away”)
Aspirin
Outcome
YES NO
Group
27 23
 Risk of ‘cure’ on aspirin
 Risk of ‘cure’ on placebo
 Odds of ‘cure’ on aspirin
Placebo


Odds of ‘cure’ on placebo
Relative Risk
Group 10 40
 Odds ratio
 And then Number Needed to Treat
 OR – the ratio of the odds of having the outcome in the
experimental group relative to the odds of having the
outcome in the control group.
 It is particularly useful because as an effect-size
statistic, it gives clear and direct information to
clinicians about which treatment approach has the best
odds of benefiting the patient.
 Also used in cross-sectional studies and case-control
studies, where exposure or not exposure replaces
treatment and control, and outcome is presence or
absence of disease.
 133 women take an antibiotic for treatment of UTI
 14 still have UTI after 6 weeks
RISK
 Q1: What is the risk of remaining infected?
ODDS
 Q2: For these 133 women, what is the odds of having the event (still infected)?
COMPARING ODDS & RISK

 Q3: Both are similar, due to small result. But if you include the 148 women in this
trial receiving placebo, of whom 128 still had UTI after 6 weeks. In this group
what is the risk of staying infected?
 Q4: What are the odds?
 Odds and Risk are never identical, although similar at low values. But depending
on presentation can seem much more powerful (especially in a newspaper
headline)
 [1] 14/133 = approx 0.1
 [2] 14/119 = still approx 0.1; more formally, it is 14/133 (risk of having the
event) divided by 119/133 (risk of not having the event), still 14/119 or 0.1
 [3] 128 (number with event – still infected)/148 (total number in the group)
 = 0.86
 [4] 128 (still infected)/20 (number cured) = 6.4 – very different).
Relative risk or risk ratio (RR)
Risk of event in one group divided by risk of the event in other group.
RR = no. with event in treatment group / no. with event in control group
no. in treatment group no. in control group
= (14/133) / (128/148)
= 0.1 / 0.86
= 0.12
RR =1 Intervention has identical effect to control

RR <1 Intervention reduces chances of having the event
Event No Event Total
RR >1 Intervention increases the chances of having the event
RR = 0 No events in treated group = 100% perfect treatment! Intervention 14 119 133
Control 128 20 148
Odds ratio (OR)
Odds in the treated group / odds in the control group
OR = no. with event in treatment group / no. with event in control group
no. without event in treatment group no. without event in control group
= (14/119) / (128/20)
= 0.118 / 6.40
= 0.018
OR =1 Intervention has identical effect to control

OR <1 Intervention reduces chances of having the event Event No Event Total
OR >1 Intervention increases the chances of having the event
OR = 0 No events in treated group = 100% perfect treatment! Intervention 14 119 133
Control 128 20 148
Outcome
YES NO
Treatment a b
Control
c d
Dead Not Dead
Albumin 726 2747 3473
Saline 729 2771 3460

Outcome
YES NO
Treatment a b
Control
c d
SAFE Study
http://www.nejm.org/doi/pdf/10.1056/NEJMoa040232
NEJM Correspondence
http://www.nejm.org/doi/full/10.1056/NEJM200410283511818
Journal Club Commentaries
http://www.biomedcentral.com/content/pdf/cc3006.pdf
http://www.biomedcentral.com/content/pdf/cc8940.pdf
Outcome
YES NO
Treatment a b
Control
c d
Type of outcome
Value of Adverse outcome Beneficial outcome (e.g.

OR/RR (e.g. death) breathing unaided)
<1 New intervention New intervention worse
better
1 New intervention no New intervention no
better/no worse better/no worse
>1 New intervention New intervention better

worse
 A measure of the relative efficacy / risk of a
treatment
 How many patients need to be exposed to a risk
factor (i.e. a treatment) over a specific period for
one extra patient to be show benefit/harm who
would not otherwise have shown benefit/harm.
 1/Absolute Risk Reduction or 1/Risk Difference
 Consider also NNH (harm), NNV (vaccinate)
 http://www.nntonline.net/visualrx/
 A practical example can be taken from the recent concerns about
third generation oral contraceptive pills and the risk of Deep Vein
Thrombosis. It is thought that third generation pills carry a risk of
DVT of about 25 per 100,000 women per year of use; in
comparison second generation pills carry a risk of about 15 per
100,000 women years and in women who do not take the pill the
risk is about 5 per 100,000 per year.
 Switching users of third generation pills to a second generation

equivalent will result in an impressive sounding Relative Risk
Reduction of 40%, but as the risk of DVT is so low the Absolute
Risk Reduction is only 0.0001 giving an NNT of 10,000 women
needing to be changed to prevent a single DVT in one year.
 When dose-adjusted warfarin was compared to aspirin,
the absolute risk reduction of stroke was 0.6% in
warfarinized patients (NNT 167)
 Tamoxifen vs Placebo for BrCa: NNT 112 (5 yrs
treatment); NNH (Venous Thromboembolic Events) =
137, NNH (Mortality) = 256 (so need to identify/control
for high risk of BrCa against high risk of adverse events
 Cochrane (2010): Vaccines for Influenza in healthy
adults – NNV = 33 (matched vaccines) to 100
(unmatched)
Number Needed to Treat
 4 out of 55 or 7.3% died on Streptomycin = EER

 14 out of 52 or 26.9% died on placebo = CER 4 51
14 38
ARR = CER – EER =26.9% – 7. 3% = 19.6% (or 0.196)
NNT = 1/0.196 = 6
6 patients with TB would need to be treated with
streptomycin to prevent 1 additional person dying
 Relative Risk of dying on streptomycin compared to placebo was 0.27
 Treatment with streptomycin showed 73% reduction in the risk of
death compared with placebo.
 Antibiotic treatment prevented approximately three quarters of the
deaths that would have occurred on placebo. BMJ 30.12.1948 769-80
Forest Plots
& Meta-Analysis
Which statement is true, based on this forest plot comparing drug
treatment to placebo?
A.Three trials have been compared

B.Study B had a smaller sample size
C.There is a significant difference between drug X and placebo in Study A
D.There is an overall significant difference between drug X and placebo
E.None of the above
In this forest plot,
should (a) all,
(b) medical,
(c) surgical
patients receive
more or less
oxygen to avoid Lancet 2018;
HAIs? And how 391: 1693–705
certain are you? Forest plot of hospital-

acquired infections
n=38
Rao, S.C., Athalye-Jape, G.K., Deshpande, G.C., Simmer, K.N. and Patole, S.K., 2016. Probiotic supplementation and late-
onset sepsis in preterm infants: a meta-analysis. Pediatrics, 137(3), p.e20153684.
Use of weaning protocols for reducing duration of
mechanical ventilation in critically ill adult
patients:Cochrane systematic review and meta-
analysis
BMJ 2011;342:c7237
 Heterogeneity primarily denotes that the range of
results varies among included trials. Although
heterogeneity may be due to chance alone, it can also be
caused by clinical or methodologic differences among
trials and might thereby result in systematic errors.
An example meta-analysis with

evidence of strong heterogeneity.
Note the CIs of some studies do
not overlap. Pooled data would
give a misleading overall
impression of relative risk.
Heterogeneity
Occurs where the results of different studies vary from each other more
than might be expected by chance. Visually, on a Forest Plot, where the CI
lines do NOT overlap. Significant heterogeneity would rule out meta-
analysis, alternatives would include sub-group or sensitivity analysis.
 Χ2 = variation in results above that

expected by chance – relates to DF
(“n of studies -1” = ‘perfect’), much
higher suggests heterogeneity’
 Low P value for Χ2 may indicate
heterogeneity
If Z Statistic > 2.2, then

heterogeneity is present; Z should
have an associated P value
Normal Distribution
Parametric Distribution
Normal distribution
Which of the following is an example of normal
distribution?
 

  

When a distribution is negatively skewed, the mean will be
higher/lower/the same as the median?
A.Higher
B.Lower
C.Same
Book prices in UK: a "Recommended retail price" is generally the
modal price, and virtually nowhere would you have to pay more. But
some shops will discount, and a few will discount heavily.
Age at retirement: most people retire at 65-68 which is when the state
pension kicks in, very few people work longer, but some people retire
in their 50s and quite a lot in their early 60s.
For a normal distribution, 68% of the observations are within +/- one
standard deviation of the mean, 95% arewithin +/- two standard
deviations, and 99.7% are within +- three standard deviations
1. Normal distributions are
symmetric around their mean.
2. The mean, median, & mode of a
normal distribution are equal.
3. The area under the normal curve
is equal to 1.0.
4. Normal distributions are denser
in center, less dense in tails.
5. Normal distributions defined by
two parameters, the mean (μ)
and the standard deviation (σ).
6. 68% of the area of a normal
distribution is within one standard
deviation of the mean.
7. Approx 95% of the area of a
normal distribution is within two
standard deviations of the mean.
The standard deviation summarises the amount by which every value
in a dataset varies from the mean. It indicates how tightly the values
are bunched around the mean. It is the most robust and widely used
measure of dispersion since, unlike the range and inter-quartile range,
it takes into account every variable in the dataset.
100 subjects’ ages are normally distributed with a mean of 41 years and
a standard deviation of 4 years. Select the true statement
A.At least 5 subjects will be older than 49
B.16 subjects will be below 37 years of age
C.75 members of the study will be aged 37 and 45
D.50 of the subjects will be between 39 and 43 years of age
E.50% of the cohort will be between the age 37 and 45
Age of dementia diagnosis length of hospital stay
A positive skew distribution is one in which there are many values of a low magnitude
and a few values of extremely high magnitude, while a negative skew distribution is one
in which there are many values of a high magnitude with a few values of very low
magnitude.
Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one
versus the other tail. It is actually the measure of outliers present in the distribution.
High kurtosis indicates that data has heavy tails or outliers. If a high kurtosis, we need to investigate why so many outliers. It
indicates a lot of things, maybe wrong data entry or other things. Investigate!
Low kurtosis indicates data has light tails or a lack of outliers. Low kurtosis (=too good to be true), then we need to investigate
and trim the dataset of unwanted results.
Mesokurtic: similar to the normal distribution. Extreme values of the distribution are similar to that of a normal distribution
characteristic. In this definition the standard normal distribution has a kurtosis of 3.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and sharper than Mesokurtic, data are heavy-
tailed or a profusion of outliers. Outliers stretch the horizontal axis of the histogram graph, so the bulk of the data appear in a
narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal distribution. The peak is lower and broader
than Mesokurtic, which means that data are light-tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal distribution.
Parametric tests assume data is drawn from a population with
a normal distribution.
They can be used with continuous data from a non-normal

population if the criteria below are met:
Parametric analyses Sample size guidelines for nonnormal data
1-sample t test Greater than 20
2-sample t test Each group should be greater than 15
•If 2-9 groups, each group should be greater than 15.
One-Way ANOVA
•If 10-12 groups, each group should be greater than 20.
Parametric tests usually have more statistical power than

nonparametric tests - you are more likely to detect a
significant effect when one truly exists.
 Parametric and nonparametric are two broad classifications of statistical procedures.
 Parametric tests are based on assumptions about the distribution of the underlying population from which the sample was
taken.
 The most common parametric assumption is that data are approximately normally distributed.
 Nonparametric tests do not rely on assumptions about the shape or parameters of the underlying population distribution.
 If the data deviate strongly from the assumptions of a parametric procedure, using the parametric procedure could lead to
incorrect conclusions.
 You should be aware of the assumptions associated with a parametric procedure and should learn methods to evaluate the
validity of those assumptions.
 If you determine that the assumptions of the parametric procedure are not valid, use an analogous nonparametric
procedure instead.
 The parametric assumption of normality is particularly worrisome for small sample sizes (n < 30). Nonparametric tests are
often a good option for these data.
 It can be difficult to decide whether to use a parametric or nonparametric procedure in some cases. Nonparametric
procedures generally have less power for the same sample size than the corresponding parametric procedure if the data
truly are normal. Interpretation of nonparametric procedures can also be more difficult than for parametric procedures. .
Indices of central tendency
& variability
Mean (Average) – Median (Middle) – Mode (Typical)
MEAN
- Easy to understand and calculate; influenced by outliers; includes every one of the values in
the data set; not actually one of the values (usually) – eg average no. of legs or eyes = 1.9
MEDIAN
- Middle score in an ascending set of data, good for odd numbers of scores, need to
average the middle two if even number; less affected by outliers and skew
MODE
- most frequent value; highest bar in a histogram or bar chart; ‘most popular’ score
- but not unique, there may be two modes, may not identify central tendency
Mean (Average) – Median (Middle) – Mode (Typical)
The more skewed the distribution, the greater the difference between the median and mean,
and the greater emphasis should be placed on using the median as opposed to the mean
If dealing with a normal distribution, and tests of normality show that the data is non-normal, it
is customary to use the median instead of the mean.
However, this is more a rule of thumb than a strict guideline. Sometimes, researchers wish to
report the mean of a skewed distribution if the median and mean are not appreciably different
(a subjective assessment), and if it allows easier comparisons to previous research to be made.
Type of Variable Best measure of central tendency

Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
Range – IQR – Variance – Standard Deviation
Prerequisites: percentiles, distributions, measures of central tendency
RANGE - a single number representing the spread of the data

The highest score minus the lowest score.
IQR - a measure of variability, based on dividing a data set into quartiles

The range of the middle 50% of the scores in a distribution
Variance - a number indicating how spread out the data is

The average squared difference of the scores from the mean
Standard Deviation - a number representing how far from the average each score is
The square root of the variance
Range – IQR – Variance – Standard Deviation
RANGE: what is the range of 2, 4, 6, 8 ? 8–2=6
IQR: what is the interquartile range of these numbers:

12 13 14 15 9 10 16 10 8 10 11 12 13 22 23 24 25 ?
25th% = 10, 75th% = 19, 19 - 10 = 9
Variance: Would the variance of 10, 12, 17, 20, 25, 27, 42, and 45 be larger if
the numbers represented a population or a sample?
Variance, if these numbers represent a sample you divide by N-1 (instead of N).
Standard Deviation: What is the standard deviation of this sample:

13 9 6 15 12 15 7 15? 3.7033
Library Service Web Sites
 ekhuft.nhs.uk/libraries
 netvibes.com/ekhuftlibrary
 flickr.com/photos/eastkenthospitals/sets
 twitter.com/EKHTlibraries
Mark Kerr, Clinical Librarian

mark.kerr@nhs.net

How To Read A Paper - The Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Read A Paper - The Statistics

Uploaded by

Copyright:

Available Formats

Mark Kerr

Statistics for Validity (to avoid chance)

1. SCREEN 2. APPRAISE 3. ASSESS 4. APPLY

CHANCE Sample size, P value, 95% CI

FUTILITY Effect Size, Clinical Significance

REPRODUCABILITY Similar studies, meta-analysis

ACCEPTABILITY Social, regulatory & clinical approval

AFFORDABILITY Cost-effectiveness studies

GENERALISABILITY Are they sufficiently like us?

The degree to which an

Validity is the lack of

CLOTS Trial: Lancet. 2009 June 6; 373(9679): 1958–1965.

Type 1 (α) error = study concludes a relationship exists

Type 2 (β) error = study concludes a relationship doesn’t

repeated with minimal variation

about that which is being measured

VALID ACCURATE PRECISE RELIABLE

VALIDITY only means “TRUE”, not necessarily “USEFUL”

 In treatment group Outcome

 In a simple aspirin study, with a Outcome

COMPARING ODDS & RISK

RR =1 Intervention has identical effect to control

OR =1 Intervention has identical effect to control

Dead Not Dead

Albumin 726 2747 3473

Saline 729 2771 3460

Value of Adverse outcome Beneficial outcome (e.g.

>1 New intervention New intervention better

 Switching users of third generation pills to a second generation

 4 out of 55 or 7.3% died on Streptomycin = EER

A.Three trials have been compared

HAIs? And how 391: 1693–705

certain are you? Forest plot of hospital-

An example meta-analysis with

 Χ2 = variation in results above that

If Z Statistic > 2.2, then

They can be used with continuous data from a non-normal

Parametric tests usually have more statistical power than

Type of Variable Best measure of central tendency

RANGE - a single number representing the spread of the data

IQR - a measure of variability, based on dividing a data set into quartiles

Variance - a number indicating how spread out the data is

RANGE: what is the range of 2, 4, 6, 8 ? 8–2=6

IQR: what is the interquartile range of these numbers:

Standard Deviation: What is the standard deviation of this sample:

Mark Kerr, Clinical Librarian

You might also like