Statistical Fallacies and Errors in Medical Research

STATISTICAL FALLACIES AND
ERRORS IN MEDICAL
RESEARCH
By
By Muhammad
Muhammad Irfan
Irfan Abdul
Abdul Jalal
Jalal
OUTLINE OF THE TALK
• Motivation
• Statistical Questions & Approaches
• Criticisms on frequentist methodology
• Bayesian Paradigm
• Recent Bayesian methodological developments
SAMPLE SIZE ISSUES
• Consequences of small sample sizes: Large treatment effects that are not replicable in
subsequent research (Pereira, Horwitz, Ioannidis 2012)
STUDY DESIGNS
• Biased samples
FAILURE TO CHECK FOR BIASES
• Selection bias and confirmation bias
• Simpson Paradox, Berksonian Bias and Hawthorne Effect
• Lead-time bias
P VALUE MISINTERPRETATIONS
• Part of the Null Hypothesis Significant Test (NHST) paradigm.
• “Fathers” of NHST – Ronald Fisher, Egon S Pearson (Karl Pearson’s son), Jerzy Neyman
• Followed the philosophy of falsification (Karl Popper) – you can falsify but not confirm the
hypothesis that all swans are white by showing a single black swan.
• You can REJECT or FAIL TO REJECT A NULL HYPOTHESIS (H0)
P VALUE MISINTERPRETATIONS
• Twelve misconceptions of p values (Goodman 2008):
P- HACKING
• Causes (Head et al. 2015):
conducting analyses midway through experiments to decide whether to

continue collecting data
recording many response variables and deciding which to report post-analysis
deciding whether to include or drop outliers post-analyses
excluding, combining, or splitting treatment groups post-analysis
including or excluding covariates post-analysis
 topping data exploration if an analysis yields a significant p-value
• Evidence : P Curve
P- CURVE
MULTIPLE COMPARISONS
• Example: Stepwise Regression Modelling
• Problems:
Bias in parameters estimation
Inconsistencies among model selection algorithms
Inappropriate focus or reliance on a single best model
• Remedies:
Avoid Stepwise regression. Use other more appropriate variable selection techniques such as
Lasso and Elastic Net approaches.
For model fit assessments, use a separate data set (test set) or cross-validation (leave one out
cross-validation, bootstrap cv)
OVER-INTERPRETATION OF NON-SIGNIFICANT
RESULTS
STATISTICAL MODEL ASSUMPTIONS
• Linear regression (residual checking – linearity, independence, normality, equal variances
(LINE))
• Proportional hazard model (proportional hazard check – Schoenfeld residuals)
• Influential diagnostics for influential points (plot of studentised residuals, deviance statistics,
dfbeta etc)
• Nonparametric test – based on ranking of observations (homogeneity of variance assumption
is critical here)
• What are other alternatives when model assumptions are not fulfilled?
 Weighted least square regression (heteroscedasticity of residuals) since Ordinary Least
Square method is no longer the “Best Linear Unbiased Estimator (BLUE)”.
Time varying covariate model (when PH assumption is violated)
Remove or retain influential observations? (should be prespecified at the pre-analysis stage)
Use of more robust statistical methods that are not reliant on statistical model assumptions)
CORRELATION AND CAUSATION
CIRCULAR INFERENCE
• Also known as “double dipping”
CIRCULAR INFERENCE
STATISTICAL ISSUES IN MACHINE
LEARNING
STATISTICAL QUESTIONS
• What is our estimate of the prevalence of diabetes in Malaysia?
(Point estimation problem)
• What is a plausible range of values for our estimated prevalence of
diabetes in the Malaysian population to reflect our degree of
uncertainty? (Interval estimation / uncertainty quantification)
• What is the probability of a future patient whose BMI is 27, age is
56 and experienced gestational diabetes to be a diabetic
(prediction problem)
• Is the prevalence of diabetes rising? (hypothesis testing problem)
STATISTICAL APPROACHES
• Frequentist and Bayesian methods address all these types of
problems: point estimation, interval estimation, prediction and
hypothesis testing. But they utilize different approaches to do so.
• Some typical frequentist approaches:
-Least squares method (point estimation)
-Maximum likelihood estimation (point estimation)
-Confidence interval (interval estimation)
-Test statistics and p values (hypothesis testing)
• Bayesian statistics doesn’t use all these familiar methods.
BAYESIAN STATISTICS: A VERY BRIEF
HISTORY
• The work of Reverend Thomas Bayes (1702-1761) was published in
1764, 3 years after his death.
• Bayes’ solution to a problem of “inverse probability” was presented in
the Essay towards Solving a Problem in the Doctrine of Chances (1764)
which was published posthumously by his friend, Richard Price, in the
Philosophical Transactions of the Royal Society of London.
• This work gives a key result in Bayesian Statistics: BAYES THEOREM
• Over the course of the next 100-150 years, it received little attention
• In fact some key figures in statistics – e.g R.A. Fisher- outrightly
rejected the idea of Bayesian statistics.
• During WW2, some of the world leading Mathematicians resurrected
Bayes’ rules in deepest secrecy to crack the coded messages of the
Germans.
THOMAS BAYES (PILFERED FROM
WIKIPEDIA)
BAYESIAN STATISTICS: A BRIEF HISTORY
• Alan M. Turing (1912 – 1954) – mathematician working at Bletchley
Park (The Imitation Game, portrayed by Benedict T.C. Cumberbatch)
• Designed the bombe – an electro-mechanical machine for testing
every possible permutation of a message produced on the Enigma
machine – could take up to 4 days to decode a message
• New system: Banburismus (named after where the work was done -
Banbury, England) – where Bayesian methods [using Bayes’ factor]
were used to quantify the belief in guesses of a stretch of letters in an
Enigma messages
• Certain permutations that were unlikely to be the original message
were “thrown out” before they were even tested.
• Greatly reduced the time it took to crack Enigma codes
ALAN TURING (WIKIPEDIA)
CRITICISMS OF FREQUENTIST APPROACH: BMI
EXAMPLE
• Suppose we are interested in whether or not there is any real

difference between the BMIs of Fakulti Perubatan (FP) and Fakulti
Sains Kesihatan (FSK) students and staff (personnel).
• In a two-sample t test, we often test the null hypothesis (Ho) that
there is no difference between the population means of the two
groups:
H0: µFP = µFSK
• How can we test the hypothesis?
EXAMPLE
•• Suppose
we obtain the following BMI values for a random sample of
students and staff from both groups (values were generated using R):
Faculties BMI
Fakulti Perubatan (FP) 23.1, 28.3, 35.3, 24.0, 32.4, 30.5, 30.1, 21.9, 22.1, 29.9, 27.2, 35.2
Fakulti Sains Kesihatan (FSK) 25.5, 27.3, 21.0, 23.3, 19.1, 22.2, 29.5, 27.5, 23.2
• From this we can obtain the descriptive summaries:

nFP = 12, nFSK = 9, x̄FP = 28.17, x̄FSK = 28.17, sFP = 4.54,
sFSK = 3.40
• The pooled standard deviation is given by the following formula:
• spooled = 4.099.
EXAMPLE
•• t
= = 2. 147
• Using R, we obtain the p value = 0.00448 (based on 19 df)
• Therefore we reject H0 at 5% level and thus conclude that there is a
significant difference between the mean BMI of these two
populations. However if we work at the 1% level, we have to retain
Ho!
• This indicates the arbitrariness of fixing the level of significance at
0.05.
• Besides, what is the real interpretation of p value?
R.A FISHER’S TAKE ON P < 0.05
• In Statistical Methods and Scientific Inference, 1956

• “However, the calculation is absurdly academic, for in fact no
scientific worker has a fixed level of significance at which from year
to year, and in all circumstances, he rejects hypotheses; he rather
gives his mind to each particular case in the light of his evidence and
his ideas.” (picture taken from https://makingscience.royalsociety.
org/s/rs/people/fst00034451)
P VALUE INTERPRETATIONS: DO AND
DON’T(S)
• Pr (your data or more extreme data | your null hypothesis is TRUE)
• It indicates the incompatibility of your data with a prespecified
hypothesis (in this case, our NULL hypothesis) if only chance is the
cause of such discrepancy.
• P value misconceptions (Wasserstein and Lazar 2016):
It doesn’t measure the probability of a hypothesis (null) being true
It doesn’t measure the probability of your data being produced by
random chance alone
We should reject our NULL hypothesis when p < 0.05 (should take
into account other factors that drive scientific inferences such as the
study design, the quality of measurement, the validity of
assumptions used when analyzing the data etc.)
BMI EXAMPLE: CONFIDENCE INTERVAL
•• We could also
obtain the 95% confidence interval for the population
mean BMI for FP staff and students.
• We use:
(df = 11)
28.17 2.201
giving (25.28, 31.05)
INTERPRETATION
• What is the Pr (25.28 < µFP < 31.05)? Is it 0.95?

• No! In the frequentist approach, population parameters (in this case
µFP) are not random variables.
• This means µFP is a fixed (but unknown) quantity. So it’s either in the
interval, or it’s not, i.e.
Pr (25.28 < µFP< 31.05) = 1 OR Pr (25.28 < µFP < 31.05) = 0
• Equivalent intervals in the Bayesian framework (known as Bayesian
Credible Intervals) have more natural interpretation
INTERPRETATION (TAN AND TAN 2010)
THE BAYESIAN PARADIGM: THE BMI
EXAMPLE
• As we shall see, Bayes’ Theorem gives us a way of combining objective
assessments with observed data
• For example, what if prior to observing the data, we believed that the
FP students + staff were likely to have a considerably lower BMI than
the FSKB students+staff?
• What if, from a previous study, we knew that the FP personnel were
quite likely to have a BMI somewhere between 25 and 33?
• Would it not be sensible to build this information into our analysis
before forming our conclusions using the data alone?
• This could be a good idea – we have relatively small samples which
could be biased!
• A Bayesian analysis could do this!
THE BAYESIAN PARADIGM: RATIONALE
• Actually, to be able to incorporate a person’s belief into an analysis of
sample data is surely the right to do – especially if the person is an expert.
• This is in line with how scientific experiments are inductively carried out:
 The experimenter usually knows something
 The experimenter carries out the experiment in which data are collected
 The experimenter then updates his / her belief from these results
• In other words, the data are used to refine what the experimenter knows. In
Bayesian paradigm, we shall consider how:
 Prior beliefs can be combined with
 Experimental data to form
 Posterior beliefs which include what we both our prior knowledge (prior
belief) and what we have learnt from data.
BAYES THEOREM (A WEE BIT OF
PROBABILITY)
•• P (A | B)
= P (A ∩ B) / P (B) [1] [conditional probability]
• However, P (B| A) = P (A ∩ B) / P (A), hence, P (A ∩ B) = P(B|A) * P(A)
• Substitute P (A ∩ B) = P(B|A) * P(A) into [1]:
• P (A | B) = P(B | A) * P(A) / P (B) [Bayes Theorem]
• However P (B) = based on Law of Total Probability
• Hence, P (A| B) = P(B|A) * P(A) / [
• However, since P(B) does not depend on A, we can move P(B) into the
proportionality constant and therefore:
• P(A|B) P(B|A) P(A)
• Posterior Likelihood Prior
BAYES THEOREM (CONJUGATE PRIOR)
•• Anexpert dermatologist who is very competent in spot diagnosing skin
conditions (psoriasis, eczema, pityriasis rosea etc).
• Θ = Pr (making correct diagnosis).
• Based on our prior knowledge, Θ = 0.95, whilst Pr (Θ <0.8) is minuscule
• Hence, we choose Θ Beta (77,5) as our prior distribution.
• π (Θ) = , 0<Θ<1
• In an experiment, that dermatologist correctly diagnosed 9 out of 10
patients [data].
• Hence the likelihood of the such observation follows a Binomial
distribution (x | ) ~ Bin (10, ).
• The likelihood function is thus given by:
• f(x = 9 | ) =
BAYES THEOREM (CONJUGATE PRIOR)
•• The
posterior distribution is thus computed using Bayes theorem:
• π (Θ | x = 9) π (Θ) f(x = 9 | )
=k , 0< <1
• We recognized the posterior distribution is a Beta distribution (to be
more specific, Beta ( 86, 6) )
• The posterior distribution is from the same family as prior distribution
(i.e. both prior and posterior are beta distribution).
• We can use the posterior distribution to obtain the posterior estimate
for parameter and for inference purposes (e.g. to obtain 95%
Bayesian Credible Interval for )
BAYES FACTOR
•• A counterpart to p value in the frequentist paradigm
• Pr ()= Pr () X
• Posterior odds = Bayes factor (B12) x prior odds
• If alternative hypothesis is H1 and null hypothesis is H2, then the
Bayes factor can be interpreted as follows (Kass and Raftery 1995):
Log10 (B10) B10 Evidence against H0

0 to 0.5 1 to 3.2 Not worth more than a bare
mention
0.5 to 1 3.2 to 10 Substantial
1 to 2 10 to 100 Strong
>2 >100 Decisive
BAYESIAN PARADIGM: RECENT
DEVELOPMENTS
• One of the difficulties in early applications of Bayesian methodology was the
maths.
• Combining prior beliefs with a probability model for the data often resulted in
maths that was just too difficult/impossible to solve by hands, especially when
the posterior distribution is a non-standard distribution.
• Then – in the early 1990s – a technique called “Markov Chain Monte Carlo
(MCMC for short) which is a computer-intensive simulation based procedure was
developed (Gelfand and Smith 1990).
• MCMC has revolutionized the use of Bayesian statistics, to the extent that
Bayesian data analyses are routinely used by non-statisticians in fields as diverse
as biology, civil engineering and social science.
• Two major MCMC algorithms: Metropolis Hastings Algorithms (Metropolis et al.
1970; Hastings (1970)) and Gibbs Sampling (Geman and Geman 1984)
• Others more recent developments: Integrated Nested Laplace Approximations
(INLA), Approximate Bayesian Computation (ABC), Sequential Monte Carlo
[beyond this talk!!]
THANK YOU
SELECTED REFERENCES
• Bayes T. An Essay towards solving a Problem in the Doctrine of

Chances. Philosophical Transactions 1763; 53: 370–418
• Fisher RA 1956. Statistical Methods and Scientific Inference, Hafner
Publishing Co.
• Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context,
Process, and Purpose, The American Statistician 2016;70(2):129-133
• Tan SH, Tan SB. The Correct Interpretation of Confidence Intervals.
Proceedings of Singapore Healthcare. 2010;19(3):276-278.
• Kass EK, Raftery AE. Bayes Factor. Journal of the American Statistical
Association 1995;90(430):773-795
SELECTED REFERENCES
• Gelfand A, Smith AFM. Sampling-Based Approaches to Calculating

Marginal Densities. Journal of the American Statistical Association
1990;85(410):398-409.
• Geman S, Geman D. Stochastic Relaxation, Gibbs Distributions, and
the Bayesian Restoration of Images". IEEE Transactions on Pattern
Analysis and Machine Intelligence 1984;6(6):721–741.
• Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller
E. Equations of state calculations by fast computing machines. J.
Chem. Phys. 1953; 21(6):1087– 1092.
• Hastings W. Monte Carlo sampling methods using Markov chains and
their application. Biometrika 1970;57:97– 109.

Statistical Fallacies and Errors in Medical Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Fallacies and Errors in Medical Research

Uploaded by

Copyright:

Available Formats

STATISTICAL FALLACIES AND

conducting analyses midway through experiments to decide whether to

• Suppose we are interested in whether or not there is any real

• From this we can obtain the descriptive summaries:

• In Statistical Methods and Scientific Inference, 1956

• What is the Pr (25.28 < µFP < 31.05)? Is it 0.95?

Log10 (B10) B10 Evidence against H0

• Bayes T. An Essay towards solving a Problem in the Doctrine of

• Gelfand A, Smith AFM. Sampling-Based Approaches to Calculating

You might also like