Professional Documents
Culture Documents
Biostatistics of HKU MMEDSC Session10handoutprint3
Biostatistics of HKU MMEDSC Session10handoutprint3
Statistics in practice
CMED6100 – Session 10
ST Ali
20 November 2021
sli.do/#hkubiostat21
Outline
Module aims
Outline
Module objectives
After completing this module, students will be able to:
6. Perform power and sample size calculations for one- and two-group
studies.
Part I
Presenting information
Rounding
Table: Proportion of lung cancer cases and healthy men with
different smoking habits.
Rounding the column percentages
Smoking habits Lung cancer cases Healthy men
to one decimal place makes the (n = 86) (n = 86)
Heavy smoker 65.1% 36.0%
comparison more difficult. Light smoker 31.4% 47.7%
Non-smoker 3.5% 16.3%
Rounding to the nearest Table: Proportion of lung cancer cases and healthy men with
different smoking habits.
percent is sufficient here. Smoking habits Lung cancer cases Healthy men
(n = 86) (n = 86)
figures, we would change the Percentages have been rounded so sums may not total.
“3%” to “3.5%”.
ST Ali CMED6100 – Session 10 Slide 6
Formatting tables
Improved?
We have rounded, reordered the rows, and removed the gridlines. Patterns in
the data are clearer.
Successful graphs
250
Notification rate
200
150
100
50
0
65 - 69 70 - 74 75 - 79 80 - 84 85 & over
80-84
75-79
70-74
65-69
350
300
250
200
Notification
rate
150
100
85 or over
50 80−84
75−79
70−74
0 65−69
350
300
250
200
Notification
rate
150
100
65-69
50 70-74
75-79
80-84
0 85 or over
350
300
250 85 or over
200 80-84
Notification
rate 75-79
150
70-74
100 65-69
50
60%
Proportion
not infected
40%
20%
0%
0 2 4 6 8
Day since intervention
98%
96%
Proportion
not infected Hand hygiene
94%
Mask+HH
92%
Control
90%
0 2 4 6 8
Day since intervention
98%
96%
Proportion
Hand hygiene
not infected 94%
Mask+HH
92%
Control
90%
0 2 4 6 8
Day since intervention
8%
Mask+HH
6%
Proportion Hand hygiene
infected
4%
2%
0%
0 2 4 6 8
Day since intervention
Life expectancy by health expenditures per capita, 2007 Life expectancy by health expenditures per capita,
1970-2008
Health expenditures are total (public and private), in
The data points are years. The other countries are
PPP-converted US dollars. Data source: OECD. Australia, Austria, Belgium, Canada, Denmark,
Finland, France, Germany, Ireland, Italy, Japan, the
Netherlands, New Zealand, Norway, Portugal, Spain,
Sweden, Switzerland, and the United Kingdom. Data
Source: OECD.
ST Ali CMED6100 – Session 10 Slide 20
Bump plots
Bump plots are similar to line graphs:
Figure: Tuberculosis notifications per 100,000 population in Hong Kong
by age group, 2012-2018.
85 or over 322
80−84 296
247 85 or over
70−74 164
155 75−79
2012 2018
An example ARTICLES
These numbers are from an article in Estimates of relative survival rates, by cancer site.
Relative survival rate, % (SE) Relative survival rate, % (SE)
For personal use. Only reproduce with permission from The Lancet Publishing Group.
These data can be graphically
presented in a ‘bump plot’ ...
Figure: The 5-, 10-, 15-, and 20-year relative survival rates for various cancers.
(1) is at least 2 of fever≥37.8◦ C, cough, headache, sore throat, aches or pains in muscles or joints.
Part II
0.3
Density
0.2
0.1
0.0
17 18 19 20 21 22 23
Normal (20,1)
ST Ali CMED6100 – Session 10 Slide 30
30
24
Frequency
18
12
17 18 19 20 21 22 23 24
Y
ST Ali CMED6100 – Session 10 Slide 31
Probability distributions Inference Comparing groups Choice of statistical methods
30
24
Frequency
18
12
17 18 19 20 21 22 23 24
Y
ST Ali CMED6100 – Session 10 Slide 32
s.e.
s.d.
ST Ali CMED6100 – Session 10 Slide 42
Probability distributions Inference Comparing groups Choice of statistical methods
Most observations will fall within Mean ±2SE gives a 95% confi-
±2SD of the mean dence interval for the mean
µ
ST Ali CMED6100 – Session 10 Slide 44
X
●
− 1.96σ
X− σ n + 1.96σ
X+ σ n
2.5% 2.5%
µ− 1.96σ
σ n µ µ+ 1.96σ
σ n
− 1.96σ
X− σ n X + 1.96σ
X+ σ n
2.5% 2.5%
µ− 1.96σ
σ n µ µ+ 1.96σ
σ n
Comparing groups
x2
●
x1
●
Null hypothesis – assume both groups are samples from the same
distribution. What is the chance of getting a difference x̄1 − x̄2 as
unusual or more unusual than the difference observed?
Comparing groups
Under the null hypothesis, x̄1 − x̄2 will have a Normal distribution
with mean 0 and variance σ12 /n1 + σ22 /n2 .
0 20 40 60 80
25
● X=10
20
15
Frequency
10
5
0
0 20 40 60 80
% of member states in Africa
0.15
0.10
0.05
0.00 ●
−6 −4 −2 0 2 4 6
X1 − X2
Figure: An observed standardised difference of 5 is at the extremes of the
sampling distribution under the null hypothesis.
ST Ali CMED6100 – Session 10 Slide 51
Probability distributions Inference Comparing groups Choice of statistical methods
0.15
0.10
0.05
0.00 ●
−6 −4 −2 0 2 4 6
X1 − X2
Figure: If the null hypothesis were true, i.e. no difference between means, it
would be very unusual to observe such a large difference (whether less than −5
or greater than 5). We would only observe such a large difference in 1% of
ST Ali
repeated experiments. CMED6100 – Session 10 Slide 52
• Small p-values, indicating that observed differences are unlikely under the
null hypothesis, are usually taken as evidence against the null hypothesis
• A common threshold is p < 0.05; in that case p-values less than 0.05 are
called ‘statistically significant’.
a
Methods for these kinds of data are outside the scope of this course.
b
Methods for testing a hypothesis about a single variable or paired difference
include the 1-sample t-test, paired t-test, and the Wilcoxon signed rank test.
∗
Methods for these kinds of data are outside the scope of this course.
Part III
All−cause 600
death rates
per 10 000
man−years 400 Unfit −> Fit
0
20 30 40 50 60 70
Age group
Criticisms by Williams∗
• Blair only took one baseline measurement of fitness (and one
measurement at follow-up).
• What if a particular patient was feeling more energetic than
usual, on the day of his test?
''true'' observed
level level
Baseline fitness
measured via treadmill
test duration (minutes)
∗ Williams PT. The illusion of improved physical fitness and reduced mortality.
Medicine & Science in Sports & Exercise. 2003; 35(5): 736-40.
ST Ali CMED6100 – Session 10 Slide 69
Errors in assessment Misleading presentation Infographics Dishonest presentation
GI consulters
10%
Non−consulting GER
subjects 46%
The iceberg of disease is a great concept. However it is not well suited to displaying
quantitative information. To be correct, the area (not the height) of each section
should be proportional to the percentage of interest. The correct version is on the
right-hand side.
ST Ali CMED6100 – Session 10 Slide 74
Primary care
consulters
Non−consulting
GER subjects
The horizontal bar chart can still give the general idea of a (half-) iceberg
shape, and this time the quantitative interpretation is correct.
ST Ali CMED6100 – Session 10 Slide 75
Infographics
Source:
http://www.forbes.com/sites/matthewherper/2013/02/19/a-
graphic-that-drives-home-how-vaccines-have-
changed-our-world/
“Information graphics or
infographics are graphic visual
representations of information,
data or knowledge intended to
present complex information
quickly and clearly”
(Wikipedia).
Source: A dapted from: Public H ealth Agency of C anada, F igure 8 ± Measles Reported
Incidence C anada. http://www.phac-aspc.gc.ca/publicat/cig-gci/p04-meas-roug-eng.php
ST Ali
CMED6100 – Session 10 Slide 79
Errors in assessment Misleading presentation Infographics Dishonest presentation
Source: http://www.phac-aspc.gc.ca/publicat/cig-gci/p04-meas-roug-eng.php
90
Trends in air pollution
●
80 ● ●
●
70 ●
●
60 ● ● ●
●
Concentration 50
µg/m3)
(µ 40
y=0.4182x−770.55
30
R2=0.0159
20
10
0
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
90
Trends in air pollution
●
80 ● ●
●
70 ●
●
60 ● y=−4.2286x+8555 ● ●
●
Concentration 50 y=6x−11945 R2=0.6754
µg/m3)
(µ 40 R2=0.686
30
20
10
0
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
in recent year, second figure was used to argue that pollutant levels are
• If air pollutant levels remained similar from year to year, then variation in
pollutant levels could not be explained by time? If the correlation is low,
the R 2 would be low, but this does not imply that a horizontal line does
not fit the data . . .
ST Ali CMED6100 – Session 10 Slide 83
Part IV
Practical issues
Practical 3 scenario
Suppose you would like to test H0 : OR = 1 at α = 0.05 (two-sided) in a case-control
study. The prevalence of exposure in the control population is assumed to be 25%.
You can request funding from a local agency, but the budget must be no higher than
$120,000. The cost of recruiting a case is $400, while controls are easier to find and
recruit and will only cost $200 each. The following table shows the power of
alternative possible study designs to detect odds ratios of 1.5, 1.8 and 2.0:
Case-to-control ratio
2:1 1:1 1:2 1:4 1:8
OR = 2.0 0.781 0.875 0.879 0.802 0.630
OR = 1.8 0.620 0.737 0.746 0.653 0.484
OR = 1.5 0.314 0.404 0.416 0.349 0.247
The 1:2 design with 150 cases and 300 controls has the highest power, and has power
of 75% and 89% to detect ORs > 1.8 and > 2.0 respectively.
ST Ali CMED6100 – Session 10 Slide 86
• Many alternatives:
• Complete case analysis
– Exclude all subjects with missing data on any variable of
interest
• Pairwise exclusion
– Only exclude subjects with missing data on the variable on an
analysis-by-analysis basis.
• The two best choices are the complete case analysis and
multiple imputation.
Data management
• https://www.youtube.com/watch?v=N2zK3sAtr-4
• Save your raw data and cleaned dataset, and document all changes
made during the cleaning process.
• Consider including dates or version numbers in your dataset
filenames.
• Document all of the steps taken in your analyses, including the
specific datasets used and the sample sizes included in each analysis.
• Software which allows you to write a series of commands is
particularly useful, as this ‘script’ or ‘syntax’ can be saved and used
again later to reproduce results.
Review
Further reading
• Altman DG, Bland JM. Missing data. BMJ, 2007; 334:424.
• Critical care series on medical statistics
http : //ccforum.com/series/CC Medical
• Statistics at square one http : //www .bmj.com/statsbk/
• Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic
research. Am J Epidemiol. 2006;163(9):783-9.
• Wicherts JM, Bakker M, Molenaar D. Willingness to Share
Research Data Is Related to the Strength of the Evidence and
the Quality of Reporting of Statistical Results. PLoS ONE,
2011; 6(11): e26828.
ST Ali CMED6100 – Session 10 Slide 100
Sample size Missing data Replication Review
Course evaluation