Biostatistics of HKU MMEDSC Session5handoutprint3

Outline
Hypothesis tests
CMED6100 – Session 5
ST Ali
School of Public Health

The University of Hong Kong
16 October 2021
sli.do/#hkubiostat21
ST Ali CMED6100 – Session 5 Slide 2
Outline
Announcements
• Assignment S5 due by Wednesday 20 October 11:59pm

• Mid-term exam on 23 October (Sat) 9:30am.
– Bring a calculator (not the one on your mobile phone),
– a pen,
– and a pencil.
• Venue will be at LTs 3&4.
Outline
Mid-term exam
• The mid-term exam covers the material from sessions 1-5
• The exam will take 105 minutes
• MCQs and SAQs, similar style to the assignments and

practice questions
• Simple calculations will be required for some questions
• The only formula you need to remember is for the standard

error. Any other formulas necessary for calculations will be
provided in the exam paper.

Outline
Objectives
After the lecture, students should be able to:
• Describe the statistical tests for comparing associations between

two categorical variables (chi-squared test, Fisher’s exact test,
McNemar’s test)
• Describe the statistical tests for comparing the mean of a

continuous variable or a difference between paired means and a null
value (one-sample t-test, paired t-test, Wilcoxon signed-rank test)
• Describe the statistical tests for comparing associations between the

means of a continuous variable in two groups (independent
two-sample t-test, Mann-Whitney U test / Wilcoxon rank-sum test)
Outline
What will be assessed?

• Students will not be required to perform hypothesis tests ‘by
hand’.
• But students will be expected to be able to describe the basic
approach to hypothesis testing,
– For example, the construction of an ‘expected counts’ table for
the chi-squared test, and the comparison of differences
between observed and expected counts.
• Students will also be expected to correctly describe the

relevant null hypotheses, and any key assumptions of the tests
described in this lecture.
Outline
Statistical hypothesis: Assertion or statement about the population characteristics (𝜇)

Null hypothesis: Hypothesis of no difference (𝐻! : 𝜇 = 𝜇0 )
Alternative hypothesis: hypothesis which is complementary to the null hypothesis (𝐻# 𝑜𝑟𝐻$ : 𝜇 ≠ 𝜇! )
Critical Value
𝐻! : 𝜇 = 𝜇! 𝐻" : 𝜇 = 𝜇" (> 𝜇! )
Pr (Type I Error) = 𝛼 (Level of significance)
Pr (Type II Error) = 𝛽 (1 − 𝛽 is Power of the test)

Outline
Recap of hypothesis testing

• Remember the ideas of sampling theory
• Under a null hypothesis we know what kinds of data might be produced
• We can compare our data to the kinds of data that we might expect if the
null hypothesis were true
• Extreme p-values are taken as evidence against the null hypothesis
0.20 E.g., if the null hypothesis were true, i.e. no

Density
0.15
difference between means, it would be very
unusual to observe large differences in
0.10
means (whether less than −5 or greater

0.05
than 5). We would only observe such a large
0.00
difference in 1% of repeated experiments.
−6 −4 −2 0 2 4 6
Difference in means
Outline
One sample vs repeated samples
• In practice we typically collect data on one sample, and then

use the data to estimate a p-value or a confidence interval.
• The interpretation of a p-value or confidence interval is based

on the idea of multiple samples, and sampling theory.
Pearson’s χ2 test Fisher’s test McNemar’s test
Part I
Tests for categorical variables

Accidents example
In a study of occupational health in a local factory, research was
conducted to investigate whether all employees faced similar risk of
various types of accident. A total of 117 accidents were classified
by the age of the employee and the type of accident:
Accident type
Age Sprain Burn Cut
Under 25 9 17 5
25 or over 61 13 12
How can we investigate whether accident type is independent of age?

Accidents example
This kind of table is called a ‘contingency table’. We can also add

the row and column totals, or the ‘marginal totals’.
Accident type
Age Sprain Burn Cut
Under 25 9 17 5 31
25 or over 61 13 12 86
70 30 17 117
Pearson’s χ2 test – example

• If the two variables age and accident type are independent, we can
calculate the expected value for each entry, given the same marginal
totals. These are shown in the table below
Table: Expected counts assuming independence.
Accident type
Age Sprain Burn Cut
Under 25 18.55 7.95 4.50 31
25 or over 51.45 22.05 12.50 86
70 30 17 117
For example, the expected number of sprains in employees aged under 25
is 31 × 70/117 = 18.55.
χ2 test – example
We can calculate a test statistic based on the differences between the observed
and expected values, as follows:
Xn
(Oi − Ei )2
T =
i=1
Ei
(9 − 18.55)2 (17 − 7.95)2 (5 − 4.50)2
= + +
18.55 7.95 4.50
(61 − 51.45)2 (13 − 22.05)2 (12 − 12.50)2
+ + +
51.45 22.05 12.50
= 4.92 + 10.30 + 0.06 + 1.77 + 3.71 + 0.02
= 20.78
χ2 test – example
0.5
0.4 χ22
0.3
Density
0.2
0.1
0.0 ●
0 5 10 15 20 25
2
This test statistic can be compared with a χ (chi-squared) distribution with 2
degrees of freedom, and we find that it corresponds to a p-value 0.001.
There is strong evidence against the null hypothesis.
Degrees of freedom
Treatment Control Total

Event A B A+B
Not event C D C+D
Total A+C B+D A+B+C+D
• When performing a chi-squared test, we need to know the

’degrees of freedom’ of the test. In a simple 2 × 2 table we
would have values A, B, C and D:

Degrees of freedom – cont
• To perform the chi-squared test, we hold the marginal totals

constant. Therefore if we choose a specific value for A, we
have no further choices to make. We cannot choose B or C
because the marginal totals A + B and A + C are fixed.
• In general, for a chi-squared test on a contingency table with

m rows and n columns, the degrees of freedom are
(m − 1) × (n − 1).
Assumptions of the χ2 test
• The χ2 test requires few assumptions
• The test is not appropriate if the expected frequencies are too

low.
• It will normally be acceptable so long as no more than 10% of

the events have expected frequencies below 5.
• An alternative for these situations is Fisher’s exact test.
Fisher’s exact test

Here is an example from a small clinical trial of treatment versus control
Treatment Control Total

Event 2 7 9
Not event 8 2 10
Total 10 9 19
In this case the expected totals in some cells are below 5 (e.g.
9 × 9/19 = 4.3).
Fisher’s test examines all the possible tables with the same marginal
totals and calculates how probable is the current table (or more extreme
tables) assuming independence.

All possible tables with the same marginal totals can be generated
0 9 1 8 2 7
10 0 9 1 8 2
p1 = 0.00001 p2 = 0.00097 p3 = 0.01754
3 6 4 5 5 4
7 3 6 4 5 5
p4 = 0.10912 p5 = 0.28643 p6 = 0.34372
6 3 7 2 8 1
4 6 3 7 2 8
p7 = 0.19095 p8 = 0.04676 p9 = 0.00438
9 0
1 9
p10 = 0.00011
p1 + p2 + p3 + · · · + p10 = 1.

The one-tailed probability would be the sum of the separate
probabilities for the arrays:
0 9 1 8 2 7
10 0 9 1 8 2
p1 = 0.00001 p2 = 0.00097 p3 = 0.01754
sum = p1 + p2 + p3 = 0.01852 is the 1-sided p-value
The 2-sided p-value is 0.023
R Commander automatically calculates the p-value using Fisher’s
test for contingency tables in which the assumptions of the χ2 test
may not be met.
Matched samples
Sometimes we have 2 × 2 tables with matched samples. For example in the

study below 1319 patients with athlete’s foot were given alternative treatments
– treatment X on one foot and treatment Y on the other foot.
Treatment Y Total
Cured Not cured
Treatment X Cured 212 144 356
Not cured 256 707 963
Total 468 851 1319
The outcome of either treatment is likely to be correlated within the same

patient, and we should not ignore this correlation when we analyse the data.

Matched samples
Treatment Y Total
Cured Not cured
Not cured 256 707 963
Total 468 851 1319
Our null hypothesis is that the proportion cured by treatment X is the same as
the proportion cured by treatment Y.
McNemar’s test
In the general form, the contingency table is:
Second First measurement Total

measurement Yes No
Yes A B A+B
No C D C+D
The null hypothesis is that A+B=A+C. Or we could flip the problem and
consider the null hypothesis that C+D=B+D. In fact both of these null
hypotheses can be rephrased more simply as B=C.
McNemar’s test
The test statistic is very easy to calculate, and depends only on the
two off-diagonal cells.
Second First measurement Total

measurement Yes No
Yes A B A+B
No C D C+D
(B−C )2
The test statistic is B+C . It follows a chi-squared distribution
with 1 degree of freedom.
McNemar’s test – example
In our example, with treatment X and Y, the table was:
Treatment Y Total
Cured Not cured
Not cured 256 707 963
Total 468 851 1319
(144−256)2
The test statistic is 144+256
= 31.4 which is highly significant (the 95%
percentile is 3.84 for a chi-squared distribution with 1 degree of freedom under
the null hypothesis).
Summary
• When testing independence between two categorical variables,

usually use the χ2 test.
• In small sample sizes (when more than 10% of expected

counts are less than 5) use Fisher’s exact test instead.
• In matched samples use McNemar’s test
– For example, repeated measures on the same individuals, or
two assessments of the same experimental unit
The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U
Part II
Tests for continuous variables

Example
Randomized trial of dietary intervention on 32 patients. The
weight in kg of all patients was measured at baseline and then
after 2 weeks. Is the intervention effective?
Table: Pretest and post-test weights of 32 patients.

Control Intervention
Subject Baseline Post-test Difference Baseline Post-test Difference
1 74.2 71.6 −2.6 78.9 73.1 −5.8
2 92.9 90.6 −2.3 73.2 67.5 −5.7
3 103.1 99.2 −3.9 74.2 66.5 −7.7
4 59.8 56.5 −3.3 120.7 114.8 −5.9
5 94.8 90.0 −4.8 88.1 81.4 −6.7
. . . . . . .
. . . . . . .
. . . . . . .
16 88.0 84.6 −3.4 79.4 73.4 −6.0
Mean 85.3 82.6 −2.8 83.4 78.3 −5.1
SD 11.0 10.7 1.2 13.8 14.0 1.8
First, plot the data
120
Weight
80
40
Pre−test Post−test Pre−test Post−test

Plot the data – weight changes

2
Weight
0
difference
−2
−4
−6
−8

Is the intervention effective?
1. We could compare the post-test weights with the pre-test

weights in the intervention group
– Weight loss on intervention would prove effectiveness?
2. We could compare mean post-test weights in the two groups

– Allocation to intervention was randomized so the distribution
of baseline weights should be similar in the two intervention
groups.
3. We could compare the mean weight loss between the

intervention group and the control group?
Comparing the two groups
Control Pre-test Post-test

group weight ∆ weight
(3) (2)
Intervention Pre-test ∆ Post-test

group weight weight
(1)
We can compare three different approaches to investigate the effectiveness of

intervention on weight loss.
Diet data
• First approach – focus on evidence of weight loss in the

intervention group . . .

Students’ T-test
• The z-test illustrated in the previous session is valid in large samples,
when the Central Limit Theorem ensures that the (sampling distribution
of the) sample mean will follow a Normal distribution.
• If we want to compare the means of two groups in a smaller dataset, we

cannot rely on the Central Limit Theorem
• An alternative test is available provided that the original data follow a

Normal distribution
• This test relies on a theoretical result that when sampling from a Normal
distribution, the sample mean will follow a t distribution
• The test was originally derived and published by William Gosset in 1908
The T distribution
0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
X
Figure: The t-distribution has ‘fatter’ tails than the Normal distribution,
but converges to Normal with more degrees of freedom
Population distribution and sample mean distribution
Distribution of Sample size Distribution of sample mean

observations
in population
Normal Small n Tn−1 distribution
Normal Large n Normal (Central Limit Theorem)
Not normal Small n Unclear
Not normal Large n Normal (Central Limit Theorem)

Check normality of the diet data

Intervention
0.25
0.20
Density
0.15
0.10
0.05
0.00
−8 −6 −4 −2 0 2
Weight difference
Paired t-test
We could look at the weight loss in the intervention group to test
whether there was any change from baseline. This is equivalent to
testing H0 : ∆t = 0 in the intervention group. This is a type of
“one sample t-test” since we are evaluating whether our sample
mean is the same as or different from a null value.
• The test statistic is calculated as

D̄t −0 −5.1
T = √
st / n
= √
1.8/ 16
= −11.3, with the degrees of freedom
n − 1 = 15.
• Can derive p-value < 0.001 and reject the null hypothesis.
Paired t-test (cont’d)

The p-value = 2 × Pr (T ≤ tc |T ∼ t15 ).
0.4
0.3
Density
0.2 ~t15
0.1
0.0 ●
critical value tc
−4 −2 0 2 4
T

Paired t-test vs 1-sample t-test
• The paired t-test is used on paired data, to test the null hypothesis that
the difference between the paired measurements is zero.
– Although we have paired measurements, when we subtract one from

the other the result is a single variable.
• The one-sample t-test is used on a single sample, to test the null

hypothesis that the mean is zero (if we want to test a null hypothesis of a
mean with another value, it is very simple to modify the test).
• These two tests are the same.
Wilcoxon signed-rank test

• The t-test relies on the assumption that the population
distribution is Normal.
• With a large sample, the Central Limit Theorem applies and

the t-test converges to a z-test, without requiring the
population to follow a normal distribution.
• When the assumption that the population follows a Normal

distribution is violated, and we have a small sample, we could
use the Wilcoxon signed-rank test as a non-parametric
alternative to the paired t-test or one-sample t-test.
Wilcoxon signed-rank test

• The simple explanation of the Wilcoxon signed-rank test is
that it examines how many observations fall below the null
value (e.g. 0) and how many fall above, and then evaluates
how unlikely it would be for so many (or so few) observations
to fall above or below the null value.
• The detailed explanation is a bit more complicated because
the test also takes into account the magnitude of the
observations. See a textbook for technical details.
• The p-value for H0 : ∆t = 0 in the intervention group is
< 0.001.
Diet data

intervention group.
– Problem – what about regression to the mean and placebo
effects, so that the control group also lost weight?
• Second approach – compare post-test weights in the

intervention and control group . . .
Distribution of post-test weights
0.05 0.05
mean=82.6 mean=78.3
SD=10.7 SD=14.0
0.04 0.04
0.03 0.03
Density
0.02 0.02
0.01 0.01
0.00 0.00
50 60 70 80 90 100 110 120 50 60 70 80 90 100 110 120
Post−test weight Post−test weight
t-test comparing post-test weights

We can use an independent two-sample t-test to compare the
post-test weights between two groups. Individuals were randomized
so the distribution of pre-test weights should be similar.
• Set the null hypothesis H0 : µc = µt .
• The test statistic can be calculated as T = X¯√
c −X̄t
where
sp 2/n
q
s12 +s22
sp = 2 .
• In this example T = √ 82.6−78.3 √
= 0.98, with
(10.72 +14.02 )/2 2/16
degrees of freedom 2n − 2 = 30. Then the (2-sided)
p-value= 0.33 > 0.05, and we could not reject the null
ST Ali hypothesis. CMED6100 – Session 5 Slide 49
Diet data

intervention group.
– Problem – what about regression to the mean and placebo
effects, so that the control group also lost weight?
• Second approach – compare post-test weights in the

intervention and control group.
• Third approach – compare weight losses in the intervention

and control group.
Baseline vs post-test weights

130
● Intervention
Control
●
110
Post−test
weight
●
(kg)
●
90 ●
●●
●●
●●
● ●
70 ●●
●
50
50 70 90 110 130
Baseline weight (kg)
Figure: High correlation between baseline and post-test weights, for two
ST Ali
groups respectively. CMED6100 – Session 5 Slide 51
Baseline vs weight differences

12
● Intervention
Control
8
4
Weight
difference 0
●
(kg)
● ●
−4 ● ●
● ●
● ●
● ●
● ●
● ●
●
−8
−12
50 70 90 110 130
Baseline weight (kg)

Check normality of the weight differences
0.4 0.4
0.3 0.3
Density
0.2 0.2
0.1 0.1
0.0 0.0
−8 −6 −4 −2 0 2 −8 −6 −4 −2 0 2
Weight difference Weight difference
Independent t-test comparing weight loss
The high correlation between baseline and follow-up measurements

means that it may be better to compare differences between
post-test and baseline weights between the two groups.
• Set the null hypothesis H0 : ∆c = ∆t .

q
D¯√
c −D̄t sc2 +st2
• The test statistic T = where sp = 2 .
sp 2/n
−2.8−(−5.1)
• Then T = √ √ = 4.25, with degrees of
(1.22 +1.82 )/2 2/16
freedom 2n − 2 = 30. The 2-sided p-value< 0.001, which is
strong evidence against the null hypothesis.
Assumptions of the two-sample t-test

• The null hypothesis of the two sample t-test is that both samples come
from the same underlying Normal distribution with the same mean and
the same variance.
• If the two sample distributions are quite skewed, we might consider a log
transformation before doing a t-test.
• If the two samples have quite different variances then the t-test may not
give an appropriate p-value.
• If the two samples have the same (or similar) variance but do not follow a
Normal distribution, we can consider an alternative non-parametric test,
such as the Mann-Whitney U (also called the Wilcoxon Rank-Sum Test).

Mann-Whitney U / Wilcoxon Rank Sum Test

Example – reaction times (ms) for patients without (A) and with (B)
anaesthetic. Subjects had to react on a simple visual stimulus.
A: 135, 141, 143, 149, 171, 172.
B: 142, 158, 170, 189, 254, 289.
Step 1: Place all the values together in rank order (i.e. from lowest to
highest). If there are two observations with the same value, the ‘A’
sample is ranked first.
A A B A A B B A A B B B
135 141 142 143 149 158 170 171 172 189 254 289
Step 2: Inspect each ‘B’ sample in turn and count the number of ‘A’s
which come before it. Add up the total to get a U value.
135 141 142 143 149 158 170 171 172 189 254 289
2 4 4 6 6 6
Total, UA = 28.

Step 3: Repeat steps 1 and 2, but this time inspect each A in turn and
count the number of B’s which precede it. Add up the total to get a
second U value.
135 141 142 143 149 158 170 171 172 189 254 289
0 0 1 1 3 3
Total, UB = 8.

Step 4: Take the smaller of the two U values (here we take UB = 8) and
look up the corresponding p-value value in a reference table as 0.11.
• Note – the maximum value for U is 36 here (= 6 × 6). Under the

null hypothesis of no difference between the means of the two
groups we would expect to get UA = UB = 18 on average.
• The situation is a bit more complicated if there are tied values, and
there is a quicker way to calculate U in larger datasets based on the
ranks of the values.
Diet example – Mann-Whitney U Test
• For the diet example, comparing weight losses between the

two groups, the p-value is < 0.001.
• We could also use the Mann-Whitney U Test to compare

post-test weights between the two groups, and in that case
the p-value is 0.11.
Control Pre-test Post-test

group weight ∆ weight
(3) Independent (2) Independent

two-sample t-test two-sample t-test
Intervention Pre-test ∆ Post-test

group weight weight
(1) One-sample t-test
We used three different approaches to investigate the effectiveness of

intervention on weight loss. Approach (3), comparing the differences between
post-test and pre-test weights between two groups, is the most appropriate.


Summary statistics and p-values for the comparison of post-test weights
between intervention and control group, and the comparison of weight losses
between intervention and control group.
(2) Comparing (3) Comparing differences

post-test weights (post-test – pre-test)
Intervention mean (sd) 78.3 (14.0) −5.1 (1.8)
Control mean (sd) 82.6(10.7) −2.8 (1.2)
t-test p-value 0.33 < 0.001
Mann-Whitney U p-value 0.11 < 0.001
The comparison of weight losses is much more ‘powerful’.
Summary
• Comparing post-test data did not show any significant
intervention effect.
• Comparing differences between post-test and baseline allowed
us to detect the intervention effect.
– This is a common occurrence when baseline and post-test
measurements are highly correlated.
• In an example where the normal distribution assumption was
appropriate, the (parametric) t-test led to smaller p-values
than the (non-parametric) Mann-Whitney U test.
– This is a common occurrence when the assumptions of
parametric tests are met.
Summary of t-tests
• Paired t-test: for testing the hypothesis that there is no

difference between the paired measurements (equivalently that
the mean difference is zero)
• One-sample t-test: for testing the hypothesis that the mean is

zero (example test #1)
• Two-sample t-test: for comparing the means of two different

variables (example test #2 and #3)

Background Example
Part III
Analysis of Variance (ANOVA)
Background Example
Background
We use ANOVA:
• To test for a statistically significant difference in the means of

a continuous outcome variable, between three or more
treatment groups;
• Many similarities between ANOVA and linear regression, but

the focus of ANOVA is on hypothesis testing.
Background Example
Assumptions of ANOVA
For ANOVA to work correctly, some assumptions must be satisfied,

including:
• The distribution of the outcome variable in each treatment

group should be Normal.
• These Normal distributions should have the same variance.

Background Example
Illustration
Small within-group
variation
→ Significant difference?
Large within-group
variation
→ non-significant
difference?
Background Example
How does ANOVA work?
• We break down the total variation (ST ) in the data into two
parts:
– Between-group variation (SA );
– Remaining ‘error’ (within-group variation) (SE ).
• By comparing these two portions of the variance, we can test

for a statistically significant difference between groups.
Background Example
Why can’t we just use t-tests?
• t-tests are only for two groups;
• If using a 5% significance level, each t-test will have a 5%

chance to give a significant p-value when there is actually no
difference between groups;
• If you do multiple t-tests, you increase the chance that you

get a spurious result;
• Better to use one ANOVA instead of many t-tests.

Background Example
Example data
• 24 rats are randomised to three different drugs which are

thought to affect reactions.
• They are stimulated in a particular way.
• Their response times (ms) are recorded.
• We want to know if there is a difference between the drug

effects.
Background Example
Example data – response times
Response times are given in milliseconds.
Drug
Block 1 2 3
1 12 10 14 15 23 21
2 17 14 17 13 26 24
3 26 21 28 29 31 35
4 13 16 13 12 16 18
mean 16.13 17.63 24.25
Background Example
First plot the data – dotplot

35 ●
●
30
Reaction ●
●
time
● ●
25
●
●
● ●
20
●
● ●
● ●
15 ●
● ●
● ●●
● ●
10 ●
1 2 3
Drug
Background Example
Box-and-whisker plot
40
30
Reaction
time
20
10
1 2 3
Drug
Background Example
Important to check homogeneity of variance

ANOVA assumes
that the within-
group variances
are the same in
each group.
Need to check that this

assumption is reasonable
(need non-significant
p-value)
Background Example
Format of data/notation
Drug
Block 1 2 3
1 ··· ···
2 Xi
3
4
sum (Ti ) T1 = 129 T2 = 141 T3 = 194
Total, T = T1 + T2 + T3 = 464; N1 = N2 = N3 = 8; N = 24.

Background Example
Calculations
First, calculate the total sum of squares.
X 2 T2
ST = Xi −
N
4642
= 122 + 102 + · · · + 182 −
24
= 10076 − 8970.667
= 1105.333
Then calculate the between-group sum of squares.

X T2 T2
i
SA = −
Ni N
1292 1412 1942 4642
= + + −
8 8 8 24
= 9269.75 − 8970.667
= 299.083
Background Example
ANOVA table
Sum of Squares df Mean Square F p-value

Between Groups 299.083 2 149.542 3.895 0.036
Within Groups 806.250 21 38.393
Total 1105.333 23
You can see the 299.083 and the 1105.333 that we calculated.
What do other values represent? ...
Background Example
Explanation of table
• Degrees of freedom (df):

– Between-group df = number of groups −1 = 2
– Within-group df = total number of individuals − number of
groups = 21
• Mean square (MS):
– sum of squares divided by degrees of freedom, in each row
• F ratio
– between-group MS divided by ‘error’ MS

Background Example
Calculation of p-value
• Use tables of the F distribution:

– Find p such that F (p, 2, 21) = 3.895
– We get p = 0.036
• There is a significant difference between the groups, at the

5% level (p = 0.036 < 0.05).
• We could further investigate the specific differences after

finding a significant difference between groups.
Significance of Differences Review
Part IV
Review
Nieuwenhuis et al. 2011 Nature Neuroscience
“In theory, a comparison of two experimental effects requires a statistical test

on their difference. In practice, this comparison is often based on an incorrect
procedure involving two separate tests in which researchers conclude that
effects differ when one effect is significant (P < 0.05) but the other is not
(P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience
articles in five top-ranking journals (Science, Nature, Nature Neuroscience,
Neuron and The Journal of Neuroscience) and found that 78 used the correct
procedure and 79 used the incorrect procedure.”

irally transduced versus control mice). of these three graphs contains repeated measurements (for example,
flect the group averages of individual
Significance of Differences before and after treatment). In the case of repeated measurements Review
urselves (for ten mice in each group), on the same group(s) of subjects, the standard-error bars do not give
n in this example is not significant the information needed to assess the significance of the differences
at the researchers intend to make is not between the repeated measurements, as they are not sensitive to the
Nieuwenhuis et al. 2011 Nature Neuroscience
correlations between these measurements3. Standard-error bars can
arious types
comparing
a 8 Baseline
Photoinhibition
b 80 Fluoxetine group
Vehicle group
c Pre-operative
Post-operative
d
4
mparing effect *
P3 amplitude effect (µV)

ndition and a 6 60 40
P3 amplitude (µV)
Escape latency (s)

paring effect
Sniff time (s)

-test. * ** 30
4 40 ns 2
and claiming
20
is specific for
2 20
presented in a, 10
two repeated-
seline). Error 0 0 0 0
Virally Control Before After Ento- Peri- Post- Virally Control
ficant transduced mice treatment treatment rhinal rhinal rhinal transduced mice
1. mice mice
Figure: Graphs illustratingVOLUME 14types

the various | NUMBER 9 | SEPTEMBER
of situations NATURE NEUROSCIENCE
2011of comparing
in which the error significance levels
occurs. (a) Comparing effect sizes in an experimental group/condition and a control group/condition. (b)
Comparing effect sizes during a pre-test and a post-test. (c) Comparing several brain areas and claiming that a
particular effect (property) is specific for one of these brain areas. (d) Data presented in a, after taking the
difference of the two repeated measures (photoinhibition and baseline). Error bars indicate s.e.m.; ns,
nonsignificant (P > 0.05); ∗ P < 0.05; ∗∗ P < 0.01.
First type of error
• When comparing effect sizes in an experimental group and a

control group, it is incorrect to contrast the significance levels
of the two effect sizes (panel A on the Figure).
• Instead, we should directly compare the effect sizes in a single

hypothesis test.
Second type of error
• When evaluating the effect of an intervention, it is incorrect

to contrast the significance levels of the pre-test values and
the post-test values (panel B on the Figure).
• Instead, we should directly compare the before-to-after

changes between the treatment group and control group in a
single hypothesis test.
• (Similar to the previous type of error)

Third type of error

• When we have more than two groups to compare between, it
is incorrect to conduct multiple significance tests (e.g. many
2-way comparisons with t-tests, panel C on the figure).
• Instead, we should directly include all of the data in a single
model e.g. ANOVA, or a regression model (session 7), and
first examine evidence for an overall effect to avoid a problem
with multiple testing.
• If we believe that a treatment effect is unique to a particular
group (e.g. the first comparison in panel C) we would need to
include interaction terms in our model.
Review
• See the session handout for summary of which hypothesis test

to use in different scenarios.
• Parametric tests will be more powerful than non-parametric

tests as long as the assumptions are met.
• Think carefully about the appropriate hypothesis to test.
References and further reading
• Nieuwenhuis S, Forstmann BU, Wagenmakers EJ. Erroneous

analyses of interactions in neuroscience: a problem of
significance. Nat Neurosci. 2011; 14(9):1105-7.
• Altman DG, Bland JM. Comparing several groups using

analysis of variance. BMJ, 1996; 312, 1472-3.
• Bland and Altman. Multiple significance tests: the Bonferroni
method
– http://bmj.bmjjournals.com/cgi/content/full/310/3973/170

Biostatistics of HKU MMEDSC Session5handoutprint3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biostatistics of HKU MMEDSC Session5handoutprint3

Uploaded by

Copyright:

Available Formats

Outline

School of Public Health

ST Ali CMED6100 – Session 5 Slide 2

• Assignment S5 due by Wednesday 20 October 11:59pm

• Venue will be at LTs 3&4.

ST Ali CMED6100 – Session 5 Slide 3

• The mid-term exam covers the material from sessions 1-5

• The exam will take 105 minutes

• MCQs and SAQs, similar style to the assignments and

• Simple calculations will be required for some questions

• The only formula you need to remember is for the standard

ST Ali CMED6100 – Session 5 Slide 4

• Describe the statistical tests for comparing associations between

• Describe the statistical tests for comparing the mean of a

• Describe the statistical tests for comparing associations between the

What will be assessed?

• Students will also be expected to correctly describe the

Statistical hypothesis: Assertion or statement about the population characteristics (𝜇)

ST Ali CMED6100 – Session 5 Slide 7

Recap of hypothesis testing

• Under a null hypothesis we know what kinds of data might be produced

• Extreme p-values are taken as evidence against the null hypothesis

0.20 E.g., if the null hypothesis were true, i.e. no

means (whether less than −5 or greater

One sample vs repeated samples

• In practice we typically collect data on one sample, and then

• The interpretation of a p-value or confidence interval is based

ST Ali CMED6100 – Session 5 Slide 9

Pearson’s χ2 test Fisher’s test McNemar’s test

Tests for categorical variables

ST Ali CMED6100 – Session 5 Slide 10

How can we investigate whether accident type is independent of age?

Pearson’s χ2 test Fisher’s test McNemar’s test

This kind of table is called a ‘contingency table’. We can also add

ST Ali CMED6100 – Session 5 Slide 12

Pearson’s χ2 test Fisher’s test McNemar’s test

Pearson’s χ2 test – example

ST Ali CMED6100 – Session 5 Slide 14

Pearson’s χ2 test Fisher’s test McNemar’s test

ST Ali CMED6100 – Session 5 Slide 15

Pearson’s χ2 test Fisher’s test McNemar’s test

Treatment Control Total

• When performing a chi-squared test, we need to know the

ST Ali CMED6100 – Session 5 Slide 16

Degrees of freedom – cont

• To perform the chi-squared test, we hold the marginal totals

• In general, for a chi-squared test on a contingency table with

ST Ali CMED6100 – Session 5 Slide 17

Pearson’s χ2 test Fisher’s test McNemar’s test

Assumptions of the χ2 test

• The χ2 test requires few assumptions

• The test is not appropriate if the expected frequencies are too

• It will normally be acceptable so long as no more than 10% of

• An alternative for these situations is Fisher’s exact test.

ST Ali CMED6100 – Session 5 Slide 18

Pearson’s χ2 test Fisher’s test McNemar’s test

Fisher’s exact test

Treatment Control Total

Fisher’s exact test

Pearson’s χ2 test Fisher’s test McNemar’s test

Fisher’s exact test

Pearson’s χ2 test Fisher’s test McNemar’s test

Sometimes we have 2 × 2 tables with matched samples. For example in the