You are on page 1of 28

Outline

Hypothesis tests
CMED6100 – Session 5

ST Ali

School of Public Health


The University of Hong Kong

16 October 2021

sli.do/#hkubiostat21

ST Ali CMED6100 – Session 5 Slide 2

Outline

Announcements

• Assignment S5 due by Wednesday 20 October 11:59pm


• Mid-term exam on 23 October (Sat) 9:30am.
– Bring a calculator (not the one on your mobile phone),
– a pen,
– and a pencil.

• Venue will be at LTs 3&4.

ST Ali CMED6100 – Session 5 Slide 3

Outline

Mid-term exam

• The mid-term exam covers the material from sessions 1-5

• The exam will take 105 minutes

• MCQs and SAQs, similar style to the assignments and


practice questions

• Simple calculations will be required for some questions

• The only formula you need to remember is for the standard


error. Any other formulas necessary for calculations will be
provided in the exam paper.

ST Ali CMED6100 – Session 5 Slide 4


Outline

Objectives
After the lecture, students should be able to:

• Describe the statistical tests for comparing associations between


two categorical variables (chi-squared test, Fisher’s exact test,
McNemar’s test)

• Describe the statistical tests for comparing the mean of a


continuous variable or a difference between paired means and a null
value (one-sample t-test, paired t-test, Wilcoxon signed-rank test)

• Describe the statistical tests for comparing associations between the


means of a continuous variable in two groups (independent
two-sample t-test, Mann-Whitney U test / Wilcoxon rank-sum test)
ST Ali CMED6100 – Session 5 Slide 5

Outline

What will be assessed?


• Students will not be required to perform hypothesis tests ‘by
hand’.
• But students will be expected to be able to describe the basic
approach to hypothesis testing,
– For example, the construction of an ‘expected counts’ table for
the chi-squared test, and the comparison of differences
between observed and expected counts.

• Students will also be expected to correctly describe the


relevant null hypotheses, and any key assumptions of the tests
described in this lecture.
ST Ali CMED6100 – Session 5 Slide 6

Outline

Statistical hypothesis: Assertion or statement about the population characteristics (𝜇)


Null hypothesis: Hypothesis of no difference (𝐻! : 𝜇 = 𝜇0 )
Alternative hypothesis: hypothesis which is complementary to the null hypothesis (𝐻# 𝑜𝑟𝐻$ : 𝜇 ≠ 𝜇! )

Critical Value
𝐻! : 𝜇 = 𝜇! 𝐻" : 𝜇 = 𝜇" (> 𝜇! )
Pr (Type I Error) = 𝛼 (Level of significance)
Pr (Type II Error) = 𝛽 (1 − 𝛽 is Power of the test)

ST Ali CMED6100 – Session 5 Slide 7


Outline

Recap of hypothesis testing


• Remember the ideas of sampling theory

• Under a null hypothesis we know what kinds of data might be produced

• We can compare our data to the kinds of data that we might expect if the
null hypothesis were true

• Extreme p-values are taken as evidence against the null hypothesis

0.20 E.g., if the null hypothesis were true, i.e. no


Density

0.15
difference between means, it would be very
unusual to observe large differences in
0.10

means (whether less than −5 or greater


0.05
than 5). We would only observe such a large
0.00
difference in 1% of repeated experiments.
−6 −4 −2 0 2 4 6
Difference in means
ST Ali CMED6100 – Session 5 Slide 8

Outline

One sample vs repeated samples

• In practice we typically collect data on one sample, and then


use the data to estimate a p-value or a confidence interval.

• The interpretation of a p-value or confidence interval is based


on the idea of multiple samples, and sampling theory.

ST Ali CMED6100 – Session 5 Slide 9

Pearson’s χ2 test Fisher’s test McNemar’s test

Part I

Tests for categorical variables

ST Ali CMED6100 – Session 5 Slide 10


Pearson’s χ2 test Fisher’s test McNemar’s test

Accidents example
In a study of occupational health in a local factory, research was
conducted to investigate whether all employees faced similar risk of
various types of accident. A total of 117 accidents were classified
by the age of the employee and the type of accident:

Accident type
Age Sprain Burn Cut
Under 25 9 17 5
25 or over 61 13 12

How can we investigate whether accident type is independent of age?


ST Ali CMED6100 – Session 5 Slide 11

Pearson’s χ2 test Fisher’s test McNemar’s test

Accidents example

This kind of table is called a ‘contingency table’. We can also add


the row and column totals, or the ‘marginal totals’.

Accident type
Age Sprain Burn Cut
Under 25 9 17 5 31
25 or over 61 13 12 86
70 30 17 117

ST Ali CMED6100 – Session 5 Slide 12

Pearson’s χ2 test Fisher’s test McNemar’s test

Pearson’s χ2 test – example


• If the two variables age and accident type are independent, we can
calculate the expected value for each entry, given the same marginal
totals. These are shown in the table below
Table: Expected counts assuming independence.

Accident type
Age Sprain Burn Cut
Under 25 18.55 7.95 4.50 31
25 or over 51.45 22.05 12.50 86
70 30 17 117
For example, the expected number of sprains in employees aged under 25
is 31 × 70/117 = 18.55.
ST Ali CMED6100 – Session 5 Slide 13
Pearson’s χ2 test Fisher’s test McNemar’s test

χ2 test – example

We can calculate a test statistic based on the differences between the observed
and expected values, as follows:
Xn
(Oi − Ei )2
T =
i=1
Ei
(9 − 18.55)2 (17 − 7.95)2 (5 − 4.50)2
= + +
18.55 7.95 4.50
(61 − 51.45)2 (13 − 22.05)2 (12 − 12.50)2
+ + +
51.45 22.05 12.50
= 4.92 + 10.30 + 0.06 + 1.77 + 3.71 + 0.02

= 20.78

ST Ali CMED6100 – Session 5 Slide 14

Pearson’s χ2 test Fisher’s test McNemar’s test

χ2 test – example

0.5

0.4 χ22

0.3
Density

0.2

0.1

0.0 ●

0 5 10 15 20 25

2
This test statistic can be compared with a χ (chi-squared) distribution with 2
degrees of freedom, and we find that it corresponds to a p-value  0.001.
There is strong evidence against the null hypothesis.

ST Ali CMED6100 – Session 5 Slide 15

Pearson’s χ2 test Fisher’s test McNemar’s test

Degrees of freedom

Treatment Control Total


Event A B A+B
Not event C D C+D
Total A+C B+D A+B+C+D

• When performing a chi-squared test, we need to know the


’degrees of freedom’ of the test. In a simple 2 × 2 table we
would have values A, B, C and D:

ST Ali CMED6100 – Session 5 Slide 16


Pearson’s χ2 test Fisher’s test McNemar’s test

Degrees of freedom – cont

• To perform the chi-squared test, we hold the marginal totals


constant. Therefore if we choose a specific value for A, we
have no further choices to make. We cannot choose B or C
because the marginal totals A + B and A + C are fixed.

• In general, for a chi-squared test on a contingency table with


m rows and n columns, the degrees of freedom are
(m − 1) × (n − 1).

ST Ali CMED6100 – Session 5 Slide 17

Pearson’s χ2 test Fisher’s test McNemar’s test

Assumptions of the χ2 test

• The χ2 test requires few assumptions

• The test is not appropriate if the expected frequencies are too


low.

• It will normally be acceptable so long as no more than 10% of


the events have expected frequencies below 5.

• An alternative for these situations is Fisher’s exact test.

ST Ali CMED6100 – Session 5 Slide 18

Pearson’s χ2 test Fisher’s test McNemar’s test

Fisher’s exact test


Here is an example from a small clinical trial of treatment versus control

Treatment Control Total


Event 2 7 9
Not event 8 2 10
Total 10 9 19

In this case the expected totals in some cells are below 5 (e.g.
9 × 9/19 = 4.3).
Fisher’s test examines all the possible tables with the same marginal
totals and calculates how probable is the current table (or more extreme
tables) assuming independence.
ST Ali CMED6100 – Session 5 Slide 19
Pearson’s χ2 test Fisher’s test McNemar’s test

Fisher’s exact test


All possible tables with the same marginal totals can be generated

0 9 1 8 2 7
10 0 9 1 8 2
p1 = 0.00001 p2 = 0.00097 p3 = 0.01754

3 6 4 5 5 4
7 3 6 4 5 5
p4 = 0.10912 p5 = 0.28643 p6 = 0.34372

6 3 7 2 8 1
4 6 3 7 2 8
p7 = 0.19095 p8 = 0.04676 p9 = 0.00438

9 0
1 9
p10 = 0.00011

p1 + p2 + p3 + · · · + p10 = 1.
ST Ali CMED6100 – Session 5 Slide 20

Pearson’s χ2 test Fisher’s test McNemar’s test

Fisher’s exact test


The one-tailed probability would be the sum of the separate
probabilities for the arrays:
0 9 1 8 2 7
10 0 9 1 8 2
p1 = 0.00001 p2 = 0.00097 p3 = 0.01754
sum = p1 + p2 + p3 = 0.01852 is the 1-sided p-value
The 2-sided p-value is 0.023
R Commander automatically calculates the p-value using Fisher’s
test for contingency tables in which the assumptions of the χ2 test
may not be met.
ST Ali CMED6100 – Session 5 Slide 21

Pearson’s χ2 test Fisher’s test McNemar’s test

Matched samples

Sometimes we have 2 × 2 tables with matched samples. For example in the


study below 1319 patients with athlete’s foot were given alternative treatments
– treatment X on one foot and treatment Y on the other foot.

Treatment Y Total
Cured Not cured
Treatment X Cured 212 144 356
Not cured 256 707 963
Total 468 851 1319

The outcome of either treatment is likely to be correlated within the same


patient, and we should not ignore this correlation when we analyse the data.

ST Ali CMED6100 – Session 5 Slide 23


Pearson’s χ2 test Fisher’s test McNemar’s test

Matched samples

Treatment Y Total
Cured Not cured
Treatment X Cured 212 144 356
Not cured 256 707 963
Total 468 851 1319

Our null hypothesis is that the proportion cured by treatment X is the same as
the proportion cured by treatment Y.

ST Ali CMED6100 – Session 5 Slide 24

Pearson’s χ2 test Fisher’s test McNemar’s test

McNemar’s test

In the general form, the contingency table is:

Second First measurement Total


measurement Yes No
Yes A B A+B
No C D C+D
Total A+C B+D A+B+C+D

The null hypothesis is that A+B=A+C. Or we could flip the problem and
consider the null hypothesis that C+D=B+D. In fact both of these null
hypotheses can be rephrased more simply as B=C.

ST Ali CMED6100 – Session 5 Slide 25

Pearson’s χ2 test Fisher’s test McNemar’s test

McNemar’s test
The test statistic is very easy to calculate, and depends only on the
two off-diagonal cells.

Second First measurement Total


measurement Yes No
Yes A B A+B
No C D C+D
Total A+C B+D A+B+C+D

(B−C )2
The test statistic is B+C . It follows a chi-squared distribution
with 1 degree of freedom.
ST Ali CMED6100 – Session 5 Slide 26
Pearson’s χ2 test Fisher’s test McNemar’s test

McNemar’s test – example

In our example, with treatment X and Y, the table was:

Treatment Y Total
Cured Not cured
Treatment X Cured 212 144 356
Not cured 256 707 963
Total 468 851 1319

(144−256)2
The test statistic is 144+256
= 31.4 which is highly significant (the 95%
percentile is 3.84 for a chi-squared distribution with 1 degree of freedom under
the null hypothesis).

ST Ali CMED6100 – Session 5 Slide 27

Pearson’s χ2 test Fisher’s test McNemar’s test

Summary

• When testing independence between two categorical variables,


usually use the χ2 test.

• In small sample sizes (when more than 10% of expected


counts are less than 5) use Fisher’s exact test instead.
• In matched samples use McNemar’s test
– For example, repeated measures on the same individuals, or
two assessments of the same experimental unit

ST Ali CMED6100 – Session 5 Slide 28

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Part II

Tests for continuous variables

ST Ali CMED6100 – Session 5 Slide 29


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Example
Randomized trial of dietary intervention on 32 patients. The
weight in kg of all patients was measured at baseline and then
after 2 weeks. Is the intervention effective?

Table: Pretest and post-test weights of 32 patients.


Control Intervention
Subject Baseline Post-test Difference Baseline Post-test Difference
1 74.2 71.6 −2.6 78.9 73.1 −5.8
2 92.9 90.6 −2.3 73.2 67.5 −5.7
3 103.1 99.2 −3.9 74.2 66.5 −7.7
4 59.8 56.5 −3.3 120.7 114.8 −5.9
5 94.8 90.0 −4.8 88.1 81.4 −6.7
. . . . . . .
. . . . . . .
. . . . . . .
16 88.0 84.6 −3.4 79.4 73.4 −6.0
Mean 85.3 82.6 −2.8 83.4 78.3 −5.1
SD 11.0 10.7 1.2 13.8 14.0 1.8

ST Ali CMED6100 – Session 5 Slide 30

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

First, plot the data

120

Weight

80

40

Pre−test Post−test Pre−test Post−test


Control Intervention

ST Ali CMED6100 – Session 5 Slide 31

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Plot the data – weight changes


2

Weight
0
difference

−2

−4

−6

−8

Control Intervention

ST Ali CMED6100 – Session 5 Slide 32


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Is the intervention effective?

1. We could compare the post-test weights with the pre-test


weights in the intervention group
– Weight loss on intervention would prove effectiveness?

2. We could compare mean post-test weights in the two groups


– Allocation to intervention was randomized so the distribution
of baseline weights should be similar in the two intervention
groups.

3. We could compare the mean weight loss between the


intervention group and the control group?

ST Ali CMED6100 – Session 5 Slide 33

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Comparing the two groups

Control Pre-test Post-test


group weight ∆ weight

(3) (2)

Intervention Pre-test ∆ Post-test


group weight weight
(1)

We can compare three different approaches to investigate the effectiveness of


intervention on weight loss.

ST Ali CMED6100 – Session 5 Slide 34

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Diet data

• First approach – focus on evidence of weight loss in the


intervention group . . .

ST Ali CMED6100 – Session 5 Slide 35


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Students’ T-test
• The z-test illustrated in the previous session is valid in large samples,
when the Central Limit Theorem ensures that the (sampling distribution
of the) sample mean will follow a Normal distribution.

• If we want to compare the means of two groups in a smaller dataset, we


cannot rely on the Central Limit Theorem

• An alternative test is available provided that the original data follow a


Normal distribution

• This test relies on a theoretical result that when sampling from a Normal
distribution, the sample mean will follow a t distribution

• The test was originally derived and published by William Gosset in 1908

ST Ali CMED6100 – Session 5 Slide 36

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

The T distribution
0.4

0.3
Density
0.2

0.1

0.0
−4 −2 0 2 4
X
Figure: The t-distribution has ‘fatter’ tails than the Normal distribution,
but converges to Normal with more degrees of freedom
ST Ali CMED6100 – Session 5 Slide 37

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Population distribution and sample mean distribution

Distribution of Sample size Distribution of sample mean


observations
in population
Normal Small n Tn−1 distribution
Normal Large n Normal (Central Limit Theorem)
Not normal Small n Unclear
Not normal Large n Normal (Central Limit Theorem)

ST Ali CMED6100 – Session 5 Slide 38


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Check normality of the diet data


Intervention
0.25

0.20

Density
0.15

0.10

0.05

0.00

−8 −6 −4 −2 0 2

Weight difference

ST Ali CMED6100 – Session 5 Slide 39

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Paired t-test
We could look at the weight loss in the intervention group to test
whether there was any change from baseline. This is equivalent to
testing H0 : ∆t = 0 in the intervention group. This is a type of
“one sample t-test” since we are evaluating whether our sample
mean is the same as or different from a null value.

• The test statistic is calculated as


D̄t −0 −5.1
T = √
st / n
= √
1.8/ 16
= −11.3, with the degrees of freedom
n − 1 = 15.

• Can derive p-value < 0.001 and reject the null hypothesis.
ST Ali CMED6100 – Session 5 Slide 40

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Paired t-test (cont’d)


The p-value = 2 × Pr (T ≤ tc |T ∼ t15 ).

0.4

0.3

Density

0.2 ~t15

0.1

0.0 ●
critical value tc
−4 −2 0 2 4
T

ST Ali CMED6100 – Session 5 Slide 41


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Paired t-test vs 1-sample t-test

• The paired t-test is used on paired data, to test the null hypothesis that

the difference between the paired measurements is zero.

– Although we have paired measurements, when we subtract one from


the other the result is a single variable.

• The one-sample t-test is used on a single sample, to test the null


hypothesis that the mean is zero (if we want to test a null hypothesis of a
mean with another value, it is very simple to modify the test).

• These two tests are the same.

ST Ali CMED6100 – Session 5 Slide 42

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Wilcoxon signed-rank test


• The t-test relies on the assumption that the population
distribution is Normal.

• With a large sample, the Central Limit Theorem applies and


the t-test converges to a z-test, without requiring the
population to follow a normal distribution.

• When the assumption that the population follows a Normal


distribution is violated, and we have a small sample, we could
use the Wilcoxon signed-rank test as a non-parametric
alternative to the paired t-test or one-sample t-test.
ST Ali CMED6100 – Session 5 Slide 44

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Wilcoxon signed-rank test


• The simple explanation of the Wilcoxon signed-rank test is
that it examines how many observations fall below the null
value (e.g. 0) and how many fall above, and then evaluates
how unlikely it would be for so many (or so few) observations
to fall above or below the null value.
• The detailed explanation is a bit more complicated because
the test also takes into account the magnitude of the
observations. See a textbook for technical details.
• The p-value for H0 : ∆t = 0 in the intervention group is
< 0.001.
ST Ali CMED6100 – Session 5 Slide 45
The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Diet data

• First approach – focus on evidence of weight loss in the


intervention group.
– Problem – what about regression to the mean and placebo
effects, so that the control group also lost weight?

• Second approach – compare post-test weights in the


intervention and control group . . .

ST Ali CMED6100 – Session 5 Slide 47

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Distribution of post-test weights

Control Intervention
0.05 0.05
mean=82.6 mean=78.3
SD=10.7 SD=14.0
0.04 0.04

0.03 0.03
Density
0.02 0.02

0.01 0.01

0.00 0.00

50 60 70 80 90 100 110 120 50 60 70 80 90 100 110 120

Post−test weight Post−test weight

ST Ali CMED6100 – Session 5 Slide 48

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

t-test comparing post-test weights


We can use an independent two-sample t-test to compare the
post-test weights between two groups. Individuals were randomized
so the distribution of pre-test weights should be similar.
• Set the null hypothesis H0 : µc = µt .
• The test statistic can be calculated as T = X¯√
c −X̄t
where
sp 2/n
q
s12 +s22
sp = 2 .
• In this example T = √ 82.6−78.3 √
= 0.98, with
(10.72 +14.02 )/2 2/16
degrees of freedom 2n − 2 = 30. Then the (2-sided)
p-value= 0.33 > 0.05, and we could not reject the null
ST Ali hypothesis. CMED6100 – Session 5 Slide 49
The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Diet data

• First approach – focus on evidence of weight loss in the


intervention group.
– Problem – what about regression to the mean and placebo
effects, so that the control group also lost weight?

• Second approach – compare post-test weights in the


intervention and control group.

• Third approach – compare weight losses in the intervention


and control group.

ST Ali CMED6100 – Session 5 Slide 50

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Baseline vs post-test weights


130
● Intervention
Control

110
Post−test
weight

(kg)

90 ●

●●

●●
●●
● ●
70 ●●

50

50 70 90 110 130
Baseline weight (kg)
Figure: High correlation between baseline and post-test weights, for two

ST Ali
groups respectively. CMED6100 – Session 5 Slide 51

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Baseline vs weight differences


12
● Intervention
Control
8

4
Weight
difference 0

(kg)
● ●
−4 ● ●
● ●
● ●
● ●
● ●
● ●

−8

−12

50 70 90 110 130
Baseline weight (kg)

ST Ali CMED6100 – Session 5 Slide 52


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Check normality of the weight differences

Control Intervention
0.4 0.4

0.3 0.3

Density

0.2 0.2

0.1 0.1

0.0 0.0

−8 −6 −4 −2 0 2 −8 −6 −4 −2 0 2

Weight difference Weight difference

ST Ali CMED6100 – Session 5 Slide 53

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Independent t-test comparing weight loss

The high correlation between baseline and follow-up measurements


means that it may be better to compare differences between
post-test and baseline weights between the two groups.

• Set the null hypothesis H0 : ∆c = ∆t .


q
D¯√
c −D̄t sc2 +st2
• The test statistic T = where sp = 2 .
sp 2/n
−2.8−(−5.1)
• Then T = √ √ = 4.25, with degrees of
(1.22 +1.82 )/2 2/16
freedom 2n − 2 = 30. The 2-sided p-value< 0.001, which is
strong evidence against the null hypothesis.

ST Ali CMED6100 – Session 5 Slide 54

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Assumptions of the two-sample t-test


• The null hypothesis of the two sample t-test is that both samples come
from the same underlying Normal distribution with the same mean and
the same variance.

• If the two sample distributions are quite skewed, we might consider a log
transformation before doing a t-test.

• If the two samples have quite different variances then the t-test may not
give an appropriate p-value.

• If the two samples have the same (or similar) variance but do not follow a
Normal distribution, we can consider an alternative non-parametric test,
such as the Mann-Whitney U (also called the Wilcoxon Rank-Sum Test).

ST Ali CMED6100 – Session 5 Slide 56


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Mann-Whitney U / Wilcoxon Rank Sum Test


Example – reaction times (ms) for patients without (A) and with (B)
anaesthetic. Subjects had to react on a simple visual stimulus.
A: 135, 141, 143, 149, 171, 172.
B: 142, 158, 170, 189, 254, 289.
Step 1: Place all the values together in rank order (i.e. from lowest to
highest). If there are two observations with the same value, the ‘A’
sample is ranked first.

A A B A A B B A A B B B
135 141 142 143 149 158 170 171 172 189 254 289

ST Ali CMED6100 – Session 5 Slide 57

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Mann-Whitney U / Wilcoxon Rank Sum Test

Example – reaction times (ms) for patients without (A) and with (B)
anaesthetic. Subjects had to react on a simple visual stimulus.
Step 2: Inspect each ‘B’ sample in turn and count the number of ‘A’s
which come before it. Add up the total to get a U value.

A A B A A B B A A B B B
135 141 142 143 149 158 170 171 172 189 254 289
2 4 4 6 6 6

Total, UA = 28.

ST Ali CMED6100 – Session 5 Slide 58

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Mann-Whitney U / Wilcoxon Rank Sum Test


Example – reaction times (ms) for patients without (A) and with (B)
anaesthetic. Subjects had to react on a simple visual stimulus.
Step 3: Repeat steps 1 and 2, but this time inspect each A in turn and
count the number of B’s which precede it. Add up the total to get a
second U value.

A A B A A B B A A B B B
135 141 142 143 149 158 170 171 172 189 254 289
0 0 1 1 3 3

Total, UB = 8.
ST Ali CMED6100 – Session 5 Slide 59
The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Mann-Whitney U / Wilcoxon Rank Sum Test


Example – reaction times (ms) for patients without (A) and with (B)
anaesthetic. Subjects had to react on a simple visual stimulus.
Step 4: Take the smaller of the two U values (here we take UB = 8) and
look up the corresponding p-value value in a reference table as 0.11.

• Note – the maximum value for U is 36 here (= 6 × 6). Under the


null hypothesis of no difference between the means of the two
groups we would expect to get UA = UB = 18 on average.

• The situation is a bit more complicated if there are tied values, and
there is a quicker way to calculate U in larger datasets based on the
ranks of the values.
ST Ali CMED6100 – Session 5 Slide 60

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Diet example – Mann-Whitney U Test

• For the diet example, comparing weight losses between the


two groups, the p-value is < 0.001.

• We could also use the Mann-Whitney U Test to compare


post-test weights between the two groups, and in that case
the p-value is 0.11.

ST Ali CMED6100 – Session 5 Slide 61

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Comparing the two groups

Control Pre-test Post-test


group weight ∆ weight

(3) Independent (2) Independent


two-sample t-test two-sample t-test

Intervention Pre-test ∆ Post-test


group weight weight
(1) One-sample t-test

We used three different approaches to investigate the effectiveness of


intervention on weight loss. Approach (3), comparing the differences between
post-test and pre-test weights between two groups, is the most appropriate.

ST Ali CMED6100 – Session 5 Slide 62


The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Comparing the two groups


Summary statistics and p-values for the comparison of post-test weights
between intervention and control group, and the comparison of weight losses
between intervention and control group.

(2) Comparing (3) Comparing differences


post-test weights (post-test – pre-test)
Intervention mean (sd) 78.3 (14.0) −5.1 (1.8)
Control mean (sd) 82.6(10.7) −2.8 (1.2)
t-test p-value 0.33 < 0.001
Mann-Whitney U p-value 0.11 < 0.001

The comparison of weight losses is much more ‘powerful’.

ST Ali CMED6100 – Session 5 Slide 63

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Summary
• Comparing post-test data did not show any significant
intervention effect.
• Comparing differences between post-test and baseline allowed
us to detect the intervention effect.
– This is a common occurrence when baseline and post-test
measurements are highly correlated.
• In an example where the normal distribution assumption was
appropriate, the (parametric) t-test led to smaller p-values
than the (non-parametric) Mann-Whitney U test.
– This is a common occurrence when the assumptions of
parametric tests are met.
ST Ali CMED6100 – Session 5 Slide 64

The t-test Paired t-test Signed-Rank test Two-sample t-test Mann-Whitney U

Summary of t-tests

• Paired t-test: for testing the hypothesis that there is no


difference between the paired measurements (equivalently that
the mean difference is zero)

• One-sample t-test: for testing the hypothesis that the mean is


zero (example test #1)

• Two-sample t-test: for comparing the means of two different


variables (example test #2 and #3)

ST Ali CMED6100 – Session 5 Slide 65


Background Example

Part III

Analysis of Variance (ANOVA)

ST Ali CMED6100 – Session 5 Slide 66

Background Example

Background

We use ANOVA:

• To test for a statistically significant difference in the means of


a continuous outcome variable, between three or more
treatment groups;

• Many similarities between ANOVA and linear regression, but


the focus of ANOVA is on hypothesis testing.

ST Ali CMED6100 – Session 5 Slide 67

Background Example

Assumptions of ANOVA

For ANOVA to work correctly, some assumptions must be satisfied,


including:

• The distribution of the outcome variable in each treatment


group should be Normal.

• These Normal distributions should have the same variance.

ST Ali CMED6100 – Session 5 Slide 68


Background Example

Illustration

Small within-group
variation
→ Significant difference?

Large within-group
variation
→ non-significant
difference?

ST Ali CMED6100 – Session 5 Slide 69

Background Example

How does ANOVA work?

• We break down the total variation (ST ) in the data into two
parts:
– Between-group variation (SA );
– Remaining ‘error’ (within-group variation) (SE ).

• By comparing these two portions of the variance, we can test


for a statistically significant difference between groups.

ST Ali CMED6100 – Session 5 Slide 70

Background Example

Why can’t we just use t-tests?

• t-tests are only for two groups;

• If using a 5% significance level, each t-test will have a 5%


chance to give a significant p-value when there is actually no
difference between groups;

• If you do multiple t-tests, you increase the chance that you


get a spurious result;

• Better to use one ANOVA instead of many t-tests.

ST Ali CMED6100 – Session 5 Slide 71


Background Example

Example data

• 24 rats are randomised to three different drugs which are


thought to affect reactions.

• They are stimulated in a particular way.

• Their response times (ms) are recorded.

• We want to know if there is a difference between the drug


effects.

ST Ali CMED6100 – Session 5 Slide 72

Background Example

Example data – response times

Response times are given in milliseconds.

Drug
Block 1 2 3
1 12 10 14 15 23 21
2 17 14 17 13 26 24
3 26 21 28 29 31 35
4 13 16 13 12 16 18
mean 16.13 17.63 24.25

ST Ali CMED6100 – Session 5 Slide 73

Background Example

First plot the data – dotplot


35 ●


30
Reaction ●

time
● ●
25

● ●
20

● ●
● ●
15 ●
● ●
● ●●
● ●

10 ●

1 2 3
Drug
ST Ali CMED6100 – Session 5 Slide 74
Background Example

Box-and-whisker plot
40

30
Reaction
time

20

10

1 2 3
Drug
ST Ali CMED6100 – Session 5 Slide 75

Background Example

Important to check homogeneity of variance


ANOVA assumes
that the within-
group variances
are the same in
each group.

Need to check that this


assumption is reasonable
(need non-significant
p-value)
ST Ali CMED6100 – Session 5 Slide 76

Background Example

Format of data/notation

Drug
Block 1 2 3
1 ··· ···
2 Xi
3
4
sum (Ti ) T1 = 129 T2 = 141 T3 = 194

Total, T = T1 + T2 + T3 = 464; N1 = N2 = N3 = 8; N = 24.

ST Ali CMED6100 – Session 5 Slide 77


Background Example

Calculations
First, calculate the total sum of squares.
X 2 T2
ST = Xi −
N
4642
= 122 + 102 + · · · + 182 −
24
= 10076 − 8970.667

= 1105.333

Then calculate the between-group sum of squares.


X T2 T2
i
SA = −
Ni N
1292 1412 1942 4642
= + + −
8 8 8 24
= 9269.75 − 8970.667

= 299.083
ST Ali CMED6100 – Session 5 Slide 78

Background Example

ANOVA table

Sum of Squares df Mean Square F p-value


Between Groups 299.083 2 149.542 3.895 0.036
Within Groups 806.250 21 38.393
Total 1105.333 23

You can see the 299.083 and the 1105.333 that we calculated.
What do other values represent? ...

ST Ali CMED6100 – Session 5 Slide 79

Background Example

Explanation of table

• Degrees of freedom (df):


– Between-group df = number of groups −1 = 2
– Within-group df = total number of individuals − number of
groups = 21
• Mean square (MS):
– sum of squares divided by degrees of freedom, in each row
• F ratio
– between-group MS divided by ‘error’ MS

ST Ali CMED6100 – Session 5 Slide 80


Background Example

Calculation of p-value

• Use tables of the F distribution:


– Find p such that F (p, 2, 21) = 3.895
– We get p = 0.036

• There is a significant difference between the groups, at the


5% level (p = 0.036 < 0.05).

• We could further investigate the specific differences after


finding a significant difference between groups.

ST Ali CMED6100 – Session 5 Slide 81

Significance of Differences Review

Part IV

Review

ST Ali CMED6100 – Session 5 Slide 82

Significance of Differences Review

Nieuwenhuis et al. 2011 Nature Neuroscience

“In theory, a comparison of two experimental effects requires a statistical test


on their difference. In practice, this comparison is often based on an incorrect
procedure involving two separate tests in which researchers conclude that
effects differ when one effect is significant (P < 0.05) but the other is not
(P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience
articles in five top-ranking journals (Science, Nature, Nature Neuroscience,
Neuron and The Journal of Neuroscience) and found that 78 used the correct
procedure and 79 used the incorrect procedure.”

ST Ali CMED6100 – Session 5 Slide 83


irally transduced versus control mice). of these three graphs contains repeated measurements (for example,
flect the group averages of individual
Significance of Differences before and after treatment). In the case of repeated measurements Review
urselves (for ten mice in each group), on the same group(s) of subjects, the standard-error bars do not give
n in this example is not significant the information needed to assess the significance of the differences
at the researchers intend to make is not between the repeated measurements, as they are not sensitive to the
Nieuwenhuis et al. 2011 Nature Neuroscience
correlations between these measurements3. Standard-error bars can

arious types
comparing
a 8 Baseline
Photoinhibition
b 80 Fluoxetine group
Vehicle group
c Pre-operative
Post-operative
d
4
mparing effect *

P3 amplitude effect (µV)


ndition and a 6 60 40
P3 amplitude (µV)

Escape latency (s)


paring effect

Sniff time (s)


-test. * ** 30
4 40 ns 2
and claiming
20
is specific for
2 20
presented in a, 10
two repeated-
seline). Error 0 0 0 0
Virally Control Before After Ento- Peri- Post- Virally Control
ficant transduced mice treatment treatment rhinal rhinal rhinal transduced mice
1. mice mice

Figure: Graphs illustratingVOLUME 14types


the various | NUMBER 9 | SEPTEMBER
of situations NATURE NEUROSCIENCE
2011of comparing
in which the error significance levels
occurs. (a) Comparing effect sizes in an experimental group/condition and a control group/condition. (b)
Comparing effect sizes during a pre-test and a post-test. (c) Comparing several brain areas and claiming that a
particular effect (property) is specific for one of these brain areas. (d) Data presented in a, after taking the
difference of the two repeated measures (photoinhibition and baseline). Error bars indicate s.e.m.; ns,
nonsignificant (P > 0.05); ∗ P < 0.05; ∗∗ P < 0.01.
ST Ali CMED6100 – Session 5 Slide 84

Significance of Differences Review

First type of error

• When comparing effect sizes in an experimental group and a


control group, it is incorrect to contrast the significance levels
of the two effect sizes (panel A on the Figure).

• Instead, we should directly compare the effect sizes in a single


hypothesis test.

ST Ali CMED6100 – Session 5 Slide 85

Significance of Differences Review

Second type of error

• When evaluating the effect of an intervention, it is incorrect


to contrast the significance levels of the pre-test values and
the post-test values (panel B on the Figure).

• Instead, we should directly compare the before-to-after


changes between the treatment group and control group in a
single hypothesis test.

• (Similar to the previous type of error)

ST Ali CMED6100 – Session 5 Slide 86


Significance of Differences Review

Third type of error


• When we have more than two groups to compare between, it
is incorrect to conduct multiple significance tests (e.g. many
2-way comparisons with t-tests, panel C on the figure).
• Instead, we should directly include all of the data in a single
model e.g. ANOVA, or a regression model (session 7), and
first examine evidence for an overall effect to avoid a problem
with multiple testing.
• If we believe that a treatment effect is unique to a particular
group (e.g. the first comparison in panel C) we would need to
include interaction terms in our model.
ST Ali CMED6100 – Session 5 Slide 87

Significance of Differences Review

Review

• See the session handout for summary of which hypothesis test


to use in different scenarios.

• Parametric tests will be more powerful than non-parametric


tests as long as the assumptions are met.

• Think carefully about the appropriate hypothesis to test.

ST Ali CMED6100 – Session 5 Slide 88

Significance of Differences Review

References and further reading

• Nieuwenhuis S, Forstmann BU, Wagenmakers EJ. Erroneous


analyses of interactions in neuroscience: a problem of
significance. Nat Neurosci. 2011; 14(9):1105-7.

• Altman DG, Bland JM. Comparing several groups using


analysis of variance. BMJ, 1996; 312, 1472-3.
• Bland and Altman. Multiple significance tests: the Bonferroni
method
– http://bmj.bmjjournals.com/cgi/content/full/310/3973/170

ST Ali CMED6100 – Session 5 Slide 89

You might also like