You are on page 1of 10

Mathematical Biology IA

University of Cambridge

M. Castle

MATHEMATICAL BIOLOGY LENT: ANOVA AND LINEAR MODELS


Lecture 1: One-Way Analysis of Variance (ANOVA)
Aims
1) To introduce the One-Way ANOVA, a test to compare the means of multiple groups.
2) To introduce the concept of partitioning of variance for statistical inference.
Objectives: After the lecture, students should be able to:
1) Compute Sums of Squares, degrees of freedom, Mean Squares and F statistics for a
One-Way ANOVA.
2) Make inferences on the mean of multiple groups using an ANOVA table.
In the last lecture, you saw how you can compare the means of two samples with a t-test.
However, we often want to deal with more complex problems that involve several groups. To
tackle these sorts of problems, we need to develop a more general framework, called Analysis
of Variance (ANOVA). In this lecture, we will concentrate on the simplest form of ANOVA (a
One-Way ANOVA, comparing the mean of several groups), and we will then expand this
framework to deal with more complex problems. As in previous lectures, we will confine
ourselves to normally distributed populations, and we will assume that these populations
have equal variances.

About notation
First of all, we need to look at the notation required to understand a One-Way ANOVA.
Subscripts are used to define the origin of each data point. Each observation is denoted as ygi,
where g represents the group (sample) it comes from, and i defines the individual response
within the group. So, y23 is the third observation in the second group. If we consider the small
dataset of meerkat weights in table 1, this would be the third observation from Kuruman
River, which is 597 g. A dataset has k groups (meaning that g = 1 ... k), and each group has ng
observations. In the meerkat example, we have two locations (k = 2), n1 is the sample size for
Deception valley (n1 = 6) and n2 is the sample size for Kuruman river (n2 = 6). The overall
sample size is N (N = n1 + n2 = 12).

Table 1. Weights in grams of a number of meerkats caught in a field study in Deception


Valley, Botswana (g=1)
514

519

568

571

553

531

y11

y12

y13

y14

y15

y16

Kuruman River, South Africa (g=2)


624

542

597

597

577

678

y21

y22

y23

y24

y25

y26

The Analysis of Variance framework


The analysis of variance framework is based on the idea that the variance of the response can
be partitioned into components that correspond to the source of variation (namely, one or
1

Mathematical Biology IA

University of Cambridge

M. Castle

more components due to changes in the values of the independent variable(s) and a
component due to random error). In a One-Way ANOVA, we have only one independent
variable (e.g. location), which is a discrete factor (i.e. a categorical variable).
To understand how this is computed, we need to think about each observation as a deviation
from the overall mean (Fig. 1),

ygi g gi
where,
=overall mean
g=the group effect
gi=the random error component
Fig. 1: Diagram showing the partitioning of an individual weight

ygi

gi

In the meerkat example, this implies that there is a mean weight for this species (), which is
then affected by the location (with >0 if meerkats are bigger than average at that location,
and <0 if the site is not a good one and they are smaller). However, even within a site, not all
individuals will have the same weight, and the individual deviation from the site mean is given
by (where is normally distributed with a mean of 0). As the purpose of a One-Way ANOVA
is to determine whether the mean of several groups differs significantly, the null hypothesis
can be phrased as H0: 1 = 2 == k. The H1 states that 1k are not all equal.
So, how can we estimate the variance components? If you think back to your first lecture, the
numerator of the formula for the variance is the sum of square deviations (SS) from the mean.
SS are a good way to summarise the level of variability around a mean, and we can use a
similar approach here (Fig. 2). A further property of SS is that they are additive. So,
SSTot = SSG + SSE
where
SSTot = Total Sum of Squares
SSG = Group Sum of Squares (also known as the Treatment SS)
SSE = Error Sum of Squares
First, let us estimate the Total Sum of Squares. This is the total deviation of the dataset from
the overall mean (Fig. 2). Even though we dont know the true overall mean, we can estimate
it by pooling all the observations from all the groups and then taking their mean, y . So,
k

ng

SSTot ( ygi y )2
g 1 i 1

Mathematical Biology IA

University of Cambridge

M. Castle

In the meerkat dataset, the overall mean is 572.6. The sum of squared deviations from the
overall mean (SSTot) is

514 572.6 519 572.6 571 572.6


2

... 678 572.6 24343.0

[All figures are rounded to 1 decimal place (DP). It is convention to give summary statistics to
an accuracy of 1 DP more than the accuracy of the original data, which in the meerkat
example were to zero DP].

Fig. 2: ANOVA Sums of Squares

SSTot

Weight (g)

700

600

Deception Valley
Kuruman River

500

400
0

10

12

Observations

SSE

700

700

600

600

Weight (g)

Weight (g)

SSG

500

500

400

400
0

Observations

10

12

10

12

Observations

We now need to consider the Group Sum of Squares (SSG). What we want to know here is how
much variability in the dataset comes from the fact that the group means are different from
the overall mean. Again we dont know exactly what g is, but we can take the mean for each

Mathematical Biology IA

University of Cambridge

M. Castle

group y g (which is our best estimate of g ). Now we can estimate the amount of
deviation due to the group effect as
k

ng

SSG ( y g y )2 ng y g y
g 1 i 1

g 1

In the meerkat example, the mean for Deception Valley, y1 , is 542.7, and the mean for
Kuruman River, y2 , is 602.5. Since the sample size is 6 for both groups (n1 = n2 = 6), we have

SSG 6 (542.7 572.6)2 6 (602.5 572.6)2 10740.1


Finally, we need to estimate the amount of variation in the data due to the random error (i.e.
those random deviations due to individual effects). Since our best estimate of g is y g ,
we can write:
k

ng

SS E ( y gi y g )2
g 1 i 1

For the meerkat example,


SSE (514 542.7)2 ... (531 542.7)2 (624 602.5)2 ... (678 602.5)2 13602.8
Now that we have created estimates of the components of variance, we can use them to test
our null hypothesis. As usual, we need to say how confident we are that our results do not
deviate from the H0 simply as a result of random chance. The obvious approach here is to say
that, if SSG is a rather large proportion of SSTot, then the null hypothesis that all groups are
similar is unlikely to be true. We can rephrase this as saying that, if the amount of variance
accounted by the group effect is large in comparison to the amount of variance due to error
(keep in mind that SSTot = SSG + SSE), then the group effect is likely to be real (i.e. significant).
At this stage, you might be tempted to simply compare SSG with SSE. However, note that we
obtained the different sum of squares using very different amount of information. The SSG
referred to a small number of estimates (the group means) compared to the overall mean,
whereas the SSE is the sum of a large number of individual deviations (we compared all data
points to their respective group means). So, to compare the variance components, we first
need to standardise them according to the number of parameters involved (i.e. the amount
of information).
We need to develop a standardisation parameter that allows to compare SS. This parameter
is called the degrees of freedom (usually denoted as df), and it obviously related to the
number of parameters we had to estimate in order to compute a set of square deviations. In
general, the degrees of freedom for a given set of deviations is equal to the number of
parameters/observations minus the number of reference parameter values we used to
compute the deviations. So, the df for SSTot (dfTot) is equal to the sample size minus 1 (as we
only have to estimate the overall mean). The dfG is equal to the number of groups minus 1 (as
the group means are compared to a single overall mean). Finally, dfE is equal to the sample
size minus the number of groups (as we had to estimate a mean for each group).

Mathematical Biology IA

University of Cambridge

M. Castle

Source

SS

df

MS

Group

SSG

dfG = k - 1

MSG = SSG / dfG

MSG / MSE

Error

SSE

dfE = N - k

MSE = SSE / dfE

Total

SSTot

dfTot = N - 1

We want to concentrate on the group and error components, to be able to test our null
hypothesis delineated above (i.e. that the variance component due to the group is large
relative to the component due to error). We can estimate the mean square deviation (MS, our
standardised estimator of variation) for each component by dividing each SS by its
appropriate df. Our confidence on the H0 depends on how much variation in the dataset can
be attributed to the group effect vs. the amount due to random error. So, to get a handle on
this, we can estimate the ratio between the group MS (MSG) and the error MS (MSE). The ratio
of two variances has a well known behaviour, described by the F distribution. The F
distribution is somewhat different from the distributions you have encountered so far, as it is
defined by two degrees of freedom, one for the variance on the numerator and one for the
variance in the denominator.
So, if we write out the ANOVA table for meerkats
Source

SS

df

MS

Group

10740.1

10740.1

7.90

Error

13602.8

10

1360.3

Total

24342.9

11

From your statistical tables, the critical value for F1,10 at = 0.05 is 4.96. Since 7.90, our
estimated F, is larger than the critical value, we conclude that the result is significant, and we
reject the null hypothesis that all populations are equal in weight. The exact p value is 0.018,
as we can see from the R output:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Location
1 10740.1 10740.1 7.8955 0.01848 *
Residuals 10 13602.8 1360.3
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Since we are comparing only two groups, we can now repeat the analysis with a t-test
(assuming equal variances), and confirm that the result does not change:
Two Sample t-test
data: Weight by Location
t = -2.8099, df = 10, p-value = 0.01848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-107.27897 -12.38769
sample estimates:
mean in group Deception Valley
mean in group Kuruman River
602.5000

Mathematical Biology IA

University of Cambridge

M. Castle

How do we report this result? In project write-ups (and later in papers that you might author),
you will be interested in the biology, not the statistics. So, describe your results, and use the
statistics to back them up:
Meerkat weights in Kuruman River were significantly different from those in Deception Valley
(F1,10=7.90, p<0.05).
Important: Note that you should ALWAYS provide the statistics (in this case F), the number of
degrees of freedom for parametric tests (for F, we have two values, 1 and 10) or the sample
sizes for certain non-parametric tests, and an indication of the p value (p<0.05, or even better,
the exact p value to 3 DP, p=0.018).

A worked example using three groups


The ANOVA framework is very powerful and flexible. Let us see how we can use a One-Way
ANOVA to compare the means of three groups.
We have obtained another six meerkat weights, this time coming from Addo Elephant Park. If
we add these new data to the previous dataset, we obtain
Table 2. Weights in grams of a number of meerkats caught in a field study in Deception
Valley, Botswana (g=1)
514

519

568

571

553

531

y11

y12

y13

y14

y15

y16

Kuruman River, South Africa (g=2)


624

542

597

597

577

678

y21

y22

y23

y24

y25

y26

Addo Elephant Park, South Africa (g=3)


591

641

677

653

673

595

y31

y32

y33

y34

y35

y36

We want to test whether there is a difference in average weight among the populations.

Mathematical Biology IA

University of Cambridge

M. Castle

First, we plot the data:

So, H0: 1 = 2 = 3 (i.e. there is no difference). H1 is that not all 1k are equal.

nk

y
g 1 i 1
k

Compute the overall mean:

gi

514 519 568 ... 673 595


594.5
( 6 6 6)

g 1

n1

Mean for each group:

y1

1i

i 1

n1

514 519 568 571 553 531


542.7
6
n2

y2

y
i 1

2i

n2

624 542 597 597 577 678


602.5
6

591 641 677 653 673 595


638.3
6

n3

y3

y
i 1

n3

3i

Mathematical Biology IA

University of Cambridge

M. Castle

ng

SSTot ( y gi y )2 (514 594.5)2


g 1 i 1

(519 594.5)2 (568 594.5)2

Sum of Squares:

... (595 594.5)2 48672.5


k

SSG ng y g y
g 1

6 (542.7 594.5)2

6 (602.5 594.5)2 6 (638.3 594.5)2 28032.3


SSE SSTot SSG 48672.5 28032.3 20640.2
Degrees of freedom:

dfTot = N-1 = (6 + 6 + 6) 1 = 18 1 = 17
dfG = k-1 = 3 1 = 2
dfE = dfTot - dfG = 17 2 = 15

Mean Squares:

MSG = SSG / dfG = 28032.3 / 2 = 14016.7


MSE = SSE / dfE = 20640.2 / 15 = 1376.0

F statistics:

F2,15 = MSG / MSE = 14016.7 / 1376.0 = 10.19

Populate the ANOVA table:


Source

SS

df

MS

Group

28032.3

14016.2

10.19

Error

20640.2

15

1376.0

Total

48672.5

17

Critical value for F2,15 at = 0.05 is 3.68 reject H0 and accept H1


Using R, we can confirm that our calculations are correct:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value
Pr(>F)
Location
2 28032
14016 10.186 0.001606 **
Residuals 15 20640
1376
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Report: Meerkats from the three locations differed significantly in weight


(F2,15=10.2,p=0.002), with the population in Deception valley being the lightest and the one in
Addo Elephant Park being the heaviest.

Mathematical Biology IA

University of Cambridge

M. Castle

Additivity of Sums of Squares


This section is not examinable and is for reference only.
We can show algebraically that the total sum of squares will always be equal to the group sum
of squares plus the error sum of squares i.e. we can show that:
SST = SSG + SSE
always.
Consider SST

= ( )
=1 =1

= [( ) + ( )]
=1 =1

= [( ) + ( ) + 2( )( )]
=1 =1

= ( ) + ( ) + 2 ( )( )
=1 =1

=1 =1

=1 =1

= + + 2 ( )( )
=1 =1

So all we need to do is show that the third term is actually identically zero:

2 ( )( )
=1 =1

[( ) ( )]
=1

=1

[( ) ( )]
=1

=1

[( )( )]
=1

Therefore
SST = SSG + SSE

Mathematical Biology IA

University of Cambridge

M. Castle

Lecture 1: One-Way Analysis of Variance (ANOVA) Questions


1) We measure the feeding rate (no. of items/5 minute focal observation) of
oystercatchers at three sites (exposed, partial, and sheltered).
exposed
14.2
16.5
9.3
15.1
13.4

partial
18.4
13.0
17.4
20.4
16.5

sheltered
24.1
22.2
25.3
25.1
21.5

Is there any evidence that the feeding rate differs among locations?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
2) Juvenile lobsters in aquaculture were grown on three different diets (fresh
mussels, semi-dry pellets and dry flakes). After nine weeks, their wet weight in
grams was:
mussels
151.6
132.1
104.2
153.5
132.0
119.0
161.9

pellets
117.7
110.8
128.6
110.1
175.2

flakes
101.8
102.9
90.4
132.8
129.3
129.4

Is there any evidence that the diet affects the growth rate of lobsters?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
3) We recorded the biomass (g) of three species of bacteria (A, B, and C) grown
in flasks with a glucose broth. After a day, their mass was:
A
59.7
52.2
55.4
59.4
52.7

B
50.0
45.6
50.1
40.1
49.3

C
48.5
61.5
55.2
45.2
51.5

Do the bacteria species differ in their ability to grow under the conditions of the
experiment?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
10