10 views

Uploaded by Toh Qin Kane

asd

- Formula Sheet - Quantitative Analysis
- A Study on Surface Roughness in Abrasive Waterjet Machining Process Using Artificial Neural Networks and Regression Analysis Method
- Tut Two Way Anova
- STAT 125 HK Business Statistics Midterm Exam
- Forecasting Experiments
- Training in Banks
- Analytical Model With Regression Eqn
- FVSysID ShortCourse 6 Validation
- Z Score Correlation
- Wolff Et Al-2007-Review of Income and Wealth
- ANOVA Model Assumptions Outline
- QT1_Session12
- UDS Further Linear Models Trial Questions
- Chapter10.pdf
- Moisture of Content
- Uji Paired T
- Regular
- ANOVA Dasar
- comparison
- Cáceres 17 Web app

You are on page 1of 10

University of Cambridge

M. Castle

Lecture 1: One-Way Analysis of Variance (ANOVA)

Aims

1) To introduce the One-Way ANOVA, a test to compare the means of multiple groups.

2) To introduce the concept of partitioning of variance for statistical inference.

Objectives: After the lecture, students should be able to:

1) Compute Sums of Squares, degrees of freedom, Mean Squares and F statistics for a

One-Way ANOVA.

2) Make inferences on the mean of multiple groups using an ANOVA table.

In the last lecture, you saw how you can compare the means of two samples with a t-test.

However, we often want to deal with more complex problems that involve several groups. To

tackle these sorts of problems, we need to develop a more general framework, called Analysis

of Variance (ANOVA). In this lecture, we will concentrate on the simplest form of ANOVA (a

One-Way ANOVA, comparing the mean of several groups), and we will then expand this

framework to deal with more complex problems. As in previous lectures, we will confine

ourselves to normally distributed populations, and we will assume that these populations

have equal variances.

About notation

First of all, we need to look at the notation required to understand a One-Way ANOVA.

Subscripts are used to define the origin of each data point. Each observation is denoted as ygi,

where g represents the group (sample) it comes from, and i defines the individual response

within the group. So, y23 is the third observation in the second group. If we consider the small

dataset of meerkat weights in table 1, this would be the third observation from Kuruman

River, which is 597 g. A dataset has k groups (meaning that g = 1 ... k), and each group has ng

observations. In the meerkat example, we have two locations (k = 2), n1 is the sample size for

Deception valley (n1 = 6) and n2 is the sample size for Kuruman river (n2 = 6). The overall

sample size is N (N = n1 + n2 = 12).

Valley, Botswana (g=1)

514

519

568

571

553

531

y11

y12

y13

y14

y15

y16

624

542

597

597

577

678

y21

y22

y23

y24

y25

y26

The analysis of variance framework is based on the idea that the variance of the response can

be partitioned into components that correspond to the source of variation (namely, one or

1

Mathematical Biology IA

University of Cambridge

M. Castle

more components due to changes in the values of the independent variable(s) and a

component due to random error). In a One-Way ANOVA, we have only one independent

variable (e.g. location), which is a discrete factor (i.e. a categorical variable).

To understand how this is computed, we need to think about each observation as a deviation

from the overall mean (Fig. 1),

ygi g gi

where,

=overall mean

g=the group effect

gi=the random error component

Fig. 1: Diagram showing the partitioning of an individual weight

ygi

gi

In the meerkat example, this implies that there is a mean weight for this species (), which is

then affected by the location (with >0 if meerkats are bigger than average at that location,

and <0 if the site is not a good one and they are smaller). However, even within a site, not all

individuals will have the same weight, and the individual deviation from the site mean is given

by (where is normally distributed with a mean of 0). As the purpose of a One-Way ANOVA

is to determine whether the mean of several groups differs significantly, the null hypothesis

can be phrased as H0: 1 = 2 == k. The H1 states that 1k are not all equal.

So, how can we estimate the variance components? If you think back to your first lecture, the

numerator of the formula for the variance is the sum of square deviations (SS) from the mean.

SS are a good way to summarise the level of variability around a mean, and we can use a

similar approach here (Fig. 2). A further property of SS is that they are additive. So,

SSTot = SSG + SSE

where

SSTot = Total Sum of Squares

SSG = Group Sum of Squares (also known as the Treatment SS)

SSE = Error Sum of Squares

First, let us estimate the Total Sum of Squares. This is the total deviation of the dataset from

the overall mean (Fig. 2). Even though we dont know the true overall mean, we can estimate

it by pooling all the observations from all the groups and then taking their mean, y . So,

k

ng

SSTot ( ygi y )2

g 1 i 1

Mathematical Biology IA

University of Cambridge

M. Castle

In the meerkat dataset, the overall mean is 572.6. The sum of squared deviations from the

overall mean (SSTot) is

2

[All figures are rounded to 1 decimal place (DP). It is convention to give summary statistics to

an accuracy of 1 DP more than the accuracy of the original data, which in the meerkat

example were to zero DP].

SSTot

Weight (g)

700

600

Deception Valley

Kuruman River

500

400

0

10

12

Observations

SSE

700

700

600

600

Weight (g)

Weight (g)

SSG

500

500

400

400

0

Observations

10

12

10

12

Observations

We now need to consider the Group Sum of Squares (SSG). What we want to know here is how

much variability in the dataset comes from the fact that the group means are different from

the overall mean. Again we dont know exactly what g is, but we can take the mean for each

Mathematical Biology IA

University of Cambridge

M. Castle

group y g (which is our best estimate of g ). Now we can estimate the amount of

deviation due to the group effect as

k

ng

SSG ( y g y )2 ng y g y

g 1 i 1

g 1

In the meerkat example, the mean for Deception Valley, y1 , is 542.7, and the mean for

Kuruman River, y2 , is 602.5. Since the sample size is 6 for both groups (n1 = n2 = 6), we have

Finally, we need to estimate the amount of variation in the data due to the random error (i.e.

those random deviations due to individual effects). Since our best estimate of g is y g ,

we can write:

k

ng

SS E ( y gi y g )2

g 1 i 1

SSE (514 542.7)2 ... (531 542.7)2 (624 602.5)2 ... (678 602.5)2 13602.8

Now that we have created estimates of the components of variance, we can use them to test

our null hypothesis. As usual, we need to say how confident we are that our results do not

deviate from the H0 simply as a result of random chance. The obvious approach here is to say

that, if SSG is a rather large proportion of SSTot, then the null hypothesis that all groups are

similar is unlikely to be true. We can rephrase this as saying that, if the amount of variance

accounted by the group effect is large in comparison to the amount of variance due to error

(keep in mind that SSTot = SSG + SSE), then the group effect is likely to be real (i.e. significant).

At this stage, you might be tempted to simply compare SSG with SSE. However, note that we

obtained the different sum of squares using very different amount of information. The SSG

referred to a small number of estimates (the group means) compared to the overall mean,

whereas the SSE is the sum of a large number of individual deviations (we compared all data

points to their respective group means). So, to compare the variance components, we first

need to standardise them according to the number of parameters involved (i.e. the amount

of information).

We need to develop a standardisation parameter that allows to compare SS. This parameter

is called the degrees of freedom (usually denoted as df), and it obviously related to the

number of parameters we had to estimate in order to compute a set of square deviations. In

general, the degrees of freedom for a given set of deviations is equal to the number of

parameters/observations minus the number of reference parameter values we used to

compute the deviations. So, the df for SSTot (dfTot) is equal to the sample size minus 1 (as we

only have to estimate the overall mean). The dfG is equal to the number of groups minus 1 (as

the group means are compared to a single overall mean). Finally, dfE is equal to the sample

size minus the number of groups (as we had to estimate a mean for each group).

Mathematical Biology IA

University of Cambridge

M. Castle

Source

SS

df

MS

Group

SSG

dfG = k - 1

MSG / MSE

Error

SSE

dfE = N - k

Total

SSTot

dfTot = N - 1

We want to concentrate on the group and error components, to be able to test our null

hypothesis delineated above (i.e. that the variance component due to the group is large

relative to the component due to error). We can estimate the mean square deviation (MS, our

standardised estimator of variation) for each component by dividing each SS by its

appropriate df. Our confidence on the H0 depends on how much variation in the dataset can

be attributed to the group effect vs. the amount due to random error. So, to get a handle on

this, we can estimate the ratio between the group MS (MSG) and the error MS (MSE). The ratio

of two variances has a well known behaviour, described by the F distribution. The F

distribution is somewhat different from the distributions you have encountered so far, as it is

defined by two degrees of freedom, one for the variance on the numerator and one for the

variance in the denominator.

So, if we write out the ANOVA table for meerkats

Source

SS

df

MS

Group

10740.1

10740.1

7.90

Error

13602.8

10

1360.3

Total

24342.9

11

From your statistical tables, the critical value for F1,10 at = 0.05 is 4.96. Since 7.90, our

estimated F, is larger than the critical value, we conclude that the result is significant, and we

reject the null hypothesis that all populations are equal in weight. The exact p value is 0.018,

as we can see from the R output:

Analysis of Variance Table

Response: Weight

Df Sum Sq Mean Sq F value Pr(>F)

Location

1 10740.1 10740.1 7.8955 0.01848 *

Residuals 10 13602.8 1360.3

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Since we are comparing only two groups, we can now repeat the analysis with a t-test

(assuming equal variances), and confirm that the result does not change:

Two Sample t-test

data: Weight by Location

t = -2.8099, df = 10, p-value = 0.01848

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-107.27897 -12.38769

sample estimates:

mean in group Deception Valley

mean in group Kuruman River

602.5000

Mathematical Biology IA

University of Cambridge

M. Castle

How do we report this result? In project write-ups (and later in papers that you might author),

you will be interested in the biology, not the statistics. So, describe your results, and use the

statistics to back them up:

Meerkat weights in Kuruman River were significantly different from those in Deception Valley

(F1,10=7.90, p<0.05).

Important: Note that you should ALWAYS provide the statistics (in this case F), the number of

degrees of freedom for parametric tests (for F, we have two values, 1 and 10) or the sample

sizes for certain non-parametric tests, and an indication of the p value (p<0.05, or even better,

the exact p value to 3 DP, p=0.018).

The ANOVA framework is very powerful and flexible. Let us see how we can use a One-Way

ANOVA to compare the means of three groups.

We have obtained another six meerkat weights, this time coming from Addo Elephant Park. If

we add these new data to the previous dataset, we obtain

Table 2. Weights in grams of a number of meerkats caught in a field study in Deception

Valley, Botswana (g=1)

514

519

568

571

553

531

y11

y12

y13

y14

y15

y16

624

542

597

597

577

678

y21

y22

y23

y24

y25

y26

591

641

677

653

673

595

y31

y32

y33

y34

y35

y36

We want to test whether there is a difference in average weight among the populations.

Mathematical Biology IA

University of Cambridge

M. Castle

So, H0: 1 = 2 = 3 (i.e. there is no difference). H1 is that not all 1k are equal.

nk

y

g 1 i 1

k

gi

594.5

( 6 6 6)

g 1

n1

y1

1i

i 1

n1

542.7

6

n2

y2

y

i 1

2i

n2

602.5

6

638.3

6

n3

y3

y

i 1

n3

3i

Mathematical Biology IA

University of Cambridge

M. Castle

ng

g 1 i 1

Sum of Squares:

k

SSG ng y g y

g 1

6 (542.7 594.5)2

SSE SSTot SSG 48672.5 28032.3 20640.2

Degrees of freedom:

dfTot = N-1 = (6 + 6 + 6) 1 = 18 1 = 17

dfG = k-1 = 3 1 = 2

dfE = dfTot - dfG = 17 2 = 15

Mean Squares:

MSE = SSE / dfE = 20640.2 / 15 = 1376.0

F statistics:

Source

SS

df

MS

Group

28032.3

14016.2

10.19

Error

20640.2

15

1376.0

Total

48672.5

17

Using R, we can confirm that our calculations are correct:

Analysis of Variance Table

Response: Weight

Df Sum Sq Mean Sq F value

Pr(>F)

Location

2 28032

14016 10.186 0.001606 **

Residuals 15 20640

1376

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(F2,15=10.2,p=0.002), with the population in Deception valley being the lightest and the one in

Addo Elephant Park being the heaviest.

Mathematical Biology IA

University of Cambridge

M. Castle

This section is not examinable and is for reference only.

We can show algebraically that the total sum of squares will always be equal to the group sum

of squares plus the error sum of squares i.e. we can show that:

SST = SSG + SSE

always.

Consider SST

= ( )

=1 =1

= [( ) + ( )]

=1 =1

= [( ) + ( ) + 2( )( )]

=1 =1

= ( ) + ( ) + 2 ( )( )

=1 =1

=1 =1

=1 =1

= + + 2 ( )( )

=1 =1

So all we need to do is show that the third term is actually identically zero:

2 ( )( )

=1 =1

[( ) ( )]

=1

=1

[( ) ( )]

=1

=1

[( )( )]

=1

Therefore

SST = SSG + SSE

Mathematical Biology IA

University of Cambridge

M. Castle

1) We measure the feeding rate (no. of items/5 minute focal observation) of

oystercatchers at three sites (exposed, partial, and sheltered).

exposed

14.2

16.5

9.3

15.1

13.4

partial

18.4

13.0

17.4

20.4

16.5

sheltered

24.1

22.2

25.3

25.1

21.5

Is there any evidence that the feeding rate differs among locations?

Provide one or two sentences that you could use in a paper to summarise your

analysis.

2) Juvenile lobsters in aquaculture were grown on three different diets (fresh

mussels, semi-dry pellets and dry flakes). After nine weeks, their wet weight in

grams was:

mussels

151.6

132.1

104.2

153.5

132.0

119.0

161.9

pellets

117.7

110.8

128.6

110.1

175.2

flakes

101.8

102.9

90.4

132.8

129.3

129.4

Is there any evidence that the diet affects the growth rate of lobsters?

Provide one or two sentences that you could use in a paper to summarise your

analysis.

3) We recorded the biomass (g) of three species of bacteria (A, B, and C) grown

in flasks with a glucose broth. After a day, their mass was:

A

59.7

52.2

55.4

59.4

52.7

B

50.0

45.6

50.1

40.1

49.3

C

48.5

61.5

55.2

45.2

51.5

Do the bacteria species differ in their ability to grow under the conditions of the

experiment?

Provide one or two sentences that you could use in a paper to summarise your

analysis.

10

- Formula Sheet - Quantitative AnalysisUploaded bymibuhari
- A Study on Surface Roughness in Abrasive Waterjet Machining Process Using Artificial Neural Networks and Regression Analysis MethodUploaded byHoda Hosny
- Tut Two Way AnovaUploaded bySyariqie Fie
- STAT 125 HK Business Statistics Midterm ExamUploaded byandrebaldwin
- Forecasting ExperimentsUploaded byluying688
- Training in BanksUploaded byAbid Rasheed
- Analytical Model With Regression EqnUploaded byManohar Narsim
- FVSysID ShortCourse 6 ValidationUploaded byAnonymous Ry7AEm
- Z Score CorrelationUploaded byHendra Sudaryono
- Wolff Et Al-2007-Review of Income and WealthUploaded byThales Speroni
- ANOVA Model Assumptions OutlineUploaded bySahil Ahuja
- QT1_Session12Uploaded byshivam1992
- UDS Further Linear Models Trial QuestionsUploaded byApam Benjamin
- Chapter10.pdfUploaded byEngr Jehangir Khan
- Moisture of ContentUploaded byKwai Tjioe
- Uji Paired TUploaded byAnonymous ZBhPNQMYzc
- RegularUploaded byAnyelo Monsalve
- ANOVA DasarUploaded byekoefendi
- comparisonUploaded byrahsarah
- Cáceres 17 Web appUploaded byGuilherme de Moura
- Vidal 2012Uploaded byWilliam Rolando Miranda Zamora
- Data Save Rini_uas No 4 (Anova)Uploaded byTeddy
- ECONOmicsUploaded byMohammad Bin Khalid Sourav
- 11.IJHRMRDEC201811Uploaded byTJPRC Publications
- Jumlah KamarUploaded byEmut Manabung
- ReadmeUploaded bySasier K. Gokool
- PostUploaded byNadzief
- Model Answers Part b 2012Uploaded byAnonymous gUySMcpSq
- chisquare terbaruUploaded bycahyonowahyu861
- Tugas Multi OutlierUploaded byGiyanti Linda Purnama

- Electoral Tally Official ResultsUploaded bypaschlag
- 1. Partita in B♭ major, BWV 825 (1)Uploaded byToh Qin Kane
- Of Molecules and MenUploaded byToh Qin Kane
- IntroductionToComplexity.docxUploaded byToh Qin Kane
- RPKM, FPKM, TPMUploaded byToh Qin Kane
- CellCulture_cellLineInfoUploaded byToh Qin Kane
- Landscape and flux theory of non equilibrium dynamical systems with application to biology.pdfUploaded byToh Qin Kane
- Methods_Clustering.docxUploaded byToh Qin Kane
- CombinatorialLabellingAndExpansionMicroscopy_Notes.docxUploaded byToh Qin Kane
- fUploaded byToh Qin Kane
- logcellUploaded byToh Qin Kane
- Monocle VignetteUploaded byToh Qin Kane
- Biogerontology NotesUploaded byToh Qin Kane
- MealsUploaded byToh Qin Kane
- Trig Cheat SheetUploaded byHMaSN
- TensorUploaded byRex Bedzra
- Craig Transposition 96Uploaded byToh Qin Kane
- Salvation of Doug and Demise of BillUploaded byToh Qin Kane
- Marriage and Domestic PartnershipssUploaded byToh Qin Kane
- Arguments for the Existence of GodUploaded byToh Qin Kane
- AristotleUploaded byToh Qin Kane
- Laplace TableUploaded byhyd arnes
- CalcIII Complete SolutionsUploaded byToh Qin Kane
- CalcIII_Complete_Assignments.pdfUploaded byEko Prayetno
- calculoUploaded byEr Cb
- ECF Paper CalendarUploaded byToh Qin Kane
- Publicity Booklet for 2016-17Uploaded byToh Qin Kane
- Spemann and Mangold ExptUploaded byToh Qin Kane

- Moore 16Uploaded byankuriitb
- WMO_Guide_168_Vol_I_en-guide to hydrological practices.pdfUploaded byzilangamba_s4535
- bio sol review 2 - experimentsUploaded byapi-242405009
- BS EN 13036Uploaded bysmith will
- Binary SystemsUploaded byvignyanam
- ci1-9Uploaded byHanane Hanane
- 173232298 a Guide to Modern Econometrics by Verbeek 91 100Uploaded byAnonymous T2LhplU
- ECON1203-2292 Final Exam S212.pdfUploaded byGorge Soros
- orifice plate report.xlsUploaded bydsde
- M39 Modeling Temperature Changes - EkTempModUploaded byNora
- 1 Introduction to StatUploaded byFaris Mohd
- Using a Pitot Probe.pdfUploaded byPhi Mac
- CH 02 03 Review ExercicesUploaded byGjergji
- 2015 SMK Sacred Heart Sibu 950 P2 Trial Exam Q&AUploaded byRexana Rhea
- T-117 Segregation Causes CuresUploaded byLuiz Fernando Oliveira Martins
- Gastec Pump Manual EliteUploaded byupil
- AP13-How a Plume Spreads Part IIUploaded byapi-3824811
- Effectiveness Review: Climate Change Adaptation and Advocacy Project, NepalUploaded byOxfam
- Z testUploaded byBitoy Aguila
- 13718-Pvt Correlations for Middle East Crude OilsUploaded byLucero Hdez
- 4. One-way ANOVAUploaded bymanu192
- SEATWORK 10152014Uploaded byspica25
- Pivot TableUploaded byJohn Robert
- IJETAir Quality Index Prediction Using Simple Machine Learning AlgorithmsTCS-2018-02-06-15Uploaded byAnonymous vQrJlEN
- CartUploaded byEdwin Escobedo
- Guidelines for Writing Good Asphalt Related Research PapersUploaded byProf. Prithvi Singh Kandhal
- Planning a Science ExperimentUploaded byPat Ngwenya
- 3Uploaded byNegin Maleki
- 6.3-Process-Stability-SPC.pdfUploaded byMohammed Amine Labbardi
- StatisticsUploaded byNursidar Pascual Mukattil