You are on page 1of 10

Use of F distribution (Analysis of Variance (ANOVA))

We have seen that to test whether the difference in the


mean values of two groups of observations could be assumed
to be arising out of sampling variation or the two groups are
coming out of a homogeneous population, we can apply t test.
When there are more than two groups to be tested, an
alternative procedure is needed for testing the hypothesis that
all the samples are drawn from the same population, that is,
they have the same mean. For example, five fertilisers are
applied to four plots, each of wheat and yield of wheat on each
of the plots is given. We may be interested in finding out
whether the effect of these fertilisers on the yields is
significantly different. The answer to this problem is provided
by the technique of Analysis of Variance. The basic purpose of
the analysis of variance (ANOVA) is to test the homogeneity of
several means. This analysis is based on the assumption that
the total variation present in a set of observations, under
certain conditions, may be partitioned into a number of
components associated with the classification of data. The total
variation in any set of numerical data is due to a number of
causes which may be classified as:

(i) Assignable causes, and (ii) Chance causes.

The variation due to assignable causes can be detected


and measured whereas the variation due to chance causes is
beyond the control of human hand and cannot be traced
separately.
1
In short: analysis of variation in an experimental
outcome and especially of a statistical variance in order
to determine the contributions of given factors or
variables to the variance.

Consider the following two investigations:

(a) A car magazine wishes to compare the average


petrol consumption of THREE similar models of
car and has available six vehicles of each model.
(b) A teacher is interested in a comparison of the
average percentage marks attained in the
examinations of FIVE different subjects and has
available the marks of eight students, who
completed each examination.

In both these examinations, interest is centred on


a comparison of more than two populations:
THREE models of car, FIVE examinations.

Basic assumptions in ANOVA

For the validity of F –test in ANOVA, the following


assumptions are made:
(i) The observations are independent;
(ii) Parent population from which,
observations are taken is normal and
(iii) Various treatment and environment
effects are additive in nature.

2
Applications of ANOVA

After reading the above, you might be thinking,


how can I use this in the real world? The ANOVA can
come in handy in a large number of real life situations.
For instance, in the social sciences, there is much
research devoted to figuring out what factors influence
people's opinions and behaviours. You could design an
experiment in which you have a group of Democrats, a
group of Republicans, and a group of Independents and
give them a survey that asks them about their views on
same sex marriage. You could then use an ANOVA to
compare the difference in the average amount of people
in each group claiming to support same sex marriage.

ANOVAs can also be used in the medical


profession. When scientists want to test the
effectiveness of a new drug, they can implement an
ANOVA to see just how effective (or possibly
ineffective) that drug is. For example, if doctors want to
test the effectiveness of a new cancer treatment, they
could design an experiment that involves 4 different
levels of the independent variable (which is the cancer
treatment). One group of participants could receive
chemotherapy, another group could receive radiation
treatment, still another could receive no treatment, and
finally the last group would receive the new drug in
question. By comparing the percentage of reduction in
3
cancer cells in each treatment group with an ANOVA,
scientists could easily tell, which type of treatment
would be most effective.

These are not the only situations in which an


ANOVA can be useful. It can be used in any experiment
that involves two (but usually three or more) levels of
the independent variable. Additionally, it can work for
more than one independent variable being tested
simultaneously.

Introduction: Any data set has variability. Variability exists


(i) within groups and (ii) between groups. Now the question
is:

(i) Is this variance significant or


(ii) Merely by chance?

The answer to this question will be possible by the technique of


Analysis of Variance.

Difference between t-test and ANOVA

• The difference between ANOVA and the t tests is


that ANOVA can be used in situations where there
are two or more means being compared, whereas
the t tests are limited to situations where only two
means are involved.

• The test statistic for ANOVA is an F-ratio, which is a


ratio of two sample variances. In the context of

4
ANOVA, the sample variances are called mean
squares, or MS values.

• The top of the F-ratio MSbetween measures the size of


mean differences between samples. The bottom of
the ratio MSwithin measures the magnitude of
differences that would be expected without any
treatment effects.

Exercise 1.

In a diet survey of antenatal mothers, it was revealed that


the iron intake of a sample of ten mothers from each of the
four villages was as follows. It is desired to find out whether
the differences in the mean iron intake between the samples in
the four villages is due to chance or can be regarded as
statistically significant.

Iron intake in mg.

Village 1 Village 2 Village 3 Village 4

11.5 9.5 18.5 20.0


12.5 18.5 16.5 16.5
8.5 16.0 24.5 17.0
21.0 22.0 30.0 24.0
28.0 30.0 28.5 10.0
26.0 14.5 14.0 12.5
14.0 19.0 19.0 18.0
22.0 24.0 17.0 22.0

5
10.0 19.5 18.0 17.0
22.0 15.0 29.0 15.5
Total 175.5 172.5 188.0 215.0

Solution. Null hypothesis: H0: There is no significant


difference between the samples in the four villages.

Various steps in the computation are as under:

(i) Calculate the sum of all observations (Total of all the


40 observations) = ∑xij, where xij represents each
observation.
(ii) Calculate the sum of observations of each village.
(175.5, 188.0, 215.0 and 172.5) = T i , i = 1,2,3 and
4
(iii) Calculate the sum of squares of all observations.
( 11.5)2 + (12.5)2 + .......+ (15.5)2. = ∑xij2.
(iv) Calculate total sum of squares = ∑xij2 – (∑xij)2/n
where n is the total number of observations.
(v) Calculate the ‘between villages’ sum of squares.

∑(Ti2/ki ) – (∑xij)2 /n

where, ki is the number of observations in the i th village.

(vi) Within village sum of squares is obtained as a


difference between the total sum of squares and the
sum of squares between the villages.

Sum of all the observations: 11.5 + 12.5+....+ 15.5


= 751
Total sum, of squares = (11.52 + 12.52 + .....+ 15.52
) – (751)2/40 = 1319.5

6
Sum of the squares between the villages

(175.5)2 + (188.0)2 + (215.0)2 + (172.5)2 (751)2

----------------------------------------------- - -----

10 4
= 112.55

Sum of the squares within villages= 1319.5 – 112.55


= 1206.95

Analysis of variance table

Source of D.F. Sum of Mean sum F


sum of squares of squares
squares (MSS)

Between
villages 3 112.55 37.517 1.119

Within
villages 36 1296.95 33.526
---------------------------------------------------------------------
Total 39 1319.5

 There are 4 villages, so the d.f. for between


villages = 4-1 =3

 There are 40 observations in all so the d.f. for


the total sum of squares (TSS) = 40-1 = 39.
7
 Degrees of freedom for within village sum of
squares is the difference between the above
two.

 MSS in each line is obtained by dividing the sum


of squares by the respective d.f.

 F, which is the variance ratio, is obtained by


dividing the “Between mean sum of squares” by
the ‘Within mean sum of squares”. The level of
significance for this calculated F values is
obtained against 3 and 36. Tabulated F is 2.87.

Conclusion: Difference is insignificant. Null


hypothesis is accepted. So the differences in
the mean iron intake between the samples in
the 4 villages are due to chance.

Exercise 2: A paper manufacturer makes grocery bags. He


is interested in increasing the tensile strength of their
product. It is thought that tensile strength is a function of
the hardwood concentration in the pulp. An investigation is
carried out to compare four levels of hardwood
concentration: 5%, 10%, 15% and 20%. Six test specimens
are made at each level and all 24 specimens are then tested
in random order. The results are shown below:

Hardwood Tensile strength


Concentratio (psi)
n (%)
5 7,8,15,11,9,10
10 12,17,13,18,19,15
15 14,18,19,17,16,18
20 19,25,22, 23,18,20
Total
8
Solution: Null hypothesis H0 : Tensile strength is
independent of the hardwood concentration.

Exercise 2: 10 varieties of wheat are given in 3 plots each and


following yields per acre obtained:

Plots/Variety 1 2 3 4 5 6 7 8 9 10
------------------------------------------------------------------------------

I 7 7 14 11 9 6 9 8 12 9
II 8 9 13 10 9 7 13 13 11 11
III 7 6 16 11 12 5 12 11 11 11
Total 22 22 43 32 30 18 34 32 34 31
------------------------------------------------------------------------------
Test the significance of the difference between varieties of yields.

Exercise 3: 4 different feeds were each fed to a lot of 5 babies and


gains in weight (in kg.) were as given below:

Gain in weight of baby chicks in a feeding experiment

Feed Gain in weight (in kg.) Total

A 2.75, 2.45, 2.10, 1.05, 2.60 10.95

B 3.05, 5.60, 1.50, 4.45, 3.15 17.75

C 2.10, 4.85, 4.05, 4.75, 4.60 20.35

D 8.45, 6.85, 8.45, 4.25, 7.70 35.70

84.75

Test whether the differences in the gain in weight due to feeds were
due to chance or can be regarded as statistically significant.

Exercise 4: The following table gives quality rating of service stations


by five professional raters:

9
SERVICE STATION

RATER 1 2 3 4 5 6 7 8 9 10

A 99 70 90 99 65 85 75 70 85 92

B 96 65 80 95 70 88 70 51 84 91

C 95 60 48 87 48 75 71 93 80 93

D 98 65 70 95 67 82 73 94 86 80

E 97 65 62 99 60 80 76 92 90 89

Exercise 5: An experiment was conducted to determine the effects of


different dates of planting and different methods of planting on the
yield of sugar-cane. The data below show the yields of sugarcane for 4
different dates and 3 methods of planting:

Date of planting

Method of October November February March


Planting

I 7.10 3.69 4.70 1.90

II 10.29 4.79 4.58 2.64

III 8.30 3.58 4.90 1.80

Carry out an analysis of variance for the above data.

------------------0---------------------

10

You might also like