You are on page 1of 9

www.analyttica.

com

Introduction to
ANOVA

© Analy Datalab Inc., 2016. All rights reserved.


https://leaps.analyttica.com

Table of Contents

What is ANOVA?

Assumptions of ANOVA

One-Way ANOVA

Two-Way ANOVA

Advantages and Limitations of ANOVA

Page 2
https://leaps.analyttica.com

What is ANOVA?
Analysis of Variance (ANOVA) is a statistical method used to test differences between
two or more means by analysing the variations in observations between and within
different groups. It was developed by Ronald Fisher.

ANOVA is based on the principal of total variance, where the total observed variation is
partitioned into two subcomponents, namely, the variance between the groups and the
variance within the groups. Using these two variances, we can statistically test whether
the groups are significantly different or not. This process is explained later in this
document.

The three principles of ANOVA are as follows:

1) Randomization: Consider a hospital is analysing the effect of three drugs, Drug A, Drug
B and Drug C. These are to be tested on 10 patients (say). Randomization implies that
each patient is equally likely to receive any of the three drugs. There is no pre-
experiment bias while administering the drugs on the patients. One-way ANOVA takes
care of the randomization priniciple. It is also called a completely randomized design
(CRD). In this example, patients 1,3,4 and 7 may receive Drug 1, patients 2,5 and 9 may
receive Drug 2, and the remaining patients may receive Drug 3.

2) Replication: Consider a farmer wants to test the effectiveness of three fertilizers on his
crops, Fertilizer A, Fertilizer B and Fertilizer C. He also wants to consider the effect of
soil type on his crops. Suppose his plot of land has five different types of soil, and he
wants to plant 15 crops in total. The ideal design in such a case would be to divide the
land into five heterogenous groups (called as blocks), each block corresponding to one
particular type of soil, and each block having three crops. Then, he would take each
block, and apply the three fertilizers to the block in a random manner. One thing to be
kept in mind is that, he would have to apply all three fertilizers in each block. Such a
design is called is called a randomized block design or two-way ANOVA, and it takes into
consideration both the randomization principle, as within each block the treatments are
applied randomly, and also the replication principle, as the same treatments are applied in
every block.

Figure 1: Sample Design of RBD

Page 3
https://leaps.analyttica.com

Assumptions of ANOVA
While carrying out ANOVA, one must keep in mind the following assumptions:

1) Normality: Each sample unit is taken from a normal distribution

2) Independence: All sample units are independent of each other

3) Homoscedasticity: The variance across different groups must be equal.

4) Continuity: The dependant variable must be a continuous numerical variable

Although these assumptions are necessary and essential while theoretically deriving the
results obtained from ANOVA, however, practically, very few real-life datasets follow
any of these criteria, and in such cases, the user may choose to carry out ANOVA
anyway, ignoring the violation of the assumptions.

One-Way ANOVA
One-way ANOVA involves one independent categorical variable, and one dependant
continuous variable. This technique is used to analyse whether the different categories of
the independent variable differ significantly, based on the differences in the mean value
of the dependent variable for each category. It involves dividing the total variation in the
dependant variable into the explained variation and the unexplained variation.
The explained variation is due to the application of the different treatments. The
unexplained variation is the variation which cannot be numerically explained. It may be
due experimental error or sampling error.

Suppose the independent variable has k classes, and y represents the dependent variable.
Consider the following notations:

y!" = the j#$ unit in the i#$ class.


where, i = 1,2, … , k and j = 1,2, … , n!

n = total sample units


n! = number of units in the i#$ class and ∑%!&' n! = n
y=!. = mean of the i#$ class
y=. . = overall mean
Then, the total sum of squares is defined as,
% *!
)
Total sum of squares = A ABy!" − y=. . D
!&' "&'

Page 4
https://leaps.analyttica.com

The total sum of squares, after simplification, can be written as follows,

% *! % *! % *!
) )
A ABy!" − y=. . D = A A(y!. − y=. . )) + A ABy!" − y=!. D
!&' "&' !&' "&' !&' "&'

The first term on the right hand side is called the Treatment Sum of Squares, or the
Between Sum of squares, and the second term on the right hand side is called the Error
Sum of Squares or the Within Sum of Squares.

Using these values, our objective is to test the null hypothesis H+ vs the alternative
hypothesis H' , where,

H+ = All categories have equal mean


H' = Not all categories have equal mean
We use the F test to test the hypothesis, where the test statistic is defined as
Variance between Treatments SS./,0#1,*# /(k − 1)
F#,-# = =
Variance within Treatments SS2//3/ /(n − k)
If the value of F#,-# is close to 1, we can accept the null hypothesis. As the value increases
above 1, the evidence against the null hypothesis increases.
As an example, we consider a data case where we want to find whether the final score in
a Mathematics exam is significantly different among different study groups.
Our null hypothesis and alternative hypothesis are as follows:

H+ : Maths score does not differ significantly across study groups


H' : Maths score differs significantly across study groups
Here, our independent variable is ‘Study Group’ and the dependent variable is ‘Math
Score’. We obtain the final ANOVA table:

Page 5
https://leaps.analyttica.com

Figure 2: Summary Table for One-Way ANOVA

As we see from the table, the F value is high, which indicates that we should reject the
null hypothesis.

Another indicator of testing the hypothesis is the p-value. The p-value is essentially the
probability that the null hypothesis is true. Usually, we keep a level of significance of 5%.
This means that, if the p-value is less than 5%, we reject the null hypothesis. Else, we
accept the null hypothesis. However, the desired level of significance can change
depending on our requirement, and correspondingly, our inference and decision to accept
or reject the null hypothesis will change.

In the above data case, we see that the p-value is extremely low. So, we can safely reject
the null hypothesis at 5% level of significance. Our final inference will be that all study
groups do not have the same mean test score in Mathematics. In other words, study
groups have a significant effect on the Maths score.

Two-way ANOVA
In two-way ANOVA, we have two independent categorical variables, and a dependent
variable. Along with testing the equality of means of the individual categorical variables,
two-way ANOVA also helps in testing the significance of the interaction between the
two independent variables on the dependent variable.
The calculations and testing of hypotheses are similar to one-way ANOVA. The only
difference is that we have the additional terms for the sum of squares of the second
categorical variable, along with the interaction sum of squares.
If we have two categorical variables A and B, then the total sum of squares can be written
as,

SS.3#04 = SS5 + SS6 + SS56 + SS2//3/

For two-way ANOVA, we have the following null hypotheses:

Page 6
https://leaps.analyttica.com

H+' = The levels (or categories) of variable A do not differ significantly


H+) = The levels of variable B do not differ significantly
H+7 = There is no significant interaction effect between variables A and B
The alternative hypotheses are the complement of the corresponding null hypotheses.
Just like in one-way ANOVA, we use the F test to test each of the hypotheses. In each
case, if the test statistic value is close to 1, we accept the corresponding null hypothesis.
Else, we reject the null hypothesis.
In one-way ANOVA, we had studied the effect of Study Group on the Maths score. Let us
introduce another categorical variable: Test Preparation. We want to test whether the
Maths score varies significantly across different study groups and different levels of test
preparation, and we also want to find out the significance of the interaction effect of
study group and test preparation on Maths score.

The null hypotheses are:

H+' : Maths score does not differ significantly across study groups
H+) : Maths score does not differ significantly across test preparation levels
H+7 : There is no significant interaction effect between study group
and test preparation on Maths score
The corresponding alternative hypotheses are:

H'' : Maths score differs significantly across study groups


H') : Maths score differs significantly across test preparation levels
H'7 : There is significant interaction effect between study group and
test preparation on Maths score
We obtain the following ANOVA table:

Figure 3: Summary Table for Two-Way ANOVA

Page 7
https://leaps.analyttica.com

As we see, the p-values for both Study Group and Test Preparation is very small,
indicating that the Maths score differs significantly between different Study Groups and
different Test Preparation levels.

Advantages and Limitations of ANOVA


Advantages

1) Compared to other tests, ANOVA is a robust test against violations of its assumptions.

2) ANOVA facilitates testing of differences among multiple means without increasing the
Type I error rate i.e. increases statistical power.

3) Two-way ANOVA looks at interaction between factors, reduces random variability, it


provides a mechanism to look at effect on second variable after controlling the first
variable.

Limitations

1) Requires that the population distributions are normal. It assumes equality of variances
for each group which may not be true at times.

2) A one-way ANOVA will confirm that at least two groups are different from each other;
however, it does not confirm what groups are different. If H+ is rejected, to find out
which exact groups have a difference in means, you need to run Fisher’s LSD or pairwise t
test.

Page 8
https://leaps.analyttica.com

Write to us at

support@analyttica.com

USA Address
Analyttica Datalab Inc.
1007 N. Orange St, Floor-4,
Wilmington, Delaware - 19801
Tel: +1 917 300 3289/3325

India Address
Analyttica Datalab Pvt. Ltd.
702, Brigade IRV Centre,2nd Main Rd,
Nallurhalli,
Whitefield, Bengaluru - 560066.
Tel : +91 80 4650 7300

Page 9

You might also like