You are on page 1of 7

Chapter 2

Analysis of Variance: Testing Equality of


Means across groups

2.1 Two-sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.1.1 Sources of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 One-way ANOVA Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 One-way ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

“nobreak

2.1 Two-sample test

In two-sample t-test, we compare two populations/groups for a quantitative variable, e.g.,


comparing average sales in beverage category of Starbucks located in residential areas in
two US cities: Dallas and Houston. Assumption of equal variances can be considered as
the beverage drinking behavior of the population that in those cities are considered to be
homogenous.
(X̄1 X̄2 )
To test H0 : µ1 = µ2 , we use the test statistic t = SD(X̄1 X̄2 )
, where estimated SD(X̄1 X̄2 )
is
r s
1 1 (n1 1)s1 + (n2 1)s2 s21 s2
sp + with s2p = + 2
n1 n2 n1 + n2 2 n1 n2

provided the data distribution is normal and the population standard deviations are un-
known.

Let us now look at the following situations:


1. Is the average CGPA (year 1) in sections A, B and C the same in IIMV?
2. Is the average night sleeping time the same for PGP, FPM and PGPEx?

If we simply compare two groups at a time, we will have to carry out more than one t-tests.
Carrying out multiple tests inflates Type I error probability. So, we want to carry out an

11
12 DS 1 Lecture Notes

overall test to see whether there are di↵erences among the groups so as to control overall
Type I error probability. The overall test that we carry is called the Analysis of Variance.
Here we compare a quantitative variable across several (> 2) groups at the same time. The
grouping (categorical) variable is typically referred to as factor or treatment.

Examples: Is the amount of time spent on sleeping (per day) related to the CGPA (A, B, C)
of IIMV students? We can start by visualizing the data using Side-by-side boxplots. Suppose
the following side-by-side boxplot graph displays the time that 30 randomly selected IIMV
students spend playing video games per week categorized by their CGPA (A, B, or C).

FIGURE 2.1

Questions:

• Does the amount of time spent sleeping appear to be related with GPA?

• How can we prove or disprove such a claim such that overall Type-I error is not in-
flated?

2.1.1 Sources of Variability

Example: Suppose that Sales of beverages in Moonbucks located at the Lucknow airport
last Monday is Rs. 20000 and that at Varanasi airport is Rs. 15000. We can see that the
di↵erence between the sales in those outlets of Moonbucks is Rs. 5000. What is a major
“explanation” as to why there is this di↵erence between the two sales? Is this factor the
ONLY reason why the two sales di↵er? Will the di↵erence between the sales at Lucknow
and Vizag be also Rs. 5000?
Analysis of Variance: Testing Equality of Means across groups 13

Thus, we can say that:

Total di↵erence Di↵erences Di↵erences


(variability) = (variability) + (variability)
in sales of beverages due to airport category due to other sources
Suppose we want to test whether there exist any significant di↵erence in average sales
of beverages in Moonbucks outlets located at Varanasi, Lucknow and Vizag airport. So,
H0 : µ1 = µ2 = µ3 against H1 : There is a significant di↵erence in atleast two cases. We
carry out the test using ANOVA.

2.2 One-way ANOVA

Suppose we have k( 2) independent population from which we draw (group) of interest.


We want to test:
H0 : µ1 = µ2 = · · · = µk vs. H1 : at least two µ’s di↵er. For inference, we draw a combined
sample of size (over all groups/populations) is n.Then we use one-way ANOVA utilizing
the fact: Total Variability in Data = Between-Group Variability + Within-Group (Error)
Variability

Assumptions of ANOVA

• Data/Variable of interest is quantitative (numeric).


14 DS 1 Lecture Notes

• Parent populations are normally distributed with unknown means µ1 , µ2 , . . . , µk .

• All the populations have the same unknown variance 2


.

Then ANOVA is a generalization of which two-sample t-test. If H0 is true, we expect


Between-Group Variability to be small. As total variability is fixed, we compare the size of
Between-Group Variance relative to the size of Within-Group Variance.

Suppose samples of sizes n1 , n2 , . . . , nk are drawn from the k population such that,

Sample 1 of size n1 : x11 , x12 , . . . , x1n1 Mean = x̄1


Sample 2 of size n2 : x21 , x22 , . . . , x2n2 Mean = x̄2
..
.
Sample k of size nk : xk1 , xk2 , . . . , xknk Mean = x̄k

Pk
Denote the overall mean = x̄ and n = i=1 ni . Then
P k P ni
SS(Total) = i=1 j=1 (xij x̄)2
Pk
SS(P) = i=1 ni (x̄i x̄)2
P k P ni
SSE = i=1 j=1 (xij x̄i )2 .

2.2.1 One-way ANOVA Test Statistic

The test statistic for a one-way ANOVA test is:


between-group variance
F = = MMS(T r)
SE ⇠ Fk 1,n k under H0 ,
within-group variance
where

• M S(P ) is the Between-Population (or Group) Mean Square (Variance): Variability


between populations/groups; based on the distances between the population/group
means around the the overall mean.

• M SE is the Within-Population (or Error) Mean Square: Variability within popula-


tion/groups; based on the average squared deviations of the observations from their
respective group average, which are then pooled together [it is a pooled MSE].

Fk 1,n k is F-distribution with two degrees of freedom k 1 and n k. It is a one-sided F


test, hence P-value=P (Fk 1,n k calculated F ).

2.2.2 One-way ANOVA Table

The analysis of variance test is summarized in an ANOVA table:


Analysis of Variance: Testing Equality of Means across groups 15

Sources d.f. SS MS F p-value

SS(P ) M S(P )
Between k 1 SS(P ) M S(P ) = k 1 F = M SE P (Fk 1,n k F)

SSE
Within n k SSE M SE = n k

Total n 1 SS(T otal)

Things to remember:

• If H0 is rejected, we conclude that means of at least two populations di↵er.

• To determine which populations are di↵erent, various follow-up procedures called


post-hoc analyses can be used, e.g., pairwise comparisons, comparisons with control
group only, etc. They control for the overall type I error rate.

• We call it “one-way” because the k populations di↵er with respect to a single “cate-
gorical” feature.

Example 1: Data on average sales of beverages (in Rs. 1000) at Moonbucks outlets located
at Jaipur, Lucknow and Vizag airport from 3-hour time periods from 4:30 AM to 10:30 PM on
Mondays in Q1 of 2019 is given below:
Jaipur Lucknow Vizag
3.1 4.2 3.3
2.5 2.5 2.6
2.2 1.7 1.7
1.5 3.5 3.9
0.7 1.2 2.8
2.4 3.1 3.5
Is there evidence that the average Moonbucks beverage sales at these three Airports are di↵erent
at the 5% level of significance? Proceed using Excel:

• In Excel, go to Data ! Data Analysis. Choose “Anova: Single Factor”.


• In Input Range, provide the full data range: A1:C7.Check the box: Labels in First Row.

Let us now look at the corresponding Anova Table.

2.3 Two-Way ANOVA


Consider the “Beverage Sales” example. We considered only one factor “Airport” to explain the
variability in beverage sales. However, there could be one or more confounding factors, i.e., factors
that have an e↵ect on the response but not considered in the model, e.g. 3-hour time periods. If we
want to control for the confounding factor, we will have to use Two-Way ANOVA. This involves two
factors — the factor of interest (Airport) and a confounding variable (e.g., 3-hour Time periods).
Thus we rename the 3-hour time periods into blocks. The general layout is:
16 DS 1 Lecture Notes

Population 1 Population 2 ... Population K Block Mean


Block 1 x11 x12 ... x1K ȳ1
Block 2 x21 x22 ... x2K ȳ2
.. .. .. .. ..
. . . . .
Block B xB1 xB2 ... xBK ȳB
Population Means x̄1 x̄2 x̄·K x̄
So, in our model, we are now accounting for two specific sources of variability: Population and Block.
The remaining variation due to factors not accounted for in the model is denoted as Error as before.
As in one-way ANOVA, here also df add up (note: n = KB): n 1 = (K 1)+(B 1)+(K 1)(B 1).
The following decomposition of Sum of Squares (SS) holds:

Total SS = Population
P PB SS + Block SS + Error SS or SS(Total) = SS(P) + SS(B) + SSE.
SS(Total) = K i=1 j=1 (xij x̄)2
P
SS(P) = B K (x̄i x̄)2
Pi=1
B
SS(B) = K j=1 (ȳj x̄)2
P PB
SSE = K i=1 j=1 (xij x̄i ȳj x̄)2 .

As in one-way ANOVA, the test statistic for testing di↵erences between treatments is F =
M S(P )/M SE ⇠ FK 1,(B 1)(K 1) . The results are summarized in an ANOVA table.
Example 1: In the example, suppose the following data were obtained from 6 blocks:
Blocks Jaipur Lucknow Vizag
4:30 AM - 7:30 AM 3.1 4.2 3.3
7:30 AM - 10:30 AM 2.5 2.5 2.6
10:30 AM - 1:30 PM 2.2 1.7 1.7
1:30 PM - 4:30 AM 1.5 3.5 3.9
4:30 AM - 7:30 PM 0.7 1.2 2.8
7:30 AM - 10:30 PM 2.4 3.1 3.5
We proceed with the steps while using Excel:
Analysis of Variance: Testing Equality of Means across groups 17

• Under Data Analysis, choose “Anova: Two-Factor Without Replication”.


• In Input Range, provide the full data range A1:E7, and check “Labels” box.

Two-Way vs One-Way ANOVA: Examples

• Comparison of several routes for driving to work. Here response variable is driving time. A
confounding factor may be day of work. One-way ANOVA treats all driving days for each
route as equivalent. Two-way ANOVA will allow for di↵erences due to day of week by treat-
ing it as block.

• Suppose an experiment comparing the sales of di↵erent oil brands in Reliance Fresh stores in
di↵erent states in India. Here response variable is ............., “population” is ................, and
block is ..............

• Comparing 3 methods of rounding first base in baseball: Round Out, Narrow Angle, and
Wide Angle. The design for one-way ANOVA assigns each method some runners but the
runners may di↵er in their running skills. So, for a two-way ANOVA design, blocks can be
..............

You might also like