You are on page 1of 32

Quantitative Methods 2

ECON 20003

WEEK 5
COMPARING SEVERAL POPULATION CENTRAL
LOCATIONS WITH ONE-WAY ANALYSIS OF
VARIANCE (ANOVA) BASED ON INDEPENDENT
SAMPLES AND RANDOMISED BLOCKS
Reference:
SSK: §15.1, 15.3-15.4, 20.3

Dr László Kónya
January 2020
ANALYSIS OF VARIANCE (ANOVA)

• On week 3 you learnt how to compare two population central locations


with parametric and nonparametric tests. Analysis of variance, ANOVA in
brief, is an extension of these tests to several (k  2) populations.

ANOVA: a class of statistical procedures used to divide the total


variation present in a set of data into several components,
and to measure the contribution of each of the possible
sources of variation to the total variation, in order to find out
whether several populations have the same central location.

When the population means exist, the aim is to test the


composite null hypothesis,
H0 : 1 = 2 = … = k vs. HA : not all i’s are equal.

This is a relatively ‘weak’ null hypothesis and even if it is


rejected we still do not know which particular means differ.

Given these hypotheses, Analysis of Variance appears a strange name,


but as you will see, it is indeed based on the comparisons of variances.
L. Kónya, 2020 UoM, ECON 20003, Week 5 2
• One might think of testing H0 with a series of t-tests that compare all
possible pairs of means, but this is not a good idea for two reasons:

i. Since there are K = k (k-1) / 2 possible pairs, the number of t-tests


to be carried out increases rapidly by the number of populations
and K can be prohibitively large.
ii. If these t-tests are calculated independently from each other using
a single sample and the same significance level, say , then the
probability of avoiding the Type I error is (1- ) for each test, but it
is only (1- )K for the whole set of K tests.

For example, if there are k = 10 (sub-) populations, one has to perform


K = (10  9) / 2 = 45 separate t-tests.
If each of these tests is performed at the 5% significance level, then the
probability of avoiding the Type I error (i.e. not to reject a true H0) is 0.95
for each test, but it is only (0.95)45 = 0.01 for the whole set of K = 45 tests.
In other words, the probability of the Type I error in any given test is 0.05,
while the probability of incorrectly rejecting H0 in at least one of the 45 tests
is 0.99. This inflated rate of error is clearly unacceptable in practice.

L. Kónya, 2020 UoM, ECON 20003, Week 5 3


• In order to perform ANOVA, one has to draw an independent random
sample from each population and compare the sample means to the
corresponding sample items, to each other and also to the overall or
grand mean.

For k = 3, this can be illustrated as follows:

Apparently, the three sample means


(yellow dots) are far closer to the
Grand mean
corresponding sample items (blue dots)
than to each other or to the grand
mean (red dot), and there is far smaller
variability within each sample than
between the three sample means.

Hence, H0 is unlikely to be true.

Samples drawn from the three populations

L. Kónya, 2020 UoM, ECON 20003, Week 5 4


Consider now the following scenario:

In this case the three sample means


(yellow dots) do not seem to be closer
to the corresponding sample items
(blue dots) than to each other or to the
grand mean (red dot), and variability
seems to be no different within the
samples than between them.

Hence, H0 is probably true.

• ANOVA has its own “vocabulary”.

Treatment: a possible source of variation in the set of data


which is under the experimenter control.
Experimental unit: an entity that receives the treatment.
Factor: a set of treatments of a single kind; it defines the
(sub-) populations.
L. Kónya, 2020 UoM, ECON 20003, Week 5 5
For example, a factor might be a set of fertilizers, a treatment can be a new
fertilizer, and the experimental units might be plots of grounds.

Measurements are taken on the experimental units in order to obtain


observations for the variable of interest, called response variable.

• There are several different ANOVAs depending on their experimental


designs. The differences are due to (i) the number of factors, (ii) the
source of measurements and (iii) the selection of sampled populations.

i. Number of factors: single-factor design (one-way ANOVA) versus


multifactor design (two-way, three-way etc. ANOVA).

ii. Source of measurements, i.e. whether the observations represent


different sets of experimental units or similar / same set of units:
completely randomised design versus randomised block design.
With a completely randomised design the experimental
units are assigned to the treatments completely at random,
while with a randomised block design the experimental units
are divided into blocks on the basis of some blocking
variable and then within each block they are randomly
assigned to the treatments.
L. Kónya, 2020 UoM, ECON 20003, Week 5 6
A block is a group of experimental units that are identical, at least
similar, with respect to all known sources of variability.

A special case of the randomised block design is the


repeated measures design, where each experimental unit is
assigned to all treatments in a random order and a block is
the collection of the measurements on a given unit.

Suppose, for example, that we intend to design an experiment to determine


whether there is any difference between three grades of petrol.
In the completely randomised design we could assign ten randomly
selected test cars to each grade and measure their consumptions.
In the randomised block design each block could be a given make and
model of cars, and three randomly selected cars from each block could run
on the three grades of petrol (one on each), or each randomly selected car
would run on each of the three grades of gasoline in a random order
(repeated measures design).

Note: The completely randomised and the randomised block designs are the
generalizations of the experiments based on two independent samples
and on matched pairs, respectively, to more than two populations.
L. Kónya, 2020 UoM, ECON 20003, Week 5 7
iii. Depending on the way the sampled sub-populations are selected
in the experiment, an ANOVA model can be a
Fixed-effects model: all possible populations of interest are
included in the analysis;
Random-effects model: the sampled populations are chosen
randomly from all possible populations
of interest.

In the case of fixed-effects the inferences are limited to the specific


populations that appear in the experiment, while in the case of
random effects the inferences can be generalised.
As regards the actual steps involved in the analysis, there is no
difference between fixed-effects and random-effects models,
but the results must be interpreted differently.

For example, if we compare the costs of living in the eight Australian state
capitals using analysis of variance, then the model is a fixed-effects ANOVA
model and the results are valid only for these cities.
However, if we randomly select 20 big cities from all across the world in
order to study the costs of living in major cities in general, then the model is
a random-effects ANOVA model and the results can be generalized.
L. Kónya, 2020 UoM, ECON 20003, Week 5 8
ONE-WAY ANOVA: INDEPENDENT SAMPLES

• One-way ANOVA is the simplest version of ANOVA as it considers a


single factor, i.e. only one kind of a treatment.
Its parametric version based on the completely randomised design,
i.e. on independent samples, is an extension of the two-independent-
sample Z or t test for the difference between two population means.

• One-way independent ANOVA is based on the following assumptions:

i. The data set constitutes k independent random samples drawn


from k (sub-) populations.
ii. Each (sub-) population is normally distributed,
Xj : N (j ; )
iii. … and has the same variance,  2.

Under these assumptions the common variance  2 can be estimated


with the weighted average of the k sample variances, i.e. by the pooled
estimator, sp2, which is an unbiased estimator of  2.

L. Kónya, 2020 UoM, ECON 20003, Week 5 9


If, in addition, H0 : 1 = 2 = … = k is also true, then all sampled (sub-)
populations have the same distribution and the sample data comprises k
independent random samples drawn from the same population, i.e. from
N ( ; ).
If the sample sizes are equal, then the k sample means can be
considered as a simple random sample drawn from the same
sampling distribution of the sample mean, and the sample
variance of this sampling distribution provides an alternative
unbiased estimator, s02, of the population variance.

It can be proved that these two estimators of  2, i.e. sp2 and s02, are
independent of each other and their ratio follows Fisher’s F distribution,
granted that H0 is correct. Namely,

If H0 is true, the two estimates are expected to be similar, so Fobs  1.


Otherwise, s02 is biased upward and Fobs > 1.
Reject H0 if Fobs is greater than the F,k-1,n-k critical value and
hence the p-value is smaller than .
L. Kónya, 2020 UoM, ECON 20003, Week 5 10
The calculations are based on the following equality:
(See the formulas
for the grand mean
on the next slide)

Total Sum of Sum of Squares for Sum of Squares


Squares, SS Treatments, SST for Error, SSE
Overall variation Variation Variation
between samples within samples
The sample presents the strongest
possible support for H0 when all
sample means are equal to the
grand mean and thus
SST = MST = Fobs = 0.

Mean Squares:

L. Kónya, 2020 UoM, ECON 20003, Week 5 11


Ex 1:
Suppose we want to compare the cholesterol contents (milligrams per package)
of k = 4 competing diet foods (A, B, C, D) on the basis of four independent
random samples of size nj = 3 each.
Each brand of diet food is a specific treatment and the packages are the
experimental units.

a) Are the differences between the


sample means significant or can
they be attributed to chance?
Use  = 0.05.

H0 : 1 = 2 = 3 = 4 versus HA : at least one k (k = 1, 2, 3, 4) differs.

The analysis is based on a completely randomised design, and assuming that


there are no other diet foods, the proper setup is a fixed-effects ANOVA model.

If all sample
Grand mean: sizes are the
same, say m
L. Kónya, 2020 UoM, ECON 20003, Week 5 12
L. Kónya, 2020 UoM, ECON 20003, Week 5 13
F,df1,df2 = F0.05,3,8 = 4.07 Fobs = 2.25 < 4.07 = Fcrit, so H0 is maintained.
Consequently, we conclude at the 5% level that the differences among the
cholesterol contents of the four competing diet foods are insignificant.

This test can be reproduced in R with the combination of the summary and
aov(Cholesterol ~ as.factor(Food)) commands, which returns

One of the critical assumptions behind the ANOVA F-test is that the sampled
(sub-) populations have the same variance (i.e. they are homoskedastic).
When we are uncertain about the validity of this assumption, it is better to
perform the Welch F-test. It is a generalization of the Welch t-test (see slides
13 & 17 of week 3) and it does not require equal variances.

To run this test in R, execute the oneway.test(Cholesterol ~ Food) command,


which returns

Again, H0 is maintained.
L. Kónya, 2020 UoM, ECON 20003, Week 5 14
Note: The ANOVA F-test is very reliant on the assumption of equal variances.
In addition, both the ANOVA F-test and the Welch F-test might lead to
incorrect conclusions if the populations are strongly non-normal. When in
doubt, it is better to use some nonparametric alternative to these tests.

• The Kruskal-Wallis test is a nonparametric counterpart of one-way


independent ANOVA and a generalization of the Wilcoxon rank-sum test
(also known as the Mann-Whitney U-test; week 3, slides #19-21).
It can be used to compare the central locations (medians) of
several populations when the data are ranked or quantitative but
not normal. The hypotheses are

This procedure assumes that


i. the data consists of independent random samples drawn from
ii. populations that differ at most with respect to their central locations
(i.e. medians),
iii. the variable of interest is continuous and the measurement scale is
at least ordinal.
L. Kónya, 2020 UoM, ECON 20003, Week 5 15
Like the Wilcoxon rank-sum test, the Kruskal-Wallis test is based on the
ranks in the pooled set of the k independent samples.
Combine the k independent samples of sizes n1, n2,…, nk,
and rank the observations from the smallest (1) to the largest
(n = ni ), averaging the ranks of tied observations.
Let Tj denote the sum of ranks assigned to the observations in
the j th sample.

The test statistic is

If H0 is true, T12/n1, T22/n2,…, Tk2/nk are fairly similar and their sum is
relatively small. Hence, a ‘large’ H value indicates that H0 is probably
incorrect.
The sampling distribution of H is non-standard, but if H0 is true and each
sample size is sufficiently large (say, at least 5),
Reject H0 if the observed test statistic exceeds
the small-sample or the chi-square critical value,
whichever is appropriate.
L. Kónya, 2020 UoM, ECON 20003, Week 5 16
Note: For k = 2, we can apply either the Wilcoxon rank-sum test or the
Kruskal-Wallis test. However, while the Wilcoxon rank-sum test can be
used for one-sided and two-sided alternative hypotheses alike, the
Kruskal-Wallis test can only determine whether a significant difference
exist between the sample medians.

Ex 1 (cont.)
b) In part (a) we tacitly assumed that the necessary requirements are satisfied.
Since the sample sizes are equal but small, we cannot verify statistically
normality and the equality of variances. Given this uncertainty, let’s perform
the Kruskal-Wallis test, first manually and then with R.

H0 : 1 = 2 = 3 = 4
HA : at least one i is different

L. Kónya, 2020 UoM, ECON 20003, Week 5 17


The chi-square critical value is  2,k-1 =  20.05,3 = 7.81, larger than Hobs = 5.17.

However, each nj = 3 is smaller than 5, so the chi-square approximation might


be misleading. The more accurate ‘small-sample’ 5% critical value for k = 4
and nj = 3 (j = 1, 2, 3, 4) is 6.8974 (see the relevant table on LMS). It is smaller
than the chi-square critical value, but still larger than the observed test statistic
value (5.17).

We maintain H0 at the 5% level and conclude that the differences


among the cholesterol contents of the four competing diet foods are
insignificant.
It is reassuring that we drew the same conclusion from all three tests,
i.e. the ANOVA F-test, the Welch F-test and the Kruskal-Wallis test.

In R the Kruskal-Wallis test can be performed by executing the kruskal.test


(Cholesterol, Food) command. It produces the following printout:

H0 is maintained.
L. Kónya, 2020 UoM, ECON 20003, Week 5 18
Note: The p-value reported by R is based on the asymptotically valid chi-
square distribution, even if the sample sizes are too small to justify the
chi-square approximation. It is our task to check whether the samples are
large enough.
In this case, for example, the p-value is somewhat inaccurate, but still
leads to the same conclusion than the ‘small-sample’ KW critical value
(see the previous slide).

L. Kónya, 2020 UoM, ECON 20003, Week 5 19


ONE-WAY ANOVA: RANDOMISED BLOCKS

• Parametric one-way ANOVA based on a randomised block design is the


multi-population equivalent of the matched pair Z or t test for the
difference between two population means.
The randomised block design can make it easier to detect differences
among the treatments by reducing the variations within them.

Suppose, for example, that a statistician wants to determine whether


incentive pay plans offered to employees are effective. To do so, he selects
three groups of five workers who assemble the same equipment, and offers
a different incentive plan to each group.
The treatments are the incentive plans, the response variable is the
production output, and the experimental units are the workers.

Since productivity likely depends on various characteristics of the workers,


like age, gender or experience, the experiment can be made more efficient
by forming groups of workers with no or only small differences with respect
to these characteristics, i.e. by eliminating individual differences.
L. Kónya, 2020 UoM, ECON 20003, Week 5 20
• One-way ANOVA using randomised block design (with k treatments and
b blocks) is based on four assumptions:
i. Each observed xij constitutes an independent random sample of
size 1 drawn from one of the k  b (sub-) populations considered.
ii. Each (sub-) population is normally distributed,
iii. … and has the same variance,  2.
iv. The block and treatment effects are additive.
There supposed to be no interaction between blocks and
treatments, i.e. the effect of any given block-treatment
combination is exactly the same than the sum of their
individual effects.

• When the experimental design is the completely randomised design


(i.e. independent samples), the total sum of squares is decomposed into
two sources of variations, treatment and error (see slide #11).

For the randomised block design, it is divided into three components:


(‘Treatment’, ‘Block, ‘Error’)
L. Kónya, 2020 UoM, ECON 20003, Week 5 21
Denoting the mean for treatment j and for block i as xT,j -bar and xB,i -bar,
respectively, these sums of squares are as follows.

Total Sum of
Squares:

Sum of Squares
for Treatment:

Sum of Squares
for Blocks:

Sum of Squares
for Error:

L. Kónya, 2020 UoM, ECON 20003, Week 5 22


• Randomised block design allows us to perform two different tests, one
for the treatment means and another one for the block means.
In both cases, the hypotheses, test statistics and the decision rules are
similar to the ones in independent samples one-way ANOVA.

1) Testing treatment means

Under H0

2) Testing block means

Under H0

In both tests, reject H0 if Fobs > F,df1,df2.


L. Kónya, 2020 UoM, ECON 20003, Week 5 23
Ex 2: (Selvanathan, p. 645, ex. 15.42)
As an experiment to understand measurement error, a statistics professor asks
four students to measure the heights of the professor (PR), a male student
(MS), and a female student (FS). The differences (in centimetres) between the
correct heights and the heights measured by the students (Error) are listed
below.
Each student measured the
height of the same three people,
hence this experiment is based
on a repeated-measures design,
i.e. on a special case of the
randomised block design.
a) Can we infer that there are differences in the errors between the subjects
being measured? Use  = 0.05.
b) Can we infer that there are differences in the errors between the students
who obtained the measurements? Use  = 0.05.

Let’s consider the subjects being measured as the treatments (k = 3) and the
students who measure as the blocks (b = 4). To answer these questions we
need to test both the treatment means (a) and the block means (b).
L. Kónya, 2020 UoM, ECON 20003, Week 5 24
The total sum of squares can be obtained from the overall sample variance,

The error sum of squares could be computed using the definitional formula,
but it is easier to obtain it from

L. Kónya, 2020 UoM, ECON 20003, Week 5 25


i. Treatment means

F,df1,df2 = F0.05,2,6 = 5.14 > 5.13, so we cannot reject H0 at the 5% level.


Hence, the treatment means are only insignificantly different, i.e. the
differences of measurement errors between the subjects are
insignificant.

ii. Block means

F,df1,df2 = F0.05,3,6 = 4.76 < 23.18, so we can reject H0 at the 5% level.


Hence, the block means are significantly different, i.e. the students
differ in terms of measurement errors they make.
L. Kónya, 2020 UoM, ECON 20003, Week 5 26
To run this test in R, execute the summary(aov(Error ~ as.factor(Student) +
as.factor(Subject))) command, which returns

At the 5% significance level H0 is maintained for the treatment (i.e. Subject)


means, but it is rejected for the block (i.e. Student) means. Notice, however,
that both null hypotheses could be rejected at the 5.2% level.

• The Friedman test is a nonparametric alternative to one-way ANOVA on


randomised blocks and a generalization of the Wilcoxon signed ranks
test for matched pairs (week 2, slides 24-25 and week 3, slide 4).
It can be used to compare the central locations of two or more
(sub-) populations when the data are ranked or quantitative but
not normal.
The hypotheses are the same as in the Kruskal-Wallis test, i.e.

L. Kónya, 2020 UoM, ECON 20003, Week 5 27


However, the Friedman test is based on the ranks within each
block.
If there are no ties, the test statistic is
where b and k are the
number of blocks and
treatments, respectively,
and Tj is the sum of ranks
for treatment j.
If there are ties, Fr has to be corrected for the number of ties and the
corrected test statistic is

where the correction factor is

and ti is the number of tied


scores in the i th block.
Note: The correction factor always satisfies 0 < C  1. Therefore, Frc  Fr and it
can happen that Fr is below but Frc is above the critical value.
The test based on Frc has potentially more power against H0.
L. Kónya, 2020 UoM, ECON 20003, Week 5 28
If H0 is true, T1, T2,…, Tk are fairly similar and the sum of their squares is
relatively small. Hence, a ‘large’ Fr (Frc) value indicates that H0 is
probably incorrect.
The sampling distribution of Fr (Frc) is non-standard, but if H0 is true and
k and/or b is sufficiently large (k > 6 and/or b > 24),

Reject H0 if the observed test statistic exceeds


the small-sample or the chi-square critical value,
whichever is appropriate.

Ex 3: (Selvanathan et al., p. 912, ex. 20.45)


The following data are from a blocked experiment. Conduct the Friedman test
to determine whether at least two population central locations differ. Use  =
0.05.

k = 3, b = 5

L. Kónya, 2020 UoM, ECON 20003, Week 5 29


The ranks have to be
assigned by moving across
blocks (rows),
and the rank sums are
calculated for the
treatments (columns).

This is the uncorrected


Friedman test statistic.

This time, however, the corrected Friedman test statistic Frc is exactly the same
because there is not a single tie and hence each ti = 0 and the correction factor
is C = 1 (see the formulas on slide 28).

L. Kónya, 2020 UoM, ECON 20003, Week 5 30


From the Friedman critical value table on LMS, the 5% small-sample critical
value is 6.4 and since Fr,obs = 4.8 < 6.4 = Fcrit, we fail to reject H0. Hence, it is
not possible to conclude at the 5% level that at least two population central
locations differ.

To run this test in R, execute the friedman.test(Y ~ Treatment | Block)


command, which returns

H0 is maintained.

Note: The p-values reported by R for nonparametric tests are based on the
asymptotically valid distributions, even if the required conditions for
reasonably accurate approximations are not satisfied. It is our task to
check whether these approximations are acceptable.

L. Kónya, 2020 UoM, ECON 20003, Week 5 31


WHAT SHOULD YOU KNOW?

• The rational behind analysis of variance (ANOVA).


• The difference between the completely randomised design and the
randomised block design.
• The difference between fixed-effects and random effects models.
• To perform parametric ANOVA based on the completely randomised
design and on the randomised block design manually and with R.
• To perform nonparametric ANOVA based on the completely randomised
design and on the randomised block design manually and with R.

L. Kónya, 2020 UoM, ECON 20003, Week 5 32

You might also like