Professional Documents
Culture Documents
Department of Statistics
Study Guide
Contents
1 THE TOOLBOX 1
2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
2.5.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6 APPENDIX 303
This module follows on STA3703 which deals with multivariate distributions. You should therefore
be familiar with the multivariate normal distribution and the distribution of quadratic forms which
results from it. The main results on matrix theory and distribution theory needed for this module
are listed in chapter 1. The proofs and derivations of these results are not required for this module,
but you should be able to state and apply these results.
D.J. Stoker: Statistical Tables, Pretoria, Academica, 3rd ed. 1977 (or earlier editions).
There is no other prescribed textbook for this module, and no recommended textbook which you
are required to read in order to complete the module. You may consult the following textbooks if
you wish to learn more about the topics dealt with in this module:
Bowerman B.L., and R. O’Connel: Linear statistical models:an applied approach. 2nd ed.
PWS-KENT 2 ed. 1990 (For chapters 3 and 4 of this Guide.)
Brownlee, K.A.: Statistical theory and methodology in science and engineering. 2nd ed.
Wiley, 1965. (For chapters 3, 4 and 5 of this Guide.)
Draper, N.R. and H. Smith: Applied regression analysis. Wiley, 1st ed. 1966, or 2nd ed.
1981 or 3rd ed 1998. (For chapter 4 of this Guide.)
Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and regression. Wiley,
1974. (For chapters 3, 4 and 5 of this Guide.)
vii
viii
Kleinbaum, D.G., Kupper L.L., Nizam A. and K.E. Muller: Applied regression analysis and
other multivariable methods. Duxbury, 4th ed. 2008. (For most of the Guide.)
Kleinbaum, D.G. and L.L. Kupper: Applied regression analysis and other multivariable meth-
ods. Duxbury, 1st ed. 1978 or 2nd ed. 1988. (For most of the Guide.)
Kutner, M.H., Nachtsheim C.J., Neter J. and W. Li.: Applied linear Statistical Models. 5th
ed. McGraw-Hill Irwin, 2005. (For most of the Guide.)
Mickey, R.M., Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and
regression. 3rd ed. Wiley, 2004. (For chapters 3, 4 and 5 of this Guide.)
Neter, J., Wasserman W. and M.H. Kutner: Applied linear statistical models, regression,
analysis of variance and experimental designs. 2nd ed. Richard D. Irwin Inc., 1985. (For
chapters 3, 4 and 5 of this Guide.)
Neter, J. and W. Wasserman: Applied linear statistical models. Richard D. Irwin Inc., 1974.
(For chapters 3, 4 and 5 of this Guide.)
Ross, S.M.: Introduction to probability and statistics for engineers and scientists. Wiley
1987. (For chapters 3 and 4 of this Guide.)
Ross, S.M.: Introduction to probability and statistics for engineers and scientists. 4th ed.
Elsevier 2009. (For chapters 3 and 4 of this Guide.)
Sall, J., Creighton L. and A. Lehman: JMP start statistics: a guide to statistics and data
analysis using JMP. SAS Institute, 4th ed 2007. (For chapters 3, 4 and 5 of this Guide -
especially in data analysis.)
Scheffé, H.: The analysis of variance. Wiley, 1959. (For chapters 2, 3 and 5 of this Guide.)
Searle, S.R.: Linear models. Wiley, 1997. (For chapters 1 and 2 of this Guide.)
Wackerly, D.D., Mendenhall W. and R.L. Scheaffer: Mathematical statistics with applications.
7th ed. Duxbury 2008. (For chapters 3 and 4 of this Guide.)
Chapter 1
THE TOOLBOX
The statistical techniques which are the subject of this module, Applied Statistics III, form part
of a more general theory called the linear model. In chapter 2 we discuss this general model; this
general theory should enable you to analyse a very general class of problems. In chapter 3 we dis-
cuss some specific models contained in the general theory under the heading analysis of variance.
Analysis of variance is a procedure which enables one to test whether a number of parameters
which determine the values of population means are equal or not. In chapter 4 we discuss regres-
sion analysis. In this type of model we assume that there is a linear relationship between a random
response variable called yield, and a number of non-random variables. The purpose is to estimate
the parameters of this relationship and to draw statistical inference about these parameters. In
chapter 5 we discuss a third class of problems, called analysis of covariance, which may be regarded
as a hybrid between regression analysis and analysis of variance. In this model we assume that
there are a number of regression functions, each corresponding to a different population, and we
want to know whether these regression functions are equal or not.
A word about notation. In this module we usually use Greek letters to denote parameters, capital
letters to denote matrices and underlined small letters to denote vectors. A vector without a
prime, for example x, will be a column vector and its transpose, for example x0 , will therefore
be a row vector. Small letters which are not underlined will usually be scalars (1×1 matrices),
1
1. THE TOOLBOX 2
including specific elements of matrices or vectors. The usual distinction between random variables
and observations, which is often made by using capitals for the former and small letters for the
latter, will not be made here. It should be clear from the context whether a symbol is for a random
variable or not. A circumflex (or hat) above the symbol for a parameter or vector of parameters,
for example β̂ or β̂, is used to denote an unbiased estimator of the parameter(s).
This section contains a number of definitions and results on matrices which, for ease of reference,
will be numbered (M1), (M2), et cetera. These results will be used frequently in chapter 2. If they
do not seem interesting at first reading, you will soon find out how they are applied.
Example 1.1.
Write the following quadratic forms in matrix notation using a symmetric matrix:
Solution 1.1.
(a) 2x2 + 3y 2 − 4z 2 + 2xy − 6xz + 10yz = 2x2 + 3y 2 − 4z 2 + xy + yx − 3xz − 3zx + 5yz + 5zy
2 1 −3
h i
0
Then x = x y z and A =
1 3 5
−3 5 −4
Thus,
2 1 −3 x
h i
x0 Ax = x y z
1 3 5 y
−3 5 −4 z
(b) x21 + 4x1 x3 − 6x2 x3 = x21 + 0x22 + 0x23 + 0x1 x2 + 0x2 x1 + 2x1 x3 + 2x3 x1 − 3x2 x3 − 3x3 x2
1 0 2
h i
0
Then x = x3 and A = 0 0 −3
x1 x2
2 −3 0
Thus,
1 0 2 x1
h i
0
x Ax = 0 −3 x2
x1 x2 x3 0
2 −3 0 x3
1. THE TOOLBOX 4
(M2) The p × p matrix A is called positive definite if x0 Ax > 0 for all x 6= 0 (where 0 is a vector
of zeros).
(M3) If the > sign in (M2) is replaced by ≥ (and the equality holds for at least one x 6= 0), then
A is said to be positive semidefinite.
Example 1.2.
Are the following matrices positive definite or positive semidefinite?
1 0 1 −1 1 − 29
(a) (b) (c)
0 4 −1 1 − 29 4
Solution 1.2.
(a)
1 0 x1
h i
x0 Ax =
x1 x2
0 4 x2
x
h i 1
=
x1 4x2
x2
= x21 + 4x22
(b)
1 −1 x1
h i
x0 Ax =
x1 x2
−1 1 x2
1. THE TOOLBOX 5 STA3701/1
x1
h i
= x1 − x2 −x1 + x2
x2
= x21 − x2 x1 − x1 x2 + x22
= (x1 − x2 )2
= 0 if x1 = x2 whether or not x1 = x2 = 0
> 0 if x1 6= x2
(c)
1 − 92 x1
h i
0
x Ax =
x1 x2
− 92 4 x2
x
h i 1
9
= x1 − − 92 x1 + 4x2
x
2 2
x2
9 9
= x21 − x1 x2 − x1 x2 + 4x22
2 2
= x1 − 9x1 x2 + 4x22
2
> 0 if x1 = 0, x2 6= 0 or if x2 = 0, x1 6= 0
< 0 if x1 = x2 = 1
(M4) (i) The rank of a matrix A, denoted by r(A), is defined to be the number of linearly
independent rows of A (or, equivalently, the number of linearly independent columns).
(iii) If A is a p × p matrix and r(A) = p then A is said to be non-singular and A−1 exists;
if r(A) < p then A is singular and A−1 does not exist.
1. THE TOOLBOX 6
(M6) The trace of a p × p matrix A, denoted by tr(A), is defined to be the sum of its diagonal
p
X
elements. If A = (aij ), then tr(A) = aii .
i=1
Thus
x0 Ax = tr(x0 Ax) = tr(x x0 A) = tr(Ax x0 ).
Example 1.3.
x 1 2
x= A=
y 2 4
Solution
1.3.
x 1 2
x = and A =
y 2 4
1 2 x
h i
x0 Ax =
x y
2 4 y
x
h i
=
x + 2y 2x + 4y
y
1. THE TOOLBOX 7 STA3701/1
= x2 + 2yx + 2xy + 4y 2
= x2 + 4yx + 4y 2
x 1 2
h i
x x0 A =
x y
y 2 4
x2 xy 1 2
=
2
yx y 2 4
2 2
x + 2xy 2x + 4xy
=
yx + 2y 2 2yx + 4y 2
= x2 + 4xy + 4y 2
1 2 x
h i
Ax x0 =
x y
2 4 y
x + 2y
h i
=
x y
2x + 4y
x2 + 2yx xy + 2y 2
=
2x2 + 4yx 2xy + 4y 2
= x2 + 4xy + 4y 2
1. THE TOOLBOX 8
Note: (x x0 A)0 = Ax x0
r(A) = tr(A).
Example 1.4.
Which of the following matrices are idempotent?
1 0 0
1 1
1 0
2 2
(a) (1) (b) (c) (d) 0 12 21
1 1
0 1 2 2
0 12 21
Give the ranks of the idempotent matrices. Which of the matrices are singular?
Solution 1.4.
1 0 1 0 1 0 1 0
(b) A = AA = =
0 1 0 1 0 1 0 1
(c)
1 1 1 1
2 2 2 2
AA =
1 1 1 1
2 2 2 2
1 1 1 1
4
+ 4 4
+ 4
=
1 1 1 1
4
+ 4 4
+ 4
1 1
2 2
=
1 1
2 2
1 1
A is idempotent. Thus, r(A) = tr(A) = 2
+ 2
= 1. A is singular.
1 0 0
(d) A = 0 12 12
1 1
0 2 2
1 0 0 1 0 0
AA = 0
1 1
0 1 1
2 2 2 2
1 1 1 1
0 2 2
0 2 2
1 0 0
= 0
1 1 1 1
4
+ 4 4
+ 4
1 1 1 1
0 4
+ 4 4
+ 4
1. THE TOOLBOX 10
1 0 0
= 0
1 1
2 2
1 1
0 2 2
(M11) If X is an n × m matrix with rank r(X) = m (thus m ≤ n) then X(X0 X)−1 X0 is idempo-
tent with rank m. Furthermore, In − X(X0 X)−1 X0 is idempotent with rank n − m. Finally,
(X(X0 X)−1 X0 )(In − X(X0 X)−1 X0 ) = O, the n×m matrix of zeros.
(M13) From (M11) and (M12) it follows that 1(10 1)−1 10 and In − 1(10 1)−1 10 are idempotent with
ranks 1 and n − 1 respectively, and their product is equal to the null matrix O.
1. THE TOOLBOX 11 STA3701/1
Example 1.5.
1 0
(a) X = 1 1 . Compute X(X 0 X)−1 X 0 .
1 2
1
(b) x = 1 . Compute x(x0 x)−1 x0 and I3 − x(x0 x)−1 x0 .
1
Solution 1.5.
(a)
1 0
1 1 1
0
XX =
1 1
0 1 2
1 2
3 3
=
3 5
5
5 −3 6
− 63
0 0 −1 1
det(X X) = 15 − 9 = 6. (X X) = 6 =
−3 3 − 36 3
6
1. THE TOOLBOX 12
1 0
5
− 36 1 1 1
6
0 −1 0
X(X X) X = 1 1
− 36 3
6
0 1 2
1 2
5 1
−
6 2
1 1 1
1
= 3
0
0 1 2
1 1
−6 2
5 2 1
−
6 6 6
2 2 2
= 6 6
6
1 2 5
−6 6 6
1 1
h i
0
(b) x = 1 x x = 1 1 1 1 = 3.
1 1
Then
1
1h i
x(x0 x)−1 x0 = 1
1 1 1
3
1
1. THE TOOLBOX 13 STA3701/1
1 1 1
3 3 3
=
1 1 1
3 3 3
1 1 1
3 3 3
1 1 1
1 0 0 3 3 3
I3 − x(x0 x)−1 x0 = 0 1 0 − 13 1 1
3 3
1 1 1
0 0 1 3 3 3
2
− 13 − 13
3
1 2 1
= −3 −3
3
1 1 2
−3 −3 3
where
∂
∂x1
..
∂ .
.
..
∂x
.
∂
∂xp
(In general, the derivative of a 1 × n row vector y 0 with respect to the p × 1 column vector
x is defined to be the p × n matrix
∂y 0
= (aij ) where
∂x
∂yj
aij = .
∂xi
1. THE TOOLBOX 14
∂ 0
Write down (x Ax).
∂x
Solution 1.6.
∂ 0
(x Ax) = 2Ax
∂x
3 1 2 u
= 2 1 4 5
v
2 5 6 w
3u v 2w
= 2 u 4v
5w
2u 5v 6w
1. THE TOOLBOX 15 STA3701/1
6u 2v 4w
= 2u 8v 10w
4u 10v 12w
or
3 1 2 u
h i
x0 Ax =
u v w 1 4 5 v
2 5 6 w
u
h i
=
3u + v + 2w u + 4v + 5w 2u + 5v + 6w v
w
= 3u2 + vu + 2wu + uv + 4v 2 + 5wv + 2uw + 5vw + 6w2
∂ 0
(x Ax) = 6u + 2v + 4w
∂u
∂ 0
(x Ax) = 8v + 2u + 10w
∂v
∂ 0
(x Ax) = 12w + 4u + 10v
∂w
1. THE TOOLBOX 16
In this section we list a number of results from distribution theory. Their application will become
clear in the next chapter.
Let y be a column vector consisting of p random variables, that is y 0 = (y1 · · · yp ). Let the expected
value of yi be µi and let µ0 = (µ1 · · · µp ). Also, let the covariance of yi and yj be σij .
σ11 · · · σ1p
.. ..
Σ = (σij ) = . . .
σp1 · · · σpp
Obviously σij = σji so that Σ is symmetric. Σ is called the covariance matrix of y. We would also
write
(D1) A covariance matrix must be positive definite (or at least positive semidefinite).
(D2) The p×1 random vector y is said to have a multivariate normal distribution with means vector
µ and covariance matrix Σ, also written y ∼ n(µ; Σ), if the probability density function of y
is given by
1 1
f (y) = 1 exp[− (y − µ)0 Σ−1 (y − µ)].
(2π)p/2 |Σ|
2 2
The covariance matrix Σ must be positive definite otherwise f (y) will not be a probability
density function.
(D3) If y is a random vector with E(y) = µ and Cov(y, y 0 ) = Σ and if A is a matrix of constants,
then z = Ay is a random vector with E(z) = Aµ and Cov(z, z 0 ) = AΣA0 .
Example 1.7.
10 4 3 −1
1 −1 0
x ∼ n 15 ; 3 9 3 . What is the distribution of y = x?
1 1 1
3 3 3
12 −1 3 1
Solution 1.7.
10 4 3 −1 10 4 3 −1
x ∼ n 15 ; 3 9 where µ = and Σ =
3 15 3 9 3
12 −1 3 1 12 −1 3 1
Now
1 −1 0
y = x
1 1 1
3 3 3
1 −1 0
= Ax where A=
1 1 1
3 3 3
Now
10
1 −1 0
Aµ =
15
1 1 1
3 3 3
12
1. THE TOOLBOX 18
−5
=
37
3
1
4 3 −1 1 3
1 −1 0
AΣA0 = 3 −1 1
3 9 3
1 1 1
3 3 3
1
−1 3 1 0 3
1 13
1 −6 −4
= −1 13
2 5 1
0 13
7 −3
=
8
−3 3
−5 7 −3
Thus, y ∼ n ; .
37 8
3
−3 3
(D5) If y ∼ n(0; Ip ) then y 0 y ∼ χ2p (that is y 0 y has a chi-square distribution with p degrees of
freedom).
(D7) If y ∼ n(µ; σ 2 Ip ) then y 0 y/σ 2 is said to have a non-central chi-square distribution with p
degrees of freedom and noncentrality parameter (n.c.p.) λ = µ0 µ/σ 2 . (Some textbooks use
the alternative convention λ = 21 µ0 µ/σ 2 ).
When λ = 0 the distribution becomes the (central) chi-square distribution. The n.c.p. λ is
non-negative by definition. As λ increases, the distribution shifts to the right. This is also
reflected in the following theorem.
(D8) If x is a noncentral chi-square variate with p degrees of freedom and noncentrality parameter
λ, then x has mean and variance
Example 1.8.
3 5 0
y ∼ n ; .
4 0 5
(a) What is the distribution of 15 (y12 + y22 )? Calculate its mean and variance.
Solution 1.8.
(a) The distribution of 51 (y12 + y22 ) ∼ χ22,λ where λ = µ0 µ/σ 2 (using D7).
3
h i
µ0 µ = = 25.
3 4
4
Thus, λ = µ0 µ/σ 2 = 25
5
= 5.
Using (D8):
1
(b) The random variable that has a χ22 distribution is 5
[(y1 − 3)3 + (y2 − 4)2 ] .
(D10) Let u and v be independent random variables with u a noncentral chi-square variate with p
degrees of freedom and n.c.p. λ and v ∼ χ2q . Then
u/p
F =
v/q
is said to have a noncentral F-distribution with p and q degrees of freedom and n.c.p. λ.
The effect of the n.c.p. λ on the distribution of F is the same as the effect of λ on the
distribution of u: increasing λ causes a shift to the right.
(D11) Let y ∼ n(µ; σ 2 Ip ). Let A be a symmetric matrix of constants. Then y 0 Ay/σ 2 has a (non-
central) chi-square distribution with r(A) degrees of freedom and n.c.p. λ = µ0 Aµ/σ 2 if and
only if A is idempotent, that is AA = A.
(D12) Let y ∼ n(µ; σ 2 Ip ) and let A and B be constant matrices. Then y 0 Ay/σ 2 and y 0 By/σ 2 are
independent if and only if AB = O.
Example 1.9.
0 1 0 0
y ∼ n 0 ; 0 1 0
0 0 0 1
(b) Show that x and z are independent and have chi-square distributions; compute the num-
ber of degrees of freedom of each.
Solution 1.9.
(a) y ∼ n(µ; σ 2 Ip )
1
x = (y1 + y2 + y3 )2
3
1
= (y1 + y2 + y3 )(y1 + y2 + y3 )
3
1 2
= (y + y1 y2 + y1 y3 + y2 y1 + y22 + y2 y3 + y3 y1 + y3 y2 + y32 )
3 1
1 2
= (y + y22 + y32 + y1 y2 + y2 y1 + y1 y3 + y3 y1 + y2 y3 + y3 y2 )
3 1
1 1 1 y
1
1h i
=
y1 y2 y3 1 1 1 y
3 2
1 1 1 y3
1 1 1
y
3 3 3 1
h i
1 1 1
=
y1 y2 y3 3 3 3 y2
1 1 1
3 3 3
y3
1 1 1
3 3 3
Thus, A =
1 1 1
3 3 3
1 1 1
3 3 3
1
z = y12 + y22 + y32 − (y1 + y2 + y3 )2
3
1 1 1 1 1 1 1 1 1
= y12 + y22 + y32 − y12 − y22 − y32 − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 1 1 1 1 1 1
= y + y + y − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 1 3 2 3 3 3 3 3 3 3 3
1. THE TOOLBOX 22
2
3
− 13 − 13 y1
h i
1 2 1
= y3 − 3 −
y1 y2 3 3
y2
1 1 2
−3 −3 3
y3
2
3
− 13 − 13
1 2 1
Thus, B = − 3 −3
3
1 1 2
−3 −3 3
1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
AA =
1 1 1 1 1 1 =
1 1 1 =A
3 3 3 3 3 3 3 3 3
1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
2
3
− 13 − 13 2
3
− 13 − 13 2
3
− 13 − 13
1 2 1 1 2 1 1 2 1
BB = − 3 −3 −3 −3 = −3 −3 = B
3 3 3
1 1 2 1 1 2 1 1 2
−3 −3 3
−3 −3 3
−3 −3 3
1 1 1 2
3 3 3 3
− 13 − 13 0 0 0
AB = 1 1 1 1
−3 2 1 =
−
3 3 3 3 3
0 0 0
1 1 1 1 1 2
3 3 3
−3 −3 3
0 0 0
Thus, x and z are independent with r(x) = 1 = degrees of freedom each. Thus, x ∼ χ21
and r(z) = 2 and z ∼ χ22 (using D11 since AA = A and BB = B).
1
(D13) If y ∼ n(µ; σ 2 Ip ) then the quadratic form y 0 Ay/σ 2 and the linear form By are independent
σ
if and only if BA = O.
(D14) More generally, if y ∼ n(µ; V ) then y 0 Ay has a noncentral chi-square distribution with r(A)
degrees of freedom and n.c.p. λ = µ0 Aµ if and only if AV is idempotent.
Example 1.10.
µ σ2 σ2ρ
y ∼ n ; .
µ σ2ρ σ2
Solution 1.10.
µ 1 ρ 1 ρ
2
y ∼ n ;σ where V = and r(V ) = 2.
µ ρ 1 ρ 1
1. THE TOOLBOX 24
Using D16
1 −ρ
y 0 V −1 y (y − µ)0 V −1 (y − µ)
= and V −1 = 1
1−ρ2
σ2 σ2
−ρ 1
1 −ρ y1 − µ
h i
1
y1 − µ y2 − µ
1−ρ2
y 0 V −1 y −ρ 1 y2 − µ
=
σ2 σ2
y1 − µ
1 h i
= y1 − µ − ρ(y2 − µ) ρ(y1 − µ) + (y2 − µ)
σ 2 (1 − ρ2 )
y2 − µ
1 2 2
= (y1 − µ) − ρ(y 2 − µ)(y 1 − µ) − ρ(y 1 − µ)(y2 − µ) + (y 2 − µ)
σ 2 (1 − ρ2 )
=
1
(y1 − µ)2 − 2ρ(y1 − µ)(y2 − µ) + (y2 − µ)2
=
σ 2 (1
−ρ )2
Exercise 1.1
1. Write the following quadratic forms in matrix notation. Ensure that each matrix is
symmetric.
2. Determine whether each of the following matrices is positive definite or positive semidefinite.
2
3
− 31 − 13 1 2 0
9 −3 2 − 23
(a) (b) − 13 23 − 13 (c) (d) 2 5 −1
−3 − 32
1 3
− 13 − 31 23 0 −1 1
3. Evaluate the following statements as true or false, and substantiate your answers for state-
ments which are false.
1 1
2 2
0
Consider the matrix A = 1 1
0 .
2 2
0 0 1
y 0 Ay
(a) Let y ∼ n(µ; σ 2 Ip ). has a (noncentral) chi-square distribution with three degrees
σ2
of freedom and noncentrality parameter λ.
1. THE TOOLBOX 26
1 1
2 2
0
y 0 Ay y 0 By
(b) Let y ∼ n(µ; σ 2 Ip ) and B = 1 1 . Then and are independent.
0
2 2 σ2 σ2
0 0 0
(c) The p × 1 random vector y is said to have a multivariate normal distribution with mean
1 0 0
h i
0 1 1
vector µ = 3 9 2 and covariance matrix Σ = 0 4 4 .
1 1
0 4 4
Then
1 1 0 −1
f (y) = 1 exp − (y − µ) Σ (y − µ)
(2π)p/2 |Σ| 2 2
is a probability density function.
(d) The quadratic form 2 (r − 12 s)(r + 32 s) − q(r − q) can be written in matrix notation,
1
q1 = x21 + (x2 + 3x3 )2
10
q2 = k(3x2 − x3 )2
1
q3 = (x1 − x2 )2 .
2
1. THE TOOLBOX 27 STA3701/1
(d) Determine the degrees of freedom and the noncentrality parameters of q1 and of q2 .
y0y
(a) What is the distribution of ?
σ2
(b) Calculate the mean and variance of the distribution in (a).
3
1
3 2
6. Let x : 2 × 1 ∼ n(µ; Σ) with µ = and Σ = .
2
1 23
1 −1
(a) What is the distribution of x0 Ax if A = ?
−1 1
2 −4
(b) If B = , are x0 Ax and x0 Bx independent? Explain.
−4 2
1. THE TOOLBOX 28
Chapter 2
We assume that
y = X β +
where y is an n×1 vector of random variables, X is an n×p matrix of known constants with n>p,
β is a p×1 vector of unknown parameters1 , and is an n×1 vector of (unobservable) random
variables. Thus is not a vector of parameters, but a Greek letter is used nevertheless, since is
unknown. In this model X is selected in advance, y is observed and β must be estimated, is a
random disturbance. The following assumptions are usually made:
Assumption 1
The matrix X has a full column rank, that is r(X) = p. This implies that the rank of the p×p
symmetric matrix X0 X is p and therefore X0 X is non-singular.
1
We usually assume that β is a vector of constants. In chapter 3, we will consider cases where some of
the βi are random variables
29
2. THE LINEAR MODEL 30
This assumption is not essential for the linear model in general, but it simplifies the algebra and
in most practical applications the model can be formulated in such a way that X0 X is in fact
non-singular. Without this assumption one would first have to study the subject of generalised
inverses of matrices.
Assumption 2
This assumption is essential; if E() 6= 0 one could incorporate E() into the vector of unknown
parameters β possibly adding an extra parameter into β and an extra column to X.
= Xβ.
The term is usually called the error term: it provides for the random fluctuation inherent in a
statistical experiment.
Assumption 3
E( 0 ) = σ 2 I
which means that the components of are uncorrelated and have the same variance σ 2 .
Assumption 4
Each column of X represents n values of a variable, often called an independent variable, predic-
tand, concomitant variable, et cetera. We refer here to design variable, and X will be called the
2. THE LINEAR MODEL 31 STA3701/1
design matrix. We assume that (ideally) the design matrix X was chosen in advance, and the n
values of the response variable y which constitute the vector y were observed accordingly by means
of an experiment.
The linear model in general does not exclude the possibility that X may be a random matrix. In
the problems dealt with in this module we will usually have a non-random matrix X. Many (but
not all) of the results derived in this module will be valid if X is in fact random.
Assumption 5
This assumption is not essential for the purpose of estimating β, but as soon as we want to test
hypotheses about or construct confidence intervals for the parameters, we have to make this as-
sumption. Also, least squares estimators are maximum likelihood estimators if assumption 5 holds.
Assumption 6
n>p, that is there are more observations than parameters. This is a very important assumption -
without it, the least squares estimators are not unique. If n<p then r(X0 X) <p and therefore X0 X
will be singular (cf. assumption 2).
In this section it is shown that some of the statistical problems dealt with in most elementary
courses are special cases of the linear model.
y1 = µ + 1
..
.
yn = µ + n
where 1 , . . . , n are independent n(0; σ 2 ) variates, that is ∼ n(0; σ 2 I). This may be written as
y1 1 1
.. .. ..
. = (µ) +
. .
yn 1 n
or y = Xβ + .
Let y1 , . . . , yn1 be a random sample from a n(µ1 ; σ 2 ) distribution and v1 , . . . , vn2 a random sample
from a n(µ2 ; σ 2 ) distribution such that the two samples are independent. We rename the second
sample by writing
yn1 +i = vi , i = 1, . . . , n2 .
y1 1 0 1
.. .. .. ..
. . . .
yn1 1 0 n1
µ
1
= +
µ
2
yn1 +1 0 1 n1 +1
.. .. .. ..
. . . .
yn1 +n2 0 1 n1 +n2
E(yi ) = α + βxi ,
Var(yi ) = σ 2 .
y1 = α + βx1 + 1
..
.
yn = α + βxn + n ,
that is
y1 1 x1
α 1
.. .. ..
. = + .
.
β
yn 1 xn n
or y = X β +
Example 2.1.
Assume that E(yi ) = β0 + β1 xi ; Var(yi ) = σ 2 /wi , i = 1,. . . , n and where y1 , . . . , yn are independent
and w1 , . . . , wn are known constants. This is a particular “weighted regression” model. Write this
in matrix notation.
Solution 2.1.
yi = β0 + β1 xi +
2. THE LINEAR MODEL 34
y1 1 x1 1
y2 1 x2 β0 1
= +
.. .. ..
. . β1 .
yn 1 xn n
2.3 Estimation
The problem is to estimate β. The estimator will be denoted by β̂. The corresponding vector of
values of the response variable, found by replacing β by β̂ and by 0 in the model, is denoted by
ŷ. Thus ŷ = X β̂.
The difference between the observed and predicted response, y − ŷ, is called the vector of residuals
and is said to be an estimator for . (This is an unconventional use of the term “estimator”, since
one usually estimates parameters while is a vector of unobservable random variables.)
Let
e = y − ŷ = y − X β̂.
= Xβ − Xβ provided E(β̂) = β,
= 0
= E() by assumption.
The fact that E(e) = E() is the justification for calling e an estimator for .
The least squares criterion states that the estimator β̂ must be found in such a way that e0 e, the
sum of squares of the residuals, is a minimum. Thus we have to minimise
s = e0 e = (y − X β̂)0 (y − X β̂)
0 0
= y 0 y − β̂ X 0 y − y 0 X β̂ + β̂ X 0 X β̂
0 0
= y 0 y − 2β̂ X 0 y + β̂ X 0 X β̂
2. THE LINEAR MODEL 35 STA3701/1
0
since β̂ X 0 y and y 0 X β̂ are scalars and therefore equal.
∂s
In order to minimise s we set = 0. Using (M14) and (M15) and noting that X 0 X is symmetric,
∂ β̂
we find
∂s
= −2X 0 y + 2X 0 X β̂.
∂ β̂
X 0 X β̂ = X 0 y
∴ β̂ = (X 0 X)−1 X 0 y
since we have assumed that X 0 X is non-singular. This is the least squares estimator. (Can you
prove that β̂ minimises rather than maximises s?)
If we assume in addition that has a multivariate normal distribution, that is ∼ n(0; σ 2 I), then
the same estimator is also the maximum likelihood estimator. More generally, if we assume that
∼ n(0; σ 2 V ), that is y ∼ n(Xβ; σ 2 V ), where V is non-singular, then the joint probability density
function (p.d.f.) of the elements of y, which is the likelihood function, is
1 1 0 −1
L= 1 1 exp − 2 (y − Xβ) V (y − Xβ)
(2πσ 2 ) 2 n |V | 2 2σ
1 1 1
∴ `nL = − n`n(2πσ 2 ) − `n|V | − 2 (y − Xβ)0 V −1 (y − Xβ).
2 2 2σ
In order to maximize `nL (and therefore L) with respect to β we have to minimise
s = (y − Xβ)0 V −1 (y − Xβ)
= y 0 V −1 y − 2β 0 X 0 V −1 y + β 0 X 0 V −1 Xβ.
∂s
= −2X 0 V −1 y + 2X 0 V −1 Xβ
∂β
= 0 if X0 V−1 Xβ = X0 V−1 y.
We assume that V has full rank and X has full column rank
∴ X 0 V −1 X is non − singular
∴ β = (X 0 V −1 X)−1 X 0 V −1 y.
2. THE LINEAR MODEL 36
β̂ = (X 0 V −1 X)−1 X 0 V −1 y.
The (ordinary) least squares (OLS) estimator is a special case of this estimator (replace V by In ).
The more general estimator is called the generalised least squares (GLS) estimator; any property
of the GLS estimator will automatically be a property of the OLS estimator when the latter is
appropriate, that is when V = In .
Another special case of GLS is weighted least squares (WLS) which is used when V is a diagonal
matrix, that is if V = (vij ) the vij = 0 if i 6= j. In this case
v 0 ... 0 1/v11 0 ... 0
11
0 v22 . . . 0 −1 0 1/v22 . . . 0
V = ..
V =
.. .. ..
. . . .
0 0 . . . vnn 0 0 . . . 1/vnn
n
0 −1
X 1 X
(y − ŷ) V (y − ŷ) = (yi − ŷi )2 = wi (yi − ŷi )2
v
i=1 ii
1
where the weight wi = . Obviously OLS is also a special case of WLS when vii = 1; i = 1, . . . , n.
vii
We now assume that
y = Xβ +
β̂ = (X 0 V −1 X)−1 X 0 V −1 y
A = (X 0 V −1 X)−1 X 0 V −1 .
Thus β̂ is a linear estimator in the sense that each element of β̂ is a linear combination of y1 , . . . , yn .
We are interested in the distribution of the random vector β̂.
2. THE LINEAR MODEL 37 STA3701/1
Theorem 2.3.1
0
E(β̂) = β and Cov(β̂, β̂ ) = σ 2 (X0 V−1 X)−1
Proof:
By (D3)
= (X 0 V −1 X)−1 X 0 V −1 (Xβ)
= Ip β = β
Thus β̂ is unbiased; we say it is an unbiased linear estimator. The covariance matrix of β̂ is likewise
derived from (D3):
= A Cov(y, y0 )A0
= σ 2 (X 0 V −1 X)−1
β̂ = (X 0 X)−1 X 0 y
E(β̂) = β
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .
Theorem 2.3.2
If y ∼ n(Xβ, σ 2 V) and
β̂ = (X 0 V −1 X)−1 X 0 V −1 y
h i
2 0 −1 −1
then β̂ ∼ n β̂, σ (X V X) .
We apply the foregoing results to the three examples in section 2.2. In each case V = In .
X 0 X = n; (X 0 X)−1 = 1/n; X 0 y = y1 + y2 + . . . + yn
∴ µ̂ = (X 0 X)−1 X 0 y = n1 (y1 + . . . + yn ) = ȳ
In this case
1 . . . 1 0 . . . 0
X0 =
0 ... 0 1 ... 1
n1 0 1/n1 0
X 0X = ; (X 0 X)−1 =
0 n2 0 1/n2
y1 + · · · + yn1
0
Xy=
yn1 +1 + · · · + yn1 +n2
2. THE LINEAR MODEL 39 STA3701/1
(y1 + · · · + yn1 )/n1
µ̂1 ȳ
1
β̂ = = = , say.
µ̂2 ȳ2
(yn1 +1 + · · · + yn1 +n2 )/n2
2
σ /n1 0
The covariance matrix is σ 2 (X 0 X)−1 = ,
2
0 σ /n2
|X 0 X| = n x2i − ( xi )2
P P
(xi − x̄)2 .
P
=n
P 2 P
xi xi
n P(x − x̄)2 − n P(x − x̄)2
i i
(X 0 X)−1 =
P
xi n
− P 2
P
n (xi − x̄) n (xi − x̄)2
P
yi
X 0y = P
xi yi
x2i − xi xi yi
P P
P P
yi
P
n (xi − x̄)2
α̂ 0 −1 0
β̂ = = (X X) X y =
β̂
P P P
− xi yi + n xi yi
P
n (xi − x̄)2
2. THE LINEAR MODEL 40
α̂ = ȳ − β̂ x̄.
Example 2.2.
Repeat the analysis for the weighted regression model in example 2.1.
Solution 2.2.
yi = β0 + β1 xi
y1 1 x1 1
y2 1 x2 β
0 + 1
=
.. .. ..
. . β1 .
yn 1 xn n
where E() = 0, E(, 0 ) = σ 2 V
1
w1
0 ... 0 w1 0 ... 0
1
0 ... 0 0 w2 . . . 0
w2
V −1
V =
.. ..
and =
.. ..
. . . .
1
0 0 ... w2
0 0 . . . wn
1 1 ... 1
X0 =
x1 x2 . . . xn
2. THE LINEAR MODEL 41 STA3701/1
w1 0 ... 0 1 x1
1 1 ... 1 0 w2 . . . 0 1 x2
X 0 V −1 X =
.. .. .. ..
x1 x2 . . . xn . . . .
0
0 . . . wn 1 xn
1 x1
w1 w2 . . . wn 1 x2
=
.. ..
w1 x1 w2 x2 . . . wn xn . .
1 xn
P P
wi w i xi
=
2
P P
w i xi w i xi
P
wi x2i − wi xi
P
1
(X 0 V −1 X)−1 =P P
wi wi x2i − ( wi xi )2 P
P
P
− w i xi wi
y1
w1 w2 . . . wn y2
X 0 V −1 Y
=
..
w 1 x1 w 2 x 2 . . . wn xn .
yn
P
wi yi
= P
w i xi y i
Then
β̂ = (X 0 V −1 X)−1 X 0 V −1 Y
P
wi x2i − wi xi
P P
1 wi yi
= P P
wi wi x2i − ( wi xi )2 P
P P
P wi xi yi
− w i xi wi
2. THE LINEAR MODEL 42
P
wi x2i
P P P
wi yi − wi xi wi xi yi
1
=
2
P P P 2
wi wi xi − ( wi xi ) P
P P P
− wi xi wi yi + wi wi xi yi
wi x2i
P P P P
wi yi − wi xi wi xi yi
wi wi x2i − ( wi xi )2
P P P
=
P P P P
− wi xi wi yi + wi wi xi yi
wi wi x2i − ( wi xi )2
P P P
We have seen that the GLS estimator is an unbiased linear estimator. We will now show that
the GLS estimator is, in fact, the best linear unbiased estimator (BLUE) in the sense that the
variance of each component of β̂, say β̂i , is smaller than (or equal to) the variance of any other
unbiased linear estimator of βi . In fact, suppose we wish to estimate a linear combination of the
components of β, say t0 β where t is a vector of known constants. In the one-sample problem
we maywant to estimate (1)(µ) = (µ), in the two-sample problem we may want to estimate
µ1 α
(1 − 1) = µ1 − µ2 , while our purpose to estimate (1 x) = α + βx for given x in the
µ2 β
simple linear regression model.
The Gauss-Markov theorem states that, of all linear combinations of y1 , · · · , yn , that is of all
estimators of the form `0 y such that E(`0 y) = t0 β, the estimator with the smallest variance is
t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y,
`0 = t0 (X 0 V −1 X)−1 X 0 V −1 .
This property holds whatever the form of V, thus also for the OLS estimator when V = In .
2. THE LINEAR MODEL 43 STA3701/1
For any given vector t, the best unbiased linear estimator of t0 β is t0 β̂ where β̂ is the GLS estimator
of β.
Proof
t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y = c0 y, say,
where
c0 = t0 (X 0 V −1 X)−1 X 0 V −1 (2.1)
(from 2.1)
Let `0 y be any other unbiased linear estimator of t0 β. Write, without loss of generality,
` = c + d.
`0 y = c0 y + d0 y.
∴ t0 β + d0 Xβ = t0 β (from 2.2)
2. THE LINEAR MODEL 44
∴ d0 Xβ = 0
This must hold for all β (we do not want an estimator which is unbiased only if β has certain
values),
The variance of `0 y is
Var(`0 y) = σ 2 `0 V `
= σ 2 (c + d)0 V (c + d)
= σ 2 (c0 V c + c0 V d + d0 V c + d0 V d).
= t0 (X 0 V −1 X)−1 X 0 d
= 0. (from 2.4)
Similarly,
d0 V c = 0
∴ Var(`0 y) = σ 2 (c0 V c + d0 V d)
= Var(c0 y) + σ 2 d0 Vd.
∴ Var(`0 y) ≥ Var(c0 y)
The theorem may also be proved by means of Lagrange multipliers. We want to choose ` so as to
minimise
Var(`0 y) = σ 2 `0 V `
2. THE LINEAR MODEL 45 STA3701/1
`0 V ` − 2θ0 (X 0 ` − t).
The result is
`0 = t0 (X 0 V −1 X)−1 X 0 V −1
∗
The Gauss-Markov theorem may also be approached as follows. Let β̂ be any unbiased linear
estimator of β and let β̂ be the GLS estimator. Then, similar to our proof of the theorem, it may
be shown that the matrix
∗ ∗0 0
Cov(β̂ , β̂ ) - Cov(β̂, β̂ )
is positive semidefinite.
In the remainder of this study guide we will assume throughout that the error term
is multivariate normally distributed with mean 0 and covariance matrix σ 2 In .
Thus the model is y = Xβ + where ∼ n(0; σ 2 In ).
β̂ = (X 0 X)−1 X 0 y
H0 : K β = m
K and m are assumed to be a matrix and a vector of known constants. The requirement that K
must be of full row rank is not overly restrictive, since one may delete rows of K which are linear
combinations of the remaining rows. For example, the two hypotheses β1 − β2 = 0 and β1 − β3 = 0
automatically imply that β2 − β3 = 0, the latter being redundant. We must of course be sure
that the deleted hypotheses are consistent with the remaining ones (eg β2 − β3 = 10 would not be
consistent with β1 − β2 = 0 and β1 − β3 = 0).
H0 : µ = C
or K β = m.
H0 : µ1 − µ2 = 0; µ1 = 10; µ2 = 10
2. THE LINEAR MODEL 47 STA3701/1
1 −1 0
µ
1
that is = 10 .
1 0
µ2
0 1 10
Since the third row of the matrix on the left is a linear combination of the first two rows (notice
that the second hypothesis minus the first yields the third) we may as well delete the third hy-
pothesis to arrive at
1 −1 µ1 0
H0 : =
1 0 µ2 10
or K β = m.
H0 : β = 5; α + 3β = 10; α + 5β = 20.
The last row is a linear combination of the first two rows (twice the first hypothesis added to the
second yields the third) and can be deleted. Thus we have
0 1 α 5
H0 : =
1 3 β 10
or K β = m.
H0 : Kβ = m.
K β̂ − m.
KE(β̂) − m = Kβ − m
(which is equal to zero if H0 is true) and, by (D3), its covariance matrix is
σ 2 K(X 0 X)−1 K 1 .
Thus, if H0 is true,
We will now study a number of quadratic forms which are relevant in the testing of linear hypothe-
ses:
q = y 0 y = y 0 In y
= q − q1
0
q3 = β̂ (X 0 X)β̂
q4 = q − q3
In each case we will show that the quadratic form can be written in the form qi = y 0 Ai y where
Ai Ai = Ai , that is Ai is idempotent.
2. THE LINEAR MODEL 49 STA3701/1
Thus qi /σ 2 is, in general, a noncentral chi-square variable with ri = r(Ai ) degrees of freedom and
noncentrality parameter (E(y))0 Ai (E(y))/σ 2 = β 0 X 0 Ai Xβ/σ 2 . We will also show that in a number
of cases Ai Aj = O, which proves that qi and qj are independent (df (D11) and (D12)).
Example 2.3.
Solution 2.3.
q = y0y
y1
h i
=
y1 y2 y3 y2
y3
= y12 + y22 + y32
q1 = y 0 (1(10 1)−1 10 )y
1 1 1
3 3 3
where 1(10 1)−1 10 = 1 1 1 using M12
3 3 3
1 1 1
3 3 3
Now
q1 = y 0 (1(10 1)−1 10 )y
1 1 1
3 3 3
y1
h i
=
1 1 1
y1 y2 y3 3 3 3
y2
y3
1 1 1
3 3 3
2. THE LINEAR MODEL 50
1 1 1
y1
1h i
=
y1 y2 y3 1 1 1 y2
3
y3
1 1 1
y
1h i 1
=
y1 + y2 + y3 y1 + y2 + y3 y1 + y2 + y3 y2
3
y3
1 2
= (y + y2 y1 + y3 y1 + y1 y2 + y22 + y3 y2 + y1 y3 + y2 y3 + y32 )
3 1
1 2
= (y + y22 + y32 + 2y1 y2 + 2y1 y3 + 2y2 y3 )
3 1
1
= (y1 + y2 + y3 )2
3
3
!2
1 X
= yi
3 i=1
P3 3
1 i=1 yi
X
2
= (3y) since = y ⇒ 3y = yi
3 3 i=1
= 3y 2
q2 = q − q1 (2.5)
X3
= yi2 − ny 2
i=1
3
X
= (yi − y)2
i=1
= (y1 − y)2 + (y2 − y)2 + (y3 − y)2
(2.6)
2. THE LINEAR MODEL 51 STA3701/1
q1 = y 0 (1(10 1)−1 10 )y
1 1
n
··· n
y1
. .. ..
= (y1 · · · yn ) .. . .
1 1
n
··· n yn
= nȳ 2 .
(2.7)
This is an important quadratic form to remember. It is sometimes called the sum of squares due
to the mean. We may write
We conclude that q1 /σ 2 is a noncentral chi-square variate with one degree of freedom and n.c.p.
1 0 0 0
= β X 1 1 Xβ/σ 2 .
n
(2.8)
It follows that λ1 = 0 if β = 0 or if X 0 1 = 0. The latter condition implies that the sum of the
elements in each row or X 0 (that is each column of X) is zero.
Example 2.4.
−1 1
−1 −1
Assume X =
. Compute λ1 .
1 1
1 −1
Solution 2.4.
β 0 X 0 1(10 1)−1 10 Xβ
λ1 =
σ2
1 0 0 0
β X 11 Xβ 1
= n 2
since (10 1)−1 =
σ n
It follows that λ1 = 0 if β = 0 or if X 0 1 = 0.
X 0 1 = 0 if the sum of the elements in each row of X 0 is zero (or that each column of X is zero).
Thus, in this case X 0 1 = 0 since the elements in each row of X 0 add up to zero, ⇒ λ1 = 0.
= y 0 A2 y
= In − A1 − A1 + A1 A1
= In − A1 = A2 .
(2.9)
that is, the first column of X consists of ones. This may be achieved in many of the well-known
linear model problems by a judicious choice of parameters (cf section 2.8).
β = (β1 0 · · · 0)0 .
Therefore Xβ = β1 1 and β 0 X0 = β1 10 .
= 0.
2. THE LINEAR MODEL 54
q 2 = q − q1
X
= yi2 − nȳ 2
X
= (yi − ȳ)2 .
(2.11)
This is another important quadratic form to remember: q2 is called the total sum of squares (ad-
justed for the mean).
Independence of q1 and q2
Since q1 = y 0 A1 y and q2 = y 0 (In − A1 )y, and
A1 (In − A1 ) = A1 − A1 A1 = A1 − A1 = O
it follows that q1 and q2 are stochastically independent. Thus we have proved a well-known result:
X
that nȳ 2 and (yi − ȳ)2 are stochastically independent.
0
q3 = β̂ (X 0 X)β̂
0
= β̂ X 0 y
= y 0 X(X 0 X)−1 X 0 y
= y 0 A3 y
(2.12)
From (M11) we know that A3 is idempotent with rank p. Thus q3 /σ 2 is a noncentral chi-square
variate with p degrees of freedom and noncentrality parameter
2. THE LINEAR MODEL 55 STA3701/1
λ3 = β 0 X 0 A3 Xβ/σ 2
= β 0 X 0 Xβ/σ 2 .
(2.13)
Since X 0 X is positive definite, λ3 = 0 if and only if β = 0. The quadratic form q3 is usually termed
the regression sum of squares or the sum of squares due to the model.
q4 = q − q 3
= y 0 In y − y 0 A3 y
= y 0 (In − A3 )y
= y 0 A4 y.
(2.14)
From (M11) we know that A4 is idempotent and has rank n - p. Thus q4 /σ 2 is a noncentral
chi-square variate with n - p degrees of freedom and n.c.p.
λ4 = β 0 X 0 A4 Xβ/σ 2
= (β 0 X 0 Xβ − β 0 X 0 Xβ)/σ 2
= 0.
(2.15)
Thus q4 /σ 2 has a central chi-square distribution with n - p degrees freedom, whatever the value
β, provided the model is correct.
2. THE LINEAR MODEL 56
y − ŷ = y − X β̂
(y − X β̂)0 (y − X β̂)
0 0
= y 0 y − y 0 X β̂ − β̂ X 0 y + β̂ X 0 X β̂
0 0 0
= y 0 y − β̂ X 0 X β̂ since y0 Xβ̂ = β̂ X0 y = β̂ X0 Xβ̂
= q − q3 = q 4 .
(2.16)
Thus q4 is the sum of squares of the residuals. It is also termed the residual sum of squares or the
error sum of squares.
Since q4 /σ 2 ∼ Xn−p
2
, it follows that
E(q4 /σ 2 ) = n − p
q4
∴ E( n−p ) = σ2.
Therefore
is an unbiased estimator of σ 2 ; σ̂ 2 is termed the error variance and σ̂ the standard error.
Independence of q3 and q4
q3 = y 0 A3 y
q4 = y 0 (In − A3 )y
and, using (D14) with K β̂ − m in the place of y, it follows that q5 /σ 2 is a noncentral chi-square
variate with degrees of freedom.
The quadratic form q5 is called the sum of squares due to the hypothesis.
Independence of q4 and q5
We want to prove that q4 and q5 are independent. In equations 2.14 and 2.17 we expressed q4 as
a quadratic form in y and q5 as a quadratic form in K β̂ − m. In order to prove independence, we
will have to express them both in terms of the same normal vector.
Then, by substituting β̂ = (X 0 X)−1 X 0 y and using the fact that X 0 (In − X(X 0 X)−1 X 0 )X = O, it
may be shown that
and
q5 = v 0 (B(B 0 B)−1 B 0 )v
where
B = X(X 0 X)−1 K 0 ,
Activity 2.1.
Prove these assertions as an exercise.
2. THE LINEAR MODEL 59 STA3701/1
(a) Find a quadratic form qi such that qi /σ 2 has a central chi-square distribution if Ho is true and
a noncentral chi-square distribution if H0 is not true. Let qi /σ 2 have ri degrees of freedom.
(b) Find a second quadratic form qj which is independent of qi and such that qj /σ 2 has a central
chi-square distribution whether Ho is true or not. Let qj /σ 2 have rj degrees of freedom.
(c) Compute
qi /ri
f=
qj /rj
which has a central Fri ;rj distribution if H0 is true and a noncentral F-distribution if H0 is
not true. If f > Fα;ri ;rj reject H0 at the 100α% level.
From table 2.1 we see that q3 and q4 are the appropriate quadratic forms. Thus
0
β̂ X 0 X β̂ / p
f =
(y − X β̂) 0 (y − X β̂) / (n − p)
SSR / p SSR
= =
SSE / (n − p) pσ̂ 2
(b) H0 : Kβ = m
(c) H0 : c0 β = m
(X 0 X)−1
T T12
11
h×h
p×p =
0
T22
T12
(p − h) × (p − h)
β̂ 1 h×1
β̂ =
β̂ 2 (p − h) × 1
β1
β =
β2
0 −1
q5 = β̂ 1 T11 β̂ 1
where q5 /σ 2 ∼ Xh2 if β 1 = 0.
2. THE LINEAR MODEL 62
Since this is a special case of q5 , we need not prove again that it is independent of q4 and we can
define the F-statistic directly.
0 −1
β̂ 1 T11 β̂ 1 / h
f=
(y − X β̂)0 (y − X β̂) / (n − p)
has an Fh;n−p distribution provided β 1 = 0, that is
β1 = . . . = βh = 0.
Obviously we can in this way test the null hypothesis that any subset of the βi s is zero, since the
ordering of the βi s is arbitrary.
Note: You will recall that the square of a td -variate has an F1;d distribution.
Example 2.5.
Let y = Xβ + ε, where ε ∼ n(µ; σ 2 I3 ) and
1 1 1 1
h i
0
X = 0 , y0 = 4 4 0 −4
1 1 0
−2 2 4 −4
b0 =
h i
(a) Show, with formulae and calculations, that β −2 6 2
5
and b2 = 1.6.
σ
Solution 2.5.
y = X β +
1 1 1 1 1 1 −2
4 2 0
1 1 0 0 1 1 2
(a) X 0 X = = 2 2 0
−2 2 4 −4 1 0
4
0 0 40
1 0 −4
4 2 0 1 0 0
2 2 0 0 1 0
0 0 40 0 0 1
1 1 1
(R1 − R2 ) 1 0 0 - 0
2 2 2
2 2 0 0 1 0
0 0 40 0 0 1
1 1
1 0 0 - 0
2 2
1 1
R2 − R1 0 1 0 - 1 0
2 2
1 1
R3 0 0 1 0 0
40 40
2. THE LINEAR MODEL 64
1 1
− 0
2 2
1
Thus, (X 0 X)−1
=
−2 1 0
1
0 0
40
n = 4 and p = 3
1 1 1 1 4
4
1 1 0 0 4
X 0y = = 8
−2 2 4 −4
0
16
−4
1 1
− 0 4 −2
2 2
0 −1 0
β̂ = (X X) Xy= 1 8 = 6
−2 1 0
2
1 16
0 0 40
5
1 1 −2 3.2
−2
1 1 2 4.8
X β̂ = 6 =
−0.4
1 0 4
0.4
1 0 −4 −3.6
e0 e = (y − X β̂)0 (y − X β̂)
4 − 3.2
i 4 − 4.8
h
= 4 − 3.2 4 − 4.8 0 − −0.4 −4 − −3.6
0 − −0.4
−4 − −3.6
0.8
h i −0.8
= 0.8 −0.8 0.4 −0.4
0.4
−0.4
= 1.6
2. THE LINEAR MODEL 65 STA3701/1
e0 e 1.6
∴ σˆ2 = = = 1.6
n−p 4−3
(b) H0 : β1 − β3 = 0
h i
K= 1 0 −1 and m = [0]
SSH/h
Test statistic: f = ∼ Fα;h,n−p
SSE/(n − p)
−2
h i
(K β̂ − m)0 = 1 0 −1 6 −0
0.4
= −2.4
1 1
− 0
2 2
1
h i
1
K(X 0 X)−1 K 0
= −2
1 0 −1 1 0
0
−1
1
0 0
40
1
1 1 1
= − −
0
2 2 40
−1
21
=
40
= 0.525
−1
SSH = (K β̂ − m)0 K(X 0 X)−1 K 0
(K β̂ − m)
40
= −2.4 × × (−2.4)
21
= 10.9714
SSH
h 10.9714
Since f = SSE
= ≈ 6.8571
n−p
1 × 1.6
2. THE LINEAR MODEL 66
Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and
conclude that β1 − β3 = 0.
OR
Using the hypothesis of the form: H0 : c0 β = m
40
= −2.4 × × (−2.4)
21
= 10.9714
SSH
h 10.9714
Since f = SSE
= ≈ 6.8571 Fα;h;n−p = F0.05;1;1 = 161. Reject H0 if f > 161.
n−p
1 × 1.6
Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and conclude β1 −β3 =
0.
Confidence regions
Consider the random variable K β̂ − Kβ where K is an h×p matrix of rank h. We know that
Therefore
q5∗ / h
f=
SSE / (n − p)
has an Fh;n−p distribution. Thus
1 − α = P (f ≤ Fα;h;n−p )
h SSE
= P (q5∗ ≤ Fα;h;n−p )
n−p
= P [(K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ) ≤ hσ̂ 2 Fα;h;n−p ].
The inequality inside the square brackets defines a joint confidence region for the h elements of
Kβ. A few examples of this follow.
The set of all values of β such that the inequality in square brackets is satisfied, forms a joint
confidence region for β1 , · · · , βp .
As before, let
β h×1
1
β=
β2 (p − h) × 1
T T12
β̂ 1 11
h×h
β̂ = , (X 0 X)−1 = .
β̂ 2
0
T12 T22
Figure 2.2:
√ √
∴ β̂i − aii σ̂t 1 α;n−p ≤ βi ≤ β̂i + aii σ̂t 1 α;n−p .
2 2
p p
∴ c0 β̂ − σ̂ c0 (X 0 X)−1 c t 1 α;n−p ≤ c0 β ≤ c0 β̂ + σ̂ c0 (X 0 X)−1 c t 1 α;n−p .
2 2
Multiple comparisons
Suppose we have rejected a null hypothesis of the form Kβ = m which really consists of h different
hypotheses. The result is that we no longer believe that all h of the relationships in the null
hypothesis are true, but we still do not know which of the h relationships are true and which are
not. Thus we have to pursue the analysis a bit further. (Of course, if H0 is not rejected, there is no
evidence that any of the h relationships do not hold and a further analysis would be redundant.)
We consider the case h = 2 first, because it is easier to draw a picture in two dimensions. The
joint confidence region for β1 and β2 is an ellipse, as was seen before. Suppose the ellipse is as
shown in figure 2.3:
Figure 2.3:
Thus we are 100(1 - α)% sure that the true β1 and β2 must both be inside the ellipse. Where
can β1 lie then? Between the extremes of the ellipse as viewed from the β1 -axis, of course, and
similarly for β2 (cf figure 2.4).
2. THE LINEAR MODEL 70
Figure 2.4:
In the above figure we would reject H0 : β1 = 0, β2 = 0 since the origin is outside the confidence
ellipse. We would also reject the subhypothesis H0 : β2 = 0 since zero falls outside the confidence
interval for β2 . However, we would not reject H0 : β1 = 0 since zero falls inside the confidence
interval for β1 .
Thus we have seen that a joint confidence region for two or more parameters implies a confidence
interval for each of the parameters. These confidence intervals are somewhat wider than those we
would have obtained if we had constructed a confidence interval for one parameter only. However,
if we want to make a number of probability statements based on the same set of data, it is more
reassuring if we could attach one probability to all our statements jointly. (For example, if we
make 20 separate probability statements with confidence 0.95 each, the probability that all our
statements are true may be as low as (0.95)20 ≈ 0.36.)
From the ellipse in figure 2.3 we may in principle obtain a confidence interval for any linear
combination c1 β1 + c2 β2 where c1 and c2 are constants. We must draw a line c1 β2 + c2 β2 = 0 on the
graph, and project the extremes of the ellipse onto the axis orthogonal to that line (cf figure 2.5).
2. THE LINEAR MODEL 71 STA3701/1
Figure 2.5:
The problem of selecting a scale on that axis must of course be solved, but usually the whole
procedure of finding a confidence interval is done algebraically. The graphs were used merely to
illustrate the procedure.
The method which we will discuss is one of several methods available, and is known as Scheffé’s
method. Assume, as before, that K is an h × p matrix of rank h. Thus
is a quadratic form such that q5∗ /σ 2 is a chi-square variate with h degrees of freedom. Now let t
be an hx1 vector of known constants. Then t0 Kβ is a scalar and
Therefore
qt∗ = (t0 K β̂ − t0 Kβ)0 (t0 K(X 0 X)−1 K 0 t)−1 (t0 K 0 β̂ − t0 Kβ) (2.19)
is a quadratic form such that qt∗ /σ 2 has a chi-square distribution with one degree of freedom. The
method of Scheffé is based on the following inequality:
2. THE LINEAR MODEL 72
Theorem 2.7.6
We have seen before that a 100(1 - α)% confidence region for Kβ is defined by
This defines a confidence region for t0 Kβ with confidence at least 100(1 - α)%.
while H0 : t0 β = 0 is rejected if
Example 2.6.
1 0 1 2
1 0 −1 7 α
Define the following as X = 0 ,y= 4 ,β= γ .
0 2
−1
1 0 3 ϕ
−1 −1 0 8
Solution 2.6.
1 0 1 2
1 0 −1 7 α
X= y = 4 and β = γ
0 2 0
−1
1 0 3 ϕ
−1 −1 0 8
1 0 1
1 1 0 1 −1 1 0 −1 4 0 0
(a) X 0 X = 0 0 2 −1 −1 0 = 0 6 0
0 2
1 −1 0 −1
0 0 1 0 0 0 2
−1 −1 0
1
0 0
4
(X 0 X)−1 = 0 61 0 Note: Inverse of a diagonal matrix
1
0 0 2
2. THE LINEAR MODEL 74
2
1 1 0 1 −1 7 4
X 0y = 0 0 2 −1 −1 4 = −3
1 −1 0 −5
0 0 3
8
1
0 0
4
1
4
β̂ = (X 0 X)−1 X 0 y = 0 61 0 −3 = − 12
−5
0 0 12 − 25
(b) H0 : α = γ − ϕ − 2, γ = −α − ϕ − 3, γ + ϕ = α − 4
H0 : α − γ + ϕ = −2, γ + α + ϕ = −3, γ + ϕ − α = −4
1 −1 1 −2
K= 1 1 and m = −3
1
−1 1 1 −4
SSH
h
Test statistic: f = SSE
∼ Fα;h;n−p
(n−p)
2. THE LINEAR MODEL 75 STA3701/1
1 −1 1 1 −2
0
(K β̂ − m) 1 − 2 − −3
1
= 1
1
−1 1 1 − 25 −4
−1 −2
= −2 − −3
−4 −4
1
=
1
0
1
1 −1 1 0 4
0 1 1 −1
K(X 0 X)−1 K 0 = 1 1 0 16 0 −1 1 1
1
1
−1 1 1 0 0 2
1 1 1
1
4
− 16 12 1 1 −1
=
1 1 1
−1
4 6 2
1 1
− 14 1
6
1
2
1 1 1
11 7 1
12 12 12
=
7 11 5
12 12 12
1 5 11
12 12 12
11 5
−1 12 12
(A )11 = (−1) 2
/ 13 = 2
5 11
12 12
11 1
12 12
(A−1 )22 = (−1)4 / 13 = 5
2
1 11
12 12
11 7
12 12
(A−1 )33 = (−1)6 / 13 = 3
2
7 11
12 12
7 5
−1 −1 12 12
(A )12 = (A )21 = (−1) 3
/ 13 = − 32
1 11
12 12
7 11
−1 −1 12 12
(A )13 = (A )31 = (−1) 4
/ 13 = 1
2
1 5
12 12
11 7
12 12
(A−1 )23 = (A−1 )32 = (−1)5 / 13 = −1
1 5
12 12
2 − 32 1
2
Thus, [K(X 0 X)−1 K 0 ]−1 = − 23 5
−1
2
1 3
2
−1 2
2− − 32
7 − 72
h i
0
q4 = (y − X β̂) (y − X β̂) = 2 − − 32 7− 7
4 − −1 3− 3
8 − − 12 4 − −1
2 2
3 − 32
8 − − 12
7
2
7
2
h i
= 7 7 3 17
2 2
5 2 2
5
3
2
17
2
= 124.
2. THE LINEAR MODEL 77 STA3701/1
Then,
−1
SSH = (K β̂ − m)0 [K(X 0−1 K 0 ] (K β̂ − m)
2 − 32 1
2
1
h i
= 1 1 0 − 23 5
−1 1
2
1
2
−1 32 0
1
h i
= 1 −1
2
1 2
1
0
3
= 2
SSH 1.5
h 3 0.5
f= SSE
= 124 = ≈ 0.0081
(n−p) (5−3)
62
√
γ̂ ± a22 σ̂t α2 ;n−p
SSE 124
σ̂ 2 = , we have σ̂ 2 = = 62, yb = − 21 , a22 = 16 , and
n−p 5−3
1√ 1√
r r
1 1
(− + 62(4.303) ; − + 62(4.303))
2 6 2 6
(− 21 − 13.8322 ; − 12 + 13.8322)
(−14.3322 ; 13.3322)
2. THE LINEAR MODEL 78
In many examples of linear models the first column of X is 1. The model is therefore
β1
y 1 x12 · · · x1p
1 β 1
.. .. 2 .
. = . + .. .
.
..
yn 1 xn2 · · · xnp n
βp
yi = β1 + β2 xi2 + · · · + βp xip + i .
or y ∗ = X ∗ β (2) + ∗
where
This model may now be used to estimate β2 , · · · , βp while β̂1 is found by setting
Even though ∗ is not an n(0, σ 2 In ) vector it can be shown that the solution of
0 0
β̂ (2) = (X ∗ X ∗ )−1 X ∗ y ∗
gives exactly the same estimators β̂2 , · · · , β̂p as the last (p − 1) elements of
β̂ = (X 0 X)−1 X 0 y,
while the first element of β̂ is exactly the same as β̂1 defined above. The advantages of reducing
the model in this way are:
(a) A smaller matrix has to be inverted (and we all know that it is a lot easier to invert a 3 × 3
matrix than a 4 × 4 matrix).
(b) Numerical accuracy is improved, especially when working on an electronic computer. The
0
elements of X ∗ X ∗ are usually much smaller than the elements of X 0 X and the result is that
0
the round-off errors involved in the inversion of X ∗ X ∗ are much smaller than in the inversion
of X 0 X.
The models of the one-sample problem and simple linear regression are examples of models
with a column of ones. The model two-sample problem can be reparameterised as follows. Let
µ = (µ1 + µ2 )/2 and α = (µ1 − µ2 )/2. Then µ1 = µ + α and µ2 = µ − α. Thus we have
y1 1 1 1
.. .. .. ..
. . . .
yn1 1 1 µ
n1
= + .
−1 α
yn1 +1 1 n1 +1
.. .. .. ..
. . . .
yn1 +n2 1 −1 n1 +n2
2. THE LINEAR MODEL 80
y = X β +
that is
y1 x11 x12 x1p 1
.. .. .. .. ..
. = β1 + β + . . . + β . + . .
. 2 . p
yn xn1 xn2 xnp n
The first vector y consists of n observations of the random variable y. The constants β1 , β2 , · · · , βp
are unknown parameters. The first column of X, say x1 , consists of n values of the non-random
design variable x1 , et cetera. Thus we have p design variables x1 , · · · , xp in our model, each corre-
sponding to one of the p parameters β1 , · · · , βp .
(a) If there is a constant term in the model, the first column of X usually consists of ones only:
x1 = 1. (This is, strictly speaking, not a variable.)
(b) An important type of design variable is one that assumes the values 0 and 1 only. Such a
variable usually indicates the absence or presence of some characteristics or treatment, and
is usually called a dummy variable. Suppose, for example, one of the characteristics to be
included in a model is the marital status of a person. Suppose that there are four categories:
Single, Married, Divorced and Widowed. If one does not think about it carefully, one may
be tempted to define the variable x = 1 for single persons; = 2 for married persons; = 3 for
divorced persons and = 4 for widowed persons. However, in a model of the type
y = β1 + β2 x
this would mean that the effect on y of being divorced is three times the effect of being single,
et cetera. We should rather define the following dummy variables (we assume that x1 = 1,
that is β1 represents the constant term):
2. THE LINEAR MODEL 81 STA3701/1
= 0 for others
= 0 for others
= 0 for others.
= 0 for others.
x1 = x2 + x3 + x4 + x5 ,
that is the first column of X would be equal to the sum of the next four columns and the
rank of X would be less than the number of columns. Thus if a categorical variable has k
categories, one would usually need to define (k − 1) dummy variables.
Dummy variables could also be defined to assume the values -1 and 1 rather than 0 and 1.
The interpretation of the corresponding coefficient βi is somewhat different then.
(c) The third type of design variable is a continuous variable. This type of variable is assumed
to be non-random in the models considered in this course. We could transform any such
variable and either replace a column by the transformed values or add additional columns.
Suppose, for example, our model is
√
yi = β0 + β1 ui + β2 u2i + β3 ui + i ,
√
1 u1 u21 u1
.. .. .. ..
.
. . . .
√
1 un u2n un
Three of the most important special cases of the linear model, which will be dealt with in the next
three chapters, are:
2. THE LINEAR MODEL 82
(a) Analysis of variance (ANOVA): Except for the first column of ones, all the other variables
are dummy variables. The problem is to test whether certain categorical variables (called
treatments) have an effect on the response variable y.
(b) Regression analysis: The first column of X is usually (but not always) 1 and all the other
variables are continuous variables. (If the first column of X is not 1 we speak of regression
through the origin.) The purpose of the analysis may be to test whether certain variables
have an effect on the response, or the purpose may be to use the estimated equation
ŷ = β̂1 x1 + . . . + β̂p xp
to predict the response y which we could expect to obtain if selected values of x1 , · · · , xp are
used in a further experiment.
(c) Analysis of covariance: The first column of X may or may not be 1, but some of the
remaining variables are dummy variables and some are continuous. The purpose of the
analysis could be partly predictive and partly to test whether the regression equation (of y
on the continuous variables) differs from groups as defined by the dummy variables.
Example 2.7.
In an experiment to compare the results of roadrunners in different circumstances, the following
sample was used:
M 3 A 60 y1
F 8 B 56 y2
M 7 C 70 y3
F 4 A 68 y4
F 11 C 77 y5
M 2 B 65 y6
2. THE LINEAR MODEL 83 STA3701/1
Construct a design matrix for this experiment and explain your answer.
Solution 2.7.
1 for programme A
xi4 =
0 otherwise
1 for programme B
xi5 =
0 otherwise
e = y − ŷ = y − X β̂,
which is an “estimator” of , the vector of random errors. If our model is correct these residuals
should contain no information about the response vector y. The error term is included in the
model to account for the unexplained variation, and if e contains information about y it means
that our model is inadequate. The residuals may be examined and analysed in many different
ways, but we will speak briefly about graphical methods only.
e = y − X β̂
= y − X(X 0 X)−1 X 0 y
This covariance matrix is not in general a diagonal matrix and the diagonal elements are not
necessarily equal. Thus the residuals are, in general, not independent and do not have the same
variance. In fact, the covariance matrix of e is singular, so that e has a singular multivariate normal
distribution. (The dependence does become smaller as n − p increases, however.)
Nevertheless, for an informal examination of the residuals we treat them as if they were independent
and had the same variance. The following graphical analyses are usually made:
2. THE LINEAR MODEL 85 STA3701/1
If the sample size n is large, we can construct a histogram to decide whether the error terms
may reasonably be regarded as normal variates (cf figures 2.6, 2.7).
We can also use normal probability paper, plotting e(i) against 100i/(n + 1) (where e(i) is the
i-th smallest residual). If the points on the graph are reasonably spread around a straight
line, we may feel safe about the assumption of normality. We should be on the lookout
especially for systematic deviations from a straight line. An example of such graph paper is
presented in figure 2.8.
2. THE LINEAR MODEL 86
Note that the vertical axis is marked in percentages, that is we must plot 100i/(n + 1) on
the vertical axis.
We plot the residuals against each of the design variables (except of course the column 1).
If e is unrelated to xi we may feel reasonably secure about our model as far as the design
variable xi is concerned.
The graph should have the appearance of figure 2.9, where the points are spread randomly
around 0. If the points follow a pattern, such as in figure 2.10, we may also have to include
additional terms such as x2i , x3i , xki , log xi , et cetera to try to improve the model. If the
points are spread around zero, but with a pattern in the variability such as in figure 2.11, it
indicates that the variances of all the i s may not be the same. In fact, the variance may
be a function of xi . For the data illustrated in figure 2.11 it may well be that the standard
deviation of is a multiple of xi ; thus
Var(yi ) = σ 2 x2i , i = 1, · · · , n
Figure 2.9:
2. THE LINEAR MODEL 88
Figure 2.10:
Figure 2.11:
Any patterns in this graph may indicate that a transformation of the response variable may
be needed (cf section 2.11).
If the time sequence of the observations is available, it is important to plot the residuals
against time. In this way we may decide whether the residuals are autocorrelated or not,
that is whether each residual depends in a specific way on the previous one. If autocorrelation
is present, the least squares estimators are inappropriate and econometric methods must be
employed to estimate β.
2. THE LINEAR MODEL 89 STA3701/1
One operator may perform better than another, a factory may operate more efficiently on
certain days of the week, et cetera, and this may show up in an appropriate residual plot.
Before leaving the residuals, we discuss one more result which one should know about. Suppose
there is a constant term in the model, that is the first column of X is 1. We know that
= (X 0 − X 0 X(X 0 X)−1 X 0 )y
= (X 0 − X 0 )y
= 0
1 ··· 1 e1 0
.. ..
x12 · · · xn2
. .
∴ = .
.. .. ..
. . .
x1p · · · xnp en 0
This shows that there are p linear relationships among the residuals. The first of these p equations
is
e1 + . . . + en = 0
which proves that the sum of the residuals is zero (provided there is a constant term in the model).
2. THE LINEAR MODEL 90
The following are some of the basic assumptions of the linear model:
(ii) Homogeneity of variances: We assume that all elements of y have the same variance (or,
at least, that the variances are known up to a common constant factor).
(iii) Additivity: We assume that the effects of the design variables add up. For example, if
y = β1 x1 + β2 x2 + . . . + βp xp +
then, if x1 is increased by 1 unit and x2 is also increased by one unit, the result will be that
y is increased by β1 + β2 units.
Certain types of data are known not to satisfy all of these assumptions, and sometimes it is possible
to transform the data (for example replacing yi by log(yi )) in order to ensure that the assumptions
are satisfied, at least approximately. Consider the following illustration.
• Illustration 2.11.1
Suppose two types of insect repellent are tested on tomatoes and potatoes, and each treatment
is applied to five tomato and five potato plants. The response variable is the number of moths
visiting the plants during the hour after spraying. Suppose the results are as follows:
It seems that the variance is a function of the mean. The variance is plotted against the mean in
figure 2.12 and a linear relationship seems likely.
2. THE LINEAR MODEL 91 STA3701/1
Figure 2.12:
p
The appropriate transformation in this situation is to replace yi by yi + 3/8 (see following dis-
cussion) and, if we do this, we obtain the following:
If we now plot the variance against the mean (cf figure 2.13) we find that there is not a marked
relationship. We say that we have stabilised the variance.
Note that this kind of graphical analysis is possible only if a number of sets of repetitions at iden-
tical conditions are available (that is if X contains a number of sets of identical rows).
2. THE LINEAR MODEL 92
Figure 2.13:
We will now discuss a number of standard transformations, their use and their effect.
Suppose each yi was obtained as follows: a number of trials, say ni trials, each resulted in
either a success or a failure; let ui be the number of successes, that is ui is a binomial variable
with parameters ni and probability of success equal to πi , say.
ui number of successes
Let yi = =
ni number of trials
= proportion successes.
E(ui ) = ni πi ; Var(ui ) = ni πi (1 − πi )
Thus we see that the variance of yi is a function of the expected value of yi (and of ni ). The
famous statistician, Sir Ronald Fisher, discovered that the random variable
√ √
zi = 2 arcsine yi (= 2 sin−1 yi )
1
has variance approximately equal to (which does not depend on πi ). Remember that
ni
the angles are in radians in this definition. It has therefore become standard practice to
apply the arcsin transformation to proportions before analysing the linear model. Note that
2. THE LINEAR MODEL 93 STA3701/1
Var(zi ) is still a function of ni , the number of trials involved. If the number of trials is not
the same throughout the experiment, one will have to apply weighted least squares. Special
problems occur when y = 0 or 1, and some tables exist which cater for this situation. If you
are unfamiliar with the arcsine (sin−1 ) function consult the textbook:
With regard to STA3701 it is sufficient to be able to use the sin−1 function which is found
on most pocket calculators.
180
Note that: 1 radian = degrees
π
π
1 degree = radians,
180
but it will not be used in this module.
Data which consist of counts (like the number of ticks on a sheep) or index numbers often
have a tendency for the standard deviation to be proportional to the mean. In such cases
the recommended variance stabilising transformation is z = log10 y or z = loge y. Since
special problems occur when the counts are small numbers, the transformation is changed to
z = log(1 + y) when the counts are low.
The effect of this transformation is also to change a moderately skew distribution into a fairly
symmetric one which resembles the normal distribution more closely.
This information is also used when changes in the design variables cause proportional changes
in the response. In such cases a likely model is
y = αxβ
Suppose y is a Poisson variable, which is usually associated with the number of arrivals (for
example cars, telephone calls, customers, et cetera) in a fixed period of time.
We know that
E(y) = Var(y).
Thus the mean is equal to the variance. In such cases where the variance is proportional to
the mean, the appropriate transformation is
√
z= y
This transformation has the effect of changing quite skew distributions into fairly symmetric
ones.
If the standard deviation of y increases even more rapidly than (E(y))2 , namely proportionally
to (E(y))4 , the transformation
z = 1/y
is often recommended for stabilising the variance. This transformation has the effect of
changing a very skew distribution into a fairly symmetric one.
2.12 Illustration
1 1 1 1 1 1 1 1 1
0
1 1 1 0 0 0 0 0 0
Let X =
0 0 0 1 1 1 0 0 0
−1 0 1 −1 0 1 −1 0 1
2. THE LINEAR MODEL 95 STA3701/1
y 0 = (19 39 50 25 24 32 23 28 39).
Such data could for example be the result of an experiment to test the effect of fertilizers on the
yield of a hectare of maize. Column 1 of X (that is row 1 of X0 ) provides for a constant term (we
do not expect zero yield if no fertilizer is used). Column 2 of X denotes District A and column
three District B (thus there are three districts involved in the experiment). Column 4 denotes the
fertilizer dosage. Suppose three dosages are used, namely 0 tons/ha, 10 tons/ha and 20 tons/ha.
Then define x4 = (dosage - 10)/10. Furthermore, y is the yield in bags/ha.
We compute
1 1 0 −1
1 1 0 0
1 1 0 1
1 1 1 1 1 1 1 1 1
1 0 1 −1
0
1 1 1 0 0 0 0 0 0
XX =
1 0 1 0
0 0 0 1 1 1 0 0 0
1 0 1 1
−1 0 1 −1 0 1 −1 0 1
1 0 0 −1
1 0 0 0
1 0 0 1
9 3 3 0
3 3 0 0
=
.
3 0 3 0
0 0 0 6
2. THE LINEAR MODEL 96
9 3 3 0 1 0 0 0
3 3 0 0 0 1 0 0
3 0 3 0 0 0 1 0
0 0 0 6 0 0 0 1
1 1 1 1
(R1 − R2 − R3 ) 1 0 0 0 0
3 3 3 3
1 1
R2 1 1 0 0 0 0 0
3 3
1 1
R3 1 0 1 0 0 0 0
3 3
1 1
R4 0 0 0 1 0 0 0
6 6
1 1 1
1 0 0 0 0
3 3 3
1 2 1
R2 − R3 0 1 0 0 − 0
3 3 3
1 1 2
R3 − R1 0 0 1 0 − 0
3 3 3
1
0 0 0 1 0 0 0
6
1
3
− 31 − 13 0
1 2 1
−3
3 3
0
Thus, (X 0 X)−1 =
1
−3 1 2
3 3
0
1
0 0 0 6
2. THE LINEAR MODEL 97 STA3701/1
19
39
50
1 1 1 1 1 1 1 1 1
25
0
1 1 1 0 0 0 0 0 0
Xy =
24
0 0 0 1 1 1 0 0 0
32
−1 0 1 −1 0 1 −1 0 1
23
28
39
279
108
=
81
54
β̂ = (X 0 X)−1 X 0 y
1 1 1
−3 −3 0 279
3
1 2 1
−3
3 3
0 108
=
−3 1 1 2
3 3
0 81
0 0 0 16 54
30
6
=
−3
9
We have
ŷ 0 = X β̂
2. THE LINEAR MODEL 98
ŷ 0 = (27 36 45 18 27 36 21 30 39)
e0 = y 0 − ŷ 0
= (−8 3 5 7 − 3 − 4 2 − 2 0)
e0 = 0 as is to be expected
e0 e = 180
σ̂ 2 = e0 e/(n − p)
= 180/(9 − 4)
= 36
σ̂ = 6.
(2.7024 ; 15.2976).
For each additional ten tons of fertilizer a yield increase of between 2.7 and 15.3 bags/ha can be
expected. Normally such an experiment would of course be conducted on a larger scale and a much
narrower confidence interval would be obtained.
β̂2 − β2
−1
(β̂2 − β2 β̂3 − β3 )T ≤ σ̂ 2 2Fα;2;5
β̂3 − β3
2 −1
6 − β2
∴ (6 − β2 − 3 − β3 ) ≤ (36)(2)(5.79)
−3 − β3
−1 2
with α = 0.05
This represents an ellipse. In order to sketch the ellipse, choose various values of β2 and solve for
β3 from the quadratic equation each time. For example, if β2 = 6 we have
−14.44 ≤ β3 + 3 ≤ 14.44
−17.44 ≤ β3 ≤ 11.44.
H0 : β2 − β3 = 0; β1 + β2 − β4 = 25
(that is the expected yield in Districts A and B are identical and the expected yield in District A
with no fertilizer is 25 bags/ha).
∴ Kβ = m.
SSH 561/4
∴f = 2
= = 1.9479
hσ 2 × 36
The 5% critical value is F0,05;2;5 = 5.79. Reject H0 if f > 5.79.
Since 1.9479 < 5.79, H0 is not rejected at the 5% level of significance.
2. THE LINEAR MODEL 101 STA3701/1
Example 2.8.
An experiment was performed to determine whether three factors have an effect on the taste of
sosaties: the type of lamb (lean or fat); time the meat is kept in the marinade (1, 2 or 3 days) and
the temperature of the oven (190◦ , 205◦ and 220◦ C). Ten dishes of sosaties were prepared accord-
ing to different recipes and each presented to a different panel of gourmets. The panel awarded a
coefficient of tastiness (y) to each. The results were as follows:
(a) Fit a linear model with constant term; let x2 = 1 for fat meat and = 0 for lean; x3 = days -
2; x4 = (temp - 205)/15.
Solution 2.8.
xi3 = Days − 2
(Temp − 205)
xi4 =
15
1 1 1 0
1 1 0 1
1 0 0 1
1 1 1 −1
0 −1
1 0 h i
X= and y0 = 16 19 20 43 27 25 22 15 29 9
1 1 −1 −1
1 1 −1
1
1 0 1 1
0 −1
1 0
1 0 −1 0
1 1 1 0
1 1 0 1
1 0 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 −1
0 −1
1 1 0 1 0 1 1 0 0 0 1 0
X 0X =
0 −1 −1 1 0 −1 1 1 −1 −1
1 0 0 1
1 −1 −1 −1 1 1 −1 1 1 −1
0 1 0 1
1 0 1 1
0 −1
1 0
1 0 −1 0
2. THE LINEAR MODEL 103 STA3701/1
10 5 0 0
5 5 0 0
=
0 0 6 0
0 0 0 8
1 1 1
(R1 − R2 ) 1 0 0 0 - 0 0
5 5 5
1 1
R2 1 1 0 0 0 0 0
5 5
1 1
R3 0 0 1 0 0 0 0
6 6
1 1
R4 0 0 0 1 0 0 0
8 8
1 1
1 0 0 0 - 0 0
5 5
1 2
R2 − R1 0 1 0 0 − 0 0
5 5
1
0 0 1 0 0 0 0
6
1
0 0 0 1 0 0 0
8
1
5
− 15 0 0
1 2
−5
5
0 0
Thus, (X 0 X)−1 =
0 1
0 6
0
0 0 0 81
2. THE LINEAR MODEL 104
16
19
20
1 1 1 1 1 1 1 1 1 1 43
1 1 0 1 0 1 1 0 0 0 27
X 0y =
0 −1 −1 1 0 −1
1 0 0 1 25
1 −1 −1 −1 1 1 −1
0 1 0 22
15
29
9
225
125
=
18
−48
β̂ = (X 0 X)−1 X 0 y
1 1
−5 0 0 225
5
1 2
−5
5
0 0 125
=
0 1
0 6 0 18
1
0 0 0 8 −48
20
5
=
3
−6
Then
2. THE LINEAR MODEL 105 STA3701/1
We have
e0 = y 0 − ŷ 0
= (−12 0 6 9 1 − 3 6 − 2 3 − 8)
e0 = 0 as is to be expected
e0 e = 384
σ̂ 2 = e0 e/(n − p)
= 384/(10 − 4)
= 64
σ̂ = 8.
√
β̂i ± t α2 ;n−p aii σ̂.
√
βˆ3 ± t α2 ;n−p a33 σ̂
r
1
3 ± 1.943 × 8
6
9 ± 6.3458
(2.6542 ; 15.3458)
1 1
6
0 8
0 6 0
T −1 = = 48 =
1 1
0 8
0 8
0 8
Now
h i βˆ3 − β3
βˆ3 − β3 βˆ4 − β4 T −1 ≤ σˆ2 2Fα;2;n−p
βˆ4 − β4
6 0
3 − β3
h i
3 − β3 −6 − β4 ≤ 64 × 2 × 5.14
−6 − β4
0 8
h i 3 − β3
6(3 − β3 ) 8(−6 − β4 ) ≤ 657.92
−6 − β4
6(3 − β3 )2 + 8(−6 − β4 )2 ≤ 657.92
(d) H0 : β2 = 0; β3 = 9; β1 + β2 + β3 − β4 = 50
β
0 1 0 0 1 0
β2
K= 0 0 1 and 9
0
β3
1 1 1 −1 50
β4
2. THE LINEAR MODEL 107 STA3701/1
β1
0 1 0 0 0
β2
H0 : 0 0 1 =
0
9
β3
1 1 1 −1 50
β4
∴ Kβ = m.
20
0 1 0 0 0
5
K β̂ − m = 0 0 1 −
0 9
3
1 1 1 −1 50
−6
5 0
= 3 − 9
34 50
5
= −6
−16
1
5
− 51 0 0 0 0 1
0 1 0 0
− 15 2
0 0 1 0 1
K(X 0 X)−1 K 0 5
= 0 0 1
0
0 16 0 0 1
0
1
1 1 1 −1
0 0 0 18 0 0 −1
0 0 1
− 15 25 0 0
1 0 1
= 0 0 6
1
0
0 1
1
0 15 61 − 18
0 0 −1
2 1
0
5 5
= 0 6
1 1
6
1 1 59
5 6 120
2. THE LINEAR MODEL 108
2 1
0 1 0 0
5 5
1 1
0 0 1 0
6 6
1 1 59
0 0 1
5 6 120
5 1 5
R1 1 0 0 0
2 2 2
6R2 0 1 1 0 6 0
1 47
2R3 − R1 0 -1 0 2
3 60
1 5
1 0 0 0
2 2
0 1 1 0 6 0
1 9
R3 − R2 0 0 -1 -2 2
3 20
1 5
1 0 0 0
2 2
0 1 1 0 6 0
20 20 40 40
R2 0 0 1 - -
9 9 9 9
2. THE LINEAR MODEL 109 STA3701/1
1 65 20 20
R1 − R3 1 0 0 −
2 18 9 9
20 94 40
R2 − R3 0 1 0 −
9 9 9
20 20 40 40
R2 0 0 1 - -
9 9 9 9
65 20 20
−
18 9 9
20 94 40
Thus, (K(X 0 X)−1 K 0 )−1 = − .
9 9 9
20 40 40
− −
9 9 9
Now
Exercise 2.1
1 1 1 1 1 1 1 1 1
1. Let X 0 = −1 −1 −1
0 0 0 1 1 1
−1 0 1 −1 0 1 −1 0 1
h i
and y 0 = 18 20 29 17 20 27 11 28 20 .
(d) Construct a 95% joint confidence region for β2 and β3 . Would you conclude that β2 =
β3 = 0?
(f) Plot
(i) e against ŷ
(ii) e against y
2. In an experiment to investigate the mass increase of young oxen under specific feeding con-
ditions, the following sample of size 12 was used:
Sussex 6 A 175 y1
Beefmaster 12 B 255 y2
Simmentaler 18 A 350 y3
Brahman 6 B 190 y4
Sussex 12 A 230 y5
Beefmaster 18 B 345 y6
Simmentaler 6 A 200 y7
Brahman 12 B 225 y8
Sussex 18 A 320 y9
Age is given in months, mass in kilogram, feed type A is pasture (grazing) and feed type B
is pasture and feeding paddock.
(i) state which analysis you would apply, namely analysis of variance, regression analysis
or analysis of covariance (substantiate your answers);
1 1 −2
1 1 2
(a) X =
1 0 4
1 0 −4
1 0 15
1 0 10
(b) X =
0 1 16
0 1 32
1 10
1 12
(c) X =
1 14
1 16
(a) State whether you would not reject or reject each of the following hypotheses and say
why:
(i) β1 = 0, β2 = 0
2. THE LINEAR MODEL 113 STA3701/1
(ii) β1 = 0
(iii) β2 = 0
β4 = 0
β2 − β3 = 0
β2 − β3 + β4 = 0
(b) Test H0 : β1 = 5.
6. In an experiment, the mass change of newborn babies, fed on four different milk formulas
was recorded (after a set period of time) as follows:
A B C D
142 76
50
2. THE LINEAR MODEL 114
(b) Test, at the 5% level of significance, whether there is a difference between the four milk
formulas.
SSH
NB: Write the hypothesis in the form: Ho : Kβ = m and use f = hσ̂ 2
.
(c) Compute a 95% confidence interval for difference in means between programmes B and
C, that is µB − µC .
ANALYSIS OF VARIANCE
3.1 Introduction
A typical experiment, the results of which can be analysed by means of the technique known as
analysis of variance, is the following: a number of experimental units (people, cattle, potato lands,
trees, et cetera) are subjected to a number of treatments (diets, medicines, fertilizer applications,
et cetera) each at a number of levels (the three levels of the treatment diet could be two, three
and four meals per day; the two levels of fertilizer could be no fertilizer, 1 ton/ha, 2 tons/ha and 4
tons/ha; et cetera). Sometimes there are other factors present, called blocking factors, which divide
the experimental units into groups which are more uniform (before the experiment starts) than
all the units jointly. Such a factor has two or more levels (for example boys and girls; Leghorns,
Black Australorps and Rhode Island Reds; married, divorced, widowed and single).
After one or more treatments have been applied to such groups of experimental units, the result
is observed. This result is called the yield (for example body weight of experimental person, the
yield of potatoes in kg/ha, the egg production of hens). The purpose of the analysis of variance
is to test whether the treatment had an effect on the yield, and sometimes whether the groups
defined by the blocking factors differ with respect to yield.
One of the problems which faces a statistician who acts as consultant to a research worker, is the
fact that the experiment is often completed before he or she is consulted, with the result that the
115
3. ANALYSIS OF VARIANCE 116
experiment is usually poorly planned. The statistician’s client does not always tell the statistician
about all the important factors, because the client possibly does not know that they are impor-
tant. It is the duty of the statistician to question his or her client about the way in which the
experiment was performed, in order to try and find out whether such factors do indeed exist or not.
The manner in which the experimental units are assigned to the treatment levels is also very im-
portant. It is always desirable that these be assigned randomly.
If there is only one factor (treatment or blocking factor) present, we speak of a one-way clas-
sification and the corresponding analysis is termed one-way analysis of variance. If there are
two factors, we have a two-way classification and we speak of two-way analysis of variance, and so
on. These names originate from the arrangement of the data in tables as in the following examples.
When discussing the analysis of these examples, including analyses of chapters 4 and 5, some out-
put from a standard package for personal computers, called SAS JMP, will be shown. You should
make sure that you are able to interpret these results. One of the techniques used in SAS JMP
is a plot of treatment means diamonds. The line across each diamond represents the group mean.
The vertical span of each diamond represents the 95% confidence interval for each group. The
standard error of any mean is given by
Suppose six potato tubers were planted during each of the four moon phases. The purpose is to
test whether the yield of the plants is influenced by the phase of the moon at planting time. The
results are as follows (yield in grams per plant):
3. ANALYSIS OF VARIANCE 117 STA3701/1
This is a typical example of a one-way classification, and will be analysed in section 3.4.
This experiment may be refined. One source of variation in potato yield is the season: apart
from the possible effect of the moon phase, there is also an optimum planting time with respect
to seasons. To a certain extent the difference in yield found in the experiment may be ascribable
to seasonal and climatic differences. It would be better if the experiment were repeated over a
number of moon months and possibly over a number of years.
A housewife does her grocery shopping once a week. Once a month she buys the main items and
is then able to compare the prices of a number of supermarkets before buying. In the other three
weeks of the month she buys only the week’s supplies of perishable foods, and she would like to
decide on one of the three supermarkets in her area for this shopping. She therefore decides to buy
at each of the three supermarkets in turn once every month for four months in order to compare
prices. Her results are as follows (prices to the nearest rand):
Month Supermarket
A B C
May 34 18 29
June 30 20 34
July 27 30 36
August 33 32 37
One source of variation is of course the fact that her shopping list varies from week to week; a
more sensitive analysis might be possible if she recorded only those items which remain on her list
from week to week; the results would, however, not answer her question fully.
3. ANALYSIS OF VARIANCE 118
A man has a choice of three routes to drive to work. He decides to perform an experiment in order
to enable him to select a route. He knows that his travelling time depends on the day of the week
and on the time he leaves home. He performs his experiment over a period of nine weeks, and
allocates the routes and departure times randomly to the days of each week. His travelling times
(in minutes) for the nine weeks are as follows:
(i) The investigator would of course try to restrict his or her investigations to “normal” weeks,
ie weeks without public holidays, weeks which do not coincide with school holidays. Weeks
in which the last day of the month falls should possibly be avoided as well, and might be the
subject of a separate study.
(ii) It is possible that the traffic volume may increase slightly over the nine weeks. Also, the
performance of a car may deteriorate gradually after being serviced. For these reasons it is
important that the routes and departure times be allocated randomly to the days and weeks.
If route I were selected the first three weeks, route II the next three weeks and route III the
last three weeks, then an observed difference in travelling times may be due to a difference
in routes or changes in traffic volume or a change in car performance or other factors.
(iii) It is also possible to select weeks as a fourth and thus obtain a four-way classification.
However, the model would then be much more complicated. (See nested experiments in
section 3.3.)
3. ANALYSIS OF VARIANCE 119 STA3701/1
We first define the concept cell. Each combination of levels of the factors in an experiment is called
a cell. In section 3.1.1 there are four cells (the four moon phases); in section 3.1.2 there are 4×3
= 12 cells (eg Supermarket B in July, Supermarket C in May); in section 3.1.3 there are 3×3×5 =
45 cells. Depending on the number of observations (replicates) in each cell, experiments may be
classified as follows:
Observations do not occur in every cell. Apart from poor planning, there may be two reasons
for this:
(i) The experiment has been designed specifically in such a way that some cells remain
empty. This may have been done because the experimenter is unable to cope with as
many experimental units as there are cells. The experiment is then designed in such a
way that a precise analysis is still possible.
(ii) Some observations may have been lost; a rabbit ate one of the plants, a test tube broke,
a laboratory mouse died or a patient decided to discontinue his or her treatment. In such
cases it is still possible to analyse the results, but the analysis is much more complicated.
The missing observations may be estimated by means of least squares, and the number
of degrees of freedom in the error sum of squares is decreased accordingly.
In such an experiment there are equal numbers of observations per cell. We distinguish
between two cases:
(i) One observation per cell: Sections 3.1.2 and 3.1.3 are such examples. As will be seen,
there is some information which is not contained in such an experiment; in particular
it is not possible to ascertain whether the model is an adequate representation of the
data.
(ii) More than one observation per cell: This is the most desirable type of experiment,
firstly because the model may be verified and secondly because the computations are
not too complicated.
3. ANALYSIS OF VARIANCE 120
(c) Unbalanced experiments in which there are unequal numbers of observations per cell. It
is possible to analyse the data in a meaningful way, but the algebra, the notation and the
computations are much more complicated.
y = Xβ +
However, sometimes some elements of β may be random variables, and the distributions of the
quadratic forms may differ from those derived in chapter 2.
Consider a factor with k levels. If these k levels constitute the complete set of levels in which one
is interested, the effects of the levels of the factor on the yield are called fixed effects. In section
3.1.1 we were interested only in the four phases of the moon, and observations were made during
every phase. In section 3.1.2 we were interested in the three supermarkets only, and in section
3.1.3 in the three routes only.
On the other hand, a factor may exist at a large number of levels and experiments are performed
at a randomly selected set of these levels. The effects of these levels are called random effects.
A simple example of a random effect is the following: A farmer wishes to test the effects of four
diets on the body mass of piglets. He knows that piglets of the same litter are more uniform (before
treatment) than all the piglets jointly. He therefore selects five litters at random and four piglets
at random from each litter. One piglet is assigned at random to each diet. The results (mass of
3. ANALYSIS OF VARIANCE 121 STA3701/1
Litters Diets
A B C D
I 30 51 40 31
II 51 56 45 48
III 44 59 57 36
IV 40 47 42 39
V 50 57 41 36
If the farmer wants to draw conclusions about these specific five litters only, he may regard the
factor “Litters” as fixed; however, if he wants to draw conclusions about the population of litters
from which the five litters may be regarded as a random sample, the factor “Litters” must be
regarded as a random factor.
Another typical example is a large consignment of wool (or other produce). A random sample of
bags is selected, and then a number of samples from each bag is selected and subjected to various
treatments. The investigator wants to make statements about the whole consignment.
We may distinguish further between factors with levels which are selected at random from finite
(small) populations and infinite (large) populations. We shall concentrate mainly on the theory of
large populations.
Thus we may regard any factor in an experiment as fixed or random. Experiments may be classified
accordingly:
(a) Fixed effects model or model I: all factors (treatments or blocking factors) are fixed.
(b) Random effects model or model II: all factors are random.
(c) Mixed model: some factors are fixed and some are random.
3. ANALYSIS OF VARIANCE 122
Example 3.1.
Determine in the following scenarios whether the factors are fixed or random effects:
(a) In an effort to improve the quality of video tapes, the effects of four kinds of treatment A, B,
C and D on reproducing quality of image are compared.
(b) Pieces of skin were exposed to four randomly chosen light intensities for a fixed period. After
exposure the elasticity of the skin was measured. The purpose was to compare the effects of
the four intensities on the elasticity of the skin.
(c) An investigator wishes to evaluate the effect on reaction time of medicine administered for
colds. Data were recorded that represent the reaction (in seconds) on a stimulus one hour
after each of four randomly chosen brands of this type of medicine was administered to ten
people.
(d) A study was done as to how the concentration of a certain drug in the blood, 24 hours after
being injected, is affected by age and sex. An analysis of the blood samples of 40 people
who had received the drug yielded concentration (in milligrams per cubic centimetre) for age
groups 11-25 years, 26-40 years, 41-65 years and >65 years.
Solution 3.1.
(d) The factor gender is a fixed effect and the factor age is a fixed effect.
To test hypotheses regarding fixed effects, we shall use the method of testing linear hypotheses in
section 2.7. In the case of random effects, we shall test hypotheses of the form H0 : σA2 = 0 (which
is equivalent to testing H0 : σB2 + cσA2 = σB2 , where c is a positive constant). The procedure for
testing such a hypothesis is as follows:
3. ANALYSIS OF VARIANCE 123 STA3701/1
We know that
σA2
f ∼ 1 + c 2 Fri ;rj .
σB
Thus f ∼ Fri ;rj if and only if σA2 = 0 (that is if and only if H0 is true). If σA2 > 0, we may expect f
to be larger than the Fri ;rj statistic. Hence, if f > Fα;ri ;rj , reject H0 : σA2 = 0 at the 100α% level.
For each model (I, II or mixed), we shall set up an analysis of variance table. The relevant quadratic
forms which make up the ratio f will be obtained from the “mean square” or “MS” column of the
table. These mean squares can be shown to be independent chi-square variates. It should become
apparent that the ratio f will be formed from two mean squares, say MSi and MSj , which are
chosen in such a way that E(MSi ) = E(MSj ) if and only if the null hypothesis of interest is true.
This will ensure that f has a central F -distribution under H0 .
The above procedures will be used to derive the results for the one-way models in detail. The
results presented for the two-way models can be derived analogously.
3.3.2.1 Illustration 1
Suppose there are six comparable grade eight classes in a large school, and six teachers who are
able to teach mathematics. Three methods of teaching mathematics are to be compared. Each
3. ANALYSIS OF VARIANCE 124
teacher is assigned (randomly) to one class to teach according to a prescribed method. After one
term five pupils are selected at random from each class and tested.
Superficially this may look like an ordinary two-way classification with five observations per cell.
However, the designations “Teacher 1” and “Teacher 2” are misleading, since each of the six classes
was taught by a different teacher. A more accurate exposition would be as follows (in order to
save space, the observations are not repeated, but each cross indicates a set of five observations):
Such an experiment is termed a hierarchical or nested experiment; in the above example we say
that the factor Teacher is nested in the factor Teaching method. In such experiments there are
3. ANALYSIS OF VARIANCE 125 STA3701/1
empty cells, not because of a lack of time or facilities, but because the empty cells are not relevant.
Another similar example is the following:
3.3.2.2 Illustration 2
A farmer wants to determine the effect of three sprays on his peach trees. He selects 12 comparable
trees at random, and assigns them randomly to the three sprays. After a week he selects six leaves
at random from each tree (Question: how would he do that?) and tests the nitrogen content of
each leaf. The results are as follows:
Spray Tree
1 2 3 4
A 4.5 5.5 13.5 11.5
7.0 7.5 15.0 10.0
5.0 12.5 12.5 12.0
5.5 6.0 12.0 9.5
6.5 4.0 10.0 10.5
7.5 6.5 9.0 12.5
B 15.5 14.5 11.0 12.5
15.0 15.0 10.0 12.0
14.5 12.5 12.0 13.5
14.0 16.0 13.0 10.0
16.0 13.5 10.5 10.5
15.0 12.5 9.5 13.5
C 7.0 3.0 8.0 12.5
8.0 4.5 9.5 10.0
5.0 4.0 8.5 13.0
7.5 5.5 10.5 10.5
8.5 7.0 7.5 11.0
6.0 6.0 10.0 9.0
In this case we do not deal with the same four trees for every spray, but with 12 different trees,
and the experiment is a nested experiment.
3. ANALYSIS OF VARIANCE 126
The analysis
We now assume that there is one treatment or blocking factor present with k levels, and that
there are ni experimental units at the i-th level. We also assume that these k levels represent
all levels of interest. The k level defines k populations, and the observations may be regarded as
independent random samples from k populations. Let yij be the observation on the j-th individual
corresponding to the i-th level of the treatment or blocking factor. The model is
yij = µ + αi + ij , j = 1, · · · , ni ; i = 1, · · · , k
k
X
Let N = ni be the total number of observations.
i=1
y = Xβ +
as follows:
y 1 0 ··· 0
11 11
.. .. .. ...
. . .
µ1
y1n1 1 0 ··· 0 1n1
.. .. µ2
+ ...
. =
. ..
.
0 ··· 1
yk1 0 k1
..
.. .. ..
µk ..
. . . . .
yknk 0 0 1 knk
3. ANALYSIS OF VARIANCE 127 STA3701/1
1
n1 0 ··· 0 n1
0 ··· 0
1
··· 0 ···
0
0 n2 0 n2
0
XX= ; (X 0 X)−1 =
.. ..
. .
1
0 0 · · · nk 0 0 ··· nk
y11 + · · · + y1n1 y1 .
y21 + · · · + y2n2
0
y2 .
Xy=
..
=
..
. .
yk1 + · · · + yknk yk .
say, where
ni
X
yi. = yij
j=1
ȳ1.
0 −1 0
ȳ2.
∴ β̂ = (X X) X y =
..
.
ȳk.
ni
1 X
where ȳi. = yij = yi. /ni ,
ni j=1
the mean of the i-th sample.
k ni k k
1 XX X X
Let ȳ.. = yij = ni ȳi. / ni
N i=1 j=1 i=1 i=1
Of importance in this model are the two quadratic forms q4 and q5 of chapter 2. The error sum of
squares is
q4 = y 0 (I − X(X 0 X)−1 X 0 )y
3. ANALYSIS OF VARIANCE 128
= y 0 y − y 0 X(X 0 X)−1 X 0 y
X ni
k X k
X
= yij2 − ni ȳi2 .
i=1 j=1 i=1
k
"n #
X Xi
= yij2 − ni ȳi.2
i=1 j=1
ni
k X
X
= (yij − ȳi. )2
i=1 j=1
(3.2)
If we write
n1
X nk
X
q4 = (y1j − ȳ1. )2 + · · · + (ykj − ȳk. )2
j=1 j=1
then we see that q4 measures the variation within samples. (Each term of q4 measures the variation
within one of the k samples.)
The hypothesis which one usually wants to test is that the k population means are all equal:
H0 : µ1 = µ2 = · · · = µk .
(In exceptional circumstances one may want to test µ1 = µ2 = · · · = µk = c with c specified, but
that problem will not be discussed here.)
µ1 = µ2 ; µ2 = µ3 ; · · · ; µk−1 = µk
k
X
q5 = ni (ȳi. − ȳ.. )2
i=1
Xk
= ni ȳi.2 − N ȳ..2
i=1
0
= y A5 y
= y 0 (B − C)y
(3.3)
1 1
N
··· N
.
where C = 1(1 1) 1 = ..
0 −1 0
1 1
N
··· N
0 −1 0
1(1 1) 1 O ··· O
0 −1 0
···
O 1(1 1) 1 O
and B=
..
.
O O · · · 1(10 1)−1 10
1 1
n1
··· n1
0 ··· 0 ··· 0 ··· 0
..
.
1 1
··· 0 ··· 0 ··· 0 ··· 0
n1 n1
1 1
0 ··· 0 ··· ··· 0 ··· 0
n2 n2
..
.
=
1 1
··· ··· ··· ··· 0
0 0 n2 n2
0
..
.
1 1
··· ··· ··· · · · nk
0 0 0 0 nk
..
.
1 1
0 ··· 0 0 ··· 0 ··· nk
· · · nk
BB = B; CC = C; BC = CB = C
r(C) = 1; r(B) = k; r(A5 ) = r(B − C) = k − 1, and
A5 A5 = (B − C)(B − C) = BB − CB − BC + CC
= B − C − C + C = A5
which shows, as before, that q5 /σ 2 is a noncentral chi-square variate with k − 1 degrees of freedom
and noncentrality parameter
k k
X
2 2 1 X
λ5 = ni (µi − µ̄. ) /σ (where µ̄. = ni µi )
i=1
N i=1
which is equal to zero if and only if H0 is true.
From chapter 2 we know that q4 and q5 are independent and that q4 /σ 2 is a (central) XN2 −k variate.
This can also be shown directly by noting that
XX k
X
q4 = yij2 − ni ȳi.2
i=1
0 0
y Iy − y By
= y 0 (I − B)y,
that r(I − B) = N − k and that (I − B)(B − C) = B − C − BB + BC = O.
q4
Thus is an unbiased estimator of σ 2 and we write
N −k
ni
k X
X
σ̂ 2 = q4 / (N − k) = (yij − ȳi. )2 / (N − k).
i=1 j=1
Consider q5 :
k
X
q5 = ni (ȳi. − ȳ.. )2 ,
i=1
3. ANALYSIS OF VARIANCE 131 STA3701/1
which is the weighted sum of squares of deviations of the sample means from the grand mean.
We say that q5 is the “between samples” sum of squares while q4 is the “within samples” sum of
squares. Furthermore, since
I − C = (I − B) + (B − C)
or
A2 = A4 + A5 ,
∴ y 0 (I − C)y = y 0 A4 y + y 0 A5 y
or
ni
k X
X
q2 = (yij − ȳ.. )2
i=1 j=1
X ni
k X k
X
2
= (yij − ȳi. ) + ni (ȳi. − ȳ.. )2
i=1 j=1 i=1
1
∴ SSTotal = SSwithin samples + SSbetween samples .
Thus H0 : µ1 = . . . = µk is rejected if f > Fα;k−1;N −k . The results are usually presented in tabular
form as follows:
∗
SS = sum of squares
MS = means square
1
This expression should, strictly speaking, be called “SSTotal adjusted for the mean ”. We adopt the more
concise “SSTotal ” from now on.
3. ANALYSIS OF VARIANCE 132
Multiple comparisons
If H0 : µ1 = . . . = µk is rejected, one may want to test for specific difference; usually one wants
X X
to test contrasts of the form H0 : ci µi = 0 where ci = 0. In such cases one may retain an
overall significance level of α if H0 is rejected when
( ci ȳi. )2
P
P 2 > (k − 1)σ̂ 2 Fα;k−1;N −k
(ci /ni )
i
X
0
(this can be shown by specifying t = (t1 t2 · · · tk−1 ), where ti = cr for i = 1, . . .,k − 1, in qt∗ ,
r=1
equations (2.7.5) and (2.7.7) of section 2.7).
(ȳr. − ȳs. )2
In particular, H0 : µr = µs is rejected if, > (k − 1)σ̂ 2 Fα;k−1;N −k.
1 1
+
nr ns
Example 3.2.
Using data in section 3.1.1:
(b) Test at the 5% level of significance if there is a difference in yields among the moon phases.
(c) Test at the 5% level of significance whether the mean yield of potatoes planted at full moon
is equal to the mean yield of potatoes planted at new moon.
(d) Test whether the mean yield of potatoes planted at full moon is equal to the mean yield of
potatoes planted during all other phases. Use α = 0.05.
3. ANALYSIS OF VARIANCE 133 STA3701/1
Solution 3.2.
(a)
Treatments
New moon First quarter Full moon Last quarter
201 189 206 205
197 200 214 194
190 199 202 202
194 196 205 198
200 194 203 200
206 198 206 201
n 6 6 6 6
y i. 198 196 206 200
y .. = 200
X
(y1j − y 1. )2 = (201 − 198)2 + (197 − 198)2 + . . . + (206 − 198)2
= 158
X
(y2j − y 2. )2 = (189 − 196)2 + (200 − 196)2 + . . . + (198 − 196)2
= (−7)2 + 42 + 32 + 02 + (−2)2 + 22
= 82
X
(y3j − y 3. )2 = (206 − 206)2 + (214 − 206)2 + . . . + (206 − 206)2
= 90
X
(y4j − y 4. )2 = (205 − 200)2 + (194 − 200)2 + . . . + (201 − 200)2
= 70
3. ANALYSIS OF VARIANCE 134
XX
q4 = (yij − ȳi. )2 = 158 + 82 + 90 + 70 = 400
X
q5 = ni (ȳi. − ȳ.. )2
= 336
Source SS d.f. MS f
Between 336 3 112 112/20 = 5.6
Within 400 20 20 = σ̂ 2
Total 736 23
(b) H0 : µ1 = µ2 = µ3 = µ4 (H0 : α1 = α2 = α3 = α4 = 0)
(c) H0 : µ1 = µ3
(ȳr. − ȳs. )2
Reject H0 if 1 1 > (k − 1)σ 2 Fα;k−1;N −k
nr
+ ns
Now
64 × 6
=
2
= 192
Since 192 > 186, we reject H0 at the 5% level of significance and conclude that the mean
yield of potatoes planted at full moon is not equal to the mean yield of potatoes planted at
new moon.
(d) We want to test whether the mean yield of potatoes planted at full moon is equal to the mean
yield of potatoes planted during all other phases.
H0 : µ3 = (µ1 + µ2 + µ4 )/3
that is H0 : − 31 µ1 − 31 µ2 + µ3 − 13 µ4 = 0 ⇒ c1 = − 31 , c2 = − 31 , c3 = 1 and c4 = − 13 .
( ci ȳi. )2
P
Then P 2
ci /ni
((− 13 )(198) + (− 31 )(196) + (1)(206) + (− 31 )(200))2
=
( 19 )( 61 ) + ( 19 )( 16 ) + (1)( 16 ) + ( 19 )( 61 )
= 288.
Since 288 > 186 we reject H0 at the 5% level of significance and conclude that the mean yield
of potatoes planted at full moon is not equal to the mean yield of potatoes planted during the
other phases.
The SAS JMP output for this example can be seen in figure 3.1. Compare the tables with the
answers obtained above. The graph shows clearly that the yield is highest during moon phase 3.
3. ANALYSIS OF VARIANCE 136
Note: If the confidence intervals (the points of the diamonds) do not overlap, the means are signif-
icantly different. In other words if the means are not much different, they are close to the
grand mean.
(i) normality
(iii) independence
3. ANALYSIS OF VARIANCE 137 STA3701/1
The assumption (i) may be investigated using the N residuals yij − ŷij = yij − ȳi. . The assumption
of equal variances may be tested formally, using a test known as Bartlett’s test or a test based on
the statistic
n
i
1 X
where s2i = (yij − ȳi. )2 .
ni − 1 j=1
The assumption of independence is much more difficult to verify − one has to infer this from the
way in which the experiment was performed.
If (i) is not true, one may try a transformation or use a rank test entitled the Kruskall-Wallis test.
If (ii) is not true one may try to stabilize the variance by means of a transformation, otherwise
one is landed with the Behrens-Fisher problem for which approximate solutions exist. If (iii) is
not true, the problem is much more difficult to solve.
A reparameterisation
A form in which the one-way analysis of variance model is often presented is one with a constant
term
where, as before, ij are independently n(0; σ 2 ) distributed. Now there are k +1 parameters instead
of k parameters. This introduces an indeterminacy which is reflected in the fact that X in the
representation y = Xβ + does not have full column rank (the first column of X is equal to the
sum of the other columns):
3. ANALYSIS OF VARIANCE 138
y11 1 1 ··· 0 11
.. .. .. ..
. . . .
µ
..
y1n1 1 1 ··· 0 .
..
.. α1 ..
. = +
. .. .
. ..
0 ··· 1
yk1 1 .
..
.. .. ..
αk
..
. . . . .
yknk 1 0 1 knk
Since the model represents the same problem as the model in equation 3.1, we must have
µi = αi + µ, i = 1, . . . , k
that is αi = µi − µ.
The αi are not uniquely defined in terms of the µi − we still need to choose a suitable definition
of µ (subject to certain conditions). The usual choice is to let
k
1X
µ= µi
k 1
in which case
X X
αi = (µi − µ) = 0.
1 X X
(Another choice of µ = ni µi , in which case ni αi = 0.)
N
The parameters α1 , · · · , αk are termed the treatment effects, with αi the effect of the i-th treatment
level. The following two null hypotheses are equivalent:
H0 : µ1 = µ2 = . . . = µk , and
H0 : α1 = . . . = αk−1 = αk (= 0).
Also, H0 : µi = µj is equivalent to H0 : αi = αj .
If you might recall in your earlier courses in statistics you learnt about statistics and parameters.
Statistics are numerical attributes that describe the characteristics of a sample whilst parameters
3. ANALYSIS OF VARIANCE 139 STA3701/1
are numerical attributes that describe the characteristics of a population. A statistic is used to
estimate/approximate a population parameter.
You learnt that there are two types of estimates, that is, point estimates and interval estimates.
Point estimate is when a single value is used to estimate a population parameter, whilst an interval
estimate is when a range of values are used to approximate a population parameter.
The values of µ, αi , σ 2 and ij in the model can be estimated from our data. These are related to
some of the summary statistics which investigators would initially calculate from their data.
(µ + αi ) + . . . + (µ + αi )
E(ȳi. ) =
n
nµ + nαi
=
n
= µ + αi
(µ + α1 ) + . . . + (µ + α1 ) + . . . . . . + (µ + αk ) + . . . + (µ + αk )
E(ȳ.. ) =
kn
knµ + n(α1 + α2 + . . . + αk )
=
kn
= µ (since α1 + α2 + . . . + αk = 0).
3. ANALYSIS OF VARIANCE 140
Thus ȳ.. can be used to estimate µ, ȳi. can be used to estimate µ + αi and ȳi. − ȳ.. is an estimate
of αi . The population variance σ 2 is estimated by s2 , the pooled estimate of the variance.
In summary, the list of the parameters and their point estimate are as follows:
In certain cases the sample sizes are not equal. In this case the overall mean is given by
ki n
1 XX
ȳ.. = yij
N i=1 j=1
Pk
ni ȳi.
= Pi=1
k
.
i=1 n i
(3.4)
In order to calculate the confidence interval we need to derive the variance for each estimator.
p
y .. ± t α2 (N − k) × var(y .. )
r
s2
y .. ± t α2 (N − k) ×
N
(3.6)
where s2 = M SE.
The V ar(α̂i ) = V ar(ȳi. − ȳ.. ). Now variance of the effect of the 1st category is
α1 ) = V ar (ȳ1. − y .. )
V ar (b
P
ni y i.
ȳ1. − y .. = ȳ1. −
N
(n1 y 1. + n2 y 2. + n3 y 3. + ... + nk y k. )
= ȳ1. −
N
n1 y 1. n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − − ... −
N N N N
N − n1 n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − ... − .
N N N N
Thus,
N − n1 n2 y 2. n3 y 3. nk y k.
V ar(ȳi. − y .. ) = V ar ȳ1. − − − ... −
N N N N
(N − n1 )2 n22 n23 n2k
= V ar(ȳ1. ) + 2 V ar(ȳ2. ) + 2 V ar(ȳ3. ) + ... + 2 V ar(ȳk. )
N2 N N N
(N − n1 )2 σ 2 n22 σ 2 n23 σ 2 n2k σ 2
= + 2 + 2 + ... + 2
N2 n1 N n2 N n3 N nk
!
σ 2 (N − n1 )2 n2 σ 2 n3 σ 2 nk σ 2
= + + + ... +
n1 N2 N2 N2 N2
!
σ 2 (N − n1 )2
= + n2 + n3 + ... + nk
N2 n1
σ 2 (N 2 − 2N n1 + n21 )
= + n2 + n3 + ... + nk
N2 n1
σ2 N 2
= − 2N + n1 + n2 + n3 + ... + nk
N 2 n1
σ2 N 2
= − 2N + N since n1 + n2 + n3 + ... + nk = N
N 2 n1
3. ANALYSIS OF VARIANCE 142
σ2 N 2
= −N
N 2 n1
σ 2 N 2 − n1 N
=
N2 n1
2
σ N N − n1
=
N2 n1
2
σ N − n1
= .
N n1
In general
σ2
N − ni
V ar (b
αi ) = . (3.7)
N ni
The
V ar (α̂i − α̂j ) = V ar ȳi. − y .j
= V ar (ȳi. ) + V ar y .j since they are independent
σ2 σ2
= +
ni nj
2 1 1
= σ + .
ni nj
(3.9)
The
V ar µ + αi = V ar (y i. )
\
σ2
= .
ni
(3.11)
q4 q4
From chapter 2 you will recall that σ̂ 2 = and 2 ∼ χ2N −k .
N −k σ
Example 3.3.
Twenty-four plots of land were divided at random into four groups of six, and subjected to the
following four treatments:
Wheat was planted on each plot, and the yields were as follows:
3. ANALYSIS OF VARIANCE 144
(1) 66 89 62 30 51 74
(2) 72 86 95 94 74 89
Treatment
(3) 47 54 62 53 71 49
(4) 81 77 69 61 68 82
(a) Select a model for this experiment and test the null hypothesis that there is no treatment
effect (5% level).
(b) Find a 95% confidence interval for the difference between the means of treatments 1 and 2.
(c) Find a 95% confidence interval for the difference between the mean of the control group and
the mean of the other three treatment.
Solution 3.3.
yij = µi + ij i = 1, 2 . . . , k j = 1, . . . , ni
Treatments
1 2 3 4
66 72 47 81
89 86 54 77
62 95 62 69
30 94 53 61
51 74 71 68
74 89 49 82
ni 6 6 6 6
ȳi. 62 85 56 73
N = 24 ȳ..
Now
X
(y1j − y 1. )2 = (66 − 62)2 + (89 − 62)2 + . . . + (74 − 62)2
= 2 034
X
(y2j − y 2. )2 = (72 − 85)2 + (86 − 85)2 + . . . + (89 − 85)2
= 488
X
(y3j − y 3. )2 = (47 − 56)2 + (54 − 56)2 + . . . + (49 − 56)2
= 404
X
(y4j − y 4. )2 = (81 − 73)2 + (77 − 73)2 + . . . + (82 − 73)2
= 346
3. ANALYSIS OF VARIANCE 146
XX
q4 = (yij − ȳi. )2 = 2 034 + 488 + 404 + 346 = 3 272
X
q5 = ni (ȳi. − ȳ.. )2
= 2 940.
Source SS d.f. MS f
Between 2 940 3 980 5.9902
Within 3 272 20 163.6
Total 6 212 23
Since 5.9902 > 3.1, we reject H0 at the 5% level of significance and conclude that the means
are significantly different from each other.
α
α = 0.05 2
= 0.025 t α2 ;N −k = t0.025;20 = 2.086
nr = ns = 6 ȳ1. = 62 ȳ2. = 85 and σ 2 = 163.6.
The 95% confidence interval for the difference between the means of treatments 1 and 2 is
s
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × + σ̂ 2
nr ns
s
1 1
(62 − 85) ± 2.086 × 163.6 +
6 6
√
−23 ± 2.086 × 54.53333333
3. ANALYSIS OF VARIANCE 147 STA3701/1
−23 ± 15.4044
(−38.4044 ; −7.5956).
(c) Let the control group be sample 1 and the other three groups be sample 2, then ȳi. = 62,
ȳ2. = 71.333, n1 = 6 and n2 = 18.
The 95% confidence interval for the difference between the mean of the control group and the
mean of the other three treatments is
s
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × σ̂ 2 +
nr ns
s
1 1
(62 − 71.3333) ± 2.086 × 163.6 +
6 18
√
−9.3333 ± 2.086 × 36.35555556
−9.3333 ± 12.5777
(−21.911 ; 3.2444).
(d) The means are 62, 85, 56 and 73. Potassium and nitrogen gave the highest yield, with
nitrogen and phosphate the second highest. A lack of nitrogen (treatments 1 and 3) lead to
the poorest results.
Example 3.4.
Ten randomly selected mental institutions were examined to determine the effects of three different
antipsychotic drugs on patients with the same types of symptoms. Each institution used one and
only one of the three drugs exclusively for a one-year period. The proportion of treated patients in
each institution who were discharged after one year of treatment was as follows for each drug used:
(b) Test to see if there are any significant differences among drugs with regards to the average
proportion of patients discharged.
(c) What basic ANOVA assumptions might be violated here? How would you test if these as-
sumptions are indeed violated (no need to test, just give a description or method).
(i) The overall mean proportion of treated patients who were discharged for the three drugs.
(iv) α1 − α3 .
(v) σ 2 .
(e) Obtain 95% confidence intervals for each of the estimates in part (d) of the question. Use the
method of multiple comparisons to obtain the confidence interval of the estimate in (d)(iv).
(f ) Use the method of multiple comparisons and test the following hypothesis: H0 : α2 = (α1 +
α3 )/2. Use α = 0.05.
Solution 3.4.
y 1. = 0.11
X
(y1j − y 1. )2 = (n1 − 1)s21
= (4 − 1)0.02582
= 0.002
3. ANALYSIS OF VARIANCE 149 STA3701/1
or
X
(y1j − y 1. )2 = (0.1 − 0.11)2 + (0.12 − 0.11)2 + (0.08 − 0.11)2 + (0.14 − 0.11)2
= 0.002
y 2.. = 0.15
X
(y2j − y 2. )2 = (n2 − 1)s22
= (3 − 1)0.03612
= 0.0026
or
X
(y2j − y 2. )2 = (0.12 − 0.15)2 + (0.14 − 0.15)2 + (0.19 − 0.15)2
= 0.0026
y 3. = 0.2
X
(y3j − y 3. )2 = (n3 − 1)s23
= (3 − 1)0.052
= 0.005
or
X
(y3j − y 3. )2 = (0.2 − 0.2)2 + (0.25 − 0.2)2 + (0.15 − 0.2)2
= 02 + (0.05)2 + (−0.05)2
= 0.005
3. ANALYSIS OF VARIANCE 150
X
q4 = (yij − y i. )2
= 0.0096
k
X
q5 = ni (yi. − y .. )2
i=1
= 4(0.11 − 0.149)2 + 3(0.15 − 0.149)2 + 3(0.2 − 0.149)2
= 0.01389
= 0.0139
Source SS d.f. MS f
Drugs (between samples 0.0139 2 0.007 5
Error (within samples) 0.0096 7 0.0014
Total 0.0235 9
(b) H0 : µ1 = µ2 = µ3 or H0 : α1 = α2 = α3 (= 0)
Pk Pn 2
Between samples SS/(k − 1) i=1 j=1 (yi. − ȳ.. ) /(k − 1)
Under H0 : f = = Pk Pn ∼ Fk−1;N −k
Within samples SS/(N − k) i=1 j=1 (y ij − ȳi. )2 /(N − k)
Since 5 > 4.74, H0 is rejected. The mean levels of patients discharged for the three drugs do
differ significantly at α = 0.05.
(c) • The assumption of equal variances might be violated. Can use Bartlett’s test to check
for this assumption and if it is violated an alternative like Welch’s test may be applied
to the one-way ANOVA.
• The assumption of normality might also be violated. Can check this by graphical exami-
nation of Q-Q plots and histograms of the observations in each group and also by a test
like the Shapiro-Wilk test. If the assumption is violated one may use the Kruskal-Wallis
test instead.
(d) (i) The point estimate of the overall mean proportion of treated patients who were discharged
for the three drugs is
µ̂ = y ..
k n
1 XX
= yij
N i=1 j=1
Pk
ni y i.
= Pi=1k
i=1 ni
4(0.11) + 3(0.15) + 3(0.2)
=
4+3+3
= 0.149.
(ii)
µ\
+ αi = y i.
µ\
+ α1 = y 1.
= 0.11
(iii)
bi = ȳ1. − y ..
α
Pk
ni y i.
= ȳ1. − Pi=1 k
i=1 ni
= 0.11 − 0.149
= −0.039
3. ANALYSIS OF VARIANCE 152
(iv)
(α\
i − αj ) = ȳi. − ȳj.
(α\
1 − α3 ) = ȳ1. − ȳ3.
= 0.11 − 0.2
= −0.09
(v)
σ̂ 2 = M SE
Pk Pni
i=1 j=1 (yij − y i. )2
=
N −k
0.0096
=
7
= 0.001371428
≈ 0.0014
(e) (i)
V ar(b
µ) = V ar(y .. )
σ2
=
N
0.0014
=
10
= 0.00014
≈ 0.0001
p
µ̂ ± t α2 ;N −k × var(µ̂)
p
y .. ± t α2 ;N −k × var(y .. )
α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365
The 95% confidence interval is
p
y .. ± t α2 ;N −k × var(y .. )
3. ANALYSIS OF VARIANCE 153 STA3701/1
√
0.149 ± 2.365 × 0.0001
0.149 ± 0.02365
0.149 ± 0.0237
(0.1253 ; 0.1727)
(ii)
σ2
V ar µ\
+ αi =
ni
0.0014
=
4
= 0.00035
≈ 0.0004
q
µ\
+ α1 ± t α2 ;N −k × var(µ\ + α1 )
p
y 1. ± t α2 ;N −k × var(y 1. )
α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365
0.11 ± 0.0473
(0.0627 ; 0.1523).
(iii)
αi ) = V ar (ȳi. − y .. )
V ar (b
σ 2 N − ni
=
N ni
3. ANALYSIS OF VARIANCE 154
σ 2 N − n1
var(αˆ1 ) =
N n
1
0.0014 10 − 4
=
10 4
6
= 0.0014
40
= 0.00021
≈ 0.0002
−0.039 ± 0.03344615
−0.039 ± 0.0334
(−0.0724 ; −0.0056)
(iv)
2 1 1
αi − α
V ar (b bj ) = σ +
ni nj
2 1 1
V ar(α\
1 − α3 ) = σ +
n1 n3
1 1
= 0.0014 +
4 3
= 0.000816666
≈ 0.0008
q
(α\
1 − α3 ) ± t α
2
;N −k var(α\i − αj )
p
(ȳ1. − y 3. ) ± t α2 ;N −k var(ȳ1. − y 3. )
3. ANALYSIS OF VARIANCE 155 STA3701/1
√
(0.11 − 0.2) ± 2.365 × 0.0008
−0.09 ± 0.066892301
−0.09 ± 0.0669
(−0.1569 ; −0.0231)
q4
(v) σ̂ 2 = = 0.0014
N −k
q4
2
∼ χ2N −k
σ q4
∴ 0.95 = P χ20.975;7 ≤ 2 ≤ χ20.025;7
σ
0.0096
= P 1.69 ≤ ≤ 16.013
σ2
0.0096 2 0.0096
= P ≤σ ≤
16.013 1.69
2
= P (0.000599512 ≤ σ ≤ 0.005680473)
= P (0.0006 ≤ σ 2 ≤ 0.0057)
( ci ȳi. )2
P
For multiple comparisons: H0 is rejected if P 2 > (k − 1)σ̂ 2 Fα;k−1,N −k
(ci /ni )
≈ 0.0001
= 0.013272
≈ 0.0133
3. ANALYSIS OF VARIANCE 156
Since 0.0001 < 0.0133, we do not reject H0 at the 5% level of significance and conclude that
the mean of the second sample is equal to the average of the first and third samples.
Analysis
We now consider the problem which arises if k populations are selected at random from a large
number of populations, and a random sample is drawn from each of these populations. For sim-
plicity we shall assume that all sample sizes are equal, that is
n1 = . . . = nk = n, say.
The model is
yij = µ + ai + ij ; j = 1, · · ·, n; i = 1, · · ·, k
where the parameters a1 , · · · , ak constitute a random sample of size k from a population of param-
eters. The following assumptions are usually made:
ij ∼ n(0; σ 2 ); j = 1, · · ·, n; i = 1, · · ·, k
ai ∼ n(0; ω 2 ); i = 1, · · ·, k
As before, let
= V ar(µ) + V ar(ai ) + V ar(ij ) since ai , . . . , ak and 11 , . . . kn are mutually independent
= ω2 + σ2
Now
= σ 2 + ω 2 + µ2
n
1X
ȳi. = (µ + ai + ij )
n j=1
n
1X
= µ + ai + ij
n j=1
n
!
1X
E(ȳi. ) = E µ + ai + ij
n j=1
n
1X
= E(µ) + E(ai ) + E(ij )
n j=1
= µ
n
!
1X
V ar(ȳi. ) = V ar µ + ai + ij
n j=1
n
1 X
= V ar(µ) + V ar(ai ) + 2 V ar(ij )
n j=1
n
1 X 2
2
= ω + 2 σ
n j=1
1
= ω2 + nσ 2
n2
σ2
= ω2 +
n
k n
1 XX
ȳ.. = (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= µ+ ai + ij
k i=1 kn i=1 j=1
k n
!
1 XX
E(ȳ.. ) = E (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= E(µ) + E(ai ) + E(ij )
k i=1 kn i=1 j=1
= µ
k n
!
1 XX
V ar(ȳ.. ) = V ar (µ + ai + ij )
kn i=1 j=1
k k n
1 X 1 XX
= 2 V ar(ai ) + V ar(ij )
k i=1 (kn)2 i=1 j=1
k k n
1 X 2 1 XX 2
= 2 ω + σ
k i=1 (kn)2 i=1 j=1
1 1
= 2
kω 2 + knσ 2
k (kn)2
ω2 σ2
= +
k kn
If we write q4 = y 0 A4 y and q5 = y 0 A5 y, we see that A4 and A5 are exactly the same as in equations
(3.2) and (3.3), section 3.4. Thus A4 A4 = A4 , A5 A5 = A5 , A4 A5 = O, r(A4 ) = N − k and r(A5 )
3. ANALYSIS OF VARIANCE 159 STA3701/1
= (N − k)σ 2 as before
!
X
E(q5 ) = E n ȳi.2 − nkȳ..2
i
X
= n E(ȳi.2 ) − nkE(ȳ..2 )
i
X
= n (µ2 + ω 2 + σ 2 /n) − kn(µ2 + σ 2 /(nk) + ω 2 /k)
i
= (k − 1)σ 2 + n(k − 1)ω 2
q4 q5
∴E = (N − k) and E = k − 1.
q5 σ 2 + nω 2
q4 q5
From these expected values, the results above and (D8) it follows that 2
and 2 are inde-
σ σ + nω 2
pendent central chi-square variates with N − k and k − 1 degrees of freedom respectively. (That
these are central chi-square variates can be proved more formally by showing that, since E(yij ) = µ
for all i and j, λ4 = λ5 = 0.)
In this type of model we are usually not interested specifically in a1 , · · · , ak since the k populations
were selected at random from a large number of populations. The interest is usually in the overall
mean µ and the variance ω 2 of the parameters ai . An unbiased estimator for µ is
µ̂ = ȳ..
while
E(ȳ.. ) = µ
and
3. ANALYSIS OF VARIANCE 160
ω2 σ2
Var(ȳ.. ) = +
k nk
1
= (nω 2 + σ 2 ).
N
q5
As was seen, an unbiased estimator for nω 2 + σ 2 is ; thus
k−1
ȳ − µ
r .. ∼ tk−1 .
1 q5
N k−1
1
2 q5 q4
Thus we write ω̂ = − .
n k−1 N −k
There is a problem with this estimator: while ω 2 ≥ 0 one may sometimes find that ω̂ 2 < 0. In
such a case it is standard practice to set ω̂ 2 = 0.
Thus f ∼ Fk−1;N −k if and only if ω 2 = 0. If ω 2 > 0 we may expect f to be larger than a Fk−1;N −k
statistic, and H0 : ω 2 = 0 is rejected if f assumes large values.
An approximate confidence interval for ω 2 may be found, but we do not discuss this topic. Of
greater interest are the quantities ω 2 /σ 2 , σ 2 /(ω 2 + σ 2 ) and ω 2 /(ω 2 + σ 2 ). The latter is of particular
interest: The variance of each observation yij is σ 2 + ω 2 . A portion ω 2 of this variance is ascribable
to the differences in population means, and ω 2 /(ω 2 + σ 2 ) is the proportion of the total variance
which is due to these differences.
As before, let
q5 q4
f= / .
k−1 N −k
σ2
Then f ∼ Fk−1;N −k .
nω 2 + σ 2
Let F1 = F1− 1 α;k−1;N −k = 1/F 1 α;N −k;k−1
2 2
F2 = F 1 α;k−1;N −k .
2
Then
σ2
1−α = P F1 < f < F2
nω 2 + σ 2
nω 2 + σ 2
f f
= P < <
F2 σ2 F1
nω 2
f f
= P <1+n 2 <
F2 σ F1
2
−1 f ω −1 f
= P + < 2 < +
n nF2 σ n nF1
2
f − F2 ω f − F1
= P < 2 <
nF2 σ nF1
2
nF1 σ nF2
= P < 2 <
f − F1 ω f − F2
σ2
nF1 nF2
= P 1+ <1+ 2 <1+
f − F1 ω f − F2
2 2
f − F1 + nF1 ω +σ f − F2 + nF2
= P < <
f − F1 ω2 f − F2
2
f − F2 ω f − F1
= P < 2 <
f − F2 + nF2 ω + σ2 f − F1 + nF1
3. ANALYSIS OF VARIANCE 162
Example 3.5.
A factory has a large number of machines which produce the same product. The mass of each
product unit is a random variable which varies from unit to unit; the variance of this random
variable is σ 2 . The mean mass per unit also varies from machine to machine due to the fact
that the machines are not always calibrated precisely. Suppose that four machines are selected at
random, and a sample of six units produced by each of these machines is selected randomly and
weighed. The results (mass in grams) are as follows:
Unit Machine
1 2 3 4
1 201 198 211 204
2 198 196 215 202
3 209 201 207 203
4 197 200 209 206
5 203 206 208 201
6 204 199 210 208
(c) Estimate the proportion of the total variation ascribable to the machines.
ω2
(d) Find the 90% confidence interval for .
σ2 + ω2
(e) Estimate the overall mean and find the 95% confidence interval for µ.
Solution 3.5.
where k = 4 and n = 6, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and
ij are mutually independent. We compute
3. ANALYSIS OF VARIANCE 163 STA3701/1
ȳ1. = 202; ȳ2. = 200; ȳ3. = 210; ȳ4. = 204; ȳ.. = 204
X
(y1j − ȳ1. )2 = (201 − 202)2 + (198 − 202)2 + . . . + (204 − 202)2
= 96
X
(y2j − ȳ2. )2 = (198 − 200)2 + (196 − 200)2 + . . . + (199 − 200)2
= 58
X
(y3j − ȳ3. )2 = (211 − 210)2 + (215 − 210)2 + . . . + (208 − 210)2
= 40
X
(y4j − ȳ4. )2 = (204 − 204)2 + (202 − 204)2 + . . . + (208 − 204)2
= 34
X
q5 = n (ȳi. − ȳ.. )2
= 6[(−2)2 + (−4)2 + 62 + 02 ]
= 6(4 + 16 + 36 + 0)
= 336
(b) H0 : σa2 = 0
Fα;k−1;N −k = F0,05;3;20 = 3.10. Reject H0 if f > 3.1.
The F-value is significant at the 5% level and our conclusion is that there is a significant
variation between machines − they should perhaps be calibrated.
(c) We have
σ̂ 2 = 11.4
σ̂ 2 + 6ω̂ 2 = 112
ω̂ 2 = 16.7667
ω̂ 2 16.7667
2 2
= ≈ 0.60.
σ̂ + ω̂ 11.4 + 16.7667
Thus we estimate that about 60% of the variation is due to a difference in machine.
f − F2 f − F1
L1 = and L2 =
f − F2 + nF2 f − F1 + nF1
1 1 1
F1 = F1− α2 ;k−1;N −k = = = ≈ 0.1155
F α2 ;N −k;k−1 F0.05;20;3 8.66
(e) The estimate of the overall mean is ȳ.. = 204 and the confidence interval for µ is
p
ȳ.. ± t α2 ;k−1 V ar(ȳ.. )
3. ANALYSIS OF VARIANCE 165 STA3701/1
(197.1261 ; 210.8739).
From the SAS JMP output in figure 3.2 below, we note that machine 3, especially, produces units
with higher mass than the others. Machine 3’s diamond does not overlap with any of the other
diamonds, thus, one can expect it to produce units significantly different from all the other ma-
chines. Note also, that, like most statistical computer programs, SAS JMP does not distinguish
between random and fixed effects models − the distribution has to be made by the user.
3. ANALYSIS OF VARIANCE 166
Example 3.6.
A large company has a number of personnel officers, and management wants to find out whether
the personnel selection is uniform or whether the variation between personnel officers is significant
compared to the variation between candidates. Five of the personnel officers are selected at ran-
dom, and each is assigned four applicants for testing. Their ratings of the applicants are as follows:
(a) and test at the 5% level whether the officers award different ratings on the average;
(b) estimate the mean rating of all candidates and construct a 90% confidence interval for it;
(c) estimate ω 2 /(σ 2 + ω 2 ) and construct a 90% confidence interval for it.
3. ANALYSIS OF VARIANCE 168
Solution 3.6.
yij = µ + ai + ij ; j = 1, · · · , n; i = 1, · · · , k
where k = 5 and n = 4, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and ij
are mutually independent.
(a)
Personnel officer
A B C D E
1 75 47 67 76 77
2 91 50 75 59 65
3 72 63 82 82 86
4 86 64 80 67 76
ȳi. 81 56 76 71 76
N = nk = 20 ȳ.. = 72
= 242
X
(y2j − ȳ2. )2 = (47 − 56)2 + (50 − 56)2 + (63 − 56)2 + (64 − 56)2
= (−9)2 + (−6)2 + 72 + 82
= 230
X
(y3j − ȳ3. )2 = (67 − 76)2 + (75 − 76)2 + (82 − 76)2 + (80 − 76)2
= (−9)2 + (−1)2 + 62 + 42
= 134
3. ANALYSIS OF VARIANCE 169 STA3701/1
X
(y4j − ȳ4. )2 = (76 − 71)2 + (59 − 71)2 + (82 − 71)2 + (67 − 71)2
= 306
X
(y5j − ȳ5. )2 = (77 − 76)2 + (65 − 76)2 + (86 − 76)2 + (76 − 76)2
= 12 + (−11)2 + (10)2 + 02
= 222
X
q5 = n (ȳi. − ȳ.. )2
k
X
= n ȳi.2 − nkȳ..2
i=1
= 4(81 + 562 + 762 + 712 + 762 ) − (4)(5)(72)2
2
= 1 480
or alternatively
X
q5 = n (ȳi. − ȳ.. )2
= 4[(81 − 72)2 + (56 − 72)2 + (76 − 72)2 + (71 − 72)2 + (76 − 72)2 ]
= 4(370)
= 1 480
3. ANALYSIS OF VARIANCE 170
Since 4.8942 > 3.06, H0 is rejected at the 5% level and our conclusion is that there is a
significant variation between personnel officers.
µ̂ = ȳ..
k n
1 XX
= yij
N i=1 j=1
k
1X
= ȳi.
k i=1
1
= (81 + 56 + 76 + 71 + 76)
5
1
= (360)
5
= 72.
Thus, the 90% confidence interval for the mean rating of all candidates is
r
q5
ȳ.. ± × t α2 ;k−1
N (k − 1)
s
1 480
72 ± × 2.132
20(5 − 1)
3. ANALYSIS OF VARIANCE 171 STA3701/1
√
72 ± 18.5 × 2.132
72 ± 9.1701
(62.8299 ; 81.1701)
ω2
(c) Estimating
σ2 + ω2
ω̂ 2 = 75.6. Now
σ̂ 2 + 4ω̂ 2 = 370
4ω̂ 2 = 294.4
⇒ ω̂ 2 = 73.6.
Then
ω̂ 2 73.6
2 2
=
σ̂ + ω̂ 75.6 + 73.6
73.6
=
149.2
≈ 0.4933.
ω2
The confidence interval for is
σ2 + ω2
ω2
f − F2 f − F1
P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
1
where F1 = and F2 = F α2 ;k−1;N −k .
F α2 ;N −k;k−1
α
Now f = 4.8942 α = 0.1 2
= 0.05 F α2 ;N −k;k−1 = F0.05;15;4 = 5.86
1
F1 = 5.86
≈ 0.1706 F2 = F0.05;4;15 = 3.06.
ω2
The 90% confidence interval for 2 is
σ + ω2
ω2
f − F2 f − F1
0.90 = P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
3. ANALYSIS OF VARIANCE 172
ω2
4.8942 − 3.06 4.8942 − 0.1706
= P < 2 2
<
4.8942 − 3.06 + 4(3.06) σ +ω 4.8942 − 0.1706 + 4(0.1706)
2
1.8342 ω 4.7236
= P < 2 2
<
14.0742 σ +ω 5.406
2
ω
= P 0.1303 < 2 < 0.8738 .
σ + ω2
We now consider experiments in which there are two factors (treatments or blocking factors), say
factor A with k levels A1 , · · · , Ak and factor B with m levels B1 · · · , Bm . The expected value of
an observation in the cell defined by levels Ai and Bj is µij , say. The expected values are then as
follows:
Factor A Factor B
B1 B2 Bm
A1 µ11 µ12 ··· µ1m
A2 µ21 µ22 ··· µ2m
.. ..
. .
Ak µk1 µk2 ··· µkm
An additive model specifies that the following relationship exists between these expected values:
µij = µ + αi + βj ; j = 1, · · · , m; i = 1, · · · , k.
B1 B2 B3
A1 µ11 =10 µ12 =15 µ13 =12
A2 µ21 =20 µ22 =25 µ23 =22
3. ANALYSIS OF VARIANCE 173 STA3701/1
The expected yield at level B2 is five units more than the expected yield at level B1 , irrespective
of the level of A(µ12 − µ11 = µ22 − µ21 = 5). Likewise, the increase in yield from A1 to A2 is ten
units, irrespective of the level of B(µ21 − µ11 = µ22 − µ12 = µ23 − µ13 = 10).
In practice it sometimes happens that the model is not additive, but that interaction is present.
Examples of interaction
The addition of calcium to the soil may have a beneficial effect on some plants, but will have a
detrimental effect on acid-loving plants. We say the factor calcium interacts with the factor type
of plant.
A specific hormone treatment may be beneficial to women but have no effect on men. (In such a
case sex and hormone treatment interact.)
Administering either drug A or drug B to a patient may be beneficial, but both drugs together
may have a detrimental effect. In other examples drug A and drug B jointly may be much more
beneficial than the sum of the effect of the two drugs separately. In both cases the factors drug A
and drug B interact.
Two chemicals may, if used separately, have very little effect on a chemical process, but jointly
they may have a profound effect.
You may think of further examples of interaction. The situation may be presented graphically as
in the following figures (figure 3.3-figure 3.6) (we assume that factor A has two levels A1 and A2
while B has three levels B1 , B2 and B3 ).
3. ANALYSIS OF VARIANCE 174
Figure 3.3:
Figure 3.4:
3. ANALYSIS OF VARIANCE 175 STA3701/1
Figure 3.5:
Figure 3.6:
3. ANALYSIS OF VARIANCE 176
If the lines joining the expected yields at A1 and A2 are parallel (figures 3.3 and 3.4) there is no
interaction between A and B present (otherwise it is present, as in figures 3.5 and 3.6).
The presence of interaction is a factor which complicates the interpretation of main effects. In
some applications it may appear as if one or both treatments have no effect on the yield at all,
while the presence of interaction is actually an indication that both factors do have an effect on
the yield. Consider the following example:
B1 B2 B3 M ean
A1 µ11 = 10 µ12 = 15 µ13 = 20 µ1. = 15
A2 µ21 = 20 µ22 = 15 µ23 = 10 µ2. = 15
M ean µ.1 = 15 µ.2 = 15 µ.3 = 15 µ.. = 15
An analysis of variance test for the A-effect is actually a test for H0 : µ̄1. = µ̄2. = . . . = µ̄k. while a
test for the B-effect is a test for H0 : µ̄.1 = µ̄.2 = . . . = µ̄.m. . In the above example both hypotheses
are true, while it is certainly not true that the treatments have no effect at all on the yield. Other
examples may be constructed where for instance the A-effect is significant and the B-effect not
significant, while the presence of interaction implies that treatment B does influence the yield, but
the magnitude of the influence depends on the level of A.
Thus, if an analysis of variance shows that there is interaction, the possible non-significance of the
main effects (effects of A and B) must not be seen as a proof that A and/or B has no influence on
the yield. Graphs like figures 3.3 and 3.4, but with sample means rather than expected yield on
the vertical axis, may be useful in interpreting the results.
If interaction is present (or rather, unless one is sure that there is no interaction), the means are
described as follows:
µij = µ + αi + βj + (αβ)ij
3. ANALYSIS OF VARIANCE 177 STA3701/1
where (αβ) is a symbol like α and β and does not indicate multiplication; sometimes γ is used
instead of (αβ). In this model α1 , · · · , αk are the main effects for treatment A, β1 , · · · , βm are the
main effects for treatment B and (αβ)ij ; i = 1, · · · , k; j = 1, · · · , m are the interaction effects.
k
X
(αβ)ij = 0 for all j;
i=1
m
X
(αβ)ij = 0 for all i;
j=1
In the remainder of this chapter we shall assume throughout that there are equal numbers of
observations per cell. The reason for this assumption is the fact that the notation and algebra
needed to provide for unequal numbers of observations per cell are somewhat more complicated.
The correct formulae may be found in any textbook on the analysis of variance.
We now consider experiments in which there are two factors (treatments or blocking factors)
present. Both factors are assumed to be fixed, and there are equal numbers of observations per
cell. We distinguish between the two cases: one observation per cell and n observations per cell
with n > 1.
the original purpose of the experiment. Since there is only one observation per cell, there is no
way of determining whether interaction exists or not (for example whether the difference between
supermarkets A and B is increasing, whether one supermarket was most expensive to begin with
and was overtaken by another supermarket, et cetera). One simply has to assume that there
is no interaction. (There is, however, a test based on the special model (αβ)ij = γαi βj to test
H0 : γ = 0. See JW Turkey (1949): One degree of freedom for non-additivity. Biometrics,
page 232.) For this reason this type of experiment should be performed only if it is impossible to
replicate the experiment for practical reasons or if one knows from past experience that interaction
has never been present in similar experiments.
The model is
yij = µ + αi + βj + ij ; i = 1, · · · , k; j = 1, · · · , m
k
X m
X
where αi = βj = 0
1 1
In this formulation α1 , · · · , αk are called the effects of treatment A and β1 , · · · , βm the effects of
treatment B. The null hypotheses to be tested are usually
H0 : α1 = · · · = αk (= 0)
and H0 : β1 = · · · = βm (= 0).
m
1 X
ȳi. = yij ; i = 1, · · · , k,
m j=1
k
1X
ȳ.j = yij ; j = 1, · · · , m,
k i=1
k m
1 XX
ȳ.. = yij
N i=1 j=1
m k
1 X 1X
= ȳ.j = ȳi. .
m j=1 k i=1
3. ANALYSIS OF VARIANCE 179 STA3701/1
Writing
yij − ȳ.. = (yij − ȳi. − ȳ.j + ȳ.. ) + (ȳi. − ȳ.. ) + (ȳ.j − ȳ.. ),
taking squares on both sides, summing over i and j and noting that the sums of the cross-products
of the terms on the right-hand side are all equal to zero, we obtain the identity
q2 = q4 + q5 + q6
As with one-way analysis of variance, we may write q4 , q5 and q6 in the forms qi = y 0 Ai y and
show that A4 A4 = A4 , A5 A5 = A5 , A6 A6 = A6 , A4 A5 = A4 A6 = A5 A6 = O, r(A4 ) =
(k − 1)(m − 1), r(A5 ) = k − 1, r(A6 ) = m − 1. This is left as an exercise for you to do.
λ4 = 0
k
X
λ5 = m αi2 /σ 2
i=1
m
X
λ6 = k βj2 /σ 2 .
j=1
yij = µ + αi + βj + ij
3. ANALYSIS OF VARIANCE 180
k
1X
ȳ.j = (µ + αi + βj + ij )
k 1
k k
1X 1X
= µ+ αi + βj + ij
k 1 k 1
k k
1X X
= µ + βj + εij since αi = 0.
k 1 1
k
1X
E(ȳ.j ) = E(µ + βj + εij )
k 1
k
1X
= E(µ) + E(βj ) + E(εij )
k 1
∴ E(ȳ.j ) = µ + βj
k
1X
Var(ȳ.j ) = Var(µ + βj + εij )
k 1
k
1 X
= V ar(µ) + V ar(βj ) + 2 V ar(εij )
k 1
1 2
=
kσ
k2
Var(ȳ.j ) = σ 2 /k
= σ 2 /k + (µ + βj )2 .
Also
k m
1 XX
ȳ.. = (µ + αi + βj + ε)
km i=1 j=1
k m k m
1X 1 X 1 XX
= µ+ αi + βj + ij
k i=1 m j=1 km i=1 j=1
1 XX
= µ+ ij
km i j
!
1 XX
E(ȳ.. ) = E µ+ ij
km i j
= µ
3. ANALYSIS OF VARIANCE 181 STA3701/1
!
1 XX
Var(ȳ.. ) = Var µ + ij
km i j
1 XX
= Var(µ) + Var(ij )
(km)2 i j
1
= kmσ 2
(km)2
= σ 2 /(km)
= σ 2 /(km) + µ2
Now
m
X
q6 = k (ȳ.j − ȳ.. )2
j=1
Xm
= k( ȳ.j2 − mȳ..2 )
j=1
m
X
∴ E(q6 ) = k (σ 2 /k + (µ + βj )2 ) − km(µ2 + σ 2 /(km))
j=1
Xm
= k (σ 2 /k + µ2 + 2µβj + βj2 ) − kmµ2 − σ 2
j=1
m
X m
X
2 2
= mσ + kmµ + 2µk βj + k βj2 − kmµ2 − σ 2
j=1 j=1
Xm
= mσ 2 + kmµ2 + k βj2 − kmµ2 − σ 2
j=1
Xm
= (mσ 2 − σ 2 ) + k βj2
j=1
X m m
X
= (m − 1)σ 2 + k βj2 (remember that βj = 0)
j=1 j=1
m
X
∴ E(q6 /σ 2 ) = (m − 1) + k βj2 /σ 2
j=1
m
X
∴ λ6 = k βj2 /σ 2 from (D8).
j=1
3. ANALYSIS OF VARIANCE 182
Example 3.7.
Using data in section 3.1.2:
Solution 3.7.
Factor B: Supermarkets
ȳ.1 = 31
ȳ.3 = 34
3. ANALYSIS OF VARIANCE 183 STA3701/1
Grand mean
ȳ.. = 30
Factor A: Months
ȳ1. = 27
ȳ2. = 28
ȳ3. = 31
ȳ4. = 34
We see that there is a consistent increase in the average price from month to month. (This
may of course be a result of a real price increase and/or an expanding shopping list.)
k
X
SSA = m (ȳi. − ȳ.. )2
i=1
= 3[(27 − 30)2 + (28 − 30)2 + (31 − 30)2 + (34 − 30)2 ]
= 3((−3)2 + (−2)2 + 12 + 42 )
= 90
k
X
SSB = k (ȳ.j − ȳ.. )2
i=1
= 4[(31 − 30)2 + (25 − 30)2 + (34 − 30)2 ]
= 4(12 + (−5)2 + 42 )
= 168
XX
SSTotal = (yij − ȳ.. )2
= 384
= 384 − 90 − 168
= 126
Actually it is advisable to compute SSResidual directly and use it as a test of the accuracy of
the computations.
3. ANALYSIS OF VARIANCE 184
. . . + (37 − 34 − 34 + 30)2
= 126
The SAS JMP output which follows, shows that supermarket 2 (B) had the lowest prices on aver-
age, while the prices rose steadily over the four months.
3. ANALYSIS OF VARIANCE 185 STA3701/1
In such small experiments there must be a considerable difference between level means before
3. ANALYSIS OF VARIANCE 186
a significant result is obtained. The housewife may now decide to continue the experiment for a
few more months, maybe trying to eliminate some of the sources of variation. She is, however,
more likely to decide that she will buy at the second supermarket in future.
The residual which may be examined for signs of deviations from the model is
Example 3.8.
In order to decide in which of three types of saucepans water will boil the quickest, the three types
of saucepans were tested on four types of stoves in a Domestic Science laboratory. A fixed amount
of water at room temperature was placed in each saucepan, and the time to boil (in seconds, with
timing started exactly three minutes after the stoves had been switched on) recorded. The results are:
Stoves Saucepans
I II III
1 21 22 23
2 23 33 43
3 25 33 26
4 31 36 44
3. ANALYSIS OF VARIANCE 187 STA3701/1
Test
(a) the null hypothesis that the saucepans are not different;
(b) the null hypothesis that the stoves are not different (with respect to boiling speed).
Solution 3.8.
Saucepans
Stoves I II III ȳi.
1 21 22 23 22
2 23 33 43 33
3 25 33 26 28
4 31 36 44 37
ȳ.j 25 31 34
Factor A: Stoves
Factor B: Saucepans
Grand mean
3. ANALYSIS OF VARIANCE 188
ȳ.. = 30
k
X
SSA = m(ȳi. − ȳ.. )2
i=1
= 3[(22 − 30)2 + (33 − 30)2 + (28 − 30)2 + (37 − 30)2 ]
= 3((−8)2 + 32 + (−2)2 + 72 )
= 378
k
X
SSB = k(ȳ.j − ȳ.. )2
i=1
= 4[(25 − 30)2 + (31 − 30)2 + (34 − 30)2 ]
= 4((−5)2 + 12 + 42 )
= 168
k X
X m
SSResidual = (yij − ȳi. − ȳ.j + ȳ.. )2
i=1 j=1
= 158
H0 : α1 = α2 = α3 = α4 = 0
Fα;k−1;(k−1)(m−1) = F0,05;3;6 = 4.76. Reject H0 if f > 4.76. Since 4.7848 > 4.76, we reject
H0 at the 5% level of significance and conclude that the stoves are significantly different with
respect to boiling speed.
3. ANALYSIS OF VARIANCE 189 STA3701/1
(b) H0 : β1 = β2 = β3 = 0
Fα;m−1;(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.
Since 3.1899 < 5.14, we do not reject H0 at the 5% level of significance and conclude that
the effect of the saucepans with respect to boiling speed is not different.
3.7.1 Illustration
The data in section 3.3.1 are an example of a mixed model (diets a fixed effect and litters a random
effect) but we cannot distinguish between the analysis of the three types of model if there is one
observation per cell only. The SAS JMP output (see figures 3.9 and 3.10) actually assumes a fixed
effects model, but the results which follow are easy to interpret.
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j
µ = overall mean response; the average of the mean responses for the km populations.
αi = effect of the i-th level of the first factor averaged over the m levels of the second factor,
(the i-th level of the first factor adds αi to the overall mean µ).
βj = effect of the j-th level of the second factor averaged over the k levels of the first factor.
(αβ)ij = interaction between the i-th level of the first factor and the j-th level
of the second factor (the population means for the ij-th treatment minus µ + αi + βj ).
ijh = deviation of yijh from the population mean response for the ij-th.
The terms αi and βj are called main effects. The term (αβ)ij is an interaction.
H0 : α1 = . . . = αk (= 0)
(if all αi are equal and Σαi = 0 then each must be equal to zero)
H0 : β1 = . . . = βm (= 0)
3. ANALYSIS OF VARIANCE 191 STA3701/1
m n
1 XX
ȳi.. = yijh ; i = 1, · · · , k;
mn j=1 h=1
m
1 X
= ȳij. (the level means of the levels of factor A)
m j=1
k n
1 XX
ȳ.j. = yijh ; i = 1, · · · , m;
nk i=1 h=1
k
1X
= ȳij. (the level means of the levels of factor B)
k i=1
k m m
1 XXX
ȳ... = yijh (the overall means)
N i=1 j=1 h=1
m k k m
1 X 1X 1 XX
= ȳ.j. = ȳi.. = ȳij. .
m j=1 k i=1 km i=1 j=1
3. ANALYSIS OF VARIANCE 192
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
k X
X m
SSAB =n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
k X
X m X
n
SSWS = SSWithin samples = (yijh − ȳij. )2
i=1 j=1 h=1
k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1
Activity 3.1.
Prove as an exercise that SSTotal = SSA + SSB + SSAB + SSWithin samples .
(This equality holds only if the number of observations per cell is the same for all cells.)
These SSs may also be written in the form of y 0 Ai y and in this way it may be proved that the four
SSs on the right-hand side are independent with degrees of freedom and sums of squares as given
in the following ANOVA table:
Example 3.9.
Derive the formulae for E(M S) for the A effect, ie, E(M SA ).
3. ANALYSIS OF VARIANCE 193 STA3701/1
Solution 3.9.
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1
m n
1 XX
ȳi.. = (µ + αi + βj + (αβ)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi
m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(βj ) + 2 V ar((αβ)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
σ2
=
mn
2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2
= + (µ + αi )2
mn
k m n
1 XXX
ȳ... = (µ + αi + βj + (αβ)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
3. ANALYSIS OF VARIANCE 194
k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ
k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(βj ) + V ar((αβ)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(βj ) + V ar(αβ)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
σ2
=
kmn
2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2
= + µ2
kmn
k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k
σ2
2
X
2 σ 2
= mn + (µ + αi ) − kmnE +µ
i=1
mn kmn
k
kmnσ 2 X kmnσ 2
= + mn (µ2 + 2µαi + αi2 ) − − kmnµ2
mn i=1
kmn
k
X k
X
2 2
= kσ + kmnµ + 2mnµ αi + mn αi2 − σ 2 − kmnµ2
i=1 i=1
k
X k
X
2
= σ (k − 1) + mn αi2 since αi = 0
i=1 i=1
3. ANALYSIS OF VARIANCE 195 STA3701/1
E(SSA )
E(M SA ) =
k−1
σ 2 (k − 1) + mn ki=1 αi2
P
=
k−1
k
2 mn X 2
= σ + α
k − 1 i=1 i
Activity 3.2.
Derive the other E(MS) as an exercise.
Remarks:
H0 : βr = βs is rejected if
(ȳ.r. − ȳ.s. )2
> (m − 1)σ̂ 2 Fα;m−1;km(n−1) .
(2/kn)
Example 3.10.
In order to test whether the method of display has an effect on bread sales, a bakery selected 12
comparable supermarkets and requested each to display its bread according to a specification of shelf
height (bottom, middle and top) and width of shelf (regular and wide). The bread sales were as
follows:
Test the three null hypotheses (regarding interaction, A-effect and B-effect), each at the 5% level.
Solution 3.10.
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j
m
X
SSA = n(ȳi.. − ȳ... )2
i=1
3. ANALYSIS OF VARIANCE 197 STA3701/1
= 1 544
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
= 6[(−1)2 + 12 ]
= 12
k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
= 24
= (47 − 45) + (43 − 45)2 + (46 − 43)2 + . . . + (42 − 44)2 + (46 − 44)2
2
= 62
k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1
= (47 − 51) + (43 − 51)2 + (46 − 51)2 + . . . + (42 − 51)2 + (46 − 51)2
2
= 1 642
3. ANALYSIS OF VARIANCE 198
We now assume that the k levels of factor A constitute a random sample from a large population
of possible levels of A, and the m levels of factor B constitute a random sample from a large
population of possible levels of B.
The model is
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
ijh ∼ n(0; σ 2 )
and where a1 , · · · , ak , b1 , · · · , bm , (ab)11 , · · · , (ab)km , and 111 , · · · , kmn are mutually independent.
As with model II one-way analysis of variance, the parameters of prime interest are µ, σa2 , σb2 and
2
σab . We use the same notation as in section 3.7, but the expected means squares in the ANOVA
table are now different.
We will derive the E(M SB ), and the other E(MS) are left as an exercise for you to do.
Example 3.11.
Derive E(M SB ).
Solution 3.11.
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
Xm
2 2
= kn ȳ.j. − kmnȳ...
j=1
k n
1 XX
ȳ.j. = (µ + ai + bj + (ab)ij + ijh )
kn i=1 h=1
k k k n
1X 1X 1 XX
ȳ.j. = µ + ai + b j + (ab)ij + εijh
k i=1 k i=1 kn i=1 h=1
= µ
2
=⇒ E(ȳ.j. ) = V ar(ȳ.j. ) + (E(ȳ.j. ))2
σa2 σ2 σ2
= + σb2 + ab + + µ2
k k kn
k m k m k m n
1X 1 X 1 XX 1 XXX
ȳ... = µ + ai + bj + (ab)ij + εijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
¯ .. + ε̄...
ȳ... = µ + ā. + b̄. + (ab)
¯ .. + ε̄... )
E(ȳ... ) = E(µ + ā. + b̄. + (ab)
= µ
¯ .. + ε̄... )
V ar(ȳ... ) = V ar(µ + ā. + b̄. + (ab)
σ2 σ2 σ2 σ2
= a + b + ab + .
k m km kmn
Now
2
V ar(ȳ... ) = E(ȳ... ) − (E(ȳ... ))2
3. ANALYSIS OF VARIANCE 201 STA3701/1
2
=⇒ E(ȳ... ) = V ar(ȳ... ) + (E(ȳ... ))2
σ2 σ2 σ2 σ2
= a + b + ab + + µ2 .
k m km kmn
Then
X
SSB = kn y 2.j. − kmny 2...
j
X
E(SSB) = E(kn y 2.j. − kmny 2... )
j
X
= kn E(y 2.j. ) − kmnE(y 2... )
j
X σ2 σ2 σ2
a
= kn + σb2 + ab + + µ2
j
k k kn
σa2 σb2 2
σ2
σab 2
−kmn + + + +µ
k m km kmn
σa2 2
2
σab σ2
= kmn + kmnσb + kmn + kmn
k k kn
2 2
σ σ σ2
+kmnµ2 − kmn a − kmn b − kmn ab
k m km
σ2
−kmn − kmnµ2
kmn
= mnσa + kmnσb2 + mnσab
2 2
+ mσ 2
= (m − 1)knσb2 + (m − 1)nσab
2
+ (m − 1)σ 2
E(SSB )
E(M SB ) =
m−1
As in one-way model II, SSA , SSB , SSAB and SSWS are independent multiples of central chi-square
variates.
2 SSAB /(k − 1)(m − 1)
If σab = 0, f =
SSWS /km(n − 1)
is an F(k−1)(m−1);km(n−1) variate, and this fact is used to test
3. ANALYSIS OF VARIANCE 202
2
H0 : σab = 0.
SSA /(k − 1)
If σa2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fk−1;(k−1)(m−1) variate and this fact is used to test
H0 : σa2 = 0.
(Note: In model I, SSA is compared to SSWS , but here SSA is compared to SSAB ; the replicates
in the cells are not utilised when H0 : σa2 = 0 is tested.)
SSB /(m − 1)
Thirdly if σb2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fm−1;(k−1)(m−1) variate. This fact is used to test
H0 : σb2 = 0.
SSA SSAB
σ̂a2 =( − )/(mn)
k − 1 (k − 1)(m − 1)
SSB SSAB
σ̂b2 =( − )/(kn)
m − 1 (k − 1)(m − 1)
2 SSAB SSWS
σ̂ab =( − )/n.
(k − 1)(m − 1) km(n − 1)
As in one-way analysis of variance, one may find confidence intervals for the three variance com-
ponents, but this subject is not dealt with in this module.
An unbiased estimator of µ is
µ̂ = ȳ···
This fact may be used to test H0 : µ = µ0 with µ0 specified or to find a confidence interval for µ
as before. The unbiased estimator of V ar(µ̂) is not a multiple of a chi-square variate, and the test
or confidence limits are only approximately true.
Example 3.12.
A consumer product agency wants to evaluate the accuracy of determining the level of calcium in
a food supplement. There are a large number of possible testing laboratories and a large number of
chemical assays for calcium. The agency randomly selects three laboratories and three assays for
use in the study. Each laboratory will use all three assays in the study. Eighteen samples containing
10 mg of calcium are prepared and each assay−laboratory combination is randomly assigned to two
samples. The calcium content is given in the following table:
Laboratory
Assay 1 2 3
1 10.9 10.5 9.7
10.9 9.8 10.0
2 11.3 9.4 8.8
11.7 10.2 9.2
3 11.8 10.0 10.4
11.2 10.7 10.7
(a) Perform an analysis of variance for this experiment. Conduct all tests with α = 0.05.
(b) Estimate all variance components and determine their proportional allocation to the total
variability.
3. ANALYSIS OF VARIANCE 204
Solution 3.12.
(a)
ȳ11. = 10.9 ȳ12. = 10.15 ȳ13. = 9.85
ȳ21. = 11.5 ȳ22. = 9.8 ȳ23. = 9.0
ȳ31. = 11.5 ȳ32. = 10.35 ȳ33. = 10.55
ȳ... = 10.4
k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1
= 12
k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 3 × 2[(10.3 − 10.4)2 + (10.1 − 10.4)2 + (10.8 − 10.4)2 ]
= 1.56
m
X
SSB = kn (y.j. − ȳ... )2
j=1
= 7.56
k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
= 1.64
3. ANALYSIS OF VARIANCE 205 STA3701/1
k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1
= 1.24
H0 : σa2 = 0:
SSA /k − 1
f =
SSAB /(k − 1)(m − 1)
M SA
=
M SAB
0.78
=
0.41
≈ 1.9024
Since 1.9024 < 6.94, we do not reject H0 at the 5% level of significance and conclude that
there is insufficient evidence to indicate a significant variability in calcium determinations
from assay to assay.
3. ANALYSIS OF VARIANCE 206
H0 : σb2 = 0
Since 9.2195 > 6.94, H0 is rejected at the 5% level of significance and we conclude that there
is a significant variability in calcium concentrations from lab to lab.
2
H0 : σab =0
Since 2.9753 < 3.63, we do not reject H0 at the 5% level of significance. There does not
appear to be a significant interaction between the levels of factors for assays and lab.
σ 2 = M SE = 0.1378
σ 2 + 2σab
2
= 0.41
2
2σab = 0.41 − 0.1378
2
2σab = 0.2722
2
∴ σab = 0.1361
3. ANALYSIS OF VARIANCE 207 STA3701/1
σ 2 + 2σab
2
+ 6σb2 = 3.78
σ 2 + 2σab
2
+ 6σa2 = 0.78
= 0.8973
0.5617
Labs 0.5617 ≈ 0.6260
0.8973
0.1361
Interaction 0.1361 ≈ 0.1517
0.8973
0.1378
Error 0.1378 ≈ 0.1536
0.8973
Total 0.8973
3. ANALYSIS OF VARIANCE 208
Note: Since there was a significant variability in the determination of calcium in the samples, the
estimate of an overall mean level µ would not be of interest to the researcher. However, in
this case we want to illustrate the methodology.
We now assume that the k levels of factor A are the only levels of interest, and thus A is a fixed
factor. The m levels of factor B constitute a random sample from a large population of possible
levels of B, and thus B is a random factor. The resulting experiment gives rise to a mixed model:
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
k
X
(αb)ij = 0 for all j
i=1
E(bj ) = E(αb)ij = 0,
2 k−1 2
bj ∼ n(0; σb ); αb)ij ∼ n(0; σαb ; ijh ∼ n(0; σ 2 );
k
b1 , · · · , bm are independent
Any two interaction terms (αb)ij and (αb)rs are independent unless they refer to the same random
level of B, that is j = s; in this case it is assumed that
2
Cov((αb)ij , (αb)rj ) = −σαb /k.
k
X
(αb)ij = 0 for all j,
i=1
which is commensurate with the assumption concerning the covariance between two interaction
terms.
2
B SSB m−1 SSB / (m − 1) σ + knσb2
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ 2 + nσαb
2
Example 3.13.
Derive the formulae for E(M SA ).
Solution 3.13.
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1
m n
1 XX
ȳi.. = (µ + αi + bj + (αb)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi
3. ANALYSIS OF VARIANCE 210
m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(bj ) + 2 V ar((αb)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
mσb2 mσαb2
mnσ 2
= + +
m2 m2 (mn)2
2 2 2
σ σ σ
= b + αb +
m m mn
2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2 σ2 σ2
= b + αb + + (µ + αi )2
m m mn
k m n
1 XXX
ȳ... = (µ + αi + bj + (αb)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ
k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(bj ) + V ar((αb)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(bj ) + V ar(αb)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
3. ANALYSIS OF VARIANCE 211 STA3701/1
mσb2 kmσαb
2
kmnσ 2
= + +
m2 (km)2 (kmn)2
σ2 σ2 σ2
= b + αb +
m km kmn
2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2 σ2 σ2
= b + αb + + µ2
m km kmn
k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k 2
σb2 σαb σ2 2
σb2 σαb σ2
X
= mn + + + (µ + αi )2 − kmn + + + µ2
i=1
m m mn m km kmn
k
kmnσb2 kmnσαb
2
kmnσ 2 X
= + + + mn (µ2 + 2µαi + αi2 )
m m mn i=1
−knσb2 − nσαb
2
− σ 2 − kmnµ2
k
X
= knσb2 + knσαb
2
+ kσ 2 + kmnµ2 + 2mnµ αi
i=1
k
X
+mn αi2 − knσb2 − nσαb
2
− σ 2 − kmnµ2
i=1
k
X k
X
2 2 2 2
= knσαb − nσαb + kσ − σ + mn αi2 since αi = 0
i=1 i=1
k
X
2
= n(k − 1)σαb + σ 2 (k − 1) + mn αi2
i=1
E(SSA )
E(M SA ) =
k−1
2
+ σ 2 (k − 1) + mn ki=1 αi2
P
n(k − 1)σαb
=
k−1
k
2 2 mn X 2
= nσαb + σ + α
k − 1 i=1 i
Activity 3.3.
Derive the other E(M S) as an exercise.
3. ANALYSIS OF VARIANCE 212
The three null hypotheses and the appropriate statistics are as follows:
H0 f d.f.
SSA /(k − 1)
α1 = . . . = αk (= 0) k − 1; (k − 1)(m − 1)
SSAB /(k − 1)(m − 1)
SSB /(m − 1)
σb2 = 0 m − 1; km(m − 1)
SSWS /km(n − 1)
reject H0 : αr = αs if
(ȳr.. − ȳs.. )2
> (k − 1)(σ̂ 2 + nσ̂αb
2
)Fα;k−1;(k−1)(m−1) .
2/mn
Example 3.14.
Preliminary research on the production of imitation pearls entailed studying the effect of the num-
bers of coats of a special lacquer (factor A) applied to an opalescent plastic bead used as the base
of the pearl on the market value of the pearl. Four batches of 12 beads (factor B) were used in
the study, and it is desired to also consider their effect on the market value. The three levels of
factor A (six, eight and ten coats) were fixed in advance, while the four batches can be regarded as
a random sample of batches from the bead production process. The market value of each pearl was
determined by a panel of experts. The market value data (coded) are as follows:
3. ANALYSIS OF VARIANCE 213 STA3701/1
(c) Do the data provide sufficient evidence to indicate an interaction between number of coats
and batch?
Hint: Test at the 5% significance level: Give the hypothesis, test statistics, formulas and conclusions
explicitly.
Solution 3.14.
(a) The factor A has three levels and the factor B has four levels.
(c)
ȳ11. = 71.7 ȳ12. = 74.275 ȳ13. = 75.625 ȳ14. = 70.825
ȳ21. = 75.525 ȳ22. = 78.35 ȳ23. = 78.5 ȳ24. = 74.8
ȳ31. = 75.625 ȳ32. = 78.35 ȳ33. = 79 ȳ34. = 74.725
ȳ... = 75.6083
k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1
= 478.7167
k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 4 × 4[(73.1063 − 75.6083)2 + (76.7938 − 75.6083)2 + (76.925 − 75.6083)2
= 150.3858
m
X
SSB = kn (y.j. − ȳ... )2
j=1
= 152.8522
3. ANALYSIS OF VARIANCE 215 STA3701/1
k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1
= 173.625
= 1.8537
Source SS d.f. MS
A 150.3858 2 75.1929
B 152.8522 3 50.9507
A×B 1.8537 6 0.3090
Error 173.625 36 4.8229
Total 478.7167 47
2
H0 : σαb = 0:
Fα;(k−1)(m−1);km(n−1) = F0,05;6;36 = 2.38. Reject H0 if f > 2.38.
Since 0.0641 < 2.38, H0 is not rejected at the 5% level of significance and conclude that there
is no interaction between number of coats and batch.
3. ANALYSIS OF VARIANCE 216
(d) H0 : α1 = α2 = α3 = 0
SSA /(k − 1)
f =
SSAB /(k − 1)(m − 1)
75.1929
=
0.3090
≈ 243.3427
Since 243.3427 > 5.14, we reject H0 at the 5% level of significance and we conclude that the
number of coats affects the market value of the pearls.
H0 : σb2 = 0
SSB /(m − 1)
f =
SSWS /km(n − 1)
50.9507
=
4.8229
≈ 10.5643
Since 10.5643 > 2.88, we reject H0 at the 5% level of significance and conclude that batches
have an effect on the market value of the pearls.
We now refer to data in sections 3.3.2 and 3.3.3. The usual model for this type of experiment is
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
In this formulation i refers to the treatments (teaching methods in section 3.3.2.1, sprays in section
3.3.2.2), j refers to the units assigned at random to the treatments (teachers and trees respectively)
and h to the repetitions (children and leaves respectively).
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
k X
X m
SSB =n (ȳij. − ȳi.. )2
i=1 j=1
k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1
k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1
Example 3.15.
Derive E(M SA ).
Solution 3.15.
The model is yijh = µ + αi + bij + εijh , i = 1, ..., k; j = 1, ..., m; h = 1, ..., n.
m m n
1 X 1 XX
y i.. = µ + αi + bij + εijh
m j=1 mn j=1 h=1
y i.. = µ + αi + bi. + εi..
= µ + αi
k k m k m n
1X 1 XX 1 XXX
y ... = µ+ αi + bij + εijh
k i=1 km i=1 j=1 kmn i=1 j=1 h=1
y ... = µ + α. + b.. + ε...
= µ
Now
X
SSA = mn y 2i.. − kmny 2...
i
X
E(SSA) = E(mn y 2i.. − kmny 2... )
i
X
= mn E(y 2i.. ) − kmnE(y 2... )
i
X σ2
σ2
b 2
= mn + + (µ + αi )
i
m mn
2
σ2
σb 2
−kmn + +µ
km kmn
" #
kσb2 kσ 2 X 2
= mn + + (µ + 2µαi + αi2 )
m mn i
kmnσb2 kmnσ 2
− − − kmnµ2
" km kmn #
2 2
kσb kσ X X X
= mn + + µ2 + 2µ αi + αi2
m mn i i i
−nσb2 − σ 2 − kmnµ2
X
= nkσb2 + σ 2 k + kmnµ2 + mn αi2
i
X
−nσb2 2
− σ − kmnµ since 2
αi = 0
X
= nkσb2 − nσb2 + σ 2 k − σ 2 + mn αi2
i
X
= (k − 1)nσb2 2
+ (k − 1)σ + mn αi2
i
" #
mn X 2
= (k − 1) nσb2 + σ 2 + α
k−1 i i
Thus,
E(SSA )
E(M SA ) =
k−1
3. ANALYSIS OF VARIANCE 220
mn
(k − 1) nσb2 + σ 2 + αi2
P
k−1
i
=
k−1
2 2 mn X 2
∴ E(M SA ) = nσb + σ + α .
k−1 i i
Activity 3.4.
Derive the other E(M S) as an exercise.
SSA /(k − 1)
f= > Fα;k−1;k(m−1) .
SSB /k(m − 1)
H0 : σb2 = 0 is rejected if
SSB /k(m − 1)
f= > Fα;k(m−1);km(n−1) .
SSE /km(n − 1)
Example 3.16.
Using the data in section 3.3.2.1, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.
3. ANALYSIS OF VARIANCE 221 STA3701/1
Solution 3.16.
The means and sums of squares of deviations from the means, are as follows (the first index refers
to teaching methods, the second to teachers and the third to pupils).
X
ȳ11. = 58 (y11h − ȳ11. )2 = (60 − 58)2 + . . . + (65 − 58)2 = 178
X
ȳ12 . = 62 (y12h − ȳ12. )2 = (61 − 62)2 + . . . + (72 − 62)2 = 166
ȳ1.. = 60
X
ȳ21. = 47 (y21h − ȳ21. )2 = (49 − 47)2 + . . . + (48 − 47)2 = 46
X
ȳ22. = 55 (y22h − ȳ22. )2 = (54 − 55)2 + . . . + (63 − 55)2 = 132
ȳ2.. = 51
X
ȳ31. = 66 (y31h − ȳ31. )2 = (68 − 66)2 + . . . + (75 − 66)2 = 138
X
ȳ32. = 72 (y32h − ȳ32. )2 = (64 − 72)2 + . . . + (75 − 72)2 = 114
ȳ3.. = 69
ȳ··· = 60
k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 2 × 5[(60 − 60)2 + (51 − 60)2 + (69 − 60)2 ]
= 10[02 + (−9)2 + 92 ]
= 1 620
k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1
= 5[(58 − 60)2 + (62 − 60)2 + (47 − 51)2 + (55 − 51)2 + (66 − 69)2 + (72 − 69)2 ]
= 290
k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1
= 774
3. ANALYSIS OF VARIANCE 222
k X
X m X
n
SST = (yijh − ȳ... )2
i=1 j=1 h=1
= 02 + (−5)2 + . . . + (15)2
= 2 684
H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;3 = 9.55. Reject H0 if f > 9.55.
Since 8.3792 < 9.55, we do not reject H0 at the 5% level of significance and conclude that there is
no significant difference between the teaching methods.
H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;3;24 = 3.01. Reject H0 if f > 3.01.
Since 2.9974 < 3.01, we do not reject H0 at the 5% level of significance and conclude that the
variances ascribable to the teachers are significant.
The SAS JMP graph of the means shows that teacher method 3 lead to the highest marks on
average, while pupils who were taught by method 2 had the poorest results. SAS JMP does not
do the full analysis for this type of experiment, and therefore we display the graph only.
3. ANALYSIS OF VARIANCE 223 STA3701/1
Example 3.17.
Using the data in section 3.3.2.2, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.
Solution 3.17.
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
Now i refers to the sprays, that is, i = 1, 2, 3; j refers to the trees, that is, j = 1, 2, 3, 4 and h to
the repetitions (leaves), that is, h = 1, 2, . . . , 6.
3. ANALYSIS OF VARIANCE 224
Now
ȳ11. = 6 ȳ12. = 7 ȳ13. = 12 ȳ14. = 11
ȳ21. = 15 ȳ22. = 14 ȳ23. = 11 ȳ24. = 12
ȳ31. = 7 ȳ32. = 5 ȳ33. = 9 ȳ34. = 11
k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 4 × 6[(9 − 10)2 + (13 − 10)2 + (8 − 10)2 ]
= 24[(−1)2 + 32 + (−2)2 ]
= 336
k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1
= 336
k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1
= 151.
H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;9 = 4.26. Reject H0 if f > 4.26.
Since 4.5 > 4.26, we reject H0 at the 5% level of significance and conclude that the effects of the
sprays are significantly different from one another.
H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;9;60 = 2.04. Reject H0 if f > 2.04.
Since 14.8342 > 2.04, we reject H0 at the 5% level of significance and conclude that the variances
ascribable to the trees are significant, that is, the hypothesis σb2 = 0 is rejected.
The SAS JMP graph follows - you should be able to interpret it.
Experiments often include more than two factors, the data in section 3.1.3 are an example of a
three-way analysis of variance.
i = 1, · · · , a; j = 1, · · · , b; k = 1, · · · , c; ` = 1, · · · , n;
with
a
X b
X c
X a
X b
X b
X
αi = βj = γk = (αβ)ij = (αβ)ij = (αγ)ik
i=1 j=1 k=1 i=1 i=1 ik
c
X b
X c
X a
X
= (αγ)ik = (βγ)jk = (βγ)jk = (αβγ)ijk
k=1 j=1 k=1 i=1
b
X c
X
= (αβγ)ijk = (αβγ)ijk = 0
j=1 k=1
The three-way analysis of variance is an expansion of the method used for the two-way analysis
of variance. Since the calculations are complicated and time-consuming, SAS JMP, or a similar
computer package, is usually used.
3. ANALYSIS OF VARIANCE 227 STA3701/1
Example 3.18.
In this example interactions were ignored and you will find the SAS JMP output fairly simple to
interpret.
Example 3.19.
A three-way analysis of variance was performed to investigate the effect of rainfall (factor A), type
of soil (factor B) and fertilizer (factor C) on yield, allowing for possible interactions.
Rainfall 1 2
Soil type 1 2 1 2
Fertilizer
1 18.6 10.2 12.3 12.3
18.8 10.5 10.9 10.7
15.8 8.5 10.7 12
17.9 10.6 11.8 11.8
2 19.4 14.7 13.2 10.2
18.9 17.8 15.5 8.5
18.4 15.6 11 7.1
18.9 16.5 11.8 6.5
3 15.9 20.9 19.4 13.2
16.5 21 21.2 13.5
17.2 21.03 20.4 12.8
16.5 20.4 20.3 15.2
The AC-interaction seems to be the strongest and the most prominent feature in figure 3.20(middle)
is the following:
The combination of level 2 of C and level 2 of A contributes to a very low yield, on the other hand
the combination of level 2 of C and level 1 of A contributes to a relatively high yield.
Figure 3.21 and figure 3.22 are representations of the ABC-interaction: figure 3.21 represents the
AC-interaction for soil type 1 and figure 3.22 represents the AC-interaction for soil type 2.
If figure 3.21 and figure 3.22 had been the same, you would be inclined to conclude that the ABC-
interaction was not significant. The difference between figure 3.21 and figure 3.22 is obvious, thus
the results of the ANOVA table for the significance of the ABC-interaction are confirmed.
3. ANALYSIS OF VARIANCE 230
Figure 3.20 left and middle and figures 3.21 and 3.22: the 1 and 2 inside the graph represent levels
1 and 2 for rainfall;
Figure 3.20 right: the 1 and 2 inside the graph represent levels 1 and 2 for soil.
3. ANALYSIS OF VARIANCE 231 STA3701/1
Exercise 3.1
(ii) the hypotheses and conclusions you expect to test and make from the analysis WITH-
OUT actually performing the analysis.
SCENARIOS
(a) The effect of salinity was examined on the growth of fish (measured by increase in
weight). A sample of five fish each was measured from full-strength seawater (32%),
brackish water (18%) and fresh water (0.5%). Analyse the experiment.
(b) An experiment was conducted to test the effects of three different levels of factor A with
three different levels of factor B. The experiment called for nine identical experimental
units, each to be tested at one of the combinations of factors A and B. Analyse the
experiment.
(c) Given the scenario in (b), there was some controversy about the assumption of no inter-
action between factors A and B and it was decided to test 36 experimental units, four
in each combination, to allow for the possibility of interaction. Analyse the experiment.
2. Four chemical treatments for curtain fabrics were tested with regard to their ability to im-
prove colour fastness. Due to the limited quantities of the two types of fabrics available for
the experiment, it was decided to apply each chemical to a sample of each type of fabric. The
results are expressed in percentages with regard to the retaining of colour, after the treated
fabrics were submitted to severe testing:
3. ANALYSIS OF VARIANCE 234
Fabric
1 2
1 53 92
Treatment 2 32 81
3 79 99
4 38 67
(b) Are there significant differences in the effect of treatments on fabrics (α = 0.05)?
(d) What would you recommend in order to make any claims as to the effectiveness of the
treatments?
3. ANALYSIS OF VARIANCE 235 STA3701/1
3. The strain readings of glass cathode supports from five different machines were investigated.
Each machine had four “heads” on which the glass was formed, and four samples were taken
from each head. The data were as follows:
Machine A B C
Head 1 2 3 4 5 6 7 8 9 10 11 12
6 13 1 7 10 2 4 0 0 10 8 7
2 3 10 4 9 1 1 3 0 11 5 2
0 9 0 7 7 1 7 4 5 6 0 5
8 8 6 9 12 10 9 1 5 7 7 4
Machine D E
Head 13 14 15 16 17 18 19 20
11 5 1 0 1 6 3 3
0 10 8 8 4 7 0 7
6 8 9 6 7 0 2 4
4 3 4 5 9 3 2 0
Analyse and interpret the results. Use the 5% level of significance. Give the model, ANOVA
table, hypotheses, test statistics and conclusions explicitly.
4. Consider the two-way analysis of variance: model II. Show that E(M SAB ) = σ 2 + nσab
2
.
5. An experiment was conducted to determine the effect of different pressure levels on the
products manufactured by a machine. A summary of the data recorded from products man-
ufactured by the machine at each of four randomly selected pressure levels is given as follows:
y 1· = 27.4 y 2· = 29.8
y 3· = 30.7 y 4· = 29.2
y ·· = 29.275
n1 = n2 = n3 = n4 = 10
(yij − y ·· )2 = 207.975
PP
i j
3. ANALYSIS OF VARIANCE 236
yij = µ + ai + εij i = 1, . . . , k = 4 j = 1, . . . , n = 10
(a) Complete the ANOVA table and test at the 5% level whether the change in pressure
has a significant effect on the products.
(b) Estimate the proportion of the total variance ascribable to pressure variation and com-
pute a 90% confidence interval.
A
25◦ 30◦ 35◦
(d) Now assume that the three levels of factor A constitute a random sample from a large
population of possible levels of A. Give the appropriate model and do questions (i)-(iii)
of (c) for the model in (d).
3. ANALYSIS OF VARIANCE 238
Chapter 4
REGRESSION ANALYSIS
4.1 Introduction
As was said in chapter 2, the regression model is a special case of the general linear model
y = X β +
in which each column of X (except possibly the first) represents a series of values of a continuous
variable. The variables represented by the columns of X are called the independent variables or
predictors. The rows of X and y represent, as usual, cases or data points. We already know that
the least squares estimator of β is
β̂ = (X 0 X)−1 X 0 y
and that
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .
Furthermore, σ 2 is estimated by
239
4. REGRESSION ANALYSIS 240
The analysis of variance has been treated in detail, but the interpretation of the various quadratic
forms in the regression model is discussed here.
or
The observations are represented by the dots. In figure 4.1 the observations are projected onto
the y-axis without regard to the x-values. The variation between these projected values, depicted
by crosses, reflects the total variation in the y-values. If we did not know about the x-values, we
would ascribe this variation to chance.
4. REGRESSION ANALYSIS 242
In figure 4.2 a straight line is fitted to the data and the observations projected parallel to the fitted
line. The variation among these projected points is the variation about the regression line and is
now regarded as the chance variation. The amount of decrease in the variation is said to be the
variation explained by the regression or the variation due to regression.
In figure 4.3 the predicted values, which are obtained by projecting the data points vertically onto
the regression line, are in turn projected onto the y-axis. The variation between these projected
points is said to be due to the regression because there would have been no such variation if the
slope of the fitted line had been zero, that is if y had not depended on x.
X X
Residual (yi − ŷi )2 n−p (yi − ŷi )2 /(n − p) σ2
X
Total (yi − ȳ)2 n−1
where in the E(MS) column β 1 is the (p − 1) × 1 vector obtained by deleting the first element of
β and A11 is the matrix obtained by deleting the first row and column of (X 0 X)−1 . In the case
p = 2, that is the regression of y on one variable x, say y = β1 + β2 x + , we have
X X
E( (ŷi − ȳ)2 /(p − 1)) = σ 2 + β22 (xi − x̄)2 .
The following data, with the dependent variable the time needed to complete a race (y) and the
independent variable the time spent on exercise in the two months prior to the race (x) were
analysed with regard to the relation between the dependent and the independent variable.
x : 178.6 221.6 190.5 227.7 206.2 245.6 209.1 260.3 212.2 280.1
y: 27.1 39.8 32.6 42.3 33.6 44.8 33.8 50.9 38.3 54.1
A straight line was fitted and a regression analysis was performed with the following SAS JMP
output in figure 4.4:
4. REGRESSION ANALYSIS 244
Since the F-value in the computer output is very large, the hypothesis H0 : β0 = β1 = 0 is rejected
and the assumption that the straight line describes the relation between x and y, can be accepted.
“R-squared” given in the computer output is defined as the percentage of the variation in y which
is accounted for by the fitted equation. Thus, 97.26% of the variation in y is accounted for by the
4. REGRESSION ANALYSIS 245 STA3701/1
model.
If we have repeated measurements in some or all of the points, that is if there are sets of equal
rows in the matrix X, a more precise analysis can be made, namely a lack of fit test.
n1
1 X
Thus, suppose that y11 , · · · , y1n1 correspond to one set of equal rows of X, and let ȳ1. = y1i ,
n1 1
n2
1 X
suppose y21 , · · · , y2n2 correspond to a second set of equal rows and let ȳ2. = y2i , et cetera.
n2 1
Xc
Suppose there are c such sets with sizes n1 , · · · , nc such that ni = n, the total sample size.
1
Then
ni
c X
X
SSError = (yij − ȳi. )2
i=1 j=1
is said to be the sum of squares due to pure error, since the only explanation for a variation among
observations which correspond to equal rows of X is random variation.
The difference between the observed mean value ȳi. and the predicted value for the i-th set ŷi , is
an indication of how well the model y = Xβ + fits the data. We write
c
X
SSLack of fit = ni (ŷi − ȳi. )2 .
i=1
SSLack of fit is a measure of the adequacy of the suggested model. We can write
Figure 4.7: Differences between means and predicted values suggest the model is inadequate
c
X
Lack of fit ni (ŷi − ȳi. )2 c−p SSLof /(c − p) σ 2 + ξ 2 /(c − p)
1
ni
c X
X
Error (yij − ȳi. )2 n−c SSE /(n − c) σ2
i=1 j=1
XX
Total (yij − ȳ.. )2 n−1
i j
where
c
X
2
ξ = ni (E(ȳi. ) − X 0i β)2
i=1
and where X 0i is the row of X corresponding to the i-th group of observations. We see that ξ = 0
if, for all i,
E(ȳi. ) = X 0i β
that is if the model contains all important independent variables (which have an influence on y).
4. REGRESSION ANALYSIS 248
Example 4.1.
A study was performed on the number of parts assembled in a factory as a function of the time
spent on a specific job. Twelve employees, divided into three groups, were assigned to three time
intervals, with the following results:
Solution 4.1.
(a) A straight
line was fitted first on the data:
1 1 1 1 1 1 1 1 1 1 1 1
X0 =
−1 −1 −1 −1 0 0 0 0 1 1 1 1
1 −1
1 −1
1 −1
0
1 1 1 1 1 1 1 1 1 1 1 1 12 0
XX= 1 −1 =
−1 −1 −1 −1 0 0 0 0 1 1 1 1 .. ..
0 8
. .
1 1
1 1
4. REGRESSION ANALYSIS 249 STA3701/1
1
1 8 0 12 0
Now (X 0 X)−1 = = 1
96 0 12 0
8
27
32
26
0
1 1 1 1 1 1 1 1 1 1 1 1 469
Xy= =
34
−1 −1 −1 −1 0 0 0 0 1 1 1 1 ..
77
.
52
49
1
0 469 39.0833
β̂ = (X 0 X)−1 X 0 y = 12 1 = .
0 77 9.625
8
with
= 982.9167
XX
(y1j − ȳ.. )2 = (27 − 29.75)2 + . . . + (34 − 29.75)2
= 44.75
XX
(y2j − ȳ.. )2 = (35 − 38.5)2 + . . . + (47 − 38.5)2
= 169
4. REGRESSION ANALYSIS 250
XX
(y3j − ȳ.. )2 = (45 − 49)2 + . . . + (49 − 49)2
= 26
XX
SSError = (yij − ȳi. )2
XX XX XX
= (y1j − ȳ.. )2 + (y2j − ȳ.. )2 + (y3j − ȳ.. )2
= 44.75 + 169 + 26
= 239.75
c
X
SSReg = ni (ŷi − ȳ.. )2
i=1
= 4(29.4583 − 39.0833)2 + 4(39.0833 − 39.0833)2 + 4(48.7083 − 39.0833)2
= 741.125
c
X
SSLof = ni (ŷi − y i. )2
i=1
= 4(29.4583 − 29.75)2 + 4(39.0833 − 38.5)2 + 4(48.7083 − 49)2
= 2.0417
Source SS d.f MS
X
Regression ni (ŷi − ȳ.. )2 = 741.125 p−1=1 741.125
X
Lack of fit ni (ŷi − ȳi. )2 = 2.0417 c−p=1 2.0417
XX
Error (yij − ȳi. )2 = 239.75 n−c=9 26.6389
XX
Total (yij − ȳ.. )2 = 982.9167 n − 1 = 11
4. REGRESSION ANALYSIS 251 STA3701/1
Since
If the lack of fit test is significant, one will have to rethink the model: more terms may have to be
included, the variables may have to be transformed or one may have to design a completely new
experiment.
In many practical applications of regression analysis the straight line seems to be inadequate to
deal with the complexity of the problem. This problem can be solved by application of multiple
regression or polynomial regression.
In certain types of problems one may not know exactly which kind of function to fit to a given set
of data, since many functions may be written in the approximate form
f (x) = c0 + c1 x + c2 x2 + · · ·.
It is often possible to approximate the unknown function by means of a polynomial. This fit is
usually fairly close over a limited interval. In matrix form the approximate model then is
k
y 1 x1 · · · x1 β
1 0 1
.. .. .. ..
. = . . + . .
k
yn 1 xn · · · xn βk n
β̂ = (X 0 X)−1 X 0 y
X X X −1 X
β̂0 n xi x2i ··· xki yi
X X X X X
x2i x3i ··· xk+1
xi
xi yi
β̂1 i
= .
.. .. ..
. . .
X X X X X
β̂k xki xk+1
i xk+2
i ··· x2k
i xki yi
1. If one fits a polynomial of degree k that is one with p = k + 1 parameters, then one has to
have at least k + 2 observations (that is n > k + 1), and at least k + 1 different values of x
(that is c ≥ k + 1 where c is defined in section 4.2). If the last restriction is violated, X 0 X
will be singular. If the first restriction is violated then the degrees of freedom for estimating
σ 2 will be n − p = n − k − 1 ≤ 0.
Figure 4.8:
4. There is a particular danger in using a fitted polynomial to extrapolate far outside the data.
As in figure 4.8 the polynomial is likely to become completely inappropriate outside the range
of observation.
• Illustration
An investigator wants to determine the relation between the dependent variable y and the inde-
pendent variable x in the following data set:
x : 38 45 49 57 69 78 84 89 79 64 54 41
y : 99 91 78 61 55 63 80 95 65 56 74 93
The SAS JMP output for the regression analysis can be seen in figure 4.9.
4. REGRESSION ANALYSIS 254
R2 = 0.1241. Thus, 12.41% of the variability in y is being explained by the model. On the
regression coefficients the intercept is the only one significant with a p-value of zero.
R2 = 0.9197. Thus, 91.97% of the variability in y is being explained by the model. The regression
coefficients are all significant with p-values less than 0.05.
The quadratic equation is a better fit, as seen in the graph and “R-square” values.
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + ε.
The estimated model ŷ = β̂0 + β̂1 x1 + · · · + β̂k xk is chosen in such a way as to minimise
n
X
SSE = (yi − ŷi )2 .
i=1
The degree of difficulty of the calculations is the basic difference between the fitting of the straight
line and multiple regression models. With regard to the above model, (k + 1) linear equations are
solved simultaneously to obtain the (k + 1) estimated coefficients β̂0 , β̂1 , · · · , β̂k .
Several computer packages are available to perform multiple regression analyses, and since the
output of the programs is similar, the interpretation is fairly simple.
4. REGRESSION ANALYSIS 256
4.4.1 Illustration
An investigator suspects that factors x1 , x2 and x3 and/or combinations of factors affect the inde-
pendent variable y.
A multiple regression analysis was suggested and performed on the following data set:
y x1 x2 x3
87 40 11 14
133 36 13 30
174 34 19 30
385 41 33 39
363 39 25 33
274 42 23 34
235 40 22 37
104 31 9 20
141 36 13 27
208 34 17 40
115 30 18 19
271 40 23 31
163 37 14 35
193 41 13 28
203 38 24 31
279 38 31 35
179 24 16 26
244 45 19 34
165 34 20 30
257 40 30 38
252 41 22 35
280 42 21 41
167 35 16 23
168 33 18 24
115 36 18 21
4. REGRESSION ANALYSIS 257 STA3701/1
and the F-statistic indicates that the βs are not all zero, since f = 13.6612 > F0,05;6;18 = 2.66.
R2 = 1 − SSError /SSTotal
thus representing the portion of the sample variance of the y-value accounted for by the estimated
model. In this case R2 = 0.8199. Thus, 81.99% of the variability in y is accounted for by the
model.
The p-value in the table is the exceedance probability that the coefficient concerned is zero under
the null hypothesis.
Figure 4.11:
4. REGRESSION ANALYSIS 259 STA3701/1
A small p-value implies a significant difference of the coefficient from zero for all significant levels <
p, for example if p = 0.08, then the coefficient is significantly different at the 10% level (α = 0.10)
but not at the 5% level (α = 0.05).
It often happens that one has a large number of independent variables available, and one has to
make a decision as to which variables to use in the regression equation. There are arguments in
favour of both extreme views: include as many variables as possible and include as few variables
as possible. We shall go into some of these arguments.
y = β0 + β1 x1 + β2 x2 +
and
y = β0∗ + β1∗ x1 + ∗ .
If β2 = 0 the two models are equivalent, but if β2 6= 0 they are different. If β2 6= 0 and one
nevertheless fits the second model, then β̂0∗ and β̂1∗ will be biased, that is E(β̂0∗ ) 6= β0∗ and E(β̂1∗ ) 6= β1∗
because E(∗ ) = E(β2 x2 + ) = β2 x2 and thus the assumption E(∗ ) = 0 is not true. For given
values of x1 and x2 the predicted value of the corresponding response y will be
ŷ = β̂0∗ + β̂1∗ x1
y = β0∗ + ∗ .
and
y = β0 + β1 x +
Thus Var(ŷ) ≥ Var(ŷ ∗ ). This result holds generally: every additional variable included in the
equation increases the variance of a prediction.
The inclusion of too many variables in the equation has other serious side-effects as well:
1. The matrix X 0 X becomes very large and often ill-conditioned (that is singular or nearly
singular) with the result that numerical accuracy is lost in computing (X 0 X)−1 .
2. A regression equation with many terms is cumbersome to use, and often one may have to
observe variables which are difficult or expensive to obtain in order to compute a prediction.
One should always aim at simplicity (within reason, of course).
y = Xβ +
η = x0 β
4. REGRESSION ANALYSIS 261 STA3701/1
be the true response which we are trying to predict. Let η̂ = x0 β̂ be the estimated response, and
let E(η̂) = x0 E(β̂) be the expected value of the estimated response. If E(β̂) = β then E(η̂) = η
and the estimator is unbiased. However, a biased estimator may sometimes be preferable, as will
now be seen. Write
Taking squares and then expectations on both sides and noting that the expected value of (η̂ −
E(η̂))(η − E(η̂)) is equal to zero, we obtain
or
The term on the left is the mean (or expected) squared error of prediction. The first term on the
right is obviously the variance of η̂ and the last term is the measure of the bias of η̂. Suppose now
that, for a given set of data, we add the variables x1 , x2 · · · , xp to the equation one by one in that
order.
We have already seen that the variance of the prediction will increase after the inclusion of every
variable while the squared bias will decrease to zero if all important variables are included in our
set of variables. As is seen in figure 4.14, the mean squared error is likely to decrease and then
increase: thus an optimum subset of the variables x1 , · · · , xp will exist which yields the optimum
prediction with the smallest mean squared error. One will never know which variables they are,
but one should know that they exist.
4. REGRESSION ANALYSIS 262
Figure 4.12:
Figure 4.13:
4. REGRESSION ANALYSIS 263 STA3701/1
Figure 4.14:
We describe a number of methods which may be employed in order to select the variables for
inclusion in the equation. In the final analysis the choice should not be based on statistical con-
siderations alone, but on knowledge of the practical situation as well. Methods such as stepwise
regression analysis may be used to help you to think about a problem, but can never be a substi-
tute for thinking. None of the methods to be described can be said to be superior to all others,
and any of them may be preferable in some circumstances.
1. The X 0 X matrix may be ill-conditioned, and it may be virtually impossible to fit the complete
model.
4. REGRESSION ANALYSIS 264
2. If two independent variables are highly correlated, it may well happen that their coefficients
are both insignificant, but that they are jointly significant. Exclusion of both will thus result
in a loss of information.
Significance level
If one tests p coefficients, one must bear in mind that the overall significance level may be larger
than one thinks. A safe procedure would be to use the level α/p for each test, in which case the
overall level will at most be α.
1. If p is larger, one may have a tremendous amount of computing to do and it will be virtually
impossible even to read all the answers which emerge from the computer. This is a wasteful
approach if p is large.
2. It is virtually impossible to attach a specific significance level to this procedure in any prac-
tical application, since the distributional properties of the process are too complicated.
Forward selection
The forward selection method works as follows: First select the independent variable which is
most highly correlated with y. If this correlation coefficient is r1 say, then R12 = r12 is the squared
multiple correlation coefficient between y and the variable selected. Next, that variable is selected
which has the highest partial correlation coefficient with y, given the variable already selected.
This variable also causes the largest increase in R2 . Thus the variables selected in the first two
steps have the largest R2 with y of all pairs of variables which include the variable selected first.
For example, if x3 is included first and then x6 , then R2 of y with x3 and x6 will be larger than
4. REGRESSION ANALYSIS 265 STA3701/1
R2 of y with x3 and x1 , with x3 and x2 , x3 and x4 , et cetera. In this way one proceeds until the
addition of further variables causes the R2 to increase by very small amounts.
1. From a distribution theoretical point of view this approach is so complex that it is impossible
to maintain a specified significance level. The cutoff point has to be selected subjectively.
2. It may happen that the subset selected at a particular step is not the optimal subset. For
example, x3 , x4 and x6 may be selected after three steps, while x2 , x5 and x6 may have a
larger R2 with y. In practice it has been found that the subset selected by this method
differs only rarely from the best subset, and usually the optimal R2 is only slightly higher
than the R2 selected. Theoretically the differences may be substantial, however.
Backward elimination
This method is the reverse of the previous one. The full set is selected at first. Next, the variable
which will cause the smallest decrease in R2 if deleted from the equation is eliminated and the
equation is recomputed. This process is continued until the decrease in R2 is intolerably large,
and the previous equation is selected. This procedure is preferred to the previous one by many
authors, but it has the disadvantages of the previous one:
Stepwise regression
This is a variation on the forward selection method and is used very widely. After the inclusion
of each variable, all variables already in the equation are scrutinised to see whether anyone of
them could be deleted without decreasing R2 by much. Suppose x4 is selected first, then x3 and
x5 . After three steps it may be found that x4 is no longer necessary since it is highly correlated
with x3 and x5 , and the information about y contained in x4 is also contained in x3 and x5 . The
disadvantages of this method are:
4. REGRESSION ANALYSIS 266
2. Even though one is required to supply an“F-level for inclusion” and an “F-level for exclusion”,
it is impossible to attach a specific significance level to the procedure. Incidentally, the “F-
levels” commonly used in existing computer programs actually refer to critical values of the
F-distribution, ostensibly with 1 and n − k − 1 degrees of freedom at the k-th step, but exact
critical values for this procedure are unobtainable. The best recommendation is to select
small “critical values” and let the procedure continue for as many steps as are possible, and
then to make a subjective judgement based on the series of R2 s and any other (non-statistical)
information available.
One of the most widely used methods is the forward stepwise regression.
Recall when using the p-values, a coefficient is significant if the p-value is less than α. It means
H0 : βi = 0 is rejected in favour of H1 : βi 6= 0 with the probability of a type I error equal to α.
Before performing a stepwise regression model we choose two alpha values determine the entry or
staying of a variable called αentry and αstay respectively. The αentry is the “probability of a type
I error related to entering an independent variable into the regression model”. The αstay is the
“probability of a type I error related to retaining an independent variable previously entered into
the model.” Normally the value of 0.05 is used for both αentry and αstay .
Thus, a variable will enter if its p-value is less than 0.05 and a variable will remain if its p-value
after adding the other variable is less than 0.05. For example, suppose we have three independent
variables x1 , x2 and x3 . If we perform simple linear regression of each variable on y and we obtain
the following p-values for each model:
4. REGRESSION ANALYSIS 267 STA3701/1
y= β0 + β1 x1 +
p-value (0.0004)
y= β0 + β 1 x2 +
p-value (0.006)
y= β0 + β 1 x3 +
p-value (0.002)
Of the variables x1 is the one that is highly significant thus the model y = β0 + β1 x1 + will be
considered since 0.0004 is the p-value (highly significant). Thus the hypothesis H0 : β1 = 0 is
rejected since 0.0004 is less than the αentry = 0.05.
In the next step suppose the possible regression models with two independent variables gave the
models of the form:
y= β0 + β1 x1 + β2 x2 +
p-value (0.0015) (0.0163)
y= β0 + β1 x1 + β2 x3 +
p-value (0.0506) (0.1225)
It can be noted that on the second equation, the p-value for β2 is now 0.1225 which is greater than
0.05. Thus, x3 cannot stay in the model. However, for the first model β2 = 0.0163, thus x1 and x2
can be retained for further use in stepwise regression.
Suppose when all three variables are entered in the equation the model is:
y= β0 + β 1 x1 + β 2 x2 + β 3 x3 +
p-value (0.0185) (0.0578) (0.4627)
It can be observed that the p−value for β3 = 0.4615 is greater than 0.05. Thus, it is not significant
at the αentry level. Thus, this model cannot be chosen and it comes back to
y = β0 + β1 x1 + β2 x2 + .
4. REGRESSION ANALYSIS 268
• Application to illustration 4.4.1 A stepwise regression was performed on the data of illus-
tration 4.4.1.
Note: The stepwise regression algorithm allows an X-variable, brought into the model at an earlier
stage, to be dropped subsequently if it is no longer helpful in conjunction with variables
added at later stages.
The comparison of “R-squared (Adj for df)” in figure 4.15 and figure 4.16 suggests that the full
regression is unnecessary, since the analysis with two factors in figure 4.16 provides satisfactory
results.
Sometimes one may be confronted with a mass of historical data, that is data collected rather
haphazardly over a period of time, and one may be tempted to try building a regression model
from this data. Some authors advise strongly against doing this at all, or they advise using such
a study only as a guidance for planning more systematic experiments. Some of the hidden pitfalls
in such historical data are as follows:
1. The whole process may have changed over time and the data may no longer be applicable to
the present situation.
2. Such data usually show a highly correlated structure among the “independent” variables. The
result is that the estimated coefficients are highly correlated and it is impossible to judge
their significance independently. In some studies, for example a number of measurements on
4. REGRESSION ANALYSIS 270
a number of people, such a correlation structure may be unavoidable and may in fact contain
information about the structure underlying the data. However, in many situations, such as
the daily records of a chemical plant, this may be due to the way in which the plant is run.
3. In a chemical plant, as in many other experimental situations, important variables are usually
controlled very strictly since small fluctuations may have a significant effect on the response.
The result is that the importance of such a variable will not be reflected by the data. This
will certainly lead to inappropriate conclusions.
4. Other more subtle effects may be reflected in the data without the statistician ever know-
ing about them. In a chemical plant, two variables which influence the response may be
temperature and pressure. An increase in either of these variables will increase the yield,
but if either of them rises too high, it may cause an explosion. Therefore the operator has
strict instructions to turn down the heat immediately the pressure increases; the result will
of course be a decrease in yield. However, if one analyses the historical data, it will appear
as if an increase in pressure causes a decrease in yield − the way in which the experiment is
run is confused with inherent properties of the system.
1. One will try to select the design matrix X in such a way that (X 0 X)−1 is a diagonal matrix or
as close to diagonal as possible (except perhaps for the first row and column which represent
the constant term). This will enable one to estimate and judge the effect of each variable
independently.
2. Secondly, all variables which are expected to influence the yield must be varied deliberately,
and as many times as are needed. Even if the chemical engineer is not willing to vary
a certain variable, one may be able to convince him or her that small changes will not
upset the whole system. The variance of the estimated coefficient β̂j of the j-th variable is
X
σ2/ (xjh − x̄j. )2 . In order to obtain a precise estimate of βj , the variance of β̂j must be
h X
small, that is (xjh − x̄j. )2 must be large. This may be achieved either by varying the
h
variable xj substantially or by choosing a large sample size (provided not all values of xj are
equal to x̄j. )
4. REGRESSION ANALYSIS 271 STA3701/1
Exercise 4.1
1. (a) Suppose the following data have been obtained from five randomly selected objects
(regression of y on x):
x 1 3 5 7 9
y 4 3 6 8 9
(iii) Test at the 5% level whether the expected value of y (at x = 15) is 20.
x 0 1 2 3 4
y 110 90 76 85 91
2. Fourteen observations were obtained on the yield from a chemical experiment. In this exper-
iment, temperature, concentration and length of agitation were set by the experimenter.
X1 X2 X3 Y
Temperature Concentration Time Yield
10 40 20 22
10 40 20 33
15 40 20 36
15 40 20 33
10 40 25 26
15 40 25 37
10 50 20 37
10 50 20 40
15 50 20 42
15 50 20 45
10 50 25 39
10 50 25 42
15 50 25 45
15 50 25 47
14 175 640 310
175 2275 8000 3875
X0 X =
where X = (1, x1 , x2 , x3 )
640 8000 29600 14200
310 3875 14200 6950
11.9857 −0.1429 −0.115 −0.220
−0.1429
−1
0.0114 0 0
(X0 X) =
−0.115 0.003 −0.001
0
−0.220 0 −0.001 0.012
4. REGRESSION ANALYSIS 273 STA3701/1
524
0
6665 14
(yi − ŷi )2 = 96.5571
P
Xy=
24330 i=1
11660
(a) Fit a multiple regression model to the data. Give the values of β0 , β1 , β2 and β3 .
(b) Give a 95% confidence interval for β2 . Can you conclude that H0 : β2 = 0?
y = β1 + β2 x + β3 x2 +
to the data
(ii) by first reducing the variables as described in section 2.8. Draw a graph of the data
and the fitted regression curve.
(e) Test, at the 5% level, the hypothesis that the expected response y at x = 1 is 10.
4. REGRESSION ANALYSIS 274
x: −1 −1 −1 0 0 0 1 1 1
y: 11 16 18 7 9 14 8 9 16
(a) Fit a straight line and perform a lack of fit test at the 5% level.
(b) Fit a second-degree polynomial and test by means of an F-test at the 5% level whether
the quadratic term is zero. Compare your answer with that of (a).
(c) Draw a graph of the data and the two regression lines of (a) and (b). Comment on the
graph with reference to your conclusions in (a) and (b).
yi = β0 + β1 xi + i ; i = 1, · · · , n; i independent n(0; σ 2 ).
Find a 100(1 − α)% confidence interval for β0 + β1 x. Regard the lower and upper limits for
β0 + β1 x as functions of x and derive their form. Sketch β0 + β1 x and the lower and upper
limits as functions of x.
xi yi xi yi
17 37.75 25 45.20
17 37.42 25 46.70
17 36.62 25 48.10
22 41.15 28 52.45
22 42.65 28 50.21
22 39.70 28 51.75
2.75505 −0.11616
(b) Assume that (X 0 X)−1 = and show through calculation that
−0.11616 0.00505
ŷ = 14.2139 + 1.2966x.
7. A marketing research firm has obtained the following prescription sales data.
y = average weekly prescription sales over the past year (in units of $1 000)
x1 = floor space (in square feet)
x2 = percentage of floor space allocated to the prescription department
x3 = number of parking spaces available for the store
x4 = monthly per capita income for the surrounding community (in units of $100)
x5 = is an independent variable: equals 1 if the pharmacy is located in a shopping centre
and equals 0 otherwise.
(b) Based on pairwise correlation and density ellipses (use a 95% confidence ellipse), which
variable is the best single predictor?
(g) The following stepwise output was obtained with probability of entry = 0.15 and prob-
ability of leaving equal to 0.15.
ANALYSIS OF COVARIANCE
5.1 Introduction
In chapter 2 it was pointed out that the analysis of covariance is a hybrid between analysis of
variance and regression analysis. As in the analysis of variance, we have a number of factors
(treatments or blocking factors) which define subsets among the experimental units. In addition
there are (continuous) independent variables which are assumed to have a linear effect on the
response. In general the analysis of covariance is used to eliminate the linear or higher degree
relation between one or more independent variables (called covariates) and a dependent variable
(yield), with the main objective to be able to determine the effect of the factor(s) (treatment) on
the dependent variable. Analysis of covariance can also be regarded as the simultaneous study of
several regressions. Due to complicated techniques, we confine ourselves to a one-way analysis of
variance with a linear regression on one single variable. Here we have independent variable (covari-
ate) related to dependent (response) variable, but without being affected by the factor (treatment).
It is often impossible to control this covariate at a constant level, but it can be measured together
with the dependent (response) variable.
ε11 , · · · , ε2n2 are independent n(0; σ 2 ) variates; with αi the effect of the factor being investigated
283
5. ANALYSIS OF COVARIANCE 284
If the model is accepted, the situation shown in figure 5.1 might arise:
Figure 5.1:
Here we have three regression lines, each with a specific slope. The differences between the lines
do not necessarily relate to the treatments, thus it is impossible to investigate the differences in y
for the three treatments.
If the slopes are equal, that is the regression lines are parallel, we accept the model
Figure 5.2:
Now the differences with regard to y can be observed directly, thus the comparison of the three
equations is possible. The parameters can be estimated and the differences in y, for any value of
x, can be determined.
We start with an illustrative example, and then discuss two of the simplest models contained in
the general theory. The analysis of covariance is of course part of the broad theory of the general
linear model.
5.2 Illustration
A wholesaler wants to increase the sales of a certain product. He selects ten dealers whom he
believes to be comparable, and requests five of them to display his product close to the entrance
of the shop. The other five are requested to place the product close to the cash registers. The
results are as follows (units sold in one week):
Source SS d.f. MS F
Treatments 10 1 10 0.24
Error 330 8 41.25
Total 340 9
The F-statistic is not significant at any of the usual levels, and it would seem that there is no
difference between the responses to the two display methods.
5. ANALYSIS OF COVARIANCE 287 STA3701/1
However, at this stage somebody suggested that the sales before treatment should also be taken
into account, and the records show the following:
Figure 5.4:
5. ANALYSIS OF COVARIANCE 289 STA3701/1
Figure 5.4 suggests that one might fit a regression line to each group of data points and compare
the regression lines in some way or other. This is the purpose of the analysis of covariance.
The model often suggested for this type of problem is the following:
where the εij are independent n(0; σ 2 ) variates. Note especially the following two assumptions:
In section 5.4 the model with unequal slopes is discussed. An F-test for the equality of the vari-
ances may be constructed − if the variances are unequal, one is confronted with the Behrens-Fisher
problem.
y11 1 0 x11 11
.. .. ..
. . .
α
1
y1n1 1 0 x1n1 1n1
= α +
2
y21 0 1 x21 21
β
.. .. ..
. . .
y2n2 0 1 x2n2 2n2
Example 5.1.
Using data given in section 5.2 perform an analysis of covariance test assuming equal slopes at the
5% equal slopes.
Solution 5.1.
x̄1. = 40 ȳ1. = 50 x̄2. = 50 ȳ.. = 52
X
(x1j − x̄1. )2 = (32 − 40)2 + (38 − 40)2 + (40 − 40)2 + (44 − 40)2 + (46 − 40)2
= (−8)2 + (−2)2 + 02 + 42 + 62
= 120
X
(x2j − x̄2. )2 = (44 − 50)2 + (45 − 50)2 + (51 − 50)2 + (53 − 50)2 + (57 − 50)2
= (−6)2 + (−5)2 + 12 + 32 + 72
= 120
X
y1j (x1j − x̄1. ) = 42(32 − 40) + 45(38 − 40)2 + . . . + 59(46 − 40)
= 140
X
y2j (x2j − x̄2. ) = 44(44 − 50) + 47(45 − 50)2 + . . . + 58(57 − 50)
= 130
5. ANALYSIS OF COVARIANCE 291 STA3701/1
P P
y1j (x1j − x̄1. ) + y2j (x2j − x̄2. )
β̂ = P P
(x1j − x̄1. )2 + (x2j − x̄2. )2
140 + 130
=
120 + 120
270
=
240
= 1.125
= 50 − 1.125(40)
= 5
= 52 − 1.125(50)
= −4.25
P2 Pni
SSR = i=1 j=1 (yij − α̂i − β̂xij )2
ni
2 X
X
(y1j − α̂2 − β̂x1j )2 = (42 − 5 − 1.125 × 32)2 + . . . + (59 − 5 − 1.125 × 42)2
i=1 j=1
= 16.875
ni
2 X
X
(y2j − α̂i − β̂x2j )2 = (44 + 4.25 − 1.125 × 44)2 + . . . + (58 + 4.25 − 1.125 × 57)2
i=1 j=1
= 9.375
ni
2 X
X
SSR = (yij − α̂i − β̂xij )2
i=1 j=1
= 16.875 + 9.375
= 26.25
SSR
σ2 =
n1 + n2 − 3
5. ANALYSIS OF COVARIANCE 292
26.25
=
5+5−2
26.25
=
7
= 3.75
X X
s2x = (x1j − x̄1. )2 + (x2j − x̄2. )2
= 120 + 120
= 240
H0 : α1 = α2
b2 )2
α1 − α
(b
f = (x1. −x2. )2
b2 [ n11 +
σ 1
n2
+ s2x
]
2
(5 − (−4.25))
= 2
3.75 15 + 51 + (40−50)
240
(9.25)2
=
(−10)2
3.75 0.4 + 240
85.5625
=
3.75(0.81666666)
85.5625
=
3.0625
≈ 27.9388
Since 27.9388 > 5.59, we reject H0 at the 5% level of significance and conclude that the lines differ
significantly.
If one is unwilling to assume that the two regression lines are parallel, one may select the following
model:
In this model we estimate (α1 , β1 ) and (α2 , β2 ) independently, but σ 2 jointly from the two samples:
X X
β̂i = yij (xij − x̄i. )/ (xij − x̄i. )2
j j
Similarly, we might wish to test whether the response for each population is equal at a particular
point x, that is we might test H0 : α1 + β1 x = α2 + β2 x by comparing
Example 5.2.
Using data in section 5.2, perform an analysis of covariance assuming unequal slopes at the 5%
level of significance.
Solution 5.2.
Recall
P
y1j (x1j − x̄1. )
β̂1 = P
(x1j − x̄1. )2
5. ANALYSIS OF COVARIANCE 294
140
=
120
7
=
6
P
y2j (x2j − x̄2. )
β̂2 = P
(x2j − x̄2. )2
130
=
120
13
=
12
2 2
XX
2 −13 13 13 13
(y2j − α̂2 − β̂2 x2j ) = 44 − − × 44 + . . . + 58 − − × 57
i j
6 12 6 12
2 2 2 2 2
−3 5 23 3 −19
= + + + +
2 12 12 4 12
55
=
6
50/3 + 55/6
=
5+5−4
155/6
=
6
155
=
36
H0 : β1 = β2
(β̂1 − β̂2 )2
f =
1 1
σ̂ 2 P 2
+P
(x1j − x̄1. ) (x2j − x̄2. )2
5. ANALYSIS OF COVARIANCE 296
(7/6 − 13/12)2
= 155
1 1
36 120
+ 120
(1/12)2
=
31/432
3
=
31
≈ 0.0968
Since 0.0968 < 5.99, we do not reject H0 at the 5% level of significance and conclude that the
assumption of equal slope is valid. Parallel lines seem to fit the data quite well.
The procedure may obviously be extended to more than two regression lines and more than one
independent variable.
5.5 Illustration
A company wants to investigate the effect of heartbeat per minute (< 60 or > 60) on the results
(y) obtained in an intensive, very demanding one-month course. It is expected that the heartbeat
is related to the mean number of kilometres (x) the course attendants ran/walked monthly in the
year prior to the course. The following results were obtained:
< 60 > 60
x y x y
8 63 50 59
22 80 91 82
11 71 72 76
12 69 86 81
14 78 63 70
23 79 79 85
The ANOVA table, with x a covariate, is as follows (SAS JMP figure 5.6):
5. ANALYSIS OF COVARIANCE 297 STA3701/1
Figure 5.7 is a graphical representation of the data and the two straight lines fitted if x is regarded
as a covariate.
Figure 5.7:
Figure 5.7 shows clearly that the two straight lines, obtained from the analysis of covariance (x
the covariate), are acceptable representations of the situation as reflected by the given data.
The following two graphs illustrate the importance of looking at a study as a whole and not only
at certain aspects.
5. ANALYSIS OF COVARIANCE 298
Figure 5.8 is a graphical representation of x against the results, without considering heartbeat. A
straight line fitted to the data is obviously not an acceptable prediction.
Figure 5.8:
Figure 5.9:
5. ANALYSIS OF COVARIANCE 299 STA3701/1
Exercise 5.1
Set 1 Set 2
x y x y
4.8 9.912 8.8 10.596
7.2 9.383 6.2 9.697
5.5 9.734 7.5 10.700
6.0 9.551 4.9 10.610
8.3 8.959 5.4 10.145
7.6 9.474 5.8 10.191
5.9 9.179 7.3 9.855
8.0 9.359 8.6 10.682
4.3 9.580 8.8 10.160
5.1 9.245 6.0 9.982
(a) Test whether the slopes of the fitted regression lines are significantly different.
[Given:
Set 2: x2· = 6.93 (x2j − x2· )2 = 19.381 y2j (x2j − x2· ) = 1.45996 ]
P P
2. A lecturer wants to confirm that results in the 2002 examination were better than in the 2000
examination. He takes a sample (n = 6) of the examination results of students from these
groups (y). The means for the assignments for each student are also recorded (x). Results
are as follows:
5. ANALYSIS OF COVARIANCE 300
2000 2002
x y x y
50 60 59 55
62 70 45 49
76 68 33 50
38 63 73 61
52 73 57 44
64 80 45 41
(a) Draw a graph of the data to examine the possibility of a covariate. Briefly interpret
your graph.
yb1 =α
b1 + βb1 x = 54.5862 + 0.2529x
yb2 =α
b2 + βb2 x = 33.6632 + 0.3142x
were fitted to these two sets of data. Confirm, by testing the hypothesis H0 : β1 = β2 ,
that the assumption of equal slopes is valid.
(c) Perform a covariance analysis. [State the hypothesis, test statistic, all formulae and the
conclusion explicitly.]
3. An experiment was performed to determine whether balloons of different colours are similar
in terms of the time taken for inflation to a diameter of 7 inches (1 inch ≈ 2.5 cm). Two
colours were selected from a single manufacturer. An assistant blew up the balloons and the
times (y) (to the nearest 1/10th of a second) were taken with a stopwatch. The data, in the
order collected (x), are given below, where the codes 1 and 2 denote the colours yellow and
blue, respectively.
5. ANALYSIS OF COVARIANCE 301 STA3701/1
Order 1 2 3 4 5 6 7 8
Colour 1 2 2 2 1 1 2 1
Time 19.8 28.5 25.7 28.8 17.1 19.3 18.3 14.0
Order 9 10 11 12
Colour 1 2 2 1
Time 16.6 18.1 18.9 16.0
(a) Ignore order. Construct an analysis of variance table and test the hypothesis that colour
has no effect on inflation time.
(b) Include order as a covariate. Test the hypothesis H0 : β1 = β2 that the slopes are equal.
[NB: Give the formulae, test statistics, decisions and conclusions explicitly.]
Set 1
x: 20 20 30 30 40 40 50 50
y: 61 65 90 92 124 124 151 157
Set 2
x: 20 20 20 30 30 40 40 40
y: 55 56 60 88 92 113 115 117
(a) Assume that parallel regression lines are applicable; give a model and perform an anal-
ysis of covariance (at the 5% level).
5. ANALYSIS OF COVARIANCE 302
(b) Represent the data and the estimated regression lines graphically and comment on the
assumptions and conclusions of (a) in the light of the graph.
Set 1
x: 1 1 2 2 3 3 4 4
y: 12 16 15 15 17 19 21 25
Set 2
x: 2 2 3 3 4 4 5 5
y: 14 16 20 24 22 24 26 30
(a) Fit a regression line through each set of data and draw a graph of the data and the
fitted lines. Give a model for the data.
(b) Test, at the 10% level, whether the slopes of the fitted regression lines are significantly
different.
(c) Test, at the 10% level, the null hypothesis that the two expected values of y which
correspond to x = 3 are equal.
Chapter 6
APPENDIX
Many of our students make certain mistakes in their assignments and examination papers, and it
seems necessary to point out these typical errors. You should read these remarks carefully and
make sure that you do not commit similar errors.
The main source of errors is failure to take note of the order of a matrix. If A is a matrix with p
rows and q columns, then A is said to be of order p × q.
Matrix addition
If A and B are matrices, then A + B is defined only if A and B are of the same order. If A is p × q
and B is p × q; then A+B is also p × q.
Matrix multiplication
The product AB is defined only if the number of columns of A is the same as the number of rows
of B. If A is p × q and B is q × r; then AB is defined and AB is a p × r matrix.
AB + CDE
303
6. APPENDIX 304
A B + C D E
p×q q×r p×s s×t t×r
| {z } | {z }
p×r p×r
Matrix division
A
If A and B are matrices, then the quotientis defined only if B is a 1 × 1 matrix, that is a scalar,
B
and not equal to zero. We often write down the quotient of two quadratic forms or other 1 × 1
expressions in matrices, for example
x0 Ax
x0 Bx
where x is a p × 1 vector and A and B are p × p matrices. However, to “cancel out” the xs and
A
equate the expression to is unforgivable.
B
trace
determinant
inverse
positive definite; positive semi-definite
quadratic form
symmetric
singular; non-singular
idempotent
Thus, before you write |X| or tr(X) or X −1 , et cetera make sure that X is square.
Determinants
If A and B are both square matrices, then
6. APPENDIX 305 STA3701/1
However, if A and B are not square (but AB and BA are) then |AB| and |BA| need not be equal.
• Counterexample 6.1.1
3
Let A = (1 2) and B = .
4
Then AB = (11)
∴ |AB| = 11,
3 6
whereas BA =
4 8
∴ |BA| = 0.
Of course, |A| and |B| are not defined in this example. Also, if |A| = |B| then it does not
follow that A = B.
Trace
If tr(A) = tr(B) then it does not follow that A = B.
• Counterexample 6.1.2
1 −10 2 0
Let A = and .
6 3 0 2
While this may seem trivial, students have been known to write
tr(I-AB) = 0
∴ I = AB,
6. APPENDIX 306
Ranks
r(AB) is not necessarily equal to r(BA).
• Counterexample 6.1.3
1 −2
2 2 2
Let A = 1 1 ; B = .
0 1 −1
1 1
2 0 4
Then AB = 2 3 1 ∴ r(AB) = 2;
2 3 1
6 0
while BA = ∴ r(BA) = 1.
0 0
Inverse
The inverse of a matrix A is defined only if A is non-singular (and of course square). Thus, while
(AB)−1 is defined if AB is non-singular, it does not necessarily follow that A−1 and B−1 exist; in
fact A and B need not be square. The well-known least squares estimator in the linear model is
β̂ = (X 0 X)−1 X 0 y
where X is n × p of rank p where p < n. Thus X −1 does not exist, and one may not write
β̂ = X −1 X 0−1 X 0 y = X −1 In y = X −1 y.
AX = AY.
A−1 AX = A−1 AY
∴ IX = IY
∴ X = Y.
• Counterexample
6.1.4
1 2 3 6 0 0 1
Let A = 4 5 6 ; B = 15 0 0 ; x = 1 .
7 8 9 24 0 0 1
6
Then Ax = Bx 15 but A 6= B.
24
• Counterexample 6.1.5
1 2
1 2 3 6 0 0
Let A = ; B= ; C = 1 2 .
1 0 −1 0 0 0
1 2
6 12
Then AC = BC = but A 6= B.
0 0
Zero products
6. APPENDIX 308
If AB = 0 then it does not follow that A = 0 or B = 0. You should be able to construct your own
counterexample.
After one of the examinations a student commented that it had taken him an hour to answer a
question like the following:
Let x1 , x2 , x3 and x4 be independent n(µ; 1) variates. Write the following quadratic forms as
qi = x0 Ai x where Ai is symmetric, determine whether each has a chi-square distribution, give the
degrees of freedom and find out which are independent:
q 1 = x1 x2 − x3 x4
4
!2
X
q2 = xi /4
1
4
X
q3 = (xi − x1 )2
1
Actually it can be completed in five minutes − if you know how. Let us do another example step
by step. Let
1
q = (x1 − 2x3 )2 + x22
5
where x1 , x2 and x3 are independent n(0; σ 2 ) variates. We wish to show that a multiple of q has a
chi-square distribution. For this purpose we first write q in the form x0 Ax where A is symmetric:
recall that the matrix A of the quadratic form x0 Ax is assumed to be symmetric in this guide.
First, we write out q term by term − this step is not really necessary but it helps to explain the
process.
1 2 4 4
q = x1 − x1 x3 + x23 + x22
5 5 5
1 2 4 2 2 2
= x1 + x3 + x22 − x1 x3 − x3 x1
5 5 5 5
+0x1 x2 + 0x2 x1 + 0x2 x3 + 0x3 x2 .
6. APPENDIX 309 STA3701/1
The coefficient of x21 comes in the (1; 1) position of A, that of x22 in the (2; 2) position and that of
x23 in the (3; 3) position
1
· ·
5
A= · 1 · .
· · 54
1
Since A must be symmetric, we place 2
(the coefficient of xi xj , where i 6= j) in the (i; j) position
and the other half in the (j; i) position:
1
0 − 25
5
A= 0 1 0
− 52 0 45
1
5
0 − 25 x1
check: (x1 x2 x3 )
0 1 0 x2
− 52 0 4
5
x3
Now we show that AA = A and we conclude that q/σ 2 is a chi-square variate with tr(A) = 2
degrees of freedom. The noncentrality parameter is 00 A0 = 0, thus q/σ 2 has a central chi-square
distribution. Quite easily done.
Efficient methods exist for performing matrix calculations, and we assume that you know these
methods. These methods are usually aimed at computer usage, and may be rather lengthy if ap-
plied manually. We therefore point out certain shortcuts which may be useful in an exam situation
when time is limited.
6. APPENDIX 310
Computing X’X
In the linear model one very often has to compute X 0 X where X (or X 0 ) is given. It is not necessary
to write out X 0 and then X next to it, if one remembers that the (i, j)-th element of X 0 X is found
by multiplying every element in the i-th row of X 0 (or the i-th column of X) by the corresponding
element in the j-th row of X 0 (or j-th column of X) and adding these products. The diagonal terms
are found as the sum of squares of elements in the corresponding row of X 0 (or column of X). Also,
X 0 X is symmetric and thus the (i, j)-th element of X 0 X is the same as the (j, i)-th element.
• Example 6.3.1
1 1 1 1 1 1 1
Let X 0 = −3 −2 −1 0 1 2 3
1 1 2 2 3 3 4
Then (X 0 X)11 = 12 + 12 + 12 + 12 + 12 + 12 + 12 = 7
(X 0 X)33 = 12 + 12 + 22 + 22 + 32 + 32 + 42 = 44
= 16
= 14
6. APPENDIX 311 STA3701/1
7 0 16
0
∴ X X = 0 28 14 .
16 14 44
• Exercise 6.3.2
Find X 0 X if
1 1 1 1 1 1 1 1
X 0 = 1 −1 2 −2 3 −3 4 −4
8 7 6 5 4 3 2 1
Answer:
8 0 36
X 0 X = 0 60 10
36 10 204
Determinants
If A is a square matrix, say p × p, and k is a common factor of every element of A, then we may
write
A = kB, say.
Then |A| = k P |B|. This fact may often be used to simplify the calculations involved in computing
a determinant. The determinant of a small matrix can be determined quite rapidly:
1 × 1 matrices
If A = (a) then |A| = a (in this case the determinant sign should not be confused with the absolute
sign):
|(a)| is the determinant of the 1 × 1 matrix, and is equal to a, which could be negative or
positive.
6. APPENDIX 312
• Example 6.3.3
2 × 2 matrices
a b
If A = then |A| = ad-bc.
c d
• Example 6.3.4
1 2
If A = then |A| = 1 × 8 − 2 × 3 = 2.
3 8
• Example 6.3.5
4 −6
If B = then |B| = 24.
−6 3
3 × 3 matrices
The first step is to write down the elements of the matrix, and repeat the first column after the
third and the second after that. If
a b c a b c a b
A= d e f write
d e f d e
g h i g h i g h
|A| consists of six terms; the first three are obtained from the diagonals that run down from left
to right. These three terms are added :
+(a × e × i) + (b × f × g) + (c × d × h) a b c a b
& & &
d e f d e
& & &
g h i g h
6. APPENDIX 313 STA3701/1
The last three terms are subtracted, and are obtained from the diagonals that run down from right
to left:
−(c × e × g) − (a × f × h) − (b × d × i) a b c a b
. . .
d e f d e
. . .
g h i g h
• Example 6.3.6
1 2 −3
A= 2 4 6
−3 6 10
Write down 1 2 −3 1 2
& & &
2 4 6 2 4
& & &
−3 6 10 −3 6
Write down 1 2 −3 1 2
. . .
2 4 6 2 4
. . .
−3 6 10 −3 6
• Exercise 6.3.7
2 −3 4
B= 2 6 −1 . Show that |B| = -47.
4 −1 9
4 × 4 matrices
This is rather more complicated. The determinant is first expanded in terms of four 3 × 3 deter-
minants, each of which is then computed as above.
a b c d
f g h e g h e g h e g h
e f g h
=a j k ` −b i k ` +c i j ` −d i j k
i j k `
n p q m p q m n q m n p
m n p q
Note that the signs alternate (+ − +−) and each term consists of an element in the first row mul-
tiplied by the 3 × 3 determinant obtained by deleting the row and column in which that element
occurs.
Inverses
Some points to remember:
1 −1
(b) If A = kB where k is a constant, then A−1 = B . (This fact may be used to simplify
k
computations if a common factor exists for each element of A.)
A O ··· O
1
(c) If A = O ··· O
A2
O O · · · Am
6. APPENDIX 315 STA3701/1
(d) If A is symmetric, then A−1 is also symmetric. In this case one has to compute the diagonal of
A−1 and the elements above the main diagonal (say), while the elements below the diagonal
can be written down directly from the corresponding elements above the diagonal. We now
show the computations for small matrices.
1 × 1 matrices
2 × 2 matrices
d −b
a b |A| |A| .
If A = then A−1 =
−c a
c d
|A| |A|
• Example 6.3.8
2 3
A=
3 6
|A| = 12 − 9 = 3
6
3
− 33 2 −1
−1
A = = .
− 33 2
3
−1 2
3
6. APPENDIX 316
• Exercise 6.3.9
1 1
2 2
4 −2 −1
B= ; B = .
−2 2
1
2
1
3 × 3 matrices
(A−1 )ij = (−1)i+j × (the determinant of the 2 × 2 matrix obtained by deleting the j-th row and
the i-th column of A) ÷ (the determinant of A).
• Example 6.3.10
2 1 0
A = 1 3 −2 |A| = 2
0 −2 2
3 -2
(A−1 )11 = (−1)2 / |A| = 1
-2 2
2 0
(A−1 )22 = (−1)4 / |A| = 2
0 2
2 1
(A−1 )33 = (−1)6 / |A| = 2 12
1 3
1 0
(A−1 )12 = (A−1 )21 = (−1)3 / |A| = −1
-2 2
1 0
(A−1 )13 = (A−1 )31 = (−1)4 / |A| = −1
3 -2
6. APPENDIX 317 STA3701/1
2 0
(A−1 )23 = (A−1 )32 = (−1)5 / |A| = 2
1 -2
1 −1 −1
∴ A−1 = −1
2 2
1
−1 2 22
(Test: AA−1 = I)
• Example 6.3.11
12 0 0 2 0 0
A = 0 18 6 = 6 0 3 1
0 6 12 0 1 2
−1
1
2 0 0 2 0 0
−1
A−1 = 16 0 3 1 = 61 0
3 1
0 1 2 0 1 2
1 1
0
2
0 12
0 0
= 61 0 2
− 1
= 0 1 1
− 30
5 5 15
1 3 1 1
0 −5 5
0 − 30 10
• Exercise 6.3.12
3
4
− 24 1
4
4 2 −4
B = − 42 4
0 ; B −1 = 2 2 −2 .
4
1 1
4
0 4
−4 −2 8