Study Guide For STA3701

STA3701/1
Department of Statistics
STA3701: Applied Statistics III
Study Guide
Contents
1 THE TOOLBOX 1
1.1 General remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Matrix results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Distribution theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 THE LINEAR MODEL 29
2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Examples of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Linear hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iii
2.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 Statistical inference on β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.8 Models with a constant term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.9 The design matrix X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.10 Examination of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.11 Transformation of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.12 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3 ANALYSIS OF VARIANCE 115
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.1.1 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.1.2 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.1.3 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.2 Balanced experiments and replicates . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.3 Fixed effects and random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.1 Example on mixed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.2 Nested or hierachical experiments . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.2.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.2.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4 One-way analysis of variance: Model I . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.1 Estimates of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.4.1.1 Point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.4.1.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5 One-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6 Two-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

3.7 Two-way analysis of variance: Model I . . . . . . . . . . . . . . . . . . . . . . . . . 177
3.7.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.8 Two-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.9 Two-way analysis of variance: Mixed model . . . . . . . . . . . . . . . . . . . . . . 208
3.10 Nested experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.11 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4 REGRESSION ANALYSIS 239
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.2 Analysis of variance in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
4.3 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.4 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.5 Inclusion or exclusion of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.6 Selection of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.7 Historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
5 ANALYSIS OF COVARIANCE 283
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
5.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.3 Analysis of covariance model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4 Comparing two regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
5.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6 APPENDIX 303
6.1 Remarks on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.2 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.3 Rapid calculations for small matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 309

PREFACE
This module follows on STA3703 which deals with multivariate distributions. You should therefore
be familiar with the multivariate normal distribution and the distribution of quadratic forms which
results from it. The main results on matrix theory and distribution theory needed for this module
are listed in chapter 1. The proofs and derivations of these results are not required for this module,
but you should be able to state and apply these results.
You should acquire the following book of tables:
D.J. Stoker: Statistical Tables, Pretoria, Academica, 3rd ed. 1977 (or earlier editions).
There is no other prescribed textbook for this module, and no recommended textbook which you
are required to read in order to complete the module. You may consult the following textbooks if
you wish to learn more about the topics dealt with in this module:
Bowerman B.L., and R. O’Connel: Linear statistical models:an applied approach. 2nd ed.
PWS-KENT 2 ed. 1990 (For chapters 3 and 4 of this Guide.)
Brownlee, K.A.: Statistical theory and methodology in science and engineering. 2nd ed.
Wiley, 1965. (For chapters 3, 4 and 5 of this Guide.)
Draper, N.R. and H. Smith: Applied regression analysis. Wiley, 1st ed. 1966, or 2nd ed.
1981 or 3rd ed 1998. (For chapter 4 of this Guide.)
Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and regression. Wiley,
1974. (For chapters 3, 4 and 5 of this Guide.)
vii
viii
Kleinbaum, D.G., Kupper L.L., Nizam A. and K.E. Muller: Applied regression analysis and
other multivariable methods. Duxbury, 4th ed. 2008. (For most of the Guide.)
Kleinbaum, D.G. and L.L. Kupper: Applied regression analysis and other multivariable meth-
ods. Duxbury, 1st ed. 1978 or 2nd ed. 1988. (For most of the Guide.)
Kutner, M.H., Nachtsheim C.J., Neter J. and W. Li.: Applied linear Statistical Models. 5th
ed. McGraw-Hill Irwin, 2005. (For most of the Guide.)
Mickey, R.M., Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and
regression. 3rd ed. Wiley, 2004. (For chapters 3, 4 and 5 of this Guide.)
Neter, J., Wasserman W. and M.H. Kutner: Applied linear statistical models, regression,
analysis of variance and experimental designs. 2nd ed. Richard D. Irwin Inc., 1985. (For
chapters 3, 4 and 5 of this Guide.)
Neter, J. and W. Wasserman: Applied linear statistical models. Richard D. Irwin Inc., 1974.
(For chapters 3, 4 and 5 of this Guide.)
Quenouille, M.H.: Introductory statistics. Butterworth-Springer, 1950. (For section 2.11 of

chapter 2 of this Guide.)
Ross, S.M.: Introduction to probability and statistics for engineers and scientists. Wiley
1987. (For chapters 3 and 4 of this Guide.)
Ross, S.M.: Introduction to probability and statistics for engineers and scientists. 4th ed.
Elsevier 2009. (For chapters 3 and 4 of this Guide.)
Sall, J., Creighton L. and A. Lehman: JMP start statistics: a guide to statistics and data
analysis using JMP. SAS Institute, 4th ed 2007. (For chapters 3, 4 and 5 of this Guide -
especially in data analysis.)
Scheffé, H.: The analysis of variance. Wiley, 1959. (For chapters 2, 3 and 5 of this Guide.)
Searle, S.R.: Linear models. Wiley, 1997. (For chapters 1 and 2 of this Guide.)
Wackerly, D.D., Mendenhall W. and R.L. Scheaffer: Mathematical statistics with applications.
7th ed. Duxbury 2008. (For chapters 3 and 4 of this Guide.)
Chapter 1
THE TOOLBOX
1.1 General remarks
The statistical techniques which are the subject of this module, Applied Statistics III, form part
of a more general theory called the linear model. In chapter 2 we discuss this general model; this
general theory should enable you to analyse a very general class of problems. In chapter 3 we dis-
cuss some specific models contained in the general theory under the heading analysis of variance.
Analysis of variance is a procedure which enables one to test whether a number of parameters
which determine the values of population means are equal or not. In chapter 4 we discuss regres-
sion analysis. In this type of model we assume that there is a linear relationship between a random
response variable called yield, and a number of non-random variables. The purpose is to estimate
the parameters of this relationship and to draw statistical inference about these parameters. In
chapter 5 we discuss a third class of problems, called analysis of covariance, which may be regarded
as a hybrid between regression analysis and analysis of variance. In this model we assume that
there are a number of regression functions, each corresponding to a different population, and we
want to know whether these regression functions are equal or not.
A word about notation. In this module we usually use Greek letters to denote parameters, capital
letters to denote matrices and underlined small letters to denote vectors. A vector without a
prime, for example x, will be a column vector and its transpose, for example x0 , will therefore
be a row vector. Small letters which are not underlined will usually be scalars (1×1 matrices),
1
1. THE TOOLBOX 2
including specific elements of matrices or vectors. The usual distinction between random variables
and observations, which is often made by using capitals for the former and small letters for the
latter, will not be made here. It should be clear from the context whether a symbol is for a random
variable or not. A circumflex (or hat) above the symbol for a parameter or vector of parameters,
for example β̂ or β̂, is used to denote an unbiased estimator of the parameter(s).
1.2 Matrix results
This section contains a number of definitions and results on matrices which, for ease of reference,
will be numbered (M1), (M2), et cetera. These results will be used frequently in chapter 2. If they
do not seem interesting at first reading, you will soon find out how they are applied.
(M1) Let x = (x1 ...xp )0 be a p × 1 vector and A = (aij ) a p × p matrix.

XX
Then x0 Ax = aij xi xj is called a quadratic form in x.
i j
The matrix A of any quadratic form x0 Ax can always be chosen to

be symmetric, hence every quadratic form in this guide will be considered
to have a symmetric matrix.
Example 1.1.
Write the following quadratic forms in matrix notation using a symmetric matrix:
(a) 2x2 + 3y 2 − 4z 2 + 2xy − 6xz + 10yz
(b) x21 + 4x1 x3 − 6x2 x3

1. THE TOOLBOX 3 STA3701/1
Solution 1.1.
(a) 2x2 + 3y 2 − 4z 2 + 2xy − 6xz + 10yz = 2x2 + 3y 2 − 4z 2 + xy + yx − 3xz − 3zx + 5yz + 5zy
 
2 1 −3
 
 
 
h i  
0
Then x = x y z and A = 
 
1 3 5 
 
 
 
 
−3 5 −4
Thus,
  
2 1 −3 x
  
  
  
h i  
x0 Ax = x y z 
  
1 3 5  y 
  
  
  
  
−3 5 −4 z
(b) x21 + 4x1 x3 − 6x2 x3 = x21 + 0x22 + 0x23 + 0x1 x2 + 0x2 x1 + 2x1 x3 + 2x3 x1 − 3x2 x3 − 3x3 x2
 
1 0 2
 
 
 
h i  
0
Then x = x3 and A =  0 0 −3 
 
x1 x2
 
 
 
 
2 −3 0
Thus,
  
1 0 2 x1
  
  
  
h i  
0
x Ax = 0 −3   x2
  
x1 x2 x3  0 
  
  
  
  
2 −3 0 x3
1. THE TOOLBOX 4
(M2) The p × p matrix A is called positive definite if x0 Ax > 0 for all x 6= 0 (where 0 is a vector
of zeros).
(M3) If the > sign in (M2) is replaced by ≥ (and the equality holds for at least one x 6= 0), then
A is said to be positive semidefinite.
Example 1.2.
Are the following matrices positive definite or positive semidefinite?
     
1 0 1 −1 1 − 29
     
(a) (b)  (c) 
     
  
     
0 4 −1 1 − 29 4
Solution 1.2.
(a)
  
1 0 x1
h i  
x0 Ax =
  
x1 x2   
  
0 4 x2
 
x
h i 1 
=
 
x1 4x2  
 
x2
= x21 + 4x22
Now x21 + 4x22 > 0 if x1 6= 0 or x2 6= 0 and x21 + 4x22 = 0 only if x1 = x2 = 0. Thus, A

is positive definite.
(b)
  
1 −1 x1
h i  
x0 Ax =
  
x1 x2   
  
−1 1 x2
 
x1
h i 
= x1 − x2 −x1 + x2 
 

 
x2
= x21 − x2 x1 − x1 x2 + x22
= x21 − 2x1 x2 + x22
= (x1 − x2 )2
= 0 if x1 = x2 whether or not x1 = x2 = 0
> 0 if x1 6= x2
Thus, A is positive semidefinite.
(c)
  
1 − 92 x1
h i  
0
x Ax =
  
x1 x2   
  
− 92 4 x2
 
x
h i 1 
9
= x1 − − 92 x1 + 4x2 
 
x
2 2

 
x2
9 9
= x21 − x1 x2 − x1 x2 + 4x22
2 2
= x1 − 9x1 x2 + 4x22
2
= (x1 − 2x2 )2 − 5x1 x2
> 0 if x1 = 0, x2 6= 0 or if x2 = 0, x1 6= 0
< 0 if x1 = x2 = 1
Thus, A is not positive definite or positive semidefinite.
(M4) (i) The rank of a matrix A, denoted by r(A), is defined to be the number of linearly
independent rows of A (or, equivalently, the number of linearly independent columns).
(ii) If A is an n × m matrix then r(A) ≤ m and r(A) ≤ n.
(iii) If A is a p × p matrix and r(A) = p then A is said to be non-singular and A−1 exists;
if r(A) < p then A is singular and A−1 does not exist.
1. THE TOOLBOX 6
(iv) A positive definite matrix is non-singular.
(M5) If A is an n × p matrix, then r(A) = r(A0 A) = r(AA0 ).
(M6) The trace of a p × p matrix A, denoted by tr(A), is defined to be the sum of its diagonal
p
X
elements. If A = (aij ), then tr(A) = aii .
i=1
(M7) If A is an m × n matrix and B an n × m matrix, then tr(AB) = tr(BA). Similarly, if A is

m × n, B is n × k and C is k × m, then tr(ABC) = tr(BCA) = tr(CAB).
(M8) In particular, if x is a p × 1 vector and A a p × p matrix, then the quadratic from x0 Ax is a

1 × 1 matrix which is equal to its trace.
Thus
x0 Ax = tr(x0 Ax) = tr(x x0 A) = tr(Ax x0 ).
Example 1.3.
   
x 1 2
   
x= A=
   
 
   
y 2 4
Compute x0 Ax, x x0 A and Ax x0 and the trace of each.
Solution
 1.3.  
x 1 2
   
x =   and A = 
   

   
y 2 4
  
1 2 x
h i  
x0 Ax =
  
x y   
  
2 4 y
 
x
h i 
=
 
x + 2y 2x + 4y  
 
y
= x2 + 2yx + 2xy + 4y 2
= x2 + 4yx + 4y 2
This is a 1 × 1 matrix, thus tr(x0 Ax) = x2 + 4yx + 4y 2
   
x 1 2
 h i 
x x0 A = 
   
 x y  
   
y 2 4
 

x2 xy 1 2
  
= 
  
 
  
2
yx y 2 4
 
2 2
x + 2xy 2x + 4xy
 
= 
 

 
yx + 2y 2 2yx + 4y 2
tr(x x0 A) = x2 + 2yx + 2xy + 4y 2
= x2 + 4xy + 4y 2
  
1 2 x
  h i
Ax x0 = 
  
  x y
  
2 4 y
 
x + 2y
 h i
= 
 
 x y
 
2x + 4y
 
x2 + 2yx xy + 2y 2
 
= 
 

 
2x2 + 4yx 2xy + 4y 2
tr(Ax x0 ) = x2 + 2yx + 2xy + 4y 2
= x2 + 4xy + 4y 2
1. THE TOOLBOX 8
Note: (x x0 A)0 = Ax x0
tr(x0 Ax) = tr(x x0 A) = tr(Ax x0 )
(M9) The matrix A is called idempotent if AA = A.
(M10) If A is idempotent then the rank of A is equal to its trace:
r(A) = tr(A).
Example 1.4.
Which of the following matrices are idempotent?
 
1 0 0
     
1 1
1 0
 
2 2
 
     
(a) (1) (b)  (c)  (d)  0 12 21 
     
 
     
1 1  
0 1 2 2
 
 
0 12 21
Give the ranks of the idempotent matrices. Which of the matrices are singular?
Solution 1.4.
(a) 1 × 1 = 1. A is idempotent and the rank of the matrix is 1 and it is non-singular.
      
1 0 1 0 1 0 1 0
      
(b) A =   AA =  =
      
 
      
0 1 0 1 0 1 0 1
A is idempotent and the rank of A = tr(A). Thus, tr(A) = 1 + 1 = 2 and A is non-

singular.
(c)
  
1 1 1 1
 2 2  2 2 
AA = 
  
 
  
1 1 1 1
2 2 2 2
 
1 1 1 1
4
+ 4 4
+ 4
 
= 
 

 
1 1 1 1
4
+ 4 4
+ 4
 
1 1
 2 2 
= 
 

 
1 1
2 2
1 1
A is idempotent. Thus, r(A) = tr(A) = 2
+ 2
= 1. A is singular.
 
1 0 0
 
 
 
 
(d) A =  0 12 12 
 
 
 
 
 
1 1
0 2 2
  
1 0 0 1 0 0
  
  
  
  
AA =  0
 1 1 
 0 1 1 
2 2 2 2

  
  
  
  
1 1 1 1
0 2 2
0 2 2
 
1 0 0
 
 
 
 
=  0
 1 1 1 1 
4
+ 4 4
+ 4

 
 
 
 
1 1 1 1
0 4
+ 4 4
+ 4
1. THE TOOLBOX 10
 
1 0 0
 
 
 
 
=  0
 1 1 
2 2

 
 
 
 
1 1
0 2 2
A is idempotent. Thus, r(A) = tr(A) = 1 + 12 + 1

2
= 2. Thus, A is singular.
(M11) If X is an n × m matrix with rank r(X) = m (thus m ≤ n) then X(X0 X)−1 X0 is idempo-
tent with rank m. Furthermore, In − X(X0 X)−1 X0 is idempotent with rank n − m. Finally,
(X(X0 X)−1 X0 )(In − X(X0 X)−1 X0 ) = O, the n×m matrix of zeros.
(M12) An important vector is the n × 1 vector of ones:

   
1 1
 .. 
   
 
 1  0  . 
1 =  .  1 1 = (1 . . . 1) 
 
 ..  = n;

.
 .   . 
   
1 1
   
1 1 ...... 1
 ..   .. .. 
   
. . .
1 10 = 
   
 (1 . . . 1) = 
 ..   .. ..  ;

 .   . . 
   
1 1 ...... 1
 
1 1
... n
 n
 .. .. 

 . . 
1(10 1)−1 1 =  .. ..  ;

 . . 
 
1 1
n
.... n
 
1 − n1 − n1 . . . − n1
 
In − 1(10 1)−1 10 =  − n1 1 − n1 . . . 1 
−n  .

 
1̇ 1 1
−n −n . . . 1 − n
(M13) From (M11) and (M12) it follows that 1(10 1)−1 10 and In − 1(10 1)−1 10 are idempotent with
ranks 1 and n − 1 respectively, and their product is equal to the null matrix O.
Example 1.5.
 
1 0
 
 
 
 
(a) X =  1 1  . Compute X(X 0 X)−1 X 0 .
 
 
 
 
 
1 2
 
1
 
 
 
 
(b) x =  1  . Compute x(x0 x)−1 x0 and I3 − x(x0 x)−1 x0 .
 
 
 
 
 
1
Solution 1.5.
(a)
 
1 0
  
1 1 1 
 

  
0
XX = 
  
 1 1 
  
 
0 1 2  
 
1 2
 
3 3
 
= 
 

 
3 5
   
5
5 −3 6
− 63
   
0 0 −1 1
det(X X) = 15 − 9 = 6. (X X) = 6 =
  

   
−3 3 − 36 3
6
1. THE TOOLBOX 12
 
1 0
   
5
− 36 1 1 1
 
 6
 
  
0 −1 0
X(X X) X =  1 1 
   
 
   
 − 36 3
 
 6
0 1 2
 
1 2
 
5 1
−
 6 2  
 1 1 1
 

  
 1
=  3
 
0  
  
 
  0 1 2
 
1 1
−6 2
 
5 2 1
−
 6 6 6 
 
 
 
 2 2 2
=  6 6

6

 
 
 
 
1 2 5
−6 6 6
   
1 1
   
   
   
  h i 
  0
(b) x =  1  x x = 1 1 1  1  = 3.
 
   
   
   
   
1 1
Then
 
1
 
 
 
 1h i
x(x0 x)−1 x0 =  1 
 
1 1 1
 3
 
 
 
1
 
1 1 1
 3 3 3 
 
 
 
= 
 1 1 1 
3 3 3

 
 
 
 
1 1 1
3 3 3
   
1 1 1
1 0 0 3 3 3
   
   
   
   
I3 − x(x0 x)−1 x0 =  0 1 0  −  13 1 1
   
3 3

   
   
   
   
1 1 1
0 0 1 3 3 3
 
2
− 13 − 13
 3 
 
 
 
 1 2 1
=  −3 −3 

 3 
 
 
 
1 1 2
−3 −3 3
(M14) If A is a p × n matrix of constants and x0 = (x1 . . . xp ), then

∂ 0
(x A) = A
∂x
where
 
∂
∂x1
..
 
 
∂  . 
.
..

∂x 
 .


 
∂
∂xp
(In general, the derivative of a 1 × n row vector y 0 with respect to the p × 1 column vector
x is defined to be the p × n matrix
∂y 0
= (aij ) where
∂x
∂yj
aij = .
∂xi
1. THE TOOLBOX 14
(M15) If A is a p × p matrix then

∂ 0
(x Ax) = A0 x + Ax.
∂x
If A is symmetric, that is A = A0 , then

∂ 0
(x Ax) = 2Ax.
∂x
Example 1.6.
   
u 3 1 2
   
   
   
   
x= v  A= 1 4 5 .
   
   
   
   
   
w 2 5 6
∂ 0
Write down (x Ax).
∂x
Also write out x0 Ax and its derivatives with respect to u, v and w.
Solution 1.6.
∂ 0
(x Ax) = 2Ax
∂x
  
3 1 2 u
  
  
  
  
= 2 1 4 5
  
 v 
  
  
  
  
2 5 6 w
 
3u v 2w
 
 
 
 
= 2  u 4v
 
5w 
 
 
 
 
2u 5v 6w
 
6u 2v 4w
 
 
 
 
=  2u 8v 10w 
 
 
 
 
 
4u 10v 12w
or
  
3 1 2 u
  
  
  
h i  
x0 Ax =
  
u v w  1 4 5  v 
  
  
  
  
2 5 6 w
 
u
 
 
 
h i 
=
 
3u + v + 2w u + 4v + 5w 2u + 5v + 6w  v 
 
 
 
 
w
= 3u2 + vu + 2wu + uv + 4v 2 + 5wv + 2uw + 5vw + 6w2
= 3u2 + 2uv + 4uw + 4v 2 + 10vw + 6w2
= 3u2 + 4v 2 + 6w2 + 2uv + 4uw + 10vw
∂ 0
(x Ax) = 6u + 2v + 4w
∂u
∂ 0
(x Ax) = 8v + 2u + 10w
∂v
∂ 0
(x Ax) = 12w + 4u + 10v
∂w
1. THE TOOLBOX 16
1.3 Distribution theory
In this section we list a number of results from distribution theory. Their application will become
clear in the next chapter.
Let y be a column vector consisting of p random variables, that is y 0 = (y1 · · · yp ). Let the expected
value of yi be µi and let µ0 = (µ1 · · · µp ). Also, let the covariance of yi and yj be σij .
Thus, E(yi − µi )(yj − µj ) = σij
(thus σii is the variance of yi ). Let Σ be the matrix
 
σ11 · · · σ1p
 .. .. 
 
Σ = (σij ) =  . . .
 
σp1 · · · σpp
Obviously σij = σji so that Σ is symmetric. Σ is called the covariance matrix of y. We would also
write
Σ = E(y − µ)(y − µ)0 = Cov(y, y 0 ).
(D1) A covariance matrix must be positive definite (or at least positive semidefinite).
(D2) The p×1 random vector y is said to have a multivariate normal distribution with means vector
µ and covariance matrix Σ, also written y ∼ n(µ; Σ), if the probability density function of y
is given by
1 1
f (y) = 1 exp[− (y − µ)0 Σ−1 (y − µ)].
(2π)p/2 |Σ|
2 2
The covariance matrix Σ must be positive definite otherwise f (y) will not be a probability
density function.
(D3) If y is a random vector with E(y) = µ and Cov(y, y 0 ) = Σ and if A is a matrix of constants,
then z = Ay is a random vector with E(z) = Aµ and Cov(z, z 0 ) = AΣA0 .
(D4) If y ∼ n(µ; Σ) then Ay ∼ n(Aµ; A Σ A0 ).

Example 1.7.
   
10 4 3 −1
     
1 −1 0
   
   
     
x ∼ n  15  ;  3 9 3  . What is the distribution of y =   x?
     
     
    1 1 1
3 3 3
   
   
12 −1 3 1
Solution 1.7.
       
10 4 3 −1 10 4 3 −1
       
       
       
       
x ∼ n  15  ;  3 9 where µ = and Σ =
       
3   15   3 9 3 
       
       
       
       
12 −1 3 1 12 −1 3 1
Now
 
1 −1 0
 
y =  x
 
 
1 1 1
3 3 3
 
1 −1 0
 
= Ax where A=
 

 
1 1 1
3 3 3
If x ∼ n(µ, Σ) then y ∼ n(Aµ, AΣA0 ) (using D4)
Now
 
10
  
1 −1 0 
 

  
Aµ = 
  
  15 
  
1 1 1  
3 3 3
 
 
12
1. THE TOOLBOX 18
 
−5
 
= 
 

 
37
3
  
1
4 3 −1 1 3
   
1 −1 0 
  
 
   
AΣA0 =  3   −1 1
   
 3 9 3

   
1 1 1   
3 3 3
  
  
1
−1 3 1 0 3
 
  1 13
1 −6 −4  
  
  
  
=   −1 13

  
 2 5 1  

  
 
0 13
 
7 −3
 
= 
 

 
8
−3 3
   
−5 7 −3
   
Thus, y ∼ n  ;  .
   
   
37 8
3
−3 3
(D5) If y ∼ n(0; Ip ) then y 0 y ∼ χ2p (that is y 0 y has a chi-square distribution with p degrees of
freedom).
(D6) If y ∼ n(µ; σ 2 Ip ) then (y − µ)0 (y − µ)/σ 2 ∼ χ2p .
(D7) If y ∼ n(µ; σ 2 Ip ) then y 0 y/σ 2 is said to have a non-central chi-square distribution with p
degrees of freedom and noncentrality parameter (n.c.p.) λ = µ0 µ/σ 2 . (Some textbooks use
the alternative convention λ = 21 µ0 µ/σ 2 ).
We write y 0 y/σ 2 ∼ χ2p;λ .

When λ = 0 the distribution becomes the (central) chi-square distribution. The n.c.p. λ is
non-negative by definition. As λ increases, the distribution shifts to the right. This is also
reflected in the following theorem.
(D8) If x is a noncentral chi-square variate with p degrees of freedom and noncentrality parameter
λ, then x has mean and variance
E(x) = p + λ; Var(x) = 2p + 4λ.
Thus both the mean and the variance increase as λ increases.
Example 1.8.
   
3 5 0
   
y ∼ n  ;  .
   
   
4 0 5
(a) What is the distribution of 15 (y12 + y22 )? Calculate its mean and variance.
(b) Which random variable has a χ22 -distribution?
Solution 1.8.
(a) The distribution of 51 (y12 + y22 ) ∼ χ22,λ where λ = µ0 µ/σ 2 (using D7).
 
3
h i 
µ0 µ =  = 25.
 
3 4 
 
4
Thus, λ = µ0 µ/σ 2 = 25
5
= 5.
Using (D8):
Mean = E(y) = E( 15 (y12 + y22 )) = p + λ = 2 + 5 = 7
Variance = V ar(y) = V ar( 15 (y12 + y22 )) = 2p + 4λ = 2(2) + 4(5) = 24
Note: λ = µ0 µ/σ 2 ∼ χ2p,λ (using D7).

1. THE TOOLBOX 20
1
(b) The random variable that has a χ22 distribution is 5
[(y1 − 3)3 + (y2 − 4)2 ] .
(D9) Let u and v be independent with u ∼ χ2p and v ∼ χ2q , then

u/p
F =
v/q
is said to have an Fp;q distribution.
(D10) Let u and v be independent random variables with u a noncentral chi-square variate with p
degrees of freedom and n.c.p. λ and v ∼ χ2q . Then
u/p
F =
v/q
is said to have a noncentral F-distribution with p and q degrees of freedom and n.c.p. λ.
The effect of the n.c.p. λ on the distribution of F is the same as the effect of λ on the
distribution of u: increasing λ causes a shift to the right.
(D11) Let y ∼ n(µ; σ 2 Ip ). Let A be a symmetric matrix of constants. Then y 0 Ay/σ 2 has a (non-
central) chi-square distribution with r(A) degrees of freedom and n.c.p. λ = µ0 Aµ/σ 2 if and
only if A is idempotent, that is AA = A.
(D12) Let y ∼ n(µ; σ 2 Ip ) and let A and B be constant matrices. Then y 0 Ay/σ 2 and y 0 By/σ 2 are
independent if and only if AB = O.
Example 1.9.
   
0 1 0 0
   
   
   
   
y ∼ n  0  ;  0 1 0 
   
   
   
   
   
0 0 0 1
(a) Write x = 31 (y1 + y2 + y3 )2

and z = y12 + y22 + y32 − 31 (y1 + y2 + y3 )2
in matrix form as x = y 0 Ay and z = y 0 By.
(b) Show that x and z are independent and have chi-square distributions; compute the num-
ber of degrees of freedom of each.
(c) What is the distribution of 2x/z?

Solution 1.9.
(a) y ∼ n(µ; σ 2 Ip )
1
x = (y1 + y2 + y3 )2
3
1
= (y1 + y2 + y3 )(y1 + y2 + y3 )
3
1 2
= (y + y1 y2 + y1 y3 + y2 y1 + y22 + y2 y3 + y3 y1 + y3 y2 + y32 )
3 1
1 2
= (y + y22 + y32 + y1 y2 + y2 y1 + y1 y3 + y3 y1 + y2 y3 + y3 y2 )
3 1
  
1 1 1 y
  1 
  
  
1h i  
=
  
y1 y2 y3 1 1 1 y 
3  2 
  

  
  
  
1 1 1 y3
  
1 1 1
y
 3 3 3  1 
  
  
h i  
 1 1 1 
=

y1 y2 y3  3 3 3   y2 
  
  
  
  
1 1 1
3 3 3
y3
 
1 1 1
 3 3 3 
 
 
 
Thus, A = 
 1 1 1 
3 3 3

 
 
 
 
1 1 1
3 3 3
1
z = y12 + y22 + y32 − (y1 + y2 + y3 )2
3
1 1 1 1 1 1 1 1 1
= y12 + y22 + y32 − y12 − y22 − y32 − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 1 1 1 1 1 1
= y + y + y − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 1 3 2 3 3 3 3 3 3 3 3
1. THE TOOLBOX 22
  
2
3
− 13 − 13 y1
  
  
  
h i  
 1 2 1 
= y3  − 3 −
 
y1 y2 3 3
  y2 
  
  
  
  
1 1 2
−3 −3 3
y3
 
2
3
− 13 − 13
 
 
 
 
 1 2 1
Thus, B =  − 3 −3 

 3 
 
 
 
1 1 2
−3 −3 3
(b) Checking for idempotence:
    
1 1 1 1 1 1 1 1 1
 3 3 3  3 3 3   3 3 3 
    
    
    
AA = 
 1 1 1  1 1 1 =
  1 1 1 =A

3 3 3 3 3 3 3 3 3

    
    
    
    
1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
    
2
3
− 13 − 13 2
3
− 13 − 13 2
3
− 13 − 13
    
    
    
    
 1 2 1 1 2 1 1 2 1
BB =  − 3 −3   −3 −3  =  −3 −3  = B
    
 3  3   3 
    
    
    
1 1 2 1 1 2 1 1 2
−3 −3 3
−3 −3 3
−3 −3 3
If x and z are independent, then AB = 0

    
1 1 1 2
3 3 3 3
− 13 − 13 0 0 0
    
    
    
    
AB =  1 1 1  1
  −3 2 1  = 
−
   
3 3 3 3 3
0 0 0 
    
    
    
    
1 1 1 1 1 2
3 3 3
−3 −3 3
0 0 0
Thus, x and z are independent with r(x) = 1 = degrees of freedom each. Thus, x ∼ χ21
and r(z) = 2 and z ∼ χ22 (using D11 since AA = A and BB = B).
(c) The distribution of 2x/z ∼ F1;2 .

u/p x/1
Note: F = = = 2x/z (using D9)
v/q z/2
1
(D13) If y ∼ n(µ; σ 2 Ip ) then the quadratic form y 0 Ay/σ 2 and the linear form By are independent
σ
if and only if BA = O.
(D14) More generally, if y ∼ n(µ; V ) then y 0 Ay has a noncentral chi-square distribution with r(A)
degrees of freedom and n.c.p. λ = µ0 Aµ if and only if AV is idempotent.
(D15) If y ∼ n(µ; V ) then y 0 Ay and y 0 By are independent if and only if AVB = O.
(D16) If y ∼ n(µ, σ 2 V ) where V is non-singular, then y 0 V −1 y/σ 2 is a noncentral chi-square variate

with r(V) degrees of freedom and n.c.p. λ = µ0 V −1 µ/σ 2 .
Example 1.10.
   
µ σ2 σ2ρ
   
y ∼ n  ;  .
   
   
µ σ2ρ σ2
Construct a variable that will have a X22 -distribution.
Solution 1.10.
     
µ 1 ρ 1 ρ
     
 2
y ∼ n  ;σ   where V =   and r(V ) = 2.
   
     
µ ρ 1 ρ 1
1. THE TOOLBOX 24
Using D16
 
1 −ρ
y 0 V −1 y (y − µ)0 V −1 (y − µ)  
= and V −1 = 1  
1−ρ2
σ2 σ2
 
 
−ρ 1
  
1 −ρ y1 − µ
h i   
1
y1 − µ y2 − µ
  
1−ρ2   
  
y 0 V −1 y −ρ 1 y2 − µ
=
σ2 σ2  
y1 − µ
1 h i 
= y1 − µ − ρ(y2 − µ) ρ(y1 − µ) + (y2 − µ) 
 
σ 2 (1 − ρ2 )

 
y2 − µ
1 2 2

= (y1 − µ) − ρ(y 2 − µ)(y 1 − µ) − ρ(y 1 − µ)(y2 − µ) + (y 2 − µ)
σ 2 (1 − ρ2 )
=
1
(y1 − µ)2 − 2ρ(y1 − µ)(y2 − µ) + (y2 − µ)2

=
σ 2 (1
−ρ )2
[(y1 − µ)2 − 2ρ(y1 − µ)(y2 − µ) + (y2 − µ)2 ]

=
σ 2 (1 − ρ2 )
∼ χ22
Exercise 1.1
1. Write the following quadratic forms in matrix notation. Ensure that each matrix is
symmetric.
(a) 4x21 + 6x1 x2 + 7x22
(b) 2x21 + 4x22 − 2x1 x3 − x2 x3 + 5x23

2 2
(c) x
3 1
+ 23 x22 + 23 x23 − 23 x1 x2 − 23 x2 x3 − 32 x1 x3
(d) 3x21 + 13x22 + x23 + 10x1 x2 + 2x1 x3
(e) x21 − 2x1 x2 − x22 − 3x2 x3 + 8x23
2. Determine whether each of the following matrices is positive definite or positive semidefinite.
   
2
3
− 31 − 13 1 2 0
       
9 −3 2 − 23
   
   
       
(a) (b)  − 13 23 − 13  (c)  (d)  2 5 −1 
       
  
       
−3 − 32
   
1   3  
   
− 13 − 31 23 0 −1 1
3. Evaluate the following statements as true or false, and substantiate your answers for state-
ments which are false.
 
1 1
2 2
0
 
 
 
 
Consider the matrix A =  1 1
0 .
 
 2 2 
 
 
 
0 0 1
y 0 Ay
(a) Let y ∼ n(µ; σ 2 Ip ). has a (noncentral) chi-square distribution with three degrees
σ2
of freedom and noncentrality parameter λ.
1. THE TOOLBOX 26
 
1 1
2 2
0
 
 
 
  y 0 Ay y 0 By
(b) Let y ∼ n(µ; σ 2 Ip ) and B =  1 1 . Then and are independent.
 
0
2 2 σ2 σ2

 
 
 
 
0 0 0
(c) The p × 1 random vector y is said to have a multivariate normal distribution with mean
 
1 0 0
 
 
 
h i  
0 1 1
vector µ = 3 9 2 and covariance matrix Σ =  0 4 4 .
 
 
 
 
 
1 1
0 4 4
Then

1 1 0 −1
f (y) = 1 exp − (y − µ) Σ (y − µ)
(2π)p/2 |Σ| 2 2
is a probability density function.
(d) The quadratic form 2 (r − 12 s)(r + 32 s) − q(r − q) can be written in matrix notation,

with symmetric matrix, as

  
2 −1 0 q
  
  
  
h i  
q r s  −1 2 −1   r .
  
  
  
  
  
0 −1 − 23 s
4. Suppose xi , i = 1, 2, 3 has independent n(µi ; 1) distributions such that 2µ1 = µ2 = 3µ3 .

Let
1
q1 = x21 + (x2 + 3x3 )2
10
q2 = k(3x2 − x3 )2
1
q3 = (x1 − x2 )2 .
2
(a) Write qi in the form qi = x0 Ai x where Ai is symmetric, i = 1, 2, 3.
(b) Determine whether q1 has a chi-squared distribution.
(c) Determine k such that q2 has a chi-squared distribution.
(d) Determine the degrees of freedom and the noncentrality parameters of q1 and of q2 .
(e) Determine whether q1 and q2 are independent.

 
5
8
− 83 − 81
 
 
 
 
0 0
(f) Suppose x ∼ n(µ; V ); x = (x1 x2 x3 ); µ = (µ1 µ2 µ3 ) and V =  − 8
3 5 1
−8 
 
 8 
 
 
 
1 1 5
−8 −8 8
2
2µ1 = µ2 = 3µ3 and q4 v χq (central chi-squared distribution).
(i) Determine whether q1 has a chi-squared distribution.

(ii) Determine the degrees of freedom p and the noncentrality parameter λ of q3 .
q3 /p
(iii) Determine whether F = has a noncentral F -distribution with p and q degrees
q4 /q
of freedom and ncp λ. Substantiate your answer in detail.
 
5
 
5. Let y ∼ n(µ; σ 2 Ip ) with  3  and σ = 3.
 
 
4
y0y
(a) What is the distribution of ?
σ2
(b) Calculate the mean and variance of the distribution in (a).
 
3
  1
3  2 
6. Let x : 2 × 1 ∼ n(µ; Σ) with µ =   and Σ =  .
 
2  
1 23
 
1 −1
(a) What is the distribution of x0 Ax if A =  ?
−1 1
 
2 −4
(b) If B =  , are x0 Ax and x0 Bx independent? Explain.
−4 2
1. THE TOOLBOX 28
Chapter 2
THE LINEAR MODEL
2.1 The model
We assume that
y = X β +
n×1 n×p p×1 n×1
where y is an n×1 vector of random variables, X is an n×p matrix of known constants with n>p,
β is a p×1 vector of unknown parameters1 , and is an n×1 vector of (unobservable) random
variables. Thus is not a vector of parameters, but a Greek letter is used nevertheless, since is
unknown. In this model X is selected in advance, y is observed and β must be estimated, is a
random disturbance. The following assumptions are usually made:
Assumption 1
The matrix X has a full column rank, that is r(X) = p. This implies that the rank of the p×p
symmetric matrix X0 X is p and therefore X0 X is non-singular.
1
We usually assume that β is a vector of constants. In chapter 3, we will consider cases where some of
the βi are random variables
29
2. THE LINEAR MODEL 30
This assumption is not essential for the linear model in general, but it simplifies the algebra and
in most practical applications the model can be formulated in such a way that X0 X is in fact
non-singular. Without this assumption one would first have to study the subject of generalised
inverses of matrices.
Assumption 2
The random vector has expectation 0, that is E() = 0.
This assumption is essential; if E() 6= 0 one could incorporate E() into the vector of unknown
parameters β possibly adding an extra parameter into β and an extra column to X.
The result of assumption 2 is that
E(y) = Xβ + E() (remember that X and β are not random)
= Xβ.
The term is usually called the error term: it provides for the random fluctuation inherent in a
statistical experiment.
Assumption 3
The covariance matrix of (and thus of y) is known at least up to a constant factor.
For example, E( 0 ) = σ 2 V with V known or, more usually,
E( 0 ) = σ 2 I
which means that the components of are uncorrelated and have the same variance σ 2 .
Assumption 4
X is a matrix of known constants.
Each column of X represents n values of a variable, often called an independent variable, predic-
tand, concomitant variable, et cetera. We refer here to design variable, and X will be called the
2. THE LINEAR MODEL 31 STA3701/1
design matrix. We assume that (ideally) the design matrix X was chosen in advance, and the n
values of the response variable y which constitute the vector y were observed accordingly by means
of an experiment.
The linear model in general does not exclude the possibility that X may be a random matrix. In
the problems dealt with in this module we will usually have a non-random matrix X. Many (but
not all) of the results derived in this module will be valid if X is in fact random.
Assumption 5
Sometimes we will assume that has a multivariate normal distribution.
This assumption is not essential for the purpose of estimating β, but as soon as we want to test
hypotheses about or construct confidence intervals for the parameters, we have to make this as-
sumption. Also, least squares estimators are maximum likelihood estimators if assumption 5 holds.
Assumption 6
n>p, that is there are more observations than parameters. This is a very important assumption -
without it, the least squares estimators are not unique. If n<p then r(X0 X) <p and therefore X0 X
will be singular (cf. assumption 2).
2.2 Examples of linear models
In this section it is shown that some of the statistical problems dealt with in most elementary
courses are special cases of the linear model.
2.2.1 One-sample problem
Let y1 , . . . , yn be a random sample from a n(µ; σ 2 ) distribution.
Then we may write

y1 = µ + 1
..
.
yn = µ + n
where 1 , . . . , n are independent n(0; σ 2 ) variates, that is ∼ n(0; σ 2 I). This may be written as
     
y1 1 1
 ..   ..  .. 
     
 . = (µ) +

.   . 
     
yn 1 n
or y = Xβ + .
In this case p = 1 and X = 1.
2.2.2 Two-sample problem
Let y1 , . . . , yn1 be a random sample from a n(µ1 ; σ 2 ) distribution and v1 , . . . , vn2 a random sample
from a n(µ2 ; σ 2 ) distribution such that the two samples are independent. We rename the second
sample by writing
yn1 +i = vi , i = 1, . . . , n2 .
The model is then
     
y1 1 0 1
.. .. ..  ..
     
. . .  .
    
    
     
 yn1 1 0   n1
      
  
    µ  
 1  
= +
   
  
    µ  
    2  
 yn1 +1   0 1   n1 +1 
.. .. ..  ..
     
    
 .   . .   . 
     
yn1 +n2 0 1 n1 +n2
or y = Xβ + where ∼ n(0; σ 2 I).
In this case p = 2 and n = n1 + n2 .

2.2.3 Simple linear regression
Assume y1 , . . . , yn are independent with
E(yi ) = α + βxi ,
Var(yi ) = σ 2 .
This implies that
y1 = α + βx1 + 1
..
.
yn = α + βxn + n ,
that is
     
y1 1 x1  
 α  1
 ..   ..     ..
   
 . = + .

.  
    β  
yn 1 xn n
or y = X β +
where E() = 0 and E( 0 ) = σ 2 I. In this case p = 2.
Example 2.1.
Assume that E(yi ) = β0 + β1 xi ; Var(yi ) = σ 2 /wi , i = 1,. . . , n and where y1 , . . . , yn are independent
and w1 , . . . , wn are known constants. This is a particular “weighted regression” model. Write this
in matrix notation.
Solution 2.1.
yi = β0 + β1 xi +
     
y1 1 x1 1
      
     
 y2   1 x2  β0  1 
= +
.. .. ..
  
     
 .   .  β1  . 
     
yn 1 xn n
2.3 Estimation
Let y = Xβ + where E() = 0; E( 0 = σ 2 I).
The problem is to estimate β. The estimator will be denoted by β̂. The corresponding vector of
values of the response variable, found by replacing β by β̂ and by 0 in the model, is denoted by
ŷ. Thus ŷ = X β̂.
The difference between the observed and predicted response, y − ŷ, is called the vector of residuals
and is said to be an estimator for . (This is an unconventional use of the term “estimator”, since
one usually estimates parameters while is a vector of unobservable random variables.)
Let
e = y − ŷ = y − X β̂.
∴ E(e) = E(y) − E(X β̂)
= Xβ − Xβ provided E(β̂) = β,
that is provided β̂ is an unbiased estimator for β
= 0
= E() by assumption.
The fact that E(e) = E() is the justification for calling e an estimator for .
The least squares criterion states that the estimator β̂ must be found in such a way that e0 e, the
sum of squares of the residuals, is a minimum. Thus we have to minimise
s = e0 e = (y − X β̂)0 (y − X β̂)
0 0
= y 0 y − β̂ X 0 y − y 0 X β̂ + β̂ X 0 X β̂
0 0
= y 0 y − 2β̂ X 0 y + β̂ X 0 X β̂
0
since β̂ X 0 y and y 0 X β̂ are scalars and therefore equal.
∂s
In order to minimise s we set = 0. Using (M14) and (M15) and noting that X 0 X is symmetric,
∂ β̂
we find
∂s
= −2X 0 y + 2X 0 X β̂.
∂ β̂
Equating this to zero we find
X 0 X β̂ = X 0 y
∴ β̂ = (X 0 X)−1 X 0 y
since we have assumed that X 0 X is non-singular. This is the least squares estimator. (Can you
prove that β̂ minimises rather than maximises s?)
If we assume in addition that has a multivariate normal distribution, that is ∼ n(0; σ 2 I), then
the same estimator is also the maximum likelihood estimator. More generally, if we assume that
∼ n(0; σ 2 V ), that is y ∼ n(Xβ; σ 2 V ), where V is non-singular, then the joint probability density
function (p.d.f.) of the elements of y, which is the likelihood function, is

1 1 0 −1
L= 1 1 exp − 2 (y − Xβ) V (y − Xβ)
(2πσ 2 ) 2 n |V | 2 2σ
1 1 1
∴ `nL = − n`n(2πσ 2 ) − `n|V | − 2 (y − Xβ)0 V −1 (y − Xβ).
2 2 2σ
In order to maximize `nL (and therefore L) with respect to β we have to minimise
s = (y − Xβ)0 V −1 (y − Xβ)
= y 0 V −1 y − 2β 0 X 0 V −1 y + β 0 X 0 V −1 Xβ.
∂s
= −2X 0 V −1 y + 2X 0 V −1 Xβ
∂β
= 0 if X0 V−1 Xβ = X0 V−1 y.
We assume that V has full rank and X has full column rank
∴ X 0 V −1 X is non − singular
∴ β = (X 0 V −1 X)−1 X 0 V −1 y.
The maximum likelihood estimator for β is therefore
β̂ = (X 0 V −1 X)−1 X 0 V −1 y.
The (ordinary) least squares (OLS) estimator is a special case of this estimator (replace V by In ).
The more general estimator is called the generalised least squares (GLS) estimator; any property
of the GLS estimator will automatically be a property of the OLS estimator when the latter is
appropriate, that is when V = In .
Another special case of GLS is weighted least squares (WLS) which is used when V is a diagonal
matrix, that is if V = (vij ) the vij = 0 if i 6= j. In this case
   
v 0 ... 0 1/v11 0 ... 0
 11   
   
 0 v22 . . . 0  −1  0 1/v22 . . . 0 
V =  ..
V = 
..   .. ..


 . .   . . 
   
0 0 . . . vnn 0 0 . . . 1/vnn
n
0 −1
X 1 X
(y − ŷ) V (y − ŷ) = (yi − ŷi )2 = wi (yi − ŷi )2
v
i=1 ii
1
where the weight wi = . Obviously OLS is also a special case of WLS when vii = 1; i = 1, . . . , n.
vii
We now assume that
y = Xβ +
where E() = 0 and E( 0 ) = σ 2 V . The generalised least squares estimator is
β̂ = (X 0 V −1 X)−1 X 0 V −1 y
= Ay, say, where
A = (X 0 V −1 X)−1 X 0 V −1 .
Thus β̂ is a linear estimator in the sense that each element of β̂ is a linear combination of y1 , . . . , yn .
We are interested in the distribution of the random vector β̂.
Theorem 2.3.1
0
E(β̂) = β and Cov(β̂, β̂ ) = σ 2 (X0 V−1 X)−1
Proof:
By (D3)
E(β̂) = E(Ay) = A E(y)
= (X 0 V −1 X)−1 X 0 V −1 (Xβ)

= Ip β = β
Thus β̂ is unbiased; we say it is an unbiased linear estimator. The covariance matrix of β̂ is likewise
derived from (D3):
Cov(β̂, β̂)0 = Cov(Ay, (Ay)0 )
= A Cov(y, y0 )A0
= (X 0 V −1 X)−1 X 0 V −1 (σ 2 V ) V −1 X(X 0 V −1 X)−1

= σ 2 (X 0 V −1 X)−1 X 0 V −1 V V −1 X(X 0 V −1 X)−1
= σ 2 (X 0 V −1 X)−1 X 0 V −1 X(X 0 V −1 X)−1
= σ 2 (X 0 V −1 X)−1
In the special case V = In we have
β̂ = (X 0 X)−1 X 0 y
E(β̂) = β
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .
The following theorem follows from (D4).

Theorem 2.3.2
If y ∼ n(Xβ, σ 2 V) and
β̂ = (X 0 V −1 X)−1 X 0 V −1 y
h i
2 0 −1 −1
then β̂ ∼ n β̂, σ (X V X) .
We apply the foregoing results to the three examples in section 2.2. In each case V = In .
In this case X 0 = (1 . . . 1) = 10 and y 0 = (y1 , . . . , yn ).
X 0 X = n; (X 0 X)−1 = 1/n; X 0 y = y1 + y2 + . . . + yn
∴ µ̂ = (X 0 X)−1 X 0 y = n1 (y1 + . . . + yn ) = ȳ
Furthermore, E(µ̂) = µ and
Var(µ̂) = σ 2 (X 0 X)−1 = σ 2 /n, a well-known result.
In this case
 
1 . . . 1 0 . . . 0
X0 =  
0 ... 0 1 ... 1
y 0 = (y1 · · · yn1 yn1 +1 · · · yn1 +n2 )
   
n1 0 1/n1 0
X 0X =   ; (X 0 X)−1 =  
0 n2 0 1/n2
 
y1 + · · · + yn1
 
0
Xy=
 

 
yn1 +1 + · · · + yn1 +n2
 
  (y1 + · · · + yn1 )/n1  
µ̂1   ȳ
  1 
β̂ = = = , say.
  
µ̂2   ȳ2
(yn1 +1 + · · · + yn1 +n2 )/n2
 
2
σ /n1 0
The covariance matrix is σ 2 (X 0 X)−1 =  ,
2
0 σ /n2
again a well-known result.

 
1 ··· 1
Here X0 =   y 0 = (y1 · · · yn )
x1 · · · xn
 P 
n xi
X 0X =  P P 2 
xi xi
|X 0 X| = n x2i − ( xi )2
P P
(xi − x̄)2 .
P
=n
Thus X 0 X is non-singular provided (xi − x̄)2 6= 0.

P
 P 2 P 
xi xi
 n P(x − x̄)2 − n P(x − x̄)2 
 i i 
(X 0 X)−1 =
 

 P 
 xi n 
− P 2
P
n (xi − x̄) n (xi − x̄)2
 P 
yi
X 0y =  P 
xi yi
x2i − xi xi yi
 P P
P P 
yi
P
n (xi − x̄)2
   
α̂ 0 −1 0
 
β̂ = = (X X) X y = 
   

β̂ 

P P P
− xi yi + n xi yi 

P
n (xi − x̄)2
which can be written

X X
β̂ = yi (xi − x̄)/ (xi − x̄)2
α̂ = ȳ − β̂ x̄.
The covariance matrix of β̂ is σ 2 (X 0 X)−1 . Thus

X X
Var(α̂) = σ 2 x2i /[n (xi − x̄)2 ]
X
Var(β̂) = σ 2 / (xi − x̄)2
X X
Cov(α̂, β̂) = −σ 2 xi /[n (xi − x̄)2 ].
Example 2.2.
Repeat the analysis for the weighted regression model in example 2.1.
Solution 2.2.
yi = β0 + β1 xi
     
y1 1 x1 1
      
     
y2 1 x2 β 
 0  +  1 
    
=
.. ..  .. 

   
 .   .  β1  . 
     
yn 1 xn n
where E() = 0, E(, 0 ) = σ 2 V
   
1
w1
0 ... 0 w1 0 ... 0
   
 1   
0 ... 0 0 w2 . . . 0
w2
V −1
   
V =
 .. ..

 and =
 .. ..


 . .   . . 
   
1
0 0 ... w2
0 0 . . . wn
 
1 1 ... 1
X0 =  
x1 x2 . . . xn
  
w1 0 ... 0 1 x1
   
  
1 1 ... 1 0 w2 . . . 0 1 x2
X 0 V −1 X = 
  
.. .. .. ..
  
  
x1 x2 . . . xn  . .  . . 
  
0
0 . . . wn 1 xn
 
1 x1
  
 
w1 w2 . . . wn  1 x2 
=  
 .. .. 
 
w1 x1 w2 x2 . . . wn xn  . . 
 
1 xn
 P P 
wi w i xi
 
= 
 

 
2
P P
w i xi w i xi
 P 
wi x2i − wi xi
P
1  
(X 0 V −1 X)−1 =P P
 
wi wi x2i − ( wi xi )2  P
P  
P 
− w i xi wi
 
y1
  
 
w1 w2 . . . wn y2
X 0 V −1 Y
 
=  
 ..


w 1 x1 w 2 x 2 . . . wn xn  . 
 
yn
 P 
wi yi
=  P 
w i xi y i
Then
β̂ = (X 0 V −1 X)−1 X 0 V −1 Y
 P 
wi x2i − wi xi
P  P 
1   wi yi
= P P
  
wi wi x2i − ( wi xi )2  P
P   P
P  wi xi yi
− w i xi wi
 P 
wi x2i
P P P
wi yi − wi xi wi xi yi
1  
=
 
2
P P P 2
wi wi xi − ( wi xi )  P
 
P P P 
− wi xi wi yi + wi wi xi yi
wi x2i
 P P P P 
wi yi − wi xi wi xi yi
wi wi x2i − ( wi xi )2
 P P P 
 
= 
 

 P P P P 
 − wi xi wi yi + wi wi xi yi 
wi wi x2i − ( wi xi )2
P P P
Cov(β̂, β̂ 0 ) = σ(X 0 V −1 X)−1
2.4 The Gauss-Markov theorem
We have seen that the GLS estimator is an unbiased linear estimator. We will now show that
the GLS estimator is, in fact, the best linear unbiased estimator (BLUE) in the sense that the
variance of each component of β̂, say β̂i , is smaller than (or equal to) the variance of any other
unbiased linear estimator of βi . In fact, suppose we wish to estimate a linear combination of the
components of β, say t0 β where t is a vector of known constants. In the one-sample problem
we maywant to estimate (1)(µ) = (µ), in the two-sample problem  we may want to estimate
µ1 α
(1 − 1)   = µ1 − µ2 , while our purpose to estimate (1 x)   = α + βx for given x in the
µ2 β
simple linear regression model.
The Gauss-Markov theorem states that, of all linear combinations of y1 , · · · , yn , that is of all
estimators of the form `0 y such that E(`0 y) = t0 β, the estimator with the smallest variance is
t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y,
that is the best choice of `0 is
`0 = t0 (X 0 V −1 X)−1 X 0 V −1 .
This property holds whatever the form of V, thus also for the OLS estimator when V = In .
Theorem 2.4.1 (The Gauss-Markov theorem)
For any given vector t, the best unbiased linear estimator of t0 β is t0 β̂ where β̂ is the GLS estimator
of β.
Proof
Consider first the GLS estimator
t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y = c0 y, say,
where
c0 = t0 (X 0 V −1 X)−1 X 0 V −1 (2.1)
∴ E(c0 y) = c0 E(y) = t0 (X 0 V −1 X)−1 X 0 V −1 Xβ = t0 β (2.2)
Var(t0 β̂) = Var(c0 y) = σ 2 c0 V c = σ 2 t0 (X 0 V −1 X)−1 t (2.3)
(from 2.1)
Let `0 y be any other unbiased linear estimator of t0 β. Write, without loss of generality,
` = c + d.
Thus the estimator of t0 β is
`0 y = c0 y + d0 y.
This estimator must be unbiased:
∴ E(`0 y) = E(c0 y) + E(d0 y) = t0 β
∴ t0 β + d0 Xβ = t0 β (from 2.2)
∴ d0 Xβ = 0
This must hold for all β (we do not want an estimator which is unbiased only if β has certain
values),
∴ d0 X = 00 or, equivalently, X 0 d = 0 (2.4)
The variance of `0 y is
Var(`0 y) = σ 2 `0 V `
= σ 2 (c + d)0 V (c + d)
= σ 2 (c0 V c + c0 V d + d0 V c + d0 V d).
Now c0 V d = t0 (X 0 V −1 X)−1 X 0 V −1 V d (from 2.1)
= t0 (X 0 V −1 X)−1 X 0 d
= 0. (from 2.4)
Similarly,
d0 V c = 0
∴ Var(`0 y) = σ 2 (c0 V c + d0 V d)
= Var(c0 y) + σ 2 d0 Vd.
Since σ 2 V is a covariance matrix and V is therefore positive semi-definite (and furthermore it is

assumed here that V−1 exists and is therefore positive definite), it follows that d0 V d ≥ 0 for all d,
∴ Var(`0 y) ≥ Var(c0 y)
which proves the theorem.
The theorem may also be proved by means of Lagrange multipliers. We want to choose ` so as to
minimise
Var(`0 y) = σ 2 `0 V `
subject to the constraints
E(`0 y) = `0 Xβ = t0 β for all β, that is t = X 0 `,
which constitutes p constraints.
Thus we choose −2θ as vector of Lagrange multipliers and minimise
`0 V ` − 2θ0 (X 0 ` − t).
The result is
`0 = t0 (X 0 V −1 X)−1 X 0 V −1
which proves the theorem.
∗
The Gauss-Markov theorem may also be approached as follows. Let β̂ be any unbiased linear
estimator of β and let β̂ be the GLS estimator. Then, similar to our proof of the theorem, it may
be shown that the matrix
∗ ∗0 0
Cov(β̂ , β̂ ) - Cov(β̂, β̂ )
is positive semidefinite.
This implies that

∗
Var(t0 β̂ ) - Var(t0 β̂) ≥ 0 and in particular that
Var(β̂i∗ ) ≥ Var(β̂i ) for all i.
2.5 Linear hypotheses
In the remainder of this study guide we will assume throughout that the error term
is multivariate normally distributed with mean 0 and covariance matrix σ 2 In .
Thus the model is y = Xβ + where ∼ n(0; σ 2 In ).
We know that the OLS estimator,

β̂ = (X 0 X)−1 X 0 y
has a n(β; σ 2 (X 0 X)−1 ) distribution.
It is often necessary to test hypotheses of the general form
H0 : K β = m
h×p p×1 h×1
where K is an h×p matrix of rank h, which automatically means that h ≤ p.
K and m are assumed to be a matrix and a vector of known constants. The requirement that K
must be of full row rank is not overly restrictive, since one may delete rows of K which are linear
combinations of the remaining rows. For example, the two hypotheses β1 − β2 = 0 and β1 − β3 = 0
automatically imply that β2 − β3 = 0, the latter being redundant. We must of course be sure
that the deleted hypotheses are consistent with the remaining ones (eg β2 − β3 = 10 would not be
consistent with β1 − β2 = 0 and β1 − β3 = 0).
H0 : µ = C
that is (1) (µ) = (c)
or K β = m.
We may be interested in testing the hypotheses
H0 : µ1 − µ2 = 0; µ1 = 10; µ2 = 10
   
1 −1   0
  µ  
 1  
that is =  10 .
 
 1 0 
  µ2  
0 1 10
Since the third row of the matrix on the left is a linear combination of the first two rows (notice
that the second hypothesis minus the first yields the third) we may as well delete the third hy-
pothesis to arrive at
    
1 −1 µ1 0
H0 :   = 
1 0 µ2 10
or K β = m.
Suppose we are interested in testing the null hypotheses
H0 : β = 5; α + 3β = 10; α + 5β = 20.
This can be written

   
0 1   = 5
  α  
 10 .
    
 1 3 
  β  
1 5 20
The last row is a linear combination of the first two rows (twice the first hypothesis added to the
second yields the third) and can be deleted. Thus we have
    
0 1 α 5
H0 :    =  
1 3 β 10
or K β = m.
Thus suppose we want to test

H0 : Kβ = m.
Consider the h×1 random vector
K β̂ − m.
Its expected value is
KE(β̂) − m = Kβ − m
(which is equal to zero if H0 is true) and, by (D3), its covariance matrix is
σ 2 K(X 0 X)−1 K 1 .
Thus, if H0 is true,
K β̂ − m ∼ n(0; σ 2 K(X 0 X)−1 K 0 ).
This topic will be pursued further in the next section.
2.6 Quadratic forms
We will now study a number of quadratic forms which are relevant in the testing of linear hypothe-
ses:
q = y 0 y = y 0 In y
q1 = y 0 (1(10 1)−1 1)y
q2 = y 0 (In − 1(10 1)−1 1)y
= q − q1
0
q3 = β̂ (X 0 X)β̂
q4 = q − q3
q5 = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m)
In each case we will show that the quadratic form can be written in the form qi = y 0 Ai y where
Ai Ai = Ai , that is Ai is idempotent.
Thus qi /σ 2 is, in general, a noncentral chi-square variable with ri = r(Ai ) degrees of freedom and
noncentrality parameter (E(y))0 Ai (E(y))/σ 2 = β 0 X 0 Ai Xβ/σ 2 . We will also show that in a number
of cases Ai Aj = O, which proves that qi and qj are independent (df (D11) and (D12)).
Example 2.3.
Assume y 0 = (y1 y2 y3 ); compute q, q1 and q2 as defined above.
Solution 2.3.
q = y0y
 
y1
h i 
=
 
y1 y2 y3  y2 
 
y3
= y12 + y22 + y32
q1 = y 0 (1(10 1)−1 10 )y
 
1 1 1
 3 3 3 
 
 
 
where 1(10 1)−1 10 =  1 1 1  using M12
 
 3 3 3 
 
 
 
1 1 1
3 3 3
Now
q1 = y 0 (1(10 1)−1 10 )y
 
1 1 1
 3 3 3  
 y1
 

h i  
=
 1 1 1  
y1 y2 y3  3 3 3
  y2 
  
 
  y3
 
1 1 1
3 3 3
 
1 1 1
  
 y1
 

1h i  
=
  
y1 y2 y3  1 1 1   y2 
3   
 
  y3
 
1 1 1
 
y
1h i 1 
=
 
y1 + y2 + y3 y1 + y2 + y3 y1 + y2 + y3  y2 
3  
y3
1 2
= (y + y2 y1 + y3 y1 + y1 y2 + y22 + y3 y2 + y1 y3 + y2 y3 + y32 )
3 1
1 2
= (y + y22 + y32 + 2y1 y2 + 2y1 y3 + 2y2 y3 )
3 1
1
= (y1 + y2 + y3 )2
3
3
!2
1 X
= yi
3 i=1
P3 3
1 i=1 yi
X
2
= (3y) since = y ⇒ 3y = yi
3 3 i=1
= 3y 2
Note: q1 = y 0 (1(10 1)−1 10 )y = ny 2 . In this case n = 3
q2 = q − q1 (2.5)
X3
= yi2 − ny 2
i=1
3
X
= (yi − y)2
i=1
= (y1 − y)2 + (y2 − y)2 + (y3 − y)2
(2.6)
The quadratic form q

n
X
0
q = yy = yi2 is usually termed the total sum of squares. Since q = y 0 In y, In In = In and
1
r(In ) = n it follows that q/σ 2 is a noncentral chi-square variate with n degrees of freedom and
noncentrality parameter (n.c.p.)
λ = (Xβ)0 In (Xβ)/σ 2 = β 0 X 0 Xβ/σ 2 .
Since X 0 X is positive definite, λ = 0 if and only if β = 0.
The quadratic form q1

We have
q1 = y 0 (1(10 1)−1 10 )y
  
1 1
n
··· n
y1
 . ..   .. 
  
= (y1 · · · yn )  .. .   . 
  
1 1
n
··· n yn
= nȳ 2 .
(2.7)
This is an important quadratic form to remember. It is sometimes called the sum of squares due
to the mean. We may write
q1 = y 0 A1 y where A1 = 1(10 1)−1 1.
From (M13) it follows that A1 is idempotent and has rank 1.
We conclude that q1 /σ 2 is a noncentral chi-square variate with one degree of freedom and n.c.p.
λ1 = β 0 X 0 1(10 1)−1 10 Xβ/σ 2

1 0 0 0
= β X 1 1 Xβ/σ 2 .
n
(2.8)
It follows that λ1 = 0 if β = 0 or if X 0 1 = 0. The latter condition implies that the sum of the
elements in each row or X 0 (that is each column of X) is zero.
Example 2.4.
 
−1 1
 
 −1 −1 
 
Assume X = 

 . Compute λ1 .

 1 1 
 
1 −1
Solution 2.4.
β 0 X 0 1(10 1)−1 10 Xβ
λ1 =
σ2
1 0 0 0
β X 11 Xβ 1
= n 2
since (10 1)−1 =
σ n
It follows that λ1 = 0 if β = 0 or if X 0 1 = 0.
X 0 1 = 0 if the sum of the elements in each row of X 0 is zero (or that each column of X is zero).
Thus, in this case X 0 1 = 0 since the elements in each row of X 0 add up to zero, ⇒ λ1 = 0.
q2 = y 0 (In − 1(10 1)−1 10 )y
= y 0 A2 y
where A2 = In − 1(10 1)−1 10 = In − A1 .
Now A2 A2 = (In − A1 )(In − A1 )
= In − A1 − A1 + A1 A1
= In − 2A1 + A1 since A1 is idempotent

= In − A1 = A2 .
(2.9)
Thus A2 is idempotent and in (M13) it was shown that r(A2 ) = n − 1. Therefore q2 /σ 2 is a

noncentral chi-square variate with n − 1 degrees of freedom and n.c.p.
λ2 = β 0 X 0 (In − 1(10 1)−1 10 )Xβ/σ 2 . (2.10)
We now study λ2 in order to find a situation in which λ2 = 0. Of course λ2 = 0 if β = 0 but we

will show that λ2 could be equal to zero even if β 6= 0.
First, we assume that X is of the form

 
1 x12 · · · x1p
 .
 
X =  .. ,

 
1 xn2 · · · xnp
that is, the first column of X consists of ones. This may be achieved in many of the well-known
linear model problems by a judicious choice of parameters (cf section 2.8).
Next, assume that β is of the form
β = (β1 0 · · · 0)0 .
Therefore the model is

          
y1 1 x12 · · · x1p β 1 β1 1
 1
 ..   ..   ..   ..   .. 
         
 . = +
  . = . + . ;

.   0.
          
yn 1 xn2 · · · xnp 0̇ n β1 n
that is, the n yi s have the same expected value β1 .
Therefore Xβ = β1 1 and β 0 X0 = β1 10 .
∴ λ2 = β1 10 (In − 1(10 1)−1 1)1β1 /σ 2
= β1 (10 1 − 10 1(10 1)−1 10 1)β1 /σ 2
= 0.
In this case, therefore, q2 /σ 2 is a (central) chi-square variate with n − 1 degrees of freedom,

regardless of the value of β1 . In fact, q2 is a well-known quadratic form for
q 2 = q − q1
X
= yi2 − nȳ 2
X
= (yi − ȳ)2 .
(2.11)
This is another important quadratic form to remember: q2 is called the total sum of squares (ad-
justed for the mean).
Independence of q1 and q2
Since q1 = y 0 A1 y and q2 = y 0 (In − A1 )y, and
A1 (In − A1 ) = A1 − A1 A1 = A1 − A1 = O
it follows that q1 and q2 are stochastically independent. Thus we have proved a well-known result:
X
that nȳ 2 and (yi − ȳ)2 are stochastically independent.

0
Since β̂ = (X 0 X)−1 X 0 y and β̂ = y 0 X(X 0 X)−1 , q3 may be written in a number of alternative ways:
0
q3 = β̂ (X 0 X)β̂
0
= β̂ X 0 y
= y 0 X(X 0 X)−1 X 0 y
= y 0 A3 y
where A3 = X(X 0 X)−1 X 0 .
(2.12)
From (M11) we know that A3 is idempotent with rank p. Thus q3 /σ 2 is a noncentral chi-square
variate with p degrees of freedom and noncentrality parameter
λ3 = β 0 X 0 A3 Xβ/σ 2
= β 0 X 0 Xβ/σ 2 .
(2.13)
Since X 0 X is positive definite, λ3 = 0 if and only if β = 0. The quadratic form q3 is usually termed
the regression sum of squares or the sum of squares due to the model.

We have, say
q4 = q − q 3
= y 0 In y − y 0 A3 y
= y 0 (In − A3 )y
= y 0 A4 y.
(2.14)
From (M11) we know that A4 is idempotent and has rank n - p. Thus q4 /σ 2 is a noncentral
chi-square variate with n - p degrees of freedom and n.c.p.
λ4 = β 0 X 0 A4 Xβ/σ 2
= β 0 X 0 (In − X(X 0 X)−1 X 0 )Xβ/σ 2
= (β 0 X 0 Xβ − β 0 X 0 X(X 0 X)−1 X 0 Xβ)/σ 2
= (β 0 X 0 Xβ − β 0 X 0 Xβ)/σ 2
= 0.
(2.15)
Thus q4 /σ 2 has a central chi-square distribution with n - p degrees freedom, whatever the value
β, provided the model is correct.
The quadratic form q4 can be written in another form. We know that
y − ŷ = y − X β̂
is the vector of residuals. The sum of squares of the residuals is
(y − X β̂)0 (y − X β̂)
0 0
= y 0 y − y 0 X β̂ − β̂ X 0 y + β̂ X 0 X β̂
0 0 0
= y 0 y − β̂ X 0 X β̂ since y0 Xβ̂ = β̂ X0 y = β̂ X0 Xβ̂
= q − q3 = q 4 .
(2.16)
Thus q4 is the sum of squares of the residuals. It is also termed the residual sum of squares or the
error sum of squares.
Since q4 /σ 2 ∼ Xn−p
2
, it follows that
E(q4 /σ 2 ) = n − p
q4
∴ E( n−p ) = σ2.
Therefore
σ̂ 2 = (y − X β̂)0 (y − X β̂)/(n − p) = q4 /(n − p)
is an unbiased estimator of σ 2 ; σ̂ 2 is termed the error variance and σ̂ the standard error.
q3 = y 0 A3 y
q4 = y 0 (In − A3 )y
Therefore, from (M11) it follows that q3 and q4 are stochastically independent.


We now consider the quadratic form
q5 = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m). (2.17)
We have seen in section 2.5 that
K β̂ − m ∼ n(Kβ − m; σ 2 K(X 0 X)−1 K 0 )
and, using (D14) with K β̂ − m in the place of y, it follows that q5 /σ 2 is a noncentral chi-square
variate with degrees of freedom.
r(K(X 0 X)−1 K 0 )−1 = r(K(X 0 X)−1 K 0 ) = h
(because K(X 0 X)−1 K 0 is an h × h non-singular covariance matrix) and n.c.p.
λ5 = (Kβ − m)0 (K(X 0 X)−1 K 0 )−1 (Kβ − m)/σ 2 . (2.18)
Note that λ5 = 0 if and only if Kβ = m.
The quadratic form q5 is called the sum of squares due to the hypothesis.
The quadratic form q3 is a special case of q5 with K = Ip and m = 0.
We want to prove that q4 and q5 are independent. In equations 2.14 and 2.17 we expressed q4 as
a quadratic form in y and q5 as a quadratic form in K β̂ − m. In order to prove independence, we
will have to express them both in terms of the same normal vector.
Let v = y − XK 0 (KK 0 )−1 m.

Then, by substituting β̂ = (X 0 X)−1 X 0 y and using the fact that X 0 (In − X(X 0 X)−1 X 0 )X = O, it
may be shown that
q4 = v 0 (In − X(X 0 X)−1 X 0 )v
and
q5 = v 0 (B(B 0 B)−1 B 0 )v
where
B = X(X 0 X)−1 K 0 ,
and also that
(In − X(X 0 X)−1 X 0 )(B(B 0 B)−1 B 0 ) = O
so that q4 and q5 are independent.
Activity 2.1.
Prove these assertions as an exercise.
Figure 2.1: Summary of quadratic forms

2.7 Statistical inference on β
Testing linear hypotheses

A general rule for testing a hypothesis in the linear model is as follows:
(a) Find a quadratic form qi such that qi /σ 2 has a central chi-square distribution if Ho is true and
a noncentral chi-square distribution if H0 is not true. Let qi /σ 2 have ri degrees of freedom.
(b) Find a second quadratic form qj which is independent of qi and such that qj /σ 2 has a central
chi-square distribution whether Ho is true or not. Let qj /σ 2 have rj degrees of freedom.
(c) Compute
qi /ri
f=
qj /rj
which has a central Fri ;rj distribution if H0 is true and a noncentral F-distribution if H0 is
not true. If f > Fα;ri ;rj reject H0 at the 100α% level.
We discuss four types of hypothesis.
(a) H0 : β = 0 (that is all the parameters are zero)
From table 2.1 we see that q3 and q4 are the appropriate quadratic forms. Thus
0
β̂ X 0 X β̂ / p
f =
(y − X β̂) 0 (y − X β̂) / (n − p)
SSR / p SSR
= =
SSE / (n − p) pσ̂ 2
has an Fp;n−p distribution if H0 is true.
(b) H0 : Kβ = m
The test is based on q5 and q4 :

SSH / h
f= = SSH / (hσ̂ 2 )
SSE / (n − p)
has an Fh;n−p distribution if H0 is true.
(c) H0 : c0 β = m
This is a special case of (b) with h = 1 and K = c0 .
(d) H0 : β1 = β2 = · · · = βh = 0; h < p (that is some of the parameters are zero)
This is a special case of (b) with

 
1 0···0 0···0
 
 0 1···0 0···0 
 
K
=  ..
 = (Ih O)
h×p

 . 
 
0 0···1 0···0
and m = 0.
In this case q5 can be computed by first partitioning (X 0 X)−1 , β̂ and β as follows:
(X 0 X)−1
 
T T12
 11 
 h×h
 

p×p = 



 0
T22 
 T12 
(p − h) × (p − h)
 
β̂ 1 h×1
 
β̂ = 
 

 
β̂ 2 (p − h) × 1
 
β1
 
β = 
 

 
β2
Now β̂ 1 ∼ n(β 1 ; σ 2 T11 ) and the quadratic form q5 can be written
0 −1
q5 = β̂ 1 T11 β̂ 1
where q5 /σ 2 ∼ Xh2 if β 1 = 0.
Since this is a special case of q5 , we need not prove again that it is independent of q4 and we can
define the F-statistic directly.
0 −1
β̂ 1 T11 β̂ 1 / h
f=
(y − X β̂)0 (y − X β̂) / (n − p)
has an Fh;n−p distribution provided β 1 = 0, that is
β1 = . . . = βh = 0.
Obviously we can in this way test the null hypothesis that any subset of the βi s is zero, since the
ordering of the βi s is arbitrary.
In particular we may use the statistic

β̂i2 / aii β̂i2
f= = 2
SSE / (n − p) σ̂ aii
to test H0 : βi = 0, where aii is the (i,i)-th element of (X 0 X)−1 . The statistic has an F1;n−p
distribution if βi = 0. Equivalently we may use
p β̂i
t= f= √
σ̂ aii
which has a tn−p distribution if βi = 0 (and a noncentral t-distribution if βi 6= 0.)
Note: You will recall that the square of a td -variate has an F1;d distribution.
Example 2.5.
Let y = Xβ + ε, where ε ∼ n(µ; σ 2 I3 ) and
 
1 1 1 1
  h i
0
X = 0 , y0 = 4 4 0 −4
 
1 1 0
 
−2 2 4 −4
b0 =
h i
(a) Show, with formulae and calculations, that β −2 6 2
5
and b2 = 1.6.
σ
(b) Test H0 : β1 − β3 = 0 (α = 0.05).

Solution 2.5.
y = X β +
n×1 n×p p×1 n×1
  
1 1 1 1 1 1 −2  
  4 2 0
  
 
 1 1 0 0   1 1 2   
(a) X 0 X =  = 2 2 0
 

 −2 2 4 −4   1 0
  
4   
   0 0 40
1 0 −4
Calculating the inverse matrix
4 2 0 1 0 0
2 2 0 0 1 0
0 0 40 0 0 1
1 1 1
(R1 − R2 ) 1 0 0 - 0
2 2 2
2 2 0 0 1 0
0 0 40 0 0 1
1 1
1 0 0 - 0
2 2
1 1
R2 − R1 0 1 0 - 1 0
2 2
1 1
R3 0 0 1 0 0
40 40
1 1
 
− 0

 2 2 

 
 
 1
Thus, (X 0 X)−1

=
 −2 1 0 

 
 
 
 1 
0 0
40
n = 4 and p = 3
  
1 1 1 1 4  





 4
 1 1 0 0  4   
X 0y =  = 8 
 

 −2 2 4 −4  
  
0   
   16
−4
 1 1    
− 0 4 −2
 2 2    
0 −1 0
β̂ = (X X) Xy= 1  8 = 6 
     
−2 1 0 
2
    
1 16
0 0 40
5
   
1 1 −2   3.2
  −2  
    
 1 1 2   4.8 
X β̂ =   6 =
 
 −0.4 
  
 1 0 4  
  0.4  
1 0 −4 −3.6
e0 e = (y − X β̂)0 (y − X β̂)
 
4 − 3.2
 
i 4 − 4.8 
h 

= 4 − 3.2 4 − 4.8 0 − −0.4 −4 − −3.6  
0 − −0.4 
 

 
−4 − −3.6
 
0.8
 
h i  −0.8 


= 0.8 −0.8 0.4 −0.4 


 0.4 
 
−0.4
= 1.6
e0 e 1.6
∴ σˆ2 = = = 1.6
n−p 4−3
(b) H0 : β1 − β3 = 0
h i
K= 1 0 −1 and m = [0]
SSH/h
Test statistic: f = ∼ Fα;h,n−p
SSE/(n − p)
 
−2
h i 
(K β̂ − m)0 = 1 0 −1  6 −0
 
 
0.4
= −2.4
1 1
 
− 0

 2 2 
 
 1
 
h i
 1 
K(X 0 X)−1 K 0

=  −2
1 0 −1  1 0 

 0 
  
  −1
 
 1 
0 0
  40
 1
1 1 1 

= − −

 0 
2 2 40  
−1
21
=
40
= 0.525
−1
SSH = (K β̂ − m)0 K(X 0 X)−1 K 0

(K β̂ − m)
40
= −2.4 × × (−2.4)
21
= 10.9714
SSH
h 10.9714
Since f = SSE
= ≈ 6.8571
n−p
1 × 1.6
Fα;h;n−p = F0.05;1;1 = 161. Reject H0 if f > 161.
Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and
conclude that β1 − β3 = 0.
OR
Using the hypothesis of the form: H0 : c0 β = m
SSH = (c0 βb − m)0 [c0 (X 0 X)−1 c]−1 (c0 βb − m)
Note: In this case k = c0 since h = 1.
SSH = (c0 βb − m)0 [c0 (X 0 X)−1 c]−1 (c0 βb − m)
40
= −2.4 × × (−2.4)
21
= 10.9714
SSH
h 10.9714
Since f = SSE
= ≈ 6.8571 Fα;h;n−p = F0.05;1;1 = 161. Reject H0 if f > 161.
n−p
1 × 1.6
Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and conclude β1 −β3 =
0.
Confidence regions
Consider the random variable K β̂ − Kβ where K is an h×p matrix of rank h. We know that
K β̂ − Kβ ∼ n(0, σ 2 K(X 0 X)−1 K 0 ).
Thus, using (D14), the quadratic form
q5∗ /σ 2 = (K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ)/σ 2

has a central chi-square distribution with h degrees of freedom.
Therefore
q5∗ / h
f=
SSE / (n − p)
has an Fh;n−p distribution. Thus
1 − α = P (f ≤ Fα;h;n−p )
h SSE
= P (q5∗ ≤ Fα;h;n−p )
n−p
= P [(K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ) ≤ hσ̂ 2 Fα;h;n−p ].
The inequality inside the square brackets defines a joint confidence region for the h elements of
Kβ. A few examples of this follow.
(a) Confidence region for β
With K = Ip and h = p we have
P [(β̂ − β)0 (X 0 X)(β̂ − β) ≤ pσ̂ 2 Fα;p;n−p ] = 1 − α
The set of all values of β such that the inequality in square brackets is satisfied, forms a joint
confidence region for β1 , · · · , βp .
(b) Joint confidence region for β1 , · · · , βh (h < p)
As before, let
 
β h×1
 1 
β=
 

 
β2 (p − h) × 1
 
  T T12
β̂ 1  11 
 h×h
   

β̂ =   , (X 0 X)−1 = .
 
   
 
β̂ 2  
0
T12 T22
Then the joint confidence region for β1 , · · · , βh is given by

−1
P [(β̂ 1 − β 1 )0 T11 (β̂ 1 − β 1 ) ≤ hσ̂ 2 Fα;p;n−p ] = 1 − α
In particular if h = 2 we have, with probability 1 − α, that

 −1  
a a β̂ − β1
 11 12   1 
(β̂1 − β1 β̂2 − β2 )   ≤ 2σ̂ 2 Fα;2;n−p
   
 
   
a12 a22 β̂2 − β2
where aij is the (i,j)-th element of (X 0 X)−1
a22 (β̂1 − β1 )2 − 2a12 (β̂1 − β1 )(β̂2 − β2 ) + a11 (β̂2 − β2 )2

∴ 2
≤ 2σ̂ 2 Fα;2;n−p .
a11 a22 − a12
This represents the interior of an ellipse with centre (β̂1 , β̂2 ) in the (β1 , β2 ) plane. We are
100(1 - α)% sure that the true (β1 , β2 ) lies inside the ellipse.
Figure 2.2:
Likewise, a 100(1 - α)% interval for βi is given by

1
(β̂i − βi ) (β̂i − βi ) ≤ σ̂ 2 Fα;1;n−p
aii
∴ (β̂i − βi )2 ≤ aii σ̂ 2 Fα;1;n−p
√
∴ |β̂i − βi | ≤ aii σ̂t 1 α;n−p
2
√ √
∴ β̂i − aii σ̂t 1 α;n−p ≤ βi ≤ β̂i + aii σ̂t 1 α;n−p .
2 2
(c) Confidence interval for c0 β
Setting K = c0 and h = 1 we have, with probability 1 - α, that
(c0 β̂ − c0 β)(c(0 X 0 X)−1 c)−1 (c0 β̂ − c0 β) ≤ σ̂ 2 Fα;1;n−p

and, since c0 β̂ − c0 β and c0 (X 0 X)−1 c are scalars, this reduces to
(c0 β̂ − c0 β)2 ≤ σ̂ 2 (c0 (X 0 X)−1 c)Fα;1;n−p

p
∴ |c0 β̂ − c0 β| ≤ σ̂ c0 (X 0 X)−1 c t 1 α;n−p
2
p p
∴ c0 β̂ − σ̂ c0 (X 0 X)−1 c t 1 α;n−p ≤ c0 β ≤ c0 β̂ + σ̂ c0 (X 0 X)−1 c t 1 α;n−p .
2 2
Multiple comparisons
Suppose we have rejected a null hypothesis of the form Kβ = m which really consists of h different
hypotheses. The result is that we no longer believe that all h of the relationships in the null
hypothesis are true, but we still do not know which of the h relationships are true and which are
not. Thus we have to pursue the analysis a bit further. (Of course, if H0 is not rejected, there is no
evidence that any of the h relationships do not hold and a further analysis would be redundant.)
We consider the case h = 2 first, because it is easier to draw a picture in two dimensions. The
joint confidence region for β1 and β2 is an ellipse, as was seen before. Suppose the ellipse is as
shown in figure 2.3:
Figure 2.3:
Thus we are 100(1 - α)% sure that the true β1 and β2 must both be inside the ellipse. Where
can β1 lie then? Between the extremes of the ellipse as viewed from the β1 -axis, of course, and
similarly for β2 (cf figure 2.4).
Figure 2.4:
In the above figure we would reject H0 : β1 = 0, β2 = 0 since the origin is outside the confidence
ellipse. We would also reject the subhypothesis H0 : β2 = 0 since zero falls outside the confidence
interval for β2 . However, we would not reject H0 : β1 = 0 since zero falls inside the confidence
interval for β1 .
Thus we have seen that a joint confidence region for two or more parameters implies a confidence
interval for each of the parameters. These confidence intervals are somewhat wider than those we
would have obtained if we had constructed a confidence interval for one parameter only. However,
if we want to make a number of probability statements based on the same set of data, it is more
reassuring if we could attach one probability to all our statements jointly. (For example, if we
make 20 separate probability statements with confidence 0.95 each, the probability that all our
statements are true may be as low as (0.95)20 ≈ 0.36.)
From the ellipse in figure 2.3 we may in principle obtain a confidence interval for any linear
combination c1 β1 + c2 β2 where c1 and c2 are constants. We must draw a line c1 β2 + c2 β2 = 0 on the
graph, and project the extremes of the ellipse onto the axis orthogonal to that line (cf figure 2.5).
Figure 2.5:
The problem of selecting a scale on that axis must of course be solved, but usually the whole
procedure of finding a confidence interval is done algebraically. The graphs were used merely to
illustrate the procedure.
The method which we will discuss is one of several methods available, and is known as Scheffé’s
method. Assume, as before, that K is an h × p matrix of rank h. Thus
q5∗ = (K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ)
is a quadratic form such that q5∗ /σ 2 is a chi-square variate with h degrees of freedom. Now let t
be an hx1 vector of known constants. Then t0 Kβ is a scalar and
t0 K β̂ − t0 Kβ ∼ n(0; σ 2 t0 K(X 0 X)−1 K 0 t).
Therefore
qt∗ = (t0 K β̂ − t0 Kβ)0 (t0 K(X 0 X)−1 K 0 t)−1 (t0 K 0 β̂ − t0 Kβ) (2.19)
is a quadratic form such that qt∗ /σ 2 has a chi-square distribution with one degree of freedom. The
method of Scheffé is based on the following inequality:
Theorem 2.7.6
qt∗ ≤ q5∗ for all t
(This result is related to the Cauchy-Schwarz inequality.)
We have seen before that a 100(1 - α)% confidence region for Kβ is defined by
P(q5∗ ≤ hσ̂ 2 F α;h;n−p ) = 1 − α.
Since qt∗ ≤ q5∗ we have
P (qt∗ ≤ hσ̂ 2 F α;h;n−p ) ≥ 1 − α. (2.20)
This defines a confidence region for t0 Kβ with confidence at least 100(1 - α)%.
In particular, if K = Ip and h = p, the joint confidence region for β1 · · · , βp is
(β̂ − β)0 (X 0 X)(β̂ − β) ≤ pσ 2 F α;p;n−p
and the corresponding confidence interval for βi is
(β̂i − βi )2 ≤ aii p σ̂ 2 Fα;p;n−p
where aii is the (i,i)-th element of (X 0 X)−1 .
Similarly, if H0 : β = 0 is rejected, we reject H0 : βi = 0 if
β̂i2 > aii p σ̂ 2 Fα;p;n−p
while H0 : t0 β = 0 is rejected if
(t0 β̂)2 > (t0 (X 0 X)−1 t)p σ̂ 2 Fα;p;n−p .

Example 2.6.    
1 0 1 2
     
1 0 −1   7  α
   

     
Define the following as X =  0 ,y= 4 ,β= γ .
     
0 2
     
−1
   
 1 0   3  ϕ
   
−1 −1 0 8
(a) Fit the model y = Xβ + where ∼ n(0, σ 2 ).
(b) Test the hypothesis: H0 : α = γ − ϕ − 2 , γ = −α − ϕ − 3, γ + ϕ = α − 4 under the 5% level

of significance.
(c) Determine a 95% confidence interval for γ.
Solution 2.6.
   
1 0 1 2
     
1 0 −1   7  α
   

     
X= y =  4  and β =  γ
     
0 2 0  
     
−1
   
 1 0   3  ϕ
   
−1 −1 0 8
 
1 0 1
    
1 1 0 1 −1  1 0 −1  4 0 0
 
    
(a) X 0 X =  0 0 2 −1 −1   0 = 0 6 0
    
0 2 
    
1 −1 0 −1
 
0 0  1 0  0 0 2
 
−1 −1 0
 
1
0 0
4
 
 
 
 
(X 0 X)−1 =  0 61 0 Note: Inverse of a diagonal matrix
 
 
 
 
 
1
0 0 2
 
2
    
1 1 0 1 −1  7  4
 
    
X 0y =  0 0 2 −1 −1   4  =  −3
     

    
1 −1 0 −5
 
0 0  3 
 
8
   
1
0 0
4
1
    
4
   
   
    
β̂ = (X 0 X)−1 X 0 y =  0 61 0   −3  =  − 12 
    
    
 −5
   
  
   
0 0 12 − 25
(b) H0 : α = γ − ϕ − 2, γ = −α − ϕ − 3, γ + ϕ = α − 4
H0 : α − γ + ϕ = −2, γ + α + ϕ = −3, γ + ϕ − α = −4
   
1 −1 1 −2
   
K= 1 1  and m =  −3 
   
1
   
−1 1 1 −4
SSH
h
Test statistic: f = SSE
∼ Fα;h;n−p
(n−p)
    
1 −1 1 1 −2
    
0
(K β̂ − m) 1   − 2  −  −3 
 1  
=  1
 
1
    
−1 1 1 − 25 −4
   
−1 −2
   
=  −2  −  −3 
   
   
−4 −4
 
1
 
=
 
 1 
 
0
   
1
1 −1 1 0 4
0 1 1 −1
   
K(X 0 X)−1 K 0 =  1 1   0 16 0   −1 1 1 
   
1
   
1
−1 1 1 0 0 2
1 1 1
  
1
4
− 16 12 1 1 −1
  
= 
 1 1 1 
  −1
 
4 6 2 
1 1 
 
− 14 1
6
1
2
1 1 1

11 7 1
 12 12 12 
= 
 7 11 5 
12 12 12

 
1 5 11
12 12 12
Determinant for K(X 0 X)−1 K 0 = 1 331

1 728
+ 35
1 728
+ 35
1 728
− 11
1 728
− 275
1 728
− 539
1 728
= 1
3
Let [K(X 0 X)−1 K 0 ]−1 = A−1 .

11 5
−1 12 12
(A )11 = (−1) 2
/ 13 = 2
5 11
12 12
11 1
12 12
(A−1 )22 = (−1)4 / 13 = 5
2
1 11
12 12
11 7
12 12
(A−1 )33 = (−1)6 / 13 = 3
2
7 11
12 12
7 5
−1 −1 12 12
(A )12 = (A )21 = (−1) 3
/ 13 = − 32
1 11
12 12
7 11
−1 −1 12 12
(A )13 = (A )31 = (−1) 4
/ 13 = 1
2
1 5
12 12
11 7
12 12
(A−1 )23 = (A−1 )32 = (−1)5 / 13 = −1
1 5
12 12
 
2 − 32 1
2
 
Thus, [K(X 0 X)−1 K 0 ]−1 =  − 23 5
−1 
 
 2 
1 3
2
−1 2
 
2− − 32
 
 7 − 72
 

h i 
0
q4 = (y − X β̂) (y − X β̂) = 2 − − 32 7− 7
4 − −1 3− 3
8 − − 12  4 − −1
 
2 2

 
 3 − 32
 

 
8 − − 12
 
7
  2
 7 
 2 
h i 
= 7 7 3 17  
2 2
5 2 2
 5 
 
 3 
 2 
 
17
2
= 124.
Then,
−1
SSH = (K β̂ − m)0 [K(X 0−1 K 0 ] (K β̂ − m)
  
2 − 32 1
2
1
h i  
= 1 1 0  − 23 5
−1   1 
  
 2  
1
2
−1 32 0
 
1
h i 
= 1 −1  
2
1 2
 1 
 
0
3
= 2
SSH 1.5
h 3 0.5
f= SSE
= 124 = ≈ 0.0081
(n−p) (5−3)
62
Fα;h;n−p = F0.05;3;2 = 19.2. Reject H0 if f > 19.2.
Since 0.0081 < 19.2, we do not reject H0 at the 5% level of significance.
(c) Confidence interval for γ is given by:
√
γ̂ ± a22 σ̂t α2 ;n−p
SSE 124
σ̂ 2 = , we have σ̂ 2 = = 62, yb = − 21 , a22 = 16 , and
n−p 5−3
t α2 ;n−p = t0.025;2 = 4.303.
Thus, the confidence interval is

√ √
(γ̂ − a22 σ̂t α2 ;n−p ; γ̂ + a22 σ̂t α2 ;n−p )
1√ 1√
r r
1 1
(− + 62(4.303) ; − + 62(4.303))
2 6 2 6
(− 21 − 13.8322 ; − 12 + 13.8322)
(−14.3322 ; 13.3322)
2.8 Models with a constant term
In many examples of linear models the first column of X is 1. The model is therefore
 
    β1  
y 1 x12 · · · x1p   
 1    β   1 
 ..   ..  2  . 
 . = . +  .. .

 . 
     ..  
 
yn 1 xn2 · · · xnp   n
βp
Thus we have, for i = 1, · · ·, n
yi = β1 + β2 xi2 + · · · + βp xip + i .
Summation for i from 1 to n gives

Xn Xn n
X n
X
yi = nβ1 + β2 xi2 + · · · + βp xip + i ,
1 1 1 1
that is ȳ = β1 + β2 x̄2 + · · · + βp x̄p + ¯.
If we subtract this from yi we obtain
yi − ȳ = β2 (xi2 − x̄2 ) + . . . + βp (xip − x̄p ) + (i − ¯),
which gives a reduced model

      
y − ȳ x − x̄2 · · · x1p − x̄p β − ¯
 1   12  2   1
.. ..   ..   ..

=  .  + 
   
 . . . 
      
yn − ȳ xn2 − x̄2 · · · xnp − x̄p βp n − ¯
or y ∗ = X ∗ β (2) + ∗
where
y ∗ = A2 y = (In − 1(10 1)−1 10 )y;
X ∗ = A2 X(2) and where

 
β1
 
X = (1 X(2) ); β =   and
 
 
β (2)
∗ ∼ n(0, σ 2 A2 ) since A2 IA2 = A2 A2 = A2 .
This model may now be used to estimate β2 , · · · , βp while β̂1 is found by setting
β̂1 = ȳ − β̂2 x̄2 − . . . − β̂p x̄p .
Even though ∗ is not an n(0, σ 2 In ) vector it can be shown that the solution of
0 0
β̂ (2) = (X ∗ X ∗ )−1 X ∗ y ∗
gives exactly the same estimators β̂2 , · · · , β̂p as the last (p − 1) elements of
β̂ = (X 0 X)−1 X 0 y,
while the first element of β̂ is exactly the same as β̂1 defined above. The advantages of reducing
the model in this way are:
(a) A smaller matrix has to be inverted (and we all know that it is a lot easier to invert a 3 × 3
matrix than a 4 × 4 matrix).
(b) Numerical accuracy is improved, especially when working on an electronic computer. The
0
elements of X ∗ X ∗ are usually much smaller than the elements of X 0 X and the result is that
0
the round-off errors involved in the inversion of X ∗ X ∗ are much smaller than in the inversion
of X 0 X.
The models of the one-sample problem and simple linear regression are examples of models
with a column of ones. The model two-sample problem can be reparameterised as follows. Let
µ = (µ1 + µ2 )/2 and α = (µ1 − µ2 )/2. Then µ1 = µ + α and µ2 = µ − α. Thus we have
     
y1 1 1 1
.. .. ..  ..
     
. . .  .
    
    
      
 yn1 1 1  µ 
     
    n1
  
   
=   +  .
      

      
−1  α
     
 yn1 +1   1  n1 +1 
.. .. ..  ..
     
    
 .   . .   . 
     
yn1 +n2 1 −1 n1 +n2
2.9 The design matrix X
We repeat the model:
y = X β +
n×1 n×1 p×1 n×1,
that is
         
y1 x11 x12 x1p 1
 ..  ..  ..  ..   .. 
         
 .  = β1  + β + . . . + β . + . .
  
.  2 .  p
         
yn xn1 xn2 xnp n
The first vector y consists of n observations of the random variable y. The constants β1 , β2 , · · · , βp
are unknown parameters. The first column of X, say x1 , consists of n values of the non-random
design variable x1 , et cetera. Thus we have p design variables x1 , · · · , xp in our model, each corre-
sponding to one of the p parameters β1 , · · · , βp .
The design variables may be classified broadly into three types:
(a) If there is a constant term in the model, the first column of X usually consists of ones only:
x1 = 1. (This is, strictly speaking, not a variable.)
(b) An important type of design variable is one that assumes the values 0 and 1 only. Such a
variable usually indicates the absence or presence of some characteristics or treatment, and
is usually called a dummy variable. Suppose, for example, one of the characteristics to be
included in a model is the marital status of a person. Suppose that there are four categories:
Single, Married, Divorced and Widowed. If one does not think about it carefully, one may
be tempted to define the variable x = 1 for single persons; = 2 for married persons; = 3 for
divorced persons and = 4 for widowed persons. However, in a model of the type
y = β1 + β2 x
this would mean that the effect on y of being divorced is three times the effect of being single,
et cetera. We should rather define the following dummy variables (we assume that x1 = 1,
that is β1 represents the constant term):
x2 = 1 for single persons
= 0 for others
x3 = 1 for married persons
= 0 for others
x4 = 1 for divorced persons
= 0 for others.
One might be tempted to define in addition
x5 = 1 for widowed persons
= 0 for others.
In such a case, however, one would have an identity
x1 = x2 + x3 + x4 + x5 ,
that is the first column of X would be equal to the sum of the next four columns and the
rank of X would be less than the number of columns. Thus if a categorical variable has k
categories, one would usually need to define (k − 1) dummy variables.
Dummy variables could also be defined to assume the values -1 and 1 rather than 0 and 1.
The interpretation of the corresponding coefficient βi is somewhat different then.
(c) The third type of design variable is a continuous variable. This type of variable is assumed
to be non-random in the models considered in this course. We could transform any such
variable and either replace a column by the transformed values or add additional columns.
Suppose, for example, our model is
√
yi = β0 + β1 ui + β2 u2i + β3 ui + i ,
then the design matrix would be the following:
√
 
1 u1 u21 u1
 .. .. .. ..
 
.

 . . . .
√
 
1 un u2n un
Three of the most important special cases of the linear model, which will be dealt with in the next
three chapters, are:
(a) Analysis of variance (ANOVA): Except for the first column of ones, all the other variables
are dummy variables. The problem is to test whether certain categorical variables (called
treatments) have an effect on the response variable y.
(b) Regression analysis: The first column of X is usually (but not always) 1 and all the other
variables are continuous variables. (If the first column of X is not 1 we speak of regression
through the origin.) The purpose of the analysis may be to test whether certain variables
have an effect on the response, or the purpose may be to use the estimated equation
ŷ = β̂1 x1 + . . . + β̂p xp
to predict the response y which we could expect to obtain if selected values of x1 , · · · , xp are
used in a further experiment.
(c) Analysis of covariance: The first column of X may or may not be 1, but some of the
remaining variables are dummy variables and some are continuous. The purpose of the
analysis could be partly predictive and partly to test whether the regression equation (of y
on the continuous variables) differs from groups as defined by the dummy variables.
Example 2.7.
In an experiment to compare the results of roadrunners in different circumstances, the following
sample was used:
Sex Duration Programme Initial Final
(Months) results results
M 3 A 60 y1
F 8 B 56 y2
M 7 C 70 y3
F 4 A 68 y4
F 11 C 77 y5
M 2 B 65 y6
(Programmes A, B and C represent three different training programmes.)
Construct a design matrix for this experiment and explain your answer.
Solution 2.7.
xi1 = 1(constant term)


 1 for male
xi2 =
 0 for female
xi3 = duration in months

 1 for programme A
xi4 =
 0 otherwise

 1 for programme B
xi5 =
 0 otherwise
xi6 = initial results
The design matrix is

 
1 1 3 1 0 60
 
1 0 8 0 1 56
 
 
 
 
 1 1 7 0 0 70 
X=

.

 1 0 4 1 0 68 
 
 
 1 0 11 0 0 77 
 
1 1 2 0 1 65
2.10 Examination of the residuals
We now turn our attention to the vector of residuals.
e = y − ŷ = y − X β̂,
which is an “estimator” of , the vector of random errors. If our model is correct these residuals
should contain no information about the response vector y. The error term is included in the
model to account for the unexplained variation, and if e contains information about y it means
that our model is inadequate. The residuals may be examined and analysed in many different
ways, but we will speak briefly about graphical methods only.
At the outset we have to remember that, since
e = y − X β̂
= y − X(X 0 X)−1 X 0 y
= (In − X(X 0 X)−1 X 0 )y
it follows that e has a multivariate normal distribution with expectation
(In − X(X 0 X)−1 X 0 )Xβ = 0
and covariance matrix
(In − X(X 0 X)−1 X 0 )σ 2 In (In − X(X 0 X)−1 X 0 )
= σ 2 (In − X(X 0 X)−1 X 0 ).
This covariance matrix is not in general a diagonal matrix and the diagonal elements are not
necessarily equal. Thus the residuals are, in general, not independent and do not have the same
variance. In fact, the covariance matrix of e is singular, so that e has a singular multivariate normal
distribution. (The dependence does become smaller as n − p increases, however.)
Nevertheless, for an informal examination of the residuals we treat them as if they were independent
and had the same variance. The following graphical analyses are usually made:
(a) The assumption of normality
If the sample size n is large, we can construct a histogram to decide whether the error terms
may reasonably be regarded as normal variates (cf figures 2.6, 2.7).
Figure 2.6: Normal residuals
Figure 2.7: Non-normal residuals
We can also use normal probability paper, plotting e(i) against 100i/(n + 1) (where e(i) is the
i-th smallest residual). If the points on the graph are reasonably spread around a straight
line, we may feel safe about the assumption of normality. We should be on the lookout
especially for systematic deviations from a straight line. An example of such graph paper is
presented in figure 2.8.
Figure 2.8: Normal probability paper

Note that the vertical axis is marked in percentages, that is we must plot 100i/(n + 1) on
the vertical axis.
(b) Plot e against xi , the i-th column of X
We plot the residuals against each of the design variables (except of course the column 1).
If e is unrelated to xi we may feel reasonably secure about our model as far as the design
variable xi is concerned.
The graph should have the appearance of figure 2.9, where the points are spread randomly
around 0. If the points follow a pattern, such as in figure 2.10, we may also have to include
additional terms such as x2i , x3i , xki , log xi , et cetera to try to improve the model. If the
points are spread around zero, but with a pattern in the variability such as in figure 2.11, it
indicates that the variances of all the i s may not be the same. In fact, the variance may
be a function of xi . For the data illustrated in figure 2.11 it may well be that the standard
deviation of is a multiple of xi ; thus
Var(yi ) = σ 2 x2i , i = 1, · · · , n
and we may have to apply weighted least squares.
Figure 2.9:
Figure 2.10:
Figure 2.11:
(c) Plot e against ŷ
Any patterns in this graph may indicate that a transformation of the response variable may
be needed (cf section 2.11).
(d) Plot e against time
If the time sequence of the observations is available, it is important to plot the residuals
against time. In this way we may decide whether the residuals are autocorrelated or not,
that is whether each residual depends in a specific way on the previous one. If autocorrelation
is present, the least squares estimators are inappropriate and econometric methods must be
employed to estimate β.
(e) Plot e against any other information available
One operator may perform better than another, a factory may operate more efficiently on
certain days of the week, et cetera, and this may show up in an appropriate residual plot.
Before leaving the residuals, we discuss one more result which one should know about. Suppose
there is a constant term in the model, that is the first column of X is 1. We know that
e = (In − X(X 0 X)−1 X 0 )y
∴ X 0 e = X 0 (In − X(X 0 X)−1 X 0 )y
= (X 0 − X 0 X(X 0 X)−1 X 0 )y
= (X 0 − X 0 )y
= 0
    
1 ··· 1 e1 0
.. ..
    
x12 · · · xn2
    
  .   . 
∴ = .
.. .. ..
 
    
 .  .   . 
    
x1p · · · xnp en 0
This shows that there are p linear relationships among the residuals. The first of these p equations
is
e1 + . . . + en = 0
which proves that the sum of the residuals is zero (provided there is a constant term in the model).
2.11 Transformation of variables
The following are some of the basic assumptions of the linear model:
(i) Normality: We assume y is a vector of normally distributed random variables, particularly

if the aim is to test hypotheses or compute confidence intervals.
(ii) Homogeneity of variances: We assume that all elements of y have the same variance (or,
at least, that the variances are known up to a common constant factor).
(iii) Additivity: We assume that the effects of the design variables add up. For example, if
y = β1 x1 + β2 x2 + . . . + βp xp +
then, if x1 is increased by 1 unit and x2 is also increased by one unit, the result will be that
y is increased by β1 + β2 units.
Certain types of data are known not to satisfy all of these assumptions, and sometimes it is possible
to transform the data (for example replacing yi by log(yi )) in order to ensure that the assumptions
are satisfied, at least approximately. Consider the following illustration.
• Illustration 2.11.1
Suppose two types of insect repellent are tested on tomatoes and potatoes, and each treatment
is applied to five tomato and five potato plants. The response variable is the number of moths
visiting the plants during the hour after spraying. Suppose the results are as follows:
Insect repellent Plant Results (moths) Mean Variance

Killitt Tomatoes 5; 7; 2; 4; 2 4 4.5
Killitt Potatoes 6; 3; 8; 4; 9 6 6.5
Bang Tomatoes 12; 5; 6; 7; 10 8 8.5
Bang Potatoes 8; 14; 8; 7; 13 10 10.5
It seems that the variance is a function of the mean. The variance is plotted against the mean in
figure 2.12 and a linear relationship seems likely.
Figure 2.12:
p
The appropriate transformation in this situation is to replace yi by yi + 3/8 (see following dis-
cussion) and, if we do this, we obtain the following:
Insect repellent Plant Transformed response Mean Variance

Killitt Tomatoes 2.32 2.72 1.54 2.09 1.54 2.04 0.261
Killitt Potatoes 2.53 1.84 2.89 2.09 3.06 2.48 0.267
Bang Tomatoes 3.52 2.32 2.52 2.72 3.22 2.86 0.248
Bang Potatoes 2.89 3.79 2.89 2.72 3.66 3.19 0.245
If we now plot the variance against the mean (cf figure 2.13) we find that there is not a marked
relationship. We say that we have stabilised the variance.
Note that this kind of graphical analysis is possible only if a number of sets of repetitions at iden-
tical conditions are available (that is if X contains a number of sets of identical rows).
Figure 2.13:
We will now discuss a number of standard transformations, their use and their effect.
(a) The arcsine transformation
Suppose each yi was obtained as follows: a number of trials, say ni trials, each resulted in
either a success or a failure; let ui be the number of successes, that is ui is a binomial variable
with parameters ni and probability of success equal to πi , say.
ui number of successes
Let yi = =
ni number of trials
= proportion successes.
Then we know that
E(ui ) = ni πi ; Var(ui ) = ni πi (1 − πi )
∴ E(yi ) = πi ; Var(yi ) = πi (1 − πi )/ni .
Thus we see that the variance of yi is a function of the expected value of yi (and of ni ). The
famous statistician, Sir Ronald Fisher, discovered that the random variable
√ √
zi = 2 arcsine yi (= 2 sin−1 yi )
1
has variance approximately equal to (which does not depend on πi ). Remember that
ni
the angles are in radians in this definition. It has therefore become standard practice to
apply the arcsin transformation to proportions before analysing the linear model. Note that
Var(zi ) is still a function of ni , the number of trials involved. If the number of trials is not
the same throughout the experiment, one will have to apply weighted least squares. Special
problems occur when y = 0 or 1, and some tables exist which cater for this situation. If you
are unfamiliar with the arcsine (sin−1 ) function consult the textbook:
Spiegel, Murray R. Theory and problems of Advanced Calculus of SCHAUM’S OUTLINE

SERIES
or any other trigonometry textbook.
With regard to STA3701 it is sufficient to be able to use the sin−1 function which is found
on most pocket calculators.
180
Note that: 1 radian = degrees
π
π
1 degree = radians,
180
but it will not be used in this module.
(b) The logarithmic transformation
Data which consist of counts (like the number of ticks on a sheep) or index numbers often
have a tendency for the standard deviation to be proportional to the mean. In such cases
the recommended variance stabilising transformation is z = log10 y or z = loge y. Since
special problems occur when the counts are small numbers, the transformation is changed to
z = log(1 + y) when the counts are low.
The effect of this transformation is also to change a moderately skew distribution into a fairly
symmetric one which resembles the normal distribution more closely.
This information is also used when changes in the design variables cause proportional changes
in the response. In such cases a likely model is
y = αxβ
that is log y = log α + β log x.
The logarithmic transformation changes a multiplicative model into an additive one.

(c) Square root transformation
Suppose y is a Poisson variable, which is usually associated with the number of arrivals (for
example cars, telephone calls, customers, et cetera) in a fixed period of time.
We know that
E(y) = Var(y).
Thus the mean is equal to the variance. In such cases where the variance is proportional to
the mean, the appropriate transformation is
√
z= y
or, if the counts are low,

p
z = y + 3/8.
This transformation has the effect of changing quite skew distributions into fairly symmetric
ones.
(d) Reciprocal transformation
If the standard deviation of y increases even more rapidly than (E(y))2 , namely proportionally
to (E(y))4 , the transformation
z = 1/y
is often recommended for stabilising the variance. This transformation has the effect of
changing a very skew distribution into a fairly symmetric one.
A more complete account of this subject is found in chapter 8 of Quenouille.
2.12 Illustration
 
1 1 1 1 1 1 1 1 1
 
 
0
 1 1 1 0 0 0 0 0 0 
Let X = 
 

 0 0 0 1 1 1 0 0 0 
 
−1 0 1 −1 0 1 −1 0 1
y 0 = (19 39 50 25 24 32 23 28 39).
Such data could for example be the result of an experiment to test the effect of fertilizers on the
yield of a hectare of maize. Column 1 of X (that is row 1 of X0 ) provides for a constant term (we
do not expect zero yield if no fertilizer is used). Column 2 of X denotes District A and column
three District B (thus there are three districts involved in the experiment). Column 4 denotes the
fertilizer dosage. Suppose three dosages are used, namely 0 tons/ha, 10 tons/ha and 20 tons/ha.
Then define x4 = (dosage - 10)/10. Furthermore, y is the yield in bags/ha.
We compute
 
1 1 0 −1
 
1 1 0 0
 
 
 
 1 1 0 1
 
 
1 1 1 1 1 1 1 1 1  
 1 0 1 −1
 
  
0
 1 1 1 0 0 0 0 0 0  
XX =   
  1 0 1 0 
 0 0 0 1 1 1 0 0 0  
 
 1 0 1 1


−1 0 1 −1 0 1 −1 0 1  
1 0 0 −1
 
 
 
 
 1 0 0 0 
 
1 0 0 1
 
9 3 3 0
 
 
 3 3 0 0 
= 

.

 3 0 3 0 
 
0 0 0 6
9 3 3 0 1 0 0 0
3 3 0 0 0 1 0 0
3 0 3 0 0 0 1 0
0 0 0 6 0 0 0 1
1 1 1 1
(R1 − R2 − R3 ) 1 0 0 0 0
3 3 3 3
1 1
R2 1 1 0 0 0 0 0
3 3
1 1
R3 1 0 1 0 0 0 0
3 3
1 1
R4 0 0 0 1 0 0 0
6 6
1 1 1
1 0 0 0 0
3 3 3
1 2 1
R2 − R3 0 1 0 0 − 0
3 3 3
1 1 2
R3 − R1 0 0 1 0 − 0
3 3 3
1
0 0 0 1 0 0 0
6
 
1
3
− 31 − 13 0
 
 1 2 1
 −3

3 3
0 
Thus, (X 0 X)−1 =
 1

 −3 1 2 
3 3
0 
 
1
0 0 0 6
 
19
 
39
 
 
 
 50
 
 
1 1 1 1 1 1 1 1 1  
 25
 
  
0
 1 1 1 0 0 0 0 0 0  
Xy =   
  24 
 0 0 0 1 1 1 0 0 0  
 
 32


−1 0 1 −1 0 1 −1 0 1 



 23 
 
 
 28 
 
39
 
279
 
 
 108 
= 
 

 81 
 
54
β̂ = (X 0 X)−1 X 0 y
  
1 1 1
−3 −3 0 279
 3  
 1 2 1
 −3
 
3 3
0   108 
= 
   
 −3 1 1 2  
3 3
0   81 
  
0 0 0 16 54
 
30
 
 
 6 
=  
 −3 
 
 
9
The expected yield in District A is 36 + 9x4 ; in District B it is 27 + 9x4 and in District C 30 +

9x4 , where x4 is the (coded) fertilizer dosage. For example, we predict that the yield in District A
with a dosage of 15 ton/ha will be 36 + 9 (15 - 10)/10 = 40.5.
We have
ŷ 0 = X β̂
ŷ 0 = (27 36 45 18 27 36 21 30 39)
e0 = y 0 − ŷ 0
= (−8 3 5 7 − 3 − 4 2 − 2 0)
e0 = 0 as is to be expected
e0 e = 180
σ̂ 2 = e0 e/(n − p)
= 180/(9 − 4)
= 36
σ̂ = 6.
A 95% confidence interval for βi is

√
β̂i ± t α2 ;n−p aii σ̂.
Now βˆ4 = 9, σ̂ = 6, a44 = 16 , α = 0.05 ⇒ t α2 ;n−p = t0.025;5 = 2.571.
The confidence interval for β4 is

√
β̂i ± t α2 ;n−p aii σ̂
r
1
9 ± 2.571(6)
6
9 ± 6.2976
(2.7024 ; 15.2976).
For each additional ten tons of fertilizer a yield increase of between 2.7 and 15.3 bags/ha can be
expected. Normally such an experiment would of course be conducted on a larger scale and a much
narrower confidence interval would be obtained.
Suppose we want to obtain a joint confidence region for β2 and β3 . Then

 −1
2 1  
 3 3  2 −1
T −1 =   = .
 
  −1 2
1 2
3 3
The confidence region is thus given by

 
β̂2 − β2
 
−1
(β̂2 − β2 β̂3 − β3 )T  ≤ σ̂ 2 2Fα;2;5
 

 
β̂3 − β3
 
2 −1  
  6 − β2
∴ (6 − β2 − 3 − β3 )   ≤ (36)(2)(5.79)
  

  −3 − β3
−1 2
with α = 0.05
∴ 2(6 − β2 )2 + 2(6 − β2 )(3 + β3 ) + 2(3 + β3 )2 ≤ 416.88
∴ (β2 − 6)2 − (β2 − 6)( β3 + 3) + (β3 + 3)2 ≤ 208.44.
This represents an ellipse. In order to sketch the ellipse, choose various values of β2 and solve for
β3 from the quadratic equation each time. For example, if β2 = 6 we have
(β3 + 3)2 ≤ 208.44
−14.44 ≤ β3 + 3 ≤ 14.44
−17.44 ≤ β3 ≤ 11.44.
Finally, suppose we wish to test the null hypothesis
H0 : β2 − β3 = 0; β1 + β2 − β4 = 25
(that is the expected yield in Districts A and B are identical and the expected yield in District A
with no fertilizer is 25 bags/ha).
Thus the null hypothesis is

 
β1
    
0 1 −1
 
0 β 0
 2  = 
 
H0 :  
0 −1  β3 
 
1 1 25
 
β4
∴ Kβ = m.
We thus have = 2 (the rank of K) and we must compute SSH.

 
  30  
0 1 −1 0    0
  6    
K β̂ − m =  −
   
 
   −3   
1 1 0 −1   25
9
 
9
 
=  
 
 
2
  
1
 
3
− 13 − 13 0 0 1
0 1 −1 0    
   −1 2 1 
0  1 1

0 −1 0  3 3 3

K(X X) K = 
  

   −1 1 2
0   −1

0

3 3 3

1 1 0 −1   
1
0 0 0 6
0 −1
 
2 1
 3 3 
= 
 

 
1 1
3 2
 
9
4
− 32
 
(K(X 0 X)−1 K 0 )−1 = 
 

 
− 32 3
  
9
4
− 23 9
  
SSH = [9 2] 
  
 
  
− 23 3 2
561
=
4
SSH 561/4
∴f = 2
= = 1.9479
hσ 2 × 36
The 5% critical value is F0,05;2;5 = 5.79. Reject H0 if f > 5.79.
Since 1.9479 < 5.79, H0 is not rejected at the 5% level of significance.
Example 2.8.
An experiment was performed to determine whether three factors have an effect on the taste of
sosaties: the type of lamb (lean or fat); time the meat is kept in the marinade (1, 2 or 3 days) and
the temperature of the oven (190◦ , 205◦ and 220◦ C). Ten dishes of sosaties were prepared accord-
ing to different recipes and each presented to a different panel of gourmets. The panel awarded a
coefficient of tastiness (y) to each. The results were as follows:
Meat Time (days) Temperature (◦ C) Tastiness

Fat 3 205 16
Fat 2 220 19
Lean 2 220 20
Fat 3 190 43
Lean 2 190 27
Fat 1 190 25
Fat 1 220 22
Lean 3 220 15
Lean 2 190 29
Lean 1 205 9
(a) Fit a linear model with constant term; let x2 = 1 for fat meat and = 0 for lean; x3 = days -
2; x4 = (temp - 205)/15.
(b) Find a 90% confidence interval for β3 .
(c) Find a 95% joint confidence region for β3 and β4 .
(d) Test H0 : β2 = 0; β3 = 9; β1 + β2 + β3 − β4 = 50 at the 5% level.

Solution 2.8.
(a) xi1 = 1(constant term)


 1 for fat meat
xi2 =
 0 for lean meat
xi3 = Days − 2
(Temp − 205)
xi4 =
15
 
1 1 1 0
 
1 1 0 1
 
 
 
1 0 0 1
 
 
 
1 1 1 −1
 
 
 
0 −1
 
 1 0  h i
X=  and y0 = 16 19 20 43 27 25 22 15 29 9
1 1 −1 −1
 
 
 
1 1 −1
 
 1 
 
 
 1 0 1 1 
 
0 −1
 
 1 0 
 
1 0 −1 0
 
1 1 1 0
 
1 1 0 1
 
 
 
1 0 0 1
 
 
  
1 1 1 1 1 1 1 1 1 1 1 1 1 −1
 
 
  
0 −1
  
 1 1 0 1 0 1 1 0 0 0 1 0
X 0X = 
 
 
0 −1 −1 1 0 −1 1 1 −1 −1
  
 1 0 0 1  
  
1 −1 −1 −1 1 1 −1 1 1 −1
 
0 1 0  1 
 
 
 1 0 1 1 
 
0 −1
 
 1 0 
 
1 0 −1 0
 
10 5 0 0
 
 
 5 5 0 0 
= 



 0 0 6 0 
 
0 0 0 8

10 5 0 0 1 0 0 0
5 5 0 0 0 1 0 0
0 0 6 0 0 0 1 0
0 0 0 8 0 0 0 1
1 1 1
(R1 − R2 ) 1 0 0 0 - 0 0
5 5 5
1 1
R2 1 1 0 0 0 0 0
5 5
1 1
R3 0 0 1 0 0 0 0
6 6
1 1
R4 0 0 0 1 0 0 0
8 8
1 1
1 0 0 0 - 0 0
5 5
1 2
R2 − R1 0 1 0 0 − 0 0
5 5
1
0 0 1 0 0 0 0
6
1
0 0 0 1 0 0 0
8
 
1
5
− 15 0 0
 
 1 2
 −5

5
0 0 
Thus, (X 0 X)−1 = 

 0 1 
0 6
0 
 
0 0 0 81
 
16
 
19
 
 
 
20
 
 
  
1 1 1 1 1 1 1 1 1 1 43
 
 
  
  
 1 1 0 1 0 1 1 0 0 0 27
X 0y = 
 
 
0 −1 −1 1 0 −1
  
 1 0 0 1  25 
  
1 −1 −1 −1 1 1 −1
 
0 1 0  22 
 
 
 15 
 
 
 29 
 
9
 
225
 
 
 125 
= 



 18 
 
−48
β̂ = (X 0 X)−1 X 0 y
  
1 1
−5 0 0 225
 5  
 1 2
 −5
  
5
0 0   125 
=   

 0 1  
0 6 0   18 
  
1
0 0 0 8 −48
 
20
 
 
 5 
= 



 3 
 
−6
ŷi = 20x1 + 5x2 + 3x3 − 6x4
Then
y1 = 20(1) + 5(1) + 3(1) − 6(0) = 28

y2 = 20(1) + 5(1) + 3(0) − 6(1) = 19
y3 = 20(1) + 5(0) + 3(0) − 6(1) = 14
.. .. ..
. . .
y10 = 20(1) + 5(0) + 3(−1) − 6(0) = 17
h i
⇒ ŷ 0 = 28 19 14 34 26 28 16 17 26 17 .
We have
e0 = y 0 − ŷ 0
= (−12 0 6 9 1 − 3 6 − 2 3 − 8)
e0 = 0 as is to be expected
e0 e = 384
σ̂ 2 = e0 e/(n − p)
= 384/(10 − 4)
= 64
σ̂ = 8.
(b) A 90% confidence interval for βi is
√
β̂i ± t α2 ;n−p aii σ̂.
Now βˆ3 = 3, σ̂ = 8, a33 = 16 , α = 0.1 ⇒ t α2 ;(n−p) = t0.05;6 = 1.943.
The confidence interval for β3 is
√
βˆ3 ± t α2 ;n−p a33 σ̂
r
1
3 ± 1.943 × 8
6
9 ± 6.3458
(2.6542 ; 15.3458)
i.e., 2.6542 ≤ β3 ≤ 15.3458.

(c) Joint confidence interval for β3 and β4 is

 
βˆ3 − β3
βˆ3 − β3 βˆ4 − β4 T −1   ≤ σˆ2 2Fα;2;n−p
βˆ4 − β4
     
1 1
6
0 8
0 6 0
     
T −1 =  = 48 =
     
  
     
1 1
0 8
0 8
0 8
βˆ3 = 3 βˆ4 = −6 Fα;2;n−p = F0.05;2;6 = 5.14.
Now
 
h i βˆ3 − β3
βˆ3 − β3 βˆ4 − β4 T −1   ≤ σˆ2 2Fα;2;n−p
βˆ4 − β4
 
6 0  
  3 − β3 
h i 
3 − β3 −6 − β4  ≤ 64 × 2 × 5.14


  −6 − β4
0 8
 
h i 3 − β3
6(3 − β3 ) 8(−6 − β4 )   ≤ 657.92
−6 − β4
6(3 − β3 )2 + 8(−6 − β4 )2 ≤ 657.92
3(−1)2 (β3 − 3)2 + 4(−1)2 (6 + β4 )2 ≤ 328.96
3(β3 − 3)2 + 4(6 + β4 )2 ≤ 328.96.
(d) H0 : β2 = 0; β3 = 9; β1 + β2 + β3 − β4 = 50
 
  β  
0 1 0 0  1  0
 
   β2   
K= 0 0 1  and  9
   
0  
  
   β3   
1 1 1 −1   50
β4
 
  β1 
0 1 0 0   0
  
β2   

H0 :  0 0 1 =
  
0 
    9 
   β3 
  
1 1 1 −1   50
β4
∴ Kβ = m.
 
  20  
 0 1 0 0  0
  
5   

K β̂ − m =  0 0 1 −
 
0    9 

 
 3  
1 1 1 −1   50
−6
   
5 0
   
=  3 − 9 
   
   
34 50
 
5
 
=  −6 
 
 
−16
  
1
 
5
− 51 0 0 0 0 1
0 1 0  0  
 − 15 2   
 0 0  1 0 1
K(X 0 X)−1 K 0 5
 
=  0 0 1

0   
0 16 0   0 1
 0  
 1 
1 1 1 −1   
0 0 0 18 0 0 −1
 
  0 0 1
− 15 25 0 0  
 
 1 0 1 

=  0 0 6
 1
0  
  0 1

1 
0 15 61 − 18  
0 0 −1
 
2 1
0
 5 5 
=  0 6
 1 1 
6 


1 1 59
5 6 120
2 1
0 1 0 0
5 5
1 1
0 0 1 0
6 6
1 1 59
0 0 1
5 6 120
5 1 5
R1 1 0 0 0
2 2 2
6R2 0 1 1 0 6 0
1 47
2R3 − R1 0 -1 0 2
3 60
1 5
1 0 0 0
2 2
0 1 1 0 6 0
1 9
R3 − R2 0 0 -1 -2 2
3 20
1 5
1 0 0 0
2 2
0 1 1 0 6 0
20 20 40 40
R2 0 0 1 - -
9 9 9 9
1 65 20 20
R1 − R3 1 0 0 −
2 18 9 9
20 94 40
R2 − R3 0 1 0 −
9 9 9
20 20 40 40
R2 0 0 1 - -
9 9 9 9
65 20 20
 
−

 18 9 9 

 
 
 20 94 40 
Thus, (K(X 0 X)−1 K 0 )−1 = − .

 9 9 9 

 
 
 20 40 40 
− −
9 9 9
Now
SSH = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m)

65 20 20
 
−
 18 9 9 
  
  5
i 20

94 40
h  
= 5 −6 −16  − −6 
  

 9 9 9 
  
  −16
 
 20 40 40 
− −
9 9 9
 5
725 176 500

= −  −6 
 
18 9 9  
−16
17 513
=
18
SSH 17 513/18 17 513

∴f = 2
= = ≈ 5.0674
hσ 3 × 64 3 456
Fα;h;n−p = F0,05;3;6 = 4.76. Reject H0 if f > 4.76.
Since 5.0674 > 4.76, we reject H0 at the 5% level of significance.

Exercise 2.1
 
1 1 1 1 1 1 1 1 1
 
1. Let X 0 =  −1 −1 −1
 
0 0 0 1 1 1 
 
−1 0 1 −1 0 1 −1 0 1
h i
and y 0 = 18 20 29 17 20 27 11 28 20 .
(a) Fit the model y = Xβ + ε where ε ∼ n (0; σ 2 I) .
(b) Test H0 : β3 = 0 at the 5% level of significance.
(c) Construct a 95% confidence interval for β2 .
(d) Construct a 95% joint confidence region for β2 and β3 . Would you conclude that β2 =
β3 = 0?
(e) Test H0 : β1 + β2 + β3 = 20 at the 5% level of significance.
(f) Plot
(i) e against ŷ
(ii) e against y
and comment on the adequacy of the model.
(g) Would you be able to reduce the model y = Xβ + ε? (Substantiate.)

2. In an experiment to investigate the mass increase of young oxen under specific feeding con-
ditions, the following sample of size 12 was used:
Breed Age Type of feed Initial mass Mass increase
Sussex 6 A 175 y1
Beefmaster 12 B 255 y2
Simmentaler 18 A 350 y3
Brahman 6 B 190 y4
Sussex 12 A 230 y5
Brahman 12 B 225 y8
Sussex 18 A 320 y9
Brahman 18 B 315 y12
Age is given in months, mass in kilogram, feed type A is pasture (grazing) and feed type B
is pasture and feeding paddock.
Construct a design matrix for this experiment. Define variables explicitly.
3. For each of the given design matrices,
(i) state which analysis you would apply, namely analysis of variance, regression analysis
or analysis of covariance (substantiate your answers);
(ii) state the purpose of the analysis clearly.

 
1 1 −2
 
 
 1 1 2 
(a) X = 



 1 0 4 
 
1 0 −4
 
1 0 15
 
 
 1 0 10 
(b) X = 



 0 1 16 
 
0 1 32
 
1 10
 
 
 1 12 
(c) X = 



 1 14 
 
1 16
4. A graphical representation of a joint confidence region for β1 and β2 is as follows:
(a) State whether you would not reject or reject each of the following hypotheses and say
why:
(i) β1 = 0, β2 = 0
(ii) β1 = 0
(iii) β2 = 0
(b) Suppose we wish to test the following hypothesis concerning β 0 = [β1 β2 β3 β4 ] :
β4 = 0
β2 − β3 = 0
β2 − β3 + β4 = 0
2β2 − 2β3 + 5β4 = 0
Write the above in matrix form.
5. Let y= Xβ+ε where ε∼ n(0;σ 2 I3 ) and

   
1 0 12
   
X= 1 , y = .
   
1   6
   
1 0 2
(a) Estimate β and σ 2 .
(b) Test H0 : β1 = 5.
6. In an experiment, the mass change of newborn babies, fed on four different milk formulas
was recorded (after a set period of time) as follows:
A B C D
200 150 75 220
188 125 110 87
203 176 102
142 76
50
The model which describes the observations is: yij = µi + εij , i = 1, 2, 3, 4;

j = 1, ..., ni ; n1 = 3; n2 = 4; n3 = 2; n4 = 5 and εij ∼ independent n (0; σ 2 ) .
(a) Estimate the parameters in the model.
(b) Test, at the 5% level of significance, whether there is a difference between the four milk
formulas.
SSH
NB: Write the hypothesis in the form: Ho : Kβ = m and use f = hσ̂ 2
.
(c) Compute a 95% confidence interval for difference in means between programmes B and
C, that is µB − µC .
(d) Is it possible to reduce this model? Substantiate.
(e) Give the advantages of the reduction of a model.

Chapter 3
ANALYSIS OF VARIANCE
3.1 Introduction
A typical experiment, the results of which can be analysed by means of the technique known as
analysis of variance, is the following: a number of experimental units (people, cattle, potato lands,
trees, et cetera) are subjected to a number of treatments (diets, medicines, fertilizer applications,
et cetera) each at a number of levels (the three levels of the treatment diet could be two, three
and four meals per day; the two levels of fertilizer could be no fertilizer, 1 ton/ha, 2 tons/ha and 4
tons/ha; et cetera). Sometimes there are other factors present, called blocking factors, which divide
the experimental units into groups which are more uniform (before the experiment starts) than
all the units jointly. Such a factor has two or more levels (for example boys and girls; Leghorns,
Black Australorps and Rhode Island Reds; married, divorced, widowed and single).
After one or more treatments have been applied to such groups of experimental units, the result
is observed. This result is called the yield (for example body weight of experimental person, the
yield of potatoes in kg/ha, the egg production of hens). The purpose of the analysis of variance
is to test whether the treatment had an effect on the yield, and sometimes whether the groups
defined by the blocking factors differ with respect to yield.
One of the problems which faces a statistician who acts as consultant to a research worker, is the
fact that the experiment is often completed before he or she is consulted, with the result that the
115
3. ANALYSIS OF VARIANCE 116
experiment is usually poorly planned. The statistician’s client does not always tell the statistician
about all the important factors, because the client possibly does not know that they are impor-
tant. It is the duty of the statistician to question his or her client about the way in which the
experiment was performed, in order to try and find out whether such factors do indeed exist or not.
The manner in which the experimental units are assigned to the treatment levels is also very im-
portant. It is always desirable that these be assigned randomly.
If there is only one factor (treatment or blocking factor) present, we speak of a one-way clas-
sification and the corresponding analysis is termed one-way analysis of variance. If there are
two factors, we have a two-way classification and we speak of two-way analysis of variance, and so
on. These names originate from the arrangement of the data in tables as in the following examples.
When discussing the analysis of these examples, including analyses of chapters 4 and 5, some out-
put from a standard package for personal computers, called SAS JMP, will be shown. You should
make sure that you are able to interpret these results. One of the techniques used in SAS JMP
is a plot of treatment means diamonds. The line across each diamond represents the group mean.
The vertical span of each diamond represents the 95% confidence interval for each group. The
standard error of any mean is given by
[S.E.(means)]2 = s2 /(number of observations included in the mean)
where s2 is the residual mean square or error mean square.
3.1.1 One-way analysis of variance
Suppose six potato tubers were planted during each of the four moon phases. The purpose is to
test whether the yield of the plants is influenced by the phase of the moon at planting time. The
results are as follows (yield in grams per plant):
3. ANALYSIS OF VARIANCE 117 STA3701/1
New moon : 201 197 190 194 200 206

First quarter : 189 200 199 196 194 198
Full moon : 206 214 202 205 203 206
Last quarter : 205 194 202 198 200 201
This is a typical example of a one-way classification, and will be analysed in section 3.4.
This experiment may be refined. One source of variation in potato yield is the season: apart
from the possible effect of the moon phase, there is also an optimum planting time with respect
to seasons. To a certain extent the difference in yield found in the experiment may be ascribable
to seasonal and climatic differences. It would be better if the experiment were repeated over a
number of moon months and possibly over a number of years.
3.1.2 Two-way analysis of variance
A housewife does her grocery shopping once a week. Once a month she buys the main items and
is then able to compare the prices of a number of supermarkets before buying. In the other three
weeks of the month she buys only the week’s supplies of perishable foods, and she would like to
decide on one of the three supermarkets in her area for this shopping. She therefore decides to buy
at each of the three supermarkets in turn once every month for four months in order to compare
prices. Her results are as follows (prices to the nearest rand):
Month Supermarket
A B C
May 34 18 29
June 30 20 34
July 27 30 36
August 33 32 37
One source of variation is of course the fact that her shopping list varies from week to week; a
more sensitive analysis might be possible if she recorded only those items which remain on her list
from week to week; the results would, however, not answer her question fully.
3.1.3 Three-way analysis of variance
A man has a choice of three routes to drive to work. He decides to perform an experiment in order
to enable him to select a route. He knows that his travelling time depends on the day of the week
and on the time he leaves home. He performs his experiment over a period of nine weeks, and
allocates the routes and departure times randomly to the days of each week. His travelling times
(in minutes) for the nine weeks are as follows:
Departure 07:00 07:15 07:30

Route I II III I II III I II III
Day:
M 34 34 40 32 46 54 36 52 50
T 19 39 32 32 35 47 33 40 47
W 24 33 36 31 40 46 32 44 47
T 15 33 39 32 41 38 34 37 46
F 38 41 53 43 58 55 45 57 60
A few remarks about the experiment:
(i) The investigator would of course try to restrict his or her investigations to “normal” weeks,
ie weeks without public holidays, weeks which do not coincide with school holidays. Weeks
in which the last day of the month falls should possibly be avoided as well, and might be the
subject of a separate study.
(ii) It is possible that the traffic volume may increase slightly over the nine weeks. Also, the
performance of a car may deteriorate gradually after being serviced. For these reasons it is
important that the routes and departure times be allocated randomly to the days and weeks.
If route I were selected the first three weeks, route II the next three weeks and route III the
last three weeks, then an observed difference in travelling times may be due to a difference
in routes or changes in traffic volume or a change in car performance or other factors.
(iii) It is also possible to select weeks as a fourth and thus obtain a four-way classification.
However, the model would then be much more complicated. (See nested experiments in
section 3.3.)
3.2 Balanced experiments and replicates
We first define the concept cell. Each combination of levels of the factors in an experiment is called
a cell. In section 3.1.1 there are four cells (the four moon phases); in section 3.1.2 there are 4×3
= 12 cells (eg Supermarket B in July, Supermarket C in May); in section 3.1.3 there are 3×3×5 =
45 cells. Depending on the number of observations (replicates) in each cell, experiments may be
classified as follows:
(a) Incomplete experiment:
Observations do not occur in every cell. Apart from poor planning, there may be two reasons
for this:
(i) The experiment has been designed specifically in such a way that some cells remain
empty. This may have been done because the experimenter is unable to cope with as
many experimental units as there are cells. The experiment is then designed in such a
way that a precise analysis is still possible.
(ii) Some observations may have been lost; a rabbit ate one of the plants, a test tube broke,
a laboratory mouse died or a patient decided to discontinue his or her treatment. In such
cases it is still possible to analyse the results, but the analysis is much more complicated.
The missing observations may be estimated by means of least squares, and the number
of degrees of freedom in the error sum of squares is decreased accordingly.
(b) Balanced experiment:
In such an experiment there are equal numbers of observations per cell. We distinguish
between two cases:
(i) One observation per cell: Sections 3.1.2 and 3.1.3 are such examples. As will be seen,
there is some information which is not contained in such an experiment; in particular
it is not possible to ascertain whether the model is an adequate representation of the
data.
(ii) More than one observation per cell: This is the most desirable type of experiment,
firstly because the model may be verified and secondly because the computations are
not too complicated.
(c) Unbalanced experiments in which there are unequal numbers of observations per cell. It
is possible to analyse the data in a meaningful way, but the algebra, the notation and the
computations are much more complicated.
3.3 Fixed effects and random effects
In chapter 2 the general linear model
y = Xβ +
was described, and it was stated that β is a vector of unknown constants.
However, sometimes some elements of β may be random variables, and the distributions of the
quadratic forms may differ from those derived in chapter 2.
Consider a factor with k levels. If these k levels constitute the complete set of levels in which one
is interested, the effects of the levels of the factor on the yield are called fixed effects. In section
3.1.1 we were interested only in the four phases of the moon, and observations were made during
every phase. In section 3.1.2 we were interested in the three supermarkets only, and in section
3.1.3 in the three routes only.
On the other hand, a factor may exist at a large number of levels and experiments are performed
at a randomly selected set of these levels. The effects of these levels are called random effects.
3.3.1 Example on mixed model
A simple example of a random effect is the following: A farmer wishes to test the effects of four
diets on the body mass of piglets. He knows that piglets of the same litter are more uniform (before
treatment) than all the piglets jointly. He therefore selects five litters at random and four piglets
at random from each litter. One piglet is assigned at random to each diet. The results (mass of
each piglet after a fixed period) are as follows:
Litters Diets
A B C D
I 30 51 40 31
II 51 56 45 48
III 44 59 57 36
IV 40 47 42 39
V 50 57 41 36
If the farmer wants to draw conclusions about these specific five litters only, he may regard the
factor “Litters” as fixed; however, if he wants to draw conclusions about the population of litters
from which the five litters may be regarded as a random sample, the factor “Litters” must be
regarded as a random factor.
Another typical example is a large consignment of wool (or other produce). A random sample of
bags is selected, and then a number of samples from each bag is selected and subjected to various
treatments. The investigator wants to make statements about the whole consignment.
We may distinguish further between factors with levels which are selected at random from finite
(small) populations and infinite (large) populations. We shall concentrate mainly on the theory of
large populations.
Thus we may regard any factor in an experiment as fixed or random. Experiments may be classified
accordingly:
(a) Fixed effects model or model I: all factors (treatments or blocking factors) are fixed.
(b) Random effects model or model II: all factors are random.
(c) Mixed model: some factors are fixed and some are random.
Example 3.1.
Determine in the following scenarios whether the factors are fixed or random effects:
(a) In an effort to improve the quality of video tapes, the effects of four kinds of treatment A, B,
C and D on reproducing quality of image are compared.
(b) Pieces of skin were exposed to four randomly chosen light intensities for a fixed period. After
exposure the elasticity of the skin was measured. The purpose was to compare the effects of
the four intensities on the elasticity of the skin.
(c) An investigator wishes to evaluate the effect on reaction time of medicine administered for
colds. Data were recorded that represent the reaction (in seconds) on a stimulus one hour
after each of four randomly chosen brands of this type of medicine was administered to ten
people.
(d) A study was done as to how the concentration of a certain drug in the blood, 24 hours after
being injected, is affected by age and sex. An analysis of the blood samples of 40 people
who had received the drug yielded concentration (in milligrams per cubic centimetre) for age
groups 11-25 years, 26-40 years, 41-65 years and >65 years.
Solution 3.1.
(a) The factor treatment is a fixed effect.
(b) The factor light intensity is a random effect.
(c) The factor brand is a random effect.
(d) The factor gender is a fixed effect and the factor age is a fixed effect.
Hypothesis testing in the analysis of variance
To test hypotheses regarding fixed effects, we shall use the method of testing linear hypotheses in
section 2.7. In the case of random effects, we shall test hypotheses of the form H0 : σA2 = 0 (which
is equivalent to testing H0 : σB2 + cσA2 = σB2 , where c is a positive constant). The procedure for
testing such a hypothesis is as follows:
(a) Find two quadratic forms qi and qj such that

qi qj
and 2
σB2 2
+ cσA σB
are independent central chi-square variates with ri and rj degrees of freedom respectively,
irrespective of whether H0 is true. Then
qi
/ (σB2 + cσA2 )
ri
qj ∼ Fri ;rj .
/ σB2
rj
qi qj
(b) Compute f= / .
ri rj
We know that
σA2

f ∼ 1 + c 2 Fri ;rj .
σB
Thus f ∼ Fri ;rj if and only if σA2 = 0 (that is if and only if H0 is true). If σA2 > 0, we may expect f
to be larger than the Fri ;rj statistic. Hence, if f > Fα;ri ;rj , reject H0 : σA2 = 0 at the 100α% level.
For each model (I, II or mixed), we shall set up an analysis of variance table. The relevant quadratic
forms which make up the ratio f will be obtained from the “mean square” or “MS” column of the
table. These mean squares can be shown to be independent chi-square variates. It should become
apparent that the ratio f will be formed from two mean squares, say MSi and MSj , which are
chosen in such a way that E(MSi ) = E(MSj ) if and only if the null hypothesis of interest is true.
This will ensure that f has a central F -distribution under H0 .
The above procedures will be used to derive the results for the one-way models in detail. The
results presented for the two-way models can be derived analogously.
3.3.2 Nested or hierachical experiments
3.3.2.1 Illustration 1
Suppose there are six comparable grade eight classes in a large school, and six teachers who are
able to teach mathematics. Three methods of teaching mathematics are to be compared. Each
teacher is assigned (randomly) to one class to teach according to a prescribed method. After one
term five pupils are selected at random from each class and tested.
Suppose their marks are as follows:
Teacher Teaching method

I II III
1 60 49 68
55 44 60
48 43 65
62 51 62
65 48 75
2 61 54 64
64 52 73
57 58 70
56 48 78
72 63 75
Superficially this may look like an ordinary two-way classification with five observations per cell.
However, the designations “Teacher 1” and “Teacher 2” are misleading, since each of the six classes
was taught by a different teacher. A more accurate exposition would be as follows (in order to
save space, the observations are not repeated, but each cross indicates a set of five observations):
Teacher Teaching method

I II III
1 X
2 X
3 X
4 X
5 X
6 X
Such an experiment is termed a hierarchical or nested experiment; in the above example we say
that the factor Teacher is nested in the factor Teaching method. In such experiments there are
empty cells, not because of a lack of time or facilities, but because the empty cells are not relevant.
Another similar example is the following:
3.3.2.2 Illustration 2
A farmer wants to determine the effect of three sprays on his peach trees. He selects 12 comparable
trees at random, and assigns them randomly to the three sprays. After a week he selects six leaves
at random from each tree (Question: how would he do that?) and tests the nitrogen content of
each leaf. The results are as follows:
Spray Tree
1 2 3 4
A 4.5 5.5 13.5 11.5
7.0 7.5 15.0 10.0
5.0 12.5 12.5 12.0
5.5 6.0 12.0 9.5
6.5 4.0 10.0 10.5
7.5 6.5 9.0 12.5
B 15.5 14.5 11.0 12.5
15.0 15.0 10.0 12.0
14.5 12.5 12.0 13.5
14.0 16.0 13.0 10.0
16.0 13.5 10.5 10.5
15.0 12.5 9.5 13.5
C 7.0 3.0 8.0 12.5
8.0 4.5 9.5 10.0
5.0 4.0 8.5 13.0
7.5 5.5 10.5 10.5
8.5 7.0 7.5 11.0
6.0 6.0 10.0 9.0
In this case we do not deal with the same four trees for every spray, but with 12 different trees,
and the experiment is a nested experiment.
3.4 One-way analysis of variance: Model I
The analysis
We now assume that there is one treatment or blocking factor present with k levels, and that
there are ni experimental units at the i-th level. We also assume that these k levels represent
all levels of interest. The k level defines k populations, and the observations may be regarded as
independent random samples from k populations. Let yij be the observation on the j-th individual
corresponding to the i-th level of the treatment or blocking factor. The model is
yij = µi + ij , j = 1, · · · , ni ; i = 1, · · · , k (3.1)
where the ij are independently n(0; σ 2 ) distributed.
Note: This model can be reparameterised to
yij = µ + αi + ij , j = 1, · · · , ni ; i = 1, · · · , k
to be discussed in the next section.
k
X
Let N = ni be the total number of observations.
i=1
We write the model in the matrix form
y = Xβ +
as follows:
     
y 1 0 ··· 0
 11    11
 ..   .. ..   ...
 
 .   . .
  
 
    µ1  
 y1n1   1 0 ··· 0  1n1
      
   
 ..   .. µ2
 +  ...
     

 . =
 
.  .. 
    .
  
  
0 ··· 1 
   
 yk1   0   k1 
 ..  
  
.. .. .. 
 µk  ..
 

 .   . . .   . 
     
yknk 0 0 1 knk
   
1
n1 0 ··· 0 n1
0 ··· 0
   
1
··· 0  ···
   
0
 0 n2  0 n2
0 
XX=  ; (X 0 X)−1 = 
..  ..

  
 .   . 
   
1
0 0 · · · nk 0 0 ··· nk
   
y11 + · · · + y1n1 y1 .
   
y21 + · · · + y2n2
   
0
   y2 . 
Xy=
 ..
=
  ..


 .   . 
   
yk1 + · · · + yknk yk .
say, where
ni
X
yi. = yij
j=1
 
ȳ1.
 
 
0 −1 0
 ȳ2. 
∴ β̂ = (X X) X y = 
 .. 

 . 
 
ȳk.
ni
1 X
where ȳi. = yij = yi. /ni ,
ni j=1
the mean of the i-th sample.
k ni k k
1 XX X X
Let ȳ.. = yij = ni ȳi. / ni
N i=1 j=1 i=1 i=1
(the grand mean of all the observations).
Of importance in this model are the two quadratic forms q4 and q5 of chapter 2. The error sum of
squares is
q4 = y 0 (I − X(X 0 X)−1 X 0 )y
= y 0 y − y 0 X(X 0 X)−1 X 0 y
X ni
k X k
X
= yij2 − ni ȳi2 .
i=1 j=1 i=1
k
"n #
X Xi
= yij2 − ni ȳi.2
i=1 j=1
ni
k X
X
= (yij − ȳi. )2
i=1 j=1
(3.2)
If we write
n1
X nk
X
q4 = (y1j − ȳ1. )2 + · · · + (ykj − ȳk. )2
j=1 j=1
then we see that q4 measures the variation within samples. (Each term of q4 measures the variation
within one of the k samples.)
The hypothesis which one usually wants to test is that the k population means are all equal:
H0 : µ1 = µ2 = · · · = µk .
(In exceptional circumstances one may want to test µ1 = µ2 = · · · = µk = c with c specified, but
that problem will not be discussed here.)
Now H0 is true if and only if the following is true:
µ1 = µ2 ; µ2 = µ3 ; · · · ; µk−1 = µk
We can thus write H0 in the form Kβ = m as follows:

    
1 −1 0 ··· 0 0 µ1 0
    
 0 1 −1 · · · 0 0   µ2   0 
    
 ..   ..  =  .. 
    
 .  .   . 
    
0 0 0 · · · 1 −1 µk 0
The rank of K is k − 1, so that q5 corresponding to this K will have k − 1 degrees of freedom.

After the computation of (K(X 0 X)−1 K 0 )−1 , which involves some intricate matrix manipulations,
it can be shown that
k
X
q5 = ni (ȳi. − ȳ.. )2
i=1
Xk
= ni ȳi.2 − N ȳ..2
i=1
0
= y A5 y
= y 0 (B − C)y
(3.3)
 
1 1
N
··· N
 .
 
where C = 1(1 1) 1 =  ..
0 −1 0 

 
1 1
N
··· N
 
0 −1 0
1(1 1) 1 O ··· O
 
0 −1 0
···
 
 O 1(1 1) 1 O 
and B=
 ..


 . 
 
O O · · · 1(10 1)−1 10
 
1 1
n1
··· n1
0 ··· 0 ··· 0 ··· 0
..
 
.
 
 
 
1 1
··· 0 ··· 0 ··· 0 ··· 0 
 
n1 n1

 
1 1
0 ··· 0 ··· ··· 0 ··· 0 
 
n2 n2

..
 
 
 . 
= 
1 1
··· ··· ··· ··· 0 
 
 0 0 n2 n2
0
..
 
 
 . 
 
1 1 
··· ··· ··· · · · nk 

 0 0 0 0 nk
..
 
 
 . 
 
1 1
0 ··· 0 0 ··· 0 ··· nk
· · · nk
It is easily seen (by direct multiplication) that

BB = B; CC = C; BC = CB = C
r(C) = 1; r(B) = k; r(A5 ) = r(B − C) = k − 1, and
A5 A5 = (B − C)(B − C) = BB − CB − BC + CC
= B − C − C + C = A5
which shows, as before, that q5 /σ 2 is a noncentral chi-square variate with k − 1 degrees of freedom
and noncentrality parameter
k k
X
2 2 1 X
λ5 = ni (µi − µ̄. ) /σ (where µ̄. = ni µi )
i=1
N i=1
which is equal to zero if and only if H0 is true.
From chapter 2 we know that q4 and q5 are independent and that q4 /σ 2 is a (central) XN2 −k variate.
This can also be shown directly by noting that
XX k
X
q4 = yij2 − ni ȳi.2
i=1
0 0
y Iy − y By
= y 0 (I − B)y,
that r(I − B) = N − k and that (I − B)(B − C) = B − C − BB + BC = O.
From (D8) it follows that
E(q4 /σ 2 ) = N − k ∴ E(q4 /(N-k)) = σ 2

k
X
2
E(q5 /σ ) = k − 1 + ni (µi − µ̄. )2 /σ 2
i=1
X
∴ E(q5 /(k − 1)) = σ 2 + ni (µi − µ̄. )2 / (k − 1).
q4
Thus is an unbiased estimator of σ 2 and we write
N −k
ni
k X
X
σ̂ 2 = q4 / (N − k) = (yij − ȳi. )2 / (N − k).
i=1 j=1
Consider q5 :
k
X
q5 = ni (ȳi. − ȳ.. )2 ,
i=1
which is the weighted sum of squares of deviations of the sample means from the grand mean.
We say that q5 is the “between samples” sum of squares while q4 is the “within samples” sum of
squares. Furthermore, since
I − C = (I − B) + (B − C)
or
A2 = A4 + A5 ,
∴ y 0 (I − C)y = y 0 A4 y + y 0 A5 y
or
ni
k X
X
q2 = (yij − ȳ.. )2
i=1 j=1
X ni
k X k
X
2
= (yij − ȳi. ) + ni (ȳi. − ȳ.. )2
i=1 j=1 i=1
1
∴ SSTotal = SSwithin samples + SSbetween samples .
From (D10) it follows that

q5 /(k − 1)
f=
q4 /(N − k)
has a central Fk−1,N −k distribution if H0 is true (and a noncentral distribution if H0 is not true).
Thus H0 : µ1 = . . . = µk is rejected if f > Fα;k−1;N −k . The results are usually presented in tabular
form as follows:
Source SS∗ d.f∗ MS∗ E(MS)

Between samples q5 k−1 q5 /(k − 1) σ 2 + Σni (µi − µ̄. )2 /(k − 1)
Within samples q4 N −k q4 /(N − k) σ2
Total q2 N −1
∗
SS = sum of squares
d.f. = degrees of freedom
MS = means square
1
This expression should, strictly speaking, be called “SSTotal adjusted for the mean ”. We adopt the more
concise “SSTotal ” from now on.
Such a table is known as an ANOVA table (analysis of variance table).
Multiple comparisons
If H0 : µ1 = . . . = µk is rejected, one may want to test for specific difference; usually one wants
X X
to test contrasts of the form H0 : ci µi = 0 where ci = 0. In such cases one may retain an
overall significance level of α if H0 is rejected when
( ci ȳi. )2
P
P 2 > (k − 1)σ̂ 2 Fα;k−1;N −k
(ci /ni )
i
X
0
(this can be shown by specifying t = (t1 t2 · · · tk−1 ), where ti = cr for i = 1, . . .,k − 1, in qt∗ ,
r=1
equations (2.7.5) and (2.7.7) of section 2.7).
(ȳr. − ȳs. )2
In particular, H0 : µr = µs is rejected if, > (k − 1)σ̂ 2 Fα;k−1;N −k.
1 1
+
nr ns
Example 3.2.
Using data in section 3.1.1:
(a) Construct an ANOVA table.
(b) Test at the 5% level of significance if there is a difference in yields among the moon phases.
(c) Test at the 5% level of significance whether the mean yield of potatoes planted at full moon
is equal to the mean yield of potatoes planted at new moon.
(d) Test whether the mean yield of potatoes planted at full moon is equal to the mean yield of
potatoes planted during all other phases. Use α = 0.05.
Solution 3.2.
(a)
Treatments
New moon First quarter Full moon Last quarter
201 189 206 205
197 200 214 194
190 199 202 202
194 196 205 198
200 194 203 200
206 198 206 201
n 6 6 6 6
y i. 198 196 206 200
y .. = 200
X
(y1j − y 1. )2 = (201 − 198)2 + (197 − 198)2 + . . . + (206 − 198)2
= 32 + (−1)2 + (−8)2 + (−4)2 + (2)2 + 82
= 158
X
(y2j − y 2. )2 = (189 − 196)2 + (200 − 196)2 + . . . + (198 − 196)2
= (−7)2 + 42 + 32 + 02 + (−2)2 + 22
= 82
X
(y3j − y 3. )2 = (206 − 206)2 + (214 − 206)2 + . . . + (206 − 206)2
= 02 + 82 + (−4)2 + (−1)2 + (−3)2 + 02
= 90
X
(y4j − y 4. )2 = (205 − 200)2 + (194 − 200)2 + . . . + (201 − 200)2
= 52 + (−6)2 + (2)2 + (−2)2 + 02 + 12
= 70
XX
q4 = (yij − ȳi. )2 = 158 + 82 + 90 + 70 = 400
X
q5 = ni (ȳi. − ȳ.. )2
= 6(198 − 200)2 + 6(196 − 200)2 + 6(206 − 200)2 + 6(200 − 200)2
= 6(−2)2 + 6(−4)2 + 6(62 ) + 6(02 )
= 336
Thus the ANOVA table is as follows:
Source SS d.f. MS f
Between 336 3 112 112/20 = 5.6
Within 400 20 20 = σ̂ 2
Total 736 23
(b) H0 : µ1 = µ2 = µ3 = µ4 (H0 : α1 = α2 = α3 = α4 = 0)
Fα;k−1;N −k = F0.05;3;20 = 3.10. Reject H0 if f > 3.1.

Since 5.6 > 3.1 we reject H0 at the 5% level of significance and conclude that the yield of the
plants is influenced by the phase of the moon at planting time.
(c) H0 : µ1 = µ3
For multiple comparisons, we first compute
(k − 1)σ̂ 2 Fα;k−1;N −k = 3 × 20 × 3.1 = 186.
(ȳr. − ȳs. )2
Reject H0 if 1 1 > (k − 1)σ 2 Fα;k−1;N −k
nr
+ ns
Now
(ȳ1. − ȳ3. )2 (198 − 206)2

1 =
6
+ 16 1
6
+ 16
(−8)2
= 2
6
64 × 6
=
2
= 192
Since 192 > 186, we reject H0 at the 5% level of significance and conclude that the mean
yield of potatoes planted at full moon is not equal to the mean yield of potatoes planted at
new moon.
(d) We want to test whether the mean yield of potatoes planted at full moon is equal to the mean
yield of potatoes planted during all other phases.
H0 : µ3 = (µ1 + µ2 + µ4 )/3
that is H0 : − 31 µ1 − 31 µ2 + µ3 − 13 µ4 = 0 ⇒ c1 = − 31 , c2 = − 31 , c3 = 1 and c4 = − 13 .
( ci ȳi. )2
P
Then P 2
ci /ni
((− 13 )(198) + (− 31 )(196) + (1)(206) + (− 31 )(200))2
=
( 19 )( 61 ) + ( 19 )( 16 ) + (1)( 16 ) + ( 19 )( 61 )
= 288.
Since 288 > 186 we reject H0 at the 5% level of significance and conclude that the mean yield
of potatoes planted at full moon is not equal to the mean yield of potatoes planted during the
other phases.
The SAS JMP output for this example can be seen in figure 3.1. Compare the tables with the
answers obtained above. The graph shows clearly that the yield is highest during moon phase 3.
Figure 3.1: SAS JMP output for example 3.2
Note: If the confidence intervals (the points of the diamonds) do not overlap, the means are signif-
icantly different. In other words if the means are not much different, they are close to the
grand mean.
Comments about the assumptions

The three important assumptions in the model are
(i) normality
(ii) equal variances
(iii) independence
The assumption (i) may be investigated using the N residuals yij − ŷij = yij − ȳi. . The assumption
of equal variances may be tested formally, using a test known as Bartlett’s test or a test based on
the statistic
maxi s2i / mini s2i
n
i
1 X
where s2i = (yij − ȳi. )2 .
ni − 1 j=1
The latter test is applicable only if n1 = . . . = nk .
The assumption of independence is much more difficult to verify − one has to infer this from the
way in which the experiment was performed.
If (i) is not true, one may try a transformation or use a rank test entitled the Kruskall-Wallis test.
If (ii) is not true one may try to stabilize the variance by means of a transformation, otherwise
one is landed with the Behrens-Fisher problem for which approximate solutions exist. If (iii) is
not true, the problem is much more difficult to solve.
A reparameterisation
A form in which the one-way analysis of variance model is often presented is one with a constant
term
yij = µ + αi + ij ; j = 1, . . . , ni ; i = 1,. . .,k
where, as before, ij are independently n(0; σ 2 ) distributed. Now there are k +1 parameters instead
of k parameters. This introduces an indeterminacy which is reflected in the fact that X in the
representation y = Xβ + does not have full column rank (the first column of X is equal to the
sum of the other columns):
     
y11 1 1 ··· 0 11
 ..   .. .. ..
     
 .   . . .
   
  
    µ 
..

 y1n1   1 1 ··· 0  .
      
   
 ..  
  
..  α1   ..

 . = + 
.  ..   . 
 . ..
     

0 ··· 1 
     
 yk1   1  . 
 ..  
  
.. .. .. 
 αk 
 ..


 .   . . .   . 
     
yknk 1 0 1 knk
Since the model represents the same problem as the model in equation 3.1, we must have
µi = αi + µ, i = 1, . . . , k
that is αi = µi − µ.
The αi are not uniquely defined in terms of the µi − we still need to choose a suitable definition
of µ (subject to certain conditions). The usual choice is to let
k
1X
µ= µi
k 1
in which case
X X
αi = (µi − µ) = 0.
1 X X
(Another choice of µ = ni µi , in which case ni αi = 0.)
N
The parameters α1 , · · · , αk are termed the treatment effects, with αi the effect of the i-th treatment
level. The following two null hypotheses are equivalent:
H0 : µ1 = µ2 = . . . = µk , and
H0 : α1 = . . . = αk−1 = αk (= 0).
Also, H0 : µi = µj is equivalent to H0 : αi = αj .
3.4.1 Estimates of parameters
If you might recall in your earlier courses in statistics you learnt about statistics and parameters.
Statistics are numerical attributes that describe the characteristics of a sample whilst parameters
are numerical attributes that describe the characteristics of a population. A statistic is used to
estimate/approximate a population parameter.
You learnt that there are two types of estimates, that is, point estimates and interval estimates.
Point estimate is when a single value is used to estimate a population parameter, whilst an interval
estimate is when a range of values are used to approximate a population parameter.
3.4.1.1 Point estimates
The values of µ, αi , σ 2 and ij in the model can be estimated from our data. These are related to
some of the summary statistics which investigators would initially calculate from their data.
The mean of the ith sample is denoted by ȳi. . It is calculated as
yi1 + yi2 + . . . + yin

y i. =
n
which is a linear combination of the yij . The expected value is
(µ + αi ) + . . . + (µ + αi )
E(ȳi. ) =
n
nµ + nαi
=
n
= µ + αi
Thus we use y i. to estimate µ + αi .
The overall mean ȳ.. can be expressed as
(y11 + . . . + y1n ) + (y21 + . . . + y2n ) + . . . . . . + (yk1 + . . . + ykn )

ȳ.. = .
kn
Its expected value is
(µ + α1 ) + . . . + (µ + α1 ) + . . . . . . + (µ + αk ) + . . . + (µ + αk )
E(ȳ.. ) =
kn
knµ + n(α1 + α2 + . . . + αk )
=
kn
= µ (since α1 + α2 + . . . + αk = 0).
Thus ȳ.. can be used to estimate µ, ȳi. can be used to estimate µ + αi and ȳi. − ȳ.. is an estimate
of αi . The population variance σ 2 is estimated by s2 , the pooled estimate of the variance.
In summary, the list of the parameters and their point estimate are as follows:
Parameter Estimate Interpretation of estimate

µ ȳ.. Overall mean
αi ȳi. − ȳ.. Effect of the ith category
µ + αi ȳi. Mean of the ith category
αi − αj ȳi. − ȳj. Difference between the means of the ith and j th categories
σ2 s2 Common variance
In certain cases the sample sizes are not equal. In this case the overall mean is given by
ki n
1 XX
ȳ.. = yij
N i=1 j=1
Pk
ni ȳi.
= Pi=1
k
.
i=1 n i
(3.4)
3.4.1.2 Confidence intervals
In order to calculate the confidence interval we need to derive the variance for each estimator.
For the overall mean the variance is

k n
!
1 XX
V ar(y .. ) = V ar yij
N i=1 j=1
k n
1 XX
= V ar(yij )
N 2 i=1 j=1
1
= 2
(N σ 2 )
N
σ2
= (3.5)
N
The confidence interval for estimating µ is given by

p
µ̂ ± t α2 (N − k) × var(µ̂)
p
y .. ± t α2 (N − k) × var(y .. )
r
s2
y .. ± t α2 (N − k) ×
N
(3.6)
where s2 = M SE.
The V ar(α̂i ) = V ar(ȳi. − ȳ.. ). Now variance of the effect of the 1st category is
α1 ) = V ar (ȳ1. − y .. )
V ar (b
P
ni y i.
ȳ1. − y .. = ȳ1. −
N
(n1 y 1. + n2 y 2. + n3 y 3. + ... + nk y k. )
= ȳ1. −
N
n1 y 1. n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − − ... −
N N N N
N − n1 n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − ... − .
N N N N
Thus,

N − n1 n2 y 2. n3 y 3. nk y k.
V ar(ȳi. − y .. ) = V ar ȳ1. − − − ... −
N N N N
(N − n1 )2 n22 n23 n2k
= V ar(ȳ1. ) + 2 V ar(ȳ2. ) + 2 V ar(ȳ3. ) + ... + 2 V ar(ȳk. )
N2 N N N
(N − n1 )2 σ 2 n22 σ 2 n23 σ 2 n2k σ 2

= + 2 + 2 + ... + 2
N2 n1 N n2 N n3 N nk
!
σ 2 (N − n1 )2 n2 σ 2 n3 σ 2 nk σ 2
= + + + ... +
n1 N2 N2 N2 N2
!
σ 2 (N − n1 )2
= + n2 + n3 + ... + nk
N2 n1
σ 2 (N 2 − 2N n1 + n21 )

= + n2 + n3 + ... + nk
N2 n1
σ2 N 2

= − 2N + n1 + n2 + n3 + ... + nk
N 2 n1
σ2 N 2

= − 2N + N since n1 + n2 + n3 + ... + nk = N
N 2 n1
σ2 N 2

= −N
N 2 n1
σ 2 N 2 − n1 N

=
N2 n1
2

σ N N − n1
=
N2 n1
2

σ N − n1
= .
N n1
In general
σ2

N − ni
V ar (b
αi ) = . (3.7)
N ni
The confidence interval of αi is

p
α̂i ± t α2 (N − k) var(α̂i )
p
(ȳi. − y .. ) ± t α2 (N − k) var(α̂i )
s
s 2 N − ni
(ȳi. − y .. ) ± t α2 (N − k) .
N ni
(3.8)
The

V ar (α̂i − α̂j ) = V ar ȳi. − y .j

= V ar (ȳi. ) + V ar y .j since they are independent
σ2 σ2
= +
ni nj

2 1 1
= σ + .
ni nj
(3.9)
The confidence interval for α̂i − α̂j is

q
(α̂i − α̂j ) ± t α2 (N − k) var(α̂i − α̂j )
q
ȳi. − y j. ± t α2 (N − k) var(ȳi. − y j. )
s
1 1
ȳi. − y j. ± t α2 (N − k) s2 + .
ni nj
(3.10)
The

V ar µ + αi = V ar (y i. )
\
σ2
= .
ni
(3.11)
The confidence interval for µ + α1 is:

q
µ\
+ αi ± t α2 (N − k) × var(µ\ + αi )
p
y i. ± t α2 (N − k) × var(y i. )
s
s2
y i. ± t α2 (N − k) × .
ni
(3.12)
q4 q4
From chapter 2 you will recall that σ̂ 2 = and 2 ∼ χ2N −k .
N −k σ
The confidence interval is given by

q4
1−α=P χ21− α ;N −k 2
≤ 2 ≤ χ α ;N −k . (3.13)
2 σ 2
Example 3.3.
Twenty-four plots of land were divided at random into four groups of six, and subjected to the
following four treatments:
(1) (control group): no fertilizer
(2) potassium and nitrogen
(3) potassium and phosphate
(4) nitrogen and phosphate
Wheat was planted on each plot, and the yields were as follows:
(1) 66 89 62 30 51 74
(2) 72 86 95 94 74 89
Treatment
(3) 47 54 62 53 71 49
(4) 81 77 69 61 68 82
(a) Select a model for this experiment and test the null hypothesis that there is no treatment
effect (5% level).
(b) Find a 95% confidence interval for the difference between the means of treatments 1 and 2.
(c) Find a 95% confidence interval for the difference between the mean of the control group and
the mean of the other three treatment.
(d) Interpret the results.
Solution 3.3.
(a) The model is
yij = µi + ij i = 1, 2 . . . , k j = 1, . . . , ni
where the ij are independent n(0, σ 2 ) distributed.
The hypotheses are:

H0 : µ1 = µ2 = µ3 = µ4
H1 : At least one of the means is different from the others.
Treatments
1 2 3 4
66 72 47 81
89 86 54 77
62 95 62 69
30 94 53 61
51 74 71 68
74 89 49 82
ni 6 6 6 6
ȳi. 62 85 56 73
N = 24 ȳ..
Now
X
(y1j − y 1. )2 = (66 − 62)2 + (89 − 62)2 + . . . + (74 − 62)2
= 42 + (27)2 + 02 + (−32)2 + (−11)2 + (12)2
= 2 034
X
(y2j − y 2. )2 = (72 − 85)2 + (86 − 85)2 + . . . + (89 − 85)2
= (−13)2 + 12 + 102 + 92 + (−11)2 + 42
= 488
X
(y3j − y 3. )2 = (47 − 56)2 + (54 − 56)2 + . . . + (49 − 56)2
= (−9)2 + (−2)2 + 62 + (−3)2 + (15)2 + (−7)2
= 404
X
(y4j − y 4. )2 = (81 − 73)2 + (77 − 73)2 + . . . + (82 − 73)2
= 82 + 42 + (−4)2 + (−12)2 + (−5)2 + 92
= 346
XX
q4 = (yij − ȳi. )2 = 2 034 + 488 + 404 + 346 = 3 272
X
q5 = ni (ȳi. − ȳ.. )2
= 6(62 − 69)2 + 6(85 − 69)2 + 6(56 − 69)2 + 6(73 − 69)2
= 6(−7)2 + 6(16)2 + 6(−13)2 ) + 6(42 )
= 2 940.
Thus the ANOVA table is as follows:
Source SS d.f. MS f
Between 2 940 3 980 5.9902
Within 3 272 20 163.6
Total 6 212 23
Fα;k−1;N −k = F0.05;3;20 = 3.10. Reject H0 if f > 3.1.
Since 5.9902 > 3.1, we reject H0 at the 5% level of significance and conclude that the means
are significantly different from each other.
(b) The confidence interval is

s
1 1
(ȳr. − ȳs. ± t α2 ;N −k × σ̂ 2 +
nr ns
α
α = 0.05 2
= 0.025 t α2 ;N −k = t0.025;20 = 2.086
nr = ns = 6 ȳ1. = 62 ȳ2. = 85 and σ 2 = 163.6.
The 95% confidence interval for the difference between the means of treatments 1 and 2 is
s
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × + σ̂ 2
nr ns
s
1 1
(62 − 85) ± 2.086 × 163.6 +
6 6
√
−23 ± 2.086 × 54.53333333
−23 ± 15.4044
(−38.4044 ; −7.5956).
(c) Let the control group be sample 1 and the other three groups be sample 2, then ȳi. = 62,
ȳ2. = 71.333, n1 = 6 and n2 = 18.
The 95% confidence interval for the difference between the mean of the control group and the
mean of the other three treatments is
s
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × σ̂ 2 +
nr ns
s
1 1
(62 − 71.3333) ± 2.086 × 163.6 +
6 18
√
−9.3333 ± 2.086 × 36.35555556
−9.3333 ± 12.5777
(−21.911 ; 3.2444).
(d) The means are 62, 85, 56 and 73. Potassium and nitrogen gave the highest yield, with
nitrogen and phosphate the second highest. A lack of nitrogen (treatments 1 and 3) lead to
the poorest results.
Example 3.4.
Ten randomly selected mental institutions were examined to determine the effects of three different
antipsychotic drugs on patients with the same types of symptoms. Each institution used one and
only one of the three drugs exclusively for a one-year period. The proportion of treated patients in
each institution who were discharged after one year of treatment was as follows for each drug used:
Drug 1: 0.10, 0.12, 0.08, 0.14 (Ȳ1 = 0.11, S1 = 0.0258)

Drug 2: 0.12, 0.14, 0.19 (Ȳ2 = 0.15, S2 = 0.0361)
Drug 3: 0.20, 0.25, 0.15 (Ȳ3 = 0.20, S3 = 0.0500)
(a) Use the appropriate model and complete an ANOVA table.

(b) Test to see if there are any significant differences among drugs with regards to the average
proportion of patients discharged.
(c) What basic ANOVA assumptions might be violated here? How would you test if these as-
sumptions are indeed violated (no need to test, just give a description or method).
(d) Write down point estimates for the following parameters:
(i) The overall mean proportion of treated patients who were discharged for the three drugs.
(ii) µ + α1 , the mean for the first drug.
(iii) α1 , the effect of the first drug.
(iv) α1 − α3 .
(v) σ 2 .
(e) Obtain 95% confidence intervals for each of the estimates in part (d) of the question. Use the
method of multiple comparisons to obtain the confidence interval of the estimate in (d)(iv).
(f ) Use the method of multiple comparisons and test the following hypothesis: H0 : α2 = (α1 +
α3 )/2. Use α = 0.05.
Solution 3.4.
(a) One-way analysis of variance: Model I
yij = µi + εij ; i = 1, ..., k = 3; j = 1, ..., ni .
y 1. = 0.11
X
(y1j − y 1. )2 = (n1 − 1)s21
= (4 − 1)0.02582
= 0.002
or
X
(y1j − y 1. )2 = (0.1 − 0.11)2 + (0.12 − 0.11)2 + (0.08 − 0.11)2 + (0.14 − 0.11)2
= (−0.01)2 + (0.01)2 + (−0.03)2 + (0.03)2
= 0.002
y 2.. = 0.15
X
(y2j − y 2. )2 = (n2 − 1)s22
= (3 − 1)0.03612
= 0.0026
or
X
(y2j − y 2. )2 = (0.12 − 0.15)2 + (0.14 − 0.15)2 + (0.19 − 0.15)2
= (−0.03)2 + (−0.01)2 + (0.04)2
= 0.0026
y 3. = 0.2
X
(y3j − y 3. )2 = (n3 − 1)s23
= (3 − 1)0.052
= 0.005
or
X
(y3j − y 3. )2 = (0.2 − 0.2)2 + (0.25 − 0.2)2 + (0.15 − 0.2)2
= 02 + (0.05)2 + (−0.05)2
= 0.005
X
q4 = (yij − y i. )2
= 0.002 + 0.0026 + 0.005
= 0.0096
Since the sample sizes are unequal

k n
1 XX
y .. = yij
N i=1 j=1
Pk
ni y i.
= Pi=1
k
i=1 ni
4(0.11) + 3(0.15) + 3(0.2)
=
4+3+3
0.44 + 0.45 + 0.6
=
10
1.49
=
10
= 0.149
k
X
q5 = ni (yi. − y .. )2
i=1
= 4(0.11 − 0.149)2 + 3(0.15 − 0.149)2 + 3(0.2 − 0.149)2
= 0.006084 + 0.000003 + 0.007803
= 0.01389
= 0.0139
The ANOVA table is:
Source SS d.f. MS f
Drugs (between samples 0.0139 2 0.007 5
Error (within samples) 0.0096 7 0.0014
Total 0.0235 9
(b) H0 : µ1 = µ2 = µ3 or H0 : α1 = α2 = α3 (= 0)
Pk Pn 2
Between samples SS/(k − 1) i=1 j=1 (yi. − ȳ.. ) /(k − 1)
Under H0 : f = = Pk Pn ∼ Fk−1;N −k
Within samples SS/(N − k) i=1 j=1 (y ij − ȳi. )2 /(N − k)
The critical value is Fα;k−1;N −k = F0.05;2;7 = 4.74. Reject H0 if f > 4.74.

0.007
Now, f = 0.0014
= 5.
Since 5 > 4.74, H0 is rejected. The mean levels of patients discharged for the three drugs do
differ significantly at α = 0.05.
(c) • The assumption of equal variances might be violated. Can use Bartlett’s test to check
for this assumption and if it is violated an alternative like Welch’s test may be applied
to the one-way ANOVA.
• The assumption of normality might also be violated. Can check this by graphical exami-
nation of Q-Q plots and histograms of the observations in each group and also by a test
like the Shapiro-Wilk test. If the assumption is violated one may use the Kruskal-Wallis
test instead.
(d) (i) The point estimate of the overall mean proportion of treated patients who were discharged
for the three drugs is
µ̂ = y ..
k n
1 XX
= yij
N i=1 j=1
Pk
ni y i.
= Pi=1k
i=1 ni
4(0.11) + 3(0.15) + 3(0.2)
=
4+3+3
= 0.149.
(ii)
µ\
+ αi = y i.
µ\
+ α1 = y 1.
= 0.11
(iii)
bi = ȳ1. − y ..
α
Pk
ni y i.
= ȳ1. − Pi=1 k
i=1 ni
= 0.11 − 0.149
= −0.039
(iv)
(α\
i − αj ) = ȳi. − ȳj.
(α\
1 − α3 ) = ȳ1. − ȳ3.
= 0.11 − 0.2
= −0.09
(v)
σ̂ 2 = M SE
Pk Pni
i=1 j=1 (yij − y i. )2
=
N −k
0.0096
=
7
= 0.001371428
≈ 0.0014
(e) (i)
V ar(b
µ) = V ar(y .. )
σ2
=
N
0.0014
=
10
= 0.00014
≈ 0.0001
The confidence interval for µ is
p
µ̂ ± t α2 ;N −k × var(µ̂)
p
y .. ± t α2 ;N −k × var(y .. )
α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365
The 95% confidence interval is
p
y .. ± t α2 ;N −k × var(y .. )
√
0.149 ± 2.365 × 0.0001
0.149 ± 0.02365
0.149 ± 0.0237
(0.149 − 0.0237) ; (0.149 + 0.0237)
(0.1253 ; 0.1727)
(ii)
σ2
V ar µ\
+ αi =
ni
0.0014
=
4
= 0.00035
≈ 0.0004
The confidence interval for µ + α1 is
q
µ\
+ α1 ± t α2 ;N −k × var(µ\ + α1 )
p
y 1. ± t α2 ;N −k × var(y 1. )
α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365
The 95% confidence interval is

p
y 1. ± t α2 ;N −k × var(y 1. )
√
0.11 ± 2.365 × 0.0004
0.11 ± 0.0473
(0.11 − 0.0473) ; (0.11 + 0.0473)
(0.0627 ; 0.1523).
(iii)
αi ) = V ar (ȳi. − y .. )
V ar (b
σ 2 N − ni

=
N ni
σ 2 N − n1

var(αˆ1 ) =
N n
1
0.0014 10 − 4
=
10 4

6
= 0.0014
40
= 0.00021
≈ 0.0002
The 95% confidence interval for α1 is

p
αˆ1 ± t α2 ;N −k var(αˆ1 )
p
(ȳ1. − y .. ) ± t α2 ;N −k var(αˆ1 )
√
(0.11 − 0.149) ± 2.365 × 0.0002
−0.039 ± 0.03344615
−0.039 ± 0.0334
(−0.039 − 0.0334) ; −0.039 + 0.0334)
(−0.0724 ; −0.0056)
(iv)

2 1 1
αi − α
V ar (b bj ) = σ +
ni nj

2 1 1
V ar(α\
1 − α3 ) = σ +
n1 n3

1 1
= 0.0014 +
4 3
= 0.000816666
≈ 0.0008
The 95% confidence interval for (α\

1 − α3 ) is
q
(α\
1 − α3 ) ± t α
2
;N −k var(α\i − αj )
p
(ȳ1. − y 3. ) ± t α2 ;N −k var(ȳ1. − y 3. )
√
(0.11 − 0.2) ± 2.365 × 0.0008
−0.09 ± 0.066892301
−0.09 ± 0.0669
(−0.09 − 0.0669) ; −0.09 + 0.0669)
(−0.1569 ; −0.0231)
q4
(v) σ̂ 2 = = 0.0014
N −k
q4
2
∼ χ2N −k
σ q4
∴ 0.95 = P χ20.975;7 ≤ 2 ≤ χ20.025;7
σ
0.0096
= P 1.69 ≤ ≤ 16.013
σ2

0.0096 2 0.0096
= P ≤σ ≤
16.013 1.69
2
= P (0.000599512 ≤ σ ≤ 0.005680473)
= P (0.0006 ≤ σ 2 ≤ 0.0057)
(f ) The hypothesis to test is

1 1
H0 : α2 = (α1 + α3 )/2 or H0 : − α1 + α2 − α3 = 0
2 2
1 1 1 1
H0 : α2 − α1 − α3 = 0 ⇒ c1 = − c2 = 1 c3 = − .
2 2 2 2
( ci ȳi. )2
P
For multiple comparisons: H0 is rejected if P 2 > (k − 1)σ̂ 2 Fα;k−1,N −k
(ci /ni )
( ci ȳi. )2 (− 21 × 0.11 + 1 × 0.15 − 12 × 0.2)2

P
P 2 = 2 2
(ci /ni ) − 12 × 41 + 12 × 13 + − 12 × 13
(−0.005)2
=
0.479166666
0.000025
=
0.479166666
= 0.000052173
≈ 0.0001
σ 2 Fα;k−1;N −k = 2 × 0.0014 × 4.74

(k − 1)b
= 0.013272
≈ 0.0133
Since 0.0001 < 0.0133, we do not reject H0 at the 5% level of significance and conclude that
the mean of the second sample is equal to the average of the first and third samples.
3.5 One-way analysis of variance: Model II
Analysis
We now consider the problem which arises if k populations are selected at random from a large
number of populations, and a random sample is drawn from each of these populations. For sim-
plicity we shall assume that all sample sizes are equal, that is
n1 = . . . = nk = n, say.
The model is
yij = µ + ai + ij ; j = 1, · · ·, n; i = 1, · · ·, k
where the parameters a1 , · · · , ak constitute a random sample of size k from a population of param-
eters. The following assumptions are usually made:
ij ∼ n(0; σ 2 ); j = 1, · · ·, n; i = 1, · · ·, k
ai ∼ n(0; ω 2 ); i = 1, · · ·, k
and ai , · · · , ak and 11 , · · · , kn are mutually independent.
As before, let
N = nk = total sample size

n
1X
ȳi = yij
n j=1
k n
1 XX
ȳ.. = yij
N i=1 j=1
k
1X
= ȳi. .
k i=1
From the model yij = µ + ai + ij we see that
E(yij ) = E(µ + ai + ij )

= E(µ) + E(ai ) + E(ij )
= µ since E(ai ) = 0 and E(ij) = 0
V ar(yij ) = V ar(µ + ai + ij )
= V ar(µ) + V ar(ai ) + V ar(ij ) since ai , . . . , ak and 11 , . . . kn are mutually independent
= ω2 + σ2
Now
V ar(yij ) = E(yij2 ) − (E(yij ))2
⇒ E(yij2 ) = V ar(yij ) + (E(yij ))2
= σ 2 + ω 2 + µ2
n
1X
ȳi. = (µ + ai + ij )
n j=1
n
1X
= µ + ai + ij
n j=1
n
!
1X
E(ȳi. ) = E µ + ai + ij
n j=1
n
1X
= E(µ) + E(ai ) + E(ij )
n j=1
= µ
n
!
1X
V ar(ȳi. ) = V ar µ + ai + ij
n j=1
n
1 X
= V ar(µ) + V ar(ai ) + 2 V ar(ij )
n j=1
n
1 X 2
2
= ω + 2 σ
n j=1
1
= ω2 + nσ 2
n2
σ2
= ω2 +
n
∴ E(ȳi.2 ) = V ar(ȳi. ) + (E(ȳi. ))2

σ2
= ω2 + + µ2
n
k n
1 XX
ȳ.. = (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= µ+ ai + ij
k i=1 kn i=1 j=1
k n
!
1 XX
E(ȳ.. ) = E (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= E(µ) + E(ai ) + E(ij )
k i=1 kn i=1 j=1
= µ
k n
!
1 XX
V ar(ȳ.. ) = V ar (µ + ai + ij )
kn i=1 j=1
k k n
1 X 1 XX
= 2 V ar(ai ) + V ar(ij )
k i=1 (kn)2 i=1 j=1
k k n
1 X 2 1 XX 2
= 2 ω + σ
k i=1 (kn)2 i=1 j=1
1 1
= 2
kω 2 + knσ 2
k (kn)2
ω2 σ2
= +
k kn
∴ E(ȳ..2 ) = V ar(ȳ.. ) + (E(ȳ.. ))2

ω2 σ2
= + + µ2 .
k kn
Consider again the two quadratic forms

k X
X n
q4 = (yij − ȳi. )2
i=1 j=1
XX X
= yij2 − n ȳi2 .
i j i
k
X
q5 = n (ȳi. − ȳ.. )2
i=1
k
X
=n ȳi.2 − nkȳ..2 .
i=1
If we write q4 = y 0 A4 y and q5 = y 0 A5 y, we see that A4 and A5 are exactly the same as in equations
(3.2) and (3.3), section 3.4. Thus A4 A4 = A4 , A5 A5 = A5 , A4 A5 = O, r(A4 ) = N − k and r(A5 )
= k − 1 so that q4 and q5 are multiples of chi-square variates and are independent.
Now consider the expected values of q4 and q5 :

!
XX X
E(q4 ) = E yij2 − n ȳi.2
i j i
XX X
= E(yij2 ) −n E(ȳi.2 )
i j i
XX X
= (µ2 + ω 2 + σ 2 ) − n (µ2 + ω 2 + σ 2 /n)
i j i
= kn(µ + ω + σ ) − nk(µ + ω 2 + σ 2 /n)

2 2 2 2
= (N − k)σ 2 as before
!
X
E(q5 ) = E n ȳi.2 − nkȳ..2
i
X
= n E(ȳi.2 ) − nkE(ȳ..2 )
i
X
= n (µ2 + ω 2 + σ 2 /n) − kn(µ2 + σ 2 /(nk) + ω 2 /k)
i
= (k − 1)σ 2 + n(k − 1)ω 2

q4 q5
∴E = (N − k) and E = k − 1.
q5 σ 2 + nω 2
q4 q5
From these expected values, the results above and (D8) it follows that 2
and 2 are inde-
σ σ + nω 2
pendent central chi-square variates with N − k and k − 1 degrees of freedom respectively. (That
these are central chi-square variates can be proved more formally by showing that, since E(yij ) = µ
for all i and j, λ4 = λ5 = 0.)
In this type of model we are usually not interested specifically in a1 , · · · , ak since the k populations
were selected at random from a large number of populations. The interest is usually in the overall
mean µ and the variance ω 2 of the parameters ai . An unbiased estimator for µ is
µ̂ = ȳ..
while
E(ȳ.. ) = µ
and
ω2 σ2
Var(ȳ.. ) = +
k nk
1
= (nω 2 + σ 2 ).
N
q5
As was seen, an unbiased estimator for nω 2 + σ 2 is ; thus
k−1
ȳ − µ
r .. ∼ tk−1 .
1 q5
N k−1
Thus a 100(1 − α)% confidence interval for µ is given by

r
q5
ȳ.. ± ∼ t 1 α;k−1 .
N (k − 1) 2
A test for H0 : µ = µ0 , where µ0 is specified, may be derived in a similar manner.
To find an unbiased estimator for ω 2 , note that
E(q4 /(N − k)) = σ 2
E(q5 /(k − 1)) = σ 2 + nω 2

1 q5 q4
∴E − = ω2.
n k−1 N −k

1
2 q5 q4
Thus we write ω̂ = − .
n k−1 N −k
There is a problem with this estimator: while ω 2 ≥ 0 one may sometimes find that ω̂ 2 < 0. In
such a case it is standard practice to set ω̂ 2 = 0.
A test for H0 : ω 2 = 0 may be derived as follows:

q5
/ (σ 2 + nω 2 )
k−1 ∼ Fk−1;N −k
q4
/ σ2
N − k
nω 2

q5 q4
∴f = / ∼ 1 + 2 Fk−1;N −k .
k−1 N −k σ
Thus f ∼ Fk−1;N −k if and only if ω 2 = 0. If ω 2 > 0 we may expect f to be larger than a Fk−1;N −k
statistic, and H0 : ω 2 = 0 is rejected if f assumes large values.
An approximate confidence interval for ω 2 may be found, but we do not discuss this topic. Of
greater interest are the quantities ω 2 /σ 2 , σ 2 /(ω 2 + σ 2 ) and ω 2 /(ω 2 + σ 2 ). The latter is of particular
interest: The variance of each observation yij is σ 2 + ω 2 . A portion ω 2 of this variance is ascribable
to the differences in population means, and ω 2 /(ω 2 + σ 2 ) is the proportion of the total variance
which is due to these differences.
As before, let
q5 q4
f= / .
k−1 N −k
σ2
Then f ∼ Fk−1;N −k .
nω 2 + σ 2
Let F1 = F1− 1 α;k−1;N −k = 1/F 1 α;N −k;k−1
2 2
F2 = F 1 α;k−1;N −k .
2
Then
σ2

1−α = P F1 < f < F2
nω 2 + σ 2
nω 2 + σ 2

f f
= P < <
F2 σ2 F1
nω 2

f f
= P <1+n 2 <
F2 σ F1
2

−1 f ω −1 f
= P + < 2 < +
n nF2 σ n nF1
2

f − F2 ω f − F1
= P < 2 <
nF2 σ nF1
2

nF1 σ nF2
= P < 2 <
f − F1 ω f − F2
σ2

nF1 nF2
= P 1+ <1+ 2 <1+
f − F1 ω f − F2
2 2

f − F1 + nF1 ω +σ f − F2 + nF2
= P < <
f − F1 ω2 f − F2
2

f − F2 ω f − F1
= P < 2 <
f − F2 + nF2 ω + σ2 f − F1 + nF1
which defines a 100(1 − α)% confidence interval for ω 2 /(ω 2 + σ 2 ).
Example 3.5.
A factory has a large number of machines which produce the same product. The mass of each
product unit is a random variable which varies from unit to unit; the variance of this random
variable is σ 2 . The mean mass per unit also varies from machine to machine due to the fact
that the machines are not always calibrated precisely. Suppose that four machines are selected at
random, and a sample of six units produced by each of these machines is selected randomly and
weighed. The results (mass in grams) are as follows:
Unit Machine
1 2 3 4
1 201 198 211 204
2 198 196 215 202
3 209 201 207 203
4 197 200 209 206
5 203 206 208 201
6 204 199 210 208
(a) Construct the ANOVA table.
(b) Test H0 : σa2 = 0 at the 5% level of significance.
(c) Estimate the proportion of the total variation ascribable to the machines.
ω2
(d) Find the 90% confidence interval for .
σ2 + ω2
(e) Estimate the overall mean and find the 95% confidence interval for µ.
Solution 3.5.
(a) We select the model yij = µ + ai + ij ; j = 1, · · · , n; i = 1, · · · , k
where k = 4 and n = 6, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and
ij are mutually independent. We compute
ȳ1. = 202; ȳ2. = 200; ȳ3. = 210; ȳ4. = 204; ȳ.. = 204
X
(y1j − ȳ1. )2 = (201 − 202)2 + (198 − 202)2 + . . . + (204 − 202)2
= (−1)2 + (−4)2 + 72 + (−5)2 + 12 + 22
= 96
X
(y2j − ȳ2. )2 = (198 − 200)2 + (196 − 200)2 + . . . + (199 − 200)2
= (−2)2 + (−4)2 + 12 + 02 + 62 + (−1)2
= 58
X
(y3j − ȳ3. )2 = (211 − 210)2 + (215 − 210)2 + . . . + (208 − 210)2
= 12 + 52 + (−3)2 + (−1)2 + (−2)2 + 02
= 40
X
(y4j − ȳ4. )2 = (204 − 204)2 + (202 − 204)2 + . . . + (208 − 204)2
= 02 + (−2)2 + (−1)2 + 22 + (−3)2 + 42
= 34
(yij − ȳ1. )2 = 96 + 58 + 40 + 34 = 228

P
q4 =
X
q5 = n (ȳi. − ȳ.. )2
= 6[(202 − 204)2 + (200 − 204)2 + (210 − 204)2 + (204 − 204)2 ]
= 6[(−2)2 + (−4)2 + 62 + 02 ]
= 6(4 + 16 + 36 + 0)
= 336
The ANOVA table is as follows:
Source SS d.f. MS E(MS) f

Machines 336 3 112 σ 2 + 6ω 2 9.8246
Replicates 228 20 11.4 σ2
Total 564 23
(b) H0 : σa2 = 0
Fα;k−1;N −k = F0,05;3;20 = 3.10. Reject H0 if f > 3.1.
The F-value is significant at the 5% level and our conclusion is that there is a significant
variation between machines − they should perhaps be calibrated.
(c) We have
σ̂ 2 = 11.4
σ̂ 2 + 6ω̂ 2 = 112
∴ 6ω̂ 2 = 112 − 11.4 = 100.6
ω̂ 2 = 16.7667
ω̂ 2 16.7667
2 2
= ≈ 0.60.
σ̂ + ω̂ 11.4 + 16.7667
Thus we estimate that about 60% of the variation is due to a difference in machine.
(d) The 90% confidence interval for ω 2 /(σ 2 + ω 2 ) is (L1 ; L2 ) where
f − F2 f − F1
L1 = and L2 =
f − F2 + nF2 f − F1 + nF1
1 1 1
F1 = F1− α2 ;k−1;N −k = = = ≈ 0.1155
F α2 ;N −k;k−1 F0.05;20;3 8.66
F2 = F α2 ;k−1;N −k = F0.05;3;20 = 3.1
9.8246 − 3.10 6.7246

L1 = = ≈ 0.2655
9.8246 − 3.10 + 6(3.10) 25.3246
9.8246 − 0.1155 9.7091

L2 = = ≈ 0.9334.
9.8246 − 0.1155 + 6(0.1155) 10.4021
Thus, the 90% confidence interval is (0.2655; 0.9334).
(e) The estimate of the overall mean is ȳ.. = 204 and the confidence interval for µ is
p
ȳ.. ± t α2 ;k−1 V ar(ȳ.. )
t α2 ;k−1 = t0.025;3 = 3.182. Thus, the 95% confidence interval is

p
ȳ.. ± t α2 ;k−1
V ar(ȳ.. )
r
112
204 ± 3.182 ×
24
204 ± 6.8739
(197.1261 ; 210.8739).
From the SAS JMP output in figure 3.2 below, we note that machine 3, especially, produces units
with higher mass than the others. Machine 3’s diamond does not overlap with any of the other
diamonds, thus, one can expect it to produce units significantly different from all the other ma-
chines. Note also, that, like most statistical computer programs, SAS JMP does not distinguish
between random and fixed effects models − the distribution has to be made by the user.

Example 3.6.
A large company has a number of personnel officers, and management wants to find out whether
the personnel selection is uniform or whether the variation between personnel officers is significant
compared to the variation between candidates. Five of the personnel officers are selected at ran-
dom, and each is assigned four applicants for testing. Their ratings of the applicants are as follows:
Applicant Personnel officer

A B C D E
1 75 47 67 76 77
2 91 50 75 59 65
3 72 63 82 82 86
4 86 64 80 67 76
Select a model for this experiment
(a) and test at the 5% level whether the officers award different ratings on the average;
(b) estimate the mean rating of all candidates and construct a 90% confidence interval for it;
(c) estimate ω 2 /(σ 2 + ω 2 ) and construct a 90% confidence interval for it.
Solution 3.6.
We select the model
yij = µ + ai + ij ; j = 1, · · · , n; i = 1, · · · , k
where k = 5 and n = 4, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and ij
are mutually independent.
(a)
Personnel officer
A B C D E
1 75 47 67 76 77
2 91 50 75 59 65
3 72 63 82 82 86
4 86 64 80 67 76
ȳi. 81 56 76 71 76
N = nk = 20 ȳ.. = 72
The hypotheses are H0 : σa2 = 0 H1 : σa2 6= 0

X
(y1j − ȳ1. )2 = (75 − 81)2 + (91 − 81)2 + (72 − 81)2 + (86 − 81)2
= (−6)2 + (10)2 + (−9)2 + 52
= 242
X
(y2j − ȳ2. )2 = (47 − 56)2 + (50 − 56)2 + (63 − 56)2 + (64 − 56)2
= (−9)2 + (−6)2 + 72 + 82
= 230
X
(y3j − ȳ3. )2 = (67 − 76)2 + (75 − 76)2 + (82 − 76)2 + (80 − 76)2
= (−9)2 + (−1)2 + 62 + 42
= 134
X
(y4j − ȳ4. )2 = (76 − 71)2 + (59 − 71)2 + (82 − 71)2 + (67 − 71)2
= 52 + (−12)2 + (11)2 + (−4)2
= 306
X
(y5j − ȳ5. )2 = (77 − 76)2 + (65 − 76)2 + (86 − 76)2 + (76 − 76)2
= 12 + (−11)2 + (10)2 + 02
= 222
(yij − ȳ1. )2 = 242 + 230 + 134 + 306 + 222 = 1 134

P
q4 =
X
q5 = n (ȳi. − ȳ.. )2
k
X
= n ȳi.2 − nkȳ..2
i=1
= 4(81 + 562 + 762 + 712 + 762 ) − (4)(5)(72)2
2
= 4(26 290) − 103 680
= 105 160 − 103 680
= 1 480
or alternatively
X
q5 = n (ȳi. − ȳ.. )2
= 4[(81 − 72)2 + (56 − 72)2 + (76 − 72)2 + (71 − 72)2 + (76 − 72)2 ]
= 4[92 + (−16)2 + 42 + (−1)2 + 42 ]
= 4(370)
= 1 480

Personnel officers 1 480 4 370 σ 2 + 4ω 2 4.8942
Replicates 1 134 15 75.6 σ2
Total 2 614 19
Fα;k−1;N −k = F0,05;4;15 = 3.06. Reject H0 if f > 3.06.
Since 4.8942 > 3.06, H0 is rejected at the 5% level and our conclusion is that there is a
significant variation between personnel officers.
(b) The mean rating of all candidates is
µ̂ = ȳ..
k n
1 XX
= yij
N i=1 j=1
k
1X
= ȳi.
k i=1
1
= (81 + 56 + 76 + 71 + 76)
5
1
= (360)
5
= 72.
The confidence interval for the overall mean is

r
q5
ȳ.. ± × t α2 ;k−1 .
N (k − 1)
α
Now ȳ.. = 72 q5 = 1 480 N = 20 α = 0.10 2
= 0.05 t α2 ;k−1 = t0.05;4 =
2.132.
Thus, the 90% confidence interval for the mean rating of all candidates is
r
q5
ȳ.. ± × t α2 ;k−1
N (k − 1)
s
1 480
72 ± × 2.132
20(5 − 1)
√
72 ± 18.5 × 2.132
72 ± 9.1701
(62.8299 ; 81.1701)
ω2
(c) Estimating
σ2 + ω2
ω̂ 2 = 75.6. Now
σ̂ 2 + 4ω̂ 2 = 370
75.6 + 4ω̂ 2 = 370
4ω̂ 2 = 370 − 75.6
4ω̂ 2 = 294.4
⇒ ω̂ 2 = 73.6.
Then
ω̂ 2 73.6
2 2
=
σ̂ + ω̂ 75.6 + 73.6
73.6
=
149.2
≈ 0.4933.
ω2
The confidence interval for is
σ2 + ω2
ω2

f − F2 f − F1
P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
1
where F1 = and F2 = F α2 ;k−1;N −k .
F α2 ;N −k;k−1
α
Now f = 4.8942 α = 0.1 2
= 0.05 F α2 ;N −k;k−1 = F0.05;15;4 = 5.86
1
F1 = 5.86
≈ 0.1706 F2 = F0.05;4;15 = 3.06.
ω2
The 90% confidence interval for 2 is
σ + ω2
ω2

f − F2 f − F1
0.90 = P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
ω2

4.8942 − 3.06 4.8942 − 0.1706
= P < 2 2
<
4.8942 − 3.06 + 4(3.06) σ +ω 4.8942 − 0.1706 + 4(0.1706)
2

1.8342 ω 4.7236
= P < 2 2
<
14.0742 σ +ω 5.406
2

ω
= P 0.1303 < 2 < 0.8738 .
σ + ω2
3.6 Two-way classification
We now consider experiments in which there are two factors (treatments or blocking factors), say
factor A with k levels A1 , · · · , Ak and factor B with m levels B1 · · · , Bm . The expected value of
an observation in the cell defined by levels Ai and Bj is µij , say. The expected values are then as
follows:
Factor A Factor B
B1 B2 Bm
A1 µ11 µ12 ··· µ1m
A2 µ21 µ22 ··· µ2m
.. ..
. .
Ak µk1 µk2 ··· µkm
An additive model specifies that the following relationship exists between these expected values:
µij = µ + αi + βj ; j = 1, · · · , m; i = 1, · · · , k.
An example of such an additive model is the following:
B1 B2 B3
A1 µ11 =10 µ12 =15 µ13 =12
A2 µ21 =20 µ22 =25 µ23 =22
The expected yield at level B2 is five units more than the expected yield at level B1 , irrespective
of the level of A(µ12 − µ11 = µ22 − µ21 = 5). Likewise, the increase in yield from A1 to A2 is ten
units, irrespective of the level of B(µ21 − µ11 = µ22 − µ12 = µ23 − µ13 = 10).
In practice it sometimes happens that the model is not additive, but that interaction is present.
Examples of interaction
The addition of calcium to the soil may have a beneficial effect on some plants, but will have a
detrimental effect on acid-loving plants. We say the factor calcium interacts with the factor type
of plant.
A specific hormone treatment may be beneficial to women but have no effect on men. (In such a
case sex and hormone treatment interact.)
Administering either drug A or drug B to a patient may be beneficial, but both drugs together
may have a detrimental effect. In other examples drug A and drug B jointly may be much more
beneficial than the sum of the effect of the two drugs separately. In both cases the factors drug A
and drug B interact.
Two chemicals may, if used separately, have very little effect on a chemical process, but jointly
they may have a profound effect.
You may think of further examples of interaction. The situation may be presented graphically as
in the following figures (figure 3.3-figure 3.6) (we assume that factor A has two levels A1 and A2
while B has three levels B1 , B2 and B3 ).
Figure 3.3:
Figure 3.4:
Figure 3.5:
Figure 3.6:
If the lines joining the expected yields at A1 and A2 are parallel (figures 3.3 and 3.4) there is no
interaction between A and B present (otherwise it is present, as in figures 3.5 and 3.6).
The presence of interaction is a factor which complicates the interpretation of main effects. In
some applications it may appear as if one or both treatments have no effect on the yield at all,
while the presence of interaction is actually an indication that both factors do have an effect on
the yield. Consider the following example:
B1 B2 B3 M ean
A1 µ11 = 10 µ12 = 15 µ13 = 20 µ1. = 15
A2 µ21 = 20 µ22 = 15 µ23 = 10 µ2. = 15
M ean µ.1 = 15 µ.2 = 15 µ.3 = 15 µ.. = 15
An analysis of variance test for the A-effect is actually a test for H0 : µ̄1. = µ̄2. = . . . = µ̄k. while a
test for the B-effect is a test for H0 : µ̄.1 = µ̄.2 = . . . = µ̄.m. . In the above example both hypotheses
are true, while it is certainly not true that the treatments have no effect at all on the yield. Other
examples may be constructed where for instance the A-effect is significant and the B-effect not
significant, while the presence of interaction implies that treatment B does influence the yield, but
the magnitude of the influence depends on the level of A.
Thus, if an analysis of variance shows that there is interaction, the possible non-significance of the
main effects (effects of A and B) must not be seen as a proof that A and/or B has no influence on
the yield. Graphs like figures 3.3 and 3.4, but with sample means rather than expected yield on
the vertical axis, may be useful in interpreting the results.
If interaction is present (or rather, unless one is sure that there is no interaction), the means are
described as follows:
µij = µ + αi + βj + (αβ)ij
where (αβ) is a symbol like α and β and does not indicate multiplication; sometimes γ is used
instead of (αβ). In this model α1 , · · · , αk are the main effects for treatment A, β1 , · · · , βm are the
main effects for treatment B and (αβ)ij ; i = 1, · · · , k; j = 1, · · · , m are the interaction effects.
As in the one-way model, certain restrictions are placed on the parameters:

k
X m
X
αi = 0; βj = 0;
i=1 j=1
k
X
(αβ)ij = 0 for all j;
i=1
m
X
(αβ)ij = 0 for all i;
j=1
from which it follows that

Xk X m
(αβ)ij = 0.
i=1 j=1
In the remainder of this chapter we shall assume throughout that there are equal numbers of
observations per cell. The reason for this assumption is the fact that the notation and algebra
needed to provide for unequal numbers of observations per cell are somewhat more complicated.
The correct formulae may be found in any textbook on the analysis of variance.
3.7 Two-way analysis of variance: Model I
We now consider experiments in which there are two factors (treatments or blocking factors)
present. Both factors are assumed to be fixed, and there are equal numbers of observations per
cell. We distinguish between the two cases: one observation per cell and n observations per cell
with n > 1.
A One observation per cell

The data in section 3.1.2 are an example of such an experiment. The purpose of the analysis
is to decide whether there is a significant difference between the three supermarkets. Out of
curiosity one may also test whether there is a difference between months, although that was not
the original purpose of the experiment. Since there is only one observation per cell, there is no
way of determining whether interaction exists or not (for example whether the difference between
supermarkets A and B is increasing, whether one supermarket was most expensive to begin with
and was overtaken by another supermarket, et cetera). One simply has to assume that there
is no interaction. (There is, however, a test based on the special model (αβ)ij = γαi βj to test
H0 : γ = 0. See JW Turkey (1949): One degree of freedom for non-additivity. Biometrics,
page 232.) For this reason this type of experiment should be performed only if it is impossible to
replicate the experiment for practical reasons or if one knows from past experience that interaction
has never been present in similar experiments.
The model is
yij = µ + αi + βj + ij ; i = 1, · · · , k; j = 1, · · · , m
k
X m
X
where αi = βj = 0
1 1
and the ij are independent n(0; σ 2 ) variates.
In this formulation α1 , · · · , αk are called the effects of treatment A and β1 , · · · , βm the effects of
treatment B. The null hypotheses to be tested are usually
H0 : α1 = · · · = αk (= 0)
and H0 : β1 = · · · = βm (= 0).
Let N = km, the total number of observations,
m
1 X
ȳi. = yij ; i = 1, · · · , k,
m j=1
k
1X
ȳ.j = yij ; j = 1, · · · , m,
k i=1
k m
1 XX
ȳ.. = yij
N i=1 j=1
m k
1 X 1X
= ȳ.j = ȳi. .
m j=1 k i=1
The quadratic forms of interest are

k X
X m
q2 = (yij − ȳ.. )2 = SSTotal
i=1 j=1
k X
X m
q4 = (yij − ȳi. − ȳ.j + ȳ.. )2 = SSResidual
i=1 j=1
k
X
q5 = m(ȳi. − ȳ.. )2 = SSA
i=1
m
X
q6 = k(ȳ.j − ȳ.. )2 = SSB .
j=1
Writing
yij − ȳ.. = (yij − ȳi. − ȳ.j + ȳ.. ) + (ȳi. − ȳ.. ) + (ȳ.j − ȳ.. ),
taking squares on both sides, summing over i and j and noting that the sums of the cross-products
of the terms on the right-hand side are all equal to zero, we obtain the identity
q2 = q4 + q5 + q6
or SSTotal = SSResidual + SSA + SSB .
As with one-way analysis of variance, we may write q4 , q5 and q6 in the forms qi = y 0 Ai y and
show that A4 A4 = A4 , A5 A5 = A5 , A6 A6 = A6 , A4 A5 = A4 A6 = A5 A6 = O, r(A4 ) =
(k − 1)(m − 1), r(A5 ) = k − 1, r(A6 ) = m − 1. This is left as an exercise for you to do.
Thus q4 /σ 2 , q5 /σ 2 and q6 /σ 2 are independently distributed as noncentral chi-square variates with

(k − 1)(m − 1), (k − 1) and (m − 1) degrees of freedom respectively and noncentrality parameters.
λ4 = 0
k
X
λ5 = m αi2 /σ 2
i=1
m
X
λ6 = k βj2 /σ 2 .
j=1
We shall show how λ6 is found; λ4 and λ5 can be found similarly. We have
yij = µ + αi + βj + ij
k
1X
ȳ.j = (µ + αi + βj + ij )
k 1
k k
1X 1X
= µ+ αi + βj + ij
k 1 k 1
k k
1X X
= µ + βj + εij since αi = 0.
k 1 1
k
1X
E(ȳ.j ) = E(µ + βj + εij )
k 1
k
1X
= E(µ) + E(βj ) + E(εij )
k 1
∴ E(ȳ.j ) = µ + βj
k
1X
Var(ȳ.j ) = Var(µ + βj + εij )
k 1
k
1 X
= V ar(µ) + V ar(βj ) + 2 V ar(εij )
k 1
1 2
=
kσ
k2
Var(ȳ.j ) = σ 2 /k
∴ E(ȳ.j2 ) = V ar(ȳ.j ) + (E(ȳ.j ))2
= σ 2 /k + (µ + βj )2 .
Also
k m
1 XX
ȳ.. = (µ + αi + βj + ε)
km i=1 j=1
k m k m
1X 1 X 1 XX
= µ+ αi + βj + ij
k i=1 m j=1 km i=1 j=1
1 XX
= µ+ ij
km i j
!
1 XX
E(ȳ.. ) = E µ+ ij
km i j
= µ
!
1 XX
Var(ȳ.. ) = Var µ + ij
km i j
1 XX
= Var(µ) + Var(ij )
(km)2 i j
1
= kmσ 2
(km)2
= σ 2 /(km)
E(ȳ..2 ) = Var(ȳ.. ) + (E(y.. ))2
= σ 2 /(km) + µ2
Now
m
X
q6 = k (ȳ.j − ȳ.. )2
j=1
Xm
= k( ȳ.j2 − mȳ..2 )
j=1
m
X
∴ E(q6 ) = k (σ 2 /k + (µ + βj )2 ) − km(µ2 + σ 2 /(km))
j=1
Xm
= k (σ 2 /k + µ2 + 2µβj + βj2 ) − kmµ2 − σ 2
j=1
m
X m
X
2 2
= mσ + kmµ + 2µk βj + k βj2 − kmµ2 − σ 2
j=1 j=1
Xm
= mσ 2 + kmµ2 + k βj2 − kmµ2 − σ 2
j=1
Xm
= (mσ 2 − σ 2 ) + k βj2
j=1
X m m
X
= (m − 1)σ 2 + k βj2 (remember that βj = 0)
j=1 j=1
m
X
∴ E(q6 /σ 2 ) = (m − 1) + k βj2 /σ 2
j=1
m
X
∴ λ6 = k βj2 /σ 2 from (D8).
j=1
Source SS d.f. MS E(MS)

k
X m X
A q5 =m [ȳi. − ȳ.. ]2 k−1 q5 /(k − 1) σ2 + αi2
1
k−1
m
X k X 2
B q6 =k [ȳ.j − ȳ.. ]2 m−1 q6 /(m − 1) σ2 + βj
m−1
XX 1
Residual q4 = [yij − ȳi. − ȳ.j + ȳ.. ]2 (k − 1)(m − 1) q4 /(k − 1)(m − 1) σ2
i j
XX
Total q2 = [yij − ȳ.. ]2 km − 1
i j
Thus H0 : α1 = . . . = αk (= 0) is rejected at the α-level if

q5 /(k − 1)
> Fα;k−1;(k−1)(m−1) .
q4 /(k − 1(m − 1))
Likewise H0 : β1 = . . . = βm (= 0) is rejected at the α-level if

q6 /(m − 1)
> Fα;m−1;(k−1)(m−1) .
q4 /(k − 1(m − 1))
Example 3.7.
Using data in section 3.1.2:
(a) Construct an ANOVA table.
(b) Using α = 0.05, test whether the
(i) months have an effect on the average price.
(ii) supermarkets have an effect on the average price.
Solution 3.7.
(a) The means are as follows:
Factor B: Supermarkets
ȳ.1 = 31
ȳ.2 = 25 (the cheapest)
ȳ.3 = 34
Grand mean
ȳ.. = 30
Factor A: Months
ȳ1. = 27
ȳ2. = 28
ȳ3. = 31
ȳ4. = 34
We see that there is a consistent increase in the average price from month to month. (This
may of course be a result of a real price increase and/or an expanding shopping list.)
k
X
SSA = m (ȳi. − ȳ.. )2
i=1
= 3[(27 − 30)2 + (28 − 30)2 + (31 − 30)2 + (34 − 30)2 ]
= 3((−3)2 + (−2)2 + 12 + 42 )
= 90
k
X
SSB = k (ȳ.j − ȳ.. )2
i=1
= 4[(31 − 30)2 + (25 − 30)2 + (34 − 30)2 ]
= 4(12 + (−5)2 + 42 )
= 168
XX
SSTotal = (yij − ȳ.. )2
= (34 − 30)2 + (18 − 30)2 + . . . + (37 − 30)2 )
= 384
SSResidual = SSTotal − SSA − SSB
= 384 − 90 − 168
= 126
Actually it is advisable to compute SSResidual directly and use it as a test of the accuracy of
the computations.
SSResidual = (34 − 31 − 27 + 30)2 + (18 − 25 − 27 + 30)2 + . . ..
. . . + (37 − 34 − 34 + 30)2
= 126

X
B 168 2 84 σ2 + 2 βj2 84/21 = 4
X
A 90 3 30 σ2 + αi2 30/21 = 1.4286
Residual 126 6 21 σ2
Total 384 11
(b) Test for A-effects (months): H0 : α1 = α2 = α3 = α4 = 0

Fα;k−1;(k−1)(m−1) = F0,05;3;6 = 4.76. Reject H0 if f > 4.76.
Since 4 < 4.76, we do not reject H0 at the 5% level of significance.

∴ A-effects not significant.
(c) Test for B-effects (supermarkets): H0 : β1 = β2 = β3 = 0

Fα;m−1;(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.
Since 1.4286 < 5.14, we do not reject H0 at the 5% level of significance.

∴ B-effects not significant.
The SAS JMP output which follows, shows that supermarket 2 (B) had the lowest prices on aver-
age, while the prices rose steadily over the four months.
In such small experiments there must be a considerable difference between level means before
a significant result is obtained. The housewife may now decide to continue the experiment for a
few more months, maybe trying to eliminate some of the sources of variation. She is, however,
more likely to decide that she will buy at the second supermarket in future.
The residual which may be examined for signs of deviations from the model is
yij − ŷij = yij − ȳi. − ȳ.j + ȳ.. .
Example 3.8.
In order to decide in which of three types of saucepans water will boil the quickest, the three types
of saucepans were tested on four types of stoves in a Domestic Science laboratory. A fixed amount
of water at room temperature was placed in each saucepan, and the time to boil (in seconds, with
timing started exactly three minutes after the stoves had been switched on) recorded. The results are:
Stoves Saucepans
I II III
1 21 22 23
2 23 33 43
3 25 33 26
4 31 36 44
Test
(a) the null hypothesis that the saucepans are not different;
(b) the null hypothesis that the stoves are not different (with respect to boiling speed).
Use the 5% level of significance in each case.
Solution 3.8.
(a) The model is yij = µ + αi + βj + ij ; i = 1, · · · , k; j = 1, · · · , m

k
X m
X
where αi = βj = 0
1 1
and the ij are independent n(0; σ 2 ) variates.
Saucepans
Stoves I II III ȳi.
1 21 22 23 22
2 23 33 43 33
3 25 33 26 28
4 31 36 44 37
ȳ.j 25 31 34
The means are as follows:
Factor A: Stoves
ȳ1. = 22 ȳ2. = 33 ȳ3. = 28 and ȳ4. = 37
Factor B: Saucepans
ȳ.1 = 25 ȳ.2 = 31 and ȳ3. = 34
Grand mean
ȳ.. = 30
k
X
SSA = m(ȳi. − ȳ.. )2
i=1
= 3[(22 − 30)2 + (33 − 30)2 + (28 − 30)2 + (37 − 30)2 ]
= 3((−8)2 + 32 + (−2)2 + 72 )
= 378
k
X
SSB = k(ȳ.j − ȳ.. )2
i=1
= 4[(25 − 30)2 + (31 − 30)2 + (34 − 30)2 ]
= 4((−5)2 + 12 + 42 )
= 168
k X
X m
SSResidual = (yij − ȳi. − ȳ.j + ȳ.. )2
i=1 j=1
= [(21 − 22 − 25 + 30)2 + (22 − 22 − 31 + 30)2 + . . . + (44 − 34 − 37 + 30)2 ]
= [42 + (−1)2 + (−3)2 + (−5)2 + (−1)2 + 62 + 22 + . . . + (−2)2 + 32 ]
= 158

m X 2
A 378 3 126 σ2 + αi 4.7848
k−1
k X 2
B 168 2 84 σ2 + βj 3.1899
m−1
Residual 158 6 26.3333 σ2
Total 704 11
H0 : α1 = α2 = α3 = α4 = 0
Fα;k−1;(k−1)(m−1) = F0,05;3;6 = 4.76. Reject H0 if f > 4.76. Since 4.7848 > 4.76, we reject
H0 at the 5% level of significance and conclude that the stoves are significantly different with
respect to boiling speed.
(b) H0 : β1 = β2 = β3 = 0
the effect of the saucepans with respect to boiling speed is not different.
3.7.1 Illustration
The data in section 3.3.1 are an example of a mixed model (diets a fixed effect and litters a random
effect) but we cannot distinguish between the analysis of the three types of model if there is one
observation per cell only. The SAS JMP output (see figures 3.9 and 3.10) actually assumes a fixed
effects model, but the results which follow are easy to interpret.
Figure 3.9: SAS JMP output for data in section 3.3.1
Figure 3.10: SAS JMP output for data in section 3.3.1

B More than one observation per cell
Consider the model
yijh = µ + αi + βj + (αβ)ij + ijh ,
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j
and where the ijh are independent n(0; σ 2 ) variates.
µ = overall mean response; the average of the mean responses for the km populations.
αi = effect of the i-th level of the first factor averaged over the m levels of the second factor,
(the i-th level of the first factor adds αi to the overall mean µ).
βj = effect of the j-th level of the second factor averaged over the k levels of the first factor.
(αβ)ij = interaction between the i-th level of the first factor and the j-th level
of the second factor (the population means for the ij-th treatment minus µ + αi + βj ).
ijh = deviation of yijh from the population mean response for the ij-th.
The terms αi and βj are called main effects. The term (αβ)ij is an interaction.
The null hypotheses usually tested are
H0 : α1 = . . . = αk (= 0)
(if all αi are equal and Σαi = 0 then each must be equal to zero)
H0 : β1 = . . . = βm (= 0)
H0 : (αβ)11 = . . . = (αβ)km (= 0).

Let N = kmn be the total number of observations,
n
1X
ȳij. = yijh ; i = 1, · · · , k; j = 1, · · · , m (the cell means)
n h=1
m n
1 XX
ȳi.. = yijh ; i = 1, · · · , k;
mn j=1 h=1
m
1 X
= ȳij. (the level means of the levels of factor A)
m j=1
k n
1 XX
ȳ.j. = yijh ; i = 1, · · · , m;
nk i=1 h=1
k
1X
= ȳij. (the level means of the levels of factor B)
k i=1
k m m
1 XXX
ȳ... = yijh (the overall means)
N i=1 j=1 h=1
m k k m
1 X 1X 1 XX
= ȳ.j. = ȳi.. = ȳij. .
m j=1 k i=1 km i=1 j=1
The quadratic forms of interest are

Xk
SSA = mn (ȳi.. − ȳ... )2
i=1
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
k X
X m
SSAB =n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
k X
X m X
n
SSWS = SSWithin samples = (yijh − ȳij. )2
i=1 j=1 h=1
k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1
Activity 3.1.
Prove as an exercise that SSTotal = SSA + SSB + SSAB + SSWithin samples .
(This equality holds only if the number of observations per cell is the same for all cells.)
These SSs may also be written in the form of y 0 Ai y and in this way it may be proved that the four
SSs on the right-hand side are independent with degrees of freedom and sums of squares as given
in the following ANOVA table:

mn X 2
A SSA k−1 SSA / (k − 1) σ2 + αi
k−1 X
kn
B SSB m−1 SSB / (m − 1) σ2 + βj2
m−1
n XX
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ2 + (αβ)2ij
(k − 1)(m − 1)
Error SSWS km(n − 1) SSWS / km(n − 1) σ2
Total SSTot kmn − 1
We write σ̂ 2 = SSWS /(km(n − 1)).
Example 3.9.
Derive the formulae for E(M S) for the A effect, ie, E(M SA ).
Solution 3.9.
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1
yijh = µ + αi + βj + (αβ)ij + ijh
m n
1 XX
ȳi.. = (µ + αi + βj + (αβ)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi
m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(βj ) + 2 V ar((αβ)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
σ2
=
mn
2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2
= + (µ + αi )2
mn
k m n
1 XXX
ȳ... = (µ + αi + βj + (αβ)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ
k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(βj ) + V ar((αβ)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(βj ) + V ar(αβ)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
σ2
=
kmn
2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2
= + µ2
kmn
k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k
σ2
2
X
2 σ 2
= mn + (µ + αi ) − kmnE +µ
i=1
mn kmn
k
kmnσ 2 X kmnσ 2
= + mn (µ2 + 2µαi + αi2 ) − − kmnµ2
mn i=1
kmn
k
X k
X
2 2
= kσ + kmnµ + 2mnµ αi + mn αi2 − σ 2 − kmnµ2
i=1 i=1
k
X k
X
2
= σ (k − 1) + mn αi2 since αi = 0
i=1 i=1
E(SSA )
E(M SA ) =
k−1
σ 2 (k − 1) + mn ki=1 αi2
P
=
k−1
k
2 mn X 2
= σ + α
k − 1 i=1 i
Activity 3.2.
Derive the other E(MS) as an exercise.
The three null hypotheses are tested by
SSA /(k − 1) SSB /(m − 1) SSAB /(k − 1)(m − 1)

; ; .
SSWS /km(n − 1) SSWS /km(n − 1) SSWS /km(n − 1)
Remarks:
Multiple comparisons are performed as before. For example:
H0 : βr = βs is rejected if
(ȳ.r. − ȳ.s. )2
> (m − 1)σ̂ 2 Fα;m−1;km(n−1) .
(2/kn)
Example 3.10.
In order to test whether the method of display has an effect on bread sales, a bakery selected 12
comparable supermarkets and requested each to display its bread according to a specification of shelf
height (bottom, middle and top) and width of shelf (regular and wide). The bread sales were as
follows:
Factor A Factor B (display width)

(display height) B1 (regular) B2 (wide)
A1 (Bottom) 47 46
43 40
A2 (Middle) 62 67
68 71
A3 (Top) 41 42
39 46
Test the three null hypotheses (regarding interaction, A-effect and B-effect), each at the 5% level.
Solution 3.10.
The model is yijh = µ + αi + βj + (αβ)ij + ijh ,
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j
and where the ijh are independent n(0; σ 2 ) variates.
Factor A Factor B(display width)

(display height) B1 (regular) B2 (wide) ȳi..
A1 (Bottom) 47 46 44
43 40
ȳ11. = 45 ȳ12. = 43
A2 (Middle) 62 67 67
68 71
ȳ21. = 65 ȳ22. = 69
A3 (Top) 41 42 42
39 46
ȳ31. = 40 ȳ32. = 44
ȳ.j. 50 52
ȳ... = 51 k=3 m = 2 and n = 2
m
X
SSA = n(ȳi.. − ȳ... )2
i=1
= 2(2)[(44 − 51)2 + (67 − 51)2 + (42 − 51)2 ]
= 4[(−7)2 + (16)2 + (−9)2 ]
= 1 544
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
= 3(2)[(50 − 51)2 + (52 − 51)2 ]
= 6[(−1)2 + 12 ]
= 12
k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
= 2[(45 − 50 − 44 + 51)2 + (43 − 52 − 44 + 51)2 + . . . + (44 − 52 − 42 + 51)2 ]
= 2[22 + (−2)2 + (−1)2 + 12 + (−1)2 + 12 ]
= 24
SSWS = SSwithin samples

Xk Xm X n
= (yijh − ȳij. )2
i=1 j=1 h=1
= (47 − 45) + (43 − 45)2 + (46 − 43)2 + . . . + (42 − 44)2 + (46 − 44)2
2
= 22 + (−2)2 + 32 + (−3)2 + (−3)2 + 32 + (−2)2 + 22 + 12 + (−1)2 + (−2)2 + 22
= 62
k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1
= (47 − 51) + (43 − 51)2 + (46 − 51)2 + . . . + (42 − 51)2 + (46 − 51)2
2
= (−4)2 + (−8)2 + (−5)2 + (−11)2 + (11)2 + (17)2 + (16)2 + (20)2 + . . . + (−5)2
= 1 642

mn X 2
A 1 544 2 772 σ2 + αi 74.7099
k−1
kn X 2
B 12 1 12 σ2 + β 1.1613
m − 1 Xj
n
AB 24 2 12 σ2 + (αβ)2ij 1.1613
(k − 1)(m − 1)
Error 62 6 10.3333 σ2
Total 1 642 11
Test regarding interaction: H0 : (αβ)11 = (αβ)12 = (αβ)21 = . . . = (αβ)32 (= 0)

Fα;(k−1)(m−1);km(n−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14. Since 1.1613 < 5.14, we do not
reject H0 at the 5% level of significance and conclude that there is no interaction.
Test regarding A-effect: H0 : α1 = α2 = α3 = 0

Fα;k−1;km(n−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14. Since 74.7099 > 5.14, we reject H0 at the
5% level of significance and conclude that there is a significant difference in sales due to display
height.
Test regarding B-effect: H0 : β1 = β2 = 0

Fα;m−1;km(n−1) = F0,05;1;6 = 5.99. Reject H0 if f > 5.99. Since 1.1613 < 5.99, we do not reject H0
at the 5% level of significance and conclude that there is no significant difference in sales due to
display width.
3.8 Two-way analysis of variance: Model II
We now assume that the k levels of factor A constitute a random sample from a large population
of possible levels of A, and the m levels of factor B constitute a random sample from a large
population of possible levels of B.
The model is
yijh = µ + ai + bj + (ab)ij + ijh ;

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
where
E(ai ) = E(bj ) = E(ab)ij = 0; for all i and j;
ai ∼ n(0; σa2 ); bj ∼ n(o; σb2 ); (ab)ij ∼ n(0; σab

2
) and
ijh ∼ n(0; σ 2 )
and where a1 , · · · , ak , b1 , · · · , bm , (ab)11 , · · · , (ab)km , and 111 , · · · , kmn are mutually independent.
As with model II one-way analysis of variance, the parameters of prime interest are µ, σa2 , σb2 and
2
σab . We use the same notation as in section 3.7, but the expected means squares in the ANOVA
table are now different.
Source SS d.f MS E(MS)

A SSA k−1 SSA / (k − 1) 2
σ 2 + mnσa2 + nσab
B SSB m−1 SSB / (m − 1) σ 2 + knσb2 + nσab
2
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ 2 + nσab

2

We will derive the E(M SB ), and the other E(MS) are left as an exercise for you to do.
Example 3.11.
Derive E(M SB ).
Solution 3.11.
m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
Xm
2 2
= kn ȳ.j. − kmnȳ...
j=1
ȳijh = µ + ai + bj + (ab)ij + ijh

k n
1 XX
ȳ.j. = (µ + ai + bj + (ab)ij + ijh )
kn i=1 h=1
k k k n
1X 1X 1 XX
ȳ.j. = µ + ai + b j + (ab)ij + εijh
k i=1 k i=1 kn i=1 h=1
ȳ.j. = µ + ā. + bj + (āb).j + ε̄.j.
E(ȳ.j. ) = E(µ + ā. + bj + (āb).j + ε̄.j. )
= µ
V ar(ȳ.j. ) = V ar(µ + ā. + bj + (āb).j + ε̄.j. )

σ2 σ2 σ2
= a + σb2 + ab +
k k kn
Now
2
V ar(ȳ.j. ) = E(ȳ.j. ) − (E(ȳ.j. ))2
2
=⇒ E(ȳ.j. ) = V ar(ȳ.j. ) + (E(ȳ.j. ))2
σa2 σ2 σ2
= + σb2 + ab + + µ2
k k kn
k m k m k m n
1X 1 X 1 XX 1 XXX
ȳ... = µ + ai + bj + (ab)ij + εijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
¯ .. + ε̄...
ȳ... = µ + ā. + b̄. + (ab)
¯ .. + ε̄... )
E(ȳ... ) = E(µ + ā. + b̄. + (ab)
= µ
¯ .. + ε̄... )
V ar(ȳ... ) = V ar(µ + ā. + b̄. + (ab)
σ2 σ2 σ2 σ2
= a + b + ab + .
k m km kmn
Now
2
V ar(ȳ... ) = E(ȳ... ) − (E(ȳ... ))2
2
=⇒ E(ȳ... ) = V ar(ȳ... ) + (E(ȳ... ))2
σ2 σ2 σ2 σ2
= a + b + ab + + µ2 .
k m km kmn
Then
X
SSB = kn y 2.j. − kmny 2...
j
X
E(SSB) = E(kn y 2.j. − kmny 2... )
j
X
= kn E(y 2.j. ) − kmnE(y 2... )
j
X σ2 σ2 σ2

a
= kn + σb2 + ab + + µ2
j
k k kn
σa2 σb2 2
σ2

σab 2
−kmn + + + +µ
k m km kmn
σa2 2
2
σab σ2
= kmn + kmnσb + kmn + kmn
k k kn
2 2
σ σ σ2
+kmnµ2 − kmn a − kmn b − kmn ab
k m km
σ2
−kmn − kmnµ2
kmn
= mnσa + kmnσb2 + mnσab
2 2
+ mσ 2
−mnσa2 − knσb2 − nσab

2
− σ2
= (m − 1)knσb2 + (m − 1)nσab
2
+ (m − 1)σ 2
= (m − 1)σ 2 + n(m − 1)σab

2
+ kn(m − 1)σb2
E(SSB )
E(M SB ) =
m−1
(m − 1)σ 2 + n(m − 1)σab

2
+ kn(m − 1)σb2
=
m−1
2 2 2
∴ E(M SB ) = σ + nσab + knσb .
As in one-way model II, SSA , SSB , SSAB and SSWS are independent multiples of central chi-square
variates.
2 SSAB /(k − 1)(m − 1)
If σab = 0, f =
SSWS /km(n − 1)
is an F(k−1)(m−1);km(n−1) variate, and this fact is used to test
2
H0 : σab = 0.
SSA /(k − 1)
If σa2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fk−1;(k−1)(m−1) variate and this fact is used to test
H0 : σa2 = 0.
(Note: In model I, SSA is compared to SSWS , but here SSA is compared to SSAB ; the replicates
in the cells are not utilised when H0 : σa2 = 0 is tested.)
SSB /(m − 1)
Thirdly if σb2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fm−1;(k−1)(m−1) variate. This fact is used to test
H0 : σb2 = 0.
The unbiased estimators for σ 2 , σa2 , σb2 and σab

2
are
σ̂ 2 = SSWS /km(n − 1)
SSA SSAB
σ̂a2 =( − )/(mn)
k − 1 (k − 1)(m − 1)
SSB SSAB
σ̂b2 =( − )/(kn)
m − 1 (k − 1)(m − 1)
2 SSAB SSWS
σ̂ab =( − )/n.
(k − 1)(m − 1) km(n − 1)
As in one-way analysis of variance, one may find confidence intervals for the three variance com-
ponents, but this subject is not dealt with in this module.
An unbiased estimator of µ is
µ̂ = ȳ···
while V ar(µ̂) = σ 2 /(kmn) + σab

2
/(km) + σa2 /k + σb2 /m.
For the estimator V âr(µ̂), substitute σ 2 with σ̂ 2 , σab

2 2
with σ̂ab , σa2 with σ̂a2 and σb2 with σ̂b2 .
This fact may be used to test H0 : µ = µ0 with µ0 specified or to find a confidence interval for µ
as before. The unbiased estimator of V ar(µ̂) is not a multiple of a chi-square variate, and the test
or confidence limits are only approximately true.
Example 3.12.
A consumer product agency wants to evaluate the accuracy of determining the level of calcium in
a food supplement. There are a large number of possible testing laboratories and a large number of
chemical assays for calcium. The agency randomly selects three laboratories and three assays for
use in the study. Each laboratory will use all three assays in the study. Eighteen samples containing
10 mg of calcium are prepared and each assay−laboratory combination is randomly assigned to two
samples. The calcium content is given in the following table:
Laboratory
Assay 1 2 3
1 10.9 10.5 9.7
10.9 9.8 10.0
2 11.3 9.4 8.8
11.7 10.2 9.2
3 11.8 10.0 10.4
11.2 10.7 10.7
(a) Perform an analysis of variance for this experiment. Conduct all tests with α = 0.05.
(b) Estimate all variance components and determine their proportional allocation to the total
variability.
Solution 3.12.
(a)
ȳ11. = 10.9 ȳ12. = 10.15 ȳ13. = 9.85
ȳ21. = 11.5 ȳ22. = 9.8 ȳ23. = 9.0
ȳ31. = 11.5 ȳ32. = 10.35 ȳ33. = 10.55
ȳ1.. = 10.3 ȳ2.. = 10.1 ȳ3.. = 10.8

ȳ.1. = 11.3 ȳ.2. = 10.1 ȳ.3. = 9.8
ȳ... = 10.4
k X
X m X
n
i=1 j=1 h=1
= (10.9 − 10.4)2 + (10.9 − 10.4)2 + . . . + (10.7 − 10.4)2
= 12
k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 3 × 2[(10.3 − 10.4)2 + (10.1 − 10.4)2 + (10.8 − 10.4)2 ]
= 6[(−0.1)2 + (−0.3)2 + (0.4)2 ]
= 1.56
m
X
SSB = kn (y.j. − ȳ... )2
j=1
= 3 × 2[(11.3 − 10.4)2 + (10.1 − 10.4)2 + (9.8 − 10.4)2 ]
= 6[(0.9)2 + (−0.3)2 + (−0.6)2 ]
= 7.56
k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1
= 2[(10.9 − 10.3 − 11.3 + 10.4)2 + . . . + (10.55 − 10.8 − 9.8 + 10.4)2 ]
= 2[(−0.3)2 + (0.15)2 + (0.15)2 + (0.5)2 + . . . + (−0.15)2 + (0.35)2 ]
= 1.64
k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1
= (10.9 − 10.9)2 + (10.9 − 10.9)2 + . . . + (10.4 − 10.55)2 + (10.7 − 10.55)2
= 02 + 02 + (−0.2)2 + (0.2)2 + . . . + (0.3)2 + (−0.3)2 + (−0.35)2 + (0.15)2
= 1.24

A (Assay) 1.56 2 0.78 σ 2 + 2σab
2
+ 6σa2
B (Lab) 7.56 2 3.78 σ 2 + 2σab
2
+ 6σb2
AB (Assay * Lab) 1.64 4 0.41 σ 2 + 2σab
2
Error 1.24 9 0.1378 σ2

Total 12.00 17
H0 : σa2 = 0:
Fα;k−1;(k−1)(m−1) = F0,05;2;4 = 6.94. Reject H0 if f > 6.94.
SSA /k − 1
f =
SSAB /(k − 1)(m − 1)
M SA
=
M SAB
0.78
=
0.41
≈ 1.9024
there is insufficient evidence to indicate a significant variability in calcium determinations
from assay to assay.
H0 : σb2 = 0

SSB /k − 1
f =
SSAB /(k − 1)(m − 1)
M SB
=
M SAB
3.78
=
0.41
≈ 9.2195
Since 9.2195 > 6.94, H0 is rejected at the 5% level of significance and we conclude that there
is a significant variability in calcium concentrations from lab to lab.
2
H0 : σab =0
Fα;(k−1)(m−1);km(n−1) = F0,05;4;9 = 3.63. Reject H0 if f > 3.63.

SSAB /(k − 1)(m − 1)
f =
SSWS /km(n − 1)
M SAB
=
M SWS
0.41
=
0.1378
≈ 2.9753
Since 2.9753 < 3.63, we do not reject H0 at the 5% level of significance. There does not
appear to be a significant interaction between the levels of factors for assays and lab.
(b) Estimating the variance components:
σ 2 = M SE = 0.1378
σ 2 + 2σab
2
= 0.41
2
2σab = 0.41 − 0.1378
2
2σab = 0.2722
2
∴ σab = 0.1361
σ 2 + 2σab
2
+ 6σb2 = 3.78
0.41 + 6σb2 = 3.78

3.78 − 0.41
σb2 =
6
3.37
σb2 =
6
2
∴ σb = 0.5617
σ 2 + 2σab
2
+ 6σa2 = 0.78
0.41 + 6σa2 = 0.78

0.78 − 0.41
σa2 =
6
0.37
σa2 =
6
2
∴ σa = 0.0617
σy2 = σa2 + σb2 + σab

2
+ σ2
= 0.0617 + 0.5617 + 0.1361 + 0.1378
= 0.8973
The total variability is
Source of variance Estimator Proportion of total

0.0617
Assays 0.0617 ≈ 0.0688
0.8973
0.5617
Labs 0.5617 ≈ 0.6260
0.8973
0.1361
Interaction 0.1361 ≈ 0.1517
0.8973
0.1378
Error 0.1378 ≈ 0.1536
0.8973
Total 0.8973
Note: Since there was a significant variability in the determination of calcium in the samples, the
estimate of an overall mean level µ would not be of interest to the researcher. However, in
this case we want to illustrate the methodology.
3.9 Two-way analysis of variance: Mixed model
We now assume that the k levels of factor A are the only levels of interest, and thus A is a fixed
factor. The m levels of factor B constitute a random sample from a large population of possible
levels of B, and thus B is a random factor. The resulting experiment gives rise to a mixed model:
yijh = µ + αi + bj + (αb)ij + ijh ;
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
k
X
(αb)ij = 0 for all j
i=1
E(bj ) = E(αb)ij = 0,

2 k−1 2
bj ∼ n(0; σb ); αb)ij ∼ n(0; σαb ; ijh ∼ n(0; σ 2 );
k
b1 , · · · , bm are independent
(αb)i1 , · · · , (αb)im are independent
111 , · · · , kmn are independent
ijh are independent of bj and (αb)ij
bj and (αb)ij are independent.
Any two interaction terms (αb)ij and (αb)rs are independent unless they refer to the same random
level of B, that is j = s; in this case it is assumed that
2
Cov((αb)ij , (αb)rj ) = −σαb /k.
Note also the restriction

k
X
(αb)ij = 0 for all j,
i=1
which is commensurate with the assumption concerning the covariance between two interaction
terms.
For this model the ANOVA table is as follows:

X
A SSA k−1 SSA / (k − 1) σ 2 + mn αi2 /(k − 1) + nσαb
2
2
B SSB m−1 SSB / (m − 1) σ + knσb2
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ 2 + nσαb
2

Example 3.13.
Derive the formulae for E(M SA ).
Solution 3.13.
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1
yijh = µ + αi + bj + (αb)ij + ijh
m n
1 XX
ȳi.. = (µ + αi + bj + (αb)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi
m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(bj ) + 2 V ar((αb)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
mσb2 mσαb2
mnσ 2
= + +
m2 m2 (mn)2
2 2 2
σ σ σ
= b + αb +
m m mn
2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2 σ2 σ2
= b + αb + + (µ + αi )2
m m mn
k m n
1 XXX
ȳ... = (µ + αi + bj + (αb)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ
k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(bj ) + V ar((αb)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(bj ) + V ar(αb)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
mσb2 kmσαb
2
kmnσ 2
= + +
m2 (km)2 (kmn)2
σ2 σ2 σ2
= b + αb +
m km kmn
2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2 σ2 σ2
= b + αb + + µ2
m km kmn
k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k 2
σb2 σαb σ2 2
σb2 σαb σ2
X
= mn + + + (µ + αi )2 − kmn + + + µ2
i=1
m m mn m km kmn
k
kmnσb2 kmnσαb
2
kmnσ 2 X
= + + + mn (µ2 + 2µαi + αi2 )
m m mn i=1
−knσb2 − nσαb
2
− σ 2 − kmnµ2
k
X
= knσb2 + knσαb
2
+ kσ 2 + kmnµ2 + 2mnµ αi
i=1
k
X
+mn αi2 − knσb2 − nσαb
2
− σ 2 − kmnµ2
i=1
k
X k
X
2 2 2 2
= knσαb − nσαb + kσ − σ + mn αi2 since αi = 0
i=1 i=1
k
X
2
= n(k − 1)σαb + σ 2 (k − 1) + mn αi2
i=1
E(SSA )
E(M SA ) =
k−1
2
+ σ 2 (k − 1) + mn ki=1 αi2
P
n(k − 1)σαb
=
k−1
k
2 2 mn X 2
= nσαb + σ + α
k − 1 i=1 i
Activity 3.3.
Derive the other E(M S) as an exercise.
The three null hypotheses and the appropriate statistics are as follows:
H0 f d.f.
SSA /(k − 1)
α1 = . . . = αk (= 0) k − 1; (k − 1)(m − 1)
SSAB /(k − 1)(m − 1)
SSB /(m − 1)
σb2 = 0 m − 1; km(m − 1)
SSWS /km(n − 1)
2 SSAB /(k − 1)(m − 1)

σαb (k − 1)(m − 1); km(n − 1)
SSWS /km(n − 1)
As before, we can obtain estimates of σ 2 , σb2 and σαb

2
, and multiple comparison tests of the form
H0 : αr = αs can be performed:
reject H0 : αr = αs if
(ȳr.. − ȳs.. )2
> (k − 1)(σ̂ 2 + nσ̂αb
2
)Fα;k−1;(k−1)(m−1) .
2/mn
(Can you see why (σ̂ 2 + nσ̂αb

2
) rather than σ̂ 2 is used in the above formula?)
Example 3.14.
Preliminary research on the production of imitation pearls entailed studying the effect of the num-
bers of coats of a special lacquer (factor A) applied to an opalescent plastic bead used as the base
of the pearl on the market value of the pearl. Four batches of 12 beads (factor B) were used in
the study, and it is desired to also consider their effect on the market value. The three levels of
factor A (six, eight and ten coats) were fixed in advance, while the four batches can be regarded as
a random sample of batches from the bead production process. The market value of each pearl was
determined by a panel of experts. The market value data (coded) are as follows:
Factor A Factor B (batch)

(number of coats) 1 2 3 4
6 72.0 72.1 75.2 70.4
74.6 76.9 73.8 68.1
67.4 74.8 75.7 72.4
72.8 73.3 77.8 72.4
8 76.9 80.3 80.2 74.3
78.1 79.3 76.6 77.6
72.9 76.6 77.3 74.4
74.2 77.2 79.9 72.9
10 76.3 80.9 79.2 71.6
74.1 73.7 78.0 77.7
77.1 78.6 77.6 75.2
75.0 80.2 81.2 74.4
(a) Give the number of levels of each factor.
(b) Identify each factor as fixed or random.
(c) Do the data provide sufficient evidence to indicate an interaction between number of coats
and batch?
(d) Test for factor A and factor B main effects.
Hint: Test at the 5% significance level: Give the hypothesis, test statistics, formulas and conclusions
explicitly.
Solution 3.14.
(a) The factor A has three levels and the factor B has four levels.
(b) Factor A is fixed effects and factor B is random effects.

(c)
ȳ11. = 71.7 ȳ12. = 74.275 ȳ13. = 75.625 ȳ14. = 70.825
ȳ21. = 75.525 ȳ22. = 78.35 ȳ23. = 78.5 ȳ24. = 74.8
ȳ31. = 75.625 ȳ32. = 78.35 ȳ33. = 79 ȳ34. = 74.725
ȳ1.. = 73.1063 ȳ2.. = 76.7938 ȳ3.. = 76.925

ȳ.1. = 74.2833 ȳ.2. = 76.9917 ȳ.3. = 77.7083 ȳ.4. = 73.45
ȳ... = 75.6083
k X
X m X
n
i=1 j=1 h=1
= (72 − 75.6083)2 + . . . + (74.4 − 75.6083)2
= 478.7167
k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 4 × 4[(73.1063 − 75.6083)2 + (76.7938 − 75.6083)2 + (76.925 − 75.6083)2
= 150.3858
m
X
SSB = kn (y.j. − ȳ... )2
j=1
= 3 × 4[(74.2833 − 75.6083)2 + (76.9917 − 75.6083)2 +
(77.7083 − 75.6083)2 + (73.45 − 75.6083)2 ]
= 152.8522
k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1
= (72 − 71.7)2 + (74.6 − 71.7)2 + . . . + (74.4 − 74.725)2
= 28.2 + 12.8475 + 8.2475 + 12.5675 + 17.1675 + 9.09 + 9.9
+11.86 + 5.3475 + 31.61 + 7.84 + 18.9475
= 173.625
SSAB = SSTotal − SSA − SSB − SSWS
= 478.7167 − 150.3858 − 152.8522 − 173.625
= 1.8537
Source SS d.f. MS
A 150.3858 2 75.1929
B 152.8522 3 50.9507
A×B 1.8537 6 0.3090
Error 173.625 36 4.8229
Total 478.7167 47
2
H0 : σαb = 0:
Fα;(k−1)(m−1);km(n−1) = F0,05;6;36 = 2.38. Reject H0 if f > 2.38.
SSAB /(k − 1)(m − 1)

f =
SSWS /km(n − 1)
0.3090
=
4.8229
≈ 0.0641
Since 0.0641 < 2.38, H0 is not rejected at the 5% level of significance and conclude that there
is no interaction between number of coats and batch.
(d) H0 : α1 = α2 = α3 = 0
Fα;(k−1);(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.
SSA /(k − 1)
f =
SSAB /(k − 1)(m − 1)
75.1929
=
0.3090
≈ 243.3427
Since 243.3427 > 5.14, we reject H0 at the 5% level of significance and we conclude that the
number of coats affects the market value of the pearls.
H0 : σb2 = 0
Fα;(m−1);(k−1)(m−1) = F0,05;3;36 = 2.88. Reject H0 if f > 2.88.
SSB /(m − 1)
f =
SSWS /km(n − 1)
50.9507
=
4.8229
≈ 10.5643
Since 10.5643 > 2.88, we reject H0 at the 5% level of significance and conclude that batches
have an effect on the market value of the pearls.
3.10 Nested experiments
We now refer to data in sections 3.3.2 and 3.3.3. The usual model for this type of experiment is
yijh = µ + αi + bij + ijh ;
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
bij are independent n(0; σb2 ) variates

ijh are independent n(0; σ 2 ) variates
and the bij and ijh are mutually independent.
In this formulation i refers to the treatments (teaching methods in section 3.3.2.1, sprays in section
3.3.2.2), j refers to the units assigned at random to the treatments (teachers and trees respectively)
and h to the repetitions (children and leaves respectively).
The sums of squares of interest are
k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
k X
X m
SSB =n (ȳij. − ȳi.. )2
i=1 j=1
k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1
k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1
The following relationship holds:
SSTotal = SSA + SSB + SSError .
The analysis of variance table is as follows:

k
X
2
A SSA k−1 SSA / (k − 1) σ + nσb2 + mn αi2 /(k − 1)
1
2
B SSB k(m − 1) SSB / k(m − 1) σ + nσb2
Error SSE km(n − 1) SSE / km(n − 1) σ 2
Total SST kmn − 1
Example 3.15.
Derive E(M SA ).
Solution 3.15.
The model is yijh = µ + αi + bij + εijh , i = 1, ..., k; j = 1, ..., m; h = 1, ..., n.
m m n
1 X 1 XX
y i.. = µ + αi + bij + εijh
m j=1 mn j=1 h=1
y i.. = µ + αi + bi. + εi..
E(y i.. ) = E(µ + αi + bi. + εi.. )
= µ + αi
V ar(y i.. ) = V ar(µ + αi + bi. + εi.. )

σ2 σ2
= b +
m mn
Now
V ar(y i.. ) = E(y 2i.. ) − (E(y i.. ))2
⇒ E(y 2i.. ) = V ar(y i.. ) + (E(y i.. ))2

σ2 σ2
= b + + (µ + αi )2
m mn
k k m k m n
1X 1 XX 1 XXX
y ... = µ+ αi + bij + εijh
k i=1 km i=1 j=1 kmn i=1 j=1 h=1
y ... = µ + α. + b.. + ε...
E(y ... ) = E(µ + α. + b.. + ε... )
= µ
V ar(y ... ) = V ar(µ + α. + b.. + ε... )

σb2 σ2
= +
km kmn
Now
V ar(y ... ) = E(y 2... ) − (E(y ... ))2
⇒ E(y 2... ) = V ar(y ... ) + (E(y ... ))2

σb2 σ2
= + + µ2
km kmn
X
SSA = mn y 2i.. − kmny 2...
i
X
E(SSA) = E(mn y 2i.. − kmny 2... )
i
X
= mn E(y 2i.. ) − kmnE(y 2... )
i
X σ2
σ2

b 2
= mn + + (µ + αi )
i
m mn
2
σ2

σb 2
−kmn + +µ
km kmn
" #
kσb2 kσ 2 X 2
= mn + + (µ + 2µαi + αi2 )
m mn i
kmnσb2 kmnσ 2
− − − kmnµ2
" km kmn #
2 2
kσb kσ X X X
= mn + + µ2 + 2µ αi + αi2
m mn i i i
−nσb2 − σ 2 − kmnµ2
X
= nkσb2 + σ 2 k + kmnµ2 + mn αi2
i
X
−nσb2 2
− σ − kmnµ since 2
αi = 0
X
= nkσb2 − nσb2 + σ 2 k − σ 2 + mn αi2
i
X
= (k − 1)nσb2 2
+ (k − 1)σ + mn αi2
i
" #
mn X 2
= (k − 1) nσb2 + σ 2 + α
k−1 i i
Thus,
E(SSA )
E(M SA ) =
k−1

mn
(k − 1) nσb2 + σ 2 + αi2
P
k−1
i
=
k−1
2 2 mn X 2
∴ E(M SA ) = nσb + σ + α .
k−1 i i
Activity 3.4.
Derive the other E(M S) as an exercise.
The hypothesis and the appropriate statistics are:

H0 : α1 = . . . = αk (= 0) is rejected if
SSA /(k − 1)
f= > Fα;k−1;k(m−1) .
SSB /k(m − 1)
H0 : σb2 = 0 is rejected if
SSB /k(m − 1)
f= > Fα;k(m−1);km(n−1) .
SSE /km(n − 1)
Example 3.16.
Using the data in section 3.3.2.1, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.
Solution 3.16.
The means and sums of squares of deviations from the means, are as follows (the first index refers
to teaching methods, the second to teachers and the third to pupils).
X
ȳ11. = 58 (y11h − ȳ11. )2 = (60 − 58)2 + . . . + (65 − 58)2 = 178
X
ȳ12 . = 62 (y12h − ȳ12. )2 = (61 − 62)2 + . . . + (72 − 62)2 = 166
ȳ1.. = 60
X
ȳ21. = 47 (y21h − ȳ21. )2 = (49 − 47)2 + . . . + (48 − 47)2 = 46
X
ȳ22. = 55 (y22h − ȳ22. )2 = (54 − 55)2 + . . . + (63 − 55)2 = 132
ȳ2.. = 51
X
ȳ31. = 66 (y31h − ȳ31. )2 = (68 − 66)2 + . . . + (75 − 66)2 = 138
X
ȳ32. = 72 (y32h − ȳ32. )2 = (64 − 72)2 + . . . + (75 − 72)2 = 114
ȳ3.. = 69
ȳ··· = 60
k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 2 × 5[(60 − 60)2 + (51 − 60)2 + (69 − 60)2 ]
= 10[02 + (−9)2 + 92 ]
= 1 620
k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1
= 5[(58 − 60)2 + (62 − 60)2 + (47 − 51)2 + (55 − 51)2 + (66 − 69)2 + (72 − 69)2 ]
= 5[(−2)2 + 22 + (−4)2 + 42 + (−3)2 + 32 ]
= 290
k X
X m X
n
i=1 j=1 h=1
= [(60 − 58) + (55 − 58)2 + . . . + (75 − 72)2 ]

2
= 178 + 166 + 46 + 132 + 138 + 114
= 774
k X
X m X
n
SST = (yijh − ȳ... )2
i=1 j=1 h=1
= [(60 − 60) + (55 − 60)2 + . . . + (75 − 60)2 ]

2
= 02 + (−5)2 + . . . + (15)2
= 2 684

X
Teaching 1 620 2 810 σ 2 + 5σb2 + 5 αi2 8.3793
methods
Teachers 290 3 96.6667 σ 2 + 5σb2 2.9974
Error 774 24 32.25 σ2

Total 2 684 29
H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;3 = 9.55. Reject H0 if f > 9.55.
Since 8.3792 < 9.55, we do not reject H0 at the 5% level of significance and conclude that there is
no significant difference between the teaching methods.
H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;3;24 = 3.01. Reject H0 if f > 3.01.
Since 2.9974 < 3.01, we do not reject H0 at the 5% level of significance and conclude that the
variances ascribable to the teachers are significant.
The SAS JMP graph of the means shows that teacher method 3 lead to the highest marks on
average, while pupils who were taught by method 2 had the poorest results. SAS JMP does not
do the full analysis for this type of experiment, and therefore we display the graph only.
Figure 3.11: SAS JMP graph for example 3.16
Example 3.17.
Using the data in section 3.3.2.2, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.
Solution 3.17.
The model is yijh = µ + αi + bij + ijh ;
i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
bij are independent n(0; σb2 ) variates
ijh are independent n(0; σ 2 ) variates
and the bij and ijh are mutually independent.
Now i refers to the sprays, that is, i = 1, 2, 3; j refers to the trees, that is, j = 1, 2, 3, 4 and h to
the repetitions (leaves), that is, h = 1, 2, . . . , 6.
Now
ȳ11. = 6 ȳ12. = 7 ȳ13. = 12 ȳ14. = 11
ȳ21. = 15 ȳ22. = 14 ȳ23. = 11 ȳ24. = 12
ȳ31. = 7 ȳ32. = 5 ȳ33. = 9 ȳ34. = 11
ȳ1.. = 9 ȳ2.. = 13 ȳ3.. = 8 ȳ... = 10
k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 4 × 6[(9 − 10)2 + (13 − 10)2 + (8 − 10)2 ]
= 24[(−1)2 + 32 + (−2)2 ]
= 336
k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1
= 6[(6 − 9)2 + (7 − 9)2 + (12 − 9)2 + . . . + (9 − 8)2 + (11 − 8)2 ]
= 6[(−3)2 + (−2)2 + 32 + 22 + 22 + 12 + (−2)2 + (−1)2 + (−1)2 + (−3)2 + 12 + 32 ]
= 336
k X
X m X
n
i=1 j=1 h=1
= [(4.5 − 6) + (7 − 6)2 + . . . + (11 − 11)2 + (9 − 11)2 ]

2
= 7 + 43 + 24.5 + 7 + 2.5 + 10 + 8.5 + 11 + 8.5 + 10.5 + 7 + 11.5
= 151.

Pk
mn i αi2
A 336 2 168 σ 2 + nσb2 + 4.5
(k − 1)
B 336 9 37.3333 σ 2 + nσb2 14.8342
Error 151 60 2.5167 σ2
Total 823 71
Testing effect of sprays:
H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;9 = 4.26. Reject H0 if f > 4.26.
Since 4.5 > 4.26, we reject H0 at the 5% level of significance and conclude that the effects of the
sprays are significantly different from one another.
H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;9;60 = 2.04. Reject H0 if f > 2.04.
Since 14.8342 > 2.04, we reject H0 at the 5% level of significance and conclude that the variances
ascribable to the trees are significant, that is, the hypothesis σb2 = 0 is rejected.
The SAS JMP graph follows - you should be able to interpret it.

3.11 Three-way analysis of variance
Experiments often include more than two factors, the data in section 3.1.3 are an example of a
three-way analysis of variance.
The general model for a three-way analysis of variance is
yijk` = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk + εijk` ;
i = 1, · · · , a; j = 1, · · · , b; k = 1, · · · , c; ` = 1, · · · , n;
with
a
X b
X c
X a
X b
X b
X
αi = βj = γk = (αβ)ij = (αβ)ij = (αγ)ik
i=1 j=1 k=1 i=1 i=1 ik
c
X b
X c
X a
X
= (αγ)ik = (βγ)jk = (βγ)jk = (αβγ)ijk
k=1 j=1 k=1 i=1
b
X c
X
= (αβγ)ijk = (αβγ)ijk = 0
j=1 k=1
and the εijk` are independent n(0; σe2 ) variates.
The three-way analysis of variance is an expansion of the method used for the two-way analysis
of variance. Since the calculations are complicated and time-consuming, SAS JMP, or a similar
computer package, is usually used.
Example 3.18.
In this example interactions were ignored and you will find the SAS JMP output fairly simple to
interpret.


Example 3.19.
A three-way analysis of variance was performed to investigate the effect of rainfall (factor A), type
of soil (factor B) and fertilizer (factor C) on yield, allowing for possible interactions.
Rainfall 1 2
Soil type 1 2 1 2
Fertilizer
1 18.6 10.2 12.3 12.3
18.8 10.5 10.9 10.7
15.8 8.5 10.7 12
17.9 10.6 11.8 11.8
2 19.4 14.7 13.2 10.2
18.9 17.8 15.5 8.5
18.4 15.6 11 7.1
18.9 16.5 11.8 6.5
3 15.9 20.9 19.4 13.2
16.5 21 21.2 13.5
17.2 21.03 20.4 12.8
16.5 20.4 20.3 15.2
The SAS JMP output is shown in figure 3.18.
From figure 3.20 the occurrence of interaction is quite clear.
The AC-interaction seems to be the strongest and the most prominent feature in figure 3.20(middle)
is the following:
The combination of level 2 of C and level 2 of A contributes to a very low yield, on the other hand
the combination of level 2 of C and level 1 of A contributes to a relatively high yield.
Figure 3.21 and figure 3.22 are representations of the ABC-interaction: figure 3.21 represents the
AC-interaction for soil type 1 and figure 3.22 represents the AC-interaction for soil type 2.
If figure 3.21 and figure 3.22 had been the same, you would be inclined to conclude that the ABC-
interaction was not significant. The difference between figure 3.21 and figure 3.22 is obvious, thus
the results of the ANOVA table for the significance of the ABC-interaction are confirmed.
Notice the following with regard to the figures:
Figure 3.20 left and middle and figures 3.21 and 3.22: the 1 and 2 inside the graph represent levels
1 and 2 for rainfall;
Figure 3.20 right: the 1 and 2 inside the graph represent levels 1 and 2 for soil.


Exercise 3.1
1. For each of the following scenarios, give
(i) the appropriate ANOVA model and the assumptions;
(ii) the hypotheses and conclusions you expect to test and make from the analysis WITH-
OUT actually performing the analysis.
SCENARIOS
(a) The effect of salinity was examined on the growth of fish (measured by increase in
weight). A sample of five fish each was measured from full-strength seawater (32%),
brackish water (18%) and fresh water (0.5%). Analyse the experiment.
(b) An experiment was conducted to test the effects of three different levels of factor A with
three different levels of factor B. The experiment called for nine identical experimental
units, each to be tested at one of the combinations of factors A and B. Analyse the
experiment.
(c) Given the scenario in (b), there was some controversy about the assumption of no inter-
action between factors A and B and it was decided to test 36 experimental units, four
in each combination, to allow for the possibility of interaction. Analyse the experiment.
2. Four chemical treatments for curtain fabrics were tested with regard to their ability to im-
prove colour fastness. Due to the limited quantities of the two types of fabrics available for
the experiment, it was decided to apply each chemical to a sample of each type of fabric. The
results are expressed in percentages with regard to the retaining of colour, after the treated
fabrics were submitted to severe testing:
Fabric
1 2
1 53 92
Treatment 2 32 81
3 79 99
4 38 67
(a) Give the appropriate model with assumptions.
(b) Are there significant differences in the effect of treatments on fabrics (α = 0.05)?
(c) Are there significant differences between fabrics?
(d) What would you recommend in order to make any claims as to the effectiveness of the
treatments?
3. The strain readings of glass cathode supports from five different machines were investigated.
Each machine had four “heads” on which the glass was formed, and four samples were taken
from each head. The data were as follows:
Machine A B C
Head 1 2 3 4 5 6 7 8 9 10 11 12
6 13 1 7 10 2 4 0 0 10 8 7
2 3 10 4 9 1 1 3 0 11 5 2
0 9 0 7 7 1 7 4 5 6 0 5
8 8 6 9 12 10 9 1 5 7 7 4
Machine D E
Head 13 14 15 16 17 18 19 20
11 5 1 0 1 6 3 3
0 10 8 8 4 7 0 7
6 8 9 6 7 0 2 4
4 3 4 5 9 3 2 0
Analyse and interpret the results. Use the 5% level of significance. Give the model, ANOVA
table, hypotheses, test statistics and conclusions explicitly.
4. Consider the two-way analysis of variance: model II. Show that E(M SAB ) = σ 2 + nσab
2
.
5. An experiment was conducted to determine the effect of different pressure levels on the
products manufactured by a machine. A summary of the data recorded from products man-
ufactured by the machine at each of four randomly selected pressure levels is given as follows:
y 1· = 27.4 y 2· = 29.8
y 3· = 30.7 y 4· = 29.2
y ·· = 29.275
n1 = n2 = n3 = n4 = 10
(yij − y ·· )2 = 207.975
PP
i j
Use a one-way analysis of variance (model II):
yij = µ + ai + εij i = 1, . . . , k = 4 j = 1, . . . , n = 10
(a) Complete the ANOVA table and test at the 5% level whether the change in pressure
has a significant effect on the products.
(b) Estimate the proportion of the total variance ascribable to pressure variation and com-
pute a 90% confidence interval.
6. In manufacturing a certain beverage, an important measurement is the percentage of im-

purities present in the final product. The following data show the percentage of impurities
present in samples taken from products manufactured at the three different temperatures
(A), and the two sterilisation times (B) in the manufacturing process. The three levels of A
were actually 25◦ C, 30◦ C and 35◦ C. The two levels of B were 15 and 20 minutes. The data
are as follows:
A
25◦ 30◦ 35◦
15 min 14.05 10.55 7.55

14.93 9.48 6.59
B
20 min 16.56 13.63 9.23

15.85 11.75 8.78
(a) Give the appropriate model and the assumptions.
(b) Complete an ANOVA table.
(c) (i) Is there an interaction effect present?

(ii) Are there significant differences between temperatures?
(iii) Is there a significant difference between the two sterilisation times?
Give the hypotheses, test statistics and formulae.
(d) Now assume that the three levels of factor A constitute a random sample from a large
population of possible levels of A. Give the appropriate model and do questions (i)-(iii)
of (c) for the model in (d).
Chapter 4
REGRESSION ANALYSIS
4.1 Introduction
As was said in chapter 2, the regression model is a special case of the general linear model
y = X β +
n×1 n×p p×1 n×1
in which each column of X (except possibly the first) represents a series of values of a continuous
variable. The variables represented by the columns of X are called the independent variables or
predictors. The rows of X and y represent, as usual, cases or data points. We already know that
the least squares estimator of β is
β̂ = (X 0 X)−1 X 0 y
and that
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .
Furthermore, σ 2 is estimated by
σ̂ 2 = y 0 (I − X(X 0 X)−1 X 0 )y/(n − p).
239
4. REGRESSION ANALYSIS 240
Thus H0 : βi = βi0 (with βi0 specified) may be tested by means of

β̂i − βi0
t= √
σ̂ aii
where aii is the (i,i)-th element of (X 0 X)−1 , or equivalently by means of
(β̂i − βi0 )2
f = t2 =
σ̂ 2 aii
which is compared to Fα;1;n−p .
Likewise, H0 : c0 β = d (with c0 and d known) is tested by means of

c0 β̂ − d
t= p
σ̂ c0 (X 0 X)−1 c
or by means of f = t2 , which is compared to Fα;1;n−p .
In this chapter we shall discuss a number of topics peculiar to regression analysis.
4.2 Analysis of variance in regression
The analysis of variance has been treated in detail, but the interpretation of the various quadratic
forms in the regression model is discussed here.
We first look at the equation

X X X
(yi − ȳ)2 = (yi − ŷi )2 + (ŷi − ȳ)2
or
SSTotal = SSResidual + SSRegression .

4. REGRESSION ANALYSIS 241 STA3701/1
These sums of squares are depicted in figures 4.1-4.3.
Figure 4.1: Total variation
The observations are represented by the dots. In figure 4.1 the observations are projected onto
the y-axis without regard to the x-values. The variation between these projected values, depicted
by crosses, reflects the total variation in the y-values. If we did not know about the x-values, we
would ascribe this variation to chance.
Figure 4.2: Variation about the regression line
In figure 4.2 a straight line is fitted to the data and the observations projected parallel to the fitted
line. The variation among these projected points is the variation about the regression line and is
now regarded as the chance variation. The amount of decrease in the variation is said to be the
variation explained by the regression or the variation due to regression.
Figure 4.3: Variation due to regression

In figure 4.3 the predicted values, which are obtained by projecting the data points vertically onto
the regression line, are in turn projected onto the y-axis. The variation between these projected
points is said to be due to the regression because there would have been no such variation if the
slope of the fitted line had been zero, that is if y had not depended on x.

X X
Regression (ŷi − ȳ)2 p−1 (ŷi − ȳ)2 /(p − 1) σ 2 + β 01 A−1
11 β 1 /(p − 1)
X X
Residual (yi − ŷi )2 n−p (yi − ŷi )2 /(n − p) σ2
X
Total (yi − ȳ)2 n−1
where in the E(MS) column β 1 is the (p − 1) × 1 vector obtained by deleting the first element of
β and A11 is the matrix obtained by deleting the first row and column of (X 0 X)−1 . In the case
p = 2, that is the regression of y on one variable x, say y = β1 + β2 x + , we have
X X
E( (ŷi − ȳ)2 /(p − 1)) = σ 2 + β22 (xi − x̄)2 .
The F-test for testing H0 : β2 = . . . = βp = 0 is based on

(ŷi − ȳ)2 /(p − 1)
P
f=P .
(yi − ŷi )2 /(n − p)
• Illustration
The following data, with the dependent variable the time needed to complete a race (y) and the
independent variable the time spent on exercise in the two months prior to the race (x) were
analysed with regard to the relation between the dependent and the independent variable.
x : 178.6 221.6 190.5 227.7 206.2 245.6 209.1 260.3 212.2 280.1
y: 27.1 39.8 32.6 42.3 33.6 44.8 33.8 50.9 38.3 54.1
A straight line was fitted and a regression analysis was performed with the following SAS JMP
output in figure 4.4:
Figure 4.4: SAS JMP output for example 4.2.2
The fitted regression line on the data is:
ŷ = β̂0 + β̂1 x = −20.0118 + 0.2677x.
Since the F-value in the computer output is very large, the hypothesis H0 : β0 = β1 = 0 is rejected
and the assumption that the straight line describes the relation between x and y, can be accepted.
“R-squared” given in the computer output is defined as the percentage of the variation in y which
is accounted for by the fitted equation. Thus, 97.26% of the variation in y is accounted for by the
model.
If we have repeated measurements in some or all of the points, that is if there are sets of equal
rows in the matrix X, a more precise analysis can be made, namely a lack of fit test.
n1
1 X
Thus, suppose that y11 , · · · , y1n1 correspond to one set of equal rows of X, and let ȳ1. = y1i ,
n1 1
n2
1 X
suppose y21 , · · · , y2n2 correspond to a second set of equal rows and let ȳ2. = y2i , et cetera.
n2 1
Xc
Suppose there are c such sets with sizes n1 , · · · , nc such that ni = n, the total sample size.
1
Then
ni
c X
X
SSError = (yij − ȳi. )2
i=1 j=1
is said to be the sum of squares due to pure error, since the only explanation for a variation among
observations which correspond to equal rows of X is random variation.
The difference between the observed mean value ȳi. and the predicted value for the i-th set ŷi , is
an indication of how well the model y = Xβ + fits the data. We write
c
X
SSLack of fit = ni (ŷi − ȳi. )2 .
i=1
It is easily shown that
SSResidual = SSError + SSLack of fit .

This situation is illustrated in figures 4.5, 4.6 and 4.7.
Figure 4.5: Observations and means
Figure 4.6: Observations and line fitted
SSLack of fit is a measure of the adequacy of the suggested model. We can write
SSTotal = SSRegression + SSLack of fit + SSError .

Figure 4.7: Differences between means and predicted values suggest the model is inadequate

Xc
Regression ni (ŷi − ȳ.. )2 p−1 SSR /(p − 1) σ 2 + β 01 A−1
11 β 1 /(p − 1)
1
c
X
Lack of fit ni (ŷi − ȳi. )2 c−p SSLof /(c − p) σ 2 + ξ 2 /(c − p)
1
ni
c X
X
Error (yij − ȳi. )2 n−c SSE /(n − c) σ2
i=1 j=1
XX
Total (yij − ȳ.. )2 n−1
i j
where
c
X
2
ξ = ni (E(ȳi. ) − X 0i β)2
i=1
and where X 0i is the row of X corresponding to the i-th group of observations. We see that ξ = 0
if, for all i,
E(ȳi. ) = X 0i β
that is if the model contains all important independent variables (which have an influence on y).
A test of the adequacy of the model is therefore based on

SSLack of fit /(c − p)
f= ,
SSError /(n − c)
which is compared to Fα;c−p;n−c .
Example 4.1.
A study was performed on the number of parts assembled in a factory as a function of the time
spent on a specific job. Twelve employees, divided into three groups, were assigned to three time
intervals, with the following results:
Time (x) Number of parts assembled (y)

10 minutes 27 32 26 34
15 minutes 35 30 42 47
20 minutes 45 50 52 49
(a) Fit a straight line to the data.
(b) Construct an ANOVA table for the lack of fit test.
(c) Test at the 5% level of significance whether the model is adequate.
Solution 4.1.
(a) A straight
 line was fitted first on the data: 
1 1 1 1 1 1 1 1 1 1 1 1
X0 =  
−1 −1 −1 −1 0 0 0 0 1 1 1 1
 
1 −1
 
1 −1
 
 
 
 1 −1
   
 
0
1 1 1 1 1 1 1 1 1 1 1 1   12 0
XX= 1 −1 =
     

−1 −1 −1 −1 0 0 0 0 1 1 1 1   .. ..

 0 8
 . . 
 
 
 1 1 
 
1 1

 1
1  8 0   12 0 
Now (X 0 X)−1 = = 1 
96 0 12 0
8
 
27
 
32
 
 
 
 26
   
 
0
1 1 1 1 1 1 1 1 1 1 1 1   469
Xy= =
  
   34 
−1 −1 −1 −1 0 0 0 0 1 1 1 1   ..

 77
 . 
 
 
 52 
 
49
 
1   
0 469 39.0833
β̂ = (X 0 X)−1 X 0 y =  12 1   = .
 
0 77 9.625
8
The straight line fitted on the data is:
ŷi = β̂0 + β̂1 xi = 39.0833 + 9.625xi
with
ŷ 0 = (29.4583 39.0833 48.7083).
(b) ȳ1. = 29.75 ȳ2. = 38.5 ȳ3. = 49 ȳ.. = 39.0833

XX
SSTotal = (yij − ȳ.. )2
= (27 − 39.0833)2 + (32 − 39.0833)2 + . . . + (49 − 39.0833)2
= 982.9167
XX
(y1j − ȳ.. )2 = (27 − 29.75)2 + . . . + (34 − 29.75)2
= 44.75
XX
(y2j − ȳ.. )2 = (35 − 38.5)2 + . . . + (47 − 38.5)2
= 169
XX
(y3j − ȳ.. )2 = (45 − 49)2 + . . . + (49 − 49)2
= 26
XX
SSError = (yij − ȳi. )2
XX XX XX
= (y1j − ȳ.. )2 + (y2j − ȳ.. )2 + (y3j − ȳ.. )2
= 44.75 + 169 + 26
= 239.75
c
X
SSReg = ni (ŷi − ȳ.. )2
i=1
= 4(29.4583 − 39.0833)2 + 4(39.0833 − 39.0833)2 + 4(48.7083 − 39.0833)2
= 741.125
c
X
SSLof = ni (ŷi − y i. )2
i=1
= 4(29.4583 − 29.75)2 + 4(39.0833 − 38.5)2 + 4(48.7083 − 49)2
= 4(−0.2917)2 + 4(0.5833)2 + 4(−0.2917)2
= 2.0417
The ANOVA table for the lack of fit test is:
Source SS d.f MS
X
Regression ni (ŷi − ȳ.. )2 = 741.125 p−1=1 741.125
X
Lack of fit ni (ŷi − ȳi. )2 = 2.0417 c−p=1 2.0417
XX
Error (yij − ȳi. )2 = 239.75 n−c=9 26.6389
XX
Total (yij − ȳ.. )2 = 982.9167 n − 1 = 11
(c) H0 : E(Y ) = β0 + β1 X (model is adequate) H1 : E(Y ) 6= β0 + β1 X
Since
SSLack of fit /(c − p)

f =
SSError /(n − c)
2.0417
=
26.6389
= 0.0766
Fα;c−p;n−c = F0.05;1;9 = 5.12. Reject H0 if f > 5.12.

Since 0.0766 < 5.12, H0 is not rejected, thus the model can be assumed to be adequate.
If the lack of fit test is significant, one will have to rethink the model: more terms may have to be
included, the variables may have to be transformed or one may have to design a completely new
experiment.
In many practical applications of regression analysis the straight line seems to be inadequate to
deal with the complexity of the problem. This problem can be solved by application of multiple
regression or polynomial regression.
4.3 Polynomial regression
In certain types of problems one may not know exactly which kind of function to fit to a given set
of data, since many functions may be written in the approximate form
f (x) = c0 + c1 x + c2 x2 + · · ·.
It is often possible to approximate the unknown function by means of a polynomial. This fit is
usually fairly close over a limited interval. In matrix form the approximate model then is
      
k
y 1 x1 · · · x1 β
 1    0   1 
 ..   ..   ..   .. 
 . = .   .  +  . .
      
k
yn 1 xn · · · xn βk n
The least squares estimator is

β̂ = (X 0 X)−1 X 0 y
   X X X −1  X 
β̂0 n xi x2i ··· xki yi
   X X X X   X 
x2i x3i ··· xk+1
  
xi
 
xi yi 

 β̂1   i  
= .
..   .. ..
  
   
 .   .   . 
   X X X X   X 
β̂k xki xk+1
i xk+2
i ··· x2k
i xki yi
Remarks about polynomial regression
1. If one fits a polynomial of degree k that is one with p = k + 1 parameters, then one has to
have at least k + 2 observations (that is n > k + 1), and at least k + 1 different values of x
(that is c ≥ k + 1 where c is defined in section 4.2). If the last restriction is violated, X 0 X
will be singular. If the first restriction is violated then the degrees of freedom for estimating
σ 2 will be n − p = n − k − 1 ≤ 0.
2. Suppose an experiment is performed at k + 1 different values of x, with replications at some

or all of these values of x. Let ȳi. be the mean value of the observations at xi . Then it may
be proved that a polynomial of degree k fitted to these observations will pass exactly through
the k + 1 points (xi ; ȳi. ), i = 1, · · · , k + 1. In such a case one will not be able to perform a
lack of fit test (the number of degrees of freedom for lack of fit is c − p, but in such a case
c = k + 1 and p = k + 1). For example, it is not possible to perform a lack of fit test on a
quadratic polynomial through only three points.
3. Although the mathematical requirement for k, the degree of the polynomial, is k ≤ c − 1

where c is the number of different x values, one should never try to fit a polynomial of high
degree even if one has a large enough c. The reason is that polynomials of high degree may fit
exactly through the data but oscillate wildly in between − see figure 4.8. Thus it is desirable
to restrict oneself to polynomials of degree 2 or 3 at the most.
Figure 4.8:
4. There is a particular danger in using a fitted polynomial to extrapolate far outside the data.
As in figure 4.8 the polynomial is likely to become completely inappropriate outside the range
of observation.
• Illustration
An investigator wants to determine the relation between the dependent variable y and the inde-
pendent variable x in the following data set:
x : 38 45 49 57 69 78 84 89 79 64 54 41
y : 99 91 78 61 55 63 80 95 65 56 74 93
Two regression lines are fitted, that is
(a) ŷ = β̂0 + β̂1 x = 95.748537 − 0.319923x
(b) ŷ = β̂0 + β̂1 x + β̂2 x2 = 322.738382 − 8.05069x + 0.061173x2
The SAS JMP output for the regression analysis can be seen in figure 4.9.

Using the linear fit:
R2 = 0.1241. Thus, 12.41% of the variability in y is being explained by the model. On the
regression coefficients the intercept is the only one significant with a p-value of zero.
Using the quadratic fit:
R2 = 0.9197. Thus, 91.97% of the variability in y is being explained by the model. The regression
coefficients are all significant with p-values less than 0.05.
The quadratic equation is a better fit, as seen in the graph and “R-square” values.
4.4 Multiple regression
The general form of a multiple regression model is
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + ε.
The dependent variable y is described as a function of k independent variables x1 , · · · , xk . As

discussed in chapter 2, the error term ε accounts for the chance variation inherent in a statisti-
cal experiment. The method of least squares estimation used to fit a straight line, is also used here.
The estimated model ŷ = β̂0 + β̂1 x1 + · · · + β̂k xk is chosen in such a way as to minimise
n
X
SSE = (yi − ŷi )2 .
i=1
The degree of difficulty of the calculations is the basic difference between the fitting of the straight
line and multiple regression models. With regard to the above model, (k + 1) linear equations are
solved simultaneously to obtain the (k + 1) estimated coefficients β̂0 , β̂1 , · · · , β̂k .
Several computer packages are available to perform multiple regression analyses, and since the
output of the programs is similar, the interpretation is fairly simple.
4.4.1 Illustration
An investigator suspects that factors x1 , x2 and x3 and/or combinations of factors affect the inde-
pendent variable y.
A multiple regression analysis was suggested and performed on the following data set:
y x1 x2 x3
87 40 11 14
133 36 13 30
174 34 19 30
385 41 33 39
363 39 25 33
274 42 23 34
235 40 22 37
104 31 9 20
141 36 13 27
208 34 17 40
115 30 18 19
271 40 23 31
163 37 14 35
193 41 13 28
203 38 24 31
279 38 31 35
179 24 16 26
244 45 19 34
165 34 20 30
257 40 30 38
252 41 22 35
280 42 21 41
167 35 16 23
168 33 18 24
115 36 18 21
The SAS JMP output is as follows:

The fitted model is
ŷ = β̂0 + β̂1 x1 + β̂2 x2 + β̂3 x3 + β̂4 x1 x2 + β̂5 x1 x3 + β̂6 x2 x3
and the F-statistic indicates that the βs are not all zero, since f = 13.6612 > F0,05;6;18 = 2.66.
The “R-squared” term in the output is defined as
R2 = 1 − SSError /SSTotal
thus representing the portion of the sample variance of the y-value accounted for by the estimated
model. In this case R2 = 0.8199. Thus, 81.99% of the variability in y is accounted for by the
model.
The p-value in the table is the exceedance probability that the coefficient concerned is zero under
the null hypothesis.
For an observed value F0 of F

Z ∞
P = fF (z)dz
F0
Figure 4.11:
with fF (x) the density function of the F distribution.
A small p-value implies a significant difference of the coefficient from zero for all significant levels <
p, for example if p = 0.08, then the coefficient is significantly different at the 10% level (α = 0.10)
but not at the 5% level (α = 0.05).
4.5 Inclusion or exclusion of variables
It often happens that one has a large number of independent variables available, and one has to
make a decision as to which variables to use in the regression equation. There are arguments in
favour of both extreme views: include as many variables as possible and include as few variables
as possible. We shall go into some of these arguments.
The effect of excluding a variable

Consider the two alternative models:
y = β0 + β1 x1 + β2 x2 +
and
y = β0∗ + β1∗ x1 + ∗ .
If β2 = 0 the two models are equivalent, but if β2 6= 0 they are different. If β2 6= 0 and one
nevertheless fits the second model, then β̂0∗ and β̂1∗ will be biased, that is E(β̂0∗ ) 6= β0∗ and E(β̂1∗ ) 6= β1∗
because E(∗ ) = E(β2 x2 + ) = β2 x2 and thus the assumption E(∗ ) = 0 is not true. For given
values of x1 and x2 the predicted value of the corresponding response y will be
ŷ = β̂0∗ + β̂1∗ x1
and E(ŷ) 6= β0 + β1 x1 + β2 x2 , the true expected response.
Thus the exclusion of important variables introduces bias.

The effect of including a variable

Consider the two simplest models
y = β0∗ + ∗ .
and
y = β0 + β1 x +
In this first model one would predict, for any x,
ŷ ∗ = β̂0∗ = ȳ, and Var(ŷ ∗ ) = σ 2 /n.
In the second model one would predict

X
ŷ = β̂0 + β̂1 x, and Var(ŷ) = σ 2 /n + σ 2 (x − x̄)2 / (xi − x̄)2 .
Thus Var(ŷ) ≥ Var(ŷ ∗ ). This result holds generally: every additional variable included in the
equation increases the variance of a prediction.
The inclusion of too many variables in the equation has other serious side-effects as well:
1. The matrix X 0 X becomes very large and often ill-conditioned (that is singular or nearly
singular) with the result that numerical accuracy is lost in computing (X 0 X)−1 .
2. A regression equation with many terms is cumbersome to use, and often one may have to
observe variables which are difficult or expensive to obtain in order to compute a prediction.
One should always aim at simplicity (within reason, of course).
Mean squared error

Consider the model
y = Xβ +
and let x0 be a vector of values of the independent variables. Let
η = x0 β
be the true response which we are trying to predict. Let η̂ = x0 β̂ be the estimated response, and
let E(η̂) = x0 E(β̂) be the expected value of the estimated response. If E(β̂) = β then E(η̂) = η
and the estimator is unbiased. However, a biased estimator may sometimes be preferable, as will
now be seen. Write
η̂ − η = (η̂ − E(η̂)) − (η − E(η̂)).
Taking squares and then expectations on both sides and noting that the expected value of (η̂ −
E(η̂))(η − E(η̂)) is equal to zero, we obtain
E(η̂ − η)2 = E(η̂ − E(η̂)) + (η − E(η̂))2 ,
or
MSE(η̂) = Var(η̂) + B 2 (η̂).
The term on the left is the mean (or expected) squared error of prediction. The first term on the
right is obviously the variance of η̂ and the last term is the measure of the bias of η̂. Suppose now
that, for a given set of data, we add the variables x1 , x2 · · · , xp to the equation one by one in that
order.
We have already seen that the variance of the prediction will increase after the inclusion of every
variable while the squared bias will decrease to zero if all important variables are included in our
set of variables. As is seen in figure 4.14, the mean squared error is likely to decrease and then
increase: thus an optimum subset of the variables x1 , · · · , xp will exist which yields the optimum
prediction with the smallest mean squared error. One will never know which variables they are,
but one should know that they exist.
Figure 4.12:
Figure 4.13:
Figure 4.14:
4.6 Selection of variables
We describe a number of methods which may be employed in order to select the variables for
inclusion in the equation. In the final analysis the choice should not be based on statistical con-
siderations alone, but on knowledge of the practical situation as well. Methods such as stepwise
regression analysis may be used to help you to think about a problem, but can never be a substi-
tute for thinking. None of the methods to be described can be said to be superior to all others,
and any of them may be preferable in some circumstances.
Testing coefficients in the full model

One obvious method is to fit the full model (that is include all possible variables) and test all
coefficients, retaining all those variables corresponding to the coefficients which are significant. If
the X 0 X matrix is diagonal or if the off-diagonal elements are relatively small, this method may
be recommended. There are two problems with this approach, both associated with the situation
where the independent variables are highly correlated. The two problems are:
1. The X 0 X matrix may be ill-conditioned, and it may be virtually impossible to fit the complete
model.
2. If two independent variables are highly correlated, it may well happen that their coefficients
are both insignificant, but that they are jointly significant. Exclusion of both will thus result
in a loss of information.
Significance level
If one tests p coefficients, one must bear in mind that the overall significance level may be larger
than one thinks. A safe procedure would be to use the level α/p for each test, in which case the
overall level will at most be α.
All possible subsets

A second approach is to compute all possible regression equations: of y on each x separately,
on each subset of two variables, et cetera. Thus there are 2P possible regression equations. The
criterion on which the decision is based is the squared multiple correlation coefficient R2 . R2
increases as every new variable is added, and one may plot a graph of the largest R2 of every
subset size against the size. The point where the increase ceases to be substantial may be used as
the cutoff point. There are two problems with this approach.
1. If p is larger, one may have a tremendous amount of computing to do and it will be virtually
impossible even to read all the answers which emerge from the computer. This is a wasteful
approach if p is large.
2. It is virtually impossible to attach a specific significance level to this procedure in any prac-
tical application, since the distributional properties of the process are too complicated.
Forward selection
The forward selection method works as follows: First select the independent variable which is
most highly correlated with y. If this correlation coefficient is r1 say, then R12 = r12 is the squared
multiple correlation coefficient between y and the variable selected. Next, that variable is selected
which has the highest partial correlation coefficient with y, given the variable already selected.
This variable also causes the largest increase in R2 . Thus the variables selected in the first two
steps have the largest R2 with y of all pairs of variables which include the variable selected first.
For example, if x3 is included first and then x6 , then R2 of y with x3 and x6 will be larger than
R2 of y with x3 and x1 , with x3 and x2 , x3 and x4 , et cetera. In this way one proceeds until the
addition of further variables causes the R2 to increase by very small amounts.
Some disadvantages of this method:
1. From a distribution theoretical point of view this approach is so complex that it is impossible
to maintain a specified significance level. The cutoff point has to be selected subjectively.
2. It may happen that the subset selected at a particular step is not the optimal subset. For
example, x3 , x4 and x6 may be selected after three steps, while x2 , x5 and x6 may have a
larger R2 with y. In practice it has been found that the subset selected by this method
differs only rarely from the best subset, and usually the optimal R2 is only slightly higher
than the R2 selected. Theoretically the differences may be substantial, however.
Backward elimination
This method is the reverse of the previous one. The full set is selected at first. Next, the variable
which will cause the smallest decrease in R2 if deleted from the equation is eliminated and the
equation is recomputed. This process is continued until the decrease in R2 is intolerably large,
and the previous equation is selected. This procedure is preferred to the previous one by many
authors, but it has the disadvantages of the previous one:
1. Difficulty with inverting X 0 X if p is large.
2. Significance level impossible to compute.
3. The subset selected may not be optimal.
Stepwise regression
This is a variation on the forward selection method and is used very widely. After the inclusion
of each variable, all variables already in the equation are scrutinised to see whether anyone of
them could be deleted without decreasing R2 by much. Suppose x4 is selected first, then x3 and
x5 . After three steps it may be found that x4 is no longer necessary since it is highly correlated
with x3 and x5 , and the information about y contained in x4 is also contained in x3 and x5 . The
disadvantages of this method are:
1. The subset selected may not be optimal.
2. Even though one is required to supply an“F-level for inclusion” and an “F-level for exclusion”,
it is impossible to attach a specific significance level to the procedure. Incidentally, the “F-
levels” commonly used in existing computer programs actually refer to critical values of the
F-distribution, ostensibly with 1 and n − k − 1 degrees of freedom at the k-th step, but exact
critical values for this procedure are unobtainable. The best recommendation is to select
small “critical values” and let the procedure continue for as many steps as are possible, and
then to make a subjective judgement based on the series of R2 s and any other (non-statistical)
information available.
One of the most widely used methods is the forward stepwise regression.
Forward stepwise regression

Stepwise regression is generally carried out using a computer. Assuming that y is the dependent
variable and x1 , x2 , . . . , xp are the p potential independent variables. In stepwise regression we use
the t-statistic and the related p-values to determine the significance of each independent variable
to the model.
Recall when using the p-values, a coefficient is significant if the p-value is less than α. It means
H0 : βi = 0 is rejected in favour of H1 : βi 6= 0 with the probability of a type I error equal to α.
Before performing a stepwise regression model we choose two alpha values determine the entry or
staying of a variable called αentry and αstay respectively. The αentry is the “probability of a type
I error related to entering an independent variable into the regression model”. The αstay is the
“probability of a type I error related to retaining an independent variable previously entered into
the model.” Normally the value of 0.05 is used for both αentry and αstay .
Thus, a variable will enter if its p-value is less than 0.05 and a variable will remain if its p-value
after adding the other variable is less than 0.05. For example, suppose we have three independent
variables x1 , x2 and x3 . If we perform simple linear regression of each variable on y and we obtain
the following p-values for each model:
y= β0 + β1 x1 +
p-value (0.0004)
y= β0 + β 1 x2 +
p-value (0.006)
y= β0 + β 1 x3 +
p-value (0.002)
Of the variables x1 is the one that is highly significant thus the model y = β0 + β1 x1 + will be
considered since 0.0004 is the p-value (highly significant). Thus the hypothesis H0 : β1 = 0 is
rejected since 0.0004 is less than the αentry = 0.05.
In the next step suppose the possible regression models with two independent variables gave the
models of the form:
y= β0 + β1 x1 + β2 x2 +
p-value (0.0015) (0.0163)
y= β0 + β1 x1 + β2 x3 +
p-value (0.0506) (0.1225)
It can be noted that on the second equation, the p-value for β2 is now 0.1225 which is greater than
0.05. Thus, x3 cannot stay in the model. However, for the first model β2 = 0.0163, thus x1 and x2
can be retained for further use in stepwise regression.
Suppose when all three variables are entered in the equation the model is:
y= β0 + β 1 x1 + β 2 x2 + β 3 x3 +
p-value (0.0185) (0.0578) (0.4627)
It can be observed that the p−value for β3 = 0.4615 is greater than 0.05. Thus, it is not significant
at the αentry level. Thus, this model cannot be chosen and it comes back to
y = β0 + β1 x1 + β2 x2 + .
In principle that is how it works.
• Application to illustration 4.4.1 A stepwise regression was performed on the data of illus-
tration 4.4.1.
Note: The stepwise regression algorithm allows an X-variable, brought into the model at an earlier
stage, to be dropped subsequently if it is no longer helpful in conjunction with variables
added at later stages.
The SAS JMP output is shown in figures 4.15 and 4.16.
Figure 4.15: SAS JMP output for illustration 4.4.1

Figure 4.16: SAS JMP output for illustration 4.4.1
The comparison of “R-squared (Adj for df)” in figure 4.15 and figure 4.16 suggests that the full
regression is unnecessary, since the analysis with two factors in figure 4.16 provides satisfactory
results.
4.7 Historical data
Sometimes one may be confronted with a mass of historical data, that is data collected rather
haphazardly over a period of time, and one may be tempted to try building a regression model
from this data. Some authors advise strongly against doing this at all, or they advise using such
a study only as a guidance for planning more systematic experiments. Some of the hidden pitfalls
in such historical data are as follows:
1. The whole process may have changed over time and the data may no longer be applicable to
the present situation.
2. Such data usually show a highly correlated structure among the “independent” variables. The
result is that the estimated coefficients are highly correlated and it is impossible to judge
their significance independently. In some studies, for example a number of measurements on
a number of people, such a correlation structure may be unavoidable and may in fact contain
information about the structure underlying the data. However, in many situations, such as
the daily records of a chemical plant, this may be due to the way in which the plant is run.
3. In a chemical plant, as in many other experimental situations, important variables are usually
controlled very strictly since small fluctuations may have a significant effect on the response.
The result is that the importance of such a variable will not be reflected by the data. This
will certainly lead to inappropriate conclusions.
4. Other more subtle effects may be reflected in the data without the statistician ever know-
ing about them. In a chemical plant, two variables which influence the response may be
temperature and pressure. An increase in either of these variables will increase the yield,
but if either of them rises too high, it may cause an explosion. Therefore the operator has
strict instructions to turn down the heat immediately the pressure increases; the result will
of course be a decrease in yield. However, if one analyses the historical data, it will appear
as if an increase in pressure causes a decrease in yield − the way in which the experiment is
run is confused with inherent properties of the system.
In contradistinction to such historical data, consider the ideally designed experiment.
1. One will try to select the design matrix X in such a way that (X 0 X)−1 is a diagonal matrix or
as close to diagonal as possible (except perhaps for the first row and column which represent
the constant term). This will enable one to estimate and judge the effect of each variable
independently.
2. Secondly, all variables which are expected to influence the yield must be varied deliberately,
and as many times as are needed. Even if the chemical engineer is not willing to vary
a certain variable, one may be able to convince him or her that small changes will not
upset the whole system. The variance of the estimated coefficient β̂j of the j-th variable is
X
σ2/ (xjh − x̄j. )2 . In order to obtain a precise estimate of βj , the variance of β̂j must be
h X
small, that is (xjh − x̄j. )2 must be large. This may be achieved either by varying the
h
variable xj substantially or by choosing a large sample size (provided not all values of xj are
equal to x̄j. )
Exercise 4.1
1. (a) Suppose the following data have been obtained from five randomly selected objects
(regression of y on x):
x 1 3 5 7 9
y 4 3 6 8 9
(i) Fit a linear regression line.
(ii) Estimate the variance, σ 2 .
(iii) Test at the 5% level whether the expected value of y (at x = 15) is 20.
(b) Consider the following data (regression of y on x):
x 0 1 2 3 4
y 110 90 76 85 91
(i) Fit a model of the form y = β0 + β1 x + β2 x2 to the data.

 
620 −540 100
1  
Given (X 0 X)−1 =  −540 870 −200  .
 
700  
100 −200 50
(ii) Estimate σ 2 .
(iii) Test whether the quadratic term is necessary in the model.

2. Fourteen observations were obtained on the yield from a chemical experiment. In this exper-
iment, temperature, concentration and length of agitation were set by the experimenter.
X1 X2 X3 Y
Temperature Concentration Time Yield
10 40 20 22
10 40 20 33
15 40 20 36
15 40 20 33
10 40 25 26
15 40 25 37
10 50 20 37
10 50 20 40
15 50 20 42
15 50 20 45
10 50 25 39
10 50 25 42
15 50 25 45
15 50 25 47
The following may be used without checking:
 
14 175 640 310
 
 
 175 2275 8000 3875 
X0 X = 


 where X = (1, x1 , x2 , x3 )
 640 8000 29600 14200 
 
310 3875 14200 6950
 
11.9857 −0.1429 −0.115 −0.220
 
−0.1429
 
−1
 0.0114 0 0 
(X0 X) =  
 −0.115 0.003 −0.001 
 
0
 
−0.220 0 −0.001 0.012
 
524
 
 
0
 6665  14
(yi − ŷi )2 = 96.5571
P
Xy=



 24330  i=1
 
11660
(a) Fit a multiple regression model to the data. Give the values of β0 , β1 , β2 and β3 .
(b) Give a 95% confidence interval for β2 . Can you conclude that H0 : β2 = 0?
(c) Test the hypothesis: H0 : 2β1 + 3β2 + β3 = 6.
(d) Give a 95% confidence interval for 2β1 + 3β2 + β3 .
3. Consider the following data:

x: 1 1 1 2 2 2 3 3 3
y: 13 8 6 10 15 11 20 16 21
(a) Fit a model of the form
y = β1 + β2 x + β3 x2 +
to the data
(i) by fitting the complete model directly,
(ii) by first reducing the variables as described in section 2.8. Draw a graph of the data
and the fitted regression curve.
(b) Estimate σ 2 and find a 95% confidence interval for σ 2 .
(c) Determine whether a quadratic model is really necessary by testing H0 : β3 = 0 at the

5% level.
(d) Test H0 : β1 = β2 = β3 at the 5% level.
(e) Test, at the 5% level, the hypothesis that the expected response y at x = 1 is 10.
4. Consider the following regression data (regression of y on x):
x: −1 −1 −1 0 0 0 1 1 1
y: 11 16 18 7 9 14 8 9 16
(a) Fit a straight line and perform a lack of fit test at the 5% level.
(b) Fit a second-degree polynomial and test by means of an F-test at the 5% level whether
the quadratic term is zero. Compare your answer with that of (a).
(c) Draw a graph of the data and the two regression lines of (a) and (b). Comment on the
graph with reference to your conclusions in (a) and (b).
5. Consider the simple linear regression model
yi = β0 + β1 xi + i ; i = 1, · · · , n; i independent n(0; σ 2 ).
Find a 100(1 − α)% confidence interval for β0 + β1 x. Regard the lower and upper limits for
β0 + β1 x as functions of x and derive their form. Sketch β0 + β1 x and the lower and upper
limits as functions of x.
6. Consider the following data:
xi yi xi yi
17 37.75 25 45.20
17 37.42 25 46.70
17 36.62 25 48.10
22 41.15 28 52.45
22 42.65 28 50.21
22 39.70 28 51.75
A regression line of y on x is suggested.
(a) Write down the design matrix X.

 
2.75505 −0.11616
(b) Assume that (X 0 X)−1 =   and show through calculation that
−0.11616 0.00505
ŷ = 14.2139 + 1.2966x.
(c) Perform a lack-of-fit test.

ni
c P
(yij − y i. )2 = 11.651.
P
Hint:
1 1
(d) If you are the investigator, what would you recommend to your client?
7. A marketing research firm has obtained the following prescription sales data.
Sales Floor space Prescription % Parking Income Shopping centre

Pharmacy y x1 x2 x3 x4 x5
1 22 4 900 9 40 18 1
2 19 5 800 10 50 20 1
3 24 5 000 11 55 17 1
4 28 4 400 12 30 19 0
5 18 3 850 13 42 10 0
6 21 5 300 15 20 22 1
7 29 4 100 20 25 8 0
8 15 4 700 22 60 15 1
9 12 5 600 24 45 16 1
10 14 4 900 27 82 14 1
11 18 3 700 28 56 12 0
12 19 3 800 31 38 8 0
13 15 2 400 36 35 6 0
14 22 1 800 37 28 4 0
15 13 3 100 40 43 6 0
16 16 2 300 41 20 5 0
17 8 4 400 42 46 7 1
18 6 3 300 42 15 4 0
19 7 2 900 45 30 9 1
20 17 2 400 46 16 3 0
These variables can be described precisely as follows:

y = average weekly prescription sales over the past year (in units of $1 000)
x1 = floor space (in square feet)
x2 = percentage of floor space allocated to the prescription department
x3 = number of parking spaces available for the store
x4 = monthly per capita income for the surrounding community (in units of $100)
x5 = is an independent variable: equals 1 if the pharmacy is located in a shopping centre
and equals 0 otherwise.
The following outputs were obtained:
Figure 4.17: Output for question 7



(a) Based on the R2 value, which variable is the best predictor?
(b) Based on pairwise correlation and density ellipses (use a 95% confidence ellipse), which
variable is the best single predictor?
(c) Give the multiple regression model from the output.
(d) Comment on the significance of each coefficient.
(e) Interpret the R2 for the model.
(f) Comment on the residuals.
(g) The following stepwise output was obtained with probability of entry = 0.15 and prob-
ability of leaving equal to 0.15.


(i) What is the first independent variable entered?
(ii) What is the second independent variable entered?
(iii) Has the first independent variable been retained? Why?
(iv) What is the final model arrived at by stepwise regression?

Chapter 5
ANALYSIS OF COVARIANCE
5.1 Introduction
In chapter 2 it was pointed out that the analysis of covariance is a hybrid between analysis of
variance and regression analysis. As in the analysis of variance, we have a number of factors
(treatments or blocking factors) which define subsets among the experimental units. In addition
there are (continuous) independent variables which are assumed to have a linear effect on the
response. In general the analysis of covariance is used to eliminate the linear or higher degree
relation between one or more independent variables (called covariates) and a dependent variable
(yield), with the main objective to be able to determine the effect of the factor(s) (treatment) on
the dependent variable. Analysis of covariance can also be regarded as the simultaneous study of
several regressions. Due to complicated techniques, we confine ourselves to a one-way analysis of
variance with a linear regression on one single variable. Here we have independent variable (covari-
ate) related to dependent (response) variable, but without being affected by the factor (treatment).
It is often impossible to control this covariate at a constant level, but it can be measured together
with the dependent (response) variable.
The model usually proposed is
yij = αi + βi xij + εij ; i = 1, 2; j = 1,· · · , ni ;
ε11 , · · · , ε2n2 are independent n(0; σ 2 ) variates; with αi the effect of the factor being investigated
283
5. ANALYSIS OF COVARIANCE 284
and βi the additional contribution due to the covariate.
If the model is accepted, the situation shown in figure 5.1 might arise:
Figure 5.1:
Here we have three regression lines, each with a specific slope. The differences between the lines
do not necessarily relate to the treatments, thus it is impossible to investigate the differences in y
for the three treatments.
If the slopes are equal, that is the regression lines are parallel, we accept the model
yij = αi + βi xij + εij ; i = 1, 2; j = 1,· · · , ni ;
with ε11 , · · · , ε2n2 independent n(0; σ 2 ) variates.

5. ANALYSIS OF COVARIANCE 285 STA3701/1
The situation shown in figure 5.2 now arises:
Figure 5.2:
Now the differences with regard to y can be observed directly, thus the comparison of the three
equations is possible. The parameters can be estimated and the differences in y, for any value of
x, can be determined.
We start with an illustrative example, and then discuss two of the simplest models contained in
the general theory. The analysis of covariance is of course part of the broad theory of the general
linear model.
5.2 Illustration
A wholesaler wants to increase the sales of a certain product. He selects ten dealers whom he
believes to be comparable, and requests five of them to display his product close to the entrance
of the shop. The other five are requested to place the product close to the cash registers. The
results are as follows (units sold in one week):
Treatment level 1 (close to cash register) 42 45 51 53 59

Treatment level 2 (close to entrance) 44 47 55 56 58
The means are ȳ1. = 50 (level 1) and ȳ2. = 52 (level 2).
The (one-way) analysis of variance table is as follows:
Source SS d.f. MS F
Treatments 10 1 10 0.24
Error 330 8 41.25
Total 340 9
The F-statistic is not significant at any of the usual levels, and it would seem that there is no
difference between the responses to the two display methods.
Figure 5.3: SAS JMP output for sales data by level

However, at this stage somebody suggested that the sales before treatment should also be taken
into account, and the records show the following:
Level 1 Before (x) 32 38 40 44 46

After (y) 42 45 51 53 59
Level 2 Before (x) 44 45 51 53 57

After (y) 44 47 55 56 58
One way of bringing the sales before treatment into the analysis is by computing the differences
(y −x) and regarding the values obtained in this way as two independent samples of sales increases.
A one-way analysis of variance would then compare the mean increases due to the treatments. How-
ever, a more sensitive analysis is suggested by a plot of y against x, using different symbols for the
two treatment levels.
Figure 5.4:
Figure 5.4 suggests that one might fit a regression line to each group of data points and compare
the regression lines in some way or other. This is the purpose of the analysis of covariance.
5.3 Analysis of covariance model
The model often suggested for this type of problem is the following:
yij = αi + βxij + εij ; j = 1,· · · , ni ; i = 1, 2;
where the εij are independent n(0; σ 2 ) variates. Note especially the following two assumptions:
(i) equal slopes
(ii) equal variances
In section 5.4 the model with unequal slopes is discussed. An F-test for the equality of the vari-
ances may be constructed − if the variances are unequal, one is confronted with the Behrens-Fisher
problem.
The model may be written as y = Xβ + as follows:
     
y11 1 0 x11 11
 ..   ..  .. 
     
 .   .  . 
 

    α  
    1   
 y1n1   1 0 x1n1   1n1 
 =  α  + 
 
    2   
 y21   0 1 x21   21 
 β
 ..   ..  .. 
    

 .   .   . 
     
y2n2 0 1 x2n2 2n2
The least squares estimators are

X X
y1j (x1j − x̄1. ) + y2j (x2j − x̄2. )
β̂ = P P
(x1j − x̄1. )2 + (x2j − x̄2. )2
α̂1 = ȳ1. − β̂ x̄1.
α̂2 = ȳ2. − β̂ x̄2. .

The residual sum of squares is

X2 X ni
SSR = (yij − α̂i − β̂xij )2
i=1 j=1
and the estimator of σ 2 is

SSR
σ̂ 2 = .
n1 + n2 − 3
The test for H0 : α1 = α2 is then based on

(α̂1 − α̂2 )2
f=
(x̄1. − x̄2. )2

1 1
σ̂ 2 + +
n1 n2 Sx2
X X
where Sx2 = (x1j − x̄1. )2 + (x2j − x̄2. )2 ; this statistic is compared to Fα;1;n1 +n2 −3 .
Example 5.1.
Using data given in section 5.2 perform an analysis of covariance test assuming equal slopes at the
5% equal slopes.
Solution 5.1.
x̄1. = 40 ȳ1. = 50 x̄2. = 50 ȳ.. = 52
X
(x1j − x̄1. )2 = (32 − 40)2 + (38 − 40)2 + (40 − 40)2 + (44 − 40)2 + (46 − 40)2
= (−8)2 + (−2)2 + 02 + 42 + 62
= 120
X
(x2j − x̄2. )2 = (44 − 50)2 + (45 − 50)2 + (51 − 50)2 + (53 − 50)2 + (57 − 50)2
= (−6)2 + (−5)2 + 12 + 32 + 72
= 120
X
y1j (x1j − x̄1. ) = 42(32 − 40) + 45(38 − 40)2 + . . . + 59(46 − 40)
= 42(−8) + 45(−2) + 51(0) + 53(4) + 59(6)
= 140
X
y2j (x2j − x̄2. ) = 44(44 − 50) + 47(45 − 50)2 + . . . + 58(57 − 50)
= 44(−6) + 47(−5) + 55(1) + 56(3) + 58(7)
= 130
P P
y1j (x1j − x̄1. ) + y2j (x2j − x̄2. )
β̂ = P P
(x1j − x̄1. )2 + (x2j − x̄2. )2
140 + 130
=
120 + 120
270
=
240
= 1.125
α̂1 = ȳ1. − β̂ x̄1.
= 50 − 1.125(40)
= 5
α̂2 = ȳ2. − β̂ x̄2.
= 52 − 1.125(50)
= −4.25
P2 Pni
SSR = i=1 j=1 (yij − α̂i − β̂xij )2
ni
2 X
X
(y1j − α̂2 − β̂x1j )2 = (42 − 5 − 1.125 × 32)2 + . . . + (59 − 5 − 1.125 × 42)2
i=1 j=1
= 12 + (−2.75)2 + 12 + (−1.5)2 + (2.25)2
= 16.875
ni
2 X
X
(y2j − α̂i − β̂x2j )2 = (44 + 4.25 − 1.125 × 44)2 + . . . + (58 + 4.25 − 1.125 × 57)2
i=1 j=1
= (−1.25)2 + (0.625)2 + (1.875)2 + (0.625)2 + (−1.875)2
= 9.375
ni
2 X
X
SSR = (yij − α̂i − β̂xij )2
i=1 j=1
= 16.875 + 9.375
= 26.25
SSR
σ2 =
n1 + n2 − 3
26.25
=
5+5−2
26.25
=
7
= 3.75
X X
s2x = (x1j − x̄1. )2 + (x2j − x̄2. )2
= 120 + 120
= 240
H0 : α1 = α2
Fcritical = Fα;1;n1 +n2 −3 = F0,05;1;7 = 5.59. Reject H0 if f > 5.59.
b2 )2
α1 − α
(b
f = (x1. −x2. )2
b2 [ n11 +
σ 1
n2
+ s2x
]
2
(5 − (−4.25))
= 2

3.75 15 + 51 + (40−50)
240
(9.25)2
=
(−10)2
3.75 0.4 + 240
85.5625
=
3.75(0.81666666)
85.5625
=
3.0625
≈ 27.9388
Since 27.9388 > 5.59, we reject H0 at the 5% level of significance and conclude that the lines differ
significantly.
5.4 Comparing two regression lines
If one is unwilling to assume that the two regression lines are parallel, one may select the following
model:
yij = αi + βi xij + ij ; j = 1,· · · , ni ; i = 1, 2;

where the ij are independent n(0; σ 2 ) variates.
In this model we estimate (α1 , β1 ) and (α2 , β2 ) independently, but σ 2 jointly from the two samples:
X X
β̂i = yij (xij − x̄i. )/ (xij − x̄i. )2
j j
α̂i = ȳi. − β̂i x̄i.

XX
σ̂ 2 = (yij − α̂i − β̂i xij )2 /(n1 + n2 − 4).
i j
H0 : β1 = β2 is tested by comparing
(β̂1 − β̂2 )2
f =
2
1 1
σ̂ P +P
(x1j − x̄1. )2 (x2j − x̄2. )2
with Fα;1;n1 +n2 −4 .
Similarly, we might wish to test whether the response for each population is equal at a particular
point x, that is we might test H0 : α1 + β1 x = α2 + β2 x by comparing
(α̂1 + β̂1 x − α̂2 − β̂2 x)2

fx =
(x − x̄1. )2 (x − x̄2. )2

1 1
σ̂ 2 +P + +P
n1 (x1j − x̄1. )2 n2 (x2j − x̄2. )2
with Fα;1;n1 +n2 −4 .
Example 5.2.
Using data in section 5.2, perform an analysis of covariance assuming unequal slopes at the 5%
level of significance.
Solution 5.2.
Recall
x̄1. = 40 ȳ1. = 50 x̄2. = 50 ȳ2. = 52
(x1j − x1. )2 = 120 (x2j − x2. )2 = 120

P P P P
yij (x1j − x̄1. ) = 140 y2j (x2j − x̄2. ) = 130
P
y1j (x1j − x̄1. )
β̂1 = P
(x1j − x̄1. )2
140
=
120
7
=
6
P
y2j (x2j − x̄2. )
β̂2 = P
(x2j − x̄2. )2
130
=
120
13
=
12
α̂1 = ȳ1. − β̂1 x̄1.

7
= 50 − × 40
6
10
=
3
α̂2 = ȳ2. − β̂2 x̄2.

13
= 52 − × 50
12
−13
=
6
− α̂i − β̂i xij )2
P P
2 i j (yij
σ̂ =
n1 + n2 − 4
2 2
XX
2 10 7 10 7
(y1j − α̂1 − β̂1 x1j ) = 42 − − × 32 + . . . + 59 − − × 46
i j
3 6 3 6
2 2 2
4 −8 2 −5
= + +1 + + 22
3 3 3
50
=
3
2 2
XX
2 −13 13 13 13
(y2j − α̂2 − β̂2 x2j ) = 44 − − × 44 + . . . + 58 − − × 57
i j
6 12 6 12
2 2 2 2 2
−3 5 23 3 −19
= + + + +
2 12 12 4 12
55
=
6
− α̂i − β̂i xij )2

P P
2 i j (yij
σ =
n1 + n2 − 4
50/3 + 55/6
=
5+5−4
155/6
=
6
155
=
36
The estimates are

α̂1 = 10/3
α̂2 = −13/6
β̂1 = 7/6
β̂2 = 13/12
155
σ̂ 2 = .
36
These regression lines are illustrated in figure 5.5.
H0 : β1 = β2
Fcritical = Fα;1;n1 +n2 −4 = F0,05;1;6 = 5.99. Reject H0 if f > 5.99
(β̂1 − β̂2 )2
f =
1 1
σ̂ 2 P 2
+P
(x1j − x̄1. ) (x2j − x̄2. )2
(7/6 − 13/12)2
= 155
1 1

36 120
+ 120
(1/12)2
=
31/432
3
=
31
≈ 0.0968
Since 0.0968 < 5.99, we do not reject H0 at the 5% level of significance and conclude that the
assumption of equal slope is valid. Parallel lines seem to fit the data quite well.
The procedure may obviously be extended to more than two regression lines and more than one
independent variable.
5.5 Illustration
Disclaimer: The data below are fictitious.
A company wants to investigate the effect of heartbeat per minute (< 60 or > 60) on the results
(y) obtained in an intensive, very demanding one-month course. It is expected that the heartbeat
is related to the mean number of kilometres (x) the course attendants ran/walked monthly in the
year prior to the course. The following results were obtained:
< 60 > 60
x y x y
8 63 50 59
22 80 91 82
11 71 72 76
12 69 86 81
14 78 63 70
23 79 79 85
The ANOVA table, with x a covariate, is as follows (SAS JMP figure 5.6):
Figure 5.6: SAS JMP output for data in illustration 5.5
Figure 5.7 is a graphical representation of the data and the two straight lines fitted if x is regarded
as a covariate.
Figure 5.7:
Figure 5.7 shows clearly that the two straight lines, obtained from the analysis of covariance (x
the covariate), are acceptable representations of the situation as reflected by the given data.
The following two graphs illustrate the importance of looking at a study as a whole and not only
at certain aspects.
Figure 5.8 is a graphical representation of x against the results, without considering heartbeat. A
straight line fitted to the data is obviously not an acceptable prediction.
Figure 5.8:
Figure 5.9 is a graphical representation of heartbeat against results, where x is ignored. It is

obvious that the study is not valid if x is not taken into consideration.
Figure 5.9:
Exercise 5.1
1. The following data were given for two sets:
Set 1 Set 2
x y x y
4.8 9.912 8.8 10.596
7.2 9.383 6.2 9.697
5.5 9.734 7.5 10.700
6.0 9.551 4.9 10.610
8.3 8.959 5.4 10.145
7.6 9.474 5.8 10.191
5.9 9.179 7.3 9.855
8.0 9.359 8.6 10.682
4.3 9.580 8.8 10.160
5.1 9.245 6.0 9.982
The following regression lines were fitted to the data:

Set 1: yb1 = 10.144723 − 0.112779x
Set 2: yb2 = 9.739767 + 0.075329x
(a) Test whether the slopes of the fitted regression lines are significantly different.
(b) Assume equal slopes and variances, and test H0 : α1 = α2 .
[Given:
Set 1: x1· = 6.27 (x1j − x1· )2 = 17.961

P P
y1j (x1j − x1· ) = −2.02562
Set 2: x2· = 6.93 (x2j − x2· )2 = 19.381 y2j (x2j − x2· ) = 1.45996 ]
P P
2. A lecturer wants to confirm that results in the 2002 examination were better than in the 2000
examination. He takes a sample (n = 6) of the examination results of students from these
groups (y). The means for the assignments for each student are also recorded (x). Results
are as follows:
2000 2002
x y x y
50 60 59 55
62 70 45 49
76 68 33 50
38 63 73 61
52 73 57 44
64 80 45 41
[The following is given and can be accepted as correct:

P P
y1j (x1j − x1. ) = 220 y2j (x2j − x2. ) = 306
(x1j − x1. )2 = 870

P P
(x2j − x2. ) = 974]
(a) Draw a graph of the data to examine the possibility of a covariate. Briefly interpret
your graph.
(b) Regression lines:
yb1 =α
b1 + βb1 x = 54.5862 + 0.2529x
yb2 =α
b2 + βb2 x = 33.6632 + 0.3142x
were fitted to these two sets of data. Confirm, by testing the hypothesis H0 : β1 = β2 ,
that the assumption of equal slopes is valid.
(c) Perform a covariance analysis. [State the hypothesis, test statistic, all formulae and the
conclusion explicitly.]
3. An experiment was performed to determine whether balloons of different colours are similar
in terms of the time taken for inflation to a diameter of 7 inches (1 inch ≈ 2.5 cm). Two
colours were selected from a single manufacturer. An assistant blew up the balloons and the
times (y) (to the nearest 1/10th of a second) were taken with a stopwatch. The data, in the
order collected (x), are given below, where the codes 1 and 2 denote the colours yellow and
blue, respectively.
Order 1 2 3 4 5 6 7 8
Colour 1 2 2 2 1 1 2 1
Time 19.8 28.5 25.7 28.8 17.1 19.3 18.3 14.0
Order 9 10 11 12
Colour 1 2 2 1
Time 16.6 18.1 18.9 16.0
[The following is given and can be assumed correct:
Σy1j (x1j − x1· ) = −27.9632 Σy2j (x2j − x2· ) = −86.5546
Σ (x1j − x1· )2 = 70.8333 Σ (x2j − x2· )2 = 70.8333]
Assume equal variances.
(a) Ignore order. Construct an analysis of variance table and test the hypothesis that colour
has no effect on inflation time.
(b) Include order as a covariate. Test the hypothesis H0 : β1 = β2 that the slopes are equal.
[NB: Give the formulae, test statistics, decisions and conclusions explicitly.]
(c) Comment on the inclusion of order as a covariate.
4. Consider the following two data sets (regression of y on x):
Set 1
x: 20 20 30 30 40 40 50 50
y: 61 65 90 92 124 124 151 157
Set 2
x: 20 20 20 30 30 40 40 40
y: 55 56 60 88 92 113 115 117
(a) Assume that parallel regression lines are applicable; give a model and perform an anal-
ysis of covariance (at the 5% level).
(b) Represent the data and the estimated regression lines graphically and comment on the
assumptions and conclusions of (a) in the light of the graph.
5. Consider the following two sets of data (regression of y on x):
Set 1
x: 1 1 2 2 3 3 4 4
y: 12 16 15 15 17 19 21 25
Set 2
x: 2 2 3 3 4 4 5 5
y: 14 16 20 24 22 24 26 30
(a) Fit a regression line through each set of data and draw a graph of the data and the
fitted lines. Give a model for the data.
(b) Test, at the 10% level, whether the slopes of the fitted regression lines are significantly
different.
(c) Test, at the 10% level, the null hypothesis that the two expected values of y which
correspond to x = 3 are equal.
Chapter 6
APPENDIX
6.1 Remarks on matrices
Many of our students make certain mistakes in their assignments and examination papers, and it
seems necessary to point out these typical errors. You should read these remarks carefully and
make sure that you do not commit similar errors.
The main source of errors is failure to take note of the order of a matrix. If A is a matrix with p
rows and q columns, then A is said to be of order p × q.
Matrix addition
If A and B are matrices, then A + B is defined only if A and B are of the same order. If A is p × q
and B is p × q; then A+B is also p × q.
Matrix multiplication
The product AB is defined only if the number of columns of A is the same as the number of rows
of B. If A is p × q and B is q × r; then AB is defined and AB is a p × r matrix.
Thus, for example, if we write
AB + CDE
303
6. APPENDIX 304
then this expression is valid only if A is p × q, B is q × r, C is p × s, D is s × t and E is t × r. When

in doubt, always write the order of the matrices beneath the expression:
A B + C D E
p×q q×r p×s s×t t×r
| {z } | {z }
p×r p×r
Matrix division
A
If A and B are matrices, then the quotientis defined only if B is a 1 × 1 matrix, that is a scalar,
B
and not equal to zero. We often write down the quotient of two quadratic forms or other 1 × 1
expressions in matrices, for example
x0 Ax
x0 Bx
where x is a p × 1 vector and A and B are p × p matrices. However, to “cancel out” the xs and
A
equate the expression to is unforgivable.
B
Properties of square matrices

The following properties and operations are defined for square matrices only (that is matrices with
as many rows as columns):
trace
determinant
inverse
positive definite; positive semi-definite
quadratic form
symmetric
singular; non-singular
idempotent
Thus, before you write |X| or tr(X) or X −1 , et cetera make sure that X is square.
Determinants
If A and B are both square matrices, then
6. APPENDIX 305 STA3701/1
|AB| = |A|||B| = |BA|.
However, if A and B are not square (but AB and BA are) then |AB| and |BA| need not be equal.
• Counterexample 6.1.1
 
3
Let A = (1 2) and B =  .
4
Then AB = (11)
∴ |AB| = 11,
 
3 6
whereas BA =  
4 8
∴ |BA| = 0.
Of course, |A| and |B| are not defined in this example. Also, if |A| = |B| then it does not
follow that A = B.
Trace
If tr(A) = tr(B) then it does not follow that A = B.
   
1 −10 2 0
Let A =   and  .
6 3 0 2
Then tr(A) = tr(B) = 4.
While this may seem trivial, students have been known to write
tr(I-AB) = 0
∴ I = AB,
6. APPENDIX 306
which is of course nonsense.
Ranks
r(AB) is not necessarily equal to r(BA).
 
1 −2  
  2 2 2
Let A =  1 1  ; B =  .
 
  0 1 −1
1 1
 
2 0 4
 
Then AB =  2 3 1  ∴ r(AB) = 2;
 
 
2 3 1
 
6 0
while BA =   ∴ r(BA) = 1.
0 0
Inverse
The inverse of a matrix A is defined only if A is non-singular (and of course square). Thus, while
(AB)−1 is defined if AB is non-singular, it does not necessarily follow that A−1 and B−1 exist; in
fact A and B need not be square. The well-known least squares estimator in the linear model is
β̂ = (X 0 X)−1 X 0 y
where X is n × p of rank p where p < n. Thus X −1 does not exist, and one may not write
β̂ = X −1 X 0−1 X 0 y = X −1 In y = X −1 y.
The last equation would be valid only if n = p and r(X) = n.
“Cancelling out” matrices and vectors

Suppose A is non-singular, and
AX = AY.
Then we may pre-multiply each side of the equation by A−1 to obtain
A−1 AX = A−1 AY
∴ IX = IY
∴ X = Y.
Likewise, if A is non-singular and UA = VA, then it follows that U = V. However, if A is singular

or if A is not a square matrix, then AX = AY does not imply that X = Y and UA = VA does not
imply that U = V. The above also holds if A is replaced by a vector.
• Counterexample
 6.1.4
    
1 2 3 6 0 0 1
     
Let A =  4 5 6  ; B =  15 0 0  ; x =  1 .
     
     
7 8 9 24 0 0 1
 
6
 
Then Ax = Bx  15  but A 6= B.
 
 
24
 
   1 2 
1 2 3 6 0 0
 
Let A = ; B= ; C =  1 2 .
     
1 0 −1 0 0 0  
1 2
 
6 12
Then AC = BC =   but A 6= B.
0 0
Zero products
6. APPENDIX 308
If AB = 0 then it does not follow that A = 0 or B = 0. You should be able to construct your own
counterexample.
6.2 Quadratic forms
After one of the examinations a student commented that it had taken him an hour to answer a
question like the following:
Let x1 , x2 , x3 and x4 be independent n(µ; 1) variates. Write the following quadratic forms as
qi = x0 Ai x where Ai is symmetric, determine whether each has a chi-square distribution, give the
degrees of freedom and find out which are independent:
q 1 = x1 x2 − x3 x4
4
!2
X
q2 = xi /4
1
4
X
q3 = (xi − x1 )2
1
Actually it can be completed in five minutes − if you know how. Let us do another example step
by step. Let
1
q = (x1 − 2x3 )2 + x22
5
where x1 , x2 and x3 are independent n(0; σ 2 ) variates. We wish to show that a multiple of q has a
chi-square distribution. For this purpose we first write q in the form x0 Ax where A is symmetric:
recall that the matrix A of the quadratic form x0 Ax is assumed to be symmetric in this guide.
First, we write out q term by term − this step is not really necessary but it helps to explain the
process.
1 2 4 4
q = x1 − x1 x3 + x23 + x22
5 5 5
1 2 4 2 2 2
= x1 + x3 + x22 − x1 x3 − x3 x1
5 5 5 5
+0x1 x2 + 0x2 x1 + 0x2 x3 + 0x3 x2 .
The coefficient of x21 comes in the (1; 1) position of A, that of x22 in the (2; 2) position and that of
x23 in the (3; 3) position
 
1
· ·
 5 
A= · 1 · .
 
 
· · 54
1
Since A must be symmetric, we place 2
(the coefficient of xi xj , where i 6= j) in the (i; j) position
and the other half in the (j; i) position:
 
1
0 − 25
 5 
A= 0 1 0
 

 
− 52 0 45
  
1
5
0 − 25 x1
  
check: (x1 x2 x3 ) 
  
0 1 0   x2 
  
− 52 0 4
5
x3
= 15 x21 + x22 + 45 x23 − 54 x1 x3 .
Now we show that AA = A and we conclude that q/σ 2 is a chi-square variate with tr(A) = 2
degrees of freedom. The noncentrality parameter is 00 A0 = 0, thus q/σ 2 has a central chi-square
distribution. Quite easily done.
6.3 Rapid calculations for small matrices
Efficient methods exist for performing matrix calculations, and we assume that you know these
methods. These methods are usually aimed at computer usage, and may be rather lengthy if ap-
plied manually. We therefore point out certain shortcuts which may be useful in an exam situation
when time is limited.
6. APPENDIX 310
Computing X’X
In the linear model one very often has to compute X 0 X where X (or X 0 ) is given. It is not necessary
to write out X 0 and then X next to it, if one remembers that the (i, j)-th element of X 0 X is found
by multiplying every element in the i-th row of X 0 (or the i-th column of X) by the corresponding
element in the j-th row of X 0 (or j-th column of X) and adding these products. The diagonal terms
are found as the sum of squares of elements in the corresponding row of X 0 (or column of X). Also,
X 0 X is symmetric and thus the (i, j)-th element of X 0 X is the same as the (j, i)-th element.
• Example 6.3.1
 
1 1 1 1 1 1 1
 
Let X 0 =  −3 −2 −1 0 1 2 3
 

 
1 1 2 2 3 3 4
Then (X 0 X)11 = 12 + 12 + 12 + 12 + 12 + 12 + 12 = 7
(X 0 X)22 = (−3)2 + (−2)2 + (−1)2 + 02 + 12 + 22 + 32 = 28
(X 0 X)33 = 12 + 12 + 22 + 22 + 32 + 32 + 42 = 44
(X 0 X)12 = (X 0 X)21 = (1)(-3) + (1)(-2) + (1)(-1) + (1)(0)
+ (1)(1) + (1)(2) + (1)(3) = 0
(X 0 X)13 = (X 0 X)31 = (1)(1) + (1)(1) + (1)(2) + (1)(2)
+ (1)(3)+ (1)(3) + (1)(4)
= 16
(X 0 X)23 = (X 0 X)32 = (-3)(1) + (-2)(1) + (-1)(2) + (0)(2)
+ (1)(3) + (2)(3) + (3)(4)
= 14
 
7 0 16
 
0
∴ X X =  0 28 14 .
 
 
16 14 44
• Exercise 6.3.2
Find X 0 X if
 
1 1 1 1 1 1 1 1
 
X 0 =  1 −1 2 −2 3 −3 4 −4 
 
 
8 7 6 5 4 3 2 1
Answer:
 
8 0 36
 
X 0 X =  0 60 10
 

 
36 10 204
Determinants
If A is a square matrix, say p × p, and k is a common factor of every element of A, then we may
write
A = kB, say.
Then |A| = k P |B|. This fact may often be used to simplify the calculations involved in computing
a determinant. The determinant of a small matrix can be determined quite rapidly:
1 × 1 matrices
If A = (a) then |A| = a (in this case the determinant sign should not be confused with the absolute
sign):
|a| is the absolute value of a, that is |a| > 0,
|(a)| is the determinant of the 1 × 1 matrix, and is equal to a, which could be negative or
positive.
6. APPENDIX 312
• Example 6.3.3
If A = (-3) then |A| = -3.
2 × 2 matrices
 
a b
If A =   then |A| = ad-bc.
c d
• Example 6.3.4
 
1 2
If A =   then |A| = 1 × 8 − 2 × 3 = 2.
3 8
• Example 6.3.5
 
4 −6
If B =   then |B| = 24.
−6 3
3 × 3 matrices
The first step is to write down the elements of the matrix, and repeat the first column after the
third and the second after that. If
 
a b c a b c a b
 
A= d e f  write
 
d e f d e
 
g h i g h i g h
|A| consists of six terms; the first three are obtained from the diagonals that run down from left
to right. These three terms are added :
+(a × e × i) + (b × f × g) + (c × d × h) a b c a b
& & &
d e f d e
& & &
g h i g h
The last three terms are subtracted, and are obtained from the diagonals that run down from right
to left:
−(c × e × g) − (a × f × h) − (b × d × i) a b c a b
. . .
d e f d e
. . .
g h i g h
Thus |A| = aei + bfg + cdh - ceg - afh - bdi.
• Example 6.3.6
 
1 2 −3
 
A= 2 4 6
 

 
−3 6 10
Write down 1 2 −3 1 2
& & &
2 4 6 2 4
& & &
−3 6 10 −3 6
Write down 1 2 −3 1 2
. . .
2 4 6 2 4
. . .
−3 6 10 −3 6
|A| = (1 × 4 × 10) + (2 × 6 × (−3)) + ((−3) × 2 × 6)
−((−3) × 4 × (−3)) − (1 × 6 × 6) − (2 × 2 × 10) = −144.

6. APPENDIX 314
• Exercise 6.3.7
 
2 −3 4
 
B= 2 6 −1 . Show that |B| = -47.
 
 
4 −1 9
4 × 4 matrices
This is rather more complicated. The determinant is first expanded in terms of four 3 × 3 deter-
minants, each of which is then computed as above.
a b c d
f g h e g h e g h e g h
e f g h
=a j k ` −b i k ` +c i j ` −d i j k
i j k `
n p q m p q m n q m n p
m n p q
Note that the signs alternate (+ − +−) and each term consists of an element in the first row mul-
tiplied by the 3 × 3 determinant obtained by deleting the row and column in which that element
occurs.
This should give you a foretaste of how a 5 × 5 or larger determinant is computed.
Inverses
Some points to remember:
(a) If |A| = 0 then A−1 does not exist.
1 −1
(b) If A = kB where k is a constant, then A−1 = B . (This fact may be used to simplify
k
computations if a common factor exists for each element of A.)
 
A O ··· O
 1 
(c) If A =  O ··· O 
 
A2
 
O O · · · Am
where A1 , A2 , · · · , Am are square matrices (and where A1 , · · · , Am must be non-singular if A

is non-singular) then
 
A−1 O · · · O
 1 
A−1 =  O A−1 ··· O
 
 2 

O O · · · A−1
m
This fact can often be used to simplify calculations.
(d) If A is symmetric, then A−1 is also symmetric. In this case one has to compute the diagonal of
A−1 and the elements above the main diagonal (say), while the elements below the diagonal
can be written down directly from the corresponding elements above the diagonal. We now
show the computations for small matrices.
1 × 1 matrices
If A = (a) then A−1 = ( a1 ) if a 6= 0 (if a = 0 then A is singular).
2 × 2 matrices
   d −b 
a b |A| |A| .
If A =   then A−1 = 
 −c a 
c d
|A| |A|
• Example 6.3.8
 
2 3
A= 
3 6
|A| = 12 − 9 = 3
   
6
3
− 33 2 −1
   
−1
A = = .
   
   
− 33 2
3
−1 2
3
6. APPENDIX 316
• Exercise 6.3.9  
  1 1
2 2
4 −2 −1
 
B= ; B = .
   
−2 2  
1
2
1
3 × 3 matrices
The (i,j)-th element of A−1 is obtained as follows:
(A−1 )ij = (−1)i+j × (the determinant of the 2 × 2 matrix obtained by deleting the j-th row and
the i-th column of A) ÷ (the determinant of A).
• Example 6.3.10
 
2 1 0
 
A =  1 3 −2  |A| = 2
 
 
0 −2 2
3 -2
(A−1 )11 = (−1)2 / |A| = 1
-2 2
2 0
(A−1 )22 = (−1)4 / |A| = 2
0 2
2 1
(A−1 )33 = (−1)6 / |A| = 2 12
1 3
1 0
(A−1 )12 = (A−1 )21 = (−1)3 / |A| = −1
-2 2
1 0
(A−1 )13 = (A−1 )31 = (−1)4 / |A| = −1
3 -2
2 0
(A−1 )23 = (A−1 )32 = (−1)5 / |A| = 2
1 -2
 
1 −1 −1
 
∴ A−1 =  −1
 
2 2 
 
1
−1 2 22
(Test: AA−1 = I)
• Example 6.3.11
   
12 0 0 2 0 0
   
A =  0 18 6  = 6  0 3 1 
   
   
0 6 12 0 1 2
 −1  
1
2 0 0 2 0 0
    −1 
A−1 = 16  0 3 1  = 61  0
   
   3 1 
  
0 1 2 0 1 2
   
1 1
0
2
0 12
0 0
   
   
   
   
= 61  0 2
− 1 
= 0 1 1
− 30
  
5 5 15

   
   
   
   
1 3 1 1
0 −5 5
0 − 30 10
• Exercise 6.3.12
   
3
4
− 24 1
4
4 2 −4
   
   
   
   
B =  − 42 4
0  ; B −1 =  2 2 −2 .
   
 4   
   
   
   
1 1
4
0 4
−4 −2 8

Study Guide For STA3701

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Study Guide For STA3701

Uploaded by

Copyright:

Available Formats

STA3701/1

STA3701: Applied Statistics III

1.1 General remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Matrix results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Distribution theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 THE LINEAR MODEL 29

2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Examples of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Linear hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.7 Statistical inference on β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.8 Models with a constant term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

2.9 The design matrix X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.10 Examination of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

2.11 Transformation of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3 ANALYSIS OF VARIANCE 115

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

3.1.1 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.1.2 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 117

3.1.3 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.2 Balanced experiments and replicates . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.3 Fixed effects and random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.3.1 Example on mixed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.3.2 Nested or hierachical experiments . . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.2.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.2.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.4 One-way analysis of variance: Model I . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.4.1 Estimates of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

3.4.1.1 Point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.4.1.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.5 One-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 156

3.6 Two-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

3.7.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

3.8 Two-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 198

3.9 Two-way analysis of variance: Mixed model . . . . . . . . . . . . . . . . . . . . . . 208

3.10 Nested experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

3.11 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

4 REGRESSION ANALYSIS 239

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

4.2 Analysis of variance in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

4.3 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

4.4 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

4.4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

4.5 Inclusion or exclusion of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

4.6 Selection of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

4.7 Historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

5 ANALYSIS OF COVARIANCE 283

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

5.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

5.3 Analysis of covariance model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

5.4 Comparing two regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

5.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

6.1 Remarks on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

6.2 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

6.3 Rapid calculations for small matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 309