You are on page 1of 325

STA3701/1

Department of Statistics

STA3701: Applied Statistics III

Study Guide
Contents

1 THE TOOLBOX 1

1.1 General remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Matrix results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Distribution theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 THE LINEAR MODEL 29

2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Examples of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Linear hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5.1 One-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5.2 Two-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

iii
2.5.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.7 Statistical inference on β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.8 Models with a constant term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

2.9 The design matrix X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.10 Examination of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

2.11 Transformation of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

2.12 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3 ANALYSIS OF VARIANCE 115

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

3.1.1 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.1.2 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 117

3.1.3 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.2 Balanced experiments and replicates . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.3 Fixed effects and random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.3.1 Example on mixed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.3.2 Nested or hierachical experiments . . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.2.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.2.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.4 One-way analysis of variance: Model I . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.4.1 Estimates of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

3.4.1.1 Point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.4.1.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.5 One-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 156

3.6 Two-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172


3.7 Two-way analysis of variance: Model I . . . . . . . . . . . . . . . . . . . . . . . . . 177

3.7.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

3.8 Two-way analysis of variance: Model II . . . . . . . . . . . . . . . . . . . . . . . . . 198

3.9 Two-way analysis of variance: Mixed model . . . . . . . . . . . . . . . . . . . . . . 208

3.10 Nested experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

3.11 Three-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

4 REGRESSION ANALYSIS 239

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

4.2 Analysis of variance in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

4.3 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

4.4 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

4.4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

4.5 Inclusion or exclusion of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

4.6 Selection of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

4.7 Historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

5 ANALYSIS OF COVARIANCE 283

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

5.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

5.3 Analysis of covariance model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

5.4 Comparing two regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

5.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

6 APPENDIX 303

6.1 Remarks on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

6.2 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

6.3 Rapid calculations for small matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 309


PREFACE

This module follows on STA3703 which deals with multivariate distributions. You should therefore
be familiar with the multivariate normal distribution and the distribution of quadratic forms which
results from it. The main results on matrix theory and distribution theory needed for this module
are listed in chapter 1. The proofs and derivations of these results are not required for this module,
but you should be able to state and apply these results.

You should acquire the following book of tables:

D.J. Stoker: Statistical Tables, Pretoria, Academica, 3rd ed. 1977 (or earlier editions).

There is no other prescribed textbook for this module, and no recommended textbook which you
are required to read in order to complete the module. You may consult the following textbooks if
you wish to learn more about the topics dealt with in this module:

Bowerman B.L., and R. O’Connel: Linear statistical models:an applied approach. 2nd ed.
PWS-KENT 2 ed. 1990 (For chapters 3 and 4 of this Guide.)

Brownlee, K.A.: Statistical theory and methodology in science and engineering. 2nd ed.
Wiley, 1965. (For chapters 3, 4 and 5 of this Guide.)

Draper, N.R. and H. Smith: Applied regression analysis. Wiley, 1st ed. 1966, or 2nd ed.
1981 or 3rd ed 1998. (For chapter 4 of this Guide.)

Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and regression. Wiley,
1974. (For chapters 3, 4 and 5 of this Guide.)

vii
viii

Kleinbaum, D.G., Kupper L.L., Nizam A. and K.E. Muller: Applied regression analysis and
other multivariable methods. Duxbury, 4th ed. 2008. (For most of the Guide.)

Kleinbaum, D.G. and L.L. Kupper: Applied regression analysis and other multivariable meth-
ods. Duxbury, 1st ed. 1978 or 2nd ed. 1988. (For most of the Guide.)

Kutner, M.H., Nachtsheim C.J., Neter J. and W. Li.: Applied linear Statistical Models. 5th
ed. McGraw-Hill Irwin, 2005. (For most of the Guide.)

Mickey, R.M., Dunn, O.J. and V.A. Clark: Applied statistics: analysis of variance and
regression. 3rd ed. Wiley, 2004. (For chapters 3, 4 and 5 of this Guide.)

Neter, J., Wasserman W. and M.H. Kutner: Applied linear statistical models, regression,
analysis of variance and experimental designs. 2nd ed. Richard D. Irwin Inc., 1985. (For
chapters 3, 4 and 5 of this Guide.)

Neter, J. and W. Wasserman: Applied linear statistical models. Richard D. Irwin Inc., 1974.
(For chapters 3, 4 and 5 of this Guide.)

Quenouille, M.H.: Introductory statistics. Butterworth-Springer, 1950. (For section 2.11 of


chapter 2 of this Guide.)

Ross, S.M.: Introduction to probability and statistics for engineers and scientists. Wiley
1987. (For chapters 3 and 4 of this Guide.)

Ross, S.M.: Introduction to probability and statistics for engineers and scientists. 4th ed.
Elsevier 2009. (For chapters 3 and 4 of this Guide.)

Sall, J., Creighton L. and A. Lehman: JMP start statistics: a guide to statistics and data
analysis using JMP. SAS Institute, 4th ed 2007. (For chapters 3, 4 and 5 of this Guide -
especially in data analysis.)

Scheffé, H.: The analysis of variance. Wiley, 1959. (For chapters 2, 3 and 5 of this Guide.)

Searle, S.R.: Linear models. Wiley, 1997. (For chapters 1 and 2 of this Guide.)

Wackerly, D.D., Mendenhall W. and R.L. Scheaffer: Mathematical statistics with applications.
7th ed. Duxbury 2008. (For chapters 3 and 4 of this Guide.)
Chapter 1

THE TOOLBOX

1.1 General remarks

The statistical techniques which are the subject of this module, Applied Statistics III, form part
of a more general theory called the linear model. In chapter 2 we discuss this general model; this
general theory should enable you to analyse a very general class of problems. In chapter 3 we dis-
cuss some specific models contained in the general theory under the heading analysis of variance.
Analysis of variance is a procedure which enables one to test whether a number of parameters
which determine the values of population means are equal or not. In chapter 4 we discuss regres-
sion analysis. In this type of model we assume that there is a linear relationship between a random
response variable called yield, and a number of non-random variables. The purpose is to estimate
the parameters of this relationship and to draw statistical inference about these parameters. In
chapter 5 we discuss a third class of problems, called analysis of covariance, which may be regarded
as a hybrid between regression analysis and analysis of variance. In this model we assume that
there are a number of regression functions, each corresponding to a different population, and we
want to know whether these regression functions are equal or not.

A word about notation. In this module we usually use Greek letters to denote parameters, capital
letters to denote matrices and underlined small letters to denote vectors. A vector without a
prime, for example x, will be a column vector and its transpose, for example x0 , will therefore
be a row vector. Small letters which are not underlined will usually be scalars (1×1 matrices),

1
1. THE TOOLBOX 2

including specific elements of matrices or vectors. The usual distinction between random variables
and observations, which is often made by using capitals for the former and small letters for the
latter, will not be made here. It should be clear from the context whether a symbol is for a random
variable or not. A circumflex (or hat) above the symbol for a parameter or vector of parameters,
for example β̂ or β̂, is used to denote an unbiased estimator of the parameter(s).

1.2 Matrix results

This section contains a number of definitions and results on matrices which, for ease of reference,
will be numbered (M1), (M2), et cetera. These results will be used frequently in chapter 2. If they
do not seem interesting at first reading, you will soon find out how they are applied.

(M1) Let x = (x1 ...xp )0 be a p × 1 vector and A = (aij ) a p × p matrix.


XX
Then x0 Ax = aij xi xj is called a quadratic form in x.
i j

The matrix A of any quadratic form x0 Ax can always be chosen to


be symmetric, hence every quadratic form in this guide will be considered
to have a symmetric matrix.

Example 1.1.
Write the following quadratic forms in matrix notation using a symmetric matrix:

(a) 2x2 + 3y 2 − 4z 2 + 2xy − 6xz + 10yz

(b) x21 + 4x1 x3 − 6x2 x3


1. THE TOOLBOX 3 STA3701/1

Solution 1.1.

(a) 2x2 + 3y 2 − 4z 2 + 2xy − 6xz + 10yz = 2x2 + 3y 2 − 4z 2 + xy + yx − 3xz − 3zx + 5yz + 5zy

 
2 1 −3
 
 
 
h i  
0
Then x = x y z and A = 
 
1 3 5 
 
 
 
 
−3 5 −4
Thus,
  
2 1 −3 x
  
  
  
h i  
x0 Ax = x y z 
  
1 3 5  y 
  
  
  
  
−3 5 −4 z

(b) x21 + 4x1 x3 − 6x2 x3 = x21 + 0x22 + 0x23 + 0x1 x2 + 0x2 x1 + 2x1 x3 + 2x3 x1 − 3x2 x3 − 3x3 x2

 
1 0 2
 
 
 
h i  
0
Then x = x3 and A =  0 0 −3 
 
x1 x2
 
 
 
 
2 −3 0
Thus,
  
1 0 2 x1
  
  
  
h i  
0
x Ax = 0 −3   x2
  
x1 x2 x3  0 
  
  
  
  
2 −3 0 x3
1. THE TOOLBOX 4

(M2) The p × p matrix A is called positive definite if x0 Ax > 0 for all x 6= 0 (where 0 is a vector
of zeros).

(M3) If the > sign in (M2) is replaced by ≥ (and the equality holds for at least one x 6= 0), then
A is said to be positive semidefinite.

Example 1.2.
Are the following matrices positive definite or positive semidefinite?

     
1 0 1 −1 1 − 29
     
(a) (b)  (c) 
     
  
     
0 4 −1 1 − 29 4

Solution 1.2.

(a)
  
1 0 x1
h i  
x0 Ax =
  
x1 x2   
  
0 4 x2
 
x
h i 1 
=
 
x1 4x2  
 
x2
= x21 + 4x22

Now x21 + 4x22 > 0 if x1 6= 0 or x2 6= 0 and x21 + 4x22 = 0 only if x1 = x2 = 0. Thus, A


is positive definite.

(b)
  
1 −1 x1
h i  
x0 Ax =
  
x1 x2   
  
−1 1 x2
1. THE TOOLBOX 5 STA3701/1

 
x1
h i 
= x1 − x2 −x1 + x2 
 

 
x2
= x21 − x2 x1 − x1 x2 + x22

= x21 − 2x1 x2 + x22

= (x1 − x2 )2

= 0 if x1 = x2 whether or not x1 = x2 = 0

> 0 if x1 6= x2

Thus, A is positive semidefinite.

(c)
  
1 − 92 x1
h i  
0
x Ax =
  
x1 x2   
  
− 92 4 x2
 
x
h i 1 
9
= x1 − − 92 x1 + 4x2 
 
x
2 2

 
x2
9 9
= x21 − x1 x2 − x1 x2 + 4x22
2 2
= x1 − 9x1 x2 + 4x22
2

= (x1 − 2x2 )2 − 5x1 x2

> 0 if x1 = 0, x2 6= 0 or if x2 = 0, x1 6= 0

< 0 if x1 = x2 = 1

Thus, A is not positive definite or positive semidefinite.

(M4) (i) The rank of a matrix A, denoted by r(A), is defined to be the number of linearly
independent rows of A (or, equivalently, the number of linearly independent columns).

(ii) If A is an n × m matrix then r(A) ≤ m and r(A) ≤ n.

(iii) If A is a p × p matrix and r(A) = p then A is said to be non-singular and A−1 exists;
if r(A) < p then A is singular and A−1 does not exist.
1. THE TOOLBOX 6

(iv) A positive definite matrix is non-singular.

(M5) If A is an n × p matrix, then r(A) = r(A0 A) = r(AA0 ).

(M6) The trace of a p × p matrix A, denoted by tr(A), is defined to be the sum of its diagonal
p
X
elements. If A = (aij ), then tr(A) = aii .
i=1

(M7) If A is an m × n matrix and B an n × m matrix, then tr(AB) = tr(BA). Similarly, if A is


m × n, B is n × k and C is k × m, then tr(ABC) = tr(BCA) = tr(CAB).

(M8) In particular, if x is a p × 1 vector and A a p × p matrix, then the quadratic from x0 Ax is a


1 × 1 matrix which is equal to its trace.

Thus
x0 Ax = tr(x0 Ax) = tr(x x0 A) = tr(Ax x0 ).

Example 1.3.

   
x 1 2
   
x= A=
   
 
   
y 2 4

Compute x0 Ax, x x0 A and Ax x0 and the trace of each.

Solution
 1.3.  
x 1 2
   
x =   and A = 
   

   
y 2 4

  
1 2 x
h i  
x0 Ax =
  
x y   
  
2 4 y
 
x
h i 
=
 
x + 2y 2x + 4y  
 
y
1. THE TOOLBOX 7 STA3701/1

= x2 + 2yx + 2xy + 4y 2

= x2 + 4yx + 4y 2

This is a 1 × 1 matrix, thus tr(x0 Ax) = x2 + 4yx + 4y 2

   
x 1 2
 h i 
x x0 A = 
   
 x y  
   
y 2 4
 

x2 xy 1 2
  
= 
  
 
  
2
yx y 2 4
 
2 2
x + 2xy 2x + 4xy
 
= 
 

 
yx + 2y 2 2yx + 4y 2

tr(x x0 A) = x2 + 2yx + 2xy + 4y 2

= x2 + 4xy + 4y 2
  
1 2 x
  h i
Ax x0 = 
  
  x y
  
2 4 y
 
x + 2y
 h i
= 
 
 x y
 
2x + 4y
 
x2 + 2yx xy + 2y 2
 
= 
 

 
2x2 + 4yx 2xy + 4y 2

tr(Ax x0 ) = x2 + 2yx + 2xy + 4y 2

= x2 + 4xy + 4y 2
1. THE TOOLBOX 8

Note: (x x0 A)0 = Ax x0

tr(x0 Ax) = tr(x x0 A) = tr(Ax x0 )

(M9) The matrix A is called idempotent if AA = A.

(M10) If A is idempotent then the rank of A is equal to its trace:

r(A) = tr(A).

Example 1.4.
Which of the following matrices are idempotent?

 
1 0 0
     
1 1
1 0
 
2 2
 
     
(a) (1) (b)  (c)  (d)  0 12 21 
     
 
     
1 1  
0 1 2 2
 
 
0 12 21

Give the ranks of the idempotent matrices. Which of the matrices are singular?

Solution 1.4.

(a) 1 × 1 = 1. A is idempotent and the rank of the matrix is 1 and it is non-singular.

      
1 0 1 0 1 0 1 0
      
(b) A =   AA =  =
      
 
      
0 1 0 1 0 1 0 1

A is idempotent and the rank of A = tr(A). Thus, tr(A) = 1 + 1 = 2 and A is non-


singular.
1. THE TOOLBOX 9 STA3701/1

(c)

  
1 1 1 1
 2 2  2 2 
AA = 
  
 
  
1 1 1 1
2 2 2 2
 
1 1 1 1
4
+ 4 4
+ 4
 
= 
 

 
1 1 1 1
4
+ 4 4
+ 4
 
1 1
 2 2 
= 
 

 
1 1
2 2

1 1
A is idempotent. Thus, r(A) = tr(A) = 2
+ 2
= 1. A is singular.
 
1 0 0
 
 
 
 
(d) A =  0 12 12 
 
 
 
 
 
1 1
0 2 2

  
1 0 0 1 0 0
  
  
  
  
AA =  0
 1 1 
 0 1 1 
2 2 2 2

  
  
  
  
1 1 1 1
0 2 2
0 2 2
 
1 0 0
 
 
 
 
=  0
 1 1 1 1 
4
+ 4 4
+ 4

 
 
 
 
1 1 1 1
0 4
+ 4 4
+ 4
1. THE TOOLBOX 10

 
1 0 0
 
 
 
 
=  0
 1 1 
2 2

 
 
 
 
1 1
0 2 2

A is idempotent. Thus, r(A) = tr(A) = 1 + 12 + 1


2
= 2. Thus, A is singular.

(M11) If X is an n × m matrix with rank r(X) = m (thus m ≤ n) then X(X0 X)−1 X0 is idempo-
tent with rank m. Furthermore, In − X(X0 X)−1 X0 is idempotent with rank n − m. Finally,
(X(X0 X)−1 X0 )(In − X(X0 X)−1 X0 ) = O, the n×m matrix of zeros.

(M12) An important vector is the n × 1 vector of ones:


   
1 1
 .. 
   
 
 1  0  . 
1 =  .  1 1 = (1 . . . 1) 
 
 ..  = n;

.
 .   . 
   
1 1
   
1 1 ...... 1
 ..   .. .. 
   
. . .
1 10 = 
   
 (1 . . . 1) = 
 ..   .. ..  ;

 .   . . 
   
1 1 ...... 1
 
1 1
... n
 n
 .. .. 

 . . 
1(10 1)−1 1 =  .. ..  ;

 . . 
 
1 1
n
.... n
 
1 − n1 − n1 . . . − n1
 
In − 1(10 1)−1 10 =  − n1 1 − n1 . . . 1 
−n  .

 
1̇ 1 1
−n −n . . . 1 − n

(M13) From (M11) and (M12) it follows that 1(10 1)−1 10 and In − 1(10 1)−1 10 are idempotent with
ranks 1 and n − 1 respectively, and their product is equal to the null matrix O.
1. THE TOOLBOX 11 STA3701/1

Example 1.5.

 
1 0
 
 
 
 
(a) X =  1 1  . Compute X(X 0 X)−1 X 0 .
 
 
 
 
 
1 2
 
1
 
 
 
 
(b) x =  1  . Compute x(x0 x)−1 x0 and I3 − x(x0 x)−1 x0 .
 
 
 
 
 
1

Solution 1.5.

(a)
 
1 0
  
1 1 1 
 

  
0
XX = 
  
 1 1 
  
 
0 1 2  
 
1 2
 
3 3
 
= 
 

 
3 5

   
5
5 −3 6
− 63
   
0 0 −1 1
det(X X) = 15 − 9 = 6. (X X) = 6 =
  

   
−3 3 − 36 3
6
1. THE TOOLBOX 12

 
1 0
   
5
− 36 1 1 1
 
 6
 
  
0 −1 0
X(X X) X =  1 1 
   
 
   
 − 36 3
 
 6
0 1 2
 
1 2
 
5 1

 6 2  
 1 1 1
 

  
 1
=  3
 
0  
  
 
  0 1 2
 
1 1
−6 2
 
5 2 1

 6 6 6 
 
 
 
 2 2 2
=  6 6

6

 
 
 
 
1 2 5
−6 6 6

   
1 1
   
   
   
  h i 
  0
(b) x =  1  x x = 1 1 1  1  = 3.
 
   
   
   
   
1 1
Then

 
1
 
 
 
 1h i
x(x0 x)−1 x0 =  1 
 
1 1 1
 3
 
 
 
1
1. THE TOOLBOX 13 STA3701/1

 
1 1 1
 3 3 3 
 
 
 
= 
 1 1 1 
3 3 3

 
 
 
 
1 1 1
3 3 3

   
1 1 1
1 0 0 3 3 3
   
   
   
   
I3 − x(x0 x)−1 x0 =  0 1 0  −  13 1 1
   
3 3

   
   
   
   
1 1 1
0 0 1 3 3 3
 
2
− 13 − 13
 3 
 
 
 
 1 2 1
=  −3 −3 

 3 
 
 
 
1 1 2
−3 −3 3

(M14) If A is a p × n matrix of constants and x0 = (x1 . . . xp ), then


∂ 0
(x A) = A
∂x

where
 

∂x1
..
 
 
∂  . 
.
..

∂x 
 .


 

∂xp

(In general, the derivative of a 1 × n row vector y 0 with respect to the p × 1 column vector
x is defined to be the p × n matrix
∂y 0
= (aij ) where
∂x

∂yj
aij = .
∂xi
1. THE TOOLBOX 14

(M15) If A is a p × p matrix then


∂ 0
(x Ax) = A0 x + Ax.
∂x

If A is symmetric, that is A = A0 , then


∂ 0
(x Ax) = 2Ax.
∂x
Example 1.6.
   
u 3 1 2
   
   
   
   
x= v  A= 1 4 5 .
   
   
   
   
   
w 2 5 6

∂ 0
Write down (x Ax).
∂x

Also write out x0 Ax and its derivatives with respect to u, v and w.

Solution 1.6.

∂ 0
(x Ax) = 2Ax
∂x
  
3 1 2 u
  
  
  
  
= 2 1 4 5
  
 v 
  
  
  
  
2 5 6 w
 
3u v 2w
 
 
 
 
= 2  u 4v
 
5w 
 
 
 
 
2u 5v 6w
1. THE TOOLBOX 15 STA3701/1

 
6u 2v 4w
 
 
 
 
=  2u 8v 10w 
 
 
 
 
 
4u 10v 12w

or
  
3 1 2 u
  
  
  
h i  
x0 Ax =
  
u v w  1 4 5  v 
  
  
  
  
2 5 6 w
 
u
 
 
 
h i 
=
 
3u + v + 2w u + 4v + 5w 2u + 5v + 6w  v 
 
 
 
 
w
= 3u2 + vu + 2wu + uv + 4v 2 + 5wv + 2uw + 5vw + 6w2

= 3u2 + 2uv + 4uw + 4v 2 + 10vw + 6w2

= 3u2 + 4v 2 + 6w2 + 2uv + 4uw + 10vw

∂ 0
(x Ax) = 6u + 2v + 4w
∂u
∂ 0
(x Ax) = 8v + 2u + 10w
∂v
∂ 0
(x Ax) = 12w + 4u + 10v
∂w
1. THE TOOLBOX 16

1.3 Distribution theory

In this section we list a number of results from distribution theory. Their application will become
clear in the next chapter.

Let y be a column vector consisting of p random variables, that is y 0 = (y1 · · · yp ). Let the expected
value of yi be µi and let µ0 = (µ1 · · · µp ). Also, let the covariance of yi and yj be σij .

Thus, E(yi − µi )(yj − µj ) = σij

(thus σii is the variance of yi ). Let Σ be the matrix

 
σ11 · · · σ1p
 .. .. 
 
Σ = (σij ) =  . . .
 
σp1 · · · σpp
Obviously σij = σji so that Σ is symmetric. Σ is called the covariance matrix of y. We would also
write

Σ = E(y − µ)(y − µ)0 = Cov(y, y 0 ).

(D1) A covariance matrix must be positive definite (or at least positive semidefinite).

(D2) The p×1 random vector y is said to have a multivariate normal distribution with means vector
µ and covariance matrix Σ, also written y ∼ n(µ; Σ), if the probability density function of y
is given by

1 1
f (y) = 1 exp[− (y − µ)0 Σ−1 (y − µ)].
(2π)p/2 |Σ|
2 2

The covariance matrix Σ must be positive definite otherwise f (y) will not be a probability
density function.

(D3) If y is a random vector with E(y) = µ and Cov(y, y 0 ) = Σ and if A is a matrix of constants,
then z = Ay is a random vector with E(z) = Aµ and Cov(z, z 0 ) = AΣA0 .

(D4) If y ∼ n(µ; Σ) then Ay ∼ n(Aµ; A Σ A0 ).


1. THE TOOLBOX 17 STA3701/1

Example 1.7.

   
10 4 3 −1
     
1 −1 0
   
   
     
x ∼ n  15  ;  3 9 3  . What is the distribution of y =   x?
     
     
    1 1 1
3 3 3
   
   
12 −1 3 1

Solution 1.7.

       
10 4 3 −1 10 4 3 −1
       
       
       
       
x ∼ n  15  ;  3 9 where µ = and Σ =
       
3   15   3 9 3 
       
       
       
       
12 −1 3 1 12 −1 3 1
Now
 
1 −1 0
 
y =  x
 
 
1 1 1
3 3 3
 
1 −1 0
 
= Ax where A=
 

 
1 1 1
3 3 3

If x ∼ n(µ, Σ) then y ∼ n(Aµ, AΣA0 ) (using D4)

Now
 
10
  
1 −1 0 
 

  
Aµ = 
  
  15 
  
1 1 1  
3 3 3
 
 
12
1. THE TOOLBOX 18

 
−5
 
= 
 

 
37
3

  
1
4 3 −1 1 3
   
1 −1 0 
  
 
   
AΣA0 =  3   −1 1
   
 3 9 3

   
1 1 1   
3 3 3
  
  
1
−1 3 1 0 3
 
  1 13
1 −6 −4  
  
  
  
=   −1 13

  
 2 5 1  

  
 
0 13
 
7 −3
 
= 
 

 
8
−3 3

   
−5 7 −3
   
Thus, y ∼ n  ;  .
   
   
37 8
3
−3 3

(D5) If y ∼ n(0; Ip ) then y 0 y ∼ χ2p (that is y 0 y has a chi-square distribution with p degrees of
freedom).

(D6) If y ∼ n(µ; σ 2 Ip ) then (y − µ)0 (y − µ)/σ 2 ∼ χ2p .

(D7) If y ∼ n(µ; σ 2 Ip ) then y 0 y/σ 2 is said to have a non-central chi-square distribution with p
degrees of freedom and noncentrality parameter (n.c.p.) λ = µ0 µ/σ 2 . (Some textbooks use
the alternative convention λ = 21 µ0 µ/σ 2 ).

We write y 0 y/σ 2 ∼ χ2p;λ .


1. THE TOOLBOX 19 STA3701/1

When λ = 0 the distribution becomes the (central) chi-square distribution. The n.c.p. λ is
non-negative by definition. As λ increases, the distribution shifts to the right. This is also
reflected in the following theorem.

(D8) If x is a noncentral chi-square variate with p degrees of freedom and noncentrality parameter
λ, then x has mean and variance

E(x) = p + λ; Var(x) = 2p + 4λ.

Thus both the mean and the variance increase as λ increases.

Example 1.8.
   
3 5 0
   
y ∼ n  ;  .
   
   
4 0 5

(a) What is the distribution of 15 (y12 + y22 )? Calculate its mean and variance.

(b) Which random variable has a χ22 -distribution?

Solution 1.8.

(a) The distribution of 51 (y12 + y22 ) ∼ χ22,λ where λ = µ0 µ/σ 2 (using D7).

 
3
h i 
µ0 µ =  = 25.
 
3 4 
 
4

Thus, λ = µ0 µ/σ 2 = 25
5
= 5.

Using (D8):

Mean = E(y) = E( 15 (y12 + y22 )) = p + λ = 2 + 5 = 7

Variance = V ar(y) = V ar( 15 (y12 + y22 )) = 2p + 4λ = 2(2) + 4(5) = 24

Note: λ = µ0 µ/σ 2 ∼ χ2p,λ (using D7).


1. THE TOOLBOX 20

1
(b) The random variable that has a χ22 distribution is 5
[(y1 − 3)3 + (y2 − 4)2 ] .

(D9) Let u and v be independent with u ∼ χ2p and v ∼ χ2q , then


u/p
F =
v/q
is said to have an Fp;q distribution.

(D10) Let u and v be independent random variables with u a noncentral chi-square variate with p
degrees of freedom and n.c.p. λ and v ∼ χ2q . Then
u/p
F =
v/q
is said to have a noncentral F-distribution with p and q degrees of freedom and n.c.p. λ.
The effect of the n.c.p. λ on the distribution of F is the same as the effect of λ on the
distribution of u: increasing λ causes a shift to the right.

(D11) Let y ∼ n(µ; σ 2 Ip ). Let A be a symmetric matrix of constants. Then y 0 Ay/σ 2 has a (non-
central) chi-square distribution with r(A) degrees of freedom and n.c.p. λ = µ0 Aµ/σ 2 if and
only if A is idempotent, that is AA = A.

(D12) Let y ∼ n(µ; σ 2 Ip ) and let A and B be constant matrices. Then y 0 Ay/σ 2 and y 0 By/σ 2 are
independent if and only if AB = O.

Example 1.9.
   
0 1 0 0
   
   
   
   
y ∼ n  0  ;  0 1 0 
   
   
   
   
   
0 0 0 1

(a) Write x = 31 (y1 + y2 + y3 )2


and z = y12 + y22 + y32 − 31 (y1 + y2 + y3 )2
in matrix form as x = y 0 Ay and z = y 0 By.

(b) Show that x and z are independent and have chi-square distributions; compute the num-
ber of degrees of freedom of each.

(c) What is the distribution of 2x/z?


1. THE TOOLBOX 21 STA3701/1

Solution 1.9.

(a) y ∼ n(µ; σ 2 Ip )

1
x = (y1 + y2 + y3 )2
3
1
= (y1 + y2 + y3 )(y1 + y2 + y3 )
3
1 2
= (y + y1 y2 + y1 y3 + y2 y1 + y22 + y2 y3 + y3 y1 + y3 y2 + y32 )
3 1
1 2
= (y + y22 + y32 + y1 y2 + y2 y1 + y1 y3 + y3 y1 + y2 y3 + y3 y2 )
3 1
  
1 1 1 y
  1 
  
  
1h i  
=
  
y1 y2 y3 1 1 1 y 
3  2 
  

  
  
  
1 1 1 y3
  
1 1 1
y
 3 3 3  1 
  
  
h i  
 1 1 1 
=

y1 y2 y3  3 3 3   y2 
  
  
  
  
1 1 1
3 3 3
y3

 
1 1 1
 3 3 3 
 
 
 
Thus, A = 
 1 1 1 
3 3 3

 
 
 
 
1 1 1
3 3 3

1
z = y12 + y22 + y32 − (y1 + y2 + y3 )2
3
1 1 1 1 1 1 1 1 1
= y12 + y22 + y32 − y12 − y22 − y32 − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 1 1 1 1 1 1
= y + y + y − y1 y2 − y2 y1 − y1 y3 − y3 y1 − y2 y3 − y3 y2
3 1 3 2 3 3 3 3 3 3 3 3
1. THE TOOLBOX 22

  
2
3
− 13 − 13 y1
  
  
  
h i  
 1 2 1 
= y3  − 3 −
 
y1 y2 3 3
  y2 
  
  
  
  
1 1 2
−3 −3 3
y3

 
2
3
− 13 − 13
 
 
 
 
 1 2 1
Thus, B =  − 3 −3 

 3 
 
 
 
1 1 2
−3 −3 3

(b) Checking for idempotence:

    
1 1 1 1 1 1 1 1 1
 3 3 3  3 3 3   3 3 3 
    
    
    
AA = 
 1 1 1  1 1 1 =
  1 1 1 =A

3 3 3 3 3 3 3 3 3

    
    
    
    
1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3

    
2
3
− 13 − 13 2
3
− 13 − 13 2
3
− 13 − 13
    
    
    
    
 1 2 1 1 2 1 1 2 1
BB =  − 3 −3   −3 −3  =  −3 −3  = B
    
 3  3   3 
    
    
    
1 1 2 1 1 2 1 1 2
−3 −3 3
−3 −3 3
−3 −3 3

If x and z are independent, then AB = 0


1. THE TOOLBOX 23 STA3701/1

    
1 1 1 2
3 3 3 3
− 13 − 13 0 0 0
    
    
    
    
AB =  1 1 1  1
  −3 2 1  = 

   
3 3 3 3 3
0 0 0 
    
    
    
    
1 1 1 1 1 2
3 3 3
−3 −3 3
0 0 0
Thus, x and z are independent with r(x) = 1 = degrees of freedom each. Thus, x ∼ χ21
and r(z) = 2 and z ∼ χ22 (using D11 since AA = A and BB = B).

(c) The distribution of 2x/z ∼ F1;2 .


u/p x/1
Note: F = = = 2x/z (using D9)
v/q z/2

1
(D13) If y ∼ n(µ; σ 2 Ip ) then the quadratic form y 0 Ay/σ 2 and the linear form By are independent
σ
if and only if BA = O.

(D14) More generally, if y ∼ n(µ; V ) then y 0 Ay has a noncentral chi-square distribution with r(A)
degrees of freedom and n.c.p. λ = µ0 Aµ if and only if AV is idempotent.

(D15) If y ∼ n(µ; V ) then y 0 Ay and y 0 By are independent if and only if AVB = O.

(D16) If y ∼ n(µ, σ 2 V ) where V is non-singular, then y 0 V −1 y/σ 2 is a noncentral chi-square variate


with r(V) degrees of freedom and n.c.p. λ = µ0 V −1 µ/σ 2 .

Example 1.10.
   
µ σ2 σ2ρ
   
y ∼ n  ;  .
   
   
µ σ2ρ σ2

Construct a variable that will have a X22 -distribution.

Solution 1.10.
     
µ 1 ρ 1 ρ
     
 2
y ∼ n  ;σ   where V =   and r(V ) = 2.
   
     
µ ρ 1 ρ 1
1. THE TOOLBOX 24

Using D16

 
1 −ρ
y 0 V −1 y (y − µ)0 V −1 (y − µ)  
= and V −1 = 1  
1−ρ2
σ2 σ2
 
 
−ρ 1

  
1 −ρ y1 − µ
h i   
1
y1 − µ y2 − µ
  
1−ρ2   
  
y 0 V −1 y −ρ 1 y2 − µ
=
σ2 σ2  
y1 − µ
1 h i 
= y1 − µ − ρ(y2 − µ) ρ(y1 − µ) + (y2 − µ) 
 
σ 2 (1 − ρ2 )

 
y2 − µ
1  2 2

= (y1 − µ) − ρ(y 2 − µ)(y 1 − µ) − ρ(y 1 − µ)(y2 − µ) + (y 2 − µ)
σ 2 (1 − ρ2 )
=
1
(y1 − µ)2 − 2ρ(y1 − µ)(y2 − µ) + (y2 − µ)2
 
=
σ 2 (1
−ρ )2

[(y1 − µ)2 − 2ρ(y1 − µ)(y2 − µ) + (y2 − µ)2 ]


=
σ 2 (1 − ρ2 )
∼ χ22
1. THE TOOLBOX 25 STA3701/1

Exercise 1.1

1. Write the following quadratic forms in matrix notation. Ensure that each matrix is
symmetric.

(a) 4x21 + 6x1 x2 + 7x22

(b) 2x21 + 4x22 − 2x1 x3 − x2 x3 + 5x23


2 2
(c) x
3 1
+ 23 x22 + 23 x23 − 23 x1 x2 − 23 x2 x3 − 32 x1 x3

(d) 3x21 + 13x22 + x23 + 10x1 x2 + 2x1 x3

(e) x21 − 2x1 x2 − x22 − 3x2 x3 + 8x23

2. Determine whether each of the following matrices is positive definite or positive semidefinite.

   
2
3
− 31 − 13 1 2 0
       
9 −3 2 − 23
   
   
       
(a) (b)  − 13 23 − 13  (c)  (d)  2 5 −1 
       
  
       
−3 − 32
   
1   3  
   
− 13 − 31 23 0 −1 1

3. Evaluate the following statements as true or false, and substantiate your answers for state-
ments which are false.

 
1 1
2 2
0
 
 
 
 
Consider the matrix A =  1 1
0 .
 
 2 2 
 
 
 
0 0 1

y 0 Ay
(a) Let y ∼ n(µ; σ 2 Ip ). has a (noncentral) chi-square distribution with three degrees
σ2
of freedom and noncentrality parameter λ.
1. THE TOOLBOX 26

 
1 1
2 2
0
 
 
 
  y 0 Ay y 0 By
(b) Let y ∼ n(µ; σ 2 Ip ) and B =  1 1 . Then and are independent.
 
0
2 2 σ2 σ2

 
 
 
 
0 0 0

(c) The p × 1 random vector y is said to have a multivariate normal distribution with mean
 
1 0 0
 
 
 
h i  
0 1 1
vector µ = 3 9 2 and covariance matrix Σ =  0 4 4 .
 
 
 
 
 
1 1
0 4 4
Then
 
1 1 0 −1
f (y) = 1 exp − (y − µ) Σ (y − µ)
(2π)p/2 |Σ| 2 2
is a probability density function.

(d) The quadratic form 2 (r − 12 s)(r + 32 s) − q(r − q) can be written in matrix notation,
 

with symmetric matrix, as


  
2 −1 0 q
  
  
  
h i  
q r s  −1 2 −1   r .
  
  
  
  
  
0 −1 − 23 s

4. Suppose xi , i = 1, 2, 3 has independent n(µi ; 1) distributions such that 2µ1 = µ2 = 3µ3 .


Let

1
q1 = x21 + (x2 + 3x3 )2
10

q2 = k(3x2 − x3 )2

1
q3 = (x1 − x2 )2 .
2
1. THE TOOLBOX 27 STA3701/1

(a) Write qi in the form qi = x0 Ai x where Ai is symmetric, i = 1, 2, 3.

(b) Determine whether q1 has a chi-squared distribution.

(c) Determine k such that q2 has a chi-squared distribution.

(d) Determine the degrees of freedom and the noncentrality parameters of q1 and of q2 .

(e) Determine whether q1 and q2 are independent.


 
5
8
− 83 − 81
 
 
 
 
0 0
(f) Suppose x ∼ n(µ; V ); x = (x1 x2 x3 ); µ = (µ1 µ2 µ3 ) and V =  − 8
3 5 1
−8 
 
 8 
 
 
 
1 1 5
−8 −8 8
2
2µ1 = µ2 = 3µ3 and q4 v χq (central chi-squared distribution).

(i) Determine whether q1 has a chi-squared distribution.


(ii) Determine the degrees of freedom p and the noncentrality parameter λ of q3 .
q3 /p
(iii) Determine whether F = has a noncentral F -distribution with p and q degrees
q4 /q
of freedom and ncp λ. Substantiate your answer in detail.
 
5
 
5. Let y ∼ n(µ; σ 2 Ip ) with  3  and σ = 3.
 
 
4

y0y
(a) What is the distribution of ?
σ2
(b) Calculate the mean and variance of the distribution in (a).
 
3
  1
3  2 
6. Let x : 2 × 1 ∼ n(µ; Σ) with µ =   and Σ =  .
 
2  
1 23
 
1 −1
(a) What is the distribution of x0 Ax if A =  ?
−1 1
 
2 −4
(b) If B =  , are x0 Ax and x0 Bx independent? Explain.
−4 2
1. THE TOOLBOX 28
Chapter 2

THE LINEAR MODEL

2.1 The model

We assume that

y = X β + 

n×1 n×p p×1 n×1

where y is an n×1 vector of random variables, X is an n×p matrix of known constants with n>p,
β is a p×1 vector of unknown parameters1 , and  is an n×1 vector of (unobservable) random
variables. Thus  is not a vector of parameters, but a Greek letter is used nevertheless, since  is
unknown. In this model X is selected in advance, y is observed and β must be estimated,  is a
random disturbance. The following assumptions are usually made:

Assumption 1

The matrix X has a full column rank, that is r(X) = p. This implies that the rank of the p×p
symmetric matrix X0 X is p and therefore X0 X is non-singular.

1
We usually assume that β is a vector of constants. In chapter 3, we will consider cases where some of
the βi are random variables

29
2. THE LINEAR MODEL 30

This assumption is not essential for the linear model in general, but it simplifies the algebra and
in most practical applications the model can be formulated in such a way that X0 X is in fact
non-singular. Without this assumption one would first have to study the subject of generalised
inverses of matrices.

Assumption 2

The random vector  has expectation 0, that is E() = 0.

This assumption is essential; if E() 6= 0 one could incorporate E() into the vector of unknown
parameters β possibly adding an extra parameter into β and an extra column to X.

The result of assumption 2 is that

E(y) = Xβ + E() (remember that X and β are not random)

= Xβ.

The term  is usually called the error term: it provides for the random fluctuation inherent in a
statistical experiment.

Assumption 3

The covariance matrix of  (and thus of y) is known at least up to a constant factor.

For example, E( 0 ) = σ 2 V with V known or, more usually,

E( 0 ) = σ 2 I

which means that the components of  are uncorrelated and have the same variance σ 2 .

Assumption 4

X is a matrix of known constants.

Each column of X represents n values of a variable, often called an independent variable, predic-
tand, concomitant variable, et cetera. We refer here to design variable, and X will be called the
2. THE LINEAR MODEL 31 STA3701/1

design matrix. We assume that (ideally) the design matrix X was chosen in advance, and the n
values of the response variable y which constitute the vector y were observed accordingly by means
of an experiment.

The linear model in general does not exclude the possibility that X may be a random matrix. In
the problems dealt with in this module we will usually have a non-random matrix X. Many (but
not all) of the results derived in this module will be valid if X is in fact random.

Assumption 5

Sometimes we will assume that  has a multivariate normal distribution.

This assumption is not essential for the purpose of estimating β, but as soon as we want to test
hypotheses about or construct confidence intervals for the parameters, we have to make this as-
sumption. Also, least squares estimators are maximum likelihood estimators if assumption 5 holds.

Assumption 6

n>p, that is there are more observations than parameters. This is a very important assumption -
without it, the least squares estimators are not unique. If n<p then r(X0 X) <p and therefore X0 X
will be singular (cf. assumption 2).

2.2 Examples of linear models

In this section it is shown that some of the statistical problems dealt with in most elementary
courses are special cases of the linear model.

2.2.1 One-sample problem

Let y1 , . . . , yn be a random sample from a n(µ; σ 2 ) distribution.

Then we may write


2. THE LINEAR MODEL 32

y1 = µ + 1
..
.
yn = µ + n

where 1 , . . . , n are independent n(0; σ 2 ) variates, that is  ∼ n(0; σ 2 I). This may be written as

     
y1 1 1
 ..   ..  .. 
     
 . = (µ) +

.   . 
     
yn 1 n

or y = Xβ + .

In this case p = 1 and X = 1.

2.2.2 Two-sample problem

Let y1 , . . . , yn1 be a random sample from a n(µ1 ; σ 2 ) distribution and v1 , . . . , vn2 a random sample
from a n(µ2 ; σ 2 ) distribution such that the two samples are independent. We rename the second
sample by writing

yn1 +i = vi , i = 1, . . . , n2 .

The model is then

     
y1 1 0 1
.. .. ..  ..
     
. . .  .
    
    
     
 yn1 1 0   n1
      
  
    µ  
 1  
= +
   
  
    µ  
    2  
 yn1 +1   0 1   n1 +1 
.. .. ..  ..
     
    
 .   . .   . 
     
yn1 +n2 0 1 n1 +n2

or y = Xβ +  where  ∼ n(0; σ 2 I).

In this case p = 2 and n = n1 + n2 .


2. THE LINEAR MODEL 33 STA3701/1

2.2.3 Simple linear regression

Assume y1 , . . . , yn are independent with

E(yi ) = α + βxi ,

Var(yi ) = σ 2 .

This implies that

y1 = α + βx1 + 1
..
.
yn = α + βxn + n ,

that is

     
y1 1 x1   
 α  1
 ..   ..     ..
   
 . = + .

.  
    β  
yn 1 xn n

or y = X β + 

where E() = 0 and E( 0 ) = σ 2 I. In this case p = 2.

Example 2.1.

Assume that E(yi ) = β0 + β1 xi ; Var(yi ) = σ 2 /wi , i = 1,. . . , n and where y1 , . . . , yn are independent
and w1 , . . . , wn are known constants. This is a particular “weighted regression” model. Write this
in matrix notation.

Solution 2.1.

yi = β0 + β1 xi + 
2. THE LINEAR MODEL 34

     
y1 1 x1 1
      
     
 y2   1 x2  β0  1 
= +
.. .. ..
  
     
 .   .  β1  . 
     
yn 1 xn n

2.3 Estimation

Let y = Xβ +  where E() = 0; E( 0 = σ 2 I).

The problem is to estimate β. The estimator will be denoted by β̂. The corresponding vector of
values of the response variable, found by replacing β by β̂ and  by 0 in the model, is denoted by
ŷ. Thus ŷ = X β̂.

The difference between the observed and predicted response, y − ŷ, is called the vector of residuals
and is said to be an estimator for . (This is an unconventional use of the term “estimator”, since
one usually estimates parameters while  is a vector of unobservable random variables.)
Let

e = y − ŷ = y − X β̂.

∴ E(e) = E(y) − E(X β̂)

= Xβ − Xβ provided E(β̂) = β,

that is provided β̂ is an unbiased estimator for β

= 0

= E() by assumption.

The fact that E(e) = E() is the justification for calling e an estimator for .

The least squares criterion states that the estimator β̂ must be found in such a way that e0 e, the
sum of squares of the residuals, is a minimum. Thus we have to minimise

s = e0 e = (y − X β̂)0 (y − X β̂)
0 0
= y 0 y − β̂ X 0 y − y 0 X β̂ + β̂ X 0 X β̂
0 0
= y 0 y − 2β̂ X 0 y + β̂ X 0 X β̂
2. THE LINEAR MODEL 35 STA3701/1

0
since β̂ X 0 y and y 0 X β̂ are scalars and therefore equal.

∂s
In order to minimise s we set = 0. Using (M14) and (M15) and noting that X 0 X is symmetric,
∂ β̂
we find
∂s
= −2X 0 y + 2X 0 X β̂.
∂ β̂

Equating this to zero we find

X 0 X β̂ = X 0 y

∴ β̂ = (X 0 X)−1 X 0 y

since we have assumed that X 0 X is non-singular. This is the least squares estimator. (Can you
prove that β̂ minimises rather than maximises s?)

If we assume in addition that  has a multivariate normal distribution, that is  ∼ n(0; σ 2 I), then
the same estimator is also the maximum likelihood estimator. More generally, if we assume that
 ∼ n(0; σ 2 V ), that is y ∼ n(Xβ; σ 2 V ), where V is non-singular, then the joint probability density
function (p.d.f.) of the elements of y, which is the likelihood function, is
 
1 1 0 −1
L= 1 1 exp − 2 (y − Xβ) V (y − Xβ)
(2πσ 2 ) 2 n |V | 2 2σ
1 1 1
∴ `nL = − n`n(2πσ 2 ) − `n|V | − 2 (y − Xβ)0 V −1 (y − Xβ).
2 2 2σ
In order to maximize `nL (and therefore L) with respect to β we have to minimise

s = (y − Xβ)0 V −1 (y − Xβ)

= y 0 V −1 y − 2β 0 X 0 V −1 y + β 0 X 0 V −1 Xβ.

∂s
= −2X 0 V −1 y + 2X 0 V −1 Xβ
∂β
= 0 if X0 V−1 Xβ = X0 V−1 y.

We assume that V has full rank and X has full column rank

∴ X 0 V −1 X is non − singular

∴ β = (X 0 V −1 X)−1 X 0 V −1 y.
2. THE LINEAR MODEL 36

The maximum likelihood estimator for β is therefore

β̂ = (X 0 V −1 X)−1 X 0 V −1 y.

The (ordinary) least squares (OLS) estimator is a special case of this estimator (replace V by In ).
The more general estimator is called the generalised least squares (GLS) estimator; any property
of the GLS estimator will automatically be a property of the OLS estimator when the latter is
appropriate, that is when V = In .

Another special case of GLS is weighted least squares (WLS) which is used when V is a diagonal
matrix, that is if V = (vij ) the vij = 0 if i 6= j. In this case
   
v 0 ... 0 1/v11 0 ... 0
 11   
   
 0 v22 . . . 0  −1  0 1/v22 . . . 0 
V =  ..
V = 
..   .. ..


 . .   . . 
   
0 0 . . . vnn 0 0 . . . 1/vnn

n
0 −1
X 1 X
(y − ŷ) V (y − ŷ) = (yi − ŷi )2 = wi (yi − ŷi )2
v
i=1 ii

1
where the weight wi = . Obviously OLS is also a special case of WLS when vii = 1; i = 1, . . . , n.
vii
We now assume that

y = Xβ + 

where E() = 0 and E( 0 ) = σ 2 V . The generalised least squares estimator is

β̂ = (X 0 V −1 X)−1 X 0 V −1 y

= Ay, say, where

A = (X 0 V −1 X)−1 X 0 V −1 .

Thus β̂ is a linear estimator in the sense that each element of β̂ is a linear combination of y1 , . . . , yn .
We are interested in the distribution of the random vector β̂.
2. THE LINEAR MODEL 37 STA3701/1

Theorem 2.3.1
0
E(β̂) = β and Cov(β̂, β̂ ) = σ 2 (X0 V−1 X)−1

Proof:

By (D3)

E(β̂) = E(Ay) = A E(y)

= (X 0 V −1 X)−1 X 0 V −1 (Xβ)
 

= Ip β = β

Thus β̂ is unbiased; we say it is an unbiased linear estimator. The covariance matrix of β̂ is likewise
derived from (D3):

Cov(β̂, β̂)0 = Cov(Ay, (Ay)0 )

= A Cov(y, y0 )A0

= (X 0 V −1 X)−1 X 0 V −1 (σ 2 V ) V −1 X(X 0 V −1 X)−1


   

= σ 2 (X 0 V −1 X)−1 X 0 V −1 V V −1 X(X 0 V −1 X)−1

= σ 2 (X 0 V −1 X)−1 X 0 V −1 X(X 0 V −1 X)−1

= σ 2 (X 0 V −1 X)−1

In the special case V = In we have

β̂ = (X 0 X)−1 X 0 y

E(β̂) = β
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .

The following theorem follows from (D4).


2. THE LINEAR MODEL 38

Theorem 2.3.2

If y ∼ n(Xβ, σ 2 V) and

β̂ = (X 0 V −1 X)−1 X 0 V −1 y

h i
2 0 −1 −1
then β̂ ∼ n β̂, σ (X V X) .

We apply the foregoing results to the three examples in section 2.2. In each case V = In .

2.3.1 One-sample problem

In this case X 0 = (1 . . . 1) = 10 and y 0 = (y1 , . . . , yn ).

X 0 X = n; (X 0 X)−1 = 1/n; X 0 y = y1 + y2 + . . . + yn

∴ µ̂ = (X 0 X)−1 X 0 y = n1 (y1 + . . . + yn ) = ȳ

Furthermore, E(µ̂) = µ and

Var(µ̂) = σ 2 (X 0 X)−1 = σ 2 /n, a well-known result.

2.3.2 Two-sample problem

In this case
 
1 . . . 1 0 . . . 0
X0 =  
0 ... 0 1 ... 1

y 0 = (y1 · · · yn1 yn1 +1 · · · yn1 +n2 )

   
n1 0 1/n1 0
X 0X =   ; (X 0 X)−1 =  
0 n2 0 1/n2
 
y1 + · · · + yn1
 
0
Xy=
 

 
yn1 +1 + · · · + yn1 +n2
2. THE LINEAR MODEL 39 STA3701/1

 
  (y1 + · · · + yn1 )/n1  
µ̂1   ȳ
  1 
β̂ = = = , say.
  
µ̂2   ȳ2
(yn1 +1 + · · · + yn1 +n2 )/n2
 
2
σ /n1 0
The covariance matrix is σ 2 (X 0 X)−1 =  ,
2
0 σ /n2

again a well-known result.

2.3.3 Simple linear regression


 
1 ··· 1
Here X0 =   y 0 = (y1 · · · yn )
x1 · · · xn
 P 
n xi
X 0X =  P P 2 
xi xi

|X 0 X| = n x2i − ( xi )2
P P

(xi − x̄)2 .
P
=n

Thus X 0 X is non-singular provided (xi − x̄)2 6= 0.


P

 P 2 P 
xi xi
 n P(x − x̄)2 − n P(x − x̄)2 
 i i 
(X 0 X)−1 =
 

 P 
 xi n 
− P 2
P
n (xi − x̄) n (xi − x̄)2

 P 
yi
X 0y =  P 
xi yi

x2i − xi xi yi
 P P
P P 
yi
P
n (xi − x̄)2
   
α̂ 0 −1 0
 
β̂ = = (X X) X y = 
   

β̂ 

P P P
− xi yi + n xi yi 

P
n (xi − x̄)2
2. THE LINEAR MODEL 40

which can be written


X X
β̂ = yi (xi − x̄)/ (xi − x̄)2

α̂ = ȳ − β̂ x̄.

The covariance matrix of β̂ is σ 2 (X 0 X)−1 . Thus


X X
Var(α̂) = σ 2 x2i /[n (xi − x̄)2 ]
X
Var(β̂) = σ 2 / (xi − x̄)2
X X
Cov(α̂, β̂) = −σ 2 xi /[n (xi − x̄)2 ].

Example 2.2.

Repeat the analysis for the weighted regression model in example 2.1.

Solution 2.2.

yi = β0 + β1 xi
     
y1 1 x1 1
      
     
y2 1 x2 β  
 0  +  1 
    
=
.. ..  .. 

   
 .   .  β1  . 
     
yn 1 xn n
where E() = 0, E(, 0 ) = σ 2 V

   
1
w1
0 ... 0 w1 0 ... 0
   
 1   
0 ... 0 0 w2 . . . 0
w2
V −1
   
V =
 .. ..

 and =
 .. ..


 . .   . . 
   
1
0 0 ... w2
0 0 . . . wn

 
1 1 ... 1
X0 =  
x1 x2 . . . xn
2. THE LINEAR MODEL 41 STA3701/1

  
w1 0 ... 0 1 x1
   
  
1 1 ... 1 0 w2 . . . 0 1 x2
X 0 V −1 X = 
  
.. .. .. ..
  
  
x1 x2 . . . xn  . .  . . 
  
0
0 . . . wn 1 xn
 
1 x1
  
 
w1 w2 . . . wn  1 x2 
=  
 .. .. 
 
w1 x1 w2 x2 . . . wn xn  . . 
 
1 xn
 P P 
wi w i xi
 
= 
 

 
2
P P
w i xi w i xi
 P 
wi x2i − wi xi
P
1  
(X 0 V −1 X)−1 =P P
 
wi wi x2i − ( wi xi )2  P
P  
P 
− w i xi wi

 
y1
  
 
w1 w2 . . . wn y2
X 0 V −1 Y
 
=  
 ..


w 1 x1 w 2 x 2 . . . wn xn  . 
 
yn
 P 
wi yi
=  P 
w i xi y i

Then

β̂ = (X 0 V −1 X)−1 X 0 V −1 Y
 P 
wi x2i − wi xi
P  P 
1   wi yi
= P P
  
wi wi x2i − ( wi xi )2  P
P   P
P  wi xi yi
− w i xi wi
2. THE LINEAR MODEL 42

 P 
wi x2i
P P P
wi yi − wi xi wi xi yi
1  
=
 
2
P P P 2
wi wi xi − ( wi xi )  P
 
P P P 
− wi xi wi yi + wi wi xi yi

wi x2i
 P P P P 
wi yi − wi xi wi xi yi
wi wi x2i − ( wi xi )2
 P P P 
 
= 
 

 P P P P 
 − wi xi wi yi + wi wi xi yi 
wi wi x2i − ( wi xi )2
P P P

Cov(β̂, β̂ 0 ) = σ(X 0 V −1 X)−1

2.4 The Gauss-Markov theorem

We have seen that the GLS estimator is an unbiased linear estimator. We will now show that
the GLS estimator is, in fact, the best linear unbiased estimator (BLUE) in the sense that the
variance of each component of β̂, say β̂i , is smaller than (or equal to) the variance of any other
unbiased linear estimator of βi . In fact, suppose we wish to estimate a linear combination of the
components of β, say t0 β where t is a vector of known constants. In the one-sample problem
we maywant to estimate (1)(µ) = (µ), in the two-sample problem  we may want to estimate
µ1 α
(1 − 1)   = µ1 − µ2 , while our purpose to estimate (1 x)   = α + βx for given x in the
µ2 β
simple linear regression model.

The Gauss-Markov theorem states that, of all linear combinations of y1 , · · · , yn , that is of all
estimators of the form `0 y such that E(`0 y) = t0 β, the estimator with the smallest variance is

t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y,

that is the best choice of `0 is

`0 = t0 (X 0 V −1 X)−1 X 0 V −1 .

This property holds whatever the form of V, thus also for the OLS estimator when V = In .
2. THE LINEAR MODEL 43 STA3701/1

Theorem 2.4.1 (The Gauss-Markov theorem)

For any given vector t, the best unbiased linear estimator of t0 β is t0 β̂ where β̂ is the GLS estimator
of β.

Proof

Consider first the GLS estimator

t0 β̂ = t0 (X 0 V −1 X)−1 X 0 V −1 y = c0 y, say,

where
c0 = t0 (X 0 V −1 X)−1 X 0 V −1 (2.1)

∴ E(c0 y) = c0 E(y) = t0 (X 0 V −1 X)−1 X 0 V −1 Xβ = t0 β (2.2)

Var(t0 β̂) = Var(c0 y) = σ 2 c0 V c = σ 2 t0 (X 0 V −1 X)−1 t (2.3)

(from 2.1)

Let `0 y be any other unbiased linear estimator of t0 β. Write, without loss of generality,

` = c + d.

Thus the estimator of t0 β is

`0 y = c0 y + d0 y.

This estimator must be unbiased:

∴ E(`0 y) = E(c0 y) + E(d0 y) = t0 β

∴ t0 β + d0 Xβ = t0 β (from 2.2)
2. THE LINEAR MODEL 44

∴ d0 Xβ = 0

This must hold for all β (we do not want an estimator which is unbiased only if β has certain
values),

∴ d0 X = 00 or, equivalently, X 0 d = 0 (2.4)

The variance of `0 y is

Var(`0 y) = σ 2 `0 V `

= σ 2 (c + d)0 V (c + d)

= σ 2 (c0 V c + c0 V d + d0 V c + d0 V d).

Now c0 V d = t0 (X 0 V −1 X)−1 X 0 V −1 V d (from 2.1)

= t0 (X 0 V −1 X)−1 X 0 d

= 0. (from 2.4)

Similarly,

d0 V c = 0

∴ Var(`0 y) = σ 2 (c0 V c + d0 V d)

= Var(c0 y) + σ 2 d0 Vd.

Since σ 2 V is a covariance matrix and V is therefore positive semi-definite (and furthermore it is


assumed here that V−1 exists and is therefore positive definite), it follows that d0 V d ≥ 0 for all d,

∴ Var(`0 y) ≥ Var(c0 y)

which proves the theorem.

The theorem may also be proved by means of Lagrange multipliers. We want to choose ` so as to
minimise

Var(`0 y) = σ 2 `0 V `
2. THE LINEAR MODEL 45 STA3701/1

subject to the constraints

E(`0 y) = `0 Xβ = t0 β for all β, that is t = X 0 `,

which constitutes p constraints.

Thus we choose −2θ as vector of Lagrange multipliers and minimise

`0 V ` − 2θ0 (X 0 ` − t).

The result is

`0 = t0 (X 0 V −1 X)−1 X 0 V −1

which proves the theorem.


The Gauss-Markov theorem may also be approached as follows. Let β̂ be any unbiased linear
estimator of β and let β̂ be the GLS estimator. Then, similar to our proof of the theorem, it may
be shown that the matrix
∗ ∗0 0
Cov(β̂ , β̂ ) - Cov(β̂, β̂ )

is positive semidefinite.

This implies that



Var(t0 β̂ ) - Var(t0 β̂) ≥ 0 and in particular that

Var(β̂i∗ ) ≥ Var(β̂i ) for all i.

2.5 Linear hypotheses

In the remainder of this study guide we will assume throughout that the error term 
is multivariate normally distributed with mean 0 and covariance matrix σ 2 In .
Thus the model is y = Xβ +  where  ∼ n(0; σ 2 In ).

We know that the OLS estimator,


2. THE LINEAR MODEL 46

β̂ = (X 0 X)−1 X 0 y

has a n(β; σ 2 (X 0 X)−1 ) distribution.

It is often necessary to test hypotheses of the general form

H0 : K β = m

h×p p×1 h×1

where K is an h×p matrix of rank h, which automatically means that h ≤ p.

K and m are assumed to be a matrix and a vector of known constants. The requirement that K
must be of full row rank is not overly restrictive, since one may delete rows of K which are linear
combinations of the remaining rows. For example, the two hypotheses β1 − β2 = 0 and β1 − β3 = 0
automatically imply that β2 − β3 = 0, the latter being redundant. We must of course be sure
that the deleted hypotheses are consistent with the remaining ones (eg β2 − β3 = 10 would not be
consistent with β1 − β2 = 0 and β1 − β3 = 0).

2.5.1 One-sample problem

H0 : µ = C

that is (1) (µ) = (c)

or K β = m.

2.5.2 Two-sample problem

We may be interested in testing the hypotheses

H0 : µ1 − µ2 = 0; µ1 = 10; µ2 = 10
2. THE LINEAR MODEL 47 STA3701/1

   
1 −1   0
  µ  
 1  
that is =  10 .
 
 1 0 
  µ2  
0 1 10

Since the third row of the matrix on the left is a linear combination of the first two rows (notice
that the second hypothesis minus the first yields the third) we may as well delete the third hy-
pothesis to arrive at

    
1 −1 µ1 0
H0 :   = 
1 0 µ2 10

or K β = m.

2.5.3 Simple linear regression

Suppose we are interested in testing the null hypotheses

H0 : β = 5; α + 3β = 10; α + 5β = 20.

This can be written


   
0 1   = 5
  α  
 10 .
    
 1 3 
  β  
1 5 20

The last row is a linear combination of the first two rows (twice the first hypothesis added to the
second yields the third) and can be deleted. Thus we have
    
0 1 α 5
H0 :    =  
1 3 β 10

or K β = m.

Thus suppose we want to test


2. THE LINEAR MODEL 48

H0 : Kβ = m.

Consider the h×1 random vector

K β̂ − m.

Its expected value is

KE(β̂) − m = Kβ − m
(which is equal to zero if H0 is true) and, by (D3), its covariance matrix is

σ 2 K(X 0 X)−1 K 1 .

Thus, if H0 is true,

K β̂ − m ∼ n(0; σ 2 K(X 0 X)−1 K 0 ).

This topic will be pursued further in the next section.

2.6 Quadratic forms

We will now study a number of quadratic forms which are relevant in the testing of linear hypothe-
ses:

q = y 0 y = y 0 In y

q1 = y 0 (1(10 1)−1 1)y

q2 = y 0 (In − 1(10 1)−1 1)y

= q − q1
0
q3 = β̂ (X 0 X)β̂

q4 = q − q3

q5 = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m)

In each case we will show that the quadratic form can be written in the form qi = y 0 Ai y where
Ai Ai = Ai , that is Ai is idempotent.
2. THE LINEAR MODEL 49 STA3701/1

Thus qi /σ 2 is, in general, a noncentral chi-square variable with ri = r(Ai ) degrees of freedom and
noncentrality parameter (E(y))0 Ai (E(y))/σ 2 = β 0 X 0 Ai Xβ/σ 2 . We will also show that in a number
of cases Ai Aj = O, which proves that qi and qj are independent (df (D11) and (D12)).

Example 2.3.

Assume y 0 = (y1 y2 y3 ); compute q, q1 and q2 as defined above.

Solution 2.3.

q = y0y
 
y1
h i 
=
 
y1 y2 y3  y2 
 
y3
= y12 + y22 + y32

q1 = y 0 (1(10 1)−1 10 )y
 
1 1 1
 3 3 3 
 
 
 
where 1(10 1)−1 10 =  1 1 1  using M12
 
 3 3 3 
 
 
 
1 1 1
3 3 3

Now

q1 = y 0 (1(10 1)−1 10 )y
 
1 1 1
 3 3 3  
 y1
 

h i  
=
 1 1 1  
y1 y2 y3  3 3 3
  y2 
  
 
  y3
 
1 1 1
3 3 3
2. THE LINEAR MODEL 50

 
1 1 1
  
 y1
 

1h i  
=
  
y1 y2 y3  1 1 1   y2 
3   
 
  y3
 
1 1 1
 
y
1h i 1 
=
 
y1 + y2 + y3 y1 + y2 + y3 y1 + y2 + y3  y2 
3  
y3
1 2
= (y + y2 y1 + y3 y1 + y1 y2 + y22 + y3 y2 + y1 y3 + y2 y3 + y32 )
3 1
1 2
= (y + y22 + y32 + 2y1 y2 + 2y1 y3 + 2y2 y3 )
3 1
1
= (y1 + y2 + y3 )2
3
3
!2
1 X
= yi
3 i=1
P3 3
1 i=1 yi
X
2
= (3y) since = y ⇒ 3y = yi
3 3 i=1
= 3y 2

Note: q1 = y 0 (1(10 1)−1 10 )y = ny 2 . In this case n = 3

q2 = q − q1 (2.5)
X3
= yi2 − ny 2
i=1
3
X
= (yi − y)2
i=1
= (y1 − y)2 + (y2 − y)2 + (y3 − y)2

(2.6)
2. THE LINEAR MODEL 51 STA3701/1

The quadratic form q


n
X
0
q = yy = yi2 is usually termed the total sum of squares. Since q = y 0 In y, In In = In and
1
r(In ) = n it follows that q/σ 2 is a noncentral chi-square variate with n degrees of freedom and
noncentrality parameter (n.c.p.)

λ = (Xβ)0 In (Xβ)/σ 2 = β 0 X 0 Xβ/σ 2 .

Since X 0 X is positive definite, λ = 0 if and only if β = 0.

The quadratic form q1


We have

q1 = y 0 (1(10 1)−1 10 )y

  
1 1
n
··· n
y1
 . ..   .. 
  
= (y1 · · · yn )  .. .   . 
  
1 1
n
··· n yn
= nȳ 2 .

(2.7)

This is an important quadratic form to remember. It is sometimes called the sum of squares due
to the mean. We may write

q1 = y 0 A1 y where A1 = 1(10 1)−1 1.

From (M13) it follows that A1 is idempotent and has rank 1.

We conclude that q1 /σ 2 is a noncentral chi-square variate with one degree of freedom and n.c.p.

λ1 = β 0 X 0 1(10 1)−1 10 Xβ/σ 2


2. THE LINEAR MODEL 52

1 0 0 0
= β X 1 1 Xβ/σ 2 .
n
(2.8)

It follows that λ1 = 0 if β = 0 or if X 0 1 = 0. The latter condition implies that the sum of the
elements in each row or X 0 (that is each column of X) is zero.

Example 2.4.

 
−1 1
 
 −1 −1 
 
Assume X = 

 . Compute λ1 .

 1 1 
 
1 −1

Solution 2.4.

β 0 X 0 1(10 1)−1 10 Xβ
λ1 =
σ2
1 0 0 0
β X 11 Xβ 1
= n 2
since (10 1)−1 =
σ n
It follows that λ1 = 0 if β = 0 or if X 0 1 = 0.

X 0 1 = 0 if the sum of the elements in each row of X 0 is zero (or that each column of X is zero).
Thus, in this case X 0 1 = 0 since the elements in each row of X 0 add up to zero, ⇒ λ1 = 0.

The quadratic form q2

q2 = y 0 (In − 1(10 1)−1 10 )y

= y 0 A2 y

where A2 = In − 1(10 1)−1 10 = In − A1 .

Now A2 A2 = (In − A1 )(In − A1 )

= In − A1 − A1 + A1 A1

= In − 2A1 + A1 since A1 is idempotent


2. THE LINEAR MODEL 53 STA3701/1

= In − A1 = A2 .

(2.9)

Thus A2 is idempotent and in (M13) it was shown that r(A2 ) = n − 1. Therefore q2 /σ 2 is a


noncentral chi-square variate with n − 1 degrees of freedom and n.c.p.

λ2 = β 0 X 0 (In − 1(10 1)−1 10 )Xβ/σ 2 . (2.10)

We now study λ2 in order to find a situation in which λ2 = 0. Of course λ2 = 0 if β = 0 but we


will show that λ2 could be equal to zero even if β 6= 0.

First, we assume that X is of the form


 
1 x12 · · · x1p
 .
 
X =  .. ,

 
1 xn2 · · · xnp

that is, the first column of X consists of ones. This may be achieved in many of the well-known
linear model problems by a judicious choice of parameters (cf section 2.8).

Next, assume that β is of the form

β = (β1 0 · · · 0)0 .

Therefore the model is


          
y1 1 x12 · · · x1p β 1 β1 1
 1
 ..   ..   ..   ..   .. 
         
 . = +
  . = . + . ;

.   0.
          
yn 1 xn2 · · · xnp 0̇ n β1 n

that is, the n yi s have the same expected value β1 .

Therefore Xβ = β1 1 and β 0 X0 = β1 10 .

∴ λ2 = β1 10 (In − 1(10 1)−1 1)1β1 /σ 2

= β1 (10 1 − 10 1(10 1)−1 10 1)β1 /σ 2

= 0.
2. THE LINEAR MODEL 54

In this case, therefore, q2 /σ 2 is a (central) chi-square variate with n − 1 degrees of freedom,


regardless of the value of β1 . In fact, q2 is a well-known quadratic form for

q 2 = q − q1
X
= yi2 − nȳ 2
X
= (yi − ȳ)2 .

(2.11)

This is another important quadratic form to remember: q2 is called the total sum of squares (ad-
justed for the mean).

Independence of q1 and q2
Since q1 = y 0 A1 y and q2 = y 0 (In − A1 )y, and

A1 (In − A1 ) = A1 − A1 A1 = A1 − A1 = O

it follows that q1 and q2 are stochastically independent. Thus we have proved a well-known result:
X
that nȳ 2 and (yi − ȳ)2 are stochastically independent.

The quadratic form q3


0
Since β̂ = (X 0 X)−1 X 0 y and β̂ = y 0 X(X 0 X)−1 , q3 may be written in a number of alternative ways:

0
q3 = β̂ (X 0 X)β̂
0
= β̂ X 0 y

= y 0 X(X 0 X)−1 X 0 y

= y 0 A3 y

where A3 = X(X 0 X)−1 X 0 .

(2.12)

From (M11) we know that A3 is idempotent with rank p. Thus q3 /σ 2 is a noncentral chi-square
variate with p degrees of freedom and noncentrality parameter
2. THE LINEAR MODEL 55 STA3701/1

λ3 = β 0 X 0 A3 Xβ/σ 2

= β 0 X 0 Xβ/σ 2 .

(2.13)

Since X 0 X is positive definite, λ3 = 0 if and only if β = 0. The quadratic form q3 is usually termed
the regression sum of squares or the sum of squares due to the model.

The quadratic form q4


We have, say

q4 = q − q 3

= y 0 In y − y 0 A3 y

= y 0 (In − A3 )y

= y 0 A4 y.

(2.14)

From (M11) we know that A4 is idempotent and has rank n - p. Thus q4 /σ 2 is a noncentral
chi-square variate with n - p degrees of freedom and n.c.p.

λ4 = β 0 X 0 A4 Xβ/σ 2

= β 0 X 0 (In − X(X 0 X)−1 X 0 )Xβ/σ 2

= (β 0 X 0 Xβ − β 0 X 0 X(X 0 X)−1 X 0 Xβ)/σ 2

= (β 0 X 0 Xβ − β 0 X 0 Xβ)/σ 2

= 0.

(2.15)

Thus q4 /σ 2 has a central chi-square distribution with n - p degrees freedom, whatever the value
β, provided the model is correct.
2. THE LINEAR MODEL 56

The quadratic form q4 can be written in another form. We know that

y − ŷ = y − X β̂

is the vector of residuals. The sum of squares of the residuals is

(y − X β̂)0 (y − X β̂)
0 0
= y 0 y − y 0 X β̂ − β̂ X 0 y + β̂ X 0 X β̂
0 0 0
= y 0 y − β̂ X 0 X β̂ since y0 Xβ̂ = β̂ X0 y = β̂ X0 Xβ̂

= q − q3 = q 4 .

(2.16)

Thus q4 is the sum of squares of the residuals. It is also termed the residual sum of squares or the
error sum of squares.

Since q4 /σ 2 ∼ Xn−p
2
, it follows that

E(q4 /σ 2 ) = n − p
q4
∴ E( n−p ) = σ2.

Therefore

σ̂ 2 = (y − X β̂)0 (y − X β̂)/(n − p) = q4 /(n − p)

is an unbiased estimator of σ 2 ; σ̂ 2 is termed the error variance and σ̂ the standard error.

Independence of q3 and q4

q3 = y 0 A3 y

q4 = y 0 (In − A3 )y

Therefore, from (M11) it follows that q3 and q4 are stochastically independent.


2. THE LINEAR MODEL 57 STA3701/1

The quadratic form q5


We now consider the quadratic form

q5 = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m). (2.17)

We have seen in section 2.5 that

K β̂ − m ∼ n(Kβ − m; σ 2 K(X 0 X)−1 K 0 )

and, using (D14) with K β̂ − m in the place of y, it follows that q5 /σ 2 is a noncentral chi-square
variate with degrees of freedom.

r(K(X 0 X)−1 K 0 )−1 = r(K(X 0 X)−1 K 0 ) = h

(because K(X 0 X)−1 K 0 is an h × h non-singular covariance matrix) and n.c.p.

λ5 = (Kβ − m)0 (K(X 0 X)−1 K 0 )−1 (Kβ − m)/σ 2 . (2.18)

Note that λ5 = 0 if and only if Kβ = m.

The quadratic form q5 is called the sum of squares due to the hypothesis.

The quadratic form q3 is a special case of q5 with K = Ip and m = 0.

Independence of q4 and q5
We want to prove that q4 and q5 are independent. In equations 2.14 and 2.17 we expressed q4 as
a quadratic form in y and q5 as a quadratic form in K β̂ − m. In order to prove independence, we
will have to express them both in terms of the same normal vector.

Let v = y − XK 0 (KK 0 )−1 m.


2. THE LINEAR MODEL 58

Then, by substituting β̂ = (X 0 X)−1 X 0 y and using the fact that X 0 (In − X(X 0 X)−1 X 0 )X = O, it
may be shown that

q4 = v 0 (In − X(X 0 X)−1 X 0 )v

and

q5 = v 0 (B(B 0 B)−1 B 0 )v

where

B = X(X 0 X)−1 K 0 ,

and also that

(In − X(X 0 X)−1 X 0 )(B(B 0 B)−1 B 0 ) = O

so that q4 and q5 are independent.

Activity 2.1.
Prove these assertions as an exercise.
2. THE LINEAR MODEL 59 STA3701/1

Figure 2.1: Summary of quadratic forms


2. THE LINEAR MODEL 60

2.7 Statistical inference on β

Testing linear hypotheses


A general rule for testing a hypothesis in the linear model is as follows:

(a) Find a quadratic form qi such that qi /σ 2 has a central chi-square distribution if Ho is true and
a noncentral chi-square distribution if H0 is not true. Let qi /σ 2 have ri degrees of freedom.

(b) Find a second quadratic form qj which is independent of qi and such that qj /σ 2 has a central
chi-square distribution whether Ho is true or not. Let qj /σ 2 have rj degrees of freedom.

(c) Compute
qi /ri
f=
qj /rj

which has a central Fri ;rj distribution if H0 is true and a noncentral F-distribution if H0 is
not true. If f > Fα;ri ;rj reject H0 at the 100α% level.

We discuss four types of hypothesis.

(a) H0 : β = 0 (that is all the parameters are zero)

From table 2.1 we see that q3 and q4 are the appropriate quadratic forms. Thus
0
β̂ X 0 X β̂ / p
f =
(y − X β̂) 0 (y − X β̂) / (n − p)
SSR / p SSR
= =
SSE / (n − p) pσ̂ 2

has an Fp;n−p distribution if H0 is true.

(b) H0 : Kβ = m

The test is based on q5 and q4 :


SSH / h
f= = SSH / (hσ̂ 2 )
SSE / (n − p)
has an Fh;n−p distribution if H0 is true.
2. THE LINEAR MODEL 61 STA3701/1

(c) H0 : c0 β = m

This is a special case of (b) with h = 1 and K = c0 .

(d) H0 : β1 = β2 = · · · = βh = 0; h < p (that is some of the parameters are zero)

This is a special case of (b) with


 
1 0···0 0···0
 
 0 1···0 0···0 
 
K
=  ..
 = (Ih O)
h×p

 . 
 
0 0···1 0···0
and m = 0.

In this case q5 can be computed by first partitioning (X 0 X)−1 , β̂ and β as follows:

(X 0 X)−1
 
T T12
 11 
 h×h
 

p×p = 



 0
T22 
 T12 
(p − h) × (p − h)
 
β̂ 1 h×1
 
β̂ = 
 

 
β̂ 2 (p − h) × 1
 
β1
 
β = 
 

 
β2

Now β̂ 1 ∼ n(β 1 ; σ 2 T11 ) and the quadratic form q5 can be written

0 −1
q5 = β̂ 1 T11 β̂ 1

where q5 /σ 2 ∼ Xh2 if β 1 = 0.
2. THE LINEAR MODEL 62

Since this is a special case of q5 , we need not prove again that it is independent of q4 and we can
define the F-statistic directly.
0 −1
β̂ 1 T11 β̂ 1 / h
f=
(y − X β̂)0 (y − X β̂) / (n − p)
has an Fh;n−p distribution provided β 1 = 0, that is

β1 = . . . = βh = 0.

Obviously we can in this way test the null hypothesis that any subset of the βi s is zero, since the
ordering of the βi s is arbitrary.

In particular we may use the statistic


β̂i2 / aii β̂i2
f= = 2
SSE / (n − p) σ̂ aii
to test H0 : βi = 0, where aii is the (i,i)-th element of (X 0 X)−1 . The statistic has an F1;n−p
distribution if βi = 0. Equivalently we may use
p β̂i
t= f= √
σ̂ aii
which has a tn−p distribution if βi = 0 (and a noncentral t-distribution if βi 6= 0.)

Note: You will recall that the square of a td -variate has an F1;d distribution.

Example 2.5.
Let y = Xβ + ε, where ε ∼ n(µ; σ 2 I3 ) and

 
1 1 1 1
  h i
0
X = 0 , y0 = 4 4 0 −4
 
1 1 0
 
−2 2 4 −4

b0 =
h i
(a) Show, with formulae and calculations, that β −2 6 2
5
and b2 = 1.6.
σ

(b) Test H0 : β1 − β3 = 0 (α = 0.05).


2. THE LINEAR MODEL 63 STA3701/1

Solution 2.5.
y = X β + 

n×1 n×p p×1 n×1

  
1 1 1 1 1 1 −2  
  4 2 0
  
 
 1 1 0 0   1 1 2   
(a) X 0 X =  = 2 2 0
 

 −2 2 4 −4   1 0
  
4   
   0 0 40
1 0 −4

Calculating the inverse matrix

4 2 0 1 0 0
2 2 0 0 1 0
0 0 40 0 0 1

1 1 1
(R1 − R2 ) 1 0 0 - 0
2 2 2
2 2 0 0 1 0
0 0 40 0 0 1

1 1
1 0 0 - 0
2 2

1 1
R2 − R1 0 1 0 - 1 0
2 2

1 1
R3 0 0 1 0 0
40 40
2. THE LINEAR MODEL 64

1 1
 
− 0

 2 2 

 
 
 1
Thus, (X 0 X)−1

=
 −2 1 0 

 
 
 
 1 
0 0
40

n = 4 and p = 3
  
1 1 1 1 4  





 4
 1 1 0 0  4   
X 0y =  = 8 
 

 −2 2 4 −4  
  
0   
   16
−4
 1 1    
− 0 4 −2
 2 2    
0 −1 0
β̂ = (X X) Xy= 1  8 = 6 
     
−2 1 0 
2
    
1 16
0 0 40
5

   
1 1 −2   3.2
  −2  
    
 1 1 2   4.8 
X β̂ =   6 =
 
 −0.4 
  
 1 0 4  
  0.4  
1 0 −4 −3.6

e0 e = (y − X β̂)0 (y − X β̂)
 
4 − 3.2
 
i 4 − 4.8 
h 

= 4 − 3.2 4 − 4.8 0 − −0.4 −4 − −3.6  
0 − −0.4 
 

 
−4 − −3.6
 
0.8
 
h i  −0.8 


= 0.8 −0.8 0.4 −0.4 


 0.4 
 
−0.4
= 1.6
2. THE LINEAR MODEL 65 STA3701/1

e0 e 1.6
∴ σˆ2 = = = 1.6
n−p 4−3

(b) H0 : β1 − β3 = 0
h i
K= 1 0 −1 and m = [0]

SSH/h
Test statistic: f = ∼ Fα;h,n−p
SSE/(n − p)
 
−2
h i 
(K β̂ − m)0 = 1 0 −1  6 −0
 
 
0.4
= −2.4

1 1
 
− 0

 2 2 
 
 1
 
h i
 1 
K(X 0 X)−1 K 0

=  −2
1 0 −1  1 0 

 0 
  
  −1
 
 1 
0 0
  40
  1
1 1 1 

= − −

 0 
2 2 40  
−1
21
=
40
= 0.525

−1
SSH = (K β̂ − m)0 K(X 0 X)−1 K 0

(K β̂ − m)
40
= −2.4 × × (−2.4)
21
= 10.9714
SSH
h 10.9714
Since f = SSE
= ≈ 6.8571
n−p
1 × 1.6
2. THE LINEAR MODEL 66

Fα;h;n−p = F0.05;1;1 = 161. Reject H0 if f > 161.

Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and
conclude that β1 − β3 = 0.

OR
Using the hypothesis of the form: H0 : c0 β = m

SSH = (c0 βb − m)0 [c0 (X 0 X)−1 c]−1 (c0 βb − m)

Note: In this case k = c0 since h = 1.

SSH = (c0 βb − m)0 [c0 (X 0 X)−1 c]−1 (c0 βb − m)

40
= −2.4 × × (−2.4)
21
= 10.9714

SSH
h 10.9714
Since f = SSE
= ≈ 6.8571 Fα;h;n−p = F0.05;1;1 = 161. Reject H0 if f > 161.
n−p
1 × 1.6

Since 6.8571 < 161, we do not reject H0 at the 5% level of significance and conclude β1 −β3 =
0.

Confidence regions
Consider the random variable K β̂ − Kβ where K is an h×p matrix of rank h. We know that

K β̂ − Kβ ∼ n(0, σ 2 K(X 0 X)−1 K 0 ).

Thus, using (D14), the quadratic form

q5∗ /σ 2 = (K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ)/σ 2


2. THE LINEAR MODEL 67 STA3701/1

has a central chi-square distribution with h degrees of freedom.

Therefore
q5∗ / h
f=
SSE / (n − p)
has an Fh;n−p distribution. Thus

1 − α = P (f ≤ Fα;h;n−p )
h SSE
= P (q5∗ ≤ Fα;h;n−p )
n−p
= P [(K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ) ≤ hσ̂ 2 Fα;h;n−p ].

The inequality inside the square brackets defines a joint confidence region for the h elements of
Kβ. A few examples of this follow.

(a) Confidence region for β

With K = Ip and h = p we have

P [(β̂ − β)0 (X 0 X)(β̂ − β) ≤ pσ̂ 2 Fα;p;n−p ] = 1 − α

The set of all values of β such that the inequality in square brackets is satisfied, forms a joint
confidence region for β1 , · · · , βp .

(b) Joint confidence region for β1 , · · · , βh (h < p)

As before, let
 
β h×1
 1 
β=
 

 
β2 (p − h) × 1
 
  T T12
β̂ 1  11 
 h×h
   

β̂ =   , (X 0 X)−1 = .
 
   
 
β̂ 2  
0
T12 T22

Then the joint confidence region for β1 , · · · , βh is given by


−1
P [(β̂ 1 − β 1 )0 T11 (β̂ 1 − β 1 ) ≤ hσ̂ 2 Fα;p;n−p ] = 1 − α
2. THE LINEAR MODEL 68

In particular if h = 2 we have, with probability 1 − α, that


 −1  
a a β̂ − β1
 11 12   1 
(β̂1 − β1 β̂2 − β2 )   ≤ 2σ̂ 2 Fα;2;n−p
   
 
   
a12 a22 β̂2 − β2

where aij is the (i,j)-th element of (X 0 X)−1

a22 (β̂1 − β1 )2 − 2a12 (β̂1 − β1 )(β̂2 − β2 ) + a11 (β̂2 − β2 )2


∴ 2
≤ 2σ̂ 2 Fα;2;n−p .
a11 a22 − a12
This represents the interior of an ellipse with centre (β̂1 , β̂2 ) in the (β1 , β2 ) plane. We are
100(1 - α)% sure that the true (β1 , β2 ) lies inside the ellipse.

Figure 2.2:

Likewise, a 100(1 - α)% interval for βi is given by


1
(β̂i − βi ) (β̂i − βi ) ≤ σ̂ 2 Fα;1;n−p
aii
∴ (β̂i − βi )2 ≤ aii σ̂ 2 Fα;1;n−p

∴ |β̂i − βi | ≤ aii σ̂t 1 α;n−p
2

√ √
∴ β̂i − aii σ̂t 1 α;n−p ≤ βi ≤ β̂i + aii σ̂t 1 α;n−p .
2 2

(c) Confidence interval for c0 β

Setting K = c0 and h = 1 we have, with probability 1 - α, that

(c0 β̂ − c0 β)(c(0 X 0 X)−1 c)−1 (c0 β̂ − c0 β) ≤ σ̂ 2 Fα;1;n−p


2. THE LINEAR MODEL 69 STA3701/1

and, since c0 β̂ − c0 β and c0 (X 0 X)−1 c are scalars, this reduces to

(c0 β̂ − c0 β)2 ≤ σ̂ 2 (c0 (X 0 X)−1 c)Fα;1;n−p


p
∴ |c0 β̂ − c0 β| ≤ σ̂ c0 (X 0 X)−1 c t 1 α;n−p
2

p p
∴ c0 β̂ − σ̂ c0 (X 0 X)−1 c t 1 α;n−p ≤ c0 β ≤ c0 β̂ + σ̂ c0 (X 0 X)−1 c t 1 α;n−p .
2 2

Multiple comparisons
Suppose we have rejected a null hypothesis of the form Kβ = m which really consists of h different
hypotheses. The result is that we no longer believe that all h of the relationships in the null
hypothesis are true, but we still do not know which of the h relationships are true and which are
not. Thus we have to pursue the analysis a bit further. (Of course, if H0 is not rejected, there is no
evidence that any of the h relationships do not hold and a further analysis would be redundant.)

We consider the case h = 2 first, because it is easier to draw a picture in two dimensions. The
joint confidence region for β1 and β2 is an ellipse, as was seen before. Suppose the ellipse is as
shown in figure 2.3:

Figure 2.3:

Thus we are 100(1 - α)% sure that the true β1 and β2 must both be inside the ellipse. Where
can β1 lie then? Between the extremes of the ellipse as viewed from the β1 -axis, of course, and
similarly for β2 (cf figure 2.4).
2. THE LINEAR MODEL 70

Figure 2.4:

In the above figure we would reject H0 : β1 = 0, β2 = 0 since the origin is outside the confidence
ellipse. We would also reject the subhypothesis H0 : β2 = 0 since zero falls outside the confidence
interval for β2 . However, we would not reject H0 : β1 = 0 since zero falls inside the confidence
interval for β1 .

Thus we have seen that a joint confidence region for two or more parameters implies a confidence
interval for each of the parameters. These confidence intervals are somewhat wider than those we
would have obtained if we had constructed a confidence interval for one parameter only. However,
if we want to make a number of probability statements based on the same set of data, it is more
reassuring if we could attach one probability to all our statements jointly. (For example, if we
make 20 separate probability statements with confidence 0.95 each, the probability that all our
statements are true may be as low as (0.95)20 ≈ 0.36.)

From the ellipse in figure 2.3 we may in principle obtain a confidence interval for any linear
combination c1 β1 + c2 β2 where c1 and c2 are constants. We must draw a line c1 β2 + c2 β2 = 0 on the
graph, and project the extremes of the ellipse onto the axis orthogonal to that line (cf figure 2.5).
2. THE LINEAR MODEL 71 STA3701/1

Figure 2.5:

The problem of selecting a scale on that axis must of course be solved, but usually the whole
procedure of finding a confidence interval is done algebraically. The graphs were used merely to
illustrate the procedure.

The method which we will discuss is one of several methods available, and is known as Scheffé’s
method. Assume, as before, that K is an h × p matrix of rank h. Thus

q5∗ = (K β̂ − Kβ)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − Kβ)

is a quadratic form such that q5∗ /σ 2 is a chi-square variate with h degrees of freedom. Now let t
be an hx1 vector of known constants. Then t0 Kβ is a scalar and

t0 K β̂ − t0 Kβ ∼ n(0; σ 2 t0 K(X 0 X)−1 K 0 t).

Therefore

qt∗ = (t0 K β̂ − t0 Kβ)0 (t0 K(X 0 X)−1 K 0 t)−1 (t0 K 0 β̂ − t0 Kβ) (2.19)

is a quadratic form such that qt∗ /σ 2 has a chi-square distribution with one degree of freedom. The
method of Scheffé is based on the following inequality:
2. THE LINEAR MODEL 72

Theorem 2.7.6

qt∗ ≤ q5∗ for all t

(This result is related to the Cauchy-Schwarz inequality.)

We have seen before that a 100(1 - α)% confidence region for Kβ is defined by

P(q5∗ ≤ hσ̂ 2 F α;h;n−p ) = 1 − α.

Since qt∗ ≤ q5∗ we have

P (qt∗ ≤ hσ̂ 2 F α;h;n−p ) ≥ 1 − α. (2.20)

This defines a confidence region for t0 Kβ with confidence at least 100(1 - α)%.

In particular, if K = Ip and h = p, the joint confidence region for β1 · · · , βp is

(β̂ − β)0 (X 0 X)(β̂ − β) ≤ pσ 2 F α;p;n−p

and the corresponding confidence interval for βi is

(β̂i − βi )2 ≤ aii p σ̂ 2 Fα;p;n−p

where aii is the (i,i)-th element of (X 0 X)−1 .

Similarly, if H0 : β = 0 is rejected, we reject H0 : βi = 0 if

β̂i2 > aii p σ̂ 2 Fα;p;n−p

while H0 : t0 β = 0 is rejected if

(t0 β̂)2 > (t0 (X 0 X)−1 t)p σ̂ 2 Fα;p;n−p .


2. THE LINEAR MODEL 73 STA3701/1

Example 2.6.    
1 0 1 2
     
1 0 −1   7  α
   

     
Define the following as X =  0 ,y= 4 ,β= γ .
     
0 2
     
−1
   
 1 0   3  ϕ
   
−1 −1 0 8

(a) Fit the model y = Xβ +  where  ∼ n(0, σ 2 ).

(b) Test the hypothesis: H0 : α = γ − ϕ − 2 , γ = −α − ϕ − 3, γ + ϕ = α − 4 under the 5% level


of significance.

(c) Determine a 95% confidence interval for γ.

Solution 2.6.

   
1 0 1 2
     
1 0 −1   7  α
   

     
X= y =  4  and β =  γ
     
0 2 0  
     
−1
   
 1 0   3  ϕ
   
−1 −1 0 8

 
1 0 1
    
1 1 0 1 −1  1 0 −1  4 0 0
 
    
(a) X 0 X =  0 0 2 −1 −1   0 = 0 6 0
    
0 2 
    
1 −1 0 −1
 
0 0  1 0  0 0 2
 
−1 −1 0

 
1
0 0
4
 
 
 
 
(X 0 X)−1 =  0 61 0 Note: Inverse of a diagonal matrix
 
 
 
 
 
1
0 0 2
2. THE LINEAR MODEL 74

 
2
    
1 1 0 1 −1  7  4
 
    
X 0y =  0 0 2 −1 −1   4  =  −3
     

    
1 −1 0 −5
 
0 0  3 
 
8

   
1
0 0
4
1
    
4
   
   
    
β̂ = (X 0 X)−1 X 0 y =  0 61 0   −3  =  − 12 
    
    
 −5
   
  
   
0 0 12 − 25

(b) H0 : α = γ − ϕ − 2, γ = −α − ϕ − 3, γ + ϕ = α − 4

H0 : α − γ + ϕ = −2, γ + α + ϕ = −3, γ + ϕ − α = −4

   
1 −1 1 −2
   
K= 1 1  and m =  −3 
   
1
   
−1 1 1 −4

SSH
h
Test statistic: f = SSE
∼ Fα;h;n−p
(n−p)
2. THE LINEAR MODEL 75 STA3701/1

    
1 −1 1 1 −2
    
0
(K β̂ − m) 1   − 2  −  −3 
 1  
=  1
 
1
    
−1 1 1 − 25 −4

   
−1 −2
   
=  −2  −  −3 
   
   
−4 −4

 
1
 
=
 
 1 
 
0

   
1
1 −1 1 0 4
0 1 1 −1
   
K(X 0 X)−1 K 0 =  1 1   0 16 0   −1 1 1 
   
1
   
1
−1 1 1 0 0 2
1 1 1
  
1
4
− 16 12 1 1 −1
  
= 
 1 1 1 
  −1
 
4 6 2 
1 1 
 
− 14 1
6
1
2
1 1 1

11 7 1
 12 12 12 
= 
 7 11 5 
12 12 12

 
1 5 11
12 12 12

Determinant for K(X 0 X)−1 K 0 = 1 331


1 728
+ 35
1 728
+ 35
1 728
− 11
1 728
− 275
1 728
− 539
1 728
= 1
3

Let [K(X 0 X)−1 K 0 ]−1 = A−1 .


2. THE LINEAR MODEL 76

11 5
−1 12 12
(A )11 = (−1) 2
/ 13 = 2
5 11
12 12

11 1
12 12
(A−1 )22 = (−1)4 / 13 = 5
2
1 11
12 12

11 7
12 12
(A−1 )33 = (−1)6 / 13 = 3
2
7 11
12 12

7 5
−1 −1 12 12
(A )12 = (A )21 = (−1) 3
/ 13 = − 32
1 11
12 12

7 11
−1 −1 12 12
(A )13 = (A )31 = (−1) 4
/ 13 = 1
2
1 5
12 12

11 7
12 12
(A−1 )23 = (A−1 )32 = (−1)5 / 13 = −1
1 5
12 12
 
2 − 32 1
2
 
Thus, [K(X 0 X)−1 K 0 ]−1 =  − 23 5
−1 
 
 2 
1 3
2
−1 2
 
2− − 32
 
 7 − 72
 

h i 
0
q4 = (y − X β̂) (y − X β̂) = 2 − − 32 7− 7
4 − −1 3− 3
8 − − 12  4 − −1
 
2 2

 
 3 − 32
 

 
8 − − 12
 
7
  2
 7 
 2 
h i 
= 7 7 3 17  
2 2
5 2 2
 5 
 
 3 
 2 
 
17
2

= 124.
2. THE LINEAR MODEL 77 STA3701/1

Then,
−1
SSH = (K β̂ − m)0 [K(X 0−1 K 0 ] (K β̂ − m)

  
2 − 32 1
2
1
h i  
= 1 1 0  − 23 5
−1   1 
  
 2  
1
2
−1 32 0

 
1
h i 
= 1 −1  
2
1 2
 1 
 
0

3
= 2

SSH 1.5
h 3 0.5
f= SSE
= 124 = ≈ 0.0081
(n−p) (5−3)
62

Fα;h;n−p = F0.05;3;2 = 19.2. Reject H0 if f > 19.2.

Since 0.0081 < 19.2, we do not reject H0 at the 5% level of significance.

(c) Confidence interval for γ is given by:


γ̂ ± a22 σ̂t α2 ;n−p
SSE 124
σ̂ 2 = , we have σ̂ 2 = = 62, yb = − 21 , a22 = 16 , and
n−p 5−3

t α2 ;n−p = t0.025;2 = 4.303.

Thus, the confidence interval is


√ √
(γ̂ − a22 σ̂t α2 ;n−p ; γ̂ + a22 σ̂t α2 ;n−p )

1√ 1√
r r
1 1
(− + 62(4.303) ; − + 62(4.303))
2 6 2 6
(− 21 − 13.8322 ; − 12 + 13.8322)
(−14.3322 ; 13.3322)
2. THE LINEAR MODEL 78

2.8 Models with a constant term

In many examples of linear models the first column of X is 1. The model is therefore
 
    β1  
y 1 x12 · · · x1p    
 1    β   1 
 ..   ..  2  . 
 . = . +  .. .

 . 
     ..  
 
yn 1 xn2 · · · xnp   n
βp

Thus we have, for i = 1, · · ·, n

yi = β1 + β2 xi2 + · · · + βp xip + i .

Summation for i from 1 to n gives


Xn Xn n
X n
X
yi = nβ1 + β2 xi2 + · · · + βp xip + i ,
1 1 1 1

that is ȳ = β1 + β2 x̄2 + · · · + βp x̄p + ¯.

If we subtract this from yi we obtain

yi − ȳ = β2 (xi2 − x̄2 ) + . . . + βp (xip − x̄p ) + (i − ¯),

which gives a reduced model


      
y − ȳ x − x̄2 · · · x1p − x̄p β  − ¯
 1   12  2   1
.. ..   ..   ..

=  .  + 
   
 . . . 
      
yn − ȳ xn2 − x̄2 · · · xnp − x̄p βp n − ¯

or y ∗ = X ∗ β (2) + ∗

where

y ∗ = A2 y = (In − 1(10 1)−1 10 )y;

X ∗ = A2 X(2) and where


 
β1
 
X = (1 X(2) ); β =   and
 
 
β (2)
2. THE LINEAR MODEL 79 STA3701/1

∗ ∼ n(0, σ 2 A2 ) since A2 IA2 = A2 A2 = A2 .

This model may now be used to estimate β2 , · · · , βp while β̂1 is found by setting

β̂1 = ȳ − β̂2 x̄2 − . . . − β̂p x̄p .

Even though ∗ is not an n(0, σ 2 In ) vector it can be shown that the solution of
0 0
β̂ (2) = (X ∗ X ∗ )−1 X ∗ y ∗

gives exactly the same estimators β̂2 , · · · , β̂p as the last (p − 1) elements of

β̂ = (X 0 X)−1 X 0 y,

while the first element of β̂ is exactly the same as β̂1 defined above. The advantages of reducing
the model in this way are:

(a) A smaller matrix has to be inverted (and we all know that it is a lot easier to invert a 3 × 3
matrix than a 4 × 4 matrix).

(b) Numerical accuracy is improved, especially when working on an electronic computer. The
0
elements of X ∗ X ∗ are usually much smaller than the elements of X 0 X and the result is that
0
the round-off errors involved in the inversion of X ∗ X ∗ are much smaller than in the inversion
of X 0 X.

The models of the one-sample problem and simple linear regression are examples of models
with a column of ones. The model two-sample problem can be reparameterised as follows. Let
µ = (µ1 + µ2 )/2 and α = (µ1 − µ2 )/2. Then µ1 = µ + α and µ2 = µ − α. Thus we have

     
y1 1 1 1
.. .. ..  ..
     
. . .  .
    
    
      
 yn1 1 1  µ  
     
    n1
  
   
=   +  .
      

      
−1  α
     
 yn1 +1   1  n1 +1 
.. .. ..  ..
     
    
 .   . .   . 
     
yn1 +n2 1 −1 n1 +n2
2. THE LINEAR MODEL 80

2.9 The design matrix X

We repeat the model:

y = X β + 

n×1 n×1 p×1 n×1,

that is

         
y1 x11 x12 x1p 1
 ..  ..  ..  ..   .. 
         
 .  = β1  + β + . . . + β . + . .
  
.  2 .  p
         
yn xn1 xn2 xnp n
The first vector y consists of n observations of the random variable y. The constants β1 , β2 , · · · , βp
are unknown parameters. The first column of X, say x1 , consists of n values of the non-random
design variable x1 , et cetera. Thus we have p design variables x1 , · · · , xp in our model, each corre-
sponding to one of the p parameters β1 , · · · , βp .

The design variables may be classified broadly into three types:

(a) If there is a constant term in the model, the first column of X usually consists of ones only:
x1 = 1. (This is, strictly speaking, not a variable.)

(b) An important type of design variable is one that assumes the values 0 and 1 only. Such a
variable usually indicates the absence or presence of some characteristics or treatment, and
is usually called a dummy variable. Suppose, for example, one of the characteristics to be
included in a model is the marital status of a person. Suppose that there are four categories:
Single, Married, Divorced and Widowed. If one does not think about it carefully, one may
be tempted to define the variable x = 1 for single persons; = 2 for married persons; = 3 for
divorced persons and = 4 for widowed persons. However, in a model of the type

y = β1 + β2 x

this would mean that the effect on y of being divorced is three times the effect of being single,
et cetera. We should rather define the following dummy variables (we assume that x1 = 1,
that is β1 represents the constant term):
2. THE LINEAR MODEL 81 STA3701/1

x2 = 1 for single persons

= 0 for others

x3 = 1 for married persons

= 0 for others

x4 = 1 for divorced persons

= 0 for others.

One might be tempted to define in addition

x5 = 1 for widowed persons

= 0 for others.

In such a case, however, one would have an identity

x1 = x2 + x3 + x4 + x5 ,

that is the first column of X would be equal to the sum of the next four columns and the
rank of X would be less than the number of columns. Thus if a categorical variable has k
categories, one would usually need to define (k − 1) dummy variables.

Dummy variables could also be defined to assume the values -1 and 1 rather than 0 and 1.
The interpretation of the corresponding coefficient βi is somewhat different then.

(c) The third type of design variable is a continuous variable. This type of variable is assumed
to be non-random in the models considered in this course. We could transform any such
variable and either replace a column by the transformed values or add additional columns.
Suppose, for example, our model is

yi = β0 + β1 ui + β2 u2i + β3 ui + i ,

then the design matrix would be the following:


 
1 u1 u21 u1
 .. .. .. ..
 
.

 . . . .

 
1 un u2n un

Three of the most important special cases of the linear model, which will be dealt with in the next
three chapters, are:
2. THE LINEAR MODEL 82

(a) Analysis of variance (ANOVA): Except for the first column of ones, all the other variables
are dummy variables. The problem is to test whether certain categorical variables (called
treatments) have an effect on the response variable y.

(b) Regression analysis: The first column of X is usually (but not always) 1 and all the other
variables are continuous variables. (If the first column of X is not 1 we speak of regression
through the origin.) The purpose of the analysis may be to test whether certain variables
have an effect on the response, or the purpose may be to use the estimated equation

ŷ = β̂1 x1 + . . . + β̂p xp

to predict the response y which we could expect to obtain if selected values of x1 , · · · , xp are
used in a further experiment.

(c) Analysis of covariance: The first column of X may or may not be 1, but some of the
remaining variables are dummy variables and some are continuous. The purpose of the
analysis could be partly predictive and partly to test whether the regression equation (of y
on the continuous variables) differs from groups as defined by the dummy variables.

Example 2.7.
In an experiment to compare the results of roadrunners in different circumstances, the following
sample was used:

Sex Duration Programme Initial Final

(Months) results results

M 3 A 60 y1

F 8 B 56 y2

M 7 C 70 y3

F 4 A 68 y4

F 11 C 77 y5

M 2 B 65 y6
2. THE LINEAR MODEL 83 STA3701/1

(Programmes A, B and C represent three different training programmes.)

Construct a design matrix for this experiment and explain your answer.

Solution 2.7.

xi1 = 1(constant term)



 1 for male
xi2 =
 0 for female

xi3 = duration in months


 1 for programme A
xi4 =
 0 otherwise

 1 for programme B
xi5 =
 0 otherwise

xi6 = initial results

The design matrix is


 
1 1 3 1 0 60
 
1 0 8 0 1 56
 
 
 
 
 1 1 7 0 0 70 
X=

.

 1 0 4 1 0 68 
 
 
 1 0 11 0 0 77 
 
1 1 2 0 1 65
2. THE LINEAR MODEL 84

2.10 Examination of the residuals

We now turn our attention to the vector of residuals.

e = y − ŷ = y − X β̂,

which is an “estimator” of , the vector of random errors. If our model is correct these residuals
should contain no information about the response vector y. The error term  is included in the
model to account for the unexplained variation, and if e contains information about y it means
that our model is inadequate. The residuals may be examined and analysed in many different
ways, but we will speak briefly about graphical methods only.

At the outset we have to remember that, since

e = y − X β̂

= y − X(X 0 X)−1 X 0 y

= (In − X(X 0 X)−1 X 0 )y

it follows that e has a multivariate normal distribution with expectation

(In − X(X 0 X)−1 X 0 )Xβ = 0

and covariance matrix

(In − X(X 0 X)−1 X 0 )σ 2 In (In − X(X 0 X)−1 X 0 )

= σ 2 (In − X(X 0 X)−1 X 0 ).

This covariance matrix is not in general a diagonal matrix and the diagonal elements are not
necessarily equal. Thus the residuals are, in general, not independent and do not have the same
variance. In fact, the covariance matrix of e is singular, so that e has a singular multivariate normal
distribution. (The dependence does become smaller as n − p increases, however.)

Nevertheless, for an informal examination of the residuals we treat them as if they were independent
and had the same variance. The following graphical analyses are usually made:
2. THE LINEAR MODEL 85 STA3701/1

(a) The assumption of normality

If the sample size n is large, we can construct a histogram to decide whether the error terms
may reasonably be regarded as normal variates (cf figures 2.6, 2.7).

Figure 2.6: Normal residuals

Figure 2.7: Non-normal residuals

We can also use normal probability paper, plotting e(i) against 100i/(n + 1) (where e(i) is the
i-th smallest residual). If the points on the graph are reasonably spread around a straight
line, we may feel safe about the assumption of normality. We should be on the lookout
especially for systematic deviations from a straight line. An example of such graph paper is
presented in figure 2.8.
2. THE LINEAR MODEL 86

Figure 2.8: Normal probability paper


2. THE LINEAR MODEL 87 STA3701/1

Note that the vertical axis is marked in percentages, that is we must plot 100i/(n + 1) on
the vertical axis.

(b) Plot e against xi , the i-th column of X

We plot the residuals against each of the design variables (except of course the column 1).
If e is unrelated to xi we may feel reasonably secure about our model as far as the design
variable xi is concerned.

The graph should have the appearance of figure 2.9, where the points are spread randomly
around 0. If the points follow a pattern, such as in figure 2.10, we may also have to include
additional terms such as x2i , x3i , xki , log xi , et cetera to try to improve the model. If the
points are spread around zero, but with a pattern in the variability such as in figure 2.11, it
indicates that the variances of all the i s may not be the same. In fact, the variance  may
be a function of xi . For the data illustrated in figure 2.11 it may well be that the standard
deviation of  is a multiple of xi ; thus

Var(yi ) = σ 2 x2i , i = 1, · · · , n

and we may have to apply weighted least squares.

Figure 2.9:
2. THE LINEAR MODEL 88

Figure 2.10:

Figure 2.11:

(c) Plot e against ŷ

Any patterns in this graph may indicate that a transformation of the response variable may
be needed (cf section 2.11).

(d) Plot e against time

If the time sequence of the observations is available, it is important to plot the residuals
against time. In this way we may decide whether the residuals are autocorrelated or not,
that is whether each residual depends in a specific way on the previous one. If autocorrelation
is present, the least squares estimators are inappropriate and econometric methods must be
employed to estimate β.
2. THE LINEAR MODEL 89 STA3701/1

(e) Plot e against any other information available

One operator may perform better than another, a factory may operate more efficiently on
certain days of the week, et cetera, and this may show up in an appropriate residual plot.

Before leaving the residuals, we discuss one more result which one should know about. Suppose
there is a constant term in the model, that is the first column of X is 1. We know that

e = (In − X(X 0 X)−1 X 0 )y

∴ X 0 e = X 0 (In − X(X 0 X)−1 X 0 )y

= (X 0 − X 0 X(X 0 X)−1 X 0 )y

= (X 0 − X 0 )y

= 0
    
1 ··· 1 e1 0
.. ..
    
x12 · · · xn2
    
  .   . 
∴ = .
.. .. ..
 
    
 .  .   . 
    
x1p · · · xnp en 0

This shows that there are p linear relationships among the residuals. The first of these p equations
is

e1 + . . . + en = 0

which proves that the sum of the residuals is zero (provided there is a constant term in the model).
2. THE LINEAR MODEL 90

2.11 Transformation of variables

The following are some of the basic assumptions of the linear model:

(i) Normality: We assume y is a vector of normally distributed random variables, particularly


if the aim is to test hypotheses or compute confidence intervals.

(ii) Homogeneity of variances: We assume that all elements of y have the same variance (or,
at least, that the variances are known up to a common constant factor).

(iii) Additivity: We assume that the effects of the design variables add up. For example, if

y = β1 x1 + β2 x2 + . . . + βp xp + 

then, if x1 is increased by 1 unit and x2 is also increased by one unit, the result will be that
y is increased by β1 + β2 units.

Certain types of data are known not to satisfy all of these assumptions, and sometimes it is possible
to transform the data (for example replacing yi by log(yi )) in order to ensure that the assumptions
are satisfied, at least approximately. Consider the following illustration.

• Illustration 2.11.1

Suppose two types of insect repellent are tested on tomatoes and potatoes, and each treatment
is applied to five tomato and five potato plants. The response variable is the number of moths
visiting the plants during the hour after spraying. Suppose the results are as follows:

Insect repellent Plant Results (moths) Mean Variance


Killitt Tomatoes 5; 7; 2; 4; 2 4 4.5
Killitt Potatoes 6; 3; 8; 4; 9 6 6.5
Bang Tomatoes 12; 5; 6; 7; 10 8 8.5
Bang Potatoes 8; 14; 8; 7; 13 10 10.5

It seems that the variance is a function of the mean. The variance is plotted against the mean in
figure 2.12 and a linear relationship seems likely.
2. THE LINEAR MODEL 91 STA3701/1

Figure 2.12:

p
The appropriate transformation in this situation is to replace yi by yi + 3/8 (see following dis-
cussion) and, if we do this, we obtain the following:

Insect repellent Plant Transformed response Mean Variance


Killitt Tomatoes 2.32 2.72 1.54 2.09 1.54 2.04 0.261
Killitt Potatoes 2.53 1.84 2.89 2.09 3.06 2.48 0.267
Bang Tomatoes 3.52 2.32 2.52 2.72 3.22 2.86 0.248
Bang Potatoes 2.89 3.79 2.89 2.72 3.66 3.19 0.245

If we now plot the variance against the mean (cf figure 2.13) we find that there is not a marked
relationship. We say that we have stabilised the variance.

Note that this kind of graphical analysis is possible only if a number of sets of repetitions at iden-
tical conditions are available (that is if X contains a number of sets of identical rows).
2. THE LINEAR MODEL 92

Figure 2.13:

We will now discuss a number of standard transformations, their use and their effect.

(a) The arcsine transformation

Suppose each yi was obtained as follows: a number of trials, say ni trials, each resulted in
either a success or a failure; let ui be the number of successes, that is ui is a binomial variable
with parameters ni and probability of success equal to πi , say.

ui number of successes
Let yi = =
ni number of trials
= proportion successes.

Then we know that

E(ui ) = ni πi ; Var(ui ) = ni πi (1 − πi )

∴ E(yi ) = πi ; Var(yi ) = πi (1 − πi )/ni .

Thus we see that the variance of yi is a function of the expected value of yi (and of ni ). The
famous statistician, Sir Ronald Fisher, discovered that the random variable
√ √
zi = 2 arcsine yi (= 2 sin−1 yi )
1
has variance approximately equal to (which does not depend on πi ). Remember that
ni
the angles are in radians in this definition. It has therefore become standard practice to
apply the arcsin transformation to proportions before analysing the linear model. Note that
2. THE LINEAR MODEL 93 STA3701/1

Var(zi ) is still a function of ni , the number of trials involved. If the number of trials is not
the same throughout the experiment, one will have to apply weighted least squares. Special
problems occur when y = 0 or 1, and some tables exist which cater for this situation. If you
are unfamiliar with the arcsine (sin−1 ) function consult the textbook:

Spiegel, Murray R. Theory and problems of Advanced Calculus of SCHAUM’S OUTLINE


SERIES

or any other trigonometry textbook.

With regard to STA3701 it is sufficient to be able to use the sin−1 function which is found
on most pocket calculators.
180
Note that: 1 radian = degrees
π
π
1 degree = radians,
180
but it will not be used in this module.

(b) The logarithmic transformation

Data which consist of counts (like the number of ticks on a sheep) or index numbers often
have a tendency for the standard deviation to be proportional to the mean. In such cases
the recommended variance stabilising transformation is z = log10 y or z = loge y. Since
special problems occur when the counts are small numbers, the transformation is changed to
z = log(1 + y) when the counts are low.

The effect of this transformation is also to change a moderately skew distribution into a fairly
symmetric one which resembles the normal distribution more closely.

This information is also used when changes in the design variables cause proportional changes
in the response. In such cases a likely model is

y = αxβ

that is log y = log α + β log x.

The logarithmic transformation changes a multiplicative model into an additive one.


2. THE LINEAR MODEL 94

(c) Square root transformation

Suppose y is a Poisson variable, which is usually associated with the number of arrivals (for
example cars, telephone calls, customers, et cetera) in a fixed period of time.

We know that

E(y) = Var(y).

Thus the mean is equal to the variance. In such cases where the variance is proportional to
the mean, the appropriate transformation is

z= y

or, if the counts are low,


p
z = y + 3/8.

This transformation has the effect of changing quite skew distributions into fairly symmetric
ones.

(d) Reciprocal transformation

If the standard deviation of y increases even more rapidly than (E(y))2 , namely proportionally
to (E(y))4 , the transformation

z = 1/y

is often recommended for stabilising the variance. This transformation has the effect of
changing a very skew distribution into a fairly symmetric one.

A more complete account of this subject is found in chapter 8 of Quenouille.

2.12 Illustration
 
1 1 1 1 1 1 1 1 1
 
 
0
 1 1 1 0 0 0 0 0 0 
Let X = 
 

 0 0 0 1 1 1 0 0 0 
 
−1 0 1 −1 0 1 −1 0 1
2. THE LINEAR MODEL 95 STA3701/1

y 0 = (19 39 50 25 24 32 23 28 39).

Such data could for example be the result of an experiment to test the effect of fertilizers on the
yield of a hectare of maize. Column 1 of X (that is row 1 of X0 ) provides for a constant term (we
do not expect zero yield if no fertilizer is used). Column 2 of X denotes District A and column
three District B (thus there are three districts involved in the experiment). Column 4 denotes the
fertilizer dosage. Suppose three dosages are used, namely 0 tons/ha, 10 tons/ha and 20 tons/ha.
Then define x4 = (dosage - 10)/10. Furthermore, y is the yield in bags/ha.

We compute
 
1 1 0 −1
 
1 1 0 0
 
 
 
 1 1 0 1
 
 
1 1 1 1 1 1 1 1 1  
 1 0 1 −1
 
  
0
 1 1 1 0 0 0 0 0 0  
XX =   
  1 0 1 0 
 0 0 0 1 1 1 0 0 0  
 
 1 0 1 1


−1 0 1 −1 0 1 −1 0 1  
1 0 0 −1
 
 
 
 
 1 0 0 0 
 
1 0 0 1
 
9 3 3 0
 
 
 3 3 0 0 
= 

.

 3 0 3 0 
 
0 0 0 6
2. THE LINEAR MODEL 96

Calculating the inverse matrix

9 3 3 0 1 0 0 0
3 3 0 0 0 1 0 0
3 0 3 0 0 0 1 0
0 0 0 6 0 0 0 1

1 1 1 1
(R1 − R2 − R3 ) 1 0 0 0 0
3 3 3 3

1 1
R2 1 1 0 0 0 0 0
3 3

1 1
R3 1 0 1 0 0 0 0
3 3

1 1
R4 0 0 0 1 0 0 0
6 6

1 1 1
1 0 0 0 0
3 3 3

1 2 1
R2 − R3 0 1 0 0 − 0
3 3 3

1 1 2
R3 − R1 0 0 1 0 − 0
3 3 3

1
0 0 0 1 0 0 0
6

 
1
3
− 31 − 13 0
 
 1 2 1
 −3

3 3
0 
Thus, (X 0 X)−1 =
 1

 −3 1 2 
3 3
0 
 
1
0 0 0 6
2. THE LINEAR MODEL 97 STA3701/1

 
19
 
39
 
 
 
 50
 
 
1 1 1 1 1 1 1 1 1  
 25
 
  
0
 1 1 1 0 0 0 0 0 0  
Xy =   
  24 
 0 0 0 1 1 1 0 0 0  
 
 32


−1 0 1 −1 0 1 −1 0 1 



 23 
 
 
 28 
 
39
 
279
 
 
 108 
= 
 

 81 
 
54

β̂ = (X 0 X)−1 X 0 y
  
1 1 1
−3 −3 0 279
 3  
 1 2 1
 −3
 
3 3
0   108 
= 
   
 −3 1 1 2  
3 3
0   81 
  
0 0 0 16 54
 
30
 
 
 6 
=  
 −3 
 
 
9

The expected yield in District A is 36 + 9x4 ; in District B it is 27 + 9x4 and in District C 30 +


9x4 , where x4 is the (coded) fertilizer dosage. For example, we predict that the yield in District A
with a dosage of 15 ton/ha will be 36 + 9 (15 - 10)/10 = 40.5.

We have

ŷ 0 = X β̂
2. THE LINEAR MODEL 98

ŷ 0 = (27 36 45 18 27 36 21 30 39)

e0 = y 0 − ŷ 0

= (−8 3 5 7 − 3 − 4 2 − 2 0)

e0 = 0 as is to be expected

e0 e = 180

σ̂ 2 = e0 e/(n − p)

= 180/(9 − 4)

= 36

σ̂ = 6.

A 95% confidence interval for βi is



β̂i ± t α2 ;n−p aii σ̂.

Now βˆ4 = 9, σ̂ = 6, a44 = 16 , α = 0.05 ⇒ t α2 ;n−p = t0.025;5 = 2.571.

The confidence interval for β4 is



β̂i ± t α2 ;n−p aii σ̂
r
1
9 ± 2.571(6)
6
9 ± 6.2976

(2.7024 ; 15.2976).

For each additional ten tons of fertilizer a yield increase of between 2.7 and 15.3 bags/ha can be
expected. Normally such an experiment would of course be conducted on a larger scale and a much
narrower confidence interval would be obtained.

Suppose we want to obtain a joint confidence region for β2 and β3 . Then


 −1
2 1  
 3 3  2 −1
T −1 =   = .
 
  −1 2
1 2
3 3

The confidence region is thus given by


2. THE LINEAR MODEL 99 STA3701/1

 
β̂2 − β2
 
−1
(β̂2 − β2 β̂3 − β3 )T  ≤ σ̂ 2 2Fα;2;5
 

 
β̂3 − β3
 
2 −1  
  6 − β2
∴ (6 − β2 − 3 − β3 )   ≤ (36)(2)(5.79)
  

  −3 − β3
−1 2

with α = 0.05

∴ 2(6 − β2 )2 + 2(6 − β2 )(3 + β3 ) + 2(3 + β3 )2 ≤ 416.88

∴ (β2 − 6)2 − (β2 − 6)( β3 + 3) + (β3 + 3)2 ≤ 208.44.

This represents an ellipse. In order to sketch the ellipse, choose various values of β2 and solve for
β3 from the quadratic equation each time. For example, if β2 = 6 we have

(β3 + 3)2 ≤ 208.44

−14.44 ≤ β3 + 3 ≤ 14.44

−17.44 ≤ β3 ≤ 11.44.

Finally, suppose we wish to test the null hypothesis

H0 : β2 − β3 = 0; β1 + β2 − β4 = 25

(that is the expected yield in Districts A and B are identical and the expected yield in District A
with no fertilizer is 25 bags/ha).

Thus the null hypothesis is


 
β1
    
0 1 −1
 
0 β 0
 2  = 
 
H0 :  
0 −1  β3 
 
1 1 25
 
β4
2. THE LINEAR MODEL 100

∴ Kβ = m.

We thus have = 2 (the rank of K) and we must compute SSH.


 
  30  
0 1 −1 0    0
  6    
K β̂ − m =  −
   
 
   −3   
1 1 0 −1   25
9
 
9
 
=  
 
 
2
  
1
 
3
− 13 − 13 0 0 1
0 1 −1 0    
   −1 2 1 
0  1 1

0 −1 0  3 3 3

K(X X) K = 
  

   −1 1 2
0   −1

0

3 3 3

1 1 0 −1   
1
0 0 0 6
0 −1
 
2 1
 3 3 
= 
 

 
1 1
3 2
 
9
4
− 32
 
(K(X 0 X)−1 K 0 )−1 = 
 

 
− 32 3
  
9
4
− 23 9
  
SSH = [9 2] 
  
 
  
− 23 3 2
561
=
4

SSH 561/4
∴f = 2
= = 1.9479
hσ 2 × 36
The 5% critical value is F0,05;2;5 = 5.79. Reject H0 if f > 5.79.
Since 1.9479 < 5.79, H0 is not rejected at the 5% level of significance.
2. THE LINEAR MODEL 101 STA3701/1

Example 2.8.
An experiment was performed to determine whether three factors have an effect on the taste of
sosaties: the type of lamb (lean or fat); time the meat is kept in the marinade (1, 2 or 3 days) and
the temperature of the oven (190◦ , 205◦ and 220◦ C). Ten dishes of sosaties were prepared accord-
ing to different recipes and each presented to a different panel of gourmets. The panel awarded a
coefficient of tastiness (y) to each. The results were as follows:

Meat Time (days) Temperature (◦ C) Tastiness


Fat 3 205 16
Fat 2 220 19
Lean 2 220 20
Fat 3 190 43
Lean 2 190 27
Fat 1 190 25
Fat 1 220 22
Lean 3 220 15
Lean 2 190 29
Lean 1 205 9

(a) Fit a linear model with constant term; let x2 = 1 for fat meat and = 0 for lean; x3 = days -
2; x4 = (temp - 205)/15.

(b) Find a 90% confidence interval for β3 .

(c) Find a 95% joint confidence region for β3 and β4 .

(d) Test H0 : β2 = 0; β3 = 9; β1 + β2 + β3 − β4 = 50 at the 5% level.


2. THE LINEAR MODEL 102

Solution 2.8.

(a) xi1 = 1(constant term)



 1 for fat meat
xi2 =
 0 for lean meat

xi3 = Days − 2
(Temp − 205)
xi4 =
15
 
1 1 1 0
 
1 1 0 1
 
 
 
1 0 0 1
 
 
 
1 1 1 −1
 
 
 
0 −1
 
 1 0  h i
X=  and y0 = 16 19 20 43 27 25 22 15 29 9
1 1 −1 −1
 
 
 
1 1 −1
 
 1 
 
 
 1 0 1 1 
 
0 −1
 
 1 0 
 
1 0 −1 0

 
1 1 1 0
 
1 1 0 1
 
 
 
1 0 0 1
 
 
  
1 1 1 1 1 1 1 1 1 1 1 1 1 −1
 
 
  
0 −1
  
 1 1 0 1 0 1 1 0 0 0 1 0
X 0X = 
 
 
0 −1 −1 1 0 −1 1 1 −1 −1
  
 1 0 0 1  
  
1 −1 −1 −1 1 1 −1 1 1 −1
 
0 1 0  1 
 
 
 1 0 1 1 
 
0 −1
 
 1 0 
 
1 0 −1 0
2. THE LINEAR MODEL 103 STA3701/1

 
10 5 0 0
 
 
 5 5 0 0 
= 



 0 0 6 0 
 
0 0 0 8

Calculating the inverse matrix


10 5 0 0 1 0 0 0
5 5 0 0 0 1 0 0
0 0 6 0 0 0 1 0
0 0 0 8 0 0 0 1

1 1 1
(R1 − R2 ) 1 0 0 0 - 0 0
5 5 5

1 1
R2 1 1 0 0 0 0 0
5 5

1 1
R3 0 0 1 0 0 0 0
6 6

1 1
R4 0 0 0 1 0 0 0
8 8

1 1
1 0 0 0 - 0 0
5 5

1 2
R2 − R1 0 1 0 0 − 0 0
5 5

1
0 0 1 0 0 0 0
6

1
0 0 0 1 0 0 0
8
 
1
5
− 15 0 0
 
 1 2
 −5

5
0 0 
Thus, (X 0 X)−1 = 

 0 1 
0 6
0 
 
0 0 0 81
2. THE LINEAR MODEL 104

 
16
 
19
 
 
 
20
 
 
  
1 1 1 1 1 1 1 1 1 1 43
 
 
  
  
 1 1 0 1 0 1 1 0 0 0 27
X 0y = 
 
 
0 −1 −1 1 0 −1
  
 1 0 0 1  25 
  
1 −1 −1 −1 1 1 −1
 
0 1 0  22 
 
 
 15 
 
 
 29 
 
9
 
225
 
 
 125 
= 



 18 
 
−48

β̂ = (X 0 X)−1 X 0 y
  
1 1
−5 0 0 225
 5  
 1 2
 −5
  
5
0 0   125 
=   

 0 1  
0 6 0   18 
  
1
0 0 0 8 −48
 
20
 
 
 5 
= 



 3 
 
−6

ŷi = 20x1 + 5x2 + 3x3 − 6x4

Then
2. THE LINEAR MODEL 105 STA3701/1

y1 = 20(1) + 5(1) + 3(1) − 6(0) = 28


y2 = 20(1) + 5(1) + 3(0) − 6(1) = 19
y3 = 20(1) + 5(0) + 3(0) − 6(1) = 14
.. .. ..
. . .
y10 = 20(1) + 5(0) + 3(−1) − 6(0) = 17
h i
⇒ ŷ 0 = 28 19 14 34 26 28 16 17 26 17 .

We have

e0 = y 0 − ŷ 0

= (−12 0 6 9 1 − 3 6 − 2 3 − 8)

e0 = 0 as is to be expected

e0 e = 384

σ̂ 2 = e0 e/(n − p)

= 384/(10 − 4)

= 64

σ̂ = 8.

(b) A 90% confidence interval for βi is


β̂i ± t α2 ;n−p aii σ̂.

Now βˆ3 = 3, σ̂ = 8, a33 = 16 , α = 0.1 ⇒ t α2 ;(n−p) = t0.05;6 = 1.943.

The confidence interval for β3 is


βˆ3 ± t α2 ;n−p a33 σ̂
r
1
3 ± 1.943 × 8
6
9 ± 6.3458

(2.6542 ; 15.3458)

i.e., 2.6542 ≤ β3 ≤ 15.3458.


2. THE LINEAR MODEL 106

(c) Joint confidence interval for β3 and β4 is


 
  βˆ3 − β3
βˆ3 − β3 βˆ4 − β4 T −1   ≤ σˆ2 2Fα;2;n−p
βˆ4 − β4

     
1 1
6
0 8
0 6 0
     
T −1 =  = 48 =
     
  
     
1 1
0 8
0 8
0 8

βˆ3 = 3 βˆ4 = −6 Fα;2;n−p = F0.05;2;6 = 5.14.

Now
 
h i βˆ3 − β3
βˆ3 − β3 βˆ4 − β4 T −1   ≤ σˆ2 2Fα;2;n−p
βˆ4 − β4
 
6 0  
  3 − β3 
h i 
3 − β3 −6 − β4  ≤ 64 × 2 × 5.14


  −6 − β4
0 8
 
h i 3 − β3
6(3 − β3 ) 8(−6 − β4 )   ≤ 657.92
−6 − β4
6(3 − β3 )2 + 8(−6 − β4 )2 ≤ 657.92

3(−1)2 (β3 − 3)2 + 4(−1)2 (6 + β4 )2 ≤ 328.96

3(β3 − 3)2 + 4(6 + β4 )2 ≤ 328.96.

(d) H0 : β2 = 0; β3 = 9; β1 + β2 + β3 − β4 = 50

 
  β  
0 1 0 0  1  0
 
   β2   
K= 0 0 1  and  9
   
0  
  
   β3   
1 1 1 −1   50
β4
2. THE LINEAR MODEL 107 STA3701/1

 
  β1 
0 1 0 0   0
  
β2   

H0 :  0 0 1 =
  
0 
    9 
   β3 
  
1 1 1 −1   50
β4

∴ Kβ = m.
 
  20  
 0 1 0 0  0
  
5   

K β̂ − m =  0 0 1 −
 
0    9 

 
 3  
1 1 1 −1   50
−6
   
5 0
   
=  3 − 9 
   
   
34 50
 
5
 
=  −6 
 
 
−16

  
1
 
5
− 51 0 0 0 0 1
0 1 0  0  
 − 15 2   
 0 0  1 0 1
K(X 0 X)−1 K 0 5
 
=  0 0 1

0   
0 16 0   0 1
 0  
 1 
1 1 1 −1   
0 0 0 18 0 0 −1
 
  0 0 1
− 15 25 0 0  
 
 1 0 1 

=  0 0 6
 1
0  
  0 1

1 
0 15 61 − 18  
0 0 −1
 
2 1
0
 5 5 

=  0 6
 1 1 
6 


1 1 59
5 6 120
2. THE LINEAR MODEL 108

Calculating the inverse matrix

2 1
0 1 0 0
5 5

1 1
0 0 1 0
6 6

1 1 59
0 0 1
5 6 120

5 1 5
R1 1 0 0 0
2 2 2

6R2 0 1 1 0 6 0

1 47
2R3 − R1 0 -1 0 2
3 60

1 5
1 0 0 0
2 2

0 1 1 0 6 0

1 9
R3 − R2 0 0 -1 -2 2
3 20

1 5
1 0 0 0
2 2

0 1 1 0 6 0

20 20 40 40
R2 0 0 1 - -
9 9 9 9
2. THE LINEAR MODEL 109 STA3701/1

1 65 20 20
R1 − R3 1 0 0 −
2 18 9 9

20 94 40
R2 − R3 0 1 0 −
9 9 9

20 20 40 40
R2 0 0 1 - -
9 9 9 9
65 20 20
 


 18 9 9 

 
 
 20 94 40 
Thus, (K(X 0 X)−1 K 0 )−1 = − .

 9 9 9 

 
 
 20 40 40 
− −
9 9 9

Now

SSH = (K β̂ − m)0 (K(X 0 X)−1 K 0 )−1 (K β̂ − m)


65 20 20
 

 18 9 9 
  
  5
i 20

94 40
h  
= 5 −6 −16  − −6 
  

 9 9 9 
  
  −16
 
 20 40 40 
− −
9 9 9
  5
725 176 500

= −  −6 
 
18 9 9  
−16
17 513
=
18

SSH 17 513/18 17 513


∴f = 2
= = ≈ 5.0674
hσ 3 × 64 3 456
Fα;h;n−p = F0,05;3;6 = 4.76. Reject H0 if f > 4.76.

Since 5.0674 > 4.76, we reject H0 at the 5% level of significance.


2. THE LINEAR MODEL 110

Exercise 2.1
 
1 1 1 1 1 1 1 1 1
 
1. Let X 0 =  −1 −1 −1
 
0 0 0 1 1 1 
 
−1 0 1 −1 0 1 −1 0 1

h i
and y 0 = 18 20 29 17 20 27 11 28 20 .

(a) Fit the model y = Xβ + ε where ε ∼ n (0; σ 2 I) .

(b) Test H0 : β3 = 0 at the 5% level of significance.

(c) Construct a 95% confidence interval for β2 .

(d) Construct a 95% joint confidence region for β2 and β3 . Would you conclude that β2 =
β3 = 0?

(e) Test H0 : β1 + β2 + β3 = 20 at the 5% level of significance.

(f) Plot

(i) e against ŷ

(ii) e against y

and comment on the adequacy of the model.

(g) Would you be able to reduce the model y = Xβ + ε? (Substantiate.)


2. THE LINEAR MODEL 111 STA3701/1

2. In an experiment to investigate the mass increase of young oxen under specific feeding con-
ditions, the following sample of size 12 was used:

Breed Age Type of feed Initial mass Mass increase

Sussex 6 A 175 y1

Beefmaster 12 B 255 y2

Simmentaler 18 A 350 y3

Brahman 6 B 190 y4

Sussex 12 A 230 y5

Beefmaster 18 B 345 y6

Simmentaler 6 A 200 y7

Brahman 12 B 225 y8

Sussex 18 A 320 y9

Beefmaster 6 B 190 y10

Simmentaler 12 A 260 y11

Brahman 18 B 315 y12

Age is given in months, mass in kilogram, feed type A is pasture (grazing) and feed type B
is pasture and feeding paddock.

Construct a design matrix for this experiment. Define variables explicitly.

3. For each of the given design matrices,

(i) state which analysis you would apply, namely analysis of variance, regression analysis
or analysis of covariance (substantiate your answers);

(ii) state the purpose of the analysis clearly.


2. THE LINEAR MODEL 112

 
1 1 −2
 
 
 1 1 2 
(a) X = 



 1 0 4 
 
1 0 −4

 
1 0 15
 
 
 1 0 10 
(b) X = 



 0 1 16 
 
0 1 32

 
1 10
 
 
 1 12 
(c) X = 



 1 14 
 
1 16

4. A graphical representation of a joint confidence region for β1 and β2 is as follows:

(a) State whether you would not reject or reject each of the following hypotheses and say
why:

(i) β1 = 0, β2 = 0
2. THE LINEAR MODEL 113 STA3701/1

(ii) β1 = 0

(iii) β2 = 0

(b) Suppose we wish to test the following hypothesis concerning β 0 = [β1 β2 β3 β4 ] :

β4 = 0

β2 − β3 = 0

β2 − β3 + β4 = 0

2β2 − 2β3 + 5β4 = 0

Write the above in matrix form.

5. Let y= Xβ+ε where ε∼ n(0;σ 2 I3 ) and


   
1 0 12
   
X= 1 , y = .
   
1   6
   
1 0 2

(a) Estimate β and σ 2 .

(b) Test H0 : β1 = 5.

6. In an experiment, the mass change of newborn babies, fed on four different milk formulas
was recorded (after a set period of time) as follows:

A B C D

200 150 75 220

188 125 110 87

203 176 102

142 76

50
2. THE LINEAR MODEL 114

The model which describes the observations is: yij = µi + εij , i = 1, 2, 3, 4;


j = 1, ..., ni ; n1 = 3; n2 = 4; n3 = 2; n4 = 5 and εij ∼ independent n (0; σ 2 ) .

(a) Estimate the parameters in the model.

(b) Test, at the 5% level of significance, whether there is a difference between the four milk
formulas.

SSH
NB: Write the hypothesis in the form: Ho : Kβ = m and use f = hσ̂ 2
.

(c) Compute a 95% confidence interval for difference in means between programmes B and
C, that is µB − µC .

(d) Is it possible to reduce this model? Substantiate.

(e) Give the advantages of the reduction of a model.


Chapter 3

ANALYSIS OF VARIANCE

3.1 Introduction

A typical experiment, the results of which can be analysed by means of the technique known as
analysis of variance, is the following: a number of experimental units (people, cattle, potato lands,
trees, et cetera) are subjected to a number of treatments (diets, medicines, fertilizer applications,
et cetera) each at a number of levels (the three levels of the treatment diet could be two, three
and four meals per day; the two levels of fertilizer could be no fertilizer, 1 ton/ha, 2 tons/ha and 4
tons/ha; et cetera). Sometimes there are other factors present, called blocking factors, which divide
the experimental units into groups which are more uniform (before the experiment starts) than
all the units jointly. Such a factor has two or more levels (for example boys and girls; Leghorns,
Black Australorps and Rhode Island Reds; married, divorced, widowed and single).

After one or more treatments have been applied to such groups of experimental units, the result
is observed. This result is called the yield (for example body weight of experimental person, the
yield of potatoes in kg/ha, the egg production of hens). The purpose of the analysis of variance
is to test whether the treatment had an effect on the yield, and sometimes whether the groups
defined by the blocking factors differ with respect to yield.

One of the problems which faces a statistician who acts as consultant to a research worker, is the
fact that the experiment is often completed before he or she is consulted, with the result that the

115
3. ANALYSIS OF VARIANCE 116

experiment is usually poorly planned. The statistician’s client does not always tell the statistician
about all the important factors, because the client possibly does not know that they are impor-
tant. It is the duty of the statistician to question his or her client about the way in which the
experiment was performed, in order to try and find out whether such factors do indeed exist or not.

The manner in which the experimental units are assigned to the treatment levels is also very im-
portant. It is always desirable that these be assigned randomly.

If there is only one factor (treatment or blocking factor) present, we speak of a one-way clas-
sification and the corresponding analysis is termed one-way analysis of variance. If there are
two factors, we have a two-way classification and we speak of two-way analysis of variance, and so
on. These names originate from the arrangement of the data in tables as in the following examples.

When discussing the analysis of these examples, including analyses of chapters 4 and 5, some out-
put from a standard package for personal computers, called SAS JMP, will be shown. You should
make sure that you are able to interpret these results. One of the techniques used in SAS JMP
is a plot of treatment means diamonds. The line across each diamond represents the group mean.
The vertical span of each diamond represents the 95% confidence interval for each group. The
standard error of any mean is given by

[S.E.(means)]2 = s2 /(number of observations included in the mean)

where s2 is the residual mean square or error mean square.

3.1.1 One-way analysis of variance

Suppose six potato tubers were planted during each of the four moon phases. The purpose is to
test whether the yield of the plants is influenced by the phase of the moon at planting time. The
results are as follows (yield in grams per plant):
3. ANALYSIS OF VARIANCE 117 STA3701/1

New moon : 201 197 190 194 200 206


First quarter : 189 200 199 196 194 198
Full moon : 206 214 202 205 203 206
Last quarter : 205 194 202 198 200 201

This is a typical example of a one-way classification, and will be analysed in section 3.4.

This experiment may be refined. One source of variation in potato yield is the season: apart
from the possible effect of the moon phase, there is also an optimum planting time with respect
to seasons. To a certain extent the difference in yield found in the experiment may be ascribable
to seasonal and climatic differences. It would be better if the experiment were repeated over a
number of moon months and possibly over a number of years.

3.1.2 Two-way analysis of variance

A housewife does her grocery shopping once a week. Once a month she buys the main items and
is then able to compare the prices of a number of supermarkets before buying. In the other three
weeks of the month she buys only the week’s supplies of perishable foods, and she would like to
decide on one of the three supermarkets in her area for this shopping. She therefore decides to buy
at each of the three supermarkets in turn once every month for four months in order to compare
prices. Her results are as follows (prices to the nearest rand):

Month Supermarket
A B C
May 34 18 29
June 30 20 34
July 27 30 36
August 33 32 37

One source of variation is of course the fact that her shopping list varies from week to week; a
more sensitive analysis might be possible if she recorded only those items which remain on her list
from week to week; the results would, however, not answer her question fully.
3. ANALYSIS OF VARIANCE 118

3.1.3 Three-way analysis of variance

A man has a choice of three routes to drive to work. He decides to perform an experiment in order
to enable him to select a route. He knows that his travelling time depends on the day of the week
and on the time he leaves home. He performs his experiment over a period of nine weeks, and
allocates the routes and departure times randomly to the days of each week. His travelling times
(in minutes) for the nine weeks are as follows:

Departure 07:00 07:15 07:30


Route I II III I II III I II III
Day:
M 34 34 40 32 46 54 36 52 50
T 19 39 32 32 35 47 33 40 47
W 24 33 36 31 40 46 32 44 47
T 15 33 39 32 41 38 34 37 46
F 38 41 53 43 58 55 45 57 60

A few remarks about the experiment:

(i) The investigator would of course try to restrict his or her investigations to “normal” weeks,
ie weeks without public holidays, weeks which do not coincide with school holidays. Weeks
in which the last day of the month falls should possibly be avoided as well, and might be the
subject of a separate study.

(ii) It is possible that the traffic volume may increase slightly over the nine weeks. Also, the
performance of a car may deteriorate gradually after being serviced. For these reasons it is
important that the routes and departure times be allocated randomly to the days and weeks.
If route I were selected the first three weeks, route II the next three weeks and route III the
last three weeks, then an observed difference in travelling times may be due to a difference
in routes or changes in traffic volume or a change in car performance or other factors.

(iii) It is also possible to select weeks as a fourth and thus obtain a four-way classification.
However, the model would then be much more complicated. (See nested experiments in
section 3.3.)
3. ANALYSIS OF VARIANCE 119 STA3701/1

3.2 Balanced experiments and replicates

We first define the concept cell. Each combination of levels of the factors in an experiment is called
a cell. In section 3.1.1 there are four cells (the four moon phases); in section 3.1.2 there are 4×3
= 12 cells (eg Supermarket B in July, Supermarket C in May); in section 3.1.3 there are 3×3×5 =
45 cells. Depending on the number of observations (replicates) in each cell, experiments may be
classified as follows:

(a) Incomplete experiment:

Observations do not occur in every cell. Apart from poor planning, there may be two reasons
for this:

(i) The experiment has been designed specifically in such a way that some cells remain
empty. This may have been done because the experimenter is unable to cope with as
many experimental units as there are cells. The experiment is then designed in such a
way that a precise analysis is still possible.

(ii) Some observations may have been lost; a rabbit ate one of the plants, a test tube broke,
a laboratory mouse died or a patient decided to discontinue his or her treatment. In such
cases it is still possible to analyse the results, but the analysis is much more complicated.
The missing observations may be estimated by means of least squares, and the number
of degrees of freedom in the error sum of squares is decreased accordingly.

(b) Balanced experiment:

In such an experiment there are equal numbers of observations per cell. We distinguish
between two cases:

(i) One observation per cell: Sections 3.1.2 and 3.1.3 are such examples. As will be seen,
there is some information which is not contained in such an experiment; in particular
it is not possible to ascertain whether the model is an adequate representation of the
data.

(ii) More than one observation per cell: This is the most desirable type of experiment,
firstly because the model may be verified and secondly because the computations are
not too complicated.
3. ANALYSIS OF VARIANCE 120

(c) Unbalanced experiments in which there are unequal numbers of observations per cell. It
is possible to analyse the data in a meaningful way, but the algebra, the notation and the
computations are much more complicated.

3.3 Fixed effects and random effects

In chapter 2 the general linear model

y = Xβ + 

was described, and it was stated that β is a vector of unknown constants.

However, sometimes some elements of β may be random variables, and the distributions of the
quadratic forms may differ from those derived in chapter 2.

Consider a factor with k levels. If these k levels constitute the complete set of levels in which one
is interested, the effects of the levels of the factor on the yield are called fixed effects. In section
3.1.1 we were interested only in the four phases of the moon, and observations were made during
every phase. In section 3.1.2 we were interested in the three supermarkets only, and in section
3.1.3 in the three routes only.

On the other hand, a factor may exist at a large number of levels and experiments are performed
at a randomly selected set of these levels. The effects of these levels are called random effects.

3.3.1 Example on mixed model

A simple example of a random effect is the following: A farmer wishes to test the effects of four
diets on the body mass of piglets. He knows that piglets of the same litter are more uniform (before
treatment) than all the piglets jointly. He therefore selects five litters at random and four piglets
at random from each litter. One piglet is assigned at random to each diet. The results (mass of
3. ANALYSIS OF VARIANCE 121 STA3701/1

each piglet after a fixed period) are as follows:

Litters Diets
A B C D
I 30 51 40 31
II 51 56 45 48
III 44 59 57 36
IV 40 47 42 39
V 50 57 41 36

If the farmer wants to draw conclusions about these specific five litters only, he may regard the
factor “Litters” as fixed; however, if he wants to draw conclusions about the population of litters
from which the five litters may be regarded as a random sample, the factor “Litters” must be
regarded as a random factor.

Another typical example is a large consignment of wool (or other produce). A random sample of
bags is selected, and then a number of samples from each bag is selected and subjected to various
treatments. The investigator wants to make statements about the whole consignment.

We may distinguish further between factors with levels which are selected at random from finite
(small) populations and infinite (large) populations. We shall concentrate mainly on the theory of
large populations.

Thus we may regard any factor in an experiment as fixed or random. Experiments may be classified
accordingly:

(a) Fixed effects model or model I: all factors (treatments or blocking factors) are fixed.

(b) Random effects model or model II: all factors are random.

(c) Mixed model: some factors are fixed and some are random.
3. ANALYSIS OF VARIANCE 122

Example 3.1.
Determine in the following scenarios whether the factors are fixed or random effects:

(a) In an effort to improve the quality of video tapes, the effects of four kinds of treatment A, B,
C and D on reproducing quality of image are compared.

(b) Pieces of skin were exposed to four randomly chosen light intensities for a fixed period. After
exposure the elasticity of the skin was measured. The purpose was to compare the effects of
the four intensities on the elasticity of the skin.

(c) An investigator wishes to evaluate the effect on reaction time of medicine administered for
colds. Data were recorded that represent the reaction (in seconds) on a stimulus one hour
after each of four randomly chosen brands of this type of medicine was administered to ten
people.

(d) A study was done as to how the concentration of a certain drug in the blood, 24 hours after
being injected, is affected by age and sex. An analysis of the blood samples of 40 people
who had received the drug yielded concentration (in milligrams per cubic centimetre) for age
groups 11-25 years, 26-40 years, 41-65 years and >65 years.

Solution 3.1.

(a) The factor treatment is a fixed effect.

(b) The factor light intensity is a random effect.

(c) The factor brand is a random effect.

(d) The factor gender is a fixed effect and the factor age is a fixed effect.

Hypothesis testing in the analysis of variance

To test hypotheses regarding fixed effects, we shall use the method of testing linear hypotheses in
section 2.7. In the case of random effects, we shall test hypotheses of the form H0 : σA2 = 0 (which
is equivalent to testing H0 : σB2 + cσA2 = σB2 , where c is a positive constant). The procedure for
testing such a hypothesis is as follows:
3. ANALYSIS OF VARIANCE 123 STA3701/1

(a) Find two quadratic forms qi and qj such that


qi qj
and 2
σB2 2
+ cσA σB
are independent central chi-square variates with ri and rj degrees of freedom respectively,
irrespective of whether H0 is true. Then
qi
/ (σB2 + cσA2 )
ri
qj ∼ Fri ;rj .
/ σB2
rj
qi qj
(b) Compute f= / .
ri rj

We know that
σA2
 
f ∼ 1 + c 2 Fri ;rj .
σB

Thus f ∼ Fri ;rj if and only if σA2 = 0 (that is if and only if H0 is true). If σA2 > 0, we may expect f
to be larger than the Fri ;rj statistic. Hence, if f > Fα;ri ;rj , reject H0 : σA2 = 0 at the 100α% level.

For each model (I, II or mixed), we shall set up an analysis of variance table. The relevant quadratic
forms which make up the ratio f will be obtained from the “mean square” or “MS” column of the
table. These mean squares can be shown to be independent chi-square variates. It should become
apparent that the ratio f will be formed from two mean squares, say MSi and MSj , which are
chosen in such a way that E(MSi ) = E(MSj ) if and only if the null hypothesis of interest is true.
This will ensure that f has a central F -distribution under H0 .

The above procedures will be used to derive the results for the one-way models in detail. The
results presented for the two-way models can be derived analogously.

3.3.2 Nested or hierachical experiments

3.3.2.1 Illustration 1

Suppose there are six comparable grade eight classes in a large school, and six teachers who are
able to teach mathematics. Three methods of teaching mathematics are to be compared. Each
3. ANALYSIS OF VARIANCE 124

teacher is assigned (randomly) to one class to teach according to a prescribed method. After one
term five pupils are selected at random from each class and tested.

Suppose their marks are as follows:

Teacher Teaching method


I II III
1 60 49 68
55 44 60
48 43 65
62 51 62
65 48 75
2 61 54 64
64 52 73
57 58 70
56 48 78
72 63 75

Superficially this may look like an ordinary two-way classification with five observations per cell.
However, the designations “Teacher 1” and “Teacher 2” are misleading, since each of the six classes
was taught by a different teacher. A more accurate exposition would be as follows (in order to
save space, the observations are not repeated, but each cross indicates a set of five observations):

Teacher Teaching method


I II III
1 X
2 X
3 X
4 X
5 X
6 X

Such an experiment is termed a hierarchical or nested experiment; in the above example we say
that the factor Teacher is nested in the factor Teaching method. In such experiments there are
3. ANALYSIS OF VARIANCE 125 STA3701/1

empty cells, not because of a lack of time or facilities, but because the empty cells are not relevant.
Another similar example is the following:

3.3.2.2 Illustration 2

A farmer wants to determine the effect of three sprays on his peach trees. He selects 12 comparable
trees at random, and assigns them randomly to the three sprays. After a week he selects six leaves
at random from each tree (Question: how would he do that?) and tests the nitrogen content of
each leaf. The results are as follows:

Spray Tree
1 2 3 4
A 4.5 5.5 13.5 11.5
7.0 7.5 15.0 10.0
5.0 12.5 12.5 12.0
5.5 6.0 12.0 9.5
6.5 4.0 10.0 10.5
7.5 6.5 9.0 12.5
B 15.5 14.5 11.0 12.5
15.0 15.0 10.0 12.0
14.5 12.5 12.0 13.5
14.0 16.0 13.0 10.0
16.0 13.5 10.5 10.5
15.0 12.5 9.5 13.5
C 7.0 3.0 8.0 12.5
8.0 4.5 9.5 10.0
5.0 4.0 8.5 13.0
7.5 5.5 10.5 10.5
8.5 7.0 7.5 11.0
6.0 6.0 10.0 9.0

In this case we do not deal with the same four trees for every spray, but with 12 different trees,
and the experiment is a nested experiment.
3. ANALYSIS OF VARIANCE 126

3.4 One-way analysis of variance: Model I

The analysis
We now assume that there is one treatment or blocking factor present with k levels, and that
there are ni experimental units at the i-th level. We also assume that these k levels represent
all levels of interest. The k level defines k populations, and the observations may be regarded as
independent random samples from k populations. Let yij be the observation on the j-th individual
corresponding to the i-th level of the treatment or blocking factor. The model is

yij = µi + ij , j = 1, · · · , ni ; i = 1, · · · , k (3.1)

where the ij are independently n(0; σ 2 ) distributed.

Note: This model can be reparameterised to

yij = µ + αi + ij , j = 1, · · · , ni ; i = 1, · · · , k

to be discussed in the next section.

k
X
Let N = ni be the total number of observations.
i=1

We write the model in the matrix form

y = Xβ + 

as follows:
     
y 1 0 ··· 0 
 11    11
 ..   .. ..   ...
 
 .   . .
  
 
    µ1  
 y1n1   1 0 ··· 0  1n1
      
   
 ..   .. µ2
 +  ...
     

 . =
 
.  .. 
    .
  
  
0 ··· 1 
   
 yk1   0   k1 
 ..  
  
.. .. .. 
 µk  ..
 

 .   . . .   . 
     
yknk 0 0 1 knk
3. ANALYSIS OF VARIANCE 127 STA3701/1

   
1
n1 0 ··· 0 n1
0 ··· 0
   
1
··· 0  ···
   
0
 0 n2  0 n2
0 
XX=  ; (X 0 X)−1 = 
..  ..

  
 .   . 
   
1
0 0 · · · nk 0 0 ··· nk

   
y11 + · · · + y1n1 y1 .
   
y21 + · · · + y2n2
   
0
   y2 . 
Xy=
 ..
=
  ..


 .   . 
   
yk1 + · · · + yknk yk .

say, where
ni
X
yi. = yij
j=1
 
ȳ1.
 
 
0 −1 0
 ȳ2. 
∴ β̂ = (X X) X y = 
 .. 

 . 
 
ȳk.
ni
1 X
where ȳi. = yij = yi. /ni ,
ni j=1
the mean of the i-th sample.

k ni k k
1 XX X X
Let ȳ.. = yij = ni ȳi. / ni
N i=1 j=1 i=1 i=1

(the grand mean of all the observations).

Of importance in this model are the two quadratic forms q4 and q5 of chapter 2. The error sum of
squares is

q4 = y 0 (I − X(X 0 X)−1 X 0 )y
3. ANALYSIS OF VARIANCE 128

= y 0 y − y 0 X(X 0 X)−1 X 0 y
X ni
k X k
X
= yij2 − ni ȳi2 .
i=1 j=1 i=1
k
"n #
X Xi

= yij2 − ni ȳi.2
i=1 j=1
ni
k X
X
= (yij − ȳi. )2
i=1 j=1
(3.2)

If we write
n1
X nk
X
q4 = (y1j − ȳ1. )2 + · · · + (ykj − ȳk. )2
j=1 j=1

then we see that q4 measures the variation within samples. (Each term of q4 measures the variation
within one of the k samples.)

The hypothesis which one usually wants to test is that the k population means are all equal:

H0 : µ1 = µ2 = · · · = µk .

(In exceptional circumstances one may want to test µ1 = µ2 = · · · = µk = c with c specified, but
that problem will not be discussed here.)

Now H0 is true if and only if the following is true:

µ1 = µ2 ; µ2 = µ3 ; · · · ; µk−1 = µk

We can thus write H0 in the form Kβ = m as follows:


    
1 −1 0 ··· 0 0 µ1 0
    
 0 1 −1 · · · 0 0   µ2   0 
    
 ..   ..  =  .. 
    
 .  .   . 
    
0 0 0 · · · 1 −1 µk 0
3. ANALYSIS OF VARIANCE 129 STA3701/1

The rank of K is k − 1, so that q5 corresponding to this K will have k − 1 degrees of freedom.


After the computation of (K(X 0 X)−1 K 0 )−1 , which involves some intricate matrix manipulations,
it can be shown that

k
X
q5 = ni (ȳi. − ȳ.. )2
i=1
Xk
= ni ȳi.2 − N ȳ..2
i=1
0
= y A5 y

= y 0 (B − C)y

(3.3)
 
1 1
N
··· N
 .
 
where C = 1(1 1) 1 =  ..
0 −1 0 

 
1 1
N
··· N
 
0 −1 0
1(1 1) 1 O ··· O
 
0 −1 0
···
 
 O 1(1 1) 1 O 
and B=
 ..


 . 
 
O O · · · 1(10 1)−1 10
 
1 1
n1
··· n1
0 ··· 0 ··· 0 ··· 0
..
 
.
 
 
 
1 1
··· 0 ··· 0 ··· 0 ··· 0 
 
n1 n1

 
1 1
0 ··· 0 ··· ··· 0 ··· 0 
 
n2 n2

..
 
 
 . 
= 
1 1
··· ··· ··· ··· 0 
 
 0 0 n2 n2
0
..
 
 
 . 
 
1 1 
··· ··· ··· · · · nk 

 0 0 0 0 nk
..
 
 
 . 
 
1 1
0 ··· 0 0 ··· 0 ··· nk
· · · nk

It is easily seen (by direct multiplication) that


3. ANALYSIS OF VARIANCE 130

BB = B; CC = C; BC = CB = C
r(C) = 1; r(B) = k; r(A5 ) = r(B − C) = k − 1, and
A5 A5 = (B − C)(B − C) = BB − CB − BC + CC
= B − C − C + C = A5

which shows, as before, that q5 /σ 2 is a noncentral chi-square variate with k − 1 degrees of freedom
and noncentrality parameter
k k
X
2 2 1 X
λ5 = ni (µi − µ̄. ) /σ (where µ̄. = ni µi )
i=1
N i=1
which is equal to zero if and only if H0 is true.

From chapter 2 we know that q4 and q5 are independent and that q4 /σ 2 is a (central) XN2 −k variate.
This can also be shown directly by noting that
XX k
X
q4 = yij2 − ni ȳi.2
i=1
0 0
y Iy − y By
= y 0 (I − B)y,
that r(I − B) = N − k and that (I − B)(B − C) = B − C − BB + BC = O.

From (D8) it follows that

E(q4 /σ 2 ) = N − k ∴ E(q4 /(N-k)) = σ 2


k
X
2
E(q5 /σ ) = k − 1 + ni (µi − µ̄. )2 /σ 2
i=1
X
∴ E(q5 /(k − 1)) = σ 2 + ni (µi − µ̄. )2 / (k − 1).

q4
Thus is an unbiased estimator of σ 2 and we write
N −k
ni
k X
X
σ̂ 2 = q4 / (N − k) = (yij − ȳi. )2 / (N − k).
i=1 j=1

Consider q5 :
k
X
q5 = ni (ȳi. − ȳ.. )2 ,
i=1
3. ANALYSIS OF VARIANCE 131 STA3701/1

which is the weighted sum of squares of deviations of the sample means from the grand mean.
We say that q5 is the “between samples” sum of squares while q4 is the “within samples” sum of
squares. Furthermore, since

I − C = (I − B) + (B − C)

or

A2 = A4 + A5 ,

∴ y 0 (I − C)y = y 0 A4 y + y 0 A5 y

or
ni
k X
X
q2 = (yij − ȳ.. )2
i=1 j=1
X ni
k X k
X
2
= (yij − ȳi. ) + ni (ȳi. − ȳ.. )2
i=1 j=1 i=1
1
∴ SSTotal = SSwithin samples + SSbetween samples .

From (D10) it follows that


q5 /(k − 1)
f=
q4 /(N − k)
has a central Fk−1,N −k distribution if H0 is true (and a noncentral distribution if H0 is not true).

Thus H0 : µ1 = . . . = µk is rejected if f > Fα;k−1;N −k . The results are usually presented in tabular
form as follows:

Source SS∗ d.f∗ MS∗ E(MS)


Between samples q5 k−1 q5 /(k − 1) σ 2 + Σni (µi − µ̄. )2 /(k − 1)
Within samples q4 N −k q4 /(N − k) σ2
Total q2 N −1


SS = sum of squares

d.f. = degrees of freedom

MS = means square

1
This expression should, strictly speaking, be called “SSTotal adjusted for the mean ”. We adopt the more
concise “SSTotal ” from now on.
3. ANALYSIS OF VARIANCE 132

Such a table is known as an ANOVA table (analysis of variance table).

Multiple comparisons
If H0 : µ1 = . . . = µk is rejected, one may want to test for specific difference; usually one wants
X X
to test contrasts of the form H0 : ci µi = 0 where ci = 0. In such cases one may retain an
overall significance level of α if H0 is rejected when
( ci ȳi. )2
P
P 2 > (k − 1)σ̂ 2 Fα;k−1;N −k
(ci /ni )
i
X
0
(this can be shown by specifying t = (t1 t2 · · · tk−1 ), where ti = cr for i = 1, . . .,k − 1, in qt∗ ,
r=1
equations (2.7.5) and (2.7.7) of section 2.7).
(ȳr. − ȳs. )2
In particular, H0 : µr = µs is rejected if, > (k − 1)σ̂ 2 Fα;k−1;N −k.
1 1
+
nr ns
Example 3.2.
Using data in section 3.1.1:

(a) Construct an ANOVA table.

(b) Test at the 5% level of significance if there is a difference in yields among the moon phases.

(c) Test at the 5% level of significance whether the mean yield of potatoes planted at full moon
is equal to the mean yield of potatoes planted at new moon.

(d) Test whether the mean yield of potatoes planted at full moon is equal to the mean yield of
potatoes planted during all other phases. Use α = 0.05.
3. ANALYSIS OF VARIANCE 133 STA3701/1

Solution 3.2.

(a)

Treatments
New moon First quarter Full moon Last quarter
201 189 206 205
197 200 214 194
190 199 202 202
194 196 205 198
200 194 203 200
206 198 206 201
n 6 6 6 6
y i. 198 196 206 200

y .. = 200

X
(y1j − y 1. )2 = (201 − 198)2 + (197 − 198)2 + . . . + (206 − 198)2

= 32 + (−1)2 + (−8)2 + (−4)2 + (2)2 + 82

= 158
X
(y2j − y 2. )2 = (189 − 196)2 + (200 − 196)2 + . . . + (198 − 196)2

= (−7)2 + 42 + 32 + 02 + (−2)2 + 22

= 82
X
(y3j − y 3. )2 = (206 − 206)2 + (214 − 206)2 + . . . + (206 − 206)2

= 02 + 82 + (−4)2 + (−1)2 + (−3)2 + 02

= 90
X
(y4j − y 4. )2 = (205 − 200)2 + (194 − 200)2 + . . . + (201 − 200)2

= 52 + (−6)2 + (2)2 + (−2)2 + 02 + 12

= 70
3. ANALYSIS OF VARIANCE 134

XX
q4 = (yij − ȳi. )2 = 158 + 82 + 90 + 70 = 400

X
q5 = ni (ȳi. − ȳ.. )2

= 6(198 − 200)2 + 6(196 − 200)2 + 6(206 − 200)2 + 6(200 − 200)2

= 6(−2)2 + 6(−4)2 + 6(62 ) + 6(02 )

= 336

Thus the ANOVA table is as follows:

Source SS d.f. MS f
Between 336 3 112 112/20 = 5.6
Within 400 20 20 = σ̂ 2
Total 736 23

(b) H0 : µ1 = µ2 = µ3 = µ4 (H0 : α1 = α2 = α3 = α4 = 0)

Fα;k−1;N −k = F0.05;3;20 = 3.10. Reject H0 if f > 3.1.


Since 5.6 > 3.1 we reject H0 at the 5% level of significance and conclude that the yield of the
plants is influenced by the phase of the moon at planting time.

(c) H0 : µ1 = µ3

For multiple comparisons, we first compute

(k − 1)σ̂ 2 Fα;k−1;N −k = 3 × 20 × 3.1 = 186.

(ȳr. − ȳs. )2
Reject H0 if 1 1 > (k − 1)σ 2 Fα;k−1;N −k
nr
+ ns
Now

(ȳ1. − ȳ3. )2 (198 − 206)2


1 =
6
+ 16 1
6
+ 16
(−8)2
= 2
6
3. ANALYSIS OF VARIANCE 135 STA3701/1

64 × 6
=
2
= 192

Since 192 > 186, we reject H0 at the 5% level of significance and conclude that the mean
yield of potatoes planted at full moon is not equal to the mean yield of potatoes planted at
new moon.

(d) We want to test whether the mean yield of potatoes planted at full moon is equal to the mean
yield of potatoes planted during all other phases.

H0 : µ3 = (µ1 + µ2 + µ4 )/3

that is H0 : − 31 µ1 − 31 µ2 + µ3 − 13 µ4 = 0 ⇒ c1 = − 31 , c2 = − 31 , c3 = 1 and c4 = − 13 .

( ci ȳi. )2
P
Then P 2
ci /ni
((− 13 )(198) + (− 31 )(196) + (1)(206) + (− 31 )(200))2
=
( 19 )( 61 ) + ( 19 )( 16 ) + (1)( 16 ) + ( 19 )( 61 )
= 288.

Since 288 > 186 we reject H0 at the 5% level of significance and conclude that the mean yield
of potatoes planted at full moon is not equal to the mean yield of potatoes planted during the
other phases.

The SAS JMP output for this example can be seen in figure 3.1. Compare the tables with the
answers obtained above. The graph shows clearly that the yield is highest during moon phase 3.
3. ANALYSIS OF VARIANCE 136

Figure 3.1: SAS JMP output for example 3.2

Note: If the confidence intervals (the points of the diamonds) do not overlap, the means are signif-
icantly different. In other words if the means are not much different, they are close to the
grand mean.

Comments about the assumptions


The three important assumptions in the model are

(i) normality

(ii) equal variances

(iii) independence
3. ANALYSIS OF VARIANCE 137 STA3701/1

The assumption (i) may be investigated using the N residuals yij − ŷij = yij − ȳi. . The assumption
of equal variances may be tested formally, using a test known as Bartlett’s test or a test based on
the statistic

maxi s2i / mini s2i

n
i
1 X
where s2i = (yij − ȳi. )2 .
ni − 1 j=1

The latter test is applicable only if n1 = . . . = nk .

The assumption of independence is much more difficult to verify − one has to infer this from the
way in which the experiment was performed.

If (i) is not true, one may try a transformation or use a rank test entitled the Kruskall-Wallis test.
If (ii) is not true one may try to stabilize the variance by means of a transformation, otherwise
one is landed with the Behrens-Fisher problem for which approximate solutions exist. If (iii) is
not true, the problem is much more difficult to solve.

A reparameterisation

A form in which the one-way analysis of variance model is often presented is one with a constant
term

yij = µ + αi + ij ; j = 1, . . . , ni ; i = 1,. . .,k

where, as before, ij are independently n(0; σ 2 ) distributed. Now there are k +1 parameters instead
of k parameters. This introduces an indeterminacy which is reflected in the fact that X in the
representation y = Xβ +  does not have full column rank (the first column of X is equal to the
sum of the other columns):
3. ANALYSIS OF VARIANCE 138

     
y11 1 1 ··· 0 11
 ..   .. .. ..
     
 .   . . .
   
  
    µ 
..

 y1n1   1 1 ··· 0  .
      
   
 ..  
  
..  α1   ..

 . = + 
.  ..   . 
 . ..
     

0 ··· 1 
     
 yk1   1  . 
 ..  
  
.. .. .. 
 αk 
 ..


 .   . . .   . 
     
yknk 1 0 1 knk

Since the model represents the same problem as the model in equation 3.1, we must have

µi = αi + µ, i = 1, . . . , k

that is αi = µi − µ.

The αi are not uniquely defined in terms of the µi − we still need to choose a suitable definition
of µ (subject to certain conditions). The usual choice is to let
k
1X
µ= µi
k 1
in which case
X X
αi = (µi − µ) = 0.
1 X X
(Another choice of µ = ni µi , in which case ni αi = 0.)
N

The parameters α1 , · · · , αk are termed the treatment effects, with αi the effect of the i-th treatment
level. The following two null hypotheses are equivalent:
H0 : µ1 = µ2 = . . . = µk , and
H0 : α1 = . . . = αk−1 = αk (= 0).

Also, H0 : µi = µj is equivalent to H0 : αi = αj .

3.4.1 Estimates of parameters

If you might recall in your earlier courses in statistics you learnt about statistics and parameters.
Statistics are numerical attributes that describe the characteristics of a sample whilst parameters
3. ANALYSIS OF VARIANCE 139 STA3701/1

are numerical attributes that describe the characteristics of a population. A statistic is used to
estimate/approximate a population parameter.

You learnt that there are two types of estimates, that is, point estimates and interval estimates.
Point estimate is when a single value is used to estimate a population parameter, whilst an interval
estimate is when a range of values are used to approximate a population parameter.

3.4.1.1 Point estimates

The values of µ, αi , σ 2 and ij in the model can be estimated from our data. These are related to
some of the summary statistics which investigators would initially calculate from their data.

The mean of the ith sample is denoted by ȳi. . It is calculated as

yi1 + yi2 + . . . + yin


y i. =
n

which is a linear combination of the yij . The expected value is

(µ + αi ) + . . . + (µ + αi )
E(ȳi. ) =
n
nµ + nαi
=
n
= µ + αi

Thus we use y i. to estimate µ + αi .

The overall mean ȳ.. can be expressed as

(y11 + . . . + y1n ) + (y21 + . . . + y2n ) + . . . . . . + (yk1 + . . . + ykn )


ȳ.. = .
kn

Its expected value is

(µ + α1 ) + . . . + (µ + α1 ) + . . . . . . + (µ + αk ) + . . . + (µ + αk )
E(ȳ.. ) =
kn
knµ + n(α1 + α2 + . . . + αk )
=
kn
= µ (since α1 + α2 + . . . + αk = 0).
3. ANALYSIS OF VARIANCE 140

Thus ȳ.. can be used to estimate µ, ȳi. can be used to estimate µ + αi and ȳi. − ȳ.. is an estimate
of αi . The population variance σ 2 is estimated by s2 , the pooled estimate of the variance.

In summary, the list of the parameters and their point estimate are as follows:

Parameter Estimate Interpretation of estimate


µ ȳ.. Overall mean
αi ȳi. − ȳ.. Effect of the ith category
µ + αi ȳi. Mean of the ith category
αi − αj ȳi. − ȳj. Difference between the means of the ith and j th categories
σ2 s2 Common variance

In certain cases the sample sizes are not equal. In this case the overall mean is given by
ki n
1 XX
ȳ.. = yij
N i=1 j=1
Pk
ni ȳi.
= Pi=1
k
.
i=1 n i
(3.4)

3.4.1.2 Confidence intervals

In order to calculate the confidence interval we need to derive the variance for each estimator.

For the overall mean the variance is


k n
!
1 XX
V ar(y .. ) = V ar yij
N i=1 j=1
k n
1 XX
= V ar(yij )
N 2 i=1 j=1
1
= 2
(N σ 2 )
N
σ2
= (3.5)
N

The confidence interval for estimating µ is given by


p
µ̂ ± t α2 (N − k) × var(µ̂)
3. ANALYSIS OF VARIANCE 141 STA3701/1

p
y .. ± t α2 (N − k) × var(y .. )
r
s2
y .. ± t α2 (N − k) ×
N
(3.6)

where s2 = M SE.

The V ar(α̂i ) = V ar(ȳi. − ȳ.. ). Now variance of the effect of the 1st category is

α1 ) = V ar (ȳ1. − y .. )
V ar (b

P
ni y i.
ȳ1. − y .. = ȳ1. −
N
(n1 y 1. + n2 y 2. + n3 y 3. + ... + nk y k. )
= ȳ1. −
N
n1 y 1. n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − − ... −
 N N N N
N − n1 n2 y 2. n3 y 3. nk y k.
= ȳ1. − − − ... − .
N N N N

Thus,

  
N − n1 n2 y 2. n3 y 3. nk y k.
V ar(ȳi. − y .. ) = V ar ȳ1. − − − ... −
N N N N
(N − n1 )2 n22 n23 n2k
= V ar(ȳ1. ) + 2 V ar(ȳ2. ) + 2 V ar(ȳ3. ) + ... + 2 V ar(ȳk. )
N2 N N N
(N − n1 )2 σ 2 n22 σ 2 n23 σ 2 n2k σ 2
       
= + 2 + 2 + ... + 2
N2 n1 N n2 N n3 N nk
!
σ 2 (N − n1 )2 n2 σ 2 n3 σ 2 nk σ 2
= + + + ... +
n1 N2 N2 N2 N2
!
σ 2 (N − n1 )2
= + n2 + n3 + ... + nk
N2 n1
σ 2 (N 2 − 2N n1 + n21 )
 
= + n2 + n3 + ... + nk
N2 n1
σ2 N 2
 
= − 2N + n1 + n2 + n3 + ... + nk
N 2 n1
σ2 N 2
 
= − 2N + N since n1 + n2 + n3 + ... + nk = N
N 2 n1
3. ANALYSIS OF VARIANCE 142

σ2 N 2
 
= −N
N 2 n1
σ 2 N 2 − n1 N
 
=
N2 n1
2
 
σ N N − n1
=
N2 n1
2
 
σ N − n1
= .
N n1

In general

σ2
 
N − ni
V ar (b
αi ) = . (3.7)
N ni

The confidence interval of αi is


p
α̂i ± t α2 (N − k) var(α̂i )
p
(ȳi. − y .. ) ± t α2 (N − k) var(α̂i )
s  
s 2 N − ni
(ȳi. − y .. ) ± t α2 (N − k) .
N ni
(3.8)

The

V ar (α̂i − α̂j ) = V ar ȳi. − y .j

= V ar (ȳi. ) + V ar y .j since they are independent
σ2 σ2
= +
ni nj
 
2 1 1
= σ + .
ni nj
(3.9)

The confidence interval for α̂i − α̂j is


q
(α̂i − α̂j ) ± t α2 (N − k) var(α̂i − α̂j )
 q
ȳi. − y j. ± t α2 (N − k) var(ȳi. − y j. )
s  
 1 1
ȳi. − y j. ± t α2 (N − k) s2 + .
ni nj
(3.10)
3. ANALYSIS OF VARIANCE 143 STA3701/1

The
 
V ar µ + αi = V ar (y i. )
\
σ2
= .
ni
(3.11)

The confidence interval for µ + α1 is:


q
µ\
+ αi ± t α2 (N − k) × var(µ\ + αi )
p
y i. ± t α2 (N − k) × var(y i. )
s
s2
y i. ± t α2 (N − k) × .
ni
(3.12)

q4 q4
From chapter 2 you will recall that σ̂ 2 = and 2 ∼ χ2N −k .
N −k σ

The confidence interval is given by


 q4 
1−α=P χ21− α ;N −k 2
≤ 2 ≤ χ α ;N −k . (3.13)
2 σ 2

Example 3.3.
Twenty-four plots of land were divided at random into four groups of six, and subjected to the
following four treatments:

(1) (control group): no fertilizer

(2) potassium and nitrogen

(3) potassium and phosphate

(4) nitrogen and phosphate

Wheat was planted on each plot, and the yields were as follows:
3. ANALYSIS OF VARIANCE 144

(1) 66 89 62 30 51 74
(2) 72 86 95 94 74 89
Treatment
(3) 47 54 62 53 71 49
(4) 81 77 69 61 68 82

(a) Select a model for this experiment and test the null hypothesis that there is no treatment
effect (5% level).

(b) Find a 95% confidence interval for the difference between the means of treatments 1 and 2.

(c) Find a 95% confidence interval for the difference between the mean of the control group and
the mean of the other three treatment.

(d) Interpret the results.

Solution 3.3.

(a) The model is

yij = µi + ij i = 1, 2 . . . , k j = 1, . . . , ni

where the ij are independent n(0, σ 2 ) distributed.

The hypotheses are:


H0 : µ1 = µ2 = µ3 = µ4
H1 : At least one of the means is different from the others.
3. ANALYSIS OF VARIANCE 145 STA3701/1

Treatments
1 2 3 4
66 72 47 81
89 86 54 77
62 95 62 69
30 94 53 61
51 74 71 68
74 89 49 82
ni 6 6 6 6
ȳi. 62 85 56 73

N = 24 ȳ..

Now

X
(y1j − y 1. )2 = (66 − 62)2 + (89 − 62)2 + . . . + (74 − 62)2

= 42 + (27)2 + 02 + (−32)2 + (−11)2 + (12)2

= 2 034

X
(y2j − y 2. )2 = (72 − 85)2 + (86 − 85)2 + . . . + (89 − 85)2

= (−13)2 + 12 + 102 + 92 + (−11)2 + 42

= 488

X
(y3j − y 3. )2 = (47 − 56)2 + (54 − 56)2 + . . . + (49 − 56)2

= (−9)2 + (−2)2 + 62 + (−3)2 + (15)2 + (−7)2

= 404

X
(y4j − y 4. )2 = (81 − 73)2 + (77 − 73)2 + . . . + (82 − 73)2

= 82 + 42 + (−4)2 + (−12)2 + (−5)2 + 92

= 346
3. ANALYSIS OF VARIANCE 146

XX
q4 = (yij − ȳi. )2 = 2 034 + 488 + 404 + 346 = 3 272
X
q5 = ni (ȳi. − ȳ.. )2

= 6(62 − 69)2 + 6(85 − 69)2 + 6(56 − 69)2 + 6(73 − 69)2

= 6(−7)2 + 6(16)2 + 6(−13)2 ) + 6(42 )

= 2 940.

Thus the ANOVA table is as follows:

Source SS d.f. MS f
Between 2 940 3 980 5.9902
Within 3 272 20 163.6
Total 6 212 23

Fα;k−1;N −k = F0.05;3;20 = 3.10. Reject H0 if f > 3.1.

Since 5.9902 > 3.1, we reject H0 at the 5% level of significance and conclude that the means
are significantly different from each other.

(b) The confidence interval is


s  
1 1
(ȳr. − ȳs. ± t α2 ;N −k × σ̂ 2 +
nr ns

α
α = 0.05 2
= 0.025 t α2 ;N −k = t0.025;20 = 2.086
nr = ns = 6 ȳ1. = 62 ȳ2. = 85 and σ 2 = 163.6.

The 95% confidence interval for the difference between the means of treatments 1 and 2 is

s  
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × + σ̂ 2
nr ns
s  
1 1
(62 − 85) ± 2.086 × 163.6 +
6 6

−23 ± 2.086 × 54.53333333
3. ANALYSIS OF VARIANCE 147 STA3701/1

−23 ± 15.4044

(−38.4044 ; −7.5956).

(c) Let the control group be sample 1 and the other three groups be sample 2, then ȳi. = 62,
ȳ2. = 71.333, n1 = 6 and n2 = 18.

The 95% confidence interval for the difference between the mean of the control group and the
mean of the other three treatments is

s  
1 1
(ȳr. − ȳs. ) ± t α2 ;N −k × σ̂ 2 +
nr ns
s  
1 1
(62 − 71.3333) ± 2.086 × 163.6 +
6 18

−9.3333 ± 2.086 × 36.35555556

−9.3333 ± 12.5777

(−21.911 ; 3.2444).

(d) The means are 62, 85, 56 and 73. Potassium and nitrogen gave the highest yield, with
nitrogen and phosphate the second highest. A lack of nitrogen (treatments 1 and 3) lead to
the poorest results.

Example 3.4.
Ten randomly selected mental institutions were examined to determine the effects of three different
antipsychotic drugs on patients with the same types of symptoms. Each institution used one and
only one of the three drugs exclusively for a one-year period. The proportion of treated patients in
each institution who were discharged after one year of treatment was as follows for each drug used:

Drug 1: 0.10, 0.12, 0.08, 0.14 (Ȳ1 = 0.11, S1 = 0.0258)


Drug 2: 0.12, 0.14, 0.19 (Ȳ2 = 0.15, S2 = 0.0361)
Drug 3: 0.20, 0.25, 0.15 (Ȳ3 = 0.20, S3 = 0.0500)

(a) Use the appropriate model and complete an ANOVA table.


3. ANALYSIS OF VARIANCE 148

(b) Test to see if there are any significant differences among drugs with regards to the average
proportion of patients discharged.

(c) What basic ANOVA assumptions might be violated here? How would you test if these as-
sumptions are indeed violated (no need to test, just give a description or method).

(d) Write down point estimates for the following parameters:

(i) The overall mean proportion of treated patients who were discharged for the three drugs.

(ii) µ + α1 , the mean for the first drug.

(iii) α1 , the effect of the first drug.

(iv) α1 − α3 .

(v) σ 2 .

(e) Obtain 95% confidence intervals for each of the estimates in part (d) of the question. Use the
method of multiple comparisons to obtain the confidence interval of the estimate in (d)(iv).

(f ) Use the method of multiple comparisons and test the following hypothesis: H0 : α2 = (α1 +
α3 )/2. Use α = 0.05.

Solution 3.4.

(a) One-way analysis of variance: Model I

yij = µi + εij ; i = 1, ..., k = 3; j = 1, ..., ni .

y 1. = 0.11

X
(y1j − y 1. )2 = (n1 − 1)s21

= (4 − 1)0.02582

= 0.002
3. ANALYSIS OF VARIANCE 149 STA3701/1

or

X
(y1j − y 1. )2 = (0.1 − 0.11)2 + (0.12 − 0.11)2 + (0.08 − 0.11)2 + (0.14 − 0.11)2

= (−0.01)2 + (0.01)2 + (−0.03)2 + (0.03)2

= 0.002

y 2.. = 0.15

X
(y2j − y 2. )2 = (n2 − 1)s22

= (3 − 1)0.03612

= 0.0026

or

X
(y2j − y 2. )2 = (0.12 − 0.15)2 + (0.14 − 0.15)2 + (0.19 − 0.15)2

= (−0.03)2 + (−0.01)2 + (0.04)2

= 0.0026

y 3. = 0.2

X
(y3j − y 3. )2 = (n3 − 1)s23

= (3 − 1)0.052

= 0.005

or

X
(y3j − y 3. )2 = (0.2 − 0.2)2 + (0.25 − 0.2)2 + (0.15 − 0.2)2

= 02 + (0.05)2 + (−0.05)2

= 0.005
3. ANALYSIS OF VARIANCE 150

X
q4 = (yij − y i. )2

= 0.002 + 0.0026 + 0.005

= 0.0096

Since the sample sizes are unequal


k n
1 XX
y .. = yij
N i=1 j=1
Pk
ni y i.
= Pi=1
k
i=1 ni
4(0.11) + 3(0.15) + 3(0.2)
=
4+3+3
0.44 + 0.45 + 0.6
=
10
1.49
=
10
= 0.149

k
X
q5 = ni (yi. − y .. )2
i=1
= 4(0.11 − 0.149)2 + 3(0.15 − 0.149)2 + 3(0.2 − 0.149)2

= 0.006084 + 0.000003 + 0.007803

= 0.01389

= 0.0139

The ANOVA table is:

Source SS d.f. MS f
Drugs (between samples 0.0139 2 0.007 5
Error (within samples) 0.0096 7 0.0014
Total 0.0235 9

(b) H0 : µ1 = µ2 = µ3 or H0 : α1 = α2 = α3 (= 0)
Pk Pn 2
Between samples SS/(k − 1) i=1 j=1 (yi. − ȳ.. ) /(k − 1)
Under H0 : f = = Pk Pn ∼ Fk−1;N −k
Within samples SS/(N − k) i=1 j=1 (y ij − ȳi. )2 /(N − k)

The critical value is Fα;k−1;N −k = F0.05;2;7 = 4.74. Reject H0 if f > 4.74.


0.007
Now, f = 0.0014
= 5.
3. ANALYSIS OF VARIANCE 151 STA3701/1

Since 5 > 4.74, H0 is rejected. The mean levels of patients discharged for the three drugs do
differ significantly at α = 0.05.

(c) • The assumption of equal variances might be violated. Can use Bartlett’s test to check
for this assumption and if it is violated an alternative like Welch’s test may be applied
to the one-way ANOVA.

• The assumption of normality might also be violated. Can check this by graphical exami-
nation of Q-Q plots and histograms of the observations in each group and also by a test
like the Shapiro-Wilk test. If the assumption is violated one may use the Kruskal-Wallis
test instead.

(d) (i) The point estimate of the overall mean proportion of treated patients who were discharged
for the three drugs is

µ̂ = y ..
k n
1 XX
= yij
N i=1 j=1
Pk
ni y i.
= Pi=1k
i=1 ni
4(0.11) + 3(0.15) + 3(0.2)
=
4+3+3
= 0.149.

(ii)

µ\
+ αi = y i.

µ\
+ α1 = y 1.

= 0.11

(iii)

bi = ȳ1. − y ..
α
Pk
ni y i.
= ȳ1. − Pi=1 k
i=1 ni
= 0.11 − 0.149

= −0.039
3. ANALYSIS OF VARIANCE 152

(iv)

(α\
i − αj ) = ȳi. − ȳj.

(α\
1 − α3 ) = ȳ1. − ȳ3.

= 0.11 − 0.2

= −0.09

(v)

σ̂ 2 = M SE
Pk Pni
i=1 j=1 (yij − y i. )2
=
N −k
0.0096
=
7
= 0.001371428

≈ 0.0014

(e) (i)

V ar(b
µ) = V ar(y .. )
σ2
=
N
0.0014
=
10
= 0.00014

≈ 0.0001

The confidence interval for µ is

p
µ̂ ± t α2 ;N −k × var(µ̂)
p
y .. ± t α2 ;N −k × var(y .. )

α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365
The 95% confidence interval is

p
y .. ± t α2 ;N −k × var(y .. )
3. ANALYSIS OF VARIANCE 153 STA3701/1


0.149 ± 2.365 × 0.0001

0.149 ± 0.02365

0.149 ± 0.0237

(0.149 − 0.0237) ; (0.149 + 0.0237)

(0.1253 ; 0.1727)

(ii)
  σ2
V ar µ\
+ αi =
ni
0.0014
=
4
= 0.00035

≈ 0.0004

The confidence interval for µ + α1 is

q
µ\
+ α1 ± t α2 ;N −k × var(µ\ + α1 )
p
y 1. ± t α2 ;N −k × var(y 1. )

α
α = 0.05, 2
= 0.025, and t α2 ;N −k = t0.025;7 = 2.365

The 95% confidence interval is


p
y 1. ± t α2 ;N −k × var(y 1. )

0.11 ± 2.365 × 0.0004

0.11 ± 0.0473

(0.11 − 0.0473) ; (0.11 + 0.0473)

(0.0627 ; 0.1523).

(iii)

αi ) = V ar (ȳi. − y .. )
V ar (b
σ 2 N − ni
 
=
N ni
3. ANALYSIS OF VARIANCE 154

σ 2 N − n1
 
var(αˆ1 ) =
N n
1 
0.0014 10 − 4
=
10 4
 
6
= 0.0014
40
= 0.00021

≈ 0.0002

The 95% confidence interval for α1 is


p
αˆ1 ± t α2 ;N −k var(αˆ1 )
p
(ȳ1. − y .. ) ± t α2 ;N −k var(αˆ1 )

(0.11 − 0.149) ± 2.365 × 0.0002

−0.039 ± 0.03344615

−0.039 ± 0.0334

(−0.039 − 0.0334) ; −0.039 + 0.0334)

(−0.0724 ; −0.0056)

(iv)
 
2 1 1
αi − α
V ar (b bj ) = σ +
ni nj
 
2 1 1
V ar(α\
1 − α3 ) = σ +
n1 n3
 
1 1
= 0.0014 +
4 3
= 0.000816666

≈ 0.0008

The 95% confidence interval for (α\


1 − α3 ) is

q
(α\
1 − α3 ) ± t α
2
;N −k var(α\i − αj )
p
(ȳ1. − y 3. ) ± t α2 ;N −k var(ȳ1. − y 3. )
3. ANALYSIS OF VARIANCE 155 STA3701/1


(0.11 − 0.2) ± 2.365 × 0.0008

−0.09 ± 0.066892301

−0.09 ± 0.0669

(−0.09 − 0.0669) ; −0.09 + 0.0669)

(−0.1569 ; −0.0231)
q4
(v) σ̂ 2 = = 0.0014
N −k
q4
2
∼ χ2N −k
σ  q4 
∴ 0.95 = P χ20.975;7 ≤ 2 ≤ χ20.025;7
 σ 
0.0096
= P 1.69 ≤ ≤ 16.013
σ2
 
0.0096 2 0.0096
= P ≤σ ≤
16.013 1.69
2
= P (0.000599512 ≤ σ ≤ 0.005680473)

= P (0.0006 ≤ σ 2 ≤ 0.0057)

(f ) The hypothesis to test is


1 1
H0 : α2 = (α1 + α3 )/2 or H0 : − α1 + α2 − α3 = 0
2 2
1 1 1 1
H0 : α2 − α1 − α3 = 0 ⇒ c1 = − c2 = 1 c3 = − .
2 2 2 2

( ci ȳi. )2
P
For multiple comparisons: H0 is rejected if P 2 > (k − 1)σ̂ 2 Fα;k−1,N −k
(ci /ni )

( ci ȳi. )2 (− 21 × 0.11 + 1 × 0.15 − 12 × 0.2)2


P
P 2 = 2 2
(ci /ni ) − 12 × 41 + 12 × 13 + − 12 × 13
(−0.005)2
=
0.479166666
0.000025
=
0.479166666
= 0.000052173

≈ 0.0001

σ 2 Fα;k−1;N −k = 2 × 0.0014 × 4.74


(k − 1)b

= 0.013272

≈ 0.0133
3. ANALYSIS OF VARIANCE 156

Since 0.0001 < 0.0133, we do not reject H0 at the 5% level of significance and conclude that
the mean of the second sample is equal to the average of the first and third samples.

3.5 One-way analysis of variance: Model II

Analysis
We now consider the problem which arises if k populations are selected at random from a large
number of populations, and a random sample is drawn from each of these populations. For sim-
plicity we shall assume that all sample sizes are equal, that is

n1 = . . . = nk = n, say.

The model is

yij = µ + ai + ij ; j = 1, · · ·, n; i = 1, · · ·, k

where the parameters a1 , · · · , ak constitute a random sample of size k from a population of param-
eters. The following assumptions are usually made:

ij ∼ n(0; σ 2 ); j = 1, · · ·, n; i = 1, · · ·, k

ai ∼ n(0; ω 2 ); i = 1, · · ·, k

and ai , · · · , ak and 11 , · · · , kn are mutually independent.

As before, let

N = nk = total sample size


n
1X
ȳi = yij
n j=1
k n
1 XX
ȳ.. = yij
N i=1 j=1
k
1X
= ȳi. .
k i=1

From the model yij = µ + ai + ij we see that

E(yij ) = E(µ + ai + ij )


3. ANALYSIS OF VARIANCE 157 STA3701/1

= E(µ) + E(ai ) + E(ij )

= µ since E(ai ) = 0 and E(ij) = 0

V ar(yij ) = V ar(µ + ai + ij )

= V ar(µ) + V ar(ai ) + V ar(ij ) since ai , . . . , ak and 11 , . . . kn are mutually independent

= ω2 + σ2

Now

V ar(yij ) = E(yij2 ) − (E(yij ))2

⇒ E(yij2 ) = V ar(yij ) + (E(yij ))2

= σ 2 + ω 2 + µ2
n
1X
ȳi. = (µ + ai + ij )
n j=1
n
1X
= µ + ai + ij
n j=1

n
!
1X
E(ȳi. ) = E µ + ai + ij
n j=1
n
1X
= E(µ) + E(ai ) + E(ij )
n j=1
= µ

n
!
1X
V ar(ȳi. ) = V ar µ + ai + ij
n j=1
n
1 X
= V ar(µ) + V ar(ai ) + 2 V ar(ij )
n j=1
n
1 X 2
2
= ω + 2 σ
n j=1
1
= ω2 + nσ 2
n2
σ2
= ω2 +
n

∴ E(ȳi.2 ) = V ar(ȳi. ) + (E(ȳi. ))2


σ2
= ω2 + + µ2
n
3. ANALYSIS OF VARIANCE 158

k n
1 XX
ȳ.. = (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= µ+ ai + ij
k i=1 kn i=1 j=1

k n
!
1 XX
E(ȳ.. ) = E (µ + ai + ij )
kn i=1 j=1
k k n
1X 1 XX
= E(µ) + E(ai ) + E(ij )
k i=1 kn i=1 j=1
= µ

k n
!
1 XX
V ar(ȳ.. ) = V ar (µ + ai + ij )
kn i=1 j=1
k k n
1 X 1 XX
= 2 V ar(ai ) + V ar(ij )
k i=1 (kn)2 i=1 j=1
k k n
1 X 2 1 XX 2
= 2 ω + σ
k i=1 (kn)2 i=1 j=1
1 1
= 2
kω 2 + knσ 2
k (kn)2
ω2 σ2
= +
k kn

∴ E(ȳ..2 ) = V ar(ȳ.. ) + (E(ȳ.. ))2


ω2 σ2
= + + µ2 .
k kn

Consider again the two quadratic forms


k X
X n
q4 = (yij − ȳi. )2
i=1 j=1
XX X
= yij2 − n ȳi2 .
i j i
k
X
q5 = n (ȳi. − ȳ.. )2
i=1
k
X
=n ȳi.2 − nkȳ..2 .
i=1

If we write q4 = y 0 A4 y and q5 = y 0 A5 y, we see that A4 and A5 are exactly the same as in equations
(3.2) and (3.3), section 3.4. Thus A4 A4 = A4 , A5 A5 = A5 , A4 A5 = O, r(A4 ) = N − k and r(A5 )
3. ANALYSIS OF VARIANCE 159 STA3701/1

= k − 1 so that q4 and q5 are multiples of chi-square variates and are independent.

Now consider the expected values of q4 and q5 :


!
XX X
E(q4 ) = E yij2 − n ȳi.2
i j i
XX X
= E(yij2 ) −n E(ȳi.2 )
i j i
XX X
= (µ2 + ω 2 + σ 2 ) − n (µ2 + ω 2 + σ 2 /n)
i j i

= kn(µ + ω + σ ) − nk(µ + ω 2 + σ 2 /n)


2 2 2 2

= (N − k)σ 2 as before
!
X
E(q5 ) = E n ȳi.2 − nkȳ..2
i
X
= n E(ȳi.2 ) − nkE(ȳ..2 )
i
X
= n (µ2 + ω 2 + σ 2 /n) − kn(µ2 + σ 2 /(nk) + ω 2 /k)
i
= (k − 1)σ 2 + n(k − 1)ω 2
   
q4 q5
∴E = (N − k) and E = k − 1.
q5 σ 2 + nω 2

q4 q5
From these expected values, the results above and (D8) it follows that 2
and 2 are inde-
σ σ + nω 2
pendent central chi-square variates with N − k and k − 1 degrees of freedom respectively. (That
these are central chi-square variates can be proved more formally by showing that, since E(yij ) = µ
for all i and j, λ4 = λ5 = 0.)

In this type of model we are usually not interested specifically in a1 , · · · , ak since the k populations
were selected at random from a large number of populations. The interest is usually in the overall
mean µ and the variance ω 2 of the parameters ai . An unbiased estimator for µ is

µ̂ = ȳ..

while

E(ȳ.. ) = µ

and
3. ANALYSIS OF VARIANCE 160

ω2 σ2
Var(ȳ.. ) = +
k nk
1
= (nω 2 + σ 2 ).
N

q5
As was seen, an unbiased estimator for nω 2 + σ 2 is ; thus
k−1
ȳ − µ
r .. ∼ tk−1 .
1 q5
N k−1

Thus a 100(1 − α)% confidence interval for µ is given by


r
q5
ȳ.. ± ∼ t 1 α;k−1 .
N (k − 1) 2

A test for H0 : µ = µ0 , where µ0 is specified, may be derived in a similar manner.

To find an unbiased estimator for ω 2 , note that

E(q4 /(N − k)) = σ 2

E(q5 /(k − 1)) = σ 2 + nω 2


  
1 q5 q4
∴E − = ω2.
n k−1 N −k

 
1
2 q5 q4
Thus we write ω̂ = − .
n k−1 N −k

There is a problem with this estimator: while ω 2 ≥ 0 one may sometimes find that ω̂ 2 < 0. In
such a case it is standard practice to set ω̂ 2 = 0.

A test for H0 : ω 2 = 0 may be derived as follows:


q5
/ (σ 2 + nω 2 )
k−1 ∼ Fk−1;N −k
q4
/ σ2
N − k
nω 2

q5 q4
∴f = / ∼ 1 + 2 Fk−1;N −k .
k−1 N −k σ
3. ANALYSIS OF VARIANCE 161 STA3701/1

Thus f ∼ Fk−1;N −k if and only if ω 2 = 0. If ω 2 > 0 we may expect f to be larger than a Fk−1;N −k
statistic, and H0 : ω 2 = 0 is rejected if f assumes large values.

An approximate confidence interval for ω 2 may be found, but we do not discuss this topic. Of
greater interest are the quantities ω 2 /σ 2 , σ 2 /(ω 2 + σ 2 ) and ω 2 /(ω 2 + σ 2 ). The latter is of particular
interest: The variance of each observation yij is σ 2 + ω 2 . A portion ω 2 of this variance is ascribable
to the differences in population means, and ω 2 /(ω 2 + σ 2 ) is the proportion of the total variance
which is due to these differences.

As before, let
q5 q4
f= / .
k−1 N −k

σ2
Then f ∼ Fk−1;N −k .
nω 2 + σ 2
Let F1 = F1− 1 α;k−1;N −k = 1/F 1 α;N −k;k−1
2 2

F2 = F 1 α;k−1;N −k .
2

Then

σ2
 
1−α = P F1 < f < F2
nω 2 + σ 2
nω 2 + σ 2
 
f f
= P < <
F2 σ2 F1
nω 2
 
f f
= P <1+n 2 <
F2 σ F1
2
 
−1 f ω −1 f
= P + < 2 < +
n nF2 σ n nF1
2
 
f − F2 ω f − F1
= P < 2 <
nF2 σ nF1
2
 
nF1 σ nF2
= P < 2 <
f − F1 ω f − F2
σ2
 
nF1 nF2
= P 1+ <1+ 2 <1+
f − F1 ω f − F2
2 2
 
f − F1 + nF1 ω +σ f − F2 + nF2
= P < <
f − F1 ω2 f − F2
2
 
f − F2 ω f − F1
= P < 2 <
f − F2 + nF2 ω + σ2 f − F1 + nF1
3. ANALYSIS OF VARIANCE 162

which defines a 100(1 − α)% confidence interval for ω 2 /(ω 2 + σ 2 ).

Example 3.5.
A factory has a large number of machines which produce the same product. The mass of each
product unit is a random variable which varies from unit to unit; the variance of this random
variable is σ 2 . The mean mass per unit also varies from machine to machine due to the fact
that the machines are not always calibrated precisely. Suppose that four machines are selected at
random, and a sample of six units produced by each of these machines is selected randomly and
weighed. The results (mass in grams) are as follows:

Unit Machine
1 2 3 4
1 201 198 211 204
2 198 196 215 202
3 209 201 207 203
4 197 200 209 206
5 203 206 208 201
6 204 199 210 208

(a) Construct the ANOVA table.

(b) Test H0 : σa2 = 0 at the 5% level of significance.

(c) Estimate the proportion of the total variation ascribable to the machines.

ω2
(d) Find the 90% confidence interval for .
σ2 + ω2

(e) Estimate the overall mean and find the 95% confidence interval for µ.

Solution 3.5.

(a) We select the model yij = µ + ai + ij ; j = 1, · · · , n; i = 1, · · · , k

where k = 4 and n = 6, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and
ij are mutually independent. We compute
3. ANALYSIS OF VARIANCE 163 STA3701/1

ȳ1. = 202; ȳ2. = 200; ȳ3. = 210; ȳ4. = 204; ȳ.. = 204

X
(y1j − ȳ1. )2 = (201 − 202)2 + (198 − 202)2 + . . . + (204 − 202)2

= (−1)2 + (−4)2 + 72 + (−5)2 + 12 + 22

= 96
X
(y2j − ȳ2. )2 = (198 − 200)2 + (196 − 200)2 + . . . + (199 − 200)2

= (−2)2 + (−4)2 + 12 + 02 + 62 + (−1)2

= 58
X
(y3j − ȳ3. )2 = (211 − 210)2 + (215 − 210)2 + . . . + (208 − 210)2

= 12 + 52 + (−3)2 + (−1)2 + (−2)2 + 02

= 40
X
(y4j − ȳ4. )2 = (204 − 204)2 + (202 − 204)2 + . . . + (208 − 204)2

= 02 + (−2)2 + (−1)2 + 22 + (−3)2 + 42

= 34

(yij − ȳ1. )2 = 96 + 58 + 40 + 34 = 228


P
q4 =

X
q5 = n (ȳi. − ȳ.. )2

= 6[(202 − 204)2 + (200 − 204)2 + (210 − 204)2 + (204 − 204)2 ]

= 6[(−2)2 + (−4)2 + 62 + 02 ]

= 6(4 + 16 + 36 + 0)

= 336

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


Machines 336 3 112 σ 2 + 6ω 2 9.8246
Replicates 228 20 11.4 σ2
Total 564 23
3. ANALYSIS OF VARIANCE 164

(b) H0 : σa2 = 0
Fα;k−1;N −k = F0,05;3;20 = 3.10. Reject H0 if f > 3.1.

The F-value is significant at the 5% level and our conclusion is that there is a significant
variation between machines − they should perhaps be calibrated.

(c) We have

σ̂ 2 = 11.4

σ̂ 2 + 6ω̂ 2 = 112

∴ 6ω̂ 2 = 112 − 11.4 = 100.6

ω̂ 2 = 16.7667
ω̂ 2 16.7667
2 2
= ≈ 0.60.
σ̂ + ω̂ 11.4 + 16.7667

Thus we estimate that about 60% of the variation is due to a difference in machine.

(d) The 90% confidence interval for ω 2 /(σ 2 + ω 2 ) is (L1 ; L2 ) where

f − F2 f − F1
L1 = and L2 =
f − F2 + nF2 f − F1 + nF1
1 1 1
F1 = F1− α2 ;k−1;N −k = = = ≈ 0.1155
F α2 ;N −k;k−1 F0.05;20;3 8.66

F2 = F α2 ;k−1;N −k = F0.05;3;20 = 3.1

9.8246 − 3.10 6.7246


L1 = = ≈ 0.2655
9.8246 − 3.10 + 6(3.10) 25.3246

9.8246 − 0.1155 9.7091


L2 = = ≈ 0.9334.
9.8246 − 0.1155 + 6(0.1155) 10.4021

Thus, the 90% confidence interval is (0.2655; 0.9334).

(e) The estimate of the overall mean is ȳ.. = 204 and the confidence interval for µ is
p
ȳ.. ± t α2 ;k−1 V ar(ȳ.. )
3. ANALYSIS OF VARIANCE 165 STA3701/1

t α2 ;k−1 = t0.025;3 = 3.182. Thus, the 95% confidence interval is


p
ȳ.. ± t α2 ;k−1
V ar(ȳ.. )
r
112
204 ± 3.182 ×
24
204 ± 6.8739

(197.1261 ; 210.8739).

From the SAS JMP output in figure 3.2 below, we note that machine 3, especially, produces units
with higher mass than the others. Machine 3’s diamond does not overlap with any of the other
diamonds, thus, one can expect it to produce units significantly different from all the other ma-
chines. Note also, that, like most statistical computer programs, SAS JMP does not distinguish
between random and fixed effects models − the distribution has to be made by the user.
3. ANALYSIS OF VARIANCE 166

Figure 3.2: SAS JMP output for example 3.5


3. ANALYSIS OF VARIANCE 167 STA3701/1

Example 3.6.
A large company has a number of personnel officers, and management wants to find out whether
the personnel selection is uniform or whether the variation between personnel officers is significant
compared to the variation between candidates. Five of the personnel officers are selected at ran-
dom, and each is assigned four applicants for testing. Their ratings of the applicants are as follows:

Applicant Personnel officer


A B C D E
1 75 47 67 76 77
2 91 50 75 59 65
3 72 63 82 82 86
4 86 64 80 67 76

Select a model for this experiment

(a) and test at the 5% level whether the officers award different ratings on the average;

(b) estimate the mean rating of all candidates and construct a 90% confidence interval for it;

(c) estimate ω 2 /(σ 2 + ω 2 ) and construct a 90% confidence interval for it.
3. ANALYSIS OF VARIANCE 168

Solution 3.6.

We select the model

yij = µ + ai + ij ; j = 1, · · · , n; i = 1, · · · , k

where k = 5 and n = 4, and E(ai ) = E(ij ) = 0, ai ∼ n(0; ω 2 ), ij ∼ n(0; σ 2 ) and the ai and ij
are mutually independent.

(a)

Personnel officer
A B C D E
1 75 47 67 76 77
2 91 50 75 59 65
3 72 63 82 82 86
4 86 64 80 67 76
ȳi. 81 56 76 71 76

N = nk = 20 ȳ.. = 72

The hypotheses are H0 : σa2 = 0 H1 : σa2 6= 0


X
(y1j − ȳ1. )2 = (75 − 81)2 + (91 − 81)2 + (72 − 81)2 + (86 − 81)2

= (−6)2 + (10)2 + (−9)2 + 52

= 242

X
(y2j − ȳ2. )2 = (47 − 56)2 + (50 − 56)2 + (63 − 56)2 + (64 − 56)2

= (−9)2 + (−6)2 + 72 + 82

= 230

X
(y3j − ȳ3. )2 = (67 − 76)2 + (75 − 76)2 + (82 − 76)2 + (80 − 76)2

= (−9)2 + (−1)2 + 62 + 42

= 134
3. ANALYSIS OF VARIANCE 169 STA3701/1

X
(y4j − ȳ4. )2 = (76 − 71)2 + (59 − 71)2 + (82 − 71)2 + (67 − 71)2

= 52 + (−12)2 + (11)2 + (−4)2

= 306

X
(y5j − ȳ5. )2 = (77 − 76)2 + (65 − 76)2 + (86 − 76)2 + (76 − 76)2

= 12 + (−11)2 + (10)2 + 02

= 222

(yij − ȳ1. )2 = 242 + 230 + 134 + 306 + 222 = 1 134


P
q4 =

X
q5 = n (ȳi. − ȳ.. )2
k
X
= n ȳi.2 − nkȳ..2
i=1
= 4(81 + 562 + 762 + 712 + 762 ) − (4)(5)(72)2
2

= 4(26 290) − 103 680

= 105 160 − 103 680

= 1 480

or alternatively

X
q5 = n (ȳi. − ȳ.. )2

= 4[(81 − 72)2 + (56 − 72)2 + (76 − 72)2 + (71 − 72)2 + (76 − 72)2 ]

= 4[92 + (−16)2 + 42 + (−1)2 + 42 ]

= 4(370)

= 1 480
3. ANALYSIS OF VARIANCE 170

The ANOVA table is as follows:


Source SS d.f. MS E(MS) f
Personnel officers 1 480 4 370 σ 2 + 4ω 2 4.8942
Replicates 1 134 15 75.6 σ2
Total 2 614 19

Fα;k−1;N −k = F0,05;4;15 = 3.06. Reject H0 if f > 3.06.

Since 4.8942 > 3.06, H0 is rejected at the 5% level and our conclusion is that there is a
significant variation between personnel officers.

(b) The mean rating of all candidates is

µ̂ = ȳ..
k n
1 XX
= yij
N i=1 j=1
k
1X
= ȳi.
k i=1
1
= (81 + 56 + 76 + 71 + 76)
5
1
= (360)
5
= 72.

The confidence interval for the overall mean is


r
q5
ȳ.. ± × t α2 ;k−1 .
N (k − 1)
α
Now ȳ.. = 72 q5 = 1 480 N = 20 α = 0.10 2
= 0.05 t α2 ;k−1 = t0.05;4 =
2.132.

Thus, the 90% confidence interval for the mean rating of all candidates is
r
q5
ȳ.. ± × t α2 ;k−1
N (k − 1)
s
1 480
72 ± × 2.132
20(5 − 1)
3. ANALYSIS OF VARIANCE 171 STA3701/1


72 ± 18.5 × 2.132

72 ± 9.1701

(62.8299 ; 81.1701)

ω2
(c) Estimating
σ2 + ω2

ω̂ 2 = 75.6. Now

σ̂ 2 + 4ω̂ 2 = 370

75.6 + 4ω̂ 2 = 370

4ω̂ 2 = 370 − 75.6

4ω̂ 2 = 294.4

⇒ ω̂ 2 = 73.6.

Then
ω̂ 2 73.6
2 2
=
σ̂ + ω̂ 75.6 + 73.6
73.6
=
149.2
≈ 0.4933.
ω2
The confidence interval for is
σ2 + ω2
ω2
 
f − F2 f − F1
P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
1
where F1 = and F2 = F α2 ;k−1;N −k .
F α2 ;N −k;k−1

α
Now f = 4.8942 α = 0.1 2
= 0.05 F α2 ;N −k;k−1 = F0.05;15;4 = 5.86
1
F1 = 5.86
≈ 0.1706 F2 = F0.05;4;15 = 3.06.

ω2
The 90% confidence interval for 2 is
σ + ω2

ω2
 
f − F2 f − F1
0.90 = P < 2 2
<
f − F2 + nF2 σ +ω f − F1 + nF1
3. ANALYSIS OF VARIANCE 172

ω2
 
4.8942 − 3.06 4.8942 − 0.1706
= P < 2 2
<
4.8942 − 3.06 + 4(3.06) σ +ω 4.8942 − 0.1706 + 4(0.1706)
2
 
1.8342 ω 4.7236
= P < 2 2
<
14.0742 σ +ω 5.406
2
 
ω
= P 0.1303 < 2 < 0.8738 .
σ + ω2

3.6 Two-way classification

We now consider experiments in which there are two factors (treatments or blocking factors), say
factor A with k levels A1 , · · · , Ak and factor B with m levels B1 · · · , Bm . The expected value of
an observation in the cell defined by levels Ai and Bj is µij , say. The expected values are then as
follows:

Factor A Factor B
B1 B2 Bm
A1 µ11 µ12 ··· µ1m
A2 µ21 µ22 ··· µ2m
.. ..
. .
Ak µk1 µk2 ··· µkm

An additive model specifies that the following relationship exists between these expected values:

µij = µ + αi + βj ; j = 1, · · · , m; i = 1, · · · , k.

An example of such an additive model is the following:

B1 B2 B3
A1 µ11 =10 µ12 =15 µ13 =12
A2 µ21 =20 µ22 =25 µ23 =22
3. ANALYSIS OF VARIANCE 173 STA3701/1

The expected yield at level B2 is five units more than the expected yield at level B1 , irrespective
of the level of A(µ12 − µ11 = µ22 − µ21 = 5). Likewise, the increase in yield from A1 to A2 is ten
units, irrespective of the level of B(µ21 − µ11 = µ22 − µ12 = µ23 − µ13 = 10).

In practice it sometimes happens that the model is not additive, but that interaction is present.

Examples of interaction

The addition of calcium to the soil may have a beneficial effect on some plants, but will have a
detrimental effect on acid-loving plants. We say the factor calcium interacts with the factor type
of plant.

A specific hormone treatment may be beneficial to women but have no effect on men. (In such a
case sex and hormone treatment interact.)

Administering either drug A or drug B to a patient may be beneficial, but both drugs together
may have a detrimental effect. In other examples drug A and drug B jointly may be much more
beneficial than the sum of the effect of the two drugs separately. In both cases the factors drug A
and drug B interact.

Two chemicals may, if used separately, have very little effect on a chemical process, but jointly
they may have a profound effect.

You may think of further examples of interaction. The situation may be presented graphically as
in the following figures (figure 3.3-figure 3.6) (we assume that factor A has two levels A1 and A2
while B has three levels B1 , B2 and B3 ).
3. ANALYSIS OF VARIANCE 174

Figure 3.3:

Figure 3.4:
3. ANALYSIS OF VARIANCE 175 STA3701/1

Figure 3.5:

Figure 3.6:
3. ANALYSIS OF VARIANCE 176

If the lines joining the expected yields at A1 and A2 are parallel (figures 3.3 and 3.4) there is no
interaction between A and B present (otherwise it is present, as in figures 3.5 and 3.6).

The presence of interaction is a factor which complicates the interpretation of main effects. In
some applications it may appear as if one or both treatments have no effect on the yield at all,
while the presence of interaction is actually an indication that both factors do have an effect on
the yield. Consider the following example:

B1 B2 B3 M ean
A1 µ11 = 10 µ12 = 15 µ13 = 20 µ1. = 15
A2 µ21 = 20 µ22 = 15 µ23 = 10 µ2. = 15
M ean µ.1 = 15 µ.2 = 15 µ.3 = 15 µ.. = 15

An analysis of variance test for the A-effect is actually a test for H0 : µ̄1. = µ̄2. = . . . = µ̄k. while a
test for the B-effect is a test for H0 : µ̄.1 = µ̄.2 = . . . = µ̄.m. . In the above example both hypotheses
are true, while it is certainly not true that the treatments have no effect at all on the yield. Other
examples may be constructed where for instance the A-effect is significant and the B-effect not
significant, while the presence of interaction implies that treatment B does influence the yield, but
the magnitude of the influence depends on the level of A.

Thus, if an analysis of variance shows that there is interaction, the possible non-significance of the
main effects (effects of A and B) must not be seen as a proof that A and/or B has no influence on
the yield. Graphs like figures 3.3 and 3.4, but with sample means rather than expected yield on
the vertical axis, may be useful in interpreting the results.

If interaction is present (or rather, unless one is sure that there is no interaction), the means are
described as follows:

µij = µ + αi + βj + (αβ)ij
3. ANALYSIS OF VARIANCE 177 STA3701/1

where (αβ) is a symbol like α and β and does not indicate multiplication; sometimes γ is used
instead of (αβ). In this model α1 , · · · , αk are the main effects for treatment A, β1 , · · · , βm are the
main effects for treatment B and (αβ)ij ; i = 1, · · · , k; j = 1, · · · , m are the interaction effects.

As in the one-way model, certain restrictions are placed on the parameters:


k
X m
X
αi = 0; βj = 0;
i=1 j=1

k
X
(αβ)ij = 0 for all j;
i=1
m
X
(αβ)ij = 0 for all i;
j=1

from which it follows that


Xk X m
(αβ)ij = 0.
i=1 j=1

In the remainder of this chapter we shall assume throughout that there are equal numbers of
observations per cell. The reason for this assumption is the fact that the notation and algebra
needed to provide for unequal numbers of observations per cell are somewhat more complicated.
The correct formulae may be found in any textbook on the analysis of variance.

3.7 Two-way analysis of variance: Model I

We now consider experiments in which there are two factors (treatments or blocking factors)
present. Both factors are assumed to be fixed, and there are equal numbers of observations per
cell. We distinguish between the two cases: one observation per cell and n observations per cell
with n > 1.

A One observation per cell


The data in section 3.1.2 are an example of such an experiment. The purpose of the analysis
is to decide whether there is a significant difference between the three supermarkets. Out of
curiosity one may also test whether there is a difference between months, although that was not
3. ANALYSIS OF VARIANCE 178

the original purpose of the experiment. Since there is only one observation per cell, there is no
way of determining whether interaction exists or not (for example whether the difference between
supermarkets A and B is increasing, whether one supermarket was most expensive to begin with
and was overtaken by another supermarket, et cetera). One simply has to assume that there
is no interaction. (There is, however, a test based on the special model (αβ)ij = γαi βj to test
H0 : γ = 0. See JW Turkey (1949): One degree of freedom for non-additivity. Biometrics,
page 232.) For this reason this type of experiment should be performed only if it is impossible to
replicate the experiment for practical reasons or if one knows from past experience that interaction
has never been present in similar experiments.

The model is

yij = µ + αi + βj + ij ; i = 1, · · · , k; j = 1, · · · , m
k
X m
X
where αi = βj = 0
1 1

and the ij are independent n(0; σ 2 ) variates.

In this formulation α1 , · · · , αk are called the effects of treatment A and β1 , · · · , βm the effects of
treatment B. The null hypotheses to be tested are usually

H0 : α1 = · · · = αk (= 0)

and H0 : β1 = · · · = βm (= 0).

Let N = km, the total number of observations,

m
1 X
ȳi. = yij ; i = 1, · · · , k,
m j=1
k
1X
ȳ.j = yij ; j = 1, · · · , m,
k i=1
k m
1 XX
ȳ.. = yij
N i=1 j=1
m k
1 X 1X
= ȳ.j = ȳi. .
m j=1 k i=1
3. ANALYSIS OF VARIANCE 179 STA3701/1

The quadratic forms of interest are


k X
X m
q2 = (yij − ȳ.. )2 = SSTotal
i=1 j=1
k X
X m
q4 = (yij − ȳi. − ȳ.j + ȳ.. )2 = SSResidual
i=1 j=1
k
X
q5 = m(ȳi. − ȳ.. )2 = SSA
i=1
m
X
q6 = k(ȳ.j − ȳ.. )2 = SSB .
j=1

Writing

yij − ȳ.. = (yij − ȳi. − ȳ.j + ȳ.. ) + (ȳi. − ȳ.. ) + (ȳ.j − ȳ.. ),

taking squares on both sides, summing over i and j and noting that the sums of the cross-products
of the terms on the right-hand side are all equal to zero, we obtain the identity

q2 = q4 + q5 + q6

or SSTotal = SSResidual + SSA + SSB .

As with one-way analysis of variance, we may write q4 , q5 and q6 in the forms qi = y 0 Ai y and
show that A4 A4 = A4 , A5 A5 = A5 , A6 A6 = A6 , A4 A5 = A4 A6 = A5 A6 = O, r(A4 ) =
(k − 1)(m − 1), r(A5 ) = k − 1, r(A6 ) = m − 1. This is left as an exercise for you to do.

Thus q4 /σ 2 , q5 /σ 2 and q6 /σ 2 are independently distributed as noncentral chi-square variates with


(k − 1)(m − 1), (k − 1) and (m − 1) degrees of freedom respectively and noncentrality parameters.

λ4 = 0
k
X
λ5 = m αi2 /σ 2
i=1
m
X
λ6 = k βj2 /σ 2 .
j=1

We shall show how λ6 is found; λ4 and λ5 can be found similarly. We have

yij = µ + αi + βj + ij
3. ANALYSIS OF VARIANCE 180

k
1X
ȳ.j = (µ + αi + βj + ij )
k 1
k k
1X 1X
= µ+ αi + βj + ij
k 1 k 1
k k
1X X
= µ + βj + εij since αi = 0.
k 1 1

k
1X
E(ȳ.j ) = E(µ + βj + εij )
k 1
k
1X
= E(µ) + E(βj ) + E(εij )
k 1
∴ E(ȳ.j ) = µ + βj

k
1X
Var(ȳ.j ) = Var(µ + βj + εij )
k 1
k
1 X
= V ar(µ) + V ar(βj ) + 2 V ar(εij )
k 1
1 2
=

k2
Var(ȳ.j ) = σ 2 /k

∴ E(ȳ.j2 ) = V ar(ȳ.j ) + (E(ȳ.j ))2

= σ 2 /k + (µ + βj )2 .

Also
k m
1 XX
ȳ.. = (µ + αi + βj + ε)
km i=1 j=1
k m k m
1X 1 X 1 XX
= µ+ αi + βj + ij
k i=1 m j=1 km i=1 j=1
1 XX
= µ+ ij
km i j
!
1 XX
E(ȳ.. ) = E µ+ ij
km i j
= µ
3. ANALYSIS OF VARIANCE 181 STA3701/1

!
1 XX
Var(ȳ.. ) = Var µ + ij
km i j
1 XX
= Var(µ) + Var(ij )
(km)2 i j
1
= kmσ 2
(km)2
= σ 2 /(km)

E(ȳ..2 ) = Var(ȳ.. ) + (E(y.. ))2

= σ 2 /(km) + µ2

Now
m
X
q6 = k (ȳ.j − ȳ.. )2
j=1
Xm
= k( ȳ.j2 − mȳ..2 )
j=1

m
X
∴ E(q6 ) = k (σ 2 /k + (µ + βj )2 ) − km(µ2 + σ 2 /(km))
j=1
Xm
= k (σ 2 /k + µ2 + 2µβj + βj2 ) − kmµ2 − σ 2
j=1
m
X m
X
2 2
= mσ + kmµ + 2µk βj + k βj2 − kmµ2 − σ 2
j=1 j=1
Xm
= mσ 2 + kmµ2 + k βj2 − kmµ2 − σ 2
j=1
Xm
= (mσ 2 − σ 2 ) + k βj2
j=1
X m m
X
= (m − 1)σ 2 + k βj2 (remember that βj = 0)
j=1 j=1

m
X
∴ E(q6 /σ 2 ) = (m − 1) + k βj2 /σ 2
j=1
m
X
∴ λ6 = k βj2 /σ 2 from (D8).
j=1
3. ANALYSIS OF VARIANCE 182

The ANOVA table is as follows:

Source SS d.f. MS E(MS)


k
X m X
A q5 =m [ȳi. − ȳ.. ]2 k−1 q5 /(k − 1) σ2 + αi2
1
k−1
m
X k X 2
B q6 =k [ȳ.j − ȳ.. ]2 m−1 q6 /(m − 1) σ2 + βj
m−1
XX 1
Residual q4 = [yij − ȳi. − ȳ.j + ȳ.. ]2 (k − 1)(m − 1) q4 /(k − 1)(m − 1) σ2
i j
XX
Total q2 = [yij − ȳ.. ]2 km − 1
i j

Thus H0 : α1 = . . . = αk (= 0) is rejected at the α-level if


q5 /(k − 1)
> Fα;k−1;(k−1)(m−1) .
q4 /(k − 1(m − 1))

Likewise H0 : β1 = . . . = βm (= 0) is rejected at the α-level if


q6 /(m − 1)
> Fα;m−1;(k−1)(m−1) .
q4 /(k − 1(m − 1))

Example 3.7.
Using data in section 3.1.2:

(a) Construct an ANOVA table.

(b) Using α = 0.05, test whether the

(i) months have an effect on the average price.

(ii) supermarkets have an effect on the average price.

Solution 3.7.

(a) The means are as follows:

Factor B: Supermarkets

ȳ.1 = 31

ȳ.2 = 25 (the cheapest)

ȳ.3 = 34
3. ANALYSIS OF VARIANCE 183 STA3701/1

Grand mean

ȳ.. = 30

Factor A: Months

ȳ1. = 27

ȳ2. = 28

ȳ3. = 31

ȳ4. = 34

We see that there is a consistent increase in the average price from month to month. (This
may of course be a result of a real price increase and/or an expanding shopping list.)
k
X
SSA = m (ȳi. − ȳ.. )2
i=1
= 3[(27 − 30)2 + (28 − 30)2 + (31 − 30)2 + (34 − 30)2 ]

= 3((−3)2 + (−2)2 + 12 + 42 )

= 90

k
X
SSB = k (ȳ.j − ȳ.. )2
i=1
= 4[(31 − 30)2 + (25 − 30)2 + (34 − 30)2 ]

= 4(12 + (−5)2 + 42 )

= 168

XX
SSTotal = (yij − ȳ.. )2

= (34 − 30)2 + (18 − 30)2 + . . . + (37 − 30)2 )

= 384

SSResidual = SSTotal − SSA − SSB

= 384 − 90 − 168

= 126

Actually it is advisable to compute SSResidual directly and use it as a test of the accuracy of
the computations.
3. ANALYSIS OF VARIANCE 184

SSResidual = (34 − 31 − 27 + 30)2 + (18 − 25 − 27 + 30)2 + . . ..

. . . + (37 − 34 − 34 + 30)2

= 126

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


X
B 168 2 84 σ2 + 2 βj2 84/21 = 4
X
A 90 3 30 σ2 + αi2 30/21 = 1.4286
Residual 126 6 21 σ2
Total 384 11

(b) Test for A-effects (months): H0 : α1 = α2 = α3 = α4 = 0


Fα;k−1;(k−1)(m−1) = F0,05;3;6 = 4.76. Reject H0 if f > 4.76.

Since 4 < 4.76, we do not reject H0 at the 5% level of significance.


∴ A-effects not significant.

(c) Test for B-effects (supermarkets): H0 : β1 = β2 = β3 = 0


Fα;m−1;(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.

Since 1.4286 < 5.14, we do not reject H0 at the 5% level of significance.


∴ B-effects not significant.

The SAS JMP output which follows, shows that supermarket 2 (B) had the lowest prices on aver-
age, while the prices rose steadily over the four months.
3. ANALYSIS OF VARIANCE 185 STA3701/1

Figure 3.7: SAS JMP output for example 3.7

In such small experiments there must be a considerable difference between level means before
3. ANALYSIS OF VARIANCE 186

Figure 3.8: SAS JMP output for example 3.7

a significant result is obtained. The housewife may now decide to continue the experiment for a
few more months, maybe trying to eliminate some of the sources of variation. She is, however,
more likely to decide that she will buy at the second supermarket in future.

The residual which may be examined for signs of deviations from the model is

yij − ŷij = yij − ȳi. − ȳ.j + ȳ.. .

Example 3.8.
In order to decide in which of three types of saucepans water will boil the quickest, the three types
of saucepans were tested on four types of stoves in a Domestic Science laboratory. A fixed amount
of water at room temperature was placed in each saucepan, and the time to boil (in seconds, with
timing started exactly three minutes after the stoves had been switched on) recorded. The results are:

Stoves Saucepans
I II III
1 21 22 23
2 23 33 43
3 25 33 26
4 31 36 44
3. ANALYSIS OF VARIANCE 187 STA3701/1

Test

(a) the null hypothesis that the saucepans are not different;

(b) the null hypothesis that the stoves are not different (with respect to boiling speed).

Use the 5% level of significance in each case.

Solution 3.8.

(a) The model is yij = µ + αi + βj + ij ; i = 1, · · · , k; j = 1, · · · , m


k
X m
X
where αi = βj = 0
1 1

and the ij are independent n(0; σ 2 ) variates.

Saucepans
Stoves I II III ȳi.
1 21 22 23 22
2 23 33 43 33
3 25 33 26 28
4 31 36 44 37
ȳ.j 25 31 34

The means are as follows:

Factor A: Stoves

ȳ1. = 22 ȳ2. = 33 ȳ3. = 28 and ȳ4. = 37

Factor B: Saucepans

ȳ.1 = 25 ȳ.2 = 31 and ȳ3. = 34

Grand mean
3. ANALYSIS OF VARIANCE 188

ȳ.. = 30

k
X
SSA = m(ȳi. − ȳ.. )2
i=1
= 3[(22 − 30)2 + (33 − 30)2 + (28 − 30)2 + (37 − 30)2 ]

= 3((−8)2 + 32 + (−2)2 + 72 )

= 378

k
X
SSB = k(ȳ.j − ȳ.. )2
i=1
= 4[(25 − 30)2 + (31 − 30)2 + (34 − 30)2 ]

= 4((−5)2 + 12 + 42 )

= 168

k X
X m
SSResidual = (yij − ȳi. − ȳ.j + ȳ.. )2
i=1 j=1

= [(21 − 22 − 25 + 30)2 + (22 − 22 − 31 + 30)2 + . . . + (44 − 34 − 37 + 30)2 ]

= [42 + (−1)2 + (−3)2 + (−5)2 + (−1)2 + 62 + 22 + . . . + (−2)2 + 32 ]

= 158

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


m X 2
A 378 3 126 σ2 + αi 4.7848
k−1
k X 2
B 168 2 84 σ2 + βj 3.1899
m−1
Residual 158 6 26.3333 σ2
Total 704 11

H0 : α1 = α2 = α3 = α4 = 0
Fα;k−1;(k−1)(m−1) = F0,05;3;6 = 4.76. Reject H0 if f > 4.76. Since 4.7848 > 4.76, we reject
H0 at the 5% level of significance and conclude that the stoves are significantly different with
respect to boiling speed.
3. ANALYSIS OF VARIANCE 189 STA3701/1

(b) H0 : β1 = β2 = β3 = 0
Fα;m−1;(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.

Since 3.1899 < 5.14, we do not reject H0 at the 5% level of significance and conclude that
the effect of the saucepans with respect to boiling speed is not different.

3.7.1 Illustration

The data in section 3.3.1 are an example of a mixed model (diets a fixed effect and litters a random
effect) but we cannot distinguish between the analysis of the three types of model if there is one
observation per cell only. The SAS JMP output (see figures 3.9 and 3.10) actually assumes a fixed
effects model, but the results which follow are easy to interpret.

Figure 3.9: SAS JMP output for data in section 3.3.1

Figure 3.10: SAS JMP output for data in section 3.3.1


3. ANALYSIS OF VARIANCE 190

B More than one observation per cell

Consider the model

yijh = µ + αi + βj + (αβ)ij + ijh ,

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;

where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j

and where the ijh are independent n(0; σ 2 ) variates.

µ = overall mean response; the average of the mean responses for the km populations.

αi = effect of the i-th level of the first factor averaged over the m levels of the second factor,
(the i-th level of the first factor adds αi to the overall mean µ).

βj = effect of the j-th level of the second factor averaged over the k levels of the first factor.

(αβ)ij = interaction between the i-th level of the first factor and the j-th level
of the second factor (the population means for the ij-th treatment minus µ + αi + βj ).

ijh = deviation of yijh from the population mean response for the ij-th.
The terms αi and βj are called main effects. The term (αβ)ij is an interaction.

The null hypotheses usually tested are

H0 : α1 = . . . = αk (= 0)

(if all αi are equal and Σαi = 0 then each must be equal to zero)

H0 : β1 = . . . = βm (= 0)
3. ANALYSIS OF VARIANCE 191 STA3701/1

H0 : (αβ)11 = . . . = (αβ)km (= 0).


Let N = kmn be the total number of observations,
n
1X
ȳij. = yijh ; i = 1, · · · , k; j = 1, · · · , m (the cell means)
n h=1

m n
1 XX
ȳi.. = yijh ; i = 1, · · · , k;
mn j=1 h=1

m
1 X
= ȳij. (the level means of the levels of factor A)
m j=1

k n
1 XX
ȳ.j. = yijh ; i = 1, · · · , m;
nk i=1 h=1

k
1X
= ȳij. (the level means of the levels of factor B)
k i=1

k m m
1 XXX
ȳ... = yijh (the overall means)
N i=1 j=1 h=1

m k k m
1 X 1X 1 XX
= ȳ.j. = ȳi.. = ȳij. .
m j=1 k i=1 km i=1 j=1
3. ANALYSIS OF VARIANCE 192

The quadratic forms of interest are


Xk
SSA = mn (ȳi.. − ȳ... )2
i=1

m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1

k X
X m
SSAB =n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1

k X
X m X
n
SSWS = SSWithin samples = (yijh − ȳij. )2
i=1 j=1 h=1

k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1

Activity 3.1.
Prove as an exercise that SSTotal = SSA + SSB + SSAB + SSWithin samples .

(This equality holds only if the number of observations per cell is the same for all cells.)

These SSs may also be written in the form of y 0 Ai y and in this way it may be proved that the four
SSs on the right-hand side are independent with degrees of freedom and sums of squares as given
in the following ANOVA table:

Source SS d.f. MS E(MS)


mn X 2
A SSA k−1 SSA / (k − 1) σ2 + αi
k−1 X
kn
B SSB m−1 SSB / (m − 1) σ2 + βj2
m−1
n XX
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ2 + (αβ)2ij
(k − 1)(m − 1)
Error SSWS km(n − 1) SSWS / km(n − 1) σ2
Total SSTot kmn − 1

We write σ̂ 2 = SSWS /(km(n − 1)).

Example 3.9.
Derive the formulae for E(M S) for the A effect, ie, E(M SA ).
3. ANALYSIS OF VARIANCE 193 STA3701/1

Solution 3.9.

k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1

yijh = µ + αi + βj + (αβ)ij + ijh

m n
1 XX
ȳi.. = (µ + αi + βj + (αβ)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1

m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi

m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + βj + (αβ)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(βj ) + 2 V ar((αβ)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
σ2
=
mn

2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2
= + (µ + αi )2
mn

k m n
1 XXX
ȳ... = (µ + αi + βj + (αβ)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
3. ANALYSIS OF VARIANCE 194

k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(βj ) + E((αβ)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ
k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + βj + (αβ)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(βj ) + V ar((αβ)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(βj ) + V ar(αβ)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
σ2
=
kmn

2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2
= + µ2
kmn

k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k 
σ2
  2 
X
2 σ 2
= mn + (µ + αi ) − kmnE +µ
i=1
mn kmn
k
kmnσ 2 X kmnσ 2
= + mn (µ2 + 2µαi + αi2 ) − − kmnµ2
mn i=1
kmn
k
X k
X
2 2
= kσ + kmnµ + 2mnµ αi + mn αi2 − σ 2 − kmnµ2
i=1 i=1
k
X k
X
2
= σ (k − 1) + mn αi2 since αi = 0
i=1 i=1
3. ANALYSIS OF VARIANCE 195 STA3701/1

E(SSA )
E(M SA ) =
k−1
σ 2 (k − 1) + mn ki=1 αi2
P
=
k−1
k
2 mn X 2
= σ + α
k − 1 i=1 i

Activity 3.2.
Derive the other E(MS) as an exercise.

The three null hypotheses are tested by

SSA /(k − 1) SSB /(m − 1) SSAB /(k − 1)(m − 1)


; ; .
SSWS /km(n − 1) SSWS /km(n − 1) SSWS /km(n − 1)

Remarks:

Multiple comparisons are performed as before. For example:

H0 : βr = βs is rejected if
(ȳ.r. − ȳ.s. )2
> (m − 1)σ̂ 2 Fα;m−1;km(n−1) .
(2/kn)

Example 3.10.
In order to test whether the method of display has an effect on bread sales, a bakery selected 12
comparable supermarkets and requested each to display its bread according to a specification of shelf
height (bottom, middle and top) and width of shelf (regular and wide). The bread sales were as
follows:

Factor A Factor B (display width)


(display height) B1 (regular) B2 (wide)
A1 (Bottom) 47 46
43 40
A2 (Middle) 62 67
68 71
A3 (Top) 41 42
39 46
3. ANALYSIS OF VARIANCE 196

Test the three null hypotheses (regarding interaction, A-effect and B-effect), each at the 5% level.

Solution 3.10.

The model is yijh = µ + αi + βj + (αβ)ij + ijh ,

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;

where
X X
αi = βj = 0
i j
X
(αβ)ij = 0 for all j
i
X
(αβ)ij = 0 for all i
j

and where the ijh are independent n(0; σ 2 ) variates.

Factor A Factor B(display width)


(display height) B1 (regular) B2 (wide) ȳi..
A1 (Bottom) 47 46 44
43 40
ȳ11. = 45 ȳ12. = 43
A2 (Middle) 62 67 67
68 71
ȳ21. = 65 ȳ22. = 69
A3 (Top) 41 42 42
39 46
ȳ31. = 40 ȳ32. = 44
ȳ.j. 50 52

ȳ... = 51 k=3 m = 2 and n = 2

m
X
SSA = n(ȳi.. − ȳ... )2
i=1
3. ANALYSIS OF VARIANCE 197 STA3701/1

= 2(2)[(44 − 51)2 + (67 − 51)2 + (42 − 51)2 ]

= 4[(−7)2 + (16)2 + (−9)2 ]

= 1 544

m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1

= 3(2)[(50 − 51)2 + (52 − 51)2 ]

= 6[(−1)2 + 12 ]

= 12

k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1

= 2[(45 − 50 − 44 + 51)2 + (43 − 52 − 44 + 51)2 + . . . + (44 − 52 − 42 + 51)2 ]

= 2[22 + (−2)2 + (−1)2 + 12 + (−1)2 + 12 ]

= 24

SSWS = SSwithin samples


Xk Xm X n
= (yijh − ȳij. )2
i=1 j=1 h=1

= (47 − 45) + (43 − 45)2 + (46 − 43)2 + . . . + (42 − 44)2 + (46 − 44)2
2

= 22 + (−2)2 + 32 + (−3)2 + (−3)2 + 32 + (−2)2 + 22 + 12 + (−1)2 + (−2)2 + 22

= 62

k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1

= (47 − 51) + (43 − 51)2 + (46 − 51)2 + . . . + (42 − 51)2 + (46 − 51)2
2

= (−4)2 + (−8)2 + (−5)2 + (−11)2 + (11)2 + (17)2 + (16)2 + (20)2 + . . . + (−5)2

= 1 642
3. ANALYSIS OF VARIANCE 198

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


mn X 2
A 1 544 2 772 σ2 + αi 74.7099
k−1
kn X 2
B 12 1 12 σ2 + β 1.1613
m − 1 Xj
n
AB 24 2 12 σ2 + (αβ)2ij 1.1613
(k − 1)(m − 1)
Error 62 6 10.3333 σ2
Total 1 642 11

Test regarding interaction: H0 : (αβ)11 = (αβ)12 = (αβ)21 = . . . = (αβ)32 (= 0)


Fα;(k−1)(m−1);km(n−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14. Since 1.1613 < 5.14, we do not
reject H0 at the 5% level of significance and conclude that there is no interaction.

Test regarding A-effect: H0 : α1 = α2 = α3 = 0


Fα;k−1;km(n−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14. Since 74.7099 > 5.14, we reject H0 at the
5% level of significance and conclude that there is a significant difference in sales due to display
height.

Test regarding B-effect: H0 : β1 = β2 = 0


Fα;m−1;km(n−1) = F0,05;1;6 = 5.99. Reject H0 if f > 5.99. Since 1.1613 < 5.99, we do not reject H0
at the 5% level of significance and conclude that there is no significant difference in sales due to
display width.

3.8 Two-way analysis of variance: Model II

We now assume that the k levels of factor A constitute a random sample from a large population
of possible levels of A, and the m levels of factor B constitute a random sample from a large
population of possible levels of B.

The model is

yijh = µ + ai + bj + (ab)ij + ijh ;


3. ANALYSIS OF VARIANCE 199 STA3701/1

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;

where

E(ai ) = E(bj ) = E(ab)ij = 0; for all i and j;

ai ∼ n(0; σa2 ); bj ∼ n(o; σb2 ); (ab)ij ∼ n(0; σab


2
) and

ijh ∼ n(0; σ 2 )

and where a1 , · · · , ak , b1 , · · · , bm , (ab)11 , · · · , (ab)km , and 111 , · · · , kmn are mutually independent.

As with model II one-way analysis of variance, the parameters of prime interest are µ, σa2 , σb2 and
2
σab . We use the same notation as in section 3.7, but the expected means squares in the ANOVA
table are now different.

Source SS d.f MS E(MS)


A SSA k−1 SSA / (k − 1) 2
σ 2 + mnσa2 + nσab
B SSB m−1 SSB / (m − 1) σ 2 + knσb2 + nσab
2

AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ 2 + nσab


2

Error SSWS km(n − 1) SSWS / km(n − 1) σ2


Total SSTot kmn − 1

We will derive the E(M SB ), and the other E(MS) are left as an exercise for you to do.

Example 3.11.
Derive E(M SB ).

Solution 3.11.

m
X
SSB = kn (ȳ.j. − ȳ... )2
j=1
Xm
2 2
= kn ȳ.j. − kmnȳ...
j=1

ȳijh = µ + ai + bj + (ab)ij + ijh


3. ANALYSIS OF VARIANCE 200

k n
1 XX
ȳ.j. = (µ + ai + bj + (ab)ij + ijh )
kn i=1 h=1
k k k n
1X 1X 1 XX
ȳ.j. = µ + ai + b j + (ab)ij + εijh
k i=1 k i=1 kn i=1 h=1

ȳ.j. = µ + ā. + bj + (āb).j + ε̄.j.

E(ȳ.j. ) = E(µ + ā. + bj + (āb).j + ε̄.j. )

= µ

V ar(ȳ.j. ) = V ar(µ + ā. + bj + (āb).j + ε̄.j. )


σ2 σ2 σ2
= a + σb2 + ab +
k k kn
Now
2
V ar(ȳ.j. ) = E(ȳ.j. ) − (E(ȳ.j. ))2

2
=⇒ E(ȳ.j. ) = V ar(ȳ.j. ) + (E(ȳ.j. ))2
σa2 σ2 σ2
= + σb2 + ab + + µ2
k k kn
k m k m k m n
1X 1 X 1 XX 1 XXX
ȳ... = µ + ai + bj + (ab)ij + εijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
¯ .. + ε̄...
ȳ... = µ + ā. + b̄. + (ab)

¯ .. + ε̄... )
E(ȳ... ) = E(µ + ā. + b̄. + (ab)

= µ

¯ .. + ε̄... )
V ar(ȳ... ) = V ar(µ + ā. + b̄. + (ab)
σ2 σ2 σ2 σ2
= a + b + ab + .
k m km kmn
Now
2
V ar(ȳ... ) = E(ȳ... ) − (E(ȳ... ))2
3. ANALYSIS OF VARIANCE 201 STA3701/1

2
=⇒ E(ȳ... ) = V ar(ȳ... ) + (E(ȳ... ))2
σ2 σ2 σ2 σ2
= a + b + ab + + µ2 .
k m km kmn
Then
X
SSB = kn y 2.j. − kmny 2...
j

X
E(SSB) = E(kn y 2.j. − kmny 2... )
j
X
= kn E(y 2.j. ) − kmnE(y 2... )
j
X  σ2 σ2 σ2

a
= kn + σb2 + ab + + µ2
j
k k kn
σa2 σb2 2
σ2
 
σab 2
−kmn + + + +µ
k m km kmn
σa2 2
2
σab σ2
= kmn + kmnσb + kmn + kmn
k k kn
2 2
σ σ σ2
+kmnµ2 − kmn a − kmn b − kmn ab
k m km
σ2
−kmn − kmnµ2
kmn
= mnσa + kmnσb2 + mnσab
2 2
+ mσ 2

−mnσa2 − knσb2 − nσab


2
− σ2

= (m − 1)knσb2 + (m − 1)nσab
2
+ (m − 1)σ 2

= (m − 1)σ 2 + n(m − 1)σab


2
+ kn(m − 1)σb2

E(SSB )
E(M SB ) =
m−1

(m − 1)σ 2 + n(m − 1)σab


2
+ kn(m − 1)σb2
=
m−1
2 2 2
∴ E(M SB ) = σ + nσab + knσb .

As in one-way model II, SSA , SSB , SSAB and SSWS are independent multiples of central chi-square
variates.
2 SSAB /(k − 1)(m − 1)
If σab = 0, f =
SSWS /km(n − 1)
is an F(k−1)(m−1);km(n−1) variate, and this fact is used to test
3. ANALYSIS OF VARIANCE 202

2
H0 : σab = 0.

SSA /(k − 1)
If σa2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fk−1;(k−1)(m−1) variate and this fact is used to test

H0 : σa2 = 0.

(Note: In model I, SSA is compared to SSWS , but here SSA is compared to SSAB ; the replicates
in the cells are not utilised when H0 : σa2 = 0 is tested.)

SSB /(m − 1)
Thirdly if σb2 = 0, f =
SSAB /(k − 1)(m − 1)
is an Fm−1;(k−1)(m−1) variate. This fact is used to test

H0 : σb2 = 0.

The unbiased estimators for σ 2 , σa2 , σb2 and σab


2
are
σ̂ 2 = SSWS /km(n − 1)

SSA SSAB
σ̂a2 =( − )/(mn)
k − 1 (k − 1)(m − 1)

SSB SSAB
σ̂b2 =( − )/(kn)
m − 1 (k − 1)(m − 1)

2 SSAB SSWS
σ̂ab =( − )/n.
(k − 1)(m − 1) km(n − 1)

As in one-way analysis of variance, one may find confidence intervals for the three variance com-
ponents, but this subject is not dealt with in this module.

An unbiased estimator of µ is

µ̂ = ȳ···

while V ar(µ̂) = σ 2 /(kmn) + σab


2
/(km) + σa2 /k + σb2 /m.

For the estimator V âr(µ̂), substitute σ 2 with σ̂ 2 , σab


2 2
with σ̂ab , σa2 with σ̂a2 and σb2 with σ̂b2 .
3. ANALYSIS OF VARIANCE 203 STA3701/1

This fact may be used to test H0 : µ = µ0 with µ0 specified or to find a confidence interval for µ
as before. The unbiased estimator of V ar(µ̂) is not a multiple of a chi-square variate, and the test
or confidence limits are only approximately true.

Example 3.12.
A consumer product agency wants to evaluate the accuracy of determining the level of calcium in
a food supplement. There are a large number of possible testing laboratories and a large number of
chemical assays for calcium. The agency randomly selects three laboratories and three assays for
use in the study. Each laboratory will use all three assays in the study. Eighteen samples containing
10 mg of calcium are prepared and each assay−laboratory combination is randomly assigned to two
samples. The calcium content is given in the following table:

Laboratory
Assay 1 2 3
1 10.9 10.5 9.7
10.9 9.8 10.0
2 11.3 9.4 8.8
11.7 10.2 9.2
3 11.8 10.0 10.4
11.2 10.7 10.7

(a) Perform an analysis of variance for this experiment. Conduct all tests with α = 0.05.

(b) Estimate all variance components and determine their proportional allocation to the total
variability.
3. ANALYSIS OF VARIANCE 204

Solution 3.12.

(a)
ȳ11. = 10.9 ȳ12. = 10.15 ȳ13. = 9.85
ȳ21. = 11.5 ȳ22. = 9.8 ȳ23. = 9.0
ȳ31. = 11.5 ȳ32. = 10.35 ȳ33. = 10.55

ȳ1.. = 10.3 ȳ2.. = 10.1 ȳ3.. = 10.8


ȳ.1. = 11.3 ȳ.2. = 10.1 ȳ.3. = 9.8

ȳ... = 10.4

k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1

= (10.9 − 10.4)2 + (10.9 − 10.4)2 + . . . + (10.7 − 10.4)2

= 12
k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 3 × 2[(10.3 − 10.4)2 + (10.1 − 10.4)2 + (10.8 − 10.4)2 ]

= 6[(−0.1)2 + (−0.3)2 + (0.4)2 ]

= 1.56
m
X
SSB = kn (y.j. − ȳ... )2
j=1

= 3 × 2[(11.3 − 10.4)2 + (10.1 − 10.4)2 + (9.8 − 10.4)2 ]

= 6[(0.9)2 + (−0.3)2 + (−0.6)2 ]

= 7.56
k X
X m
SSAB = n (ȳij. − ȳ.j. − ȳi.. + ȳ... )2
i=1 j=1

= 2[(10.9 − 10.3 − 11.3 + 10.4)2 + . . . + (10.55 − 10.8 − 9.8 + 10.4)2 ]

= 2[(−0.3)2 + (0.15)2 + (0.15)2 + (0.5)2 + . . . + (−0.15)2 + (0.35)2 ]

= 1.64
3. ANALYSIS OF VARIANCE 205 STA3701/1

k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1

= (10.9 − 10.9)2 + (10.9 − 10.9)2 + . . . + (10.4 − 10.55)2 + (10.7 − 10.55)2

= 02 + 02 + (−0.2)2 + (0.2)2 + . . . + (0.3)2 + (−0.3)2 + (−0.35)2 + (0.15)2

= 1.24

The ANOVA table is as follows:

Source SS d.f. MS E(MS)


A (Assay) 1.56 2 0.78 σ 2 + 2σab
2
+ 6σa2
B (Lab) 7.56 2 3.78 σ 2 + 2σab
2
+ 6σb2
AB (Assay * Lab) 1.64 4 0.41 σ 2 + 2σab
2

Error 1.24 9 0.1378 σ2


Total 12.00 17

H0 : σa2 = 0:

Fα;k−1;(k−1)(m−1) = F0,05;2;4 = 6.94. Reject H0 if f > 6.94.

SSA /k − 1
f =
SSAB /(k − 1)(m − 1)
M SA
=
M SAB
0.78
=
0.41
≈ 1.9024

Since 1.9024 < 6.94, we do not reject H0 at the 5% level of significance and conclude that
there is insufficient evidence to indicate a significant variability in calcium determinations
from assay to assay.
3. ANALYSIS OF VARIANCE 206

H0 : σb2 = 0

Fα;m−1;(k−1)(m−1) = F0,05;2;4 = 6.94. Reject H0 if f > 6.94.


SSB /k − 1
f =
SSAB /(k − 1)(m − 1)
M SB
=
M SAB
3.78
=
0.41
≈ 9.2195

Since 9.2195 > 6.94, H0 is rejected at the 5% level of significance and we conclude that there
is a significant variability in calcium concentrations from lab to lab.

2
H0 : σab =0

Fα;(k−1)(m−1);km(n−1) = F0,05;4;9 = 3.63. Reject H0 if f > 3.63.


SSAB /(k − 1)(m − 1)
f =
SSWS /km(n − 1)
M SAB
=
M SWS
0.41
=
0.1378
≈ 2.9753

Since 2.9753 < 3.63, we do not reject H0 at the 5% level of significance. There does not
appear to be a significant interaction between the levels of factors for assays and lab.

(b) Estimating the variance components:

σ 2 = M SE = 0.1378

σ 2 + 2σab
2
= 0.41
2
2σab = 0.41 − 0.1378
2
2σab = 0.2722
2
∴ σab = 0.1361
3. ANALYSIS OF VARIANCE 207 STA3701/1

σ 2 + 2σab
2
+ 6σb2 = 3.78

0.41 + 6σb2 = 3.78


3.78 − 0.41
σb2 =
6
3.37
σb2 =
6
2
∴ σb = 0.5617

σ 2 + 2σab
2
+ 6σa2 = 0.78

0.41 + 6σa2 = 0.78


0.78 − 0.41
σa2 =
6
0.37
σa2 =
6
2
∴ σa = 0.0617

σy2 = σa2 + σb2 + σab


2
+ σ2

= 0.0617 + 0.5617 + 0.1361 + 0.1378

= 0.8973

The total variability is

Source of variance Estimator Proportion of total


0.0617
Assays 0.0617 ≈ 0.0688
0.8973

0.5617
Labs 0.5617 ≈ 0.6260
0.8973

0.1361
Interaction 0.1361 ≈ 0.1517
0.8973

0.1378
Error 0.1378 ≈ 0.1536
0.8973

Total 0.8973
3. ANALYSIS OF VARIANCE 208

Note: Since there was a significant variability in the determination of calcium in the samples, the
estimate of an overall mean level µ would not be of interest to the researcher. However, in
this case we want to illustrate the methodology.

3.9 Two-way analysis of variance: Mixed model

We now assume that the k levels of factor A are the only levels of interest, and thus A is a fixed
factor. The m levels of factor B constitute a random sample from a large population of possible
levels of B, and thus B is a random factor. The resulting experiment gives rise to a mixed model:

yijh = µ + αi + bj + (αb)ij + ijh ;

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1
k
X
(αb)ij = 0 for all j
i=1

E(bj ) = E(αb)ij = 0,
 
2 k−1 2
bj ∼ n(0; σb ); αb)ij ∼ n(0; σαb ; ijh ∼ n(0; σ 2 );
k
b1 , · · · , bm are independent

(αb)i1 , · · · , (αb)im are independent

111 , · · · , kmn are independent

ijh are independent of bj and (αb)ij

bj and (αb)ij are independent.

Any two interaction terms (αb)ij and (αb)rs are independent unless they refer to the same random
level of B, that is j = s; in this case it is assumed that
2
Cov((αb)ij , (αb)rj ) = −σαb /k.

Note also the restriction


3. ANALYSIS OF VARIANCE 209 STA3701/1

k
X
(αb)ij = 0 for all j,
i=1

which is commensurate with the assumption concerning the covariance between two interaction
terms.

For this model the ANOVA table is as follows:

Source SS d.f. MS E(MS)


X
A SSA k−1 SSA / (k − 1) σ 2 + mn αi2 /(k − 1) + nσαb
2

2
B SSB m−1 SSB / (m − 1) σ + knσb2
AB SSAB (k − 1)(m − 1) SSAB / (k − 1)(m − 1) σ 2 + nσαb
2

Error SSWS km(n − 1) SSWS / km(n − 1) σ2


Total SSTot kmn − 1

Example 3.13.
Derive the formulae for E(M SA ).

Solution 3.13.

k
X
SSA = mn (ȳi.. − ȳ... )2
i=1
Xk
2 2
= mn ȳi.. − kmnȳ...
i=1

yijh = µ + αi + bj + (αb)ij + ijh

m n
1 XX
ȳi.. = (µ + αi + bj + (αb)ij + ijh )
mn j=1 h=1
m m m n
1 X 1 X 1 XX
= µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1

m m m n
!
1 X 1 X 1 XX
E(ȳi.. ) = E µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
m j=1 m j=1 mn j=1 h=1
= µ + αi
3. ANALYSIS OF VARIANCE 210

m m m n
!
1 X 1 X 1 XX
V ar(ȳi.. ) = V ar µ + αi + bj + (αb)ij + ijh
m j=1 m j=1 mn j=1 h=1
m m m n
1 X 1 X 1 XX
= V ar(µ) + V ar(αi ) + 2 V ar(bj ) + 2 V ar((αb)ij ) + V ar(ijh )
m j=1 m j=1 (mn)2 j=1 h=1
mσb2 mσαb2
mnσ 2
= + +
m2 m2 (mn)2
2 2 2
σ σ σ
= b + αb +
m m mn

2
E(ȳi.. ) = V ar(ȳi.. ) + E(ȳi.. )2
σ2 σ2 σ2
= b + αb + + (µ + αi )2
m m mn

k m n
1 XXX
ȳ... = (µ + αi + bj + (αb)ij + ijh )
kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1

k m k m k m n
!
1X 1 X 1 XX 1 XXX
E(ȳ... ) = E µ+ αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m k m n
1X 1 X 1 XX 1 XXX
= E(µ) + E(αi ) + E(bj ) + E((αb)ij ) + E(ijh )
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
= µ

k m k m k m n
!
1X 1 X 1 XX 1 XXX
V ar(ȳ... ) = V ar µ + αi + bj + (αb)ij + ijh
k i=1 m j=1 km i=1 j=1 kmn i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= V ar(µ) + 2 V ar(αi ) + 2 V ar(bj ) + V ar((αb)ij )
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
k m k m
1 X 1 X 1 XX
= 2 V ar(αi ) + 2 V ar(bj ) + V ar(αb)ij
k i=1 m j=1 (km)2 i=1 j=1
k X m X n
1 X
+ V ar(ijh )
(kmn)2 i=1 j=1 h=1
3. ANALYSIS OF VARIANCE 211 STA3701/1

mσb2 kmσαb
2
kmnσ 2
= + +
m2 (km)2 (kmn)2
σ2 σ2 σ2
= b + αb +
m km kmn

2
E(ȳ... ) = V ar(ȳ... ) + E(ȳ... )2
σ2 σ2 σ2
= b + αb + + µ2
m km kmn

k
!
X
E(SSA ) = E mn (ȳi.. − ȳ... )2
i=1
k
X
2 2
= mn E(ȳi.. ) − kmnE(ȳ... )
i=1
k  2
σb2 σαb σ2 2
σb2 σαb σ2
X   
= mn + + + (µ + αi )2 − kmn + + + µ2
i=1
m m mn m km kmn
k
kmnσb2 kmnσαb
2
kmnσ 2 X
= + + + mn (µ2 + 2µαi + αi2 )
m m mn i=1
−knσb2 − nσαb
2
− σ 2 − kmnµ2
k
X
= knσb2 + knσαb
2
+ kσ 2 + kmnµ2 + 2mnµ αi
i=1
k
X
+mn αi2 − knσb2 − nσαb
2
− σ 2 − kmnµ2
i=1
k
X k
X
2 2 2 2
= knσαb − nσαb + kσ − σ + mn αi2 since αi = 0
i=1 i=1
k
X
2
= n(k − 1)σαb + σ 2 (k − 1) + mn αi2
i=1

E(SSA )
E(M SA ) =
k−1
2
+ σ 2 (k − 1) + mn ki=1 αi2
P
n(k − 1)σαb
=
k−1
k
2 2 mn X 2
= nσαb + σ + α
k − 1 i=1 i

Activity 3.3.
Derive the other E(M S) as an exercise.
3. ANALYSIS OF VARIANCE 212

The three null hypotheses and the appropriate statistics are as follows:

H0 f d.f.
SSA /(k − 1)
α1 = . . . = αk (= 0) k − 1; (k − 1)(m − 1)
SSAB /(k − 1)(m − 1)

SSB /(m − 1)
σb2 = 0 m − 1; km(m − 1)
SSWS /km(n − 1)

2 SSAB /(k − 1)(m − 1)


σαb (k − 1)(m − 1); km(n − 1)
SSWS /km(n − 1)

As before, we can obtain estimates of σ 2 , σb2 and σαb


2
, and multiple comparison tests of the form
H0 : αr = αs can be performed:

reject H0 : αr = αs if

(ȳr.. − ȳs.. )2
> (k − 1)(σ̂ 2 + nσ̂αb
2
)Fα;k−1;(k−1)(m−1) .
2/mn

(Can you see why (σ̂ 2 + nσ̂αb


2
) rather than σ̂ 2 is used in the above formula?)

Example 3.14.
Preliminary research on the production of imitation pearls entailed studying the effect of the num-
bers of coats of a special lacquer (factor A) applied to an opalescent plastic bead used as the base
of the pearl on the market value of the pearl. Four batches of 12 beads (factor B) were used in
the study, and it is desired to also consider their effect on the market value. The three levels of
factor A (six, eight and ten coats) were fixed in advance, while the four batches can be regarded as
a random sample of batches from the bead production process. The market value of each pearl was
determined by a panel of experts. The market value data (coded) are as follows:
3. ANALYSIS OF VARIANCE 213 STA3701/1

Factor A Factor B (batch)


(number of coats) 1 2 3 4
6 72.0 72.1 75.2 70.4
74.6 76.9 73.8 68.1
67.4 74.8 75.7 72.4
72.8 73.3 77.8 72.4
8 76.9 80.3 80.2 74.3
78.1 79.3 76.6 77.6
72.9 76.6 77.3 74.4
74.2 77.2 79.9 72.9
10 76.3 80.9 79.2 71.6
74.1 73.7 78.0 77.7
77.1 78.6 77.6 75.2
75.0 80.2 81.2 74.4

(a) Give the number of levels of each factor.

(b) Identify each factor as fixed or random.

(c) Do the data provide sufficient evidence to indicate an interaction between number of coats
and batch?

(d) Test for factor A and factor B main effects.

Hint: Test at the 5% significance level: Give the hypothesis, test statistics, formulas and conclusions
explicitly.

Solution 3.14.

(a) The factor A has three levels and the factor B has four levels.

(b) Factor A is fixed effects and factor B is random effects.


3. ANALYSIS OF VARIANCE 214

(c)
ȳ11. = 71.7 ȳ12. = 74.275 ȳ13. = 75.625 ȳ14. = 70.825
ȳ21. = 75.525 ȳ22. = 78.35 ȳ23. = 78.5 ȳ24. = 74.8
ȳ31. = 75.625 ȳ32. = 78.35 ȳ33. = 79 ȳ34. = 74.725

ȳ1.. = 73.1063 ȳ2.. = 76.7938 ȳ3.. = 76.925


ȳ.1. = 74.2833 ȳ.2. = 76.9917 ȳ.3. = 77.7083 ȳ.4. = 73.45

ȳ... = 75.6083

k X
X m X
n
SSTotal = (yijh − ȳ... )2
i=1 j=1 h=1

= (72 − 75.6083)2 + . . . + (74.4 − 75.6083)2

= 478.7167

k
X
SSA = mn (yi.. − ȳ... )2
i=1
= 4 × 4[(73.1063 − 75.6083)2 + (76.7938 − 75.6083)2 + (76.925 − 75.6083)2

= 150.3858

m
X
SSB = kn (y.j. − ȳ... )2
j=1

= 3 × 4[(74.2833 − 75.6083)2 + (76.9917 − 75.6083)2 +

(77.7083 − 75.6083)2 + (73.45 − 75.6083)2 ]

= 152.8522
3. ANALYSIS OF VARIANCE 215 STA3701/1

k X
X m X
n
SSWS = (yijh − ȳij. )2
i=1 j=1 h=1

= (72 − 71.7)2 + (74.6 − 71.7)2 + . . . + (74.4 − 74.725)2

= 28.2 + 12.8475 + 8.2475 + 12.5675 + 17.1675 + 9.09 + 9.9

+11.86 + 5.3475 + 31.61 + 7.84 + 18.9475

= 173.625

SSAB = SSTotal − SSA − SSB − SSWS

= 478.7167 − 150.3858 − 152.8522 − 173.625

= 1.8537

The ANOVA table is as follows:

Source SS d.f. MS
A 150.3858 2 75.1929
B 152.8522 3 50.9507
A×B 1.8537 6 0.3090
Error 173.625 36 4.8229
Total 478.7167 47

2
H0 : σαb = 0:
Fα;(k−1)(m−1);km(n−1) = F0,05;6;36 = 2.38. Reject H0 if f > 2.38.

SSAB /(k − 1)(m − 1)


f =
SSWS /km(n − 1)
0.3090
=
4.8229
≈ 0.0641

Since 0.0641 < 2.38, H0 is not rejected at the 5% level of significance and conclude that there
is no interaction between number of coats and batch.
3. ANALYSIS OF VARIANCE 216

(d) H0 : α1 = α2 = α3 = 0

Fα;(k−1);(k−1)(m−1) = F0,05;2;6 = 5.14. Reject H0 if f > 5.14.

SSA /(k − 1)
f =
SSAB /(k − 1)(m − 1)
75.1929
=
0.3090
≈ 243.3427

Since 243.3427 > 5.14, we reject H0 at the 5% level of significance and we conclude that the
number of coats affects the market value of the pearls.

H0 : σb2 = 0

Fα;(m−1);(k−1)(m−1) = F0,05;3;36 = 2.88. Reject H0 if f > 2.88.

SSB /(m − 1)
f =
SSWS /km(n − 1)
50.9507
=
4.8229
≈ 10.5643

Since 10.5643 > 2.88, we reject H0 at the 5% level of significance and conclude that batches
have an effect on the market value of the pearls.

3.10 Nested experiments

We now refer to data in sections 3.3.2 and 3.3.3. The usual model for this type of experiment is

yijh = µ + αi + bij + ijh ;

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1

bij are independent n(0; σb2 ) variates


3. ANALYSIS OF VARIANCE 217 STA3701/1

ijh are independent n(0; σ 2 ) variates

and the bij and ijh are mutually independent.

In this formulation i refers to the treatments (teaching methods in section 3.3.2.1, sprays in section
3.3.2.2), j refers to the units assigned at random to the treatments (teachers and trees respectively)
and h to the repetitions (children and leaves respectively).

The sums of squares of interest are

k
X
SSA = mn (ȳi.. − ȳ... )2
i=1

k X
X m
SSB =n (ȳij. − ȳi.. )2
i=1 j=1

k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1

k X
X m X
n
SSTotal = (yijh − ȳ... )2 .
i=1 j=1 h=1

The following relationship holds:

SSTotal = SSA + SSB + SSError .

The analysis of variance table is as follows:

Source SS d.f. MS E(MS)


k
X
2
A SSA k−1 SSA / (k − 1) σ + nσb2 + mn αi2 /(k − 1)
1
2
B SSB k(m − 1) SSB / k(m − 1) σ + nσb2
Error SSE km(n − 1) SSE / km(n − 1) σ 2
Total SST kmn − 1
3. ANALYSIS OF VARIANCE 218

Example 3.15.
Derive E(M SA ).

Solution 3.15.
The model is yijh = µ + αi + bij + εijh , i = 1, ..., k; j = 1, ..., m; h = 1, ..., n.

m m n
1 X 1 XX
y i.. = µ + αi + bij + εijh
m j=1 mn j=1 h=1
y i.. = µ + αi + bi. + εi..

E(y i.. ) = E(µ + αi + bi. + εi.. )

= µ + αi

V ar(y i.. ) = V ar(µ + αi + bi. + εi.. )


σ2 σ2
= b +
m mn
Now

V ar(y i.. ) = E(y 2i.. ) − (E(y i.. ))2

⇒ E(y 2i.. ) = V ar(y i.. ) + (E(y i.. ))2


σ2 σ2
= b + + (µ + αi )2
m mn

k k m k m n
1X 1 XX 1 XXX
y ... = µ+ αi + bij + εijh
k i=1 km i=1 j=1 kmn i=1 j=1 h=1
y ... = µ + α. + b.. + ε...

E(y ... ) = E(µ + α. + b.. + ε... )

= µ

V ar(y ... ) = V ar(µ + α. + b.. + ε... )


σb2 σ2
= +
km kmn
3. ANALYSIS OF VARIANCE 219 STA3701/1

Now

V ar(y ... ) = E(y 2... ) − (E(y ... ))2

⇒ E(y 2... ) = V ar(y ... ) + (E(y ... ))2


σb2 σ2
= + + µ2
km kmn

X
SSA = mn y 2i.. − kmny 2...
i

X
E(SSA) = E(mn y 2i.. − kmny 2... )
i
X
= mn E(y 2i.. ) − kmnE(y 2... )
i
X  σ2
σ2

b 2
= mn + + (µ + αi )
i
m mn
 2
σ2

σb 2
−kmn + +µ
km kmn
" #
kσb2 kσ 2 X 2
= mn + + (µ + 2µαi + αi2 )
m mn i
kmnσb2 kmnσ 2
− − − kmnµ2
" km kmn #
2 2
kσb kσ X X X
= mn + + µ2 + 2µ αi + αi2
m mn i i i
−nσb2 − σ 2 − kmnµ2
X
= nkσb2 + σ 2 k + kmnµ2 + mn αi2
i
X
−nσb2 2
− σ − kmnµ since 2
αi = 0
X
= nkσb2 − nσb2 + σ 2 k − σ 2 + mn αi2
i
X
= (k − 1)nσb2 2
+ (k − 1)σ + mn αi2
i
" #
mn X 2
= (k − 1) nσb2 + σ 2 + α
k−1 i i

Thus,

E(SSA )
E(M SA ) =
k−1
3. ANALYSIS OF VARIANCE 220

 
mn
(k − 1) nσb2 + σ 2 + αi2
P
k−1
i
=
k−1
2 2 mn X 2
∴ E(M SA ) = nσb + σ + α .
k−1 i i

Activity 3.4.
Derive the other E(M S) as an exercise.

The hypothesis and the appropriate statistics are:


H0 : α1 = . . . = αk (= 0) is rejected if

SSA /(k − 1)
f= > Fα;k−1;k(m−1) .
SSB /k(m − 1)

H0 : σb2 = 0 is rejected if

SSB /k(m − 1)
f= > Fα;k(m−1);km(n−1) .
SSE /km(n − 1)
Example 3.16.
Using the data in section 3.3.2.1, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.
3. ANALYSIS OF VARIANCE 221 STA3701/1

Solution 3.16.
The means and sums of squares of deviations from the means, are as follows (the first index refers
to teaching methods, the second to teachers and the third to pupils).

X
ȳ11. = 58 (y11h − ȳ11. )2 = (60 − 58)2 + . . . + (65 − 58)2 = 178
X
ȳ12 . = 62 (y12h − ȳ12. )2 = (61 − 62)2 + . . . + (72 − 62)2 = 166
ȳ1.. = 60
X
ȳ21. = 47 (y21h − ȳ21. )2 = (49 − 47)2 + . . . + (48 − 47)2 = 46
X
ȳ22. = 55 (y22h − ȳ22. )2 = (54 − 55)2 + . . . + (63 − 55)2 = 132
ȳ2.. = 51
X
ȳ31. = 66 (y31h − ȳ31. )2 = (68 − 66)2 + . . . + (75 − 66)2 = 138
X
ȳ32. = 72 (y32h − ȳ32. )2 = (64 − 72)2 + . . . + (75 − 72)2 = 114
ȳ3.. = 69
ȳ··· = 60

k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 2 × 5[(60 − 60)2 + (51 − 60)2 + (69 − 60)2 ]

= 10[02 + (−9)2 + 92 ]

= 1 620

k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1

= 5[(58 − 60)2 + (62 − 60)2 + (47 − 51)2 + (55 − 51)2 + (66 − 69)2 + (72 − 69)2 ]

= 5[(−2)2 + 22 + (−4)2 + 42 + (−3)2 + 32 ]

= 290

k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1

= [(60 − 58) + (55 − 58)2 + . . . + (75 − 72)2 ]


2

= 178 + 166 + 46 + 132 + 138 + 114

= 774
3. ANALYSIS OF VARIANCE 222

k X
X m X
n
SST = (yijh − ȳ... )2
i=1 j=1 h=1

= [(60 − 60) + (55 − 60)2 + . . . + (75 − 60)2 ]


2

= 02 + (−5)2 + . . . + (15)2

= 2 684

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


X
Teaching 1 620 2 810 σ 2 + 5σb2 + 5 αi2 8.3793
methods

Teachers 290 3 96.6667 σ 2 + 5σb2 2.9974

Error 774 24 32.25 σ2


Total 2 684 29

H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;3 = 9.55. Reject H0 if f > 9.55.

Since 8.3792 < 9.55, we do not reject H0 at the 5% level of significance and conclude that there is
no significant difference between the teaching methods.

H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;3;24 = 3.01. Reject H0 if f > 3.01.

Since 2.9974 < 3.01, we do not reject H0 at the 5% level of significance and conclude that the
variances ascribable to the teachers are significant.

The SAS JMP graph of the means shows that teacher method 3 lead to the highest marks on
average, while pupils who were taught by method 2 had the poorest results. SAS JMP does not
do the full analysis for this type of experiment, and therefore we display the graph only.
3. ANALYSIS OF VARIANCE 223 STA3701/1

Figure 3.11: SAS JMP graph for example 3.16

Example 3.17.

Using the data in section 3.3.2.2, perform a complete analysis of variance and interpret the results.
Complete the ANOVA table, state the hypothesis and give all formulas explicitly.

Solution 3.17.

The model is yijh = µ + αi + bij + ijh ;

i = 1, · · · , k; j = 1, · · · , m; h = 1, · · · , n;
k
X
where αi = 0
i=1

bij are independent n(0; σb2 ) variates

ijh are independent n(0; σ 2 ) variates

and the bij and ijh are mutually independent.

Now i refers to the sprays, that is, i = 1, 2, 3; j refers to the trees, that is, j = 1, 2, 3, 4 and h to
the repetitions (leaves), that is, h = 1, 2, . . . , 6.
3. ANALYSIS OF VARIANCE 224

Now
ȳ11. = 6 ȳ12. = 7 ȳ13. = 12 ȳ14. = 11
ȳ21. = 15 ȳ22. = 14 ȳ23. = 11 ȳ24. = 12
ȳ31. = 7 ȳ32. = 5 ȳ33. = 9 ȳ34. = 11

ȳ1.. = 9 ȳ2.. = 13 ȳ3.. = 8 ȳ... = 10

k
X
SSA = mn (y i.. − ȳ... )2
i=1
= 4 × 6[(9 − 10)2 + (13 − 10)2 + (8 − 10)2 ]

= 24[(−1)2 + 32 + (−2)2 ]

= 336

k X
X m
SSB = n (ȳij. − ȳi.. )2
i=1 j=1

= 6[(6 − 9)2 + (7 − 9)2 + (12 − 9)2 + . . . + (9 − 8)2 + (11 − 8)2 ]

= 6[(−3)2 + (−2)2 + 32 + 22 + 22 + 12 + (−2)2 + (−1)2 + (−1)2 + (−3)2 + 12 + 32 ]

= 336

k X
X m X
n
SSError = (yijh − ȳij. )2
i=1 j=1 h=1

= [(4.5 − 6) + (7 − 6)2 + . . . + (11 − 11)2 + (9 − 11)2 ]


2

= 7 + 43 + 24.5 + 7 + 2.5 + 10 + 8.5 + 11 + 8.5 + 10.5 + 7 + 11.5

= 151.

The ANOVA table is as follows:

Source SS d.f. MS E(MS) f


Pk
mn i αi2
A 336 2 168 σ 2 + nσb2 + 4.5
(k − 1)
B 336 9 37.3333 σ 2 + nσb2 14.8342
Error 151 60 2.5167 σ2
Total 823 71
3. ANALYSIS OF VARIANCE 225 STA3701/1

Testing effect of sprays:

H0 : α1 = α2 = α3 = 0
Fα;k−1;k(m−1) = F0,05;2;9 = 4.26. Reject H0 if f > 4.26.

Since 4.5 > 4.26, we reject H0 at the 5% level of significance and conclude that the effects of the
sprays are significantly different from one another.

H0 : σb2 = 0:
Fα;k(m−1);km(n−1) = F0,05;9;60 = 2.04. Reject H0 if f > 2.04.

Since 14.8342 > 2.04, we reject H0 at the 5% level of significance and conclude that the variances
ascribable to the trees are significant, that is, the hypothesis σb2 = 0 is rejected.

The SAS JMP graph follows - you should be able to interpret it.

Figure 3.12: SAS JMP output for example 3.17


3. ANALYSIS OF VARIANCE 226

3.11 Three-way analysis of variance

Experiments often include more than two factors, the data in section 3.1.3 are an example of a
three-way analysis of variance.

The general model for a three-way analysis of variance is

yijk` = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk + εijk` ;

i = 1, · · · , a; j = 1, · · · , b; k = 1, · · · , c; ` = 1, · · · , n;

with
a
X b
X c
X a
X b
X b
X
αi = βj = γk = (αβ)ij = (αβ)ij = (αγ)ik
i=1 j=1 k=1 i=1 i=1 ik

c
X b
X c
X a
X
= (αγ)ik = (βγ)jk = (βγ)jk = (αβγ)ijk
k=1 j=1 k=1 i=1

b
X c
X
= (αβγ)ijk = (αβγ)ijk = 0
j=1 k=1

and the εijk` are independent n(0; σe2 ) variates.

The three-way analysis of variance is an expansion of the method used for the two-way analysis
of variance. Since the calculations are complicated and time-consuming, SAS JMP, or a similar
computer package, is usually used.
3. ANALYSIS OF VARIANCE 227 STA3701/1

Example 3.18.
In this example interactions were ignored and you will find the SAS JMP output fairly simple to
interpret.

Figure 3.13: SAS JMP output for example 3.18

Figure 3.14: SAS JMP output for example 3.18

Figure 3.15: SAS JMP output for example 3.18


3. ANALYSIS OF VARIANCE 228

Figure 3.16: SAS JMP output for example 3.18

Figure 3.17: SAS JMP output for example 3.18


3. ANALYSIS OF VARIANCE 229 STA3701/1

Example 3.19.
A three-way analysis of variance was performed to investigate the effect of rainfall (factor A), type
of soil (factor B) and fertilizer (factor C) on yield, allowing for possible interactions.

Rainfall 1 2
Soil type 1 2 1 2
Fertilizer
1 18.6 10.2 12.3 12.3
18.8 10.5 10.9 10.7
15.8 8.5 10.7 12
17.9 10.6 11.8 11.8
2 19.4 14.7 13.2 10.2
18.9 17.8 15.5 8.5
18.4 15.6 11 7.1
18.9 16.5 11.8 6.5
3 15.9 20.9 19.4 13.2
16.5 21 21.2 13.5
17.2 21.03 20.4 12.8
16.5 20.4 20.3 15.2

The SAS JMP output is shown in figure 3.18.

From figure 3.20 the occurrence of interaction is quite clear.

The AC-interaction seems to be the strongest and the most prominent feature in figure 3.20(middle)
is the following:

The combination of level 2 of C and level 2 of A contributes to a very low yield, on the other hand
the combination of level 2 of C and level 1 of A contributes to a relatively high yield.

Figure 3.21 and figure 3.22 are representations of the ABC-interaction: figure 3.21 represents the
AC-interaction for soil type 1 and figure 3.22 represents the AC-interaction for soil type 2.

If figure 3.21 and figure 3.22 had been the same, you would be inclined to conclude that the ABC-
interaction was not significant. The difference between figure 3.21 and figure 3.22 is obvious, thus
the results of the ANOVA table for the significance of the ABC-interaction are confirmed.
3. ANALYSIS OF VARIANCE 230

Figure 3.18: SAS JMP output for example 3.19

Notice the following with regard to the figures:

Figure 3.20 left and middle and figures 3.21 and 3.22: the 1 and 2 inside the graph represent levels
1 and 2 for rainfall;

Figure 3.20 right: the 1 and 2 inside the graph represent levels 1 and 2 for soil.
3. ANALYSIS OF VARIANCE 231 STA3701/1

Figure 3.19: SAS JMP output for example 3.19

Figure 3.20: SAS JMP output for example 3.19


3. ANALYSIS OF VARIANCE 232

Figure 3.21: SAS JMP output for example 3.19

Figure 3.22: SAS JMP output for example 3.19


3. ANALYSIS OF VARIANCE 233 STA3701/1

Exercise 3.1

1. For each of the following scenarios, give

(i) the appropriate ANOVA model and the assumptions;

(ii) the hypotheses and conclusions you expect to test and make from the analysis WITH-
OUT actually performing the analysis.

SCENARIOS

(a) The effect of salinity was examined on the growth of fish (measured by increase in
weight). A sample of five fish each was measured from full-strength seawater (32%),
brackish water (18%) and fresh water (0.5%). Analyse the experiment.

(b) An experiment was conducted to test the effects of three different levels of factor A with
three different levels of factor B. The experiment called for nine identical experimental
units, each to be tested at one of the combinations of factors A and B. Analyse the
experiment.

(c) Given the scenario in (b), there was some controversy about the assumption of no inter-
action between factors A and B and it was decided to test 36 experimental units, four
in each combination, to allow for the possibility of interaction. Analyse the experiment.

2. Four chemical treatments for curtain fabrics were tested with regard to their ability to im-
prove colour fastness. Due to the limited quantities of the two types of fabrics available for
the experiment, it was decided to apply each chemical to a sample of each type of fabric. The
results are expressed in percentages with regard to the retaining of colour, after the treated
fabrics were submitted to severe testing:
3. ANALYSIS OF VARIANCE 234

Fabric

1 2

1 53 92
Treatment 2 32 81
3 79 99
4 38 67

(a) Give the appropriate model with assumptions.

(b) Are there significant differences in the effect of treatments on fabrics (α = 0.05)?

(c) Are there significant differences between fabrics?

(d) What would you recommend in order to make any claims as to the effectiveness of the
treatments?
3. ANALYSIS OF VARIANCE 235 STA3701/1

3. The strain readings of glass cathode supports from five different machines were investigated.
Each machine had four “heads” on which the glass was formed, and four samples were taken
from each head. The data were as follows:

Machine A B C
Head 1 2 3 4 5 6 7 8 9 10 11 12
6 13 1 7 10 2 4 0 0 10 8 7
2 3 10 4 9 1 1 3 0 11 5 2
0 9 0 7 7 1 7 4 5 6 0 5
8 8 6 9 12 10 9 1 5 7 7 4

Machine D E
Head 13 14 15 16 17 18 19 20
11 5 1 0 1 6 3 3
0 10 8 8 4 7 0 7
6 8 9 6 7 0 2 4
4 3 4 5 9 3 2 0

Analyse and interpret the results. Use the 5% level of significance. Give the model, ANOVA
table, hypotheses, test statistics and conclusions explicitly.

4. Consider the two-way analysis of variance: model II. Show that E(M SAB ) = σ 2 + nσab
2
.

5. An experiment was conducted to determine the effect of different pressure levels on the
products manufactured by a machine. A summary of the data recorded from products man-
ufactured by the machine at each of four randomly selected pressure levels is given as follows:

y 1· = 27.4 y 2· = 29.8
y 3· = 30.7 y 4· = 29.2
y ·· = 29.275
n1 = n2 = n3 = n4 = 10
(yij − y ·· )2 = 207.975
PP
i j
3. ANALYSIS OF VARIANCE 236

Use a one-way analysis of variance (model II):

yij = µ + ai + εij i = 1, . . . , k = 4 j = 1, . . . , n = 10

(a) Complete the ANOVA table and test at the 5% level whether the change in pressure
has a significant effect on the products.

(b) Estimate the proportion of the total variance ascribable to pressure variation and com-
pute a 90% confidence interval.

6. In manufacturing a certain beverage, an important measurement is the percentage of im-


purities present in the final product. The following data show the percentage of impurities
present in samples taken from products manufactured at the three different temperatures
(A), and the two sterilisation times (B) in the manufacturing process. The three levels of A
were actually 25◦ C, 30◦ C and 35◦ C. The two levels of B were 15 and 20 minutes. The data
are as follows:

A
25◦ 30◦ 35◦

15 min 14.05 10.55 7.55


14.93 9.48 6.59
B

20 min 16.56 13.63 9.23


15.85 11.75 8.78

(a) Give the appropriate model and the assumptions.

(b) Complete an ANOVA table.

(c) (i) Is there an interaction effect present?


(ii) Are there significant differences between temperatures?
3. ANALYSIS OF VARIANCE 237 STA3701/1

(iii) Is there a significant difference between the two sterilisation times?

Give the hypotheses, test statistics and formulae.

(d) Now assume that the three levels of factor A constitute a random sample from a large
population of possible levels of A. Give the appropriate model and do questions (i)-(iii)
of (c) for the model in (d).
3. ANALYSIS OF VARIANCE 238
Chapter 4

REGRESSION ANALYSIS

4.1 Introduction

As was said in chapter 2, the regression model is a special case of the general linear model

y = X β + 

n×1 n×p p×1 n×1

in which each column of X (except possibly the first) represents a series of values of a continuous
variable. The variables represented by the columns of X are called the independent variables or
predictors. The rows of X and y represent, as usual, cases or data points. We already know that
the least squares estimator of β is

β̂ = (X 0 X)−1 X 0 y

and that
0
Cov(β̂, β̂ ) = σ 2 (X 0 X)−1 .

Furthermore, σ 2 is estimated by

σ̂ 2 = y 0 (I − X(X 0 X)−1 X 0 )y/(n − p).

239
4. REGRESSION ANALYSIS 240

Thus H0 : βi = βi0 (with βi0 specified) may be tested by means of


β̂i − βi0
t= √
σ̂ aii
where aii is the (i,i)-th element of (X 0 X)−1 , or equivalently by means of
(β̂i − βi0 )2
f = t2 =
σ̂ 2 aii
which is compared to Fα;1;n−p .

Likewise, H0 : c0 β = d (with c0 and d known) is tested by means of


c0 β̂ − d
t= p
σ̂ c0 (X 0 X)−1 c
or by means of f = t2 , which is compared to Fα;1;n−p .

In this chapter we shall discuss a number of topics peculiar to regression analysis.

4.2 Analysis of variance in regression

The analysis of variance has been treated in detail, but the interpretation of the various quadratic
forms in the regression model is discussed here.

We first look at the equation


X X X
(yi − ȳ)2 = (yi − ŷi )2 + (ŷi − ȳ)2

or

SSTotal = SSResidual + SSRegression .


4. REGRESSION ANALYSIS 241 STA3701/1

These sums of squares are depicted in figures 4.1-4.3.

Figure 4.1: Total variation

The observations are represented by the dots. In figure 4.1 the observations are projected onto
the y-axis without regard to the x-values. The variation between these projected values, depicted
by crosses, reflects the total variation in the y-values. If we did not know about the x-values, we
would ascribe this variation to chance.
4. REGRESSION ANALYSIS 242

Figure 4.2: Variation about the regression line

In figure 4.2 a straight line is fitted to the data and the observations projected parallel to the fitted
line. The variation among these projected points is the variation about the regression line and is
now regarded as the chance variation. The amount of decrease in the variation is said to be the
variation explained by the regression or the variation due to regression.

Figure 4.3: Variation due to regression


4. REGRESSION ANALYSIS 243 STA3701/1

In figure 4.3 the predicted values, which are obtained by projecting the data points vertically onto
the regression line, are in turn projected onto the y-axis. The variation between these projected
points is said to be due to the regression because there would have been no such variation if the
slope of the fitted line had been zero, that is if y had not depended on x.

The ANOVA table is as follows:

Source SS d.f. MS E(MS)


X X
Regression (ŷi − ȳ)2 p−1 (ŷi − ȳ)2 /(p − 1) σ 2 + β 01 A−1
11 β 1 /(p − 1)

X X
Residual (yi − ŷi )2 n−p (yi − ŷi )2 /(n − p) σ2
X
Total (yi − ȳ)2 n−1

where in the E(MS) column β 1 is the (p − 1) × 1 vector obtained by deleting the first element of
β and A11 is the matrix obtained by deleting the first row and column of (X 0 X)−1 . In the case
p = 2, that is the regression of y on one variable x, say y = β1 + β2 x + , we have
X X
E( (ŷi − ȳ)2 /(p − 1)) = σ 2 + β22 (xi − x̄)2 .

The F-test for testing H0 : β2 = . . . = βp = 0 is based on


(ŷi − ȳ)2 /(p − 1)
P
f=P .
(yi − ŷi )2 /(n − p)
• Illustration

The following data, with the dependent variable the time needed to complete a race (y) and the
independent variable the time spent on exercise in the two months prior to the race (x) were
analysed with regard to the relation between the dependent and the independent variable.

x : 178.6 221.6 190.5 227.7 206.2 245.6 209.1 260.3 212.2 280.1
y: 27.1 39.8 32.6 42.3 33.6 44.8 33.8 50.9 38.3 54.1

A straight line was fitted and a regression analysis was performed with the following SAS JMP
output in figure 4.4:
4. REGRESSION ANALYSIS 244

Figure 4.4: SAS JMP output for example 4.2.2

The fitted regression line on the data is:

ŷ = β̂0 + β̂1 x = −20.0118 + 0.2677x.

Since the F-value in the computer output is very large, the hypothesis H0 : β0 = β1 = 0 is rejected
and the assumption that the straight line describes the relation between x and y, can be accepted.

“R-squared” given in the computer output is defined as the percentage of the variation in y which
is accounted for by the fitted equation. Thus, 97.26% of the variation in y is accounted for by the
4. REGRESSION ANALYSIS 245 STA3701/1

model.

If we have repeated measurements in some or all of the points, that is if there are sets of equal
rows in the matrix X, a more precise analysis can be made, namely a lack of fit test.

n1
1 X
Thus, suppose that y11 , · · · , y1n1 correspond to one set of equal rows of X, and let ȳ1. = y1i ,
n1 1
n2
1 X
suppose y21 , · · · , y2n2 correspond to a second set of equal rows and let ȳ2. = y2i , et cetera.
n2 1
Xc
Suppose there are c such sets with sizes n1 , · · · , nc such that ni = n, the total sample size.
1
Then
ni
c X
X
SSError = (yij − ȳi. )2
i=1 j=1

is said to be the sum of squares due to pure error, since the only explanation for a variation among
observations which correspond to equal rows of X is random variation.

The difference between the observed mean value ȳi. and the predicted value for the i-th set ŷi , is
an indication of how well the model y = Xβ +  fits the data. We write
c
X
SSLack of fit = ni (ŷi − ȳi. )2 .
i=1

It is easily shown that

SSResidual = SSError + SSLack of fit .


4. REGRESSION ANALYSIS 246

This situation is illustrated in figures 4.5, 4.6 and 4.7.

Figure 4.5: Observations and means

Figure 4.6: Observations and line fitted

SSLack of fit is a measure of the adequacy of the suggested model. We can write

SSTotal = SSRegression + SSLack of fit + SSError .


4. REGRESSION ANALYSIS 247 STA3701/1

Figure 4.7: Differences between means and predicted values suggest the model is inadequate

The ANOVA table is as follows:

Source SS d.f. MS E(MS)


Xc
Regression ni (ŷi − ȳ.. )2 p−1 SSR /(p − 1) σ 2 + β 01 A−1
11 β 1 /(p − 1)
1

c
X
Lack of fit ni (ŷi − ȳi. )2 c−p SSLof /(c − p) σ 2 + ξ 2 /(c − p)
1

ni
c X
X
Error (yij − ȳi. )2 n−c SSE /(n − c) σ2
i=1 j=1
XX
Total (yij − ȳ.. )2 n−1
i j

where
c
X
2
ξ = ni (E(ȳi. ) − X 0i β)2
i=1

and where X 0i is the row of X corresponding to the i-th group of observations. We see that ξ = 0
if, for all i,

E(ȳi. ) = X 0i β

that is if the model contains all important independent variables (which have an influence on y).
4. REGRESSION ANALYSIS 248

A test of the adequacy of the model is therefore based on


SSLack of fit /(c − p)
f= ,
SSError /(n − c)
which is compared to Fα;c−p;n−c .

Example 4.1.
A study was performed on the number of parts assembled in a factory as a function of the time
spent on a specific job. Twelve employees, divided into three groups, were assigned to three time
intervals, with the following results:

Time (x) Number of parts assembled (y)


10 minutes 27 32 26 34
15 minutes 35 30 42 47
20 minutes 45 50 52 49

(a) Fit a straight line to the data.

(b) Construct an ANOVA table for the lack of fit test.

(c) Test at the 5% level of significance whether the model is adequate.

Solution 4.1.

(a) A straight
 line was fitted first on the data: 
1 1 1 1 1 1 1 1 1 1 1 1
X0 =  
−1 −1 −1 −1 0 0 0 0 1 1 1 1
 
1 −1
 
1 −1
 
 
 
 1 −1
   
 
0
1 1 1 1 1 1 1 1 1 1 1 1   12 0
XX= 1 −1 =
     

−1 −1 −1 −1 0 0 0 0 1 1 1 1   .. ..

 0 8
 . . 
 
 
 1 1 
 
1 1
4. REGRESSION ANALYSIS 249 STA3701/1


 1
1  8 0   12 0 
Now (X 0 X)−1 = = 1 
96 0 12 0
8
 
27
 
32
 
 
 
 26
   
 
0
1 1 1 1 1 1 1 1 1 1 1 1   469
Xy= =
  
   34 
−1 −1 −1 −1 0 0 0 0 1 1 1 1   ..

 77
 . 
 
 
 52 
 
49
 
1   
0 469 39.0833
β̂ = (X 0 X)−1 X 0 y =  12 1   = .
 
0 77 9.625
8

The straight line fitted on the data is:

ŷi = β̂0 + β̂1 xi = 39.0833 + 9.625xi

with

ŷ 0 = (29.4583 39.0833 48.7083).

(b) ȳ1. = 29.75 ȳ2. = 38.5 ȳ3. = 49 ȳ.. = 39.0833


XX
SSTotal = (yij − ȳ.. )2

= (27 − 39.0833)2 + (32 − 39.0833)2 + . . . + (49 − 39.0833)2

= 982.9167

XX
(y1j − ȳ.. )2 = (27 − 29.75)2 + . . . + (34 − 29.75)2

= 44.75

XX
(y2j − ȳ.. )2 = (35 − 38.5)2 + . . . + (47 − 38.5)2

= 169
4. REGRESSION ANALYSIS 250

XX
(y3j − ȳ.. )2 = (45 − 49)2 + . . . + (49 − 49)2

= 26

XX
SSError = (yij − ȳi. )2
XX XX XX
= (y1j − ȳ.. )2 + (y2j − ȳ.. )2 + (y3j − ȳ.. )2

= 44.75 + 169 + 26

= 239.75

c
X
SSReg = ni (ŷi − ȳ.. )2
i=1
= 4(29.4583 − 39.0833)2 + 4(39.0833 − 39.0833)2 + 4(48.7083 − 39.0833)2

= 741.125

c
X
SSLof = ni (ŷi − y i. )2
i=1
= 4(29.4583 − 29.75)2 + 4(39.0833 − 38.5)2 + 4(48.7083 − 49)2

= 4(−0.2917)2 + 4(0.5833)2 + 4(−0.2917)2

= 2.0417

The ANOVA table for the lack of fit test is:

Source SS d.f MS
X
Regression ni (ŷi − ȳ.. )2 = 741.125 p−1=1 741.125

X
Lack of fit ni (ŷi − ȳi. )2 = 2.0417 c−p=1 2.0417

XX
Error (yij − ȳi. )2 = 239.75 n−c=9 26.6389
XX
Total (yij − ȳ.. )2 = 982.9167 n − 1 = 11
4. REGRESSION ANALYSIS 251 STA3701/1

(c) H0 : E(Y ) = β0 + β1 X (model is adequate) H1 : E(Y ) 6= β0 + β1 X

Since

SSLack of fit /(c − p)


f =
SSError /(n − c)
2.0417
=
26.6389
= 0.0766

Fα;c−p;n−c = F0.05;1;9 = 5.12. Reject H0 if f > 5.12.


Since 0.0766 < 5.12, H0 is not rejected, thus the model can be assumed to be adequate.

If the lack of fit test is significant, one will have to rethink the model: more terms may have to be
included, the variables may have to be transformed or one may have to design a completely new
experiment.

In many practical applications of regression analysis the straight line seems to be inadequate to
deal with the complexity of the problem. This problem can be solved by application of multiple
regression or polynomial regression.

4.3 Polynomial regression

In certain types of problems one may not know exactly which kind of function to fit to a given set
of data, since many functions may be written in the approximate form

f (x) = c0 + c1 x + c2 x2 + · · ·.

It is often possible to approximate the unknown function by means of a polynomial. This fit is
usually fairly close over a limited interval. In matrix form the approximate model then is
      
k
y 1 x1 · · · x1 β 
 1    0   1 
 ..   ..   ..   .. 
 . = .   .  +  . .
      
k
yn 1 xn · · · xn βk n

The least squares estimator is


4. REGRESSION ANALYSIS 252

β̂ = (X 0 X)−1 X 0 y

   X X X −1  X 
β̂0 n xi x2i ··· xki yi
   X X X X   X 
x2i x3i ··· xk+1
  
xi
 
xi yi 

 β̂1   i  
= .
..   .. ..
  
   
 .   .   . 
   X X X X   X 
β̂k xki xk+1
i xk+2
i ··· x2k
i xki yi

Remarks about polynomial regression

1. If one fits a polynomial of degree k that is one with p = k + 1 parameters, then one has to
have at least k + 2 observations (that is n > k + 1), and at least k + 1 different values of x
(that is c ≥ k + 1 where c is defined in section 4.2). If the last restriction is violated, X 0 X
will be singular. If the first restriction is violated then the degrees of freedom for estimating
σ 2 will be n − p = n − k − 1 ≤ 0.

2. Suppose an experiment is performed at k + 1 different values of x, with replications at some


or all of these values of x. Let ȳi. be the mean value of the observations at xi . Then it may
be proved that a polynomial of degree k fitted to these observations will pass exactly through
the k + 1 points (xi ; ȳi. ), i = 1, · · · , k + 1. In such a case one will not be able to perform a
lack of fit test (the number of degrees of freedom for lack of fit is c − p, but in such a case
c = k + 1 and p = k + 1). For example, it is not possible to perform a lack of fit test on a
quadratic polynomial through only three points.

3. Although the mathematical requirement for k, the degree of the polynomial, is k ≤ c − 1


where c is the number of different x values, one should never try to fit a polynomial of high
degree even if one has a large enough c. The reason is that polynomials of high degree may fit
exactly through the data but oscillate wildly in between − see figure 4.8. Thus it is desirable
to restrict oneself to polynomials of degree 2 or 3 at the most.
4. REGRESSION ANALYSIS 253 STA3701/1

Figure 4.8:

4. There is a particular danger in using a fitted polynomial to extrapolate far outside the data.
As in figure 4.8 the polynomial is likely to become completely inappropriate outside the range
of observation.

• Illustration

An investigator wants to determine the relation between the dependent variable y and the inde-
pendent variable x in the following data set:

x : 38 45 49 57 69 78 84 89 79 64 54 41
y : 99 91 78 61 55 63 80 95 65 56 74 93

Two regression lines are fitted, that is

(a) ŷ = β̂0 + β̂1 x = 95.748537 − 0.319923x

(b) ŷ = β̂0 + β̂1 x + β̂2 x2 = 322.738382 − 8.05069x + 0.061173x2

The SAS JMP output for the regression analysis can be seen in figure 4.9.
4. REGRESSION ANALYSIS 254

Figure 4.9: SAS JMP output for example 4.3.2


4. REGRESSION ANALYSIS 255 STA3701/1

Using the linear fit:

R2 = 0.1241. Thus, 12.41% of the variability in y is being explained by the model. On the
regression coefficients the intercept is the only one significant with a p-value of zero.

Using the quadratic fit:

R2 = 0.9197. Thus, 91.97% of the variability in y is being explained by the model. The regression
coefficients are all significant with p-values less than 0.05.

The quadratic equation is a better fit, as seen in the graph and “R-square” values.

4.4 Multiple regression

The general form of a multiple regression model is

y = β0 + β1 x1 + β2 x2 + . . . + βk xk + ε.

The dependent variable y is described as a function of k independent variables x1 , · · · , xk . As


discussed in chapter 2, the error term ε accounts for the chance variation inherent in a statisti-
cal experiment. The method of least squares estimation used to fit a straight line, is also used here.

The estimated model ŷ = β̂0 + β̂1 x1 + · · · + β̂k xk is chosen in such a way as to minimise
n
X
SSE = (yi − ŷi )2 .
i=1

The degree of difficulty of the calculations is the basic difference between the fitting of the straight
line and multiple regression models. With regard to the above model, (k + 1) linear equations are
solved simultaneously to obtain the (k + 1) estimated coefficients β̂0 , β̂1 , · · · , β̂k .

Several computer packages are available to perform multiple regression analyses, and since the
output of the programs is similar, the interpretation is fairly simple.
4. REGRESSION ANALYSIS 256

4.4.1 Illustration

An investigator suspects that factors x1 , x2 and x3 and/or combinations of factors affect the inde-
pendent variable y.

A multiple regression analysis was suggested and performed on the following data set:
y x1 x2 x3
87 40 11 14
133 36 13 30
174 34 19 30
385 41 33 39
363 39 25 33
274 42 23 34
235 40 22 37
104 31 9 20
141 36 13 27
208 34 17 40
115 30 18 19
271 40 23 31
163 37 14 35
193 41 13 28
203 38 24 31
279 38 31 35
179 24 16 26
244 45 19 34
165 34 20 30
257 40 30 38
252 41 22 35
280 42 21 41
167 35 16 23
168 33 18 24
115 36 18 21
4. REGRESSION ANALYSIS 257 STA3701/1

The SAS JMP output is as follows:

Figure 4.10: SAS JMP output for example 4.4.1


4. REGRESSION ANALYSIS 258

The fitted model is

ŷ = β̂0 + β̂1 x1 + β̂2 x2 + β̂3 x3 + β̂4 x1 x2 + β̂5 x1 x3 + β̂6 x2 x3

and the F-statistic indicates that the βs are not all zero, since f = 13.6612 > F0,05;6;18 = 2.66.

The “R-squared” term in the output is defined as

R2 = 1 − SSError /SSTotal

thus representing the portion of the sample variance of the y-value accounted for by the estimated
model. In this case R2 = 0.8199. Thus, 81.99% of the variability in y is accounted for by the
model.

The p-value in the table is the exceedance probability that the coefficient concerned is zero under
the null hypothesis.

For an observed value F0 of F


Z ∞
P = fF (z)dz
F0

Figure 4.11:
4. REGRESSION ANALYSIS 259 STA3701/1

with fF (x) the density function of the F distribution.

A small p-value implies a significant difference of the coefficient from zero for all significant levels <
p, for example if p = 0.08, then the coefficient is significantly different at the 10% level (α = 0.10)
but not at the 5% level (α = 0.05).

4.5 Inclusion or exclusion of variables

It often happens that one has a large number of independent variables available, and one has to
make a decision as to which variables to use in the regression equation. There are arguments in
favour of both extreme views: include as many variables as possible and include as few variables
as possible. We shall go into some of these arguments.

The effect of excluding a variable


Consider the two alternative models:

y = β0 + β1 x1 + β2 x2 + 

and

y = β0∗ + β1∗ x1 + ∗ .

If β2 = 0 the two models are equivalent, but if β2 6= 0 they are different. If β2 6= 0 and one
nevertheless fits the second model, then β̂0∗ and β̂1∗ will be biased, that is E(β̂0∗ ) 6= β0∗ and E(β̂1∗ ) 6= β1∗
because E(∗ ) = E(β2 x2 + ) = β2 x2 and thus the assumption E(∗ ) = 0 is not true. For given
values of x1 and x2 the predicted value of the corresponding response y will be

ŷ = β̂0∗ + β̂1∗ x1

and E(ŷ) 6= β0 + β1 x1 + β2 x2 , the true expected response.

Thus the exclusion of important variables introduces bias.


4. REGRESSION ANALYSIS 260

The effect of including a variable


Consider the two simplest models

y = β0∗ + ∗ .

and

y = β0 + β1 x + 

In this first model one would predict, for any x,

ŷ ∗ = β̂0∗ = ȳ, and Var(ŷ ∗ ) = σ 2 /n.

In the second model one would predict


X
ŷ = β̂0 + β̂1 x, and Var(ŷ) = σ 2 /n + σ 2 (x − x̄)2 / (xi − x̄)2 .

Thus Var(ŷ) ≥ Var(ŷ ∗ ). This result holds generally: every additional variable included in the
equation increases the variance of a prediction.

The inclusion of too many variables in the equation has other serious side-effects as well:

1. The matrix X 0 X becomes very large and often ill-conditioned (that is singular or nearly
singular) with the result that numerical accuracy is lost in computing (X 0 X)−1 .

2. A regression equation with many terms is cumbersome to use, and often one may have to
observe variables which are difficult or expensive to obtain in order to compute a prediction.
One should always aim at simplicity (within reason, of course).

Mean squared error


Consider the model

y = Xβ + 

and let x0 be a vector of values of the independent variables. Let

η = x0 β
4. REGRESSION ANALYSIS 261 STA3701/1

be the true response which we are trying to predict. Let η̂ = x0 β̂ be the estimated response, and
let E(η̂) = x0 E(β̂) be the expected value of the estimated response. If E(β̂) = β then E(η̂) = η
and the estimator is unbiased. However, a biased estimator may sometimes be preferable, as will
now be seen. Write

η̂ − η = (η̂ − E(η̂)) − (η − E(η̂)).

Taking squares and then expectations on both sides and noting that the expected value of (η̂ −
E(η̂))(η − E(η̂)) is equal to zero, we obtain

E(η̂ − η)2 = E(η̂ − E(η̂)) + (η − E(η̂))2 ,

or

MSE(η̂) = Var(η̂) + B 2 (η̂).

The term on the left is the mean (or expected) squared error of prediction. The first term on the
right is obviously the variance of η̂ and the last term is the measure of the bias of η̂. Suppose now
that, for a given set of data, we add the variables x1 , x2 · · · , xp to the equation one by one in that
order.

We have already seen that the variance of the prediction will increase after the inclusion of every
variable while the squared bias will decrease to zero if all important variables are included in our
set of variables. As is seen in figure 4.14, the mean squared error is likely to decrease and then
increase: thus an optimum subset of the variables x1 , · · · , xp will exist which yields the optimum
prediction with the smallest mean squared error. One will never know which variables they are,
but one should know that they exist.
4. REGRESSION ANALYSIS 262

Figure 4.12:

Figure 4.13:
4. REGRESSION ANALYSIS 263 STA3701/1

Figure 4.14:

4.6 Selection of variables

We describe a number of methods which may be employed in order to select the variables for
inclusion in the equation. In the final analysis the choice should not be based on statistical con-
siderations alone, but on knowledge of the practical situation as well. Methods such as stepwise
regression analysis may be used to help you to think about a problem, but can never be a substi-
tute for thinking. None of the methods to be described can be said to be superior to all others,
and any of them may be preferable in some circumstances.

Testing coefficients in the full model


One obvious method is to fit the full model (that is include all possible variables) and test all
coefficients, retaining all those variables corresponding to the coefficients which are significant. If
the X 0 X matrix is diagonal or if the off-diagonal elements are relatively small, this method may
be recommended. There are two problems with this approach, both associated with the situation
where the independent variables are highly correlated. The two problems are:

1. The X 0 X matrix may be ill-conditioned, and it may be virtually impossible to fit the complete
model.
4. REGRESSION ANALYSIS 264

2. If two independent variables are highly correlated, it may well happen that their coefficients
are both insignificant, but that they are jointly significant. Exclusion of both will thus result
in a loss of information.

Significance level

If one tests p coefficients, one must bear in mind that the overall significance level may be larger
than one thinks. A safe procedure would be to use the level α/p for each test, in which case the
overall level will at most be α.

All possible subsets


A second approach is to compute all possible regression equations: of y on each x separately,
on each subset of two variables, et cetera. Thus there are 2P possible regression equations. The
criterion on which the decision is based is the squared multiple correlation coefficient R2 . R2
increases as every new variable is added, and one may plot a graph of the largest R2 of every
subset size against the size. The point where the increase ceases to be substantial may be used as
the cutoff point. There are two problems with this approach.

1. If p is larger, one may have a tremendous amount of computing to do and it will be virtually
impossible even to read all the answers which emerge from the computer. This is a wasteful
approach if p is large.

2. It is virtually impossible to attach a specific significance level to this procedure in any prac-
tical application, since the distributional properties of the process are too complicated.

Forward selection
The forward selection method works as follows: First select the independent variable which is
most highly correlated with y. If this correlation coefficient is r1 say, then R12 = r12 is the squared
multiple correlation coefficient between y and the variable selected. Next, that variable is selected
which has the highest partial correlation coefficient with y, given the variable already selected.
This variable also causes the largest increase in R2 . Thus the variables selected in the first two
steps have the largest R2 with y of all pairs of variables which include the variable selected first.
For example, if x3 is included first and then x6 , then R2 of y with x3 and x6 will be larger than
4. REGRESSION ANALYSIS 265 STA3701/1

R2 of y with x3 and x1 , with x3 and x2 , x3 and x4 , et cetera. In this way one proceeds until the
addition of further variables causes the R2 to increase by very small amounts.

Some disadvantages of this method:

1. From a distribution theoretical point of view this approach is so complex that it is impossible
to maintain a specified significance level. The cutoff point has to be selected subjectively.

2. It may happen that the subset selected at a particular step is not the optimal subset. For
example, x3 , x4 and x6 may be selected after three steps, while x2 , x5 and x6 may have a
larger R2 with y. In practice it has been found that the subset selected by this method
differs only rarely from the best subset, and usually the optimal R2 is only slightly higher
than the R2 selected. Theoretically the differences may be substantial, however.

Backward elimination
This method is the reverse of the previous one. The full set is selected at first. Next, the variable
which will cause the smallest decrease in R2 if deleted from the equation is eliminated and the
equation is recomputed. This process is continued until the decrease in R2 is intolerably large,
and the previous equation is selected. This procedure is preferred to the previous one by many
authors, but it has the disadvantages of the previous one:

1. Difficulty with inverting X 0 X if p is large.

2. Significance level impossible to compute.

3. The subset selected may not be optimal.

Stepwise regression
This is a variation on the forward selection method and is used very widely. After the inclusion
of each variable, all variables already in the equation are scrutinised to see whether anyone of
them could be deleted without decreasing R2 by much. Suppose x4 is selected first, then x3 and
x5 . After three steps it may be found that x4 is no longer necessary since it is highly correlated
with x3 and x5 , and the information about y contained in x4 is also contained in x3 and x5 . The
disadvantages of this method are:
4. REGRESSION ANALYSIS 266

1. The subset selected may not be optimal.

2. Even though one is required to supply an“F-level for inclusion” and an “F-level for exclusion”,
it is impossible to attach a specific significance level to the procedure. Incidentally, the “F-
levels” commonly used in existing computer programs actually refer to critical values of the
F-distribution, ostensibly with 1 and n − k − 1 degrees of freedom at the k-th step, but exact
critical values for this procedure are unobtainable. The best recommendation is to select
small “critical values” and let the procedure continue for as many steps as are possible, and
then to make a subjective judgement based on the series of R2 s and any other (non-statistical)
information available.

One of the most widely used methods is the forward stepwise regression.

Forward stepwise regression


Stepwise regression is generally carried out using a computer. Assuming that y is the dependent
variable and x1 , x2 , . . . , xp are the p potential independent variables. In stepwise regression we use
the t-statistic and the related p-values to determine the significance of each independent variable
to the model.

Recall when using the p-values, a coefficient is significant if the p-value is less than α. It means
H0 : βi = 0 is rejected in favour of H1 : βi 6= 0 with the probability of a type I error equal to α.
Before performing a stepwise regression model we choose two alpha values determine the entry or
staying of a variable called αentry and αstay respectively. The αentry is the “probability of a type
I error related to entering an independent variable into the regression model”. The αstay is the
“probability of a type I error related to retaining an independent variable previously entered into
the model.” Normally the value of 0.05 is used for both αentry and αstay .

Thus, a variable will enter if its p-value is less than 0.05 and a variable will remain if its p-value
after adding the other variable is less than 0.05. For example, suppose we have three independent
variables x1 , x2 and x3 . If we perform simple linear regression of each variable on y and we obtain
the following p-values for each model:
4. REGRESSION ANALYSIS 267 STA3701/1

y= β0 + β1 x1 + 
p-value (0.0004)
y= β0 + β 1 x2 + 
p-value (0.006)
y= β0 + β 1 x3 + 
p-value (0.002)

Of the variables x1 is the one that is highly significant thus the model y = β0 + β1 x1 +  will be
considered since 0.0004 is the p-value (highly significant). Thus the hypothesis H0 : β1 = 0 is
rejected since 0.0004 is less than the αentry = 0.05.

In the next step suppose the possible regression models with two independent variables gave the
models of the form:

y= β0 + β1 x1 + β2 x2 + 
p-value (0.0015) (0.0163)
y= β0 + β1 x1 + β2 x3 + 
p-value (0.0506) (0.1225)

It can be noted that on the second equation, the p-value for β2 is now 0.1225 which is greater than
0.05. Thus, x3 cannot stay in the model. However, for the first model β2 = 0.0163, thus x1 and x2
can be retained for further use in stepwise regression.

Suppose when all three variables are entered in the equation the model is:

y= β0 + β 1 x1 + β 2 x2 + β 3 x3 + 
p-value (0.0185) (0.0578) (0.4627)

It can be observed that the p−value for β3 = 0.4615 is greater than 0.05. Thus, it is not significant
at the αentry level. Thus, this model cannot be chosen and it comes back to

y = β0 + β1 x1 + β2 x2 + .
4. REGRESSION ANALYSIS 268

In principle that is how it works.

• Application to illustration 4.4.1 A stepwise regression was performed on the data of illus-
tration 4.4.1.

Note: The stepwise regression algorithm allows an X-variable, brought into the model at an earlier
stage, to be dropped subsequently if it is no longer helpful in conjunction with variables
added at later stages.

The SAS JMP output is shown in figures 4.15 and 4.16.

Figure 4.15: SAS JMP output for illustration 4.4.1


4. REGRESSION ANALYSIS 269 STA3701/1

Figure 4.16: SAS JMP output for illustration 4.4.1

The comparison of “R-squared (Adj for df)” in figure 4.15 and figure 4.16 suggests that the full
regression is unnecessary, since the analysis with two factors in figure 4.16 provides satisfactory
results.

4.7 Historical data

Sometimes one may be confronted with a mass of historical data, that is data collected rather
haphazardly over a period of time, and one may be tempted to try building a regression model
from this data. Some authors advise strongly against doing this at all, or they advise using such
a study only as a guidance for planning more systematic experiments. Some of the hidden pitfalls
in such historical data are as follows:

1. The whole process may have changed over time and the data may no longer be applicable to
the present situation.

2. Such data usually show a highly correlated structure among the “independent” variables. The
result is that the estimated coefficients are highly correlated and it is impossible to judge
their significance independently. In some studies, for example a number of measurements on
4. REGRESSION ANALYSIS 270

a number of people, such a correlation structure may be unavoidable and may in fact contain
information about the structure underlying the data. However, in many situations, such as
the daily records of a chemical plant, this may be due to the way in which the plant is run.

3. In a chemical plant, as in many other experimental situations, important variables are usually
controlled very strictly since small fluctuations may have a significant effect on the response.
The result is that the importance of such a variable will not be reflected by the data. This
will certainly lead to inappropriate conclusions.

4. Other more subtle effects may be reflected in the data without the statistician ever know-
ing about them. In a chemical plant, two variables which influence the response may be
temperature and pressure. An increase in either of these variables will increase the yield,
but if either of them rises too high, it may cause an explosion. Therefore the operator has
strict instructions to turn down the heat immediately the pressure increases; the result will
of course be a decrease in yield. However, if one analyses the historical data, it will appear
as if an increase in pressure causes a decrease in yield − the way in which the experiment is
run is confused with inherent properties of the system.

In contradistinction to such historical data, consider the ideally designed experiment.

1. One will try to select the design matrix X in such a way that (X 0 X)−1 is a diagonal matrix or
as close to diagonal as possible (except perhaps for the first row and column which represent
the constant term). This will enable one to estimate and judge the effect of each variable
independently.

2. Secondly, all variables which are expected to influence the yield must be varied deliberately,
and as many times as are needed. Even if the chemical engineer is not willing to vary
a certain variable, one may be able to convince him or her that small changes will not
upset the whole system. The variance of the estimated coefficient β̂j of the j-th variable is
X
σ2/ (xjh − x̄j. )2 . In order to obtain a precise estimate of βj , the variance of β̂j must be
h X
small, that is (xjh − x̄j. )2 must be large. This may be achieved either by varying the
h
variable xj substantially or by choosing a large sample size (provided not all values of xj are
equal to x̄j. )
4. REGRESSION ANALYSIS 271 STA3701/1

Exercise 4.1

1. (a) Suppose the following data have been obtained from five randomly selected objects
(regression of y on x):

x 1 3 5 7 9

y 4 3 6 8 9

(i) Fit a linear regression line.

(ii) Estimate the variance, σ 2 .

(iii) Test at the 5% level whether the expected value of y (at x = 15) is 20.

(b) Consider the following data (regression of y on x):

x 0 1 2 3 4

y 110 90 76 85 91

(i) Fit a model of the form y = β0 + β1 x + β2 x2 to the data.


 
620 −540 100
1  
Given (X 0 X)−1 =  −540 870 −200  .
 
700  
100 −200 50
(ii) Estimate σ 2 .

(iii) Test whether the quadratic term is necessary in the model.


4. REGRESSION ANALYSIS 272

2. Fourteen observations were obtained on the yield from a chemical experiment. In this exper-
iment, temperature, concentration and length of agitation were set by the experimenter.

X1 X2 X3 Y
Temperature Concentration Time Yield
10 40 20 22
10 40 20 33
15 40 20 36
15 40 20 33
10 40 25 26
15 40 25 37
10 50 20 37
10 50 20 40
15 50 20 42
15 50 20 45
10 50 25 39
10 50 25 42
15 50 25 45
15 50 25 47

The following may be used without checking:

 
14 175 640 310
 
 
 175 2275 8000 3875 
X0 X = 


 where X = (1, x1 , x2 , x3 )
 640 8000 29600 14200 
 
310 3875 14200 6950
 
11.9857 −0.1429 −0.115 −0.220
 
−0.1429
 
−1
 0.0114 0 0 
(X0 X) =  
 −0.115 0.003 −0.001 
 
0
 
−0.220 0 −0.001 0.012
4. REGRESSION ANALYSIS 273 STA3701/1

 
524
 
 
0
 6665  14
(yi − ŷi )2 = 96.5571
P
Xy=



 24330  i=1
 
11660

(a) Fit a multiple regression model to the data. Give the values of β0 , β1 , β2 and β3 .

(b) Give a 95% confidence interval for β2 . Can you conclude that H0 : β2 = 0?

(c) Test the hypothesis: H0 : 2β1 + 3β2 + β3 = 6.

(d) Give a 95% confidence interval for 2β1 + 3β2 + β3 .

3. Consider the following data:


x: 1 1 1 2 2 2 3 3 3
y: 13 8 6 10 15 11 20 16 21

(a) Fit a model of the form

y = β1 + β2 x + β3 x2 + 

to the data

(i) by fitting the complete model directly,

(ii) by first reducing the variables as described in section 2.8. Draw a graph of the data
and the fitted regression curve.

(b) Estimate σ 2 and find a 95% confidence interval for σ 2 .

(c) Determine whether a quadratic model is really necessary by testing H0 : β3 = 0 at the


5% level.

(d) Test H0 : β1 = β2 = β3 at the 5% level.

(e) Test, at the 5% level, the hypothesis that the expected response y at x = 1 is 10.
4. REGRESSION ANALYSIS 274

4. Consider the following regression data (regression of y on x):

x: −1 −1 −1 0 0 0 1 1 1
y: 11 16 18 7 9 14 8 9 16

(a) Fit a straight line and perform a lack of fit test at the 5% level.

(b) Fit a second-degree polynomial and test by means of an F-test at the 5% level whether
the quadratic term is zero. Compare your answer with that of (a).

(c) Draw a graph of the data and the two regression lines of (a) and (b). Comment on the
graph with reference to your conclusions in (a) and (b).

5. Consider the simple linear regression model

yi = β0 + β1 xi + i ; i = 1, · · · , n; i independent n(0; σ 2 ).

Find a 100(1 − α)% confidence interval for β0 + β1 x. Regard the lower and upper limits for
β0 + β1 x as functions of x and derive their form. Sketch β0 + β1 x and the lower and upper
limits as functions of x.

6. Consider the following data:

xi yi xi yi

17 37.75 25 45.20

17 37.42 25 46.70

17 36.62 25 48.10

22 41.15 28 52.45

22 42.65 28 50.21

22 39.70 28 51.75

A regression line of y on x is suggested.

(a) Write down the design matrix X.


4. REGRESSION ANALYSIS 275 STA3701/1

 
2.75505 −0.11616
(b) Assume that (X 0 X)−1 =   and show through calculation that
−0.11616 0.00505
ŷ = 14.2139 + 1.2966x.

(c) Perform a lack-of-fit test.


ni
c P
(yij − y i. )2 = 11.651.
P
Hint:
1 1
(d) If you are the investigator, what would you recommend to your client?

7. A marketing research firm has obtained the following prescription sales data.

Sales Floor space Prescription % Parking Income Shopping centre


Pharmacy y x1 x2 x3 x4 x5
1 22 4 900 9 40 18 1
2 19 5 800 10 50 20 1
3 24 5 000 11 55 17 1
4 28 4 400 12 30 19 0
5 18 3 850 13 42 10 0
6 21 5 300 15 20 22 1
7 29 4 100 20 25 8 0
8 15 4 700 22 60 15 1
9 12 5 600 24 45 16 1
10 14 4 900 27 82 14 1
11 18 3 700 28 56 12 0
12 19 3 800 31 38 8 0
13 15 2 400 36 35 6 0
14 22 1 800 37 28 4 0
15 13 3 100 40 43 6 0
16 16 2 300 41 20 5 0
17 8 4 400 42 46 7 1
18 6 3 300 42 15 4 0
19 7 2 900 45 30 9 1
20 17 2 400 46 16 3 0

These variables can be described precisely as follows:


4. REGRESSION ANALYSIS 276

y = average weekly prescription sales over the past year (in units of $1 000)
x1 = floor space (in square feet)
x2 = percentage of floor space allocated to the prescription department
x3 = number of parking spaces available for the store
x4 = monthly per capita income for the surrounding community (in units of $100)
x5 = is an independent variable: equals 1 if the pharmacy is located in a shopping centre
and equals 0 otherwise.

The following outputs were obtained:

Figure 4.17: Output for question 7


4. REGRESSION ANALYSIS 277 STA3701/1

Figure 4.18: Output for question 7


4. REGRESSION ANALYSIS 278

Figure 4.19: Output for question 7


4. REGRESSION ANALYSIS 279 STA3701/1

(a) Based on the R2 value, which variable is the best predictor?

(b) Based on pairwise correlation and density ellipses (use a 95% confidence ellipse), which
variable is the best single predictor?

(c) Give the multiple regression model from the output.

(d) Comment on the significance of each coefficient.

(e) Interpret the R2 for the model.

(f) Comment on the residuals.

(g) The following stepwise output was obtained with probability of entry = 0.15 and prob-
ability of leaving equal to 0.15.

Figure 4.20: Output for question 7


4. REGRESSION ANALYSIS 280

Figure 4.21: Output for question 7


4. REGRESSION ANALYSIS 281 STA3701/1

(i) What is the first independent variable entered?

(ii) What is the second independent variable entered?

(iii) Has the first independent variable been retained? Why?

(iv) What is the final model arrived at by stepwise regression?


4. REGRESSION ANALYSIS 282
Chapter 5

ANALYSIS OF COVARIANCE

5.1 Introduction

In chapter 2 it was pointed out that the analysis of covariance is a hybrid between analysis of
variance and regression analysis. As in the analysis of variance, we have a number of factors
(treatments or blocking factors) which define subsets among the experimental units. In addition
there are (continuous) independent variables which are assumed to have a linear effect on the
response. In general the analysis of covariance is used to eliminate the linear or higher degree
relation between one or more independent variables (called covariates) and a dependent variable
(yield), with the main objective to be able to determine the effect of the factor(s) (treatment) on
the dependent variable. Analysis of covariance can also be regarded as the simultaneous study of
several regressions. Due to complicated techniques, we confine ourselves to a one-way analysis of
variance with a linear regression on one single variable. Here we have independent variable (covari-
ate) related to dependent (response) variable, but without being affected by the factor (treatment).
It is often impossible to control this covariate at a constant level, but it can be measured together
with the dependent (response) variable.

The model usually proposed is

yij = αi + βi xij + εij ; i = 1, 2; j = 1,· · · , ni ;

ε11 , · · · , ε2n2 are independent n(0; σ 2 ) variates; with αi the effect of the factor being investigated

283
5. ANALYSIS OF COVARIANCE 284

and βi the additional contribution due to the covariate.

If the model is accepted, the situation shown in figure 5.1 might arise:

Figure 5.1:

Here we have three regression lines, each with a specific slope. The differences between the lines
do not necessarily relate to the treatments, thus it is impossible to investigate the differences in y
for the three treatments.

If the slopes are equal, that is the regression lines are parallel, we accept the model

yij = αi + βi xij + εij ; i = 1, 2; j = 1,· · · , ni ;

with ε11 , · · · , ε2n2 independent n(0; σ 2 ) variates.


5. ANALYSIS OF COVARIANCE 285 STA3701/1

The situation shown in figure 5.2 now arises:

Figure 5.2:

Now the differences with regard to y can be observed directly, thus the comparison of the three
equations is possible. The parameters can be estimated and the differences in y, for any value of
x, can be determined.

We start with an illustrative example, and then discuss two of the simplest models contained in
the general theory. The analysis of covariance is of course part of the broad theory of the general
linear model.

5.2 Illustration

A wholesaler wants to increase the sales of a certain product. He selects ten dealers whom he
believes to be comparable, and requests five of them to display his product close to the entrance
of the shop. The other five are requested to place the product close to the cash registers. The
results are as follows (units sold in one week):

Treatment level 1 (close to cash register) 42 45 51 53 59


Treatment level 2 (close to entrance) 44 47 55 56 58
The means are ȳ1. = 50 (level 1) and ȳ2. = 52 (level 2).
5. ANALYSIS OF COVARIANCE 286

The (one-way) analysis of variance table is as follows:

Source SS d.f. MS F
Treatments 10 1 10 0.24
Error 330 8 41.25
Total 340 9

The F-statistic is not significant at any of the usual levels, and it would seem that there is no
difference between the responses to the two display methods.
5. ANALYSIS OF COVARIANCE 287 STA3701/1

Figure 5.3: SAS JMP output for sales data by level


5. ANALYSIS OF COVARIANCE 288

However, at this stage somebody suggested that the sales before treatment should also be taken
into account, and the records show the following:

Level 1 Before (x) 32 38 40 44 46


After (y) 42 45 51 53 59

Level 2 Before (x) 44 45 51 53 57


After (y) 44 47 55 56 58
One way of bringing the sales before treatment into the analysis is by computing the differences
(y −x) and regarding the values obtained in this way as two independent samples of sales increases.
A one-way analysis of variance would then compare the mean increases due to the treatments. How-
ever, a more sensitive analysis is suggested by a plot of y against x, using different symbols for the
two treatment levels.

Figure 5.4:
5. ANALYSIS OF COVARIANCE 289 STA3701/1

Figure 5.4 suggests that one might fit a regression line to each group of data points and compare
the regression lines in some way or other. This is the purpose of the analysis of covariance.

5.3 Analysis of covariance model

The model often suggested for this type of problem is the following:

yij = αi + βxij + εij ; j = 1,· · · , ni ; i = 1, 2;

where the εij are independent n(0; σ 2 ) variates. Note especially the following two assumptions:

(i) equal slopes

(ii) equal variances

In section 5.4 the model with unequal slopes is discussed. An F-test for the equality of the vari-
ances may be constructed − if the variances are unequal, one is confronted with the Behrens-Fisher
problem.

The model may be written as y = Xβ +  as follows:

     
y11 1 0 x11 11
 ..   ..  .. 
     
 .   .  . 
 

    α  
    1   
 y1n1   1 0 x1n1   1n1 
 =  α  + 
 
    2   
 y21   0 1 x21   21 
 β
 ..   ..  .. 
    

 .   .   . 
     
y2n2 0 1 x2n2 2n2

The least squares estimators are


X X
y1j (x1j − x̄1. ) + y2j (x2j − x̄2. )
β̂ = P P
(x1j − x̄1. )2 + (x2j − x̄2. )2
α̂1 = ȳ1. − β̂ x̄1.

α̂2 = ȳ2. − β̂ x̄2. .


5. ANALYSIS OF COVARIANCE 290

The residual sum of squares is


X2 X ni
SSR = (yij − α̂i − β̂xij )2
i=1 j=1

and the estimator of σ 2 is


SSR
σ̂ 2 = .
n1 + n2 − 3

The test for H0 : α1 = α2 is then based on


(α̂1 − α̂2 )2
f=
(x̄1. − x̄2. )2
 
1 1
σ̂ 2 + +
n1 n2 Sx2
X X
where Sx2 = (x1j − x̄1. )2 + (x2j − x̄2. )2 ; this statistic is compared to Fα;1;n1 +n2 −3 .

Example 5.1.
Using data given in section 5.2 perform an analysis of covariance test assuming equal slopes at the
5% equal slopes.

Solution 5.1.
x̄1. = 40 ȳ1. = 50 x̄2. = 50 ȳ.. = 52
X
(x1j − x̄1. )2 = (32 − 40)2 + (38 − 40)2 + (40 − 40)2 + (44 − 40)2 + (46 − 40)2

= (−8)2 + (−2)2 + 02 + 42 + 62

= 120

X
(x2j − x̄2. )2 = (44 − 50)2 + (45 − 50)2 + (51 − 50)2 + (53 − 50)2 + (57 − 50)2

= (−6)2 + (−5)2 + 12 + 32 + 72

= 120

X
y1j (x1j − x̄1. ) = 42(32 − 40) + 45(38 − 40)2 + . . . + 59(46 − 40)

= 42(−8) + 45(−2) + 51(0) + 53(4) + 59(6)

= 140

X
y2j (x2j − x̄2. ) = 44(44 − 50) + 47(45 − 50)2 + . . . + 58(57 − 50)

= 44(−6) + 47(−5) + 55(1) + 56(3) + 58(7)

= 130
5. ANALYSIS OF COVARIANCE 291 STA3701/1

P P
y1j (x1j − x̄1. ) + y2j (x2j − x̄2. )
β̂ = P P
(x1j − x̄1. )2 + (x2j − x̄2. )2
140 + 130
=
120 + 120
270
=
240
= 1.125

α̂1 = ȳ1. − β̂ x̄1.

= 50 − 1.125(40)

= 5

α̂2 = ȳ2. − β̂ x̄2.

= 52 − 1.125(50)

= −4.25
P2 Pni
SSR = i=1 j=1 (yij − α̂i − β̂xij )2
ni
2 X
X
(y1j − α̂2 − β̂x1j )2 = (42 − 5 − 1.125 × 32)2 + . . . + (59 − 5 − 1.125 × 42)2
i=1 j=1

= 12 + (−2.75)2 + 12 + (−1.5)2 + (2.25)2

= 16.875

ni
2 X
X
(y2j − α̂i − β̂x2j )2 = (44 + 4.25 − 1.125 × 44)2 + . . . + (58 + 4.25 − 1.125 × 57)2
i=1 j=1

= (−1.25)2 + (0.625)2 + (1.875)2 + (0.625)2 + (−1.875)2

= 9.375

ni
2 X
X
SSR = (yij − α̂i − β̂xij )2
i=1 j=1
= 16.875 + 9.375

= 26.25

SSR
σ2 =
n1 + n2 − 3
5. ANALYSIS OF COVARIANCE 292

26.25
=
5+5−2
26.25
=
7
= 3.75

X X
s2x = (x1j − x̄1. )2 + (x2j − x̄2. )2

= 120 + 120

= 240

H0 : α1 = α2

Fcritical = Fα;1;n1 +n2 −3 = F0,05;1;7 = 5.59. Reject H0 if f > 5.59.

b2 )2
α1 − α
(b
f = (x1. −x2. )2
b2 [ n11 +
σ 1
n2
+ s2x
]
2
(5 − (−4.25))
=  2

3.75 15 + 51 + (40−50)
240

(9.25)2
=  
(−10)2
3.75 0.4 + 240
85.5625
=
3.75(0.81666666)
85.5625
=
3.0625
≈ 27.9388

Since 27.9388 > 5.59, we reject H0 at the 5% level of significance and conclude that the lines differ
significantly.

5.4 Comparing two regression lines

If one is unwilling to assume that the two regression lines are parallel, one may select the following
model:

yij = αi + βi xij + ij ; j = 1,· · · , ni ; i = 1, 2;


5. ANALYSIS OF COVARIANCE 293 STA3701/1

where the ij are independent n(0; σ 2 ) variates.

In this model we estimate (α1 , β1 ) and (α2 , β2 ) independently, but σ 2 jointly from the two samples:
X X
β̂i = yij (xij − x̄i. )/ (xij − x̄i. )2
j j

α̂i = ȳi. − β̂i x̄i.


XX
σ̂ 2 = (yij − α̂i − β̂i xij )2 /(n1 + n2 − 4).
i j
H0 : β1 = β2 is tested by comparing
(β̂1 − β̂2 )2
f =  
2
1 1
σ̂ P +P
(x1j − x̄1. )2 (x2j − x̄2. )2

with Fα;1;n1 +n2 −4 .

Similarly, we might wish to test whether the response for each population is equal at a particular
point x, that is we might test H0 : α1 + β1 x = α2 + β2 x by comparing

(α̂1 + β̂1 x − α̂2 − β̂2 x)2


fx =
(x − x̄1. )2 (x − x̄2. )2
 
1 1
σ̂ 2 +P + +P
n1 (x1j − x̄1. )2 n2 (x2j − x̄2. )2

with Fα;1;n1 +n2 −4 .

Example 5.2.
Using data in section 5.2, perform an analysis of covariance assuming unequal slopes at the 5%
level of significance.

Solution 5.2.
Recall

x̄1. = 40 ȳ1. = 50 x̄2. = 50 ȳ2. = 52

(x1j − x1. )2 = 120 (x2j − x2. )2 = 120


P P P P
yij (x1j − x̄1. ) = 140 y2j (x2j − x̄2. ) = 130

P
y1j (x1j − x̄1. )
β̂1 = P
(x1j − x̄1. )2
5. ANALYSIS OF COVARIANCE 294

140
=
120
7
=
6

P
y2j (x2j − x̄2. )
β̂2 = P
(x2j − x̄2. )2
130
=
120
13
=
12

α̂1 = ȳ1. − β̂1 x̄1.


7
= 50 − × 40
6
10
=
3

α̂2 = ȳ2. − β̂2 x̄2.


13
= 52 − × 50
12
−13
=
6
− α̂i − β̂i xij )2
P P
2 i j (yij
σ̂ =
n1 + n2 − 4
 2  2
XX
2 10 7 10 7
(y1j − α̂1 − β̂1 x1j ) = 42 − − × 32 + . . . + 59 − − × 46
i j
3 6 3 6
 2  2  2
4 −8 2 −5
= + +1 + + 22
3 3 3
50
=
3

 2  2
XX
2 −13 13 13 13
(y2j − α̂2 − β̂2 x2j ) = 44 − − × 44 + . . . + 58 − − × 57
i j
6 12 6 12
 2  2  2  2  2
−3 5 23 3 −19
= + + + +
2 12 12 4 12
55
=
6

− α̂i − β̂i xij )2


P P
2 i j (yij
σ =
n1 + n2 − 4
5. ANALYSIS OF COVARIANCE 295 STA3701/1

50/3 + 55/6
=
5+5−4
155/6
=
6
155
=
36

The estimates are


α̂1 = 10/3
α̂2 = −13/6
β̂1 = 7/6
β̂2 = 13/12
155
σ̂ 2 = .
36

These regression lines are illustrated in figure 5.5.

Figure 5.5: SAS JMP output for example 5.2

H0 : β1 = β2

Fcritical = Fα;1;n1 +n2 −4 = F0,05;1;6 = 5.99. Reject H0 if f > 5.99

(β̂1 − β̂2 )2
f =  
1 1
σ̂ 2 P 2
+P
(x1j − x̄1. ) (x2j − x̄2. )2
5. ANALYSIS OF COVARIANCE 296

(7/6 − 13/12)2
= 155
 1 1

36 120
+ 120
(1/12)2
=
31/432
3
=
31
≈ 0.0968

Since 0.0968 < 5.99, we do not reject H0 at the 5% level of significance and conclude that the
assumption of equal slope is valid. Parallel lines seem to fit the data quite well.

The procedure may obviously be extended to more than two regression lines and more than one
independent variable.

5.5 Illustration

Disclaimer: The data below are fictitious.

A company wants to investigate the effect of heartbeat per minute (< 60 or > 60) on the results
(y) obtained in an intensive, very demanding one-month course. It is expected that the heartbeat
is related to the mean number of kilometres (x) the course attendants ran/walked monthly in the
year prior to the course. The following results were obtained:

< 60 > 60
x y x y
8 63 50 59
22 80 91 82
11 71 72 76
12 69 86 81
14 78 63 70
23 79 79 85

The ANOVA table, with x a covariate, is as follows (SAS JMP figure 5.6):
5. ANALYSIS OF COVARIANCE 297 STA3701/1

Figure 5.6: SAS JMP output for data in illustration 5.5

Figure 5.7 is a graphical representation of the data and the two straight lines fitted if x is regarded
as a covariate.

Figure 5.7:

Figure 5.7 shows clearly that the two straight lines, obtained from the analysis of covariance (x
the covariate), are acceptable representations of the situation as reflected by the given data.

The following two graphs illustrate the importance of looking at a study as a whole and not only
at certain aspects.
5. ANALYSIS OF COVARIANCE 298

Figure 5.8 is a graphical representation of x against the results, without considering heartbeat. A
straight line fitted to the data is obviously not an acceptable prediction.

Figure 5.8:

Figure 5.9 is a graphical representation of heartbeat against results, where x is ignored. It is


obvious that the study is not valid if x is not taken into consideration.

Figure 5.9:
5. ANALYSIS OF COVARIANCE 299 STA3701/1

Exercise 5.1

1. The following data were given for two sets:

Set 1 Set 2
x y x y
4.8 9.912 8.8 10.596
7.2 9.383 6.2 9.697
5.5 9.734 7.5 10.700
6.0 9.551 4.9 10.610
8.3 8.959 5.4 10.145
7.6 9.474 5.8 10.191
5.9 9.179 7.3 9.855
8.0 9.359 8.6 10.682
4.3 9.580 8.8 10.160
5.1 9.245 6.0 9.982

The following regression lines were fitted to the data:


Set 1: yb1 = 10.144723 − 0.112779x

Set 2: yb2 = 9.739767 + 0.075329x

(a) Test whether the slopes of the fitted regression lines are significantly different.

(b) Assume equal slopes and variances, and test H0 : α1 = α2 .

[Given:

Set 1: x1· = 6.27 (x1j − x1· )2 = 17.961


P P
y1j (x1j − x1· ) = −2.02562

Set 2: x2· = 6.93 (x2j − x2· )2 = 19.381 y2j (x2j − x2· ) = 1.45996 ]
P P

2. A lecturer wants to confirm that results in the 2002 examination were better than in the 2000
examination. He takes a sample (n = 6) of the examination results of students from these
groups (y). The means for the assignments for each student are also recorded (x). Results
are as follows:
5. ANALYSIS OF COVARIANCE 300

2000 2002

x y x y

50 60 59 55

62 70 45 49

76 68 33 50

38 63 73 61

52 73 57 44

64 80 45 41

[The following is given and can be accepted as correct:


P P
y1j (x1j − x1. ) = 220 y2j (x2j − x2. ) = 306

(x1j − x1. )2 = 870


P P
(x2j − x2. ) = 974]

(a) Draw a graph of the data to examine the possibility of a covariate. Briefly interpret
your graph.

(b) Regression lines:

yb1 =α
b1 + βb1 x = 54.5862 + 0.2529x

yb2 =α
b2 + βb2 x = 33.6632 + 0.3142x
were fitted to these two sets of data. Confirm, by testing the hypothesis H0 : β1 = β2 ,
that the assumption of equal slopes is valid.

(c) Perform a covariance analysis. [State the hypothesis, test statistic, all formulae and the
conclusion explicitly.]

3. An experiment was performed to determine whether balloons of different colours are similar
in terms of the time taken for inflation to a diameter of 7 inches (1 inch ≈ 2.5 cm). Two
colours were selected from a single manufacturer. An assistant blew up the balloons and the
times (y) (to the nearest 1/10th of a second) were taken with a stopwatch. The data, in the
order collected (x), are given below, where the codes 1 and 2 denote the colours yellow and
blue, respectively.
5. ANALYSIS OF COVARIANCE 301 STA3701/1

Order 1 2 3 4 5 6 7 8
Colour 1 2 2 2 1 1 2 1
Time 19.8 28.5 25.7 28.8 17.1 19.3 18.3 14.0
Order 9 10 11 12
Colour 1 2 2 1
Time 16.6 18.1 18.9 16.0

[The following is given and can be assumed correct:

Σy1j (x1j − x1· ) = −27.9632 Σy2j (x2j − x2· ) = −86.5546

Σ (x1j − x1· )2 = 70.8333 Σ (x2j − x2· )2 = 70.8333]

Assume equal variances.

(a) Ignore order. Construct an analysis of variance table and test the hypothesis that colour
has no effect on inflation time.

(b) Include order as a covariate. Test the hypothesis H0 : β1 = β2 that the slopes are equal.
[NB: Give the formulae, test statistics, decisions and conclusions explicitly.]

(c) Comment on the inclusion of order as a covariate.

4. Consider the following two data sets (regression of y on x):

Set 1
x: 20 20 30 30 40 40 50 50
y: 61 65 90 92 124 124 151 157
Set 2
x: 20 20 20 30 30 40 40 40
y: 55 56 60 88 92 113 115 117

(a) Assume that parallel regression lines are applicable; give a model and perform an anal-
ysis of covariance (at the 5% level).
5. ANALYSIS OF COVARIANCE 302

(b) Represent the data and the estimated regression lines graphically and comment on the
assumptions and conclusions of (a) in the light of the graph.

5. Consider the following two sets of data (regression of y on x):

Set 1
x: 1 1 2 2 3 3 4 4
y: 12 16 15 15 17 19 21 25

Set 2
x: 2 2 3 3 4 4 5 5
y: 14 16 20 24 22 24 26 30

(a) Fit a regression line through each set of data and draw a graph of the data and the
fitted lines. Give a model for the data.

(b) Test, at the 10% level, whether the slopes of the fitted regression lines are significantly
different.

(c) Test, at the 10% level, the null hypothesis that the two expected values of y which
correspond to x = 3 are equal.
Chapter 6

APPENDIX

6.1 Remarks on matrices

Many of our students make certain mistakes in their assignments and examination papers, and it
seems necessary to point out these typical errors. You should read these remarks carefully and
make sure that you do not commit similar errors.

The main source of errors is failure to take note of the order of a matrix. If A is a matrix with p
rows and q columns, then A is said to be of order p × q.

Matrix addition
If A and B are matrices, then A + B is defined only if A and B are of the same order. If A is p × q
and B is p × q; then A+B is also p × q.

Matrix multiplication
The product AB is defined only if the number of columns of A is the same as the number of rows
of B. If A is p × q and B is q × r; then AB is defined and AB is a p × r matrix.

Thus, for example, if we write

AB + CDE

303
6. APPENDIX 304

then this expression is valid only if A is p × q, B is q × r, C is p × s, D is s × t and E is t × r. When


in doubt, always write the order of the matrices beneath the expression:

A B + C D E
p×q q×r p×s s×t t×r
| {z } | {z }
p×r p×r

Matrix division
A
If A and B are matrices, then the quotientis defined only if B is a 1 × 1 matrix, that is a scalar,
B
and not equal to zero. We often write down the quotient of two quadratic forms or other 1 × 1
expressions in matrices, for example
x0 Ax
x0 Bx
where x is a p × 1 vector and A and B are p × p matrices. However, to “cancel out” the xs and
A
equate the expression to is unforgivable.
B

Properties of square matrices


The following properties and operations are defined for square matrices only (that is matrices with
as many rows as columns):

trace
determinant
inverse
positive definite; positive semi-definite
quadratic form
symmetric
singular; non-singular
idempotent

Thus, before you write |X| or tr(X) or X −1 , et cetera make sure that X is square.

Determinants
If A and B are both square matrices, then
6. APPENDIX 305 STA3701/1

|AB| = |A|||B| = |BA|.

However, if A and B are not square (but AB and BA are) then |AB| and |BA| need not be equal.

• Counterexample 6.1.1

 
3
Let A = (1 2) and B =  .
4

Then AB = (11)

∴ |AB| = 11,

 
3 6
whereas BA =  
4 8

∴ |BA| = 0.

Of course, |A| and |B| are not defined in this example. Also, if |A| = |B| then it does not
follow that A = B.

Trace
If tr(A) = tr(B) then it does not follow that A = B.

• Counterexample 6.1.2
   
1 −10 2 0
Let A =   and  .
6 3 0 2

Then tr(A) = tr(B) = 4.

While this may seem trivial, students have been known to write

tr(I-AB) = 0

∴ I = AB,
6. APPENDIX 306

which is of course nonsense.

Ranks
r(AB) is not necessarily equal to r(BA).

• Counterexample 6.1.3
 
1 −2  
  2 2 2
Let A =  1 1  ; B =  .
 
  0 1 −1
1 1

 
2 0 4
 
Then AB =  2 3 1  ∴ r(AB) = 2;
 
 
2 3 1

 
6 0
while BA =   ∴ r(BA) = 1.
0 0

Inverse
The inverse of a matrix A is defined only if A is non-singular (and of course square). Thus, while
(AB)−1 is defined if AB is non-singular, it does not necessarily follow that A−1 and B−1 exist; in
fact A and B need not be square. The well-known least squares estimator in the linear model is

β̂ = (X 0 X)−1 X 0 y

where X is n × p of rank p where p < n. Thus X −1 does not exist, and one may not write

β̂ = X −1 X 0−1 X 0 y = X −1 In y = X −1 y.

The last equation would be valid only if n = p and r(X) = n.

“Cancelling out” matrices and vectors


Suppose A is non-singular, and
6. APPENDIX 307 STA3701/1

AX = AY.

Then we may pre-multiply each side of the equation by A−1 to obtain

A−1 AX = A−1 AY

∴ IX = IY

∴ X = Y.

Likewise, if A is non-singular and UA = VA, then it follows that U = V. However, if A is singular


or if A is not a square matrix, then AX = AY does not imply that X = Y and UA = VA does not
imply that U = V. The above also holds if A is replaced by a vector.

• Counterexample
 6.1.4
    
1 2 3 6 0 0 1
     
Let A =  4 5 6  ; B =  15 0 0  ; x =  1 .
     
     
7 8 9 24 0 0 1

 
6
 
Then Ax = Bx  15  but A 6= B.
 
 
24

• Counterexample 6.1.5
 
   1 2 
1 2 3 6 0 0
 
Let A = ; B= ; C =  1 2 .
     
1 0 −1 0 0 0  
1 2

 
6 12
Then AC = BC =   but A 6= B.
0 0

Zero products
6. APPENDIX 308

If AB = 0 then it does not follow that A = 0 or B = 0. You should be able to construct your own
counterexample.

6.2 Quadratic forms

After one of the examinations a student commented that it had taken him an hour to answer a
question like the following:

Let x1 , x2 , x3 and x4 be independent n(µ; 1) variates. Write the following quadratic forms as
qi = x0 Ai x where Ai is symmetric, determine whether each has a chi-square distribution, give the
degrees of freedom and find out which are independent:

q 1 = x1 x2 − x3 x4
4
!2
X
q2 = xi /4
1
4
X
q3 = (xi − x1 )2
1

Actually it can be completed in five minutes − if you know how. Let us do another example step
by step. Let
1
q = (x1 − 2x3 )2 + x22
5
where x1 , x2 and x3 are independent n(0; σ 2 ) variates. We wish to show that a multiple of q has a
chi-square distribution. For this purpose we first write q in the form x0 Ax where A is symmetric:
recall that the matrix A of the quadratic form x0 Ax is assumed to be symmetric in this guide.

First, we write out q term by term − this step is not really necessary but it helps to explain the
process.
1 2 4 4
q = x1 − x1 x3 + x23 + x22
5 5 5
1 2 4 2 2 2
= x1 + x3 + x22 − x1 x3 − x3 x1
5 5 5 5
+0x1 x2 + 0x2 x1 + 0x2 x3 + 0x3 x2 .
6. APPENDIX 309 STA3701/1

The coefficient of x21 comes in the (1; 1) position of A, that of x22 in the (2; 2) position and that of
x23 in the (3; 3) position
 
1
· ·
 5 
A= · 1 · .
 
 
· · 54

1
Since A must be symmetric, we place 2
(the coefficient of xi xj , where i 6= j) in the (i; j) position
and the other half in the (j; i) position:
 
1
0 − 25
 5 
A= 0 1 0
 

 
− 52 0 45

  
1
5
0 − 25 x1
  
check: (x1 x2 x3 ) 
  
0 1 0   x2 
  
− 52 0 4
5
x3

= 15 x21 + x22 + 45 x23 − 54 x1 x3 .

Now we show that AA = A and we conclude that q/σ 2 is a chi-square variate with tr(A) = 2
degrees of freedom. The noncentrality parameter is 00 A0 = 0, thus q/σ 2 has a central chi-square
distribution. Quite easily done.

6.3 Rapid calculations for small matrices

Efficient methods exist for performing matrix calculations, and we assume that you know these
methods. These methods are usually aimed at computer usage, and may be rather lengthy if ap-
plied manually. We therefore point out certain shortcuts which may be useful in an exam situation
when time is limited.
6. APPENDIX 310

Computing X’X
In the linear model one very often has to compute X 0 X where X (or X 0 ) is given. It is not necessary
to write out X 0 and then X next to it, if one remembers that the (i, j)-th element of X 0 X is found
by multiplying every element in the i-th row of X 0 (or the i-th column of X) by the corresponding
element in the j-th row of X 0 (or j-th column of X) and adding these products. The diagonal terms
are found as the sum of squares of elements in the corresponding row of X 0 (or column of X). Also,
X 0 X is symmetric and thus the (i, j)-th element of X 0 X is the same as the (j, i)-th element.

• Example 6.3.1
 
1 1 1 1 1 1 1
 
Let X 0 =  −3 −2 −1 0 1 2 3
 

 
1 1 2 2 3 3 4

Then (X 0 X)11 = 12 + 12 + 12 + 12 + 12 + 12 + 12 = 7

(X 0 X)22 = (−3)2 + (−2)2 + (−1)2 + 02 + 12 + 22 + 32 = 28

(X 0 X)33 = 12 + 12 + 22 + 22 + 32 + 32 + 42 = 44

(X 0 X)12 = (X 0 X)21 = (1)(-3) + (1)(-2) + (1)(-1) + (1)(0)

+ (1)(1) + (1)(2) + (1)(3) = 0

(X 0 X)13 = (X 0 X)31 = (1)(1) + (1)(1) + (1)(2) + (1)(2)

+ (1)(3)+ (1)(3) + (1)(4)

= 16

(X 0 X)23 = (X 0 X)32 = (-3)(1) + (-2)(1) + (-1)(2) + (0)(2)

+ (1)(3) + (2)(3) + (3)(4)

= 14
6. APPENDIX 311 STA3701/1

 
7 0 16
 
0
∴ X X =  0 28 14 .
 
 
16 14 44

• Exercise 6.3.2

Find X 0 X if
 
1 1 1 1 1 1 1 1
 
X 0 =  1 −1 2 −2 3 −3 4 −4 
 
 
8 7 6 5 4 3 2 1

Answer:
 
8 0 36
 
X 0 X =  0 60 10
 

 
36 10 204

Determinants
If A is a square matrix, say p × p, and k is a common factor of every element of A, then we may
write

A = kB, say.

Then |A| = k P |B|. This fact may often be used to simplify the calculations involved in computing
a determinant. The determinant of a small matrix can be determined quite rapidly:

1 × 1 matrices

If A = (a) then |A| = a (in this case the determinant sign should not be confused with the absolute
sign):

|a| is the absolute value of a, that is |a| > 0,

|(a)| is the determinant of the 1 × 1 matrix, and is equal to a, which could be negative or
positive.
6. APPENDIX 312

• Example 6.3.3

If A = (-3) then |A| = -3.

2 × 2 matrices
 
a b
If A =   then |A| = ad-bc.
c d

• Example 6.3.4
 
1 2
If A =   then |A| = 1 × 8 − 2 × 3 = 2.
3 8

• Example 6.3.5
 
4 −6
If B =   then |B| = 24.
−6 3

3 × 3 matrices

The first step is to write down the elements of the matrix, and repeat the first column after the
third and the second after that. If
 
a b c a b c a b
 
A= d e f  write
 
d e f d e
 
g h i g h i g h

|A| consists of six terms; the first three are obtained from the diagonals that run down from left
to right. These three terms are added :

+(a × e × i) + (b × f × g) + (c × d × h) a b c a b
& & &
d e f d e
& & &
g h i g h
6. APPENDIX 313 STA3701/1

The last three terms are subtracted, and are obtained from the diagonals that run down from right
to left:

−(c × e × g) − (a × f × h) − (b × d × i) a b c a b
. . .
d e f d e
. . .
g h i g h

Thus |A| = aei + bfg + cdh - ceg - afh - bdi.

• Example 6.3.6
 
1 2 −3
 
A= 2 4 6
 

 
−3 6 10

Write down 1 2 −3 1 2
& & &
2 4 6 2 4
& & &
−3 6 10 −3 6

Write down 1 2 −3 1 2
. . .
2 4 6 2 4
. . .
−3 6 10 −3 6

|A| = (1 × 4 × 10) + (2 × 6 × (−3)) + ((−3) × 2 × 6)

−((−3) × 4 × (−3)) − (1 × 6 × 6) − (2 × 2 × 10) = −144.


6. APPENDIX 314

• Exercise 6.3.7
 
2 −3 4
 
B= 2 6 −1 . Show that |B| = -47.
 
 
4 −1 9

4 × 4 matrices

This is rather more complicated. The determinant is first expanded in terms of four 3 × 3 deter-
minants, each of which is then computed as above.

a b c d
f g h e g h e g h e g h
e f g h
=a j k ` −b i k ` +c i j ` −d i j k
i j k `
n p q m p q m n q m n p
m n p q

Note that the signs alternate (+ − +−) and each term consists of an element in the first row mul-
tiplied by the 3 × 3 determinant obtained by deleting the row and column in which that element
occurs.

This should give you a foretaste of how a 5 × 5 or larger determinant is computed.

Inverses
Some points to remember:

(a) If |A| = 0 then A−1 does not exist.

1 −1
(b) If A = kB where k is a constant, then A−1 = B . (This fact may be used to simplify
k
computations if a common factor exists for each element of A.)
 
A O ··· O
 1 
(c) If A =  O ··· O 
 
A2
 
O O · · · Am
6. APPENDIX 315 STA3701/1

where A1 , A2 , · · · , Am are square matrices (and where A1 , · · · , Am must be non-singular if A


is non-singular) then
 
A−1 O · · · O
 1 
A−1 =  O A−1 ··· O
 
 2 

O O · · · A−1
m

This fact can often be used to simplify calculations.

(d) If A is symmetric, then A−1 is also symmetric. In this case one has to compute the diagonal of
A−1 and the elements above the main diagonal (say), while the elements below the diagonal
can be written down directly from the corresponding elements above the diagonal. We now
show the computations for small matrices.

1 × 1 matrices

If A = (a) then A−1 = ( a1 ) if a 6= 0 (if a = 0 then A is singular).

2 × 2 matrices
   d −b 
a b |A| |A| .
If A =   then A−1 = 
 −c a 
c d
|A| |A|

• Example 6.3.8
 
2 3
A= 
3 6

|A| = 12 − 9 = 3

   
6
3
− 33 2 −1
   
−1
A = = .
   
   
− 33 2
3
−1 2
3
6. APPENDIX 316

• Exercise 6.3.9  
  1 1
2 2
4 −2 −1
 
B= ; B = .
   
−2 2  
1
2
1

3 × 3 matrices

The (i,j)-th element of A−1 is obtained as follows:

(A−1 )ij = (−1)i+j × (the determinant of the 2 × 2 matrix obtained by deleting the j-th row and
the i-th column of A) ÷ (the determinant of A).

• Example 6.3.10
 
2 1 0
 
A =  1 3 −2  |A| = 2
 
 
0 −2 2

3 -2
(A−1 )11 = (−1)2 / |A| = 1
-2 2

2 0
(A−1 )22 = (−1)4 / |A| = 2
0 2

2 1
(A−1 )33 = (−1)6 / |A| = 2 12
1 3

1 0
(A−1 )12 = (A−1 )21 = (−1)3 / |A| = −1
-2 2

1 0
(A−1 )13 = (A−1 )31 = (−1)4 / |A| = −1
3 -2
6. APPENDIX 317 STA3701/1

2 0
(A−1 )23 = (A−1 )32 = (−1)5 / |A| = 2
1 -2
 
1 −1 −1
 
∴ A−1 =  −1
 
2 2 
 
1
−1 2 22

(Test: AA−1 = I)

• Example 6.3.11
   
12 0 0 2 0 0
   
A =  0 18 6  = 6  0 3 1 
   
   
0 6 12 0 1 2

 −1  
1
2 0 0 2 0 0
    −1 
A−1 = 16  0 3 1  = 61  0
   
   3 1 
  
0 1 2 0 1 2

   
1 1
0
2
0 12
0 0
   
   
   
   
= 61  0 2
− 1 
= 0 1 1
− 30
  
5 5 15

   
   
   
   
1 3 1 1
0 −5 5
0 − 30 10

• Exercise 6.3.12

   
3
4
− 24 1
4
4 2 −4
   
   
   
   
B =  − 42 4
0  ; B −1 =  2 2 −2 .
   
 4   
   
   
   
1 1
4
0 4
−4 −2 8

You might also like