You are on page 1of 40

Chapter 3

Multiple Linear Regression Model


We consider the problem of regression when the study variable depends on more than one explanatory or
independent variables, called a multiple linear regression model. This model generalizes the simple linear
regression in two ways. It allows the mean function E ( y ) to depend on more than one explanatory variables
and to have shapes other than straight lines, although it does not allow for arbitrary shapes.

The linear model:


Let y denotes the dependent (or study) variable that is linearly related to k independent (or explanatory)
variables X 1 , X 2 ,..., X k through the parameters 1 ,  2 ,...,  k and we write

y  X 11  X 2  2  ...  X k  k   .
This is called the multiple linear regression model. The parameters 1 ,  2 ,...,  k are the regression

coefficients associated with X 1 , X 2 ,..., X k respectively and  is the random error component reflecting the
difference between the observed and fitted linear relationship. There can be various reasons for such
difference, e.g., the joint effect of those variables not included in the model, random factors which can not
be accounted for in the model etc.

Note that the j th regression coefficient  j represents the expected change in y per unit change in the j th

independent variable X j . Assuming E ( )  0,

E ( y )
j  .
X j

Linear model:
y E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently )
 j  j

should not depend on any  ' s . For example,


i) y   0  1 X is a linear model as it is linear in the parameters.

ii) y   0 X 1 can be written as

log y  log  0  1 log X


y*   0*  1 x*
which is linear in the parameter  0* and 1 , but nonlinear is variables y*  log y, x*  log x. So it is

a linear model.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
1
iii) y   0  1 X   2 X 2
is linear in parameters  0 , 1 and  2 but it is nonlinear is variables X . So it is a linear model

1
iv) y  0 
X  2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
v) y   0  1 X 2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
vi) y   0  1 X   2 X 2  3 X 3
is a cubic polynomial model which can be written as
y   0  1 X   2 X 2  3 X 3
which is linear in the parameters  0 , 1 ,  2 ,  3 and linear in the variables X 1  X , X 2  X 2 , X 3  X 3 .

So it is a linear model.

Example:
The income and education of a person are related. It is expected that, on average, a higher level of education
provides higher income. So a simple linear regression model can be expressed as
income   0  1 education   .

Not that 1 reflects the change in income with respect to per unit change in education and  0 reflects the

income when education is zero as it is expected that even an illiterate person can also have some income.

Further, this model neglects that most people have higher income when they are older than when they are
young, regardless of education. So 1 will over-state the marginal impact of education. If age and education
are positively correlated, then the regression model will associate all the observed increase in income with an
increase in education. So a better model is
income   0  1 education   2 age   .
Often it is observed that the income tends to rise less rapidly in the later earning years than is early years. To
accommodate such a possibility, we might extend the model to
income   0  1education   2 age  3age 2  
This is how we proceed for regression modeling in real-life situation. One needs to consider the experimental
condition and the phenomenon before making the decision on how many, why and how to choose the
dependent and independent variables.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
2
Model set up:
Let an experiment be conducted n times, and the data is obtained as follows:
Observation number Response Explanatory variables
y X1 X 2  X k

1 y1 x11 x12  x1k


2 y2 x21 x22  x2 k
     
yn xn1 xn 2  xnk
n

Assuming that the model is


y   0  1 X 1   2 X 2  ...   k X k   ,
the n-tuples of observations are also assumed to follow the same model. Thus they satisfy
y1   0  1 x11   2 x12  ...   k x1k  1
y2   0  1 x21   2 x22  ...   k x2 k   2
 
yn   0  1 xn1   2 xn 2  ...   k xnk   n .
These n equations can be written as
 y1  1 x11 x12  x1k   0   1 
      
 y2    1 x21 x22  x2 k  1    2 

         
      
 yn   1 xn1 xn 2  xnk    k    n 

or y  X    .
In general, the model with k explanatory variables can be expressed as
y  X 
where y  ( y1 , y2 ,..., yn ) ' is a n  1 vector of n observation on study variable,

 x11 x12  x1k 


 
 x21 x22  x2 k 
X
    
 
 xn1 xn 2  xnk 
is a n  k matrix of n observations on each of the k explanatory variables,   ( 1 ,  2 ,...,  k ) ' is a k 1
vector of regression coefficients and   (1 ,  2 ,...,  n ) ' is a n 1 vector of random error components or
disturbance term.

If intercept term is present, take first column of X to be (1,1,…,1)’.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
3
Assumptions in multiple linear regression model
Some assumptions are needed in the model y  X    for drawing the statistical inferences. The following
assumptions are made:
(i) E ( )  0

(ii) E ( ')   2 I n

(iii) Rank ( X )  k
(iv) X is a non-stochastic matrix
(v)  ~ N (0,  2 I n ) .

These assumptions are used to study the statistical properties of the estimator of regression coefficients. The
following assumption is required to study, particularly the large sample properties of the estimators.

 X 'X 
(vi) lim     exists and is a non-stochastic and nonsingular matrix (with finite elements).
n 
 n 

The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless
stated separately.

We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the
stated assumption.

Estimation of parameters:
A general procedure for the estimation of regression coefficient vector is to minimize
n n

 M ( i )   M ( yi  xi11  xi 2  2  ...  xik  k )


i 1 i 1

for a suitably chosen function M .


Some examples of choice of M are
M ( x)  x
M ( x)  x 2

M ( x)  x , in general.
p

We consider the principle of least square which is related to M ( x)  x 2 and method of maximum likelihood
estimation for the estimation of parameters.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
4
Principle of ordinary least squares (OLS)
Let B be the set of all possible vectors  . If there is no further information, the B is k -dimensional real
Euclidean space. The object is to find a vector b '  (b1 , b2 ,..., bk ) from B that minimizes the sum of squared

deviations of  i ' s, i.e.,


n
S (  )    i2   '   ( y  X  ) '( y  X  )
i 1

for given y and X . A minimum will always exist as S (  ) is a real-valued, convex and differentiable
function. Write
S (  )  y ' y   ' X ' X   2 ' X ' y .
Differentiate S (  ) with respect to 
S (  )
 2X ' X   2X ' y

 2 S ( )
 2 X ' X (atleast non-negative definite).
 2
The normal equation is
S (  )
0

 X ' Xb  X ' y
where the following result is used:
Result: If f ( z )  Z ' AZ is a quadratic form, Z is a m 1 vector and A is any m  m symmetric matrix

then F ( z )  2 Az .
z

Since it is assumed that rank ( X )  k (full rank), then X ' X is a positive definite and unique solution of the
normal equation is
b  ( X ' X ) 1 X ' y
which is termed as ordinary least squares estimator (OLSE) of  .

 2 S ( )
Since is at least non-negative definite, so b minimize S (  ) .
 2

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
5
In case, X is not of full rank, then
b  ( X ' X )  X ' y   I  ( X ' X )  X ' X  

where ( X ' X )  is the generalized inverse of X ' X and  is an arbitrary vector. The generalized inverse

( X ' X )  of X ' X satisfies

X ' X ( X ' X ) X ' X  X ' X


X ( X ' X ) X ' X  X
X ' X ( X ' X ) X '  X '
Theorem:
(i) Let ŷ  Xb be the empirical predictor of y . Then ŷ has the same value for all solutions b of
X ' Xb  X ' y.
(ii) S (  ) attains the minimum for any solution of X ' Xb  X ' y.
Proof:
(i) Let b be any member in
b  ( X ' X )  X ' y   I  ( X ' X )  X ' X   .

Since X ( X ' X )  X ' X  X , so then

Xb  X ( X ' X )  X ' y  X  I  ( X ' X )  X ' X  

= X ( X ' X ) X ' y
which is independent of  . This implies that ŷ has the same value for all solution b of X ' Xb  X ' y.
(ii) Note that for any  ,

S (  )   y  Xb  X (b   )   y  Xb  X (b   ) 
 ( y  Xb)( y  Xb)  (b   ) X ' X (b   )  2(b   ) X ( y  Xb)
 ( y  Xb)( y  Xb)  (b   ) X ' X (b   ) (Using X ' Xb  X ' y )
 ( y  Xb)( y  Xb)  S (b)
 y ' y  2 y ' Xb  b ' X ' Xb
 y ' y  b ' X ' Xb
 y ' y  yˆ ' yˆ .

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
6
Fitted values:
If ̂ is any estimator of  for the model y  X    , then the fitted values are defined as

ŷ  X ˆ where ̂ is any estimator of  .

In the case of ˆ  b,
yˆ  Xb
 X ( X ' X ) 1 X ' y
 Hy

where H  X ( X ' X ) 1 X ' is termed as Hat matrix which is


(i) symmetric
(ii) idempotent (i.e., HH  H ) and

(iii) tr H  tr X ( X X ) 1 X '  tr X ' X ( X ' X ) 1  tr I k  k .

Residuals
The difference between the observed and fitted values of the study variable is called as residual. It is
denoted as
e  y ~ yˆ
 y  yˆ
 y  Xb
 y  Hy
 (I  H ) y
 Hy

where H  I  H .

Note that
(i) H is a symmetric matrix
(ii) H is an idempotent matrix, i.e.,
HH  ( I  H )( I  H )  ( I  H )  H and

(iii) trH  trI n  trH  (n  k ).

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
7
Properties of OLSE
(i) Estimation error:
The estimation error of b is
b    ( X ' X ) 1 X ' y  
 ( X ' X ) 1 X '( X    )  
 ( X ' X ) 1 X ' 

(ii) Bias
Since X is assumed to be nonstochastic and E ( )  0

E (b   )  ( X ' X ) 1 X ' E ( )
 0.
Thus OLSE is an unbiased estimator of  .

(iii) Covariance matrix


The covariance matrix of b is
V (b)  E (b   )(b   ) '
 E ( X ' X ) 1 X '  ' X ( X ' X ) 1 
 ( X ' X ) 1 X ' E ( ') X ( X ' X ) 1
  2 ( X ' X ) 1 X ' IX ( X ' X ) 1
  2 ( X ' X ) 1.

(iv) Variance
The variance of b can be obtained as the sum of variances of all b1 , b2 ,..., bk which is the trace of covariance

matrix of b . Thus
Var (b)  tr V (b) 
k
  E (bi   i ) 2
i 1
k
  Var (bi ).
i 1

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
8
Estimation of  2
The least-squares criterion can not be used to estimate  2 because  2 does not appear in S (  ) . Since

E ( i2 )   2 , so we attempt with residuals ei to estimate  2 as follows:


e  y  yˆ
 y  X ( X ' X ) 1 X ' y
 [ I  X ( X ' X ) 1 X '] y
 Hy.
Consider the residual sum of squares
n
SSr e s   ei2
i 1

 e 'e
 ( y  Xb) '( y  Xb)
 y '( I  H )( I  H ) y
 y '( I  H ) y
 y ' Hy.
Also
SS r e s  ( y  Xb) '( y  Xb)
 y ' y  2b ' X ' y  b ' X ' Xb
 y ' y  b ' X ' y (Using X ' Xb  X ' y )

SSr e s  y ' Hy
 (X    )'H (X   )
  ' H  (Using HX  0)

Since  ~ N (0,  2 I ) , so y ~ N ( X  ,  2 I ) . Hence y ' Hy ~  2 (n  k ) .

Thus E[ y ' Hy ]  (n  k ) 2

 y ' Hy 
or E  2
 nk 
or E  MSr e s    2

SSr e s
where MSr e s  is the mean sum of squares due to residual.
nk
Thus an unbiased estimator of  2 is
ˆ 2  MSr e s  s 2 (say)
which is a model-dependent estimator.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
9
Variance of ŷ
The variance of ŷ is
V ( yˆ )  V ( Xb)
 XV (b) X '
  2 X ( X ' X ) 1 X '
  2H.

Gauss-Markov Theorem:
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of  .
Proof: The OLSE of  is

b  ( X ' X ) 1 X ' y
which is a linear function of y . Consider the arbitrary linear estimator

b*  a ' y
of linear parametric function  '  where the elements of a are arbitrary constants.

Then for b* ,
E (b* )  E (a ' y )  a ' X 

and so b* is an unbiased estimator of  '  when

E (b* )  a ' X    ' 


 a ' X   '.
Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to
those estimators for which a ' X   '.

Further
Var (a ' y )  a 'Var ( y )a   2 a ' a
Var ( ' b)   'Var (b)
  2 a ' X ( X ' X ) 1 X ' a.
Consider
Var (a ' y )  Var ( ' b)   2  a ' a  a ' X ( X ' X ) 1 X ' a 
  2 a '  I  X ( X ' X ) 1 X ' a
  2 a '( I  H )a.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
10
Since ( I  H ) is a positive semi-definite matrix, so
Var (a ' y )  Var ( ' b)  0 .

This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b .
Consequently b is the best linear unbiased estimator, where ‘best’ refers to the fact that b is efficient within
the class of linear and unbiased estimators.

Maximum likelihood estimation:


In the model, y  X    , it is assumed that the errors are normally and independently distributed with

constant variance  2 or  ~ N (0,  2 I ).


The normal density function for the errors is
1  1 
f ( i )  exp   2  i2  i  1, 2,..., n. .
 2  2 
The likelihood function is the joint density of 1 ,  2 ,...,  n given as
n
L(  ,  2 )   f ( i )
i 1

1  1 n 2

(2 2 ) n /2
exp   2 2   i 
 i 1 
1  1 
 exp   2  '  
(2 )
2 n /2
 2 
 1
1 
 exp   2 ( y  X  ) '( y  X  )  .
(2 )  2
2 n /2

Since the log transformation is monotonic, so we maximize ln L(  ,  2 ) instead of L(  ,  2 ) .
n 1
ln L(  ,  2 )   ln(2 2 )  2 ( y  X  ) '( y  X  ) .
2 2
The maximum likelihood estimators (m.l.e.) of  and  2 are obtained by equating the first-order

derivatives of ln L(  ,  2 ) with respect to  and  2 to zero as follows:

 ln L(  ,  2 ) 1
 2 X '( y  X  )  0
 2 2
 ln L(  ,  2 ) n 1
 2  ( y  X  ) '( y  X  ).
 2
2 2( 2 ) 2
The likelihood equations are given by
X 'X  X 'y
1
 2  ( y  X  ) '( y  X  ).
n
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
11
Since rank( X )  k , so that the unique m.l.e. of  and  2 are obtained as

  ( X ' X ) 1 X ' y
1
 2  ( y  X  ) '( y  X  ).
n

Further to verify that these values maximize the likelihood function, we find
 2 ln L(  ,  2 ) 1
 2 X 'X
 2

 2 ln L(  ,  2 ) n 1
  6 ( y  X  ) '( y  X  )
 ( )
2 2 2
2 4

 2 ln L(  ,  2 ) 1
  4 X '( y  X  ).
 2

Thus the Hessian matrix of second-order partial derivatives of ln L(  ,  2 ) with respect to  and  2 is

  2 ln L(  ,  2 )  2 ln L(  ,  2 ) 
 
  2  2 
  2 ln L(  ,  2 )  ln L(  ,  ) 
2 2

 
  2   2 ( 2 ) 2 

which is negative definite at    and  2   2 . This ensures that the likelihood function is maximized at
these values.

Comparing with OLSEs, we find that


(i) OLSE and m.l.e. of  are same. So m.l.e. of  is also an unbiased estimator of  .
nk 2
(ii) OLSE of  2 is s 2 which is related to m.l.e. of  2 as  2  s . So m.l.e. of  2 is a
n
biased estimator of  2 .

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
12
Consistency of estimators
(i) Consistency of b :
 X 'X 
Under the assumption that lim     exists as a nonstochastic and nonsingular matrix (with finite
n 
 n 
elements), we have
1
1 X 'X 
lim V (b)   2 lim  
n  n  n
 n 
1
  2 lim  1
n  n

 0.
This implies that OLSE converges to  in quadratic mean. Thus OLSE is a consistent estimator of  . This
holds true for maximum likelihood estimators also.

The same conclusion can also be proved using the concept of convergence in probability.
An estimator ˆn converges to  in probability if

lim P  ˆn       0 for any   0


n   

and is denoted as plim(ˆn )   .

The consistency of OLSE can be obtained under the weaker assumption that
 X 'X 
plim    * .
 n 
exists and is a nonsingular and nonstochastic matrix such that
 X ' 
plim    0.
 n 
Since
b    ( X ' X ) 1 X ' 
1
 X ' X  X '
  .
 n  n
So
1
 X 'X   X ' 
plim(b   )  plim   plim  
 n   n 
 *1.0
 0.
Thus b is a consistent estimator of  . Same is true for m.l.e. also.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
13
(ii) Consistency of s 2
Now we look at the consistency of s 2 as an estimate of  2 as
1
s2  e 'e
nk
1
  ' H
nk
1
1 k 
 1    '    ' X ( X ' X ) 1 X '  
n n
 k    '  ' X  X ' X  X ' 
1 1

 1       .
 n   n n  n  n 

 ' 1 n 2
Note that
n
consists of   i and { i2 , i  1, 2,..., n} is a sequence of independently and identically
n i 1

distributed random variables with mean  2 . Using the law of large numbers
  ' 
 
2
plim 
 n 
  ' X  X ' X  1 X '     'X    X ' X  
1
X ' 
plim       plim   plim     plim 
 n  n  n   n    n    n 
 0.*1.0
0
 plim( s 2 )  (1  0) 1  2  0 
  2.

Thus s 2 is a consistent estimator of  2 . The same holds true for m.l.e. also.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
14
Cramer-Rao lower bound
Let   (  ,  2 ) ' . Assume that both  and  2 are unknown. If E (ˆ)   , then the Cramer-Rao lower

bound for ˆ is grater than or equal to the matrix inverse of


  2 ln L( ) 
I ( )   E  
  ' 
   ln L(  ,  2 )    ln L(  ,  2 )  
 E   E 
   2    
2


   ln L(  ,  2 )    ln L(  ,  )  
2
 E   E 
       ( )  
2 2 2 2

  X 'X   X '( y  X  )  
 E   2  E
 4  
   
 
 (y  X )' X   n ( y  X  ) '( y  X  )  
 E   E 4   
  4  2 6 
X 'X 
 2 0 
 .
 0 n 
 2 4 
Then
 2 ( X ' X ) 1 0 
 I ( )   
1
2 4 
0
 n 
is the Cramer-Rao lower bound matrix of  and  2 .

The covariance matrix of OLSEs of  and  2 is

 2 ( X ' X ) 1 0 
 OLS   0 2 4 

 n  k 
which means that the Cramer-Rao have bound is attained for the covariance of b but not for s 2 .

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
15
Standardized regression coefficients:
Usually, it is difficult to compare the regression coefficients because the magnitude of ˆ j reflects the units

of measurement of j th explanatory variable X j . For example, in the following fitted regression model

yˆ  5  X 1  1000 X 2 ,
y is measured in litres, X 1 in litres and X 2 in millilitres. Although ˆ2  ˆ1 but the effect of both

explanatory variables is identical. One litre change in either X 1 and X 2 when another variable is held fixed

produces the same change is ŷ .

Sometimes it is helpful to work with scaled explanatory variables and study variable that produces
dimensionless regression coefficients. These dimensionless regression coefficients are called as
standardized regression coefficients.

There are two popular approaches for scaling, which gives standardized regression coefficients. We discuss
them as follows:

1. Unit normal scaling:


Employ unit normal scaling to each explanatory variable and study variable. So define
xij  x j
zij  , i  1, 2,..., n, j  1, 2,..., k
sj
yi  y
yi* 
sy
1 n 1 n
where s 2j  
n  1 i 1
( xij  x j ) 2 and s y2  
n  1 i 1
( yi  y ) 2 are the sample variances of j th explanatory

variable and study variable, respectively.

All scaled explanatory variable and scaled study variable has mean zero and sample variance unity, i.e.,
using these new variables, the regression model becomes
yi*   1 zi1   2 zi 2  ...   k zik   i , i  1, 2,..., n.

Such centering removes the intercept term from the model. The least-squares estimate of   ( 1 ,  2 ,...,  k ) '
is
ˆ  ( Z ' Z ) 1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and
divided by its standard deviation. So it is called as a unit normal scaling.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
16
2. Unit length scaling:
In unit length scaling, define
xij  x j
ij  , i  1, 2,..., n; j  1, 2,..., k
S 1/2
jj

yi  y
yi0 
SST1/2
n
where S jj   ( xij  x j ) 2 is the corrected sum of squares for j th explanatory variables X j and
i 1
n
ST  SST   ( yi  y ) 2 is the total sum of squares. In this scaling, each new explanatory variable W j has
i 1

1 n n
mean  j   ij  0 and length
n i 1
 (
i 1
ij   j ) 2  1.

In terms of these variables, the regression model is


yio  1i1   2i 2  ...   k ik   i , i  1, 2,..., n.
The least-squares estimate of the regression coefficient   (1 ,  2 ,...,  k ) ' is

ˆ  (W 'W )1W ' y 0 .


In such a case, the matrix W 'W is in the form of the correlation matrix, i.e.,
1 r12 r13  r1k 
 
 r12 1 r23  r2 k 
W 'W   r13 r23 1  r3k 
 
      
r r3k  1 
 1k r2 k
where
n

 (x ui  xi )( xuj  x j )
Sij
rij  u 1
1/2

( Sii S jj ) ( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables X i and X j . Similarly

W ' y o  (r1 y , r2 y ,..., rky ) '

where
n

 (x uj  x j )( yu  y )
Siy
rjy  u 1
1/2

( S jj SST ) ( S jj SST )1/2

is the simple correlation coefficient between the j th explanatory variable X j and study variable y .

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
17
Note that it is customary to refer rij and rjy as correlation coefficient though X i ' s are not random variable.

If unit normal scaling is used, then


Z ' Z  (n  1)W 'W .

So the estimates of regression coefficient in unit normal scaling (i.e., ˆ ) and unit length scaling (i.e., ˆ ) are

identical. So it does not matter which scaling is used, so ˆ  ˆ .

The regression coefficients obtained after such scaling, viz., ˆ or ˆ usually called standardized regression
coefficients.

The relationship between the original and standardized regression coefficients is


1/ 2
 SS 
b j  ˆ j  T  , j  1, 2,..., k
 S
 jj 
and
k
b0  y   b j x j
j 1

where b0 is the OLSE of intercept term and b j are the OLSE of slope parameters.

The model in deviation form


The multiple linear regression model can also be expressed in the deviation form.
First, all the data is expressed in terms of deviations from the sample mean.
The estimation of regression parameters is performed in two steps:
 First step: Estimate the slope parameters.
 Second step : Estimate the intercept term.
The multiple linear regression model in deviation form is expressed as follows:
Let
1
A  I   '
n
where   1,1,...,1 ' is a n  1 vector of each element unity. So

1 0  0 1 1  1
0 1  0  1 1
 1  1
A  .
     n    
   
0 0  1 1 1  1
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
18
Then
 y1 
 
1 n 1 y 1
y   yi  1,1,...,1  2    ' y
n i 1 n    n
 
 yn 
Ay  y  y   y1  y , y2  y ,..., yn  y  '.

Thus pre-multiplication of any column vector by A produces a vector showing those observations in
deviation form:

Note that
1
A     ' 
n
1
   .n
n
 
0
and A is a symmetric and idempotent matrix.

In the model
y  X  ,
the OLSE of  is

b X 'X  X 'y


1

and the residual vector is


e  y  Xb.

Note that Ae  e.

If the n  k matrix is partitioned as


X   X 1 X 2* 

where X 1  1,1,...,1 ' is n  1 vector with all elements unity, X 2* is n   k  1 matrix of observations of

 k  1 explanatory variables X 2 , X 3 ,..., X k and OLSE b   b1 , b2* ' is suitably partitioned with OLSE of

intercept term 1 as b1 and b2 as a  k  1 1 vector of OLSEs associated with  2 ,  3 ,...,  k .


Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
19
Then
y  X 1b1  X 2*b2*  e.

Premultiply by A,

Ay  AX 1b1  AX 2*b2*  Ae
 AX 2*b2*  e.

Premultiply by X 2* gives

X 2* ' Ay  X 2* ' AX 2*b2*  X 2* ' e


 X 2* ' AX 2*b2* .

Since A is symmetric and idempotent,

 AX  '  Ay    AX  '  AX  b
*
2
*
2
*
2
*
2 ..

This equation can be compared with the normal equations X ' y  X ' Xb in the model y  X    . Such a
comparison yields the following conclusions:
 b2* is the sub vector of OLSE.

 Ay is the study variables vector in deviation form.

 AX 2* is the explanatory variable matrix in deviation form.

 This is the normal equation in terms of deviations. Its solution gives OLS of slope coefficients as

b2*   AX 2*  '  AX 2*    AX  '  Ay  .


1
*
2

The estimate of the intercept term is obtained in the second step as follows:
1
Premultiplying y  Xb  e by  ' gives
n
1 1 1
 ' y   ' Xb   ' e
n n n
 b1 
 
b
y  1 X 2 X 3 ... X k   2   0

 
 bk 
 b1  y  b2 X 2  b3 X 3  ...  bk X k .
Now we explain various sums of squares in terms of this model.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
20
The expression of the total sum of squares (TSS) remains the same as earlier and is given by
TSS  y ' Ay.
Since
Ay  AX 2*b2*  e
y ' Ay  y ' AX 2*b2*  y ' e
  Xb  e  ' AX 2*b2*  y ' e

  X 1b1  X 2*b2*  e  ' AX 2*b2*   X 1b1  X 2*b2*  e  ' e

 b2* ' X 2* ' AX 2*b2*  e ' e


TSS  SSreg  SS res

where the sum of squares due to regression is


SS reg  b2* ' X 2* ' AX 2*b2*

and the sum of squares due to residual is


SSres  e ' e .

Testing of hypothesis:
There are several important questions which can be answered through the test of hypothesis concerning the
regression coefficients. For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seem to be important?
etc.

In order the answer such questions, we first develop the test of hypothesis for a general framework, viz.,
general linear hypothesis. Then several tests of hypothesis can be derived as its special cases. So first, we
discuss the test of a general linear hypothesis.

Test of hypothesis for H 0 : R  r


We consider a general linear hypothesis that the parameters in  are contained in a subspace of parameter
space for which R  r , where R is ( J  k ) a matrix of known elements and r is a ( J  1 ) vector of known
elements.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
21
In general, the null hypothesis
H 0 : R  r
is termed as general linear hypothesis and
H1 : R   r
is the alternative hypothesis.

We assume that rank ( R)  J , i.e., full rank so that there is no linear dependence in the hypothesis.
Some special cases and interesting example of H 0 : R  r are as follows:

(i) H 0 : i  0
Choose J  1, r  0, R  [0, 0,..., 0,1, 0,..., 0] where 1 occurs at the i th position is R .
This particular hypothesis explains whether X i has any effect on the linear model or not.

(ii) H 0 :  3   4 or H 0 : 3   4  0

Choose J  1, r  0, R  [0, 0,1, 1, 0,..., 0]


(iii) H 0 : 3   4  5

or H 0 : 3   4  0, 3   5  0

0 0 1  1 0 0 ... 0 
Choose J  2, r  (0, 0) ', R   .
0 0 1 0  1 0 ... 0 
(iv) H 0 :  3 5  4  2

Choose J  1, r  2, R   0, 0,1,5, 0...0

(v) H 0 :  2   3  ...   k  0
J  k 1
r  (0, 0,..., 0) '
0 1 0 ... 0  0 
0 0 1 ... 0  0 I 
R   k 1 
.
       
   
0 0 0 ... 1  ( k 1)k 0 
This particular hypothesis explains the goodness of fit. It tells whether  i has a linear effect or not and are

they of any importance. It also tests that X 2 , X 3 ,..., X k have no influence in the determination of y . Here

1  0 is excluded because this involves additional implication that the mean level of y is zero. Our main
concern is to know whether the explanatory variables help to explain the variation in y around its mean
value or not.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
22
We develop the likelihood ratio test for H 0 : R   r.

Likelihood ratio test:


The likelihood ratio test statistic is

max L(  ,  2 | y, X ) Lˆ ()
 
max L(  ,  | y, X , R   r ) Lˆ ( )
2

where  is the whole parametric space and  is the sample space.

If both the likelihoods are maximized, one constrained, and the other unconstrained, then the value of the
unconstrained will not be smaller than the value of the constrained. Hence   1.

First, we discuss the likelihood ratio test for a more straightforward case when
R  I k and r   0 , i.e.,    0 . This will give us a better and detailed understanding of the minor details,

and then we generalize it for R  r , in general.

Likelihood ratio test for H 0 :    0


Let the null hypothesis related to k  1 vector  is
H 0 :   0

where  0 is specified by the investigator. The elements of  0 can take on any value, including zero. The

concerned alternative hypothesis is


H1 :    0 .

Since  ~ N (0,  2 I ) in y  X    , so y ~ N ( X  ,  2 I ). Thus the whole parametric space and sample


space are  and  respectively given by

 : (  ,  2 ) :     i  ,  2  0, i  1, 2,..., k
 : (  ,  2 ) :    0 ,  2  0 .

The unconstrained likelihood under  .


1  1 
L(  ,  2 | y, X )  exp   2 ( y  X  ) '( y  X  )  .
(2 )2 n /2
 2 

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
23
This is maximized over  when
  ( X ' X ) 1 X ' y
1
 2  ( y  X  ) '( y  X  ).
n

where  and  2 are the maximum likelihood estimates of  and  2 which are the values maximizing the
likelihood function.
Lˆ ()  max L   ,  2 | y, X ) 
 
 
1  ( y  X  ) '( y  X  ) 
 exp 
n
  2( y  X  ) '( y  X  )  
 2    2
  
 n ( y  X  ) '( y  X  )    n  
 n
n n /2 exp   
  2 .
n
(2 ) ( y  X  ) '( y  X  )  2
n /2

The constrained likelihood under  is


1  1 
Lˆ ( )  max L(  ,  2 | y, X ,    0 )  exp   2 ( y  X  0 ) '( y  X  0 )  .
(2 ) 2 n /2
 2 
Since  0 is known, so the constrained likelihood function has an optimum variance estimator

1
2  ( y  X  0 ) '( y  X  0 )
n
 n
n n /2 exp   
Lˆ ( )   2 .
n /2
(2 ) ( y  X  0 ) '( y  X  0) 
n /2

The likelihood ratio is


 
 n n /2 exp(n / 2) 
 (2 ) ( y  X  ) '( y  X  )  
n /2
   
n /2
Lˆ ()  

ˆ
L( )  
 n n /2 exp(n / 2) 
 (2 ) n /2 ( y  X  ) '( y  X  )  n /2 
  0 0  
n /2
 ( y  X  0 ) '( y  X  0 ) 
 
 ( y  X  ) '( y  X  ) 
n/ 2
 2 
  
n /2
 2 
  

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
24
where
( y  X  0 ) '( y  X  0 )

( y  X  ) '( y  X  )
is the ratio of the quadratic forms. Now we simplify the numerator in  as follows:

( y  X  0 ) '( y  X  0 )  ( y  X  )  X (    0 )  ( y  X  )  X (    0 ) 
 ( y  X  ) '( y  X  )  2 y '  I  X ( X ' X ) 1 X ' X (    0 )  (    0 ) ' X ' X (    0 )
 ( y  X  ) '( y  X  )  (    0 ) ' X ' X (    0 ).
Thus
( y  X  ) '( y  X  )  (    0 ) ' X ' X (    0 )

( y  X  ) '( y  X  )
(    0 ) ' X ' X (    0 )
 1
( y  X  ) '( y  X  )
(    0 ) ' X ' X (    0 )
or   1  0 
( y  X  ) '( y  X  )

where 0  0  .

Distribution of ratio of quadratic forms


Now we find the distribution of the quadratic forms involved is 0 to find the distribution of 0 as follows:

( y  X  ) '( y  X  )  e ' e
 y '  I  X ( X ' X ) 1 X ' y
 y ' Hy
 (X    )'H (X    )
  ' H (using HX  0)
 (n  k )ˆ 2

Result: If Z is a n  1 random vector that is distributed as N (0,  2 I n ) and A is any symmetric idempotent

Z ' AZ
n  n matrix of rank, p then ~  2 ( p). If B is another n  n symmetric idempotent matrix of rank
 2

Z ' BZ
q , then ~  2 (q) . If AB  0 then Z ' AZ is distributed independently of Z ' BZ .
2
So using this result, we have
y ' Hy (n  k )ˆ 2
 ~  2 (n  k ).
 2
 2

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
25
Further, if H 0 is true, then    0 and we have the numerator in 0 . Rewriting the numerator in 0 , in
general, we have
(    ) ' X ' X (    )   ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' 
  ' X ( X ' X ) 1 X ' 
  ' H
where H is an idempotent matrix with rank k . Thus using this result, we have
 ' H   ' X '( X ' X ) 1 X ' 
 ~  2 (k ).
2 2
Furthermore, the product of the quadratic form matrices in the numerator ( ' H  ) and denominator ( ' H  )
of 0 is

 I  X ( X ' X ) 1 X ' X ( X ' X ) 1 X '  X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X '  0
and hence the  2 random variables in the numerator and denominator of 0 are independent. Dividing

each of the  2 random variables by their respective degrees of freedom

 
   
 (  0 ) ' X ' X (  0 ) 
 2 
 k 
1   
  (n  k )ˆ  2

 
   2
 
  n  k  
 
   
(    0 ) ' X ' X (    0 )

kˆ 2
( y  X  0 ) '( y  X  0 )  ( y  X  ) '( y  X  )

kˆ 2
~ F (k , n  k ) under H 0 .
Note that
( y  X  0 ) '( y  X  0 ) : Restricted error sum of squares
( y  X  ) '( y  X  ) : Unrestricted error sum of squares

Numerator in 1 : Difference between the restricted and unrestricted error sum of squares.

The decision rule is to reject H 0 :    0 at  level of significance whenever

1  F (k , n  k )
where F (k , n  k ) is the upper critical points on the central F -distribution with k and n  k degrees of

freedom.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
26
Likelihood ratio test for H 0 : R  r
The same logic and reasons used in the development of the likelihood ratio test for H 0 :    0 can be

extended to develop the likelihood ratio test for H 0 : R   r as follows.

  (  ,  2 ) :     i  ,  2  0, i  1, 2,..., k 
  (  ,  2 ) :     i  , R   r ,  2  0 .

Let   ( X ' X ) 1 X ' y. .


Then
E ( R  )  R
V ( R  )  E  R(    )(    ) ' R '
 RV (  ) R '
  2 R( X ' X ) 1 R '.

Since  ~ N   ,  2 ( X ' X ) 1 
so R  ~ N  R ,  2 R ( X ' X ) 1 R '
R   r  R   R  R(    ) ~ N 0,  2 R( X ' X ) 1 R ' .

1
There exists a matrix Q such that  R( X ' X ) 1 R '  QQ ' and then

  QR(b   )  N (0,  2 I n ) . Therefore under H 0 : R   r  0, so

 ' ( R  r ) ' QQ '( R  r )



2 2
1
( R  r ) '  R ( X ' X ) 1 R ' ( R   r )
=
2
1
(    ) ' R '  R ( X ' X ) 1 R ' R (    )

2
1
 ' X ( X ' X ) 1 R '  R( X ' X ) 1 R ' R( X ' X ) 1 X ' 

2
~  2 ( J ).
1
which is obtained as X ( X ' X ) 1 R '  R ( X ' X ) 1 R ' R ( X ' X ) 1 X ' is an idempotent matrix, and its trace is J

which is the associated degrees of freedom.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
27
Also, irrespective of whether H 0 is true or not,

e ' e ( y  X  ) '( y  X  ) y ' Hy (n  k )ˆ 2


   ~  2 (n  k ).
2 2 2 2
Moreover, the product of quadratic form matrices of e ' e and
1
(    ) ' R '  R( X ' X ) 1 R ' R(    ) is zero implying that both the quadratic forms are independent. So in

terms of likelihood ratio test statistic


 ( R   r ) '  R( X ' X ) 1 R ' 1 ( R   r ) 
   
 2 
 J 
 
1   
 (n  k )ˆ 2

 
 
2

nk

  
1
R  r ) '  R ( X ' X ) 1 R ' R   r

J ˆ 2
~ F ( J , n  k ) under H 0 .
So the decision rule is to reject H 0 whenever

1  F ( J , n  k )
where F ( J , n  k ) is the upper critical points on the central F distribution with J and (n  k ) degrees of
freedom.

Test of significance of regression (Analysis of variance)


If we set R  [0 I k 1 ], r  0, then the hypothesis H 0 : R  r reduces to the following null hypothesis:

H 0 :  2   3  ...   k  0
against the alternative hypothesis
H1 :  j  0 for at least one j  2,3,..., k
This hypothesis determines if there is a linear relationship between y and any set of the explanatory
variables X 2 , X 3 ,..., X k . Notice that X 1 corresponds to the intercept term in the model and hence xi1  1

for all i  1, 2,..., n .

This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least
one of the explanatory variables among X 2 , X 3 ,..., X k . contributes significantly to the model. This is called
as analysis of variance.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
28
Since  ~ N (0,  2 I ),

so y ~ N ( X  ,  2 I )
b  ( X ' X ) 1 X ' y ~ N   ,  2 ( X ' X ) 1  .

SS res
Also ˆ 2 
nk
( y  yˆ ) '( y  yˆ )

nk
y '  I  X ( X ' X ) 1 X ' y y ' Hy y' y b' X ' y
   .
nk nk nk
Since ( X ' X )-1 X ' H  0, so b and ˆ 2 are independently distributed.

Since y ' Hy   ' H  and H is an idempotent matrix, so

SS r e s ~  (2n  k ) ,

i.e., central  2 distribution with (n  k ) degrees of freedom.

Partition X  [ X 1 , X 2* ] where the submatrix X 2* contains the explanatory variables X 2 , X 3 ,..., X k

and partition   [ 1 ,  2* ] where the subvector  2* contains the regression coefficients  2 ,  3 ,...,  k .

Now partition the total sum of squares due to y ' s as


SST  y ' Ay
 SSr e g  SS r e s

where SSr e g  b2* ' X 2* ' AX 2*b2* is the sum of squares due to regression and the sum of squares due to residuals

is given by
SS r e s  ( y  Xb) '( y  Xb)
 y ' Hy
 SST  SSr e g .

Further
SSr e g   * ' X * ' AX *  *   * ' X * ' AX *  *
~  k21  2 2 2 2 2  , i.e., non-central  2 distribution with non  centrality parameter 2 2 2 2 2 ,
2  2  2

SST   * ' X * ' AX *  *   * ' X * ' AX *  *


~  n21  2 2 2 2 2  , i.e., non-central  2 distribution with non  centrality parameter 2 2 2 2 2 .
2  2  2

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
29
Since X 2 H  0, so SSr e g and SS r e s are independently distributed. The mean squares due to regression is

SSr e g
MSr e g 
k 1
and the mean square due to error is
SSr e s
MSres  .
nk
Then
MSreg   * ' X * ' AX *  * 
~ Fk 1,n  k  2 2 2 2 2 
MS res  2 
which is a non-central F -distribution with (k  1, n  k ) degrees of freedom and noncentrality parameter

 2* ' X 2* ' AX 2*  2*
.
2 2

Under H 0 :  2  3  ...   k ,

MSreg
F ~ Fk 1,n  k .
MSres

The decision rule is to reject at  level of significance whenever


F  F (k  1, n  k ).

The calculation of F -statistic can be summarized in the form of an analysis of variance (ANOVA) table
given as follows:

Source of variation Sum of squares Degrees of freedom Mean squares F


Regression SS r e g k 1 MS reg  SSr e g / k  1 F
Error nk
SSr e s MSres  SS r e s /(n  k )

Total SST n 1

Rejection of H 0 indicates that it is likely that atleast one  i  0 (i  1, 2,..., k ).

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
30
Test of hypothesis on individual regression coefficients
In case, if the test in analysis of variance is rejected, then another question arises is that which of the
regression coefficients is/are responsible for the rejection of the null hypothesis. The explanatory variables
corresponding to such regression coefficients are important for the model.

Adding such explanatory variables also increases the variance of fitted values ŷ , so one needs to be careful
that only those regressors are added that are of real value in explaining the response. Adding unimportant
explanatory variables may increase the residual mean square, which may decrease the usefulness of the
model.

To test the null hypothesis


H0 :  j  0

versus the alternative hypothesis


H1 :  j  0

has already been discussed is the case of a simple linear regression model. In the present case, if H 0 is

accepted, it implies that the explanatory variable X j can be deleted from the model. The corresponding test

statistic is
bj
t ~ t (n  k  1) under H 0
se(b j )

where the standard error of OLSE b j of  j is

se(b j )  ˆ 2C jj where C jj denotes the j th diagonal element of ( X ' X ) 1 corresponding to b j .

The decision rule is to reject H 0 at  level of significance if

t  t .
, n  k 1
2

Note that this is only a partial or marginal test because ˆ j depends on all the other explanatory variables

X i (i  j that are in the model. This is a test of the contribution of X j given the other explanatory variables

in the model.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
31
Confidence interval estimation
The confidence intervals in a multiple regression model can be constructed for individual regression
coefficients as well as jointly. We consider both of them as follows:

Confidence interval on the individual regression coefficient:


Assuming  i ' s are identically and independently distributed following N (0,  2 ) in y  X    , we have

y ~ N ( X  , 2 I )
b ~ N (  ,  2 ( X ' X ) 1 ).

Thus the marginal distribution of any regression coefficient estimate


b j ~ N (  j ,  2C jj )

where C jj is the j th diagonal element of ( X ' X ) 1 .

Thus
bj   j
tj  ~ t (n  k ) under H 0 , j  1, 2,...
ˆ 2C jj

SS r e s y ' y  b ' X ' y


where ˆ 2   .
nk nk
So the 100(1   )% confidence interval for  j ( j  1, 2,..., k ) is obtained as follows:

 b j 
P  t  j  t   1  
 2 ,nk ˆ 2C jj ,nk 
 2

 
P b j  t ˆ 2C jj   j  b j  t ˆ 2C jj   1   .
,nk ,n  k
 2 2 

Thus the confidence interval is


 
 b j  t , n  k ˆ C jj , b j  t , n  k ˆ C jj
2 2
.
 2 2 

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
32
Simultaneous confidence intervals on regression coefficients:
A set of confidence intervals that are true simultaneously with probability (1   ) are called simultaneous or
joint confidence intervals.
It is relatively easy to define a joint confidence region for  in multiple regression model.
Since
(b   ) ' X ' X (b   )
~ Fk ,n  k
k MS r e s
 (b   ) ' X ' X (b   ) 
 P  F (k , n  k )   1   .
 k MSr e s 
So a 100 (1   )% joint confidence region for all of the parameters in  is
(b   ) ' X ' X (b   )
 F (k , n  k )
k MSr e s
which describes an elliptically shaped region.

Coefficient of determination ( R 2 ) and adjusted R 2


Let R be the multiple correlation coefficient between y , and X 1 , X 2 ,..., X k . Then square of multiple

correlation coefficient ( R 2 ) is called a coefficient of determination. The value of R 2 commonly describes


how well the sample regression line fits the observed data. This is also treated as a measure of goodness of
fit of the model.

Assuming that the intercept term is present in the model as


yi  1   2 X i 2   3 X i 3  ...   k X ik  ui , i  1, 2,..., n
then
e 'e
R2  1  n

 ( y  y)
i 1
i
2

SSres SS r e g
 1 
SST SST
where
SSr e s : sum of squares due to residuals,

SST : total sum of squares

SSr e g : sum of squares due to regression.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
33
R 2 measures the explanatory power of the model, which in turn reflects the goodness of fit of the model. It
reflects the model adequacy in the sense of how much is the explanatory power of the explanatory variables.
Since
e ' e  y '  I  X ( X ' X ) 1 X ' y  y ' Hy,
n n

 ( yi  y )2   yi2  ny 2 ,
i 1 i 1

1 n
1
where y  
n i 1
yi   ' y
n
with   1,1,...,1 ', y   y1 , y2 ,..., yn  '

Thus
n
 1 
( y  y)
i 1
i  y ' y  n  2  ' yy '  
2

n 
1
 y ' y  y '  ' y
n
1
 y ' y  y '  ( '  )  ' y
 y '  I  ( ' ) 1  ' y
 y ' Ay

where A  I  ( ' ) 1  ' .

y ' Hy
So R2  1  .
y ' Ay

The limits of R 2 are 0 and 1, i.e.,


0  R 2  1.

R 2  0 indicates the poorest fit of the model.


R 2  1 indicates the best fit of the model
R 2  0.95 indicates that 95% of the variation in y is explained by R 2 . In simple words, the model is 95%
good.

Similarly, any other value of R 2 between 0 and 1 indicates the adequacy of the fitted model.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
34
Adjusted R 2
If more explanatory variables are added to the model, then R 2 increases. In case the variables are
irrelevant, then R 2 will still increase and gives an overly optimistic picture.

With a purpose of correction in the overly optimistic picture, adjusted R 2 , denoted as R 2 or adj R 2 is used
which is defined as
SSr e s / (n  k )
R2  1
SST / (n  1)
 n 1 
 1   (1  R ).
2

 nk 
We will see later that (n  k ) and (n  1) are the degrees of freedom associated with the distributions of SSres

SSr e s SST
and SST . Moreover, the quantities and are based on the unbiased estimators of respective
nk n 1
variances of e and y in the context of analysis of variance.

The adjusted R 2 will decline if the addition if an extra variable produces too small a reduction in (1  R 2 ) to

 n 1 
compensate for the increase in  .
nk 

Another limitation of adjusted R 2 is that it can be negative also. For example, if k  3, n  10, R 2  0.16,
then
9
R 2  1   0.97  0.25  0
7
which has no interpretation.

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
35
Limitations
1. If the constant term is absent in the model, then R 2 can not be defined. In such cases, R 2 can be
negative. Some ad-hoc measures based on R 2 for regression line through origin have been proposed
in the literature.

Reason that why R 2 is valid only in linear models with intercept term:
In the model y  X    , the ordinary least squares estimator of  is b  ( X ' X ) 1 X ' y . Consider the
fitted model as
y  Xb  ( y  Xb)
 Xb  e
where e is the residual. Note that
y  ly  Xb  e  ly
 yˆ  e  ly
where ŷ  Xb is the fitted value and l  (1,1,...,1) ' is a n  1 vector of elements unity. The total sum of
n
squares TSS   ( yi  y ) 2 is then obtained as
i 1

TSS  ( y  ly ) '( y  ly )  [( yˆ  ly )  e]'[( yˆ  ly )  e]


 ( yˆ  ly ) '( yˆ  ly )  e ' e  2( yˆ  ly ) ' e
  
 SS reg  SSres  2( Xb  ly ) ' e (because yˆ  Xb)
 SSreg  SSres  2 yl ' e (because X ' e  0).

The Fisher Cochran theorem requires TSS  SS reg  SS res to hold true in the context of analysis of

variance and further to define the R2. In order that TSS  SS reg  SS res holds true, we need that

l ' e should be zero, i.e. l ' e =l '( y  yˆ )  0 which is possible only when there is an intercept term in the
model. We show this claim as follows:

First, we consider a no intercept simple linear regression model yi  1 xi   i , (i  1, 2,..., n) where the
n

x y i i n n n
parameter 1 is estimated as b  *
1
i 1
n
. Then l ' e =  ei   ( yi  yˆi )   ( yi  b1* xi ) 0, in general.
x
i 1
2
i
i 1 i 1 i 1

Similarly, in a no intercept multiple linear regression model y  X    , we find that


l ' e =l '( y  yˆ )  l '( X     Xb) =  l ' X (b   )  l '   0 , in general.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
36
Next, we consider a simple linear regression model with intercept term yi   0  1 xi   i , (i  1, 2,..., n)

sxy
where the parameters  0 and 1 are estimated as b0  y  b1 x and b1  respectively, where
sxx
n n
1 n 1 n
sxy   ( xi  x )( yi  y ), sxx   ( xi  x ) 2 , x   i n
x y  yi . We find that
i 1 i 1 n i 1 i 1

n n
l ' e =  ei   ( yi  yˆi )
i 1 i 1
n
  ( yi  b0  b1 xi )
i 1
n
  ( yi  y  b1 x  b1 xi )
i 1
n
  [( yi  y )  b1 ( xi  x )]
i 1
n n
  ( yi  y )  b1  ( xi  x )
i 1 i 1

 0.

In a multiple linear regression model with an intercept term y   0l  X    where the parameters  0

and  are estimated as ˆ0  y  bx and b  ( X ' X ) 1 X ' y , respectively. We find that

l ' e =l '( y  yˆ )
=l '( y  ˆ0  Xb)
=l '( y  y  Xb  Xb) ,
=l '( y  y )  l '( X  X )b
=0.
Thus we conclude that for the Fisher Cochran to hold true in the sense that the total sum of squares can
be divided into two orthogonal components, viz., the sum of squares due to regression and sum of
squares due to errors, it is necessary that l ' e =l '( y  yˆ )  0 holds and which is possible only when the
intercept term is present in the model.

2. R 2 is sensitive to extreme values, so R 2 lacks robustness.

3. R 2 always increases with an increase in the number of explanatory variables in the model. The main
drawback of this property is that even when the irrelevant explanatory variables are added in the
model, R 2 still increases. This indicates that the model is getting better, which is not really correct.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
37
4. Consider a situation where we have the following two models:
yi  1   2 X i 2  ...   k X ik  ui , i  1, 2,.., n
log yi   1   2 X i 2  ...   k X ik  vi
The question is now which model is better?
For the first model,
n

 ( y  yˆ )
i i
2

R12  1  i 1
n

 ( y  y)
i 1
i
2

and for the second model, an option is to define R 2 as


n

 (log y  log yˆ )
i i
2

R22  1  i 1
n
.
 (log y  log y )
i 1
i
2

As such R12 and R22 are not comparable. If still, the two models are needed to be compared, a better

proposition to define R 2 can be as follows:


n

 ( y  anti log yˆ )
i
*
i
R32  1  i 1
n

 ( y  y)
i 1
i
2

 y . Now
where y  log
*
R12 and R32 on the comparison may give an idea about the adequacy of the two
i i

models.

Relationship of analysis of variance test and coefficient of determination


Assuming the 1 to be an intercept term, then for H 0 :  2  3  ...   k  0, the F  statistic in analysis of
variance test is
MSr e g
F
MSres
(n  k ) SSr e g

(k  1) SS r e s
 n  k  SSr e g
 
 k  1  SST  SSr e g
SSr e g
 n  k  SST nk  R
2
   
 k  1  1  SS r e g  k  1  1  R
2

SST
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
38
where R 2 is the coefficient of determination. So F and R 2 are closely related. When R 2  0, then F  0.

In the limit, when R 2  1, F   . So both F and R 2 vary directly. Larger R 2 implies greater F value.
That is why the F test under the analysis of variance is termed as the measure of the overall significance of
estimated regression. It is also a test of significance of R 2 . If F is highly significant, it implies that we can
reject H 0 , i.e. y is linearly related to X ' s.

Prediction of values of study variable


The prediction in the multiple regression model has two aspects
1. Prediction of the average value of study variable or mean response.
2. Prediction of the actual value of the study variable.

1. Prediction of average value of y


We need to predict E ( y ) at a given x0  ( x01 , x02 ,..., x0 k ) '.
The predictor as a point estimate is
p  x0 b  x0 ( X ' X ) 1 X ' y
E ( p)  x0  .

So p is an unbiased predictor for E ( y ) .


Its variance is
Var ( p)  E  p  E ( y )  '  p  E ( y ) 
= 2 x0 ( X ' X ) 1 x0

Then
E ( yˆ 0 )  x0   E ( y | x0 )
Var ( yˆ 0 )   2 x0 ( X ' X ) 1 x0

The confidence interval on the mean response at a particular point, such as x01 , x02 ,..., x0 k can be found as
follows:
Define x0  ( x01 , x02 ,..., x0 k ) '. The fitted value at x0 is yˆ 0  x0 b.
Then
 yˆ 0  E ( y | x0 ) 
P  t   t   1  
 2 , n  k ˆ 2 x0 ( X ' X ) 1 x0 2
,n k

 
P  yˆ 0  t  ˆ 2 x0 ( X ' X ) 1 x0  E ( y | x0 )  yˆ 0  t  ˆ 2 x0 ( X ' X ) 1 x0   1   .
,n  k ,nk
 2 2 
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
39
The 100 (1   )% confidence interval on the mean response at the point x01 , x02 ,..., x0 k , i.e., E ( y / x0 ) is

 
 yˆ 0  t , n  k ˆ x0 ( X ' X ) x0 , yˆ 0  t , n  k  ˆ x0 ( X ' X ) x0
2 1 2 1
.
 2 2 

2. Prediction of actual value of y


We need to predict y at a given x0  ( x01 , x02 ,..., x0 k ) '.

The predictor as a point estimate is


p f  x0 b
E ( p f )  x0 
So p f is an unbiased for y. It's variance is
Var ( p f )  E  ( p f  y )( p f  y ) ' 
  2 1  x0 ( X ' X ) 1 x0  .

The 100 (1   )% confidence interval for this future observation is

 
 p f  t , n  k ˆ [1  x0 ( X ' X ) x0 ], p f  t , n  k ˆ [1  x0 ( X ' X ) x0 ]  .
2 1 2 1

 2 2 

Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
40

You might also like